Research Highlights

Our mission-focused research agenda of advancing AI for humanity.

Foundation of AI

Foundation Models

The Evolution of LLM / MLLM (Multimodal LLM)

Kosmos-G: Generating Images in Context with Multimodal Large Language Models
Kosmos-2.5: A Multimodal Literate Model
Kosmos-2: Grounding Multimodal Large Language Models (MLLMs) to the World
(Kosmos-1) Language Is Not All You Need: Aligning Perception with Language Models. NeurIPS'23.
(MetaLM) Language Models are General-Purpose Interfaces

(VALL-E) Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

#TheBigConvergence of foundation models and large-scale pre-training across tasks, languages, and modalities

(BEiT-3) Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks. CVPR'23.

BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
BEiT: BERT Pre-Training of Image Transformers. ICLR'22 (Oral).

XLM-E: Cross-lingual Language Model Pre-training via ELECTRA. ACL'22.
mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs. EMNLP'21.
InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training. NAACL'21.
UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training. ICML'20.
(UniLM) Unified Language Model Pre-training for Natural Language Understanding and Generation. NeurIPS'19.

Foundation Architecture

The Revolution of Model Architecture

BitNet: Scaling 1-bit Transformers for Large Language Models
(RetNet) Retentive Network: A Successor to Transformer for Large Language Models
LongNet: Scaling Transformers to 1,000,000,000 Tokens

Fundamental research on modeling generality and capability, as well as training stability and efficiency

TorchScale: Transformers at Scale
(PoSE) Efficient Context Window Extension of LLMs via Positional Skip-wise Training
(XPos) A Length-Extrapolatable Transformer. ACL'23.
Magneto: A Foundation Transformer. ICML'23.
DeepNet: Scaling Transformers to 1,000 Layers
On the Representation Collapse of Sparse Mixture of Experts. NeurIPS'22.

Science of Intelligence

Fundamental research to understand the principles and theoretical boundary of (artificial general) intelligence.

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta Optimizers. ACL'23.


General technology for enabling AI capabilities w/ (M)LLMs.

LLM Adaptation

(AdaLLM) Adapting Large Language Models via Reading Comprehension

LLM Distillation

(MiniLLM) Knowledge Distillation of Large Language Models

LLM Accelerator

(llma) Inference with Reference: Lossless Acceleration of Large Language Models

Prompt Intelligence

Prompt as a new language of Foundation Models and Generative AI, and a new programming language and interface for Human-AI communication and collaboration.

Learning to Retrieve In-Context Examples for Large Language Models
(Cross-Lingual-Thought) Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting. EMNLP'23.
(Promptist) Optimizing Prompts for Text-to-Image Generation. NeurIPS'23.
Structured Prompting: Scaling In-Context Learning to 1,000 Examples
Extensible Prompts for Language Models. NeurIPS'23.

Democratizing Foundation Models

Research and development of effective and efficient approaches to deploying large AI (foundation) models in practice.

Lossless Acceleration for Seq2seq Generation with Aggressive Decoding
EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation. EMNLP'22.
(xTune) Consistency Regularization for Cross-Lingual Fine-Tuning. EMNLP'21.
MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers. ACL'21.
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. NeurIPS'22.


Our research also pushes disruptive technologies for vertical domains and/or tasks.

Revolutionizing Document AI

Our pioneering research on multimodal document foundation models for technology evolution of Document AI.

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. ACM MM'22.
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding. ACL'21.
LayoutLM: Pre-training of Text and Layout for Document Image Understanding. KDD'20.
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding. ACL'21.
DiT: Self-supervised Pre-training for Document Image Transformer. ACM MM'22.

MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding. ACL'22.

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models. AAAI'23.