Advancing AI for Humanity

aka.ms/GeneralAI

VibeVoice

Introducing VibeVoice: Advancing the Frontier of Voice AI

August 26, 2025

RPT

Introducing RPT: Reinforcement Pre-Training

TPT

Introducing BitDistill: Finetuning Any Full-precision LLMs into BitNet (1.58-bit) for Specific Tasks via Distillation

October 15, 2025

TPT

Introducing DocReward: A Document Reward Model

October 13, 2025

TPT

Introducing TPT: Thinking Augmented Pre-Training

September 24, 2025

GMPO

Mean Matters in RL for LLMs: Geometric-Mean Policy Optimization

ReSa

Rectified Sparse Attention

On-Policy RL with Optimal Reward Baseline

Advancing the Foundation of RL in LLMs

RRM

The Frontier of Reward Models

Think Only When You Need with Large Hybrid-Reasoning Models

Adaptive Thinking Models

BitNet

Introducing BitNet b1.58 2B4T - Scaling Native 1-bit LLMs

BitNet

BitNet v2: Native 4-bit Activations for 1-bit LLMs

BitNet

LatentLM: A Grand Unification of Multimodality

BitNet

Differential Transformer

BitNet

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

YOCO

YOCO: Decoder-Decoder Architectures for Large Language Models

BitNet

Scaling Laws of Synthetic Data for Language Models

BitNet

BitNet a4.8: 4-bit Activations for 1-bit LLMs

MoE

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

MoE

Bootstrap Your Own Context Length

MoE

MH-MoE (v2): Multi-Head Mixture-of-Experts

Fully Sparsely-Activated LLMs

1-bit AI Infra / bitnet.cpp: Running LLMs on CPUs

Fully Sparsely-Activated LLMs

Q-Sparse / Block Q-Sparse: Fully Sparsely-Activated LLMs

BitNet

The Era of 1-bit LLMs: Training Tips, Code and FAQ

MoE

MELLE: Autoregressive Speech Synthesis without Vector Quantization

MoE

VALL-E 2: Human Parity Zero-Shot Text to Speech Synthesis

Learning Law

The Learning Law: Towards Optimal Learning of Language Models

The Mind's Eye of (M)LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Multilingual E5 Text Embeddings

BitNet: 1-bit Transformers and LLMs

RetNet

Retentive Network: Revolutionizing Transformers for Large Language Models

LongViT

LongViT (LongNet for Vision): When an Image is Worth 1,024 × 1,024 Words

Kosmos

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

Kosmos 2.5

Kosmos-2.5: A Multimodal Literate Model

LLM4Science

Large Language Model for Science: A Study on P vs. NP

LongNet

LongNet: Scaling Transformers to 1,000,000,000 Tokens

KOSMOS-2

Kosmos-2: Grounding Multimodal Large Language Models (MLLMs) to the World

KOSMOS-1

Kosmos-1: A Multimodal Large Language Model (MLLM)

VALL-E

WavMark: Watermarking for Audio Generation

VALL-E

VALL-E (X): Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

LLMA

PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training

LLMA

AdaLLM: Adapting Large Language Models via Reading Comprehension

MiniLLM

MiniLLM: Knowledge Distillation of Large Language Models

LongMem

Large Language Models with Long-Term Memory

LLMA

LLM Accelerator: Lossless Acceleration of Large Language Models

XPos

A Length-Extrapolatable Transformer

ICL

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta Optimizers

Promptist

Promptist: Optimizing Prompts for Text-to-Image Generation

Structured Prompting

Structured Prompting: Scaling In-Context Learning to 1,000 Examples

TorchScale

TorchScale: Transformers at (Any) Scale

Magneto: A Foundation Transformer

October 13, 2022

BEiT-3

BEiT-3: A General-Purpose Multimodal Foundation Model

Language Models are General-Purpose Interfaces

DeepNet: Scaling Transformers to 1,000 Layers

BEiT: BERT Pre-Training of Image Transformers

MiniLM: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers

XLM-E: Efficient Multilingual Language Model Pre-training

UniLM: Unified Language Model Pre-training