Advancing AI for Humanity

Star ...

Building Frontier Efficiency Model

coming, April 1, 2025

A New Paradigm of AI

coming, Jan 1, 2025

The Next Recipe / Dec 15, 2024

The Second Curve of Scaling Law / Jan 15, 2024

Scaling Factors / Jun 15, 2024

BitNet

Introducing RPT: Reinforcement Pre-Training

BitNet

Rectified Sparse Attention

On-Policy RL with Optimal Reward Baseline

Advancing the Foundation of RL in LLMs

BitNet

The Frontier of Reward Models

Think Only When You Need with Large Hybrid-Reasoning Models

Adaptive Thinking Models

BitNet

Introducing BitNet b1.58 2B4T - Scaling Native 1-bit LLMs

BitNet

BitNet v2: Native 4-bit Activations for 1-bit LLMs

BitNet

LatentLM: A Grand Unification of Multimodality

BitNet

Differential Transformer

BitNet

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

BitNet

Scaling Laws of Synthetic Data for Language Models

YOCO

YOCO: Decoder-Decoder Architectures for Large Language Models
// Gated RetNet (RetNet-3)

BitNet

BitNet a4.8: 4-bit Activations for 1-bit LLMs

MoE

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

MoE

Bootstrap Your Own Context Length

MoE

MH-MoE (v2): Multi-Head Mixture-of-Experts

Fully Sparsely-Activated LLMs

1-bit AI Infra / bitnet.cpp: Running LLMs on CPUs

Fully Sparsely-Activated LLMs

Q-Sparse / Block Q-Sparse: Fully Sparsely-Activated LLMs

BitNet

The Era of 1-bit LLMs: Training Tips, Code and FAQ

MoE

MELLE: Autoregressive Speech Synthesis without Vector Quantization

MoE

VALL-E 2: Human Parity Zero-Shot Text to Speech Synthesis

Learning Law

The Learning Law: Towards Optimal Learning of Language Models

The Mind's Eye of (M)LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Multilingual E5 Text Embeddings

BitNet: 1-bit Transformers and LLMs

RetNet

Retentive Network: Revolutionizing Transformers for Large Language Models

LongViT

LongViT (LongNet for Vision): When an Image is Worth 1,024 × 1,024 Words

Kosmos

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

Kosmos 2.5

Kosmos-2.5: A Multimodal Literate Model

LLM4Science

Large Language Model for Science: A Study on P vs. NP

LongNet

LongNet: Scaling Transformers to 1,000,000,000 Tokens

KOSMOS-2

Kosmos-2: Grounding Multimodal Large Language Models (MLLMs) to the World

KOSMOS-1

Kosmos-1: A Multimodal Large Language Model (MLLM)

VALL-E

WavMark: Watermarking for Audio Generation

VALL-E

VALL-E (X): Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

LLMA

PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training

LLMA

AdaLLM: Adapting Large Language Models via Reading Comprehension

MiniLLM

MiniLLM: Knowledge Distillation of Large Language Models

LongMem

Large Language Models with Long-Term Memory

LLMA

LLM Accelerator: Lossless Acceleration of Large Language Models

XPos

A Length-Extrapolatable Transformer

ICL

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta Optimizers

Promptist

Promptist: Optimizing Prompts for Text-to-Image Generation

Structured Prompting

Structured Prompting: Scaling In-Context Learning to 1,000 Examples

TorchScale

TorchScale: Transformers at (Any) Scale

Magneto: A Foundation Transformer

October 13, 2022

BEiT-3

BEiT-3: A General-Purpose Multimodal Foundation Model

Language Models are General-Purpose Interfaces

DeepNet: Scaling Transformers to 1,000 Layers

BEiT: BERT Pre-Training of Image Transformers

MiniLM: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers

XLM-E: Efficient Multilingual Language Model Pre-training

UniLM: Unified Language Model Pre-training