Vaswani, Shazeer, Parmar et al. — Google Brain
Attention Is All You Need
The 2017 paper that eliminated recurrence and convolutions entirely, replacing them with pure self-attention. Result: 27.5 BLEU on English-to-German translation, 41.1 on English-to-French — superior quality, faster training, and a parallelisable architecture that scaled to GPT-4, Gemini, and everything since. Now tracking toward the most-cited paper in history at 238,000+ citations.
Transformer: 238,000+ citations, no recurrence, no convolutions. There is a before and after this paper. Every AI system you interact with today runs on the architecture it introduced.
Ian Goodfellow, Jean Pouget-Abadie et al. — Université de Montréal
Generative Adversarial Networks
Two neural networks trained in adversarial opposition: a generator that learns to produce increasingly realistic outputs, a discriminator that learns to tell real from fake. Conceived in a single evening, written in a week, and cited 92,000+ times. This framework underpins virtually all generative image AI, from early deepfakes to Stable Diffusion.
The conceptual leap that made generative AI possible. Reportedly conceived in a single evening and written in a week — and it changed everything.
Krizhevsky, Sutskever, Hinton — University of Toronto
ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)
AlexNet won the 2012 ImageNet competition by a margin so large it shocked the computer vision community — raw pixels to superhuman classification via deep CNNs on GPUs. DeepMind's first deep RL paper, published the following year, extended the same principle: learn directly from raw pixels, beat human experts across 49 Atari games. Two papers, one revolution.
Raw pixels to superhuman performance. The single most consequential experiment in modern AI history — everything that followed traces back to this result.
Brown, Mann, Ryder et al. — OpenAI
Language Models are Few-Shot Learners (GPT-3)
The paper introducing GPT-3, a 175-billion-parameter language model that demonstrated remarkable few-shot learning — the ability to perform new tasks from just a handful of examples in the prompt, with no gradient updates. This was the first model to make the general public seriously reckon with what large language models could do, and it set the template for the ChatGPT era.
GPT-3 was the first model that felt like something genuinely new. This paper is the record of that moment.
Richard Sutton
The Bitter Lesson
70 years of AI research condensed into 1,700 words. Sutton's thesis: general methods that leverage computation always beat human-knowledge-based approaches. The "bitter lesson" for researchers is that their clever, domain-specific solutions are always eventually surpassed by brute-force scale. Written in 2019, it predicted the LLM era before it arrived.
70 years of AI prove: computation beats human knowledge. The most important 1,700-word essay in AI. Read it, then re-read it after every AI breakthrough.
Silver, Huang, Maddison et al. — DeepMind
Mastering the Game of Go with Deep Neural Networks and Tree Search (AlphaGo)
The paper documenting AlphaGo's defeat of European Go champion Fan Hui — the first time a computer program beat a professional human player at Go. The system combined deep convolutional networks for position evaluation with Monte Carlo tree search, trained on both human expert games and self-play. A landmark in AI capability that the world watched in real time.
Go was considered AI-hard for decades. AlphaGo's victory changed the field's sense of what was possible — and when.
Kaplan, McCandlish, Henighan et al. — OpenAI
Scaling Laws for Neural Language Models
GPT-3's secret: model performance scales predictably as a power law with compute, dataset size, and parameter count — and the three can be traded off against each other. This OpenAI paper gave the AI industry its roadmap, justifying the massive investment in scaling that produced GPT-3, GPT-4, and every frontier model since. The blueprint for modern AI development.
This paper is why every major lab spent billions scaling up. Understanding it is understanding the strategic logic of the AI race.
Ouyang, Wu, Jiang et al. — OpenAI
Training Language Models to Follow Instructions with Human Feedback (InstructGPT)
The paper behind ChatGPT. InstructGPT introduced RLHF (Reinforcement Learning from Human Feedback) as a practical method for aligning language models with human intent. By fine-tuning GPT-3 on human preference data and then training with a reward model, the authors produced a model that was dramatically more helpful and less harmful than the base model — despite being 100x smaller.
RLHF is what turned a text predictor into an assistant. This paper is the technical foundation of every chat AI.
Leopold Aschenbrenner
Situational Awareness: The Decade Ahead
A 165-page essay by former OpenAI researcher Leopold Aschenbrenner, arguing that AGI is likely by 2027, that the US-China AI race is the defining geopolitical contest of the decade, and that AI labs are dangerously under-secured. Released in June 2024, it became the most-discussed AI document of the year, read by policymakers, investors, and researchers worldwide.
Whether you agree with it or not, this essay shaped the conversation about AI risk, national security, and the AGI timeline more than any other document in 2024.
Mnih, Kavukcuoglu, Silver et al. — DeepMind
Playing Atari with Deep Reinforcement Learning
The paper that launched modern deep reinforcement learning. DeepMind's DQN agent learned to play 49 Atari games directly from raw pixel inputs, using only the game score as reward — no hand-crafted features, no game-specific knowledge. It achieved superhuman performance on 29 of them. This was the proof of concept that a single algorithm could master diverse tasks from raw sensory data.
The paper that made the world take DeepMind seriously, and that made reinforcement learning a mainstream research direction.
Tim Urban
The AI Revolution: The Road to Superintelligence
A two-part, deeply researched long-form essay that introduced the concepts of AGI, superintelligence, and existential AI risk to a mainstream audience. Tim Urban's signature style — rigorous research, irreverent humour, hand-drawn diagrams — made ideas from Bostrom's Superintelligence accessible to millions of non-technical readers. Elon Musk shared it widely; it remains the most-read popular introduction to AI risk.
This is the essay that made AI risk a mainstream conversation. If you want to understand why people are worried, start here.
Devlin, Chang, Lee, Toutanova — Google AI Language
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT introduced the pre-training / fine-tuning paradigm that now dominates NLP. By pre-training a transformer bidirectionally on masked language modelling and next sentence prediction, then fine-tuning on downstream tasks, BERT achieved state-of-the-art results on 11 NLP benchmarks simultaneously. It established that a single pre-trained model could be adapted to almost any language task.
BERT is the model that proved transfer learning works at scale in NLP. It is the direct ancestor of every modern language model.
Gwern Branwen
The Scaling Hypothesis
Gwern's prescient essay arguing that simply scaling up neural networks — more parameters, more data, more compute — would produce qualitatively new capabilities, not just quantitative improvements. Written when GPT-2 was the frontier model, it predicted the emergence of GPT-3 and beyond. The essay is a masterclass in reasoning from first principles about a technology's trajectory.
Gwern called the scaling era before it happened. This essay is required reading for understanding why the field bet so heavily on scale.
He, Zhang, Ren, Sun — Microsoft Research
Deep Residual Learning for Image Recognition (ResNet)
The paper introducing residual connections — skip connections that allow gradients to flow directly through layers, enabling the training of networks hundreds of layers deep. ResNet won the 2015 ImageNet competition with a 152-layer network and a 3.57% top-5 error rate. Residual connections are now a standard component of virtually every deep learning architecture, including transformers.
One of the most cited papers in all of computer science. Residual connections are in everything — you are using ResNet ideas every time you use a modern AI system.
Sam Altman
Moore's Law for Everything
Sam Altman's vision essay arguing that AI will soon drive a Moore's Law-style compression of costs across every domain — labour, healthcare, education, housing — and that this will require new economic structures to distribute the gains. Written before ChatGPT, it reads as a blueprint for how OpenAI's CEO thinks about AI's civilisational impact and the policy responses it demands.
The clearest statement of the techno-optimist case for AI. Understand this essay to understand the worldview driving the leading AI labs.
Bai, Jones, Ndousse et al. — Anthropic
Constitutional AI: Harmlessness from AI Feedback
Anthropic's paper introducing Constitutional AI — a method for training AI systems to be helpful and harmless using a written set of principles (a "constitution") and AI-generated feedback, rather than relying solely on human labellers. The technique underlies Claude and represents a significant advance in scalable alignment: using AI to supervise AI, guided by explicit human values.
Constitutional AI is Anthropic's answer to the alignment problem. This paper is the technical foundation of the approach that produced Claude.
Jumper, Evans, Pritzel et al. — DeepMind
Highly Accurate Protein Structure Prediction with AlphaFold
AlphaFold 2 solved the 50-year-old protein folding problem — predicting a protein's 3D structure from its amino acid sequence with atomic accuracy. DeepMind's system achieved a median score of 92.4 on the CASP14 benchmark, far exceeding all prior methods. The paper is widely considered the most significant scientific application of AI to date, with direct implications for drug discovery and biology.
The moment AI moved from beating humans at games to solving real scientific problems. The most important AI paper outside of NLP.
Zhao, Zhou, Li et al. — Renmin University of China
A Survey of Large Language Models
The most comprehensive survey of large language models available, covering pre-training, fine-tuning, alignment, evaluation, and application across 200+ pages. The paper tracks the evolution from early language models through GPT-3, ChatGPT, and GPT-4, and provides a structured taxonomy of the field. Widely used as a reference by researchers and practitioners entering the LLM space.
If you want a single document that maps the entire LLM landscape, this is it. The most-cited survey paper in the field.
Ho, Jain, Abbeel — UC Berkeley
Denoising Diffusion Probabilistic Models
The paper that established diffusion models as the dominant paradigm for generative image AI. Ho et al. showed that a model trained to iteratively denoise images could generate high-quality samples competitive with GANs, with better training stability and mode coverage. This work is the direct foundation of Stable Diffusion, DALL-E 2, Midjourney, and every modern image generation system.
Stable Diffusion, Midjourney, DALL-E — they all run on the ideas in this paper. The foundation of the generative image revolution.
Bubeck, Chandrasekaran, Eldan et al. — Microsoft Research
Sparks of Artificial General Intelligence: Early experiments with GPT-4
A 154-page paper from Microsoft Research presenting early experiments with GPT-4 and arguing that it shows "sparks of AGI" — the ability to reason, plan, and solve problems across domains in ways that go beyond pattern matching. The paper sparked intense debate about what GPT-4 actually understands, and whether current LLMs are approaching general intelligence. One of the most-read AI papers of 2023.
The paper that made the AGI debate mainstream. Whether you agree with its conclusions or not, it defined the conversation of 2023.