OpenScholar: Scientific Literature Synthesis with Retrieval-Augmented LMs

GenAI

Guides

NLP

LLM

RAG

Retrieval

Foundation models

Hugging Face

Author

Chris Endemann

Published

February 27, 2026

OpenScholar is an open-source, retrieval-augmented language model (LM) designed to help researchers navigate and synthesize scientific literature. Developed by the Allen Institute for AI (AI2) and the University of Washington, OpenScholar answers scientific queries by searching a datastore of 45 million open-access papers, retrieving relevant passages, and generating citation-backed responses. The work was published in Nature in February 2026.

Unlike general-purpose LLMs that frequently hallucinate citations (GPT-4o hallucinates citations 78-90% of the time), OpenScholar achieves citation accuracy on par with human experts. In human evaluations conducted by 16 PhD-level experts, OpenScholar’s responses were preferred over expert-written ones 51% of the time for the 8B variant and 70% of the time for the GPT-4o-augmented variant.

Key features

Retrieval-augmented generation over 45M papers: OpenScholar searches a datastore of 45 million open-access papers (~236 million passage embeddings) drawn from Semantic Scholar, ensuring responses are grounded in real, retrievable literature rather than model memory.
Iterative self-feedback inference: At inference time, OpenScholar uses a self-feedback loop to iteratively refine its outputs — each iteration retrieves additional papers, improving factuality, coverage, and citation accuracy through natural language feedback.
Highly accurate citations: While GPT-4o hallucinates the vast majority of its cited papers, OpenScholar’s retrieval-first design ensures all citations correspond to real, retrievable sources.
Fully open-source: All code, model checkpoints, retriever/reranker weights, retrieval index, training data, and evaluation benchmarks are publicly available — the first complete open release of a scientific assistant LM pipeline.

Model variants and sizes

OpenScholar can be used with different underlying language models:

OpenScholar-8B (OS-8B): A fine-tuned version of Llama 3.1 8B, optimized for scientific literature synthesis. This is the flagship open-weight model. Available on Hugging Face. Despite its compact size, it outperforms GPT-4o by 6.1% in correctness on multi-paper synthesis tasks, and is 100x more cost-efficient than comparable systems like PaperQA2.
OpenScholar-GPT4o (OS-GPT4o): The OpenScholar pipeline (datastore, retriever, reranker, and self-feedback loop) applied on top of GPT-4o. This variant improves GPT-4o’s correctness by 12% and raises citation F1 from 0.1 to 39.5, demonstrating how the pipeline enhances any off-the-shelf LLM.
OpenScholar-70B (OS-70B): The pipeline applied using Llama 3.1 70B as the underlying generator, offering a middle ground between the compact 8B model and proprietary API-based options.

How the 8B model was trained

The OpenScholar-8B model was trained using the same self-feedback pipeline used at inference time, but repurposed for synthetic data generation:

Curated abstracts: Starting from 1 million curated scientific paper abstracts from the datastore.
Synthetic data generation: The self-feedback loop was used to generate 130,000 high-quality training instances, where the model iteratively refined its own outputs with retrieval feedback.
Instruction tuning: The final 13K instruction-tuning dataset (OS_Train_Data) was used to fine-tune Llama 3.1 8B using a modified version of torchtune on 8x A100 GPUs.

This approach allows a compact 8B model to achieve performance competitive with much larger proprietary models by distilling the quality of the iterative self-feedback pipeline into the model weights.

Evaluation: ScholarQABench

To rigorously evaluate scientific literature synthesis, the authors created ScholarQABench, the first large-scale multi-domain benchmark for this task:

2,967 expert-written queries and 208 long-form answers across four domains: computer science, physics, neuroscience, and biomedicine.
Evaluation metrics include correctness, citation accuracy (are cited papers real and relevant?), coverage (does the response address all aspects of the query?), and writing quality.
Human evaluations were conducted by 16 experts with PhDs across 108 questions, providing gold-standard comparisons between model and expert-written responses.

Key results on ScholarQABench:

Model	Correctness vs. GPT-4o	Citation quality	Human preference vs. expert
GPT-4o (no retrieval)	baseline	Hallucinates 78-90% of citations	Preferred 32% of the time
OpenScholar-8B	+6.1%	On par with human experts	Preferred 51% of the time
OpenScholar-GPT4o	+12%	Citation F1: 0.1 → 39.5	Preferred 70% of the time

GenAI use at UW-Madison

UW–Madison faculty, staff, students, and affiliates are required to follow campus policies relevant to AI use. Uses of generative AI that are explicitly prohibited by policy include, but are not limited to, the following:

Entering any sensitive, restricted or otherwise protected institutional data – including hard-coded passwords – into any generative AI tool or service;
Using AI-generated code for institutional IT systems or services without review by a human to verify the absence of malicious elements;
Using generative AI to violate laws; institutional policies, rules or guidelines; or agreements or contracts.

Potential use cases

Literature reviews: Rapidly synthesize the state of research on a topic with properly cited sources, saving hours of manual search and reading. Particularly useful for getting up to speed in unfamiliar fields.
Research question exploration: Ask nuanced scientific questions and receive grounded answers that point you to the most relevant papers, helping identify gaps and opportunities in the literature.
Grant writing and proposals: Quickly gather and cite supporting evidence for research proposals, ensuring claims are backed by real, verifiable literature.
Cross-disciplinary research: Explore connections between fields (e.g., neuroscience and computer science) by querying across OpenScholar’s multi-domain datastore of 45 million papers.
Teaching and mentoring: Help students and early-career researchers learn to navigate scientific literature effectively, with a tool that models good citation practices.

OpenScholar: Scientific Literature Synthesis with Retrieval-Augmented LMs

Key features

Model variants and sizes

How the 8B model was trained

Evaluation: ScholarQABench

GenAI use at UW-Madison

Potential use cases

Links

Comments

Key features

Model variants and sizes

How the 8B model was trained

Evaluation: ScholarQABench

GenAI use at UW-Madison

Potential use cases

Links

Related resources

Comments