Glossary term
Glossary term
Evaluation and Benchmarks
Long-context evaluation task testing whether a model can recall a specific fact buried in a long document.
Anthropic uses Needle in a Haystack to validate Claude 3's 200k context window - inserting a sentence ('The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park') at various positions in a 200k document and asking the model to recall it.
Google uses Needle in a Haystack to validate Gemini 1.5 Pro's 1M token context - demonstrating near-perfect recall at all positions across 1M tokens, vs degradation at 50k tokens for GPT-4.
LangChain's eval library provides a Needle in a Haystack test suite for RAG pipelines - used by teams to verify that important information is not lost when retrieving from large document collections.