Chunking

Splitting documents into smaller pieces for indexing and retrieval.

1.
LlamaIndex uses recursive character chunking with 512-token chunks and 50-token overlaps when indexing a law firm's contract database, ensuring no clause is split across chunk boundaries.
2.
Unstructured.io's document-processing pipeline uses semantic chunking that splits PDF reports at natural section boundaries (headings, tables, figures) rather than fixed token counts, improving retrieval precision by 22%.
3.
A hospital knowledge-base system uses sentence-window chunking: each chunk is one sentence, but surrounding ±3 sentences are added back at retrieval time to give the model full context around the matched sentence.

Loading…