Glossary term
Glossary term
Multimodal AI
AI capability to parse, interpret, and reason over tabular data in images or documents.
TableFormer (Google) is used in Document AI to extract structured tables from financial reports and regulatory filings - parsing merged cells, multi-level headers, and spanning rows that rule-based parsers miss.
TAPAS (Google) answers natural language questions over HTML and PDF tables without converting to SQL - used by enterprise BI tools to let analysts query embedded report tables in plain language.
LayoutLMv3 (Microsoft) jointly encodes text, layout, and image tokens to understand tables in scanned documents - achieving state-of-the-art on FUNSD and DocVQA benchmarks for invoice and form understanding.