Upload a PDF, extract tables as JSON or CSV. Structured data in seconds, no signup required.
Drop a PDF here or click to browse
PDF files up to 10MB
No tables detected in this PDF
The document may not contain tabular data, or the tables may be image-based. Try a PDF with selectable text content.
Extraction failed
This tool uses the same API you can call from your code. Extract tables from PDFs at scale — one endpoint, structured output, every time.
import PDFBase from 'pdfbase'
const client = new PDFBase()
const result = await client.extract.tables({
source: { url: 'https://example.com/report.pdf' },
format: 'json'
})
// result.tables → array of extracted tables
// result.tables[0].headers → column names
// result.tables[0].rows → row data
PDF table extraction is the process of pulling structured tabular data out of PDF documents and converting it into machine-readable formats like JSON or CSV. PDFs were designed for visual presentation, not data interchange. The table you see in a PDF is just positioned text — there's no underlying spreadsheet, no rows-and-columns data model. Extracting that data requires analyzing text positions, detecting alignment patterns, and reconstructing the original table structure.
This is a hard problem. Tables in PDFs come in many forms: bordered tables with visible gridlines, borderless tables aligned by whitespace, tables that span multiple pages, cells that are merged across rows or columns, and tables nested inside other tables. Copy-pasting from a PDF reader destroys the structure entirely — you get a wall of concatenated text with no column separation.
The PDFBase table extraction API solves this by combining layout analysis with heuristic-based structure detection. Upload any PDF with tabular data — financial reports, research papers, government filings, invoices — and get clean, structured output you can immediately pipe into your data pipeline, import into a spreadsheet, or consume in your application.
Drag and drop a PDF file into the upload zone, or click to browse your files. The file is read client-side as base64 — nothing is uploaded until you click "Extract Tables". Maximum file size is 10MB.
Select JSON for structured objects with headers and typed values — ideal for API consumption and code integration. Or choose CSV for flat comma-separated output — ready to open in Excel, Google Sheets, or feed into a data pipeline.
Click "Extract Tables" and the PDF is sent to the PDFBase API for processing. The API detects all tables in the document, reconstructs their structure, and returns them as your chosen format. You get a visual preview, raw output, and download — all in one step.
Most valuable data in the world still lives locked inside PDFs. Government agencies, financial institutions, research journals, and enterprises publish data as PDF reports rather than APIs. Extracting that data programmatically is a core data engineering task across industries:
Quarterly earnings reports, balance sheets, P&L statements, and SEC filings are published as PDFs. Extracting these tables feeds trading algorithms, financial models, and automated compliance checks. Manual re-entry is too slow and error-prone at scale.
Research results live in PDF tables — experiment outcomes, measurement data, statistical summaries. Meta-analyses and literature reviews require extracting data from dozens or hundreds of papers to synthesize results across studies.
Census data, regulatory filings, public procurement records, and policy reports are almost always PDF-first. Civic tech projects, journalists, and policy researchers need automated extraction to make this data queryable and analyzable.
Accounts payable automation starts with extracting line items from vendor invoices. ERP integrations, expense management, and bookkeeping workflows all depend on pulling structured data from PDF invoices at high accuracy.
There are several approaches to extracting tables from PDFs. Each has different tradeoffs around accuracy, setup complexity, and handling of edge cases. Here's how the major options compare:
| Method | Setup | Borderless tables | Merged cells | Multi-page |
|---|---|---|---|---|
| Camelot (Python) | pip install + Ghostscript | Stream mode (fragile) | Partial | Manual stitching |
| Tabula (Java) | JVM + tabula-py wrapper | Guess mode | Limited | Manual stitching |
| Unstructured | Docker + multiple models | ML-based (good) | Supported | Supported |
| PDFBase API | One API call | Layout analysis | Supported | Automatic |
Camelot and Tabula are open-source libraries that work well for simple bordered tables but struggle with complex layouts. Unstructured is powerful but requires heavy infrastructure. PDFBase trades self-hosting for zero setup and a consistent API.
Bordered tables extract better. Tables with visible gridlines are significantly easier to detect than whitespace-aligned data. If you're generating the source PDF yourself, add borders to your tables. Even thin, light-colored borders dramatically improve extraction accuracy.
Watch for merged cells. Cells that span multiple columns or rows can confuse extractors. The PDFBase API handles most merged cell patterns, but if you're seeing incorrect column alignment, check whether the source PDF uses complex cell merging. Simplifying the table structure in the source document often fixes it.
Use digitally-created PDFs when possible. PDFs generated from Word, LaTeX, or HTML contain real text data that extractors can read directly. Scanned PDFs (photos of documents) contain image pixels, not text — they need OCR before table extraction can work. If you have a choice, always use the digital version.
Check header detection. The API attempts to detect header rows automatically by analyzing font weight, position, and content patterns. If headers are mis-detected, the JSON output's headers array may need adjustment in your consuming code.
Multi-page tables work automatically. If a table spans across a page break, the API stitches the rows together into a single table. You don't need to pre-process the PDF or specify page ranges. This also handles repeated header rows on continuation pages — they're deduplicated automatically.
Convert HTML to PDF with full Chromium rendering. Free, no signup.
ToolsMarkdown to PDF, PDF to PNG, and more developer tools
DocsFull reference for the PDFBase API with code examples in every language
Ready to extract tables from PDFs in your code?
Try the API free — 100 credits, no card