Extract Tables from PDF to CSV and JSON

You need to extract tables from PDF and get clean rows out as CSV or JSON. Sounds simple. It is not, and the reason matters.

A table in a PDF is not a table. There is no row object, no column object, no cell. What you actually have is a pile of text fragments with x and y coordinates printed onto a page. The grid you see with your eyes is an illusion your brain assembles from alignment and spacing. The PDF format does not store "this number belongs in column 3, row 7." It stores "draw the string 1,240.50 at point (412, 308)." Every table extraction tool is reverse-engineering structure that was thrown away when the document was generated.

There is a second fork before you write any code. PDFs come in two flavors. A digital PDF has a real text layer: the characters are stored as text and you can select them in a viewer. A scanned PDF is just an image of a page, usually from a scanner or a photo. There is no text layer at all, only pixels. If you run a text-based extractor on a scanned PDF you get nothing back, because there is no text to find. Scanned documents need OCR first to turn pixels into characters, and only then can you look for tables. Most tools in this guide assume a digital PDF. We will be explicit about which ones handle scans.

This guide walks through the four approaches that actually work: Camelot, Tabula-py, pdfplumber, and a structured-extraction API. Real code for each. Then the edge cases that break all of them. If you also need plain text out of PDFs, see our companion post on PDF parsing in Python.

Camelot (Python)

Camelot is the tool people reach for first, and for good reason. It is built specifically for tables and it gives you an accuracy score for every table it finds, so you can tell when it has failed instead of silently handing you garbage.

The key concept is its two parsing modes, and picking the wrong one is the most common mistake.

lattice is for tables with visible ruling lines: the cells are boxed in by drawn borders. Camelot detects those lines and uses them to carve out cells. This is the reliable mode when it applies.
stream is for borderless tables where columns are separated only by whitespace. Camelot infers column boundaries from gaps in the text. It works, but it guesses, and it guesses worse when columns are tight or values are ragged.

camelot_example.py

import camelot

# lattice: ruled tables. stream: whitespace-aligned tables.

tables = camelot.read_pdf('report.pdf', pages='all', flavor='lattice')

print(len(tables)) # number of tables found

for i, table in enumerate(tables):

print(table.parsing_report)

# {'accuracy': 98.4, 'whitespace': 12.1, 'order': 1, 'page': 1}

df = table.df # a pandas DataFrame

table.to_csv(f'table_{i}.csv')

table.to_json(f'table_{i}.json')

That parsing_report is the thing that makes Camelot worth using. An accuracy near 100 means the line detection lined up cleanly. An accuracy in the 70s with high whitespace means columns merged or split. You can build a pipeline that flags low-accuracy tables for human review instead of trusting everything blindly.

The downsides

Ghostscript dependency. Camelot needs Ghostscript installed on the system for the lattice backend. That is another apt-get in your Dockerfile and a common source of "works on my machine" failures in CI.
Merged cells confuse it. A header that spans two columns, or a value that spans two rows, breaks the clean grid assumption. You get shifted columns or a value landing in the wrong cell.
Borderless tables push you to stream. The moment there are no ruling lines, you are in stream mode and accuracy gets soft. Tight columns and right-aligned numbers are where it hurts most.
No OCR. Camelot reads the text layer. Hand it a scanned PDF and it finds zero tables.

Tabula-py (Python)

Tabula-py is a thin Python wrapper around Tabula, which is written in Java. It has been around a long time and shows up in a lot of data-journalism pipelines because the original Tabula desktop app was built for exactly that: pulling tables out of government PDFs.

The output drops straight into pandas, which is convenient if your downstream code already lives there.

tabula_example.py

import tabula

# returns a list of DataFrames, one per detected table

dfs = tabula.read_pdf('report.pdf', pages='all')

for df in dfs:

print(df.head())

# write straight to CSV or JSON

tabula.convert_into('report.pdf', 'out.csv', output_format='csv', pages='all')

tabula.convert_into('report.pdf', 'out.json', output_format='json', pages='all')

# pin the region when auto-detection drifts (top, left, bottom, right in pts)

dfs = tabula.read_pdf('report.pdf', pages='2', area=[120, 40, 700, 560])

By default Tabula guesses where the tables are. When the guess is wrong, the fix is the area argument: you give it explicit coordinates for the table region and it stops hunting. This is the difference between fighting the tool for an hour and getting clean output in one call. Measure the region once in a PDF viewer and hard-code it for documents with a fixed layout.

The downsides

JVM dependency. Tabula is Java. You need a JDK installed and on the path. On a slim container that is a real chunk of weight, and the startup cost of spinning up a JVM per call adds latency in a request path.
Weaker on complex layouts. Multi-line cells, nested headers, and irregular spacing trip it up more often than Camelot's lattice mode does on the same document.
Guess versus explicit. Auto-detection is fine for clean reports and unreliable for anything dense. For production you almost always end up specifying area per template.
No OCR. Same story. Text layer only.

pdfplumber (Python)

When Camelot and Tabula both misalign a table, pdfplumber is the tool that lets you fix it. It does not try to be clever by default. It exposes the raw geometry of the page: every character, every line, every rectangle, with coordinates. Then it gives you a table extractor you can tune with explicit settings until the grid lines up.

This is the right tool when the others fail, because you stop hoping the heuristic works and start telling it exactly how to find rows and columns.

pdfplumber_example.py

import pdfplumber

import csv

with pdfplumber.open('report.pdf') as pdf:

page = pdf.pages[0]

# default extraction, good for ruled tables

tables = page.extract_tables()

# borderless table: align on text instead of lines

settings = {

'vertical_strategy': 'text',

'horizontal_strategy': 'text',

'snap_tolerance': 4,

}

tables = page.extract_tables(settings)

for rows in tables:

with open('out.csv', 'w', newline='') as f:

csv.writer(f).writerows(rows)

The two settings that matter most are vertical_strategy and horizontal_strategy. Set them to 'lines' and pdfplumber uses drawn rules to find cell edges, which is perfect for ruled tables. Set them to 'text' and it clusters by the position of words, which is how you handle borderless tables. For the truly stubborn cases you can pass explicit_vertical_lines and explicit_horizontal_lines with the exact x and y values where column and row separators sit, and it will use those.

The downsides

You tune per document. The power is also the cost. There is no one setting that works everywhere. Each new layout may need its own table_settings. For a fleet of identical invoices that is a one-time cost. For arbitrary user uploads it is a grind.
No accuracy score. Unlike Camelot, pdfplumber does not tell you how confident it is. You validate output yourself.
No OCR. Pure Python text-layer reader. Scanned PDFs need an OCR step in front of it.

pdfplumber is also excellent for plain text and word-level coordinates, which is why it shows up again in our Python PDF parsing comparison. For tables specifically, treat it as the precision instrument you bring out when the automatic tools give up.

The Node.js gap

If your stack is Node, here is the honest truth: table extraction tooling is weak. There is no Node equivalent of Camelot or Tabula that you would trust in production. The reason is historical. The serious table work happened in the Python and Java data-science world, and that is where the libraries grew.

What you actually have in Node is lower level. Libraries like pdf2json or pdfjs-dist give you text fragments with x and y positions, and then you write your own clustering: bucket fragments into rows by their y coordinate, then into columns by their x coordinate, then emit cells. It works for clean, fixed-layout tables. It falls apart on anything irregular, and you are now maintaining a table-detection algorithm, which is not the feature you set out to build.

So in practice Node teams do one of two things. They stand up a small Python sidecar service running Camelot or pdfplumber and call it over HTTP, which means now you run Python anyway. Or they call an API that returns structured tables directly. If you only need plain text rather than tables, the calculus is different and more forgiving, which we cover in extracting text from PDF in Node.js. For tables, do not fight Node. Offload it.

No Ghostscript. No JVM. No tuning per file.

Send a PDF, get structured tables back as JSON or CSV. PDFBase handles scanned docs, multi-page tables, and the geometry so you don't have to.

Get 100 free credits, no card

PDFBase table extraction API

The pitch is narrow and honest: one call returns the tables in a PDF as structured rows, in CSV or JSON, and it handles the two things the Python tools make you do yourself. It runs OCR for scanned documents, so a scan and a digital PDF go through the same endpoint. And there is nothing to install on your side: no Ghostscript, no JVM, no headless browser. You send the file, you get rows.

pdfbase_tables.js

import PDFBase from 'pdfbase'

import { readFileSync } from 'fs'

const client = new PDFBase('pk_live_...')

const result = await client.extract.tables({

file: readFileSync('report.pdf'),

format: 'csv', // or 'json'

ocr: 'auto' // runs OCR only when no text layer exists

})

console.log(result.tables.length) // 3

const first = result.tables[0]

console.log(first.page) // 1

console.log(first.rows[0]) // ['Item', 'Qty', 'Unit Price', 'Total']

console.log(first.rows[1]) // ['API Credits', '5000', '0.09', '450.00']

// result.csv holds the ready-to-write CSV string

writeFileSync('out.csv', result.csv)

A few things worth being straight about. This is not magic. The same hard cases that trip up Camelot, merged cells and deeply nested headers, are hard for any extractor including this one. What you are buying is not a perfect parser. It is the operational load lifted: no system dependencies, OCR folded in, multi-page tables stitched back together, and one HTTP call from any runtime including Node, Go, or Ruby where the Python tools are not an option.

Scanned PDFs work. OCR runs automatically when there is no text layer, so you do not branch your code on document type.
Multi-page tables get rejoined. A table that continues across a page break is returned as one table, not two fragments.
Both formats in one call. Ask for json for programmatic use or csv for a file you hand to a spreadsheet. The row data is the same.
No install footprint. Your container stays slim. Nothing to patch when Ghostscript or a JDK ships a CVE.

You can try it without writing any code first: the free PDF table extractor takes an upload and returns CSV in the browser, so you can sanity-check your documents before wiring up the API. Full options, including page ranges and region hints, are in the PDFBase docs.

Edge cases that break everything

No matter which tool you pick, these are the cases that will bite you. Knowing them ahead of time is the difference between shipping and a week of confused debugging.

Merged cells. A header that spans two columns or a value that spans two rows breaks the grid model every tool relies on. The fix is usually post-processing: detect the empty cell and forward-fill from the spanned value. No extractor gets this right unattended.
Multi-page tables that split. A long table continues onto the next page, often with the header repeated. Most tools return it as two separate tables. You have to detect the repeated header, drop it, and concatenate. The PDFBase API does this stitching for you; with the Python tools you write it.
Nested tables. A cell that itself contains a small table. Almost nothing handles this. You generally have to extract the outer table, then re-run extraction on the bounding box of the offending cell.
Rotated pages. Wide tables are often printed landscape on a portrait page, which means the page is rotated 90 degrees. Extractors that ignore the rotation flag read the text sideways and produce nonsense. Check the page rotation and normalize it before extracting.
Scanned PDFs needing OCR. The silent killer. The text-based tools return zero tables and no error, so it looks like an empty document instead of a missing OCR step. Always check for a text layer first; if there is none, OCR before you extract.
Numbers with thousands separators. A value like 1,240.50 contains a comma. Write that into CSV naively and the comma is read as a column delimiter, shifting every field after it. Quote the field, or strip separators, or use a different delimiter. This one corrupts data quietly and you find it three reports later.

Comparison

Here is how the four approaches stack up on the dimensions that decide which one you reach for.

	Camelot	Tabula-py	pdfplumber	PDFBase API
Scanned support (OCR)	No	No	No	Yes, built in
Setup deps	Ghostscript	JVM / JDK	None (pip only)	None (API call)
Complex layouts	Good (lattice)	Fair	Best, with tuning	Good, no tuning
Output formats	CSV, JSON, DataFrame	CSV, JSON, DataFrame	Lists (write yourself)	CSV, JSON
Accuracy score	Yes	No	No	Per-table confidence
Best for	Ruled tables, Python	Clean reports, pandas	Hard layouts, control	Scans, Node, scale

Which should you choose?

Cut to it. Match the tool to your constraints, not to whatever the top Stack Overflow answer says.

Python, ruled tables, you want a confidence signal

Use Camelot in lattice mode. The parsing_report accuracy score lets you trust the good tables and flag the bad ones. Just budget for the Ghostscript dependency.

Clean reports, already in pandas, fixed layout

Use Tabula-py with an explicit area. Drop the DataFrames straight into your pipeline. Pin the region and it is reliable for documents with a stable layout.

The other two misaligned a stubborn table

Use pdfplumber with custom table_settings. When you need explicit control over column and row boundaries, this is the precision tool. The cost is tuning per layout.

Scanned docs, a Node stack, or you don't want the ops

Use the PDFBase table extraction API. OCR is folded in, multi-page tables get stitched, and there are no system dependencies to install or patch. One call from any runtime.

Wrapping up

Extracting tables from PDF is really one problem wearing two coats: reconstructing structure that the format threw away, and dealing with documents that have no text layer at all. The Python tools, Camelot, Tabula, and pdfplumber, are strong at the first problem on digital PDFs and leave the second to you. Pick among them by how ruled your tables are and how much control you need.

If you are on Node, or you have scanned documents, or you simply do not want to run Ghostscript and a JVM and write multi-page stitching by hand, an API that returns structured tables is the pragmatic call. Either way, respect the edge cases. Merged cells, rotated pages, and commas inside numbers will outlast whichever library you choose.

Want to skip the setup? Grab 100 free credits, no card, and pull tables out of your first PDF in a single call. Or run a file through the free PDF table extractor first to see how clean the output is.

Camelot (Python)

The downsides

Tabula-py (Python)

The downsides

pdfplumber (Python)

The downsides

The Node.js gap

PDFBase table extraction API

Edge cases that break everything

Comparison

Which should you choose?

Wrapping up

More from the PDFBase blog