PDF Parsing in Python: PyPDF2 vs pdfplumber vs PDFBase API (2026)

You need to extract text from a PDF in Python. You pip install the first thing Google suggests. And then you spend the next 3 hours debugging encoding issues, garbled tables, and text that comes out in the wrong order.

The Python PDF ecosystem is weirdly fragmented. There are at least four serious options for parsing PDFs, and they all have different strengths, different failure modes, and different levels of "oh god why is the API like this." PyPDF2 got renamed. pdfplumber is amazing at tables but chokes on big files. pdfminer is powerful but makes you write 50 lines for what should be 5. And then there's the "just POST it to an API" option.

This post covers all four approaches with real Python code, real output, and honest tradeoffs. By the end you'll know which tool to reach for and—just as important—which ones will waste your afternoon.

First: The PyPDF2 Naming Confusion

Before we get into code, let's clear up the naming mess that trips up every Python developer who hasn't touched PDF parsing in a year.

Gotcha: PyPDF2 is dead. Long live pypdf.

The lineage goes: PyPDF (2005, abandoned) → PyPDF2 (2012, community fork) → PyPDF4 (2018, short-lived fork) → pypdf (2022, the original PyPDF2 maintainer unified everything under a new name). In 2026, pip install PyPDF2 still works but installs a deprecated shim that points you to pypdf. If you see import errors or deprecation warnings, that's why. Install pypdf, not PyPDF2.

1. pypdf (formerly PyPDF2)

The grand old library of Python PDF parsing. pypdf can extract text, merge/split documents, read metadata, encrypt and decrypt, and rotate pages. It's the Swiss Army knife—does a lot of things, but text extraction isn't the sharpest blade.

pypdf_extract.py

from pypdf import PdfReader

reader = PdfReader("invoice.pdf")

# Extract text page by page

for page in reader.pages:

text = page.extract_text()

print(text)

# Get metadata

print(reader.metadata.title)

print(len(reader.pages))

Simple and clean. 6 lines to get text from every page. But here's what the output actually looks like on a PDF with a table:

Output — pypdf on an invoice with a table

Invoice #4821 Date: March 15, 2026 Bill To: Acme Corp Item Quantity Unit Price Total Widget A 50 $12.00 $600.00 Widget B 25 $8.50 $212.50 Service Fee 1 $150.00 $150.00 Subtotal: $962.50 Tax: $77.00 Total: $1,039.50

Looks okay for this simple case, right? The text is there. But notice: the table has no structure. It's just lines of text concatenated together. There's no way to programmatically distinguish "Widget A" from "50" from "$12.00" without writing a custom parser. And on more complex layouts—multi-column PDFs, forms with overlapping text boxes, PDFs generated by design tools—the text order falls apart completely.

When to use pypdf

Merging and splitting PDFs. This is where pypdf actually shines. Clean API, fast, reliable.
Reading metadata. Author, title, creation date, page count—one-liners.
Simple text extraction from well-structured, text-heavy PDFs (like contracts, plain-text reports).
Page rotation, encryption, watermarking. Solid utility functions for PDF manipulation.

Where it falls down: anything involving tables, complex layouts, or scanned documents.

2. pdfplumber — The Table Extraction King

pdfplumber was built by a journalist who needed to extract data from government PDFs. That origin story tells you everything about its priorities: it's laser-focused on getting structured data out of documents that were designed for humans, not machines.

pdfplumber_extract.py

import pdfplumber

with pdfplumber.open("invoice.pdf") as pdf:

# Text with layout preservation

for page in pdf.pages:

print(page.extract_text())

# The magic: table extraction

for page in pdf.pages:

tables = page.extract_tables()

for table in tables:

for row in table:

print(row)

Same invoice PDF. Here's what extract_tables() gives you:

Output — pdfplumber extract_tables()

['Item', 'Quantity', 'Unit Price', 'Total'] ['Widget A', '50', '$12.00', '$600.00'] ['Widget B', '25', '$8.50', '$212.50'] ['Service Fee', '1', '$150.00', '$150.00']

That's structured data. Each row is a list. Each cell is a string. You can drop this straight into a pandas DataFrame, write it to CSV, or feed it into any data pipeline. No regex parsing, no hacky string splitting. pdfplumber detects the table boundaries by analyzing the positions of lines and text on the page, then reconstructs the grid.

This is genuinely the best table extraction you'll get from a pure Python library. It handles merged cells, borderless tables (using text alignment), and multi-page tables with reasonable accuracy.

The downsides

Slow on large documents. pdfplumber loads every character's position into memory to do spatial analysis. On a 50-page PDF, expect ~800ms. On a 500-page PDF, expect minutes—and possibly an OOM kill.
Memory-hungry. That spatial analysis means the entire page layout lives in memory. A single page with dense text can consume 50MB+. Multiply by hundreds of pages and you're in trouble.
No OCR. If the PDF is a scanned image, pdfplumber gets nothing. It only works on PDFs with actual text objects embedded.
Table detection isn't magic. It works great on well-structured tables with clear borders. On tables that use whitespace alignment with no lines? It guesses, and sometimes it guesses wrong.

For extracting tables from a reasonable number of well-formed PDFs, pdfplumber is the best pure-Python option. Full stop. Just don't feed it a 1,000-page government report.

Table extraction showdown: pypdf vs pdfplumber

To make the difference concrete, here's the same table extracted by both libraries. This is from a real invoice PDF with bordered table cells:

pypdf — garbled

Item Quantity Unit Price Total Widget A 50 $12.00 $600.00 Widget B 25 $8.50 $212.50 Service Fee 1 $150.00 $150.00

Plain text. No structure. Good luck parsing this programmatically when columns don't align.

pdfplumber — structured

[['Item', 'Quantity', 'Unit Price', 'Total'], ['Widget A', '50', '$12.00', '$600.00'], ['Widget B', '25', '$8.50', '$212.50'], ['Service Fee', '1', '$150.00', '$150.00']]

Python list of lists. Each cell isolated. Ready for pandas, CSV, or JSON.

3. pdfminer.six — The Low-Level Powerhouse

pdfminer.six (the maintained Python 3 fork of the original pdfminer) gives you something the others don't: precise control over text position, font information, and page layout analysis. It exposes the raw building blocks of PDF text rendering.

It's also the library most likely to make you question your career choices.

pdfminer_extract.py

from pdfminer.high_level import extract_text

# The simple way (yes, pdfminer has one)

text = extract_text("invoice.pdf")

print(text)

Wait, that's only 3 lines? Turns out pdfminer has a high_level module that works fine for basic extraction. But the reason people use pdfminer is for the low-level layout analysis. Here's what that looks like:

pdfminer_layout.py

from pdfminer.high_level import extract_pages

from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTChar

params = LAParams(

line_margin=0.5,

word_margin=0.1,

char_margin=2.0,

boxes_flow=0.5

)

for page_layout in extract_pages("invoice.pdf", laparams=params):

for element in page_layout:

if isinstance(element, LTTextBox):

# Get bounding box coordinates

x0, y0, x1, y1 = element.bbox

print(f"[{x0:.0f},{y0:.0f}] {element.get_text().strip()}")

# Get individual character info

for line in element:

if isinstance(line, LTTextLine):

for char in line:

if isinstance(char, LTChar):

print(f" '{char.get_text()}' font={char.fontname} size={char.size:.1f}")

That's 20 lines to get text with position data and font info. And this is actually the clean version—the older approach using PDFResourceManager, PDFPageInterpreter, and PDFPageAggregator is even more verbose. The class hierarchy reads like it was designed by someone who really loved the AbstractSingletonProxyFactoryBean school of API design.

But the output is powerful:

Output — pdfminer with position data

[72,742] Invoice #4821 'I' font=Helvetica-Bold size=18.0 'n' font=Helvetica-Bold size=18.0 'v' font=Helvetica-Bold size=18.0 ... [72,710] Date: March 15, 2026 [72,580] Item [200,580] Quantity [320,580] Unit Price [450,580] Total

See those x-coordinates? [72,580], [200,580], [320,580], [450,580]—same y-position, different x-positions. That's a table row. With pdfminer, you can reconstruct table structure by grouping elements with the same y-coordinate. It's manual work, but you have absolute precision.

When to use pdfminer

You need exact text positions. Building a PDF search index? Highlighting text regions? pdfminer tells you the bounding box of every character.
You need font information. Which words are bold? What font size is the heading? pdfminer exposes this.
You're building your own extraction pipeline and need full control over how layout analysis works. The LAParams class lets you tune line margin, word margin, character margin, and text flow direction.

When not to use pdfminer

You just want text from a PDF. Use pypdf—it's 10x simpler.
You want tables. Use pdfplumber—it does the spatial reconstruction for you.
You have scanned PDFs. pdfminer can't help. None of these can.

The Scanned PDF Problem

Here's the thing nobody mentions in the "top 10 Python PDF libraries" listicles: none of the libraries above handle scanned PDFs. If your PDF is an image of a document (from a scanner, a photo, or an export from some legacy system), pypdf returns empty strings, pdfplumber returns empty tables, and pdfminer returns nothing.

To handle scanned PDFs in pure Python, you need a whole separate stack:

diy_ocr_stack.py

# You need ALL of these:

# pip install pytesseract Pillow pdf2image

# brew install tesseract poppler (system deps!)

from pdf2image import convert_from_path

import pytesseract

# Convert PDF pages to images (needs Poppler)

images = convert_from_path("scanned_doc.pdf", dpi=300)

# OCR each image (needs Tesseract binary)

for i, image in enumerate(images):

text = pytesseract.image_to_string(image)

print(f"--- Page {i+1} ---")

print(text)

That's 4 additional dependencies (pytesseract, Pillow, pdf2image, Poppler), two of which require system-level installs. In Docker, you're adding Tesseract and Poppler to your image—a couple hundred megabytes of extra binaries. And Tesseract's OCR quality on anything that isn't perfectly clean document scans is... not great.

Quick aside: Camelot and tabula-py

Two other libraries show up in table extraction discussions. Camelot uses Ghostscript and OpenCV to detect tables visually—genuinely clever, but requires Ghostscript as a system dependency. tabula-py wraps the Tabula Java library—good at tables, but requires a JVM running alongside your Python process. For most projects in 2026, pdfplumber handles table extraction well enough that the extra system dependencies aren't worth it.

Skip the dependency hell

PDFBase extracts text, tables, and OCR from any PDF in one API call. No Tesseract, no Poppler, no system deps.

Try free — 100 credits, no card

4. PDFBase API — POST the File, Get Structured Data

The API approach flips the model: instead of installing libraries, managing dependencies, and handling edge cases locally, you POST the PDF to an endpoint and get text, tables, and metadata back. PDFBase handles OCR for scanned documents, table detection, and layout analysis on its infrastructure.

pdfbase_extract.py

import requests

url = "https://api.pdfbase.dev/v1/extract"

headers = {"Authorization": "Bearer pk_live_..."}

with open("invoice.pdf", "rb") as f:

response = requests.post(

url,

headers=headers,

files={"file": f},

data={"extract": "text,tables,metadata"}

)

result = response.json()

# Structured text

print(result["text"])

# Tables as structured JSON

for table in result["tables"]:

print(table["headers"])

for row in table["rows"]:

print(row)

# Works on scanned PDFs too — OCR is automatic

print(result["metadata"]["pages"])

print(result["metadata"]["ocr_applied"]) # True if scanned

Output — PDFBase API response (tables)

{ "tables": [ { "page": 1, "headers": ["Item", "Quantity", "Unit Price", "Total"], "rows": [ {"Item": "Widget A", "Quantity": "50", "Unit Price": "$12.00", "Total": "$600.00"}, {"Item": "Widget B", "Quantity": "25", "Unit Price": "$8.50", "Total": "$212.50"}, {"Item": "Service Fee", "Quantity": "1", "Unit Price": "$150.00", "Total": "$150.00"} ] } ], "metadata": { "pages": 1, "ocr_applied": false } }

Tables come back as JSON objects with named headers—not arrays of arrays. You get row["Item"] instead of row[0]. And if the PDF happens to be a scanned image, the API automatically applies OCR. No Tesseract, no Poppler, no system dependencies.

Why the API approach wins for production pipelines

Zero local dependencies. Your Python app only needs requests. No C binaries, no system packages, no Docker layer headaches.
Automatic OCR. The API detects scanned pages and applies OCR transparently. You don't need to check upfront whether a PDF is text-based or image-based.
Consistent JSON output. Whether the input is a clean digital PDF or a scanned fax from 2003, the output format is the same structured JSON. Your downstream code doesn't need branching logic.
No memory pressure. Processing happens on PDFBase's infrastructure. You can parse a 500-page PDF without your application server breaking a sweat.

You can try it right now without code using the free PDF Text Extractor tool or the PDF Table Extractor. For API details, see the documentation.

Performance Benchmarks

Tested on a 50-page PDF (a mix of text, tables, and form fields). Machine: M2 MacBook Pro, Python 3.12. API benchmark includes network latency to the nearest PDFBase edge node.

	Time (50 pages)	Peak Memory	Notes
pypdf	~200ms	~15MB	Fast, but text-only extraction
pdfminer.six	~600ms	~45MB	Layout analysis adds overhead
pdfplumber	~800ms	~120MB	Full spatial analysis per page
PDFBase API	~500ms + network	~2MB (client)	Processing is server-side

pypdf is the fastest because it does the least work—it reads text objects in stream order without spatial analysis. pdfplumber is the slowest because it reconstructs the entire page layout in memory. The API sits in the middle on wall-clock time, but shifts all the computational cost off your machine.

The memory column is where it gets interesting. pdfplumber at 120MB on a 50-page PDF means you'll hit ~1.2GB on a 500-page document. The API client uses ~2MB regardless of PDF size because only the HTTP request/response lives in your process memory.

Full Comparison

Feature	pypdf	pdfplumber	pdfminer.six	PDFBase API
Text Quality	Good for simple docs	Good (layout-aware)	Excellent (precise)	Excellent + OCR
Table Extraction	None	Excellent	Manual (via coords)	Structured JSON
Scanned PDF / OCR	No	No	No	Yes (automatic)
Speed (50 pages)	~200ms	~800ms	~600ms	~500ms + network
API Complexity	Simple	Simple	Painful	Trivial (HTTP POST)
Dependencies	None (pure Python)	pdfminer.six	None (pure Python)	requests only
Maintenance Status	Active	Active	Maintained	Managed service

Which Should You Choose?

No hedging. Here's the decision tree:

Quick text extraction from simple PDFs

Use pypdf. Three lines of code, fast, no dependencies. If your PDFs are text-heavy documents without complex layouts (contracts, articles, plain reports), pypdf is all you need. Don't overthink it.

Table extraction from well-structured PDFs

Use pdfplumber. Nothing beats it for pulling structured tables out of digital PDFs. The extract_tables() method is genuinely great. Just watch your memory on documents over 100 pages.

Precise layout analysis and text positions

Use pdfminer.six. When you need to know the exact x,y coordinates of every text element, or you need font information, or you're building a custom extraction pipeline that requires full control. Accept that the API is verbose and move on.

Production pipelines, scanned PDFs, or you want to avoid dependency management

Use a PDF API like PDFBase. You get text + tables + OCR in one call. No Tesseract, no Poppler, no system deps. Consistent JSON output regardless of PDF type. The trade-off is cost ($0.01/page) and network dependency—a trade most production systems will happily make.

The Bottom Line

The Python PDF ecosystem is fragmented for a reason: parsing PDFs is genuinely hard. The format was designed for rendering, not data extraction. Every library is making different tradeoffs about how to reverse-engineer structure from a flat document.

For most real-world projects: start with pypdf for simple text, reach for pdfplumber when you need tables, and switch to an API when you're tired of writing edge-case handling code or when scanned documents enter the picture. pdfminer is the "I need to build my own thing from primitives" option—powerful, but you're paying for that power with code complexity.

Try the free PDF Text Extractor or Table Extractor to see results on your own PDFs before writing any code. When you're ready to integrate, grab 100 free API credits—no credit card, takes about 30 seconds. Full reference at docs.pdfbase.dev.

PDF Parsing in Python PyPDF2 vs pdfplumber vs PDFBase API

First: The PyPDF2 Naming Confusion

1. pypdf (formerly PyPDF2)

When to use pypdf

2. pdfplumber — The Table Extraction King

The downsides

Table extraction showdown: pypdf vs pdfplumber

3. pdfminer.six — The Low-Level Powerhouse

When to use pdfminer

When not to use pdfminer

The Scanned PDF Problem

Quick aside: Camelot and tabula-py

4. PDFBase API — POST the File, Get Structured Data

Why the API approach wins for production pipelines

Performance Benchmarks

Full Comparison

Which Should You Choose?

The Bottom Line

More from the PDFBase blog