GenScrap - AI搭載コンテンツ管理プラットフォーム

#tabula-py が付いた公開スクラップ 1 件

extracting tables from PDFs using Python

manual

Here’s a practical, up-to-date comparison of the main Python options for extracting tables from PDFs—what they’re best at, trade-offs, and quickstart code. # TL;DR — Which should you use? * **Digital PDFs with clear ruling/borders:** Start with **Camelot** (lattice/stream modes). ([Camelot][1], [Read the Docs][2]) * **Digital PDFs (simple/fast, Java OK):** **tabula-py** (wrapper of tabula-java). Great defaults, very fast. Requires Java. ([Tabula-Py][3]) * **Mixed/quirky layouts & you want control:** **pdfplumber** (fine-grained, tunable table finding). ([GitHub][4]) * **Scanned/image PDFs:** Pre-OCR with **OCRmyPDF**, then run Camelot/tabula/pdfplumber. ([OCRmyPDF][5], [GitHub][6], [Ubuntu Manpages][7]) * **Complex documents (forms, multi-span/merged cells) or images with no ruling:** ML approach like **Microsoft Table Transformer (TaTR)** via Hugging Face / layoutparser. Heavier, but most robust for detection/structure. ([GitHub][8]) --- ## Side-by-side comparison | Library | Best for | Accuracy | Speed | Setup friction | Notes | | ---------------------------- | --------------------------------------------- | ----------------------------: | --------: | ------------------------ | ----------------------------------------------------------------------------------------------------------------- | | **Camelot** | Digital PDFs with lines or consistent spacing | High (lattice); good (stream) | Fast | Pure Python | Two modes: *lattice* (uses ruling lines) and *stream* (white-space). Exports DataFrame/CSV easily. ([Camelot][1]) | | **tabula-py** | Bulk extraction from consistent PDFs | High on “regular” tables | Very fast | Needs Java (tabula-java) | Simple API (`read_pdf`, `convert_into`), good for pipelines. ([Tabula-Py][3]) | | **pdfplumber** | Tricky layouts & fine control | Good→High (with tuning) | Medium | Pure Python | Exposes characters/lines/rects; you can tune table settings per page. ([GitHub][4]) | | **OCRmyPDF** (pre-step) | Scanned/image PDFs | — | Medium | External CLI | Adds searchable text layer; then use one of the above. ([OCRmyPDF][5], [GitHub][6]) | | **Table Transformer (TaTR)** | Complex or image-only tables | High (best on hard layouts) | Slowest | ML model + deps | Detects table regions & structure; works on PDFs/images. ([GitHub][8]) | --- ## Minimal working examples ### 1) Camelot ```python import camelot tables = camelot.read_pdf("file.pdf", pages="1-3", flavor="lattice") # or flavor="stream" df = tables[0].df tables.export("out.csv", f="csv") ``` Camelot’s *lattice* mode relies on ruling lines; use *stream* for whitespace-based tables. ([Camelot][9]) ### 2) tabula-py ```python import tabula dfs = tabula.read_pdf("file.pdf", pages="all", multiple_tables=True) # returns list of DataFrames tabula.convert_into("file.pdf", "out.csv", output_format="csv", pages="all") ``` Requires Java because it wraps **tabula-java**. Great for batch jobs. ([Tabula-Py][3]) ### 3) pdfplumber ```python import pdfplumber, pandas as pd rows = [] with pdfplumber.open("file.pdf") as pdf: page = pdf.pages[0] table = page.extract_table( table_settings={"vertical_strategy":"lines", "horizontal_strategy":"lines"} ) df = pd.DataFrame(table[1:], columns=table[0]) ``` `table_settings` lets you adapt to non-standard layouts; you can also use edge/line tolerance strategies. ([GitHub][4]) ### 4) Scanned PDFs: OCRmyPDF → (Camelot/tabula/pdfplumber) ```bash ocrmypdf --deskew --clean input_scanned.pdf input_searchable.pdf ``` Then run Camelot/tabula/pdfplumber on `input_searchable.pdf`. ([OCRmyPDF][5], [GitHub][6]) ### 5) ML route: Microsoft Table Transformer (TaTR) * Typical stack: `layoutparser`/`transformers` + TaTR weights → detect tables → reconstruct structure → export CSV/DF. Best on complex images and messy scans; heavier to set up and slower. ([GitHub][8]) --- ## Recommended decision flow 1. **Is it scanned?** → Yes: **OCRmyPDF**, then try Camelot (lattice) → Camelot (stream) → pdfplumber. ([OCRmyPDF][5]) 2. **Digital with visible lines?** → **Camelot (lattice)** first; fall back to *stream*. ([Camelot][9]) 3. **Bulk and consistent layout, Java OK?** → **tabula-py** for speed and simplicity. ([Tabula-Py][3]) 4. **Weird/irregular layouts / need control?** → **pdfplumber** with tuned settings. ([GitHub][4]) 5. **Still failing (merged cells, rotated, camera shots)?** → **TaTR** (ML). ([GitHub][8]) --- ## Practical tips * Try **multiple modes** (Camelot lattice vs stream) per page; some pages switch styles. ([Camelot][9]) * For **rotated** pages, rotate first (pdfplumber can detect/rotate). ([GitHub][4]) * With **tabula-py**, you can pass `area` or templates for consistent cropping; great for reports where tables live in fixed regions. ([Tabula-Py][10]) * For **no ruling lines**, increase tolerance and rely on whitespace clustering (Camelot stream / pdfplumber strategies). ([Camelot][9], [GitHub][4]) * For **end-to-end pipelines**, standardize outputs to CSV/Parquet and add validation (row/column counts, headers present, totals match). If you share a representative PDF (one digital, one scanned), I can run a quick bake-off with 2–3 approaches and give you tuned code per page type. [1]: https://camelot-py.readthedocs.io/?utm_source=chatgpt.com "Camelot: PDF Table Extraction for Humans — Camelot 1.0.9 ..." [2]: https://readthedocs.org/projects/camelot-py/downloads/pdf/master/?utm_source=chatgpt.com "Camelot Documentation" [3]: https://tabula-py.readthedocs.io/?utm_source=chatgpt.com "tabula-py: Read tables in a PDF into DataFrame — tabula-py ..." [4]: https://github.com/jsvine/pdfplumber?utm_source=chatgpt.com "jsvine/pdfplumber - and easily extract text and tables." [5]: https://ocrmypdf.readthedocs.io/?utm_source=chatgpt.com "OCRmyPDF documentation — ocrmypdf 16.10.5.dev10+ ..." [6]: https://github.com/ocrmypdf/OCRmyPDF?utm_source=chatgpt.com "OCRmyPDF adds an OCR text layer to scanned PDF files ..." [7]: https://manpages.ubuntu.com/manpages/lunar/man1/ocrmypdf.1.html?utm_source=chatgpt.com "ocrmypdf - add an OCR text layer to PDF files" [8]: https://github.com/microsoft/table-transformer?utm_source=chatgpt.com "microsoft/table-transformer" [9]: https://camelot-py.readthedocs.io/en/master/user/quickstart.html?utm_source=chatgpt.com "Quickstart — Camelot 1.0.9 documentation - Read the Docs" [10]: https://tabula-py.readthedocs.io/en/latest/tabula.html?utm_source=chatgpt.com "tabula-py documentation"

pdf extraction python libraries table extraction+2

2025/8/25表示