Extract and aggregate tables of empirical results from computer science papers!
This project requires Python 3.6. We recommend you set up a conda environment:
conda create -n corvid python=3.6
source activate corvid
The dependencies are listed in the requirements.in file:
pip install -r requirements.in
After installing, you can run all the unit tests:
pytest tests/
If you're interested in using one of the predefined Table extractors from the table_extraction module, you'll also need to install a tool to parse PDFs to XML. We currently support PDFLib's TET toolkit v5.1 and Nuance's OmniPage Capture SDK v20.2. For TET, you'll need the path to the bin/tet executable after installation. For OmniPage, you'll need to run make to build corvid.cpp within the module omnipage/ in this repo.
|-- corvid/
| |-- table_extraction/
| | |-- table_extractor.py
| | |-- evaluate.py
| |-- table_aggregation/
| | |-- schema_matcher.py
| | |-- evaluate.py
| |-- types/
| | |-- table.py
|-- tests/
|-- config.py
|-- requirements.in
A few important things:
-
table.pycontains theTableclass, which is the data structure used to represent Tables. It's fine to think ofTableas a wrapper around a 2Dnumpyarray, where each[i,j]element represents a cell in the Table. -
table_extractor.pycontains theTableExtractorclass. The.extract()method extractsTableobjects from a PDF input. -
schema_matcher.pycontains theSchemaMatcherclass. The.aggregate_tables()method takes a list ofTableobjects and finds alignments between columns. For example, a column "p" in Table 1 could be aligned with another column "precision" in Table 2. The.map_tables()method uses these alignments to build a single aggregate Table. -
evaluate.pycontains a functionevaluate()which computes a suite of performance metrics on a given a Gold Table and Predicted Table pair. Thetable_extractionandtable_aggregationmodules have their own respective evaluation methods.
The repo contains two modules:
First, prepare paper_ids.txt that looks like:
0ad9e1f04af6a9727ea7a21d0e9e3cf062ca6d75
eda636e3abae829cf7ad8e0519fbaec3f29d1e82
...
We can download PDFs from S3 for the papers in this file:
python scripts/fetch_papers_pdfs_from_s3.py
--mode pdf
--paper_ids /path/to/paper_ids.txt
--input_url s3://url-with-pdfs
--output_dir data/pdf/After we download the PDFs, we can parse them into the TETML format using PDFLib's TET:
python scripts/parse_pdfs_to_tetml.py
--parser /path/to/pdflib-tet-binary
--input_dir data/pdf/
--output_dir data/tetml/ If the options in scripts fetch_papers_*.py and parse_pdfs_*.py are left out, the scripts will attempt to use default values from a configuration file. See our example in example_config.py.
Now that we've processed all these papers to TETML format, let's try extracting tables from one of them:
from bs4 import BeautifulSoup
from corvid.table_extraction.table_extractor import TetmlTableExtractor
TETML_PATH = 'data/tetml/0ad9e1f04af6a9727ea7a21d0e9e3cf062ca6d75.tetml'
with open(TETML_PATH, 'r') as f_tetml:
tetml = BeautifulSoup(f_tetml)
tables = TetmlTableExtractor.extract_tables(tetml)Let's try manipulating the first table in this list:
table = tables[0]
# visualize
print(table)
# shape
table.nrow; table.ncol; table.dim
# indexing via grid
first_row = table[0,:]
first_col = table[:,0]
# indexing via cells
first_cell = table[0]- read
aliasesfrom madeleine's annotation and add todatasets.json
- font information in cells
- finish evaluation module for table extraction; write example script for API
- table normalizing function
- reorganize
data/file structure - handling
boxafter table transformations (maybe store externally from class) - maybe store all metadata non-specific to table externally from class
- tests for file/tetml utils
[[cell for cell in row] for row in x]make possible on Tablexusing__iter__; make.gridprivate after this- Script for inspecting Table pickles
- Naming. Alignment seems to denote bidirectionality vs Mapping has direction.
- latex source to table (for training/evaluation)
- parsing heuristics
After downloading the .dmg, you'll need to mount the file:
sudo hdiutil attach TET-5.1-OSX-Perl-PHP-Python-Ruby.dmgYou can then find the TET binary at
ls /Volumes/TET-5.1-OSX-Perl-PHP-Python-Ruby/bin/tet