Ground-truthed datasets of PDF tables

Update 2013-11-25: Ground truth added to the competition dataset.

Update 2013-04-05: The competition dataset has now been released and can be downloaded from the competition home page.

Update 2013-03-17: Tools for automatic visualization and comparison to ground truth are now online.

On this page you will find two ground-truthed datasets of natively-digital PDF documents containing tables. These documents have been collected systematically from the European Union and US Government websites, and we therefore expect them to have public domain status. Each PDF document is accompanied by three XML (or CSV) file containing its ground truth in the following models:

table regions (for evaluating table location)
cell structures (for evaluating table structure recognition)
functional representation (for evaluating table interpretation)

This work was carried out as a collaboration between Giorgio Orsi, Linda Oro, Max Göbel and myself. We currently have over 50 excerpts, taken from larger PDF documents, and are appealing to the document engineering community to help us increase this number to several hundred or more.

The datasets of EU documents and US Government documents are now online.

The tools for ground-truthing are currently at beta stage and are available on request.

We organized the competition on PDF table detection and structure recognition at ICDAR 2013. The datasets here were made available to all participants for practice. The competition dataset included a further collection of EU and US documents, and has now been made available with ground truth. However there is no information available on the functional representation, as only table location and cell structure recognition were covered in the competition.

Please contact me if you would like to join our collaborative effort in improving this dataset.

back to my homepage