Update 2013-04-15: Deadline for result submission extended to Thursday, 18 April, 2359hrs GMT.
Update 2013-04-05: The competition dataset has now been released and can be downloaded from the competition home page.
Update 2013-03-17: Tools for automatic visualization and comparison to ground truth are now online.
New deadlines for registration (31 March) and result submission.
This page describes the format to which entries for both sub-competitions, table location and table structure recognition, must adhere, and how each entry will be numerically compared to the ground truth. This format is also described in [1]. Entrants can choose to take part in either or both competitions. The format of the ground-truth in the example dataset is a special case of this format. Please note that the example dataset also includes a functional model; however functional analysis does not form part of the present competition.
We have made some tools for automatic visualization and comparison to ground truth available. We hope that this will make it easier for participants to prepare their results for submission in our format.
Table regions are defined as rectangular areas of a given page by their coordinates. Since a table can span more than one page, several regions can belong to the same table. For example, the ground truth file for a document with a table spanning from the first to the second page may look as follows:
<?xml version="1.0" encoding="UTF-8"?>
<document filename='filename.pdf'>
<table id='0'>
<region id='0' page='1'>
<instruction instr-id='83'/>
<instruction instr-id='90' subinstr-id='0'/>
...
<instruction instr-id='169'/>
<bounding-box x1='87' y1='117' x2='551' y2='220'/>
</region>
<region id='1' page='2'>
<instruction instr-id='202'/>
<instruction instr-id='209' subinstr-id='2'/>
...
<bounding-box x1='87' y1='261' x2='551' y2='364'/>
</region>
</table>
<table id='1'>
...
</table>
...
</document>
For each tabular region that is found, entrants are only required to return its rectangular bounding-box in PDF coordinates. The <instruction> tags, whose purpose is described below, need not be included. Note that the page-numbering is 1-based (all document excerpts in the example begin with page 1). The region and table ID numbering is 0-based; tables within a document and regions within a table can be output in any order.
The XML schema definition for the region model can be downloaded here.
All regions in the ground truth dataset are specified by the smallest bounding box that encloses all the character elements within the region. Lines and other graphic elements are discounted. For the entries, the bounding boxes need not be minimal; our evaluation procedure will simply take all character elements whose centres fall within the specified area. These elements are then compared with the ground truth to calculate the completeness and purity of the result [2].
*A character element is defined as each character drawn by the <TJ> operator or <Tj> text-drawing operand. Textual content (e.g. in logos, etc.) drawn by vector graphics operators or by placing bitmaps is ignored. Its coordinates are defined as follows:
In order to determine which tables have been detected completely and/or purely, it is necessary to map each GT table to its corresponding result table. In most cases, this mapping is obvious. However, the following special cases can arise:
Please note that for this sub-competition it is irrelevant whether several regions are grouped into a (logical) table or not. Only the <region> and not the <table> tags are used.
The aim of the table structure recognition sub-competition is to compare methods for determining the cell structure of tables given correct information about their location. It is therefore permissible to participate only in this sub-competition, and not in the table location sub-competition. We strongly recommend entrants to use manually generated or corrected input regarding the table locations when generating their results, in order to avoid being unnecessarily penalized.
The cell structure of a table is defined as a matrix of cells. Cells are defined by their textual content and their start and end column and row positions. Blank cells are not represented in this format. An example looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<document filename='filename.pdf'>
<table id='0'>
<region id='0' page='3' col-increment='0' row-increment='0'>
<cell id='0' start-row='0' start-col='0'>
<bounding-box x1='70' y1='79' x2='131' y2='91'/>
<content>COUNTRY</content>
<instruction instr-id='65' subinstr-id='0'/>
</cell>
<cell id='1' start-row='0' start-col='1' end-col='2'>
<bounding-box x1='165' y1='79' x2='201' y2='91'/>
<content>3 years</content>
<instruction instr-id='65' subinstr-id='2'/>
</cell>
<cell id='2' start-row='0' start-col='3'>
<bounding-box x1='234' y1='79' x2='271' y2='91'/>
<content>4 years</content>
<instruction instr-id='65' subinstr-id='4'/>
</cell>
...
</region>
...
</table>
...
</document>
In the ground truth for the example dataset, the table numbers correspond to those in the relevant region model files. In this competition, this need not be the case, as entries for the table structure recognition sub-competition will be evaluated independently of the table location competition.
In contrast to the region model, for the cell structure model, entrants are required to return the textual content (<content> tag) for each cell; the tags <bounding-box> and <instruction>, which are present in the ground truth, are not required and will be ignored.
The cell numbering begins at (0,0) for the top-left cell. The attributes end-col and end-row are optional; if they are ommitted, the col and/or rowspan are assumed to be 1. If all cells are shifted, or an entire row or column returned as spanning the same number of cells, this will not make any difference to the final result, as explained below.
The XML schema definition for the cell structure model can be downloaded here.
First, the content will be stripped of all spaces and special characters so that errors in e.g. detecting spacing do not affect the result. For comparing two cell structures, we use a method inspired by Hurst's proto-links [3]: for each table region we generate a list of adjacency relations between each content cell and its nearest neighbour in horizontal and vertical directions. No adjacency relations are generated between blank cells or a blank cell and a content cell.
This 1-D list of adjacency relations is then compared to the ground truth by using precision and recall measures, as shown in the figure below. If both cells are identical and the direction matches, then it is marked as correctly retrieved; otherwise it is marked as incorrect. Using neighbourhoods makes the comparison invariant to the absolute position of the table (e.g. if everything is shifted by one cell) and also avoids ambiguities arising with dealing with different types of errors (merged/split cells, inserted empty column, etc.).
References:
[1] Göbel, M., Hassan, T., Oro, E., Orsi, G.: A Methodology for Evaluating Algorithms for Table Understanding in PDF Documents, DocEng 2012
[2] Silva, A.C.: Metrics for evaluating performance in document analysis: application to tables, IJDAR 14(1):101-109, 2011
[3] Hurst, M.: A constraint-based approach to table structure derivation, ICDAR 2003
Competition organizers: Max Göbel, Tamir Hassan, Ermelinda Oro and Giorgio Orsi.
If you have any further questions, please feel free to get in touch.
back to the competition homepage