Document Extractions¶
Documents are full of information. For instance, invoices contain an amount to pay and additional information such as a bank account and a due date. The Gini API is able to extract this information and provide it in a structured way. In the following, that extracted information will be referred to as extractions. Examples for extractions are, as already mentioned, amounts and bank accounts, but also addresses, tax numbers, links to websites etc.
Additionally, Gini also maps a semantic property to an extraction. In other words, Gini not only extracts a date from a given document, but also infers that the date is the due date of an invoice. In the following, that will be referred to as a specific extraction.
Extractions¶
An extraction contains an entity describing the general semantic type of the extraction (e.g. a date), which also determines the format of the value containing the information as text. Optionally there may be a box describing the position of the extraction value on the document. In most instances, extractions without a bounding box are meta information (e.g. doctype).
Name | Type | Description |
---|---|---|
entity | string | Key (primary identification) of an entity type (e. g. banknumber). See Extraction Entities for a full list. |
value | string | A normalized textual representation of the Text/Information provided by the extraction value (e. g. bank number without spaces between the digits) |
box | Bounding Box | (Optional) bounding box containing the position of the extraction value on the document |
Example¶
{
"entity": "date",
"value": "2012-06-20",
"box": { ... }
}
Specific extractions¶
A specific extraction assigns a semantic property to an extraction. It also has an additional field candidates:
Name | Type | Description |
---|---|---|
candidates | string | (Optional) A reference to a extraction candidates. See Available Extraction Candidates for a list. |
Example¶
"paymentDueDate": {
"entity": "date",
"value": "2012-06-20",
"box": { ... },
"candidates": "dates"
}
Available Specific Extractions¶
Name | Description | Entity | Candidates |
---|---|---|---|
amountToPay | The amount which has to be paid. | amount | amounts |
bankAccountNumber | The account number of a payment recipient. | bankaccount | bankAccountNumbers |
bankNumber | The bank number of a payment recipient. | banknumber | bankNumbers |
bic | The bic of a payment recipient. | bic | bics |
docType | The document type of a given document. | doctype | n/a |
iban | The IBAN of a document sender. | iban | ibans |
paymentDueDate | The calculated payment due date (e.g. of an invoice). | extraction-entity-date | dates |
paymentPurpose | The extra payment purpose text when the payment reference is not available NOTE: Currently only available for clients in Austria. | text | n/a |
paymentRecipient | The payment recipient, benefitter of a money transfer activity | companyname | senderNames |
paymentReference | The payment reference. | reference | n/a |
senderName | The sender name. | companyname | senderNames |
Extraction Candidates¶
Extraction candidates are a list of suggestions for an appropriate extraction.
Example¶
"dates": [
{"entity": "date","value": "2012-06-20","box": { ... } },
{"entity": "date","value": "2012-05-10","box": { ... } },
...
]
Available Extraction Candidates¶
Name | Description | Entity |
---|---|---|
amounts | All amounts of a given document. | amount |
bankAccountNumbers | All account numbers of a given document. | bankaccount |
bankNumbers | All bank numbers of a given document. | banknumber |
bics | All bics of a given document. | bic |
ibans | All IBANs of a given document. | iban |
senderNames | All possible sender names of a given document. | companyname |
Extraction Entities¶
The available extraction entities are (follow each link for a detailed description):
Bounding Box¶
A bounding box creates a direct relation between an extraction and a document. The box describes the page and the position where the extraction originates.
Name | Type | Description |
---|---|---|
left | number | The distance from the left edge of the page |
top | number | The distance from the top edge of the page |
width | number | The horizontal dimension of a box |
height | number | The vertical dimension of a box |
page | number | The page on which the box can be found, starting with 1 |
Example¶
"box": {
"page": 2,
"left": 483.0,
"top": 450.0,
"width": 51.0,
"height": 9.0
}
Coordinate system¶
The origin of the coordinate system is adjusted to the upper left corner of the page. The coordinate system uses the DTP point as unit: 1 pt = 1 inch / 72 = 25.4 mm / 72 = 0.3528 mm