Document Extractions

Documents are full of information. For instance, invoices contain an amount to pay and additional information such as a bank account and a due date. The Gini API is able to extract this information and provide it in a structured way. In the following, that extracted information will be referred to as extractions. Examples for extractions are, as already mentioned, amounts and bank accounts, but also addresses, tax numbers, links to websites etc.

Additionally, Gini also maps a semantic property to an extraction. In other words, Gini not only extracts a date from a given document, but also infers that the date is the due date of an invoice. In the following, that will be referred to as a specific extraction.

Extractions

An extraction contains an entity describing the general semantic type of the extraction (e.g. a date), which also determines the format of the value containing the information as text. Optionally there may be a box describing the position of the extraction value on the document. In most instances, extractions without a bounding box are meta information (e.g. doctype).

Name Type Description
entity string Key (primary identification) of an entity type (e. g. banknumber). See Extraction Entities for a full list.
value string A normalized textual representation of the Text/Information provided by the extraction value (e. g. bank number without spaces between the digits)
box Bounding Box (Optional) bounding box containing the position of the extraction value on the document

Example

{
  "entity": "date",
  "value": "2012-06-20",
  "box": { ... }
}

Specific extractions

A specific extraction assigns a semantic property to an extraction. It also has an additional field candidates:

Name Type Description
candidates string (Optional) A reference to a extraction candidates. See Available Extraction Candidates for a list.

Example

"paymentDueDate": {
  "entity": "date",
  "value": "2012-06-20",
  "box": { ... },
  "candidates": "dates"
}

Available Specific Extractions

Name Description Entity Candidates
amountToPay The amount which has to be paid. amount amounts
bankAccountNumber The account number of a payment recipient. bankaccount bankAccountNumbers
bankNumber The bank number of a payment recipient. banknumber bankNumbers
bic The bic of a payment recipient. bic bics
docType The document type of a given document. doctype n/a
iban The IBAN of a document sender. iban ibans
paymentDueDate The calculated payment due date (e.g. of an invoice). extraction-entity-date dates
paymentPurpose The extra payment purpose text when the payment reference is not available NOTE: Currently only available for clients in Austria. text n/a
paymentRecipient The payment recipient, benefitter of a money transfer activity companyname senderNames
paymentReference The payment reference. reference n/a
senderName The sender name. companyname senderNames

Extraction Candidates

Extraction candidates are a list of suggestions for an appropriate extraction.

Example

"dates": [
  {"entity": "date","value": "2012-06-20","box": { ... } },
  {"entity": "date","value": "2012-05-10","box": { ... } },
  ...
]

Available Extraction Candidates

Name Description Entity
amounts All amounts of a given document. amount
bankAccountNumbers All account numbers of a given document. bankaccount
bankNumbers All bank numbers of a given document. banknumber
bics All bics of a given document. bic
ibans All IBANs of a given document. iban
senderNames All possible sender names of a given document. companyname

Extraction Entities

The available extraction entities are (follow each link for a detailed description):

Bounding Box

A bounding box creates a direct relation between an extraction and a document. The box describes the page and the position where the extraction originates.

Name Type Description
left number The distance from the left edge of the page
top number The distance from the top edge of the page
width number The horizontal dimension of a box
height number The vertical dimension of a box
page number The page on which the box can be found, starting with 1

Example

"box": {
  "page": 2,
  "left": 483.0,
  "top": 450.0,
  "width": 51.0,
  "height": 9.0
}

Coordinate system

The origin of the coordinate system is adjusted to the upper left corner of the page. The coordinate system uses the DTP point as unit: 1 pt = 1 inch / 72 = 25.4 mm / 72 = 0.3528 mm