Document Extractions

Documents are full of information. For instance, invoices contain an amount to pay and additional information such as a bank account and a due date. The Gini API is able to extract this information and provide it in a structured way. In the following, that extracted information will be referred to as extractions. Examples for extractions are, as already mentioned, amounts and bank accounts, but also addresses, tax numbers, links to websites etc.

Additionally, Gini also maps a semantic property to an extraction. In other words, Gini not only extracts a date from a given document, but also infers that the date is the due date of an invoice. In the following, that will be referred to as a specific extraction.

Extractions

An extraction contains an entity describing the general semantic type of the extraction (e.g. a date), which also determines the format of the value containing the information as text. Optionally there may be a box describing the position of the extraction value on the document. In most instances, extractions without a bounding box are meta information (e.g. doctype).

Name Type Description
entity string Key (primary identification) of an entity type (e. g. banknumber). See Extraction Entities for a full list.
value string A normalized textual representation of the Text/Information provided by the extraction value (e. g. bank number without spaces between the digits)
box Bounding Box (Optional) bounding box containing the position of the extraction value on the document

Example

{
  "entity": "date",
  "value": "2012-06-20",
  "box": { ... }
}

Specific extractions

A specific extraction assigns a semantic property to an extraction. It also has an additional field candidates:

Name Type Description
candidates string (Optional) A reference to a extraction candidates. See Available Extraction Candidates for a list.

Example

"paymentDueDate": {
  "entity": "date",
  "value": "2012-06-20",
  "box": { ... },
  "candidates": "dates"
}

Available Specific Extractions

Name Description Entity Candidates
amountToPay The amount which has to be paid. amount amounts
bankAccountNumber The account number of a payment recipient. bankaccount bankAccountNumbers
bankNumber The bank number of a payment recipient. banknumber bankNumbers
bic The bic of a payment recipient. bic bics
companyRegisterId The Commercial Registry number of a document sender. companyregisterid companyRegisterIds
customerId The customer Id of a document recipient. customerid customerIds
docType The document type of a given document. doctype n/a
documentDate The document date. date dates
documentDomain The domain of a current document. documentdomain n/a
email The most probable email address of a sender email emails
iban The IBAN of a document sender. iban ibans
invoiceId The invoice Id of a given document. invoiceid invoiceIds
paymentDueDate The calculated payment due date (e.g. of an invoice). date dates
paymentPurpose The extra payment purpose text when the payment reference is not available text n/a
paymentRecipient The payment recipient, benefitter of a money transfer activity companyname senderNames
paymentReference The payment reference. reference n/a
paymentState If a document has yet to be paid or is paid already. paymentstate n/a
phoneNumber The first found phoneNumber in a given document. phonenumber phoneNumbers
referenceId The first found reference id in a given document. text referenceIds
senderCity The sender city. city n/a
senderName The sender name. companyname senderNames
senderNameAddition The sender name addition. companynameaddition n/a
senderPoBox The sender post-office box. poboxnumber n/a
senderPostalCode The sender’s postal code. zipcode n/a
senderStreet The sender’s street with house number. street n/a
taxNumber The tax number of a document sender. taxnumber taxnumbers
vatRegNumber The VAT number of a document sender. vat vatRegNumbers
website The most probable web address of a sender. url websites

Extraction Candidates

Extraction candidates are a list of suggestions for an appropriate extraction.

Example

"dates": [
  {"entity": "date","value": "2012-06-20","box": { ... } },
  {"entity": "date","value": "2012-05-10","box": { ... } },
  ...
]

Available Extraction Candidates

Name Description Entity
amounts All amounts of a given document. amount
bankAccountNumbers All account numbers of a given document. bankaccount
bankNumbers All bank numbers of a given document. banknumber
bics All bics of a given document. bic
companyRegisterIds All alphanumeric strings (of a similar structure as a German company register id) of a given document. companyregisterid
customerIds All alphanumeric strings (of a similar structure as an identifier) of a given document. customerid
dates All dates of a given document. date
emails All emails of a given document. email
ibans All IBANs of a given document. iban
invoiceIds All alphanumeric strings (of similar structure as an identifier) of a given document. invoiceid
phoneNumbers All phone numbers of a given document. phonenumber
referenceIds All potential reference id numbers of a given document. text
senderNames All possible sender names of a given document. companyname
taxNumbers All strings of digits (of a similar structure as a German tax number) of a given document. taxnumber
vatRegNumbers All alphanumeric strings (of a similar structure as an identifier) of a given document. vat
websites All links found in a given document. url

Bounding Box

A bounding box creates a direct relation between an extraction and a document. The box describes the page and the position where the extraction originates.

Name Type Description
left number The distance from the left edge of the page
top number The distance from the top edge of the page
width number The horizontal dimension of a box
height number The vertical dimension of a box
page number The page on which the box can be found, starting with 1

Example

"box": {
  "page": 2,
  "left": 483.0,
  "top": 450.0,
  "width": 51.0,
  "height": 9.0
}

Coordinate system

The origin of the coordinate system is adjusted to the upper left corner of the page. The coordinate system uses the DTP point as unit: 1 pt = 1 inch / 72 = 25.4 mm / 72 = 0.3528 mm