Documents

As the key aspect of the Gini API is to provide information extraction for analyzing documents, the API is mainly built around the concept of documents. A document can be any written representation of information such as invoices, reminders, contracts and so on.

The main idea is that you submit a document in form of an electronic file to Gini. After the document has been analyzed by Gini you can get the information that is extracted from the document by querying the API.

The following documentation explains those actions in detail.

Note

Most example requests show the usage of cURL, a command-line tool to perform HTTP requests.

Submitting files

In order to analyze a document, the document source file must be first submitted to Gini.

You can submit documents by using the file of the document and calling a POST request on the /documents path. After successful creation of a document, the location of the new document is returned in the Location header.

The Gini API currently supports two different variants of uploads, one optimized for web applications running in a web browser and one for all other types of clients.

The variant optimized for web browsers expects the documents to be uploaded using a multipart/form-data request as constructed by typical web browsers.

The variant aimed at all other clients simply uses the request body (independent of the Content-Type of the request) as document.

Supported file formats

Gini currently supports input files in the formats PDF, GIF (non-animated), PNG, JPEG, TIFF and plain text files. You can use native documents (PDF only) as well as scanned document (all formats). Note that there are certain limitations on a document that the Gini API accepts:

  • Document file size must be less than 10 MiB.
  • PDF files must not have any security restrictions such as password protection.
  • Scanned documents should have a resolution of at least 300dpi in order for the OCR to return optimal results.
  • Plain text documents have to be encoded in UTF-8. The source size must be smaller than 512 KiB.
  • Only the first 10 pages of a document are processed.
  • Only contents of documents in German language are sufficiently well recognized.

Document Type Hints

The type of a document is in many cases known to the client application. If you provide the doctype parameter with a valid value from the doctype entity , Gini can optimize the processing of the document in many ways. Furthermore, some incubating extractions may only available if the document type is provided.

Request

Documents can be submitted by doing a POST request on the /documents resource.

POST /documents

Headers

Header Value
Content-Type multipart/form-data; boundary=...
  */*
Accept application/vnd.gini.v1+json

Request query parameters

If the upload is performed not using multipart/form-data you can optionally provide a file name for the submitted document with a query parameter:

Name Type Description
filename string (Optional) File name of the submitted document
doctype string (Optional) Type of the submitted document See doctype for possible values

Body

Only in case of Content-Type: multipart/form-data (applications running in a web browser):

Key Description
Content-Disposition form-data
file File contents of document

Example

Variant for web applications running in a web browser:

curl -H 'Authorization: BEARER <token>' --form 'file=@file.pdf' -H 'Accept: application/vnd.gini.v1+json' -i https://api.gini.net/documents

Variant for other types of applications:

curl -H 'Authorization: BEARER <token>' --data-binary '@file.pdf' -H 'Accept: application/vnd.gini.v1+json' -i https://api.gini.net/documents?filename=file.pdf

Response

Headers

Status Code Description
201 (Created) Success
Header Value
Content-Type application/vnd.gini.v1+json
Location Absolute URI of created document (document URI)

Errors

Status Code Description
400 (Bad Request) Returned when a file in an invalid format is sent

Checking processing status and getting document information

After submission the document is processed. You can check the processing status of a single document by examining the document information. It can be retrieved by a GET request to the document URI. When the document has been processed you can retrieve the extractions from it and additional information such as the document layout. Alternatively, SSE (server sent events) based Notifications can be used.

Request

Document information can be retrieved by doing a GET request on the document URI.

GET /documents/{id}

Headers

Header Value
Accept application/vnd.gini.v1+json

Example

curl -H 'Authorization: BEARER <token>' -X GET -H 'Accept: application/vnd.gini.v1+json' -i https://api.gini.net/documents/c292af40-d06a-11e2-9a2f-000000000000

Response

Headers

Status Code Description
200 (OK) Success
Header Value
Content-Type application/vnd.gini.v1+.json

Body (application/vnd.gini.v1+json)

Key Child Key Type Description
id   string Unique identifier of document as UUID Version 1
name   string Document name (as stated on upload)
pageCount   number Number of pages
creationDate   number Unix timestamp of document creation date in milliseconds
origin   string Source channel of the document, either UPLOAD (when uploaded through Gini API) or UNKNOWN
progress   string Processing status of the document, either PENDING, COMPLETED, or ERROR
sourceClassification   string Classification of the source file, either SCANNED, SANDWICH, NATIVE or TEXT.
pages   array List of page objects
  pageNumber number Page number in the document
  images object URIs to pre-rendered page images
_links   array List of related resources, e. g. the found extractions or the document layout
  extractions string URI to extractions of the document (extractions URI)
  layout string URI to the layout of the document (layout URI)
  processed string URI to the processed document
  document string URI to the document, meaning the current resource (document URI)

Example

{
  "id": "626626a0-749f-11e2-bfd6-000000000000",
  "creationDate": 1360623867402,
  "name": "scanned.jpg",
  "progress": "COMPLETED",
  "origin": "UPLOAD",
  "sourceClassification": "SCANNED",
  "pageCount": 1,
  "pages" : [
    {
      "images" : {
        "750x900" : "http://api.gini.net/documents/626626a0-749f-11e2-bfd6-000000000000/pages/1/750x900",
        "1280x1810" : "http://api.gini.net/documents/626626a0-749f-11e2-bfd6-000000000000/pages/1/1280x1810"
      },
      "pageNumber" : 1
    }
  ],
  "_links": {
    "extractions": "https://api.gini.net/documents/626626a0-749f-11e2-bfd6-000000000000/extractions",
    "layout": "https://api.gini.net/documents/626626a0-749f-11e2-bfd6-000000000000/layout",
    "document": "https://api.gini.net/documents/626626a0-749f-11e2-bfd6-000000000000",
    "processed": "https://api.gini.net/documents/626626a0-749f-11e2-bfd6-000000000000/processed"
  }
}

Errors

Status Code Description
404 (Not Found) Returned when no document can be found under the specific URI

Retrieving extractions

After a document has been processed, the extractions from the document analysis can be retrieved. See Document Extractions for details about extractions.

Request

Extractions can be retrieved by doing a GET request on the extractions URI:

GET /documents/{id}/extractions

Headers

Header Value
Accept application/vnd.gini.v1+json

Example

curl -H 'Authorization: BEARER <token>' -X GET -H 'Accept: application/vnd.gini.v1+json' -i https://api.gini.net/documents/c292af40-d06a-11e2-9a2f-000000000000/extractions

Response

Headers

Status Code Description
200 (OK) Success
Header Value
Content-Type application/vnd.gini.v1+json

Body (application/vnd.gini.v1+json)

A detailed explanation of the response format can be found in Document Extractions.

Name Type Description
extractions object A mapping of labels to Extractions (i.e. Specific extractions)
candidates object A mapping of labels to a list of Extraction Candidates

Example

 {
    "extractions": {
        "amountToPay": {
            "box": {
                "height": 9.0,
                "left": 516.0,
                "page": 1,
                "top": 588.0,
                "width": 42.0
            },
            "entity": "amount",
            "value": "24.99:EUR",
            "candidates": "amounts"
        }
      },
      "candidates": {
        "amounts": [
          {
              "box": {
                  "height": 9.0,
                  "left": 516.0,
                  "page": 1,
                  "top": 588.0,
                  "width": 42.0
              },
              "entity": "amount",
              "value": "24.99:EUR"
          },
          {
              "box": {
                  "height": 9.0,
                  "left": 241.0,
                  "page": 1,
                  "top": 588.0,
                  "width": 42.0
              },
              "entity": "amount",
              "value": "21.0:EUR"
          }
        ]
        ...
    }
}

Errors

Status Code Description
404 (Not Found) Response status if the requested entity couldn’t be found

Submitting feedback on extractions

Depending on your use case, you should always submit feedback on extractions in order to help to improve the recognition rate of the Gini API.

Note

Feedback should only be sent if real users can see, approve, correct and complement extractions based data.

Feedback should only be sent if your use case fulfills these constraints. Gini uses several techniques to learn from feedback on extractions automatically. Thereby it is equally important for Gini to receive both feedback on correct and on incorrect extractions. There are currently two ways to submit feedback. The first and most common one is to submit the complete feedback in one request. This is the most obvious way if your frontend (app) shows the extractions on one screen in an editable form. The user can edit the extractions before she confirms that the approvement/correction is complete with a click on a button. The second way covers quite rare use cases where the final approvement signal - e.g. the click on a button - is not possible. Therefore you can send the feedback for one label per request.

There are three different types of feedback:
  • positive feedback: The happy path - the extraction was correct and confirmed by the user
  • complementary feedback: The Gini API extracted nothing, the given label is not in the response and the user entered the correct value
  • negative feedback: The extraction was incomplete / erroneous and corrected by the user

Please see the full example for further details.

Submitting feedback on multiple extractions

The Gini API allows to submit feedback on multiple extractions for a single document with a single request. It is strongly recommended for two reasons that you submit your feedback in this way. On the one hand, the total number of round trips is reduced to one and the feedback is handled internally as a batch. Thus, the update is more efficient for multiple extractions compared to submitting each feedback with a separate request (See Submitting feedback on single extractions). On the other hand, Gini’s training techniques can benefit from the feedback on multiple extractions, as Gini can be aware of the fact that the single parts of the submitted feedback belong together.

Note

You should send feedback only for labels which the user has seen. Unseen labels should be filtered out.

Request

Give feedback and correct or verify multiple specific labeled extraction patterns with a single PUT request to the documents extractions URI:

PUT /documents/{id}/extractions

The labels must correspond to the names of the extraction types (e. g. amountToPay. See Available Specific Extractions for a list of names).

Headers
Header Value
Content-Type application/vnd.gini.v1+json
Body
Key Type Description
feedback object A mapping of labels to Extractions (i.e. Specific extractions)

Note

The boxes are optional. Boxes should only be sent if your use case visualize the document with the locations of the extractions. Thus, the user can confirm the correct location of a extraction.

Example

We show a more elaborated example here in order to explain the different types of feedback. The example scenario is as follows: The user uploads a document where the labels amountToPay, paymentReference, iban were extracted. Unfortunately the label paymentRecipient could not be extracted. The response for the extractions request is as follows:

{
    "candidates": {
    },
    "extractions": {
        "amountToPay": {
            "box": {
                "height": 8.0,
                "left": 545.0,
                "page": 1,
                "top": 586.0,
                "width": 17.0
            },
            "candidates": "amounts",
            "entity": "amount",
            "value": "5.60:EUR"
        },
        "iban": {
            "box": {
                "height": 7.0,
                "left": 447.0,
                "page": 1,
                "top": 746.0,
                "width": 100.0
            },
            "candidates": "ibans",
            "entity": "iban",
            "value": "DE68130300000017850360"
        },
        "paymentReference": {
            "entity": "reference",
            "value": "ReNr 123, KdNr 32"
        }
    }
}

The user adds the missed paymentRecipient value (complementary feedback) and corrects the paymentReference to “ReNr 1735, KdNr 37” (negative feedback). The iban and amountToPay were correct (positive feedback). The document is not shown, so we can leave out the boxes. The resulting feedback request is then as follows:

{
   "feedback": {
       "amountToPay": {
           "value": "5.60:EUR"
       },
       "iban": {
           "value": "DE68130300000017850360",
       },
       "paymentReference": {
           "value": "ReNr 1735, KdNr 37"
       },
       "paymentRecipient": {
           "value": "Zalando SE"
       }
   }
}

Response

Status Code Description
204 (No Content) The feedback was successfully processed
404 (Not Found) The document or label could not be found
422 (Unprocessable Entity) At least one value was not valid regarding to the labels entity validation rules

Submitting feedback on single extractions

Request

Give feedback and correct, verify or add a specific labeled extraction pattern with a PUT request to the documents extractions URI. Please be aware that single extraction feedback should be avoided - its only suitable for very rare use cases.

PUT /documents/{id}/extractions/{label}

The label must correspond to the name of the extraction type (e. g. amountToPay. See Available Specific Extractions for a list of names).

Headers
Header Value
Content-Type application/vnd.gini.v1+json
Body
Key Type Description
value string New value of extraction
box object (Optional) Bounding box where the extraction can be found
Example
{
    "box": {
        "height": 14,
        "left": 405,
        "page": 1,
        "top": 421,
        "width": 36
    },
    "value": "new value"
}

Response

Status Code Description
204 (No Content) The feedback was successfully processed
404 (Not Found) The document or label could not be found
422 (Unprocessable Entity) At least one value was not valid regarding to the labels entity validation rules

Submitting feedback for invalid extractions

Request

In case an extraction was found erroneously (i.e. is not present in the source document), you can delete it by issuing a DELETE request to the extraction URI:

DELETE /documents/{id}/extractions/{label}

Response

Status Code Description
204 (No Content) Removal of the label was successful
404 (Not Found) Returned when the document or label can not be found

Retrieving a document’s pages

The Gini API renders preview images of a document’s pages. To retrieve the list of pages for a document, issue a GET request to the pages sub-resource of a document:

GET /documents/{id}/pages

The response is a list of pages.

Request

Path parameters

Name Value
id Document ID

Headers

Header Value
Accept application/vnd.gini.v1+json

Example

curl -H 'Authorization: BEARER <token>' -X GET -H 'Accept: application/vnd.gini.v1+json' -i https://api.gini.net/documents/c292af40-d06a-11e2-9a2f-000000000000/pages

Response

Headers

Status Code Description
200 (OK) The request was successful.
404 (Not Found) The requested document does not exist.

Body

Name Type Description
pages array All pages in the current result page

A page is an entity with the following fields:

Key Child key Type Description
documentId   string UUID of the document to which page belongs
pagenum   number Page number
_links   object Links to related resources
  document string Link to the document to which the page belongs
  pages string Link to the pages of the document
_images   object Links to pre-rendered page images in different resolutions
  image resolution in pixels string Link to a pre-rendered image of the page

Note

Image downloads require a corresponding Accept header such as image/jpeg or image/*.

Example

[
  {
    "images" : {
      "1280x1810" : "https://api.gini.net/documents/c292af40-d06a-11e2-9a2f-000000000000/pages/1/1280x1810",
      "750x900" : "https://api.gini.net/documents/c292af40-d06a-11e2-9a2f-000000000000/pages/1/750x900"
    },
    "pageNumber" : 1
  },
  {
    "pageNumber" : 2,
    "images" : {
      "1280x1810" : "https://api.gini.net/documents/c292af40-d06a-11e2-9a2f-000000000000/pages/2/1280x1810",
      "750x900" : "https://api.gini.net/documents/c292af40-d06a-11e2-9a2f-000000000000/pages/2/750x900"
    }
  }
]

Retrieving the layout of a document

The layout of the document describes the textual content of a document with positional information, based on the processed document.

Coordinate system

The origin of the coordinate system is adjusted to the upper left corner of the page. The coordinate system uses the DTP point as unit: 1 pt = 1 inch / 72 = 25.4 mm / 72 = 0.3528 mm

Request

The layout of a document can be retrieved by a GET request to the layout URI:

GET /documents/{id}/layout

Headers

Header Value
Accept application/vnd.gini.v1+json

Example

curl -H 'Authorization: BEARER <token>' -X GET -H 'Accept: application/vnd.gini.v1+json' -i https://api.gini.net/documents/c292af40-d06a-11e2-9a2f-000000000000/layout

Response

Headers

Status Code Description
200 (OK) Success
Header Value
Content-Type application/vnd.gini.v1+json

Body (application/vnd.gini.v1+json)

Key Type Description
pages array Array of page objects
Page Object
Key Type Description
number number Number of the page starting with 1
sizeX number Width of the page
sizeY number Height of the page
textZones array Array of textzone objects
regions array Array of region objects
TextZone Object
Key Type Description
paragraphs array Array of paragraph objects
Paragraph Object
Key Type Description
w number Width of the paragraph
h number Height of the paragraph
t number Distance of the paragraph from the upper edge of the page
l number Distance of the paragraph from the left edge of the page
lines array Array of line objects
Line Object
Key Type Description
w number Width of the line
h number Height of the line
t number Distance of the line from the upper edge of the page
l number Distance of the line from the left edge of the page
wds array Array of word objects
Word Object
Key Type Description
h number Height of the word
w number Width of the word
l number Distance of the word from the left edge of the page
t number Distance of the word from the upper edge of the page
fontSize number Font size of the word in points
fontFamily string Name of the font family of the word
bold boolean Indicates bold font style
text string Text of word
Region Object
Key Type Description
h number Height of the region of interest
w number Width of the region of interest
l number Distance of the region from the left edge of the page
t number Distance of the region from the upper edge of the page
type string Type of the region of interest, e.g. RemittanceSlip

Example

{
  "pages": [
    {
      "number": 1,
      "sizeX": 595.3,
      "sizeY": 841.9,
      "textZones": [
        {
          "paragraphs": [
            {
              "l": 54.0,
              "t": 158.76,
              "w": 190.1,
              "h": 36.55000000000001,
              "lines": [
                {
                  "l": 54.0,
                  "t": 158.76,
                  "w": 190.1,
                  "h": 10.810000000000002,
                  "wds": [
                    {
                      "l": 54.0,
                      "t": 158.76,
                      "w": 18.129999999999995,
                      "h": 9.900000000000006,
                      "fontSize": 9.9,
                      "fontFamily": "Arial-BoldMT",
                      "bold":false,
                      "text": "Ihre"
                    },
                    {
                      "l": 74.86,
                      "t": 158.76,
                      "w": 83.91000000000001,
                      "h": 9.900000000000006,
                      "fontSize": 9.9,
                      "fontFamily": "Arial-BoldMT",
                      "bold":false,
                      "text": "Vorgangsnummer"
                    },
                    {
                      "l": 158.76,
                      "t": 158.76,
                      "w": 3.3000000000000114,
                      "h": 9.900000000000006,
                      "fontSize": 9.9,
                      "fontFamily": "Arial-BoldMT",
                      "bold":false,
                      "text": ":"
                    },
                    [...]
                  ]
                },
                [...]
              ]
            }
          ]
        }
      ],
      "regions": [
        {
          "l": 20.0,
          "t": 240.1,
          "w": 190.0,
          "h": 150.3,
          "type": "RemittanceSlip"
        },
        [...]
      ]
    },
    [...]
  ]
}

Errors

Status Code Description
404 (Not Found) Returned when the requested layout is invalid

Retrieving the processed document

Request

Before Gini tries to extract information, it preprocesses the document, e.g. to deskew pages. The processed document can be retrieved by a GET request:

GET /documents/{id}/processed

Path parameters

Name Value
id Document ID

Response

Headers

Status Code Description
200 (OK) Success

Body

The version of the uploaded document file after preprocessing (i.e. color corrected, deskewed) which has been used for all layout and semantic extractions. In case of native PDF documents identical to the original document file.

Errors

Status Code Description
404 (Not Found) The requested document does not exist.

Create an Error Report for a Document

If the processing result for a document was not satisfactory (e.g. extractions where empty or incorrect), you can create an error report for a document. This allows Gini to analyze and correct the problem that was found. The owner of this document must agree that Gini can use this document for debugging and error analysis. The returned errorId can be used to refer to the reported error towards the Gini support.

Request

Create a error report with a POST request to the document URI:

POST /documents/{id}/errorreport

Headers

Header Value
Content-Type application/vnd.gini.v1+json

Request query parameters

Name Type Description
summary string (Optional) Short summary of the error found
description string (Optional) More detailed description of the error found

Body

Empty.

Example

curl -H 'Authorization: BEARER <token>' -X POST -H 'Accept: application/vnd.gini.v1+json' -i https://api.gini.net/documents/c292af40-d06a-11e2-9a2f-000000000000/errorreport?summary=Extractions%20Empty&description=Despite%20the%20submitted%20remittance%20slip%20has%20a%20good%20image%20quality%2C%20the%20iban%20was%20not%20found.

Response

Headers

Status Code Description
200 (OK) The error report was successfully submitted
404 (Not Found) The document couldn’t be found

Body

Key Type Description
message string Short feedback message
errorId string The unique errorId. Can be used to refer to this error.

Example

{
  "message": "error was reported, please refer to the given error id",
  "errorId": "436626a0-749f-11e2-bfd6-00000000000"
}

Deleting documents

If you want to delete a document you can do this by doing a DELETE request on the document URI. When the document is deleted all associated resources (extractions, layout) will also be deleted.

Request

Documents can be deleted by doing a DELETE request on the document URI.

DELETE /documents/{id}

Example

curl -H 'Authorization: BEARER <token>' -X DELETE -i https://api.gini.net/documents/c292af40-d06a-11e2-9a2f-000000000000

Response

Headers

Status Code Description
204 (No Content) Success

Errors

Status Code Description
404 (Not Found) Returned when no document can be found under the specific URI

Getting a list of all documents

To get a list of all documents, issue a GET request on the /documents resource. The response will be a paginated list of all documents.

GET /documents

Request query parameters

Name Type Description
limit number (Optional) Maximum number of documents to return. Defaults to 20.
offset number (Optional) Start offset. Defaults to 0.

Example request

curl -H 'Authorization: BEARER <token>' -H 'Accept: application/vnd.gini.v1+json' -X GET -i https://api.gini.net/documents?limit=50

Response

The response is a paginated list of documents.

Headers

Status Code Description
200 (OK) Success

Body

The response entity has the following fields:

Name Type Description
totalCount number Total number of documents
documents array All documents of the current result page

Example

{
  "totalCount": 118,
  "documents": [
    {...},
    {...},
    ...
  ]
}