Document Validation & Extraction

Automatically verify document eligibility based on type, expiration, and recency, then extract the key information required for decision-making.

The Document Validation & Extraction API allows you to extract structured data from various document types.

In addition to data extraction, the API provides powerful validation capabilities that allow you to verify extracted information against expected values and validate document properties:

  • Field Validation: Verify that extracted fields (such as document numbers, names, addresses) match expected values using either exact or fuzzy matching

  • Expiry Validation: For identity documents, validate that documents are not expired and have sufficient time remaining until expiration

  • Recency Validation: For proof of address and tax documents, ensure documents are recent enough to be considered valid (within a specified number of months)

The API supports four document types: identity documents, organisation documents, proof-of-address documents, and tax documents.

Each document type supports different validation options:

  • Identity Documents:

    • Field validation (document number, surname, given name)

    • Expiry validation

  • Organisation Documents:

    • Field validation (organisation name, registration number, registered office address)

  • Proof of Address:

    • Field validation (recipient name, recipient address)

    • Recency validation

  • Tax Documents:

    • Field validation (tax number, taxpayer name, taxpayer address)

    • Recency validation

All validation results are returned in the response, allowing you to programmatically verify document authenticity and data accuracy.

Endpoint

POST /document/extract

Authentication

All requests require API key authentication via the

Authorization

header:

Authorization: YOUR_API_KEY

Document Sources

You can provide documents in two ways:

1. Base64 Encoding (source: "base64")

Encode your document as a base64 string and include it in the request payload.

Fields:

  • source (required): Must be "base64"

  • payload (required): Base64-encoded document content

Example:

{
  "document_type": "identity",
  "source": "base64",
  "payload": "JVBERi0xLjQKJeLjz9MKMy..."
}

2. Public URL (source: "url")

Provide a publicly accessible URL to your document. The API will fetch the document from this URL.

Fields:

  • source (required): Must be "url"

  • url (required): Publicly accessible URL to the document

Example:

{
  "document_type": "identity",
  "source": "url",
  "url": "https://example.com/document.pdf"
}

Important Notes:

  • The URL must be publicly accessible (no authentication required)

  • The URL must return the document directly (not a redirect to a login page)

  • The document must be accessible via HTTP or HTTPS

Supported File Types

The API supports the following file formats:

  • PDF (application/pdf)

  • JPEG (image/jpeg)

  • PNG (image/png)

File Size Limits:

  • Maximum file size: 50MB

If you provide an unsupported file type or a file that exceeds the size limit, you'll receive an error response.

Document Types

1. Identity Documents

Identity documents include passports, national ID cards, driver's licenses, and other government-issued identification documents.

Request Format

{
  "document_type": "identity",
  "source": "base64",
  "payload": "...",
  "validations": {
    "fields": [
      {
        "field": "document_number",
        "expected_value": "ABC123456",
        "match_type": "exact"
      },
      {
        "field": "surname",
        "expected_value": "Smith",
        "match_type": "fuzzy"
      },
      {
        "field": "given_name",
        "expected_value": "John",
        "match_type": "exact"
      }
    ],
    "expiry": {
      "minimum_months_until_expiry": 6,
      "validate_document_not_expired": true
    }
  }
}

Request Fields

Required:

  • document_type: Must be "identity"

  • source: Either "base64" or "url"

  • payload (if source is "base64"): Base64-encoded document

  • url (if source is "url"): Public URL to the document

Optional:

  • validations: Validation rules to apply to the extracted data

Validation Options

Field Validations (fields):

Validate specific fields extracted from the document against expected values.

  • field (required): The field to validate. Allowed values:

    • "document_number": The document number/ID

    • "surname": The person's surname/last name

    • "given_name": The person's given name/first name

  • expected_value (required): The expected value for the field

  • match_type (required): How to match the extracted value against the expected value:

    • "exact": Exact string match (case-sensitive)

    • "fuzzy": Fuzzy matching (handles minor variations, spacing, etc.)

Expiry Validations (expiry):

Validate the document's expiration date.

  • minimum_months_until_expiry (optional): Minimum number of months until the document expires. Must be at least 1.

  • validate_document_not_expired (optional): Boolean flag to check if the document has expired. Set to true to validate that the document is not expired.

Response Format

{
  "id": "dex_abc123...",
  "object": "document_extraction",
  "livemode": false,
  "document_type": "identity",
  "issuing_country": "US",
  "document_number": "ABC123456",
  "document_date_of_issue": "2020-01-15",
  "document_expiry_date": "2030-01-15",
  "document_issuing_authority": "Department of State",
  "surname": "Smith",
  "given_name": "John",
  "date_of_birth": "1990-05-20",
  "place_of_birth": "New York, USA",
  "validation_result": {
    "number_of_checks_passed": 3,
    "number_of_checks_failed": 0,
    "number_of_checks_skipped": 0,
    "checks": [
      {
        "check_type": "field_validation",
        "field": "document_number",
        "expected_value": "ABC123456",
        "extracted_value": "ABC123456",
        "match_type": "exact",
        "status": "passed"
      },
      {
        "check_type": "expiry_validation",
        "check_name": "validate_document_not_expired",
        "document_expiry_date": "2030-01-15",
        "status": "passed"
      },
      {
        "check_type": "expiry_validation",
        "check_name": "minimum_months_until_expiry",
        "document_expiry_date": "2030-01-15",
        "months_until_expiry": 120,
        "status": "passed"
      }
    ]
  }
}

Response Fields

  • id: Unique identifier for the extraction (prefixed with dex_)

  • object: Always "document_extraction"

  • livemode: true if using live API key, false if using sandbox API key

  • document_type: Always "identity"

  • issuing_country: ISO 3166-1 alpha-2 country code (2 letters)

  • document_number: The document number or ID

  • document_date_of_issue: Date when the document was issued (YYYY-MM-DD format)

  • document_expiry_date: Date when the document expires (YYYY-MM-DD format)

  • document_issuing_authority: The authority that issued the document

  • surname: The person's surname/last name

  • given_name: The person's given name/first name

  • date_of_birth: The person's date of birth (YYYY-MM-DD format)

  • place_of_birth: The person's place of birth

  • validation_result: Results of any validations performed (only present if validations were requested)

2. Organisation Documents

Organisation documents include certificates of incorporation, business registration documents, and other official documents that establish a legal entity.

Request Format

{
  "document_type": "organisation_doc",
  "source": "url",
  "url": "https://example.com/certificate.pdf",
  "validations": {
    "fields": [
      {
        "field": "organisation_name",
        "expected_value": "Acme Corporation",
        "match_type": "fuzzy"
      },
      {
        "field": "registration_number",
        "expected_value": "12345678",
        "match_type": "exact"
      },
      {
        "field": "registered_office_address",
        "expected_value": "123 Main St, New York, NY 10001",
        "match_type": "fuzzy"
      }
    ]
  }
}

Request Fields

Required:

  • document_type: Must be "organisation_doc"

  • source: Either "base64" or "url"

  • payload (if source is "base64"): Base64-encoded document

  • url (if source is "url"): Public URL to the document

Optional:

  • validations: Validation rules to apply to the extracted data

Validation Options

Field Validations (fields):

  • field (required): The field to validate. Allowed values:

    • "organisation_name": The legal name of the organisation

    • "registration_number": The registration or incorporation number

    • "registered_office_address": The registered office address

  • expected_value (required): The expected value for the field

  • match_type (required): Either "exact" or "fuzzy"

Note: Organisation documents do not support expiry validations.

Response Format

{
  "id": "dex_abc123...",
  "object": "document_extraction",
  "livemode": false,
  "document_type": "organisation_doc",
  "organisation_name": "Acme Corporation",
  "registration_number": "12345678",
  "issuing_authority": "Secretary of State",
  "jurisdiction_of_incorporation": "US",
  "incorporation_date": "2010-03-15",
  "registered_office_address": "123 Main St, New York, NY 10001",
  "organisation_status": "Active",
  "organisation_type": "Corporation",
  "significant_people": [
    {
      "name": "John Smith",
      "role": "CEO",
      "address": "456 Oak Ave, New York, NY 10002"
    },
    {
      "name": "Jane Doe",
      "role": "CFO",
      "address": null
    }
  ],
  "validation_result": {
    "number_of_checks_passed": 3,
    "number_of_checks_failed": 0,
    "number_of_checks_skipped": 0,
    "checks": [
      {
        "check_type": "field_validation",
        "field": "organisation_name",
        "expected_value": "Acme Corporation",
        "extracted_value": "Acme Corporation",
        "match_type": "fuzzy",
        "status": "passed"
      }
    ]
  }
}

Response Fields

  • id: Unique identifier for the extraction (prefixed with dex_)

  • object: Always "document_extraction"

  • livemode: true if using live API key, false if using sandbox API key

  • document_type: Always "organisation_doc"

  • organisation_name: The legal name of the organisation

  • registration_number: The registration or incorporation number

  • issuing_authority: The authority that issued the document

  • jurisdiction_of_incorporation: ISO 3166-1 alpha-2 country code (2 letters)

  • incorporation_date: Date of incorporation (YYYY-MM-DD format)

  • registered_office_address: The registered office address

  • organisation_status: The current status of the organisation (e.g., "Active", "Inactive")

  • organisation_type: The type of organisation (e.g., "Corporation", "LLC", "Partnership")

  • significant_people: Array of significant people associated with the organisation (directors, officers, etc.)

    • name: Full name of the person

    • role: Their role or position (optional)

    • address: Their address (optional)

  • validation_result: Results of any validations performed (only present if validations were requested)

3. Proof of Address Documents

Proof of address documents include utility bills, bank statements, government correspondence, and other documents that demonstrate a person's residential address.

Request Format

{
  "document_type": "proof_of_address",
  "source": "base64",
  "payload": "...",
  "validations": {
    "fields": [
      {
        "field": "recipient_name",
        "expected_value": "John Smith",
        "match_type": "fuzzy"
      },
      {
        "field": "recipient_address",
        "expected_value": "123 Main St, New York, NY 10001",
        "match_type": "fuzzy"
      }
    ],
    "recency": {
      "maximum_document_age_months": 3
    }
  }
}

Request Fields

Required:

  • document_type: Must be "proof_of_address"

  • source: Either "base64" or "url"

  • payload (if source is "base64"): Base64-encoded document

  • url (if source is "url"): Public URL to the document

Optional:

  • validations: Validation rules to apply to the extracted data

Validation Options

Field Validations (fields):

  • field (required): The field to validate. Allowed values:

    • "recipient_name": The name of the person receiving the document

    • "recipient_address": The address shown on the document

  • expected_value (required): The expected value for the field

  • match_type (required): Either "exact" or "fuzzy"

Recency Validation (recency):

Validate that the document is recent enough to be considered valid proof of address.

  • maximum_document_age_months (optional): Maximum age of the document in months. Must be at least 1. The document date must be within this number of months from the current date.

Response Format

{
  "id": "dex_abc123...",
  "object": "document_extraction",
  "livemode": false,
  "document_type": "proof_of_address",
  "recipient_name": "John Smith",
  "recipient_address": "123 Main St, New York, NY 10001",
  "issuer_name": "Electric Company",
  "issuer_address": "456 Utility Blvd, New York, NY 10002",
  "document_date": "2024-01-15",
  "validation_result": {
    "number_of_checks_passed": 2,
    "number_of_checks_failed": 0,
    "number_of_checks_skipped": 0,
    "checks": [
      {
        "check_type": "field_validation",
        "field": "recipient_name",
        "expected_value": "John Smith",
        "extracted_value": "John Smith",
        "match_type": "fuzzy",
        "status": "passed"
      },
      {
        "check_type": "recency_validation",
        "document_date_of_issue": "2024-01-15",
        "status": "passed"
      }
    ]
  }
}

Response Fields

  • id: Unique identifier for the extraction (prefixed with dex_)

  • object: Always "document_extraction"

  • livemode: true if using live API key, false if using sandbox API key

  • document_type: Always "proof_of_address"

  • recipient_name: The name of the person receiving the document

  • recipient_address: The address shown on the document

  • issuer_name: The name of the entity that issued the document (e.g., utility company, bank)

  • issuer_address: The address of the issuer

  • document_date: The date on the document (YYYY-MM-DD format)

  • validation_result: Results of any validations performed (only present if validations were requested)

4. Tax Documents

Tax documents include tax certificates, tax identification documents, and other official tax-related documents.

Request Format

{
  "document_type": "tax_doc",
  "source": "url",
  "url": "https://example.com/tax-certificate.pdf",
  "validations": {
    "fields": [
      {
        "field": "tax_number",
        "expected_value": "12-3456789",
        "match_type": "exact"
      },
      {
        "field": "taxpayer_name",
        "expected_value": "Acme Corporation",
        "match_type": "fuzzy"
      },
      {
        "field": "taxpayer_address",
        "expected_value": "123 Main St, New York, NY 10001",
        "match_type": "fuzzy"
      }
    ],
    "recency": {
      "maximum_document_age_months": 12
    }
  }
}

Request Fields

Required:

  • document_type: Must be "tax_doc"

  • source: Either "base64" or "url"

  • payload (if source is "base64"): Base64-encoded document

  • url (if source is "url"): Public URL to the document

Optional:

  • validations: Validation rules to apply to the extracted data

Validation Options

Field Validations (fields):

  • field (required): The field to validate. Allowed values:

    • "tax_number": The tax identification number

    • "taxpayer_name": The name of the taxpayer (individual or organisation)

    • "taxpayer_address": The address of the taxpayer

  • expected_value (required): The expected value for the field

  • match_type (required): Either "exact" or "fuzzy"

Recency Validation (recency):

Validate that the document is recent enough to be considered valid.

  • maximum_document_age_months (optional): Maximum age of the document in months. Must be at least 1. The document date must be within this number of months from the current date.

Response Format

{
  "id": "dex_abc123...",
  "object": "document_extraction",
  "livemode": false,
  "document_type": "tax_doc",
  "issuing_authority": "Internal Revenue Service",
  "issuing_authority_country": "US",
  "tax_number": "12-3456789",
  "taxpayer_name": "Acme Corporation",
  "taxpayer_address": "123 Main St, New York, NY 10001",
  "date_of_issue": "2024-01-10",
  "reference_number": "REF-123456",
  "validation_result": {
    "number_of_checks_passed": 3,
    "number_of_checks_failed": 0,
    "number_of_checks_skipped": 0,
    "checks": [
      {
        "check_type": "field_validation",
        "field": "tax_number",
        "expected_value": "12-3456789",
        "extracted_value": "12-3456789",
        "match_type": "exact",
        "status": "passed"
      },
      {
        "check_type": "recency_validation",
        "document_date_of_issue": "2024-01-10",
        "status": "passed"
      }
    ]
  }
}

Response Fields

  • id: Unique identifier for the extraction (prefixed with dex_)

  • object: Always "document_extraction"

  • livemode: true if using live API key, false if using sandbox API key

  • document_type: Always "tax_doc"

  • issuing_authority: The authority that issued the tax document

  • issuing_authority_country: ISO 3166-1 alpha-2 country code (2 letters)

  • tax_number: The tax identification number

  • taxpayer_name: The name of the taxpayer (individual or organisation)

  • taxpayer_address: The address of the taxpayer

  • date_of_issue: Date when the document was issued (YYYY-MM-DD format)

  • reference_number: Reference number or certificate number on the document

  • validation_result: Results of any validations performed (only present if validations were requested)

Validation Results

When you include validations in your request, the response will include a validation_result object that shows the results of all validation checks.

Validation Result Structure

{
  "validation_result": {
    "number_of_checks_passed": 2,
    "number_of_checks_failed": 1,
    "number_of_checks_skipped": 0,
    "checks": [
      {
        "check_type": "field_validation",
        "field": "document_number",
        "expected_value": "ABC123",
        "extracted_value": "ABC123",
        "match_type": "exact",
        "status": "passed"
      },
      {
        "check_type": "expiry_validation",
        "check_name": "validate_document_not_expired",
        "document_expiry_date": "2020-01-15",
        "status": "failed"
      },
      {
        "check_type": "recency_validation",
        "document_date_of_issue": "2024-01-15",
        "status": "passed"
      }
    ]
  }
}

Validation Check Types

  1. Field Validation (

    check_type: "field_validation"

    )

    • Validates that extracted field values match expected values

    • Fields: field, expected_value, extracted_value, match_type, status

  2. Expiry Validation (

    check_type: "expiry_validation"

    )

    • Validates document expiration

    • Two subtypes:

      • minimum_months_until_expiry: Checks that document has at least N months until expiry

      • validate_document_not_expired: Checks that document is not expired

    • Fields: check_name, document_expiry_date, status, and optionally months_until_expiry

  3. Recency Validation (

    check_type: "recency_validation"

    )

    • Validates that document is recent enough

    • Fields: document_date_of_issue, status

Validation Status

Each check has a

status

field with one of three values:

  • "passed": The validation check passed

  • "failed": The validation check failed

  • "skipped": The validation check was skipped (e.g., if the required field was not found in the document)

Summary Counts

The validation_result includes summary counts:

  • number_of_checks_passed: Number of checks that passed

  • number_of_checks_failed: Number of checks that failed

  • number_of_checks_skipped: Number of checks that were skipped

Error Handling

Common Errors

Document Not of Expected Type

Error Code:

document_not_of_type

HTTP Status:

400 Bad Request

Message: "The document you provided is not of the '{type}' type"

Occurs when the document you provide doesn't match the document_type you specified in the request.

Occurs when the document you provide is a sample document.

Invalid Base64 Payload

Error Code:

document_base64_payload_invalid

HTTP Status:

400 Bad Request

Message: "The base64 payload of the document you provided is not valid"

Occurs when the base64-encoded payload is malformed or invalid.

Document URL Unreachable

Error Code:

document_url_unreachable

HTTP Status:

400 Bad Request

Message: "The url '{url}' of the document you provided is unreachable"

Occurs when:

  • The URL is not publicly accessible

  • The URL returns an error (404, 403, etc.)

  • The URL requires authentication

  • The URL times out

Unsupported Content Type

Error Code:

document_content_type_not_supported

HTTP Status:

400 Bad Request

Message: "The document you provided is not of a supported type. We only support PDF, JPEG, and PNG documents."

Occurs when the document is not a PDF, JPEG, or PNG file.

Document Too Large

Error Code:

document_too_large

HTTP Status:

400 Bad Request

Message: "The document you provided is too large. The maximum allowed size is 50MB."

Occurs when the document exceeds the 50MB size limit.

Error Response Format

All errors follow this format:

{
  "error": {
    "code": "document_not_of_type",
    "message": "The document you provided is not of the 'identity' type",
    "livemode": false
  }
}

For field-specific errors, the response includes a

field

property:

{
  "error": {
    "code": "document_base64_payload_invalid",
    "field": "payload",
    "message": "The base64 payload of the document you provided is not valid",
    "livemode": false
  }
}

Best Practices

  1. Choose the Right Match Type:

    • Use "exact" for critical fields like document numbers, IDs, or registration numbers

    • Use "fuzzy" for names and addresses where minor variations are acceptable

  2. Handle Errors Gracefully: Always check the error response and handle different error codes appropriately in your application.

  3. Validate URLs Before Sending: If using URL-based document submission, ensure the URL is publicly accessible and returns the document directly.

  4. Check Validation Results: Always review the

    validation_result in the response to understand which checks passed or failed.

Document extraction typically takes a few seconds to process, depending on:

  • Document size and complexity

  • Number of pages (for PDFs)

  • Current API load

The API has a 60-second timeout. If processing takes longer, you'll receive a timeout error.