Content extraction
This guide explains how Papra extracts text from your documents and how to configure the available extraction strategies, including external providers and OCR.
What is content extraction?
Section titled “What is content extraction?”When a document is added to Papra, its text content is extracted so it can be searched and used by other features such as auto tagging. Different document formats, and different quality requirements, call for different extraction methods, so Papra supports several extraction strategies.
Choosing strategies
Section titled “Choosing strategies”The strategies Papra uses are controlled by the CONTENT_EXTRACTION_STRATEGY environment variable. It accepts a single strategy name or a comma-separated list, in order of preference:
# Single strategy (default)CONTENT_EXTRACTION_STRATEGY=internal
# Try Mistral OCR first, fall back to the internal strategyCONTENT_EXTRACTION_STRATEGY=mistral-ocr,internalWhen multiple strategies are listed:
- The first strategy that can handle the document (based on its mime type, see mime type filtering) is used.
- If that strategy fails to process the document, the next eligible strategy is tried.
Available strategies
Section titled “Available strategies”| Strategy | Description | Requires |
|---|---|---|
internal |
Built-in extraction using the lecture library, with Tesseract for OCR. Supports all common formats and is always available. |
Nothing |
mistral-ocr |
OCR via the Mistral OCR API. | Mistral API key |
docling |
Extraction via a self-hosted Docling server. | Running Docling server |
azure-di |
Extraction via Azure Document Intelligence. | Azure endpoint + API key |
custom-http |
Extraction via any HTTP service you provide. | Your own HTTP endpoint |
Internal
Section titled “Internal”The default strategy. It uses the bundled lecture library to extract text from common document formats and falls back to Tesseract OCR for images and scanned documents. No configuration is required, and it is always available.
CONTENT_EXTRACTION_STRATEGY=internalMistral OCR
Section titled “Mistral OCR”Uses the Mistral OCR API to extract text. It reuses the Mistral adapter credentials from the LLM configuration, so set MISTRAL_API_KEY (and optionally MISTRAL_BASE_URL).
CONTENT_EXTRACTION_STRATEGY=mistral-ocr,internalMISTRAL_API_KEY=your-mistral-api-key| Variable | Default | Description |
|---|---|---|
MISTRAL_API_KEY |
– | API key for the Mistral API (shared with the Mistral LLM adapter). |
MISTRAL_OCR_MODEL_NAME |
mistral-ocr-latest |
The Mistral OCR model to use. |
MISTRAL_OCR_MIME_TYPES_ALLOW_LIST |
application/pdf,image/* |
Mime types this strategy handles. See mime type filtering. |
Docling
Section titled “Docling”Uses a self-hosted Docling server to extract text.
CONTENT_EXTRACTION_STRATEGY=docling,internalDOCLING_BASE_URL=http://docling:5001| Variable | Default | Description |
|---|---|---|
DOCLING_BASE_URL |
http://localhost:5001 |
Base URL of the Docling service. |
DOCLING_API_KEY |
– | Optional API key, sent as the X-API-Key header. On the Docling side this maps to DOCLING_SERVE_API_KEY. |
DOCLING_MIME_TYPES_ALLOW_LIST |
* |
Mime types this strategy handles. See mime type filtering. |
Azure Document Intelligence
Section titled “Azure Document Intelligence”Uses the Azure Document Intelligence service.
CONTENT_EXTRACTION_STRATEGY=azure-di,internalAZURE_DI_ENDPOINT=https://your-resource.cognitiveservices.azure.comAZURE_DI_API_KEY=your-azure-api-key| Variable | Default | Description |
|---|---|---|
AZURE_DI_ENDPOINT |
– | Endpoint of the Azure Document Intelligence resource. |
AZURE_DI_API_KEY |
– | API key, sent as the Ocp-Apim-Subscription-Key header. |
AZURE_DI_MIME_TYPES_ALLOW_LIST |
* |
Mime types this strategy handles. See mime type filtering. |
AZURE_DI_POLLING_ATTEMPTS |
10 |
Number of times to poll for the result of a processing job. |
AZURE_DI_POLLING_DELAY_MS |
2000 |
Delay in milliseconds between polling attempts. |
Custom HTTP
Section titled “Custom HTTP”Sends the document to an HTTP service you provide and reads the extracted text from the response. This lets you plug in any extraction backend without writing a Papra adapter. Can be used with n8n, Zapier, or any other service that can receive a file and return text.
CONTENT_EXTRACTION_STRATEGY=custom-http,internalCONTENT_EXTRACTION_CUSTOM_HTTP_URL=https://extractor.example.com/extractCONTENT_EXTRACTION_CUSTOM_HTTP_HEADERS={"Authorization":"Bearer your-token"}Papra makes a POST request to the configured URL with the document, then reads the extracted text from the response.
| Variable | Default | Description |
|---|---|---|
CONTENT_EXTRACTION_CUSTOM_HTTP_URL |
– | URL of your HTTP extraction service. |
CONTENT_EXTRACTION_CUSTOM_HTTP_HEADERS |
{} |
Extra headers as a JSON object, e.g. {"Authorization":"Bearer <token>"}. The Content-Type header is set automatically based on the upload format. |
CONTENT_EXTRACTION_CUSTOM_HTTP_UPLOAD_FORMAT |
form-data |
How the document is sent: form-data or json (see below). |
CONTENT_EXTRACTION_CUSTOM_HTTP_RESPONSE_FORMAT |
json |
How the response is read: json or text (see below). |
CONTENT_EXTRACTION_CUSTOM_HTTP_JSON_RESPONSE_TEXT_PATH |
text |
For json responses, the dot path to the extracted text, e.g. data.content. |
CONTENT_EXTRACTION_CUSTOM_HTTP_REQUEST_TIMEOUT_MS |
30000 |
Request timeout in milliseconds. |
CONTENT_EXTRACTION_CUSTOM_HTTP_MIME_TYPES_ALLOW_LIST |
* |
Mime types this strategy handles. See mime type filtering. |
Upload formats (how Papra sends the document):
form-data: amultipart/form-datarequest with the document file in thefilefield.json: a JSON body with the document base64-encoded:{ "document": { "filename": "file.pdf", "type": "application/pdf", "content": "<base64-encoded-content>" } }
Response formats (how Papra reads the extracted text):
json: the response is parsed as JSON and the text is read from the path inCONTENT_EXTRACTION_CUSTOM_HTTP_JSON_RESPONSE_TEXT_PATH(e.g.text, ordata.content).text: the entire response body is used as the extracted text.
Mime type filtering
Section titled “Mime type filtering”Every strategy except internal accepts a mime types allow list that can limits which documents it handles. If a document’s mime type does not match, the strategy is skipped and the next one in the list is tried.
The list is comma-separated and supports wildcards and negations:
*matches all formats.image/*matches all image mime types.- Prefix an entry with
!to negate it:*,!image/pngallows everything except PNG. - Negations always take precedence over allows, even more specific ones:
image/png,!image/*rejects PNG.
For example, to use Azure Document Intelligence only for PDFs and fall back to the internal strategy for everything else:
CONTENT_EXTRACTION_STRATEGY=azure-di,internalAZURE_DI_MIME_TYPES_ALLOW_LIST=application/pdf