Skip to content

Document Deduplication

Papra automatically detects and prevents duplicate documents per organization using content hashing. This ensures that if the same file is uploaded multiple times, only one copy is stored, saving storage space and reducing clutter.

When a document is added to an organization (upload, email ingestion, folder sync, …), the server computes a SHA-256 hash of the file content and checks if a document with the same hash already exists in that organization.

  • If there is no document with the same hash in the organization, the new document is added as usual
  • If a document with same content exists, the upload is rejected
  • If a document with same content was previously deleted (in trash), it is restored instead of creating a new copy, the metadata is updated to match the newly added document
  • Papra uses SHA-256 for content hashing.
  • Computed during streaming upload (no extra I/O)
  • 64-character hexadecimal string stored in the database

The database enforces uniqueness with a composite index:

UNIQUE (organization_id, original_sha256_hash)

This guarantees no two active documents in the same organization can have identical content.

Only the file content is hashed and used for deduplication, filenames, upload dates, and metadata don’t affect deduplication. Two files are considered duplicates if and only if their content is strictly identical.