ISCC-UNIT Data-Code#
IEP: | 0008 |
---|---|
Title: | ISCC-UNIT Data-Code |
Author: | Titusz Pan tp@iscc.foundation |
Comments: | https://github.com/iscc/iscc-ieps/issues/13 |
Status: | DRAFT |
Type: | Core |
License: | CC-BY-4.0 |
Created: | 2022-09-28 |
Updated: | 2024-01-04 |
Note
This document is a DRAFT contributed as input to ISO TC 46/SC 9/WG 18. The final version is developed at the International Organization for Standardization as ISO/DIS 24138
1. General#
- The Data-Code shall be a similarity hash for any kind of data regardless of its media type.
- The Data-Code shall cluster digital assets that have near-identical data.
- Small differences (as a proportion of the whole) in referent data shall yield identical Data-Codes.
- More significant differences in referent data shall produce similar Data-Codes that can be compared against each other to estimate the data-similarity of the referents.
- The Data-Code shall be resistant to data shifting and reordering sequences of data within referent data.
NOTE
Changes of the Data-Code do not reflect semantic or syntactic changes of the content.
2. Format#
The Data-Code shall have the data format illustrated in Figure 10:
EXAMPLE 1: 64-bit Data-Code in its canonical form:
ISCC:GAAWAIBQLNWP7X32
EXAMPLE 2: 256-bit Data-Code in its canonical form:
ISCC:GADWAIBQLNWP7X32J3INMAMDUJ4QMN67BBQKVTVZIWHXQ7QJIKHYTBY
3. Inputs#
The input for calculating the Data-Code shall be the bytes of a file, without reference to their meaning or structure.
4. Outputs#
Data-Code processing shall generate the following output elements:
- iscc: the Data-Code in its canonical form (required).
5. Processing#
An ISCC processor shall calculate the Data-Code as follows:
- Split the data into variable sized chunks with an average chunk size of 1024 bytes using the content defined chunking (CDC) algorithm.
- Calculate the 32-bit integer hash of each chunk using the XXH32 algorithm.
- Apply the minhash algorithm to the array of 32-bit integers to calculate the ISCC-BODY of the Data-Code with appropriate length.
NOTE
For further technical details see source-code in modules code_data.py and minhash.py of the reference implementation.
6. Conformance#
An implementation of the Data-Code algorithm shall be regarded as conforming to the standard if it creates the same Data-Code as the reference implementation for the same data input.
NOTE
The ISCC reference implementation uses the open source XXHASH library 1 for XXH32 chunk hashing and appropriate use of this software will generate the same codes as the reference implementation.
7. Bibliography#
-
Collet, Yann. xxHash: Extremely fast hash algorithm.
Accessed July 2022, available at https://cyan4973.github.io/xxHash/ ↩