-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Priority Level
Medium
Dataset Name
ISNI
Description
Overview
This issue proposes adding support for the International Standard Name Identifier (ISNI) dataset to the Lux pipeline. ISNI provides authoritative identification for public identities of parties involved in content creation across the creative industries. Much of the information below comes from ISNI's website. I would like to thank drjwbaker for getting this Issue started. Thanks to these questions, we were able to identify the new data dump files provided by ISNI.
What is ISNI?
ISNI (International Standard Name Identifier) is an ISO Standard (ISO 27729) that assigns unique identifiers to public identities of parties involved throughout the media content industries. ISNI identifies contributors to creative works such as:
- Authors, writers, and creators
- Publishers and imprints
- Recording artists and performers
- Researchers and academics
- Organizations and institutions
Each ISNI consists of 16 digits and provides persistent identification across different platforms and databases, enabling disambiguation of entities with similar names.
Benefits to Lux
- Authority Control: ISNI provides authoritative identification for creators and contributors
- Cross-Platform Linking: Enables connections to major library and cultural heritage databases
- Disambiguation: Resolves name conflicts for entities with similar names
- International Scope: Covers entities from creative industries worldwide
- Standard Compliance: Based on ISO 27729 international standard
Configuration Requirements
The pipeline would need configuration entries for:
- Person data URL:
https://isni.org/isni/data-person/data.jsonld - Organization data URL:
https://isni.org/isni/data-organization/data.jsonld - Update schedule: Every 6 months
- Data format: JSON-LD preferred for easier processing
Dataset Characteristics
- License: Creative Commons CC0 1.0 Universal Public Domain Dedication
- Update Frequency: Every 6 months
- URI Pattern:
https://isni.org/isni/{ISNI}(where ISNI is the 16-digit identifier) - Size: Millions of person and organization records
- No SPARQL endpoint currently available
Data Access Method
Data Sources
- Person data: https://isni.org/isni/data-person
- Organization data: https://isni.org/isni/data-organization
Available Formats
- RDF/XML:
https://isni.org/isni/data-person/data.rdfandhttps://isni.org/isni/data-organization/data.rdf - JSON-LD:
https://isni.org/isni/data-person/data.jsonldandhttps://isni.org/isni/data-organization/data.jsonld
Data Format
Person Entity Schema
| RDF Property | Expected Value/Range | Definition | Cardinality |
|---|---|---|---|
rdfs:label |
Literal | 16 digit ISNI presented as stated in the ISNI ISO standard, e.g. ISNI 0000 0000 8045 6315 | 1 |
rdf:type |
Class | Always schema:Person |
1 |
schema:alternateName |
Literal | Name of the public identity | 1..* |
schema:birthDate |
Literal | Year of birth of the public identity | 0..1 |
schema:deathDate |
Literal | Year of death of the public identity | 0..1 |
schema:identifier |
Class | Always schema:PropertyValue |
1 |
isni:hasDeprecatedISNI |
Literal | Deprecated ISNI; 16 digits with no space | 0..* |
owl:sameAs |
Class | Entity identified by a URI and modelled as a real world object | 0..* |
madsrdf:isIdentifiedByAuthority |
Class | Entity identified by a URI and modelled as an authority/skos:Concept | 0..* |
dcterms:source |
Class | Entity identified by a non machine actionable URI, i.e. a URL | 0..* |
Organization or Group Entity Schema
| RDF Property | Expected Value/Range | Definition | Cardinality |
|---|---|---|---|
rdfs:label |
Literal | 16 digit ISNI presented as stated in the ISNI ISO standard, e.g. ISNI 0000 0001 2353 1945 | 1 |
rdf:type |
Class | Always schema:Organization |
1 |
schema:alternateName |
Literal | Name of the public identity | 1..* |
schema:identifier |
Class | Always schema:PropertyValue |
1 |
isni:hasDeprecatedISNI |
Literal | Deprecated ISNI; 16 digits with no space | 0..* |
owl:sameAs |
Class | Entity identified by a URI and modelled as a real world object | 0..* |
madsrdf:isIdentifiedByAuthority |
Class | Entity identified by a URI and modelled as an authority/skos:Concept | 0..* |
dcterms:source |
Class | Entity identified by a non machine actionable URI, i.e. a URL | 0..* |
Property Value Schema
Always a blank node:
| RDF Property | Expected Value/Range | Definition | Cardinality |
|---|---|---|---|
rdf:type |
Class | schema:PropertyValue |
1 |
schema:propertyID |
Class | Always the Wikidata identifier for the ISNI schema, i.e. http://www.wikidata.org/entity/Q423048 |
1 |
schema:value |
Literal | 16 digit ISNI – no blank spaces | 1 |
Entity Matching
The dataset includes links to:
- LC/NACO (Library of Congress Name Authority Cooperative Program)
- data.bnf.fr (Bibliothèque nationale de France)
- Wikidata
- MusicBrainz
- National Library of Korea
- National Assembly Library of Korea
Linking Properties Used
owl:sameAs: For resources modelled as real world objects (e.g., Wikidata)madsrdf:isIdentifiedByAuthority: For resources modelled as authorities (e.g., Library of Congress)dcterms:source: For non-machine actionable URLs
Technical Requirements
- Review and approve the proposal (below)
- Implement the downloader, loader, and mapper components
- [ ]Add ISNI configuration to the pipeline
- Test with sample data
- Schedule regular updates aligned with ISNI's 6-month refresh cycle
Known Limitations
No response
Example Integration
1. Example Downloader
import os
from pipeline.process.base.downloader import BaseDownloader
class ISNIDownloader(BaseDownloader):
"""
Person data URL: https://isni.org/isni/data-person/data.jsonld
Organization data URL: https://isni.org/isni/data-organization/data.jsonld
"""
def get_urls(self):
person_url = self.config['input_files']["records"][0]['url']
org_url = self.config['input_files']["records"][1]['url']
dumps_dir = self.config['dumps_dir']
person_path = os.path.join(dumps_dir, 'isni-persons.jsonld')
org_path = os.path.join(dumps_dir, 'isni-organizations.jsonld')
return [
{"url": person_url, "path": person_path},
{"url": org_url, "path": org_path}
]2. Example Loader
import os
import ujson as json
import gzip
import time
from pipeline.process.base.loader import Loader
class ISNILoader(Loader):
def extract_identifier(self, record):
"""Extract ISNI identifier from the record URI"""
uri = record.get('@id', '')
if 'isni.org/isni/' in uri:
return uri.split('/')[-1]
return None
def load(self):
"""Load ISNI JSON-LD data"""
start = time.time()
record_count = 0
with open(self.in_path, 'r', encoding='utf-8') as fh:
data = json.load(fh)
# Handle different JSON-LD structures
if '@graph' in data:
records = data['@graph']
elif isinstance(data, list):
records = data
else:
records = [data]
for record in records:
identifier = self.extract_identifier(record)
if identifier:
self.out_cache[identifier] = record
record_count += 1
if record_count % 10000 == 0:
elapsed = time.time() - start
rate = record_count / elapsed
print(f"Processed {record_count} records in {elapsed:.2f}s ({rate:.2f}/s)")
print(f"Loaded {record_count} ISNI records")
self.out_cache.commit()3. Example Mapper
from pipeline.process.base.mapper import Mapper
from cromulent import model, vocab
import re
class ISNIMapper(Mapper):
def __init__(self, config):
Mapper.__init__(self, config)
self.factory.auto_assign_id = False
def guess_type(self, data):
"""Determine entity type from RDF type"""
rdf_type = data.get("@type", [])
if isinstance(rdf_type, str):
rdf_type = [rdf_type]
if "schema:Person" in rdf_type or "Person" in rdf_type:
return model.Person
elif "schema:Organization" in rdf_type or "Organization" in rdf_type:
return model.Group
return model.Person # Default fallback
def extract_isni_number(self, uri):
"""Extract 16-digit ISNI from URI"""
if 'isni.org/isni/' in uri:
return uri.split('/')[-1]
return None
def format_isni_display(self, isni):
"""Format ISNI for display: 0000 0000 0000 0000"""
if len(isni) == 16:
return f"{isni[:4]} {isni[4:8]} {isni[8:12]} {isni[12:16]}"
return isni
def parse_person(self, record):
"""Map ISNI person record to Linked Art Person"""
uri = record.get('@id', '')
isni_number = self.extract_isni_number(uri)
if not isni_number:
return None
top = model.Person(ident=uri)
# Add ISNI as identifier
isni_id = vocab.LocalNumber(content=self.format_isni_display(isni_number))
isni_id.assigned_by = model.AttributeAssignment()
isni_id.assigned_by.carried_out_by = model.Group(ident="https://isni.org/", _label="ISNI International Agency")
top.identified_by = isni_id
# Add names from schema:alternateName
alt_names = record.get('schema:alternateName', [])
if isinstance(alt_names, str):
alt_names = [alt_names]
if alt_names:
# First name as primary
primary_name = vocab.PrimaryName(content=alt_names[0])
top.identified_by = primary_name
# Rest as alternate names
for name in alt_names[1:]:
alt_name = vocab.AlternateName(content=name)
top.identified_by = alt_name
# Add birth date
birth_date = record.get('schema:birthDate')
if birth_date:
birth = model.Birth()
birth.timespan = model.TimeSpan()
birth.timespan.identified_by = vocab.DisplayName(content=str(birth_date))
top.born = birth
# Add death date
death_date = record.get('schema:deathDate')
if death_date:
death = model.Death()
death.timespan = model.TimeSpan()
death.timespan.identified_by = vocab.DisplayName(content=str(death_date))
top.died = death
# Add external equivalents
same_as = record.get('owl:sameAs', [])
if isinstance(same_as, str):
same_as = [same_as]
for equiv_uri in same_as:
if isinstance(equiv_uri, dict):
equiv_uri = equiv_uri.get('@id', equiv_uri)
top.equivalent = model.Person(ident=equiv_uri)
data = model.factory.toJSON(top)
return {"identifier": isni_number, "data": data, "source": "isni"}
def parse_organization(self, record):
"""Map ISNI organization record to Linked Art Group"""
uri = record.get('@id', '')
isni_number = self.extract_isni_number(uri)
if not isni_number:
return None
top = model.Group(ident=uri)
# Add ISNI as identifier
isni_id = vocab.LocalNumber(content=self.format_isni_display(isni_number))
isni_id.assigned_by = model.AttributeAssignment()
isni_id.assigned_by.carried_out_by = model.Group(ident="https://isni.org/", _label="ISNI International Agency")
top.identified_by = isni_id
# Add names from schema:alternateName
alt_names = record.get('schema:alternateName', [])
if isinstance(alt_names, str):
alt_names = [alt_names]
if alt_names:
# First name as primary
primary_name = vocab.PrimaryName(content=alt_names[0])
top.identified_by = primary_name
# Rest as alternate names
for name in alt_names[1:]:
alt_name = vocab.AlternateName(content=name)
top.identified_by = alt_name
# Add external equivalents
same_as = record.get('owl:sameAs', [])
if isinstance(same_as, str):
same_as = [same_as]
for equiv_uri in same_as:
if isinstance(equiv_uri, dict):
equiv_uri = equiv_uri.get('@id', equiv_uri)
top.equivalent = model.Group(ident=equiv_uri)
data = model.factory.toJSON(top)
return {"identifier": isni_number, "data": data, "source": "isni"}
def transform(self, record, rectype=None, reference=False):
if not rectype:
rectype = self.guess_type(record)
if rectype == model.Person or "Person" in str(rectype):
return self.parse_person(record)
elif rectype == model.Group or "Organization" in str(rectype):
return self.parse_organization(record)
else:
return None