Schemas for Representing Researcher Identity:
Comparison of EAC (2004 version) and ISO 21461

Nick Nicholas, Link Affiliates
2009-12-04 Version 2
opoudjis@optusnet.com.au


© Copyright 2009 University of Southern Queensland



This is an informal comparison of two standards for the encoding of information about parties related to content:

Note that a new version of the EAC standard is imminent. Known as EAC-CPF (Encoded Archival Context for Corporate bodies, Persons and Families) the new proposed standard is available at http://eac.staatsbibliothek-berlin.de/.

The context for this work is the use of profiles of the two standards by the National Library of Australia and the Australian National Data Service (ANDS), which wish to be able to exchange data in order to represent researcher identities in the Australian Research Data Commons:

The two bodies gather identity metadata for quite different purposes, and have a different view of the data. People Australia is an authoritative machine-readable reference on identities in Australia; Research Data Australia facilitates research data discovery by humans, and researcher identity is coded only as a means to that end. People Australia uses a rigorous data standard with metadata curation; Research Data Australia aggregates information from disparate and uneven sources. This means that mismatches between the two are to be expected.

The comparison is motivated by the specific consideration of importing identity data from People Australia into Research Data Australia: ANDS have identified People Australia as a major source of information on researcher identity.

The People Australia profile of EAC excludes some elements, but remains close to the full schema. For EAC records to be ingested the People Australia requires that they validate against the EAC XML Schema (http://www3.iath.virginia.edu/eac/shared/eac/eac.xsd). Once the EAC-CPF standard is released the National Library will migrate to it and produce the relevant documentation and mappings. The new EAC-CPF standard is more generic and gets rid of many of the dependencies on archival practices, as a result it will be a better fit for what the People Australia program is aiming to achieve.

RIF-CS by contrast simplifies ISO 2146 more drastically.

The comparison is in two parts. First People Australia EAC is compared with the full ISO 2146 data model. Then it is compared with RIF-CS. RIF-CS is currently undergoing structural review, and this comparison has been written as feedback to that process. Readers should consult the latest version of RIF-CS.

Note that while RIF-CS is an explicit schema, ISO 2146 is a conceptual data model, and has no defined serialisation. Therefore issues with the structure of the RIF-CS schema are specific to RIF-CS, and are not inherited from ISO 2146.

1 People Australia EAC vs. ISO 2146

The elements and attributes of the Libraries Australia profile of EAC are mapped to ISO 2146. Elements and attributes not supported in ISO 2146 are split into those relevant to e-research registries, which have motivated this comparison, and those which do not. Some more general issues with the mapping are also noted.

ISO 2146 is clearly simpler than EAC, and not designed to deal with the level of detail in EAC. For example, in e-research registries, the absence of biographical detail is not of concern; nor is the lack of flexibility in nominating vocabularies a problem. On the other hand, ISO 2146 required address information is more granular than EAC.

The essential information of EAC can be preserved in ISO 2146, with some profiling and reinterpretation of elements (typing elements in particular). Mappings from ISO 2146 to EAC are also possible.

  • The coding of professional activity is open to implementation choice.
  • Resources are not in scope of ISO 2146, outside hyperlinking (13.5 Info Pointer); the related resource metadata in EAC, including bibliographical citations, should be formulated in a different schema.
  • ISO 2146 support for metadata disambiguating names is poorer than EAC; ANDS is exploring alternate mechanisms for deduplication/disaggregation. ISO 2146 has a very simple model of sources of evidence compared to EAC; given the likely quality of metadata, elaborated sources of evidence would not be expected.
  • The ANDS use of ISO 2146 could usefully include the EAC attributes of level of detail and authority file (source), to make the distinction explicit between ANDS as an aggregator, and the sources it has aggregated authority information from. (Note however that level of detail is being dropped in EAC-CPF (it will be coded only through localControlEntry, which allows localisation of the EAC header parameters.),
  • It may also be useful to code country separately in address information, for international collaborations.

1.1 Mapping: elements

1.1.1 Supported

  • eac : 5 Registry Object. Note that ISO 2146 specialises Party as a type of Registry Object, with different contextual interpretation of the generic attributes of Registry Object.
  • eac/eacheader@status : 14.2.2 Status Value.
  • eac/eacheader, other attributes: see general discussion of attributes below
  • eac/eacid : 5.1 Registry Object Key. This is the primary key of the entity. ISO 2146 treats it as a simple key.
  • eac/eacid attributes : 5.2 Identifier. The identifier object allows more detail in identifier coding, and may be used to support the additional attributes of eacid. These attributes allow for the encoding of the identifier, the institution assigning the identifier and country encodings which make it possible to establish uniqueness.
  • eac/eacheader/mainhist/mainevent : 14.1 Update Event
  • eac/eacheader/mainhist/mainevent@maintype : 14.1.1 Update Event Type
  • eac/eacheader/mainhist/mainevent/maindate : 14.1 Update Event/13.3 Date Range
  • eac/eacheader/mainhist/mainevent/maindesc : 14.1 Update Event/13.12 Notes
  • eac/eacheader/mainhist/mainevent/name : 14.1.2 Record updater. EAC inserts a full name; ISO 2146 allows only an identifier.
  • eac/eacheader/mainhist/sourcedecl : 14.1.3 Record Source (applied to create and update event).
  • eac/eacheader/languagedecl : 5 Registry Object/13.9 Language.
  • eac/condesc/identity/didentifier : 5 Registry Object/13.5 Info Pointer
  • eac/condesc/identity/didentifier@role : 13.5.2 Target Resource
  • eac/condesc/identity/didentifier@type : 13.5.1 Info Pointer Type
  • eac/condesc/identity/persgp/pershead@role : 5.3.1 Name Role
  • eac/condesc/identity/persgp/pershead/part : 5.3.3 Name Part
  • eac/condesc/identity/persgp/pershead/part@type : 5.3.3.1 Name Part Type
  • eac/condesc/identity/persgp/pershead/usedate : 5.3 Name/13.3 Date Range
  • eac/condesc/identity/persgp/pershead/existdate : 5 Registry Object/13.3 Date Range
  • eac/condesc/identity/persgp/pershead/place : 5 Registry Object/13.10 Location
  • eac/condesc/identity/persgp/pershead/place@type : 13.10.1 Location Type. The intent behind 13.10.1 Location Type for parties (Home Location, Term Location, Business Location) is not the same as in EAC (Place of birth, Place of death); but that is a vocabulary issue that ISO 2146 does not prescribe.
  • eac/condesc/desc : 13.4 Description
  • eac/condesc/desc/persdesc/existdesc/existdate : 5 Registry Object/13.3 Date Range
  • eac/condesc/desc/persdesc/existdesc/place : 5 Registry Object/13.10 Location
  • eac/condesc/desc/persdesc/existdesc/place@type : 13.10.1 Location Type
  • eac/condesc/desc/persdesc/locations/location/place: The distinction between existdesc/place and location/place (place of birth, place of residence) can only be made in ISO 2146 through 13.10.1 Location Type
  • eac/condesc/desc/persdesc/locations/location/address : 13.1 Address. EAC Address does not have microformatting; ISO 2146 Address has a more refined structure, and the EAC Address needs to be parsed if converting to ISO 2146.
  • eac/condesc/desc/funactrels/funactrel/funact : 5.6 Event. Event or Related Activity (distinct registry object) or 5.5 Subject. See discussion under §1.3. Issues below.
  • eac/condesc/desc/funactrels/funactrel/funact@type : 5.6.1 Event Type or 5.4.1 Relation Type (presuming 5.6 Event or Related Activity for funact; otherwise no explicit provision)
  • eac/condesc/desc/funactrels/funactrel/date : 5.6 Event/13.3 Date Range or 5.4 Relation/13.3 Date Range (presuming 5.6 Event or Related Activity for funact; no explicit provision for 5.5 Subject)
  • eac/condesc/desc/funactrels/funactrel/place : 5.6 Event/13.10 Location or 8. Activity/13.10 Location (presuming 5.6 Event or Related Activity for funact; no explicit provision for 5.5 Subject)
  • eac/condesc/desc/funactrels/funactrel/descnote : 5.6 Event/13.12 Notes or 5.6 Event/13.4 Description
  • eac/condesc/eacrels/eacrel : 5.4 Relation
  • eac/condesc/eacrels/eacrel/persname : 5.4.2 Related Register Object Key. Note that EAC identifies related parties by name, whereas ISO 2146 identifies them by identifier.
  • eac/condesc/eacrels/eacrel/persname@reltype : 5.4.1 Relation Type
  • eac/condesc/resourcesrels/resourcerel : 5.4 Relation. This requires expanding the ISO 2146 relation partycollection to partyresource. Normally resources are out of scope of ISO 2146. Alternatively 13.5 Info Pointer could be used, but that does not allow any metadata to be coded on the related resource.
  • eac/condesc/resourcesrels/resourcerel/archunit@reltype : 5.4.1 Relation Type.
  • eac/condesc/resourcesrels/resourcerel/archunit@type : 5.4.3 Relation Qualifier.
  • eac/condesc/resourcesrels/resourcerel/archunit : 13.5.2 Target Resource ISO 2146 target resources can be bibliographic citations. The internal structure of the citation is out of scope for ISO 2146.

1.1.2 Not Supported, Not Relevant

  • eac/eacheader/ruledecl : no provision for rules for identifying relevant standards or rules in creating the schema instance. Unlikely to be a problem for this context of use.
  • eac/condesc/identity/persgp, eac/condesc/corpgrp : See discussion under §1.3 Issues.
  • eac/condesc/identity/persgp/pershead/sourceref : information on the reference for a name would go to 13.12 Notes.
  • eac/condesc/identity/persgp/pershead/sourceref/sourceinfo : information on the source for a name would go to 13.12 Notes.
  • eac/condesc/identity/persgp/pershead/sourceref/sourceinfo/note : details on the source for a name would go to 13.12 Notes.
  • eac/condesc/identity/persgp/pershead/sourceref/sourceinfo/note/bibref : bibliographic references on the source for a name would go to 13.12 Notes or 13.5 Info Pointer.
  • eac/condesc/desc/bioghist : No provision in ISO 2146 for biographical description.
  • eac/condesc/desc/persdesc/ : No provision in ISO 2146 for historical or cultural context.
  • eac/condesc/desc/persdesc/descentry : ISO 2146 does not provide in its base schema for extension of party descriptions, e.g. nationality, heritage, gender. ISO 2146 can be profiled readily. The citizenship categories EAC envisions here are probably not relevant in an e-research context.

1.1.3 Not Supported, Relevant

  • eac/condesc/identity/persgp/pershead/nameadd : no provision for explicit disambiguating information in ISO 2146; 5.3 Name/13.12 Notes could be used, but would not by default be machine-readable.
  • eac/condesc/resourcesrels/resourcerel/archunit, musunit : (ISO 2146 without profiling would conflate archival and museum objects as collections). 13.13 Resource Type could provide limited information. The details of reference, such as title and imprint, are also out of scope for ISO 2146. In EAC-CPF the whole model for handling relations (related resources, people and functions) has changed dramatically. As a result issues with archunit, bibunit and musunit will disappear in the future.

1.2 Mapping: attributes

1.2.1 Supported

  • @adrtype : 13.1.11 Location Type
  • @emailtype : 13.1.11 Location Type
  • @form : ISO 2146 indicates whether a time span is open or closed by omitting the end date in 13.3 Date Range
  • @href : 13.5.2 Target Resource
  • @id : 5.1 Register Object Key
  • @languagecode : 13.9 Language
  • @maintype : 14.1.1 Update Event Type
  • @placetype : 13.10.1 Location Type : the distinction between jurisdictions and geographic names can be expressed by ISO 2146 through typing, although that was not likely anticipated.
  • @reltype : 5.4.1 Relation Type
  • @role : 13.5.1 Info Pointer Type (?) : typing hyperlinks to classify the role of a remote resource in a link is possible through ISO 2146, but may extend that element beyond its intended usewhich is descriptive (Assets, Documents, Inputs, etc.)
  • @status : 14.2.2 Status Value
  • @syskey : The local identifier of an XML entity is not in scope of abstract data model of ISO 2146, but would be included in any profile.
  • @system : The local address of an XML entity is not in scope of abstract data model of ISO 2146, but would be included in any profile.
  • @teltype : 13.10.1 Location Type
  • @type : Supported as separate elements rather than generically.

1.2.2 Not Supported, Not Relevant

ISO 2146 does not provide flexibility in choice of vocabularies for standard elements. This is not a concern for an e-research context:

  • @calendar : ISO 2146 presupposes ISO86031, and therefore the Gregorian calendar. This is not an issue for an e-research context.
  • @countryencoding : parallel with other attributes, if ISO 2146 tagged countries separately, it would only allow ISO 3166-1
  • @dateencoding : ISO 2146 is constrained to ISO 8601
  • @era : ISO 2146 does not differentiate between CE and AD, as it relies on ISO 8601 (which uses positive and negative numbers); People Australia have also indicated they are unlikely to use this attribute, and it is irrelevant to information on researchers.
  • @langencoding : ISO 2146 is constrained to ISO 639-3.
  • @owenerencoding : ISO 2146 does not code for the owner of a system, per ISO 15511.
  • @scriptencoding : ISO 2146 does not code for non-Latin scripts; if it did, it would be constrained to ISO 15924.
  • @normal : Dates in ISO 2146 are only meant to be machine-readable, so there is no need to give a machine-readable normalised coding.
  • @typeauth : the type vocabularies used in ISO 2146 are implicit, and ISO 2146 does not allow standards names to be nominated in the schema.
  • @typekey : the type vocabularies used in ISO 2146 are implicit, and cannot be formulated ad hoc as in EAC.
  • @valueauth : the standards vocabularies used in ISO 2146 are implicit, and ISO 2146 does not allow standards names to be nominated in the schema. For date, language, spatial location, the standards are fixed as ISO 639-3, ISO 86031, ISO 19100.
  • @valuekey : the standards vocabularies used in ISO 2146 are implicit, and cannot be formulated ad hoc as in EAC
  • @encodinganalogsys : ISO 2146 does not cross-reference other encoding schemes.

ISO 2146 does not provide support for operational usage of schemata, as an abstract data model. These can straightforwardly be dealt with in any implementation of ISO 2146.

  • @audience : ISO 2146 does not specify the privacy level of elements, though this could be profiled easily.
  • @authorised : ISO 2146 does not specify authority metadata for elements.
  • @actuate : ISO 2146 does not specify hyperlink behaviour in browsers.
  • @label : ISO 2146 does not suggest display labels for rendering.
  • @show : ISO 2146 does not code how hyperlinks should be rendered, although this could be profiled.

Other fields

  • @ownercode : ISO 2146 does not code for the owner of a system, per ISO 15511.
  • @rule : ISO 2146 has no notion of capturing rules for the formulation of an element.
  • @scriptcode : ISO 2146 does not code for non-Latin scripts; this is unlikely to be relevant in this context.

1.2.3 Not Supported, Relevant

  • @countrycode : Country codes not tagged separately from rest of location descriptions in ISO 2146. If this becomes a search item, it could be microcoded in the ISO 2146 profile, or added as a separate field. Countries can be treated as a defined 13.1.3 Address Part in the context of addresses.
  • @detaillevel : ISO 2146 does not represent level of description detail, although this could easily be profiled.
  • @ea : ISO 2146 does not support reference to authority files. This may be useful to expose where ANDS has aggregated metadata from (e.g. People Australia as a reference for the identity metadata used).

1.3 Issues

  • ISO 2146 does not have any notion of grouping together multiple descriptions of parties, like EACs eacgrp, condescgrp. This should not matter: descriptions of parties can be one at a time.
  • The ISO 2146 model of sources for evidence, 14.1.3 Record Source, is not as fully elaborated as the EAC model, eac/eacheader/mainhist/sourcedecl
  • There is no provision in ISO 2146 for grouping together names in different languages in a group element (eac/condesc/identity/persgp, eac/condesc/corpgrp); ISO 2146 instead does the grouping as a multilingual string (pushing the embedding one level down). Information is still preserved, but different name structures in different languages for the same entity become impossible.
  • EAC differentiates between dates and places used in disambiguation (eac/condesc/identity/persgp/pershead/usedate, eac/condesc/identity/persgp/pershead/existdate, eac/condesc/identity/persgp/pershead/place), and those used only as informative data (eac/condesc/desc/persdesc/existdesc/existdate, eac/condesc/desc/persdesc/existdesc/place). ISO 2146 does not make such a distinction, and conflates these.
  • eac/condesc/resourcesrels/resourcerel/archunit, bibunit, musunit : ISO 2146 does not provide for the breakdown of related resources into archival, bibliographic and museum objects, since it does not model resources at all; this would best be addressed by a distinct schema.
  • The EAC funact entity describes a function, profession, or activity performed by the entity. In terms of e-research, it can address both the institutional position of a researcher (degree, lectureship, fellowship), and more narrowly the funded project under which the associated collection was produced. The institutional position binds the researcher to an institution for a specific timespan, and allows parties to be discovered by their employer. The funded project is itself a primary search point for research data, and therefore essential to capture.
  • Of the three alternatives, 5.5. Subject can also be used for an occupation, but only as an enum, requiring a fixed vocabulary, and without any internal structure (such as date span). Event is an attribute of a Party, whereas Activity is a distinct object, related to the Party. The definition of the two entities is only slightly different: Event is a happening occurring at a particular point in time or location that may be associated with a registry object, while Activity is something occurring over time that generates one or more outputs. However, Event is also named as a subtype of Activity (something that happens at a particular place or time as an organized activity with participants or an audience)
  • Activity matches the overall intent of EAC funact better: it also explicitly provides for courses, programs (including higher degree programs) and projects as subtypes, although it does not nominate professional appointments. However the definition of 5.6 Event is elastic enough to be used for institutional position as well. The distinction should instead be made on whether to expose the appointment or project as a top-level object, independently discoverable, or to subsume it as an attribute of the party.
  • Research is directly driven by funded projects, and the community understands them as self-standing entities: they are thought of as generators of research in their own right. Moreover, projects include the contributions of multiple parties. So projects should be coded as Activities, which exposes them as distinct objects. On the other hand, appointments and degrees undertaken are not seen independently of the party: users will not search for Joe Bloggs Masters Course or Joe Bloggs Senior Lectureship independently of Joe Bloggs. So it is more appropriate to code professional milestones as Events.

2 RIF-CS vs. ISO 2146

2.1 Supported

  • originatingSource codes the original source of information for the metadata. This refines the model for source of evidence in 14.1.3 Record Source, and can be mapped to eac/eacheader/sourcedecl/source
  • RIF-CS has the same model for multilingual strings as ISO 2146
  • 6.1 Party Type : party@type. RIF-CS allows parties to be individuals or groups.
  • 5.1 Registry Object Key : key
  • 5.2 Identifier : party/identifier
  • 5.3 Name : party/name
  • 5.4 Relation : party/relatedinfo
  • 13.10 Location : party/location
  • 5.5 Subject : party/subject. Vocabulary can be specified through party/subject@type
  • 13.4 Description : party/description
  • 13.5 Info Pointer : party/relatedobject
  • 13.5.2 Target Resource : party/relatedobject/url
  • 5.3 Name/13.3 Date Range : name@dateFrom, name@dateTo
  • 13.10 Location/13.3 Date Range : location@dateFrom, location@dateTo
  • 14.1 Update Event/13.3 Date Range (= 5.8 Date Record Last Modified): party@dateModified
  • 5.2.2 Identifier Role : identifier@type
  • 5.3.1 Name Role : name@type
  • 5.3.2 Unstructured Name : represented as a single name/namePart instance.
  • 5.3. Name Part : name/namePart
  • 5.3.2.2 Name Part Type : name/namePart@type
  • 5.4.1 Relation Type : party/relatedinfo@type, relation/type
  • 5.4.3 Relation Qualifier : relation/relationdescription
  • 13.1.1 Physical Address : location/address/physical
  • 13.1.2 Electronic Address : location/address/electronic
  • 13.1.3 Address Part : location/address/physical/addresspart. RIF-CS does not allow components of electronic addresses, but does distinguish args from values of service URIs.
  • 13.1.1.1 Physical Address Type : location/address/physical@type
  • 13.1.1.2 Electronic Address Type : location/address/electronic@type
  • 13.1.3.1 Address Part Type : location/address/physical/addresspart@type
  • 13.4.1 Description Role : description@type . The vocabulary for description is narrow, but it does include brief and full as values, which addresses the EAC @detaillevel attribute.
  • 13.10.1 Location Type : location@type
  • 13.10.2 Spatial Location : location/spatial

2.2 Not Supported, Not Relevant

  • 13.7 Is Default
  • 13.9 Language
  • 13.6 Is Active
  • 13.1.4 Address Text Encoding
  • 13.1.2 Currency
  • 13.8 Supported
  • 13.11 Measurement Type

2.3 Not Supported, Relevant

  • 5.6 Event
  • 13.5.1 Info Pointer Type
  • 13.3 Date Range outside named instances. In particular, no support for dates of: 5 Registry Object (e.g. lifespan of party), 5.2 Identifier, 5.4 Relation, 5.6 Event, 14.2 Status, or generic 14.1 Update Event
  • 13.12 Notes
  • 14.1 Update Event
  • 14.2 Status

The following simplifications in the current version of RIF-CS relative to ISO 2146 are to be noted. RIF-CS is under review, and some of these concerns have been raised independently. Please consult the latest version of RIF-CS for updates.

  • RIF-CS does not support Events, which restricts representation of professional activities to the activity object. As already noted above, this exposes what may be incidental biographical information as top-level objects. Particularly for past affiliations, the level of detail in an activity object may be difficult to provide.
  • Machine readable context for hyperlinks is not provided: any clarification is limited to the prose of relation/relationdescription. That said, documenting the type of resource linked to in a source repository is the source repositorys responsibility. It is not obvious that ANDS should even be providing the MIME type of the linked resource, especially as source repositories may reserve the right to change it.
  • Lifespans cannot be associated with parties or relations. This restricts the potential for disambiguation or context.
  • There is no provision for record history metadata in the schema itself, outside a simple indication of provenance. The detailed history of the data is held in the source collection, while this metadata profile is only intended for aggregation; so RIF-CS should not be replicating the authority metadata of the source repositories: ANDS does not supplant the source repositories authority as a data generator. However, ANDS does have its own authority as a data aggregator, especially if it imposes quality control on the records it aggregates; it would be useful to expose such authority metadata as the quality control status.
  • There is limited provision for annotation of the schema with explanatory notes: currently they only concern date coverage. Data contributors may want to publish provisos on how metadata are to be interpreted, with varying scope of metadata.
  • There is no provision for language coding of metadata, outside the string level. This may become a concern as ANDS coverage expands; but given that ANDS registers metadata and not data, and that its primary coverage is Australian institutional data, this is not a priority.
  • Vocabularies are currently limited and not consistently explained. The RIF-CS vocabularies, particularly for entity relations, are currently under review.
  • No extension mechanism for the schema is provided, outside the use of local keywords. Because ANDS is acting as an aggregator, and consistency of RIF-CS instances is vital to that task, customisations of RIF-CS are not desirable. The question however is currently under discussion.

1 My thanks to Basil Dewhurst (NLA) and Joan Gray (ANDS) for their feedback.

Southern Cross