Diversity and Identity: Categories for OAI data-providers in the Open Language Archives Network

This work analyzes the network typology of data-providers who use the Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH) to engage in ethnolinguistic information-resource stewardship. The Open Language Archive Community’s (OLAC) network is analyzed addressing: (1) the ontological nature of OAI data-providers, chiefly that not all data-providers are archives; (2) the classificatory nature of the data-providers in contrast to existing OLAC categories of personal and institutional ; and (3) the impact of classification/description on the social-understanding about those providers. That is, discrete classificatory terminology does not exist within the target OLAC user community. A broader understanding of the classificatory distinctions among cultural heritage organizations would enable depositors to select the most appropriate institutions for cultural heritage preservation. Two classification taxonomies are presented for the data-providers. The taxonomy terms are applied to the members of the network: (1) as a lens by which one may understand metadata quality discrepancies across data-providers; (2) to identify strong and weak areas within the network; and (3) to identify network growth potential in contrast to the historically involved network participants. The developed taxonomies are applicable to cultural heritage networks outside of the set of OLAC data-providers and contribute to broader metadata quality discussions in the Library-Archive-Museum (LAM) community.


Introduction
Metadata quality across aggregated record sets and harvested record sets is a well discussed topic in the literature (Stvilia et al. 2004;Ward 2004;Bui and Park 2006;Park 2006;Palmer, Zavalina, and Fenlon 2010;Zavalina 2011;Palavitsinis, Manouselis, and Sanchez-Alonso 2014).Less well discussed is how network typologies of data-providers impact the reported results in metadata quality studies.Understanding network typologies in aggregate data contexts can have several benefits for network managers and other stakeholders.However, defining appropriate categories for data-provider network members can be challenging.Too few or too many categories and the narrative evidenced via the data analysis becomes difficult to interpret.This study looks at the 60 plus members of the Open Language Archive Community (OLAC) and proposes two taxonomies relevant to cultural heritage institutions stewarding language resources.This stands in contrast to an existing two-way distinction that the OLAC application profile (OLAC-AP) provides.By exploring the diversity of the networked data-providers, this study does three things.First it addresses an awareness gap among network participants related to who is involved.Second, it explores the classification of data-providers for the purpose of network health and growth potential.Third, by re-evaluating terminology used by the network of data-providers it allows for a more holistic discussion about the kinds of network stakeholders and their long-term roles related to language resources in the cultural heritage domain.
The classification of data-providers is an important business function across industries.For example, Bessembinder and colleagues (2019) discuss the classification of climate dataproviders and the meteorological services enabled via their data sharing.Exegy (2019) classifies financial data-providers and the level of business services bundled with data access.In these contexts, the classification of data-providers is used to indicate the authority weight and inform business operations about the metadata quality.Within the context of research on cultural heritage institutions (CHI), the social classification or "framing" of providers has implications for research evaluating CHI information retrieval systems.That is, the social assent of categories and divisions of Knowledge Organization (KO), using Hjørland's (2008) broad sense of KO, impacts the interpretation of metadata records, which are in fact KO products in Hjørland's (2008) narrow sense.I propose that a theory of KO must account for types of information resources as well as types of business models (social functions) used by organizations in a larger KO ecosystem.
The presented taxonomies applied to data-providers are applicable not only to OLAC, which was the analyzed network, but also to other data-sharing networks in the cultural heritage space.

Background
OLAC is a federation of 60 plus data-providers sharing metadata records (Bird and Simons 2022) via the Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH)1 for consumption and display via a common aggregator2 .OAI-PMH was designed to allow programmatic harvesting of metadata records (Shreeves, Kaczmarek, and Cole 2003).Typical applications of OAI-PMH include cross-institutional metadata harvests, metadata transmission between systems within the same institution, and networks of institutions contributing to a common data store.In the last case, the common data store is often coupled with a user interface for searching the collection of records from across institutional data-providers.This is the basic architecture behind OLAC and other large cultural heritage and scholarly communications discovery portals such as: Digital Public Library of America (DPLA)3 , Europana4 , Directory of Open Access Journals5 , and the Platform for Open Data6 .
The consumption of records from diverse types of data-providers suggests, at least in cases of metadata aggregation, that various KO practices (in the narrow sense) for resource description are brought together for display and engagement.These practices can and often do represent different kinds of KO approaches, e.g., impacting metadata quality assessment (Manghi, Candela, and Pagano 2010) or user experience design (Chopey 2005;Zavalina 2011Zavalina , 2012)).Therefore, it is important to account for these various local KO approaches when studying aggregated data; e.g., OLAC metadata as was presented by Paterson (2022).
The presented taxonomies clarify several complex conceptual contrasts.The first is the distinction between the terms personal and institutional as they are used in the OLAC-AP.Second is the social use of the term archive by language-scholars.Third are the terms Data Provider, Service Provider and Repository as used in the context of OAI-PMH.

Institutional vs personal
The OLAC-AP is based on both OAI-PMH and Dublin Core (Bird and Simons 2003, 2004, 2001;Simons and Bird 2003).The OAI-PMH schema has a container, <description>, which is used to describe the data-provider7 .One option provided by the OAI-PMH documentation allows application profiles to further define a schema for use within the <description> container to provide network-specific information about the data-provider (Lagoze et al. 2005).Since its establishment in 2001, the OLAC-AP has defined and required this component's use by data-providers to identify their nature as either personal or institutional (Simons and Bird 2008, §3).The OLAC-AP defines the terms as follows: Institutional indicates that the repository is operated by an institution that is committed to maintaining it in the future, even after the individuals currently associated with it are no longer involved.
Personal indicates that the repository is being operated by an individual (or a group of individuals) without the commitment of an institution for maintenance far into the future.
The OLAC-AP definition for personal implies that a collective of individuals should be classified as personal.This is counterintuitive based on the common usage definition of personal.These terminological choices have created a challenging situation to evaluate dataproviders clearly and objectively.For example, would a department of colleagues providing data together in a single feed be personal or institutional?Similarly, would an ad-hoc network of researchers, such as the Rift Valley Network, be equally classified as personal?In the former case, a department seems to be closer to institutional than an unincorporated association of researchers, yet even a set of colleagues may not have the same lasting duration as an organization with a preservation mandate.Prior to work by Paterson (2021a), no data-provider self-identified as personal, yet several of the data-providers were clearly departmental in scope and maintained by a single person.

Archive
A second point of terminological confusion revolves around the term archive.Within OLAC documentation, all data-providers are discussed as archives.OLAC documentation drafters have used inclusive language choosing to cluster different types of institutional data-providers together.The choice maps well to the concepts of open and community found in the OLAC name, but the language of inclusion here does not acknowledge the diversity of the kinds of current or potential data-providers.OLAC's use of the term archive leads to a very interesting question: "What is an archive?"Is an archive an institution with a preservation mandate, as is commonly used in scholarly literature (Featherstone 2006;Seyfeddinipur et al. 2019;Burke et al. 2022;Matthews 2016)?Or is an archive a set of records often with a common origin and intra-record relationships, as is discussed by Duranti (1997) and others (Jenkinson 1937;Johnston and Schembri 2006)?The formative role that OLAC has had in the language-scholar community has strongly influenced the concept of archive among language-scholars.
Language-scholars use the term archive differently, e.g., some use it to mean a set of associated records with links (Ratner and MacWhinney 2016;Johnston and Schembri 2006), while others use the term in reference to an organization with a preservation mandate (Franchetto and Keren 2014;Skilton 2021) 8 .This indicates that, at least among languagescholars, there isn't a unified concept behind the term archive.This characterization of the language-scholar community is supported by results from a survey where 370 language-scholars responded to the question: "Have you archived your lexical dataset?"One hundred respondents replied "yes", but only 13 had made a deposit to a "long term stewardship institution" (Paterson III 2015).

Repository, data-provider, and service provider
The OAI-PMH documentation defines the terms Data Provider, Service Provider, and Repository.Within the OAI-PMH context, a corporate entity implementing data sharing technology is the Data Provider.The server technology implementing the access is called the Repository.Finally, Service Providers use metadata harvested via OAI-PMH as a basis for building value-added services.Clarifying the fine distinctions between the various technical and broader social use contexts are vital to terminological clarity within the presented taxonomies.

Methodology
To evaluate and classify data-providers, I investigated their descriptions as provided on the OLAC website9 .I also considered their names and web presence.When I considered the social function each data-provider was attempting to fill, three broad functional but mutually exclusive categories emerged: (1) institutions which in some way provide access to resources; (2) collections and exhibits (displays); and (3) reference resources.I classified data-providers into the first group if they were institutions which stewarded resources which may be acquired under some access policy.In contrast, I placed data-providers that never possess resources into group three.That is, group three data-providers either merely list additional resources, or they are the resource, e.g., Several encyclopedia-like resources provide a record for each article within the larger work.Finally, the remaining data-providers were focused on interactive engagement with resources or telling a narrative about the resources.I put these data-providers into group two.In Figure 1, I arranged the three categories along a continuum where the left side is more likely to have the resource while the right side is less likely to have the actual resources described.The three emergent social functions each pertaining to a category are: engagement with resources (access institutions), engagement with narrative (collections and exhibits), and engagement with facts (reference resources).

Second Taxonomy
The preservation of resources beyond the efforts of a single person seems to be the major focus of the OLAC-AP personal/institutional dichotomy.The attempt to maintain this distinction motivated the exploration of a second finer-grained taxonomy.In the process of applying the taxonomy to OLAC data-providers it was realized that (1) an access organization might not be mandated to preserve content; and (2) that these three function-based categories are esoteric and may be challenging for practical use within the OLAC-AP because they are not directly applicable in ways that staff at data-providers can easily apply.
For added perspective on possible kinds of Access Institutions participating in OAI-PMH networks the list of providers to the DPLA was consulted10 .As discussed later, the consultation increased the number of relevant organization types which might be data-providers.However, Libraries, Archives, and Museums (LAMs) constituted a significant portion of data-providers.The rise of digital libraries and metadata sharing has opened up many new conversations between LAMs (Zorich, Waibel, and Erway 2008;Waibel and Erway 2009;Tonta 2008;Matthews 2016; Ke 2016; Roy, Bhasin, and Arriaga 2011; Katre 2011).Many authors approach LAM/G(allery)LAM discussions from the perspective of collaboration and unification.However, Besser points out the traditionally divergent business models of these memory institutions which I found most useful while creating the second taxonomy (Dietz et al. 2005, 23).
Though libraries, museums, and archives all look like similar repositories housing cultural resources, there are some fundamental differences in mission, in what is collected, in how works are organized, and in how the institution relates to its users.
The traditional library is based upon the individual item, which is generally not unique.Archives manage groups of works and focus on maintaining a particular context for the overall collection.Museums collect specific objects and provide curatorial context for each of them.These distinctions of the fundamental unit that is collected, affect each institution's acquisition policy, cataloging, preservation, and presentation to the public.
Libraries and museums are both repositories, but libraries are user-driven.The role of the library is to provide access to a vast amount of material, which the user freely roams, making his/her own connections between works.Museums, historically, are curator-driven.They have only provided limited access to holdings, usually through a particular interpretative exhibition context, as provided by curatorial and educational staff.The museum provides a framework of context and interpretation, and the user can navigate within that smaller context.Archives tend to be research driven.They are accessible, often by appointment, in non-public spaces.The archivist has identified an area of the collection a researcher might be interested in, but s/he must go through it physically, item by item, to find out more information.
Thus, in the creation of the second taxonomy I adopted Besser's observations distinguishing the institutional characteristics of memory institutions.However, Besser's work pre-dates the widespread use of the term digital repository and as such doesn't directly address this important component of the current cultural preservation landscape.In considering how a digital repository (such as Zenodo11 or OSF12 ) is different from any of the notions of Besser, I looked at the traditional content management practices in archives and compared them with the best practice OAIS model for data management in digital repositories (CCSDS 2012;and cf. Bel 2012).Archives have historically curated collections as a process of stewardship.As such archivists view the archival collection much like a glacier, slow to move, but changing.This stands in contrast to the OAIS model which calls for repositories to hold exact copies of content as submitted.Artifact curation does not occur within the conceptual model of repositories.With these distinctions in mind, the terms and characteristics outlined in Table 1 were adopted.Museums and art galleries use entirely different business models to fund their operating costs and make money.The simplified difference between an art gallery and a museum is that a museum is a place of entertainment; it's an activity to visit a museum.However, an art gallery is a business that displays and sells goods.An art gallery, like Eden Gallery, aims to raise the profile of artists who exhibit in its spaces and ultimately sell artworks.
While galleries and museums are listed in Table 1, their roles are notable because these institutions often display their stewarded resources within collections and exhibits.As institutions, their public interactions are dedicated to the interpretation of artifacts.Museums in their curator-driven capacity are unapologetically expressionistic, while galleries, with their sales-oriented business model, measure their success through outcomes grounded in impressionistic interpretations.These two audience-oriented natures in some ways overlap with the types of taxonomic resources in the broad category of collections and exhibits illustrated in Table 2.
Regarding other types of institutions, a review of DLPA data-providers revealed several additional entity types including Networks, Centers, Publishers, Institutions with collectionspecific management practice, Institutes, Historical Societies, Registries, and Services.Each of these entity types, not to mention specific entities, differs in business model implementation and the characteristics by which information resources are stewarded.However, it was decided for this taxonomy that such entities were not viable for inclusion in-and-of themselves.Two reasons for this were: (1) data-providers may be sub-entities of organizations with these names; (2) often, organizations with these names truly do fit into the taxonomy terms available.Choosing a stewardship-oriented and function-based alignment allows data-providers the option to consider their most appropriate identifying term.For example, publishers most frequently align with behaviors consistent with repositories, even if they have a profit-centric model akin to galleries.Different corporate entities may operate a library or archive to the specific ends of the parent organization.It might be the library which is the data-provider rather than the larger organization which is responsible for the OAI-PMH data.Terms appearing in organizational titles can carry brand value rather than conforming with characterizations presented in Table 1.For example, some entities may call themselves a library or an archive but function like a repository.
Within the broad category Collections and Exhibits, the terms and characterizations listed in Table 2 were considered.These types of data-providers are focused on a specific narrative and the metadata often supports this goal in its original context.Institutional data-providers may manage several special collections or archives (archives in the sense of a coherent set of related records).However, it might also be the case that data-providers only manage a single collection.Such a collection may provide access to resources (like a repository does) or may only point to source locations (like a bibliography).Considering the institutional versus personal dynamic available via the OLAC-AP, and the kinds of digital collections appearing across the internet, four distinct taxonomy terms were chosen: Special Collection, Personal Portfolio, Lab or Department Portfolio, and Project Portfolio.Exhibits or collections generally have different kinds of arrangements.It follows then that they also have different types of metadata supporting their cohesion and navigation.There are several reasons why the taxonomy terms in Table 2 deserve equal placement within a taxonomy also containing the terms in Table 1.First, many of these terms cover a use-case where an individual or community crafts an expressive statement via a collection of creative works.This sense of autonomy is often rescinded when content is committed to an institutional steward.Retaining the ability to craft the experience around resources is one reason scholars do not submit scholarly outputs to institutional stewards.Second, many of these cases would fall under the current OLAC-AP term personal, but they are not always personal to an individual.Third, there is a terminological ambiguity among language-scholars on how to refer to these types of exhibits and collections (cf.Paterson III 2021b, ftnt. 8).Even among information professionals, the types of collections created by language-scholars could be considered an archive in the sense of the term referring to a distinct set of records, which is ambiguous with the usage of archive referring to a type of stewardship institution.Content standards like Describing Archives: A Content Standard (Society of American Archivists 2013) provide guiding terminology for resolving these kinds of ambiguities.By acknowledging that more than just "archives" are data-providers the diversity of the contributor network is embraced.Also, by acknowledging diversity within the OLAC-AP, contributors must ask themselves if they have taken the necessary steps for long-term resource stewardship.This still allows for situations which express a great deal of autonomy and expressivity through the building of unique interactive narratives.
The third group in the broad taxonomy shown in Figure 1 is Reference Resources.As listed in Table 3, I found three kinds of resources which fit into this broad category.The first were encyclopedia-like resources that covered a range of languages, which in the case of the OLAC aggregator's user interface, is a specially indexed access point.The second type of resource was a list of other resources like bibliographies and discographies.I included the OAI-PMH concept of gateway within the term Bibliography.OAI-PMH gateways are specific network nodes which grant access to a dataset via another protocol such as Z39.5013 .The third kind of resource is like the second in that it points to other resources.The term aggregator was chosen for these resources in contrast to gateways, which faithfully pass on metadata records.As used here, aggregators are nodes that operate in conjunction with a bibliographic utility (Hillmann 2008;Hillmann, Dushay, and Phipps 2004) which does data transformation or consistency alterations to the record.Both Europeana (Raemy 2020; Neale and Charles 2020) and some nodes within the DPLA network (Lynch and Gibson 2019;Lynch, Gibson, and Han 2020) operate this way.These types of entities do not currently exist in the OLAC network but could.A forward-looking taxonomy should account for these kinds of functions.

Discussion
Table 4 presents a breakdown of the OLAC data-providers applying the two taxonomies.By applying the taxonomies to the network providers, it shows that a significant number of dataproviders are repositories and encyclopedic resources.The distribution suggests that there is current value being realized by the producers of encyclopedic resources.If we assume that publishers most naturally fit into the category of repository, then the number of repositories participating in the network should be about two orders of magnitude higher than the current numbers.It also suggests that curation activities for language resources are not likely to happen among OLAC networks participants.Given the large number of print resources in libraries, greater participation in the network by these types of institutions would add significant value to the network and serve to increase awareness around language resources.The scope of the data coverage informs the classification of the data-provider.For example, if the data is of the catalog for the entire institution, then classify according to one of the values in the access institution group; if the data provided from an institution is scoped to a specific collection, then classify according to a term within the collection and exhibit group.For institutions with more than one collection but not a whole institution's catalog there are at least two options: first the use of multiple OAI-PMH data feeds, or second, the use of the listSet/setSpec mechanism per the OAI-PMH specification.Currently, some OLAC data-providers do provide setSpec data, however, the OLAC-AP and aggregator do not specify or make use of this data.Specifying the use and scope of the OAI-PMH data feed description along with setSpec use would clarify the OLAC-AP for situations where resource stewards such as the Pacific And Regional Archive for Digital Sources In Endangered Languages (PARADISEC) and SIL International's Language & Culture Archives provide records for resources in their holdings as well as records for resources they know about but are not specifically in their holdings.
The classification of current data-providers suggests that efforts to include more dataproviders within the network ought to consider systemic approaches for including dataproviders from some of the underrepresented categories.Efforts to include portfolio collections would support the scholarly profiles of scholars and research units.Using appropriate terminology when referencing data-providers serves to: (1) clarify expectations around the longterm availability of resources-especially those representing the ethnolinguistic heritage of endangered language communities.And (2) clarify user experience design research related to information retrieval systems for language resources.Within the field of language-resource stewardship and language scholarship, the different senses of the term archive have been conflated, resulting in impacts on reported research results.For example, Yi et al. (2022) present a comparison of "language archives" and their web-facing user interfaces without differentiating the kind of information resource a website presents, e.g., institutions with diverse collections versus a single corpus.They also further conflate websites presenting either sense of archive with actual digital asset management infrastructure.The framing of their analysis suggests that the website is the archive.
Digital infrastructure facilitating asset storage, discovery, and acquisition is actually a complex construct which varies from implementation to implementation.In contrast to Yi et al.'s (2022) 'equal treatment' of different kinds of web facing entitles in the name of 'language archives', Ferreira et al. (2021) argue that archives and digital displays for interactions are distinct types of entities.They denote specific kinds of purposes for websites highlighting the fact that some websites do not structure their existence around a preservation mandate; in a sense they are ephemeral, even as they provide meaningful community access to resources.These ephemeral websites are often community produced exhibits or the presentation of the scholarly outputs of research labs.Although these websites are not archives in the preservation institution sense, they can equally be data-providers to OLAC for aggregation and increased awareness of the language resources available.
The framing of Yi and colleges' work demonstrates that some language-scholars' understanding of an archive is related to the language-scholar's experience of it-a nod to digital materiality (and cf. Manoff 2006;Leonardi 2010;Jung and Stolterman 2012;Pink, Ardévol, and Lanzeni 2016).It also suggests that a scholar's understanding of preservation is the ability to access resources.These are two important and under-explored components in developing successful cultural preservation workflows which involve scholar-driven accessions and descriptions.

Conclusion
The result of this study was that a 12-term taxonomy was developed.When applied, it can be used to better understand the OAI-PMH data-provider network.A more diverse typology of data-providers within the OLAC application profile would serve user communities more effectively and impact the social perspective on stewardship organizations and access channels.The utility of the taxonomy can be realized in contexts beyond the Open Language Archives Hugh Paterson III.2023.Diversity and Identity: Categories for OAI data-providers in the Open Language Archives Network.NASKO, Community.By acknowledging diversity, reasonable expectations by language-scholars and data users can be established.
The taxonomy serves at least three functions.First, it allows for a gap analysis by revealing the kinds of OLAC data-providers which have found value through participation and the kinds of data-providers around which possible network-growth opportunities exist.Second, it allows for a more useful metadata quality evaluation by grouping like contributors together.Third, it raises awareness among language-scholars and metadata specialists concerning the differences between data-providers.

Table 1 .
Access Institutions

Table 4 .
Analysis of OLAC data-providers