Topic Modeling and Facet Analysis of an Emerging Domain: Research Data Management and Data Curation

Heather Moulaison Sandy, Heather Froehlich, Cynthia Hudson-Vitale, Denice Adkins


Research data management (RDM) is often seen as the overarching field that permits research data to be managed, and is related to the field of data curation (DC), a subset of digital curation. Together, RDM and DC (RDM/DC) allow information professionals to work with clients and each other to make data available in support of the research enterprise. An emerging area of scholarly communication, RDM/DC represents a rich area of study from the perspective of knowledge organization (KO). This paper explores the following research question: What can facet analysis tell us about the emerging field of RDM/DC? First, the MAchine Learning for LanguagE Toolkit (MALLET) implementation of Latent Dirichlet Allocation (LDA) is used for topic modelling of abstracts of the RDM/DC scholarly literature. A preliminary analysis of this empirical data by the research team yields a number of topics and, when possible, their relevant aspects or contexts. Facet analysis principles are next applied to these results, producing four general facets: Practice, Stakeholders, Resources, and Study of RDM/DC; however, complex notions infused throughout the field such as “services” and “metadata” do not appear outright in the analysis. Each facet is then further explored through logical division, and the resulting system is encoded in Protégé and visualized using WebVOWL. We conclude that the major areas of emphasis in this data-intensive field will be fundamentally of interest to those in LIS, in scholarly communication, and perhaps increasingly, in KO and other fields that manage and make available data of all kinds.

Full Text: