Printer Friendly

Bringing data to life--lessons from the UK data service: our curation process ensures that data will always be available.

The UK Data Service has been around as a digital archive, in various guises, for nearly 50 years. In that time, we have processed, preserved, and made thousands of social science data collections available for reuse. We hold a library of digital data.

As a national data service for the social sciences, funded by the Economic and Social Research Council (ESRC), we hold data that unlock new discoveries in research, provide evidence for policy decisions, and help teach the next generation core data skills.

The service is a single point of access to a wide range of secondary data, including large-scale government surveys, international macrodata, business microdata, qualitative studies, and census data from 1971 to 2011. We have case studies describing how the data have been used, guides detailing how to use the data collections, publications, and outputs from the ESRC of relevance to the data collections--all linked to each other.

Curating Social Science Data

One of our early data collections was the 1975 "Labour Force Survey." These data are still available today, even though technology and standards have changed considerably since then. Our curation process ensures that data will always be available.

The UK Data Service follows the Open Archival Information System (OAIS) reference model that promotes the long-term preservation of data.

Three parts of this model are as follows:

* Submission Information Package (SIP)--the information sent from the producer to the archive

* Archival Information Package (AIP)--the information stored by the archive

* Dissemination Information Package (DIP)--the information sent to a user when requested

Trust is necessary in all of this. The producers (or depositors) of the data have to trust that you will look after it--through to the consumers (users) who trust that the data you provide are of good quality and are what they say they are.

Since 2005, our archive has been designated as a place of deposit by The National Archives, allowing us to curate public records. We have the Data Seal of Approval and hold the ISO 27001 information security standard.

Our depositors include government departments such as the Department for Work and Pensions and the Office for National Statistics; the International Monetary Fund; award-holders funded by the ESRC; the BBC with the "Great British Class Survey"; and others that do research and offer data that fit into our collection.

Creating and Managing Research Data

Data often have a longer life span than the research project that creates them. Researchers may continue to work on data after funding has ceased, follow-up projects may analyze or add to the data, and data may be reused by other researchers.

Well-organized, well-documented, preserved, and shared data are invaluable to advance scientific inquiry and to increase opportunities for learning and innovation. We provide training in preparing and managing research data so that when the data arrive at an archive for curation, they are of good quality and ready for reuse. We have extensive online information on how to do this. We also supply training and a handbook called Managing and Sharing Research Data: A Guide to Good Practice, by Louise Corti, Veerle Van den Eynden, Libby Bishop, and Matthew Woollard--all from the UK Data Service. (1)

The following is a typical lifecycle journey of research data:

* Creating data--designing, planning consent, collection and management, capturing, and creating metadata

* Processing data--entering, transcribing, checking and validating, anonymizing, and describing

* Analyzing data--interpreting, deriving, producing outputs and publishing, and preparing for sharing

* Preserving data--migrating, backing up, storing, creating metadata and documentation, and archiving

* Access to data--distributing, sharing, controlling access, and promoting

* Reusing data--for follow-ups, new research, research reviews, scrutinizing, teaching, and learning

Discovering Data for Research

All our data collections can be freely browsed using metadata through our Discover interface. Our vision is a world of open metadata, connecting resources, and presenting the user with a map of possible pathways to the data. To aid this, we index all fields, using different filters for different items and linking between items.

Behind each catalog record is the Data Documentation Initiative (DDI) metadata. Each data collection within Discover is described by a structured catalog record that contains a number of predefined elements, such as title and abstract. These elements are based on the DDI, an XML-defined international standard for describing social science data.

Mandatory elements should always be present in each record, with the exception of some older data collections. Some elements consist of a free-text entry (e.g., the abstract and main topics). Others draw on controlled vocabularies (e.g., sampling procedures or keywords). Controlled vocabularies are lists of standardized terminology. They provide efficiency and consistency in creating and retrieving information.

To help users make an informed decision on which data to use, Discover links together the following:

* 6,500-plus data collections

* 450,000-plus variables, with more than 250,000 containing question text/responses

* 170-plus case studies showing how our data have been used in research and teaching

* 13,000-plus ESRC outputs--conference papers, articles, reports, and research summaries

* 170-plus support/how-to guides--dataset, theme, and methods/ statistics

Searching within Discover can be refined with facets (filters). For data collections, additional filters include data type, key data (showing the major, or key, data collections we hold), country (to which the data refer), kind of data, spatial unit, and access conditions.

For case studies, additional filters include whether the data have been used for research or teaching, the type of data used, and for the teaching case studies, the course level. Support and how-to guides have an additional filter for the type of guide, whether it is a dataset, a theme, a method, or statistics. Publications and outputs can also be filtered by source (showing the originating repository where the metadata were recorded), author, and type of output. Our 22,000 users find the data they want, downloading more than 75,000 data collections in a year (October 2014 to September 2015).

Discover was developed in-house to replace a multitude of separate search interfaces with a single portal. It is powered by Apache Solr, which is an open source indexer written in JavaScript. It is built on Apache Lucene, which is an indexing and search library. Solr is capable of indexing billions of records and can search millions of records in milliseconds. It features autocomplete and stemming.

Much of our data can be downloaded directly from the website once access conditions have been met. Metadata can be harvested for data sharing through the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).

As well as Discover, we provide other resource discovery tools such as:

* Variable and Question Bank

* QualiBank--full-text interviews

* Thesauruses--Humanities and Social Science Electronic Thesaurus (HASSET) and European Language Social Science Thesaurus (ELSST)

* Data exploration tools--Nesstar, InFuse, and UKDS.Stat

Variable and Question Bank

The UK Data Service Variable and Question Bank contains more than 450,000 variables, with 250,000-plus containing question text/responses. It allows you to find and retrieve information about variables and questions from a range of survey datasets held by the UK Data Service. When a survey question belongs to a group--say, a set of harmonized questions or a survey instrument--the variable concerned is flagged. Coverage extends across most of the recent, major U.K. surveys and longitudinal studies, but not all studies covered in the Discover data catalog have had their variables added to this search.


QualiBank is the UK Data Service's search and browse interface for qualitative data objects. It allows you to find and retrieve extracts of textual data, audio files, and images from a selection of qualitative data collections. A persistent citation can be made for selected extracts of data, such as an interview extract. The system allows searching the content of text files, such as interviews, essays, open-ended questions, and reports. It also permits searching of metadata attached to these objects, such as a description of a photo or of an audio recording.


HASSET is the subject thesaurus that the UK Data Service uses to index and search its data collection. Originally based on the UNESCO Thesaurus, it has been developed by the UK Data Archive for more than 20 years. HASSET terms reflect the contents of data collections curated by the UK Data Service. It can be used to retrieve studies and the key variables of data collections held within the Discover catalog.

ELSST is a broad-based, multilingual thesaurus for the social sciences that is used to search the CESSDA data catalog and is based on HASSET. It incorporates a selection of its concepts and is currently available in 12 languages (Danish, Czech, English, Finnish, French, German, Greek, Lithuanian, Norwegian, Romanian, Spanish, and Swedish). The concepts that are shared between the two thesauruses are indicated by the qualifier (core). Funding from the CESSDA-ELSST project has enhanced the development of HASSET.

HASSET's term labels and thesaurus revisions aim to keep abreast with social science and humanities literature. HASSET terms also reflect the need for new concepts to index data in other CESSDA archives, and the thesaurus aims to include terms suggested by data users.

Both HASSET and ELSST not only incorporate a tree view to aid in searching, but have a visual graph, making the thesaurus easier to navigate for both the novice and more experienced user.

Data Exploration Tools

Many of our data collections can be explored online using our tools designed for various types of data. For example, for census data, these include UK BORDERS, InFuse, and WICID. For international macrodata, there is UKDS.Stat, and there is Nesstar for surveys.

Citing Data

Once you've found the data for your research, you should acknowledge its use by citing it. Citation does the following:

* Acknowledges the author's sources

* Makes identifying data easier

* Promotes the reproduction of research results

* Makes it easier to find data

* Allows the impact of data to be tracked

* Provides a structure that recognizes and can reward data creators

We've made it easy to cite our data in a number of well-known formats. You can also export the citation to EndNote or in CSL XML.

A citation looks like this:

   Office for National Statistics.
   Social Survey Division, (2015) International
   Passenger Survey, 2014
   [data collection], 3rd Edition. UK
   Data Service. SN: 7534,

The fields within the citation are the creator(s) of the data, the year the data were published, the title of the data collection, an edition, who has published the data, and an identifier.

The identifier is made up of two parts: our study number and a digital object identifier (DOI), in this example, dx.doi .org/10.5255/UKDA-SN-7534-3.

A DOI is the following:

* An international standard for persistently identifying digital objects, or information about them, in a globally unique way

* A string of letters and numbers that can be used to make resources directly available to anyone over the internet

* Assigned to the data collection each time there is a major change to data, documentation, or metadata

Having a DOI ensures you can always find the data even if their physical location has changed. We are investigating the best ways to cite subsets of data.

The Future

We are now looking at the exciting challenges of Big Data and linked data, but the data we hold will still be accessible in years to come.


(1.) Sage 2014: Corti, L., Van den Eynden, V, Bishop, L. and Woollard, M. (2014) Managing and Sharing Research Data: A Guide to Good Practice, London: Sage. ISBN: 978-1446267264

Anne Etheridge ( is the discovery metadata manager at the UK Data Archive. She is responsible for the archive's resource discovery tools, including searching, metadata, and display. Etheridge joined the archive in 2000 and has, over the years, worked on a number of projects, including the Humanities and Social Science Electronic Thesaurus (HASSET) and the European Language Social Science Thesaurus (ELSST). She is the archive's representative on the DDI (Data Documentation Initiative) Controlled Vocabularies Group.
COPYRIGHT 2016 Information Today, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2016 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Etheridge, Anne
Publication:Computers in Libraries
Article Type:Organization overview
Geographic Code:4EUUK
Date:Apr 1, 2016
Previous Article:Telling your story: using dashboards and infographics for data visualization: data visualization allows libraries to communicate their message in a...
Next Article:Tools for writing for the web: you need to make sure that you're e-communicating accurately and practically in order to achieve your own goals.

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters