Printer Friendly

Data Sets, Game, and Match.

We're hearing a lot lately about how Big Data and the tools to analyze enormous data sets are transforming businesses. That's true, but the philosophy behind such data analysis is a logical extension of what businesses have always done. Companies rely on numbers to set strategy, alter product mix, develop new services, and adjust customer messaging. Henry Ford is famous for saying that customers could have his automobiles in any color they wanted as long as it was black. Customer buying behavior--the numbers demonstrated clearly that they wanted a choice in colors--changed Ford policy. And that predated Big Data.

The business world is replete with numbers. Financial performance, sales volume, price-to-earnings ratios, market share, manufacturing costs, and marketing data are expressed as numbers. Externally, economic, demographic, customer behavior, industry, financial, and market data are gathered by government agencies, institutions, and private research firms. This has been happening for decades. The difference lies in the amount of data that is now available and the software that now exists to allow machines to process that data. AI technologies, particularly machine learning (ML) and natural language processing (NLP), bring incredible power to data analytics. They make business insights possible, often based on pattern recognition in masses of data that are beyond the ability of humans to ascertain.

I frequently hear that data is the "new oil." If that's true, then data sets are an oil well gusher for business librarians. The existence of large data sets, coupled with the knowledge of how to exploit the numbers for business purposes, presents tremendous opportunities for expanding our skill sets into new and unexplored territory. Information professionals are the ones who know about data sources and evaluation criteria. Data may be big, but that doesn't mean that all data is equally valuable. Quantity is not the same as quality. Data librarians, information analysts, and data curators are job titles of the future--and keep in mind they may not exist in traditional libraries, not even semi-traditional special libraries.

An important finding from Ithaka S+R's Dec. 12, 2019, research report, "Teaching Business: Looking at the Support Needs of Instructors," identifies data as a critical skill. It states: "As data becomes a societal watchword, the private sector is in increasing need of employees that have a high level of data literacy. Yet significant barriers exist for both instructors and students in finding and accessing data, especially industry and financial data" (


A multitude of data sets exist as free sources. Many, however, relate to scientific data rather than business data. The possibility always exists that these will have business implications, so don't completely ignore them. A data set about disease occurrence across time and in various places could hold information valuable to a pharmaceutical company. Data sets related to changes in climate contain crucial information for almost every company in every industry.

You don't always get a free lunch. Other sources for relevant data sets for business decision making are proprietary and far from free or low cost. Whether you can afford to buy these data sets will depend on how critical the information is to your organization. A corporation may well decide that spending the money is worthwhile as the ROI is proven to be favorable. In academia, the expenditure is more problematic.

The reason behind acquiring a data set also plays a role in an acquisition decision. You may want the data for training purposes. You have a clientele interested in learning how to use R, Python, or a similar programming language to analyze large data sets. Not only does this encourage the use of free sources, it also leads to teachable moments about data quality and data literacy. One piece of data literacy is explaining to requestors why a particular data set is not available. Personal bank account information is one example (and, yes, I was once asked for this). Another is confidential internal data, such as prices paid to individual suppliers. Aggregate cost of materials is one thing but delving into individual invoices is unlikely to be in a data set.

Those learning about data analysis should also be examining how data was gathered, what time frame it covers, and whether it is biased. The bias could be unintentional or it could be explicitly stated. A data set, for example, might be geographically limited. If the data-gathering criteria changed at some point, that affects the validity of the analysis. Other variables to point out to analysts include the effect that legislation or regulation could have had on data being gathered.

Data analysis is not restricted to academic exercises. Within companies, data scientists proficient with ML, predictive analytics, and other AI-based technologies are being hired to work with internally generated data sets. Customer preferences, buying behavior, and supplier activities, when properly analyzed, create insights that can validate a company's strategy or cause changes in the strategy. Companies may want to compare their internal data sets with external ones for a further validation of whether they are on the correct path to profitability.

This presents some challenges in trying to find an external data set that correlates to what has been gathered internally by the company. Inevitably, the precise parameters will not quite fit. The company has price data but the external data set has volume data. The company's definition of its sales areas does not mesh with the geographical breakdowns used by the external data set.


Whether for training purposes or real-world business usage, the chosen data set needs to match user needs. Here is some criteria to consider including:

* Accessibility: Can you really get your hands on the data in the format you need, preferably a standard spreadsheet one such as XLS, XLSX, CSV, or SPSS?

* Time frame: Does the time frame of the data match your user needs, can you acquire it quickly, and how much time will it take to process the data?

* Cost: What is the price tag, and is it protected by a license that adds to the cost and potentially limits your ability to share analysis done on the data set?

* Uniqueness: Are there multiple sources for the data, which frequently happens with government data, or does only one data set exist that covers the subject area?

* Attributes: Are the data attributes relevant to the business needs, or are there too many extraneous attributes that you would need to factor out?

* Breadth: How broad is the data set, and does it account for exceptions?

* Manipulation: Is it raw data, or has the data been massaged or normalized?

* Updates: Is this a one-time data snapshot, or is the data being updated on a regular basis?

The value of data depends on context. Some data sets will decay with time while others remain fresh. How data is applied, whether on campus or in a company, determines value. It doesn't exist in a vacuum. The advantage accrued to Big Data is the data mining tools that now exist to pull value out of previously incomprehensive mounds of information.


The first challenge confronting information professionals is finding appropriate data sets. Governments love to collect data. The U.S. government alone is responsible for thousands of different data sets. lists 261,077 data sets, which have the advantage of being free and easily downloadable. The downside is that many have not been kept up-to-date. You are frequently better off going directly to the government agency most likely to have the data set you want.

The U.S. Census Bureau Economic Indicators ( collects data on a monthly basis for both wholesale and retail businesses, downloadable as Excel spreadsheets. Data from the U.S. Department of Agriculture ( is downloadable in XML or JSON formats and originates with the Agricultural Marketing Service, the Economic Research Service, the Foreign Agricultural Service, the National Agricultural Statistics Service, the Natural Resources Conservation Service, Rural Development, and the World Agricultural Outlook Board. Bank data and statistics, derived from bank call reports, are available from the Federal Deposit Insurance Corp. (FDIC;

The Federal Reserve Bank of St. Louis offers FRED (Federal Reserve Economic Data;, a fully searchable compendium of economic data sets. Data can be high level, such as electricity per kilowatt hour in U.S. cities, or extremely specific, such as anthracite coal prices for New York. Not limited to U.S. data, you can see the U.K.'s Consumer Price Index staring with 1960, Danish government bond yields starting with 1987, and GDP for Senegal. FRED is an astounding resource that should not be overlooked in the search for external data sets.

The U.S. government does not have a monopoly on data collection and data set creation. The Organisation for Economic Co-operation and Development (OECD; makes data gleaned from its research reports freely available. The data relates to OECD countries and, sometimes, to selected non-member countries. One data set on the OECD site is particularly interesting. Unlike data sets in which the data originates with the OECD, its International Cartels Database ( OECD_HIC) is the long-running project of a Purdue University professor, John Connor. He sold it to the OECD in 2017. But it also resides within Purdue's institutional data repository ( Both places make the data set freely available.

Eurostat (, the European Union's statistical office, primarily covers data from EU countries. The International Monetary Fund (IMF; imf. org) and the World Bank ( collect and normalize data from all over the world. The IMF concentrates on economic and financial indicators, such as lending, exchange rates, debt, and trade statistics. The World Bank's DataBank ( is a visualization and analysis tool for worldwide development data.

If it's general training data you're after, try Kaggle (kaggle. com), which has more than 19,000 public data sets and 200,000 Jupyter notebooks. The data is wide-ranging, from avocado prices to blockchain to patents to World Development Indicators to sports scores. It also steers you to minicourses on various data analysis and data science tools. Google has also gotten into the data set act via its Cloud division ( sets). Its collection of public data sets can be used by companies to combine with internal data stored on the cloud to derive new insights. Some of the data sets are available without cost, but Google charges for large queries.


Data sets generated by researchers that power articles published in scholarly journals may or may not be available for data mining. Publisher licenses too frequently prohibit data mining, even for the information that exists in databases to which a library already subscribes.

To do effective data mining of financial, economic, news, or company data, researchers need to download a huge amount of information from commercial databases. In general, the companies that product these databases, including Factiva, LexisNexis, and the like, are unhappy about supplying Big Data without a hefty, often unaffordable, price tag--and sometimes they simply outright refuse. As technology to effectively mine Big Data becomes more prevalent and joins the basic toolset of researchers, we can only hope that database producers will modify their licenses to accommodate modern research methods. Certainly, they are under pressure to do just that.


The landscape of available data from free sources is not static. Keeping track of it for you is the Association of Public Data Users ( As an advocacy group for better public data, it sends a weekly email with feature articles, news, and updates about data emanating from government, nonprofits and foundations, and higher education.

If you accept the premise that data is the new oil, then the opportunities for information professionals to acquire, curate, analyze, and explain data are immense. Our oil wells of data are indeed gushing with promises of new approaches to our jobs and increased appreciation of our skills. James Dean, in the 1956 movie Giant, celebrated striking oil by showing up covered in "black gold." It's an apt image for us, although we'll be covered in data.

Marydee Ojala

Editor-in-Chief, Online Searcher

Marydee Ojala ( is editor-in-chief of Online Searcher).
COPYRIGHT 2020 Information Today, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2020 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:the dollar sign
Author:Ojala, Marydee
Publication:Online Searcher
Date:Jan 1, 2020
Previous Article:Is It Time for a Universal Academic Search Engine?
Next Article:What's in a Metric? Data Sources of Key Research Impact Tools.

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters