Big data and R&D management: a new primer on big data offers insight into the basics of dealing with "uncomfortable data"--data that is too large or too unstructured to be accommodated by a firm's existing processes.
So what is "big data"? According to the findings of an Industrial Research Institute (IRI) Research working group focused on digitalization in R&D, the term has no easy definition (Alexander, Blackburn, and Legan 2015). Big data is difficult to pin down because it is not a particular technology or approach. It is everywhere--in the ubiquitous sensors that power the Internet of Things, in the incalculably huge data streams generated by social media, even in usage data generated by cellphone apps--and nowhere; no one really knows how to operationalize it, what its role should be, or even who is responsible for it. For many managers, big data is an unknown "thing" they are told to incorporate into their projects without any clear notion of what that thing is or how it might contribute. In this way, big data is, as the research group began to refer to it, "uncomfortable data."
Describing big data as uncomfortable highlights the way data's traditional role in organizations is being challenged both by the unprecedented scale of big data and the opportunities it presents and by the technologies that support it. Large, continual data streams are inherently uncertain, and the data streams available today, which are different in both scale and type from the kinds of data companies previously dealt with, often doesn't fit into traditional analytic frameworks. The sheer size of these streams means that an analyst working with them has no reliable way to verify the quality of the data or eliminate structural bias that can affect the results. In this world, "decisions become probabilistic--the data tells us what we think we know to be true, but we can't be completely sure" (Alexander, Blackburn, and Legan 2015, p. 24).
Clearly, big data will have implications for R&D, as it suggests new ways in which companies and products interact with consumers and with each other. Working with it "requires new technologies and new approaches to enable organizations to use data effectively in improving decision-making and operations" (Alexander, Blackburn, and Legan, pp. 4-5). Recognizing the wide-ranging implications of the rise of big data, IRI included in its Digitalization and R&D research platform a working group to explore the concept of big data, with the goal of building a deeper understanding of its implications for product innovation and R&D management (see "IRI's Digitalization and R&D Management Research Platform," p. 24).
In pursuit of this goal, the project team surveyed the existing literature and interviewed both R&D executives and big data practitioners, seeking to understand what big data is, what tools are needed to harness it, and how R&D practitioners are thinking about it. The product of this work, Big Data and the Future of R&D Management, is intended to serve as a primer on the topic for R&D practitioners. The book offers an overview of the world of big data and aims to help R&D professionals grasp the opportunities and challenges it may present for innovation.
Defining Big Data
The working group, whose membership consists of R&D management practitioners and academic subject matter experts, began its study by attempting to grasp the term big data. This first attempt met with an unexpected challenge: although the term has become pervasive, it lacks a single definition--it is not meaningful to everyone in the same way. Confounded by the proliferation of definitions and qualifiers around the concept, the team abandoned the attempt to define it. Indeed, given that big data has acquired such importance in spite of this nebulousness of definition, the group decided a single, concrete definition wasn't really a desirable outcome. The more useful approach, they decided, was "to look at the characteristics of what we call Big Data, and how those characteristics relate to the way that organizations are accustomed to using data" (Alexander, Blackburn, and Legan, p. 2).
One of those characteristics is obviously size: how big is big? Tools differ, companies differ, and the data sets being analyzed differ in size, scope, and complexity. At what point does an organization's data become "big"? Answers to that question, the team found, varied widely. Companies accustomed to working with massive data sets view big data as data whose scale far exceeds the capabilities of current state-of-the-art data management technologies. For smaller companies, big data may simply be a data set larger than can be managed using a spreadsheet or database program. The point, summarized best by Bill Pike of PNNL in an interview conducted by the research team, is that "Big Data is data of sufficient size and complexity to challenge contemporary analytical techniques" (Alexander, Blackburn, and Legan, p. 2). What is contemporary for one group or company may be outdated or insufficient for another.
But big data isn't only about size; rather, "Big Data as a concept is really a confluence of a whole set of trends in computing, information processing, computational methods, and analytical tools" (Alexander, Blackburn, and Legan, p. 2). The variety of these concepts is immense (see "The Vocabulary of Big Data," p. 25), and their application varies. Generally, though, big data can be characterized by five attributes that make it difficult or impossible to process using conventional tools and approaches:
1. Volume--Big data is, obviously, big.
2. Variety--Big data is generated from many sources and has a wide variety of different characteristics.
3. Velocity--Big data is generated continuously in near-real time and accumulates rapidly.
4. Variability--Big data is not necessarily delivered in one steady stream but will arrive or be discovered in vastly different timeframes.
5. Veracity--Big data comes from sources that may not be entirely trustworthy, like social media.
The first three of these attributes--volume, variety, and velocity--were defined by Doug Laney in a 2001 working paper for Meta Group. The two additional characteristics--variability and veracity--speak to a primary source of the uncomfortableness of big data: its lack of any clear foundation for understanding and assessing its trustworthiness. For the research group, "The issue of trustworthiness highlights an important issue raised by Big Data--the degree to which the scale, scope, and sophistication of Big Data analysis can overcome the potential pitfalls inherent in dealing with high-velocity, diverse datastreams" (Alexander, Blackburn, and Legan, p. 4).
Trustworthiness is no small matter. Data informs decision making at all levels of a business. If the data that decisions are based on are untrustworthy, decision making will also be unreliable. Decisions made based on data that proves inaccurate or distorted or otherwise deceptive or false can lead the company in a bad strategic direction, causing irreparable harm to the organization or even to society as a whole. Thus, creating an analytic mechanism to account for gaps in the veracity of data is a top challenge among data scientists. However, there is a shortage of data scientists educated in creating data analysis mechanisms of this sort. Taken together, the lack of trustworthy tools and the relatively small pool of talented data scientists to implement them mean that businesses must be careful to understand the reliability of the data they are depending on--and be prepared to live with big data discomfort.
The Tools of Big Data
Big data is made possible by a number of technological developments--affordable sensors, ubiquitous connectivity, and social media to name a few. Its use in business and research is made possible by breakthroughs in data management, including tools like MapReduce and Hadoop, as well as emerging analytical systems and database tools (see "Big Data Tools," above).
Those tools are often another source of discomfort, as they are at the cutting edge of the IT field. Clay Heaton, a researcher for RTI International, noted in an interview with the IRI team that the tools he uses are typically experimental and require extensive customization. This use of noncommercial, highly customizable data management systems often does not sit well with the traditional IT management team's desire for stable, off-the-shelf platforms and packages. Further, the methodologies used to analyze any given stream of data must shift and adapt as the data itself is constantly shifting. Given this radical instability, organizations looking to leverage big data, whether to drive innovation programs or to support new products, must be tolerant of experimental and creative approaches to solve the many challenges presented by data science.
Tackling these challenges is about more than technology. Just as the term big data encompasses a diverse set of practices and technologies, working with big data requires a skillset that is uncommon in many scientific work environments. The term "data scientist" was not even coined until 2012 (in a Harvard Business Review article by Tom Davenport and D. J. Patil). As defined by Davenport and Patil, "Data scientists ... are able to bring structure to large quantities of formless data and make analysis possible" (p. 72). Extracting answers from big data requires a particular kind of talent, one that is still somewhat rare. These individuals are commonly described as "pi-shaped," in contrast to the T-shaped people that populate many research groups. T-shaped people "have some degree of knowledge across a broad range of topics, but very deep knowledge in only one domain," the IRI Research Team explains. Data scientists, by contrast, "need to complement that domain knowledge with equally deep knowledge of the tools and methods of data science--statistics, coding, and data management" (Alexander, Blackburn, and Legan, p. 10).
The result is a pi-shaped individual (Figure 1)--one with depth in two distinct areas. Data scientists may resemble IT professionals in that they use coding and data management tools, but they are typically trained as scientists first. Often, they begin their careers performing research that requires handling of messy data sets, and that work precipitates the acquisition of data science skills. Their domain of expertise, instead of being IT- or computer science-related, is usually in a field like chemistry, physics, biology, or even sales or business administration; they develop big data skills out of a need to support their primary research. This trend is changing--many universities now offer data science courses and degrees, producing specialists whose primary training is in data management and analysis--but for now data scientists tend to be pi-shaped researchers.
[FIGURE 1 OMITTED]
The Implications of Big Data for R&D Management
Despite the magnitude of the challenge big data presents for organizations and researchers, the value of a big data platform that generates a stream of useful output cannot be overstated. The ability to process massive amounts of data from different sources and synthesize those data into a holistic view of what is going on has significant implications, not just for how business decisions get made but for how research is done. Until recently, scientific hypotheses were tested by designing and conducting experiments. The costs involved in such a research and development process were large and it could take years or decades to bear fruit--if it ever did. With big data tools, scientists can engage in virtual experimentation using data sets that represent real-world conditions and limitations. Material science organizations can simulate chemical reactions to understand the properties of new compounds before ever actually combining the elements. Physical experiments may still be necessary, but virtual studies can help limit and target that lab work, freeing up resources that might otherwise have been tied up in lengthy research cycles.
The scale of the change reaches beyond experimentation. In a 2011 article in Science, researchers James Evans and Jacob Foster described how machine analysis of the scientific literature could identify hidden biases in how researchers in a given domain select, design, and conduct experiments, pinpointing where a gap in knowledge may exist. Such algorithmic discoveries, called "machine-generated hypotheses" by Evans and Foster (2011), could guide research to previously unexplored terrain, opening entirely new lines of inquiry. These technologies are still in development, however, and the tools needed to produce accurate, actionable intelligence are still far behind where they need to be to capture the full value of these new techniques. Still, one thing is clear; big data is poised to change how R&D is managed.
The big data research group identified three ways big data will change R&D management. First, with more complete and accurate data on what is happening in the market and in the lab, executive teams will have the information they need to make more efficient, more refined, and more accurate strategic decisions. Predictive analytics will make it possible for organizations to predict and navigate market forces, delivering the right products to the right consumers at the right time. Big data will also help organizations reconcile the perennial conflict between short- and long-term R&D goals, providing the building blocks of a stronger case for long-term research.
Second, big data will enable new approaches to conducting research and development, such as the use of machine-generated hypotheses. Data modeling--algorithms and programs that process and analyze real-world conditions in real time--and virtual experimentation will release R&D from the constraints imposed by the need to construct real-world prototypes, allowing new products to be conceptualized, developed, and delivered in a fraction of the time and at a much lower cost.
Finally, big data will change R&D management through disruption. As organizations learn how to use big data, predictive analytics, and the associated tools to gain competitive advantage, organizations that don't have this capability will miss opportunities in the market. While some of this competitive advantage may come from the finely honed strategic decision making enabled by market-based data, it will also be built on unique products built on big data approaches. These kinds of advantages will reach far beyond the obvious applications in the Internet of Things and "smart" products. For instance, the startup Hampton Creek Foods has developed a vegan mayonnaise that replaces the eggs in traditional mayonnaise with a synthesized protein; that protein was developed using data analytics to model egg protein and understand how it behaved in the emulsion process. As Big Data and the Future of R&D Management notes, "Hampton Creek exemplifies how the application of Big Data in an unconventional area could enable a small team of data scientists [to] take on large incumbents in the market by developing new products with completely novel tools and techniques" (Alexander, Blackburn, and Legan, p. 16). As big data tools and techniques become perfected, big data-based startups like Hampton Creek will become more commonplace. Companies that fail to get into the big data and data analytics game are going to fall behind rapidly once these tools enter the mainstream in industry.
Our modern society is sitting atop a mountain of data that is increasing in size and complexity with each passing second. The challenge of making sense of this data is enormous, but new tools and applications for analyzing it and putting it to work are emerging every day. As with any potentially game-changing development, companies that move quickly into the big data space--by hiring data scientists, giving them the tools and flexibility to explore the data in a variety of ways, and allowing them space for creativity in their analysis--stand to gain a significant advantage over late adopters. As scientific research and discovery moves into digital testing environments, and as researchers uncover new ways of gathering and analyzing information from a wide array of sources, individuals and companies equipped with the skills and technologies to analyze the massive flow of data emerging from simulations, the Internet of Things, social media, surveillance systems, and other sources of real-time data will be the only ones in the game capable of keeping pace with scientific advancement.
IRI's Digitalization and R&D Management Research Platform
The Big Data research working group is one of the groups working within IRI's research platform, Digitalization and R&D. Designed to explore a broad range of issues relevant to digitalization and its likely effects on R&D organizations today and in the future, the platform incorporates three research groups:
Big Data: This team is focused on understanding how big data will inform, enable, and disrupt R&D by examining it through the lenses of R&D strategy, human capital, technology, and process integration. The team's first deliverable, the e-book Big Data and the Future of R&D Management, aimed to help IRI members understand the concepts encompassed in the term big data, see how big data is already having an impact on organizations, and establish the framework for the next phases of the project. The final deliverable will be a maturity matrix to help organizations understand where they are in terms of their ability to access the power of big data, using information gathered from case studies.
Collaboration: This project team is comparing virtual and physical collaboration spaces with the object of identifying how collaboration may become boundaryless over time, as ecosystems of hyper-collaboration develop around cutting-edge tools and evolving best practices in technical and managerial approaches.
Virtual Experimentation & Simulation: This team is looking at the principles shaping virtual spaces and their uses with an eye toward identifying best practices for determining when a virtual environment should be employed over a physical one as part of R&D processes.
Each of these three teams is comprised of R&D management practitioners, drawn from IRI member companies, and subject matter experts. Work kicked off in early 2015; all three teams will complete their work with report-outs at IRI's October 2016 Member Summit in Chicago, Illinois, followed by an RTM special issue on the topic in 2017.
For more information about the IRI research platform or any of its associated research groups, contact Lee Green (email@example.com). IRI members can download a free copy of Big Data and the Future of R&D Management at www.iriweb.org/research-big-data.
RELATED ARTICLE: The vocabulary of big data.
The origins of the concept of big data date back to the early 2000s, when physicists working at large particle colliders developed grid computing techniques to handle the enormous data streams generated by their work. Terms like "Fourth Paradigm Science" (coined by Jim Gray of Microsoft) emerged to describe the practice of conducting scientific experiments entirely with data generated by other experiments. From there, grid computing systems, which consisted of "highly-distributed infrastructures for data storage, processing, and access" (Alexander, Blackburn, and Legan, p. 2), were adopted by data-reliant firms like Google and Yahoo! that needed their massive data storage and management capabilities. To better traverse the grids of data generated by this new computing model, Google built MapReduce and its Google File System. From there, Yahoo! took the MapReduce programming model and provided it to the open-source community via Hadoop, a versatile, easy-to-use implementation for MapReduce programs. New tools have been emerging ever since, and more arrive each year.
This history has generated a common set of tools and terms that shape the conversation around big data:
* Cloud computing--Data management and storage within a virtual environment hosted seamlessly on an array of remote or distributed servers.
* Datafication--The practice of capturing an increasing number of aspects of social and physical phenomena as digital data.
* Found data--Data used for a different purpose than the one for which it was originally generated, such as credit card transaction data that is analyzed to map consumer purchasing patterns.
* Machine learning--Modes of analysis in which automated algorithms process huge data sets and produce results that are interpretable by humans.
* Open data--Readily available data typically found in the public domain.
* Predictive analytics--The use of machine learning and related techniques to generate predictions about future events.
* Social media--Online nodes through which very large sets of users interact spontaneously, such as Facebook, Twitter, or LinkedIn, which provide massive sets of data about human interactions.
* Ubiquitous sensing--Using sensors across a wide array of environments to collect detailed data from a very large number of observations.
* Unstructured data--Information that either does not have a predefined data model or is not organized in a predefined manner. Typically, unstructured data is text-heavy, but it may also contain dates or numbers.
RELATED ARTICLE: Big data tools.
The tools employed to make use of big data streams are many and varied, but the IRI Big Data team identified a set of common tools that appear frequently in discussions about the field. These tools are largely either programming languages or models for building software that can make sense of big data (such as MapReduce, MALLET, R), implementation frameworks for launching such software (such as Hadoop, Azure), or ways of visualizing data (such as D3.js).
* Azure is a cloud computing platform and infrastructure created by Microsoft for building, deploying, and managing applications and services through a global network of Microsoft-managed data centers. It integrates Hadoop into SQL Server along with Active Directory and Microsoft System Center, allowing for large-scale data storage and processing in the cloud.
* Hadoop is an open-source software framework for distributed storage and processing of very large data sets on computer clusters built from commodity hardware. The core of Hadoop consists of two parts: a storage component known as Hadoop Distributed File System (HDFS) and a processing component called MapReduce.
* Hive is a data warehouse infrastructure developed by Facebook that can sit atop Hadoop and perform queries on information stored in HDFS. It comes with its own programming language called HiveQL (similar to SQL) that transparently converts queries to MapReduce jobs.
* MALLET, or MAchine Learning for LanguagE Toolkit, is a Java-based package for statistical natural-language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications. It is used primarily for analysis of text and other unstructured data.
* MapReduce, invented by Google in the early 2000s, is a programming model that operates through an associated implementation for processing and generating large data sets on a cluster. It consists of two common features used in functional programming, the Map() function, which filters and sorts data, and the Reduce() function, which performs a summary operation.
* R is a programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. It is primarily used in developing statistical software and data analysis tools.
Greg Holden is the business writer and social media manager for the Industrial Research Institute and a regular contributor to Research-Technology Management. He holds a BS in political science and an MA in Middle East history and is currently studying software engineering at the University of Maryland, firstname.lastname@example.org.
Alexander, J., Blackburn, M., and Legan, D. 2015. Big data and the future of R&D management: Deliverable 1-A primer on big data for innovation. Industrial Research Institute, http://www.iriweb.org/sites/default/files/Big%20Data%20Primer_O.pdf
Columbus, L. 2015. Where big data jobs will be in 2016. Forbes/Tech, November 16. http://www.forbes.com/sites/louiscolumbus/2015/11/16/where-big-data-jobs-will-be-in2016/#4848dcd9f7f1
Davenport, T. H., and Patil, D. J. 2012. Data scientist: The sexiest job in the 21st century. Harvard Business Review 90(10): 70-76. https://hbr.org/2012/10/data-scientist-thesexiest-job-of-the-21st-century
Evans, J. A., and Foster, J. G. 2011. Metaknowledge. Science 331(6018): 721-725.
Howe, B. 2015. A confluence of big data skills in academic and industry R&D. Presentation given at the IRI Annual Meeting, Seattle, Washington, April. Available on Slideshare as "Big Data Talent in Industry and R&D," http://www.slideshare.net/billhoweuw/iri-meeting
Laney, D. 2001. 3D data management: Controlling data volume, velocity, and variety. Application Delivery Strategies, February 6. Meta Group, File 949. https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data -Management-Controlling-Data-Volume-Velocity-and-Variety.pdf
Manyika, J., Chili, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., and Hung Byers, A. 2011. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute, May. http://www.mckinsey.com/business-functions/business-technology/our-insights/big-data-the-next-frontier-for-innovation
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||RESEARCH NOTE; research and development|
|Comment:||Big data and R&D management: a new primer on big data offers insight into the basics of dealing with "uncomfortable data"--data that is too large or too unstructured to be accommodated by a firm's existing processes.(RESEARCH NOTE)(research and development)|
|Date:||Sep 1, 2016|
|Previous Article:||Innovating with crowds: an interview with Karim Lakhani: Karim Lakhani talks with Jim Euchner about using communities and contests to access the...|
|Next Article:||The Hollywood model: leveraging the capabilities of freelance talent to advance innovation and reduce risk: a networked, project-based approach to...|