Printer Friendly

An overview of big data: concept, frameworks and research issues.

INTRODUCTION

A. What is Big Data?

As the term goes on, it is the technique to deal with the data in a vast environment. It is not a thing which stands alone; rather it is the combination of other techniques such as Cloud Computing, Information storage and retrieval and so on. There are four important characteristics which makes this Big Data unique from all others. They are,

* Volume--The amount of data to be processed

* Velocity--The speed in which the data is being processed

* Variety--The type of the data to be processed

* Veracity--The exactness of the information provided by the analysis

The first '3V's are proposed by [1], and then the last one by IBM [2]. As specified in [3], it is easier to measure the first '2 V's than the others. Because of this reason, the Variety and Veracity is left unattended, at least as of now.

In the perspective of Volume, the data stored digitally increases rapidly as specified in [4]. So, we have a plenty of amount of data in all the fields. Velocity, in general depends on the computing power and indirectly proportional to each other. But, in case of the Big Data, it is not only the computing power which affects the Velocity but the diversity of the data sources. The Variety imposes that Big Data deals with different types of data. These types can be broadly divided into three as,

* Structured Data

It refers to the data which has defined length and format. Examples include Sensor data, POS data, Click Stream data,

* Semi Structured Data

It falls between Structured and Unstructured data. May have simple Name-Value pairs. Weblog data, XML, Swift data are of this category.

* Unstructured Data

The data which does not follow any format is generally named as unstructured data. It includes Images, Videos, and some scientific data.

According to the article [5], only 20-30% of the data is structured and remaining is unstructured. 84% of IT companies process this unstructured data which has the problem of latency, scalability and the infrastructure.

Architecture Of Big Data:

In this section, the basic frameworks that bring the Big Data into the reality are being discussed. The very famous frameworks namely HADOOP and SPARK with their respective pros and cons are explored here.

A. Apache Hadoop:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models [6]. Apache Hadoop 0.14 first released in the year of 2007 and the current version is 2.7, which includes lot of changes from the base version. The general architecture of Hadoop is as follows,

As show in the Figure 1, the Hadoop architecture includes a number of components. Each component is responsible for its own process. The base of the architecture is Hadoop Distributed File System (HDFS), which is inspired by the Google File System (GFS).

B. HDFS:

Hadoop Distributed File System (HDFS) is a distributed file system, which is built to support the NUTCH search engine project of Apache. It is the primary storage used by the Hadoop acpplication which stores the data over the cluster of commodity hardware. This file system has the characteristics of high availability, fault tolerant, scalable and simple [7]. The basic HDFS cluster consits of two components namely Data Node and Name Node. The Name node is reponsible to manage the file system meta data and the Data Node is having the actual data stored. In this Master/Slave architecture, each cluster will have one Name node, which is a Master and a number of Data Nodes, who are all slaves. The client contacts the Name node, the name node processes the request and performs the I/O operation on the Data Node to complete the request.

C. Map Reduce:

Hadoop follows the programming paradigm called as Map Reduce. Using this technique, the input is being divided in to the number of chunks (mostly the <key, value> pairs), which is then processed by Map function. The output produced by the Map function (Which is also a <key, value> pair) is passed as an input to the Reduce function.

D. Hadoop YARN:

The next generation Map Reduce (Map Reduce version2 (MRv2)), is known as YARN. This is introduced in the Hadoop 2.x Version. Basically this is a scheduling and Resource Management tool of Hadoop. In Map Reduce version 1(MRv1), there was a component named Job Tracker to manage the clusters and multiple Task trackers were running on the Data Nodes [8]. In MRv2, the Job Tracker and the Task trackers are replaced by Resource Manager, Application Master, and Node Manager. Though the way of job execution is different, the basic Map Reduce concept remains unchanged.

Other tools that are part of Hadoop eco systems are Hive, Pig, HBase, Storm, etc., Each of these tools serve their own purpose as given in [6],

* Hive--A data warehouse infrastructure for querying structured data

* Pig--A high-level data-flow language and execution framework

* HBase- A scalable, distributed database that supports structured data storage

* Storm--Distributed, real time computing system

Berkley Data Analysis Stack (Bdas):

An alternative of the Hadoop framework to deal with the Big Data in a fastest way is proposed by amplab, University of California, Berkeley. This stack consists of the four layers namely Resource Virtualization, Storage, Processing Engine and Access and Interfaces. The Resource virtualization uses two components Hadoop YARN and Mesos. Mesos is developed by amplab, which builds and run the distributed systems efficiently with fault tolerance [9]. In the storage layer, the base is the File System. It should be any file systems like Hadoop Distributed File System, Amazon's S3 or Ceph. The tachyon [10] is the distributed storage system that enables the memory speed data transfer powered by memory centric concept. As we see the stack, it is Hadoop compatible. Succint [11] is the data store that provides the feasibility of executing the query on the compressed data.

A. Spark:

The primary drawback of Hadoop is the Data Access. Every time the Map process is executed, the data has to be retrieved from the disk. Think about the time to retrieve the data from the disks placed around the corners of the world for the iterative jobs. This issue has created the space for something better than Hadoop in terms of time. The Spark overcomes the problem of the disk data access by managing the data in cache. So, for iterative processes the time taken by the Spark is lesser than the Hadoop [12]. The performance comparison of Hadoop and Spark on iterative processes is given here as in [12].

In Hadoop, when the number of iteration increases, the running time also grows rapidly. But, in case of Spark, the time difference between the five iterations and 30 iterations is relatively small.

Above the spark layer, there are different numbers of components used as interfaces. Each component has its own purpose. The few are,

Spark SQL--Used to query the structured data as a distributed dataset

* Spark Streaming--Used to process the streaming data

* Spark R--Interface that facilitates to use Spark from R

* GraphX--Tool for graph computations

* Mlib--Spark's machine learning library

The following table gives the comparison of tools that is being used by each framework to accomplish the differenttasks. Somecomponents arecompatible with the other framework to certain extent, whereas some of them are not.

Research Issues:

The field of Big Data is still not a single concept and it evolves in multiple dimensions. Considering this nature, it seems a promising field for research. Big data comprises a number of sub fields like Data Mining, Artificial Intelligence, Mathematics etc., The immature nature of big data enlightens a lot of research issues regarding Communications, Security, Privacy and Algorithms but not limited to. This section explains the above mentioned issues related to Big Data research.

A. Communication Perspective:

In Big Data, as we aware, the data is retrieved and processed from multiple locations around the corner of the world using parallel computing. All these data are collected and processed at a given time, which requires such an efficient communication mechanisms to ensure that the data is being collected and processed as intended. In the perspective of communication, research can be carried out to reduce the communication cost and to ensure the consistency in the communication between the nodes.

B. Security Perspective:

Since the big data involves lot of data of the users (in terms of user behavior analysis), companies (in terms of market analysis), systems and sensors, the security plays a major role to keep the data safe from unauthorized access and modification. The security issues can be addressed in multiple phases including Data Gathering, Data Processing, Output Modeling etc.,

As specified in [13], there are only less numbers of publications are available regarding the security issues in big data. Considering this fact, this field is definitely a boon for the research.

C. Privacy Perspective:

From the dawn of Data mining concepts itself, the most concentrated field is the privacy. There are a vast number of articles written to address the privacy issues. Since the data that is being used for analysis is sensitive in all manner, it is mandatory to keep the data's privacy uninterrupted. Any person's personal data can be identified, though the data collected is claimed to be anonymous. It is the responsibility of the person who is collecting and processing the data to maintain the privacy of the particular dataset of persons or the companies involved in the dataset.

D. Algorithm Development:

Algorithms such as clustering, classification and pattern mining are particularly developed for data mining and not suitable to the nature of Big Data. The reason behind this is, in big data, the data collection and processing are being done using parallel computing techniques. But, the data mining algorithms are heavily centralized. Though if these centralized algorithms are used in big data environment, there is the problem in synchronization between the processes. Because, each process may get completed in different time, it is difficult to synchronize the output of the processes (Nodes).

Conclusion:

In this paper, I reviewed the concept of big data, two types of architectures and few research areas. For the researchers who want to begin the research in big data, this might be helpful with the clear introduction about the concepts. In terms of the architectures, though Spark outturns Hadoop in performance both have their own pros and cons, which has to be chosen wisely according to the preference and the availability of the tools in the particular architecture. In the research point of view, combining the techniques of parallel computing with the data mining algorithms will yield the better result in big data research.

REFERENCES

[1] Pattern-Based Strategy: getting value from Big Data, Gartner Group press release, available at http://www.gartner.com/it/page.jsp?id=1731916, July 2011.

[2] The 4 V's of Big Data, http://www.ibmbigdatahub.com/tag/587.

[3] Jagadish, H.V., 2015. Big Data and Science: Myths and Reality, Big Data Research, 2(2): 49-52, ISSN 2214-5796, http://dx.doi.org/10.1016/j.bdr.2015.01.005

[4] Hiibert and Lopez, 2011. 'The world's technological capacity to store, communicate, and compute information," science.

[5] Douglas, K., 2012. "Infographic: big data brings marketing big numbers," http://www.marketingtechblog.com/ibm-big-data-marketing/.

[6] https://hadoop.apache.org/

[7] http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

[8] "Yarn Essentials", AmolFasalae, Nirmal Kumar, PACKT publiching, ISBN 978-1-78439-173-7 P39-42.

[9] http://mesos.apache.org/

[10] http://tachyon-project.org/

[11] http://succinct.cs.berkeley.edu/wp/wordpress/

[12] https://amplab.cs.berkeley.edu/projects/spark-lightning-fast-cluster-computing/

[13] Chun-Wei Tsai, Chin-Feng Lai, Han-Chieh Chao and Athanasios V. Vasilakos, 2015. "Big data analytics: a survey", Journal of Big Data, DOI 10.1186/s40537-015-0030-3

(1) Ponsuresh Manoharan, (2) D. Pradeep, (3) C. Sundar

(1) Assistant Professor, K.L. University, Vijayawada, Andhra Pradesh

(2) Assistant Professor, Christian College of Engineering and Technology, Oddanchatram, Tamilnadu.

(3) Professor, Christian College of Engineering and Technology, Oddanchatram, Tamilnadu.

Received 18 January 2017; Accepted 22 March 2017; Available online 28 March 2017

Address For Correspondence: Ponsuresh Manoharan, Assistant Professor, K.L. University, Vijayawada, Andhra Pradesh

Caption: Fig. 1: Hadoop Architecture

Caption: Fig. 2: Layers of Stacks

Caption: Fig. 3: Running time
Table I: The comparison of tools

Attribute                  Hadoop                  Spark

File system                HDFS, S3, Blob, Swift   HDFS, S3, Ceph
Querying structured data   HiveQL                  Shark(Spark SQL)
Machine learning           Mahout                  Mlib
Streaming Data Analysis    Hadoop Streaming        Spark Streaming
COPYRIGHT 2017 American-Eurasian Network for Scientific Information
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2017 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Manoharan, Ponsuresh; Pradeep, D.; Sundar, C.
Publication:Advances in Natural and Applied Sciences
Article Type:Report
Date:Mar 1, 2017
Words:2079
Previous Article:Survey report on MANETs trust management.
Next Article:Data reduction using sliding window protocol in content defined chunking.
Topics:

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters |