Printer Friendly

A survey on BigData with various V's on comparison of apache hadoop and apache spark.

INTRODUCTION

In earlier days the data has been generated only by a certain group of peoples or organization like newspaper, television and rest of them were the consumer of the generated data. Later, because of the rise of Internet the unidirectional data generation has been changed into bidirectional (eg., an user can generate as well as consumer of a data). The utility of internet usage among the peoples were familiarized and nearly half of the world population were engaged by being a user in the internet. [1] proposed a model for identifying the frequent patterns in the Big Data which flows over the internet of different kinds of data. Not only being an user but also performing their day to day activities with the help of internet which in parallel generates a huge volume of data with different types (variety) within a fraction of time (velocity) and sometime these data are of uncertainty(veracity). Computation of these data is not that much easy since the traditional RDBMS, ORDBMS or DBMS cannot have sufficient support. This modern issue can be solved with the help of Hadoop by Map and Reduce with its ecosystem tools or Spark. The four layer architecture of OHBDA[2] includes the storage layer, Online and Historical Data Processing layer, Analytics layer and Decision making layer for classifying and processing the Big Data. The modern framework are specially designed to solve the challenges in Big Data. Hadoop is best for batch oriented task and it works on the Map and Reduce. The performance of the cache memory has been accelerated in mapreduce with its performance using [3] SSD-Empowered approach. Whereas for streaming data or for the implementation of iterative process, the support in Hadoop is very less. [9] gives the solution for reducing the bandwidth through on demand services with BoD broker for Multimedia Big Data, it reduces the cost by DWDM backbone. To overcome this issue, Spark comes into the role. The advantages of Hadoop-MapReduce and Spark's in-memory over Oracle are shown in Table1. [4] represents the use of Agile methodology in Big Data analytics by comparing the different approaches of traditional and modern catalogs in web based systems holding data. Though Hadoop and Spark helps for the processing of Bigdata but data centers are the crucial which holds the data and it is monitored through adaptive sampling in an efficient manner[5] by using Cross validation and Kalman filter block.

II. Rate Of Dataflow In A Minute:

For the last two decades the data generation has been rapidly increasing day by day and time to time. According to June 30, 2016 around 48.7% of world population are using internet and there is 890.8% growth in the period of 2000-2016 alone. The graphical representation of data has been visualized in social network analysis[7]. The Social Network analysis(SNA) shows the interaction among the actors(data) for communicating among the individuals in a group[7]. Since, storing the Big Data needs a physical device which enhances the utility of cloud computing and its resources for various application[8]. Visualization of data and enabling realtime interaction with data in motion is deal with [6] through different layer and graphics visualization with incremental approach.

This indirectly says the huge volume of data which has been flowing in the internet. As well as the internet user has been raised from 2billion in 2011 to 3.57 billion in 2016. Interestingly more than 80% of these data are unstructured and nearly 20% are structured data. Storing these volumes of data with proper resource allocation in cloud infrastructure is stated in [10] for Big Data Applications. With this huge amount or volume of data, the success in driving a business needs fast, integrity, authentic and deep insight on data. This deep insight can be achieved with Hadoop or Spark based on the impact of input. Recommendation model[28] helps the human in lots of ways and one of is for personalized travel sequence recommendation through topical package space construction. [32] for exploiting the common author relations through CARE in recommending scientific articles.

Some facts about the data in internet in every minute are, there are more than 347,000 tweets every minute which is 250,000 tweets every minute in 2011, nearly 4 million google searches were conducted for every minute which is more than 2 million searches on Google every minute in 2011, more than 300 hours of videos are uploaded to YouTube which is 70 hours of videos in 2011. Since 2013, the number of Facebook Posts shared each minute has increased 22%, from 2.5 Million to 3 Million posts per minute. This number has increased more than 300 percent, from around 650,000 posts per minute in 2011. Vine user play more than 1 million videos for each minute. The Sizes in bytes for Petabyte is 1,125,899,906,842,624, Exabyte is 1,152,921,504,606,846,976, Zettabyte is 1,180,591,620,717,411,303,424 and Yottabyte is 1,208,925,819,614,629,174,706,176. As these volume of data increasing for everyday, it's very difficult to process with Relational database or with some other tools. KIRA[21] approach helps to classify the feasible and infeasible among the astronomy images with less error and standard deviation, [23]with multimedia pivot tables and insights also helps to analyze in Big Data.

III. Characteristics Of Bigdata:

The giant companies like Google, Yahoo, Facebook, Twitter and etc., are generating enormous amount of data for every second and these data are of either structured, unstructure or semi-structured. To predict a fact among these data is very challengeable because of its varies V's. The Big Data can have more challenges on every V's and the top most 10 V's among them are Volume, Variety, Velocity, Veracity, Validity, Value, Variability, Vagueness, Venue and Visualization.

1. Volume:

The volume is measured by the data size. The size varies by the organization, utility and data generated. Volume is always helps to measure the quantity and the storage medium. Size varies from byte, kilobyte, megabyte, gigabyte and etc., as in Table 2. Once the volume increases the complexity in storing also rises.

[s.sup.n][alpha][c.sup.n] (1)

s = Data size n = times of data size c = level of complexity

2. Variety:

Varieties always shows the flavors in data from different source. The generated data may be structure(rows and colums), unstructured or semi-structure or even the combination of all three. The combination of varieties leads the high level of complexity for processing.

[[summation].sup.i.sub.1] v = [c.sup.n] (2)

v = structure form of data i = level of combination c = complexity n = level of complexity

3. Velocity:

The Data is hitting at a very high speed and processing the data at the same speed before the next data is coming to hit is very difficult with traditional approach. The rate of speed varies on the type and size of data. Hadoop still not supports iterative approach for processing, it makes a little bit complex. Iterative is supported in the Spark helps to give solution for the high speed data. For example, tweets(f(s)) posted by the user for every time limit(k) will vary rapidly.

f(s) = [DELTA][[summation].sup.n.sub.k=0] s (3)

4. Veracity:

The uncertainty is very common in Big Data because of the volume and type. It can be reduced by the data preprocessing. Missing data, duplication can occur and it has to be minimize and in common it can be coined as dirty data.

5. Validity:

Most of the Big Data are from transaction data and it has some time limit to use it. Eg. OTP for banking and other validation process related applications.

6. Value:

The quality of data is merely needed for understanding the importance for figuring the needs. Sometimes, the unimportance data might be appended with quality and cause the dissimilarity in the point of one view. The estimation for the bigdata in healthcare and crime detection has very high impact.

7. Variablity:

The storage volume of the data will vary according to the type and format. Based on the data warehouse the variability will differ.

8. Vagueness:

The Big Data always helps to identify the unknowing things from the knowing data. the art of learning unknowing is leads to vagueness.

9. Venue:

The data origin and use of the data vary according the needs. The person behavior data can sometimes helps to predict the upcoming disease as well as helps to predict the crime ratio of that particular in a different representation. Crime can be reduced once the location is identified with historical data.

10. Visualization:

Understanding the data is must and the important is viewing the data in a pictorial representation. Visualizing the Big Data reduces the complexity and makes others to understand. Some of the visualizing tools are Tableau, Excel, Elastic Search, Logstash, Kibana and etc., Each can helps data to view person.

IV. Processing Platform For Big Data:

The challenge in the Big Data is choosing the suitable processing platform and framework. Initially Hadoop helps lot for processing Big Data by storing in HDFS and process with Map and Reduce function. Later to fix few drawbacks, Spark came to the picture. Hadoop helps to process batch oriented data and it does Map and Reduce even for all kind of task. Spark helps to process even the streaming data or data in motion with in-memory and it's faster than Hadoop. The few Hadoop ecosystem tools are Pig, Hive, Sqoop, Flume, Hbase, Mahout. one kind of data in WoT with its own characteristic's and not as same time -series is spatiotemporal interval data and provisioning sematic content can be achieved by cluster evaluation, spatial distance evaluation[11],[36] in fusion plasma with two phase region outlier detection, KVASIR architecture [22]. Sometimes, predicting the needed information with the historical data in a dataset or from streaming data-source like twitter requires a few feature alignment model [27] proposed a opinion association graph for online review analytics. Privacy preserving in data sharing on distributed streaming data is achieved by shadow and hashed shadow coding[35]. [37] machine learning algorithm with RCM for analyzing the twitter. [30] classification model and minimum correlation method in multimedia helps real time analytics for transmission and storage with reduced load.

1. Hadoop:

Hadoop is a framework with two main components namely HDFS and MapReduce. Mapreduce always splits the data into map and reduce for processing in Hadoop and guaranteeing performance in mapreduce systems is able through the systematic approach of building dynamic performance model for big data analytics[15], [17] with metadata of related jobs in enhanced Hadoop architecture. [24] deals with the improvement in mapreduce at heterogenous computing environments with JAS algorithm.

HDFS is a storage medium of 64MB block size as default. Namenode holds the metadata of the input and splits into blocks which is send to datanode for processing. The namenode allocates the process to the datanode that are ready to process, identified by the hearbeat get from the datanode. For every second the datanode sends the heartbeat to the namenode and signals still the node has platform for process. On behalf of the heartbeat, the namenode stores the block for processing. The default replication factor is 3, an user can change according to their needs. The replication is for the safety and quick process to happen. In addition to Namenode and Datanode the other services provided by the HDFS are secondary Namenode, Resource manager, Jobtracker, Task manager. It helps to check the process is happening in the correct way or any substitution method is required. MapReduce does the computation process on the Datanode with map and reduce function. In addition to it shuffling and sorting act in the reduce function. An issue in Hadoop is all job has to do map and reduce process even if it is not required. Hadoop stores data in disk, it takes more time for data to read and write. Identifying the nearby mobile cloud for hosting data through Hadoop[18] with mapreduce jobs are achieved for energy efficient and to increase the fault tolerance. [33] Geographical social factors could help to predict the exact service rating through user's mobile location in various location. [34]muffm a smartmobile phone application helps to track the students attendance in a safe manner.

2. Spark:

To process a large scale data in very fast then Spark engine is more handful. Programs can run upto 100x times faster than Hadoop Mapreduce, if data is stored in memory and 10x times faster if data is in disk. Spark support multi language like java, scala, R, Python. Spark has a set of libraries namely Spark SQL, Spark Streaming, MLLib for Machine Learning and Graphx for graph oriented task. It can run on any platform like Standalone, on Hadoop, Mesos or in cloud. It support to access diverse data sources including Cassandra, HDFS, HBase and S3. The data structure in spark is Resilient Distributed Dataset(RDD) with fault tolerant. Spark mostly does the read only operation were a RDD will create for every instance with highly interactive and iterative. The task are done on the in-memory(RAM) and requires only very minimum time.

V. Practical Implementations:

The Hadoop and Spark can be used in many applications to process large amount of data and of different types. This processing framework can do many task and reduce the time with high throughput and performance. Some of the applications are Social media, Medical, Crime Detection, Research, Banking and etc.,

a. Transaction in E-Banking:

A rich number of customers are in this field and amount of data is enormous. To deal with these enormous data is not easy for traditional processing technique and its leads to the Big Data. The network analytics and natural language processors are used to catch illegal trading and anti-money laundering in banking sector. Sematic web holds huge data with different scaled parameters in a high level similarity data. Processing RDF semistructure data is through loading, querying and distributed filters[31].

b. Favorite playlist in Media and Entertainment:

Peoples always expect rich media on-demand in different formats and in various devices, some big data challenges in the communications, media and entertainment industry includes Collecting, analyzing, and utilizing consumer insights, Leveraging mobile and social media content, Understanding patterns of real-time, media content usage. Selecting the feature in the big data for a particular dataset leads the collision and problem because certain times the fewer data holding fields plays a major role. [14] discuss about the selection named distributed feature selection model with different selection approaches in economic case. [26] identifies the model for "Having Friends" in SN distribution.

c. Patient Tracking Data in Healthcare:

Huge amount of data has to be process in healthcare but lack of utilizing the data leads to cost rising in healthcare and proper medicine matching is not upto mark. This also leads to patient incure on their disease. The source for BigData is also IoT(Internet of Things) with applications like eHealthcare etc., Security and privacy on these data is enhanced by various agents like Monitoring, Policy, Data Collector, Access control, Ontology and Physician agent[12],[16] through digitalization and data analytics in large scale. [20], [25] helps for the interactive solution to improve the health condition of individuals through simulated approach HBDA and precision medicine with electronic healthcare records. [38] talks about the prediction of health condition for individuals through i2b2 model.

d. Predicting Outcome in Education:

A major challenge in the education is to incorporate big data from different sources and to predict the need. Based on the students earlier performance, a result or behaviors or placement can identify. [34]muffin a smart mobile phone application helps to track the students attendance in a safe manner leads to the outcome prediction.

e. Increasing Customer Support in Manufacturing:

The demand on resources like oil, agricultural products, minerals, gas, metals, and so on has been increasing day by day and leds to an increase in the volume, complexity, and velocity of data that is a challenge to handle. Large volumes of data from the manufacturing industry and underutilization of this information prevents improved quality of products, energy efficiency, reliability, and better profit margins. Storing data from various sources and websites may sometimes has possibility for the occurance of duplication even after encryption and these duplication creates a huge impact for the failure in model. This failure can be reduced by the Deduplication with ownership models[13]. [29] frequent pattern differs from duplication, visualization in visual analytics can be raised through pyramidviz. [39] discussed about the various techniques in visual analytics in urban area which results for the entrance of smart city. [40] changes in the passenger flow is determined by different views in the tweets with spatial temporal data.

f. Public Sector:

Big data is being used for many claims that collected in the form of unsructured data. It can reduce the fraudulent claim in a faster manner. All the illegal activities can be easily identified. Similarly like healthcare, rich amount of data with less information identification is done in transport. BigData Analysis is much needed in [19]Transport to improve its management including resources like manpower, vehicles and etc.,

Similarly big data can be applicable for insurance, transportation, retail and wholesale trade and etc., Processing Big Data from these kind of source with different varieties is not easy with traditional technologies, so we can process with hadoop and spark. Though we know hadoop will process on batch and difficult for the same in streaming data, we are moving to spark. Spark is working with in memory and RDD, it takes less time compared to hadoop. Processing data from these kind of source can be achievable with hadoop and spark.

Conclusion:

Thus various parameters has been taken for understanding the importance and easiness in Big Data with the processing core engines like Spark compared and Hadoop. The different V's discuss about the uniqueness in it and also the need of BigData in every field. Most of the application generates data with more information and less identification has done so far. To identify the insight in the data, Hadoop or spark can be helpful. We discussed more on the operation and functionality of Hadoop and spark.

REFERENCES

[1.] Carson K. Leung*, Fan Jiang, Hao Zhang, and Adam G.M. Pazdor, 2016. "A Data Science Model for Big Data Analytics of Frequent Patterns", IEEE 14th Intl Conference on Dependable, Autonomic and Secure Computing.

[2.] Julie Yixuan Zhu, Jialing Xu, Victor O.K. Li, 2016. "A Four-layer Architecture for Online and Historical Big Data Analytics." IEEE 14th Intl Conference on Dependable, Autonomic and Secure Computing.

[3.] Bo Wang, Jinlei Jiang, Member, Yongwei Wu, Guangwen Yang, Keqin Li, 2016. "Accelerating MapReduce on Commodity Clusters: An SSD-Empowered Approach." DOI 10.1109/TBDATA.2599933, IEEE Transactions on Big Data.

[4.] Hong-Mei Chen, Rick Kazman, and Serge Haziyev, 2015. "Agile Big Data Analytics for Web-based Systems: An Architecture-centric Approach." IEEE TRANSACTIONS ON BIG DATA.

[5.] Tingshan Huang, Nagarajan Kandasamy, Harish Sethu, Matthew C. Stamm, "An Efficient Strategy for Online Performance Monitoring of Datacenters via Adaptive Sampling." IEEE Transactions on Cloud Computing.

[6.] Ignacio Gracia, Ruben Casado, Abdelhamid Bouchachia, 2016. "An incremental approach for real-time Big Data visual analytics." 4th International Conference on Future Internet of Things and Cloud Workshops.

[7.] Fan Liang, Weichang Du, 2016. "Analytics Toolkit for Business Big Data." IEEE International Congress on Big Data.

[8.] Yunus Yetis, Ruthvik Goud Sara, Berat A. Erol, Halid Kaplan, Abdurrahman Akuzum and Mo Jamshidi Ph.D. 2016. "Application of Big Data Analytics via Cloud Computing." WAC: 1570250270.

[9.] Abdulsalam Yassine, Ali Asghar Nazari Shirehjini, and Shervin Shirmohammadi, "Bandwidth On-demand for Multimedia Big Data Transfer across Geo-Distributed Cloud Data Centers. " IEEE Transactions on Cloud Computing.

[10.] Wenyun Dai, Longfei Qiu, Ana Wu, and Meikang Qiu, "Cloud Infrastructure Resource Allocation for Big Data Applications." IEEE Transactions on Big Data.

[11.] Wei Shao, Flora D. Salim, Andy Song, and Athman Bouguettaya, "Clustering Big Spatiotemporal-Interval Data." IEEE Transactions on Big Data.

[12.] Todor Ivascu, Marc Frincu and Viorel Negru, 2016. "Considerations Towards Security and Privacy in Internet of Things Based eHealth Applications." SISY a[euro]cents IEEE 14th International Symposium on Intelligent Systems and Informatics a[euro]cents, Subotica, Serbia.

[13.] Zheng Yan, Wenxiu Ding, Xixun Yu, Haiqi Zhu, and Robert H. Deng, 2016. "Deduplication on Encrypted Big Data in Cloud." IEEE TRANSACTIONS ON BIG DATA, 2(2).

[14.] Liang Zhao, Zhikui Chen, Yueming Hu, Geyong Min, and Zhaohua Jiang, 2014. "Distributed Feature Selection for Efficient Economic Big Data Analysis." JOURNAL OF LATEX CLASS FILES, 13(9).

[15.] Berekmeri, M., D. Serrano, S. Bouchenak, N. Marchand, B. Robu, 2015. "Feedback Autonomic Provisioning for Guaranteeing Performance in MapReduce Systems." IEEE TRANSACTIONS ON CLOUD COMPUTING.

[16.] Volk er tr esp, J. Marcoverhage, ma r kus bundschus, shahrooz ra bi z adeh, pet er a. Fa sching, and yu shipeng, 2016. "Going Digital: A Survey on Digitalization and Large-Scale Data Analytics in Healthcare." Proceedings of the IEEE | 104(11).

[17.] Hamoud Alshammari, Jeongkyu Lee and Hassan Bajwa, 2015. "H2Hadoop: Improving Hadoop Performance using the Metadata of Related Jobs." IEEE TRANSACTIONS On Cloud Computing, manuscript ID.

[18.] Johnu George, Chien-An Chen, Radu Stoleru, Geoffrey G. Xie, 2014. "Hadoop MapReduce for Mobile Clouds." IEEE TRANSACTIONS ON CLOUD COMPUTING, 3(1).

[19.] Albert Nagy, Jazsef Tick, 2016. "Improving Transport Management with Big Data Analytics." SISY 2016, IEEE 14th International Symposium on Intelligent Systems and Informatics.

[20.] Dillon Chrimes, Belaid Moa, Hamid Zamani, Mu-Hsing Kuo, 2016. "Interactive Healthcare Big Data Analytics Platform under Simulated Performance." IEEE 14th Intl Conf on Dependable, Autonomic and Secure Computing.

[21.] Zhao Zhang Kyle Barbary Frank Austin Nothaft Evan R. Sparks, "Kira: Processing Astronomy Imagery Using Big Data Technology." IEEE Transactions on Big Data.

[22.] Liang Wang, Sotiris Tasoulis, Teemu Roos, and Jussi Kangasharju. "Kvasir: Scalable Provision of Semantically Relevant Web Content on Big Data Framework." IEEE TRANSACTIONS ON BIG DATA, SPECIAL ISSUE ON BIG DATA ANALYTICS AND THE WEB.

[23.] Marcel Worring, Dennis Koelma, Jan Zahalka. "Multimedia Pivot Tables for Multimedia Analytics on Image Collections." IEEE Transactions on Multimedia.

[24.] Sun-Yuan Hsieh, Chi-Ting Chen1, Chi-Hao Chen, Tzu-Hsiang Yen, Hung-Chang Hsiao, and Rajkumar Buyya. "Novel Scheduling Algorithms for Efficient Deployment of MapReduce Applications in Heterogeneous Computing Environments." IEEE Transactions on Cloud Computing.

[25.] Po-Yen Wu, Chih-Wen Cheng, Chanchala D. Kaddi, Janani Venugopalan, Ryan Hoffman and May D. Wang. "-Omic and Electronic Health Records Big Data Analytics for Precision Medicine." IEEE Transactions on Biomedical Engineering.

[26.] Horia-Nicolai Teodorescu, 2016. "On Models of "Having Friends" and SN Friends Distribution." CoDIT'16.

[27.] Lavanya, T., J.C. ce Pamila, K. Veningston, 2016. "Online Review Analytics using Word Alignment Model on Twitter Data." 2016 3rd International Conference on Advanced Computing and Communication Systems (ICACCS -2016), Coimbatore, INDIA.

[28.] Shuhui Jiang, Xueming Qian*, Tao Mei, Senior Member, and Yun Fu, 2016. "Personalized Travel Sequence Recommendation on Multi-Source Big Social Media." IEEE TRANSACTIONS ON BIG DATA, X(X), SEPTEMBER.

[29.] Carson K. Leung*, Vadim V. Kononov, Adam G.M. Pazdor and Fan Jiang, PyramidViz: "Visual Analytics and Big Data Visualization of Frequent Patterns." 14th Intl Conf on Dependable, Autonomic and Secure Computing.

[30.] Kun Wang*, Jun Mi*, Chenhan Xu*, Lei Shu, and Der-Jiunn Deng, "Real-time Big Data Analytics for Multimedia Transmission and Storage."

[31.] Long Cheng and Spyros Kotoulas. "Scale-Out Processing of Large RDF Datasets." IEEE Transactions on Big Data.

[32.] Feng Xia, Haifeng Liu, Ivan Lee, and Longbing Cao. "Scientific Article Recommendation: Exploiting Common Author Relations and Historical Preferences." IEEE Transactions on Big Data.

[33.] Guoshuai Zhao, Xueming Qian, Chen Kang. "Service Rating Prediction by Exploring Social Mobile Users Geographical Locations." IEEE TRANSACTIONS ON BIG DATA.

[34.] Sanja Maravic Cisar *, Robert Pinter *, Viktor Vojni, Vanja Tumbas **, 2016. Petar Cisar. "Smartphone Application for Tracking Students Class Attendance." SISY a[euro]cents IEEE 14th International Symposium on Intelligent Systems and Informatics a[euro]cents, Subotica, Serbia.

[35.] Siyuan Liu, Qiang Qu, Lei Chen, and Lionel M. Ni, 2015. "SMC: A Practical Schema for PrivacyPreserved Data Sharing over Distributed Data Streams." IEEE TRANSACTIONS ON BIG DATA, 1: 2.

[36.] Lingfei Wu, Kesheng Wu, Alex Sim, Michael Churchill, Jong Y. Choi, Andreas Stathopoulos, CS Chang, and Scott Klasky, 2016. "Towards Real-Time Detection and Tracking of Spatio-Temporal Features: BlobFilaments in Fusion Plasma." TFEE TRANSACTIONS ON BIG DATA.

[37.] Alisa Sotsenko, Marc Jansen, Marcelo Milrad, Juwel Rana, 2016. "Using a Rich Context Model for RealTime Big Data Analytics in Twitter." 4th International Conference on Future Internet of Things and Cloud Workshops.

[38.] Daniel R. Harris, Adam D. Baus, Tamela J. Harper, Traci D. Jarrett, Cecil R. Pollard, and Jeffery C. Talbert. "Using i2b2 to Bootstrap Rural Health Analytics and Learning Networks."

[39.] Yixian Zheng, Wenchao Wu, Yuanzhe Chen, Huamin Qu, and Lionel M. Ni, 2015. "Visual Analytics in Urban Computing: An Overview." IEEE TRANSACTIONS ON BIG DATA.

[40.] Masahiko Itoh, Member, Daisaku Yokoyama, Masashi Toyoda, Yoshimitsu Tomita, Satoshi Kawamura, and Masaru Kitsuregawa, 2016. "Visual Exploration of Changes in Passenger Flows and Tweets on MegaCity Metro Network." IEEE TRANSACTIONS ON BIG DATA, 2(1).

(1) Mr.M. BalaAnand, (2) Dr.N.Karthikeyan, (3) Dr.S.Karthik, (4) Mr.C.B.Sivaparthipan

(1) Asst. Professor, V.R.S. College of Engineering & Technology, Arasur, VUiupuram.

(2) Professor SNS College of Engineering, Coimbatore.

(3) Dean & Professor SNS College of Technology, Coimbatore.

(4) Asst. Professor SNS College of Technology, Coimbatore.

Received 28 February 2017; Accepted 22 March 2017; Available online 25 April 2017

Address For Correspondence:

Mr.M. BalaAnand, Asst. Professor, V.R.S. College of Engineering & Technology, Arasur, Villupuram E-mail: balavdy@gmail.com_

Caption: Fig. 1: 10V'S OF BIG DATA

Caption: Table 2: Processing Time For Data In Disk
Table I: Rdbms Vs Mapreduce Vs Spark

             Oracle / Mysql          MapReduce

Data Size    Gigabytes               Petabytes
Access       Interactive and Batch   Batch
Update       Read and Write many     Write once, Read
             times                   Manytimes

Structure    Static Schema           Dynamic Schema
Integrity    High                    Low
Speed        x times                 (x+n) times

             Spark

Data Size    Even Exabyte or zetabyte
Access       Iterative and Interactive
Update       Reduces most of the Read and
               Write operation by
               RDD with in-memory
Structure    Can support both
Integrity    Very High because of RDD
Speed        100(x+n)times
COPYRIGHT 2017 American-Eurasian Network for Scientific Information
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2017 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:BalaAnand, M.; Karthikeyan, N.; Karthik, S.; Sivaparthipan, C.B.
Publication:Advances in Natural and Applied Sciences
Article Type:Report
Date:Apr 1, 2017
Words:4380
Previous Article:Portable camera based text and product label reading from hand-held object for blind persons.
Next Article:Braille messenger-a braille script based Sms system for the visually impaired people.
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters