Modern-era retrospective analysis for research and applications.
Earth science datasets, which capture a variety of information about the Earth's surface, are obtained using various acquisition methods, at varying domains of coverage (both in space and time), and with varying data uniqueness. For example, observational data about the Earth's surface can be collected by local sensor recordings (in situ data) or via instruments mounted on remote sensing satellites (remote sensing data). These observations are frequently available at non-uniform locations with scarce coverage in space and time, e.g. due to the uneven distribution of a limited number of in situ sensors across the world, which makes it necessary to process and convert them axed spatial grids using basic interruption, aggregation and sampling techniques.
Getting programs on multiple machines to work together in an efficient way, so that each program knows which module of the data to process, and then being able to put the result from all of the machines together to make sense of a large pool of data takes special encoding technique. Since it is usually much faster for program to access data stored locally instead of over a network, the allocation of data across a group and how those machines are networked together are also important considerations which must be made when thinking about big data problems. The uses of big data are almost as varied as they are large. Important example you're probably already familiar with including social media network analyzing their members' data to learn more about them and connect them with content and advertising relevant to their interest, or search train looking at the relationship between queries and results to give better answers to users' question. But the potential uses go much further! Two of the largest sources of data in large quantities are transactional data, including everything from supply price to bank data to individual merchants' purchase histories; and sensor data, much of it coming from what is generally referred to as the Internet of Things (IoT).
This sensor data might be anything from measurements taken from robots on the manufacturing line of an automaker, to location data on a cell receiver network, to direct electrical usage in homes and businesses, to customer boarding in order taken on a transit system. By analyzing this data, organizations are able to learn trends about the data they are measure, as well as the group generate this data. The hope for this big data examination are to provide more customized service and increased efficiencies in whatever industry the data is collected from. One of the best-known methods for turning raw data into useful information is by what is known as Map Reduce. Map Reduce is a method for taking a large data set and performing computations on it across multiple computer, in parallel. It serves as a model for how to agenda, and is often used to refer to the actual implementation of this model. In essence, Map Reduce consists of two parts.
The Map function does sorting and filtering, taking data and placing it inside of category so that it can be analyze. The diminish function provide a summary of this data by combine it all together. While largely recognized to research which took place at Google, Map Reduce is now a standard term and refers to a general model used by many technologies.
II. Related Works:
The problem of spatial smoothing using regression models is well studied, especially in spatial records mining, using two popular approaches (a) Spatial Auto Regression(SAR)and (b) Geographically Weighted Regression In SAR models, the spatial dependency of the dependent variable is directly modeled in the regression objective. The original regression equation y = X_ is modified as y = _Wy + X_ + _, where W is the contiguity matrix denoting the neighborhood relationship and _ is a constant parameter that decides the strength of spatial dependency amongst the elements of y, the dependent variable. Note that, the SAR approach corresponds to a single regression model with smoothing over the entire space. In our case, we have separate models for each spatial point and we need to ensure smoothness amongst multiple model parameters. The SAR approach does not address this problem. One of the limitations of SAR approach is that, it does not account for the underlying spatial heterogeneity. The Geographically Weighted Regression (GWR) was proposed to address this issue using a spatially varying model parameter _(s). They used a weighted linear regression model where the parameters in geographic location are weighted using a distance decay function. The regression model at a specific location s is given by y = X_(s) + _.In this approach, the calibration of model parameters in a neighborhood is pre-defined, such as a distance decay function, and not general enough as in our case. In a parallel thread, there are several approaches considered by climate scientists to address the GCM combination problem. These models can be broadly grouped in to three categories:
(1) Weighted averaging models
(2) Bayesian models and
The main challenge in this approaches choosing the decay function corresponding to each location is quite tedious. In Bayesian models the current and future climate parameters are treated as random variables with a prior probability distribution. The likelihood component of the model specifies the conditional distribution of the data, given the parameters. The posterior is obtained by combining the prior and likelihood of the model. In the online learning models each GCM is modeled as an expert and for every time instance experts give predictions. The online learning approach takes one data point at a time and updates its confidence (weights) for each expert based on the accuracy of the recent prediction. We note that, none of these models address the problem of combining multiple GCM model outputs at each location, with spatial smoothing across the model parameters in neighbouring locations.
1. Map stage 2. Reduce stage 3. Shuffle stage 4. Reduce stage
Hadoop Distributed File System:
Hadoop can work directly with any mountable distributed file system such as restricted FS, HFTP FS, S3 FS, and others, but the most common file system used by Hadoop is the Hadoop Distributed File System .The Hadoop Distributed File System is based on the Google File System and provides a distributed file system that is designed to run on large clusters (thousands of computers) of small computer machines in a dependable, fault-tolerant method.
HDFS uses a master/slave architecture where master consists of a single NameNode that manages the file system metadata and one or more slave DataNodes that store the actual data. A file in an HDFS namespace is divide into several blocks and those blocks are stored in a set of DataNode. The NameNode determines the mapping of blocks to the DataNode. The DataNode takes care of read and write operation with the file method. They also take care of block design, deletion and replication based on order given by NameNode. HDFS provides a shell like any other file system and a list of commands are available to interact with the file system. These defense commands will be covered in a separate chapter along with appropriate examples. MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contain two important tasks, namely Map and Reduce. Maptakes a set of data and converts it into another set of data, where individual element are brokendown into tuple. Secondly, reduce task, which takes the output from a map as an input and combines those data tuple into a smaller set of tuple. As the sequence of the name MapReduce imply, the reduce task is always performed after the map job.
III. Problem Statement:
Mysql is relational database management system. It is generally used for OLTP processing any maintenance on storage, or data records, a downtime is needed for any available RDBMS. In standalone database systems, to add dealing out power such as more CPU, objective memory in non-virtualized environment, a downtime is required for RDBMS such as DB2, Oracle, and SQL attendant. The database cluster uses the same data files stored in shared storage in RDBMS systems, the performance tuning of an RDBMS can go nightmare and limitation of data processing and maintenance cost high.
IV. Proposed System:
Then even at fairly large scales of data it's rare that people are using Hadoop as a substitute for a parallel database. Hadoop has some different design goal and consequently a different set of strengths & weaknesses relative to parallel RDBMS. Hadoop is a lot more flexible. Unlike parallel RDBMS it don't have to pre process the data before it can use it. Then don't have to design a star schema or update some data dictionary or manipulate it with a separate ETL process. More over you can change the schema after the fact with very little cost or effort. There are some work load where flexibility is very valuable and those workloads are moving to Hadoop pretty quickly.
Hadoop is more scalable. The largest publicly discussed Hadoop cluster (Facebook's) was at 30 pet a bytes mid last year and it's grown since then. It there are parallel RDBMS'es that have come close to those kinds of numbers.
There's no parallel RDBMS that comes close to those numbers to my knowledge. because log, text and image data is often much bulkier than transactional data it's often been kept out of parallel RDBMS since the economics just wouldn't make any sense. Proposed concept deals with providing database by using hadoop tool can analyze no limitation of data and simple add number of machines to the cluster and it get results with less time, high throughput and maintenance cost is very less and it can using joins, partitions and bucketing techniques in hadoop. The Hadoop tool on the Proposed system. Hadoop is open soure framework which has overseen by the apache software foundation and it is used for storing and processing huge datasets with a cluster of commodity hardware use Hadoop tool contains two things one is hdfs and map reduce. Use Hadoop ecosystems like sqoop, Hive and pig.
V. Performance Analysis:
1. Data Preprocessing Module:
In this module have to create Data set for Weather dataset it contains a table with twenty cities each day temperatures details for last 15 years and this data first provide in MySQL database with help of this dataset analysis this project.
2. Data Migration Module with Sqoop:
In this report ready with dataset. So now the aim is transfer the dataset into hadoop(HDFS), that will be happen in this module. Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. In this module fetch the dataset into hadoop (HDFS) using sqoop Tool. Using sqoop have to perform lot of the function, such that if want to fetch the particular column or if want to fetch the dataset with specific condition that will be support by Sqoop Tool and data will be stored in hadoop (HDFS).
3. Data Analytic Module with Hive:
Hive is a data ware house system for Hadoop. It runs SQL like querie called HQL (Hive query language) which gets internally converted to map reduce job. Hive was developed by Facebook. Hive support Data definition Language, Data Manipulation Language and user definite function. In this module have to analysis the dataset using HIVE tool which will be stored in hadoop (HDFS).For analysis dataset HIVE using HQL Language. Using hive we perform Tables creations, joins, Partition, Bucketing concept. Hive analysis the only Structure Language.
4. Data Analytic Module with Pig:
Apache Pig is a high level data flow platform for execution Map Reduce programs of Hadoop. The language for Pig is pig Latin. Pig handles both structure and unstructured language. It is also top of the map reduce process running background. In this module also used for analyzing the Data set through Pig using Latin Script data flow language. In this also doing all operators, functions and joins applying on the data see the result.
5. Data Analytic Module with MapReduce:
MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contain two important tasks, namely Map and Reduce. In this module also used for analyzing the data set using MAP REDUCE. Map Reduce Run by Java Program.
VI. Conclusion And Future Enhancement:
MapReduce is a framework for executing highly parallelizable and distributable algorithms across huge data sets using a large number of commodity computers. Using Mapreduce with Hadoop, the high temperature can be analyses effectively. The scalability bottleneck is removed by using Hadoop with MapReduce. Addition of more system to the distributed network gives faster processing of the data. The goal of this study was to analyze which city has highest temp, lowest temp recorded in year wise and month wise report generation of previous 15year as the forecast for the following year. With the wide spread employment of these technologies throughout the commercial industry and the interest within the open-source communitie, the capabilities of MapReduce and Hadoop will continue to grow and mature. The use of these types of technologies for large scale data analyses has the potential to greatly enhance the weather forecast too.
B. Future Enhancement:
They are using spark we can get result hundred times faster than Hadoop. The secret is that it runs in memory on the cluster, and that it isn't tied to Hadoop MapReduce two-stage paradigm. This make repeated access to the same data much faster. Spark can run as a individual or on top of Hadoop YARN, where it can read data in a straight line from HDFS.
Spark Tool using spark we can get result hundred times faster than Hadoop.
[1.] Goncalves, A.R. et al., 2014. "Multi-Task Sparse Structure Learning," Proc. 23rd ACM Int'l Conf. Information and Knowledge Management, pp: 451-460.
[2.] Robustness And Synthesis, 2014. Of Earth System Models (Esms): "A Multi-Task Learning Perspective" Andre R Gonc, VidyashankarSivakumar.
[3.] Karpatne, A. et al., 2014. "Predictive Learning in the Presence of Heterogeneity and Limited Training Data," Proc. SIAM Int'l Conf. Data Mining, pp: 253-261.
[4.] Subbian, K. and A. Banerjee, 2013. "Climate Multi-Model Regression Using Spatial Smoothing," Proc. SIAM Int'l Conf. Data Mining, pp: 324-332.
[5.] Vestavia, R.R., 2013. "Gaussian Multiple Instance Learning Approach for Mapping the Slums of the World Using Very High Resolution Imagery," Proc. 19th ACM Int'l Conf. Knowledge Discovery and Data Mining, pp: 1419-1426.
[6.] 2013. "Climate Data Guide Spurs Discovery and Understanding" Eos, Michael Cardosa, PiyushNarang.
[7.] Taylor, K.E., R.J. Stouer and G.A. Meehl, 2012. "An Over- view of CMIP5 and the Experiment Design," Bul- letin Am. Meteorological Soc., 93(4): 485-498.
[8.] Taylor, K.E., R.J. Stouffer and G.A. Meehl, 2012. "An overview of cmip5 and the experiment design," Bulletin of the American Meteorological Society, 93(4): 485-498.
[9.] 2011. "Spatial-Scale Dependence of Climate Model Performance in the CMIP3 Ensemble" David Masson And RetoKnutti.
[10.] 2010. "Flexible Modeling of Latent Task Structures in Multitask Learning" Alexandre Passos.
(1) G. Ramadevi, (2) S. Parvathi, (3) A. Kumaresan, (4) K. Vijayakumar
(1,2,3,4) Department of Computer Science, SKP Engineering College Tiruvannamaiai
Received 28 January 2017; Accepted 22 March 2017; Available online 28 April 2017
Address For Correspondence:
G. Ramadevi, Department of Computer Science, SKP Engineering College Tiruvannamalai
|Printer friendly Cite/link Email Feedback|
|Author:||Ramadevi, G.; Parvathi, S.; Kumaresan, A.; Vijayakumar, K.|
|Publication:||Advances in Natural and Applied Sciences|
|Date:||Apr 30, 2017|
|Previous Article:||Detection fraudulent of credit card application using payment gateway.|
|Next Article:||A secure erasure code-based cloud storage system with secured data forwarding.|