Printer Friendly

Research on the computer data processing modes under the big data era.

1. Introduction

Cluster makes parallel programs running on all the cluster nodes at the same time, in order to gain higher processing capacity. At the same time improve the system scalability. But with the enlargement of the cluster size, cluster the probability of failure is bigger (Chen,2012; Bennett, 2013). So the fault tolerance is one of the must solve the problem of the parallel computing field. With the increase of the amount of the data, data mining ability to deal with massive amounts of data are questions cannot be ignored (Demchenko, 2013; Dev, 2014). The parallel algorithm is an effective way to solve this essential problem. Classifier in the decision tree algorithm is one of the important methods used in data mining (Mora, 2014; Lupton, 2014; Goncalves et al. 2015). As for the specific application on the computer architecture concern, we should also be focused on the listed aspects. (1) Replica consistency maintenance, when the basic normal operation of the job and that when an error occurs and copy work normally when the main consideration of consistency maintenance (Swan, 2013), when an error occurs mainly considering that how to shield the mistake (Hu, 2014; Horton, 2015). In the original Primary as the backup agreement, when the Primary node failure, to run the general view change distributed algorithm, and from multiple choose new master copy of a copy of the backup node to provide service to the system. (2) After the collapse of the node to join. According to the system structure and data copy in the I/O node placement strategy can be known that the system will not be a new file on the crash node, so as long as the change and delete files. When the I/O node is collapsed, the consistency recovery protocol should be run. The consistency recovery of the nodes needs to be consistent with the consistency of the recovery from multiple nodes, rather than by a single node to be responsible for the recovery of the node. (3) Data placement algorithm. Metadata manager is generally determined by the new file in the I/O node placement. The client may select the specified I/O node to store the file. If the specified I/O node is to satisfy the requirements of normal work while if the client specifies that the node has collapsed, the metadata manager stores the client file by selecting the other I/O node based on the placement strategy of the data.

As shown in the figure one, the computer data processing modes and flowcharts are demonstrated systematically. In the later sections, to specifically deal with the mentioned challenge, we conduct research from the perspectives of Hadoop, CUDA and parallel architecture. Experiment will be taken to verify the effectiveness.

2. The Technique Review of the Data Processing Systems

With the rapid development of industry information construction, the enterprise has accumulated a large amount of data, they are the basis of the management, and evaluation, the basis of core competence has become the most units (Marz, 2015; Paten, 2015). In the data processing area, main tool is the ETL tool, such as Kettle, Data-Stage, Informatica, Microsoft DTS, etc., they often provide a graphical operation or the essential scripted definition and management tools to a certain extent improve efficiency of the system development, operation, and some aspects of the commonality (Russell,2014; Rose, 2014). In the following figure one, we show the architecture of traditional data processing system and the features and characteristics are shown as below.

1. User transparency, high fault tolerance: Data interaction between the child and the child problems by the graphs organization and management, but according to node distribution characteristics in the process of data processing and data features appropriate modified graphs interface, in order to improve the efficiency of data processing.

2. Efficiency of the mining algorithms: In the WAMS platform, when a site failure, the basic algorithm can be used to dig out its impact on the site, to determine the main impact on the site.

3. Reliable, efficient and scalable: Graphs can be called the cluster's overall computing resources, largest Hadoop distributed file system can be called the biggest cluster storage resources, and can increase the cluster nodes according to the computing and storage tasks.

Middleware to provide transparent data processing services to users in order to realize the access to huge amounts of data, to effectively manage the data nodes to achieve the high performance parallel processing make full use of the considerable number of server resources in the grid, to provide users with a convenient and fast data services approach is a middleware system structure design of consideration. This paper mainly discusses the I/O intensive large data processing tasks and the essential. Computationally intensive tasks are greatly influenced by the processor real-time running state, and processor provided by hardware and operating system regulating mechanism differences, therefore we do not consider computationally intensive data processing tasks. While as this time, the consumption of the related energy for the system can be modelled as the follows.

GeneralCost = max {[T.sub.i]} [summation] [p.sub.i] (1)

As we can see, only use the least amount of nodes perform the task does not ensure that consume the least amount of energy, because the least node is not equal to the total power consumption at least, it doesn't mean that maximum execution time minimum. In addition, in a lot of the cluster is not suitable for using the new data storage strategy. The parameters involved are shown as follows.

3. The Proposed Methodology

3.1. Hadoop and data processing

MapReduce framework as a mass data processing is widely used, especially TB-level data processing and PB grade has a natural advantage, more and more are using ETL cleaning process framework MapReduce programming model, the need for that MapReduce operation mechanism, algorithm, performance optimization, programming frameworks and other aspects of the study design customization. For the MapReduce, we can summarize the characteristics as the follows.

1. It has good scalability, greatly improving the efficiency of the development. Through good organizational structure, the use of filters, factory methods, strategies, and the other listener mode, and good xml configuration file, so good scalability.

2. Providing configuration files, use multiple outputs, the results according to certain standards of HDFS assigned to a different directory, to facilitate the establishment hive partition table to do later data analysis or computing.

3. Hadoop as a new distributed storage and computing architecture, due to be deployed on a common platform and scalable, low cost, high efficiency and reliability and other advantages to distributed computing has been widely used, and has been industry and that academia is becoming the de facto standard parallel processing of massive data.

HDFS data management model for the MapReduce job down into task execution laid the foundation for a map task, jobtracker will consider that tasktracker data locality, in the best case, the task is the localization data that is the task to run the slicing of the input node is located. for reduce task, jobtracker simply select from a list of the tasks to be run reduce one to perform, without taking into account the localized data. The figure 2 demonstrates the issue in detail.

Hadoop static replication strategy is adopted, namely, when the data enter to create a specified number of the copies. But that is not the case for the static replication strategy based on changes in the environment to make dynamic adjustment, easy to cause the waste of basic resources. Therefore, researchers put forward the dynamic replication strategy that can be based on the user network, storage and demand dynamically create or delete a copy such as: tension in the storage space, delete some copy to save the storage space while when storage resources are rich, for the frequently accessed data increase copy in order to improve the efficiency of the load balancing node. The complexity can be shown as the follows.

[e.sub.complexity] (f, t) = [SIGMA]([] + [f.sub.j]) / (I x D) (2)

In a distributed environment, I/O capacity and network bandwidth nodes are the two key factors affecting the performance of the whole cluster. In order to take full advantage of the cluster computing resources, the system may be deployed to the mission all nodes in the cluster, leading to a task it may need to process data is not stored locally. Taking into account the cost of data transmission, the traditional MapReduce cluster system when the job is executed for a map task, the system will try to assign a task to the node processing the data is located, to avoid mobile data bring I/O and communication costs that shown in the figure three.

Figure 4--The Confgure Code for the Hadoop System





 <!-- mobile model -->

3.2. CUDA and data processing

CUDA is Nvidia GPU-based computing company launched a parallel platform. CUDA C language syntax is similar to write parallel programs, while adding some access, use GPU hardware features proprietary instructions. The current GPU hardware unified shader and the general purpose computing architecture, this architecture allows the use of CUDA on behalf achieve shader functionality possible because the CUDA generalpurpose computing code and the shader code eventually was converted to the same hardware instruction and the same hardware unit on execution. For the processing of the CUDA based architecture, in the host, the CPU is responsible for the task scheduling, human-computer interaction and to get the data from the data stream, and constantly to exchange data subset to the device memory. On GPU equipment, using CUDA kernel program to deal with the data flow, and dynamically update the summary data structure to the device memory. In data flow quantile parallel algorithm, this paper adopts the sliding window model based on basic window as the summary data structure model, support fixed length or period of sliding window. Sliding Windows update, and delete operations based on basic window units and memory on the device out-of-date basic window will be abandoned in the device memory directly as follows.

1. In the field of the real-time data, the high density of the general data and multidimensional attribute constitutes a huge stream data with calculation model has the transition to the nuclear calculation. Because the calculation problem becomes very complex that must find new ways to improve the computing speed. These high-dimensional data stream related processing performance requirements is very big, before making the calculation of the processor performance cannot meet the requirements.

2. The use of GPU image block and pixel mapping technique, the proposed model the general flow of data elements mapped onto threads in parallel processing, and the existing data stream suitable for parallel computing algorithms have a lot of the qualities, the same instruction can be applied different data, the use of parallel processing to accelerate the model.

3. Based on related GPU high-dimensional data stream processing the biggest bottleneck is the host and the device data transmission overhead. The GPU can't handle the core tasks for alone, so must first buffer a certain amount of high-dimensional data, again to the GPU to calculate, to gain high data throughput. In order to meet the real-time demand of that data stream processing, by reducing the buffer and can reduce the delay, but the size of the buffer should guarantee every kernel program processing of a batch of data can make the GPU work at full capacity.

In this study, GPGPU programs at NVIDIA CUDA architecture is developed and run the program, so the compiled code file formats are cuh its extension headers and cu-based. The architecture is shown in the figure four. Dogpu class directly using the GPGPU program Dogpu.cuh and, where is devoted to memory allocation routines, their memory for configuration and the data transfer functions do not belong to any one class order; and then that Dogpu kernel GPU computing is the core of the program, called directly by the program, Dogpu class must be achieved by GPGPU purpose of make calls. When the program starts to calculate, allocate CPU to do the first data on GPU computing, in this process, working on the GPU running called Kernel, Kernel and each work on the GPU is allocated to each of the Grid Therefore let the GPU kernel is a core program operations. A Grid can contain several Block, Block each in turn contains a number Thread.

4. The Proposed Methodology and the Experiment

4.1. Parallel data processing paradigm

In parallel and distributed computer systems use the copy technology can improve system performance and availability. In the case of some node failure, the user will still be able to get to access the data. To improve the usability of the system, at the same time the data replication technology can improve the performance of the system. For the construction of the system, we should take into consideration of the listed issues. (1) For some of limited resources parallel storage system, the logical block number from the time and space to the storage node label mapping process consumption is crucial. If the mapping process up too much CPU and the memory space, will lead directly to the system performance degradation. (2) For parallel storage system, the degree of the parallelism is the key to system performance, and data distributed manner will directly affect the degree of parallelism data access. (3) Since zero is invalid mobility in this article are distributed two kinds of design goals, these two kinds of way is not in accordance with the movement of generating invalid weight equalization operation to distribute the data, so these two methods can be completely ineffective zero mobility.

For this concern, we should integrate the clustering algorithm as the whole. Fast K-medoids clustering algorithm to update general cluster centers using local heuristic method, it is possible to obtain more stable clustering results, the algorithm runs with Kmeans similar algorithm is simple, fast advantages. However, due to the algorithm in the selection of the initial cluster centers, and it is possible to make different kinds of clusters initial cluster centers are in the same class clusters, so the effect is not optimal clustering. We use the formula 3 to modify the center.

[] = {[x.sub.ij][absolute value of min] [X.sub.ij] - 1/N [SIGMA][x.sub.ik][paralell]} (3)

Algorithm thought is that calculate each sample data sets the sample distribution density around; Choose the former sample distribution around the K concentration of sample points as the initial clustering center clustering; Calculation of various kinds of clusters in each sample to the other cluster sample distance, the sum of minimum value choice of the sample as the new clustering center to clustering, until the sum of all the samples to where the cluster center distance remains the same. Extensibility is parallel algorithm and high performance parallel computers design personnel's pursuit of an important goal. The scalability of parallel system reflects the problem size in parallel system with system size. Through scalability analysis, we can understand the degree of match system structure and application algorithm, at the same time can be used to evaluate the performance of the parallel system. For this, we define the similarity function as follows. And the systematic flowchart architecture is shown in the figure 5.

Sim ([x.sub.i], [x.sub.j]) = 1/1 + [SIGMA][w.sub.i][absolute value of [] - [x.sub.jl]] (4)

4.2. Experimental analysis

In this part, we conduct experimental analysis on the proposed algorithm. To test the validity of the algorithm in this paper, respectively in the artificial simulation data set and UCI machine learning database to experiment on the standard data sets with the experimental environment of Matlab 9.0 programming environment, Windows 7 operating system. The following experiments are stored in a distribution of the structure of the large-scale parallel computer. Now in two ways to achieve data distribution Gaussian elimination and to facilitate the analysis, there is not pivoting operation; while using constant efficiency scalable parallel system to analyze scalability. These two are the allocation block cyclic distribution block cyclic distribution columns and rows and columns. The visualized and numerical simulation results are demonstrated in the figure 6 and table 2.

5. Conclusion and Summary

In this paper, an efficient platform for the large data can effectively support the huge amounts of data integration and data mining algorithms are proposed, and visualization of the steps, and can use the standard data analysis process to ensure the stability of the results. In today's era of big data, massive use of historical data stored in the on-site production process, from mining the useful knowledge and information, through the intelligent algorithm to multivariable model identification system has become a hot research. Because of the diversity of big data, big data acquisition and integration usually cannot be directly applied to data mining algorithms to preprocess the data, that combined with the information structure of the specific application of data processing, data abstract semantic information, and the need for large numbers obtained according to various properties in select attributes independent culling and application, or the introduction of additional abstract measure etc. This paper provides the new paradigm for the issues that is proved to be efficient from the theoretical and technique layers.


Bennett, P., Giles, L., Halevy, A. (2013). Channeling the deluge: research challenges for big data and information systems. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, 2537-2538.

Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business Intelligence and Analytics: From Big Data to Big Impact. MIS quarterly, 36(4), 1165-1188.

Demchenko, Y., Grosso, P., De Laat, C., & Membrey, P. (2013). Addressing big data issues in scientific data infrastructure. Collaboration Technologies and Systems (CTS), 48-55.

Dev, D., & Patgiri, R. (2014). Performance evaluation of HDFS in big data management. High Performance Computing and Applications (ICHPCA), 1-7.

Goncalves, J., Faria, B. M., Reis, L. P., Carvalho, V., & Rocha, A. (2015). Data mining and electronic devices applied to quality of life related to health data. In 2015 10th Iberian Conference on Information Systems and Technologies (CISTI) (pp. 1-4). IEEE.

Hu, H., Wen, Y., Chua, T. S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652-687.

Hu, C., Xu, Z., Liu, Y., Mei, L., Chen, L., & Luo, X. (2014). Semantic link network-based model for organizing multimedia big data. IEEE Transactions on Emerging Topics in Computing, 2(3), 376-387.

Horton, J. J., & Tambe, P. (2015). Labor Economists Get Their Microscope: Big Data and Labor Market Analysis. Big Data, 3(3), 130-137.

Lupton, D. (2014). The commodification of patient opinion: The digital patient experience economy in the age of big data. Sociology of health & illness, 36(6), 856-869.

Mora, A. D., & Fonseca, J. M. (2014). Metodologia para a detecao de artefactos luminosos em imagens de retinografia com aplicacao em rastreio oftalmologico. RISTI--Revista Iberica de Sistemas e Tecnologias de Informacao, (13), 51-63.

Marz, N., & Warren, J. (2015). Big Data: Principles and best practices of scalable real-time data systems. Manning Publications, 47-50.

Paten, B., Diekhans, M., Druker, B. J., Friend, S., Guinney, J. (2015). The nih bd2k center for big data in translational genomics. Journal of the American Medical Informatics Association, 22(6), 1143-1147.

Russell Neuman, W., Guggenheim, L., Mo Jang, S., & Bae, S. Y. (2014). The dynamics of public attention: Agenda_setting theory meets big data. Journal of Communication, 64(2), 193-214.

Rose, D. K., Cohen, M. M., Wigglesworth, D. F., & Yee, D. A. (2014). From the journal archives: improving patient outcomes in the era of big data. Can J Anesth/J Can Anesth, 61, 959-962

Swan, M. (2013). The quantified self: Fundamental disruption in big data science and biological discovery. Big Data, 1(2), 85-99.

Schadt, E. E. (2012). The changing privacy landscape in the era of big data. Molecular Systems Biology, 8(1), 612-620.

Wang, H., & Wang, J. (2014). An effective image representation method using kernel classification. 2014 IEEE 26th International Conference on Tools with Artificial Intelligence, 853-858.

Ningning Li (1) *, Youqing Qiao (2)

* Ningning Li,

(1) Computer Teaching and research Section, School of sport communication and information techonology, Shandong Sport University, China

(2) Barracks Department, Jinan Military General Hospital, China

Table 1--The parameters involved by the data processing system

Structure of content     Meaning

Began to run             The operation of the log data processing set
                         start time.

Information recorded     Operation of a data processing records of
during execution         information.

Computational results    The results of data processing set

Environment setting      Record data processing freight lines of the
                         input parameters.

Table 2--The accuracy detection of different methodologies

NO.    Ours    Methodl   Method2   Method3   Method4

1      97.86   95.12     96.77     95.66     95.66
2      97-55   95.20     96.85     95.45     95.63
3      97.26   95.22     96.65     95.47     95.69
4      97.53   95.59     96.66     95.45     95.86
5      97.74   95.36     96.30     95.42     95.84
6      97.12   95.33     96.28     95.53     95.75
7      97.55   95.37     96.79     95.26     95.23
8      97.66   95.41     96.56     95.25     95.56
9      97.53   95.15     96.22     95.28     95.33
10     97.58   95.16     96.25     95.89     95.62
COPYRIGHT 2016 AISTI (Iberian Association for Information Systems and Technologies)
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2016 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Li, Ningning; Qiao, Youqing
Publication:RISTI (Revista Iberica de Sistemas e Tecnologias de Informacao)
Date:Oct 1, 2016
Previous Article:Combining heuristic search and latent variable sampling to infer dynamic gene regulatory networks.
Next Article:Research on the education management system optimization model based on wavelet neural network and adaptive weight analysis.

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters