Printer Friendly

A region wise health care service prediction using Hadoop.


Big data analytics is a process of handling and analyzing large amount of data from varied source and produce useful information from these large data. This useful information will help the industries and various other sectors to allocate their resources appropriately. The big data not only refers to the large amount of data but it also handles the noisy, complex and heterogeneous data. The big data will handle main challenges they are data challenge, processing challenge and management challenge.

The data challenge deals with 3 dimensions of data they are volume, velocity, variety. The volume refers to the amount of data and velocity states the speed of processing the data and variety defines different source of data. The processing challenge involves data collection, data preprocessing, data storing, data retrieval, data representation. The management challenge deals with data privacy, data security and governance.

The big data analytics is widely used in different sectors for analyzing their data and get useful information out from those stored data. By this useful information, the performance of their company, certainly improve. Similarly the big data in healthcare sector is used to provide better health care facilities to the people.

Big Data Applications:

Organizations have worked upon its information examination and translation to open another skyline of chances their way.

Enormous information speaks about data and a route with which it can be investigated. This gives an opportunity at each phase of learning, disclosure in enormous information. Huge information offers an open door in numerous parts as said beneath:

* Banking and security

* Communication media and services

* Education

* Government

* Health sector

* Protection

* Insurance

* Transportation

* Different Merchants, for example, : Retail, wholesalers

When it comes to healthcare analytics, hospitals and health systems can benefit most from the information if they move towards understanding the analytic discoveries, rather than just focusing on the straight facts [11]. Use analytics for clinical decision support can significantly improve the way clinicians make decisions about their patients--while at the same time cutting costs for the organization across the country and to improve the patient wellness.

Handling Big Data:

A. Big data Handling Techniques:

Treatment of Big Data is another significant concern. The following are some rising advancements that are helping clients adapt to furthermore, handle Big Data in a financially savvy way. Taking care of huge information should be possible as for taking after angles like Processing Big data: MapReduce, Hadoop is an incorporated system for handling big information Analysis and questioning of information: WibiData, PLATFORA, PIG

Business Intelligence: Hive

Storage: Cloud storage, Column-arranged databases, pattern less databases Machine Learning: Apache Mahout, SkyTree A portion of the different big information taking care of methods characterized are outlined Below [5].

B. MapReduce:

MapReduce is the key calculation that the Hadoop MapReduce framework uses to appropriate work around a bunch. Mapper work, a guide change capacity is given to change an information column of key and esteem to a yield key/esteem:

map(key1,value)-> list<key2,value2>

That is, for information it gives back a rundown containing at least zero (key, esteem) combines: The yield can be an alternate key from the input. The yield can have various passages with a similar key Diminish work: lessens change is given to take all values for a particular key, and create another rundown of the lessened yield. reduce(key2, list<value2>)-> list<value3>

C. Hadoop:

Apache Hadoop is an open source system that empowers organizations to rapidly pick up understanding from monstrous measures of organized and unstructured information. It is utilized as a part of keeping up, scaling and examining substantial size of information. This information can be organized or unstructured.

D. Pig:

Apache PIG is a stage for examining substantial information sets. PIG's dialect, PIG Latin, gives a chance to determine a grouping of change capacities like consolidation, channel, gathering and so on. Aside from inherent capacities it additionally gives office to user defined capacities to do extraordinary reason handling. PIG's dialect takes into consideration inquiry execution over information put away on a Hadoop bunch, rather than a "SQL-like" dialect.

E. Hive:

Apache Hive is an information stockroom framework based on top of Hadoop for giving information outline, question, and analysis. Hive gives a SQL-like interface to inquiry information put away in different databases and record frameworks that coordinate with Hadoop .Hive gives the fundamental SQL reflection to incorporate SQL-like Inquiries (HiveQL) into the hidden Java Programming interface without the need to actualize questions in the low-level Java Programming interface. Since a large portion of the information warehousing application work with SQL based questioning dialect, Hive underpins simple compactness of SQLbased application to Hadoop.

H. Schema-Less Databases, or NoSQL:

There are a few database sorts that fit into this class, for example, key-esteem stores and record stores, which concentrate on the capacity and recovery of extensive volumes of unstructured, semi-organized, or even organized information. They accomplish execution picks up by getting rid of a few (or all) of the limitations customarily connected with ordinary databases.

Challenges Of Big Data:

Huge information which is ordinarily of the size petabyte or terabyte is bound to be gone up against with numerous hypothetical, specialized, mechanical and down to earth challenges. Genuine research endeavors are being put resources into request to enhance the proficiency of capacity, handling and investigation of huge information. Taking after are the different difficulties confronted while taking care of enormous information.

A. Data Acquisition and Recording:

It is critical to catch the setting into which information has been created and the capacity to sift through the commotion amid preprocessing the information and to pack information. Pre-handling of information is unpredictable and is tedious hence the genuine test is taking care of huge volumes of unstructured and organized information persistently landing from countless. Thus an answer for this would require development of new advances and models, intended to proficiently extricate esteem from substantial volumes of a wide assortment of information, by empowering high speed catch, disclosure as well as investigation.

B. Information Extraction and Cleaning:

Regularly information should be changed keeping in mind the end goal to extricate data from it keeping in mind the end goal to express this data in a shape that is appropriate for investigation. Information may likewise be of poor quality or potentially indeterminate. Removing important data from such tremendous measures of information of low quality is one of the real difficulties being confronted in huge information. The exactness of the Information cleaning and information quality confirmation. Along these lines cleaning of information and it's quality confirmation are basic.

C. Information Integration, Conglomeration and Visualization:

Information won't not be homogenous and may have distinctive metadata. In this manner Information reconciliation requires gigantic human endeavors. Manual methodologies neglect to scale to what is required for huge information, subsequently the prerequisite of more current and better methodologies emerges. Additionally unique information

D. Query Processing and Analysis:

Strategies reasonable for huge information should be found and assessed for productivity so they can manage boisterous, rapid, heterogeneous, dishonest information. In any case in spite of these troubles, enormous information regardless of the possibility that boisterous and questionable can be more significant for recognizing more dependable covered up examples and information contrasted with minor specimens of good information.

II. Related Work:

An application information and communication technology in home care for communication between patients, family members and health care professionals is presented by Lindberg. B [1]. The aim of this study is to provide health care solution to the aged people in the home and providing proper facilities to the people at the home via e-health. E-health provides a peer to peer communication between patients and health service provider. Video conferencing was introduced to communicate between patients and health care personnel and another way of video conferencing can be held between patient and nurse. Video conferencing is helpful to patient because they need not go to hospitals directly.

Javier Andreu-Perez [2] done research on big data for health .This research provides overview about recent technology innovation in big data. It collects data from varied sources such as both structured and unstructured data sources. Also, discussed about privacy, security and data ownership of the data in health care. Big data can serve to boost the applicability of clinical research studies into real-world scenarios, where population heterogeneity is an obstacle. It equally provides the scope to enable effective and precision medicine by performing patient stratification. Another important factor to consider is rapid and seamless health data acquisition, which will contribute to the success of big data in medicine. Specifically, sensing provides a solid set of solutions to fill this gap

Interpretability of linguistic fuzzy rule survey is contributed by Jose Gacto.M[3] which provides more accuracy and interpretability in the fuzzy rule generation. It provides linguistic fuzzy modeling and precise fuzzy modeling system. In this, it is necessary to point the measures of different quadrants. The problem faced in this is that, it can't handle more than three objectives. An outline of the proposed interpretability measures and procedures for getting more interpretable etymological Fuzzy Rule-Based System(FRBSs) was displayed. It provides two principles, "multifaceted nature versus semantic interpretability" considering the two principle sorts of measures; and "govern base versus fluffy segments" considering the distinctive segments of the information base Knowledge base(KB) to which both sorts of measures can be connected. This prompts to four distinctive quadrants being investigated: The multifaceted nature at the govern base Rule Base (RB) level; the many-sided quality at the fluffy parcel level; the semantics at the RB level; the semantics at the fluffy parcel level. The fundamental point is to give a settled system in request to encourage a superior comprehension of the theme and very much established future works

Rule weight specification in Fuzzy rule based classification system by H. Ishibuchi [4] provides weights to the rule and provides classification according to the weights of the rule this provides more accuracy to the fuzzy rule. There is infinite number of training number are available hence they are distributed within some particular interval. In this we compare four definitions of rule weight with one another using wine and glass data using computer simulation. To test the given samples we use Design and Test procedures.

Cancer prediction technique using fuzzy logic system is done by Poongodi. M, Manjula. L [5]. The study provides cancer prediction using fuzzy rule that are generated from their behavior and systems. This system is applicable for predicting the cancer by individual person. the general population have made hazard examination and prediagnosis with respect to the bosom malignancy which is resolved as a tumor sort by utilizing the fluffy rationale demonstrate; built up a framework which gives proposals to people as a manual for diminish the hazard or kill the hazard on the base of hazard status of disease sort. The purpose behind determination of fluffy rationale display in this study is that the framework utilizes fluffy rationale display empowers to give successful results relying upon questionable verbal information simply like rationale of person. The nature of fluffy rationale demonstrates use here is to achieve a general arrangement by doing just constrained investigations. It requires long investment to utilize the different strategies for such issue. The fluffy rationale gives the speediest answer for the issue forestalls to lose.

[6] Valarmathi and Ayesh has presented paper on Prediction of risk in breast cancer using fuzzy logic tool box in mat lab environment this paper describes about the fuzzy logic system to predict the breast cancer using ID3 algorithm, Association rule and fuzzy logic system. The Tumor size, number of nodes. The linguistic variables Very Risk (VR), Very Risk Moderate (VRM), Risk (R) and Not Risk (NR) were used in order to give a breast cancer risk prognosis. this 20% criteria value in the association rule showed 18 taluks including Coimbatore to be the risk regions for breast cancer.

Hardship financing of healthcare among rural poor in Orissa, India was done by Erika Binnendijk [7] It mainly focuses on providing funds to the people in rural area based on collecting the data from that area and predicting the disease and providing the funds to that area. Poor family units are subjected to extensive and extended monetary hardship due to the circuitous and longer-term injurious impacts of how they adapt to out-of-pocket human services costs. The informal organization that family units can get to impacts introduction to hardship financing. Our discoveries indicate the need to build up a strategy arrangement that would constrain that presentation both in quantum and in time. We in these way reason arrangement intercessions planning to guarantee well being related monetary security would need to show that they have decreased the recurrence and the volume of hardship financing.

Hence, the objective of this work is to propose a design and implementation of Hadoop based system for processing huge health care data from a rural hospital .

III. Architecture For Prediction Framework Using Hadoop:

The proposed system consists of architecture for prediction framework for health care services using big data analytics tools for providing the healthcare services to the group of people.

The architecture involves the following phases data preprocessing, quantifier and input slicer and hadoop framework and the analytical engine. The architecture will provide better prediction and more accuracy compared to other system.

The raw medical data is given as an input to the pre processor phase which removes all the redundant and noisy data from the original data. The fuzzy rules and weights are given and rules are coded and it is given as an input to the hadoop map reduce phase. The hadoop will compute in parallel by comprising all the nodes in the loosely coupled distributed computing environment (DCE) and stored as an XML (eXtensible Markup Language) instance. Such XML instances are analyzed in the analytical engine using hive or pig which visualizes and gives the statistical report for the data. The various predictions can be made by the Medical officers or DMO (District Medical Officers) to ensure/predict if there is a chance for any health shocks due to various reasons.

A. Data Preprocessing:

The data preprocessing is the initial stage in the architecture it removes the redundant data from the given data set and normalize the data elements in the preprocessing step.

B. Quantifier and input slicer:

The quantifier will slice the input data into n- dimensional rows and column this data is given as input to the hadoop framework.

C. Fuzzy Rule Extraction:

In the fuzzy rule extraction phase the definition of linguistic identifiers are defined and the fuzzy rule are extracted from the data and the fuzzy rules are compressed and the weight rules are calculated from the given data set.

D. HadoopFramework:

The hadoop framework will allocate the job to the several data nodes. Each data node will perform their operation parallel accordingly. The tasks are splited to the appropriate data nodes.

E. Analytical Engine:

The analytical engine is used to visualize the data. The data instance are analyzed and visualized using the analytical tool hive or pig. The visualization gives the statistical report from this report the predictive model is achieved. Hive is a distributed agent platform, a decentralized system for building applications by networking local system resources. Figure[4.7] describes the analytical engine Apache Hive data warehousing component, an element of cloud-based Hadoop ecosystem. Applications of apache hive are SQL, oracle, IBM DB2.

The outcome of the proposed system is to assess the health of the public and patterns of illness and injury. Subsequently the following outcomes can also be obtained through the proposed health care analytics viz.

1. Identify the unmet regional health needs

2. Document patterns of health care expenditures on inappropriate, wasteful, or potentially harmful services.

3. Find cost-effective care providers

4. Improve the quality of care in hospitals, practitioners' offices, clinics,

5. Various other health care settings.

IV. Methodlogies:

Rapid Miner:

Various tools and techniques are used to implement the system. Each phase has different tools to implement the system. The preprocessing step involves rapidminer tools to process the data. The fuzzy rule extraction phase involves matlab to generate the rule. The analytical engine involves hive/pig tools to analyze the data.

* RapidMiner Studio is a visual workflow designer that makes it easy to build of complete analytic workflows.

* It contains a huge library of machine learning algorithms and functions to build the best possible model for any use case

In early days there are many tools were found for data mining for pre-processing but they are very expensive, and complicate to install. In recent times Rapid Miner is used which is very easy to install and use. In Rapid Miner there are two major areas they are Repositories and Operators. The Repositories is the place where you can link the data set which you want to mine. The Operator is the place in which all data mining tools will be located. Rapid Miner Studio contains many operators for pre-processing the data. In search field available above we can search the operator. Rapid Miner is mainly used for pre-processing the dataset. Data Set format will be like csv, arff, excel xml.

1) Remove tuples with missing values

2) Replace missing values with most suitable constant values

3) Replace missing values with mode value for nominal attributes and mean for continuous numeric attributes

4) Replace missing values with random numbers

5) Replace missing values with estimated values using linear regression

Fuzzy Logic:

Fuzzy logic is the extension of Boolean logic. Fuzzy logic is based on full of fuzzy theory sets. Fuzzy rule summarization technique is used generate the predictive model for health shock prediction to generate the rule for causal factors.

Fuzzy logic expects to model human thinking and to apply the model to issues as per needs. It tries to furnish personal computer(PC) with the capacity to process extraordinary information of people and to work by making utilization of their encounters and bits of knowledge. At the point when human rationale takes care of issues, it makes verbal principles, for example, "if <event realized> is this, the <result> is that". Fuzzy tries to adjust these verbal principles and the capacity to settle on choices of people to machines/PCs. It employs the verbal factors and terms together with verbal guidelines. Verbal standards and terms utilized as a part of human basic leadership process are fluffy instead of exact. Verbal terms and factors are communicated numerically as participation degrees and enrollment capacities. Fluffy basic leadership systems utilize typical verbal expressions rather than numeric qualities. Transferring these typical verbal expressions to PCs depend on science. This scientific premise is fluffy rationale. The objective of this study is to parallelize the processing of the medical dataset and to predict the future health shocks. For predicting the disease the Fuzzy rule summarization technique is used accumulation and representation techniques might be required for various information investigation errands.

These attributes are used to generate the fuzzy rule to predict the health shock in particular region. Each attribute have particular range in which they easily detect if it exceed the range. These issues will be record for every person in that region.

The output of the rapid miner is used as a input to the fuzzy logic and in fuzzy logic rules will be generated according to the range as show in fig

The output of the fuzzy logic is given as input to the Hadoop. In Hadoop the data will be assigned to data node by the master node which reduce the size of the data.

Comparison Of Analytical Engines Within Rapidminer:

The output from the Hadoop will be reduced data size that will be given as input to the analytical engine which produce the statistical report of the health shock prediction.


There is no freely accessible dataset that can offer assistance to comprehend and screen the healthcare conditions. The point was to then investigate and model such a dataset to comprehend the relationships between the information and conditions their effect on well being. The accessibility and translate capacity of this information can be useful to governments to determine strategies for general specialists and NGOs, so as to begin group based health programs. The pre-handled information was utilized to produce a fuzzy rule to forecast the wellbeing stuns utilizing the strategy which created an interpretable govern based model to picture and anticipate the extent of wellbeing stuns experienced by people. This study is going to predict the outcome of heart disease for the diabetic patient .The principal activities to examine and comprehend the medicinal services framework and the event of wellbeing stuns in rural area.


[1.] Lindberg, B., C. Nilsson, D. Zotterman, S. Soderberg, L. Skar, 2013. Using information and communication technology in home care for communication betweenpatients, family members, and healthcare professionals: A systematic review, Int. J. Telemed. 31.

[2.] Andreu-Perez, J., C.C.Y. Poon, R.D. Merrifield, S.T.C. Wong, G.Z. Yang, 2015. Big data for health, IEEE

J. Biomed. Health Inform, 19(4): 1193-1208.

[3.] JoseGacto, M., R. Alcala, F. Herrera, 2011. Interpretability of linguistic fuzzy rule-based systems: An overview of interpretability measures, Inform. Sci., 181(20): 4340-4360.

[4.] Ishibuchi, H., T. Yamamoto, 2005. Rule weight specifiation in fuzzy rule-based classification systems, IEEE Trans. Fuzzy Syst. 13(4).

[5.] Poongodi, M., L. Manjula, S. Pradeepkumar, M. Umadevi, 2012. Cancer prediction technique using fuzzy logic, Int. J. Curr. Res., 4: 106-110.

[6.] Valarmathi, S., Ayesha Sulthana, Ramya Rathan, K.C. Latha, S. Balasubramanian, R. Sridhar, prediction of risk in breast cancer using fuzzy logic tool box in matlab environment.

[7.] Binnendik, E., R. Koren, D.M. Dror, 2012. Hardship financing of healthcare among rural poor In Orissa, India, BMC Health Serv. Res., 12: 23.

[8.] Kelsey Brimmer, 2013. "5 ways hospitals can use dataanalytics" Accessed online on 16th November 2016

[9.] Priyanga, P., V.P. MuthuKumar, 2015. Cloud computing for healthcare organization, Int. J. Multidiscip. Res. Dev., 2: 487-493.

[10.] Mahmud, S., R. Iqbal, F. Doctor, 2014. An integrated framework for the prediction of health shocks, in: Proceedings of The 2nd International Conference on Applied Information and Communications Technology, ICAICT.

[11.] Chandrashekar, R., M. Kala, D. Mane, 2015. Integration of big data in cloud computing environments for enhanced data processing capabilities, Int. J. Eng. Res. Gen. Sci., 3: 240-245.

[12.] Doctor, F., R. Iqbal, R.N. Gorgui-Naguib, 2014. A fuzzy ambient intelligent agents approach for monitoring disease progression of dementia patients, J. Ambient Intell. Humanized Comput., 5(1): 147-158.

(1) Suganya N and (2) Dr.(Mrs). T. Hemalatha

(1) Computer Science and Engineering Psnacet Dindigul India.

(2) Associate Professor, Computer Science and Engineering Psnacet Dindigul, India.

Received 28 January 2017; Accepted 22 March 2017; Available online 28 April 2017

Address For Correspondence:

Suganya N, Computer Science and Engineering Psnacet Dindigul, India.


Caption: Fig. 1: Challenges of Big Data

Caption: Fig. 2: Big Data Challenges

Caption: Fig. 3: Prediction Framework Architecture

Caption: Fig. 4: Preprocessing

Caption: Fig. 5: Quantifier

Caption: Fig. 6: Fuzzy Rule Extraction

Caption: Fig. 7: Hadoop Framework

Caption: Fig. 8: Hadoop architecture

Caption: Fig. 9: Attributes

Caption: Fig. 10: Rapid Miner Computation Engine Runtimes
COPYRIGHT 2017 American-Eurasian Network for Scientific Information
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2017 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Suganya, N.; Hemalatha, T.
Publication:Advances in Natural and Applied Sciences
Article Type:Report
Date:Apr 30, 2017
Previous Article:Detection of Ddos attack in cloud based E-learing systems.
Next Article:A review on cloud storage security in distributed computing.

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters