Printer Friendly

Tweet segmentation and classification for short text in rumor based identification using KNN approach.


Big data is a new a terminology that surpass the challenge in terms of storing voluminous data, processing, the speed of processing those data and dealing with different types of data (both structured and unstructured data). In most enterprise situations the number of information is just too huge or it shifts too quickly or it surpasses current process capability. Despite these problems, big data assist companies to improve their operations and help them to take more intelligent decisions. When handling larger datasets, organizations look difficulties in having the ability to create, handle, and manage huge data. Large information is hard to tangle in business analytics since there are no standard tools and procedures designed to explore and analyze huge datasets. The characteristics of big data known with seven Vs as shown in figure1.1

Social Media Data:

The data on social interactions is an increasingly striking set of data, mainly for sales, marketing support functions and User sentiments. These data are often unstructured or semi--structured, so besides the utter size of the data, it poses a unique challenge when intensive and analyzing information pertaining in it. Examples of social media are Gmail, Facebook, Twitter etc., Section II supports with necessary literature for the technical implementation of the work. The proposed work is explained in section III. Section IV and V provides insights on the implementation aspects. Section VI concludes and gives directions to enhance the work in future.

Short Text Message:

Text messages are used by youth and adults for personal, family and social purposes and in business, government and non-governmental organizations for communication between colleagues. Text messaging is most often used between private mobile phone users, as a substitute for voice calls in situations where voice communication is impossible,Short message services are developing very rapidly throughout the world. SMS is hugely popular in India, where youngsters often exchange lots of text messages, and companies provide alerts, infotainment, news, cricket scores updates, railway/airline booking, mobile billing, and banking services on SMS. Short text message are used to reduce typing when you are messaging on cell phone, Smartphone or on computer keyboard. Some may call it "internet slang"

II. Literature Survey:

Tweet segmentation is the major part for identifying a text for rumors. Many people have proposed methodologies for segmenting the tweets. Some of them are Hash tag, POS- tag, Named entity, Hybrid seg etc.,

Chenliang Li et al [7] proposed a hybrid tweet segmentation framework joining local contexts into the existing outside knowledge bases, and named method HybridSeg. HybridSeg conduct tweet segmentation in group method.

Xiaohua Liu et al [16] analyze to combine a K- Nearest Neighbors (KNN) classifier with a sequence Conditional Random Fields (CRF) model below a semi- supervised data framework to undertake the named entity issues.

Xiangyang Zhou et al [17] predict to jointly take out social events from multiple related tweets using a new factor graph, to identify the redundancy in tweets, i.e., the repetitive occurrences of a social occurrence in several tweets.

Alan Ritter et al [2] presented a novel approach to categorize events in an open-domain text genre with unknown types. Their approach is based on latent variable model that first discovers event types which match the data, and then it is used to classify aggregate events without any annotated examples.

Xinfan Meng et al [15] address an entity-centric topic-oriented opinion report framework, that is capable of manufacturing opinion summaries with topics and terribly underlining the insight behind the opinions in Twitter and decompose the opinion summarization into three dimensions, specifically topic, opinion and insight, the opinion outline is generated by integration dimensions.

Avirup Sil, Alexander Yates [4] Their discriminative re-ranking framework allows uses to introduce features into the model that capture the dependency between entity linking decisions and mention boundary decisions, which existing models do not handle. Ranks are candidate mention-entity pairs together to make joint predictions.

Chenliang Li, Aixin Sun, Anwitaman Datta [8] presented a segment-based event detection system for tweets, referred to as Twevent. Twevent first detects burst tweet segments as event segments, and then clusters the event segments into events, considering both their distribution and content similarity. The recognized named entities with high confidence positively enhance the performance of tweet segmentation.

Xiaolong Wang, Furu Wei z, et al [19] used a specialized hash tag- level sentiment classification. This task aims to mechanically generate the overall sentiment polarity for a given hash tag during a certain period, which markedly differs from the standard sentence-level and document-level sentiment analysis. To propose a graph model to boost the results from the pick baseline, this effectively incorporates the tweets sentiment data and hash tags co-occurrence relationship.

III. Proposed Work:

The proposed system describes the Hybrids Approach and to finds the Optimal Segmentation of Tweets. Hybrids is generated via Named Entities Extracted from User's Followers' and User's Own Posts. It is difficult to classify Rumors in each Tweets, to implement the K- Nearest Neighbor Classifier (K-NN) Approach to Eliminate short text in Rumor Based Tweets.

Twitter is a micro-blogging social media platform with hundreds and millions of users. Twitter is a social network where users can publish and exchange short messages of up to 140 characters long, also known as tweets. It can define a rumor to an unverified assertion that starts from one or more sources and spreads over time from node to node in a network. Figure 1.2 explains the architectural design of tweet processing with the big data perspective. The Short text datasets are collected in the data acquisition process, then in Data preprocessing it includes cleaning, normalization, transformation and Stemming words analysis. The keywords are analyzed based on POS tagger. Next, the Hybird segmentation process is done, HybridSeg learns from both global and local contexts, and has the ability of learning from pseudo feedback. The segments recognized based on local context with high confidence serve as good feedback to extract more meaningful segments. After basic segmentation, a great number of named entities in the text, such as personal names, location names and organization names, are not yet segmented and recognized properly. Now the KNN approach is designed to classify the short text in rumors based tweets. k-NN algorithm is the simplest of all machine learning algorithms. KNN classification approach is used to label the each tweets. This process eliminate the rumors using KNN classification. Finally, a system that can detect short text message as rumors and predict their veracity and maybe impact is indeed a very valuable and useful tool.

IV. Implementation:

As defined earlier a statement whose true value is unverifiable is called a rumor. Misinformations are spreaded through rumors among social media. Identifying rumors are critical in online social media where huge amounts of information are easily reached across a huge network by sources with unproven authority. This paper focuses on the event of HybridSeg and KNN approach to the classification of tweets (posts on Twitter). HybridSeg learns from both global and local contexts and has the ability to find out from pseudo feedback. In order to analyze the textual content of the tweets, give a summary of the top terms occurring in each type of topic to the classifiers.The filtering process is to eliminate irrelevant words. The filtering process removes all the stop words contained in the tweets. The stop word removal process includes Twitter-specific words for the main languages in the dataset. Next, calculate the Team Frequency for each word and each type of trending topic. This process gives a list of words for each type of trending subject and ranks the words in the descending order by TF value. These steps are implemented in Global and Local Context. Before extracting pseudo feedback, POS tagger is implemented to define features in order to categorize words as adverb, adjective and so on in Natural Language.

Algorithm: KNN Classification--Rumor prediction:

1. Read the training data from a file <x, f(x)>

2. Read the testing data from a file <x, f(x)>

3. Set K to some value

4. Normalize the attribute values in the range 0 to 1 Value=value / (1+value)

5. Apply Backward Elimination

6. For each testing example in the testing data set

a) Find the K nearest neighbors within the training data set based on the Euclidean distance

b) Predict the class value by finding the maximum class represented in the K nearest neighbors

c) Calculate the accuracy as Accuracy = (# of correctly classified examples / # of testing examples) * 100


Twitter is an excellent case to investigate misinformation in social media. This work has implemented by collecting and annotating a large dataset that comprises all the tweets that are with rumor in a definite period of time. Tweet is classified as Entities based on opinion. It is split as positive, negative and neutral. Tweets can be similar words, miss match words, accurate words that is based on the identification method. Those words are analyzed using KNN classifier. Only those words with most accuracy are predicted as rumor. Tabulation shows the sample tweets that are identified as rumor based on the regular expression to splits the tweets using KNN classifier.

V. Conclusion And Future Work:

In this work, we implemented KNN classification algorithms for tweet segmentation, the KNN classification was very effective in tweet segmentation.The main aim is to build a system that employs this work and the emerging patterns within the re-tweet network topology to find whether a short text is rumor or not. This work involves using more advanced techniques from linguistics to extend the speed of correct tense identification. In specific, developments to the analysis of verb phrases and modifications of the marked parameters for sentences might be terribly useful. By creating it simple to match news coverage to twitter posts concerns a happening, the system offers both up-to- the-minute information and valuable insight into past events.In future ,we can extend our approach implement various classification algorithm to predict the attackers and also eliminate the attackers from twitter datasets. And try this approach to implement in various languages in twitter.


[1.] Ritter, A., S. Clark, Mausam and O. Etzioni, 2011. Named entity recognition in tweets: An experimental study,[parallel] in EMNLP, pp: 1524-1534.

[2.] Ritter, A., Mausam, O. Etzioni and S. Clark, 2012. [parallel] Open domain event extraction from twitter.[parallel] In KDD, pp: 1104-1112.

[3.] Cui, A., M. Zhang, Y. Liu, S. Ma, and K. Zhang, 2012. Discover breaking events with popular hashtags in twitter,[parallel] in CIKM, pp: 1794-1798.

[4.] Sil and A. Yates, 2013. Re-ranking for joint named- entity recognition and linkingng,[parallel] in CIKM, pp: 2369-2374.

[5.] Chenliang Li, Aixin Sun, Jianshu Weng, and Qi He, 2015. Tweet Segmentation and its Application to Named Entity Recognition, IEEE Transcation, 27: 2.

[6.] Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun and B.-S. Lee, 2012. Twiner: Named entity recognition in targeted twitter stream,[parallel] in SIGIR, pp: 721-730.

[7.] Li, A. Sun, J. Weng, and Q. He, 2013. Exploiting hybrid contexts for tweet segmentation,[parallel] in SIGIR, pp: 523-532.

[8.] Li, A. Sun, and A. Datta, 2012. Twevent: segment-based event detection from tweets,[parallel] in CIKM, pp: 155-164.

[9.] Kouloumpis, T. Wilson, and J. Moore, 2011. Twitter sentiment analysis: The good the bad and the omg[parallel] In ICWSM, pp; 538-541.

[10.] Zhou and J. Su, 2002. Named entity recognition using an hmm based chunk tagger,[parallel] in ACL, pp: 473480.

[11.] Liu, K.-L., W.-J. Li, and M. Guo, 2012. Emoticon smoothed language models for twitter sentiment analysisjin AAAI

[12.] Gimpel, K., N. Schneider, B. O'Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and N. A. Smith, 2011. Part-of-speech tagging for twitter:annotation,features, and experiments,[parallel] in ACL- HLT, pp: 42-47.

[13.] Jiang, L., M. Yu, M. Zhou, X. Liu, and T. Zhao, 2011. Target-dependent twitter sentiment classification. In ACL, 151-160.

[14.] Jiang, W., L. Huang and Q. Liu, 2009. Automatic adaption of Annotation standards: Chinese word segmentation and pos tagging-a case study,[parallel] in ACL, pp: 522-530.

[15.] Meng, X., F. Wei, X. Liu, M. Zhou, S. Li, and H. Wang, 2012. Entity centric topic-oriented opinion summarization in twitter,[parallel] in KDD, pp. 379-387.

[16.] Liu, X., S. Zhang, F. Wei, and M. Zhou, 2011. Recognizing named entities in tweets,[parallel] in ACL, pp: 359367.

[17.] Liu, X., X. Zhou, Z. Fu, F. Wei, and M. Zhou, 2012. Exacting social events for tweets using a factor graph,[parallel] in AAAI.

[18.] Zeng, X., D.F. Wong, L.S. Chao and I. Trancoso, 2013. Graph-based semi -supervised model for joint chinese word segmentation and POS tagging,[parallel] in ACL, pp: 770-779.

[19.] Wang, X., F. Wei, X. Liu, M. Zhou, and M. Zhang, 2011. Topic sentiment analysis in twitter: a graphbased hashtag sentiment classification approach,[parallel] in CIKM, pp: 1031-1040.

[20.] Nithya, S., A.C. Kaladevi, 2015. "Tweet segmentation and classification for rumor identification using KNN approach.

(1) Dr.A.C.Kaladevi and (2) R. Parkavi

(1) Professor, Department of Computer Science and Engineering Sona College of Technology, Salem.

(2) PG Scholar, Departmentof Computer Science and Engineering Sona College of Technology, Salem.

Received 28 March 2017; Accepted 7 June 2017; Available online 12 June 2017

Address For Correspondence:

Dr.A.C.Kaladevi, Professor, Department of Computer Science and Engineering Sona College of T echnology, Salem E-mail:

Caption: Fig. 1.1: Big Data three Vs

Caption: Fig. 1.2: Architectural design
Table 1: Examples of Classifying Word


Access     Good      Yes        No         No
Accident   Ok        Yes        No         No
Blog       Blog      No         No         Yes
Case       Viral     no         No         Yes
COPYRIGHT 2017 American-Eurasian Network for Scientific Information
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2017 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Kaladevi, A.C.; Parkavi, R.
Publication:Advances in Natural and Applied Sciences
Article Type:Report
Date:Jun 1, 2017
Previous Article:Design of energy efficient and low power asynchronous 8*8 multiplier.
Next Article:A comprehensive comparison of evolutionary algorithms Vs. artificial neural networks for DDOS attack detection in networks.

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters