Study on evolution trends of network public opinion based on hyperlink analysis.
The rapid development of the Internet has facilitated the rapid expansion of the primary data found on the internet, that is, Web pages, thus resulting in the exponential growth of Web page links. Such links are the core elements for the growth of Web pages. Web page links, which contain rich resources, are of great research value and have various application prospects in diversified sectors and fields. The analysis and applied research on Web page links started rather late in China and currently focus on link analysis algorithms [1,2,3,4,5] and link analysis applications [6,7,8,9]. Meanwhile, some scholars have conducted further research on link structure analysis [10,11], link analysis tools , link descriptive text , visual link analysis , and so on. Among these scholars, Li Jiang  has conducted a comprehensive review of relevant research on links. By combining the features of network public opinion, this study examines the evolution characteristics of network public opinion and the application of link analysis in the quality evaluation of information sources of network public opinion. The results are beneficial for the detection and the follow-up tracing and monitoring of the network public opinion hotspots.
2. Network Public Opinion
The modern concept of public opinion was proposed in 1762. This concept refers to the social and political attitudes held by the public, as the subject, toward state administrators, as the object, toward the occurrence and development of some intermediary social items under some certain social circumstances . No commonly accepted and authoritative definition for network public opinion exists, although the concept has often been referred to as the emotions and opinions of the public expressed and communicated via the Internet medium. The concepts of network public opinion and socialized public opinion tend to interact with each other and have been eventually assimilated to some extent. Network public opinion is characterized by instantaneity, freedom, deviation, mutability, individuation, emotionality, group polarization, and so on, because of the openness and virtuality of the Internet. The individualized opinions tend to generate effects in groups because of the characteristics of emotionality and group polarization that result in a wide-spread public opinion, which may eventually affect social reality. The timely guidance and correction of the erroneous tendencies of online opinion are crucial because of the instantaneity of network public opinion.
The propagation of network public opinion generally consists of four stages, namely, the germination, outbreak, peaceful, and dormant periods, as shown in the Figure 1. The germination period is when a public opinion starts to take shape and is not yet widespread, but is propagated only within a certain range with limited audience and influence. Therefore, the germination period is an ideal period for detecting the public opinion hotspot. However, not all network public opinions begin with a distinct germination period, which takes only a short period from the generation of the opinion to the outbreak. The burst type opinion is a typical example. Nevertheless, if the period of time is broken down into small segments, the germination period can still be identified. The outbreak period is characterized by fast propagation, dramatic growth in the audience size, and distinct features of emotionality and multi-polarization. The outbreak period is the best time for guidance and intervention. The calm period is marked by a stable audience size and a shift from multi-polarization to polarization. The dormant period is when a network public opinion temporarily or permanently enters into a dormant state. The public opinion might be activated once again under certain conditions during the dormant period, which is why this period is neither referred to as the annihilation period nor the receding period. If network public opinion is once again activated, its evolution will still comply with the four stages shown in Figure 1.
3. Link Analysis
The Internet contains a vast amount of Web pages. The mutually referenced links contained in a Web page make it distinct from general text. If a Web page is abstracted into a node and the links among Web pages are regarded as the directed edges, then the whole Internet can be abstracted into a directed graph, which is referred to as Web graph. Such graph is composed of page nodes and directed edges that link the nodes together. The World Wide Web (WWW) is also composed of nodes and edges. A node in WWW will be the linkage point, which usually corresponds to a certain Web page or sometimes a picture or an email. The edge will be the link, which is the core element used for connecting related linkage points. Two Web pages, one of which is called the linking Web page and the other is the linked Web page, are usually connected by a link. The link terminologies are illustrated in Figure 2, and the definitions of the relevant terminologies are given below.
(1) Link, hyperlink: Both refer to Internet links. The two words are often used when an inlink need not be differentiated from an outlink. These terms are occasionally used to refer to inlink and outlink.
(2) Inlink: Inlink is a link directed to a Web site. This link is generally supposed to come from a Web page outside a certain collection. "Inlink' is synonymous with "backward link," whereas "accepted inlink" has the same meaning as "being linked."
(3) Outlink: Outlink is a link directed from a Web page. This link is generally supposed to point to a Web page outside a certain collection.
(4) Selflink: Selflink is a link from a certain Web page and is directed to the page itself or to a different part of the same page. This link is generally supposed to lead to a page inside a certain collection.
(5) Interlink, reciprocal link: Interlink often refers to the link connecting two different Web sites or to an inter-site link.
(6) Co-linked: Two pages are co-linked if both contain a link from a third page.
(7) Co-linking: Two pages are co-linking if both contain outlinks leading to a third page.
(8) Co-link: Co-linked and co-linking are collectively referred to as co-link.
Hyperlink analysis, often referred to as structure analysis, studies the nature of the Web, especially its hidden macroscopic nature, with the hyperlink as its main input. The hyperlink analysis of the Web is typically based on the following two assumptions:
Assumption 1: If a hyperlink is directed from page A to page B, then page B is recommended by the author of page A.
Assumption 2: If pages A and B are linked together through a hyperlink, we can assume that they might be related to the same subject.
Many researchers have found that the hyperlink structure on WWW contains vast and significant resources, which can significantly improve the quality of search results if fully utilized. On the basis of the concept of hyperlink analysis, Sergey Brin and Lawrence Page, the founders of Google, came up with PageRank algorithm  when building the early-stage prototype of the search system in 1998. The basic ideas of PageRank algorithm are as follows: a Web page might be very important if many Web pages are directed to it; a Web page might also be important if a Web page of greater significance is directed to it; and the importance of a Web page is quantified and measured by a PageRank value. Thus, the PageRank value of a Web page depends on the PageRank value of all the Web pages linking to it, whereas the PageRank value of all the Web pages linking to it are determined by the PageRank value of the Web pages linking to them. Therefore, the PageRank value of a Web page can be obtained through iteration, and its value consequently affects the PageRank value of the Web pages to which it is linking.
The PageRank algorithm of a certain Web page A is based on the following two basic assumptions:
Quantity assumption: In the Web graph model, if a Web page node receives more inlinks directed from other Web pages, the page is presumed to be more important.
Quality assumption: Given that the inlinks directed to page A have varied qualities, the page of higher quality will transmit more weight to other pages through links. Therefore, page A will be more important if it is directed from higher quality pages.
On the basis of the two assumptions above, the PageRank algorithm will initially endow all Web pages with equal scores for their importance level, and then update the PageRank score values of each page node via iteration recursive calculation until the scores become stable.
HITS , which was first proposed by Jon Kleinberg in the 1990s, is another classical Web page link analytical algorithm. HITS mainly calculates the authority value of the content and links on a Web page. The authority value of content refers to the popularity of the Web page content itself. Hub value, which is the authority value of a link, refers to the capability of a Web page to link to other popular Web page resources. PageRank and HITS algorithms are different from each other in two main aspects. First, the way that PageRank algorithm endows the initial ranking and maintains the ranking has nothing to do with any query, whereas the HITS algorithm compiles a different root set according to each query and then determines the priority of Web pages in accordance with the specific conditions of queries. Second, the PageRank algorithm looks forward from one link to another, whereas HITS algorithm checks backwards from an authoritative Web page and then determines the Web pages that direct to that authoritative Web page. Some scholars have come up with other hyperlink analytical algorithms, including SALSA, PHITS, and Bayesian, some of which have been applied to actual systems with positive results.
Ahmadi-Abkenari  highlights that a Web page with high hub value indicates that the authority value of the Web page it is directed to is high, whereas a Web page with high authority value indicates that the hub value of the Web page that it is directed from is high. This relationship and interplay can be demonstrated by Formulas (1) and (2), in which a(i) represents the authority value of the Web page i, h (i) represents the hub value of the Web page i, and E represents the edge of the connected Web page graph.
a (i) = [summation over ((j, i) [member of] E)] h (i) (1)
h (i) = [summation over ((i, j) [member of] E)] a (i) (2)
4. Evolutionary Relationship between Hyperlink Analysis and Network Public Opinion
From the perspective of the inclination of emotions, network public opinion is often featured with multipolarization. Thus, we cannot simply classify public opinion with polarized thinking. From the view of propagation modes, public opinions are mainly propagated in two ways, namely, instant messaging tools and Web pages, with the latter being the major one. The sources of the Web are diversified, and the common sources of network public opinion are BBS, fora, blogs, podcasts, micro blogs, and news stations. The news channels of all the major domestic Chinese-language Web portals, with their rich manpower and material resources, authoritative and abundant sources, and extensive influence, have become a significant source of public opinion information (also referred to as information source). The quality of information source is related to authority, timeliness, originality, and so on. A high-quality information source has a significant reference value in terms of the detection and follow-up monitoring of the network public opinion hotspot.
As an indispensable tool in the electronic age, search engines are also an important communication channel, although such an engine in itself does not serve as a carrier of news. When a search engine is searching for a Web page to meet the user's request, two major factors are considered: one is the score based on the similarity between the user request and the Web page content or the relevance between the Web page and the user's query, and the other is the score obtained through the link analysis algorithm or the importance of the Web page. After integrating the two factors, the search engine will obtain a fitting function of the similarity level to sort the search results. If the linkage point of a Web page is frequently linked, the Web page has a higher degree of recognition, a more extensive influence, and a higher reference value. As a result, such linkage point is often considered as high-quality linkage point. Therefore, the public opinion corresponding to such point is often a public opinion hotspot. By contrast, if the linkage point is rarely or never linked, then a Web page has a low degree of recognition, low impact, and low reference value. Consequently, the subject corresponding to such linkage point is unlikely to be a high-quality public opinion hotspot.
The function of links in the evolution of network public opinion can be demonstrated in the following examples. A hyperlink directed to a certain page B might be contained in page A, although this hyperlink may not have any direct effect on the query based on keywords. However, the author of the Web page provides browsers through hyperlinks with some important information outside the content of this page. Such information is considered by the author to be useful for the browsers. Another example is that some links guide the browser to return to the home page of a Web site, which enables the browsers to relocate the browsing route by re-selecting the entry point. Some other links guide the browsers to the page for commenting on the content of the current page. Such links might have the same subject as the current link and might be directed to a page of good quality.
However, the evaluation of the value of a linkage point during public opinion analysis cannot completely adopt ordinary evaluation standards because of the features of mutability and instantaneity of network public opinion. This condition indicates that the evaluation of the value of a linkage point cannot depend only on the frequency of being linked. Meanwhile, time should also be considered as another important evaluation indicator because predicting how a newly released page will be linked is difficult. The linkage points of a new page might remain in dormant state or can suddenly and unexpectedly become a public opinion hotspot and enter into the outbreak period. If the network public opinion is only detected after it enters into outbreak period, the follow-up guidance and intervention work will not be as effective as expected, which is why the germination period is the ideal period to detect the public opinion hotspot, as previously mentioned.
In another case, the value of a Web page after being released for a long time becomes limited despite being linked frequently because the public opinion corresponding to the page content might be in the peaceful or dormant period. The characteristics of network public opinion differentiate it from ordinary evaluation. Liu Yanshu  et al. studied the feasibility of internet information evaluation with the application of link relations, and they believe that ordinary link analysis methods can be feasibly applied to the relevant evaluation of network public opinion as long as the two cases above are considered.
Given the characteristics of network public opinion, at least two features are required for a quality linkage point: high frequency of being linked and short releasing time. If a Web site contains more linkage points with a higher ratio of quality linkage points, then it will be considered as a high-quality information source. Moreover, in the study of network public opinion, links can be used not only to study on the quality of a certain single information source, but also to examine on the collective evolution of information sources. All information sources will be inevitably affected once the network public opinion enters the outbreak period. Meanwhile, the network public opinion entering into the outbreak period is often attributed to the combined effect of all information sources. In such case, taking all information sources as a whole when studying the quantity and quality of linkage points within a certain period of time is of great value in understanding the evolution tendency of network public opinion for prompt guidance and intervention with a suitable persuasion strategy.
5. Instance Verification and Analysis
5.1 Experimental Subjects
This experiment selects the news channels of the better-known domestic news Web sites or Web portals as information sources or experimental subjects. All the information sources are listed in the Table 1. All the subdomains of these domains also belong to the crawling list of Web crawlers, such as http://focus.news.163.com/, http://bbs.news.163.com/.
5.2 Experimental Methods
A search for Web pages relevant to "The Murder Case of Nanping" under the domains shown in Table 1 was performed with the use of the independently developed BUT Web crawler tool. Certain information was extracted from each page. Such information includes title, release time, source, domain, and URL. The title refers to the news title on the Web page; the release time refers to the release time of the news on the Web page; the source refers to the source of the news notated on the Web page, which might come from paper media or Internet media; the domain is the domain containing the news; and the URL is the URL string of the Web page.
5.3 Experimental Results and Analysis
5.3.1. Evolution over Time
The statistics of the obtained data was determined with the day as the unit. The time span of the statistics ranged from March 21, 2010 to May 2, 2010. The results are shown in the Figure 3. No data were obtained for March 21 and 22. The horizontal axis represents the date; where as the vertical axis represents the amount of network public opinion.
The figure shows that some feature points should be noted. These special points will be elaborated in the following text.
(1) March 23: This network public opinion accumulated a large quantity on the same day. This public opinion is a typical sudden and unexpected public opinion that is fully characterized by burstiness. We can break down the time into smaller segments and determine the statistics of the public opinion on March 23 with hour as the unit for such burst type public opinions. The result is shown in the Figure 4.
To elaborate on Figure 4, the analysis and verification of the crawl data and relevant URL networking reveal that this public opinion first appeared on the Internet between 9 am and 10 am and was initially released by Chinese Radio Network. The opinion was then reprinted and released by Sina News Center. After further checking of the page content, this network public opinion was first reported by News Coverage of Voice of China at 8:58 am.
The network public opinion reached its peak after 11 am and then slightly declined at 12 o'clock and was once again on the rise at 1 pm. The opinion remained at a relative lower level until 5 pm. We can deduce what has happened given the data and the daily routines of ordinary people. First, the public opinion rapidly propagated through online media and reached the peak at 11 am. The public opinion declined during the 12 o'clock lunch break. The opinion was once more on the rise at 1 pm. The amount of public opinion was relatively small from 2 pm to 5 pm, which was a work period.
The attention to the opinion was once again on the rise after 5 pm, which is probably because many people are fond of browsing news and for a after work.
The attention declined after 11 pm, which can be related to the work and rest schedule of the majority of people.
Figure 4 shows that this public opinion has no apparent germination period with the day as the unit, but we can see that the germination period is from 9 pm to 11 pm of March 23 when the period is broken down into hours.
We can find many points in common with the above conclusion through timely statistics of several dates with relatively large amounts of public opinions, such as March 24 and March 25.
(2) From March 24 to March 26: This network public opinion reached a peak during this period, which can be deemed as the outbreak period.
(3) From March 26 to April 5: The attention was decreasing during this period, which can be deemed as the peaceful period. The period from the 4th to the 6th can be regarded as the dormant period.
(4) From April 4 to April 8: The amount of network public opinion increased dramatically, which was different from what was expected given that background knowledge was lacking. After checking the relevant URL of the crawl data, we can see that the court hearing of the murder case was held on April 8, which is the direct reason for the dramatic rise in attention during that day. Since April 8, the amount of public opinion decreased until April 11.
(5) From April 12 to April 14: The amount of the public opinion was once again on the rise during this period. After checking the URL data, we can see that "Guangxi Hepu Case," a similar case, occurred on April 12, which activated the public opinion from its dormant state and brought back the public's attention.
(6) April 20: The amount of the public opinion significantly increased on this day. By checking the URL, we can see that this was the day of the final order of "The Murder Case of Nanping." Therefore, the public has directed considerable attention to the case and looked forward to the final order of the court.
(7) April 28 to April 30: The amount of the public opinion remained at a high level for three consecutive days, which is unfortunate because this case was a real-life tragedy. Three similar cases occurred in succession in Leizhou of Guangdong Province, Taixing of Jiangsu Province, and Weifang of Shandong Province, which has caused the whole society to reflect on the issue.
(8) May 1: The amount of public opinion decreased since this day.
5.3.2. Statistics of Information Source
Only the sources notated on the Web page were considered in the statistics. The statistic results are shown in the Figure 5. The primary media that acted as information sources for the vast amount of network media are shown in Figure 5 and include Xinhua Network, China News Network, Beijing News, Yangcheng Evening News, Beijing Times, China Daily Network, China Radio Network, and Southeast Express. Therefore, the sources from certain media, including Xinhua Network and China News Network, were of high quality because they accounted for a very large proportion among well-known domestic news stations. Although the Web sites selected in this experiment comprise only a very small portion of the vast amount of Web sites and thus cannot be used to answer all questions, the influence of these Web sites is significantly stronger than that of ordinary Web sites. In fact, we found that the majority of the network public opinion in other sites were reprinted from these large influential Web sites when all the restrictions on the range of Web sites crawled by the crawler tool were removed.
Therefore, we should prioritize high-quality information sources in the follow-up detection and monitoring of network public opinion hotspots. Moreover, we should pay attention to authoritative and influential media if guidance and intervention in the network public opinion are necessary by making full use of radiation impact to spread the measures of guidance and the policy of persuasion throughout the Internet. In this way, the long-term security of society can be maintained. Not all public opinions are initially released by authoritative Web sites because such Web sites, despite their great influence, are subject to strict management. Many network public opinions initially come from some BBS, but are still eventually pushed to the peak by authoritative Web sites.
Through the crawling and the extraction of some specific indicators of a specific network public opinion on several well-known domestic Web sites, the evolution of network public opinion over time was studies and analyzed using the link analysis method combined with the features of network public opinion. Findings show that the result of link analysis is consistent with the actual facts of public opinion. The four stages of the network public opinion evolution were also verified, and the result of the analysis with practical examples is consistent with the actual situation. This finding proved the feasibility of applying link analysis to the evolution tendency of public opinion.
The statistics and analysis on information source, combined with the evolutionary features of the network public opinion over time, not only serve a guiding function in the detection and tracing of network public opinion hotspot, but also possess certain reference value for the government in formulating guidance measures and persuasion policies to maintain the long-term security of society.
Received: 19 July 2015, Revised 28 August 2015, Accepted 5 September 2015
This work was funded by the China Postdoctoral Science Foundation (2014 M560700), the Postdoctoral Special Fund of Chongqing (XM2014057), the Natural Science Foundation of Zhejiang Province (LY13F010005), the Science and Technology Support Program of Hubei Province (2014BKB068, 2014BDH124), and the Science and Technology Development Foundation of Xiangyang.
 Bo, Yang., He-Chang, Cheng., Guan-Yu, Zhu., Zhao
 Xue-Hua. (2014). A Novel Page Ranking Algorithm Based on Analyzing the Diversity of Inbound Hyperlinks. Chinese Journal of Computers, 37 (4) 833-847. (In Chinese).
 Guan-hui, Yan., Xin, Shu., Zhi-cheng, Ma., Xiang, Li. (2013). Community discovery for microblog based on topic and link analysis. Application Research of Computers, 30 (7) 1953-1957. (In Chinese).
 Jianhui, Li., Lan, Jinsong., Shen, Zhihong., Teng Changyan., ZHOU Yuanchun. (2013). PageRank Algorithm for Scientific Data Ranking. Journal of Frontiers of Computer Science & Technology, 7 (6) 494-504. (In Chinese).
 Gelan, Yang., Li, Tu. (2012). Novel PageRank algorithm based on topic and link weighted. Journal of Huazhong University of Science and Technology (Natural Science Edition), 40 (S1) 300-303. (In Chinese).
 Xian-chao, Zhang., Wen, Xu., Liang, Gao., Liang., Wen xin. (2012).Combining Content and Link Analysis for Local Web Community Extraction. Journal of Computer Research and Development, 49 (11) 2352-2358. (In Chinese).
 Yan, Guo., Chun-yang, Liu., Zhi-hua, Yu., Jin, Zhang, Yuan, Dai. (2011). Research on the Impact Evaluation of Web Information Sources of Public Opinion. Journal of Chinese Information Processing, 25 (03) 64-71. (In Chinese).
 Hong-wei, Wang., Yuan-kai, Li., Pei, Yin. (2013). A Study on Anti-Cheating in Web Search Ranking Based on Link Analysis. Journal of Systems & Management, 22 (1) 107-113. (In Chinese).
 Wei, Yu., Shi-jun, Li., Li-juan, Wen., Jian-wei, Tian. (2010). Ranking of Deep Web Sources Based on Data Quality. Journal of Chinese Computer Systems, 31 (4) 641-646. (In Chinese).
 Spertus, E., ParaSite: Mining structural information on the Web. (1997). Computer Networks and ISDN Systems, 29 (8-13) 1205-1215
 Xiao-Yu, Wang., Ao-Ying, Zhou. (2003). Linkage Analysis for the World Wide Web and Its Application: A Survey. Journal of Software, 14 (10) 1768-1780. (In Chinese).
 Weiss, R., Velez, B., Sheldon, M., Nemprempre, C., Szilgayi, P., Gioffrd, D. K., HyPursuit. (1996). A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. In: Proceedings of the Seventh ACM Conference on HyPertext(HYPERTEXT '96), p. 80-193.
 Jun-ping, Qiu., Jiang, Li., (2007). Schemes Against the Defects of Link Analysis Tools. Information Science, 25 (5) 641-647.
 Min, Zhang., Jian-feng, Gao Zhang., Shao-ping, MA (2004). Anchor Text and Its Context Based Web Information Retrieval. Journal of Computer Research And Development, 41 (1) 221-226. (In Chinese).
 Tianbo, Tang., Gao Feng. (2009). The Application of Visualization Technology in Link Analysis. New Technology of Library and Information Service, (2) 78-82. (In Chinese).
 Jiang, Li., Zhi-ming, Yin. (2008). A Review on Link Analysis. Journal of Academic Libraries, 26 (2) 51-58. (In Chinese).
 Lai-hua, Wang. (2003). Overview of public opinion research. Tianjin: Tianjin academy of social sciences press.
 Brin, S., Page. L. (1998). Anatomy of a large-scale hypertextual web search engine, Proc. 7th Intl. World-WideWeb Conference(WWW07), p 107-117.
 Page, Lawrence., Brin, Segrey., Motwani, Rajeev., Winograd, Terry. (1998). The PageRank citation ranking: bringing order to the web. Manuscript in Progress. http:// google.stanford.edu/ ~backrub/ Pageranksub.ps.
 Ahmadi-Abkenari, F., Selamat, A., (2012).An architecture for a focused trend parallel Web crawler with the application of clickstream analysis. Information Sciences, 184 (1) 266-281
 Yan-shu, Liu., Pin, Fang. (2002). Study on the Reliability of Link Popularity in Web Information Evaluation. Journal of the China Society for Scientific and Technical Information, 21 (4) 401-406. (In Chinese).
Qiong Gu (1,2), Xiangdong He (1), Xianming Wang (3)
(1) Institute of Logic and Intelligence, Southwest University, Chongqing 400715, China
(2) School of Mathematics and Computer Science, Hubei University of Arts and Science, Xiangyang Hubei 441053, China
(3) Oujiang College, Wenzhou University, Wenzhou, Zhejiang, 325035, China firstname.lastname@example.org
Table 1 Domain Name and description of the source of information Number Domain Name Description 1 http://news.xinhuanet.com/ Xinhua net 2 http://news.sina.com.cn/ Sina News 3 http://news.163.com/ Netease News 4 http://news.sohu.com/ Sohu News 5 http://news.qq.com/ Tencent News 6 http://news.tom.com/ TOM News Channel 7 http://news.21cn.com/ 21CN News 8 http://news.ifeng.com/ Phoenix IT 9 http://news.people.com.cn/ People News 10 http://news.cntv.cn/ Chinese network television news station
|Printer friendly Cite/link Email Feedback|
|Author:||Gu, Qiong; He, Xiangdong; Wang, Xianming|
|Publication:||Journal of Digital Information Management|
|Date:||Dec 1, 2014|
|Previous Article:||An approach for generating an XML data warehouse schema using model transformation language.|
|Next Article:||Study on the key technology for establishing a cloud platform-oriented digital oilfield based on high-performance computing.|