Bibliometrical analysis on the big data research in China.
Along with Facebook, QQ, representing the rise of social networks, Twitter, microblogging and other social media, the rapid rise of location-based services LBS represented a new way of information dissemination continue to emerge, as well as cloud computing, networking and other technology rise at an unprecedented speed in constant growth and accumulation of big data era has arrived. McKinsey & Company published "Big Data: The Next Frontier for Innovation"  in May 2011, and the report is the first use the concept of "Big data", followed by industry, technology and government attentions. "Big Data", a general software tool is difficult to capture, manage and analyze large amounts of data, generally "terabyte" as a unit. "Big Data" and "big", not only is the "large capacity", the greater significance lies in: through the massive data exchange, integration and analysis, the discovery of new knowledge, create new value, bringing the "big knowledge", "big profits" and "big development" .
2. Data sources and data processing methods
2.1 Data sources
This paper selected Chinese Text (CNKI) published in the Chinese academic journals Network main library as a searchable database, the "subject " as the search term, "big data " or "Big data" as a search term, select all data from 2002 to 2012 (all topics about "big data" articles before 2012, the data of 2013 is not complete, so it has not analyzed temporarily), select the range of journals "All journals " in order to improve the recall ratio; and set matching as the "exact " in order to improve the pertiency factor. Data acquisition time is 10 April 2013, and a total number of papers are 2898.
2.2 Data Processing Methods
Mainly through multivariate statistical analysis tools Spss17.0, Excel as a data analysis and processing tools and Ucinet6 visualization tools, use the bibliometrics summary and visualization methods, qualitatively and quantitatively, statistical analysis analyses the law of development of China's large data papers in the past 11 years.
3. Article growth law analysis
The growth of scientific knowledge and its laws and the growth of the scientific literature and its laws are closely linked, and the number of scientific literature directly reflects changes in scientific knowledge, so the number of scientific literature is important yardstick to measure the amount of scientific knowledge . Domestic big data study start late, it has been developed relatively slow in the early years, but it has been in a growth trend now. (See Table 1).
In Table 1 we have the statistics from 2002 to 2012 in these 11 years of big data research papers shouts case, through the analysis, since 2002, China has been in large data growth (in 2005, despite a little decline, but still have 194 research outputs), especially growth in 2012 is more obvious, we can say that 2012 is a great year for the development of large data. At the same time, we also used multivariate analysis software SPSS17.0 depicts the exponential growth curve literature Price curve (see Figure 1).
One of the founders of bibliometrics Price, a variety of scientific indicators have been a lot of statistical analysis, proposed four-stage theory of growth in scientific literature . As can be seen in Figure 1 through 2002-2012, large data research boom phase in gradually, research papers showed rapid growth, exponential growth. Therefore, in accordance with the Price of logistic curve growth theory predicts large domestic research in the field of data is in a great period of development.
4. Author distribution law
4.1 Information producers Lotka experience law
Lotka, American statistician scientist found there is a certain production capacity of the law, he proposed in 1926 firstly, reflecting the production capacity of Lotka's law . Next we will be based on this principle to a research paper on large data analysis in order to verify the law of Lotka experience and further research topics fitting the Lotka formula.
Lotka's law usually expression, i.e.: y (x) = C/xn C> 0, x = 1, 2, 3, xmax where xmax represents the maximum capacity in a certain period of time of author . These studies, a total of statistical correlation of 4766 authors were published in 2898 papers. We use the method of least squares to the calculated value of n = [N [summation] XY - [summation] X [summation] Y]/[N [summation] [X.sup.2] - ([summation] [X.sup.2])], X = lgx, Y = [lgy.sub.x], N is the total number of all authors. This should be 4766. The calculated n = 2.290834339, which Lotka result is basically the same, that is consistent with the inverse square law. And C = I (1/[x.sup.2290834339]), calculates C is 1.400143544. Therefore, we analyzed large data field Lotka formula: yx = 1.400143544/x 2.290834339. Relevant statistical data are shown in Table 2.
4.2 Prolific author analysis
The publication of scientific papers is an important indicator of the creative work of evaluation of scientific and technological personnel, scientific and technical personnel will be published scientific papers as their self-expression and a way to confirm their position in the scientific community . The authors of the sample statistics are 4766. First of all, we have the case of the posting statistics (see Table 3).
As can be seen from Table 3, with a high of published articles Xu Cui-ping and Li Lu prominent in the field of large data, it is likely sometime to become the leader of the domestic large data in the future. According to Price's research in the field of literature distribution, we know that 75% of life scientists published a paper, based on the statistical results of Table 3, we have come to the author of a paper published in the data field of study accounting for 77.63%, with Price very close to the statistical results.
5. Research Institutions
Table 4 Statistics 2002-2012 TOP21 most large data posting research institutions, with a strong academic and research capabilities in the data field of study, National University of Defense Technology (55, 9.23%) topped the list. Addition to the Chinese Academy of Sciences, the rest are institutions of higher education, colleges and universities in large data has a leading position in the Chinese Academy of Sciences as the highest level of research institutions in China, the research efforts should not be underestimated (see Table 4) .
6. Journal Analysis
Sample data of the respective journals statistics, have a total of 908 journals. The distribution of literature in journals was discrete state of Bradford, UK study found that there are certain rules, the distribution of literature in journals and Bradford's law . Table 5 lists of published articles in 16 journals (or more) and the Amount of Papers.
As can be seen from the periodical distribution table, the core journals of the field of big data: "Computer Engineering and Applications", "Computer Engineering", "Communication World ", "Microcomputer Information", "computer", "computer applied research" and so on. 30 journals, only "successful marketing" does not belong to the IT journals, and others belong to the IT journals. It is the proportion of the total as high as 97.99% in the total journals, most of the domestic large data are concentrated in the IT sector, mainly based on the amount of data algorithms and techniques inquiry. The field of Library and Information on large data is relatively scarce; a small amount of research has focused on the Library, competitive intelligence, and based on mass Citation data academic research has also taken off.
7. Keywords analysis
7.1 Word frequency statistics
Usually keywords can reflect the transdisciplinary themes and concerns, high-frequency words can well reflect the particular area of concern hot, and co-word characteristics between keywords better summarized in a subject professional focus. So we statistics the keywords and word frequency from 2002 to 2012 in the paper, and summarize the research focus using the multidimensional scaling analysis methods in modern statistical techniques. See Table 7, Figure 2.
Statistics based on Excel, all the research journal articles, a total of 9147 Keywords Figure 2 can be seen from Table 7, "data mining", "database", "FPGA", "data", " clustering" will be the focus of future research in the field of large data. It is consistent with the design of FPGA-based data acquisition system with a large number of journal articles phenomenon. Keywords tables and data reveals a perspective view of "data mining", "big data" hot spots, but also does not reflect the relationship between the various keywords, so it needs to define keyword analysis, and to visualize the way of presentation. Next, I will construct a high-frequency words co-word analysis matrix, using a SPSS17.0 cluster analysis and multidimensional scaling analysis of the relationship between the high frequency keywords and related degrees.
7.2 Construction co-word analysis matrix of high-frequency words
Co-word analysis of the principle of the method is mainly these words as a basis for them in the same literature a number of occurrences of a set of pairwise statistical cluster analysis, multidimensional scaling analysis, so as to reflect between words kinship, and then analyzes these words represent the disciplines and topics structural changes . According to the principle of co-word analysis, for further processing of high-frequency keywords, pairwise statistics account the number of occurrences in the same paper. If the two key words, the higher the frequency is, the closer the relationship between them. Final matrix is shown in Table 8.
As can be seen from the co-word of matrix, the keywords appear direction, so that the matrix is a symmetric matrix, i.e. elements corresponding equal to the diagonal axis of symmetry. Keywords co-occurrence frequency is generally low, even a large part of zero indicates that the field of big data as an independent discipline its development is not mature enough, is not stable enough, the other with large data involves a multidisciplinary cross the actual situation is consistent, from different disciplines of the focus of the data is bound to be different, less contact. It is also because of their research broad distribution of large data research papers in journals scattered journals involving management, economics, computer science and other disciplines.
7.3 Cluster Analysis
Cluster analysis is the number of samples (or variables) data according to its many features, the degree of can closeness in nature in the case of no prior knowledge be automatically classified, resulting in more than one classification results . Similar individuals are very similar to the individual differences between the different types. The system cluster, clustering method select the between-groups linkage, measurement method using Squared Euclidean distance, standardized method for Z-scores" to draw a dendrogram, as shown in Figure 3.
From the dendrogram can be seen that horizontal distance indicates differences in size, you can clearly see that the variable clustering process. At the same time, if the keyword is divided into two categories, then from the clustering process can be seen keywords can be divided into five categories: database, data centers, large amount of data, FPGA and data collection should be classified as a class; big data support vector machines for the second class; data warehouse, data mining and clustering as a class respectively.
7.4 Multidimensional Scaling Analysis
Multidimensional scaling analysis is used to study the degree of similarity (dissimilarity) between multiple things, through appropriate dimensionality reduction methods, indicates this similarity (dissimilarity) the degree of distance between points in the low-latitude space . Results of the analysis shown in Figure 5, each investigator distribution reflects the relationship and strength among the investigators, and a high degree of similarity of the Investigator together, the formation of the academic community, indicates that the closer the distance between the research direction, to marginalized or no researchers classified to the research community that the research direction is narrow, or the transition to the other direction .
Multidimensional the scale analysis Figure keywords can be shown that the distribution of the degree of association between them. High similar keywords together form a community, the closer the distance the greater of their associated. Such as data acquisition and data centers; marginalized or not classified within the group of keywords, its affiliates is very small, or are in transition to a certain direction, such as clustering, data warehousing. Figure 4 using multidimensional scaling to analyze the results from Figure 3 clustering analysis of the results are consistent.
Data usage is still shown in Table 9, no missing for all analysis. Stress and RSQ are two multidimensional scaling analysis of the reliability and validity of the estimated value, RSQ bigger and more desirable, usually in the 0.60 is acceptable. The multidimensional scaling analysis Stress = 0.26697 RSQ = 0.65665, description fits the data better. From Figure 6, points in the figure are substantially distributed around a straight line, showing a significant linear trend analysis also shows that the effect is better. In general, below multidimensional scaling diagram is our big data research and development focus.
In summary, we learned of large data in the 2002-2012 research has made some achievements, specific performance as follows:
* Growth: Research Papers posting large data volume increase year by year, and the statistics from Table 1, big data research papers has been in a growth trend since 2002, while in 2012 issued documents have exceeded 800, and these data indicate studies of large data exploration period and preliminary stages of development have passed and it is in the boom period now.
* Regularity: Domestic research papers posting large data volume followed Price growth law. The law of author's distribution accorded Lotka and periodical distribution showed Bradford law. Lotka distribution formula which is yx = 1.400143544/x2.290834339.
* Concentration: Big data field's productive authors have Xu Cui-ping, Li Lu, Jiang Qi, Zhang Peng and Li Xin, etc. Most influential research institutions are University of Defense Technology, Chinese Academy of Sciences, Huazhong University of Electronic Science and Technology University and Zhejiang University; core journals "Computer Engineering and Applications", "computer Engineering", "communication World", "Microcomputer information" and "computer Applications" and so on.
This study showed that: domestic demand for big data research has focused on algorithms and techniques in the field of IT applications, and Library and Information field is still in its initial stage of exploration. Conforming to large data trends, big data in library and information field of research has attracted a number of research institutions and researcher's attention in recent years. It has been the National Social Science Foundation and the National Natural Science Foundation as a research topic; research based on big data has taken shape. But now the field of big data LIS related researches less, and most of these papers in the theoretical stage of exploration, lack adequate practical support. To this end, the field of Library and Information Studies for large data must intensify our efforts based on theory and practice, and continuously inject new research forces to form the core of the group is the leading force in the research team.
Categories and Subject Descriptors: H. 2.8 [Database Applications]; I.2.7 [Natural Language Processing]; G.3 [Probability and Statistics]: Multivariate Statistics
General Terms: Big Data Research, Text Analysis, Multivariate Analysis
Received: 3 July 2013, Revised 14 August 2013, Accepted 19 August 2013
I am so grateful to Hu Yu, my student for computing data of this article, and also grateful to Qi Feiei for reviewing the English format of the manuscript.
 Big Data: The next frontier for innovation, competition, and productivity [EB/OL]. http://www.mckinsey.com/ Insights/MGI/Research/Technologu_and_Innovation/ Big_data_the_next_frontier for innovation. 2013-04-18.
 Zipei, Xu. (2012). Large data: data revolution is coming, and how to reform government, business and our lives. Guilin: Guangxi Normal University Press, 7, p. 57.
 Junping, Qiu., Jinyan, Su., Zunyan, Xiong (2008). Information Resource Management: a Comparative Study Based on Bibliometrics. Journal of Library Science in China, (5) 37-38.
 Jing'an, Pang. (1999). Scientometrical Research Methodology. Beijing: Scientific and Technical Documentation Press, p. 299-301.
 Feicheng, Ma. (2002). Information Management Science [M]. Wuhan: Wuhan University Press, 12, p. 8085.
 Junping, Qiu. (2007). Informatics. Wuhan: Wuhan University Press, p. 45-55.
 Derek de Solla Price. (1963). Little Science, Big Science. New York: Columbia Press.
 Bradford, S. C. (1934). Sources of Information on Specific Subjects. Engineering, 26 (1) 85-86.
 Lu, Feng., Fuhai, Leng (2006). Development of Theoretical Studies of Co-word Analysis. Journal of Library Science in China, (2) 88-92.
 Wei, Xue. (2009). SPSS Statistical Analysis Method and Application. Beijing: Publishing House of Electronics Industry.
 Qiang, Du., Liyan, Jia. (2009). Breakthrough of Understanding SPSS statistical analysis from entry to master. Beijing: People's Posts and Telecommunications Press.
 Jiming, Hu. (2009). Author Co-Citation Analysis of Information Service Research in China. Journal of Intelligence, (10) 170-174.
Department of Information Management
Zhengzhou 450001, China
Table 1. 2002-2012 the number of research papers and Statistics Age (years) 2002 2003 2004 2005 2006 2007 issued (articles) 102 113 130 225 194 221 Age (years) 2008 2009 2010 2011 2012 issued (articles) 216 248 275 334 840 Table 2. Posting less than six of distributed data Papers [y.sub.x] X Y 1 3700 0 3.568201724 2 862 0.30103 2.935507266 3 136 0.47712125 2.133538908 4 41 0.60205999 1.612783857 5 11 0.69897 1.041392685 6 4 0.77815125 0.602059991 X 4754 2.8573325 11.89348443 Papers XY XX 1 0 0 2 0.88367574 0.090619058 3 1.017956761 0.227644692 4 0.970992635 0.362476233 5 0.72790225 0.488559067 6 0.468493735 0.605519368 X 4.06902112 1.774818419 Table 3. From 2002 to 2012 of the distribution pattern data Name Published Authors Cumulative articles number Xu Cuiping/Li Lu 11 2 2 Jiang Qi Ping/Zhang Peng 10 2 4 Li Xin 9 1 5 Li Yan 8 1 6 Liu Jie/Wang Lei/Zhang Jian/ 7 6 12 Yang Jie/Ren-yi/Nan Liu Baiyun Chuan/Xudong Qi/Wang 6 4 16 Wei/Xu Xiangyang Chen Hua/Li Yunfei/Hu Feng/ 5 10 26 Wang Suihua/Cheng Yimin Yang Chuanjian/Wang Hua/Sun Zhihui/Zhu Dehai/Li Jun/Zhou Shu Name Percentage Xu Cuiping/Li Lu 0.076923077 Jiang Qi Ping/Zhang Peng 0.076923077 Li Xin 0.038461538 Li Yan 0.038461538 Liu Jie/Wang Lei/Zhang Jian/ 0.230769231 Yang Jie/Ren-yi/Nan Liu Baiyun Chuan/Xudong Qi/Wang 0.153846154 Wei/Xu Xiangyang Chen Hua/Li Yunfei/Hu Feng/ 0.384615385 Wang Suihua/Cheng Yimin Yang Chuanjian/Wang Hua/Sun Zhihui/Zhu Dehai/Li Jun/Zhou Shu * Note: The data collected in this table does not consider the case of the same name, which is that there is no same name in the large data field of study. Table 4. 2002-2012 Posting TOP21 list of research institutions No. Name Published Articles 1 University of Defense Technology 55 2 Graduate School of the Chinese Academy of Sciences 36 3 Huazhong University of Science and Technology 36 4 University of Electronic Science and Technology 33 5 Zhejiang University 33 6 Tsinghua University 32 7 Northwestern Polytechnical University 31 8 Shanghai Jiao Tong University 29 9 Beijing University of Aeronautics and Astronautics 28 10 Wuhan University 28 11 Tongji University 27 12 Chinese University of Science and Technology 26 13 South China University 25 14 Southeast University 24 15 Tianjin University 24 16 China University of Geosciences 23 17 Xi'an University of Electronic Science and 23 Technology 18 Xi'an Jiaotong University 22 19 PLA Information Engineering University 21 20 Central South University 20 21 Harbin Institute of Technology 20 Table 5. References and periodicals distribution data Name Papers Sum of Periodical: C Computer Engineering and 67 1 Applications Computer Engineering 63 2 Communication World 53 3 Microcomputer 53 4 Computer Applications 49 5 Application Research of 38 6 Computers Computer and Network 37 7 Computer Engineering 33 8 and Design Computer Science 31 9 Silicon 31 10 Name Logarithm of Sum of Sum of Papers: R(n) Periodical: lgC Computer Engineering and 0 67 Applications Computer Engineering 0.301029996 130 Communication World 0.477121255 183 Microcomputer 0.602059991 236 Computer Applications 0.698970004 285 Application Research of 0.77815125 323 Computers Computer and Network 0.84509804 360 Computer Engineering 0.903089987 393 and Design Computer Science 0.954242509 424 Silicon 1 455 Table 7. High frequency Keywords (Top 20) No. Keywords Frequency 1 Data Mining 110 2 Database 52 3 FPGA 51 4 Big Data 51 5 Clustering 45 6 Support Vector Machine 37 7 Data Collection 35 8 Large Amount of Data 30 9 Data Warehouse 29 10 Data Centers 28 11 Rough Set 26 12 cloud computing 24 13 Data Processing 24 14 DSP 23 15 association rules 20 16 algorithm 20 17 huge amounts of data 20 18 neural network 19 19 Attribute Reduction 19 20 multi-threaded 18 Table 8. 2002 to 2012 data research papers frequency words co-word matrix Data Data FPGA Big clustering Mining base Data Data Mining 11 2 0 5 10 Database 2 4 0 1 0 FPGA 0 0 4 0 0 Big Data 5 1 0 6 0 clustering 10 0 0 0 11 support vector 5 0 0 0 0 machine DataCollecting 2 0 3 0 0 large amount 0 1 0 0 1 of data data warehouse 6 2 0 0 0 data centers 2 3 0 1 0 support Data large data data vector Collecting amount warehouse centers machine of data 2 Data Mining 5 2 0 6 3 Database 0 0 1 2 0 FPGA 0 3 0 0 1 Big Data 0 0 0 0 0 clustering 0 0 1 0 0 support vector 6 0 0 0 machine 0 DataCollecting 0 4 0 0 0 large amount 0 0 2 1 1 of data data warehouse 0 0 1 7 4 data centers 0 0 0 1 Figure 2. High-frequency keywords PivotChart data mining 110 database 52 FPGA 51 big data 51 cluster 45 SVM 37 data acquisition 35 large amount of data 29 data warehouse 30 data center 28 rough sets 26 clouds computing 24 data process 23 DSP 20 association rules 20 algorithm 20 massive data 19 neural networks 19 attributes reduction 18 multithread 18 data analysis 24 Note: Table made from bar graph.
|Printer friendly Cite/link Email Feedback|
|Publication:||Journal of Digital Information Management|
|Date:||Dec 1, 2013|
|Next Article:||The modeling of genetic and Tabu Search algorithm based BP Neural network in the risk analysis of investment.|