Printer Friendly

Research on Chinese word segmentation based on matrix restraint.

1. Introduction

According to the report of the meeting held in Boston in April 2000, the 5th Annual Meeting of the search engine pointed out, the number of pages has been more than 10 billions. Internet has become the largest information repository and how to use the computer to extract useful information from the web effectively and quickly is kept to be the exploring issues. Although the search engines can help users to find the information they need in a broad array of network information, the ever-changing nature of language bring great difficulties in understanding natural language for computer and also for the effective retrieval of search engine. Chinese automatic segmentation is one of the key technologies of the search engines, especially for the processing applications of a great mass of information. Based on Chinese word segmentation based syntax and semantics of the constraint matrix Chinese word segmentation algorithm and design and a segmentation system. This thesis proposes an algorithm of Chinese word segmentation based on the constraint matrix of syntax and semantics on the basis of the technology of Chinese word segmentation, designing and realizing a segmentation system.

2. Background Knowledge

2.1. The Concept of Chinese Participles and Words

By definition of the Rules of the Information Processing of Chinese Word Segmentation: Word is "the basic unit of the determined semantic or grammatical function used for Chinese information processing, including the specification of rules defined of the words and phrases". Word is also the smallest language ingredients with independent activities and also an important knowledge carrier and basic operating unit of the natural language processing system. Chinese word segmentation is the process of the text word boundary by computer recognizing automatically; it is the most important part of the Chinese information processing. Segmentation can regroup the continuous sequence of words to the process of word sequences according to certain specifications (Nazzi, 2014). Recognized Chinese word belongs to the context of natural language processing techniques and a part of the understanding of the semantics of the process and derives the composed core word for the semantic analysis module. How to improve the accuracy and speed of the word is a key issue in the field of information processing during the segmentation process.

2.2. The Methods of Chinese Participles and Words

The existing segmentation algorithm is divided into three main categories: (1) mechanical segmentation, basing on the matching techniques of dictionary and thesaurus; (2) word frequency statistics, basing on the lexical frequency statistics; (3) intelligent key lexical technology, lying in the understanding of knowledge. The basic idea of the a mechanical sub-lexical: first create a thesaurus, give the string Chinese character word, cut the substring word in some way, continue cut the rest stars if the substring word match a dictionary word, otherwise the substring is not a word and re-cut to the Chinese character string to give the substring match. According to the direction of the word, a mechanical sub-lexical is divided into maximum matching method, reverse maximum matching method, least syncopation, two-way matching, and signs segmentation method, etc. Generally, the accuracy of the reverse maximum matching is higher than that of maximum matching, which the error rate of maximum matching is 1/132, while the reverse maximum matching is 1/245. Take a two-way matching method can take the advantages of both sides and get the best results. Currently, this algorithm has been used in most of the segmentation system.

Word frequency statistics is built on the basis of the mathematical theory of probability, whose basic idea is: to reflect the credibility of the word based on the adjacent probability between words, make the statistical tabulation from the probability of combination of the word adjacent and calculate their mutual information. Mutual information is defined as follows formula:

I(A, B) = log P(A,B)/P(A)P(B) (1)

Wherein, P (A, B) represents the probability of A and B that Chinese character string appears at the same time. P(A) is the probability of the character string A, P(B) is the probability of a character string B. Assuming the number of occurrences of A and B in the Chinese character string are respectively n (A), n(B), n(AB), n is the total number of the set Frequency, then

P(A, B) = n(A, B)/n, P(A) = n(A)/n, P(B) = n(B)/n (2)

There are formulas that set a field value in advance, and if I(AB) is greater than the set threshold, the A, B is a word, otherwise it can't be cut into words. It's also called sub-lexical dictionary as a dictionary need not use in this word segmentation method.

Besides these mainstream algorithms, there are some other segmentation algorithms with their own strengths, such as syntax analysis, artificial neural network method and dictionary sub-lexical, etc.

2.3. The Difficulties of Chinese Participles and Words

Although modern Chinese participles have made a great progress, The Chinese is totally difficult and cannot easily adapt to human's thoughts and have further development. So we are supposed to have a full views above all kinds of difficulties, which are included with "standards of participles and words, ambiguities distinguish and unlisted words distinguish.

The standards of participles and words: The biggest problem in Chinese participles and words is the concept of words is confusing. What kinds of words can be called participles? It's hard to answer. Among the Table Sound Words, they are defined by the tradition and have no questions of standards of participles and words. The Chinese written words are united by the words and is the kind of ideograms, which has no strict limitations and form changing. Therefore, there are special difficulties in Chinese participles and words.

Ambiguities distinguish: ambiguities distinguish refers to a special sentence has many segmentation methods and different method has different meanings. So it leads to combinational ambiguity and cross ambiguity. The important problem of Chinese participles and words is how to select a correct result from the Chinese sentences through the legal word sequences. This is called ambiguities distinguish.

Unlisted words distinguish: Unlisted words refer to the words aren't included in the dictionary but can be called "words". Unlisted words include the new general words, technical nouns and so on. Chinese words system is an open system. With the time goes by, it is changing continuously. So more new words can be produced. In order to syncopate the Chinese sentences correctly, the Chinese words system is required to have to ability to distinguish the unlisted words. 3

3. Identification of Chinese Word Segmentation Ambiguity

Chinese is a very complex language and difficult for computer to understand, the unambiguous identification is one of the keys. The generation of Chinese word segmentation ambiguity mainly has two types: combination ambiguity and overlapping ambiguity. The combination ambiguity means that the part of a word is a complete word, such as "the People's Republic of China", "China", "People" and "Republic" are separate words, but their combination is also a word. The overlapping ambiguity word refers to the overlapping parts of the two adjacent, such as "he is to study the way of life", "graduate" is a word, "life" is a word, they reuse the same word. Study shows that overlapping ambiguity is a key point of generating ambiguity, which accounting for about 90% of the entire word ambiguity.

3.1. Segmentation Field of Overlapping Ambiguity

Definition 1: Segmentation field of overlapping ambiguity (intersection field), a string T=[a.sub.1,] [a.sub.2] ... [a.sub.n] is not word, [a.sub.1,] [a.sub.2] ... [a.sub.n] as an individual Chinese characters, if exists in the integer [i.sub.1,] [i.sub.2] ... [i.sub.m] and [j.sub.1,,] [j.sub.2] ... [j.sub.m] (m or 2), it meets: (1) [Y.sub.1] = [a.sub.i1] ... [A.sub.j1], [Y.sub.2] = [a.sub.i2] ... [A.sub.j2], ..., [Y.sub.m] = [a.sub.m1] ... [a.sub.jm], is made into words respectively, and T does not contain the words that present in [Y.sub.1], [Y.sub.2], [Y.sub.m];(2) [Y.sub.1], [Y.sub.2], [Y.sub.m] constitute cross with each other, i.e., 1 = [i.sub.1] < [i.sub.2] <= [j.sub.1] < [j.sub.2], [i.sub.2] [i.sub.3] < = [j.sub.2] < [j.sub.3], [i.sub.3] < [i.sub.4] < = [j.sub.3] < [j.sub.4], [i.sub.m-2] < "[i.sub.m-1] < = [j.sub.m-2] < [j.sub.m-1], [i.sub.m-1] < [i.sub.m-1] < = [j.sub.m-1] < [j.sub.m=n]. Then the field T is called intersection type of the ambiguity segmentation fields (referred to the intersection field).

Definition 2: In the intersection field T = [a.sub.1], [a.sub.2] ... [a.sub.n], Y = [a.sub.i] ... [a.sub.j] is a clause of T, and it meets: (A) Y is the word, (B) T does not include the words that present in Y, then Y is called an intersection factor of T.

Definition 3: The biggest intersection type of ambiguity that cut molecular section (maximum intersection field) set T = [a.sub.1], [a.sub.2] ... [a.sub.n] is the any string, T = [a.sub.i] ... [a.sub.j] is a clause of T (1<=i<j<=n), and T1 is the intersection type of ambiguity segmentation field. If T does not contain the intersection type of ambiguity segmentation field that bigger than T1, it says the T1 is the biggest intersection type of ambiguity segmentation field of T (hereinafter referred to as the biggest intersection field). Of course, also there exists a true ambiguity.

Definition 4: In the intersection field T= [a.sub.1], [a.sub.2] ... [a.sub.n], [Y.sub.i] = [a.sub.i] ... [a.sub.j], [a.sub.j+1] ... [a.sub.k], [a.sub.k+1] ... al is a clause of T, [Y.sub.j] = [a.sub.j+1] ... [a.sub.k], [a.sub.k+1], ... [a.sub.1] is a clause of T, [Y.sub.j] = [a.sub.j+1] ... [a.sub.k] is a clause of T, then T is the true ambiguity, and [Y.sub.i], [Y.sub.j] are the true ambiguity field of Y. For example: "the way of life" can be cut into "graduate/live/mode and also "research/life/way". "life" here cannot be judged as a word if there is no related context sentence. In the real text, most of the intersection field belongs to the uprising of a machine form and it is false ambiguity from the point of view of person.

3.2. The Biggest Intersection Field Classification

Definition 5: Ste the largest intersection string T = [a.sub.1], [a.sub.2] ... [a.sub.n] the intersection factor of an are: [a.sub.i1] ... [a.sub.j1], [a.sub.i2] . ... [a.sub.j2], ... [a.sub.im] ... [a.sub.jm], (1 = [i.sub.1] < [i.sub.2] ... <[i.sub.m], [j.sub.1] < [j.sub.2] ...< [j.sub.m] = n), then T can be classified according to the position of the intersection factors in T. The T macro structure can be obtained and denoted: ([i.sub.1], [j.sub.1] - [i.sub.1] + 1) ([i.sub.2], [j.sub.2] - [i.sub.2] + 1) ... ([i.sub.m], [j.sub.m] - [i.sub.m] + 1). The first digital in the bracket represents the enlightenment position of the corresponding intersection factor in T, and the second number indicates the length of the intersection factor. For example: the macro structure type of "what is the character set" denoted (0, 2) (2, 2) (3, 4) (4, 2), the macro structure type of "life"denoted (0, 2), (1, 2).

Definition 6: The two macro structure [W.sub.i]: ([i.sub.1], [k.sub.1]) ([i.sub.2], [k.sub.2]) ... ([i.sub.m], [k.sub.m]), [W.sub.2] : ([j.sub.1], [l.sub.1]) ([j.sub.2], [l.sub.2]) ... ([j.sub.m], [l.sub.m]), called [W.sub.1] > [W.sub.2]. If there is: 1 [less than or equal to] x [less than or equal to] m, 1 [less than or equal to] x [less than or equal to] n, [k.sub.1] = [l.sub.1], [i.sub.2] + [k.sub.2] = [j.sub.2] + [l.sub.2] ..., [i.sub.x] + [k.sub.x] = [j.sub.x] + [l.sub.x], [i.sub.x+1] + [k.sub.x+1]) [j.sub.x+] 1 + [l.sub.x+1].

To illustrate the rationality of the classification processing type of ambiguity, two concepts needed: static frequency and dynamic frequency of the biggest intersection field.

Definition 7: Set the complete works of the maximum intersection field (section type) for I = {[T.sub.1], ..., [T.sub.i], ..., [T.sub.n]}, among which, the occurrence number (period of time) of the field Ti in the corpus is Freq ([T.sub.i]), and make C is the collection of some maximum intersection fields (sets of [section.sub.type]). C = {[T.sub.i1], ..., [T.sub.im]}, C is defined about static frequency and dynamic frequency of I as formula.

Static frequency=, and dynamic frequency is 2.

Wherein, [absolute value of (C)] and [absolute value of (I)] represent the size of the set respectively. Perruchet (2014) according to the distribution of the intersection field type, the macro structure is more concentrated, which makes it's possible for the classification processing. The collection of the biggest intersection field set by cutting the Chinese character string is the premise of the classification processing, which use the method of testing gradually from head to deal with the string, dividing them into the single biggest intersection field, and then extract the maximum intersection field gradually.

3.3. The Principles of Chinese Participle Algorithm Design

The size of particle size is the bigger the better: the text segmentation which is used to semantic analysis is required to be of a bigger particle size. Analysed because the bigger it is, the amount of the words it is included and can express its meaning more exactly (Katharine, 2015). For instance, "the president of the student union" can be divided into "the student union" and "president". But also can be divided into "student" "union" and "president". These three methods are all correct. But when we analyze these three methods form the semantic analysis point of view, the first one is the best among the three methods.

Recognized the words in result of segmentation should use non-dictionary words the more the better and use the single character dictionary words the less the better (Li, 2015). Also studied the non-dictionary words here refer to the words which aren't included in the dictionary. And the single character dictionary words can be used independently, such as "owned", "did", "and", "you", "I" and "he" (Qiu, 2015). For instance, "technique and service" can be divided into "technique and serve task" and "technique" and "service". The word "service" cannot be found in the dictionary, but the word "and" can be used independently. So between the two words "technique and service", there is a non-dictionary word but the words "technique" and "service" are independent words. So we use the later one.

Recognized the total number of words in participle is the less the better. In the condition of the same total of words, the number of dividing words is the less the better (Kurumada, 2013). The less it can be, the semantic elements it includes, it can be more important and exact.

3.4. The Standards of Chinese Participle Algorithm

The aim of Chinese participle algorithm system is to establish an open and general modern written Chinese participle algorithm system. Also studied the standards of automatic algorithm system are correct rate and speed.

Correct rate of algorithm [alpha] = Segmentate the number of words correctly/ total number of corpus x 100% (3)

Correct rate of identifying unlisted words

[beta] = The total number of identifying words/ total number of unlisted words x 100% (4)

Recall rate of unlisted words

[gamma] = total number of identifying unlisted words/ total number of unlisted words x 100% (5)

Among these elements, the correct rate is of great importance. Recognized it is affected by the problems of ambiguity and identifying unlisted words. So how to select a proper algorithm model and to solve problems of ambiguity and identifying unlisted words is necessary to improve the correct rate of participle algorithm.

Presently, recognized there are some kinds of algorithm models which are often used. They are Word-based Trigram language Model, Hidden Markov Model, Expectation Maximization Algorithm and Noise Model and so on.

4. Text Word Matrix

4.1. Definition of A Text Word Matrix

Definition 8: if there is a length (word) for the M text, you need to select the n words of the characteristics of n-grams. Build a n m+n-1 mhatrix, and the discipline of matrix elements of the distribution is that: set the former n-k matrix elements of the kth line and the k-1 elements behind to o.m elements in the middle array according to the order of the text of all words or punctuation. According to this rule, the matrix is the matrix of the text, and the concrete structure is as follows (Grigori, 2014):

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The structure of the text word matrix is shown in algorithm 1:

Algorithm 1: text word matrix construction algorithm

Input: text t{w1, w2, ..., [w.sub.m]}, the value of N in the Chinese language model to be screened.

Output: array of matrix representation of arr[n] [m+n-i], subscript starting from 1.

1:  begin
2:    for each i: 1 to n
3:    for each j: 1 to n-k
4:      arr[i][j] = 0;
5:  end for
6:    for each j: m+n-k+1 to m+n-k
7:     arr[i][j] = 0;
8:  end for
9:  for each j: n-k+1 to m+n-k
10:   for each k: 1 to m
11:    arr[i][j] = t[k];
12:   end for
13:    end for
14: end for
15: return arr;
16: end


According to the experimental study of Zhou and others, the Chinese text that have the best classification effect are 2-, 3-grams and 4-grams. In this paper, the generation of the word matrix is divided into three categories (Liu, 2013). The basic form of 2-grams matrix:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The basic form of 3-grams matrix:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The basic form of 4-grams matrix:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

According to the theory of Chinese single and double word recognition: in Chinese text, more than 95% of the words are single and double word. By giving up a small number of multi--words, using a large number of single--and double--word to construct n-grams features, deform the basic form of the above word matrix (Sun, 2013).

4.2 Transformation of The Text Word Matrix

Definition9: the double word in the text T are constituted with two adjacent single words that arranged in the sequence of text (Goldberg, 2013). Assuming that there is a text t{[w.sub.1], [w.sub.2], ..., [w.sub.m]}.

There are two forms of the double words in text T: [word.sub.1]{([w.sub.1] [w.sub.2]), ([w.sub.3] [w.sub.4]), ..., ([w.sub.m-1] [w.sub.m]) and [word.sub.2]{[w.sub.1], ([w.sub.2] [w.sub.3]), ([w.sub.4] [w.sub.5]), ...,([w.sub.m-2][w.sub.m-1]), [w.sub.m]}. According to the nature 1.

Because of the unsless of the redundant words and the double words which have no semantic meaning in the process of combination (such as: "de ren" and "le mei"), so we filter the redundant elements and none semantic words by calculating the correlation of elements T, word1 and word2 in the collection to assure the constitute way of the single words in the text. in combination is finally determined a single word in a text in t. Correlation can be defined using the mutual information between elements and categories, and the computing method is shown in the formula 6,7 and 8:

r(w, ci) = p(w, ci)[log.sup.p(w,ci)/p(w)p(ci)] (6)

avg(w, ci) = 1/m [m.summation over (i=1)] r (w, ci) (7)

[sigma](w) = [square root of (1/m [m.summation over (i=1)] [(r(w, ci)-avg (w, ci)).sup.2])] (8)

r(w, [c.sub.i]) is the relationship between the single word(or a constituted double word) W and the categories. avg(w, [c.sub.i])is the average value of the correlation between W and each class. Sigma [sigma](w) is the final results.if sigma [sigma](w) value is large, indicating that W was only associated with one or few categories, and it has large contributions to the classification, be retained; if the sigma [sigma](w) is small and less than the threshold value, then W and each class or most categories are related, and it is attibuted to the reduntant unrelated elements.

In the process of text word matrix transformation, we can obtain the set of correlation calculated values of every element in T, word1 and word2: Vt{[v.sub.1], [v.sub.2], ...}, [Vw.sub.1]{[v.sub.1], [v.sub.2] ...}, [Vw.sub.2]{[v.sub.1], [v.sub.2] ...}. To set smaller than the correlation threshold delta [delta] elements to 0, merge the three collections.

If the correlation value between [v.sub.1], and [v.sub.1] the correlation is greater than [V.sub.1], replace the [w.sub.1] in T with ([w.sub.1], [w.sub.2]) in [Vw.sub.1] or [Vw.sub.1] in [Vw.sub.2]. In the process [Vw.sub.1] and [Vw.sub.2] consolidate into Vt, there will be a conflict of the same location, so before the merger to compare [Vw.sub.1] and [Vw.sub.2] corresponding V value, take a larger correlation value.

In the merger, the non 0 elements [V.sub.k] in [V.sub.w1] and [V.sub.w2] [V.sub.K] are consolidated into the [V.sub.t], instead of the zero elements in the Vt or the elements whose correlation value is small. The calculation of the combined position of [v.sub.k] in the [V.sub.t], as shown in the formula (9):

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (9)

The specific transformation of the word matrix is shown in algorithm 2:

Algorithm 2: word matrix transform algorithm

Input: text t{[w.sub.1], [w.sub.2] ..., [w.sub.m]},correlation
threshold 8 and n (the value will be screened in n-grams)

Output: arr (the representation form of the matrix elements
after the transformation matrix) and [arr.sub.value] (the
representation form of the array of the value), subscript from 1.

1: begin
2: Combine the all single words in the text of the t{[w.sub.1],
[w.sub.2] ..., [w.sub.m]} from front to back, and gain [word.sub.1]
{([w.sub.1][w.sub.2]), ([w.sub.3][w.sub.4]), ..., ([w.sub.m-1]
[w.sub.m])} and [word.sub.2]{[w.sub.1], ([w.sub.2] [w.sub.3]),
([w.sub.4][w.sub.5]), ..., ([w.sub.m-2] [w.sub.m-1]), [w.sub.m]}
3: Calculate the correlation value, [sigma], of each element in the T,
word1 and word2.
4: for each i : 1 to m
5:     if s (t[i]) < [delta]
6:     [V.sub.t] [i]=0 ;
7:    else [V.sub.t][i]=s(t [i]);
8:  end for
9:  for each i : 1 to ?2 / m [??]
10:    if s ([word.sub.1] [i]) < [delta]
11:    [V.sub.w1] [i]=0 ;
12:   else [V.sub.w1][i]=s ([word.sub.1] [i]);
13: end for
14: for each i : 1 to ?2 / m [??]
15: if s ([word.sub.2] [i])< [delta]
16:    [V.sub.w1] [i]=0;
17:    else [V.sub.w1][i] = s (word1[i]) ;
18:   end for
19: for each i : 1 to ?2 / m[??]
20: if [V.sub.w1] m > [V.sub.w2][i]
21:   if [V.sub.w1][i] > [V.sub.t][i];
22:    t[2i-i]=[word.sub.1][i];
23:    V[2i-i]=[V.sub.w1][i];
24:   end if
25:  else
26: if [V.sub.w2] [i]> [V.sub.t][i]
27: t[2i-2] = [word.sub.2] [i];
28: V[2i-2] = [Vw.sub.2][i];
29: end if
30: end if
31: end for
32: Using algorithm 1, Vt and T are constructed in the form of word
    matrix, and the ARR
and arrvalue are obtained.
33: end


The transformed set is a single word and two word irregular rows: merge {([w.sub.1] [w.sub.2]), ([w.sub.3] [w.sub.4]), [w.sub.4], [w.sub.5], ...}. The value of merge is the correlation value and the irregular arrangement of 0: [V.sub.m]{[v.sub.1], [v.sub.2], [v.sub.5], 0, ...}. Through the method of text word, matrix elements are combined to get the mixed matrix of single words and double words. In the case of 2-grams, 3-grams, and 4-grams, the representation of the deformation matrix is as follows:

After the transformation of 2-grams text word matrix, the representation form of the various elements of the matrix:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

After the transformation of 2-grams text word matrix, the representation form of the various elements value of the matrix:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

After the transformation of 3-grams text word matrix, the representation form of the various elements of the matrix:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

After the transformation of 3-grams text word matrix, the representation form of the various elements value of the matrix:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

After the transformation of 4 grams text word matrix, the representation form of the various elements of the matrix:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

After the transformation of 4-grams text word matrix, the representation form of the various elements value of the matrix:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

5. Matrix Constraint Method of Word Segmentation Technology

5.1. Matrix Constraint Method

Bhatti (2014) studied the basic idea of matrix constraint method is: first create a grammatical constraint matrix and a semantic constraint matrix, whose elements show respectively that whether the adjacent between enjoying the syntactical word and another syntactical word in compliance with the rules of grammar and the adjacent between belonging to a semantic kind word and another kind of word is logical or not. Machine finishes in the constraint of word segmentation when make the segmentation.

Set TCN = (T, C) is a string function, among which, T = {[g.sub.0], [g.sub.1], ..., [g.sub.n-i]}, the geometric constraint words method matrix P = ([p.sub.ij]) is a n order matrix and C is the Modern Chinese Grammar Information Dictionary. By definition, constraint matrix is a Boolean matrix and expresses the constraint condition of the constraint system. For the elements of the matrix pij, if the geometric entity inihas the corresponding constrained to the geometric entities in j, the elements of the value is 1; otherwise, the element value is zero. The constraint matrix can be generated according to the constraint network.

The grammar rules in the matrix use the Modern Chinese Grammar Information Dictionary and it is an electronic dictionary prepared for the realization of Chinese analysis and Chinese generation of computer. The dictionary has 32 database files, which contains the one head bank of all words, 23all kinds of word bank. In addition, the verb library sets up six depots and pronoun library sets up two depots. The head bank sets up 13 attribute fields, noun library sets up 27 attribute fields, verb library sets up 46 attribute fields and the total information of the whole library is up to 2520026. The storage space of the amount of information needed is about 16681B. The electronic dictionary is widely applied in the field of language information processing.

Similarly, use the same method for semantic constraint matrix and here the unnecessary details are no longer gave. Recognized semantic rules adopt the Modern Chinese Grammar Information Dictionary (SKCC), which is a semantic knowledge base of Chinese information processing finished at the end of 1998, containing the word 48835. It can provide strong support for computer semantic analysis (Freixo, 2014). It analysed use the grammar and semantic constraint matrix P to constraint every phrase that separated from the segmentation system, remain the same if it is 1, otherwise make the resegmentation until all the words are grammaticality constraint matrix.

5.2. Principles of Chinese Word Segmentation

Principle 1 Make words as far as possible, i.e. if the whole intersection field is a word, do not do the segmentation; otherwise make segmentation results as far as possible to put each part into a multiple character to constitute a word, try to avoid condition that the segmentation result is composed of multiple characters of the word.

Principle 2 Phrase and idiom first. If the waiting segmentation field contains phrases and idioms, try to ensure to make this part into words.

Principle 3 Conform to the grammar rules. Segmentation results must comply with the rules of grammar and the situation such as "adjective + verb" are not allowed.

Principle 4 Conform to the semantic rules. Segmentation results must meet the semantic rules and the obvious semantic errors are not allowed.

Principle 5 Positive maximum matching first, applying to the situation that appears several reasonable segmentations.

Although these principles are not strict, sometimes must depends on conditions and the situation of several principles can be used at the same time also appears when dealing with ambiguous field, which makes the system is more flexible to realize and facilitate to improve rules by dealing with the uncertain methodology in the future.

6. Process of The System Segmentation

The basic process of word segmentation system as shown in figure 1. First pretreat the text flow with finite automaton (FSA), identify the English and Chinese digital with obvious characteristics (including the cardinal numbers, ordinal numbers, fractions, decimals, percentage, etc.), address name, date, person name, etc.; Then input pretreated segmentation text and obtain further segmentation results through the cross ambiguity detection of the core dictionaries and go into the next round of testing. Adopt the constraint by the grammar constraint matrix, if it's right, go to the next round, or make new segmentation. Use the Modern Chinese Grammar Information Dictionary through the semantic constraint matrix in the next round, if right, go to the next round, or resegmente. If the whole conform to the rules, recognize the unknown words by the new words in the dictionary, complete a segmentation after recognition and get the segmentation results. Practice proves that effect of the participle treatment is remarkable and the accuracy can be increased by 6.7% or so from the simulation results.

7. Conclusion

Word segmentation is the basis of a computer to process information and it's the inseparable participle either in the information retrieval or in text classification. The paper processes the Chinese word segmentation based on the Modern Chinese Grammar Information Dictionary and Modern Chinese Semantic Dictionary, inspect the segmentation results on the basis of the understanding of grammar and semantic and improve the ability of elimination ambiguity. In addition, the system introduces the constraint matrix, which improve the accuracy of segmentation system and has reference value for further Chinese information processing and information mining.

Recebido/Submission: 07/04/2016

Aceitacao/Acceptance: 22/07/2016

References

Francisco V., Liliana CH. (2013). Syntactic Dependency-based N-grams as Classification Features. Lecture Notes in Computer Science, 7630, 1-11.

Freixo, J., & Rocha, A. (2014). Arquitetura de Informacao de Suporte a Gestao da Qualidade em Unidades Hospitalares. RISTI--Revista Iberica de Sistemas e Tecnologias de Informacao, (14), 1-15.

Goldberg Y., Elhadad M. (2013). Word segmentation, unknown-word resolution, and morphological agreement in a Hebrew parsing system. Computational Linguistics, 39(1): 111-120.

Grigori S., Francisco V., Alexander G. (2014). Syntactic N-grams as Machine Learning Features forNatural Language Processing. Expert Systems with Applications 41, 853-860.

Katharine G.E., Casey L.W. (2015). Listening through voices: Infant statistical word segmentation across multiple speakers. Developmental Psychology, 51(11): 17-22.

Kurumada C., Meylan S.C., Frank M.C. (2013). Zipfian frequency distributions facilitate word segmentation in context. Cognition, 127(3): 49-53.

Li X., Zong C., Su K.Y.A. (2015). Unified Model for Solving the OOV Problem of Chinese Word Segmentation. ACM Transactions on Asian and Low-Resource Language Information Processing, 14(3): 24-26.

Liu P.P., Li W.J., Lin N. (2013). Do Chinese readers follow the national standard rules for word segmentation during reading? Plos One, 8(2): 39-47.

Nazzi T., Mersad K., Sundara M. (2014). Early word segmentation in infants acquiring Parisian French: task-dependent and dialect-specific aspects. Journal of Child Language, 41(3): 25-29.

Qiu Y.F., Liu S.X., Shao L.S. (2015). N-grams feature selection and weighting algorithm based on single-word matrix intersection. Computer Engineering and Application. Computer Engineering and Applications, 8(19): 4-9.

Sun X., Zhang Y., Matsuzaki T. (2013). Probabilistic Chinese word segmentation with non-local information and stochastic training. Information Processing & Management, 49(3): 210-231.

Fangmei Liu, Pengyuan Wang, Zengyu Cai

mailczy@163.com

Zhengzhou University of Light Industry, 450002, Zhengzhou, China
COPYRIGHT 2016 AISTI (Iberian Association for Information Systems and Technologies)
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2016 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Liu, Fangmei; Wang, Pengyuan; Cai, Zengyu
Publication:RISTI (Revista Iberica de Sistemas e Tecnologias de Informacao)
Article Type:Report
Date:Oct 15, 2016
Words:5551
Previous Article:Egg signals recognition based on LMD and relevance vector machine.
Next Article:Research on the precise design of green buildings based on BIM technology--taking the ecological sponge exhibition center of Yuelai new town of...
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters