Mining and Correlation Analysis of Association Rules between Properties and Therapeutic Efficacy of Chinese Materia Medica Based on Strategy Pattern.
To the Editor: Chinese physicians often address the combination of the properties and therapeutic efficacy of Chinese materia medica (CMM). They believe that the properties and therapeutic efficacy of traditional Chinese medicines (TCMs) should be considered as an 'organic whole.' 'Use based on therapeutic efficacy' can lead to omission of the properties of CMM. It also fails to meet the requirements of 'differential diagnosis and treatment,' thus restricting the flexibility in using CMM. The four natures, five flavors, channel tropism, and therapeutic efficacy of CMM are corresponded to visceral syndromes, which indicate the concept of 'wholeness' in TCM. Currently, many studies have been conducted to mine association rules between the properties and therapeutic efficacy of CMM. Some researchers mined text data for correlation between the properties and therapeutic efficacy of CMM. Although some results are obtained, limitations exist in the results, and it remains undetermined whether the obtained association rules are significant. In this paper, based on strategy pattern, we designed and developed a system for mining association rules between the properties and therapeutic efficacy of CMM. Strategy pattern defines a family of algorithms, encapsulates each one and makes them interchangeable. This pattern enables the implementations or interchange of different algorithms according to the option of the user. Thus, we can choose the appropriate algorithm on the basis of the characteristics of the data set. Strategy pattern was fully utilized for the selection of proper mining algorithm, which could make the association mining process user-friendly, intelligent, and fast.
First of all, we need to carry out the steps of quantization of Chinese medicine data. Quantization refers that by reasonable analysis the complex CMM data are divided into independent and exclusive minimum information units that consist of several Chinese characters and numbers and cannot be subdivided. Regarding research subject selection, the authoritative work Pharmacopoeia of the People's Republic of China 2015 edition (hereinafter referred to as Pharmacopoeia) was taken as reference, and the information of 619 CMM recorded in the Pharmacopoeia was extracted. The extracted information included the natures, flavors, channel tropisms, toxicities, therapeutic efficacies, and indications of CMM. Since the characteristics of ascending, descending, floating, and sinking of CMM are not included in the Pharmacopoeia, we also did not consider these characteristics in this paper. During information extraction, 13 CMM with incomplete information was excluded, and thus, a total of 606 CMM were studied in this paper. In the initial extraction of CMM information, Chinese punctuation symbols such as comma (，), enumerated comma (、), full stop ( 。) and so on were used as split points for information extraction. If the toxicity of medicine is not given, it can be considered as nontoxic. In the descriptions concerning the therapeutic efficacies and indications of CMM, different words/phrases are often used to mean the same thing. Therefore, one word/phrase among these different words/phrases was selected for use. The information about the therapeutic efficacy and indications of CMM when used externally was kept.
According to the definition of quantization, vague descriptions should be processed to ensure the exclusivity and uniqueness of CMM data. On this basis, data mining can be performed. In the Pharmacopoeia, the descriptions about the natures, flavors, channel tropism and indications of each CMM have already met the requirements of quantization, thus quantization of these data is unnecessary. However, the descriptions about the toxicity and therapeutic efficacy of each CMM are relatively vague and not very exclusive. Therefore, the toxicity and therapeutic efficacy information was quantized. In the information table, the toxicity was classified into no toxicity, little toxicity, having toxicity, and high toxicity. 'Having toxicity' means its toxicity is between 'little toxicity' and 'high toxicity'. In order to compare the toxicity of each CMM directly, we used 'medium toxicity' to replace 'having toxicity' in this quantization research. In the Pharmacopoeia , the therapeutic efficacy of CMM is described using natural language, thus there is a certain degree of vagueness. For example, the therapeutic efficacy of Folium Callicarpae Formosanae is described as 'cool blood/astringe/stop bleeding, remove blood stasis/eliminate toxicity/reduce swelling.' However, 'cool blood/astringe/stop bleeding' includes seven meanings: cool blood; astringe; stop bleeding; astringe by cooling blood; stop bleeding by cooling blood; astringe and stop bleeding; astringe by cooling blood and finally stop bleeding. 'Remove blood stasis/eliminate toxicity/reduce swelling' also includes seven meanings: remove blood stasis; eliminate toxicity; reduce swelling; remove blood stasis and eliminate toxicity; remove blood stasis and reduce swelling; eliminate toxicity and reduce swelling; remove blood stasis, eliminate toxicity and reduce swelling. In order to guarantee the exclusivity of therapeutic efficacy, a total of 692 therapeutic efficacies involved in this study were quantized. For example, 'remove heat/eliminate toxicity' was quantized into 'remove heat,' 'eliminate toxicity,' and 'eliminate toxicity by removing heat.' 'Dispel wind-dampness' was quantized into 'dispel wind,' 'dispel dampness,' and 'dispel wind and dampness.'
Then, we set about designing the association mining system for properties and therapeutic efficacy of CMM. The association mining system was developed on Matlab2014a platform. The operating system adopted was Windows 10. In order to achieve the interactive design, the system integrated Matlab object-oriented programming and graphical user interface design.
Four algorithms, including Apriori, Eclat, DF-FIMBII, and CBM-Eclat, were used for frequent pattern mining of association rules between the properties and therapeutic efficacy of TCMs in the database of quantized TCM information. After frequent patterns were obtained, strong association rules needed to be found. For the above four algorithms, frequent pattern sets with the same form were generated: [INSIDE:1], where L [sub]k denoted the frequent k-itemset. Each frequent itemset was ordered. The simulation was performed in Matlab. Different minimum support thresholds (0-1) were set for the quantized TCM dataset. With each minimum support threshold, each of the four algorithms was run for three times. The average of three running times was reported as the running time of algorithm (round to three significant figures). After comparing the running time of each algorithm, Apriori and DF-FIMBII turned out to be the two most efficient algorithms. When the minimum support threshold ≥0.07, the efficiency of Apriori was higher than that of DF-FIMBII. When the minimum support threshold <0.07, however, the efficiency of DF-FIMBII was obviously higher than that of Apriori. At large values of minimum support threshold, the number of frequent patterns that met requirements was very small, and DF-FIMBII spent much time transforming data into a vertical data format. At small values of minimum support threshold, there were many frequent patterns that met the requirements. In this circumstance, Apriori would scan the database for several times and a large amount of candidate itemsets would be generated, thus leading to a long running time. Due to their high efficiencies, Apriori and DF-FIMBII were integrated into the system for CMM dataset mining. The user can choose the appropriate algorithm according to the characteristics of data set based on strategy pattern.
After algorithm selection, we began to design the system module. The system consists of three modules: data import module, parameter setting module, and result display module. In data import module, the database that needs to be mined can be imported by clicking the data import button. The data should be in XLS or XLSX format, and there should be no TID markers or column markers. After the data are imported, their properties can be automatically displayed: the number of affairs, the number of items, and the average length of affairs (the total number of items in all affairs divided by the total number of affairs). In parameter setting module, users can input the minimum support threshold (0-1) and the minimum confidence threshold (0-1). Then, additional correlation measures can be selected. If Chi-square test is chosen, level of significance (a) should also be selected. For Chi-square test, value 1 was defined as being relevant, and value 0 was defined as being irrelevant. Subsequently, mining method should be chosen. If users themselves do not select a mining method, the system can automatically choose a proper mining method according to the minimum support threshold. For TCM text data, Apriori will be selected if the minimum support threshold >0.07; otherwise, DF-FIMBII will be selected. After parameters are set, we can click the mining button to mine association rules. Moreover, the system will automatically mine association rules from dataset. The progress of the mining process can be seen, which is conducive to interaction between users and the system. When the association mining task is finished, results will be automatically displayed. Nonempty results will be saved as XLSX file. The name of the saved file is the name of the original XLSX file + minimum support threshold + minimum confidence threshold, which makes it convenient for users to further check the results.
After the system was designed, we could run the mining of properties and therapeutic efficacy of CMM and analyze the results. First, the quantized TCM information database was imported into the system of association rule mining. According to the results of the running time of algorithms, only a small amount of association rules can be obtained when the minimum support threshold ≥0.2. Therefore, the minimum support threshold was set at 0.08, and the minimum confidence threshold was set at 0.6. All additional correlation measures were selected. For Chi-square test, level of significance a = 0.025. Ultimately, a total of 133 association rules were obtained the results of association rule mining are listed in [Supplementary Table 1 [SUPPORTING:1]]. As for rule correlation analysis, different results will be produced due to the difference of evaluation methods adopted. The mining system designed in this paper contains six additional evaluation methods, and their meanings of results are as follows:
All confidence (A, B) refers to the minimum confidence of association rules 'A=>B' and 'B=>A' related to the two itemsets of A and B. If the value of all confidence (A, B) is <0.5, A and B are negatively correlated. If the value of all confidence (A, B) is equal to0.5, A and B are neutral with no obvious positive or negative correlation. If the value of all confidence (A, B) is >0.5, A and B are positively correlated. The final results showed that 14 of the rules were positively correlated and 119 were negatively correlated using all confidence to evaluate the association rules.
Max confidence (A, B) refers to the maximum confidence of association rules 'A=>B' and 'B=>A' related to the two itemsets of A and B. If the value of max confidence (A, B) is <0.5, A and B are negatively correlated. If the value is equal to 0.5, A and B are neutral with no obvious positive or negative correlation. If the value is >0.5, A and B are positively correlated. The final results showed that all the rules were positively correlated using the maximum confidence evaluation.
Lift (A, B) denotes the ratio of the probability of containing B under the condition of containing A to the probability of containing B without A. Lift (A=>B) >1 indicates that the rule context is positively correlated and Lift (A=>B) <1 indicates that the rule context is negatively correlated, while Lift (A=>B) = 1 indicates that the context is not correlated. The final results showed that 104 rules were positive correlation and 29 rules were a negative correlation.
Chi-square test is the deviation degree between the actual observed value and the theoretical inferred value of the statistical sample. The Chi-square value is compared with the P value. If it is greater than or equal to the P value, the rule is of significance. If the value of Chi-square is less than the P value, the rule is not significant. The final results showed that all the Chi-square values were 0 and less than P values, so there was no significant difference in all rules.
Kulc (A, B) refers to the mean value of two conditional probabilities (the probability of containing B under the condition of containing A; the probability of containing A under the condition of containing B). If the value of Kulc (A, B) is <0.5, A and B are negatively correlated. If the value is equal to 0.5, and A and B are neutral. If the value is >0.5, and A and B are positively correlated. The final results showed that 82 rules were positively correlated and 51 rules were negatively correlated.
Cosines (A, B) can be viewed as the harmonic lift measure. It is similar to Lift except that the cosine takes the square root of the product of the probabilities of A and B. If the value of Cosine (A, B) is <0.5, A and B are negatively correlated. If the value is equal to 0.5, A and B are neutral. If the value is >0.5, A and B are positively correlated. The final results showed that 38 rules are positive and 95 rules are negative.
According to the experimental result, the system designed with proper mining algorithm selection enables to obtain good association rule mining results, and the operation is intelligent and fast. In addition, correlation measures are integrated into the system. During operation, users are free to choose among these measures for the evaluation of correlation between the antecedents and consequents of rules. For example, regarding the rule 'cool blood => no toxicity' listed in [Supplementary Table 1], 8.5809% of 606 CMM are nontoxic and at the same time are able to cool blood. The 100% confidence level demonstrates all CMM that can cool blood are nontoxic. All confidence and cosine show that there is a negative correlation between the antecedent and consequent of this rule. However, max confidence, lift, and Kulc reveals positive correlation between the antecedent and consequent of this rule. The rule analysis results of the system are intuitive, diverse, and user-friendly. Correlation measures can be selected in accordance with mining targets. Alternatively, 'voting principles' can be used. For example, if four out of the six correlation measures reveal that there is a positive correlation between the antecedent and consequent of a certain rule, a positive correlation is recorded as the evaluation result. In other cases, such as if three correlation measures reveal positive correlation and the other three reveal a negative correlation, 'support-confidence' is used as standard to determine whether the certain rule is significant.
As we know, many results have been obtained from CMM data mining, especially association mining of the properties and therapeutic efficacy of TCMs. Before association mining, researchers often need to process the data. However, during preprocessing, they may simply split the phrases into several shorter ones. For example, 'remove heat and eliminate toxicity' is splitted d into 'remove heat' and 'eliminate toxicity.' They may also combine phrases with similar meaning into one phase/word. For instance, 'extremely cold' and 'slightly cold' are combined as 'cold.' Such pretreatments may ignore the fact that some CMM have similar properties or therapeutic efficacy, but their strengths are different. These pretreatments may also fail to include other possible meanings of phrases describing the therapeutic efficacy of CMM. These shortcomings suggest that current CMM data preprocessing methods often lead to loss of information. Therefore, the concept of quantization was introduced here. Some vague descriptions about the therapeutic efficacy CMM were quantized. On this basis, a database of quantized CMM information was constructed, and this was conducive for subsequent CMM data. Due to the database, more information can be retained, and data can be described clearly and in more detail. The mining results are thus more accurate.
Generally speaking, the researchers often use one method for data mining of properties and therapeutic efficacy of CMM ('one to one' pattern, i.e., one method solves one problem). In fact, there are many mining algorithms available. These algorithms have different efficiencies under different conditions, and the selection of the best algorithm depends on circumstance. If only one algorithm is used, the mining efficiency can be affected. This is also not conducive to the improvement of mining method. Moreover, many studies neglect to evaluate the obtained association rules. Therefore, even if strong rules are obtained, valuable conclusions still cannot be drawn since the rules might be insignificant. In order to address these problems, we designed and developed a system based on strategy pattern for mining association rules between properties and therapeutic efficacy of CMM. Users are free to choose among several mining algorithms and proper mining algorithm can be chosen using the strategy pattern. These make the association mining process more intelligent, fast, and convenient. The results are also more intuitive. Therefore, correlation analysis can be easily performed to determine whether these association rules are significant.
Although the database can retain as much information as possible, the quantization process is not aimed for specific CMM, thus limitations may exist. Moreover, the minimum support threshold was still set in the system, and it may be not suitable for other datasets. Therefore, our future research will focus on how to overcome these limitations.
Supplementary information is linked to the online version of the paper on the Chinese Medical Journal website.
Financial support and sponsorship
This study was supported by a grant from the National Natural Science Foundation of China (No. 81660727).
Conflicts of interest
There are no conflicts of interest.
1. Sun LM. Synchronous treatment of the heart and brain and the holistic view of traditional Chinese medicine (in Chinese). J Tradit Chin Med 2012;53:1705-6. doi: 10.13288/j.11-2166/r.2012.19.021.
2. Luo SS, Zhang XY, Zhang CQ, Li WW, Qi CC. Data mining study and system design of traditional Chinese medicine based on strategy model (In Chinese). World Sci Technol Mod J Tradit Chin Med 2015;5:929-33. doi: 10.11842/wst.2015.05.002.
3. Liu LJ. Research and application of improved Apriori algorithm (in Chinese). Comput Eng Des 2017;38:3324-8. doi: 10.16208/j.issn1000-7024.2017.12.023.
4. Chen L, Feng S. Two-level confidence threshold setting method for positive and negative association rules (in Chinese). J Comput Appl 2018;38:1315-9. doi: 10.11772/j.issn.1001-9081.2017102469.
5. Jin R, Lin Q, Zhang B, Liu X, Liu SM, Zhao Q, et al . A study of association rules in three-dimensional property-taste-effect data of Chinese herbal medicines based on Apriori algorithm (in Chinese). J Chin Integr Med 2011;9:794-803. doi: 10.3736/jcim20110715.
|Printer friendly Cite/link Email Feedback|
|Author:||Wu, Di-Yao; Zhang, Xin-You; Zhou, Xiao-Ling|
|Publication:||Chinese Medical Journal|
|Article Type:||Letter to the editor|
|Date:||Nov 20, 2018|
|Previous Article:||Small Interfering RNA Targeting a-Fodrin Suppressing the Immune Response of Sjogren's Syndrome Mice.|
|Next Article:||A Rare Cause of Recurrent Fatal Hemoptysis: Dieulafoy's Disease of the Bronchus.|