Words that often occur together form collocations. Collocations are important language components and have been used to facilitate many natural language processing tasks, including natural language generation, machine translation, information retrieval, sentiment analysis and language learning. Meanwhile, collocations are difficult to capture, especially for second language learners; and new collocations develop quickly nowadays, especially with the help of the affluent user generated content on the Web. In this paper we present an automatic collocation extraction and exploration system for the Chinese language: the DACE system. We identify collocations using three measures: frequency, mutual information and χ2-test. The system was built upon distributed computing frameworks so as to efficiently process large scale corpora. Empirical evaluation and analysis of the system showed the effectiveness of the collocation measures and the efficiency of the distributed computing processes.
Collocations, also known as multi-word expressions 1 or compound words 2, are important language units. From the computational point of view, a collocation is a set of words that occur together more often than by chance 3. This include, for example, compound nouns like train station, phrasal verbs like follow up, proper nouns like New Zealand, and common syntactic patterns like adjective+noun and heavy rain. Collocations have been used widely in many natural language processing tasks, for example to help natural language generation 4, improve machine translation quality 1, impact search result ranking 5, disambiguate word senses 6 and assist sentiment analysis 7 and second language learning 8.
Though important, collocations are difficult to capture and learn, especially for second language learners, simply because there are so many of them and their forms are extremely diverse. Previous researchers propose to exploit the enormous web resources to discover collocations for language learning 8. This motivates us to tap into the Chinese web 5-gram corpus 9 for identifying Chinese collocations. The chosen corpus consists of over 39 billion Chinese phrases, each associated with its number of occurrences across over 800 million web pages. After filtering out phrases that occur less than five thousand times, the remaining phrases still generated 40 million collocation candidates. The sheer amount of data presents new challenge to the conventional standalone extracting process. Therefore, we employed the Hadoop distributed computing platform, parallelizing the extraction process with the MapReduce framework, and storing and retrieving the extracted collocations with the distributed database HBase. We name the system DACE to stand for distributed automatic collocation extraction. Using distributed computing, DACE can complete the extraction process within four hours for all 40 million collocation candidates, and retrieves any collocation within a second.
The rest of this paper is organized as following. Next we review related work on automatic collocation extraction with a specific focus on the Chinese language domain. Section 3 explains DACE’s system architecture, and Sections 4 and 5 describe the distributed extraction and indexing phases respectively. Section 6 presents experimental setup and discusses empirical experimental results. Section 7 concludes the study.
Most automatic collocation extraction methods rely on a measure that can quantify the association strength between words in a phrase, so as to determine the co-occurrence of two or more words is indeed statistically more often than by chance. In general, these measures can be categorized into three types: frequency, information theoretic measures and hypothesis test scores. Early studies used co-occurring frequency to identify collocations 10. Later, mutual information was commonly employed 11. By comparing the observed number of co-occurrences with the expected co-occuring frequency assuming that the component words were independent, mutual information recognizes those with a co-occuring probability greater than the expected value as collocations. Similarly, hypothesis tests discover associated events (i.e. words) by comparing to the null hypothesis (i.e. assuming independent events). Such tests include comparing the log-likelihood ratio 12, t-test 13 and -test 8. Besides, position, span and syntactic rules can also be considered in the association measure 14.
More recent studies have proposed several new ways for collocation extraction. For example, linear regression was applied with features covering 84 collocation rules and three linguistic patterns, to quantify the association strength between words in valid collocations 15. The bilingual word alignment method commonly used in the machine translation field was adapted to the monolingual scenario to extract collocations that co-occur in similar contexts 16. Multilingual context and multilingual copora have also received increasing attention in recent studies on automatic collocation extraction 17.
For the Chinese language, previous studies have investigated the distinct properties of Chinese collocations 19 and methods for extracting them 1, 2, 16, 19, 20, 21, 22, 23. Table 1 summarizes the recent studies from three aspects: the corpus in use and its scale in terms of number of characters, the association strength measure employed, and the target application if there is one.
Table 1 shows that news corpus was by far the major resource for most automatic collocation extraction systems. Recently in the English language domain, large-scale web resources and multilingual copora have also been used 8, 17. They provide unprecedented affluent resources for the task. Such resources in the Chinese language are yet to be exploit.
The proposed DACE system adopts a pipeline architecture, as shown in Figure 1. Given a corpus consisting of phrases and their number of occurrences, first it will be filtered, for example to remove phrases that contain non-Chinese characters and thus rarely form valid Chinese collocations. Then the phrases can be subjected to syntactic analysis. For example, a part-of-speech (POS) tagger can be employed and then phrases can be filtered based on their syntactic patterns. This step is optional though.
Filtered phrases were then stored into a distributed database table (see Table 2), upon which the association strength of each phrase as a collocation is calculated, by employing different measures. The calculation step is also distributed. The resulting collocations and their associated scores are stored into the index tables.
During exploration stage, i.e. retrieval of the extracted collocations, user usually submit a keyword. It is then matched against the three index tables. Collocations have directions: a collocation can either start or end with the keyword, namely, right and left collocations. DACE provides options to choose the collocation direction, collocation measure, and the number of hits returned. Finally, the matched collocations are ranked in descending order of their associated scores and returned.
Implementing the DACE system in a distributed environment allows us to perform comparative studies much more efficiently. We selected and implemented three measures for quantifying the salience of a phrase as a collocation: frequency, mutual information and -test. As for the distributed computing platform, we chose the most well-known off-the-shelf framework Hadoop. This section first explains the collocation measures (i.e. mutual information and
-test) and then describes the parallelized extraction process.
Given two words and
, their mutual information is calculated as follow
![]() | (1) |
whereas is the probability of word
,
is w’s number of occurrences in a corpus, and N is the total sum of occurrences of all words in the corpus. Mutual information is also known as the pointwise mutual information (PMI).
-test is also known as the Pearson’s chi-square test. It extracts collocations by comparing the actual and the expected number of occurrences. Given two words
and
, their
-test score is calculated as follow
![]() | (2) |
where ,
,
,
, and
is again the total sum of occurrences of all words in the corpus. In words,
is the number of times
and
co-occur,
is the number of times
occur without
, and
is the number of times
occur without
.
We used the Hadoop MapReduce framework to distribute the entire extraction process to a cluster of computing nodes. Each distributed calculation step in Figure 1 corresponds to one Mapper class and one Reducer class in the MapReduce programming framework. In general, the Mapper class processes a phrase and outpts a <key, value> pair to represent it, for example, <collocation, score>, while the Reducer class mainly sort the collocations by their scores and store them into the backend database. It is worth noting that when calculating mutual information (MI) and -test (CHI) scores, since they both require summarizing
,
and
, an intermediate step was designed to collect these statistics (see Section 6.2).
Efficient indexes are fundamental for responsive retrieval and exploration, especially given large amount of data. In accordance with the extraction process, we used the distributed data storage framework associated with Hadoop—Hbase—as the backend database system.
In contrast to the relational data structure in traditional SQL databases, HBase adopts the column-based data structure. Tables consist of column families, and a column family consists of columns. Both the number and the data type of columns in one column family can vary on the fly. Such dynamic structure is due to HBase's sparse key-value format for physical data storage, as shown in Figure 2. Such structure is perfect for storing the collocation data: the number of collocated words vary dramatically for different key words.
Two table structures were designed: one for storing the filtered phrases (i.e. the CollocationCandidates table) and the other for storing the extracted collocations (i.e. the index tables e.g. FreqCollocations). Table 2 and Table 3 illustrate their column-based structures with examples from our dataset.
The CollocationCandidate table consists of two column families, namely phrases and freq. The phrases family records the words in a phrase and their associated POS tag if available, resulting in two (i.e. unigram) to ten (i.e. 5-gram) columns. The freq family has only one column that records the number of occurrences of the phrase as a whole. We took the unicode encoding of the phrase as the row key.
The three index tables share the same structure, as shown in Table 3. Each word corresponds to two row keys, i.e. two data rows, for phrases start and end with the word respectively. For example, in Table 3, the keyword 学习 corresponds to two rows 学习_L and 学习_R, for phrases end (i.e. left collocations) and start (i.e. right collocations) with that key word.
It also has two column families: collocations and freq. The collocations family stores bigrams that either start or end with the word. Phrases are organized with three columns as a set: the following (or preceding) term, collocation score of the phrase (i.e. frequency, mutual information value and -test value) and its syntactic pattern if available (e.g. n+v). Phrases are sorted in descending order of their scores. The last column records the total number of occurrences of the indexing key word in the specified collocation direction. The number of columns in the collocations family is huge, and it also varies, for example, in our experiment it varied from zero to 119,942. As exaplained above, HBase’s sparse storage format can handle such data efficiently.
Our dataset is the Chinese web 5-gram corpus, which contains Chinese word n-grams and their observed frequency counts generated from over 800 million tokens of Web text, resulting in over 30G files in gzip format and 39 billion n-grams 9. The length of the n-grams ranges from unigram (single words) to 5-grams. The corpus is huge. Efficient exploration of such a dataset is challenging.
Considering that non-Chinese characters, such as numbers and English letters or words, rarely occur in real Chinese collocations, phrases that contain these characters were removed from the corpus. We also filtered phrases that occur less then 5000 in the 800 million tokens.
We deployed a Hadoop cloud to perform the extraction processes and to support the DACE system. The cloud consists of five computing nodes: two master nodes and three core nodes. Each node was equiped with a 64 bit 16-core CPU with 32G RAM, Huawei’s Euler OS 2.2 (an adapted OS based on CentOS), and 40G and 2T SAS disk space for system and data files respectively.
Distributed computing services installed on each node included JDK 1.6, Hadoop 2.7.2 and HBase 1.0.2, and ZooKeeper 3.5.1. The topology structure of the cluster is shown in Figure 3. Node Master1 is the master node and major access point of the cloud.
As explained previously, the DACE system mainly consists of two phases: filtering and indexing, as shown in Figure 1. Table 4 compares the data scale and the time cost of each stage.
Output of the filtering process—the Candidate Collocations table in Figure 1—had 20 million rows. Each row corresponds to a phrase. When broken down into words in the indexing stage, the phrases generated 14 million distinct words, that is, the index tables had 14 million rows. The number of columns in each row varied from zero to 119, 942, resulting in over 40 million distinct expressions. Yet only a small portion of these expressions were valid collocations.
As Table 4 shows, MI and CHI measures took more time than the frequency measure. This is because they involved the three frequency counts: the number of occurrences of a phrase and of its component words. In practice, we performed a separate step to compute these intermediate scores. This step took about 170 minutes, and since its result was shared by the two measures, the time cost of computing the actual MI and CHI scores took 25 and 20 minutes respectively.
6.3. Comparing Collocation MeasuresTable 5 lists the top collocations extracted by different measures. Compared to previous study on the English language 24, similar behavior of the three measures was observed. In general, -test and mutual information tend to favor expressions with low frequency and has a repetitive pattern. The frequency measure, despite its simplicity, finds meaningful and effective collocations. Therefore, Table 5 lists the top 40 collocations for the frequency measure, and only 10 for the other two measures.
We also implemented a web-based information retrieval system to provide efficient exploration of the extracted collocations, as shown in Figure 4. The search interface provides options to choose the direction of a collocation, the measure, and the number of hits returned. Searching and ranking were also based on HBase queries. We tested ten query times, and the average retrieving time was 258ms.
Collocations are important yet difficult to capture. The affluent text on the Web provides natural, updated and valuable resources for automatically extraction of collations. In this paper we designed and implemented the DACE system for automatic collection extraction and exploration. Empiricial experimental results showed that DACE is efficient and the extracted collocations are effective. The search interface of the DACE system is quite simple at the moment, and we plan to improve it with more flexible and user-friendly search options in future.
[1] | Piao, S. S. L., Sun, G., Rayson, P., Yuan, Q. Automatic Extraction of Chinese Multiword Expressions with a Statistical Tool. In Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context. In Proc. of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), 2006, pp. 17-24. | ||
In article | View Article | ||
[2] | Zhang, J., Gao, J., Zhou, M. Extraction of Chinese compound words—an experimental study on a very large corpus. In Proc. of the 2nd Chinese Language Processing Workshop, ACL 2000, 2000. | ||
In article | View Article | ||
[3] | Mckeown, K. R., Radev, D. R. Collocations. In A Handbook of Natural Language Processing, R. Dale, H. Moisl, and H. Somers Eds. Marcel Dekker, New York, 2000, pp. 507-523. | ||
In article | |||
[4] | Smadja, F., McKeown, K. Automatically extracting and representing collocations for language generation. In Proceedings of the 28th annual meeting on Association for Computational Linguistics, 1990, pp. 252-259. | ||
In article | View Article | ||
[5] | Liu, Z. Y., Wang, H., Wu, H., Liu, T., Li, S. Reordering with source language collocations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011, pp. 1035-1044. | ||
In article | View Article | ||
[6] | Yarowsky, D. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. 1995, pp. 189-196. | ||
In article | View Article | ||
[7] | Xu, R. F., Xu, J., Kit C. HITSZ_CITYU: Combine collocation, context words and neighboring sentence sentiment in sentiment adjectives disambiguation. In Proceedings of the 5th International Workshop on Semantic Evaluation, ACL 2010, pp. 448-451. | ||
In article | View Article | ||
[8] | Wu, S. Q., Franken, M., Witten, I. H. Supporting collocation learning with a digital library. Computer Assisted Language Learning, 2010, 23(1), pp. 87-110. | ||
In article | View Article | ||
[9] | Liu, F., Yang, M., Lin D. Chinese Web 5-gram Version 1 LDC2010T06. Web Download. Philadelphia: Linguistic Data Consortium, 2010, https://catalog.ldc.upenn.edu/LDC2010T06. | ||
In article | View Article | ||
[10] | Choueka, Y., Klein, S. T., Neuwitz, E. Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal of the Association for Literary and Linguistic Computing, 1983, 4, pp. 34-38. | ||
In article | View Article | ||
[11] | Church, K. Hanks, P. Word association norms, mutual information, and lexicography. Journal of Computational Linguistics, 1990, 16, pp. 22-29. | ||
In article | View Article | ||
[12] | Dunning, T. Accurate methods for the statistics of surprise and coincidence. Journal of Computational Linguistics, 1993, 19, pp. 61-74. | ||
In article | View Article | ||
[13] | Manning, C., Schütze, H. Foundations of statistical natural language processing. MIT Press. 1999. | ||
In article | View Article | ||
[14] | Smadja, F. Retrieving collocations from text: Xtract. Computat. Linguist., 1993, 19, pp. 143-177. | ||
In article | View Article | ||
[15] | Pecina, P.. An Extensive Empricial Study of Collocation Extraction Methods. Proceedings of the ACL Student Research Workshop, 2005, pp. 13-18. | ||
In article | View Article | ||
[16] | Liu, Z. Y., Wang, H., Wu, H., Li, S. Two-word collocation extraction using monolingual word alignment method. 2011, ACM Transaction on Intelligent Systems and Technology, 3(1), 16. | ||
In article | View Article | ||
[17] | Seretan, V., Wehrli, E. Multilingual collocation extraction: issues and solutions. In proceedings of the workshop on multilingual language resources and interoperability, 2006, pp. 40-49. | ||
In article | View Article | ||
[18] | Sun, M. S., Huang, C. N., Fang, J. A Quantitative Analysis of Chinese Collocation. Studies of the Chinese Language, 1997(1), pp. 29-38. (in Chinese) | ||
In article | |||
[19] | Lu, Q., Li, Y., Xu, R.. Improving Xtract for Chinese collocation extraction. In Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering. 2003, pp. 333-338. | ||
In article | View Article | ||
[20] | Qu, W. G., Chen, X. H., Ji, G. L. Automatic Extraction of Word Collocation Based on Frame. Computer Engineering, 2004, 30(23), pp. 22-24. (in Chinese) | ||
In article | View Article | ||
[21] | Li, W., Lu, Q., Xu, R. Similarity based chinese synonym collocation extraction. International Journal of Computational Linguistics and Chinese Language Processing. 2005, 10, pp. 123-144. | ||
In article | View Article | ||
[22] | Wang, S. G., Yang, J. L., Zhang, W. Chinese Verbs and Verbs Matching Based on Maximum Entropy Model and Voting Method. Journal of Chinese Computer Systems, 2007, 28(7), pp. 1306-1309. (in Chinese) | ||
In article | |||
[23] | Xu, R. F., Lu Q., Wong, K. F., Li, W. J. Building a Chinese collocation bank. International Journal of Computer Processing of Languages, 2009, 22 (1), pp. 21-47. | ||
In article | View Article | ||
[24] | Wu, S. Q. Supporting Collocation Learning. PhD thesis, 2010. | ||
In article | View Article | ||
This work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit
https://creativecommons.org/licenses/by/4.0/
[1] | Piao, S. S. L., Sun, G., Rayson, P., Yuan, Q. Automatic Extraction of Chinese Multiword Expressions with a Statistical Tool. In Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context. In Proc. of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), 2006, pp. 17-24. | ||
In article | View Article | ||
[2] | Zhang, J., Gao, J., Zhou, M. Extraction of Chinese compound words—an experimental study on a very large corpus. In Proc. of the 2nd Chinese Language Processing Workshop, ACL 2000, 2000. | ||
In article | View Article | ||
[3] | Mckeown, K. R., Radev, D. R. Collocations. In A Handbook of Natural Language Processing, R. Dale, H. Moisl, and H. Somers Eds. Marcel Dekker, New York, 2000, pp. 507-523. | ||
In article | |||
[4] | Smadja, F., McKeown, K. Automatically extracting and representing collocations for language generation. In Proceedings of the 28th annual meeting on Association for Computational Linguistics, 1990, pp. 252-259. | ||
In article | View Article | ||
[5] | Liu, Z. Y., Wang, H., Wu, H., Liu, T., Li, S. Reordering with source language collocations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011, pp. 1035-1044. | ||
In article | View Article | ||
[6] | Yarowsky, D. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. 1995, pp. 189-196. | ||
In article | View Article | ||
[7] | Xu, R. F., Xu, J., Kit C. HITSZ_CITYU: Combine collocation, context words and neighboring sentence sentiment in sentiment adjectives disambiguation. In Proceedings of the 5th International Workshop on Semantic Evaluation, ACL 2010, pp. 448-451. | ||
In article | View Article | ||
[8] | Wu, S. Q., Franken, M., Witten, I. H. Supporting collocation learning with a digital library. Computer Assisted Language Learning, 2010, 23(1), pp. 87-110. | ||
In article | View Article | ||
[9] | Liu, F., Yang, M., Lin D. Chinese Web 5-gram Version 1 LDC2010T06. Web Download. Philadelphia: Linguistic Data Consortium, 2010, https://catalog.ldc.upenn.edu/LDC2010T06. | ||
In article | View Article | ||
[10] | Choueka, Y., Klein, S. T., Neuwitz, E. Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal of the Association for Literary and Linguistic Computing, 1983, 4, pp. 34-38. | ||
In article | View Article | ||
[11] | Church, K. Hanks, P. Word association norms, mutual information, and lexicography. Journal of Computational Linguistics, 1990, 16, pp. 22-29. | ||
In article | View Article | ||
[12] | Dunning, T. Accurate methods for the statistics of surprise and coincidence. Journal of Computational Linguistics, 1993, 19, pp. 61-74. | ||
In article | View Article | ||
[13] | Manning, C., Schütze, H. Foundations of statistical natural language processing. MIT Press. 1999. | ||
In article | View Article | ||
[14] | Smadja, F. Retrieving collocations from text: Xtract. Computat. Linguist., 1993, 19, pp. 143-177. | ||
In article | View Article | ||
[15] | Pecina, P.. An Extensive Empricial Study of Collocation Extraction Methods. Proceedings of the ACL Student Research Workshop, 2005, pp. 13-18. | ||
In article | View Article | ||
[16] | Liu, Z. Y., Wang, H., Wu, H., Li, S. Two-word collocation extraction using monolingual word alignment method. 2011, ACM Transaction on Intelligent Systems and Technology, 3(1), 16. | ||
In article | View Article | ||
[17] | Seretan, V., Wehrli, E. Multilingual collocation extraction: issues and solutions. In proceedings of the workshop on multilingual language resources and interoperability, 2006, pp. 40-49. | ||
In article | View Article | ||
[18] | Sun, M. S., Huang, C. N., Fang, J. A Quantitative Analysis of Chinese Collocation. Studies of the Chinese Language, 1997(1), pp. 29-38. (in Chinese) | ||
In article | |||
[19] | Lu, Q., Li, Y., Xu, R.. Improving Xtract for Chinese collocation extraction. In Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering. 2003, pp. 333-338. | ||
In article | View Article | ||
[20] | Qu, W. G., Chen, X. H., Ji, G. L. Automatic Extraction of Word Collocation Based on Frame. Computer Engineering, 2004, 30(23), pp. 22-24. (in Chinese) | ||
In article | View Article | ||
[21] | Li, W., Lu, Q., Xu, R. Similarity based chinese synonym collocation extraction. International Journal of Computational Linguistics and Chinese Language Processing. 2005, 10, pp. 123-144. | ||
In article | View Article | ||
[22] | Wang, S. G., Yang, J. L., Zhang, W. Chinese Verbs and Verbs Matching Based on Maximum Entropy Model and Voting Method. Journal of Chinese Computer Systems, 2007, 28(7), pp. 1306-1309. (in Chinese) | ||
In article | |||
[23] | Xu, R. F., Lu Q., Wong, K. F., Li, W. J. Building a Chinese collocation bank. International Journal of Computer Processing of Languages, 2009, 22 (1), pp. 21-47. | ||
In article | View Article | ||
[24] | Wu, S. Q. Supporting Collocation Learning. PhD thesis, 2010. | ||
In article | View Article | ||