DACE: Extracting and Exploring Large Scale Chinese Web Collocations with Distributed Computing

Words that often occur together form collocations. Collocations are important language components and have been used to facilitate many natural language processing tasks, including natural language generation, machine translation, information retrieval, sentiment analysis and language learning. Meanwhile, collocations are difficult to capture, especially for second language learners; and new collocations develop quickly nowadays, especially with the help of the affluent user generated content on the Web. In this paper we present an automatic collocation extraction and exploration system for the Chinese language: the DACE system. We identify collocations using three measures: frequency, mutual information and χχ2-test. The system was built upon distributed computing frameworks so as to efficiently process large scale corpora. Empirical evaluation and analysis of the system showed the effectiveness of the collocation measures and the efficiency of the distributed computing processes.


Introduction
Collocations, also known as multi-word expressions [1] or compound words [2], are important language units. From the computational point of view, a collocation is a set of words that occur together more often than by chance [3]. This include, for example, compound nouns like train station, phrasal verbs like follow up, proper nouns like New Zealand, and common syntactic patterns like adjective+noun and heavy rain. Collocations have been used widely in many natural language processing tasks, for example to help natural language generation [4], improve machine translation quality [1], impact search result ranking [5], disambiguate word senses [6] and assist sentiment analysis [7] and second language learning [8].
Though important, collocations are difficult to capture and learn, especially for second language learners, simply because there are so many of them and their forms are extremely diverse. Previous researchers propose to exploit the enormous web resources to discover collocations for language learning [8]. This motivates us to tap into the Chinese web 5-gram corpus [9] for identifying Chinese collocations. The chosen corpus consists of over 39 billion Chinese phrases, each associated with its number of occurrences across over 800 million web pages. After filtering out phrases that occur less than five thousand times, the remaining phrases still generated 40 million collocation candidates. The sheer amount of data presents new challenge to the conventional standalone extracting process. Therefore, we employed the Hadoop distributed computing platform, parallelizing the extraction process with the MapReduce framework, and storing and retrieving the extracted collocations with the distributed database HBase. We name the system DACE to stand for distributed automatic collocation extraction. Using distributed computing, DACE can complete the extraction process within four hours for all 40 million collocation candidates, and retrieves any collocation within a second.
The rest of this paper is organized as following. Next we review related work on automatic collocation extraction with a specific focus on the Chinese language domain. Section 3 explains DACE's system architecture, and Sections 4 and 5 describe the distributed extraction and indexing phases respectively. Section 6 presents experimental setup and discusses empirical experimental results. Section 7 concludes the study.

Related Work
Most automatic collocation extraction methods rely on a measure that can quantify the association strength between words in a phrase, so as to determine the cooccurrence of two or more words is indeed statistically more often than by chance. In general, these measures can be categorized into three types: frequency, information theoretic measures and hypothesis test scores. Early studies used co-occurring frequency to identify collocations [10]. Later, mutual information was commonly employed [11]. By comparing the observed number of co-occurrences with the expected co-occuring frequency assuming that the component words were independent, mutual information recognizes those with a co-occuring probability greater than the expected value as collocations. Similarly, hypothesis tests discover associated events (i.e. words) by comparing to the null hypothesis (i.e. assuming independent events). Such tests include comparing the log-likelihood ratio [12], t-test [13] and 2 -test [8]. Besides, position, span and syntactic rules can also be considered in the association measure [14].
More recent studies have proposed several new ways for collocation extraction. For example, linear regression was applied with features covering 84 collocation rules and three linguistic patterns, to quantify the association strength between words in valid collocations [15]. The bilingual word alignment method commonly used in the machine translation field was adapted to the monolingual scenario to extract collocations that co-occur in similar contexts [16]. Multilingual context and multilingual copora have also received increasing attention in recent studies on automatic collocation extraction [17].
For the Chinese language, previous studies have investigated the distinct properties of Chinese collocations [19] and methods for extracting them [1,2,16,19,20,21,22,23]. Table 1 summarizes the recent studies from three aspects: the corpus in use and its scale in terms of number of characters, the association strength measure employed, and the target application if there is one. Table 1 shows that news corpus was by far the major resource for most automatic collocation extraction systems. Recently in the English language domain, large-scale web resources and multilingual copora have also been used [8,17]. They provide unprecedented affluent resources for the task. Such resources in the Chinese language are yet to be exploit.

System Architecture
The proposed DACE system adopts a pipeline architecture, as shown in Figure 1. Given a corpus consisting of phrases and their number of occurrences, first it will be filtered, for example to remove phrases that contain non-Chinese characters and thus rarely form valid Chinese collocations. Then the phrases can be subjected to syntactic analysis. For example, a part-of-speech (POS) tagger can be employed and then phrases can be filtered based on their syntactic patterns. This step is optional though.
Filtered phrases were then stored into a distributed database table (see Table 2), upon which the association strength of each phrase as a collocation is calculated, by employing different measures. The calculation step is also distributed. The resulting collocations and their associated scores are stored into the index tables.
During exploration stage, i.e. retrieval of the extracted collocations, user usually submit a keyword. It is then matched against the three index tables. Collocations have directions: a collocation can either start or end with the keyword, namely, right and left collocations. DACE provides options to choose the collocation direction, collocation measure, and the number of hits returned. Finally, the matched collocations are ranked in descending order of their associated scores and returned.

Distributed Extraction of Collocations
Implementing the DACE system in a distributed environment allows us to perform comparative studies much more efficiently. We selected and implemented three measures for quantifying the salience of a phrase as a collocation: frequency, mutual information and 2 -test. As for the distributed computing platform, we chose the most well-known off-the-shelf framework Hadoop. This section first explains the collocation measures (i.e. mutual information and 2 -test) and then describes the parallelized extraction process.

Mutual Information
Given two words w 1 and w 2 , their mutual information is calculated as follow ( ) whereas P(w) is the probability of word w, C(w) is w's number of occurrences in a corpus, and N is the total sum of occurrences of all words in the corpus. Mutual information is also known as the pointwise mutual information (PMI).

-test
where 11 = ( 1 , 2 ) , 12 = ( 2 ) − ( 1 , 2 ) , and is again the total sum of occurrences of all words in the corpus. In words, 11 is the number of times 1 and 2 co-occur, 12 is the number of times 2 occur without 1 , and 21 is the number of times 1 occur without 2 .

Parallelized Extraction
We used the Hadoop MapReduce framework to distribute the entire extraction process to a cluster of computing nodes. Each distributed calculation step in Figure 1 corresponds to one Mapper class and one Reducer class in the MapReduce programming framework. In general, the Mapper class processes a phrase and outpts a <key, value> pair to represent it, for example, <collocation, score>, while the Reducer class mainly sort the collocations by their scores and store them into the backend database. It is worth noting that when calculating mutual information (MI) and 2 -test (CHI) scores, since they both require summarizing ( 1 ) , ( 2 ) and ( 1 2 ), an intermediate step was designed to collect these statistics (see Section 6.2).

Distributed Indexing of Collocations
Efficient indexes are fundamental for responsive retrieval and exploration, especially given large amount of data. In accordance with the extraction process, we used the distributed data storage framework associated with Hadoop-Hbase-as the backend database system.
In contrast to the relational data structure in traditional SQL databases, HBase adopts the column-based data structure. Tables consist of column families, and a column family consists of columns. Both the number and the data type of columns in one column family can vary on the fly. Such dynamic structure is due to HBase's sparse key-value format for physical data storage, as shown in Figure 2. Such structure is perfect for storing the collocation data: the number of collocated words vary dramatically for different key words. Two table structures were designed: one for storing the filtered phrases (i.e. the CollocationCandidates table) and the other for storing the extracted collocations (i.e. the index tables e.g. FreqCollocations). Table 2 and Table 3 illustrate their column-based structures with examples from our dataset.
The CollocationCandidate table consists of two column families, namely phrases and freq. The phrases family records the words in a phrase and their associated POS tag if available, resulting in two (i.e. unigram) to ten (i.e. 5gram) columns. The freq family has only one column that records the number of occurrences of the phrase as a whole. We took the unicode encoding of the phrase as the row key. Translation of Table 2: Egg (term1); Is good for (term2); Health (term3). The three index tables share the same structure, as shown in Table 3. Each word corresponds to two row keys, i.e. two data rows, for phrases start and end with the word respectively. For example, in Table 3, the keyword 学习 corresponds to two rows 学习_L and 学习_R, for phrases end (i.e. left collocations) and start (i.e. right collocations) with that key word.
It also has two column families: collocations and freq. The collocations family stores bigrams that either start or end with the word. Phrases are organized with three columns as a set: the following (or preceding) term, collocation score of the phrase (i.e. frequency, mutual information value and 2 -test value) and its syntactic pattern if available (e.g. n+v). Phrases are sorted in descending order of their scores. The last column records the total number of occurrences of the indexing key word in the specified collocation direction. The number of columns in the collocations family is huge, and it also varies, for example, in our experiment it varied from zero to 119,942. As exaplained above, HBase's sparse storage format can handle such data efficiently.

Dataset and Preprocesses
Our dataset is the Chinese web 5-gram corpus, which contains Chinese word n-grams and their observed frequency counts generated from over 800 million tokens of Web text, resulting in over 30G files in gzip format and 39 billion n-grams [9]. The length of the n-grams ranges from unigram (single words) to 5-grams. The corpus is huge. Efficient exploration of such a dataset is challenging.
Considering that non-Chinese characters, such as numbers and English letters or words, rarely occur in real Chinese collocations, phrases that contain these characters were removed from the corpus. We also filtered phrases that occur less then 5000 in the 800 million tokens.

Distributed Computing Platform
We deployed a Hadoop cloud to perform the extraction processes and to support the DACE system. The cloud consists of five computing nodes: two master nodes and three core nodes. Each node was equiped with a 64 bit 16-core CPU with 32G RAM, Huawei's Euler OS 2.2 (an adapted OS based on CentOS), and 40G and 2T SAS disk space for system and data files respectively.
Distributed computing services installed on each node included JDK 1.6, Hadoop 2.7.2 and HBase 1.0.2, and ZooKeeper 3.5.1. The topology structure of the cluster is shown in Figure 3. Node Master1 is the master node and major access point of the cloud.

Stagewise Analysis
As explained previously, the DACE system mainly consists of two phases: filtering and indexing, as shown in Figure 1. Table 4 compares the data scale and the time cost of each stage. Output of the filtering process-the Candidate Collocations table in Figure 1-had 20 million rows. Each row corresponds to a phrase. When broken down into words in the indexing stage, the phrases generated 14 million distinct words, that is, the index tables had 14 million rows. The number of columns in each row varied from zero to 119, 942, resulting in over 40 million distinct expressions. Yet only a small portion of these expressions were valid collocations.
As Table 4 shows, MI and CHI measures took more time than the frequency measure. This is because they involved the three frequency counts: the number of occurrences of a phrase and of its component words. In practice, we performed a separate step to compute these intermediate scores. This step took about 170 minutes, and since its result was shared by the two measures, the time cost of computing the actual MI and CHI scores took 25 and 20 minutes respectively.  Table 5 lists the top collocations extracted by different measures. Compared to previous study on the English language [24], similar behavior of the three measures was observed. In general, 2 -test and mutual information tend to favor expressions with low frequency and has a repetitive pattern. The frequency measure, despite its simplicity, finds meaningful and effective collocations. Therefore, Table 5 lists the top 40 collocations for the frequency measure, and only 10 for the other two measures.

Collocation Retrieval System
We also implemented a web-based information retrieval system to provide efficient exploration of the extracted collocations, as shown in Figure 4. The search interface provides options to choose the direction of a collocation, the measure, and the number of hits returned. Searching and ranking were also based on HBase queries. We tested ten query times, and the average retrieving time was 258ms.

Conclusions
Collocations are important yet difficult to capture. The affluent text on the Web provides natural, updated and valuable resources for automatically extraction of collations. In this paper we designed and implemented the DACE system for automatic collection extraction and exploration. Empiricial experimental results showed that DACE is efficient and the extracted collocations are effective. The search interface of the DACE system is quite simple at the moment, and we plan to improve it with more flexible and user-friendly search options in future.