Automatic Domain-Relevant Collocation Extraction from Arabic Corpus

Rebhi S Baraka, Manar S Fayyad


An approach for automatic domain-relevant collocation extraction from Arabic text corpus is proposed. It uses naïve linguistic and statistical methods to extract collocations and relate them to specific domains depending on prevalence and tendency collocation ranking mechanism. In order to realize the proposed approach we use a corpus separated into ten domains. The proposed approach starts with preprocessing this corpus, then extracting candidate collocations. After that, it ranks the candidate collocations depending on the distributional behavior of candidate collocations within the domain and across the rest of the corpus. Then we distribute the candidate collocations over the domains depending on their rank values to get domains' term matrix. Finally, we evaluate the resulting collocation matrix by using it to classify the domain of a number of documents. The results are encouraging in most domains such that the achieved rate of accuracy exceeded 90%.


