Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. | Arabic Retrieval Revisited Morphological Hole Filling Kareem Darwish Ahmed M. Ali Qatar Computing Research Institute Qatar Foundation Doha Qatar kdarwish@qf.org.qa amali@qf.org.qa Abstract Due to Arabic s morphological complexity Arabic retrieval benefits greatly from morphological analysis - particularly stemming. However the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. The use of our model yields statistically significant improvements in Arabic retrieval over the use of the best statistical stemming technique. The technique can potentially be applied to other languages. 1. Introduction Arabic exhibits rich morphological phenomena that complicate retrieval. Arabic nouns and verbs are typically derived from a set of 10 000 roots that are cast into stems using templates that may add infixes double letters or remove letters. Stems can accept the attachment of clitics in the form of prefixes or suffixes such as prepositions determiners pronouns etc. Orthographic rules can cause the addition deletion or substitution of letters during suffix and prefix attachment. Further stems can be inflected to obtain plural forms via the addition of suffixes or through using a different stem form altogether producing so-called broken1 aka irregular plurals. For retrieval we would ideally like to match related stem forms regardless of inflected form or attached clitic. Tolerating some form of derivational morphology where nouns are transformed into adjectives via the attachment of the suffix ự y 2 ex. J x mSr ựj mSry is desirable as they are semantically related. Matching all stems that are cast from the same root would introduce undesired ambiguity because a single root can produce up to 1 000 stems. Two general approaches have been shown to improve Arabic retrieval. The first approach