Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
With the availability of large treebanks, retrieval techniques for highly structured data now become essential. In this contribution, we investigate the efficient retrieval of MT structures at the cost of a complex index--the Treegram Index. We illustrate our approach with the VENONA retrieval system, which handles the BH t (Biblia Hebraica transeripta) treebank comprising 508,650 phrase structure trees with maximum degree eight and maximum height 17, containing altogether 3.3 million Old-Hebrew words. 1 Multiway-tree retrieval based on treegrams To cope with this tree-retrieval problem, we generalize the well-known n-gram indexing technique for text databases: In place of substrings. | Proceedings of EACL 99 The Treegram Index An Efficient Technique for Retrieval in Linguistic Treebanks Hans Argenton and Anke Feldhaus Infineon Technologies DAT CIF Postbox 801709 D-81617 Miinchen hans. ar genton@infineon. com University of Tubingen SfS Kleine Wilhelmstr.113 D-72074 Tubingen feldhaus@sfs.nphil.uni-tuebingen.de Multiway trees MT henceforth are a common and well-understood data structure for describing hierarchical linguistic information. With the availability of large treebanks retrieval techniques for highly structured data now become essential. In this contribution we investigate the efficient retrieval of MT structures at the cost of a complex index the Treegram Index. We illustrate our approach with the Venona retrieval system which handles the BH Biblia Hebraica transcripta treebank comprising 508 650 phrase structure trees with maximum degree eight and maximum height 17 containing altogether 3.3 million Old-Hebrew words. 1 Multiway-tree retrieval based on treegrams The base entities of the tree-retrieval problem for positional MTs are labeled rooted MTs where children are distinguished by their position. Let s and t be two MTs t contains s written as 5 i if there exists an injective embedding such that 1 nodes are mapped to nodes with identical labels and 2 a root of a child with position i is mapped to a root of a child with the same position. Retrieval problem Let DB be a set of labeled positional MTs and let q be a query tree having the same label alphabet. The problem is to find efficiently all trees t G DB that contain q. To cope with this tree-retrieval problem we generalize the well-known n-gram indexing technique for text databases In place of substrings with fixed length we use subtrees with fixed maximal height treegrams. Let TG t h denote the set of all treegrams of height h contained in the MT t and let T DB denote the set of all database trees that contain the treegram g. Assume that g has the height h and that T DB can be .