Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
This paper introduces to the finite-state calculus a family of directed replace operators. In contrast to the simple replace expression, UPPER - LOWER,defined in K a r t t u n e n (1995), the new directed version, UPPER ©- LOWER, yields an unambiguous transducer if the lower language consists of a single string. It transduces the input string from left to right, making only the longest possible replacement at each point. A new type of replacement expression, UPPER @- PREFIX . . . SUFFIX, yields a transducer that inserts text around strings that are instances of UPPER. . | Directed Replacement Lauri Karttunen Rank Xerox Research Centre Grenoble 6 chemin de Maupertuis F-38240 MEYLAN FRANCE lauri.karttunenSxerox.fr Abstract This paper introduces to the finite-state calculus a family of directed replace operators. In contrast to the simple replace expression UPPER - LOWER defined in Karttunen 1995 the new directed version UPPER fl- LOWER yields an unambiguous transducer if the lower language consists of a single string. It transduces the input string from left to right making only the longest possible replacement at each point. A new type of replacement expression UPPER a- PREFIX . SUFFIX yields a transducer that inserts text around strings that are instances of UPPER. The symbol . denotes the matching part of the input which itself remains unchanged. PREFIX and SUFFIX are regular expressions describing the insertions. Expressions of the type UPPER 3- PREFIX . . . SUFFIX may be used to compose a deterministic parser for a local grammar in the sense of Gross 1989 . Other useful applications of directed replacement include tokenization and filtering of text streams. 1 Introduction Transducers compiled from simple replace expressions UPPER - LOWER Karttunen 1995 Kempe and Karttunen 1996 are generally nondeterministic in the sense that they may yield multiple results even if the lower language consists of a single string. For example let US consider the transducer in Figure 1 representing a b I b I b a I a b a - X.1 1The regular expression formalism and other notational conventions used in the paper are explained in the Appendix at the end. Figure 1 ab b ba aba- x. The four paths with aba on the upper side are 0 a 0 b x 2 a 0 0 a 0 b x 2 a 0 0 0 a x 1 b 0 2 a 0 and 0 a x 1 b 0 2 a 0 0 . The application of this transducer to the input aba produces four alternate results axa ax xa and x as shown in Figure 1 since there are four paths in the network that contain aba on the upper side with different strings on the lower side. This .