43 |
43 |
44 \title{{POSIX} {L}exing with {B}itcoded {D}erivatives} |
44 \title{{POSIX} {L}exing with {B}itcoded {D}erivatives} |
45 \titlerunning{POSIX Lexing with Bitcoded Derivatives} |
45 \titlerunning{POSIX Lexing with Bitcoded Derivatives} |
46 \author{Chengsong Tan}{King's College London}{chengsong.tan@kcl.ac.uk}{}{} |
46 \author{Chengsong Tan}{King's College London}{chengsong.tan@kcl.ac.uk}{}{} |
47 \author{Christian Urban}{King's College London}{christian.urban@kcl.ac.uk}{}{} |
47 \author{Christian Urban}{King's College London}{christian.urban@kcl.ac.uk}{}{} |
|
48 \authorrunning{C.~Tan and C.~Urban} |
|
49 \keywords{POSIX matching, Derivatives of Regular Expressions, Isabelle/HOL} |
|
50 \category{} |
|
51 \ccsdesc[100]{Design and analysis of algorithms} |
|
52 \ccsdesc[100]{Formal languages and automata theory} |
|
53 \Copyright{\mbox{}} |
|
54 \nolinenumbers |
48 |
55 |
49 |
56 |
50 \begin{document} |
57 \begin{document} |
51 \maketitle |
58 \maketitle |
52 |
59 |
53 \begin{abstract} |
60 \begin{abstract} |
54 Brzozowski introduced the notion of derivatives for regular |
61 Sulzmann and Lu described a lexing algorithm that calculates |
55 expressions. They can be used for a very simple regular expression |
62 Brzozowski derivatives using bitcodes annotated to regular |
56 matching algorithm. Sulzmann and Lu cleverly extended this algorithm |
63 expressions. Their algorithm generates POSIX values which encode |
57 in order to deal with POSIX matching, which is the underlying |
64 the information of \emph{how} a regular expression matches a |
58 disambiguation strategy for regular expressions needed in lexers. |
65 string---that is, which part of the string is matched by which part |
59 Their algorithm generates POSIX values which encode the information of |
66 of the regular expression. The purpose of the bitcodes is to generate POSIX values incrementally while |
60 \emph{how} a regular expression matches a string---that is, which part |
67 derivatives are calculated. They also help with designing |
61 of the string is matched by which part of the regular expression. In |
68 an `aggressive' simplification function that keeps the size of |
62 this paper we give our inductive definition of what a POSIX value is |
69 derivatives small. Without simplification derivatives can grow |
63 and show $(i)$ that such a value is unique (for given regular |
70 exponentially resulting in an extremely slow lexing algorithm. In this |
64 expression and string being matched) and $(ii)$ that Sulzmann and Lu's |
71 paper we describe a variant of Sulzmann and Lu's algorithm: Our |
65 algorithm always generates such a value (provided that the regular |
72 algorithm is a recursive functional program, whereas Sulzmann |
66 expression matches the string). We show that $(iii)$ our inductive |
73 and Lu's version involves a fixpoint construction. We \textit{(i)} |
67 definition of a POSIX value is equivalent to an alternative definition |
74 prove in Isabelle/HOL that our program is correct and generates |
68 by Okui and Suzuki which identifies POSIX values as least elements |
75 unique POSIX values; we also \textit{(ii)} establish a polynomial |
69 according to an ordering of values. We also prove the correctness of |
76 bound for the size of the derivatives. The size can be seen as a |
70 Sulzmann's bitcoded version of the POSIX matching algorithm and extend the |
77 proxy measure for the efficiency of the lexing algorithm: because of |
71 results to additional constructors for regular expressions. \smallskip |
78 the polynomial bound our algorithm does not suffer from |
72 |
79 the exponential blowup in earlier works. |
73 {\bf Keywords:} POSIX matching, Derivatives of Regular Expressions, |
80 |
74 Isabelle/HOL |
81 % Brzozowski introduced the notion of derivatives for regular |
|
82 % expressions. They can be used for a very simple regular expression |
|
83 % matching algorithm. Sulzmann and Lu cleverly extended this |
|
84 % algorithm in order to deal with POSIX matching, which is the |
|
85 % underlying disambiguation strategy for regular expressions needed |
|
86 % in lexers. Their algorithm generates POSIX values which encode |
|
87 % the information of \emph{how} a regular expression matches a |
|
88 % string---that is, which part of the string is matched by which |
|
89 % part of the regular expression. In this paper we give our |
|
90 % inductive definition of what a POSIX value is and show $(i)$ that |
|
91 % such a value is unique (for given regular expression and string |
|
92 % being matched) and $(ii)$ that Sulzmann and Lu's algorithm always |
|
93 % generates such a value (provided that the regular expression |
|
94 % matches the string). We show that $(iii)$ our inductive definition |
|
95 % of a POSIX value is equivalent to an alternative definition by |
|
96 % Okui and Suzuki which identifies POSIX values as least elements |
|
97 % according to an ordering of values. We also prove the correctness |
|
98 % of Sulzmann's bitcoded version of the POSIX matching algorithm and |
|
99 % extend the results to additional constructors for regular |
|
100 % expressions. \smallskip |
75 \end{abstract} |
101 \end{abstract} |
76 |
102 |
77 |
103 |
78 |
104 |
79 \input{session} |
105 \input{session} |