61 \title{{POSIX} {L}exing with {B}itcoded {D}erivatives} |
61 \title{{POSIX} {L}exing with {B}itcoded {D}erivatives} |
62 \titlerunning{POSIX Lexing with Bitcoded Derivatives} |
62 \titlerunning{POSIX Lexing with Bitcoded Derivatives} |
63 \author{Chengsong Tan}{King's College London}{chengsong.tan@kcl.ac.uk}{}{} |
63 \author{Chengsong Tan}{King's College London}{chengsong.tan@kcl.ac.uk}{}{} |
64 \author{Christian Urban}{King's College London}{christian.urban@kcl.ac.uk}{}{} |
64 \author{Christian Urban}{King's College London}{christian.urban@kcl.ac.uk}{}{} |
65 \authorrunning{C.~Tan and C.~Urban} |
65 \authorrunning{C.~Tan and C.~Urban} |
66 \keywords{POSIX matching, Derivatives of Regular Expressions, Isabelle/HOL} |
66 \keywords{POSIX matching and lexing, Derivatives of Regular Expressions, Isabelle/HOL} |
67 \category{} |
67 \category{} |
68 \ccsdesc[100]{Design and analysis of algorithms} |
68 \ccsdesc[100]{Design and analysis of algorithms} |
69 \ccsdesc[100]{Formal languages and automata theory} |
69 \ccsdesc[100]{Formal languages and automata theory} |
70 \Copyright{\mbox{}} |
70 \Copyright{\mbox{}} |
71 \nolinenumbers |
71 \nolinenumbers |
73 |
73 |
74 \begin{document} |
74 \begin{document} |
75 \maketitle |
75 \maketitle |
76 |
76 |
77 \begin{abstract} |
77 \begin{abstract} |
78 Sulzmann and Lu described a lexing algorithm that calculates |
78 Sulzmann and Lu describe a lexing algorithm that calculates |
79 Brzozowski derivatives using bitcodes annotated to regular |
79 Brzozowski derivatives using bitcodes annotated to regular |
80 expressions. Their algorithm generates POSIX values which encode |
80 expressions. Their algorithm generates POSIX values which encode |
81 the information of \emph{how} a regular expression matches a |
81 the information of \emph{how} a regular expression matches a |
82 string---that is, which part of the string is matched by which part |
82 string---that is, which part of the string is matched by which part |
83 of the regular expression. The purpose of the bitcodes is to generate POSIX values incrementally while |
83 of the regular expression. This information is needed in the |
|
84 context of lexing in order to extract and to classify tokens. |
|
85 The purpose of the bitcodes is to generate POSIX values incrementally while |
84 derivatives are calculated. They also help with designing |
86 derivatives are calculated. They also help with designing |
85 an ``aggressive'' simplification function that keeps the size of |
87 an ``aggressive'' simplification function that keeps the size of |
86 derivatives finite. Without simplification the size derivatives can grow |
88 derivatives finite. Without simplification the size of some derivatives can grow |
87 arbitrarily big resulting in an extremely slow lexing algorithm. In this |
89 arbitrarily big resulting in an extremely slow lexing algorithm. In this |
88 paper we describe a variant of Sulzmann and Lu's algorithm: Our |
90 paper we describe a variant of Sulzmann and Lu's algorithm: Our |
89 algorithm is a recursive functional program, whereas Sulzmann |
91 algorithm is a recursive functional program, whereas Sulzmann |
90 and Lu's version involves a fixpoint construction. We \textit{(i)} |
92 and Lu's version involves a fixpoint construction. We \textit{(i)} |
91 prove in Isabelle/HOL that our algorithm is correct and generates |
93 prove in Isabelle/HOL that our algorithm is correct and generates |