1 % Chapter Template |
1 % Chapter Template |
2 |
2 |
3 \chapter{Regular Expressions and Sulzmann and Lu's Lexing Algorithm Without Bitcodes} % Main chapter title |
3 \chapter{Regular Expressions and POSIX Lexing} % Main chapter title |
4 |
4 |
5 \label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts |
5 \label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts |
6 %and notations we |
6 %and notations we |
7 %use for describing the lexing algorithm by Sulzmann and Lu, |
7 %use for describing the lexing algorithm by Sulzmann and Lu, |
8 %and then give the algorithm and its variant, and discuss |
8 %and then give the algorithm and its variant, and discuss |
9 %why more aggressive simplifications are needed. |
9 %why more aggressive simplifications are needed. |
10 |
10 |
11 |
11 |
12 \section{Preliminaries} |
12 \section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions} |
13 |
13 We have a built-in datatype char, made up of characters, which we do not define |
|
14 on top of anything else. |
|
15 \begin{center} |
|
16 \begin{tabular}{lcl} |
|
17 $\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\ |
|
18 \end{tabular} |
|
19 \end{center} |
|
20 They can form strings by lists: |
|
21 \begin{center} |
|
22 \begin{tabular}{lcl} |
|
23 $\textit{string}$ & $\dn$ & $[] | c :: cs$\\ |
|
24 & & $(c\; \text{has char type})$ |
|
25 \end{tabular} |
|
26 \end{center} |
|
27 And strings can be concatenated to form longer strings: |
|
28 \begin{center} |
|
29 \begin{tabular}{lcl} |
|
30 $s_1 @ s_2$ & $\rightarrow$ & $s'$\\ |
|
31 \end{tabular} |
|
32 \end{center} |
|
33 |
|
34 A set of strings can operate with another set of strings: |
|
35 \begin{center} |
|
36 \begin{tabular}{lcl} |
|
37 $A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\ |
|
38 \end{tabular} |
|
39 \end{center} |
|
40 We also call the above "language concatenation". |
|
41 The power of a language is defined recursively, using the |
|
42 concatenation operator $@$: |
|
43 \begin{center} |
|
44 \begin{tabular}{lcl} |
|
45 $A^0 $ & $\dn$ & $\{ [] \}$\\ |
|
46 $A^{n+1}$ & $\dn$ & $A^n @ A$ |
|
47 \end{tabular} |
|
48 \end{center} |
|
49 The infinite set of all the power of a language unioned together |
|
50 is defined using the power operator, also in recursive function: |
|
51 \begin{center} |
|
52 \begin{tabular}{lcl} |
|
53 $A^*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$\\ |
|
54 \end{tabular} |
|
55 \end{center} |
|
56 We also define an operation of chopping off a character from all the strings |
|
57 in a set: |
|
58 \begin{center} |
|
59 \begin{tabular}{lcl} |
|
60 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\ |
|
61 \end{tabular} |
|
62 \end{center} |
|
63 With the above definitions, it becomes natural to define regular expressions |
|
64 as a concise way for expressing the languages. |
|
65 \section{Regular Expressions and Their Language Interpretation} |
14 Suppose we have an alphabet $\Sigma$, the strings whose characters |
66 Suppose we have an alphabet $\Sigma$, the strings whose characters |
15 are from $\Sigma$ |
67 are from $\Sigma$ |
16 can be expressed as $\Sigma^*$. |
68 can be expressed as $\Sigma^*$. |
17 |
69 |
18 We use patterns to define a set of strings concisely. Regular expressions |
70 We use patterns to define a set of strings concisely. Regular expressions |
47 \begin{center} |
99 \begin{center} |
48 $\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$ |
100 $\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$ |
49 \end{center} |
101 \end{center} |
50 Mathematically, it can be expressed as the |
102 Mathematically, it can be expressed as the |
51 |
103 |
52 If the $\textit{StringSet}$ happen to have some structure, for example, |
104 If the $\textit{A}$ happen to have some structure, for example, |
53 if it is regular, then we have that it |
105 if it is regular, then we have that it |
54 |
106 |
55 % Derivatives of a |
107 % Derivatives of a |
56 %regular expression, written $r \backslash c$, give a simple solution |
108 %regular expression, written $r \backslash c$, give a simple solution |
57 %to the problem of matching a string $s$ with a regular |
109 %to the problem of matching a string $s$ with a regular |
58 %expression $r$: if the derivative of $r$ w.r.t.\ (in |
110 %expression $r$: if the derivative of $r$ w.r.t.\ (in |
59 %succession) all the characters of the string matches the empty string, |
111 %succession) all the characters of the string matches the empty string, |
60 %then $r$ matches $s$ (and {\em vice versa}). |
112 %then $r$ matches $s$ (and {\em vice versa}). |
61 |
113 |
|
114 |
|
115 \section{Brzozowski Derivatives of Regular Expressions} |
62 The the derivative of regular expression, denoted as |
116 The the derivative of regular expression, denoted as |
63 $r \backslash c$, is a function that takes parameters |
117 $r \backslash c$, is a function that takes parameters |
64 $r$ and $c$, and returns another regular expression $r'$, |
118 $r$ and $c$, and returns another regular expression $r'$, |
65 which is computed by the following recursive function: |
119 which is computed by the following recursive function: |
66 |
120 |
76 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
130 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
77 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
131 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
78 \end{tabular} |
132 \end{tabular} |
79 \end{center} |
133 \end{center} |
80 \noindent |
134 \noindent |
81 \noindent |
135 The function derivative, written $r\backslash c$, |
82 |
|
83 The $\nullable$ function tests whether the empty string $""$ |
|
84 is in the language of $r$: |
|
85 |
|
86 |
|
87 \begin{center} |
|
88 \begin{tabular}{lcl} |
|
89 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
|
90 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
|
91 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
|
92 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
|
93 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
|
94 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
|
95 \end{tabular} |
|
96 \end{center} |
|
97 \noindent |
|
98 The empty set does not contain any string and |
|
99 therefore not the empty string, the empty string |
|
100 regular expression contains the empty string |
|
101 by definition, the character regular expression |
|
102 is the singleton that contains character only, |
|
103 and therefore does not contain the empty string, |
|
104 the alternative regular expression(or "or" expression) |
|
105 might have one of its children regular expressions |
|
106 being nullable and any one of its children being nullable |
|
107 would suffice. The sequence regular expression |
|
108 would require both children to have the empty string |
|
109 to compose an empty string and the Kleene star |
|
110 operation naturally introduced the empty string. |
|
111 |
|
112 We can give the meaning of regular expressions derivatives |
|
113 by language interpretation: |
|
114 |
|
115 |
|
116 |
|
117 |
|
118 \begin{center} |
|
119 \begin{tabular}{lcl} |
|
120 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
121 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
122 $d \backslash c$ & $\dn$ & |
|
123 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
124 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
125 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ |
|
126 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
127 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
128 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
129 \end{tabular} |
|
130 \end{center} |
|
131 \noindent |
|
132 \noindent |
|
133 The function derivative, written $\backslash c$, |
|
134 defines how a regular expression evolves into |
136 defines how a regular expression evolves into |
135 a new regular expression after all the string it contains |
137 a new regular expression after all the string it contains |
136 is chopped off a certain head character $c$. |
138 is chopped off a certain head character $c$. |
137 The most involved cases are the sequence |
139 The most involved cases are the sequence |
138 and star case. |
140 and star case. |
144 regular expression and attaches the star regular expression |
146 regular expression and attaches the star regular expression |
145 to the sequence's second element to make sure a copy is retained |
147 to the sequence's second element to make sure a copy is retained |
146 for possible more iterations in later phases of lexing. |
148 for possible more iterations in later phases of lexing. |
147 |
149 |
148 |
150 |
|
151 The $\nullable$ function tests whether the empty string $""$ |
|
152 is in the language of $r$: |
|
153 |
|
154 |
|
155 \begin{center} |
|
156 \begin{tabular}{lcl} |
|
157 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
|
158 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
|
159 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
|
160 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
|
161 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
|
162 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
|
163 \end{tabular} |
|
164 \end{center} |
|
165 \noindent |
|
166 The empty set does not contain any string and |
|
167 therefore not the empty string, the empty string |
|
168 regular expression contains the empty string |
|
169 by definition, the character regular expression |
|
170 is the singleton that contains character only, |
|
171 and therefore does not contain the empty string, |
|
172 the alternative regular expression (or "or" expression) |
|
173 might have one of its children regular expressions |
|
174 being nullable and any one of its children being nullable |
|
175 would suffice. The sequence regular expression |
|
176 would require both children to have the empty string |
|
177 to compose an empty string and the Kleene star |
|
178 operation naturally introduced the empty string. |
|
179 |
|
180 We have the following property where the derivative on regular |
|
181 expressions coincides with the derivative on a set of strings: |
|
182 |
|
183 \begin{lemma} |
|
184 $\textit{Der} \; c \; L(r) = L (r\backslash c)$ |
|
185 \end{lemma} |
|
186 |
|
187 \noindent |
|
188 |
|
189 |
149 The main property of the derivative operation |
190 The main property of the derivative operation |
150 that enables us to reason about the correctness of |
191 that enables us to reason about the correctness of |
151 an algorithm using derivatives is |
192 an algorithm using derivatives is |
152 |
193 |
153 \begin{center} |
194 \begin{center} |
232 w.r.t.~the character $a$, one obtains a derivative regular expression |
273 w.r.t.~the character $a$, one obtains a derivative regular expression |
233 with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}. |
274 with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}. |
234 The reason why $(a + aa) ^*$ explodes so drastically is that without |
275 The reason why $(a + aa) ^*$ explodes so drastically is that without |
235 pruning, the algorithm will keep records of all possible ways of matching: |
276 pruning, the algorithm will keep records of all possible ways of matching: |
236 \begin{center} |
277 \begin{center} |
237 $(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$ |
278 $(a + aa) ^* \backslash [aa] = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$ |
238 \end{center} |
279 \end{center} |
239 |
280 |
240 \noindent |
281 \noindent |
241 Each of the above alternative branches correspond to the match |
282 Each of the above alternative branches correspond to the match |
242 $aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete). |
283 $aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete). |
258 More simplifications are possible, such as deleting duplicates |
299 More simplifications are possible, such as deleting duplicates |
259 and opening up nested alternatives to trigger even more simplifications. |
300 and opening up nested alternatives to trigger even more simplifications. |
260 And suppose we apply simplification after each derivative step, and compose |
301 And suppose we apply simplification after each derivative step, and compose |
261 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn |
302 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn |
262 \textit{simp}(a \backslash c)$. Then we can build |
303 \textit{simp}(a \backslash c)$. Then we can build |
263 a matcher without having cumbersome regular expressions. |
304 a matcher with simpler regular expressions. |
264 |
|
265 |
305 |
266 If we want the size of derivatives in the algorithm to |
306 If we want the size of derivatives in the algorithm to |
267 stay even lower, we would need more aggressive simplifications. |
307 stay even lower, we would need more aggressive simplifications. |
268 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
308 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
269 deleting duplicates whenever possible. For example, the parentheses in |
309 deleting duplicates whenever possible. For example, the parentheses in |
270 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b |
310 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b |
271 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
311 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
272 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
312 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
273 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us |
313 $a^*+a+\ONE$. These more aggressive simplification rules are for |
274 to achieve a very tight size bound, namely, |
314 a very tight size bound, possibly as low |
275 the same size bound as that of the \emph{partial derivatives}. |
315 as that of the \emph{partial derivatives}\parencite{Antimirov1995}. |
276 |
316 |
277 Building derivatives and then simplify them. |
317 Building derivatives and then simplify them. |
278 So far so good. But what if we want to |
318 So far so good. But what if we want to |
279 do lexing instead of just a YES/NO answer? |
319 do lexing instead of just getting a YES/NO answer? |
280 This requires us to go back again to the world |
320 This requires us to go back again to the world |
281 without simplification first for a moment. |
321 without simplification first for a moment. |
282 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and |
322 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and |
283 elegant(arguably as beautiful as the original |
323 elegant(arguably as beautiful as the original |
284 derivatives definition) solution for this. |
324 derivatives definition) solution for this. |
340 $\ldots$, and vice versa. |
380 $\ldots$, and vice versa. |
341 Even for the regular expression matching a certain string, there could |
381 Even for the regular expression matching a certain string, there could |
342 still be more than one value corresponding to it. |
382 still be more than one value corresponding to it. |
343 Take the example where $r= (a^*\cdot a^*)^*$ and the string |
383 Take the example where $r= (a^*\cdot a^*)^*$ and the string |
344 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
384 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
345 The number of different ways of matching |
385 If we do not allow any empty iterations in its lexical values, |
346 without allowing any value under a star to be flattened |
386 there will be $n - 1$ "splitting points" on $s$ we can choose to |
347 to an empty string can be given by the following formula: |
387 split or not so that each sub-string |
348 \begin{equation} |
388 segmented by those chosen splitting points will form different iterations: |
349 C_n = (n+1)+n C_1+\ldots + 2 C_{n-1} |
389 \begin{center} |
350 \end{equation} |
390 \begin{tabular}{lcr} |
351 and a closed form formula can be calculated to be |
391 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$\\ |
352 \begin{equation} |
392 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$\\ |
353 C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}} |
393 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$\\ |
354 \end{equation} |
394 & $\textit{etc}.$ & |
355 which is clearly in exponential order. |
395 \end{tabular} |
|
396 \end{center} |
|
397 |
|
398 And for each iteration, there are still multiple ways to split |
|
399 between the two $a^*$s. |
|
400 It is not surprising there are exponentially many lexical values |
|
401 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$ and |
|
402 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
356 |
403 |
357 A lexer aimed at getting all the possible values has an exponential |
404 A lexer aimed at getting all the possible values has an exponential |
358 worst case runtime. Therefore it is impractical to try to generate |
405 worst case runtime. Therefore it is impractical to try to generate |
359 all possible matches in a run. In practice, we are usually |
406 all possible matches in a run. In practice, we are usually |
360 interested about POSIX values, which by intuition always |
407 interested about POSIX values, which by intuition always |
907 $\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$ |
954 $\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$ |
908 \end{tabular} |
955 \end{tabular} |
909 \end{center} |
956 \end{center} |
910 |
957 |
911 |
958 |
912 %---------------------------------------------------------------------------------------- |
|
913 % SECTION correctness proof |
|
914 %---------------------------------------------------------------------------------------- |
|
915 \section{Correctness of Bit-coded Algorithm (Without Simplification)} |
|
916 We now give the proof the correctness of the algorithm with bit-codes. |
|
917 |
|
918 Ausaf and Urban cleverly defined an auxiliary function called $\flex$, |
|
919 defined as |
|
920 \[ |
|
921 \flex \; r \; f \; [] \; v \; = \; f\; v |
|
922 \flex \; r \; f \; c :: s \; v = \flex r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v |
|
923 \] |
|
924 which accumulates the characters that needs to be injected back, |
|
925 and does the injection in a stack-like manner (last taken derivative first injected). |
|
926 $\flex$ is connected to the $\lexer$: |
|
927 \begin{lemma} |
|
928 $\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$ |
|
929 \end{lemma} |
|
930 $\flex$ provides us a bridge between $\lexer$ and $\blexer$. |
|
931 What is even better about $\flex$ is that it allows us to |
|
932 directly operate on the value $\mkeps (r\backslash v)$, |
|
933 which is pivotal in the definition of $\lexer $ and $\blexer$, but not visible as an argument. |
|
934 When the value created by $\mkeps$ becomes available, one can |
|
935 prove some stepwise properties of lexing nicely: |
|
936 \begin{lemma}\label{flexStepwise} |
|
937 $\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $ |
|
938 \end{lemma} |
|
939 |
|
940 And for $\blexer$ we have a function with stepwise properties like $\flex$ as well, |
|
941 called $\retrieve$\ref{retrieveDef}. |
|
942 $\retrieve$ takes bit-codes from annotated regular expressions |
|
943 guided by a value. |
|
944 $\retrieve$ is connected to the $\blexer$ in the following way: |
|
945 \begin{lemma}\label{blexer_retrieve} |
|
946 $\blexer \; r \; s = \decode \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$ |
|
947 \end{lemma} |
|
948 If you take derivative of an annotated regular expression, |
|
949 you can $\retrieve$ the same bit-codes as before the derivative took place, |
|
950 provided that you use the corresponding value: |
|
951 |
|
952 \begin{lemma}\label{retrieveStepwise} |
|
953 $\retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$ |
|
954 \end{lemma} |
|
955 The other good thing about $\retrieve$ is that it can be connected to $\flex$: |
|
956 %centralLemma1 |
|
957 \begin{lemma}\label{flex_retrieve} |
|
958 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$ |
|
959 \end{lemma} |
|
960 \begin{proof} |
|
961 By induction on $s$. The induction tactic is reverse induction on strings. |
|
962 $v$ is allowed to be arbitrary. |
|
963 The crucial point is to rewrite |
|
964 \[ |
|
965 \retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) |
|
966 \] |
|
967 as |
|
968 \[ |
|
969 \retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\; \mkeps (r \backslash s@[c])) |
|
970 \]. |
|
971 This enables us to equate |
|
972 \[ |
|
973 \retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) |
|
974 \] |
|
975 with |
|
976 \[ |
|
977 \flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c]))) |
|
978 \], |
|
979 which in turn can be rewritten as |
|
980 \[ |
|
981 \flex \; r \; \textit{id} \; s@[c] \; (\mkeps (r\backslash s@[c])) |
|
982 \]. |
|
983 \end{proof} |
|
984 |
|
985 With the above lemma we can now link $\flex$ and $\blexer$. |
|
986 |
|
987 \begin{lemma}\label{flex_blexer} |
|
988 $\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s) = \blexer \; r \; s$ |
|
989 \end{lemma} |
|
990 \begin{proof} |
|
991 Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}. |
|
992 \end{proof} |
|
993 Finally |
|
994 |
|
995 |
|
996 |
959 |
997 |
960 |