author | Chengsong |
Sat, 28 May 2022 16:29:32 +0100 | |
changeset 526 | cb702fb4227f |
parent 525 | d8740017324c |
child 528 | 28751de4b4ba |
permissions | -rwxr-xr-x |
468 | 1 |
% Chapter Template |
2 |
||
526 | 3 |
\chapter{Regular Expressions and Sulzmann and Lu's Lexing Algorithm Without Bitcodes} % Main chapter title |
519 | 4 |
|
5 |
\label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts |
|
6 |
%and notations we |
|
7 |
%use for describing the lexing algorithm by Sulzmann and Lu, |
|
8 |
%and then give the algorithm and its variant, and discuss |
|
9 |
%why more aggressive simplifications are needed. |
|
10 |
||
11 |
||
12 |
\section{Preliminaries} |
|
13 |
||
14 |
Suppose we have an alphabet $\Sigma$, the strings whose characters |
|
15 |
are from $\Sigma$ |
|
16 |
can be expressed as $\Sigma^*$. |
|
17 |
||
18 |
We use patterns to define a set of strings concisely. Regular expressions |
|
19 |
are one of such patterns systems: |
|
20 |
The basic regular expressions are defined inductively |
|
21 |
by the following grammar: |
|
22 |
\[ r ::= \ZERO \mid \ONE |
|
23 |
\mid c |
|
24 |
\mid r_1 \cdot r_2 |
|
25 |
\mid r_1 + r_2 |
|
26 |
\mid r^* |
|
27 |
\] |
|
28 |
||
29 |
The language or set of strings defined by regular expressions are defined as |
|
30 |
%TODO: FILL in the other defs |
|
31 |
\begin{center} |
|
32 |
\begin{tabular}{lcl} |
|
33 |
$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\ |
|
34 |
$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\ |
|
35 |
\end{tabular} |
|
36 |
\end{center} |
|
37 |
Which are also called the "language interpretation". |
|
38 |
||
39 |
||
40 |
||
41 |
The Brzozowski derivative w.r.t character $c$ is an operation on the regex, |
|
42 |
where the operation transforms the regex to a new one containing |
|
43 |
strings without the head character $c$. |
|
44 |
||
45 |
Formally, we define first such a transformation on any string set, which |
|
46 |
we call semantic derivative: |
|
47 |
\begin{center} |
|
48 |
$\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$ |
|
49 |
\end{center} |
|
50 |
Mathematically, it can be expressed as the |
|
51 |
||
52 |
If the $\textit{StringSet}$ happen to have some structure, for example, |
|
53 |
if it is regular, then we have that it |
|
54 |
||
55 |
% Derivatives of a |
|
56 |
%regular expression, written $r \backslash c$, give a simple solution |
|
57 |
%to the problem of matching a string $s$ with a regular |
|
58 |
%expression $r$: if the derivative of $r$ w.r.t.\ (in |
|
59 |
%succession) all the characters of the string matches the empty string, |
|
60 |
%then $r$ matches $s$ (and {\em vice versa}). |
|
61 |
||
62 |
The the derivative of regular expression, denoted as |
|
63 |
$r \backslash c$, is a function that takes parameters |
|
64 |
$r$ and $c$, and returns another regular expression $r'$, |
|
65 |
which is computed by the following recursive function: |
|
66 |
||
67 |
\begin{center} |
|
68 |
\begin{tabular}{lcl} |
|
69 |
$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
70 |
$\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
71 |
$d \backslash c$ & $\dn$ & |
|
72 |
$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
73 |
$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
74 |
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ |
|
75 |
& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
76 |
& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
77 |
$(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
78 |
\end{tabular} |
|
79 |
\end{center} |
|
80 |
\noindent |
|
81 |
\noindent |
|
82 |
||
83 |
The $\nullable$ function tests whether the empty string $""$ |
|
84 |
is in the language of $r$: |
|
85 |
||
86 |
||
87 |
\begin{center} |
|
88 |
\begin{tabular}{lcl} |
|
89 |
$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
|
90 |
$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
|
91 |
$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
|
92 |
$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
|
93 |
$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
|
94 |
$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
|
95 |
\end{tabular} |
|
96 |
\end{center} |
|
97 |
\noindent |
|
98 |
The empty set does not contain any string and |
|
99 |
therefore not the empty string, the empty string |
|
100 |
regular expression contains the empty string |
|
101 |
by definition, the character regular expression |
|
102 |
is the singleton that contains character only, |
|
103 |
and therefore does not contain the empty string, |
|
104 |
the alternative regular expression(or "or" expression) |
|
105 |
might have one of its children regular expressions |
|
106 |
being nullable and any one of its children being nullable |
|
107 |
would suffice. The sequence regular expression |
|
108 |
would require both children to have the empty string |
|
109 |
to compose an empty string and the Kleene star |
|
110 |
operation naturally introduced the empty string. |
|
111 |
||
112 |
We can give the meaning of regular expressions derivatives |
|
113 |
by language interpretation: |
|
114 |
||
115 |
||
116 |
||
117 |
||
118 |
\begin{center} |
|
119 |
\begin{tabular}{lcl} |
|
120 |
$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
121 |
$\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
122 |
$d \backslash c$ & $\dn$ & |
|
123 |
$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
124 |
$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
125 |
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ |
|
126 |
& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
127 |
& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
128 |
$(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
129 |
\end{tabular} |
|
130 |
\end{center} |
|
131 |
\noindent |
|
132 |
\noindent |
|
133 |
The function derivative, written $\backslash c$, |
|
134 |
defines how a regular expression evolves into |
|
135 |
a new regular expression after all the string it contains |
|
136 |
is chopped off a certain head character $c$. |
|
137 |
The most involved cases are the sequence |
|
138 |
and star case. |
|
139 |
The sequence case says that if the first regular expression |
|
140 |
contains an empty string then the second component of the sequence |
|
141 |
might be chosen as the target regular expression to be chopped |
|
142 |
off its head character. |
|
143 |
The star regular expression's derivative unwraps the iteration of |
|
144 |
regular expression and attaches the star regular expression |
|
145 |
to the sequence's second element to make sure a copy is retained |
|
146 |
for possible more iterations in later phases of lexing. |
|
147 |
||
148 |
||
149 |
The main property of the derivative operation |
|
150 |
that enables us to reason about the correctness of |
|
151 |
an algorithm using derivatives is |
|
152 |
||
153 |
\begin{center} |
|
154 |
$c\!::\!s \in L(r)$ holds |
|
155 |
if and only if $s \in L(r\backslash c)$. |
|
156 |
\end{center} |
|
157 |
||
158 |
\noindent |
|
159 |
We can generalise the derivative operation shown above for single characters |
|
160 |
to strings as follows: |
|
161 |
||
162 |
\begin{center} |
|
163 |
\begin{tabular}{lcl} |
|
164 |
$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
|
165 |
$r \backslash [\,] $ & $\dn$ & $r$ |
|
166 |
\end{tabular} |
|
167 |
\end{center} |
|
168 |
||
169 |
\noindent |
|
170 |
and then define Brzozowski's regular-expression matching algorithm as: |
|
171 |
||
172 |
\[ |
|
173 |
match\;s\;r \;\dn\; nullable(r\backslash s) |
|
174 |
\] |
|
175 |
||
176 |
\noindent |
|
177 |
Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, |
|
178 |
this algorithm presented graphically is as follows: |
|
179 |
||
180 |
\begin{equation}\label{graph:*} |
|
181 |
\begin{tikzcd} |
|
182 |
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
|
183 |
\end{tikzcd} |
|
184 |
\end{equation} |
|
185 |
||
186 |
\noindent |
|
187 |
where we start with a regular expression $r_0$, build successive |
|
188 |
derivatives until we exhaust the string and then use \textit{nullable} |
|
189 |
to test whether the result can match the empty string. It can be |
|
190 |
relatively easily shown that this matcher is correct (that is given |
|
191 |
an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
|
192 |
||
193 |
Beautiful and simple definition. |
|
194 |
||
195 |
If we implement the above algorithm naively, however, |
|
196 |
the algorithm can be excruciatingly slow. |
|
197 |
||
198 |
||
199 |
\begin{figure} |
|
200 |
\centering |
|
201 |
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
202 |
\begin{tikzpicture} |
|
203 |
\begin{axis}[ |
|
204 |
xlabel={$n$}, |
|
205 |
x label style={at={(1.05,-0.05)}}, |
|
206 |
ylabel={time in secs}, |
|
207 |
enlargelimits=false, |
|
208 |
xtick={0,5,...,30}, |
|
209 |
xmax=33, |
|
210 |
ymax=10000, |
|
211 |
ytick={0,1000,...,10000}, |
|
212 |
scaled ticks=false, |
|
213 |
axis lines=left, |
|
214 |
width=5cm, |
|
215 |
height=4cm, |
|
216 |
legend entries={JavaScript}, |
|
217 |
legend pos=north west, |
|
218 |
legend cell align=left] |
|
219 |
\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data}; |
|
220 |
\end{axis} |
|
221 |
\end{tikzpicture}\\ |
|
222 |
\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings |
|
223 |
of the form $\underbrace{aa..a}_{n}$.} |
|
224 |
\end{tabular} |
|
225 |
\caption{EightThousandNodes} \label{fig:EightThousandNodes} |
|
226 |
\end{figure} |
|
227 |
||
228 |
||
229 |
(8000 node data to be added here) |
|
230 |
For example, when starting with the regular |
|
231 |
expression $(a + aa)^*$ and building a few successive derivatives (around 10) |
|
232 |
w.r.t.~the character $a$, one obtains a derivative regular expression |
|
233 |
with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}. |
|
234 |
The reason why $(a + aa) ^*$ explodes so drastically is that without |
|
235 |
pruning, the algorithm will keep records of all possible ways of matching: |
|
236 |
\begin{center} |
|
237 |
$(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$ |
|
238 |
\end{center} |
|
239 |
||
240 |
\noindent |
|
241 |
Each of the above alternative branches correspond to the match |
|
242 |
$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete). |
|
243 |
These different ways of matching will grow exponentially with the string length, |
|
244 |
and without simplifications that throw away some of these very similar matchings, |
|
245 |
it is no surprise that these expressions grow so quickly. |
|
246 |
Operations like |
|
247 |
$\backslash$ and $\nullable$ need to traverse such trees and |
|
248 |
consequently the bigger the size of the derivative the slower the |
|
249 |
algorithm. |
|
250 |
||
251 |
Brzozowski was quick in finding that during this process a lot useless |
|
252 |
$\ONE$s and $\ZERO$s are generated and therefore not optimal. |
|
253 |
He also introduced some "similarity rules" such |
|
254 |
as $P+(Q+R) = (P+Q)+R$ to merge syntactically |
|
255 |
different but language-equivalent sub-regexes to further decrease the size |
|
256 |
of the intermediate regexes. |
|
257 |
||
258 |
More simplifications are possible, such as deleting duplicates |
|
259 |
and opening up nested alternatives to trigger even more simplifications. |
|
260 |
And suppose we apply simplification after each derivative step, and compose |
|
261 |
these two operations together as an atomic one: $a \backslash_{simp}\,c \dn |
|
262 |
\textit{simp}(a \backslash c)$. Then we can build |
|
263 |
a matcher without having cumbersome regular expressions. |
|
264 |
||
265 |
||
266 |
If we want the size of derivatives in the algorithm to |
|
267 |
stay even lower, we would need more aggressive simplifications. |
|
268 |
Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
|
269 |
deleting duplicates whenever possible. For example, the parentheses in |
|
270 |
$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b |
|
271 |
\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
|
272 |
example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
|
273 |
$a^*+a+\ONE$. Adding these more aggressive simplification rules help us |
|
274 |
to achieve a very tight size bound, namely, |
|
275 |
the same size bound as that of the \emph{partial derivatives}. |
|
276 |
||
277 |
Building derivatives and then simplify them. |
|
278 |
So far so good. But what if we want to |
|
279 |
do lexing instead of just a YES/NO answer? |
|
280 |
This requires us to go back again to the world |
|
281 |
without simplification first for a moment. |
|
282 |
Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and |
|
283 |
elegant(arguably as beautiful as the original |
|
284 |
derivatives definition) solution for this. |
|
285 |
||
286 |
\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu} |
|
287 |
||
288 |
||
289 |
They first defined the datatypes for storing the |
|
290 |
lexing information called a \emph{value} or |
|
291 |
sometimes also \emph{lexical value}. These values and regular |
|
292 |
expressions correspond to each other as illustrated in the following |
|
293 |
table: |
|
294 |
||
295 |
\begin{center} |
|
296 |
\begin{tabular}{c@{\hspace{20mm}}c} |
|
297 |
\begin{tabular}{@{}rrl@{}} |
|
298 |
\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |
|
299 |
$r$ & $::=$ & $\ZERO$\\ |
|
300 |
& $\mid$ & $\ONE$ \\ |
|
301 |
& $\mid$ & $c$ \\ |
|
302 |
& $\mid$ & $r_1 \cdot r_2$\\ |
|
303 |
& $\mid$ & $r_1 + r_2$ \\ |
|
304 |
\\ |
|
305 |
& $\mid$ & $r^*$ \\ |
|
306 |
\end{tabular} |
|
307 |
& |
|
308 |
\begin{tabular}{@{\hspace{0mm}}rrl@{}} |
|
309 |
\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ |
|
310 |
$v$ & $::=$ & \\ |
|
311 |
& & $\Empty$ \\ |
|
312 |
& $\mid$ & $\Char(c)$ \\ |
|
313 |
& $\mid$ & $\Seq\,v_1\, v_2$\\ |
|
314 |
& $\mid$ & $\Left(v)$ \\ |
|
315 |
& $\mid$ & $\Right(v)$ \\ |
|
316 |
& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ |
|
317 |
\end{tabular} |
|
318 |
\end{tabular} |
|
319 |
\end{center} |
|
320 |
||
321 |
\noindent |
|
322 |
||
323 |
Building on top of Sulzmann and Lu's attempt to formalize the |
|
324 |
notion of POSIX lexing rules \parencite{Sulzmann2014}, |
|
325 |
Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled |
|
326 |
POSIX matching as a ternary relation recursively defined in a |
|
327 |
natural deduction style. |
|
328 |
With the formally-specified rules for what a POSIX matching is, |
|
329 |
they proved in Isabelle/HOL that the algorithm gives correct results. |
|
330 |
||
331 |
But having a correct result is still not enough, |
|
332 |
we want at least some degree of $\mathbf{efficiency}$. |
|
333 |
||
334 |
||
335 |
||
336 |
One regular expression can have multiple lexical values. For example |
|
337 |
for the regular expression $(a+b)^*$, it has a infinite list of |
|
338 |
values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$, |
|
339 |
$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$, |
|
340 |
$\ldots$, and vice versa. |
|
341 |
Even for the regular expression matching a certain string, there could |
|
342 |
still be more than one value corresponding to it. |
|
343 |
Take the example where $r= (a^*\cdot a^*)^*$ and the string |
|
344 |
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
|
345 |
The number of different ways of matching |
|
346 |
without allowing any value under a star to be flattened |
|
347 |
to an empty string can be given by the following formula: |
|
348 |
\begin{equation} |
|
349 |
C_n = (n+1)+n C_1+\ldots + 2 C_{n-1} |
|
350 |
\end{equation} |
|
351 |
and a closed form formula can be calculated to be |
|
352 |
\begin{equation} |
|
353 |
C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}} |
|
354 |
\end{equation} |
|
355 |
which is clearly in exponential order. |
|
356 |
||
357 |
A lexer aimed at getting all the possible values has an exponential |
|
358 |
worst case runtime. Therefore it is impractical to try to generate |
|
359 |
all possible matches in a run. In practice, we are usually |
|
360 |
interested about POSIX values, which by intuition always |
|
361 |
\begin{itemize} |
|
362 |
\item |
|
363 |
match the leftmost regular expression when multiple options of matching |
|
364 |
are available |
|
365 |
\item |
|
366 |
always match a subpart as much as possible before proceeding |
|
367 |
to the next token. |
|
368 |
\end{itemize} |
|
369 |
||
370 |
||
371 |
For example, the above example has the POSIX value |
|
372 |
$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
|
373 |
The output of an algorithm we want would be a POSIX matching |
|
374 |
encoded as a value. |
|
375 |
The reason why we are interested in $\POSIX$ values is that they can |
|
376 |
be practically used in the lexing phase of a compiler front end. |
|
377 |
For instance, when lexing a code snippet |
|
378 |
$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized |
|
379 |
as an identifier rather than a keyword. |
|
380 |
||
381 |
The contribution of Sulzmann and Lu is an extension of Brzozowski's |
|
382 |
algorithm by a second phase (the first phase being building successive |
|
383 |
derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
|
384 |
is generated in case the regular expression matches the string. |
|
385 |
Pictorially, the Sulzmann and Lu algorithm is as follows: |
|
386 |
||
387 |
\begin{ceqn} |
|
388 |
\begin{equation}\label{graph:2} |
|
389 |
\begin{tikzcd} |
|
390 |
r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
|
391 |
v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
|
392 |
\end{tikzcd} |
|
393 |
\end{equation} |
|
394 |
\end{ceqn} |
|
395 |
||
396 |
||
397 |
\noindent |
|
398 |
For convenience, we shall employ the following notations: the regular |
|
399 |
expression we start with is $r_0$, and the given string $s$ is composed |
|
400 |
of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the |
|
401 |
left to right, we build the derivatives $r_1$, $r_2$, \ldots according |
|
402 |
to the characters $c_0$, $c_1$ until we exhaust the string and obtain |
|
403 |
the derivative $r_n$. We test whether this derivative is |
|
404 |
$\textit{nullable}$ or not. If not, we know the string does not match |
|
405 |
$r$ and no value needs to be generated. If yes, we start building the |
|
406 |
values incrementally by \emph{injecting} back the characters into the |
|
407 |
earlier values $v_n, \ldots, v_0$. This is the second phase of the |
|
408 |
algorithm from the right to left. For the first value $v_n$, we call the |
|
409 |
function $\textit{mkeps}$, which builds a POSIX lexical value |
|
410 |
for how the empty string has been matched by the (nullable) regular |
|
411 |
expression $r_n$. This function is defined as |
|
412 |
||
413 |
\begin{center} |
|
414 |
\begin{tabular}{lcl} |
|
415 |
$\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
|
416 |
$\mkeps(r_{1}+r_{2})$ & $\dn$ |
|
417 |
& \textit{if} $\nullable(r_{1})$\\ |
|
418 |
& & \textit{then} $\Left(\mkeps(r_{1}))$\\ |
|
419 |
& & \textit{else} $\Right(\mkeps(r_{2}))$\\ |
|
420 |
$\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ |
|
421 |
$mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ |
|
422 |
\end{tabular} |
|
423 |
\end{center} |
|
424 |
||
425 |
||
426 |
\noindent |
|
427 |
After the $\mkeps$-call, we inject back the characters one by one in order to build |
|
428 |
the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ |
|
429 |
($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
|
430 |
After injecting back $n$ characters, we get the lexical value for how $r_0$ |
|
431 |
matches $s$. The POSIX value is maintained throught out the process. |
|
432 |
For this Sulzmann and Lu defined a function that reverses |
|
433 |
the ``chopping off'' of characters during the derivative phase. The |
|
434 |
corresponding function is called \emph{injection}, written |
|
435 |
$\textit{inj}$; it takes three arguments: the first one is a regular |
|
436 |
expression ${r_{i-1}}$, before the character is chopped off, the second |
|
437 |
is a character ${c_{i-1}}$, the character we want to inject and the |
|
438 |
third argument is the value ${v_i}$, into which one wants to inject the |
|
439 |
character (it corresponds to the regular expression after the character |
|
440 |
has been chopped off). The result of this function is a new value. The |
|
441 |
definition of $\textit{inj}$ is as follows: |
|
468 | 442 |
|
519 | 443 |
\begin{center} |
444 |
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
445 |
$\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
|
446 |
$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
|
447 |
$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ |
|
448 |
$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
449 |
$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
450 |
$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ |
|
451 |
$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ |
|
452 |
\end{tabular} |
|
453 |
\end{center} |
|
454 |
||
455 |
\noindent This definition is by recursion on the ``shape'' of regular |
|
456 |
expressions and values. |
|
457 |
The clauses basically do one thing--identifying the ``holes'' on |
|
458 |
value to inject the character back into. |
|
459 |
For instance, in the last clause for injecting back to a value |
|
460 |
that would turn into a new star value that corresponds to a star, |
|
461 |
we know it must be a sequence value. And we know that the first |
|
462 |
value of that sequence corresponds to the child regex of the star |
|
463 |
with the first character being chopped off--an iteration of the star |
|
464 |
that had just been unfolded. This value is followed by the already |
|
465 |
matched star iterations we collected before. So we inject the character |
|
466 |
back to the first value and form a new value with this new iteration |
|
467 |
being added to the previous list of iterations, all under the $Stars$ |
|
468 |
top level. |
|
469 |
||
470 |
We have mentioned before that derivatives without simplification |
|
471 |
can get clumsy, and this is true for values as well--they reflect |
|
472 |
the regular expressions size by definition. |
|
473 |
||
474 |
One can introduce simplification on the regex and values, but have to |
|
475 |
be careful in not breaking the correctness as the injection |
|
476 |
function heavily relies on the structure of the regexes and values |
|
477 |
being correct and match each other. |
|
478 |
It can be achieved by recording some extra rectification functions |
|
479 |
during the derivatives step, and applying these rectifications in |
|
480 |
each run during the injection phase. |
|
481 |
And we can prove that the POSIX value of how |
|
482 |
regular expressions match strings will not be affected---although is much harder |
|
483 |
to establish. |
|
484 |
Some initial results in this regard have been |
|
485 |
obtained in \cite{AusafDyckhoffUrban2016}. |
|
486 |
||
487 |
||
488 |
||
489 |
%Brzozowski, after giving the derivatives and simplification, |
|
490 |
%did not explore lexing with simplification or he may well be |
|
491 |
%stuck on an efficient simplificaiton with a proof. |
|
492 |
%He went on to explore the use of derivatives together with |
|
493 |
%automaton, and did not try lexing using derivatives. |
|
494 |
||
495 |
We want to get rid of complex and fragile rectification of values. |
|
496 |
Can we not create those intermediate values $v_1,\ldots v_n$, |
|
497 |
and get the lexing information that should be already there while |
|
498 |
doing derivatives in one pass, without a second phase of injection? |
|
499 |
In the meantime, can we make sure that simplifications |
|
500 |
are easily handled without breaking the correctness of the algorithm? |
|
501 |
||
502 |
Sulzmann and Lu solved this problem by |
|
503 |
introducing additional informtaion to the |
|
504 |
regular expressions called \emph{bitcodes}. |
|
505 |
||
506 |
\subsection*{Bit-coded Algorithm} |
|
507 |
Bits and bitcodes (lists of bits) are defined as: |
|
508 |
||
509 |
\begin{center} |
|
510 |
$b ::= 1 \mid 0 \qquad |
|
511 |
bs ::= [] \mid b::bs |
|
512 |
$ |
|
513 |
\end{center} |
|
514 |
||
515 |
\noindent |
|
516 |
The $1$ and $0$ are not in bold in order to avoid |
|
517 |
confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or |
|
518 |
bit-lists) can be used to encode values (or potentially incomplete values) in a |
|
519 |
compact form. This can be straightforwardly seen in the following |
|
520 |
coding function from values to bitcodes: |
|
521 |
||
522 |
\begin{center} |
|
523 |
\begin{tabular}{lcl} |
|
524 |
$\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
|
525 |
$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
|
526 |
$\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\ |
|
527 |
$\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\ |
|
528 |
$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ |
|
529 |
$\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\ |
|
530 |
$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\; |
|
531 |
code(\Stars\,vs)$ |
|
532 |
\end{tabular} |
|
533 |
\end{center} |
|
534 |
||
535 |
\noindent |
|
536 |
Here $\textit{code}$ encodes a value into a bitcodes by converting |
|
537 |
$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty |
|
538 |
star iteration by $1$. The border where a local star terminates |
|
539 |
is marked by $0$. This coding is lossy, as it throws away the information about |
|
540 |
characters, and also does not encode the ``boundary'' between two |
|
541 |
sequence values. Moreover, with only the bitcode we cannot even tell |
|
542 |
whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The |
|
543 |
reason for choosing this compact way of storing information is that the |
|
544 |
relatively small size of bits can be easily manipulated and ``moved |
|
545 |
around'' in a regular expression. In order to recover values, we will |
|
546 |
need the corresponding regular expression as an extra information. This |
|
547 |
means the decoding function is defined as: |
|
548 |
||
549 |
||
550 |
%\begin{definition}[Bitdecoding of Values]\mbox{} |
|
551 |
\begin{center} |
|
552 |
\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
|
553 |
$\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ |
|
554 |
$\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ |
|
555 |
$\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
|
556 |
$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; |
|
557 |
(\Left\,v, bs_1)$\\ |
|
558 |
$\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
|
559 |
$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; |
|
560 |
(\Right\,v, bs_1)$\\ |
|
561 |
$\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & |
|
562 |
$\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ |
|
563 |
& & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ |
|
564 |
& & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ |
|
565 |
$\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ |
|
566 |
$\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & |
|
567 |
$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
|
568 |
& & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ |
|
569 |
& & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ |
|
570 |
||
571 |
$\textit{decode}\,bs\,r$ & $\dn$ & |
|
572 |
$\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
|
573 |
& & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; |
|
574 |
\textit{else}\;\textit{None}$ |
|
575 |
\end{tabular} |
|
576 |
\end{center} |
|
577 |
%\end{definition} |
|
578 |
||
579 |
Sulzmann and Lu's integrated the bitcodes into regular expressions to |
|
580 |
create annotated regular expressions \cite{Sulzmann2014}. |
|
581 |
\emph{Annotated regular expressions} are defined by the following |
|
582 |
grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$} |
|
583 |
||
584 |
\begin{center} |
|
585 |
\begin{tabular}{lcl} |
|
586 |
$\textit{a}$ & $::=$ & $\ZERO$\\ |
|
587 |
& $\mid$ & $_{bs}\ONE$\\ |
|
588 |
& $\mid$ & $_{bs}{\bf c}$\\ |
|
589 |
& $\mid$ & $_{bs}\sum\,as$\\ |
|
590 |
& $\mid$ & $_{bs}a_1\cdot a_2$\\ |
|
591 |
& $\mid$ & $_{bs}a^*$ |
|
592 |
\end{tabular} |
|
593 |
\end{center} |
|
594 |
%(in \textit{ALTS}) |
|
595 |
||
596 |
\noindent |
|
597 |
where $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regular |
|
598 |
expressions and $as$ for a list of annotated regular expressions. |
|
599 |
The alternative constructor($\sum$) has been generalized to |
|
600 |
accept a list of annotated regular expressions rather than just 2. |
|
601 |
We will show that these bitcodes encode information about |
|
602 |
the (POSIX) value that should be generated by the Sulzmann and Lu |
|
603 |
algorithm. |
|
604 |
||
605 |
||
606 |
To do lexing using annotated regular expressions, we shall first |
|
607 |
transform the usual (un-annotated) regular expressions into annotated |
|
608 |
regular expressions. This operation is called \emph{internalisation} and |
|
609 |
defined as follows: |
|
610 |
||
611 |
%\begin{definition} |
|
612 |
\begin{center} |
|
613 |
\begin{tabular}{lcl} |
|
614 |
$(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\ |
|
615 |
$(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\ |
|
616 |
$(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\ |
|
617 |
$(r_1 + r_2)^\uparrow$ & $\dn$ & |
|
618 |
$_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\, |
|
619 |
\textit{fuse}\,[1]\,r_2^\uparrow]$\\ |
|
620 |
$(r_1\cdot r_2)^\uparrow$ & $\dn$ & |
|
621 |
$_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\ |
|
622 |
$(r^*)^\uparrow$ & $\dn$ & |
|
623 |
$_{[]}(r^\uparrow)^*$\\ |
|
624 |
\end{tabular} |
|
625 |
\end{center} |
|
626 |
%\end{definition} |
|
627 |
||
628 |
\noindent |
|
629 |
We use up arrows here to indicate that the basic un-annotated regular |
|
630 |
expressions are ``lifted up'' into something slightly more complex. In the |
|
631 |
fourth clause, $\textit{fuse}$ is an auxiliary function that helps to |
|
632 |
attach bits to the front of an annotated regular expression. Its |
|
633 |
definition is as follows: |
|
634 |
||
635 |
\begin{center} |
|
636 |
\begin{tabular}{lcl} |
|
637 |
$\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ |
|
638 |
$\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & |
|
639 |
$_{bs @ bs'}\ONE$\\ |
|
640 |
$\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ & |
|
641 |
$_{bs@bs'}{\bf c}$\\ |
|
642 |
$\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ & |
|
643 |
$_{bs@bs'}\sum\textit{as}$\\ |
|
644 |
$\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & |
|
645 |
$_{bs@bs'}a_1 \cdot a_2$\\ |
|
646 |
$\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & |
|
647 |
$_{bs @ bs'}a^*$ |
|
648 |
\end{tabular} |
|
649 |
\end{center} |
|
650 |
||
651 |
\noindent |
|
652 |
After internalising the regular expression, we perform successive |
|
653 |
derivative operations on the annotated regular expressions. This |
|
654 |
derivative operation is the same as what we had previously for the |
|
655 |
basic regular expressions, except that we beed to take care of |
|
656 |
the bitcodes: |
|
657 |
||
658 |
||
659 |
\iffalse |
|
660 |
%\begin{definition}{bder} |
|
661 |
\begin{center} |
|
662 |
\begin{tabular}{@{}lcl@{}} |
|
663 |
$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
664 |
$(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
665 |
$(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & |
|
666 |
$\textit{if}\;c=d\; \;\textit{then}\; |
|
667 |
\textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ |
|
668 |
$(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & |
|
669 |
$\textit{ALTS}\;bs\,(map (\backslash c) as)$\\ |
|
670 |
$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
|
671 |
$\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
672 |
& &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ |
|
673 |
& &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ |
|
674 |
& &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ |
|
675 |
$(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & |
|
676 |
$\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, |
|
677 |
(\textit{STAR}\,[]\,r)$ |
|
678 |
\end{tabular} |
|
679 |
\end{center} |
|
680 |
%\end{definition} |
|
681 |
||
682 |
\begin{center} |
|
683 |
\begin{tabular}{@{}lcl@{}} |
|
684 |
$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
685 |
$(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
686 |
$(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ & |
|
687 |
$\textit{if}\;c=d\; \;\textit{then}\; |
|
688 |
_{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\ |
|
689 |
$(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ & |
|
690 |
$_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\ |
|
691 |
$(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ & |
|
692 |
$\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
693 |
& &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\ |
|
694 |
& &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ |
|
695 |
& &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\ |
|
696 |
$(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ & |
|
697 |
$_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\, |
|
698 |
(_{bs}\textit{STAR}\,[]\,r)$ |
|
699 |
\end{tabular} |
|
700 |
\end{center} |
|
701 |
%\end{definition} |
|
702 |
\fi |
|
703 |
||
704 |
\begin{center} |
|
705 |
\begin{tabular}{@{}lcl@{}} |
|
706 |
$(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
|
707 |
$(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
|
708 |
$(_{bs}{\bf d})\,\backslash c$ & $\dn$ & |
|
709 |
$\textit{if}\;c=d\; \;\textit{then}\; |
|
710 |
_{bs}\ONE\;\textit{else}\;\ZERO$\\ |
|
711 |
$(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ & |
|
712 |
$_{bs}\sum\;(\textit{as.map}(\backslash c))$\\ |
|
713 |
$(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & |
|
714 |
$\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
715 |
& &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ |
|
716 |
& &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ |
|
717 |
& &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ |
|
718 |
$(_{bs}a^*)\,\backslash c$ & $\dn$ & |
|
719 |
$_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot |
|
720 |
(_{[]}r^*))$ |
|
721 |
\end{tabular} |
|
722 |
\end{center} |
|
723 |
||
724 |
%\end{definition} |
|
725 |
\noindent |
|
726 |
For instance, when we do derivative of $_{bs}a^*$ with respect to c, |
|
727 |
we need to unfold it into a sequence, |
|
728 |
and attach an additional bit $0$ to the front of $r \backslash c$ |
|
729 |
to indicate one more star iteration. Also the sequence clause |
|
730 |
is more subtle---when $a_1$ is $\textit{bnullable}$ (here |
|
731 |
\textit{bnullable} is exactly the same as $\textit{nullable}$, except |
|
732 |
that it is for annotated regular expressions, therefore we omit the |
|
733 |
definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how |
|
734 |
$a_1$ matches the string prior to character $c$ (more on this later), |
|
735 |
then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2 |
|
736 |
\backslash c)$ will collapse the regular expression $a_1$(as it has |
|
737 |
already been fully matched) and store the parsing information at the |
|
738 |
head of the regular expression $a_2 \backslash c$ by fusing to it. The |
|
739 |
bitsequence $\textit{bs}$, which was initially attached to the |
|
740 |
first element of the sequence $a_1 \cdot a_2$, has |
|
741 |
now been elevated to the top-level of $\sum$, as this information will be |
|
742 |
needed whichever way the sequence is matched---no matter whether $c$ belongs |
|
743 |
to $a_1$ or $ a_2$. After building these derivatives and maintaining all |
|
744 |
the lexing information, we complete the lexing by collecting the |
|
745 |
bitcodes using a generalised version of the $\textit{mkeps}$ function |
|
746 |
for annotated regular expressions, called $\textit{bmkeps}$: |
|
747 |
||
748 |
||
749 |
%\begin{definition}[\textit{bmkeps}]\mbox{} |
|
750 |
\begin{center} |
|
751 |
\begin{tabular}{lcl} |
|
752 |
$\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\ |
|
753 |
$\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ & |
|
754 |
$\textit{if}\;\textit{bnullable}\,a$\\ |
|
755 |
& &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\ |
|
756 |
& &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\ |
|
757 |
$\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ & |
|
758 |
$bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ |
|
759 |
$\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ & |
|
760 |
$bs \,@\, [0]$ |
|
761 |
\end{tabular} |
|
762 |
\end{center} |
|
763 |
%\end{definition} |
|
764 |
||
765 |
\noindent |
|
766 |
This function completes the value information by travelling along the |
|
767 |
path of the regular expression that corresponds to a POSIX value and |
|
768 |
collecting all the bitcodes, and using $S$ to indicate the end of star |
|
769 |
iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and |
|
770 |
decode them, we get the value we expect. The corresponding lexing |
|
771 |
algorithm looks as follows: |
|
772 |
||
773 |
\begin{center} |
|
774 |
\begin{tabular}{lcl} |
|
775 |
$\textit{blexer}\;r\,s$ & $\dn$ & |
|
776 |
$\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ |
|
777 |
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
778 |
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
779 |
& & $\;\;\textit{else}\;\textit{None}$ |
|
780 |
\end{tabular} |
|
781 |
\end{center} |
|
782 |
||
783 |
\noindent |
|
784 |
In this definition $\_\backslash s$ is the generalisation of the derivative |
|
785 |
operation from characters to strings (just like the derivatives for un-annotated |
|
786 |
regular expressions). |
|
787 |
||
788 |
Now we introduce the simplifications, which is why we introduce the |
|
789 |
bitcodes in the first place. |
|
790 |
||
791 |
\subsection*{Simplification Rules} |
|
792 |
||
793 |
This section introduces aggressive (in terms of size) simplification rules |
|
794 |
on annotated regular expressions |
|
795 |
to keep derivatives small. Such simplifications are promising |
|
796 |
as we have |
|
797 |
generated test data that show |
|
798 |
that a good tight bound can be achieved. We could only |
|
799 |
partially cover the search space as there are infinitely many regular |
|
800 |
expressions and strings. |
|
801 |
||
802 |
One modification we introduced is to allow a list of annotated regular |
|
803 |
expressions in the $\sum$ constructor. This allows us to not just |
|
804 |
delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but |
|
805 |
also unnecessary ``copies'' of regular expressions (very similar to |
|
806 |
simplifying $r + r$ to just $r$, but in a more general setting). Another |
|
807 |
modification is that we use simplification rules inspired by Antimirov's |
|
808 |
work on partial derivatives. They maintain the idea that only the first |
|
809 |
``copy'' of a regular expression in an alternative contributes to the |
|
810 |
calculation of a POSIX value. All subsequent copies can be pruned away from |
|
811 |
the regular expression. A recursive definition of our simplification function |
|
812 |
that looks somewhat similar to our Scala code is given below: |
|
813 |
%\comment{Use $\ZERO$, $\ONE$ and so on. |
|
814 |
%Is it $ALTS$ or $ALTS$?}\\ |
|
815 |
||
816 |
\begin{center} |
|
817 |
\begin{tabular}{@{}lcl@{}} |
|
818 |
||
819 |
$\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
|
820 |
&&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ |
|
821 |
&&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ |
|
822 |
&&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
|
823 |
&&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
|
824 |
&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ |
|
825 |
||
826 |
$\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\ |
|
827 |
&&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
|
828 |
&&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
|
829 |
&&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ |
|
830 |
||
831 |
$\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
|
832 |
\end{tabular} |
|
833 |
\end{center} |
|
834 |
||
835 |
\noindent |
|
836 |
The simplification does a pattern matching on the regular expression. |
|
837 |
When it detected that the regular expression is an alternative or |
|
838 |
sequence, it will try to simplify its child regular expressions |
|
839 |
recursively and then see if one of the children turns into $\ZERO$ or |
|
840 |
$\ONE$, which might trigger further simplification at the current level. |
|
841 |
The most involved part is the $\sum$ clause, where we use two |
|
842 |
auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested |
|
843 |
alternatives and reduce as many duplicates as possible. Function |
|
844 |
$\textit{distinct}$ keeps the first occurring copy only and removes all later ones |
|
845 |
when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. |
|
846 |
Its recursive definition is given below: |
|
847 |
||
848 |
\begin{center} |
|
849 |
\begin{tabular}{@{}lcl@{}} |
|
850 |
$\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; |
|
851 |
(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ |
|
852 |
$\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ |
|
853 |
$\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) |
|
854 |
\end{tabular} |
|
855 |
\end{center} |
|
856 |
||
857 |
\noindent |
|
858 |
Here $\textit{flatten}$ behaves like the traditional functional programming flatten |
|
859 |
function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it |
|
860 |
removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. |
|
861 |
||
862 |
Having defined the $\simp$ function, |
|
863 |
we can use the previous notation of natural |
|
864 |
extension from derivative w.r.t.~character to derivative |
|
865 |
w.r.t.~string:%\comment{simp in the [] case?} |
|
866 |
||
867 |
\begin{center} |
|
868 |
\begin{tabular}{lcl} |
|
869 |
$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ |
|
870 |
$r \backslash_{simp} [\,] $ & $\dn$ & $r$ |
|
871 |
\end{tabular} |
|
872 |
\end{center} |
|
873 |
||
874 |
\noindent |
|
875 |
to obtain an optimised version of the algorithm: |
|
876 |
||
877 |
\begin{center} |
|
878 |
\begin{tabular}{lcl} |
|
879 |
$\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
|
880 |
$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
|
881 |
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
882 |
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
883 |
& & $\;\;\textit{else}\;\textit{None}$ |
|
884 |
\end{tabular} |
|
885 |
\end{center} |
|
886 |
||
887 |
\noindent |
|
888 |
This algorithm keeps the regular expression size small, for example, |
|
889 |
with this simplification our previous $(a + aa)^*$ example's 8000 nodes |
|
890 |
will be reduced to just 6 and stays constant, no matter how long the |
|
891 |
input string is. |
|
892 |
||
893 |
||
468 | 894 |
|
500 | 895 |
|
896 |
||
897 |
||
468 | 898 |
|
899 |
%----------------------------------- |
|
900 |
% SUBSECTION 1 |
|
901 |
%----------------------------------- |
|
518 | 902 |
\section{Specifications of Certain Functions to be Used} |
524 | 903 |
Here we give some functions' definitions, |
904 |
which we will use later. |
|
905 |
\begin{center} |
|
906 |
\begin{tabular}{ccc} |
|
525
d8740017324c
fixed latex problems
Christian Urban <christian.urban@kcl.ac.uk>
parents:
524
diff
changeset
|
907 |
$\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$ |
524 | 908 |
\end{tabular} |
909 |
\end{center} |
|
500 | 910 |
|
911 |
||
519 | 912 |
%---------------------------------------------------------------------------------------- |
913 |
% SECTION correctness proof |
|
914 |
%---------------------------------------------------------------------------------------- |
|
525
d8740017324c
fixed latex problems
Christian Urban <christian.urban@kcl.ac.uk>
parents:
524
diff
changeset
|
915 |
\section{Correctness of Bit-coded Algorithm (Without Simplification)} |
519 | 916 |
We now give the proof the correctness of the algorithm with bit-codes. |
515 | 917 |
|
519 | 918 |
Ausaf and Urban cleverly defined an auxiliary function called $\flex$, |
919 |
defined as |
|
920 |
\[ |
|
921 |
\flex \; r \; f \; [] \; v \; = \; f\; v |
|
922 |
\flex \; r \; f \; c :: s \; v = \flex r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v |
|
923 |
\] |
|
924 |
which accumulates the characters that needs to be injected back, |
|
925 |
and does the injection in a stack-like manner (last taken derivative first injected). |
|
926 |
$\flex$ is connected to the $\lexer$: |
|
515 | 927 |
\begin{lemma} |
519 | 928 |
$\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$ |
515 | 929 |
\end{lemma} |
519 | 930 |
$\flex$ provides us a bridge between $\lexer$ and $\blexer$. |
931 |
What is even better about $\flex$ is that it allows us to |
|
932 |
directly operate on the value $\mkeps (r\backslash v)$, |
|
933 |
which is pivotal in the definition of $\lexer $ and $\blexer$, but not visible as an argument. |
|
934 |
When the value created by $\mkeps$ becomes available, one can |
|
935 |
prove some stepwise properties of lexing nicely: |
|
936 |
\begin{lemma}\label{flexStepwise} |
|
937 |
$\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $ |
|
515 | 938 |
\end{lemma} |
518 | 939 |
|
524 | 940 |
And for $\blexer$ we have a function with stepwise properties like $\flex$ as well, |
941 |
called $\retrieve$\ref{retrieveDef}. |
|
519 | 942 |
$\retrieve$ takes bit-codes from annotated regular expressions |
943 |
guided by a value. |
|
944 |
$\retrieve$ is connected to the $\blexer$ in the following way: |
|
945 |
\begin{lemma}\label{blexer_retrieve} |
|
946 |
$\blexer \; r \; s = \decode \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$ |
|
947 |
\end{lemma} |
|
948 |
If you take derivative of an annotated regular expression, |
|
949 |
you can $\retrieve$ the same bit-codes as before the derivative took place, |
|
950 |
provided that you use the corresponding value: |
|
518 | 951 |
|
519 | 952 |
\begin{lemma}\label{retrieveStepwise} |
953 |
$\retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$ |
|
518 | 954 |
\end{lemma} |
519 | 955 |
The other good thing about $\retrieve$ is that it can be connected to $\flex$: |
956 |
%centralLemma1 |
|
957 |
\begin{lemma}\label{flex_retrieve} |
|
958 |
$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$ |
|
518 | 959 |
\end{lemma} |
960 |
\begin{proof} |
|
519 | 961 |
By induction on $s$. The induction tactic is reverse induction on strings. |
962 |
$v$ is allowed to be arbitrary. |
|
963 |
The crucial point is to rewrite |
|
964 |
\[ |
|
965 |
\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) |
|
966 |
\] |
|
967 |
as |
|
968 |
\[ |
|
969 |
\retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\; \mkeps (r \backslash s@[c])) |
|
970 |
\]. |
|
971 |
This enables us to equate |
|
972 |
\[ |
|
973 |
\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) |
|
974 |
\] |
|
975 |
with |
|
976 |
\[ |
|
977 |
\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c]))) |
|
978 |
\], |
|
979 |
which in turn can be rewritten as |
|
980 |
\[ |
|
981 |
\flex \; r \; \textit{id} \; s@[c] \; (\mkeps (r\backslash s@[c])) |
|
982 |
\]. |
|
518 | 983 |
\end{proof} |
984 |
||
519 | 985 |
With the above lemma we can now link $\flex$ and $\blexer$. |
500 | 986 |
|
519 | 987 |
\begin{lemma}\label{flex_blexer} |
988 |
$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s) = \blexer \; r \; s$ |
|
989 |
\end{lemma} |
|
990 |
\begin{proof} |
|
991 |
Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}. |
|
992 |
\end{proof} |
|
993 |
Finally |
|
518 | 994 |
|
995 |
||
996 |
||
997 |