532
|
1 |
% Chapter Template
|
|
2 |
|
|
3 |
\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
|
|
4 |
|
|
5 |
\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
|
|
6 |
%and notations we
|
564
|
7 |
% used for describing the lexing algorithm by Sulzmann and Lu,
|
|
8 |
%and then give the algorithm and its variant and discuss
|
532
|
9 |
%why more aggressive simplifications are needed.
|
|
10 |
|
538
|
11 |
In this chapter, we define the basic notions
|
|
12 |
for regular languages and regular expressions.
|
564
|
13 |
This is essentially a description in ``English"
|
|
14 |
of our formalisation in Isabelle/HOL.
|
|
15 |
We also give the definition of what $\POSIX$ lexing means,
|
|
16 |
followed by an algorithm by Sulzmanna and Lu\parencite{Sulzmann2014}
|
|
17 |
that produces the output conforming
|
|
18 |
to the $\POSIX$ standard.
|
|
19 |
It is also worth mentioning that
|
|
20 |
we choose to use the ML-style notation
|
|
21 |
for function applications, where
|
|
22 |
the parameters of a function is not enclosed
|
|
23 |
inside a pair of parentheses (e.g. $f \;x \;y$
|
|
24 |
instead of $f(x,\;y)$). This is mainly
|
|
25 |
to make the text visually more concise.
|
532
|
26 |
|
538
|
27 |
\section{Basic Concepts}
|
564
|
28 |
Usually, formal language theory starts with an alphabet
|
538
|
29 |
denoting a set of characters.
|
541
|
30 |
Here we just use the datatype of characters from Isabelle,
|
|
31 |
which roughly corresponds to the ASCII characters.
|
564
|
32 |
In what follows, we shall leave the information about the alphabet
|
541
|
33 |
implicit.
|
|
34 |
Then using the usual bracket notation for lists,
|
|
35 |
we can define strings made up of characters:
|
532
|
36 |
\begin{center}
|
|
37 |
\begin{tabular}{lcl}
|
541
|
38 |
$\textit{s}$ & $\dn$ & $[] \; |\; c :: s$
|
532
|
39 |
\end{tabular}
|
|
40 |
\end{center}
|
541
|
41 |
Where $c$ is a variable ranging over characters.
|
|
42 |
Strings can be concatenated to form longer strings in the same
|
564
|
43 |
way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
|
541
|
44 |
We omit the precise
|
538
|
45 |
recursive definition here.
|
|
46 |
We overload this concatenation operator for two sets of strings:
|
532
|
47 |
\begin{center}
|
|
48 |
\begin{tabular}{lcl}
|
541
|
49 |
$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
|
532
|
50 |
\end{tabular}
|
|
51 |
\end{center}
|
538
|
52 |
We also call the above \emph{language concatenation}.
|
532
|
53 |
The power of a language is defined recursively, using the
|
|
54 |
concatenation operator $@$:
|
|
55 |
\begin{center}
|
|
56 |
\begin{tabular}{lcl}
|
|
57 |
$A^0 $ & $\dn$ & $\{ [] \}$\\
|
541
|
58 |
$A^{n+1}$ & $\dn$ & $A @ A^n$
|
532
|
59 |
\end{tabular}
|
|
60 |
\end{center}
|
564
|
61 |
The union of all powers of a language
|
|
62 |
can be used to define the Kleene star operator:
|
532
|
63 |
\begin{center}
|
|
64 |
\begin{tabular}{lcl}
|
536
|
65 |
$A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
|
532
|
66 |
\end{tabular}
|
|
67 |
\end{center}
|
|
68 |
|
538
|
69 |
\noindent
|
564
|
70 |
However, to obtain a more convenient induction principle
|
538
|
71 |
in Isabelle/HOL,
|
536
|
72 |
we instead define the Kleene star
|
532
|
73 |
as an inductive set:
|
538
|
74 |
|
532
|
75 |
\begin{center}
|
538
|
76 |
\begin{mathpar}
|
564
|
77 |
\inferrule{\mbox{}}{[] \in A*\\}
|
538
|
78 |
|
564
|
79 |
\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}
|
538
|
80 |
\end{mathpar}
|
532
|
81 |
\end{center}
|
564
|
82 |
\noindent
|
541
|
83 |
We also define an operation of "chopping off" a character from
|
|
84 |
a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
|
532
|
85 |
\begin{center}
|
|
86 |
\begin{tabular}{lcl}
|
|
87 |
$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
|
|
88 |
\end{tabular}
|
|
89 |
\end{center}
|
538
|
90 |
\noindent
|
541
|
91 |
This can be generalised to "chopping off" a string from all strings within set $A$,
|
|
92 |
namely:
|
532
|
93 |
\begin{center}
|
|
94 |
\begin{tabular}{lcl}
|
541
|
95 |
$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
|
532
|
96 |
\end{tabular}
|
|
97 |
\end{center}
|
538
|
98 |
\noindent
|
541
|
99 |
which is essentially the left quotient $A \backslash L$ of $A$ against
|
|
100 |
the singleton language with $L = \{w\}$
|
532
|
101 |
in formal language theory.
|
564
|
102 |
However, for the purposes here, the $\textit{Ders}$ definition with
|
541
|
103 |
a single string is sufficient.
|
532
|
104 |
|
564
|
105 |
With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
|
532
|
106 |
we have a few properties of how the language derivative can be defined using
|
|
107 |
sub-languages.
|
|
108 |
\begin{lemma}
|
536
|
109 |
\[
|
|
110 |
\Der \; c \; (A @ B) =
|
|
111 |
\begin{cases}
|
538
|
112 |
((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\
|
|
113 |
(\Der \; c \; A) \, @ \, B, & \text{otherwise}
|
536
|
114 |
\end{cases}
|
|
115 |
\]
|
532
|
116 |
\end{lemma}
|
|
117 |
\noindent
|
|
118 |
This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
|
|
119 |
and get to $B$.
|
536
|
120 |
The language $A*$'s derivative can be described using the language derivative
|
532
|
121 |
of $A$:
|
|
122 |
\begin{lemma}
|
538
|
123 |
$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
|
532
|
124 |
\end{lemma}
|
|
125 |
\begin{proof}
|
564
|
126 |
There are too inclusions to prove:
|
532
|
127 |
\begin{itemize}
|
564
|
128 |
\item{$\subseteq$}:\\
|
532
|
129 |
The set
|
536
|
130 |
\[ \{s \mid c :: s \in A*\} \]
|
532
|
131 |
is enclosed in the set
|
564
|
132 |
\[ \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
|
532
|
133 |
because whenever you have a string starting with a character
|
538
|
134 |
in the language of a Kleene star $A*$,
|
|
135 |
then that character together with some sub-string
|
|
136 |
immediately after it will form the first iteration,
|
|
137 |
and the rest of the string will
|
536
|
138 |
be still in $A*$.
|
564
|
139 |
\item{$\supseteq$}:\\
|
532
|
140 |
Note that
|
538
|
141 |
\[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \]
|
564
|
142 |
hold.
|
|
143 |
Also this holds:
|
536
|
144 |
\[ \Der \; c \; (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
|
564
|
145 |
where the $\textit{RHS}$ can be rewritten
|
|
146 |
as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
|
|
147 |
which of course contains $\Der \; c \; A @ A*$.
|
532
|
148 |
\end{itemize}
|
|
149 |
\end{proof}
|
538
|
150 |
|
|
151 |
\noindent
|
532
|
152 |
Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
|
|
153 |
for regular languages, we need to first give definitions for regular expressions.
|
|
154 |
|
536
|
155 |
\subsection{Regular Expressions and Their Meaning}
|
564
|
156 |
The \emph{basic regular expressions} are defined inductively
|
532
|
157 |
by the following grammar:
|
|
158 |
\[ r ::= \ZERO \mid \ONE
|
|
159 |
\mid c
|
|
160 |
\mid r_1 \cdot r_2
|
|
161 |
\mid r_1 + r_2
|
|
162 |
\mid r^*
|
|
163 |
\]
|
538
|
164 |
\noindent
|
564
|
165 |
We call them basic because we will introduce
|
|
166 |
additional constructors in later chapters such as negation
|
538
|
167 |
and bounded repetitions.
|
564
|
168 |
We use $\ZERO$ for the regular expression that
|
|
169 |
matches no string, and $\ONE$ for the regular
|
|
170 |
expression that matches only the empty string\footnote{
|
|
171 |
some authors
|
|
172 |
also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
|
|
173 |
but we prefer our notation}.
|
|
174 |
The sequence regular expression is written $r_1\cdot r_2$
|
|
175 |
and sometimes we omit the dot if it is clear which
|
|
176 |
regular expression is meant; the alternative
|
|
177 |
is written $r_1 + r_2$.
|
|
178 |
The \emph{language} or meaning of
|
|
179 |
a regular expression is defined recursively as
|
|
180 |
a set of strings:
|
532
|
181 |
%TODO: FILL in the other defs
|
|
182 |
\begin{center}
|
|
183 |
\begin{tabular}{lcl}
|
564
|
184 |
$L \; \ZERO$ & $\dn$ & $\phi$\\
|
|
185 |
$L \; \ONE$ & $\dn$ & $\{[]\}$\\
|
|
186 |
$L \; c$ & $\dn$ & $\{[c]\}$\\
|
|
187 |
$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
|
|
188 |
$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
|
|
189 |
$L \; r^*$ & $\dn$ & $ (L\;r)*$
|
532
|
190 |
\end{tabular}
|
|
191 |
\end{center}
|
536
|
192 |
\noindent
|
532
|
193 |
Now with semantic derivatives of a language and regular expressions and
|
|
194 |
their language interpretations in place, we are ready to define derivatives on regexes.
|
|
195 |
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
|
564
|
196 |
%Recall, the language derivative acts on a set of strings
|
|
197 |
%and essentially chops off a particular character from
|
|
198 |
%all strings in that set, Brzozowski defined a derivative operation on regular expressions
|
|
199 |
%so that after derivative $L(r\backslash c)$
|
|
200 |
%will look as if it was obtained by doing a language derivative on $L(r)$:
|
|
201 |
Recall that the semantic derivative acts on a set of
|
|
202 |
strings. Brzozowski noticed that this operation
|
|
203 |
can be ``mirrored" on regular expressions which
|
|
204 |
he calls the derivative of a regular expression $r$
|
|
205 |
with respect to a character $c$, written
|
|
206 |
$r \backslash c$.
|
|
207 |
He defined this operation such that the following property holds:
|
538
|
208 |
\begin{center}
|
532
|
209 |
|
|
210 |
\[
|
564
|
211 |
L(r \backslash c) = \Der \; c \; L(r)
|
532
|
212 |
\]
|
564
|
213 |
\end{center}
|
|
214 |
\noindent
|
|
215 |
For example in the sequence case we have
|
|
216 |
\begin{center}
|
|
217 |
\begin{tabular}{lcl}
|
|
218 |
$\Der \; c \; (A @ B)$ & $\dn$ &
|
|
219 |
$ \textit{if} \;\, [] \in A \;
|
|
220 |
\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup
|
|
221 |
\Der \; c\; B$\\
|
|
222 |
& & $\textit{else}\; (\Der \; c \; A) @ B$\\
|
|
223 |
\end{tabular}
|
|
224 |
\end{center}
|
|
225 |
\noindent
|
|
226 |
This can be translated to
|
|
227 |
regular expressions in the following
|
|
228 |
manner:
|
|
229 |
\begin{center}
|
|
230 |
\begin{tabular}{lcl}
|
|
231 |
$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
|
|
232 |
& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
|
|
233 |
\end{tabular}
|
|
234 |
\end{center}
|
532
|
235 |
|
|
236 |
\noindent
|
564
|
237 |
And similarly, the Kleene star's semantic derivative
|
|
238 |
can be expressed as
|
532
|
239 |
\[
|
564
|
240 |
\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
|
532
|
241 |
\]
|
564
|
242 |
which translates to
|
532
|
243 |
\[
|
564
|
244 |
(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
|
532
|
245 |
\]
|
564
|
246 |
In the above definition of $(r_1\cdot r_2) \backslash c$,
|
|
247 |
the $\textit{if}$ clause's
|
|
248 |
boolean condition
|
|
249 |
$[] \in L(r_1)$ needs to be
|
|
250 |
somehow recursively computed.
|
|
251 |
We call such a function that checks
|
|
252 |
whether the empty string $[]$ is
|
|
253 |
in the language of a regular expression $\nullable$:
|
|
254 |
\begin{center}
|
|
255 |
\begin{tabular}{lcl}
|
|
256 |
$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
|
|
257 |
$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
|
|
258 |
$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
|
|
259 |
$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
|
|
260 |
$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
|
|
261 |
$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
|
|
262 |
\end{tabular}
|
|
263 |
\end{center}
|
|
264 |
\noindent
|
|
265 |
The $\ZERO$ regular expression
|
|
266 |
does not contain any string and
|
|
267 |
therefore is not \emph{nullable}.
|
|
268 |
$\ONE$ is \emph{nullable}
|
|
269 |
by definition.
|
|
270 |
The character regular expression $c$
|
|
271 |
corresponds to the singleton set $\{c\}$,
|
|
272 |
and therefore does not contain the empty string.
|
|
273 |
The alternative regular expression is nullable
|
|
274 |
if at least one of its children is nullable.
|
|
275 |
The sequence regular expression
|
|
276 |
would require both children to have the empty string
|
|
277 |
to compose an empty string, and the Kleene star
|
|
278 |
is always nullable because it naturally
|
|
279 |
contains the empty string.
|
|
280 |
|
|
281 |
The derivative function, written $r\backslash c$,
|
|
282 |
defines how a regular expression evolves into
|
|
283 |
a new one after all the string it contains is acted on:
|
|
284 |
if it starts with $c$, then the character is chopped of,
|
|
285 |
if not, that string is removed.
|
532
|
286 |
\begin{center}
|
|
287 |
\begin{tabular}{lcl}
|
|
288 |
$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
|
|
289 |
$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
|
|
290 |
$d \backslash c$ & $\dn$ &
|
|
291 |
$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
|
|
292 |
$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
|
538
|
293 |
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
|
532
|
294 |
& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
|
|
295 |
& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
|
|
296 |
$(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\
|
|
297 |
\end{tabular}
|
|
298 |
\end{center}
|
|
299 |
\noindent
|
564
|
300 |
The most involved cases are the sequence case
|
|
301 |
and the star case.
|
532
|
302 |
The sequence case says that if the first regular expression
|
564
|
303 |
contains an empty string, then the second component of the sequence
|
|
304 |
needs to be considered, as its derivative will contribute to the
|
|
305 |
result of this derivative.
|
|
306 |
The star regular expression $r^*$'s derivative
|
|
307 |
unwraps one iteration of $r$, turns it into $r\backslash c$,
|
|
308 |
and attaches the original $r^*$
|
|
309 |
after $r\backslash c$, so that
|
|
310 |
we can further unfold it as many times as needed.
|
|
311 |
We have the following correspondence between
|
|
312 |
derivatives on regular expressions and
|
|
313 |
derivatives on a set of strings:
|
|
314 |
\begin{lemma}\label{derDer}
|
532
|
315 |
$\textit{Der} \; c \; L(r) = L (r\backslash c)$
|
|
316 |
\end{lemma}
|
|
317 |
|
|
318 |
\noindent
|
|
319 |
The main property of the derivative operation
|
564
|
320 |
(that enables us to reason about the correctness of
|
|
321 |
derivative-based matching)
|
|
322 |
is
|
532
|
323 |
|
539
|
324 |
\begin{lemma}\label{derStepwise}
|
564
|
325 |
$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
|
539
|
326 |
\end{lemma}
|
532
|
327 |
|
|
328 |
\noindent
|
|
329 |
We can generalise the derivative operation shown above for single characters
|
|
330 |
to strings as follows:
|
|
331 |
|
|
332 |
\begin{center}
|
|
333 |
\begin{tabular}{lcl}
|
|
334 |
$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
|
|
335 |
$r \backslash [\,] $ & $\dn$ & $r$
|
|
336 |
\end{tabular}
|
|
337 |
\end{center}
|
|
338 |
|
|
339 |
\noindent
|
564
|
340 |
When there is no ambiguity, we will
|
|
341 |
omit the subscript and use $\backslash$ instead
|
|
342 |
of $\backslash_r$ to denote
|
532
|
343 |
string derivatives for brevity.
|
539
|
344 |
Brzozowski's regular-expression matcher algorithm can then be described as:
|
532
|
345 |
|
|
346 |
\begin{definition}
|
564
|
347 |
$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
|
532
|
348 |
\end{definition}
|
|
349 |
|
|
350 |
\noindent
|
|
351 |
Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$,
|
|
352 |
this algorithm presented graphically is as follows:
|
|
353 |
|
|
354 |
\begin{equation}\label{graph:successive_ders}
|
|
355 |
\begin{tikzcd}
|
|
356 |
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
|
|
357 |
\end{tikzcd}
|
|
358 |
\end{equation}
|
|
359 |
|
|
360 |
\noindent
|
539
|
361 |
It can be
|
|
362 |
relatively easily shown that this matcher is correct:
|
|
363 |
\begin{lemma}
|
564
|
364 |
$\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$
|
539
|
365 |
\end{lemma}
|
|
366 |
\begin{proof}
|
564
|
367 |
By the stepwise property of derivatives (lemma \ref{derStepwise})
|
|
368 |
and lemma \ref{derDer}.
|
539
|
369 |
\end{proof}
|
|
370 |
\noindent
|
564
|
371 |
\begin{center}
|
|
372 |
\begin{figure}
|
539
|
373 |
\begin{tikzpicture}
|
|
374 |
\begin{axis}[
|
|
375 |
xlabel={$n$},
|
|
376 |
ylabel={time in secs},
|
|
377 |
ymode = log,
|
|
378 |
legend entries={Naive Matcher},
|
|
379 |
legend pos=north west,
|
|
380 |
legend cell align=left]
|
|
381 |
\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
|
|
382 |
\end{axis}
|
|
383 |
\end{tikzpicture}
|
|
384 |
\caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
|
|
385 |
\end{figure}
|
564
|
386 |
\end{center}
|
539
|
387 |
\noindent
|
564
|
388 |
If we implement the above algorithm naively, however,
|
|
389 |
the algorithm can be excruciatingly slow, as shown in
|
|
390 |
\ref{NaiveMatcher}.
|
|
391 |
Note that both axes are in logarithmic scale.
|
|
392 |
Around two dozens characters
|
|
393 |
would already explode the matcher on regular expression
|
|
394 |
$(a^*)^*b$.
|
|
395 |
For this, we need to introduce certain
|
539
|
396 |
rewrite rules for the intermediate results,
|
|
397 |
such as $r + r \rightarrow r$,
|
|
398 |
and make sure those rules do not change the
|
|
399 |
language of the regular expression.
|
564
|
400 |
One simpled-minded simplification function
|
|
401 |
that achieves these requirements is given below:
|
|
402 |
\begin{center}
|
|
403 |
\begin{tabular}{lcl}
|
|
404 |
$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
|
|
405 |
$(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\
|
|
406 |
& & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
|
|
407 |
& & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
|
|
408 |
& & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
|
|
409 |
& & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
|
|
410 |
& & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
|
|
411 |
$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
|
|
412 |
& & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
|
|
413 |
& & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
|
|
414 |
& & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
|
|
415 |
$\simp \; r$ & $\dn$ & $r$
|
|
416 |
|
|
417 |
\end{tabular}
|
|
418 |
\end{center}
|
|
419 |
If we repeatedly apply this simplification
|
|
420 |
function during the matching algorithm,
|
|
421 |
we have a matcher with simplification:
|
|
422 |
\begin{center}
|
|
423 |
\begin{tabular}{lcl}
|
|
424 |
$\derssimp \; [] \; r$ & $\dn$ & $r$\\
|
|
425 |
$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
|
|
426 |
$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
|
|
427 |
\end{tabular}
|
|
428 |
\end{center}
|
|
429 |
\begin{figure}
|
539
|
430 |
\begin{tikzpicture}
|
|
431 |
\begin{axis}[
|
|
432 |
xlabel={$n$},
|
|
433 |
ylabel={time in secs},
|
|
434 |
ymode = log,
|
|
435 |
xmode = log,
|
564
|
436 |
grid = both,
|
539
|
437 |
legend entries={Matcher With Simp},
|
|
438 |
legend pos=north west,
|
|
439 |
legend cell align=left]
|
|
440 |
\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
|
|
441 |
\end{axis}
|
564
|
442 |
\end{tikzpicture}
|
|
443 |
\caption{$(a^*)^*b$
|
|
444 |
against
|
|
445 |
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
|
|
446 |
\end{figure}
|
|
447 |
\noindent
|
|
448 |
The running time of $\textit{ders}\_\textit{simp}$
|
|
449 |
on the same example of \ref{NaiveMatcher}
|
|
450 |
is now very tame in terms of the length of inputs,
|
|
451 |
as shown in \ref{BetterMatcher}.
|
539
|
452 |
|
538
|
453 |
Building derivatives and then testing the existence
|
539
|
454 |
of empty string in the resulting regular expression's language,
|
564
|
455 |
adding simplifications when necessary.
|
538
|
456 |
So far, so good. But what if we want to
|
|
457 |
do lexing instead of just getting a YES/NO answer?
|
564
|
458 |
Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and
|
538
|
459 |
elegant (arguably as beautiful as the definition of the original derivative) solution for this.
|
|
460 |
|
539
|
461 |
\section{Values and the Lexing Algorithm by Sulzmann and Lu}
|
564
|
462 |
In this section, we present a two-phase regular expression lexing
|
|
463 |
algorithm.
|
|
464 |
The first phase takes successive derivatives with
|
|
465 |
respect to the input string,
|
|
466 |
and the second phase does the reverse, \emph{injecting} back
|
|
467 |
characters, in the meantime constructing a lexing result.
|
|
468 |
We will introduce the injection phase in detail slightly
|
|
469 |
later, but as a preliminary we have to first define
|
|
470 |
the datatype for lexing results,
|
|
471 |
called \emph{value} or
|
|
472 |
sometimes also \emph{lexical value}. Values and regular
|
538
|
473 |
expressions correspond to each other as illustrated in the following
|
|
474 |
table:
|
|
475 |
|
|
476 |
\begin{center}
|
|
477 |
\begin{tabular}{c@{\hspace{20mm}}c}
|
|
478 |
\begin{tabular}{@{}rrl@{}}
|
|
479 |
\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
|
|
480 |
$r$ & $::=$ & $\ZERO$\\
|
|
481 |
& $\mid$ & $\ONE$ \\
|
|
482 |
& $\mid$ & $c$ \\
|
|
483 |
& $\mid$ & $r_1 \cdot r_2$\\
|
|
484 |
& $\mid$ & $r_1 + r_2$ \\
|
|
485 |
\\
|
|
486 |
& $\mid$ & $r^*$ \\
|
|
487 |
\end{tabular}
|
|
488 |
&
|
|
489 |
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
|
|
490 |
\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
|
|
491 |
$v$ & $::=$ & \\
|
|
492 |
& & $\Empty$ \\
|
|
493 |
& $\mid$ & $\Char(c)$ \\
|
|
494 |
& $\mid$ & $\Seq\,v_1\, v_2$\\
|
|
495 |
& $\mid$ & $\Left(v)$ \\
|
|
496 |
& $\mid$ & $\Right(v)$ \\
|
|
497 |
& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
|
|
498 |
\end{tabular}
|
|
499 |
\end{tabular}
|
|
500 |
\end{center}
|
|
501 |
\noindent
|
564
|
502 |
A value has an underlying string, which
|
|
503 |
can be calculated by the ``flatten" function $|\_|$:
|
|
504 |
\begin{center}
|
|
505 |
\begin{tabular}{lcl}
|
|
506 |
$|\Empty|$ & $\dn$ & $[]$\\
|
|
507 |
$|\Char \; c|$ & $ \dn$ & $ [c]$\\
|
|
508 |
$|\Seq(v_1, v_2)|$ & $ \dn$ & $ v_1| @ |v_2|$\\
|
|
509 |
$|\Left(v)|$ & $ \dn$ & $ |v|$\\
|
|
510 |
$|\Right(v)|$ & $ \dn$ & $ |v|$\\
|
|
511 |
$|\Stars([])|$ & $\dn$ & $[]$\\
|
|
512 |
$|\Stars(v::vs)|$ & $\dn$ & $ |v| @ |\Stars(vs)|$
|
|
513 |
\end{tabular}
|
|
514 |
\end{center}
|
|
515 |
Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
|
|
516 |
to indicate that a value $v$ could be generated from a lexing algorithm
|
|
517 |
with input $r$. They call it the value inhabitation relation.
|
538
|
518 |
\begin{mathpar}
|
564
|
519 |
\inferrule{\mbox{}}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
|
|
520 |
|
|
521 |
\inferrule{\mbox{}}{\vdash \Empty : \ONE} \hspace{2em}
|
|
522 |
|
|
523 |
\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
|
|
524 |
|
|
525 |
\inferrule{\vdash v_1 : r_1}{\vdash \Left(v_1):r_1+r_2}
|
|
526 |
|
|
527 |
\inferrule{\vdash v_2 : r_2}{\vdash \Right(v_2):r_1 + r_2}
|
538
|
528 |
|
564
|
529 |
\inferrule{\forall v \in vs. \vdash v:r \land |v| \neq []}{\vdash \Stars(vs):r^*}
|
|
530 |
\end{mathpar}
|
|
531 |
\noindent
|
|
532 |
The condition $|v| \neq []$ in the premise of star's rule
|
|
533 |
is to make sure that for a given pair of regular
|
|
534 |
expression $r$ and string $s$, the number of values
|
|
535 |
satisfying $|v| = s$ and $\vdash v:r$ is finite.
|
|
536 |
Given the same string and regular expression, there can be
|
|
537 |
multiple values for it. For example, both
|
|
538 |
$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
|
|
539 |
$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
|
|
540 |
and the values both flatten to $abc$.
|
|
541 |
Lexers therefore have to disambiguate and choose only
|
|
542 |
one of the values to output. $\POSIX$ is one of the
|
|
543 |
disambiguation strategies that is widely adopted.
|
|
544 |
|
|
545 |
Ausaf and Urban\parencite{AusafDyckhoffUrban2016}
|
|
546 |
formalised the property
|
|
547 |
as a ternary relation.
|
|
548 |
The $\POSIX$ value $v$ for a regular expression
|
538
|
549 |
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
|
564
|
550 |
in the following set of rules\footnote{The names of the rules are used
|
|
551 |
as they were originally given in \cite{AusafDyckhoffUrban2016}}:
|
|
552 |
\noindent
|
|
553 |
\begin{figure}
|
|
554 |
\begin{mathpar}
|
|
555 |
\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
|
|
556 |
|
|
557 |
\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
|
|
558 |
|
|
559 |
\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
|
|
560 |
|
|
561 |
\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
|
|
562 |
|
|
563 |
\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
|
|
564 |
\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
|
|
565 |
s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
|
|
566 |
\Seq \; v_1 \; v_2}
|
|
567 |
|
|
568 |
\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
|
|
569 |
|
|
570 |
\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
|
|
571 |
|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
|
|
572 |
s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
|
|
573 |
(v::vs)}
|
|
574 |
\end{mathpar}
|
|
575 |
\caption{POSIX Lexing Rules}
|
|
576 |
\end{figure}
|
538
|
577 |
\noindent
|
564
|
578 |
The above $\POSIX$ rules follows the intuition described below:
|
538
|
579 |
\begin{itemize}
|
564
|
580 |
\item (Left Priority)\\
|
|
581 |
Match the leftmost regular expression when multiple options of matching
|
|
582 |
are available.
|
|
583 |
\item (Maximum munch)\\
|
|
584 |
Always match a subpart as much as possible before proceeding
|
|
585 |
to the next token.
|
538
|
586 |
\end{itemize}
|
564
|
587 |
\noindent
|
|
588 |
These disambiguation strategies can be
|
|
589 |
quite practical.
|
538
|
590 |
For instance, when lexing a code snippet
|
564
|
591 |
\[
|
|
592 |
\textit{iffoo} = 3
|
|
593 |
\]
|
|
594 |
using the regular expression (with
|
|
595 |
keyword and identifier having their
|
|
596 |
usualy definitions on any formal
|
|
597 |
language textbook, for instance
|
|
598 |
keyword is a nonempty string starting with letters
|
|
599 |
followed by alphanumeric characters or underscores):
|
|
600 |
\[
|
|
601 |
\textit{keyword} + \textit{identifier},
|
|
602 |
\]
|
|
603 |
we want $\textit{iffoo}$ to be recognized
|
|
604 |
as an identifier rather than a keyword (if)
|
|
605 |
followed by
|
|
606 |
an identifier (foo).
|
|
607 |
POSIX lexing achieves this.
|
|
608 |
|
|
609 |
We know that a $\POSIX$ value is also a normal underlying
|
|
610 |
value:
|
541
|
611 |
\begin{lemma}
|
|
612 |
$(r, s) \rightarrow v \implies \vdash v: r$
|
|
613 |
\end{lemma}
|
|
614 |
\noindent
|
538
|
615 |
The good property about a $\POSIX$ value is that
|
|
616 |
given the same regular expression $r$ and string $s$,
|
|
617 |
one can always uniquely determine the $\POSIX$ value for it:
|
|
618 |
\begin{lemma}
|
|
619 |
$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$
|
|
620 |
\end{lemma}
|
539
|
621 |
\begin{proof}
|
564
|
622 |
By induction on $s$, $r$ and $v_1$. The inductive cases
|
|
623 |
are all the POSIX rules.
|
|
624 |
Each case is proven by a combination of
|
|
625 |
the induction rules for $\POSIX$
|
|
626 |
values and the inductive hypothesis.
|
|
627 |
Probably the most cumbersome cases are
|
|
628 |
the sequence and star with non-empty iterations.
|
539
|
629 |
|
|
630 |
We give the reasoning about the sequence case as follows:
|
|
631 |
When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$,
|
|
632 |
we know that there could not be a longer string $r_1'$ such that $(s_1', r_1) \rightarrow v_1'$
|
|
633 |
and $(s_2', r_2) \rightarrow v2'$ and $s_1' @s_2' = s$ all hold.
|
|
634 |
For possible values of $s_1'$ and $s_2'$ where $s_1'$ is shorter, they cannot
|
|
635 |
possibly form a $\POSIX$ for $s$.
|
|
636 |
If we have some other values $v_1'$ and $v_2'$ such that
|
|
637 |
$(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$,
|
|
638 |
Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$,
|
|
639 |
which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$
|
|
640 |
is the same as $\Seq(v_1, v_2)$.
|
|
641 |
\end{proof}
|
564
|
642 |
Now we know what a $\POSIX$ value is; the problem is how do we achieve
|
538
|
643 |
such a value in a lexing algorithm, using derivatives?
|
|
644 |
|
|
645 |
\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
|
|
646 |
|
|
647 |
The contribution of Sulzmann and Lu is an extension of Brzozowski's
|
|
648 |
algorithm by a second phase (the first phase being building successive
|
564
|
649 |
derivatives---see \ref{graph:successive_ders}). This second phase generates a POSIX value if the regular expression matches the string.
|
538
|
650 |
Two functions are involved: $\inj$ and $\mkeps$.
|
|
651 |
The function $\mkeps$ constructs a value from the last
|
|
652 |
one of all the successive derivatives:
|
|
653 |
\begin{ceqn}
|
|
654 |
\begin{equation}\label{graph:mkeps}
|
|
655 |
\begin{tikzcd}
|
|
656 |
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
|
|
657 |
& & & v_n
|
|
658 |
\end{tikzcd}
|
|
659 |
\end{equation}
|
|
660 |
\end{ceqn}
|
|
661 |
|
|
662 |
It tells us how can an empty string be matched by a
|
|
663 |
regular expression, in a $\POSIX$ way:
|
|
664 |
|
|
665 |
\begin{center}
|
|
666 |
\begin{tabular}{lcl}
|
564
|
667 |
$\mkeps \; \ONE$ & $\dn$ & $\Empty$ \\
|
|
668 |
$\mkeps \; r_{1}+r_{2}$ & $\dn$
|
|
669 |
& \textit{if} ($\nullable \; r_{1})$\\
|
|
670 |
& & \textit{then} $\Left (\mkeps \; r_{1})$\\
|
|
671 |
& & \textit{else} $\Right(\mkeps \; r_{2})$\\
|
|
672 |
$\mkeps \; r_1\cdot r_2$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
|
|
673 |
$\mkeps \; r^* $ & $\dn$ & $\Stars\;[]$
|
538
|
674 |
\end{tabular}
|
|
675 |
\end{center}
|
|
676 |
|
|
677 |
|
|
678 |
\noindent
|
|
679 |
We favour the left to match an empty string if there is a choice.
|
|
680 |
When there is a star for us to match the empty string,
|
|
681 |
we give the $\Stars$ constructor an empty list, meaning
|
|
682 |
no iterations are taken.
|
|
683 |
The result of a call to $\mkeps$ on a $\nullable$ $r$ would
|
|
684 |
be a $\POSIX$ value corresponding to $r$:
|
|
685 |
\begin{lemma}
|
|
686 |
$\nullable(r) \implies (r, []) \rightarrow (\mkeps\; v)$
|
|
687 |
\end{lemma}\label{mePosix}
|
|
688 |
|
|
689 |
|
|
690 |
After the $\mkeps$-call, we inject back the characters one by one in order to build
|
|
691 |
the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
|
|
692 |
($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
|
|
693 |
After injecting back $n$ characters, we get the lexical value for how $r_0$
|
|
694 |
matches $s$.
|
|
695 |
To do this, Sulzmann and Lu defined a function that reverses
|
|
696 |
the ``chopping off'' of characters during the derivative phase. The
|
|
697 |
corresponding function is called \emph{injection}, written
|
|
698 |
$\textit{inj}$; it takes three arguments: the first one is a regular
|
|
699 |
expression ${r_{i-1}}$, before the character is chopped off, the second
|
|
700 |
is a character ${c_{i-1}}$, the character we want to inject and the
|
|
701 |
third argument is the value ${v_i}$, into which one wants to inject the
|
|
702 |
character (it corresponds to the regular expression after the character
|
|
703 |
has been chopped off). The result of this function is a new value.
|
|
704 |
\begin{ceqn}
|
|
705 |
\begin{equation}\label{graph:inj}
|
|
706 |
\begin{tikzcd}
|
|
707 |
r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
|
|
708 |
v_1 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed]
|
|
709 |
\end{tikzcd}
|
|
710 |
\end{equation}
|
|
711 |
\end{ceqn}
|
|
712 |
|
|
713 |
|
|
714 |
\noindent
|
|
715 |
The
|
|
716 |
definition of $\textit{inj}$ is as follows:
|
|
717 |
|
|
718 |
\begin{center}
|
|
719 |
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
|
|
720 |
$\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\
|
|
721 |
$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
|
|
722 |
$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
|
|
723 |
$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
|
|
724 |
$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
|
|
725 |
$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
|
|
726 |
$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
|
|
727 |
\end{tabular}
|
|
728 |
\end{center}
|
|
729 |
|
|
730 |
\noindent
|
|
731 |
This definition is by recursion on the ``shape'' of regular
|
|
732 |
expressions and values.
|
|
733 |
The clauses do one thing--identifying the ``hole'' on a
|
|
734 |
value to inject the character back into.
|
|
735 |
For instance, in the last clause for injecting back to a value
|
|
736 |
that would turn into a new star value that corresponds to a star,
|
|
737 |
we know it must be a sequence value. And we know that the first
|
|
738 |
value of that sequence corresponds to the child regex of the star
|
|
739 |
with the first character being chopped off--an iteration of the star
|
|
740 |
that had just been unfolded. This value is followed by the already
|
|
741 |
matched star iterations we collected before. So we inject the character
|
|
742 |
back to the first value and form a new value with this latest iteration
|
|
743 |
being added to the previous list of iterations, all under the $\Stars$
|
|
744 |
top level.
|
539
|
745 |
The POSIX value is maintained throughout the process:
|
538
|
746 |
\begin{lemma}
|
|
747 |
$(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
|
|
748 |
\end{lemma}\label{injPosix}
|
|
749 |
|
|
750 |
|
|
751 |
Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
|
|
752 |
and taking into consideration the possibility of a non-match,
|
|
753 |
we have a lexer with the following recursive definition:
|
|
754 |
\begin{center}
|
539
|
755 |
\begin{tabular}{lcl}
|
538
|
756 |
$\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\
|
|
757 |
$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
|
539
|
758 |
& & $\quad \None \implies \None$\\
|
|
759 |
& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
|
538
|
760 |
\end{tabular}
|
|
761 |
\end{center}
|
|
762 |
\noindent
|
|
763 |
The central property of the $\lexer$ is that it gives the correct result by
|
|
764 |
$\POSIX$ standards:
|
541
|
765 |
\begin{theorem}
|
564
|
766 |
The $\lexer$ based on derivatives and injections is both sound and complete:
|
|
767 |
\begin{tabular}{lcl}
|
|
768 |
$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
|
|
769 |
$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
|
538
|
770 |
\end{tabular}
|
541
|
771 |
\end{theorem}
|
538
|
772 |
|
|
773 |
|
|
774 |
\begin{proof}
|
|
775 |
By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
|
|
776 |
The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case
|
|
777 |
by lemma \ref{injPosix}.
|
|
778 |
\end{proof}
|
|
779 |
|
539
|
780 |
|
541
|
781 |
We now give a pictorial view of the algorithm (
|
539
|
782 |
For convenience, we employ the following notations: the regular
|
538
|
783 |
expression we start with is $r_0$, and the given string $s$ is composed
|
539
|
784 |
of characters $c_0 c_1 \ldots c_{n-1}$. The
|
|
785 |
values built incrementally by \emph{injecting} back the characters into the
|
|
786 |
earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
|
|
787 |
are always in the same subscript, i.e. $\vdash v_i : r_i$):
|
538
|
788 |
|
|
789 |
\begin{ceqn}
|
|
790 |
\begin{equation}\label{graph:2}
|
|
791 |
\begin{tikzcd}
|
|
792 |
r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
|
|
793 |
v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
|
|
794 |
\end{tikzcd}
|
|
795 |
\end{equation}
|
|
796 |
\end{ceqn}
|
|
797 |
|
|
798 |
\noindent
|
539
|
799 |
As we did earlier in this chapter on the matcher, one can
|
|
800 |
introduce simplification on the regex.
|
|
801 |
However, now we need to do a backward phase and make sure
|
|
802 |
the values align with the regular expressions.
|
|
803 |
Therefore one has to
|
538
|
804 |
be careful not to break the correctness, as the injection
|
|
805 |
function heavily relies on the structure of the regexes and values
|
|
806 |
being correct and matching each other.
|
|
807 |
It can be achieved by recording some extra rectification functions
|
|
808 |
during the derivatives step, and applying these rectifications in
|
|
809 |
each run during the injection phase.
|
539
|
810 |
|
|
811 |
\ChristianComment{Do I introduce the lexer with rectification here?}
|
538
|
812 |
And we can prove that the POSIX value of how
|
|
813 |
regular expressions match strings will not be affected---although it is much harder
|
|
814 |
to establish.
|
|
815 |
Some initial results in this regard have been
|
|
816 |
obtained in \cite{AusafDyckhoffUrban2016}.
|
|
817 |
|
539
|
818 |
However, even with these simplification rules, we could still end up in
|
|
819 |
trouble, when we encounter cases that require more involved and aggressive
|
|
820 |
simplifications.
|
|
821 |
\section{A Case Requring More Aggressive Simplification}
|
|
822 |
For example, when starting with the regular
|
|
823 |
expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10)
|
|
824 |
w.r.t.~the character $a$, one obtains a derivative regular expression
|
|
825 |
with more than 9000 nodes (when viewed as a tree)
|
|
826 |
even with simplification.
|
|
827 |
\begin{figure}
|
|
828 |
\begin{tikzpicture}
|
|
829 |
\begin{axis}[
|
|
830 |
xlabel={$n$},
|
|
831 |
ylabel={size},
|
|
832 |
legend entries={Naive Matcher},
|
|
833 |
legend pos=north west,
|
|
834 |
legend cell align=left]
|
|
835 |
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
|
|
836 |
\end{axis}
|
|
837 |
\end{tikzpicture}
|
|
838 |
\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
|
|
839 |
\end{figure}\label{fig:BetterWaterloo}
|
|
840 |
|
|
841 |
That is because our lexing algorithm currently keeps a lot of
|
541
|
842 |
"useless" values that will not be used.
|
539
|
843 |
These different ways of matching will grow exponentially with the string length.
|
538
|
844 |
|
539
|
845 |
For $r= (a^*\cdot a^*)^*$ and
|
|
846 |
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
|
|
847 |
if we do not allow any empty iterations in its lexical values,
|
|
848 |
there will be $n - 1$ "splitting points" on $s$ we can independently choose to
|
538
|
849 |
split or not so that each sub-string
|
539
|
850 |
segmented by those chosen splitting points will form different iterations.
|
541
|
851 |
For example when $n=4$, we give out a few of the many possibilities of splitting:
|
538
|
852 |
\begin{center}
|
|
853 |
\begin{tabular}{lcr}
|
541
|
854 |
$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration)\\
|
539
|
855 |
$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$ (two iterations)\\
|
|
856 |
$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$ (two iterations)\\
|
|
857 |
$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
|
|
858 |
$a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
|
538
|
859 |
& $\textit{etc}.$ &
|
|
860 |
\end{tabular}
|
|
861 |
\end{center}
|
|
862 |
\noindent
|
|
863 |
And for each iteration, there are still multiple ways to split
|
|
864 |
between the two $a^*$s.
|
|
865 |
It is not surprising there are exponentially many lexical values
|
|
866 |
that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$ and
|
|
867 |
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
|
|
868 |
A lexer to keep all the possible values will naturally
|
|
869 |
have an exponential runtime on ambiguous regular expressions.
|
|
870 |
With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
|
|
871 |
of a match. This means Sulzmann and Lu's injection-based algorithm
|
539
|
872 |
exponential by nature.
|
|
873 |
Somehow one has to make sure which
|
564
|
874 |
lexical values are $\POSIX$ and must be kept in a lexing algorithm.
|
538
|
875 |
|
|
876 |
|
|
877 |
For example, the above $r= (a^*\cdot a^*)^*$ and
|
|
878 |
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
|
|
879 |
$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
|
539
|
880 |
We want to keep this value only, and remove all the regular expression subparts
|
|
881 |
not corresponding to this value during lexing.
|
|
882 |
To do this, a two-phase algorithm with rectification is a bit too fragile.
|
|
883 |
Can we not create those intermediate values $v_1,\ldots v_n$,
|
|
884 |
and get the lexing information that should be already there while
|
|
885 |
doing derivatives in one pass, without a second injection phase?
|
564
|
886 |
In the meantime, can we ensure that simplifications
|
539
|
887 |
are easily handled without breaking the correctness of the algorithm?
|
538
|
888 |
|
539
|
889 |
Sulzmann and Lu solved this problem by
|
|
890 |
introducing additional information to the
|
|
891 |
regular expressions called \emph{bitcodes}.
|
538
|
892 |
|
|
893 |
|
|
894 |
|
|
895 |
|