532
|
1 |
% Chapter Template
|
|
2 |
|
|
3 |
\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
|
|
4 |
|
|
5 |
\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
|
|
6 |
%and notations we
|
564
|
7 |
% used for describing the lexing algorithm by Sulzmann and Lu,
|
|
8 |
%and then give the algorithm and its variant and discuss
|
532
|
9 |
%why more aggressive simplifications are needed.
|
|
10 |
|
538
|
11 |
In this chapter, we define the basic notions
|
|
12 |
for regular languages and regular expressions.
|
583
|
13 |
This is essentially a description in ``English''
|
564
|
14 |
of our formalisation in Isabelle/HOL.
|
|
15 |
We also give the definition of what $\POSIX$ lexing means,
|
622
|
16 |
followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
|
564
|
17 |
that produces the output conforming
|
622
|
18 |
to the $\POSIX$ standard\footnote{In what follows
|
583
|
19 |
we choose to use the Isabelle-style notation
|
564
|
20 |
for function applications, where
|
583
|
21 |
the parameters of a function are not enclosed
|
564
|
22 |
inside a pair of parentheses (e.g. $f \;x \;y$
|
|
23 |
instead of $f(x,\;y)$). This is mainly
|
622
|
24 |
to make the text visually more concise.}.
|
532
|
25 |
|
538
|
26 |
\section{Basic Concepts}
|
622
|
27 |
Formal language theory usually starts with an alphabet
|
538
|
28 |
denoting a set of characters.
|
541
|
29 |
Here we just use the datatype of characters from Isabelle,
|
|
30 |
which roughly corresponds to the ASCII characters.
|
564
|
31 |
In what follows, we shall leave the information about the alphabet
|
541
|
32 |
implicit.
|
|
33 |
Then using the usual bracket notation for lists,
|
622
|
34 |
we can define strings made up of characters as:
|
532
|
35 |
\begin{center}
|
|
36 |
\begin{tabular}{lcl}
|
541
|
37 |
$\textit{s}$ & $\dn$ & $[] \; |\; c :: s$
|
532
|
38 |
\end{tabular}
|
|
39 |
\end{center}
|
583
|
40 |
where $c$ is a variable ranging over characters.
|
622
|
41 |
The $::$ stands for list cons and $[]$ for the empty
|
|
42 |
list.
|
|
43 |
A singleton list is sometimes written as $[c]$ for brevity.
|
541
|
44 |
Strings can be concatenated to form longer strings in the same
|
564
|
45 |
way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
|
541
|
46 |
We omit the precise
|
538
|
47 |
recursive definition here.
|
|
48 |
We overload this concatenation operator for two sets of strings:
|
532
|
49 |
\begin{center}
|
|
50 |
\begin{tabular}{lcl}
|
541
|
51 |
$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
|
532
|
52 |
\end{tabular}
|
|
53 |
\end{center}
|
538
|
54 |
We also call the above \emph{language concatenation}.
|
532
|
55 |
The power of a language is defined recursively, using the
|
|
56 |
concatenation operator $@$:
|
|
57 |
\begin{center}
|
|
58 |
\begin{tabular}{lcl}
|
|
59 |
$A^0 $ & $\dn$ & $\{ [] \}$\\
|
541
|
60 |
$A^{n+1}$ & $\dn$ & $A @ A^n$
|
532
|
61 |
\end{tabular}
|
|
62 |
\end{center}
|
564
|
63 |
The union of all powers of a language
|
|
64 |
can be used to define the Kleene star operator:
|
532
|
65 |
\begin{center}
|
|
66 |
\begin{tabular}{lcl}
|
536
|
67 |
$A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
|
532
|
68 |
\end{tabular}
|
|
69 |
\end{center}
|
|
70 |
|
538
|
71 |
\noindent
|
564
|
72 |
However, to obtain a more convenient induction principle
|
538
|
73 |
in Isabelle/HOL,
|
536
|
74 |
we instead define the Kleene star
|
532
|
75 |
as an inductive set:
|
538
|
76 |
|
532
|
77 |
\begin{center}
|
538
|
78 |
\begin{mathpar}
|
564
|
79 |
\inferrule{\mbox{}}{[] \in A*\\}
|
538
|
80 |
|
564
|
81 |
\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}
|
538
|
82 |
\end{mathpar}
|
532
|
83 |
\end{center}
|
564
|
84 |
\noindent
|
541
|
85 |
We also define an operation of "chopping off" a character from
|
|
86 |
a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
|
532
|
87 |
\begin{center}
|
|
88 |
\begin{tabular}{lcl}
|
|
89 |
$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
|
|
90 |
\end{tabular}
|
|
91 |
\end{center}
|
538
|
92 |
\noindent
|
583
|
93 |
This can be generalised to ``chopping off'' a string
|
|
94 |
from all strings within a set $A$,
|
541
|
95 |
namely:
|
532
|
96 |
\begin{center}
|
|
97 |
\begin{tabular}{lcl}
|
541
|
98 |
$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
|
532
|
99 |
\end{tabular}
|
|
100 |
\end{center}
|
538
|
101 |
\noindent
|
541
|
102 |
which is essentially the left quotient $A \backslash L$ of $A$ against
|
622
|
103 |
the singleton language with $L = \{s\}$.
|
|
104 |
However, for our purposes here, the $\textit{Ders}$ definition with
|
541
|
105 |
a single string is sufficient.
|
532
|
106 |
|
577
|
107 |
The reason for defining derivatives
|
622
|
108 |
is that they provide another approach
|
577
|
109 |
to test membership of a string in
|
|
110 |
a set of strings.
|
|
111 |
For example, to test whether the string
|
|
112 |
$bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with
|
|
113 |
respect to the string $bar$:
|
|
114 |
\begin{center}
|
622
|
115 |
\begin{tabular}{lll}
|
577
|
116 |
$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ &
|
622
|
117 |
$\{ar, rak\}$ \\
|
|
118 |
& $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\
|
|
119 |
& $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
|
|
120 |
%& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
|
577
|
121 |
\end{tabular}
|
|
122 |
\end{center}
|
|
123 |
\noindent
|
|
124 |
and in the end test whether the set
|
622
|
125 |
has the empty string.\footnote{We use the infix notation $A\backslash c$
|
577
|
126 |
instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
|
622
|
127 |
on languages rather than regular expressions.}
|
|
128 |
|
|
129 |
In general, if we have a language $S$,
|
|
130 |
then we can test whether $s$ is in $S$
|
577
|
131 |
by testing whether $[] \in S \backslash s$.
|
564
|
132 |
With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
|
532
|
133 |
we have a few properties of how the language derivative can be defined using
|
|
134 |
sub-languages.
|
577
|
135 |
For example, for the sequence operator, we have
|
622
|
136 |
something similar to a ``chain rule'':
|
532
|
137 |
\begin{lemma}
|
536
|
138 |
\[
|
|
139 |
\Der \; c \; (A @ B) =
|
|
140 |
\begin{cases}
|
538
|
141 |
((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\
|
|
142 |
(\Der \; c \; A) \, @ \, B, & \text{otherwise}
|
536
|
143 |
\end{cases}
|
|
144 |
\]
|
532
|
145 |
\end{lemma}
|
|
146 |
\noindent
|
|
147 |
This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
|
|
148 |
and get to $B$.
|
583
|
149 |
The language derivative for $A*$ can be described using the language derivative
|
532
|
150 |
of $A$:
|
|
151 |
\begin{lemma}
|
538
|
152 |
$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
|
532
|
153 |
\end{lemma}
|
|
154 |
\begin{proof}
|
583
|
155 |
There are two inclusions to prove:
|
532
|
156 |
\begin{itemize}
|
564
|
157 |
\item{$\subseteq$}:\\
|
532
|
158 |
The set
|
536
|
159 |
\[ \{s \mid c :: s \in A*\} \]
|
532
|
160 |
is enclosed in the set
|
564
|
161 |
\[ \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
|
532
|
162 |
because whenever you have a string starting with a character
|
538
|
163 |
in the language of a Kleene star $A*$,
|
|
164 |
then that character together with some sub-string
|
|
165 |
immediately after it will form the first iteration,
|
|
166 |
and the rest of the string will
|
536
|
167 |
be still in $A*$.
|
564
|
168 |
\item{$\supseteq$}:\\
|
532
|
169 |
Note that
|
538
|
170 |
\[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \]
|
583
|
171 |
holds.
|
|
172 |
Also the following holds:
|
536
|
173 |
\[ \Der \; c \; (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
|
564
|
174 |
where the $\textit{RHS}$ can be rewritten
|
|
175 |
as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
|
|
176 |
which of course contains $\Der \; c \; A @ A*$.
|
532
|
177 |
\end{itemize}
|
|
178 |
\end{proof}
|
538
|
179 |
|
|
180 |
\noindent
|
622
|
181 |
The clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$
|
|
182 |
for regular expressions.
|
|
183 |
To introduce them, we need to first give definitions for regular expressions,
|
|
184 |
which we shall do next.
|
532
|
185 |
|
536
|
186 |
\subsection{Regular Expressions and Their Meaning}
|
564
|
187 |
The \emph{basic regular expressions} are defined inductively
|
532
|
188 |
by the following grammar:
|
|
189 |
\[ r ::= \ZERO \mid \ONE
|
|
190 |
\mid c
|
|
191 |
\mid r_1 \cdot r_2
|
|
192 |
\mid r_1 + r_2
|
|
193 |
\mid r^*
|
|
194 |
\]
|
538
|
195 |
\noindent
|
564
|
196 |
We call them basic because we will introduce
|
|
197 |
additional constructors in later chapters such as negation
|
538
|
198 |
and bounded repetitions.
|
564
|
199 |
We use $\ZERO$ for the regular expression that
|
|
200 |
matches no string, and $\ONE$ for the regular
|
622
|
201 |
expression that matches only the empty string.\footnote{
|
|
202 |
Some authors
|
564
|
203 |
also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
|
622
|
204 |
but we prefer our notation.}
|
564
|
205 |
The sequence regular expression is written $r_1\cdot r_2$
|
|
206 |
and sometimes we omit the dot if it is clear which
|
|
207 |
regular expression is meant; the alternative
|
|
208 |
is written $r_1 + r_2$.
|
|
209 |
The \emph{language} or meaning of
|
|
210 |
a regular expression is defined recursively as
|
|
211 |
a set of strings:
|
532
|
212 |
%TODO: FILL in the other defs
|
|
213 |
\begin{center}
|
|
214 |
\begin{tabular}{lcl}
|
564
|
215 |
$L \; \ZERO$ & $\dn$ & $\phi$\\
|
|
216 |
$L \; \ONE$ & $\dn$ & $\{[]\}$\\
|
|
217 |
$L \; c$ & $\dn$ & $\{[c]\}$\\
|
622
|
218 |
$L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
|
|
219 |
$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
|
|
220 |
$L \; (r^*)$ & $\dn$ & $ (L\;r)*$
|
532
|
221 |
\end{tabular}
|
|
222 |
\end{center}
|
536
|
223 |
\noindent
|
622
|
224 |
%Now with language derivatives of a language and regular expressions and
|
|
225 |
%their language interpretations in place, we are ready to define derivatives on regular expressions.
|
|
226 |
With $L$ we are ready to introduce Brzozowski derivatives on regular expressions.
|
|
227 |
We do so by first introducing what properties it should satisfy.
|
|
228 |
|
532
|
229 |
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
|
564
|
230 |
%Recall, the language derivative acts on a set of strings
|
|
231 |
%and essentially chops off a particular character from
|
|
232 |
%all strings in that set, Brzozowski defined a derivative operation on regular expressions
|
|
233 |
%so that after derivative $L(r\backslash c)$
|
|
234 |
%will look as if it was obtained by doing a language derivative on $L(r)$:
|
622
|
235 |
%Recall that the language derivative acts on a
|
|
236 |
%language (set of strings).
|
|
237 |
%One can decide whether a string $s$ belongs
|
|
238 |
%to a language $S$ by taking derivative with respect to
|
|
239 |
%that string and then checking whether the empty
|
|
240 |
%string is in the derivative:
|
|
241 |
%\begin{center}
|
|
242 |
%\parskip \baselineskip
|
|
243 |
%\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
|
|
244 |
%\def\rlwd{.5pt}
|
|
245 |
%\newcommand\notate[3]{%
|
|
246 |
% \unskip\def\useanchorwidth{T}%
|
|
247 |
% \setbox0=\hbox{#1}%
|
|
248 |
% \def\stackalignment{c}\stackunder[-6pt]{%
|
|
249 |
% \def\stackalignment{c}\stackunder[-1.5pt]{%
|
|
250 |
% \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
|
|
251 |
% \rule{\rlwd}{#2\baselineskip}}}{%
|
|
252 |
% \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
|
|
253 |
%}
|
|
254 |
%\Longstack{
|
|
255 |
%\notate{$\{ \ldots ,\;$
|
|
256 |
% \notate{s}{1}{$(c_1 :: s_1)$}
|
|
257 |
% $, \; \ldots \}$
|
|
258 |
%}{1}{$S_{start}$}
|
|
259 |
%}
|
|
260 |
%\Longstack{
|
|
261 |
% $\stackrel{\backslash c_1}{\longrightarrow}$
|
|
262 |
%}
|
|
263 |
%\Longstack{
|
|
264 |
% $\{ \ldots,\;$ \notate{$s_1$}{1}{$(c_2::s_2)$}
|
|
265 |
% $,\; \ldots \}$
|
|
266 |
%}
|
|
267 |
%\Longstack{
|
|
268 |
% $\stackrel{\backslash c_2}{\longrightarrow}$
|
|
269 |
%}
|
|
270 |
%\Longstack{
|
|
271 |
% $\{ \ldots,\; s_2
|
|
272 |
% ,\; \ldots \}$
|
|
273 |
%}
|
|
274 |
%\Longstack{
|
|
275 |
% $ \xdashrightarrow{\backslash c_3\ldots\ldots} $
|
|
276 |
%}
|
|
277 |
%\Longstack{
|
|
278 |
% \notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} =
|
|
279 |
% S_{start}\backslash s$}
|
|
280 |
%}
|
|
281 |
%\end{center}
|
|
282 |
%\begin{center}
|
|
283 |
% $s \in S_{start} \iff [] \in S_{end}$
|
|
284 |
%\end{center}
|
|
285 |
%\noindent
|
|
286 |
Brzozowski noticed that $\Der$
|
579
|
287 |
can be ``mirrored'' on regular expressions which
|
564
|
288 |
he calls the derivative of a regular expression $r$
|
|
289 |
with respect to a character $c$, written
|
577
|
290 |
$r \backslash c$. This infix operator
|
|
291 |
takes an original regular expression $r$ as input
|
|
292 |
and a character as a right operand and
|
|
293 |
outputs a result, which is a new regular expression.
|
|
294 |
The derivative operation on regular expression
|
|
295 |
is defined such that the language of the derivative result
|
|
296 |
coincides with the language of the original
|
579
|
297 |
regular expression being taken
|
577
|
298 |
derivative with respect to the same character:
|
579
|
299 |
\begin{property}
|
|
300 |
|
|
301 |
\[
|
|
302 |
L \; (r \backslash c) = \Der \; c \; (L \; r)
|
|
303 |
\]
|
|
304 |
\end{property}
|
|
305 |
\noindent
|
622
|
306 |
Pictorially, this looks as follows:\\
|
|
307 |
\vspace{3mm}
|
579
|
308 |
|
577
|
309 |
\parskip \baselineskip
|
|
310 |
\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
|
|
311 |
\def\rlwd{.5pt}
|
|
312 |
\newcommand\notate[3]{%
|
|
313 |
\unskip\def\useanchorwidth{T}%
|
|
314 |
\setbox0=\hbox{#1}%
|
|
315 |
\def\stackalignment{c}\stackunder[-6pt]{%
|
|
316 |
\def\stackalignment{c}\stackunder[-1.5pt]{%
|
|
317 |
\stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
|
|
318 |
\rule{\rlwd}{#2\baselineskip}}}{%
|
|
319 |
\strut\kern8pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
|
|
320 |
}
|
|
321 |
\Longstack{
|
|
322 |
\notate{$r$}{1}{$L \; r = \{\ldots, \;c::s_1,
|
|
323 |
\;\ldots\}$}
|
|
324 |
}
|
|
325 |
\Longstack{
|
579
|
326 |
$\stackrel{\backslash c}{\xrightarrow{\hspace*{8cm}}}$
|
577
|
327 |
}
|
|
328 |
\Longstack{
|
579
|
329 |
\notate{$r\backslash c$}{1}{$L \; (r\backslash c)=
|
577
|
330 |
\{\ldots,\;s_1,\;\ldots\}$}
|
622
|
331 |
}\\
|
|
332 |
\vspace{ 3mm }
|
|
333 |
|
579
|
334 |
The derivatives on regular expression can again be
|
|
335 |
generalised to a string.
|
622
|
336 |
One could compute $r_{start} \backslash s$ and test membership of $s$
|
|
337 |
in $L \; r_{start}$ by checking
|
579
|
338 |
whether the empty string is in the language of
|
622
|
339 |
$r_{end}$ ($r_{start}\backslash s$).\\
|
579
|
340 |
|
622
|
341 |
\vspace{2mm}
|
579
|
342 |
\Longstack{
|
|
343 |
\notate{$r_{start}$}{4}{
|
|
344 |
\Longstack{$L \; r_{start} = \{\ldots, \;$
|
|
345 |
\notate{$s$}{1}{$c_1::s_1$}
|
|
346 |
$, \ldots\} $
|
|
347 |
}
|
|
348 |
}
|
|
349 |
}
|
|
350 |
\Longstack{
|
|
351 |
$\stackrel{\backslash c_1}{ \xrightarrow{\hspace*{1.8cm}} }$
|
|
352 |
}
|
|
353 |
\Longstack{
|
|
354 |
\notate{$r_1$}{3}{
|
|
355 |
$r_1 = r_{start}\backslash c_1$,
|
|
356 |
$L \; r_1 = $
|
|
357 |
\Longstack{
|
|
358 |
$\{ \ldots,\;$ \notate{$s_1$}{1}{$c_2::s_2$}
|
|
359 |
$,\; \ldots \}$
|
|
360 |
}
|
|
361 |
}
|
|
362 |
}
|
|
363 |
\Longstack{
|
|
364 |
$\stackrel{\backslash c_2}{\xrightarrow{\hspace*{1.8cm}}}$
|
|
365 |
}
|
|
366 |
\Longstack{
|
|
367 |
$r_2$
|
|
368 |
}
|
|
369 |
\Longstack{
|
|
370 |
$ \xdashrightarrow{\hspace*{0.3cm} \backslash c_3 \ldots \ldots \ldots \hspace*{0.3cm}} $
|
|
371 |
}
|
|
372 |
\Longstack{
|
|
373 |
\notate{$r_{end}$}{1}{
|
|
374 |
$L \; r_{end} = \{\ldots, \; [], \ldots\}$}
|
|
375 |
}
|
|
376 |
|
|
377 |
|
|
378 |
|
|
379 |
\begin{property}
|
|
380 |
$s \in L \; r_{start} \iff [] \in L \; r_{end}$
|
|
381 |
\end{property}
|
|
382 |
\noindent
|
622
|
383 |
Next we give the recursive definition of derivative on
|
579
|
384 |
regular expressions, so that it satisfies the properties above.
|
|
385 |
The derivative function, written $r\backslash c$,
|
583
|
386 |
takes a regular expression $r$ and character $c$, and
|
|
387 |
returns a new regular expression representing
|
|
388 |
the original regular expression's language $L \; r$
|
|
389 |
being taken the language derivative with respect to $c$.
|
626
|
390 |
\begin{table}
|
|
391 |
\begin{center}
|
579
|
392 |
\begin{tabular}{lcl}
|
|
393 |
$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
|
|
394 |
$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
|
|
395 |
$d \backslash c$ & $\dn$ &
|
|
396 |
$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
|
|
397 |
$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
|
|
398 |
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
|
|
399 |
& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
|
|
400 |
& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
|
|
401 |
$(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\
|
|
402 |
\end{tabular}
|
626
|
403 |
\end{center}
|
|
404 |
\caption{Derivative on Regular Expressions}
|
|
405 |
\label{table:der}
|
|
406 |
\end{table}
|
564
|
407 |
\noindent
|
579
|
408 |
The most involved cases are the sequence case
|
|
409 |
and the star case.
|
|
410 |
The sequence case says that if the first regular expression
|
|
411 |
contains an empty string, then the second component of the sequence
|
|
412 |
needs to be considered, as its derivative will contribute to the
|
|
413 |
result of this derivative:
|
|
414 |
\begin{center}
|
|
415 |
\begin{tabular}{lcl}
|
583
|
416 |
$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ &
|
|
417 |
$\textit{if}\;\,([] \in L(r_1))\;
|
|
418 |
\textit{then} \; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
|
579
|
419 |
& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
|
|
420 |
\end{tabular}
|
|
421 |
\end{center}
|
|
422 |
\noindent
|
|
423 |
Notice how this closely resembles
|
|
424 |
the language derivative operation $\Der$:
|
564
|
425 |
\begin{center}
|
|
426 |
\begin{tabular}{lcl}
|
|
427 |
$\Der \; c \; (A @ B)$ & $\dn$ &
|
|
428 |
$ \textit{if} \;\, [] \in A \;
|
|
429 |
\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup
|
|
430 |
\Der \; c\; B$\\
|
|
431 |
& & $\textit{else}\; (\Der \; c \; A) @ B$\\
|
|
432 |
\end{tabular}
|
|
433 |
\end{center}
|
|
434 |
\noindent
|
583
|
435 |
The derivative of the star regular expression $r^*$
|
579
|
436 |
unwraps one iteration of $r$, turns it into $r\backslash c$,
|
|
437 |
and attaches the original $r^*$
|
|
438 |
after $r\backslash c$, so that
|
|
439 |
we can further unfold it as many times as needed:
|
|
440 |
\[
|
|
441 |
(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
|
|
442 |
\]
|
|
443 |
Again,
|
583
|
444 |
the structure is the same as the language derivative of Kleene star:
|
532
|
445 |
\[
|
564
|
446 |
\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
|
532
|
447 |
\]
|
564
|
448 |
In the above definition of $(r_1\cdot r_2) \backslash c$,
|
|
449 |
the $\textit{if}$ clause's
|
|
450 |
boolean condition
|
|
451 |
$[] \in L(r_1)$ needs to be
|
|
452 |
somehow recursively computed.
|
|
453 |
We call such a function that checks
|
|
454 |
whether the empty string $[]$ is
|
|
455 |
in the language of a regular expression $\nullable$:
|
|
456 |
\begin{center}
|
|
457 |
\begin{tabular}{lcl}
|
|
458 |
$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
|
|
459 |
$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
|
|
460 |
$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
|
|
461 |
$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
|
|
462 |
$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
|
|
463 |
$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
|
|
464 |
\end{tabular}
|
|
465 |
\end{center}
|
|
466 |
\noindent
|
|
467 |
The $\ZERO$ regular expression
|
|
468 |
does not contain any string and
|
|
469 |
therefore is not \emph{nullable}.
|
|
470 |
$\ONE$ is \emph{nullable}
|
|
471 |
by definition.
|
|
472 |
The character regular expression $c$
|
|
473 |
corresponds to the singleton set $\{c\}$,
|
|
474 |
and therefore does not contain the empty string.
|
|
475 |
The alternative regular expression is nullable
|
|
476 |
if at least one of its children is nullable.
|
|
477 |
The sequence regular expression
|
|
478 |
would require both children to have the empty string
|
|
479 |
to compose an empty string, and the Kleene star
|
|
480 |
is always nullable because it naturally
|
579
|
481 |
contains the empty string.
|
532
|
482 |
\noindent
|
564
|
483 |
We have the following correspondence between
|
|
484 |
derivatives on regular expressions and
|
|
485 |
derivatives on a set of strings:
|
|
486 |
\begin{lemma}\label{derDer}
|
608
|
487 |
\mbox{}
|
579
|
488 |
\begin{itemize}
|
|
489 |
\item
|
532
|
490 |
$\textit{Der} \; c \; L(r) = L (r\backslash c)$
|
579
|
491 |
\item
|
|
492 |
$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
|
|
493 |
\end{itemize}
|
532
|
494 |
\end{lemma}
|
579
|
495 |
\begin{proof}
|
|
496 |
By induction on $r$.
|
|
497 |
\end{proof}
|
532
|
498 |
\noindent
|
622
|
499 |
which is the main property of derivatives
|
|
500 |
that enables us to reason about the correctness of
|
|
501 |
derivative-based matching.
|
532
|
502 |
We can generalise the derivative operation shown above for single characters
|
|
503 |
to strings as follows:
|
|
504 |
|
|
505 |
\begin{center}
|
|
506 |
\begin{tabular}{lcl}
|
|
507 |
$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
|
583
|
508 |
$r \backslash_s [\,] $ & $\dn$ & $r$
|
532
|
509 |
\end{tabular}
|
|
510 |
\end{center}
|
|
511 |
|
|
512 |
\noindent
|
564
|
513 |
When there is no ambiguity, we will
|
|
514 |
omit the subscript and use $\backslash$ instead
|
583
|
515 |
of $\backslash_s$ to denote
|
532
|
516 |
string derivatives for brevity.
|
622
|
517 |
Brzozowski's derivative-based
|
|
518 |
regular-expression matching algorithm can then be described as:
|
532
|
519 |
|
|
520 |
\begin{definition}
|
564
|
521 |
$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
|
532
|
522 |
\end{definition}
|
|
523 |
|
|
524 |
\noindent
|
|
525 |
Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$,
|
|
526 |
this algorithm presented graphically is as follows:
|
|
527 |
|
601
|
528 |
\begin{equation}\label{matcher}
|
532
|
529 |
\begin{tikzcd}
|
583
|
530 |
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] &
|
|
531 |
r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] &
|
|
532 |
\;\textrm{true}/\textrm{false}
|
532
|
533 |
\end{tikzcd}
|
|
534 |
\end{equation}
|
|
535 |
|
|
536 |
\noindent
|
622
|
537 |
It is relatively
|
|
538 |
easy to show that this matcher is correct, namely
|
539
|
539 |
\begin{lemma}
|
564
|
540 |
$\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$
|
539
|
541 |
\end{lemma}
|
|
542 |
\begin{proof}
|
583
|
543 |
By induction on $s$ using property of derivatives:
|
|
544 |
lemma \ref{derDer}.
|
539
|
545 |
\end{proof}
|
601
|
546 |
\begin{figure}
|
564
|
547 |
\begin{center}
|
539
|
548 |
\begin{tikzpicture}
|
|
549 |
\begin{axis}[
|
|
550 |
xlabel={$n$},
|
|
551 |
ylabel={time in secs},
|
601
|
552 |
%ymode = log,
|
539
|
553 |
legend entries={Naive Matcher},
|
|
554 |
legend pos=north west,
|
|
555 |
legend cell align=left]
|
|
556 |
\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
|
|
557 |
\end{axis}
|
|
558 |
\end{tikzpicture}
|
583
|
559 |
\caption{Matching the regular expression $(a^*)^*b$ against strings of the form
|
|
560 |
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
|
|
561 |
$ using Brzozowski's original algorithm}\label{NaiveMatcher}
|
601
|
562 |
\end{center}
|
539
|
563 |
\end{figure}
|
|
564 |
\noindent
|
564
|
565 |
If we implement the above algorithm naively, however,
|
|
566 |
the algorithm can be excruciatingly slow, as shown in
|
|
567 |
\ref{NaiveMatcher}.
|
|
568 |
Note that both axes are in logarithmic scale.
|
|
569 |
Around two dozens characters
|
622
|
570 |
would already ``explode'' the matcher with the regular expression
|
564
|
571 |
$(a^*)^*b$.
|
622
|
572 |
To improve this situation, we need to introduce simplification
|
|
573 |
rules for the intermediate results,
|
539
|
574 |
such as $r + r \rightarrow r$,
|
|
575 |
and make sure those rules do not change the
|
|
576 |
language of the regular expression.
|
564
|
577 |
One simpled-minded simplification function
|
622
|
578 |
that achieves these requirements
|
|
579 |
is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
|
564
|
580 |
\begin{center}
|
|
581 |
\begin{tabular}{lcl}
|
|
582 |
$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
|
|
583 |
$(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\
|
|
584 |
& & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
|
|
585 |
& & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
|
|
586 |
& & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
|
|
587 |
& & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
|
|
588 |
& & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
|
|
589 |
$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
|
|
590 |
& & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
|
|
591 |
& & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
|
|
592 |
& & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
|
583
|
593 |
$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
|
564
|
594 |
|
|
595 |
\end{tabular}
|
|
596 |
\end{center}
|
|
597 |
If we repeatedly apply this simplification
|
|
598 |
function during the matching algorithm,
|
|
599 |
we have a matcher with simplification:
|
|
600 |
\begin{center}
|
|
601 |
\begin{tabular}{lcl}
|
|
602 |
$\derssimp \; [] \; r$ & $\dn$ & $r$\\
|
|
603 |
$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
|
|
604 |
$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
|
|
605 |
\end{tabular}
|
|
606 |
\end{center}
|
|
607 |
\begin{figure}
|
539
|
608 |
\begin{tikzpicture}
|
|
609 |
\begin{axis}[
|
|
610 |
xlabel={$n$},
|
|
611 |
ylabel={time in secs},
|
601
|
612 |
%ymode = log,
|
|
613 |
%xmode = log,
|
564
|
614 |
grid = both,
|
539
|
615 |
legend entries={Matcher With Simp},
|
|
616 |
legend pos=north west,
|
|
617 |
legend cell align=left]
|
|
618 |
\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
|
|
619 |
\end{axis}
|
564
|
620 |
\end{tikzpicture}
|
|
621 |
\caption{$(a^*)^*b$
|
|
622 |
against
|
|
623 |
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
|
|
624 |
\end{figure}
|
|
625 |
\noindent
|
|
626 |
The running time of $\textit{ders}\_\textit{simp}$
|
583
|
627 |
on the same example of Figure \ref{NaiveMatcher}
|
|
628 |
is now ``tame'' in terms of the length of inputs,
|
|
629 |
as shown in Figure \ref{BetterMatcher}.
|
539
|
630 |
|
622
|
631 |
So far the story is use Brzozowski derivatives and
|
|
632 |
simplify as much as possible and at the end test
|
|
633 |
whether the empty string is recognised
|
|
634 |
by the final derivative.
|
583
|
635 |
But what if we want to
|
|
636 |
do lexing instead of just getting a true/false answer?
|
622
|
637 |
Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and
|
|
638 |
elegant (arguably as beautiful as the definition of the
|
|
639 |
Brzozowski derivative) solution for this.
|
538
|
640 |
|
539
|
641 |
\section{Values and the Lexing Algorithm by Sulzmann and Lu}
|
564
|
642 |
In this section, we present a two-phase regular expression lexing
|
|
643 |
algorithm.
|
|
644 |
The first phase takes successive derivatives with
|
|
645 |
respect to the input string,
|
|
646 |
and the second phase does the reverse, \emph{injecting} back
|
|
647 |
characters, in the meantime constructing a lexing result.
|
|
648 |
We will introduce the injection phase in detail slightly
|
|
649 |
later, but as a preliminary we have to first define
|
|
650 |
the datatype for lexing results,
|
|
651 |
called \emph{value} or
|
608
|
652 |
sometimes also \emph{lexical value}.
|
|
653 |
Values and regular
|
|
654 |
expressions correspond to each other
|
|
655 |
as illustrated in the following
|
538
|
656 |
table:
|
|
657 |
|
|
658 |
\begin{center}
|
|
659 |
\begin{tabular}{c@{\hspace{20mm}}c}
|
|
660 |
\begin{tabular}{@{}rrl@{}}
|
|
661 |
\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
|
|
662 |
$r$ & $::=$ & $\ZERO$\\
|
|
663 |
& $\mid$ & $\ONE$ \\
|
|
664 |
& $\mid$ & $c$ \\
|
|
665 |
& $\mid$ & $r_1 \cdot r_2$\\
|
|
666 |
& $\mid$ & $r_1 + r_2$ \\
|
|
667 |
\\
|
|
668 |
& $\mid$ & $r^*$ \\
|
|
669 |
\end{tabular}
|
|
670 |
&
|
|
671 |
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
|
|
672 |
\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
|
|
673 |
$v$ & $::=$ & \\
|
|
674 |
& & $\Empty$ \\
|
591
|
675 |
& $\mid$ & $\Char \; c$ \\
|
538
|
676 |
& $\mid$ & $\Seq\,v_1\, v_2$\\
|
591
|
677 |
& $\mid$ & $\Left \;v$ \\
|
|
678 |
& $\mid$ & $\Right\;v$ \\
|
538
|
679 |
& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
|
|
680 |
\end{tabular}
|
|
681 |
\end{tabular}
|
|
682 |
\end{center}
|
|
683 |
\noindent
|
564
|
684 |
A value has an underlying string, which
|
|
685 |
can be calculated by the ``flatten" function $|\_|$:
|
|
686 |
\begin{center}
|
|
687 |
\begin{tabular}{lcl}
|
|
688 |
$|\Empty|$ & $\dn$ & $[]$\\
|
|
689 |
$|\Char \; c|$ & $ \dn$ & $ [c]$\\
|
591
|
690 |
$|\Seq \; v_1, \;v_2|$ & $ \dn$ & $ v_1| @ |v_2|$\\
|
|
691 |
$|\Left \; v|$ & $ \dn$ & $ |v|$\\
|
|
692 |
$|\Right \; v|$ & $ \dn$ & $ |v|$\\
|
|
693 |
$|\Stars \; []|$ & $\dn$ & $[]$\\
|
|
694 |
$|\Stars \; v::vs|$ & $\dn$ & $ |v| @ |\Stars(vs)|$
|
564
|
695 |
\end{tabular}
|
|
696 |
\end{center}
|
|
697 |
Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
|
|
698 |
to indicate that a value $v$ could be generated from a lexing algorithm
|
622
|
699 |
with input $r$. They call it the value inhabitation relation,
|
|
700 |
defined by the rules.
|
628
|
701 |
\begin{figure}[H]
|
538
|
702 |
\begin{mathpar}
|
591
|
703 |
\inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em}
|
564
|
704 |
|
|
705 |
\inferrule{\mbox{}}{\vdash \Empty : \ONE} \hspace{2em}
|
|
706 |
|
591
|
707 |
\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq \; v_1,\; v_2 : (r_1 \cdot r_2)}
|
564
|
708 |
|
591
|
709 |
\inferrule{\vdash v_1 : r_1}{\vdash \Left \; v_1 : r_1+r_2}
|
564
|
710 |
|
591
|
711 |
\inferrule{\vdash v_2 : r_2}{\vdash \Right \; v_2:r_1 + r_2}
|
538
|
712 |
|
591
|
713 |
\inferrule{\forall v \in vs. \vdash v:r \land |v| \neq []}{\vdash \Stars \; vs : r^*}
|
564
|
714 |
\end{mathpar}
|
628
|
715 |
\caption{The inhabitation relation for values and regular expressions}\label{fig:inhab}
|
626
|
716 |
\end{figure}
|
564
|
717 |
\noindent
|
|
718 |
The condition $|v| \neq []$ in the premise of star's rule
|
|
719 |
is to make sure that for a given pair of regular
|
|
720 |
expression $r$ and string $s$, the number of values
|
|
721 |
satisfying $|v| = s$ and $\vdash v:r$ is finite.
|
601
|
722 |
This additional condition was
|
|
723 |
imposed by Ausaf and Urban to make their proofs easier.
|
622
|
724 |
Given a string and a regular expression, there can be
|
564
|
725 |
multiple values for it. For example, both
|
|
726 |
$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
|
|
727 |
$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
|
|
728 |
and the values both flatten to $abc$.
|
|
729 |
Lexers therefore have to disambiguate and choose only
|
622
|
730 |
one of the values be generated. $\POSIX$ is one of the
|
564
|
731 |
disambiguation strategies that is widely adopted.
|
|
732 |
|
622
|
733 |
Ausaf et al.\parencite{AusafDyckhoffUrban2016}
|
564
|
734 |
formalised the property
|
|
735 |
as a ternary relation.
|
|
736 |
The $\POSIX$ value $v$ for a regular expression
|
538
|
737 |
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
|
622
|
738 |
in the following rules\footnote{The names of the rules are used
|
|
739 |
as they were originally given in \cite{AusafDyckhoffUrban2016}.}:
|
|
740 |
\begin{figure}[p]
|
564
|
741 |
\begin{mathpar}
|
|
742 |
\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
|
|
743 |
|
|
744 |
\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
|
|
745 |
|
|
746 |
\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
|
|
747 |
|
|
748 |
\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
|
|
749 |
|
|
750 |
\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
|
|
751 |
\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
|
|
752 |
s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
|
|
753 |
\Seq \; v_1 \; v_2}
|
|
754 |
|
|
755 |
\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
|
|
756 |
|
|
757 |
\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
|
|
758 |
|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
|
|
759 |
s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
|
|
760 |
(v::vs)}
|
|
761 |
\end{mathpar}
|
622
|
762 |
\caption{The inductive POSIX rules given by Ausaf et al.
|
|
763 |
\cite{AusafDyckhoffUrban2016}.
|
|
764 |
This ternary relation, written $(s, r) \rightarrow v$,
|
|
765 |
formalises the POSIX constraints on the
|
601
|
766 |
value $v$ given a string $s$ and
|
|
767 |
regular expression $r$.
|
|
768 |
}
|
622
|
769 |
\end{figure}\afterpage{\clearpage}
|
538
|
770 |
\noindent
|
579
|
771 |
|
622
|
772 |
%\begin{figure}
|
|
773 |
%\begin{tikzpicture}[]
|
|
774 |
% \node [minimum width = 6cm, rectangle split, rectangle split horizontal,
|
|
775 |
% rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
|
|
776 |
% (node1)
|
|
777 |
% {$r_{token1}$
|
|
778 |
% \nodepart{two} $\;\;\; \quad r_{token2}\;\;\;\quad$ };
|
|
779 |
% %\node [left = 6.0cm of node1] (start1) {hi};
|
|
780 |
% \node [left = 0.2cm of node1] (middle) {$v.s.$};
|
|
781 |
% \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal,
|
|
782 |
% rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
|
|
783 |
% (node2)
|
|
784 |
% {$\quad\;\;\;r_{token1}\quad\;\;\;$
|
|
785 |
% \nodepart{two} $r_{token2}$ };
|
|
786 |
% \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
|
|
787 |
% \node [above = 1.5cm of middle, minimum width = 6cm,
|
|
788 |
% rectangle, style={draw, rounded corners, inner sep=10pt}]
|
|
789 |
% (topNode) {$s$};
|
|
790 |
% \path[->,draw]
|
|
791 |
% (topNode) edge node {split $A$} (node2)
|
|
792 |
% (topNode) edge node {split $B$} (node1)
|
|
793 |
% ;
|
|
794 |
%
|
|
795 |
%
|
|
796 |
%\end{tikzpicture}
|
|
797 |
%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
|
|
798 |
%\end{figure}
|
564
|
799 |
The above $\POSIX$ rules follows the intuition described below:
|
538
|
800 |
\begin{itemize}
|
564
|
801 |
\item (Left Priority)\\
|
|
802 |
Match the leftmost regular expression when multiple options of matching
|
622
|
803 |
are available. See P+L and P+R where in P+R $s$ cannot
|
|
804 |
be in the language of $L \; r_1$.
|
564
|
805 |
\item (Maximum munch)\\
|
|
806 |
Always match a subpart as much as possible before proceeding
|
622
|
807 |
to the next part of the string.
|
579
|
808 |
For example, when the string $s$ matches
|
622
|
809 |
$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
|
|
810 |
Then the split that matches a longer string for the first part
|
|
811 |
$r_{part1}$ is preferred by this maximum munch rule.
|
|
812 |
This is caused by the side-condition
|
|
813 |
\begin{center}
|
|
814 |
$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
|
|
815 |
s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
|
|
816 |
\end{center}
|
|
817 |
in PS.
|
|
818 |
%(See
|
|
819 |
%\ref{munch} for an illustration).
|
538
|
820 |
\end{itemize}
|
564
|
821 |
\noindent
|
|
822 |
These disambiguation strategies can be
|
|
823 |
quite practical.
|
538
|
824 |
For instance, when lexing a code snippet
|
564
|
825 |
\[
|
|
826 |
\textit{iffoo} = 3
|
|
827 |
\]
|
622
|
828 |
using a regular expression
|
|
829 |
for keywords and
|
|
830 |
identifiers:
|
|
831 |
%(for example, keyword is a nonempty string starting with letters
|
|
832 |
%followed by alphanumeric characters or underscores):
|
564
|
833 |
\[
|
622
|
834 |
r_{keyword} + r_{identifier}.
|
564
|
835 |
\]
|
622
|
836 |
If we want $\textit{iffoo}$ to be recognized
|
|
837 |
as an identifier
|
|
838 |
where identifiers are defined as usual (letters
|
|
839 |
followed by letters, numbers or underscores),
|
|
840 |
then a match with a keyword (if)
|
564
|
841 |
followed by
|
622
|
842 |
an identifier (foo) would be incorrect.
|
|
843 |
POSIX lexing would generate what we want.
|
564
|
844 |
|
622
|
845 |
\noindent
|
|
846 |
We know that a POSIX
|
|
847 |
value for regular expression $r$ is inhabited by $r$.
|
541
|
848 |
\begin{lemma}
|
|
849 |
$(r, s) \rightarrow v \implies \vdash v: r$
|
|
850 |
\end{lemma}
|
|
851 |
\noindent
|
622
|
852 |
The main property about a $\POSIX$ value is that
|
538
|
853 |
given the same regular expression $r$ and string $s$,
|
|
854 |
one can always uniquely determine the $\POSIX$ value for it:
|
|
855 |
\begin{lemma}
|
|
856 |
$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$
|
|
857 |
\end{lemma}
|
539
|
858 |
\begin{proof}
|
564
|
859 |
By induction on $s$, $r$ and $v_1$. The inductive cases
|
|
860 |
are all the POSIX rules.
|
|
861 |
Probably the most cumbersome cases are
|
|
862 |
the sequence and star with non-empty iterations.
|
567
|
863 |
We shall give the details for proving the sequence case here.
|
539
|
864 |
|
567
|
865 |
When we have
|
|
866 |
\[
|
|
867 |
(s_1, r_1) \rightarrow v_1 \;\, and \;\,
|
601
|
868 |
(s_2, r_2) \rightarrow v_2 \;\, and \;\,
|
567
|
869 |
\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
|
|
870 |
s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
|
|
871 |
\]
|
|
872 |
we know that the last condition
|
|
873 |
excludes the possibility of a
|
|
874 |
string $s_1'$ longer than $s_1$ such that
|
|
875 |
\[
|
|
876 |
(s_1', r_1) \rightarrow v_1' \;\;
|
|
877 |
and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s
|
|
878 |
\]
|
|
879 |
hold.
|
|
880 |
A shorter string $s_1''$ with $s_2''$ satisfying
|
|
881 |
\[
|
|
882 |
(s_1'', r_1) \rightarrow v_1''
|
|
883 |
\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s
|
|
884 |
\]
|
|
885 |
cannot possibly form a $\POSIX$ value either, because
|
|
886 |
by definition there is a candidate
|
|
887 |
with longer initial string
|
|
888 |
$s_1$. Therefore, we know that the POSIX
|
|
889 |
value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
|
|
890 |
$s$ must have the
|
|
891 |
property that
|
|
892 |
\[
|
|
893 |
|a| = s_1 \;\; and \;\; |b| = s_2.
|
|
894 |
\]
|
|
895 |
The goal is to prove that $a = v_1 $ and $b = v_2$.
|
|
896 |
If we have some other POSIX values $v_{10}$ and $v_{20}$ such that
|
|
897 |
$(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,
|
|
898 |
then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$,
|
|
899 |
which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
|
539
|
900 |
is the same as $\Seq(v_1, v_2)$.
|
|
901 |
\end{proof}
|
567
|
902 |
\noindent
|
622
|
903 |
We have now defined what a POSIX value is and shown that it is unique,
|
|
904 |
the problem is to generate
|
567
|
905 |
such a value in a lexing algorithm using derivatives.
|
538
|
906 |
|
|
907 |
\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
|
|
908 |
|
567
|
909 |
Sulzmann and Lu extended Brzozowski's
|
|
910 |
derivative-based matching
|
622
|
911 |
to a lexing algorithm by a second phase
|
567
|
912 |
after the initial phase of successive derivatives.
|
|
913 |
This second phase generates a POSIX value
|
|
914 |
if the regular expression matches the string.
|
538
|
915 |
Two functions are involved: $\inj$ and $\mkeps$.
|
622
|
916 |
The function $\mkeps$ constructs a POSIX value from the last
|
567
|
917 |
derivative $r_n$:
|
538
|
918 |
\begin{ceqn}
|
|
919 |
\begin{equation}\label{graph:mkeps}
|
|
920 |
\begin{tikzcd}
|
567
|
921 |
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
|
538
|
922 |
& & & v_n
|
|
923 |
\end{tikzcd}
|
|
924 |
\end{equation}
|
|
925 |
\end{ceqn}
|
567
|
926 |
\noindent
|
|
927 |
In the above diagram, again we assume that
|
|
928 |
the input string $s$ is made of $n$ characters
|
622
|
929 |
$c_0c_1 \ldots c_{n-1}$
|
|
930 |
The last derivative operation
|
|
931 |
$\backslash c_{n-1}$ generates the derivative $r_n$, for which
|
|
932 |
$\mkeps$ produces the value $v_n$. This value
|
|
933 |
tells us how the empty string is matched by the (nullable)
|
|
934 |
regular expression $r_n$, in a POSIX way.
|
567
|
935 |
The definition of $\mkeps$ is
|
538
|
936 |
\begin{center}
|
|
937 |
\begin{tabular}{lcl}
|
564
|
938 |
$\mkeps \; \ONE$ & $\dn$ & $\Empty$ \\
|
567
|
939 |
$\mkeps \; (r_{1}+r_{2})$ & $\dn$
|
|
940 |
& $\textit{if}\; (\nullable \; r_{1}) \;\,
|
|
941 |
\textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\
|
|
942 |
& & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\
|
|
943 |
$\mkeps \; (r_1 \cdot r_2)$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
|
564
|
944 |
$\mkeps \; r^* $ & $\dn$ & $\Stars\;[]$
|
538
|
945 |
\end{tabular}
|
|
946 |
\end{center}
|
|
947 |
|
|
948 |
|
|
949 |
\noindent
|
622
|
950 |
The function prefers the left child $r_1$ of $r_1 + r_2$
|
567
|
951 |
to match an empty string if there is a choice.
|
622
|
952 |
When there is a star to match the empty string,
|
538
|
953 |
we give the $\Stars$ constructor an empty list, meaning
|
567
|
954 |
no iteration is taken.
|
622
|
955 |
The result of $\mkeps$ on a $\nullable$ $r$
|
|
956 |
is a POSIX value for $r$ and the empty string:
|
567
|
957 |
\begin{lemma}\label{mePosix}
|
|
958 |
$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$
|
|
959 |
\end{lemma}
|
|
960 |
\begin{proof}
|
622
|
961 |
By induction on $r$.
|
567
|
962 |
\end{proof}
|
|
963 |
\noindent
|
622
|
964 |
After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
|
567
|
965 |
in reverse order as they were chopped off in the derivative phase.
|
622
|
966 |
The fucntion for this is called $\inj$. This function
|
|
967 |
operates on values, unlike $\backslash$ which operates on regular expressions.
|
567
|
968 |
In the diagram below, $v_i$ stands for the (POSIX) value
|
|
969 |
for how the regular expression
|
|
970 |
$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
|
|
971 |
of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
|
538
|
972 |
After injecting back $n$ characters, we get the lexical value for how $r_0$
|
|
973 |
matches $s$.
|
601
|
974 |
\begin{figure}[H]
|
|
975 |
\begin{center}
|
538
|
976 |
\begin{ceqn}
|
|
977 |
\begin{tikzcd}
|
567
|
978 |
r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
|
|
979 |
v_0 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed]
|
538
|
980 |
\end{tikzcd}
|
|
981 |
\end{ceqn}
|
601
|
982 |
\end{center}
|
|
983 |
\caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016},
|
|
984 |
matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$.
|
|
985 |
The first phase involves taking successive derivatives w.r.t the characters $c_0$,
|
|
986 |
$c_1$, and so on. These are the same operations as they have appeared in the matcher
|
|
987 |
\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
|
|
988 |
then the second phase starts. First $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
|
|
989 |
the empty string , by always selecting the leftmost
|
|
990 |
nullable regular expression. After that $\inj$ ``injects'' back the character in reverse order as they
|
|
991 |
appeared in the string, always preserving POSIXness.}\label{graph:inj}
|
|
992 |
\end{figure}
|
538
|
993 |
\noindent
|
623
|
994 |
The function $\textit{inj}$ as defined by Sulzmann and Lu
|
|
995 |
takes three arguments: a regular
|
567
|
996 |
expression ${r_{i}}$, before the character is chopped off,
|
623
|
997 |
a character ${c_{i}}$ (the character we want to inject back) and
|
567
|
998 |
the third argument $v_{i+1}$ the value we want to inject into.
|
568
|
999 |
The result of an application
|
|
1000 |
$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
|
|
1001 |
\[
|
|
1002 |
(s_i, r_i) \rightarrow v_i
|
|
1003 |
\]
|
|
1004 |
holds.
|
567
|
1005 |
The definition of $\textit{inj}$ is as follows:
|
538
|
1006 |
\begin{center}
|
568
|
1007 |
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}
|
|
1008 |
$\textit{inj}\;(c)\;c\,Empty$ & $\dn$ & $\Char\,c$\\
|
|
1009 |
$\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left \; (\textit{inj}\; r_1 \; c\,v)$\\
|
|
1010 |
$\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c \; v)$\\
|
|
1011 |
$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$ &
|
|
1012 |
$\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\
|
|
1013 |
$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ &
|
|
1014 |
$\dn$ & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
|
|
1015 |
$\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$ & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
|
|
1016 |
$\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
|
538
|
1017 |
\end{tabular}
|
|
1018 |
\end{center}
|
|
1019 |
|
|
1020 |
\noindent
|
623
|
1021 |
The function recurses on
|
568
|
1022 |
the shape of regular
|
623
|
1023 |
expressionsw and values.
|
568
|
1024 |
Intuitively, each clause analyses
|
|
1025 |
how $r_i$ could have transformed when being
|
|
1026 |
derived by $c$, identifying which subpart
|
|
1027 |
of $v_{i+1}$ has the ``hole''
|
|
1028 |
to inject the character back into.
|
|
1029 |
Once the character is
|
|
1030 |
injected back to that sub-value;
|
623
|
1031 |
$\inj$ assembles all parts together
|
568
|
1032 |
to form a new value.
|
|
1033 |
|
|
1034 |
For instance, the last clause is an
|
|
1035 |
injection into a sequence value $v_{i+1}$
|
|
1036 |
whose second child
|
|
1037 |
value is a star, and the shape of the
|
|
1038 |
regular expression $r_i$ before injection
|
|
1039 |
is a star.
|
|
1040 |
We therefore know
|
|
1041 |
the derivative
|
|
1042 |
starts on a star and ends as a sequence:
|
|
1043 |
\[
|
|
1044 |
(r^*) \backslash c \longrightarrow r\backslash c \cdot r^*
|
|
1045 |
\]
|
|
1046 |
during which an iteration of the star
|
|
1047 |
had just been unfolded, giving the below
|
|
1048 |
value inhabitation relation:
|
|
1049 |
\[
|
|
1050 |
\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.
|
|
1051 |
\]
|
|
1052 |
The value list $vs$ corresponds to
|
|
1053 |
matched star iterations,
|
|
1054 |
and the ``hole'' lies in $v$ because
|
|
1055 |
\[
|
|
1056 |
\vdash v: r\backslash c.
|
|
1057 |
\]
|
|
1058 |
Finally,
|
|
1059 |
$\inj \; r \;c \; v$ is prepended
|
|
1060 |
to the previous list of iterations, and then
|
|
1061 |
wrapped under the $\Stars$
|
|
1062 |
constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
|
538
|
1063 |
|
568
|
1064 |
Recall that lemma
|
623
|
1065 |
\ref{mePosix} tells us that
|
|
1066 |
$\mkeps$ always generates the POSIX value.
|
|
1067 |
The function $\inj$ preserves the POSIXness, provided
|
|
1068 |
the value before injection is POSIX, namely
|
568
|
1069 |
\begin{lemma}\label{injPosix}
|
623
|
1070 |
If$(r \backslash c, s) \rightarrow v $,
|
|
1071 |
then $(r, c :: s) \rightarrow (\inj r \; c\; v)$.
|
568
|
1072 |
\end{lemma}
|
|
1073 |
\begin{proof}
|
|
1074 |
By induction on $r$.
|
623
|
1075 |
The non-trivial cases are sequence and star.
|
|
1076 |
When $r = a \cdot b$, there can be
|
568
|
1077 |
three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
|
|
1078 |
We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
|
|
1079 |
case.
|
|
1080 |
\begin{itemize}
|
|
1081 |
\item
|
|
1082 |
$v = \Seq \; v_a \; v_b$.\\
|
|
1083 |
The ``not nullable'' clause of the $\inj$ function is taken:
|
|
1084 |
\begin{center}
|
|
1085 |
\begin{tabular}{lcl}
|
|
1086 |
$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
|
|
1087 |
& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
|
|
1088 |
\end{tabular}
|
|
1089 |
\end{center}
|
|
1090 |
We know that there exists a unique pair of
|
|
1091 |
$s_a$ and $s_b$ satisfaying
|
|
1092 |
$(a \backslash c, s_a) \rightarrow v_a$,
|
|
1093 |
$(b , s_b) \rightarrow v_b$, and
|
|
1094 |
$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in
|
|
1095 |
L \; (a\backslash c) \land
|
|
1096 |
s_4 \in L \; b$.
|
|
1097 |
The last condition gives us
|
|
1098 |
$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in
|
|
1099 |
L \; a \land
|
|
1100 |
s_4 \in L \; b$.
|
|
1101 |
By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,
|
|
1102 |
and this gives us
|
|
1103 |
\[
|
|
1104 |
(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
|
|
1105 |
\]
|
|
1106 |
\item
|
|
1107 |
$v = \Left \; (\Seq \; v_a \; v_b)$\\
|
|
1108 |
The argument is almost identical to the above case,
|
|
1109 |
except that a different clause of $\inj$ is taken:
|
|
1110 |
\begin{center}
|
|
1111 |
\begin{tabular}{lcl}
|
|
1112 |
$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
|
|
1113 |
& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
|
|
1114 |
\end{tabular}
|
|
1115 |
\end{center}
|
|
1116 |
With a similar reasoning,
|
538
|
1117 |
|
568
|
1118 |
\[
|
|
1119 |
(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
|
|
1120 |
\]
|
|
1121 |
again holds.
|
|
1122 |
\item
|
|
1123 |
$v = \Right \; v_b$\\
|
|
1124 |
Again the injection result would be
|
|
1125 |
\begin{center}
|
|
1126 |
\begin{tabular}{lcl}
|
|
1127 |
$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\
|
|
1128 |
& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$
|
|
1129 |
\end{tabular}
|
|
1130 |
\end{center}
|
|
1131 |
We know that $a$ must be nullable,
|
|
1132 |
allowing us to call $\mkeps$ and get
|
|
1133 |
\[
|
|
1134 |
(a, []) \rightarrow \mkeps \; a.
|
|
1135 |
\]
|
|
1136 |
Also by inductive hypothesis
|
|
1137 |
\[
|
|
1138 |
(b, c::s) \rightarrow \inj\; b \; c \; v_b
|
|
1139 |
\]
|
|
1140 |
holds.
|
|
1141 |
In addition, as
|
|
1142 |
$\Right \;v_b$ instead of $\Left \ldots$ is
|
|
1143 |
the POSIX value for $v$, it must be the case
|
|
1144 |
that $s \notin L \;( (a\backslash c)\cdot b)$.
|
|
1145 |
This tells us that
|
|
1146 |
\[
|
|
1147 |
\nexists s_3 \; s_4.
|
|
1148 |
s_3 @s_4 = s \land s_3 \in L \; (a\backslash c)
|
|
1149 |
\land s_4 \in L \; b
|
|
1150 |
\]
|
|
1151 |
which translates to
|
|
1152 |
\[
|
|
1153 |
\nexists s_3 \; s_4. \; s_3 \neq [] \land
|
|
1154 |
s_3 @s_4 = c::s \land s_3 \in L \; a
|
|
1155 |
\land s_4 \in L \; b.
|
|
1156 |
\]
|
|
1157 |
(Which basically says there cannot be a longer
|
|
1158 |
initial split for $s$ other than the empty string.)
|
|
1159 |
Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
|
|
1160 |
as the POSIX value for $a\cdot b$.
|
|
1161 |
\end{itemize}
|
|
1162 |
The star case can be proven similarly.
|
|
1163 |
\end{proof}
|
|
1164 |
\noindent
|
623
|
1165 |
Putting all together, Sulzmann and Lu obtained the following algorithm
|
|
1166 |
outlined in the diagram \ref{graph:inj}:
|
538
|
1167 |
\begin{center}
|
539
|
1168 |
\begin{tabular}{lcl}
|
568
|
1169 |
$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\
|
|
1170 |
$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
|
|
1171 |
& & $\quad \phantom{\mid}\; \None \implies \None$\\
|
|
1172 |
& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
|
538
|
1173 |
\end{tabular}
|
|
1174 |
\end{center}
|
568
|
1175 |
\noindent
|
623
|
1176 |
The central property of the $\lexer$ is that it gives the correct result
|
|
1177 |
according to
|
|
1178 |
POSIX rules.
|
573
|
1179 |
\begin{theorem}\label{lexerCorrectness}
|
568
|
1180 |
The $\lexer$ based on derivatives and injections is correct:
|
|
1181 |
\begin{center}
|
|
1182 |
\begin{tabular}{lcl}
|
|
1183 |
$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
|
|
1184 |
$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
|
|
1185 |
\end{tabular}
|
|
1186 |
\end{center}
|
|
1187 |
\end{theorem}
|
|
1188 |
\begin{proof}
|
623
|
1189 |
By induction on $s$. $r$ generalising over an arbitrary regular expression.
|
|
1190 |
The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
|
568
|
1191 |
by lemma \ref{injPosix}.
|
|
1192 |
\end{proof}
|
538
|
1193 |
\noindent
|
623
|
1194 |
As we did earlier in this chapter with the matcher, one can
|
|
1195 |
introduce simplification on the regular expression in each derivative step.
|
568
|
1196 |
However, now one needs to do a backward phase and make sure
|
623
|
1197 |
the values align with the regular expression.
|
539
|
1198 |
Therefore one has to
|
538
|
1199 |
be careful not to break the correctness, as the injection
|
623
|
1200 |
function heavily relies on the structure of
|
|
1201 |
the regular expressions and values being aligned.
|
|
1202 |
This can be achieved by recording some extra rectification functions
|
538
|
1203 |
during the derivatives step, and applying these rectifications in
|
|
1204 |
each run during the injection phase.
|
568
|
1205 |
With extra care
|
623
|
1206 |
one can show that POSIXness will not be affected
|
|
1207 |
by the simplifications listed here \cite{AusafDyckhoffUrban2016}.
|
|
1208 |
\begin{center}
|
|
1209 |
\begin{tabular}{lcl}
|
|
1210 |
$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
|
|
1211 |
$(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\
|
|
1212 |
& & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
|
|
1213 |
& & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
|
|
1214 |
& & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
|
|
1215 |
& & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
|
|
1216 |
& & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
|
|
1217 |
$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
|
|
1218 |
& & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
|
|
1219 |
& & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
|
|
1220 |
& & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
|
|
1221 |
$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
|
|
1222 |
|
|
1223 |
\end{tabular}
|
|
1224 |
\end{center}
|
538
|
1225 |
|
623
|
1226 |
However, with the simple-minded simplification rules allowed
|
|
1227 |
in an injection-based lexer, one can still end up
|
|
1228 |
with exploding derivatives.
|
568
|
1229 |
\section{A Case Requring More Aggressive Simplifications}
|
539
|
1230 |
For example, when starting with the regular
|
585
|
1231 |
expression $(a^* \cdot a^*)^*$ and building just over
|
|
1232 |
a dozen successive derivatives
|
539
|
1233 |
w.r.t.~the character $a$, one obtains a derivative regular expression
|
585
|
1234 |
with millions of nodes (when viewed as a tree)
|
623
|
1235 |
even with the simple-minded simplification.
|
585
|
1236 |
\begin{figure}[H]
|
601
|
1237 |
\begin{center}
|
539
|
1238 |
\begin{tikzpicture}
|
|
1239 |
\begin{axis}[
|
|
1240 |
xlabel={$n$},
|
|
1241 |
ylabel={size},
|
585
|
1242 |
legend entries={Simple-Minded Simp, Naive Matcher},
|
539
|
1243 |
legend pos=north west,
|
|
1244 |
legend cell align=left]
|
|
1245 |
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
|
585
|
1246 |
\addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data};
|
539
|
1247 |
\end{axis}
|
|
1248 |
\end{tikzpicture}
|
601
|
1249 |
\end{center}
|
539
|
1250 |
\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
|
|
1251 |
\end{figure}\label{fig:BetterWaterloo}
|
|
1252 |
|
568
|
1253 |
That is because Sulzmann and Lu's
|
|
1254 |
injection-based lexing algorithm keeps a lot of
|
541
|
1255 |
"useless" values that will not be used.
|
539
|
1256 |
These different ways of matching will grow exponentially with the string length.
|
623
|
1257 |
Consider the case
|
568
|
1258 |
\[
|
|
1259 |
r= (a^*\cdot a^*)^* \quad and \quad
|
|
1260 |
s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
|
|
1261 |
\]
|
|
1262 |
as an example.
|
|
1263 |
This is a highly ambiguous regular expression, with
|
|
1264 |
many ways to split up the string into multiple segments for
|
|
1265 |
different star iteratioins,
|
573
|
1266 |
and for each segment
|
|
1267 |
multiple ways of splitting between
|
568
|
1268 |
the two $a^*$ sub-expressions.
|
573
|
1269 |
When $n$ is equal to $1$, there are two lexical values for
|
|
1270 |
the match:
|
|
1271 |
\[
|
623
|
1272 |
\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1)
|
573
|
1273 |
\]
|
|
1274 |
and
|
|
1275 |
\[
|
623
|
1276 |
\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2)
|
573
|
1277 |
\]
|
|
1278 |
The derivative of $\derssimp \;s \; r$ is
|
|
1279 |
\[
|
|
1280 |
(a^*a^* + a^*)\cdot(a^*a^*)^*.
|
|
1281 |
\]
|
|
1282 |
The $a^*a^*$ and $a^*$ in the first child of the above sequence
|
|
1283 |
correspond to value 1 and value 2, respectively.
|
|
1284 |
When $n=2$, the number goes up to 7:
|
|
1285 |
\[
|
|
1286 |
\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
|
|
1287 |
\]
|
623
|
1288 |
|
573
|
1289 |
\[
|
|
1290 |
\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
|
|
1291 |
\]
|
623
|
1292 |
|
573
|
1293 |
\[
|
|
1294 |
\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
|
|
1295 |
\]
|
623
|
1296 |
|
573
|
1297 |
\[
|
|
1298 |
\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
|
|
1299 |
\]
|
623
|
1300 |
|
573
|
1301 |
\[
|
|
1302 |
\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []),
|
|
1303 |
\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
|
|
1304 |
]
|
|
1305 |
\]
|
623
|
1306 |
|
573
|
1307 |
\[
|
|
1308 |
\Stars \; [
|
|
1309 |
\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
|
|
1310 |
\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
|
|
1311 |
]
|
|
1312 |
\]
|
|
1313 |
and
|
|
1314 |
\[
|
|
1315 |
\Stars \; [
|
|
1316 |
\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
|
|
1317 |
\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
|
|
1318 |
]
|
|
1319 |
\]
|
|
1320 |
And $\derssimp \; aa \; (a^*a^*)^*$ would be
|
|
1321 |
\[
|
|
1322 |
((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* +
|
|
1323 |
(a^*a^* + a^*)\cdot(a^*a^*)^*.
|
|
1324 |
\]
|
|
1325 |
which removes two out of the seven terms corresponding to the
|
|
1326 |
seven distinct lexical values.
|
|
1327 |
|
|
1328 |
It is not surprising that there are exponentially many
|
|
1329 |
distinct lexical values that cannot be eliminated by
|
|
1330 |
the simple-minded simplification of $\derssimp$.
|
568
|
1331 |
A lexer without a good enough strategy to
|
|
1332 |
deduplicate will naturally
|
623
|
1333 |
have an exponential runtime on highly
|
|
1334 |
ambiguous regular expressions, because there
|
|
1335 |
are exponentially many matches.
|
|
1336 |
For this particular example, it seems
|
|
1337 |
that the number of distinct matches growth
|
|
1338 |
speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
|
538
|
1339 |
|
573
|
1340 |
On the other hand, the
|
|
1341 |
$\POSIX$ value for $r= (a^*\cdot a^*)^*$ and
|
|
1342 |
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is
|
|
1343 |
\[
|
|
1344 |
\Stars\,
|
|
1345 |
[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]]
|
|
1346 |
\]
|
|
1347 |
and at any moment the subterms in a regular expression
|
|
1348 |
that will result in a POSIX value is only
|
|
1349 |
a minority among the many other terms,
|
|
1350 |
and one can remove ones that are absolutely not possible to
|
|
1351 |
be POSIX.
|
|
1352 |
In the above example,
|
579
|
1353 |
\begin{equation}\label{eqn:growth2}
|
573
|
1354 |
((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* +
|
|
1355 |
\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
|
579
|
1356 |
\end{equation}
|
573
|
1357 |
can be further simplified by
|
|
1358 |
removing the underlined term first,
|
|
1359 |
which would open up possibilities
|
|
1360 |
of further simplification that removes the
|
|
1361 |
underbraced part.
|
|
1362 |
The result would be
|
|
1363 |
\[
|
|
1364 |
(\underbrace{a^*a^*}_\text{term 1} + \underbrace{a^*}_\text{term 2})\cdot(a^*a^*)^*.
|
|
1365 |
\]
|
|
1366 |
with corresponding values
|
|
1367 |
\begin{center}
|
|
1368 |
\begin{tabular}{lr}
|
|
1369 |
$\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$ & $(\text{term 1})$\\
|
|
1370 |
$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] $ & $(\text{term 2})$
|
|
1371 |
\end{tabular}
|
|
1372 |
\end{center}
|
|
1373 |
Other terms with an underlying value such as
|
|
1374 |
\[
|
|
1375 |
\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
|
|
1376 |
\]
|
|
1377 |
is simply too hopeless to contribute a POSIX lexical value,
|
|
1378 |
and is therefore thrown away.
|
538
|
1379 |
|
623
|
1380 |
Ausaf et al. \cite{AusafDyckhoffUrban2016}
|
573
|
1381 |
have come up with some simplification steps, however those steps
|
|
1382 |
are not yet sufficiently strong so that they achieve the above effects.
|
|
1383 |
And even with these relatively mild simplifications the proof
|
|
1384 |
is already quite a bit complicated than the theorem \ref{lexerCorrectness}.
|
|
1385 |
One would prove something like:
|
|
1386 |
\[
|
|
1387 |
\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\;
|
|
1388 |
\textit{then}\;\; (r, c::s) \rightarrow
|
|
1389 |
\inj\;\, r\, \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v)
|
|
1390 |
\]
|
|
1391 |
instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
|
|
1392 |
not only has to return a simplified regular expression,
|
|
1393 |
but also what specific simplifications
|
|
1394 |
has been done as a function on values
|
|
1395 |
showing how one can transform the value
|
|
1396 |
underlying the simplified regular expression
|
|
1397 |
to the unsimplified one.
|
|
1398 |
|
623
|
1399 |
We therefore choose a slightly different approach
|
|
1400 |
also described by Sulzmann and Lu to
|
573
|
1401 |
get better simplifications, which uses
|
|
1402 |
some augmented data structures compared to
|
|
1403 |
plain regular expressions.
|
|
1404 |
We call them \emph{annotated}
|
|
1405 |
regular expressions.
|
|
1406 |
With annotated regular expressions,
|
|
1407 |
we can avoid creating the intermediate values $v_1,\ldots v_n$ and a
|
|
1408 |
second phase altogether.
|
|
1409 |
We introduce this new datatype and the
|
|
1410 |
corresponding algorithm in the next chapter.
|
538
|
1411 |
|
|
1412 |
|
|
1413 |
|
|
1414 |
|