author | Chengsong |
Wed, 23 Aug 2023 03:02:31 +0100 | |
changeset 668 | 3831621d7b14 |
parent 667 | 660cf698eb26 |
permissions | -rwxr-xr-x |
532 | 1 |
% Chapter Template |
2 |
||
666 | 3 |
\chapter{Regular Expressions and POSIX Lexing-Preliminaries} % Main chapter title |
532 | 4 |
|
5 |
\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts |
|
6 |
%and notations we |
|
564 | 7 |
% used for describing the lexing algorithm by Sulzmann and Lu, |
8 |
%and then give the algorithm and its variant and discuss |
|
532 | 9 |
%why more aggressive simplifications are needed. |
649 | 10 |
|
11 |
||
668
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
12 |
\marginpar{Gerog comment: addressed attributing work correctly problem, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
13 |
showed clearly these definitions come from Ausaf et al.} |
649 | 14 |
|
668
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
15 |
This is a preliminary chapter which describes the results of |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
16 |
Sulzmann and Lu \cite{Sulzmann2014} and part of the formalisations by |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
17 |
Ausaf et al \ref{AusafDyckhoffUrban2016} (mainly the definitions). |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
18 |
These results are not part of this PhD work, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
19 |
but the details are included to provide necessary context |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
20 |
for our work on the correctness proof of $\blexersimp$, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
21 |
as we show in chapter \ref{Bitcoded2} how the proofs break down when |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
22 |
simplifications are applied. |
657
00171b627b8d
Fixed some annotated/unannotated a/r notation inconsistencies.
Chengsong
parents:
651
diff
changeset
|
23 |
|
668
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
24 |
In the coming section, the definitions of basic notions |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
25 |
for regular languages and regular expressions are given. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
26 |
This is essentially a description in ``English'' |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
27 |
the functions and datatypes used by Ausaf et al. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
28 |
\cite{AusafDyckhoffUrban2016} \cite{Ausaf} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
29 |
in their formalisation in Isabelle/HOL. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
30 |
We include them as we build on their formalisation, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
31 |
and therefore inherently use these definitions. |
649 | 32 |
|
33 |
||
668
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
34 |
% In the next section, we will briefly |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
35 |
% introduce Brzozowski derivatives and Sulzmann |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
36 |
% and Lu's algorithm, which the main point of this thesis builds on. |
649 | 37 |
%We give a taste of what they |
38 |
%are like and why they are suitable for regular expression |
|
39 |
%matching and lexing. |
|
40 |
\section{Formal Specification of POSIX Matching |
|
41 |
and Brzozowski Derivatives} |
|
42 |
%Now we start with the central topic of the thesis: Brzozowski derivatives. |
|
43 |
Brzozowski \cite{Brzozowski1964} first introduced the |
|
44 |
concept of a \emph{derivative} of regular expression in 1964. |
|
45 |
The derivative of a regular expression $r$ |
|
46 |
with respect to a character $c$, is written as $r \backslash c$. |
|
47 |
This operation tells us what $r$ transforms into |
|
48 |
if we ``chop'' off a particular character $c$ |
|
49 |
from all strings in the language of $r$ (defined |
|
50 |
later as $L \; r$). |
|
51 |
%To give a flavour of Brzozowski derivatives, we present |
|
52 |
%two straightforward clauses from it: |
|
53 |
%\begin{center} |
|
54 |
% \begin{tabular}{lcl} |
|
55 |
% $d \backslash c$ & $\dn$ & |
|
56 |
% $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
57 |
%$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
58 |
% \end{tabular} |
|
59 |
%\end{center} |
|
60 |
%\noindent |
|
61 |
%The first clause says that for the regular expression |
|
62 |
%denoting a singleton set consisting of a single-character string $\{ d \}$, |
|
63 |
%we check the derivative character $c$ against $d$, |
|
64 |
%returning a set containing only the empty string $\{ [] \}$ |
|
65 |
%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise. |
|
66 |
%The second clause states that to obtain the regular expression |
|
67 |
%representing all strings' head character $c$ being chopped off |
|
68 |
%from $r_1 + r_2$, one simply needs to recursively take derivative |
|
69 |
%of $r_1$ and $r_2$ and then put them together. |
|
70 |
Derivatives have the property |
|
71 |
that $s \in L \; (r\backslash c)$ if and only if |
|
72 |
$c::s \in L \; r$ where $::$ stands for list prepending. |
|
73 |
%This property can be used on regular expressions |
|
74 |
%matching and lexing--to test whether a string $s$ is in $L \; r$, |
|
75 |
%one simply takes derivatives of $r$ successively with |
|
76 |
%respect to the characters (in the correct order) in $s$, |
|
77 |
%and then test whether the empty string is in the last regular expression. |
|
78 |
With this property, derivatives can give a simple solution |
|
79 |
to the problem of matching a string $s$ with a regular |
|
80 |
expression $r$: if the derivative of $r$ w.r.t.\ (in |
|
81 |
succession) all the characters of the string matches the empty string, |
|
82 |
then $r$ matches $s$ (and {\em vice versa}). |
|
83 |
%This makes formally reasoning about these properties such |
|
84 |
%as correctness and complexity smooth and intuitive. |
|
85 |
There are several mechanised proofs of this property in various theorem |
|
86 |
provers, |
|
87 |
for example one by Owens and Slind \cite{Owens2008} in HOL4, |
|
88 |
another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and |
|
89 |
yet another in Coq by Coquand and Siles \cite{Coquand2012}. |
|
90 |
||
91 |
In addition, one can extend derivatives to bounded repetitions |
|
92 |
relatively straightforwardly. For example, the derivative for |
|
93 |
this can be defined as: |
|
94 |
\begin{center} |
|
95 |
\begin{tabular}{lcl} |
|
96 |
$r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot |
|
97 |
r^{\{n-1\}} (\text{when} n > 0)$\\ |
|
98 |
\end{tabular} |
|
99 |
\end{center} |
|
100 |
\noindent |
|
101 |
Experimental results suggest that unlike DFA-based solutions |
|
102 |
for bounded regular expressions, |
|
103 |
derivatives can cope |
|
104 |
large counters |
|
105 |
quite well. |
|
106 |
||
107 |
There have also been |
|
108 |
extensions of derivatives to other regex constructs. |
|
109 |
For example, Owens et al include the derivatives |
|
110 |
for the \emph{NOT} regular expression, which is |
|
111 |
able to concisely express C-style comments of the form |
|
112 |
$/* \ldots */$ (see \cite{Owens2008}). |
|
113 |
Another extension for derivatives is |
|
114 |
regular expressions with look-aheads, done |
|
115 |
by Miyazaki and Minamide |
|
116 |
\cite{Takayuki2019}. |
|
117 |
%We therefore use Brzozowski derivatives on regular expressions |
|
118 |
%lexing |
|
119 |
||
120 |
||
121 |
||
122 |
Given the above definitions and properties of |
|
123 |
Brzozowski derivatives, one quickly realises their potential |
|
124 |
in generating a formally verified algorithm for lexing: the clauses and property |
|
125 |
can be easily expressed in a functional programming language |
|
126 |
or converted to theorem prover |
|
127 |
code, with great ease. |
|
128 |
Perhaps this is the reason why derivatives have sparked quite a bit of interest |
|
129 |
in the functional programming and theorem prover communities in the last |
|
130 |
fifteen or so years ( |
|
131 |
\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18}, |
|
132 |
\cite{Chen12} and \cite{Coquand2012} |
|
133 |
to name a few), despite being buried in the ``sands of time'' \cite{Owens2008} |
|
134 |
after they were first published by Brzozowski. |
|
135 |
||
136 |
||
137 |
However, there are two difficulties with derivative-based matchers: |
|
138 |
First, Brzozowski's original matcher only generates a yes/no answer |
|
139 |
for whether a regular expression matches a string or not. This is too |
|
140 |
little information in the context of lexing where separate tokens must |
|
141 |
be identified and also classified (for example as keywords |
|
142 |
or identifiers). |
|
143 |
Second, derivative-based matchers need to be more efficient in terms |
|
144 |
of the sizes of derivatives. |
|
145 |
Elegant and beautiful |
|
146 |
as many implementations are, |
|
666 | 147 |
they can be still quite slow. |
649 | 148 |
For example, Sulzmann and Lu |
149 |
claim a linear running time of their proposed algorithm, |
|
150 |
but that was falsified by our experiments. The running time |
|
151 |
is actually $\Omega(2^n)$ in the worst case. |
|
152 |
A similar claim about a theoretical runtime of $O(n^2)$ |
|
153 |
is made for the Verbatim \cite{Verbatim} |
|
154 |
%TODO: give references |
|
155 |
lexer, which calculates POSIX matches and is based on derivatives. |
|
156 |
They formalized the correctness of the lexer, but not their complexity result. |
|
157 |
In the performance evaluation section, they analyzed the run time |
|
158 |
of matching $a$ with the string |
|
159 |
\begin{center} |
|
160 |
$\underbrace{a \ldots a}_{\text{n a's}}$. |
|
161 |
\end{center} |
|
162 |
\noindent |
|
163 |
They concluded that the algorithm is quadratic in terms of |
|
164 |
the length of the input string. |
|
165 |
When we tried out their extracted OCaml code with the example $(a+aa)^*$, |
|
166 |
the time it took to match a string of 40 $a$'s was approximately 5 minutes. |
|
167 |
||
168 |
||
169 |
\subsection{Sulzmann and Lu's Algorithm} |
|
170 |
Sulzmann and Lu~\cite{Sulzmann2014} overcame the first |
|
171 |
problem with the yes/no answer |
|
172 |
by cleverly extending Brzozowski's matching |
|
173 |
algorithm. Their extended version generates additional information on |
|
174 |
\emph{how} a regular expression matches a string following the POSIX |
|
175 |
rules for regular expression matching. They achieve this by adding a |
|
176 |
second ``phase'' to Brzozowski's algorithm involving an injection |
|
177 |
function. This injection function in a sense undoes the ``damage'' |
|
178 |
of the derivatives chopping off characters. |
|
179 |
In earlier work, Ausaf et al provided the formal |
|
180 |
specification of what POSIX matching means and proved in Isabelle/HOL |
|
181 |
the correctness |
|
182 |
of this extended algorithm accordingly |
|
183 |
\cite{AusafDyckhoffUrban2016}. |
|
184 |
||
185 |
The version of the algorithm proven correct |
|
186 |
suffers however heavily from a |
|
187 |
second difficulty, where derivatives can |
|
188 |
grow to arbitrarily big sizes. |
|
189 |
For example if we start with the |
|
190 |
regular expression $(a+aa)^*$ and take |
|
191 |
successive derivatives according to the character $a$, we end up with |
|
192 |
a sequence of ever-growing derivatives like |
|
193 |
||
194 |
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}} |
|
195 |
\begin{center} |
|
196 |
\begin{tabular}{rll} |
|
197 |
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
198 |
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
199 |
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\ |
|
200 |
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
201 |
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots) |
|
202 |
\end{tabular} |
|
203 |
\end{center} |
|
204 |
||
205 |
\noindent where after around 35 steps we usually run out of memory on a |
|
206 |
typical computer. Clearly, the |
|
207 |
notation involving $\ZERO$s and $\ONE$s already suggests |
|
208 |
simplification rules that can be applied to regular regular |
|
209 |
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r |
|
210 |
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow |
|
211 |
r$. While such simple-minded simplifications have been proved in |
|
212 |
the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's |
|
213 |
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do |
|
214 |
\emph{not} help with limiting the growth of the derivatives shown |
|
215 |
above: the growth is slowed, but the derivatives can still grow rather |
|
216 |
quickly beyond any finite bound. |
|
217 |
||
218 |
Therefore we want to look in this thesis at a second |
|
219 |
algorithm by Sulzmann and Lu where they |
|
220 |
overcame this ``growth problem'' \cite{Sulzmann2014}. |
|
221 |
In this version, POSIX values are |
|
222 |
represented as bit sequences and such sequences are incrementally generated |
|
223 |
when derivatives are calculated. The compact representation |
|
224 |
of bit sequences and regular expressions allows them to define a more |
|
225 |
``aggressive'' simplification method that keeps the size of the |
|
226 |
derivatives finite no matter what the length of the string is. |
|
227 |
They make some informal claims about the correctness and linear behaviour |
|
228 |
of this version, but do not provide any supporting proof arguments, not |
|
229 |
even ``pencil-and-paper'' arguments. They write about their bit-coded |
|
230 |
\emph{incremental parsing method} (that is the algorithm to be formalised |
|
231 |
in this dissertation) |
|
232 |
||
233 |
||
234 |
||
235 |
\begin{quote}\it |
|
236 |
``Correctness Claim: We further claim that the incremental parsing |
|
237 |
method [..] in combination with the simplification steps [..] |
|
238 |
yields POSIX parse trees. We have tested this claim |
|
239 |
extensively [..] but yet |
|
240 |
have to work out all proof details.'' \cite[Page 14]{Sulzmann2014} |
|
241 |
\end{quote} |
|
242 |
Ausaf and Urban made some initial progress towards the |
|
243 |
full correctness proof but still had to leave out the optimisation |
|
244 |
Sulzmann and Lu proposed. |
|
245 |
Ausaf wrote \cite{Ausaf}, |
|
246 |
\begin{quote}\it |
|
247 |
``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.'' |
|
248 |
\end{quote} |
|
249 |
This thesis implements the aggressive simplifications envisioned |
|
250 |
by Ausaf and Urban, |
|
251 |
together with a formal proof of the correctness of those simplifications. |
|
252 |
||
253 |
||
254 |
One of the most recent work in the context of lexing |
|
255 |
%with this issue |
|
256 |
is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}. |
|
257 |
This is relevant work for us and we will compare it later with |
|
258 |
our derivative-based matcher we are going to present. |
|
259 |
There is also some newer work called |
|
260 |
Verbatim++~\cite{Verbatimpp}, which does not use derivatives, |
|
261 |
but deterministic finite automaton instead. |
|
262 |
We will also study this work in a later section. |
|
263 |
%An example that gives problem to automaton approaches would be |
|
264 |
%the regular expression $(a|b)^*a(a|b)^{\{n\}}$. |
|
265 |
%It requires at least $2^{n+1}$ states to represent |
|
266 |
%as a DFA. |
|
267 |
||
268 |
||
538 | 269 |
\section{Basic Concepts} |
622 | 270 |
Formal language theory usually starts with an alphabet |
538 | 271 |
denoting a set of characters. |
637 | 272 |
Here we use the datatype of characters from Isabelle, |
541 | 273 |
which roughly corresponds to the ASCII characters. |
564 | 274 |
In what follows, we shall leave the information about the alphabet |
541 | 275 |
implicit. |
276 |
Then using the usual bracket notation for lists, |
|
622 | 277 |
we can define strings made up of characters as: |
532 | 278 |
\begin{center} |
279 |
\begin{tabular}{lcl} |
|
541 | 280 |
$\textit{s}$ & $\dn$ & $[] \; |\; c :: s$ |
532 | 281 |
\end{tabular} |
282 |
\end{center} |
|
583 | 283 |
where $c$ is a variable ranging over characters. |
622 | 284 |
The $::$ stands for list cons and $[]$ for the empty |
285 |
list. |
|
637 | 286 |
For brevity, a singleton list is sometimes written as $[c]$. |
541 | 287 |
Strings can be concatenated to form longer strings in the same |
637 | 288 |
way we concatenate two lists, which we shall write as $s_1 @ s_2$. |
541 | 289 |
We omit the precise |
538 | 290 |
recursive definition here. |
291 |
We overload this concatenation operator for two sets of strings: |
|
532 | 292 |
\begin{center} |
293 |
\begin{tabular}{lcl} |
|
541 | 294 |
$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\ |
532 | 295 |
\end{tabular} |
296 |
\end{center} |
|
538 | 297 |
We also call the above \emph{language concatenation}. |
532 | 298 |
The power of a language is defined recursively, using the |
299 |
concatenation operator $@$: |
|
300 |
\begin{center} |
|
301 |
\begin{tabular}{lcl} |
|
302 |
$A^0 $ & $\dn$ & $\{ [] \}$\\ |
|
541 | 303 |
$A^{n+1}$ & $\dn$ & $A @ A^n$ |
532 | 304 |
\end{tabular} |
305 |
\end{center} |
|
564 | 306 |
The union of all powers of a language |
307 |
can be used to define the Kleene star operator: |
|
532 | 308 |
\begin{center} |
309 |
\begin{tabular}{lcl} |
|
536 | 310 |
$A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\ |
532 | 311 |
\end{tabular} |
312 |
\end{center} |
|
313 |
||
538 | 314 |
\noindent |
564 | 315 |
However, to obtain a more convenient induction principle |
538 | 316 |
in Isabelle/HOL, |
536 | 317 |
we instead define the Kleene star |
532 | 318 |
as an inductive set: |
538 | 319 |
|
532 | 320 |
\begin{center} |
538 | 321 |
\begin{mathpar} |
564 | 322 |
\inferrule{\mbox{}}{[] \in A*\\} |
538 | 323 |
|
564 | 324 |
\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*} |
538 | 325 |
\end{mathpar} |
532 | 326 |
\end{center} |
564 | 327 |
\noindent |
541 | 328 |
We also define an operation of "chopping off" a character from |
329 |
a language, which we call $\Der$, meaning \emph{Derivative} (for a language): |
|
532 | 330 |
\begin{center} |
331 |
\begin{tabular}{lcl} |
|
332 |
$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\ |
|
333 |
\end{tabular} |
|
334 |
\end{center} |
|
538 | 335 |
\noindent |
583 | 336 |
This can be generalised to ``chopping off'' a string |
337 |
from all strings within a set $A$, |
|
541 | 338 |
namely: |
532 | 339 |
\begin{center} |
340 |
\begin{tabular}{lcl} |
|
541 | 341 |
$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\ |
532 | 342 |
\end{tabular} |
343 |
\end{center} |
|
538 | 344 |
\noindent |
541 | 345 |
which is essentially the left quotient $A \backslash L$ of $A$ against |
622 | 346 |
the singleton language with $L = \{s\}$. |
347 |
However, for our purposes here, the $\textit{Ders}$ definition with |
|
541 | 348 |
a single string is sufficient. |
532 | 349 |
|
577 | 350 |
The reason for defining derivatives |
622 | 351 |
is that they provide another approach |
577 | 352 |
to test membership of a string in |
353 |
a set of strings. |
|
354 |
For example, to test whether the string |
|
638 | 355 |
$bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set with |
577 | 356 |
respect to the string $bar$: |
357 |
\begin{center} |
|
622 | 358 |
\begin{tabular}{lll} |
577 | 359 |
$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & |
622 | 360 |
$\{ar, rak\}$ \\ |
361 |
& $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\ |
|
362 |
& $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\ |
|
363 |
%& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\ |
|
577 | 364 |
\end{tabular} |
365 |
\end{center} |
|
366 |
\noindent |
|
637 | 367 |
and in the end, test whether the set |
638 | 368 |
contains the empty string.\footnote{We use the infix notation $A\backslash c$ |
369 |
instead of $\Der \; c \; A$ for brevity, as it will always be |
|
370 |
clear from the context that we are operating |
|
622 | 371 |
on languages rather than regular expressions.} |
372 |
||
373 |
In general, if we have a language $S$, |
|
374 |
then we can test whether $s$ is in $S$ |
|
577 | 375 |
by testing whether $[] \in S \backslash s$. |
564 | 376 |
With the sequencing, Kleene star, and $\textit{Der}$ operator on languages, |
532 | 377 |
we have a few properties of how the language derivative can be defined using |
378 |
sub-languages. |
|
577 | 379 |
For example, for the sequence operator, we have |
622 | 380 |
something similar to a ``chain rule'': |
532 | 381 |
\begin{lemma} |
536 | 382 |
\[ |
383 |
\Der \; c \; (A @ B) = |
|
384 |
\begin{cases} |
|
538 | 385 |
((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\ |
386 |
(\Der \; c \; A) \, @ \, B, & \text{otherwise} |
|
536 | 387 |
\end{cases} |
388 |
\] |
|
532 | 389 |
\end{lemma} |
390 |
\noindent |
|
391 |
This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it |
|
392 |
and get to $B$. |
|
583 | 393 |
The language derivative for $A*$ can be described using the language derivative |
532 | 394 |
of $A$: |
395 |
\begin{lemma} |
|
538 | 396 |
$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\ |
532 | 397 |
\end{lemma} |
398 |
\begin{proof} |
|
583 | 399 |
There are two inclusions to prove: |
532 | 400 |
\begin{itemize} |
564 | 401 |
\item{$\subseteq$}:\\ |
532 | 402 |
The set |
637 | 403 |
\[ S_1 = \{s \mid c :: s \in A*\} \] |
532 | 404 |
is enclosed in the set |
637 | 405 |
\[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \] |
406 |
This is because for any string $c::s$ satisfying $c::s \in A*$, |
|
407 |
%whenever you have a string starting with a character |
|
408 |
%in the language of a Kleene star $A*$, |
|
409 |
%then that |
|
410 |
the character $c$, together with a prefix of $s$ |
|
411 |
%immediately after $c$ |
|
412 |
forms the first iteration of $A*$, |
|
413 |
and the rest of the $s$ is also $A*$. |
|
414 |
This coincides with the definition of $S_2$. |
|
564 | 415 |
\item{$\supseteq$}:\\ |
532 | 416 |
Note that |
538 | 417 |
\[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \] |
583 | 418 |
holds. |
419 |
Also the following holds: |
|
536 | 420 |
\[ \Der \; c \; (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \] |
564 | 421 |
where the $\textit{RHS}$ can be rewritten |
422 |
as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \] |
|
423 |
which of course contains $\Der \; c \; A @ A*$. |
|
532 | 424 |
\end{itemize} |
425 |
\end{proof} |
|
538 | 426 |
|
427 |
\noindent |
|
622 | 428 |
The clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$ |
429 |
for regular expressions. |
|
430 |
To introduce them, we need to first give definitions for regular expressions, |
|
431 |
which we shall do next. |
|
532 | 432 |
|
536 | 433 |
\subsection{Regular Expressions and Their Meaning} |
564 | 434 |
The \emph{basic regular expressions} are defined inductively |
532 | 435 |
by the following grammar: |
436 |
\[ r ::= \ZERO \mid \ONE |
|
437 |
\mid c |
|
438 |
\mid r_1 \cdot r_2 |
|
439 |
\mid r_1 + r_2 |
|
440 |
\mid r^* |
|
441 |
\] |
|
538 | 442 |
\noindent |
564 | 443 |
We call them basic because we will introduce |
637 | 444 |
additional constructors in later chapters, such as negation |
538 | 445 |
and bounded repetitions. |
564 | 446 |
We use $\ZERO$ for the regular expression that |
447 |
matches no string, and $\ONE$ for the regular |
|
622 | 448 |
expression that matches only the empty string.\footnote{ |
449 |
Some authors |
|
564 | 450 |
also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$ |
638 | 451 |
but we prefer this notation.} |
564 | 452 |
The sequence regular expression is written $r_1\cdot r_2$ |
453 |
and sometimes we omit the dot if it is clear which |
|
454 |
regular expression is meant; the alternative |
|
455 |
is written $r_1 + r_2$. |
|
456 |
The \emph{language} or meaning of |
|
457 |
a regular expression is defined recursively as |
|
458 |
a set of strings: |
|
532 | 459 |
%TODO: FILL in the other defs |
460 |
\begin{center} |
|
461 |
\begin{tabular}{lcl} |
|
667 | 462 |
$L \; \ZERO$ & $\dn$ & $\varnothing$\\ |
564 | 463 |
$L \; \ONE$ & $\dn$ & $\{[]\}$\\ |
464 |
$L \; c$ & $\dn$ & $\{[c]\}$\\ |
|
622 | 465 |
$L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\ |
466 |
$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\ |
|
467 |
$L \; (r^*)$ & $\dn$ & $ (L\;r)*$ |
|
532 | 468 |
\end{tabular} |
469 |
\end{center} |
|
536 | 470 |
\noindent |
622 | 471 |
%Now with language derivatives of a language and regular expressions and |
472 |
%their language interpretations in place, we are ready to define derivatives on regular expressions. |
|
637 | 473 |
With $L$, we are ready to introduce Brzozowski derivatives on regular expressions. |
638 | 474 |
We do so by first introducing what properties they should satisfy. |
622 | 475 |
|
532 | 476 |
\subsection{Brzozowski Derivatives and a Regular Expression Matcher} |
564 | 477 |
%Recall, the language derivative acts on a set of strings |
478 |
%and essentially chops off a particular character from |
|
479 |
%all strings in that set, Brzozowski defined a derivative operation on regular expressions |
|
480 |
%so that after derivative $L(r\backslash c)$ |
|
481 |
%will look as if it was obtained by doing a language derivative on $L(r)$: |
|
622 | 482 |
%Recall that the language derivative acts on a |
483 |
%language (set of strings). |
|
484 |
%One can decide whether a string $s$ belongs |
|
485 |
%to a language $S$ by taking derivative with respect to |
|
486 |
%that string and then checking whether the empty |
|
487 |
%string is in the derivative: |
|
488 |
%\begin{center} |
|
489 |
%\parskip \baselineskip |
|
490 |
%\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}} |
|
491 |
%\def\rlwd{.5pt} |
|
492 |
%\newcommand\notate[3]{% |
|
493 |
% \unskip\def\useanchorwidth{T}% |
|
494 |
% \setbox0=\hbox{#1}% |
|
495 |
% \def\stackalignment{c}\stackunder[-6pt]{% |
|
496 |
% \def\stackalignment{c}\stackunder[-1.5pt]{% |
|
497 |
% \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{% |
|
498 |
% \rule{\rlwd}{#2\baselineskip}}}{% |
|
499 |
% \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces% |
|
500 |
%} |
|
501 |
%\Longstack{ |
|
502 |
%\notate{$\{ \ldots ,\;$ |
|
503 |
% \notate{s}{1}{$(c_1 :: s_1)$} |
|
504 |
% $, \; \ldots \}$ |
|
505 |
%}{1}{$S_{start}$} |
|
506 |
%} |
|
507 |
%\Longstack{ |
|
508 |
% $\stackrel{\backslash c_1}{\longrightarrow}$ |
|
509 |
%} |
|
510 |
%\Longstack{ |
|
511 |
% $\{ \ldots,\;$ \notate{$s_1$}{1}{$(c_2::s_2)$} |
|
512 |
% $,\; \ldots \}$ |
|
513 |
%} |
|
514 |
%\Longstack{ |
|
515 |
% $\stackrel{\backslash c_2}{\longrightarrow}$ |
|
516 |
%} |
|
517 |
%\Longstack{ |
|
518 |
% $\{ \ldots,\; s_2 |
|
519 |
% ,\; \ldots \}$ |
|
520 |
%} |
|
521 |
%\Longstack{ |
|
522 |
% $ \xdashrightarrow{\backslash c_3\ldots\ldots} $ |
|
523 |
%} |
|
524 |
%\Longstack{ |
|
525 |
% \notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} = |
|
526 |
% S_{start}\backslash s$} |
|
527 |
%} |
|
528 |
%\end{center} |
|
529 |
%\begin{center} |
|
530 |
% $s \in S_{start} \iff [] \in S_{end}$ |
|
531 |
%\end{center} |
|
532 |
%\noindent |
|
533 |
Brzozowski noticed that $\Der$ |
|
579 | 534 |
can be ``mirrored'' on regular expressions which |
564 | 535 |
he calls the derivative of a regular expression $r$ |
536 |
with respect to a character $c$, written |
|
577 | 537 |
$r \backslash c$. This infix operator |
638 | 538 |
takes regular expression $r$ as input |
539 |
and a character as a right operand. |
|
577 | 540 |
The derivative operation on regular expression |
541 |
is defined such that the language of the derivative result |
|
542 |
coincides with the language of the original |
|
579 | 543 |
regular expression being taken |
638 | 544 |
derivative with respect to the same characters, namely |
579 | 545 |
\begin{property} |
546 |
||
547 |
\[ |
|
548 |
L \; (r \backslash c) = \Der \; c \; (L \; r) |
|
549 |
\] |
|
550 |
\end{property} |
|
551 |
\noindent |
|
637 | 552 |
Next, we give the recursive definition of derivative on |
553 |
regular expressions so that it satisfies the properties above. |
|
638 | 554 |
%The derivative function, written $r\backslash c$, |
555 |
%takes a regular expression $r$ and character $c$, and |
|
556 |
%returns a new regular expression representing |
|
557 |
%the original regular expression's language $L \; r$ |
|
558 |
%being taken the language derivative with respect to $c$. |
|
626 | 559 |
\begin{table} |
560 |
\begin{center} |
|
579 | 561 |
\begin{tabular}{lcl} |
562 |
$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
563 |
$\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
564 |
$d \backslash c$ & $\dn$ & |
|
565 |
$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
566 |
$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
567 |
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\ |
|
568 |
& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
569 |
& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
570 |
$(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
571 |
\end{tabular} |
|
626 | 572 |
\end{center} |
573 |
\caption{Derivative on Regular Expressions} |
|
574 |
\label{table:der} |
|
575 |
\end{table} |
|
564 | 576 |
\noindent |
579 | 577 |
The most involved cases are the sequence case |
578 |
and the star case. |
|
579 |
The sequence case says that if the first regular expression |
|
580 |
contains an empty string, then the second component of the sequence |
|
581 |
needs to be considered, as its derivative will contribute to the |
|
582 |
result of this derivative: |
|
583 |
\begin{center} |
|
584 |
\begin{tabular}{lcl} |
|
583 | 585 |
$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & |
586 |
$\textit{if}\;\,([] \in L(r_1))\; |
|
638 | 587 |
\textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\ |
579 | 588 |
& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$ |
589 |
\end{tabular} |
|
590 |
\end{center} |
|
591 |
\noindent |
|
592 |
Notice how this closely resembles |
|
593 |
the language derivative operation $\Der$: |
|
564 | 594 |
\begin{center} |
595 |
\begin{tabular}{lcl} |
|
596 |
$\Der \; c \; (A @ B)$ & $\dn$ & |
|
597 |
$ \textit{if} \;\, [] \in A \; |
|
598 |
\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup |
|
599 |
\Der \; c\; B$\\ |
|
600 |
& & $\textit{else}\; (\Der \; c \; A) @ B$\\ |
|
601 |
\end{tabular} |
|
602 |
\end{center} |
|
603 |
\noindent |
|
583 | 604 |
The derivative of the star regular expression $r^*$ |
579 | 605 |
unwraps one iteration of $r$, turns it into $r\backslash c$, |
606 |
and attaches the original $r^*$ |
|
607 |
after $r\backslash c$, so that |
|
608 |
we can further unfold it as many times as needed: |
|
609 |
\[ |
|
610 |
(r^*) \backslash c \dn (r \backslash c)\cdot r^*. |
|
611 |
\] |
|
612 |
Again, |
|
637 | 613 |
the structure is the same as the language derivative of the Kleene star: |
532 | 614 |
\[ |
564 | 615 |
\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*) |
532 | 616 |
\] |
564 | 617 |
In the above definition of $(r_1\cdot r_2) \backslash c$, |
618 |
the $\textit{if}$ clause's |
|
619 |
boolean condition |
|
620 |
$[] \in L(r_1)$ needs to be |
|
621 |
somehow recursively computed. |
|
622 |
We call such a function that checks |
|
623 |
whether the empty string $[]$ is |
|
624 |
in the language of a regular expression $\nullable$: |
|
625 |
\begin{center} |
|
626 |
\begin{tabular}{lcl} |
|
627 |
$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
|
628 |
$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
|
629 |
$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
|
630 |
$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
|
631 |
$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
|
632 |
$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
|
633 |
\end{tabular} |
|
634 |
\end{center} |
|
635 |
\noindent |
|
636 |
The $\ZERO$ regular expression |
|
637 |
does not contain any string and |
|
638 |
therefore is not \emph{nullable}. |
|
639 |
$\ONE$ is \emph{nullable} |
|
640 |
by definition. |
|
641 |
The character regular expression $c$ |
|
642 |
corresponds to the singleton set $\{c\}$, |
|
643 |
and therefore does not contain the empty string. |
|
644 |
The alternative regular expression is nullable |
|
645 |
if at least one of its children is nullable. |
|
646 |
The sequence regular expression |
|
647 |
would require both children to have the empty string |
|
648 |
to compose an empty string, and the Kleene star |
|
649 |
is always nullable because it naturally |
|
579 | 650 |
contains the empty string. |
532 | 651 |
\noindent |
638 | 652 |
We have the following two correspondences between |
564 | 653 |
derivatives on regular expressions and |
654 |
derivatives on a set of strings: |
|
655 |
\begin{lemma}\label{derDer} |
|
608 | 656 |
\mbox{} |
579 | 657 |
\begin{itemize} |
658 |
\item |
|
532 | 659 |
$\textit{Der} \; c \; L(r) = L (r\backslash c)$ |
579 | 660 |
\item |
661 |
$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$. |
|
662 |
\end{itemize} |
|
532 | 663 |
\end{lemma} |
579 | 664 |
\begin{proof} |
665 |
By induction on $r$. |
|
666 |
\end{proof} |
|
532 | 667 |
\noindent |
638 | 668 |
which are the main properties of derivatives |
669 |
that enables us later to reason about the correctness of |
|
622 | 670 |
derivative-based matching. |
532 | 671 |
We can generalise the derivative operation shown above for single characters |
672 |
to strings as follows: |
|
673 |
||
674 |
\begin{center} |
|
675 |
\begin{tabular}{lcl} |
|
676 |
$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\ |
|
583 | 677 |
$r \backslash_s [\,] $ & $\dn$ & $r$ |
532 | 678 |
\end{tabular} |
679 |
\end{center} |
|
680 |
||
681 |
\noindent |
|
564 | 682 |
When there is no ambiguity, we will |
683 |
omit the subscript and use $\backslash$ instead |
|
583 | 684 |
of $\backslash_s$ to denote |
532 | 685 |
string derivatives for brevity. |
622 | 686 |
Brzozowski's derivative-based |
687 |
regular-expression matching algorithm can then be described as: |
|
532 | 688 |
|
689 |
\begin{definition} |
|
564 | 690 |
$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$ |
532 | 691 |
\end{definition} |
692 |
||
693 |
\noindent |
|
637 | 694 |
Assuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$, |
695 |
this algorithm, presented graphically, is as follows: |
|
532 | 696 |
|
601 | 697 |
\begin{equation}\label{matcher} |
532 | 698 |
\begin{tikzcd} |
583 | 699 |
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & |
700 |
r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & |
|
701 |
\;\textrm{true}/\textrm{false} |
|
532 | 702 |
\end{tikzcd} |
703 |
\end{equation} |
|
704 |
||
705 |
\noindent |
|
622 | 706 |
It is relatively |
707 |
easy to show that this matcher is correct, namely |
|
539 | 708 |
\begin{lemma} |
564 | 709 |
$\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$ |
539 | 710 |
\end{lemma} |
711 |
\begin{proof} |
|
637 | 712 |
By induction on $s$ using the property of derivatives: |
583 | 713 |
lemma \ref{derDer}. |
539 | 714 |
\end{proof} |
601 | 715 |
\begin{figure} |
564 | 716 |
\begin{center} |
539 | 717 |
\begin{tikzpicture} |
718 |
\begin{axis}[ |
|
719 |
xlabel={$n$}, |
|
720 |
ylabel={time in secs}, |
|
601 | 721 |
%ymode = log, |
539 | 722 |
legend entries={Naive Matcher}, |
723 |
legend pos=north west, |
|
724 |
legend cell align=left] |
|
725 |
\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data}; |
|
726 |
\end{axis} |
|
727 |
\end{tikzpicture} |
|
583 | 728 |
\caption{Matching the regular expression $(a^*)^*b$ against strings of the form |
729 |
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s} |
|
730 |
$ using Brzozowski's original algorithm}\label{NaiveMatcher} |
|
601 | 731 |
\end{center} |
539 | 732 |
\end{figure} |
733 |
\noindent |
|
564 | 734 |
If we implement the above algorithm naively, however, |
666 | 735 |
the algorithm can be as slow as backtracking lexers, as shown in |
564 | 736 |
\ref{NaiveMatcher}. |
737 |
Note that both axes are in logarithmic scale. |
|
637 | 738 |
Around two dozen characters |
638 | 739 |
this algorithm already ``explodes'' with the regular expression |
564 | 740 |
$(a^*)^*b$. |
622 | 741 |
To improve this situation, we need to introduce simplification |
742 |
rules for the intermediate results, |
|
638 | 743 |
such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$, |
539 | 744 |
and make sure those rules do not change the |
745 |
language of the regular expression. |
|
638 | 746 |
One simple-minded simplification function |
622 | 747 |
that achieves these requirements |
748 |
is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}): |
|
564 | 749 |
\begin{center} |
750 |
\begin{tabular}{lcl} |
|
751 |
$\simp \; r_1 \cdot r_2 $ & $ \dn$ & |
|
752 |
$(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\ |
|
753 |
& & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\ |
|
754 |
& & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\ |
|
755 |
& & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\ |
|
756 |
& & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\ |
|
757 |
& & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\ |
|
758 |
$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\ |
|
759 |
& & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\ |
|
760 |
& & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\ |
|
761 |
& & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\ |
|
583 | 762 |
$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$ |
564 | 763 |
|
764 |
\end{tabular} |
|
765 |
\end{center} |
|
766 |
If we repeatedly apply this simplification |
|
767 |
function during the matching algorithm, |
|
768 |
we have a matcher with simplification: |
|
769 |
\begin{center} |
|
770 |
\begin{tabular}{lcl} |
|
771 |
$\derssimp \; [] \; r$ & $\dn$ & $r$\\ |
|
772 |
$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\ |
|
773 |
$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$ |
|
774 |
\end{tabular} |
|
775 |
\end{center} |
|
776 |
\begin{figure} |
|
539 | 777 |
\begin{tikzpicture} |
778 |
\begin{axis}[ |
|
779 |
xlabel={$n$}, |
|
780 |
ylabel={time in secs}, |
|
601 | 781 |
%ymode = log, |
782 |
%xmode = log, |
|
564 | 783 |
grid = both, |
539 | 784 |
legend entries={Matcher With Simp}, |
785 |
legend pos=north west, |
|
786 |
legend cell align=left] |
|
787 |
\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data}; |
|
788 |
\end{axis} |
|
564 | 789 |
\end{tikzpicture} |
790 |
\caption{$(a^*)^*b$ |
|
791 |
against |
|
792 |
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher} |
|
793 |
\end{figure} |
|
794 |
\noindent |
|
795 |
The running time of $\textit{ders}\_\textit{simp}$ |
|
583 | 796 |
on the same example of Figure \ref{NaiveMatcher} |
797 |
is now ``tame'' in terms of the length of inputs, |
|
798 |
as shown in Figure \ref{BetterMatcher}. |
|
539 | 799 |
|
637 | 800 |
So far, the story is use Brzozowski derivatives and |
801 |
simplify as much as possible, and at the end test |
|
622 | 802 |
whether the empty string is recognised |
803 |
by the final derivative. |
|
583 | 804 |
But what if we want to |
805 |
do lexing instead of just getting a true/false answer? |
|
622 | 806 |
Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and |
807 |
elegant (arguably as beautiful as the definition of the |
|
808 |
Brzozowski derivative) solution for this. |
|
668
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
809 |
\marginpar{explicitly attribute the work} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
810 |
For the same reason described |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
811 |
at the beginning of this chapter, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
812 |
we introduce the formal |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
813 |
semantics of $\POSIX$ lexing by |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
814 |
Ausaf et al.\cite{AusafDyckhoffUrban2016}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
815 |
followed by the first lexing algorithm by |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
816 |
Sulzmanna and Lu \cite{Sulzmann2014} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
817 |
that produces the output conforming |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
818 |
to the $\POSIX$ standard. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
819 |
%TODO: Actually show this in chapter 4. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
820 |
In what follows |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
821 |
we choose to use the Isabelle-style notation |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
822 |
for function and datatype constructor applications, where |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
823 |
the parameters of a function are not enclosed |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
824 |
inside a pair of parentheses (e.g. $f \;x \;y$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
825 |
instead of $f(x,\;y)$). This is mainly |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
667
diff
changeset
|
826 |
to make the text visually more concise. |
538 | 827 |
|
539 | 828 |
\section{Values and the Lexing Algorithm by Sulzmann and Lu} |
564 | 829 |
In this section, we present a two-phase regular expression lexing |
830 |
algorithm. |
|
831 |
The first phase takes successive derivatives with |
|
832 |
respect to the input string, |
|
833 |
and the second phase does the reverse, \emph{injecting} back |
|
834 |
characters, in the meantime constructing a lexing result. |
|
835 |
We will introduce the injection phase in detail slightly |
|
836 |
later, but as a preliminary we have to first define |
|
837 |
the datatype for lexing results, |
|
838 |
called \emph{value} or |
|
608 | 839 |
sometimes also \emph{lexical value}. |
840 |
Values and regular |
|
841 |
expressions correspond to each other |
|
842 |
as illustrated in the following |
|
538 | 843 |
table: |
844 |
||
845 |
\begin{center} |
|
846 |
\begin{tabular}{c@{\hspace{20mm}}c} |
|
847 |
\begin{tabular}{@{}rrl@{}} |
|
848 |
\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |
|
849 |
$r$ & $::=$ & $\ZERO$\\ |
|
850 |
& $\mid$ & $\ONE$ \\ |
|
851 |
& $\mid$ & $c$ \\ |
|
852 |
& $\mid$ & $r_1 \cdot r_2$\\ |
|
853 |
& $\mid$ & $r_1 + r_2$ \\ |
|
854 |
\\ |
|
855 |
& $\mid$ & $r^*$ \\ |
|
856 |
\end{tabular} |
|
857 |
& |
|
858 |
\begin{tabular}{@{\hspace{0mm}}rrl@{}} |
|
859 |
\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ |
|
860 |
$v$ & $::=$ & \\ |
|
861 |
& & $\Empty$ \\ |
|
591 | 862 |
& $\mid$ & $\Char \; c$ \\ |
538 | 863 |
& $\mid$ & $\Seq\,v_1\, v_2$\\ |
591 | 864 |
& $\mid$ & $\Left \;v$ \\ |
865 |
& $\mid$ & $\Right\;v$ \\ |
|
538 | 866 |
& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ |
867 |
\end{tabular} |
|
868 |
\end{tabular} |
|
869 |
\end{center} |
|
870 |
\noindent |
|
564 | 871 |
A value has an underlying string, which |
872 |
can be calculated by the ``flatten" function $|\_|$: |
|
873 |
\begin{center} |
|
874 |
\begin{tabular}{lcl} |
|
875 |
$|\Empty|$ & $\dn$ & $[]$\\ |
|
876 |
$|\Char \; c|$ & $ \dn$ & $ [c]$\\ |
|
591 | 877 |
$|\Seq \; v_1, \;v_2|$ & $ \dn$ & $ v_1| @ |v_2|$\\ |
878 |
$|\Left \; v|$ & $ \dn$ & $ |v|$\\ |
|
879 |
$|\Right \; v|$ & $ \dn$ & $ |v|$\\ |
|
880 |
$|\Stars \; []|$ & $\dn$ & $[]$\\ |
|
881 |
$|\Stars \; v::vs|$ & $\dn$ & $ |v| @ |\Stars(vs)|$ |
|
564 | 882 |
\end{tabular} |
883 |
\end{center} |
|
884 |
Sulzmann and Lu used a binary predicate, written $\vdash v:r $, |
|
885 |
to indicate that a value $v$ could be generated from a lexing algorithm |
|
622 | 886 |
with input $r$. They call it the value inhabitation relation, |
887 |
defined by the rules. |
|
628 | 888 |
\begin{figure}[H] |
538 | 889 |
\begin{mathpar} |
591 | 890 |
\inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em} |
564 | 891 |
|
892 |
\inferrule{\mbox{}}{\vdash \Empty : \ONE} \hspace{2em} |
|
893 |
||
591 | 894 |
\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq \; v_1,\; v_2 : (r_1 \cdot r_2)} |
564 | 895 |
|
591 | 896 |
\inferrule{\vdash v_1 : r_1}{\vdash \Left \; v_1 : r_1+r_2} |
564 | 897 |
|
591 | 898 |
\inferrule{\vdash v_2 : r_2}{\vdash \Right \; v_2:r_1 + r_2} |
538 | 899 |
|
591 | 900 |
\inferrule{\forall v \in vs. \vdash v:r \land |v| \neq []}{\vdash \Stars \; vs : r^*} |
564 | 901 |
\end{mathpar} |
628 | 902 |
\caption{The inhabitation relation for values and regular expressions}\label{fig:inhab} |
626 | 903 |
\end{figure} |
564 | 904 |
\noindent |
905 |
The condition $|v| \neq []$ in the premise of star's rule |
|
906 |
is to make sure that for a given pair of regular |
|
907 |
expression $r$ and string $s$, the number of values |
|
908 |
satisfying $|v| = s$ and $\vdash v:r$ is finite. |
|
601 | 909 |
This additional condition was |
910 |
imposed by Ausaf and Urban to make their proofs easier. |
|
622 | 911 |
Given a string and a regular expression, there can be |
564 | 912 |
multiple values for it. For example, both |
913 |
$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and |
|
914 |
$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold |
|
915 |
and the values both flatten to $abc$. |
|
916 |
Lexers therefore have to disambiguate and choose only |
|
637 | 917 |
one of the values to be generated. $\POSIX$ is one of the |
564 | 918 |
disambiguation strategies that is widely adopted. |
919 |
||
638 | 920 |
Ausaf et al. \cite{AusafDyckhoffUrban2016} |
564 | 921 |
formalised the property |
922 |
as a ternary relation. |
|
923 |
The $\POSIX$ value $v$ for a regular expression |
|
538 | 924 |
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified |
622 | 925 |
in the following rules\footnote{The names of the rules are used |
926 |
as they were originally given in \cite{AusafDyckhoffUrban2016}.}: |
|
927 |
\begin{figure}[p] |
|
564 | 928 |
\begin{mathpar} |
929 |
\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty} |
|
930 |
||
931 |
\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c} |
|
932 |
||
933 |
\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1} |
|
934 |
||
935 |
\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2} |
|
936 |
||
937 |
\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\ |
|
938 |
\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land |
|
939 |
s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow |
|
940 |
\Seq \; v_1 \; v_2} |
|
941 |
||
942 |
\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])} |
|
943 |
||
944 |
\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\ |
|
945 |
|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land |
|
946 |
s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \; |
|
947 |
(v::vs)} |
|
948 |
\end{mathpar} |
|
622 | 949 |
\caption{The inductive POSIX rules given by Ausaf et al. |
950 |
\cite{AusafDyckhoffUrban2016}. |
|
951 |
This ternary relation, written $(s, r) \rightarrow v$, |
|
952 |
formalises the POSIX constraints on the |
|
601 | 953 |
value $v$ given a string $s$ and |
954 |
regular expression $r$. |
|
955 |
} |
|
646 | 956 |
\label{fig:POSIXDef} |
622 | 957 |
\end{figure}\afterpage{\clearpage} |
538 | 958 |
\noindent |
579 | 959 |
|
622 | 960 |
%\begin{figure} |
961 |
%\begin{tikzpicture}[] |
|
962 |
% \node [minimum width = 6cm, rectangle split, rectangle split horizontal, |
|
963 |
% rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}] |
|
964 |
% (node1) |
|
965 |
% {$r_{token1}$ |
|
966 |
% \nodepart{two} $\;\;\; \quad r_{token2}\;\;\;\quad$ }; |
|
967 |
% %\node [left = 6.0cm of node1] (start1) {hi}; |
|
968 |
% \node [left = 0.2cm of node1] (middle) {$v.s.$}; |
|
969 |
% \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal, |
|
970 |
% rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}] |
|
971 |
% (node2) |
|
972 |
% {$\quad\;\;\;r_{token1}\quad\;\;\;$ |
|
973 |
% \nodepart{two} $r_{token2}$ }; |
|
974 |
% \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX}; |
|
975 |
% \node [above = 1.5cm of middle, minimum width = 6cm, |
|
976 |
% rectangle, style={draw, rounded corners, inner sep=10pt}] |
|
977 |
% (topNode) {$s$}; |
|
978 |
% \path[->,draw] |
|
979 |
% (topNode) edge node {split $A$} (node2) |
|
980 |
% (topNode) edge node {split $B$} (node1) |
|
981 |
% ; |
|
982 |
% |
|
983 |
% |
|
984 |
%\end{tikzpicture} |
|
985 |
%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch} |
|
986 |
%\end{figure} |
|
637 | 987 |
The above $\POSIX$ rules follow the intuition described below: |
538 | 988 |
\begin{itemize} |
564 | 989 |
\item (Left Priority)\\ |
637 | 990 |
Match the leftmost regular expression when multiple options for matching |
622 | 991 |
are available. See P+L and P+R where in P+R $s$ cannot |
992 |
be in the language of $L \; r_1$. |
|
564 | 993 |
\item (Maximum munch)\\ |
994 |
Always match a subpart as much as possible before proceeding |
|
622 | 995 |
to the next part of the string. |
579 | 996 |
For example, when the string $s$ matches |
622 | 997 |
$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split: |
998 |
Then the split that matches a longer string for the first part |
|
999 |
$r_{part1}$ is preferred by this maximum munch rule. |
|
637 | 1000 |
The side-condition |
622 | 1001 |
\begin{center} |
1002 |
$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land |
|
1003 |
s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$ |
|
1004 |
\end{center} |
|
637 | 1005 |
in PS causes this. |
622 | 1006 |
%(See |
1007 |
%\ref{munch} for an illustration). |
|
538 | 1008 |
\end{itemize} |
564 | 1009 |
\noindent |
1010 |
These disambiguation strategies can be |
|
1011 |
quite practical. |
|
538 | 1012 |
For instance, when lexing a code snippet |
564 | 1013 |
\[ |
1014 |
\textit{iffoo} = 3 |
|
1015 |
\] |
|
622 | 1016 |
using a regular expression |
1017 |
for keywords and |
|
1018 |
identifiers: |
|
1019 |
%(for example, keyword is a nonempty string starting with letters |
|
1020 |
%followed by alphanumeric characters or underscores): |
|
564 | 1021 |
\[ |
622 | 1022 |
r_{keyword} + r_{identifier}. |
564 | 1023 |
\] |
622 | 1024 |
If we want $\textit{iffoo}$ to be recognized |
1025 |
as an identifier |
|
1026 |
where identifiers are defined as usual (letters |
|
1027 |
followed by letters, numbers or underscores), |
|
1028 |
then a match with a keyword (if) |
|
564 | 1029 |
followed by |
622 | 1030 |
an identifier (foo) would be incorrect. |
638 | 1031 |
POSIX lexing generates what is included by lexing. |
564 | 1032 |
|
622 | 1033 |
\noindent |
1034 |
We know that a POSIX |
|
1035 |
value for regular expression $r$ is inhabited by $r$. |
|
541 | 1036 |
\begin{lemma} |
1037 |
$(r, s) \rightarrow v \implies \vdash v: r$ |
|
1038 |
\end{lemma} |
|
1039 |
\noindent |
|
622 | 1040 |
The main property about a $\POSIX$ value is that |
538 | 1041 |
given the same regular expression $r$ and string $s$, |
1042 |
one can always uniquely determine the $\POSIX$ value for it: |
|
1043 |
\begin{lemma} |
|
1044 |
$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$ |
|
1045 |
\end{lemma} |
|
539 | 1046 |
\begin{proof} |
564 | 1047 |
By induction on $s$, $r$ and $v_1$. The inductive cases |
1048 |
are all the POSIX rules. |
|
1049 |
Probably the most cumbersome cases are |
|
1050 |
the sequence and star with non-empty iterations. |
|
567 | 1051 |
We shall give the details for proving the sequence case here. |
539 | 1052 |
|
567 | 1053 |
When we have |
1054 |
\[ |
|
1055 |
(s_1, r_1) \rightarrow v_1 \;\, and \;\, |
|
601 | 1056 |
(s_2, r_2) \rightarrow v_2 \;\, and \;\, |
567 | 1057 |
\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land |
1058 |
s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2 |
|
1059 |
\] |
|
1060 |
we know that the last condition |
|
1061 |
excludes the possibility of a |
|
1062 |
string $s_1'$ longer than $s_1$ such that |
|
1063 |
\[ |
|
1064 |
(s_1', r_1) \rightarrow v_1' \;\; |
|
1065 |
and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s |
|
1066 |
\] |
|
1067 |
hold. |
|
1068 |
A shorter string $s_1''$ with $s_2''$ satisfying |
|
1069 |
\[ |
|
1070 |
(s_1'', r_1) \rightarrow v_1'' |
|
1071 |
\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s |
|
1072 |
\] |
|
1073 |
cannot possibly form a $\POSIX$ value either, because |
|
637 | 1074 |
by definition, there is a candidate |
1075 |
with a longer initial string |
|
567 | 1076 |
$s_1$. Therefore, we know that the POSIX |
1077 |
value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching |
|
1078 |
$s$ must have the |
|
1079 |
property that |
|
1080 |
\[ |
|
1081 |
|a| = s_1 \;\; and \;\; |b| = s_2. |
|
1082 |
\] |
|
1083 |
The goal is to prove that $a = v_1 $ and $b = v_2$. |
|
1084 |
If we have some other POSIX values $v_{10}$ and $v_{20}$ such that |
|
1085 |
$(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold, |
|
1086 |
then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, |
|
1087 |
which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$ |
|
539 | 1088 |
is the same as $\Seq(v_1, v_2)$. |
1089 |
\end{proof} |
|
567 | 1090 |
\noindent |
637 | 1091 |
We have now defined what a POSIX value is and shown that it is unique. |
1092 |
The problem is to generate |
|
567 | 1093 |
such a value in a lexing algorithm using derivatives. |
538 | 1094 |
|
1095 |
\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm} |
|
1096 |
||
567 | 1097 |
Sulzmann and Lu extended Brzozowski's |
1098 |
derivative-based matching |
|
622 | 1099 |
to a lexing algorithm by a second phase |
567 | 1100 |
after the initial phase of successive derivatives. |
1101 |
This second phase generates a POSIX value |
|
1102 |
if the regular expression matches the string. |
|
638 | 1103 |
The algorithm uses two functions called $\inj$ and $\mkeps$. |
622 | 1104 |
The function $\mkeps$ constructs a POSIX value from the last |
567 | 1105 |
derivative $r_n$: |
538 | 1106 |
\begin{ceqn} |
1107 |
\begin{equation}\label{graph:mkeps} |
|
1108 |
\begin{tikzcd} |
|
567 | 1109 |
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\ |
538 | 1110 |
& & & v_n |
1111 |
\end{tikzcd} |
|
1112 |
\end{equation} |
|
1113 |
\end{ceqn} |
|
567 | 1114 |
\noindent |
1115 |
In the above diagram, again we assume that |
|
1116 |
the input string $s$ is made of $n$ characters |
|
622 | 1117 |
$c_0c_1 \ldots c_{n-1}$ |
1118 |
The last derivative operation |
|
1119 |
$\backslash c_{n-1}$ generates the derivative $r_n$, for which |
|
1120 |
$\mkeps$ produces the value $v_n$. This value |
|
1121 |
tells us how the empty string is matched by the (nullable) |
|
1122 |
regular expression $r_n$, in a POSIX way. |
|
567 | 1123 |
The definition of $\mkeps$ is |
538 | 1124 |
\begin{center} |
1125 |
\begin{tabular}{lcl} |
|
564 | 1126 |
$\mkeps \; \ONE$ & $\dn$ & $\Empty$ \\ |
567 | 1127 |
$\mkeps \; (r_{1}+r_{2})$ & $\dn$ |
1128 |
& $\textit{if}\; (\nullable \; r_{1}) \;\, |
|
1129 |
\textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\ |
|
1130 |
& & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\ |
|
1131 |
$\mkeps \; (r_1 \cdot r_2)$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\ |
|
564 | 1132 |
$\mkeps \; r^* $ & $\dn$ & $\Stars\;[]$ |
538 | 1133 |
\end{tabular} |
1134 |
\end{center} |
|
1135 |
||
1136 |
||
1137 |
\noindent |
|
622 | 1138 |
The function prefers the left child $r_1$ of $r_1 + r_2$ |
567 | 1139 |
to match an empty string if there is a choice. |
622 | 1140 |
When there is a star to match the empty string, |
538 | 1141 |
we give the $\Stars$ constructor an empty list, meaning |
567 | 1142 |
no iteration is taken. |
622 | 1143 |
The result of $\mkeps$ on a $\nullable$ $r$ |
1144 |
is a POSIX value for $r$ and the empty string: |
|
567 | 1145 |
\begin{lemma}\label{mePosix} |
638 | 1146 |
$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$ |
567 | 1147 |
\end{lemma} |
1148 |
\begin{proof} |
|
622 | 1149 |
By induction on $r$. |
567 | 1150 |
\end{proof} |
1151 |
\noindent |
|
622 | 1152 |
After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one |
567 | 1153 |
in reverse order as they were chopped off in the derivative phase. |
637 | 1154 |
The function for this is called $\inj$. This function |
622 | 1155 |
operates on values, unlike $\backslash$ which operates on regular expressions. |
567 | 1156 |
In the diagram below, $v_i$ stands for the (POSIX) value |
1157 |
for how the regular expression |
|
1158 |
$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters |
|
1159 |
of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
|
538 | 1160 |
After injecting back $n$ characters, we get the lexical value for how $r_0$ |
1161 |
matches $s$. |
|
601 | 1162 |
\begin{figure}[H] |
1163 |
\begin{center} |
|
538 | 1164 |
\begin{ceqn} |
1165 |
\begin{tikzcd} |
|
567 | 1166 |
r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
1167 |
v_0 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed] |
|
538 | 1168 |
\end{tikzcd} |
1169 |
\end{ceqn} |
|
601 | 1170 |
\end{center} |
1171 |
\caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016}, |
|
1172 |
matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$. |
|
1173 |
The first phase involves taking successive derivatives w.r.t the characters $c_0$, |
|
1174 |
$c_1$, and so on. These are the same operations as they have appeared in the matcher |
|
1175 |
\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string), |
|
637 | 1176 |
then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches |
1177 |
the empty string, by always selecting the leftmost |
|
1178 |
nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they |
|
601 | 1179 |
appeared in the string, always preserving POSIXness.}\label{graph:inj} |
1180 |
\end{figure} |
|
538 | 1181 |
\noindent |
623 | 1182 |
The function $\textit{inj}$ as defined by Sulzmann and Lu |
1183 |
takes three arguments: a regular |
|
567 | 1184 |
expression ${r_{i}}$, before the character is chopped off, |
623 | 1185 |
a character ${c_{i}}$ (the character we want to inject back) and |
567 | 1186 |
the third argument $v_{i+1}$ the value we want to inject into. |
568 | 1187 |
The result of an application |
1188 |
$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that |
|
1189 |
\[ |
|
1190 |
(s_i, r_i) \rightarrow v_i |
|
1191 |
\] |
|
1192 |
holds. |
|
567 | 1193 |
The definition of $\textit{inj}$ is as follows: |
538 | 1194 |
\begin{center} |
568 | 1195 |
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l} |
1196 |
$\textit{inj}\;(c)\;c\,Empty$ & $\dn$ & $\Char\,c$\\ |
|
1197 |
$\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left \; (\textit{inj}\; r_1 \; c\,v)$\\ |
|
1198 |
$\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c \; v)$\\ |
|
1199 |
$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$ & |
|
1200 |
$\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\ |
|
1201 |
$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ & |
|
1202 |
$\dn$ & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\ |
|
1203 |
$\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$ & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\ |
|
1204 |
$\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\ |
|
538 | 1205 |
\end{tabular} |
1206 |
\end{center} |
|
1207 |
\noindent |
|
623 | 1208 |
The function recurses on |
568 | 1209 |
the shape of regular |
637 | 1210 |
expressions and values. |
568 | 1211 |
Intuitively, each clause analyses |
1212 |
how $r_i$ could have transformed when being |
|
1213 |
derived by $c$, identifying which subpart |
|
1214 |
of $v_{i+1}$ has the ``hole'' |
|
1215 |
to inject the character back into. |
|
1216 |
Once the character is |
|
1217 |
injected back to that sub-value; |
|
637 | 1218 |
$\inj$ assembles all parts |
568 | 1219 |
to form a new value. |
1220 |
||
1221 |
For instance, the last clause is an |
|
1222 |
injection into a sequence value $v_{i+1}$ |
|
1223 |
whose second child |
|
637 | 1224 |
value is a star and the shape of the |
568 | 1225 |
regular expression $r_i$ before injection |
1226 |
is a star. |
|
1227 |
We therefore know |
|
1228 |
the derivative |
|
1229 |
starts on a star and ends as a sequence: |
|
1230 |
\[ |
|
1231 |
(r^*) \backslash c \longrightarrow r\backslash c \cdot r^* |
|
1232 |
\] |
|
1233 |
during which an iteration of the star |
|
1234 |
had just been unfolded, giving the below |
|
1235 |
value inhabitation relation: |
|
1236 |
\[ |
|
1237 |
\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*. |
|
1238 |
\] |
|
1239 |
The value list $vs$ corresponds to |
|
1240 |
matched star iterations, |
|
1241 |
and the ``hole'' lies in $v$ because |
|
1242 |
\[ |
|
1243 |
\vdash v: r\backslash c. |
|
1244 |
\] |
|
1245 |
Finally, |
|
1246 |
$\inj \; r \;c \; v$ is prepended |
|
637 | 1247 |
to the previous list of iterations and then |
568 | 1248 |
wrapped under the $\Stars$ |
1249 |
constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$. |
|
538 | 1250 |
|
667 | 1251 |
Putting all together, Sulzmann and Lu obtained the following algorithm |
1252 |
outlined in the injection-based lexing diagram \ref{graph:inj}: |
|
1253 |
\begin{center} |
|
1254 |
\begin{tabular}{lcl} |
|
1255 |
$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\ |
|
1256 |
$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\ |
|
1257 |
& & $\quad \phantom{\mid}\; \None \implies \None$\\ |
|
1258 |
& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$ |
|
1259 |
\end{tabular} |
|
1260 |
\end{center} |
|
1261 |
\noindent |
|
1262 |
||
1263 |
\subsection{Examples on How Injection and Lexer Works} |
|
1264 |
We will provide a few examples on how $\inj$ and $\lexer$ |
|
1265 |
works by showing their values in each recursive call on some |
|
1266 |
concrete examples. |
|
1267 |
We start with the call $\lexer \; (a+aa)^*\cdot c\; aac$ |
|
1268 |
on the lexer, note that the value's character constructor |
|
1269 |
$\Char \;c$ is abbreviated as $c$ for readability. |
|
1270 |
Similarly the last derivative's sub-expression |
|
1271 |
is abbreviated as $r_{=0}$\footnote{which is equal to |
|
1272 |
$((\ZERO + (\ZERO a+ \ZERO))\cdot(a+aa)^* + (\ZERO + \ZERO a)\cdot (a+aa)^*)+ |
|
1273 |
((\ZERO+\ZERO a)\cdot(a+aa)^* + (\ZERO+\ZERO a)\cdot(a+aa)^*), |
|
1274 |
$.} |
|
1275 |
whose language interpretation is equivalent to that of |
|
1276 |
$\ZERO$ and therefore not crucial to be displayed fully expanded, |
|
1277 |
as they will not be injected into.%our example run. |
|
1278 |
\begin{center} |
|
1279 |
\begin{tabular}{lcl} |
|
1280 |
$(a+\textcolor{magenta}{a}\textcolor{blue}{a})^* \cdot \textcolor{red}{c}$ & $\stackrel{\backslash \textcolor{magenta}{a}}{\rightarrow}$ & $((\ONE + \textcolor{magenta}{\ONE} \textcolor{blue}{a})\cdot (a+aa)^*)\cdot \textcolor{red}{c}+\ZERO$\\ |
|
1281 |
%& $\stackrel{\rightarrow}{\backslash a}$ & $((\ONE + \ONE a)\cdot (a+aa)^*)\cdot c + \ZERO$\\ |
|
1282 |
& $\stackrel{\backslash \textcolor{blue}{a}}{\rightarrow}$ & |
|
1283 |
$(((\ZERO+(\ZERO a+ \textcolor{blue}{\ONE}))\cdot (a+aa)^* + (\ONE+\ONE a)\cdot (a+aa)^* )\cdot \textcolor{red}{\mathbf{c}} + \ZERO)+\ZERO$\\ |
|
1284 |
& $\stackrel{\backslash \textcolor{red}{c}}{\rightarrow}$ & |
|
1285 |
$((r_{=0}\cdot c + \textcolor{red}{\ONE})+\ZERO)+\ZERO$\\ |
|
1286 |
& $\stackrel{\mkeps}{\rightarrow}$ & $\Left (\Left \; (\Right \; \textcolor{red}{\Empty}))$ \\ |
|
1287 |
& $\stackrel{\inj \;\textcolor{red}{c} }{\rightarrow}$ & |
|
1288 |
$\Left \; (\Left \; (\Seq \;(\Left \; (\Seq \; (\Right \; (\Right\; \textcolor{blue}{\Empty})) \; \Stars \, [])) \; \textcolor{red}{c}))$\\ |
|
1289 |
& $\stackrel{\inj \;\textcolor{blue}{a}}{\rightarrow}$ & |
|
1290 |
$\Left\; (\Seq \; (\Seq\; (\Right \; (\Seq \; \textcolor{magenta}{\Empty }\; \textcolor{blue}{ \mathbf{a}}) ) |
|
1291 |
\;\Stars \,[]) \; \textcolor{red}{c})$\\ |
|
1292 |
& $\stackrel{\inj \;\textcolor{magenta}{a}}{\rightarrow}$ & $\Seq \; (\Stars \; [\Right \; (\Seq \; \mathbf{\textcolor{magenta}{a}} \; \textcolor{blue}{a})]) \; \textcolor{red}{ c}$ |
|
1293 |
||
1294 |
%$\inj \; r \; c\A$ |
|
1295 |
%$\inj \; (a\cdot b)\cdot c \; \Seq \; \ONE \; b$ & $=$ & $(a+e)$\\ |
|
1296 |
\end{tabular} |
|
1297 |
\end{center} |
|
1298 |
\noindent |
|
1299 |
We have assigned different colours for each character, |
|
1300 |
as well as their corresponding locations in values and regular expressions. |
|
1301 |
The most recently injected character is marked with a bold font. |
|
1302 |
%Their corresponding derivative steps have been marked with the same |
|
1303 |
%colour. We also mark the specific location where the coloured character |
|
1304 |
%was injected into with the same colour, before and after the injection occurred. |
|
1305 |
%TODO: can be also used to motivate injection->blexer in Bitcoded1 |
|
1306 |
To show the details of how $\inj$ works, |
|
1307 |
we zoom in to the second injection above, |
|
1308 |
illustrating the recursive calls involved: |
|
1309 |
\begin{center} |
|
1310 |
\begin{tabular}{lcl} |
|
1311 |
$\inj \;\quad ((\ONE + {\ONE} |
|
1312 |
\textcolor{blue}{a})\cdot (a+aa)^*)\cdot |
|
1313 |
c + \ZERO \; \quad \textcolor{blue}{a} \; $ & &\\ |
|
1314 |
$\quad \Left \; |
|
1315 |
(\Left \; (\Seq \;(\Left \; (\Seq \; (\Right \; (\Right\; |
|
1316 |
\textcolor{blue}{\Empty})) \; \Stars \, [])) \; c))$ & \\$=$\\ |
|
1317 |
$\Left \; (\inj \; ((\ONE + \ONE |
|
1318 |
\textcolor{blue}{a})\cdot (a+aa)^*)\cdot |
|
1319 |
c \quad \textcolor{blue}{a} \quad (\Left \; (\Seq\; ( \Left \; (\Seq \; (\Right \; (\Right\; |
|
1320 |
\textcolor{blue}{\Empty})) \; \Stars \, []) ) \; c ) )\;\;)$ &\\ $=$\\ |
|
1321 |
$\Left \; (\Seq \; (\inj \; (\ONE + \ONE \textcolor{blue}{a})\cdot(a+aa)^* \quad \textcolor{blue}{a} \quad |
|
1322 |
(\Left \; (\Seq \; (\Right \; (\Right\;\textcolor{blue}{\Empty})))) ) \; c ) $ & \\$=$\\ |
|
1323 |
$\Left \; (\Seq \; (\Seq \; (\inj \quad (\ONE + \ONE \textcolor{blue}{a}) \quad \textcolor{blue}{a}\quad \Right \;(\Right \; \textcolor{blue}{\Empty}) ) \; \Stars \,[])\; c)$ &\\ $=$ \\ |
|
1324 |
$\Left \; (\Seq \; (\Seq \; (\Right \; (\inj \; \ONE \textcolor{blue}{a} \quad \textcolor{blue}{a}\quad (\Right \;\textcolor{blue}{\Empty})))) \; \Stars \, [])$ |
|
1325 |
& \\ $=$\\ |
|
1326 |
$\Left \; (\Seq \; (\Seq \; (\Right \; (\Seq \; (\mkeps \; \ONE)\;(\inj \;\textcolor{blue}{a} \; \textcolor{blue}{a} \; \textcolor{blue}{\Empty})) ) ) \; \Stars \, [] )$ |
|
1327 |
& \\ $=$\\ |
|
1328 |
$\Left \; (\Seq \; (\Seq \; (\Right \; (\Seq \; \Empty \; \textcolor{blue}{a}) ) ) \; \Stars \, [] )$ |
|
1329 |
\end{tabular} |
|
1330 |
\end{center} |
|
1331 |
\noindent |
|
1332 |
We will now introduce the proofs related to $\inj$ and |
|
1333 |
$\lexer$. These proofs have originally been found by Ausaf et al. |
|
1334 |
in their 2016 work \cite{AusafDyckhoffUrban2016}. |
|
1335 |
Proofs to some of these lemmas have also been discussed in more detail in |
|
1336 |
Ausaf's thesis \cite{Ausaf}. |
|
1337 |
Nevertheless, we still introduce these proofs, to make this thesis |
|
1338 |
self-contained as we show inductive variants used in these proofs break |
|
1339 |
after more simplifications are introduced. |
|
1340 |
%Also note that lemmas like \ref{}described in this thesis is a more faithful |
|
1341 |
%representation of the actual accompanying Isabelle formalisation, and |
|
1342 |
%therefore |
|
1343 |
||
1344 |
||
1345 |
||
568 | 1346 |
Recall that lemma |
623 | 1347 |
\ref{mePosix} tells us that |
1348 |
$\mkeps$ always generates the POSIX value. |
|
1349 |
The function $\inj$ preserves the POSIXness, provided |
|
1350 |
the value before injection is POSIX, namely |
|
568 | 1351 |
\begin{lemma}\label{injPosix} |
623 | 1352 |
If$(r \backslash c, s) \rightarrow v $, |
1353 |
then $(r, c :: s) \rightarrow (\inj r \; c\; v)$. |
|
568 | 1354 |
\end{lemma} |
1355 |
\begin{proof} |
|
1356 |
By induction on $r$. |
|
623 | 1357 |
The non-trivial cases are sequence and star. |
1358 |
When $r = a \cdot b$, there can be |
|
568 | 1359 |
three cases for the value $v$ satisfying $\vdash v:a\backslash c$. |
1360 |
We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each |
|
1361 |
case. |
|
1362 |
\begin{itemize} |
|
1363 |
\item |
|
1364 |
$v = \Seq \; v_a \; v_b$.\\ |
|
1365 |
The ``not nullable'' clause of the $\inj$ function is taken: |
|
1366 |
\begin{center} |
|
1367 |
\begin{tabular}{lcl} |
|
1368 |
$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\ |
|
1369 |
& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$ |
|
1370 |
\end{tabular} |
|
1371 |
\end{center} |
|
1372 |
We know that there exists a unique pair of |
|
637 | 1373 |
$s_a$ and $s_b$ satisfying |
568 | 1374 |
$(a \backslash c, s_a) \rightarrow v_a$, |
1375 |
$(b , s_b) \rightarrow v_b$, and |
|
1376 |
$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in |
|
1377 |
L \; (a\backslash c) \land |
|
1378 |
s_4 \in L \; b$. |
|
1379 |
The last condition gives us |
|
1380 |
$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in |
|
1381 |
L \; a \land |
|
1382 |
s_4 \in L \; b$. |
|
1383 |
By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds, |
|
1384 |
and this gives us |
|
1385 |
\[ |
|
1386 |
(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b. |
|
1387 |
\] |
|
1388 |
\item |
|
1389 |
$v = \Left \; (\Seq \; v_a \; v_b)$\\ |
|
1390 |
The argument is almost identical to the above case, |
|
1391 |
except that a different clause of $\inj$ is taken: |
|
1392 |
\begin{center} |
|
1393 |
\begin{tabular}{lcl} |
|
1394 |
$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\ |
|
1395 |
& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$ |
|
1396 |
\end{tabular} |
|
1397 |
\end{center} |
|
637 | 1398 |
With similar reasoning, |
538 | 1399 |
|
568 | 1400 |
\[ |
1401 |
(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b. |
|
1402 |
\] |
|
1403 |
again holds. |
|
1404 |
\item |
|
1405 |
$v = \Right \; v_b$\\ |
|
1406 |
Again the injection result would be |
|
1407 |
\begin{center} |
|
1408 |
\begin{tabular}{lcl} |
|
1409 |
$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\ |
|
1410 |
& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$ |
|
1411 |
\end{tabular} |
|
1412 |
\end{center} |
|
1413 |
We know that $a$ must be nullable, |
|
1414 |
allowing us to call $\mkeps$ and get |
|
1415 |
\[ |
|
1416 |
(a, []) \rightarrow \mkeps \; a. |
|
1417 |
\] |
|
637 | 1418 |
Also, by inductive hypothesis |
568 | 1419 |
\[ |
1420 |
(b, c::s) \rightarrow \inj\; b \; c \; v_b |
|
1421 |
\] |
|
1422 |
holds. |
|
1423 |
In addition, as |
|
1424 |
$\Right \;v_b$ instead of $\Left \ldots$ is |
|
1425 |
the POSIX value for $v$, it must be the case |
|
1426 |
that $s \notin L \;( (a\backslash c)\cdot b)$. |
|
1427 |
This tells us that |
|
1428 |
\[ |
|
1429 |
\nexists s_3 \; s_4. |
|
1430 |
s_3 @s_4 = s \land s_3 \in L \; (a\backslash c) |
|
1431 |
\land s_4 \in L \; b |
|
1432 |
\] |
|
1433 |
which translates to |
|
1434 |
\[ |
|
1435 |
\nexists s_3 \; s_4. \; s_3 \neq [] \land |
|
1436 |
s_3 @s_4 = c::s \land s_3 \in L \; a |
|
1437 |
\land s_4 \in L \; b. |
|
1438 |
\] |
|
637 | 1439 |
(Which says there cannot be a longer |
568 | 1440 |
initial split for $s$ other than the empty string.) |
1441 |
Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$ |
|
1442 |
as the POSIX value for $a\cdot b$. |
|
1443 |
\end{itemize} |
|
1444 |
The star case can be proven similarly. |
|
1445 |
\end{proof} |
|
1446 |
\noindent |
|
667 | 1447 |
|
1448 |
||
1449 |
||
1450 |
||
1451 |
||
1452 |
||
1453 |
||
1454 |
%TODO: Cut from previous lexer def, need to make coherent |
|
623 | 1455 |
The central property of the $\lexer$ is that it gives the correct result |
1456 |
according to |
|
1457 |
POSIX rules. |
|
573 | 1458 |
\begin{theorem}\label{lexerCorrectness} |
568 | 1459 |
The $\lexer$ based on derivatives and injections is correct: |
1460 |
\begin{center} |
|
1461 |
\begin{tabular}{lcl} |
|
1462 |
$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\ |
|
1463 |
$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$ |
|
1464 |
\end{tabular} |
|
1465 |
\end{center} |
|
1466 |
\end{theorem} |
|
1467 |
\begin{proof} |
|
623 | 1468 |
By induction on $s$. $r$ generalising over an arbitrary regular expression. |
637 | 1469 |
The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case |
568 | 1470 |
by lemma \ref{injPosix}. |
1471 |
\end{proof} |
|
538 | 1472 |
\noindent |
623 | 1473 |
As we did earlier in this chapter with the matcher, one can |
1474 |
introduce simplification on the regular expression in each derivative step. |
|
637 | 1475 |
However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase) |
1476 |
and ensure that |
|
1477 |
the values align with the regular expression at each step. |
|
539 | 1478 |
Therefore one has to |
538 | 1479 |
be careful not to break the correctness, as the injection |
623 | 1480 |
function heavily relies on the structure of |
1481 |
the regular expressions and values being aligned. |
|
1482 |
This can be achieved by recording some extra rectification functions |
|
637 | 1483 |
during the derivatives step and applying these rectifications in |
538 | 1484 |
each run during the injection phase. |
568 | 1485 |
With extra care |
623 | 1486 |
one can show that POSIXness will not be affected |
1487 |
by the simplifications listed here \cite{AusafDyckhoffUrban2016}. |
|
1488 |
\begin{center} |
|
1489 |
\begin{tabular}{lcl} |
|
1490 |
$\simp \; r_1 \cdot r_2 $ & $ \dn$ & |
|
1491 |
$(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\ |
|
1492 |
& & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\ |
|
1493 |
& & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\ |
|
1494 |
& & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\ |
|
1495 |
& & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\ |
|
1496 |
& & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\ |
|
1497 |
$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\ |
|
1498 |
& & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\ |
|
1499 |
& & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\ |
|
1500 |
& & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\ |
|
1501 |
$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$ |
|
1502 |
||
1503 |
\end{tabular} |
|
1504 |
\end{center} |
|
538 | 1505 |
|
637 | 1506 |
However, one can still end up |
1507 |
with exploding derivatives, |
|
1508 |
even with the simple-minded simplification rules allowed |
|
1509 |
in an injection-based lexer. |
|
1510 |
\section{A Case Requiring More Aggressive Simplifications} |
|
539 | 1511 |
For example, when starting with the regular |
585 | 1512 |
expression $(a^* \cdot a^*)^*$ and building just over |
1513 |
a dozen successive derivatives |
|
539 | 1514 |
w.r.t.~the character $a$, one obtains a derivative regular expression |
585 | 1515 |
with millions of nodes (when viewed as a tree) |
637 | 1516 |
even with the mentioned simplifications. |
585 | 1517 |
\begin{figure}[H] |
601 | 1518 |
\begin{center} |
539 | 1519 |
\begin{tikzpicture} |
1520 |
\begin{axis}[ |
|
1521 |
xlabel={$n$}, |
|
1522 |
ylabel={size}, |
|
585 | 1523 |
legend entries={Simple-Minded Simp, Naive Matcher}, |
539 | 1524 |
legend pos=north west, |
1525 |
legend cell align=left] |
|
1526 |
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data}; |
|
585 | 1527 |
\addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data}; |
539 | 1528 |
\end{axis} |
1529 |
\end{tikzpicture} |
|
601 | 1530 |
\end{center} |
539 | 1531 |
\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$} |
1532 |
\end{figure}\label{fig:BetterWaterloo} |
|
1533 |
||
568 | 1534 |
That is because Sulzmann and Lu's |
1535 |
injection-based lexing algorithm keeps a lot of |
|
541 | 1536 |
"useless" values that will not be used. |
539 | 1537 |
These different ways of matching will grow exponentially with the string length. |
623 | 1538 |
Consider the case |
568 | 1539 |
\[ |
1540 |
r= (a^*\cdot a^*)^* \quad and \quad |
|
1541 |
s=\underbrace{aa\ldots a}_\text{n \textit{a}s} |
|
1542 |
\] |
|
1543 |
as an example. |
|
1544 |
This is a highly ambiguous regular expression, with |
|
1545 |
many ways to split up the string into multiple segments for |
|
637 | 1546 |
different star iterations, |
573 | 1547 |
and for each segment |
1548 |
multiple ways of splitting between |
|
568 | 1549 |
the two $a^*$ sub-expressions. |
573 | 1550 |
When $n$ is equal to $1$, there are two lexical values for |
1551 |
the match: |
|
1552 |
\[ |
|
623 | 1553 |
\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1) |
573 | 1554 |
\] |
1555 |
and |
|
1556 |
\[ |
|
623 | 1557 |
\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2) |
573 | 1558 |
\] |
1559 |
The derivative of $\derssimp \;s \; r$ is |
|
1560 |
\[ |
|
1561 |
(a^*a^* + a^*)\cdot(a^*a^*)^*. |
|
1562 |
\] |
|
1563 |
The $a^*a^*$ and $a^*$ in the first child of the above sequence |
|
1564 |
correspond to value 1 and value 2, respectively. |
|
1565 |
When $n=2$, the number goes up to 7: |
|
1566 |
\[ |
|
1567 |
\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])] |
|
1568 |
\] |
|
623 | 1569 |
|
573 | 1570 |
\[ |
1571 |
\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] |
|
1572 |
\] |
|
623 | 1573 |
|
573 | 1574 |
\[ |
1575 |
\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])] |
|
1576 |
\] |
|
623 | 1577 |
|
573 | 1578 |
\[ |
1579 |
\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ] |
|
1580 |
\] |
|
623 | 1581 |
|
573 | 1582 |
\[ |
1583 |
\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), |
|
1584 |
\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]) |
|
1585 |
] |
|
1586 |
\] |
|
623 | 1587 |
|
573 | 1588 |
\[ |
1589 |
\Stars \; [ |
|
1590 |
\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]), |
|
1591 |
\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]) |
|
1592 |
] |
|
1593 |
\] |
|
1594 |
and |
|
1595 |
\[ |
|
1596 |
\Stars \; [ |
|
1597 |
\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]), |
|
1598 |
\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []) |
|
1599 |
] |
|
1600 |
\] |
|
638 | 1601 |
And $\derssimp \; aa \; (a^*a^*)^*$ is |
573 | 1602 |
\[ |
1603 |
((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* + |
|
1604 |
(a^*a^* + a^*)\cdot(a^*a^*)^*. |
|
1605 |
\] |
|
1606 |
which removes two out of the seven terms corresponding to the |
|
1607 |
seven distinct lexical values. |
|
1608 |
||
1609 |
It is not surprising that there are exponentially many |
|
1610 |
distinct lexical values that cannot be eliminated by |
|
1611 |
the simple-minded simplification of $\derssimp$. |
|
568 | 1612 |
A lexer without a good enough strategy to |
1613 |
deduplicate will naturally |
|
623 | 1614 |
have an exponential runtime on highly |
637 | 1615 |
ambiguous regular expressions because there |
623 | 1616 |
are exponentially many matches. |
1617 |
For this particular example, it seems |
|
1618 |
that the number of distinct matches growth |
|
1619 |
speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length). |
|
538 | 1620 |
|
573 | 1621 |
On the other hand, the |
1622 |
$\POSIX$ value for $r= (a^*\cdot a^*)^*$ and |
|
1623 |
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is |
|
1624 |
\[ |
|
1625 |
\Stars\, |
|
637 | 1626 |
[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]]. |
573 | 1627 |
\] |
637 | 1628 |
At any moment, the subterms in a regular expression |
1629 |
that will potentially result in a POSIX value is only |
|
573 | 1630 |
a minority among the many other terms, |
638 | 1631 |
and one can remove the ones that are not possible to |
573 | 1632 |
be POSIX. |
1633 |
In the above example, |
|
579 | 1634 |
\begin{equation}\label{eqn:growth2} |
573 | 1635 |
((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
1636 |
\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}. |
|
579 | 1637 |
\end{equation} |
573 | 1638 |
can be further simplified by |
1639 |
removing the underlined term first, |
|
1640 |
which would open up possibilities |
|
1641 |
of further simplification that removes the |
|
1642 |
underbraced part. |
|
1643 |
The result would be |
|
1644 |
\[ |
|
1645 |
(\underbrace{a^*a^*}_\text{term 1} + \underbrace{a^*}_\text{term 2})\cdot(a^*a^*)^*. |
|
1646 |
\] |
|
1647 |
with corresponding values |
|
1648 |
\begin{center} |
|
1649 |
\begin{tabular}{lr} |
|
1650 |
$\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$ & $(\text{term 1})$\\ |
|
1651 |
$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] $ & $(\text{term 2})$ |
|
1652 |
\end{tabular} |
|
1653 |
\end{center} |
|
637 | 1654 |
Other terms with an underlying value, such as |
573 | 1655 |
\[ |
1656 |
\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])] |
|
1657 |
\] |
|
638 | 1658 |
do not to contribute a POSIX lexical value, |
1659 |
and therefore can be thrown away. |
|
538 | 1660 |
|
623 | 1661 |
Ausaf et al. \cite{AusafDyckhoffUrban2016} |
639 | 1662 |
have come up with some simplification steps, |
1663 |
and incorporated the simplification into $\lexer$. |
|
1664 |
They call this lexer $\textit{lexer}_{simp}$ and proved that |
|
1665 |
\[ |
|
1666 |
\lexer \; r\; s = \textit{lexer}_{simp} \; r \; s |
|
1667 |
\] |
|
1668 |
The function $\textit{lexer}_{simp}$ |
|
1669 |
involves some fiddly manipulation of value rectification, |
|
1670 |
which we omit here. |
|
1671 |
however those steps |
|
638 | 1672 |
are not yet sufficiently strong, to achieve the above effects. |
637 | 1673 |
And even with these relatively mild simplifications, the proof |
1674 |
is already quite a bit more complicated than the theorem \ref{lexerCorrectness}. |
|
638 | 1675 |
One would need to prove something like this: |
573 | 1676 |
\[ |
1677 |
\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\; |
|
1678 |
\textit{then}\;\; (r, c::s) \rightarrow |
|
637 | 1679 |
\inj\;\, r\, \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v). |
573 | 1680 |
\] |
1681 |
instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$ |
|
1682 |
not only has to return a simplified regular expression, |
|
1683 |
but also what specific simplifications |
|
638 | 1684 |
have been done as a function on values |
573 | 1685 |
showing how one can transform the value |
1686 |
underlying the simplified regular expression |
|
1687 |
to the unsimplified one. |
|
1688 |
||
623 | 1689 |
We therefore choose a slightly different approach |
1690 |
also described by Sulzmann and Lu to |
|
573 | 1691 |
get better simplifications, which uses |
1692 |
some augmented data structures compared to |
|
1693 |
plain regular expressions. |
|
1694 |
We call them \emph{annotated} |
|
1695 |
regular expressions. |
|
1696 |
With annotated regular expressions, |
|
1697 |
we can avoid creating the intermediate values $v_1,\ldots v_n$ and a |
|
1698 |
second phase altogether. |
|
1699 |
We introduce this new datatype and the |
|
1700 |
corresponding algorithm in the next chapter. |
|
538 | 1701 |
|
1702 |
||
1703 |
||
1704 |