532
|
1 |
% Chapter Template
|
|
2 |
|
|
3 |
\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
|
|
4 |
|
|
5 |
\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
|
|
6 |
%and notations we
|
|
7 |
%use for describing the lexing algorithm by Sulzmann and Lu,
|
|
8 |
%and then give the algorithm and its variant, and discuss
|
|
9 |
%why more aggressive simplifications are needed.
|
|
10 |
|
538
|
11 |
In this chapter, we define the basic notions
|
|
12 |
for regular languages and regular expressions.
|
|
13 |
We also give the definition of what $\POSIX$ lexing means.
|
532
|
14 |
|
538
|
15 |
\section{Basic Concepts}
|
|
16 |
Usually in formal language theory there is an alphabet
|
|
17 |
denoting a set of characters.
|
|
18 |
Here we only use the datatype of characters from Isabelle,
|
|
19 |
which roughly corresponds to the ASCII character.
|
|
20 |
Then using the usual $[]$ notation for lists,
|
|
21 |
we can define strings using chars:
|
532
|
22 |
\begin{center}
|
|
23 |
\begin{tabular}{lcl}
|
|
24 |
$\textit{string}$ & $\dn$ & $[] | c :: cs$\\
|
|
25 |
& & $(c\; \text{has char type})$
|
|
26 |
\end{tabular}
|
|
27 |
\end{center}
|
538
|
28 |
And strings can be concatenated to form longer strings,
|
|
29 |
in the same way as we concatenate two lists,
|
|
30 |
which we denote as $@$. We omit the precise
|
|
31 |
recursive definition here.
|
|
32 |
We overload this concatenation operator for two sets of strings:
|
532
|
33 |
\begin{center}
|
|
34 |
\begin{tabular}{lcl}
|
|
35 |
$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
|
|
36 |
\end{tabular}
|
|
37 |
\end{center}
|
538
|
38 |
We also call the above \emph{language concatenation}.
|
532
|
39 |
The power of a language is defined recursively, using the
|
|
40 |
concatenation operator $@$:
|
|
41 |
\begin{center}
|
|
42 |
\begin{tabular}{lcl}
|
|
43 |
$A^0 $ & $\dn$ & $\{ [] \}$\\
|
|
44 |
$A^{n+1}$ & $\dn$ & $A^n @ A$
|
|
45 |
\end{tabular}
|
|
46 |
\end{center}
|
|
47 |
The union of all the natural number powers of a language
|
538
|
48 |
is defined as the Kleene star operator:
|
532
|
49 |
\begin{center}
|
|
50 |
\begin{tabular}{lcl}
|
536
|
51 |
$A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
|
532
|
52 |
\end{tabular}
|
|
53 |
\end{center}
|
|
54 |
|
538
|
55 |
\noindent
|
|
56 |
However, to obtain a convenient induction principle
|
|
57 |
in Isabelle/HOL,
|
536
|
58 |
we instead define the Kleene star
|
532
|
59 |
as an inductive set:
|
538
|
60 |
|
532
|
61 |
\begin{center}
|
538
|
62 |
\begin{mathpar}
|
|
63 |
\inferrule{}{[] \in A*\\}
|
|
64 |
|
|
65 |
\inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*}
|
|
66 |
\end{mathpar}
|
532
|
67 |
\end{center}
|
538
|
68 |
|
|
69 |
We also define an operation of "chopping of" a character from
|
|
70 |
a language, which we call $\Der$, meaning "Derivative for a language":
|
532
|
71 |
\begin{center}
|
|
72 |
\begin{tabular}{lcl}
|
|
73 |
$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
|
|
74 |
\end{tabular}
|
|
75 |
\end{center}
|
538
|
76 |
\noindent
|
|
77 |
This can be generalised to "chopping off" a string from all strings within set $A$,
|
|
78 |
with the help of the concatenation operator:
|
532
|
79 |
\begin{center}
|
|
80 |
\begin{tabular}{lcl}
|
|
81 |
$\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
|
|
82 |
\end{tabular}
|
|
83 |
\end{center}
|
538
|
84 |
\noindent
|
532
|
85 |
which is essentially the left quotient $A \backslash L'$ of $A$ against
|
|
86 |
the singleton language $L' = \{w\}$
|
|
87 |
in formal language theory.
|
536
|
88 |
For this dissertation the $\textit{Ders}$ definition with
|
|
89 |
a single string suffices.
|
532
|
90 |
|
|
91 |
With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
|
|
92 |
we have a few properties of how the language derivative can be defined using
|
|
93 |
sub-languages.
|
|
94 |
\begin{lemma}
|
536
|
95 |
\[
|
|
96 |
\Der \; c \; (A @ B) =
|
|
97 |
\begin{cases}
|
538
|
98 |
((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\
|
|
99 |
(\Der \; c \; A) \, @ \, B, & \text{otherwise}
|
536
|
100 |
\end{cases}
|
|
101 |
\]
|
532
|
102 |
\end{lemma}
|
|
103 |
\noindent
|
|
104 |
This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
|
|
105 |
and get to $B$.
|
536
|
106 |
The language $A*$'s derivative can be described using the language derivative
|
532
|
107 |
of $A$:
|
|
108 |
\begin{lemma}
|
538
|
109 |
$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
|
532
|
110 |
\end{lemma}
|
|
111 |
\begin{proof}
|
|
112 |
\begin{itemize}
|
|
113 |
\item{$\subseteq$}
|
538
|
114 |
\noindent
|
532
|
115 |
The set
|
536
|
116 |
\[ \{s \mid c :: s \in A*\} \]
|
532
|
117 |
is enclosed in the set
|
536
|
118 |
\[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
|
532
|
119 |
because whenever you have a string starting with a character
|
538
|
120 |
in the language of a Kleene star $A*$,
|
|
121 |
then that character together with some sub-string
|
|
122 |
immediately after it will form the first iteration,
|
|
123 |
and the rest of the string will
|
536
|
124 |
be still in $A*$.
|
532
|
125 |
\item{$\supseteq$}
|
|
126 |
Note that
|
538
|
127 |
\[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \]
|
532
|
128 |
and
|
536
|
129 |
\[ \Der \; c \; (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
|
532
|
130 |
where the $\textit{RHS}$ of the above equatioin can be rewritten
|
536
|
131 |
as \[ (\Der \; c\; A) @ A* \cup A' \], $A'$ being a possibly empty set.
|
532
|
132 |
\end{itemize}
|
|
133 |
\end{proof}
|
538
|
134 |
|
|
135 |
\noindent
|
532
|
136 |
Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
|
|
137 |
for regular languages, we need to first give definitions for regular expressions.
|
|
138 |
|
536
|
139 |
\subsection{Regular Expressions and Their Meaning}
|
532
|
140 |
The basic regular expressions are defined inductively
|
|
141 |
by the following grammar:
|
|
142 |
\[ r ::= \ZERO \mid \ONE
|
|
143 |
\mid c
|
|
144 |
\mid r_1 \cdot r_2
|
|
145 |
\mid r_1 + r_2
|
|
146 |
\mid r^*
|
|
147 |
\]
|
538
|
148 |
\noindent
|
|
149 |
We call them basic because we might introduce
|
|
150 |
more constructs later such as negation
|
|
151 |
and bounded repetitions.
|
|
152 |
We defined the regular expression containing
|
|
153 |
nothing as $\ZERO$, note that some authors
|
|
154 |
also use $\phi$ for that.
|
|
155 |
Similarly, the regular expression denoting the
|
|
156 |
singleton set with only $[]$ is sometimes
|
|
157 |
denoted by $\epsilon$, but we use $\ONE$ here.
|
532
|
158 |
|
|
159 |
The language or set of strings denoted
|
|
160 |
by regular expressions are defined as
|
|
161 |
%TODO: FILL in the other defs
|
|
162 |
\begin{center}
|
|
163 |
\begin{tabular}{lcl}
|
536
|
164 |
$L \; (\ZERO)$ & $\dn$ & $\phi$\\
|
|
165 |
$L \; (\ONE)$ & $\dn$ & $\{[]\}$\\
|
|
166 |
$L \; (c)$ & $\dn$ & $\{[c]\}$\\
|
532
|
167 |
$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
|
|
168 |
$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
|
536
|
169 |
$L \; (r^*)$ & $\dn$ & $ (L(r))^*$
|
532
|
170 |
\end{tabular}
|
|
171 |
\end{center}
|
536
|
172 |
\noindent
|
532
|
173 |
Which is also called the "language interpretation" of
|
|
174 |
a regular expression.
|
|
175 |
|
|
176 |
Now with semantic derivatives of a language and regular expressions and
|
|
177 |
their language interpretations in place, we are ready to define derivatives on regexes.
|
|
178 |
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
|
538
|
179 |
|
|
180 |
\ChristianComment{Hi this part I want to keep the ordering as is, so that it keeps the
|
|
181 |
readers engaged with a story how we got to the definition of $\backslash$, rather
|
|
182 |
than first "overwhelming" them with the definition of $\nullable$.}
|
|
183 |
|
536
|
184 |
The language derivative acts on a string set and chops off a character from
|
532
|
185 |
all strings in that set, we want to define a derivative operation on regular expressions
|
|
186 |
so that after derivative $L(r\backslash c)$
|
|
187 |
will look as if it was obtained by doing a language derivative on $L(r)$:
|
538
|
188 |
\begin{center}
|
532
|
189 |
\[
|
538
|
190 |
r\backslash c \dn ?
|
532
|
191 |
\]
|
538
|
192 |
so that
|
|
193 |
\[
|
|
194 |
L(r \backslash c) = \Der \; c \; L(r) ?
|
|
195 |
\]
|
|
196 |
\end{center}
|
532
|
197 |
So we mimic the equalities we have for $\Der$ on language concatenation
|
|
198 |
|
|
199 |
\[
|
|
200 |
\Der \; c \; (A @ B) = \textit{if} \; [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B\\
|
|
201 |
\]
|
|
202 |
to get the derivative for sequence regular expressions:
|
|
203 |
\[
|
|
204 |
(r_1 \cdot r_2 ) \backslash c = \textit{if}\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c \textit{else} (r_1 \backslash c) \cdot r_2
|
|
205 |
\]
|
|
206 |
|
|
207 |
\noindent
|
|
208 |
and language Kleene star:
|
|
209 |
\[
|
536
|
210 |
\textit{Der} \;c \;A* = (\textit{Der}\; c A) @ (A*)
|
532
|
211 |
\]
|
|
212 |
to get derivative of the Kleene star regular expression:
|
|
213 |
\[
|
|
214 |
r^* \backslash c = (r \backslash c)\cdot r^*
|
|
215 |
\]
|
|
216 |
Note that although we can formalise the boolean predicate
|
|
217 |
$[] \in L(r_1)$ without problems, if we want a function that works
|
|
218 |
computationally, then we would have to define a function that tests
|
|
219 |
whether an empty string is in the language of a regular expression.
|
|
220 |
We call such a function $\nullable$:
|
|
221 |
|
|
222 |
|
|
223 |
|
|
224 |
\begin{center}
|
|
225 |
\begin{tabular}{lcl}
|
|
226 |
$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
|
|
227 |
$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
|
|
228 |
$d \backslash c$ & $\dn$ &
|
|
229 |
$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
|
|
230 |
$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
|
538
|
231 |
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
|
532
|
232 |
& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
|
|
233 |
& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
|
|
234 |
$(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\
|
|
235 |
\end{tabular}
|
|
236 |
\end{center}
|
|
237 |
\noindent
|
|
238 |
The function derivative, written $r\backslash c$,
|
|
239 |
defines how a regular expression evolves into
|
|
240 |
a new regular expression after all the string it contains
|
|
241 |
is chopped off a certain head character $c$.
|
|
242 |
The most involved cases are the sequence
|
|
243 |
and star case.
|
|
244 |
The sequence case says that if the first regular expression
|
|
245 |
contains an empty string then the second component of the sequence
|
|
246 |
might be chosen as the target regular expression to be chopped
|
|
247 |
off its head character.
|
|
248 |
The star regular expression's derivative unwraps the iteration of
|
|
249 |
regular expression and attaches the star regular expression
|
|
250 |
to the sequence's second element to make sure a copy is retained
|
|
251 |
for possible more iterations in later phases of lexing.
|
|
252 |
|
|
253 |
|
538
|
254 |
To test whether $[] \in L(r_1)$, we need the $\nullable$ function,
|
|
255 |
which tests whether the empty string $""$
|
532
|
256 |
is in the language of $r$:
|
|
257 |
|
|
258 |
|
|
259 |
\begin{center}
|
|
260 |
\begin{tabular}{lcl}
|
|
261 |
$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
|
|
262 |
$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
|
|
263 |
$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
|
|
264 |
$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
|
|
265 |
$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
|
|
266 |
$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
|
|
267 |
\end{tabular}
|
|
268 |
\end{center}
|
|
269 |
\noindent
|
|
270 |
The empty set does not contain any string and
|
|
271 |
therefore not the empty string, the empty string
|
|
272 |
regular expression contains the empty string
|
|
273 |
by definition, the character regular expression
|
|
274 |
is the singleton that contains character only,
|
|
275 |
and therefore does not contain the empty string,
|
|
276 |
the alternative regular expression (or "or" expression)
|
|
277 |
might have one of its children regular expressions
|
|
278 |
being nullable and any one of its children being nullable
|
|
279 |
would suffice. The sequence regular expression
|
|
280 |
would require both children to have the empty string
|
|
281 |
to compose an empty string and the Kleene star
|
|
282 |
operation naturally introduced the empty string.
|
|
283 |
|
|
284 |
We have the following property where the derivative on regular
|
|
285 |
expressions coincides with the derivative on a set of strings:
|
|
286 |
|
|
287 |
\begin{lemma}
|
|
288 |
$\textit{Der} \; c \; L(r) = L (r\backslash c)$
|
|
289 |
\end{lemma}
|
|
290 |
|
|
291 |
\noindent
|
|
292 |
|
|
293 |
|
|
294 |
The main property of the derivative operation
|
|
295 |
that enables us to reason about the correctness of
|
|
296 |
an algorithm using derivatives is
|
|
297 |
|
|
298 |
\begin{center}
|
|
299 |
$c\!::\!s \in L(r)$ holds
|
|
300 |
if and only if $s \in L(r\backslash c)$.
|
|
301 |
\end{center}
|
|
302 |
|
|
303 |
\noindent
|
|
304 |
We can generalise the derivative operation shown above for single characters
|
|
305 |
to strings as follows:
|
|
306 |
|
|
307 |
\begin{center}
|
|
308 |
\begin{tabular}{lcl}
|
|
309 |
$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
|
|
310 |
$r \backslash [\,] $ & $\dn$ & $r$
|
|
311 |
\end{tabular}
|
|
312 |
\end{center}
|
|
313 |
|
|
314 |
\noindent
|
|
315 |
When there is no ambiguity we will use $\backslash$ to denote
|
|
316 |
string derivatives for brevity.
|
|
317 |
|
|
318 |
and then define Brzozowski's regular-expression matching algorithm as:
|
|
319 |
|
|
320 |
\begin{definition}
|
538
|
321 |
$\textit{match}\;s\;r \;\dn\; \nullable(r\backslash s)$
|
532
|
322 |
\end{definition}
|
|
323 |
|
|
324 |
\noindent
|
|
325 |
Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$,
|
|
326 |
this algorithm presented graphically is as follows:
|
|
327 |
|
|
328 |
\begin{equation}\label{graph:successive_ders}
|
|
329 |
\begin{tikzcd}
|
|
330 |
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
|
|
331 |
\end{tikzcd}
|
|
332 |
\end{equation}
|
|
333 |
|
|
334 |
\noindent
|
538
|
335 |
|
|
336 |
|
|
337 |
Building derivatives and then testing the existence
|
|
338 |
of empty string in the resulting regular expression's language.
|
|
339 |
So far, so good. But what if we want to
|
|
340 |
do lexing instead of just getting a YES/NO answer?
|
|
341 |
Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
|
|
342 |
elegant (arguably as beautiful as the definition of the original derivative) solution for this.
|
|
343 |
|
|
344 |
\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
|
|
345 |
Here we present the hybrid phases of a regular expression lexing
|
|
346 |
algorithm using the function $\inj$, as given by Sulzmann and Lu.
|
|
347 |
They first defined the datatypes for storing the
|
|
348 |
lexing information called a \emph{value} or
|
|
349 |
sometimes also \emph{lexical value}. These values and regular
|
|
350 |
expressions correspond to each other as illustrated in the following
|
|
351 |
table:
|
|
352 |
|
|
353 |
\begin{center}
|
|
354 |
\begin{tabular}{c@{\hspace{20mm}}c}
|
|
355 |
\begin{tabular}{@{}rrl@{}}
|
|
356 |
\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
|
|
357 |
$r$ & $::=$ & $\ZERO$\\
|
|
358 |
& $\mid$ & $\ONE$ \\
|
|
359 |
& $\mid$ & $c$ \\
|
|
360 |
& $\mid$ & $r_1 \cdot r_2$\\
|
|
361 |
& $\mid$ & $r_1 + r_2$ \\
|
|
362 |
\\
|
|
363 |
& $\mid$ & $r^*$ \\
|
|
364 |
\end{tabular}
|
|
365 |
&
|
|
366 |
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
|
|
367 |
\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
|
|
368 |
$v$ & $::=$ & \\
|
|
369 |
& & $\Empty$ \\
|
|
370 |
& $\mid$ & $\Char(c)$ \\
|
|
371 |
& $\mid$ & $\Seq\,v_1\, v_2$\\
|
|
372 |
& $\mid$ & $\Left(v)$ \\
|
|
373 |
& $\mid$ & $\Right(v)$ \\
|
|
374 |
& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
|
|
375 |
\end{tabular}
|
|
376 |
\end{tabular}
|
|
377 |
\end{center}
|
|
378 |
|
|
379 |
\noindent
|
|
380 |
We have a formal binary relation for telling whether the structure
|
|
381 |
of a regular expression agrees with the value.
|
|
382 |
\begin{mathpar}
|
|
383 |
\inferrule{}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
|
|
384 |
\inferrule{}{\vdash \Empty : \ONE} \hspace{2em}
|
|
385 |
\inferrule{\vdash v_1 : r_1 \\ \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
|
|
386 |
\end{mathpar}
|
|
387 |
|
|
388 |
Building on top of Sulzmann and Lu's attempt to formalise the
|
|
389 |
notion of POSIX lexing rules \parencite{Sulzmann2014},
|
|
390 |
Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
|
|
391 |
POSIX matching as a ternary relation recursively defined in a
|
|
392 |
natural deduction style.
|
|
393 |
The formal definition of a $\POSIX$ value $v$ for a regular expression
|
|
394 |
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
|
|
395 |
in the following set of rules:
|
|
396 |
\ChristianComment{Will complete later}
|
|
397 |
\newcommand*{\inference}[3][t]{%
|
|
398 |
\begingroup
|
|
399 |
\def\and{\\}%
|
|
400 |
\begin{tabular}[#1]{@{\enspace}c@{\enspace}}
|
|
401 |
#2 \\
|
|
402 |
\hline
|
|
403 |
#3
|
|
404 |
\end{tabular}%
|
|
405 |
\endgroup
|
|
406 |
}
|
|
407 |
\begin{center}
|
|
408 |
\inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$ }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
|
|
409 |
\end{center}
|
|
410 |
\noindent
|
|
411 |
The above $\POSIX$ rules could be explained intuitionally as
|
|
412 |
\begin{itemize}
|
|
413 |
\item
|
|
414 |
match the leftmost regular expression when multiple options of matching
|
|
415 |
are available
|
|
416 |
\item
|
|
417 |
always match a subpart as much as possible before proceeding
|
|
418 |
to the next token.
|
|
419 |
\end{itemize}
|
|
420 |
|
|
421 |
The reason why we are interested in $\POSIX$ values is that they can
|
|
422 |
be practically used in the lexing phase of a compiler front end.
|
|
423 |
For instance, when lexing a code snippet
|
|
424 |
$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
|
|
425 |
as an identifier rather than a keyword.
|
|
426 |
|
|
427 |
The good property about a $\POSIX$ value is that
|
|
428 |
given the same regular expression $r$ and string $s$,
|
|
429 |
one can always uniquely determine the $\POSIX$ value for it:
|
|
430 |
\begin{lemma}
|
|
431 |
$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$
|
|
432 |
\end{lemma}
|
|
433 |
Now we know what a $\POSIX$ value is, the problem is how do we achieve
|
|
434 |
such a value in a lexing algorithm, using derivatives?
|
|
435 |
|
|
436 |
\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
|
|
437 |
|
|
438 |
The contribution of Sulzmann and Lu is an extension of Brzozowski's
|
|
439 |
algorithm by a second phase (the first phase being building successive
|
|
440 |
derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value
|
|
441 |
is generated if the regular expression matches the string.
|
|
442 |
Two functions are involved: $\inj$ and $\mkeps$.
|
|
443 |
The function $\mkeps$ constructs a value from the last
|
|
444 |
one of all the successive derivatives:
|
|
445 |
\begin{ceqn}
|
|
446 |
\begin{equation}\label{graph:mkeps}
|
|
447 |
\begin{tikzcd}
|
|
448 |
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
|
|
449 |
& & & v_n
|
|
450 |
\end{tikzcd}
|
|
451 |
\end{equation}
|
|
452 |
\end{ceqn}
|
|
453 |
|
|
454 |
It tells us how can an empty string be matched by a
|
|
455 |
regular expression, in a $\POSIX$ way:
|
|
456 |
|
|
457 |
\begin{center}
|
|
458 |
\begin{tabular}{lcl}
|
|
459 |
$\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\
|
|
460 |
$\mkeps(r_{1}+r_{2})$ & $\dn$
|
|
461 |
& \textit{if} $\nullable(r_{1})$\\
|
|
462 |
& & \textit{then} $\Left(\mkeps(r_{1}))$\\
|
|
463 |
& & \textit{else} $\Right(\mkeps(r_{2}))$\\
|
|
464 |
$\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
|
|
465 |
$mkeps(r^*)$ & $\dn$ & $\Stars\,[]$
|
|
466 |
\end{tabular}
|
|
467 |
\end{center}
|
|
468 |
|
|
469 |
|
|
470 |
\noindent
|
|
471 |
We favour the left to match an empty string if there is a choice.
|
|
472 |
When there is a star for us to match the empty string,
|
|
473 |
we give the $\Stars$ constructor an empty list, meaning
|
|
474 |
no iterations are taken.
|
|
475 |
The result of a call to $\mkeps$ on a $\nullable$ $r$ would
|
|
476 |
be a $\POSIX$ value corresponding to $r$:
|
|
477 |
\begin{lemma}
|
|
478 |
$\nullable(r) \implies (r, []) \rightarrow (\mkeps\; v)$
|
|
479 |
\end{lemma}\label{mePosix}
|
|
480 |
|
|
481 |
|
|
482 |
After the $\mkeps$-call, we inject back the characters one by one in order to build
|
|
483 |
the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
|
|
484 |
($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
|
|
485 |
After injecting back $n$ characters, we get the lexical value for how $r_0$
|
|
486 |
matches $s$.
|
|
487 |
To do this, Sulzmann and Lu defined a function that reverses
|
|
488 |
the ``chopping off'' of characters during the derivative phase. The
|
|
489 |
corresponding function is called \emph{injection}, written
|
|
490 |
$\textit{inj}$; it takes three arguments: the first one is a regular
|
|
491 |
expression ${r_{i-1}}$, before the character is chopped off, the second
|
|
492 |
is a character ${c_{i-1}}$, the character we want to inject and the
|
|
493 |
third argument is the value ${v_i}$, into which one wants to inject the
|
|
494 |
character (it corresponds to the regular expression after the character
|
|
495 |
has been chopped off). The result of this function is a new value.
|
|
496 |
\begin{ceqn}
|
|
497 |
\begin{equation}\label{graph:inj}
|
|
498 |
\begin{tikzcd}
|
|
499 |
r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
|
|
500 |
v_1 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed]
|
|
501 |
\end{tikzcd}
|
|
502 |
\end{equation}
|
|
503 |
\end{ceqn}
|
|
504 |
|
|
505 |
|
|
506 |
\noindent
|
|
507 |
The
|
|
508 |
definition of $\textit{inj}$ is as follows:
|
|
509 |
|
|
510 |
\begin{center}
|
|
511 |
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
|
|
512 |
$\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\
|
|
513 |
$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
|
|
514 |
$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
|
|
515 |
$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
|
|
516 |
$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
|
|
517 |
$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
|
|
518 |
$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
|
|
519 |
\end{tabular}
|
|
520 |
\end{center}
|
|
521 |
|
|
522 |
\noindent
|
|
523 |
This definition is by recursion on the ``shape'' of regular
|
|
524 |
expressions and values.
|
|
525 |
The clauses do one thing--identifying the ``hole'' on a
|
|
526 |
value to inject the character back into.
|
|
527 |
For instance, in the last clause for injecting back to a value
|
|
528 |
that would turn into a new star value that corresponds to a star,
|
|
529 |
we know it must be a sequence value. And we know that the first
|
|
530 |
value of that sequence corresponds to the child regex of the star
|
|
531 |
with the first character being chopped off--an iteration of the star
|
|
532 |
that had just been unfolded. This value is followed by the already
|
|
533 |
matched star iterations we collected before. So we inject the character
|
|
534 |
back to the first value and form a new value with this latest iteration
|
|
535 |
being added to the previous list of iterations, all under the $\Stars$
|
|
536 |
top level.
|
|
537 |
The POSIX value is maintained throughout the process.
|
|
538 |
\begin{lemma}
|
|
539 |
$(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
|
|
540 |
\end{lemma}\label{injPosix}
|
|
541 |
|
|
542 |
|
|
543 |
Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
|
|
544 |
and taking into consideration the possibility of a non-match,
|
|
545 |
we have a lexer with the following recursive definition:
|
|
546 |
\begin{center}
|
|
547 |
\begin{tabular}{lcr}
|
|
548 |
$\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\
|
|
549 |
$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
|
|
550 |
& & $\None \implies \None$\\
|
|
551 |
& & $\mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
|
|
552 |
\end{tabular}
|
|
553 |
\end{center}
|
|
554 |
\noindent
|
|
555 |
The central property of the $\lexer$ is that it gives the correct result by
|
|
556 |
$\POSIX$ standards:
|
|
557 |
\begin{lemma}
|
|
558 |
\begin{tabular}{l}
|
|
559 |
$s \in L(r) \Longleftrightarrow (\exists v. \; r \; s = \Some(v) \land (r, \; s) \rightarrow v)$\\
|
|
560 |
$s \notin L(r) \Longleftrightarrow (\lexer \; r\; s = \None)$
|
|
561 |
\end{tabular}
|
|
562 |
\end{lemma}
|
|
563 |
|
|
564 |
|
|
565 |
\begin{proof}
|
|
566 |
By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
|
|
567 |
The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case
|
|
568 |
by lemma \ref{injPosix}.
|
|
569 |
\end{proof}
|
|
570 |
|
|
571 |
For convenience, we shall employ the following notations: the regular
|
|
572 |
expression we start with is $r_0$, and the given string $s$ is composed
|
|
573 |
of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the
|
|
574 |
left to right, we build the derivatives $r_1$, $r_2$, \ldots according
|
|
575 |
to the characters $c_0$, $c_1$ until we exhaust the string and obtain
|
|
576 |
the derivative $r_n$. We test whether this derivative is
|
|
577 |
$\textit{nullable}$ or not. If not, we know the string does not match
|
|
578 |
$r$, and no value needs to be generated. If yes, we start building the
|
|
579 |
values incrementally by \emph{injecting} back the characters into the
|
|
580 |
earlier values $v_n, \ldots, v_0$.
|
|
581 |
Pictorially, the algorithm is as follows:
|
|
582 |
|
|
583 |
\begin{ceqn}
|
|
584 |
\begin{equation}\label{graph:2}
|
|
585 |
\begin{tikzcd}
|
|
586 |
r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
|
|
587 |
v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
|
|
588 |
\end{tikzcd}
|
|
589 |
\end{equation}
|
|
590 |
\end{ceqn}
|
|
591 |
|
|
592 |
|
|
593 |
\noindent
|
|
594 |
This is the second phase of the
|
|
595 |
algorithm from right to left. For the first value $v_n$, we call the
|
|
596 |
function $\textit{mkeps}$, which builds a POSIX lexical value
|
|
597 |
for how the empty string has been matched by the (nullable) regular
|
|
598 |
expression $r_n$. This function is defined as
|
|
599 |
|
|
600 |
|
|
601 |
|
|
602 |
We have mentioned before that derivatives without simplification
|
|
603 |
can get clumsy, and this is true for values as well--they reflect
|
|
604 |
the size of the regular expression by definition.
|
|
605 |
|
|
606 |
One can introduce simplification on the regex and values but have to
|
|
607 |
be careful not to break the correctness, as the injection
|
|
608 |
function heavily relies on the structure of the regexes and values
|
|
609 |
being correct and matching each other.
|
|
610 |
It can be achieved by recording some extra rectification functions
|
|
611 |
during the derivatives step, and applying these rectifications in
|
|
612 |
each run during the injection phase.
|
|
613 |
And we can prove that the POSIX value of how
|
|
614 |
regular expressions match strings will not be affected---although it is much harder
|
|
615 |
to establish.
|
|
616 |
Some initial results in this regard have been
|
|
617 |
obtained in \cite{AusafDyckhoffUrban2016}.
|
|
618 |
|
|
619 |
|
|
620 |
|
|
621 |
%Brzozowski, after giving the derivatives and simplification,
|
|
622 |
%did not explore lexing with simplification, or he may well be
|
|
623 |
%stuck on an efficient simplification with proof.
|
|
624 |
%He went on to examine the use of derivatives together with
|
|
625 |
%automaton, and did not try lexing using products.
|
|
626 |
|
|
627 |
We want to get rid of the complex and fragile rectification of values.
|
|
628 |
Can we not create those intermediate values $v_1,\ldots v_n$,
|
|
629 |
and get the lexing information that should be already there while
|
|
630 |
doing derivatives in one pass, without a second injection phase?
|
|
631 |
In the meantime, can we make sure that simplifications
|
|
632 |
are easily handled without breaking the correctness of the algorithm?
|
|
633 |
|
|
634 |
Sulzmann and Lu solved this problem by
|
|
635 |
introducing additional information to the
|
|
636 |
regular expressions called \emph{bitcodes}.
|
|
637 |
|
|
638 |
|
|
639 |
|
|
640 |
|
|
641 |
|
|
642 |
With the formally-specified rules for what a POSIX matching is,
|
|
643 |
they proved in Isabelle/HOL that the algorithm gives correct results.
|
|
644 |
But having a correct result is still not enough,
|
|
645 |
we want at least some degree of $\mathbf{efficiency}$.
|
|
646 |
|
|
647 |
|
|
648 |
A pair of regular expression and string can have multiple lexical values.
|
|
649 |
Take the example where $r= (a^*\cdot a^*)^*$ and the string
|
|
650 |
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
|
|
651 |
If we do not allow any empty iterations in its lexical values,
|
|
652 |
there will be $n - 1$ "splitting points" on $s$ we can choose to
|
|
653 |
split or not so that each sub-string
|
|
654 |
segmented by those chosen splitting points will form different iterations:
|
|
655 |
\begin{center}
|
|
656 |
\begin{tabular}{lcr}
|
|
657 |
$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$\\
|
|
658 |
$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$\\
|
|
659 |
$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$\\
|
|
660 |
& $\textit{etc}.$ &
|
|
661 |
\end{tabular}
|
|
662 |
\end{center}
|
|
663 |
\noindent
|
|
664 |
And for each iteration, there are still multiple ways to split
|
|
665 |
between the two $a^*$s.
|
|
666 |
It is not surprising there are exponentially many lexical values
|
|
667 |
that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$ and
|
|
668 |
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
|
|
669 |
A lexer to keep all the possible values will naturally
|
|
670 |
have an exponential runtime on ambiguous regular expressions.
|
|
671 |
With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
|
|
672 |
of a match. This means Sulzmann and Lu's injection-based algorithm
|
|
673 |
will be exponential by nature.
|
|
674 |
Somehow one has to decide which lexical value to keep and
|
|
675 |
output in a lexing algorithm.
|
|
676 |
|
|
677 |
|
|
678 |
For example, the above $r= (a^*\cdot a^*)^*$ and
|
|
679 |
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
|
|
680 |
$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
|
|
681 |
The output of an algorithm we want would be a POSIX matching
|
|
682 |
encoded as a value.
|
|
683 |
|
|
684 |
|
|
685 |
|
|
686 |
|
|
687 |
|
|
688 |
%kind of redundant material
|
|
689 |
|
|
690 |
|
532
|
691 |
where we start with a regular expression $r_0$, build successive
|
|
692 |
derivatives until we exhaust the string and then use \textit{nullable}
|
|
693 |
to test whether the result can match the empty string. It can be
|
|
694 |
relatively easily shown that this matcher is correct (that is given
|
|
695 |
an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
|
|
696 |
|
|
697 |
Beautiful and simple definition.
|
|
698 |
|
|
699 |
If we implement the above algorithm naively, however,
|
|
700 |
the algorithm can be excruciatingly slow.
|
|
701 |
|
|
702 |
|
|
703 |
\begin{figure}
|
|
704 |
\centering
|
|
705 |
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
|
|
706 |
\begin{tikzpicture}
|
|
707 |
\begin{axis}[
|
|
708 |
xlabel={$n$},
|
|
709 |
x label style={at={(1.05,-0.05)}},
|
|
710 |
ylabel={time in secs},
|
|
711 |
enlargelimits=false,
|
|
712 |
xtick={0,5,...,30},
|
|
713 |
xmax=33,
|
|
714 |
ymax=10000,
|
|
715 |
ytick={0,1000,...,10000},
|
|
716 |
scaled ticks=false,
|
|
717 |
axis lines=left,
|
|
718 |
width=5cm,
|
|
719 |
height=4cm,
|
|
720 |
legend entries={JavaScript},
|
|
721 |
legend pos=north west,
|
|
722 |
legend cell align=left]
|
|
723 |
\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
|
|
724 |
\end{axis}
|
|
725 |
\end{tikzpicture}\\
|
|
726 |
\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
|
|
727 |
of the form $\underbrace{aa..a}_{n}$.}
|
|
728 |
\end{tabular}
|
|
729 |
\caption{EightThousandNodes} \label{fig:EightThousandNodes}
|
|
730 |
\end{figure}
|
|
731 |
|
|
732 |
|
|
733 |
(8000 node data to be added here)
|
|
734 |
For example, when starting with the regular
|
|
735 |
expression $(a + aa)^*$ and building a few successive derivatives (around 10)
|
|
736 |
w.r.t.~the character $a$, one obtains a derivative regular expression
|
|
737 |
with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
|
|
738 |
The reason why $(a + aa) ^*$ explodes so drastically is that without
|
|
739 |
pruning, the algorithm will keep records of all possible ways of matching:
|
|
740 |
\begin{center}
|
|
741 |
$(a + aa) ^* \backslash [aa] = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
|
|
742 |
\end{center}
|
|
743 |
|
|
744 |
\noindent
|
|
745 |
Each of the above alternative branches correspond to the match
|
|
746 |
$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
|
|
747 |
These different ways of matching will grow exponentially with the string length,
|
|
748 |
and without simplifications that throw away some of these very similar matchings,
|
|
749 |
it is no surprise that these expressions grow so quickly.
|
|
750 |
Operations like
|
|
751 |
$\backslash$ and $\nullable$ need to traverse such trees and
|
|
752 |
consequently the bigger the size of the derivative the slower the
|
|
753 |
algorithm.
|
|
754 |
|
|
755 |
Brzozowski was quick in finding that during this process a lot useless
|
|
756 |
$\ONE$s and $\ZERO$s are generated and therefore not optimal.
|
|
757 |
He also introduced some "similarity rules", such
|
|
758 |
as $P+(Q+R) = (P+Q)+R$ to merge syntactically
|
|
759 |
different but language-equivalent sub-regexes to further decrease the size
|
|
760 |
of the intermediate regexes.
|
|
761 |
|
|
762 |
More simplifications are possible, such as deleting duplicates
|
|
763 |
and opening up nested alternatives to trigger even more simplifications.
|
|
764 |
And suppose we apply simplification after each derivative step, and compose
|
|
765 |
these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
|
|
766 |
\textit{simp}(a \backslash c)$. Then we can build
|
|
767 |
a matcher with simpler regular expressions.
|
|
768 |
|
|
769 |
If we want the size of derivatives in the algorithm to
|
|
770 |
stay even lower, we would need more aggressive simplifications.
|
|
771 |
Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
|
|
772 |
delete duplicates whenever possible. For example, the parentheses in
|
|
773 |
$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
|
|
774 |
\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
|
|
775 |
example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
|
|
776 |
$a^*+a+\ONE$. These more aggressive simplification rules are for
|
|
777 |
a very tight size bound, possibly as low
|
|
778 |
as that of the \emph{partial derivatives}\parencite{Antimirov1995}.
|
|
779 |
|
|
780 |
|
|
781 |
|