468
|
1 |
% Chapter 1
|
|
2 |
|
|
3 |
\chapter{Introduction} % Main chapter title
|
|
4 |
|
|
5 |
\label{Chapter1} % For referencing the chapter elsewhere, use \ref{Chapter1}
|
|
6 |
|
|
7 |
%----------------------------------------------------------------------------------------
|
|
8 |
|
|
9 |
% Define some commands to keep the formatting separated from the content
|
|
10 |
\newcommand{\keyword}[1]{\textbf{#1}}
|
|
11 |
\newcommand{\tabhead}[1]{\textbf{#1}}
|
|
12 |
\newcommand{\code}[1]{\texttt{#1}}
|
|
13 |
\newcommand{\file}[1]{\texttt{\bfseries#1}}
|
|
14 |
\newcommand{\option}[1]{\texttt{\itshape#1}}
|
|
15 |
|
|
16 |
|
|
17 |
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
|
|
18 |
\newcommand{\ZERO}{\mbox{\bf 0}}
|
|
19 |
\newcommand{\ONE}{\mbox{\bf 1}}
|
|
20 |
\def\lexer{\mathit{lexer}}
|
|
21 |
\def\mkeps{\mathit{mkeps}}
|
|
22 |
|
|
23 |
\def\DFA{\textit{DFA}}
|
|
24 |
\def\bmkeps{\textit{bmkeps}}
|
|
25 |
\def\retrieve{\textit{retrieve}}
|
|
26 |
\def\blexer{\textit{blexer}}
|
|
27 |
\def\flex{\textit{flex}}
|
|
28 |
\def\inj{\mathit{inj}}
|
|
29 |
\def\Empty{\mathit{Empty}}
|
|
30 |
\def\Left{\mathit{Left}}
|
|
31 |
\def\Right{\mathit{Right}}
|
|
32 |
\def\Stars{\mathit{Stars}}
|
|
33 |
\def\Char{\mathit{Char}}
|
|
34 |
\def\Seq{\mathit{Seq}}
|
|
35 |
\def\Der{\mathit{Der}}
|
|
36 |
\def\nullable{\mathit{nullable}}
|
|
37 |
\def\Z{\mathit{Z}}
|
|
38 |
\def\S{\mathit{S}}
|
|
39 |
\def\rup{r^\uparrow}
|
|
40 |
\def\simp{\mathit{simp}}
|
|
41 |
\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
|
|
42 |
\def\map{\mathit{map}}
|
|
43 |
\def\distinct{\mathit{distinct}}
|
|
44 |
\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
|
|
45 |
%----------------------------------------------------------------------------------------
|
|
46 |
%This part is about regular expressions, Brzozowski derivatives,
|
|
47 |
%and a bit-coded lexing algorithm with proven correctness and time bounds.
|
472
|
48 |
|
|
49 |
%TODO: look up snort rules to use here--give readers idea of what regexes look like
|
|
50 |
|
|
51 |
|
471
|
52 |
Regular expressions are widely used in computer science:
|
|
53 |
be it in IDEs with syntax hightlighting and auto completion,
|
468
|
54 |
command line tools like $\mathit{grep}$ that facilitates easy
|
471
|
55 |
text processing , network intrusion
|
468
|
56 |
detection systems that rejects suspicious traffic, or compiler
|
|
57 |
front ends--there is always an important phase of the
|
|
58 |
task that involves matching a regular
|
|
59 |
exression with a string.
|
|
60 |
Given its usefulness and ubiquity, one would imagine that
|
|
61 |
modern regular expression matching implementations
|
|
62 |
are mature and fully-studied.
|
|
63 |
|
|
64 |
If you go to a popular programming language's
|
|
65 |
regex engine,
|
|
66 |
you can supply it with regex and strings of your own,
|
|
67 |
and get matching/lexing information such as how a
|
|
68 |
sub-part of the regex matches a sub-part of the string.
|
|
69 |
These lexing libraries are on average quite fast.
|
|
70 |
%TODO: get source for SNORT/BRO's regex matching engine/speed
|
|
71 |
For example, the regex engines some network intrusion detection
|
|
72 |
systems use are able to process
|
|
73 |
megabytes or even gigabytes of network traffic per second.
|
|
74 |
|
|
75 |
Why do we need to have our version, if the algorithms are
|
|
76 |
blindingly fast already? Well it turns out it is not always the case.
|
|
77 |
|
|
78 |
|
|
79 |
Take $(a^*)^*\,b$ and ask whether
|
|
80 |
strings of the form $aa..a$ match this regular
|
|
81 |
expression. Obviously this is not the case---the expected $b$ in the last
|
|
82 |
position is missing. One would expect that modern regular expression
|
|
83 |
matching engines can find this out very quickly. Alas, if one tries
|
|
84 |
this example in JavaScript, Python or Java 8 with strings like 28
|
|
85 |
$a$'s, one discovers that this decision takes around 30 seconds and
|
|
86 |
takes considerably longer when adding a few more $a$'s, as the graphs
|
|
87 |
below show:
|
|
88 |
|
|
89 |
\begin{center}
|
|
90 |
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
|
|
91 |
\begin{tikzpicture}
|
|
92 |
\begin{axis}[
|
|
93 |
xlabel={$n$},
|
|
94 |
x label style={at={(1.05,-0.05)}},
|
|
95 |
ylabel={time in secs},
|
|
96 |
enlargelimits=false,
|
|
97 |
xtick={0,5,...,30},
|
|
98 |
xmax=33,
|
|
99 |
ymax=35,
|
|
100 |
ytick={0,5,...,30},
|
|
101 |
scaled ticks=false,
|
|
102 |
axis lines=left,
|
|
103 |
width=5cm,
|
|
104 |
height=4cm,
|
|
105 |
legend entries={JavaScript},
|
|
106 |
legend pos=north west,
|
|
107 |
legend cell align=left]
|
|
108 |
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
|
|
109 |
\end{axis}
|
|
110 |
\end{tikzpicture}
|
|
111 |
&
|
|
112 |
\begin{tikzpicture}
|
|
113 |
\begin{axis}[
|
|
114 |
xlabel={$n$},
|
|
115 |
x label style={at={(1.05,-0.05)}},
|
|
116 |
%ylabel={time in secs},
|
|
117 |
enlargelimits=false,
|
|
118 |
xtick={0,5,...,30},
|
|
119 |
xmax=33,
|
|
120 |
ymax=35,
|
|
121 |
ytick={0,5,...,30},
|
|
122 |
scaled ticks=false,
|
|
123 |
axis lines=left,
|
|
124 |
width=5cm,
|
|
125 |
height=4cm,
|
|
126 |
legend entries={Python},
|
|
127 |
legend pos=north west,
|
|
128 |
legend cell align=left]
|
|
129 |
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
|
|
130 |
\end{axis}
|
|
131 |
\end{tikzpicture}
|
|
132 |
&
|
|
133 |
\begin{tikzpicture}
|
|
134 |
\begin{axis}[
|
|
135 |
xlabel={$n$},
|
|
136 |
x label style={at={(1.05,-0.05)}},
|
|
137 |
%ylabel={time in secs},
|
|
138 |
enlargelimits=false,
|
|
139 |
xtick={0,5,...,30},
|
|
140 |
xmax=33,
|
|
141 |
ymax=35,
|
|
142 |
ytick={0,5,...,30},
|
|
143 |
scaled ticks=false,
|
|
144 |
axis lines=left,
|
|
145 |
width=5cm,
|
|
146 |
height=4cm,
|
|
147 |
legend entries={Java 8},
|
|
148 |
legend pos=north west,
|
|
149 |
legend cell align=left]
|
|
150 |
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
|
|
151 |
\end{axis}
|
|
152 |
\end{tikzpicture}\\
|
|
153 |
\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
|
|
154 |
of the form $\underbrace{aa..a}_{n}$.}
|
|
155 |
\end{tabular}
|
|
156 |
\end{center}
|
|
157 |
|
|
158 |
|
|
159 |
This is clearly exponential behaviour, and
|
|
160 |
is triggered by some relatively simple regex patterns.
|
|
161 |
|
|
162 |
|
|
163 |
This superlinear blowup in matching algorithms sometimes cause
|
|
164 |
considerable grief in real life: for example on 20 July 2016 one evil
|
|
165 |
regular expression brought the webpage
|
|
166 |
\href{http://stackexchange.com}{Stack Exchange} to its
|
|
167 |
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
|
|
168 |
In this instance, a regular expression intended to just trim white
|
|
169 |
spaces from the beginning and the end of a line actually consumed
|
|
170 |
massive amounts of CPU-resources---causing web servers to grind to a
|
|
171 |
halt. This happened when a post with 20,000 white spaces was submitted,
|
|
172 |
but importantly the white spaces were neither at the beginning nor at
|
|
173 |
the end. As a result, the regular expression matching engine needed to
|
|
174 |
backtrack over many choices. In this example, the time needed to process
|
|
175 |
the string was $O(n^2)$ with respect to the string length. This
|
|
176 |
quadratic overhead was enough for the homepage of Stack Exchange to
|
|
177 |
respond so slowly that the load balancer assumed a $\mathit{DoS}$
|
|
178 |
attack and therefore stopped the servers from responding to any
|
|
179 |
requests. This made the whole site become unavailable.
|
|
180 |
A more
|
|
181 |
recent example is a global outage of all Cloudflare servers on 2 July
|
|
182 |
2019. A poorly written regular expression exhibited exponential
|
|
183 |
behaviour and exhausted CPUs that serve HTTP traffic. Although the
|
|
184 |
outage had several causes, at the heart was a regular expression that
|
|
185 |
was used to monitor network
|
|
186 |
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
|
|
187 |
%TODO: data points for some new versions of languages
|
|
188 |
These problems with regular expressions
|
|
189 |
are not isolated events that happen
|
|
190 |
very occasionally, but actually quite widespread.
|
|
191 |
They occur so often that they get a
|
|
192 |
name--Regular-Expression-Denial-Of-Service (ReDoS)
|
|
193 |
attack.
|
|
194 |
Davis et al. \parencite{Davis18} detected more
|
|
195 |
than 1000 super-linear (SL) regular expressions
|
|
196 |
in Node.js, Python core libraries, and npm and pypi.
|
|
197 |
They therefore concluded that evil regular expressions
|
|
198 |
are problems more than "a parlour trick", but one that
|
|
199 |
requires
|
|
200 |
more research attention.
|
|
201 |
|
|
202 |
\section{Why are current algorithm for regexes slow?}
|
|
203 |
|
471
|
204 |
%find literature/find out for yourself that REGEX->DFA on basic regexes
|
|
205 |
%does not blow up the size
|
|
206 |
Shouldn't regular expression matching be linear?
|
|
207 |
How can one explain the super-linear behaviour of the
|
|
208 |
regex matching engines we have?
|
|
209 |
The time cost of regex matching algorithms in general
|
|
210 |
involve two phases:
|
|
211 |
the construction phase, in which the algorithm builds some
|
|
212 |
suitable data structure from the input regex $r$, we denote
|
|
213 |
the time cost by $P_1(r)$.
|
|
214 |
The lexing
|
|
215 |
phase, when the input string $s$ is read and the data structure
|
|
216 |
representing that regex $r$ is being operated on. We represent the time
|
|
217 |
it takes by $P_2(r, s)$.\\
|
|
218 |
In the case of a $\mathit{DFA}$,
|
|
219 |
we have $P_2(r, s) = O( |s| )$,
|
|
220 |
because we take at most $|s|$ steps,
|
|
221 |
and each step takes
|
|
222 |
at most one transition--
|
|
223 |
a deterministic-finite-automata
|
|
224 |
by definition has at most one state active and at most one
|
|
225 |
transition upon receiving an input symbol.
|
|
226 |
But unfortunately in the worst case
|
|
227 |
$P_1(r) = O(exp^{|r|})$. An example will be given later. \\
|
|
228 |
For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold
|
|
229 |
expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.
|
|
230 |
The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
|
|
231 |
On the other hand, if backtracking is used, the worst-case time bound bloats
|
|
232 |
to $|r| * 2^|s|$ .
|
|
233 |
%on the input
|
|
234 |
%And when calculating the time complexity of the matching algorithm,
|
|
235 |
%we are assuming that each input reading step requires constant time.
|
|
236 |
%which translates to that the number of
|
|
237 |
%states active and transitions taken each time is bounded by a
|
|
238 |
%constant $C$.
|
|
239 |
%But modern regex libraries in popular language engines
|
|
240 |
% often want to support much richer constructs than just
|
|
241 |
% sequences and Kleene stars,
|
|
242 |
%such as negation, intersection,
|
|
243 |
%bounded repetitions and back-references.
|
|
244 |
%And de-sugaring these "extended" regular expressions
|
|
245 |
%into basic ones might bloat the size exponentially.
|
|
246 |
%TODO: more reference for exponential size blowup on desugaring.
|
|
247 |
\subsection{Tools that uses $\mathit{DFA}$s}
|
|
248 |
%TODO:more tools that use DFAs?
|
|
249 |
$\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
|
|
250 |
in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
|
|
251 |
lexers. The user provides a set of regular expressions
|
|
252 |
and configurations to such lexer generators, and then
|
|
253 |
gets an output program encoding a minimized $\mathit{DFA}$
|
|
254 |
that can be compiled and run.
|
|
255 |
The good things about $\mathit{DFA}$s is that once
|
|
256 |
generated, they are fast and stable, unlike
|
|
257 |
backtracking algorithms.
|
|
258 |
However, they do not scale well with bounded repetitions.\\
|
468
|
259 |
|
471
|
260 |
Bounded repetitions, usually written in the form
|
|
261 |
$r^{\{c\}}$ (where $c$ is a constant natural number),
|
|
262 |
denotes a regular expression accepting strings
|
|
263 |
that can be divided into $c$ substrings, where each
|
|
264 |
substring is in $r$.
|
|
265 |
For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
|
|
266 |
an $\mathit{NFA}$ describing it would look like:
|
|
267 |
\begin{center}
|
|
268 |
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
|
|
269 |
\node[state,initial] (q_0) {$q_0$};
|
|
270 |
\node[state, red] (q_1) [right=of q_0] {$q_1$};
|
|
271 |
\node[state, red] (q_2) [right=of q_1] {$q_2$};
|
|
272 |
\node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
|
|
273 |
\path[->]
|
|
274 |
(q_0) edge node {a} (q_1)
|
|
275 |
edge [loop below] node {a,b} ()
|
|
276 |
(q_1) edge node {a,b} (q_2)
|
|
277 |
(q_2) edge node {a,b} (q_3);
|
|
278 |
\end{tikzpicture}
|
|
279 |
\end{center}
|
|
280 |
The red states are "countdown states" which counts down
|
|
281 |
the number of characters needed in addition to the current
|
|
282 |
string to make a successful match.
|
|
283 |
For example, state $q_1$ indicates a match that has
|
|
284 |
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
|
|
285 |
and just consumed the "delimiter" $a$ in the middle, and
|
|
286 |
need to match 2 more iterations of $(a|b)$ to complete.
|
|
287 |
State $q_2$ on the other hand, can be viewed as a state
|
|
288 |
after $q_1$ has consumed 1 character, and just waits
|
|
289 |
for 1 more character to complete.
|
|
290 |
$q_3$ is the last state, requiring 0 more character and is accepting.
|
|
291 |
Depending on the suffix of the
|
|
292 |
input string up to the current read location,
|
|
293 |
the states $q_1$ and $q_2$, $q_3$
|
|
294 |
may or may
|
|
295 |
not be active, independent from each other.
|
|
296 |
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
|
|
297 |
contain at least $2^3$ non-equivalent states that cannot be merged,
|
|
298 |
because the subset construction during determinisation will generate
|
|
299 |
all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
|
|
300 |
Generalizing this to regular expressions with larger
|
|
301 |
bounded repetitions number, we have that
|
|
302 |
regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
|
|
303 |
would require at least $2^{n+1}$ states, if $r$ contains
|
|
304 |
more than 1 string.
|
|
305 |
This is to represent all different
|
|
306 |
scenarios which "countdown" states are active.
|
|
307 |
For those regexes, tools such as $\mathit{JFLEX}$
|
|
308 |
would generate gigantic $\mathit{DFA}$'s or
|
|
309 |
out of memory errors.
|
|
310 |
For this reason, regex libraries that support
|
|
311 |
bounded repetitions often choose to use the $\mathit{NFA}$
|
|
312 |
approach.
|
|
313 |
\subsection{The $\mathit{NFA}$ approach to regex matching}
|
|
314 |
One can simulate the $\mathit{NFA}$ running in two ways:
|
|
315 |
one by keeping track of all active states after consuming
|
|
316 |
a character, and update that set of states iteratively.
|
|
317 |
This can be viewed as a breadth-first-search of the $\mathit{NFA}$
|
|
318 |
for a path terminating
|
|
319 |
at an accepting state.
|
|
320 |
Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
|
|
321 |
type of $\mathit{NFA}$ simulation, and guarantees a linear runtime
|
|
322 |
in terms of input string length.
|
|
323 |
%TODO:try out these lexers
|
|
324 |
The other way to use $\mathit{NFA}$ for matching is choosing
|
|
325 |
a single transition each time, keeping all the other options in
|
|
326 |
a queue or stack, and backtracking if that choice eventually
|
|
327 |
fails. This method, often called a "depth-first-search",
|
|
328 |
is efficient in a lot of cases, but could end up
|
|
329 |
with exponential run time.\\
|
|
330 |
%TODO:COMPARE java python lexer speed with Rust and Go
|
|
331 |
The reason behind backtracking algorithms in languages like
|
|
332 |
Java and Python is that they support back-references.
|
|
333 |
\subsection{Back References in Regex--Non-Regular part}
|
|
334 |
If we have a regular expression like this (the sequence
|
|
335 |
operator is omitted for brevity):
|
|
336 |
\begin{center}
|
|
337 |
$r_1(r_2(r_3r_4))$
|
|
338 |
\end{center}
|
|
339 |
We could label sub-expressions of interest
|
|
340 |
by parenthesizing them and giving
|
|
341 |
them a number by the order in which their opening parentheses appear.
|
|
342 |
One possible way of parenthesizing and labelling is given below:
|
|
343 |
\begin{center}
|
|
344 |
$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
|
|
345 |
\end{center}
|
|
346 |
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$, $r_4$ are labelled
|
|
347 |
by 1 to 4. $1$ would refer to the entire expression
|
|
348 |
$(r_1(r_2(r_3)(r_4)))$, $2$ referring to $r_2(r_3)(r_4)$, etc.
|
|
349 |
These sub-expressions are called "capturing groups".
|
|
350 |
We can use the following syntax to denote that we want a string just matched by a
|
|
351 |
sub-expression (capturing group) to appear at a certain location again,
|
|
352 |
exactly as it was:
|
|
353 |
\begin{center}
|
|
354 |
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
|
|
355 |
\underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
|
|
356 |
\end{center}
|
|
357 |
The backslash and number $i$ are used to denote such
|
|
358 |
so-called "back-references".
|
|
359 |
Let $e$ be an expression made of regular expressions
|
|
360 |
and back-references. $e$ contains the expression $e_i$
|
|
361 |
as its $i$-th capturing group.
|
|
362 |
The semantics of back-reference can be recursively
|
|
363 |
written as:
|
|
364 |
\begin{center}
|
|
365 |
\begin{tabular}{c}
|
|
366 |
$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
|
|
367 |
$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
|
|
368 |
\end{tabular}
|
|
369 |
\end{center}
|
|
370 |
The concrete example
|
|
371 |
$((a|b|c|\ldots|z)^*)\backslash 1$
|
|
372 |
would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
|
|
373 |
Back-reference is a construct in the "regex" standard
|
|
374 |
that programmers found quite useful, but not exactly
|
|
375 |
regular any more.
|
|
376 |
In fact, that allows the regex construct to express
|
|
377 |
languages that cannot be contained in context-free
|
|
378 |
languages either.
|
|
379 |
For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
|
|
380 |
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
|
|
381 |
which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
|
|
382 |
Such a language is contained in the context-sensitive hierarchy
|
|
383 |
of formal languages.
|
|
384 |
Solving the back-reference expressions matching problem
|
|
385 |
is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
|
|
386 |
efficient solution is not known to exist.
|
|
387 |
%TODO:read a bit more about back reference algorithms
|
|
388 |
It seems that languages like Java and Python made the trade-off
|
|
389 |
to support back-references at the expense of having to backtrack,
|
|
390 |
even in the case of regexes not involving back-references.\\
|
|
391 |
Summing these up, we can categorise existing
|
|
392 |
practical regex libraries into the ones with linear
|
|
393 |
time guarantees like Go and Rust, which impose restrictions
|
|
394 |
on the user input (not allowing back-references,
|
|
395 |
bounded repetitions canno exceed 1000 etc.), and ones
|
|
396 |
that allows the programmer much freedom, but grinds to a halt
|
|
397 |
in some non-negligible portion of cases.
|
468
|
398 |
%TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
|
471
|
399 |
% For example, the Rust regex engine claims to be linear,
|
|
400 |
% but does not support lookarounds and back-references.
|
|
401 |
% The GoLang regex library does not support over 1000 repetitions.
|
|
402 |
% Java and Python both support back-references, but shows
|
|
403 |
%catastrophic backtracking behaviours on inputs without back-references(
|
|
404 |
%when the language is still regular).
|
468
|
405 |
%TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
|
|
406 |
%TODO: verify the fact Rust does not allow 1000+ reps
|
|
407 |
%TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
|
471
|
408 |
\section{Buggy Regex Engines}
|
|
409 |
|
|
410 |
|
468
|
411 |
Another thing about the these libraries is that there
|
|
412 |
is no correctness guarantee.
|
|
413 |
In some cases they either fails to generate a lexing result when there is a match,
|
|
414 |
or gives the wrong way of matching.
|
|
415 |
|
|
416 |
|
|
417 |
It turns out that regex libraries not only suffer from
|
|
418 |
exponential backtracking problems,
|
|
419 |
but also undesired (or even buggy) outputs.
|
|
420 |
%TODO: comment from who
|
471
|
421 |
Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
|
468
|
422 |
correctly implementing the POSIX (maximum-munch)
|
|
423 |
rule of regular expression matching.
|
471
|
424 |
This experience is echoed by the writer's
|
|
425 |
tryout of a few online regex testers:
|
468
|
426 |
A concrete example would be
|
|
427 |
the regex
|
|
428 |
\begin{verbatim}
|
|
429 |
(((((a*a*)b*)b){20})*)c
|
|
430 |
\end{verbatim}
|
|
431 |
and the string
|
|
432 |
\begin{verbatim}
|
|
433 |
baabaabababaabaaaaaaaaababaa
|
|
434 |
aababababaaaabaaabaaaaaabaab
|
|
435 |
aabababaababaaaaaaaaababaaaa
|
|
436 |
babababaaaaaaaaaaaaac
|
|
437 |
\end{verbatim}
|
|
438 |
|
|
439 |
This seemingly complex regex simply says "some $a$'s
|
|
440 |
followed by some $b$'s then followed by 1 single $b$,
|
|
441 |
and this iterates 20 times, finally followed by a $c$.
|
|
442 |
And a POSIX match would involve the entire string,"eating up"
|
|
443 |
all the $b$'s in it.
|
|
444 |
%TODO: give a coloured example of how this matches POSIXly
|
|
445 |
|
|
446 |
This regex would trigger catastrophic backtracking in
|
|
447 |
languages like Python and Java,
|
471
|
448 |
whereas it gives a non-POSIX and uninformative
|
468
|
449 |
match in languages like Go or .NET--The match with only
|
|
450 |
character $c$.
|
|
451 |
|
471
|
452 |
As Grathwohl\parencite{grathwohl2014crash} commented,
|
468
|
453 |
\begin{center}
|
471
|
454 |
``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
|
468
|
455 |
\end{center}
|
|
456 |
|
472
|
457 |
%\section{How people solve problems with regexes}
|
468
|
458 |
|
|
459 |
|
472
|
460 |
When a regular expression does not behave as intended,
|
|
461 |
people usually try to rewrite the regex to some equivalent form
|
|
462 |
or they try to avoid the possibly problematic patterns completely\parencite{Davis18},
|
|
463 |
of which there are many false positives.
|
|
464 |
Animated tools to "debug" regular expressions
|
|
465 |
are also quite popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101}
|
|
466 |
to name a few.
|
|
467 |
There is also static analysis work on regular expressions that
|
|
468 |
aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
|
471
|
469 |
\parencite{Rathnayake2014StaticAF} proposed an algorithm
|
|
470 |
that detects regular expressions triggering exponential
|
|
471 |
behavious on backtracking matchers.
|
472
|
472 |
Weideman \parencite{Weideman2017Static} came up with
|
|
473 |
non-linear polynomial worst-time estimates
|
471
|
474 |
for regexes, attack string that exploit the worst-time
|
|
475 |
scenario, and "attack automata" that generates
|
472
|
476 |
attack strings.
|
|
477 |
%Arguably these methods limits the programmers' freedom
|
|
478 |
%or productivity when all they want is to come up with a regex
|
|
479 |
%that solves the text processing problem.
|
|
480 |
|
471
|
481 |
%TODO:also the regex101 debugger
|
|
482 |
\section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
|
468
|
483 |
Is it possible to have a regex lexing algorithm with proven correctness and
|
|
484 |
time complexity, which allows easy extensions to
|
|
485 |
constructs like
|
|
486 |
bounded repetitions, negation, lookarounds, and even back-references?
|
472
|
487 |
|
|
488 |
We propose Brzozowski derivatives on regular expressions as
|
|
489 |
a solution to this.
|
|
490 |
|
|
491 |
In the last fifteen or so years, Brzozowski's derivatives of regular
|
|
492 |
expressions have sparked quite a bit of interest in the functional
|
|
493 |
programming and theorem prover communities. The beauty of
|
|
494 |
Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
|
|
495 |
expressible in any functional language, and easily definable and
|
|
496 |
reasoned about in theorem provers---the definitions just consist of
|
|
497 |
inductive datatypes and simple recursive functions.
|
|
498 |
And an algorithms based on it by
|
|
499 |
Suzmann and Lu \parencite{Sulzmann2014} allows easy extension
|
|
500 |
to include extended regular expressions and
|
|
501 |
simplification of internal data structures
|
|
502 |
eliminating the exponential behaviours.
|
|
503 |
|
|
504 |
|
|
505 |
|
471
|
506 |
|
468
|
507 |
|
471
|
508 |
|
|
509 |
|
|
510 |
|
|
511 |
%----------------------------------------------------------------------------------------
|
|
512 |
|
472
|
513 |
\section{Our Contribution}
|
|
514 |
|
471
|
515 |
|
|
516 |
|
472
|
517 |
This work addresses the vulnerability of super-linear and
|
|
518 |
buggy regex implementations by the combination
|
|
519 |
of Brzozowski's derivatives and interactive theorem proving.
|
|
520 |
We give an
|
471
|
521 |
improved version of Sulzmann and Lu's bit-coded algorithm using
|
|
522 |
derivatives, which come with a formal guarantee in terms of correctness and
|
|
523 |
running time as an Isabelle/HOL proof.
|
|
524 |
Then we improve the algorithm with an even stronger version of
|
|
525 |
simplification, and prove a time bound linear to input and
|
|
526 |
cubic to regular expression size using a technique by
|
|
527 |
Antimirov.
|
|
528 |
|
468
|
529 |
|
|
530 |
The main contribution of this thesis is a proven correct lexing algorithm
|
|
531 |
with formalized time bounds.
|
|
532 |
To our best knowledge, there is no lexing libraries using Brzozowski derivatives
|
|
533 |
that have a provable time guarantee,
|
|
534 |
and claims about running time are usually speculative and backed by thin empirical
|
|
535 |
evidence.
|
|
536 |
%TODO: give references
|
|
537 |
For example, Sulzmann and Lu had proposed an algorithm in which they
|
|
538 |
claim a linear running time.
|
|
539 |
But that was falsified by our experiments and the running time
|
|
540 |
is actually $\Omega(2^n)$ in the worst case.
|
|
541 |
A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
|
|
542 |
%TODO: give references
|
|
543 |
lexer, which calculates POSIX matches and is based on derivatives.
|
|
544 |
They formalized the correctness of the lexer, but not the complexity.
|
|
545 |
In the performance evaluation section, they simply analyzed the run time
|
|
546 |
of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
|
|
547 |
and concluded that the algorithm is quadratic in terms of input length.
|
|
548 |
When we tried out their extracted OCaml code with our example $(a+aa)^*$,
|
|
549 |
the time it took to lex only 40 $a$'s was 5 minutes.
|
472
|
550 |
|
|
551 |
We believe our results of a proof of performance on general
|
468
|
552 |
inputs rather than specific examples a novel contribution.\\
|
|
553 |
|
472
|
554 |
|
|
555 |
\subsection{Related Work}
|
|
556 |
We are aware
|
|
557 |
of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
|
|
558 |
Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
|
|
559 |
of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one
|
|
560 |
in Coq is given by Coquand and Siles \parencite{Coquand2012}.
|
|
561 |
Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
|
|
562 |
|
|
563 |
%We propose Brzozowski's derivatives as a solution to this problem.
|
|
564 |
% about Lexing Using Brzozowski derivatives
|
|
565 |
\section{Preliminaries}
|
468
|
566 |
|
|
567 |
Suppose we have an alphabet $\Sigma$, the strings whose characters
|
|
568 |
are from $\Sigma$
|
|
569 |
can be expressed as $\Sigma^*$.
|
|
570 |
|
|
571 |
We use patterns to define a set of strings concisely. Regular expressions
|
|
572 |
are one of such patterns systems:
|
|
573 |
The basic regular expressions are defined inductively
|
|
574 |
by the following grammar:
|
|
575 |
\[ r ::= \ZERO \mid \ONE
|
|
576 |
\mid c
|
|
577 |
\mid r_1 \cdot r_2
|
|
578 |
\mid r_1 + r_2
|
|
579 |
\mid r^*
|
|
580 |
\]
|
|
581 |
|
|
582 |
The language or set of strings defined by regular expressions are defined as
|
|
583 |
%TODO: FILL in the other defs
|
|
584 |
\begin{center}
|
|
585 |
\begin{tabular}{lcl}
|
|
586 |
$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
|
|
587 |
$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\
|
|
588 |
\end{tabular}
|
|
589 |
\end{center}
|
|
590 |
Which are also called the "language interpretation".
|
|
591 |
|
|
592 |
|
|
593 |
|
|
594 |
The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
|
|
595 |
where the operation transforms the regex to a new one containing
|
|
596 |
strings without the head character $c$.
|
|
597 |
|
|
598 |
Formally, we define first such a transformation on any string set, which
|
|
599 |
we call semantic derivative:
|
|
600 |
\begin{center}
|
|
601 |
$\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$
|
|
602 |
\end{center}
|
|
603 |
Mathematically, it can be expressed as the
|
|
604 |
|
|
605 |
If the $\textit{StringSet}$ happen to have some structure, for example,
|
|
606 |
if it is regular, then we have that it
|
|
607 |
|
472
|
608 |
% Derivatives of a
|
|
609 |
%regular expression, written $r \backslash c$, give a simple solution
|
|
610 |
%to the problem of matching a string $s$ with a regular
|
|
611 |
%expression $r$: if the derivative of $r$ w.r.t.\ (in
|
|
612 |
%succession) all the characters of the string matches the empty string,
|
|
613 |
%then $r$ matches $s$ (and {\em vice versa}).
|
|
614 |
|
468
|
615 |
The the derivative of regular expression, denoted as
|
|
616 |
$r \backslash c$, is a function that takes parameters
|
|
617 |
$r$ and $c$, and returns another regular expression $r'$,
|
|
618 |
which is computed by the following recursive function:
|
|
619 |
|
|
620 |
\begin{center}
|
|
621 |
\begin{tabular}{lcl}
|
|
622 |
$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
|
|
623 |
$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
|
|
624 |
$d \backslash c$ & $\dn$ &
|
|
625 |
$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
|
|
626 |
$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
|
|
627 |
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
|
|
628 |
& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
|
|
629 |
& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
|
|
630 |
$(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\
|
|
631 |
\end{tabular}
|
|
632 |
\end{center}
|
|
633 |
\noindent
|
|
634 |
\noindent
|
|
635 |
|
|
636 |
The $\nullable$ function tests whether the empty string $""$
|
|
637 |
is in the language of $r$:
|
|
638 |
|
|
639 |
|
|
640 |
\begin{center}
|
|
641 |
\begin{tabular}{lcl}
|
|
642 |
$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
|
|
643 |
$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
|
|
644 |
$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
|
|
645 |
$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
|
|
646 |
$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
|
|
647 |
$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
|
|
648 |
\end{tabular}
|
|
649 |
\end{center}
|
|
650 |
\noindent
|
|
651 |
The empty set does not contain any string and
|
|
652 |
therefore not the empty string, the empty string
|
|
653 |
regular expression contains the empty string
|
|
654 |
by definition, the character regular expression
|
|
655 |
is the singleton that contains character only,
|
|
656 |
and therefore does not contain the empty string,
|
|
657 |
the alternative regular expression(or "or" expression)
|
|
658 |
might have one of its children regular expressions
|
|
659 |
being nullable and any one of its children being nullable
|
|
660 |
would suffice. The sequence regular expression
|
|
661 |
would require both children to have the empty string
|
|
662 |
to compose an empty string and the Kleene star
|
|
663 |
operation naturally introduced the empty string.
|
|
664 |
|
|
665 |
We can give the meaning of regular expressions derivatives
|
|
666 |
by language interpretation:
|
|
667 |
|
|
668 |
|
|
669 |
|
|
670 |
|
|
671 |
\begin{center}
|
|
672 |
\begin{tabular}{lcl}
|
|
673 |
$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
|
|
674 |
$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
|
|
675 |
$d \backslash c$ & $\dn$ &
|
|
676 |
$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
|
|
677 |
$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
|
|
678 |
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
|
|
679 |
& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
|
|
680 |
& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
|
|
681 |
$(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\
|
|
682 |
\end{tabular}
|
|
683 |
\end{center}
|
|
684 |
\noindent
|
|
685 |
\noindent
|
|
686 |
The function derivative, written $\backslash c$,
|
|
687 |
defines how a regular expression evolves into
|
|
688 |
a new regular expression after all the string it contains
|
|
689 |
is chopped off a certain head character $c$.
|
|
690 |
The most involved cases are the sequence
|
|
691 |
and star case.
|
|
692 |
The sequence case says that if the first regular expression
|
|
693 |
contains an empty string then second component of the sequence
|
|
694 |
might be chosen as the target regular expression to be chopped
|
|
695 |
off its head character.
|
|
696 |
The star regular expression unwraps the iteration of
|
|
697 |
regular expression and attack the star regular expression
|
|
698 |
to its back again to make sure there are 0 or more iterations
|
|
699 |
following this unfolded iteration.
|
|
700 |
|
|
701 |
|
|
702 |
The main property of the derivative operation
|
|
703 |
that enables us to reason about the correctness of
|
|
704 |
an algorithm using derivatives is
|
|
705 |
|
|
706 |
\begin{center}
|
|
707 |
$c\!::\!s \in L(r)$ holds
|
|
708 |
if and only if $s \in L(r\backslash c)$.
|
|
709 |
\end{center}
|
|
710 |
|
|
711 |
\noindent
|
|
712 |
We can generalise the derivative operation shown above for single characters
|
|
713 |
to strings as follows:
|
|
714 |
|
|
715 |
\begin{center}
|
|
716 |
\begin{tabular}{lcl}
|
|
717 |
$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
|
|
718 |
$r \backslash [\,] $ & $\dn$ & $r$
|
|
719 |
\end{tabular}
|
|
720 |
\end{center}
|
|
721 |
|
|
722 |
\noindent
|
|
723 |
and then define Brzozowski's regular-expression matching algorithm as:
|
|
724 |
|
|
725 |
\[
|
|
726 |
match\;s\;r \;\dn\; nullable(r\backslash s)
|
|
727 |
\]
|
|
728 |
|
|
729 |
\noindent
|
|
730 |
Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$,
|
|
731 |
this algorithm presented graphically is as follows:
|
|
732 |
|
|
733 |
\begin{equation}\label{graph:*}
|
|
734 |
\begin{tikzcd}
|
|
735 |
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
|
|
736 |
\end{tikzcd}
|
|
737 |
\end{equation}
|
|
738 |
|
|
739 |
\noindent
|
|
740 |
where we start with a regular expression $r_0$, build successive
|
|
741 |
derivatives until we exhaust the string and then use \textit{nullable}
|
|
742 |
to test whether the result can match the empty string. It can be
|
|
743 |
relatively easily shown that this matcher is correct (that is given
|
|
744 |
an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
|
|
745 |
|
|
746 |
Beautiful and simple definition.
|
|
747 |
|
|
748 |
If we implement the above algorithm naively, however,
|
|
749 |
the algorithm can be excruciatingly slow. For example, when starting with the regular
|
|
750 |
expression $(a + aa)^*$ and building 12 successive derivatives
|
|
751 |
w.r.t.~the character $a$, one obtains a derivative regular expression
|
|
752 |
with more than 8000 nodes (when viewed as a tree). Operations like
|
|
753 |
$\backslash$ and $\nullable$ need to traverse such trees and
|
|
754 |
consequently the bigger the size of the derivative the slower the
|
|
755 |
algorithm.
|
|
756 |
|
|
757 |
Brzozowski was quick in finding that during this process a lot useless
|
|
758 |
$\ONE$s and $\ZERO$s are generated and therefore not optimal.
|
|
759 |
He also introduced some "similarity rules" such
|
|
760 |
as $P+(Q+R) = (P+Q)+R$ to merge syntactically
|
|
761 |
different but language-equivalent sub-regexes to further decrease the size
|
|
762 |
of the intermediate regexes.
|
|
763 |
|
|
764 |
More simplifications are possible, such as deleting duplicates
|
|
765 |
and opening up nested alternatives to trigger even more simplifications.
|
|
766 |
And suppose we apply simplification after each derivative step, and compose
|
|
767 |
these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
|
|
768 |
\textit{simp}(a \backslash c)$. Then we can build
|
|
769 |
a matcher without having cumbersome regular expressions.
|
|
770 |
|
|
771 |
|
|
772 |
If we want the size of derivatives in the algorithm to
|
|
773 |
stay even lower, we would need more aggressive simplifications.
|
|
774 |
Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
|
|
775 |
deleting duplicates whenever possible. For example, the parentheses in
|
|
776 |
$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
|
|
777 |
\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
|
|
778 |
example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
|
|
779 |
$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
|
|
780 |
to achieve a very tight size bound, namely,
|
|
781 |
the same size bound as that of the \emph{partial derivatives}.
|
|
782 |
|
|
783 |
Building derivatives and then simplify them.
|
|
784 |
So far so good. But what if we want to
|
|
785 |
do lexing instead of just a YES/NO answer?
|
|
786 |
This requires us to go back again to the world
|
|
787 |
without simplification first for a moment.
|
|
788 |
Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
|
|
789 |
elegant(arguably as beautiful as the original
|
|
790 |
derivatives definition) solution for this.
|
|
791 |
|
|
792 |
\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
|
|
793 |
|
|
794 |
|
|
795 |
They first defined the datatypes for storing the
|
|
796 |
lexing information called a \emph{value} or
|
|
797 |
sometimes also \emph{lexical value}. These values and regular
|
|
798 |
expressions correspond to each other as illustrated in the following
|
|
799 |
table:
|
|
800 |
|
|
801 |
\begin{center}
|
|
802 |
\begin{tabular}{c@{\hspace{20mm}}c}
|
|
803 |
\begin{tabular}{@{}rrl@{}}
|
|
804 |
\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
|
|
805 |
$r$ & $::=$ & $\ZERO$\\
|
|
806 |
& $\mid$ & $\ONE$ \\
|
|
807 |
& $\mid$ & $c$ \\
|
|
808 |
& $\mid$ & $r_1 \cdot r_2$\\
|
|
809 |
& $\mid$ & $r_1 + r_2$ \\
|
|
810 |
\\
|
|
811 |
& $\mid$ & $r^*$ \\
|
|
812 |
\end{tabular}
|
|
813 |
&
|
|
814 |
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
|
|
815 |
\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
|
|
816 |
$v$ & $::=$ & \\
|
|
817 |
& & $\Empty$ \\
|
|
818 |
& $\mid$ & $\Char(c)$ \\
|
|
819 |
& $\mid$ & $\Seq\,v_1\, v_2$\\
|
|
820 |
& $\mid$ & $\Left(v)$ \\
|
|
821 |
& $\mid$ & $\Right(v)$ \\
|
|
822 |
& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
|
|
823 |
\end{tabular}
|
|
824 |
\end{tabular}
|
|
825 |
\end{center}
|
|
826 |
|
|
827 |
\noindent
|
472
|
828 |
|
|
829 |
Building on top of Sulzmann and Lu's attempt to formalize the
|
|
830 |
notion of POSIX lexing rules \parencite{Sulzmann2014},
|
|
831 |
Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
|
|
832 |
POSIX matching as a ternary relation recursively defined in a
|
|
833 |
natural deduction style.
|
|
834 |
With the formally-specified rules for what a POSIX matching is,
|
|
835 |
they proved in Isabelle/HOL that the algorithm gives correct results.
|
|
836 |
|
|
837 |
But having a correct result is still not enough, we want $\mathbf{efficiency}$.
|
|
838 |
|
|
839 |
|
|
840 |
|
468
|
841 |
One regular expression can have multiple lexical values. For example
|
|
842 |
for the regular expression $(a+b)^*$, it has a infinite list of
|
|
843 |
values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
|
|
844 |
$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
|
|
845 |
$\ldots$, and vice versa.
|
|
846 |
Even for the regular expression matching a certain string, there could
|
|
847 |
still be more than one value corresponding to it.
|
|
848 |
Take the example where $r= (a^*\cdot a^*)^*$ and the string
|
|
849 |
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
|
|
850 |
The number of different ways of matching
|
|
851 |
without allowing any value under a star to be flattened
|
|
852 |
to an empty string can be given by the following formula:
|
|
853 |
\begin{center}
|
|
854 |
$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
|
|
855 |
\end{center}
|
|
856 |
and a closed form formula can be calculated to be
|
|
857 |
\begin{equation}
|
|
858 |
C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
|
|
859 |
\end{equation}
|
|
860 |
which is clearly in exponential order.
|
472
|
861 |
|
468
|
862 |
A lexer aimed at getting all the possible values has an exponential
|
|
863 |
worst case runtime. Therefore it is impractical to try to generate
|
|
864 |
all possible matches in a run. In practice, we are usually
|
|
865 |
interested about POSIX values, which by intuition always
|
|
866 |
match the leftmost regular expression when there is a choice
|
|
867 |
and always match a sub part as much as possible before proceeding
|
|
868 |
to the next token. For example, the above example has the POSIX value
|
|
869 |
$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
|
|
870 |
The output of an algorithm we want would be a POSIX matching
|
|
871 |
encoded as a value.
|
|
872 |
The contribution of Sulzmann and Lu is an extension of Brzozowski's
|
|
873 |
algorithm by a second phase (the first phase being building successive
|
|
874 |
derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
|
|
875 |
is generated in case the regular expression matches the string.
|
|
876 |
Pictorially, the Sulzmann and Lu algorithm is as follows:
|
|
877 |
|
|
878 |
\begin{ceqn}
|
|
879 |
\begin{equation}\label{graph:2}
|
|
880 |
\begin{tikzcd}
|
|
881 |
r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
|
|
882 |
v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
|
|
883 |
\end{tikzcd}
|
|
884 |
\end{equation}
|
|
885 |
\end{ceqn}
|
|
886 |
|
|
887 |
|
|
888 |
\noindent
|
|
889 |
For convenience, we shall employ the following notations: the regular
|
|
890 |
expression we start with is $r_0$, and the given string $s$ is composed
|
|
891 |
of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the
|
|
892 |
left to right, we build the derivatives $r_1$, $r_2$, \ldots according
|
|
893 |
to the characters $c_0$, $c_1$ until we exhaust the string and obtain
|
|
894 |
the derivative $r_n$. We test whether this derivative is
|
|
895 |
$\textit{nullable}$ or not. If not, we know the string does not match
|
|
896 |
$r$ and no value needs to be generated. If yes, we start building the
|
|
897 |
values incrementally by \emph{injecting} back the characters into the
|
|
898 |
earlier values $v_n, \ldots, v_0$. This is the second phase of the
|
|
899 |
algorithm from the right to left. For the first value $v_n$, we call the
|
|
900 |
function $\textit{mkeps}$, which builds a POSIX lexical value
|
|
901 |
for how the empty string has been matched by the (nullable) regular
|
|
902 |
expression $r_n$. This function is defined as
|
|
903 |
|
|
904 |
\begin{center}
|
|
905 |
\begin{tabular}{lcl}
|
|
906 |
$\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\
|
|
907 |
$\mkeps(r_{1}+r_{2})$ & $\dn$
|
|
908 |
& \textit{if} $\nullable(r_{1})$\\
|
|
909 |
& & \textit{then} $\Left(\mkeps(r_{1}))$\\
|
|
910 |
& & \textit{else} $\Right(\mkeps(r_{2}))$\\
|
|
911 |
$\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
|
|
912 |
$mkeps(r^*)$ & $\dn$ & $\Stars\,[]$
|
|
913 |
\end{tabular}
|
|
914 |
\end{center}
|
|
915 |
|
|
916 |
|
|
917 |
\noindent
|
|
918 |
After the $\mkeps$-call, we inject back the characters one by one in order to build
|
|
919 |
the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
|
|
920 |
($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
|
|
921 |
After injecting back $n$ characters, we get the lexical value for how $r_0$
|
|
922 |
matches $s$. The POSIX value is maintained throught out the process.
|
|
923 |
For this Sulzmann and Lu defined a function that reverses
|
|
924 |
the ``chopping off'' of characters during the derivative phase. The
|
|
925 |
corresponding function is called \emph{injection}, written
|
|
926 |
$\textit{inj}$; it takes three arguments: the first one is a regular
|
|
927 |
expression ${r_{i-1}}$, before the character is chopped off, the second
|
|
928 |
is a character ${c_{i-1}}$, the character we want to inject and the
|
|
929 |
third argument is the value ${v_i}$, into which one wants to inject the
|
|
930 |
character (it corresponds to the regular expression after the character
|
|
931 |
has been chopped off). The result of this function is a new value. The
|
|
932 |
definition of $\textit{inj}$ is as follows:
|
|
933 |
|
|
934 |
\begin{center}
|
|
935 |
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
|
|
936 |
$\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\
|
|
937 |
$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
|
|
938 |
$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
|
|
939 |
$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
|
|
940 |
$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
|
|
941 |
$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
|
|
942 |
$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
|
|
943 |
\end{tabular}
|
|
944 |
\end{center}
|
|
945 |
|
|
946 |
\noindent This definition is by recursion on the ``shape'' of regular
|
|
947 |
expressions and values.
|
|
948 |
The clauses basically do one thing--identifying the ``holes'' on
|
|
949 |
value to inject the character back into.
|
|
950 |
For instance, in the last clause for injecting back to a value
|
|
951 |
that would turn into a new star value that corresponds to a star,
|
|
952 |
we know it must be a sequence value. And we know that the first
|
|
953 |
value of that sequence corresponds to the child regex of the star
|
|
954 |
with the first character being chopped off--an iteration of the star
|
|
955 |
that had just been unfolded. This value is followed by the already
|
|
956 |
matched star iterations we collected before. So we inject the character
|
|
957 |
back to the first value and form a new value with this new iteration
|
|
958 |
being added to the previous list of iterations, all under the $Stars$
|
|
959 |
top level.
|
|
960 |
|
|
961 |
We have mentioned before that derivatives without simplification
|
|
962 |
can get clumsy, and this is true for values as well--they reflect
|
|
963 |
the regular expressions size by definition.
|
|
964 |
|
|
965 |
One can introduce simplification on the regex and values, but have to
|
|
966 |
be careful in not breaking the correctness as the injection
|
|
967 |
function heavily relies on the structure of the regexes and values
|
|
968 |
being correct and match each other.
|
|
969 |
It can be achieved by recording some extra rectification functions
|
|
970 |
during the derivatives step, and applying these rectifications in
|
|
971 |
each run during the injection phase.
|
|
972 |
And we can prove that the POSIX value of how
|
|
973 |
regular expressions match strings will not be affected---although is much harder
|
472
|
974 |
to establish.
|
|
975 |
Some initial results in this regard have been
|
468
|
976 |
obtained in \cite{AusafDyckhoffUrban2016}.
|
|
977 |
|
472
|
978 |
|
|
979 |
|
468
|
980 |
%Brzozowski, after giving the derivatives and simplification,
|
|
981 |
%did not explore lexing with simplification or he may well be
|
|
982 |
%stuck on an efficient simplificaiton with a proof.
|
|
983 |
%He went on to explore the use of derivatives together with
|
|
984 |
%automaton, and did not try lexing using derivatives.
|
|
985 |
|
|
986 |
We want to get rid of complex and fragile rectification of values.
|
|
987 |
Can we not create those intermediate values $v_1,\ldots v_n$,
|
|
988 |
and get the lexing information that should be already there while
|
|
989 |
doing derivatives in one pass, without a second phase of injection?
|
|
990 |
In the meantime, can we make sure that simplifications
|
|
991 |
are easily handled without breaking the correctness of the algorithm?
|
|
992 |
|
|
993 |
Sulzmann and Lu solved this problem by
|
|
994 |
introducing additional informtaion to the
|
|
995 |
regular expressions called \emph{bitcodes}.
|
|
996 |
|
|
997 |
\subsection*{Bit-coded Algorithm}
|
|
998 |
Bits and bitcodes (lists of bits) are defined as:
|
|
999 |
|
|
1000 |
\begin{center}
|
|
1001 |
$b ::= 1 \mid 0 \qquad
|
|
1002 |
bs ::= [] \mid b::bs
|
|
1003 |
$
|
|
1004 |
\end{center}
|
|
1005 |
|
|
1006 |
\noindent
|
|
1007 |
The $1$ and $0$ are not in bold in order to avoid
|
|
1008 |
confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
|
|
1009 |
bit-lists) can be used to encode values (or potentially incomplete values) in a
|
|
1010 |
compact form. This can be straightforwardly seen in the following
|
|
1011 |
coding function from values to bitcodes:
|
|
1012 |
|
|
1013 |
\begin{center}
|
|
1014 |
\begin{tabular}{lcl}
|
|
1015 |
$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
|
|
1016 |
$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
|
|
1017 |
$\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
|
|
1018 |
$\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
|
|
1019 |
$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
|
|
1020 |
$\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
|
|
1021 |
$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
|
|
1022 |
code(\Stars\,vs)$
|
|
1023 |
\end{tabular}
|
|
1024 |
\end{center}
|
|
1025 |
|
|
1026 |
\noindent
|
|
1027 |
Here $\textit{code}$ encodes a value into a bitcodes by converting
|
|
1028 |
$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
|
|
1029 |
star iteration by $1$. The border where a local star terminates
|
|
1030 |
is marked by $0$. This coding is lossy, as it throws away the information about
|
|
1031 |
characters, and also does not encode the ``boundary'' between two
|
|
1032 |
sequence values. Moreover, with only the bitcode we cannot even tell
|
|
1033 |
whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
|
|
1034 |
reason for choosing this compact way of storing information is that the
|
|
1035 |
relatively small size of bits can be easily manipulated and ``moved
|
|
1036 |
around'' in a regular expression. In order to recover values, we will
|
|
1037 |
need the corresponding regular expression as an extra information. This
|
|
1038 |
means the decoding function is defined as:
|
|
1039 |
|
|
1040 |
|
|
1041 |
%\begin{definition}[Bitdecoding of Values]\mbox{}
|
|
1042 |
\begin{center}
|
|
1043 |
\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
|
|
1044 |
$\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
|
|
1045 |
$\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
|
|
1046 |
$\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
|
|
1047 |
$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
|
|
1048 |
(\Left\,v, bs_1)$\\
|
|
1049 |
$\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
|
|
1050 |
$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
|
|
1051 |
(\Right\,v, bs_1)$\\
|
|
1052 |
$\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
|
|
1053 |
$\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
|
|
1054 |
& & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
|
|
1055 |
& & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
|
|
1056 |
$\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
|
|
1057 |
$\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ &
|
|
1058 |
$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
|
|
1059 |
& & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
|
|
1060 |
& & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
|
|
1061 |
|
|
1062 |
$\textit{decode}\,bs\,r$ & $\dn$ &
|
|
1063 |
$\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
|
|
1064 |
& & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
|
|
1065 |
\textit{else}\;\textit{None}$
|
|
1066 |
\end{tabular}
|
|
1067 |
\end{center}
|
|
1068 |
%\end{definition}
|
|
1069 |
|
|
1070 |
Sulzmann and Lu's integrated the bitcodes into regular expressions to
|
|
1071 |
create annotated regular expressions \cite{Sulzmann2014}.
|
|
1072 |
\emph{Annotated regular expressions} are defined by the following
|
|
1073 |
grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$}
|
|
1074 |
|
|
1075 |
\begin{center}
|
|
1076 |
\begin{tabular}{lcl}
|
|
1077 |
$\textit{a}$ & $::=$ & $\ZERO$\\
|
|
1078 |
& $\mid$ & $_{bs}\ONE$\\
|
|
1079 |
& $\mid$ & $_{bs}{\bf c}$\\
|
|
1080 |
& $\mid$ & $_{bs}\sum\,as$\\
|
|
1081 |
& $\mid$ & $_{bs}a_1\cdot a_2$\\
|
|
1082 |
& $\mid$ & $_{bs}a^*$
|
|
1083 |
\end{tabular}
|
|
1084 |
\end{center}
|
|
1085 |
%(in \textit{ALTS})
|
|
1086 |
|
|
1087 |
\noindent
|
|
1088 |
where $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regular
|
|
1089 |
expressions and $as$ for a list of annotated regular expressions.
|
|
1090 |
The alternative constructor($\sum$) has been generalized to
|
|
1091 |
accept a list of annotated regular expressions rather than just 2.
|
|
1092 |
We will show that these bitcodes encode information about
|
|
1093 |
the (POSIX) value that should be generated by the Sulzmann and Lu
|
|
1094 |
algorithm.
|
|
1095 |
|
|
1096 |
|
|
1097 |
To do lexing using annotated regular expressions, we shall first
|
|
1098 |
transform the usual (un-annotated) regular expressions into annotated
|
|
1099 |
regular expressions. This operation is called \emph{internalisation} and
|
|
1100 |
defined as follows:
|
|
1101 |
|
|
1102 |
%\begin{definition}
|
|
1103 |
\begin{center}
|
|
1104 |
\begin{tabular}{lcl}
|
|
1105 |
$(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
|
|
1106 |
$(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
|
|
1107 |
$(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
|
|
1108 |
$(r_1 + r_2)^\uparrow$ & $\dn$ &
|
|
1109 |
$_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
|
|
1110 |
\textit{fuse}\,[1]\,r_2^\uparrow]$\\
|
|
1111 |
$(r_1\cdot r_2)^\uparrow$ & $\dn$ &
|
|
1112 |
$_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
|
|
1113 |
$(r^*)^\uparrow$ & $\dn$ &
|
|
1114 |
$_{[]}(r^\uparrow)^*$\\
|
|
1115 |
\end{tabular}
|
|
1116 |
\end{center}
|
|
1117 |
%\end{definition}
|
|
1118 |
|
|
1119 |
\noindent
|
|
1120 |
We use up arrows here to indicate that the basic un-annotated regular
|
|
1121 |
expressions are ``lifted up'' into something slightly more complex. In the
|
|
1122 |
fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
|
|
1123 |
attach bits to the front of an annotated regular expression. Its
|
|
1124 |
definition is as follows:
|
|
1125 |
|
|
1126 |
\begin{center}
|
|
1127 |
\begin{tabular}{lcl}
|
|
1128 |
$\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
|
|
1129 |
$\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
|
|
1130 |
$_{bs @ bs'}\ONE$\\
|
|
1131 |
$\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
|
|
1132 |
$_{bs@bs'}{\bf c}$\\
|
|
1133 |
$\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
|
|
1134 |
$_{bs@bs'}\sum\textit{as}$\\
|
|
1135 |
$\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
|
|
1136 |
$_{bs@bs'}a_1 \cdot a_2$\\
|
|
1137 |
$\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
|
|
1138 |
$_{bs @ bs'}a^*$
|
|
1139 |
\end{tabular}
|
|
1140 |
\end{center}
|
|
1141 |
|
|
1142 |
\noindent
|
|
1143 |
After internalising the regular expression, we perform successive
|
|
1144 |
derivative operations on the annotated regular expressions. This
|
|
1145 |
derivative operation is the same as what we had previously for the
|
|
1146 |
basic regular expressions, except that we beed to take care of
|
|
1147 |
the bitcodes:
|
|
1148 |
|
|
1149 |
|
|
1150 |
\iffalse
|
|
1151 |
%\begin{definition}{bder}
|
|
1152 |
\begin{center}
|
|
1153 |
\begin{tabular}{@{}lcl@{}}
|
|
1154 |
$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
|
|
1155 |
$(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
|
|
1156 |
$(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
|
|
1157 |
$\textit{if}\;c=d\; \;\textit{then}\;
|
|
1158 |
\textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
|
|
1159 |
$(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
|
|
1160 |
$\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
|
|
1161 |
$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
|
|
1162 |
$\textit{if}\;\textit{bnullable}\,a_1$\\
|
|
1163 |
& &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
|
|
1164 |
& &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
|
|
1165 |
& &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
|
|
1166 |
$(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
|
|
1167 |
$\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
|
|
1168 |
(\textit{STAR}\,[]\,r)$
|
|
1169 |
\end{tabular}
|
|
1170 |
\end{center}
|
|
1171 |
%\end{definition}
|
|
1172 |
|
|
1173 |
\begin{center}
|
|
1174 |
\begin{tabular}{@{}lcl@{}}
|
|
1175 |
$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
|
|
1176 |
$(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
|
|
1177 |
$(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
|
|
1178 |
$\textit{if}\;c=d\; \;\textit{then}\;
|
|
1179 |
_{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\
|
|
1180 |
$(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
|
|
1181 |
$_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
|
|
1182 |
$(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
|
|
1183 |
$\textit{if}\;\textit{bnullable}\,a_1$\\
|
|
1184 |
& &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
|
|
1185 |
& &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
|
|
1186 |
& &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
|
|
1187 |
$(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
|
|
1188 |
$_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
|
|
1189 |
(_{bs}\textit{STAR}\,[]\,r)$
|
|
1190 |
\end{tabular}
|
|
1191 |
\end{center}
|
|
1192 |
%\end{definition}
|
|
1193 |
\fi
|
|
1194 |
|
|
1195 |
\begin{center}
|
|
1196 |
\begin{tabular}{@{}lcl@{}}
|
|
1197 |
$(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\
|
|
1198 |
$(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\
|
|
1199 |
$(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
|
|
1200 |
$\textit{if}\;c=d\; \;\textit{then}\;
|
|
1201 |
_{bs}\ONE\;\textit{else}\;\ZERO$\\
|
|
1202 |
$(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
|
|
1203 |
$_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
|
|
1204 |
$(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
|
|
1205 |
$\textit{if}\;\textit{bnullable}\,a_1$\\
|
|
1206 |
& &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
|
|
1207 |
& &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
|
|
1208 |
& &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
|
|
1209 |
$(_{bs}a^*)\,\backslash c$ & $\dn$ &
|
|
1210 |
$_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
|
|
1211 |
(_{[]}r^*))$
|
|
1212 |
\end{tabular}
|
|
1213 |
\end{center}
|
|
1214 |
|
|
1215 |
%\end{definition}
|
|
1216 |
\noindent
|
|
1217 |
For instance, when we do derivative of $_{bs}a^*$ with respect to c,
|
|
1218 |
we need to unfold it into a sequence,
|
|
1219 |
and attach an additional bit $0$ to the front of $r \backslash c$
|
|
1220 |
to indicate that there is one more star iteration. Also the sequence clause
|
|
1221 |
is more subtle---when $a_1$ is $\textit{bnullable}$ (here
|
|
1222 |
\textit{bnullable} is exactly the same as $\textit{nullable}$, except
|
|
1223 |
that it is for annotated regular expressions, therefore we omit the
|
|
1224 |
definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
|
|
1225 |
$a_1$ matches the string prior to character $c$ (more on this later),
|
|
1226 |
then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2
|
|
1227 |
\backslash c)$ will collapse the regular expression $a_1$(as it has
|
|
1228 |
already been fully matched) and store the parsing information at the
|
|
1229 |
head of the regular expression $a_2 \backslash c$ by fusing to it. The
|
|
1230 |
bitsequence $\textit{bs}$, which was initially attached to the
|
|
1231 |
first element of the sequence $a_1 \cdot a_2$, has
|
|
1232 |
now been elevated to the top-level of $\sum$, as this information will be
|
|
1233 |
needed whichever way the sequence is matched---no matter whether $c$ belongs
|
|
1234 |
to $a_1$ or $ a_2$. After building these derivatives and maintaining all
|
|
1235 |
the lexing information, we complete the lexing by collecting the
|
|
1236 |
bitcodes using a generalised version of the $\textit{mkeps}$ function
|
|
1237 |
for annotated regular expressions, called $\textit{bmkeps}$:
|
|
1238 |
|
|
1239 |
|
|
1240 |
%\begin{definition}[\textit{bmkeps}]\mbox{}
|
|
1241 |
\begin{center}
|
|
1242 |
\begin{tabular}{lcl}
|
|
1243 |
$\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
|
|
1244 |
$\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
|
|
1245 |
$\textit{if}\;\textit{bnullable}\,a$\\
|
|
1246 |
& &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
|
|
1247 |
& &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
|
|
1248 |
$\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
|
|
1249 |
$bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
|
|
1250 |
$\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
|
|
1251 |
$bs \,@\, [0]$
|
|
1252 |
\end{tabular}
|
|
1253 |
\end{center}
|
|
1254 |
%\end{definition}
|
|
1255 |
|
|
1256 |
\noindent
|
|
1257 |
This function completes the value information by travelling along the
|
|
1258 |
path of the regular expression that corresponds to a POSIX value and
|
|
1259 |
collecting all the bitcodes, and using $S$ to indicate the end of star
|
|
1260 |
iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
|
|
1261 |
decode them, we get the value we expect. The corresponding lexing
|
|
1262 |
algorithm looks as follows:
|
|
1263 |
|
|
1264 |
\begin{center}
|
|
1265 |
\begin{tabular}{lcl}
|
|
1266 |
$\textit{blexer}\;r\,s$ & $\dn$ &
|
|
1267 |
$\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
|
|
1268 |
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
|
|
1269 |
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
|
|
1270 |
& & $\;\;\textit{else}\;\textit{None}$
|
|
1271 |
\end{tabular}
|
|
1272 |
\end{center}
|
|
1273 |
|
|
1274 |
\noindent
|
|
1275 |
In this definition $\_\backslash s$ is the generalisation of the derivative
|
|
1276 |
operation from characters to strings (just like the derivatives for un-annotated
|
|
1277 |
regular expressions).
|
|
1278 |
|
|
1279 |
Remember tha one of the important reasons we introduced bitcodes
|
|
1280 |
is that they can make simplification more structured and therefore guaranteeing
|
|
1281 |
the correctness.
|
|
1282 |
|
|
1283 |
\subsection*{Our Simplification Rules}
|
|
1284 |
|
|
1285 |
In this section we introduce aggressive (in terms of size) simplification rules
|
|
1286 |
on annotated regular expressions
|
|
1287 |
in order to keep derivatives small. Such simplifications are promising
|
|
1288 |
as we have
|
|
1289 |
generated test data that show
|
|
1290 |
that a good tight bound can be achieved. Obviously we could only
|
|
1291 |
partially cover the search space as there are infinitely many regular
|
|
1292 |
expressions and strings.
|
|
1293 |
|
|
1294 |
One modification we introduced is to allow a list of annotated regular
|
|
1295 |
expressions in the $\sum$ constructor. This allows us to not just
|
|
1296 |
delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
|
|
1297 |
also unnecessary ``copies'' of regular expressions (very similar to
|
|
1298 |
simplifying $r + r$ to just $r$, but in a more general setting). Another
|
|
1299 |
modification is that we use simplification rules inspired by Antimirov's
|
|
1300 |
work on partial derivatives. They maintain the idea that only the first
|
|
1301 |
``copy'' of a regular expression in an alternative contributes to the
|
|
1302 |
calculation of a POSIX value. All subsequent copies can be pruned away from
|
|
1303 |
the regular expression. A recursive definition of our simplification function
|
|
1304 |
that looks somewhat similar to our Scala code is given below:
|
|
1305 |
%\comment{Use $\ZERO$, $\ONE$ and so on.
|
|
1306 |
%Is it $ALTS$ or $ALTS$?}\\
|
|
1307 |
|
|
1308 |
\begin{center}
|
|
1309 |
\begin{tabular}{@{}lcl@{}}
|
|
1310 |
|
|
1311 |
$\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\
|
|
1312 |
&&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\
|
|
1313 |
&&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\
|
|
1314 |
&&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\
|
|
1315 |
&&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\
|
|
1316 |
&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\
|
|
1317 |
|
|
1318 |
$\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
|
|
1319 |
&&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\
|
|
1320 |
&&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
|
|
1321 |
&&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\
|
|
1322 |
|
|
1323 |
$\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
|
|
1324 |
\end{tabular}
|
|
1325 |
\end{center}
|
|
1326 |
|
|
1327 |
\noindent
|
|
1328 |
The simplification does a pattern matching on the regular expression.
|
|
1329 |
When it detected that the regular expression is an alternative or
|
|
1330 |
sequence, it will try to simplify its children regular expressions
|
|
1331 |
recursively and then see if one of the children turn into $\ZERO$ or
|
|
1332 |
$\ONE$, which might trigger further simplification at the current level.
|
|
1333 |
The most involved part is the $\sum$ clause, where we use two
|
|
1334 |
auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
|
|
1335 |
alternatives and reduce as many duplicates as possible. Function
|
|
1336 |
$\textit{distinct}$ keeps the first occurring copy only and remove all later ones
|
|
1337 |
when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
|
|
1338 |
Its recursive definition is given below:
|
|
1339 |
|
|
1340 |
\begin{center}
|
|
1341 |
\begin{tabular}{@{}lcl@{}}
|
|
1342 |
$\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
|
|
1343 |
(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
|
|
1344 |
$\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\
|
|
1345 |
$\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise)
|
|
1346 |
\end{tabular}
|
|
1347 |
\end{center}
|
|
1348 |
|
|
1349 |
\noindent
|
|
1350 |
Here $\textit{flatten}$ behaves like the traditional functional programming flatten
|
|
1351 |
function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
|
|
1352 |
removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
|
|
1353 |
|
|
1354 |
Having defined the $\simp$ function,
|
|
1355 |
we can use the previous notation of natural
|
|
1356 |
extension from derivative w.r.t.~character to derivative
|
|
1357 |
w.r.t.~string:%\comment{simp in the [] case?}
|
|
1358 |
|
|
1359 |
\begin{center}
|
|
1360 |
\begin{tabular}{lcl}
|
|
1361 |
$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
|
|
1362 |
$r \backslash_{simp} [\,] $ & $\dn$ & $r$
|
|
1363 |
\end{tabular}
|
|
1364 |
\end{center}
|
|
1365 |
|
|
1366 |
\noindent
|
|
1367 |
to obtain an optimised version of the algorithm:
|
|
1368 |
|
|
1369 |
\begin{center}
|
|
1370 |
\begin{tabular}{lcl}
|
|
1371 |
$\textit{blexer\_simp}\;r\,s$ & $\dn$ &
|
|
1372 |
$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
|
|
1373 |
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
|
|
1374 |
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
|
|
1375 |
& & $\;\;\textit{else}\;\textit{None}$
|
|
1376 |
\end{tabular}
|
|
1377 |
\end{center}
|
|
1378 |
|
|
1379 |
\noindent
|
|
1380 |
This algorithm keeps the regular expression size small, for example,
|
|
1381 |
with this simplification our previous $(a + aa)^*$ example's 8000 nodes
|
|
1382 |
will be reduced to just 6 and stays constant, no matter how long the
|
|
1383 |
input string is.
|
|
1384 |
|
|
1385 |
|
|
1386 |
|
|
1387 |
Derivatives give a simple solution
|
|
1388 |
to the problem of matching a string $s$ with a regular
|
|
1389 |
expression $r$: if the derivative of $r$ w.r.t.\ (in
|
|
1390 |
succession) all the characters of the string matches the empty string,
|
|
1391 |
then $r$ matches $s$ (and {\em vice versa}).
|
|
1392 |
|
|
1393 |
|
|
1394 |
|
|
1395 |
However, there are two difficulties with derivative-based matchers:
|
|
1396 |
First, Brzozowski's original matcher only generates a yes/no answer
|
|
1397 |
for whether a regular expression matches a string or not. This is too
|
|
1398 |
little information in the context of lexing where separate tokens must
|
|
1399 |
be identified and also classified (for example as keywords
|
|
1400 |
or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this
|
|
1401 |
difficulty by cleverly extending Brzozowski's matching
|
|
1402 |
algorithm. Their extended version generates additional information on
|
|
1403 |
\emph{how} a regular expression matches a string following the POSIX
|
|
1404 |
rules for regular expression matching. They achieve this by adding a
|
|
1405 |
second ``phase'' to Brzozowski's algorithm involving an injection
|
|
1406 |
function. In our own earlier work we provided the formal
|
|
1407 |
specification of what POSIX matching means and proved in Isabelle/HOL
|
|
1408 |
the correctness
|
|
1409 |
of Sulzmann and Lu's extended algorithm accordingly
|
|
1410 |
\cite{AusafDyckhoffUrban2016}.
|
|
1411 |
|
|
1412 |
The second difficulty is that Brzozowski's derivatives can
|
|
1413 |
grow to arbitrarily big sizes. For example if we start with the
|
|
1414 |
regular expression $(a+aa)^*$ and take
|
|
1415 |
successive derivatives according to the character $a$, we end up with
|
|
1416 |
a sequence of ever-growing derivatives like
|
|
1417 |
|
|
1418 |
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
|
|
1419 |
\begin{center}
|
|
1420 |
\begin{tabular}{rll}
|
|
1421 |
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
|
|
1422 |
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
|
|
1423 |
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
|
|
1424 |
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
|
|
1425 |
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
|
|
1426 |
\end{tabular}
|
|
1427 |
\end{center}
|
|
1428 |
|
|
1429 |
\noindent where after around 35 steps we run out of memory on a
|
|
1430 |
typical computer (we shall define shortly the precise details of our
|
|
1431 |
regular expressions and the derivative operation). Clearly, the
|
|
1432 |
notation involving $\ZERO$s and $\ONE$s already suggests
|
|
1433 |
simplification rules that can be applied to regular regular
|
|
1434 |
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
|
|
1435 |
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
|
|
1436 |
r$. While such simple-minded simplifications have been proved in our
|
|
1437 |
earlier work to preserve the correctness of Sulzmann and Lu's
|
|
1438 |
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
|
|
1439 |
\emph{not} help with limiting the growth of the derivatives shown
|
|
1440 |
above: the growth is slowed, but the derivatives can still grow rather
|
|
1441 |
quickly beyond any finite bound.
|
|
1442 |
|
|
1443 |
|
|
1444 |
Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
|
|
1445 |
\cite{Sulzmann2014} where they introduce bitcoded
|
|
1446 |
regular expressions. In this version, POSIX values are
|
|
1447 |
represented as bitsequences and such sequences are incrementally generated
|
|
1448 |
when derivatives are calculated. The compact representation
|
|
1449 |
of bitsequences and regular expressions allows them to define a more
|
|
1450 |
``aggressive'' simplification method that keeps the size of the
|
|
1451 |
derivatives finite no matter what the length of the string is.
|
|
1452 |
They make some informal claims about the correctness and linear behaviour
|
|
1453 |
of this version, but do not provide any supporting proof arguments, not
|
|
1454 |
even ``pencil-and-paper'' arguments. They write about their bitcoded
|
|
1455 |
\emph{incremental parsing method} (that is the algorithm to be formalised
|
|
1456 |
in this paper):
|
|
1457 |
|
|
1458 |
|
|
1459 |
\begin{quote}\it
|
|
1460 |
``Correctness Claim: We further claim that the incremental parsing
|
|
1461 |
method [..] in combination with the simplification steps [..]
|
|
1462 |
yields POSIX parse trees. We have tested this claim
|
|
1463 |
extensively [..] but yet
|
|
1464 |
have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
|
|
1465 |
\end{quote}
|
|
1466 |
|
|
1467 |
|
|
1468 |
|
|
1469 |
|
|
1470 |
|
|
1471 |
|
|
1472 |
%----------------------------------------------------------------------------------------
|
|
1473 |
|
|
1474 |
|
|
1475 |
%----------------------------------------------------------------------------------------
|
|
1476 |
|
|
1477 |
%----------------------------------------------------------------------------------------
|
|
1478 |
|
|
1479 |
%----------------------------------------------------------------------------------------
|
|
1480 |
|
|
1481 |
|