17
|
1 |
\documentclass[a4paper,UKenglish]{lipics}
|
|
2 |
\usepackage{graphic}
|
|
3 |
\usepackage{data}
|
|
4 |
% \documentclass{article}
|
|
5 |
%\usepackage[utf8]{inputenc}
|
|
6 |
%\usepackage[english]{babel}
|
|
7 |
%\usepackage{listings}
|
|
8 |
% \usepackage{amsthm}
|
|
9 |
% \usepackage{hyperref}
|
|
10 |
% \usepackage[margin=0.5in]{geometry}
|
|
11 |
%\usepackage{pmboxdraw}
|
|
12 |
|
|
13 |
\title{POSIX Regular Expression Matching and Lexing}
|
|
14 |
\author[1]{Annonymous}
|
|
15 |
|
|
16 |
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
|
|
17 |
\newcommand{\ZERO}{\mbox{\bf 0}}
|
|
18 |
\newcommand{\ONE}{\mbox{\bf 1}}
|
|
19 |
\def\lexer{\mathit{lexer}}
|
|
20 |
\def\mkeps{\mathit{mkeps}}
|
|
21 |
\def\inj{\mathit{inj}}
|
|
22 |
\def\Empty{\mathit{Empty}}
|
|
23 |
\def\Left{\mathit{Left}}
|
|
24 |
\def\Right{\mathit{Right}}
|
|
25 |
\def\Stars{\mathit{Stars}}
|
|
26 |
\def\Char{\mathit{Char}}
|
|
27 |
\def\Seq{\mathit{Seq}}
|
|
28 |
\def\Der{\mathit{Der}}
|
|
29 |
\def\nullable{\mathit{nullable}}
|
|
30 |
\def\Z{\mathit{Z}}
|
|
31 |
\def\S{\mathit{S}}
|
|
32 |
|
|
33 |
%\theoremstyle{theorem}
|
|
34 |
%\newtheorem{theorem}{Theorem}
|
|
35 |
%\theoremstyle{lemma}
|
|
36 |
%\newtheorem{lemma}{Lemma}
|
|
37 |
%\newcommand{\lemmaautorefname}{Lemma}
|
|
38 |
%\theoremstyle{definition}
|
|
39 |
%\newtheorem{definition}{Definition}
|
|
40 |
|
|
41 |
|
|
42 |
\begin{document}
|
|
43 |
|
|
44 |
\maketitle
|
|
45 |
\begin{abstract}
|
|
46 |
Brzozowski introduced in 1964 a beautifully simple algorithm for
|
|
47 |
regular expression matching based on the notion of derivatives of
|
|
48 |
regular expressions. In 2014, Sulzmann and Lu extended this
|
|
49 |
algorithm to not just give a YES/NO answer for whether or not a regular
|
|
50 |
expression matches a string, but in case it matches also \emph{how}
|
|
51 |
it matches the string. This is important for applications such as
|
|
52 |
lexing (tokenising a string). The problem is to make the algorithm
|
|
53 |
by Sulzmann and Lu fast on all inputs without breaking its
|
|
54 |
correctness.
|
|
55 |
\end{abstract}
|
|
56 |
|
|
57 |
\section{Introduction}
|
|
58 |
|
|
59 |
This PhD-project is about regular expression matching and
|
|
60 |
lexing. Given the maturity of this topic, the reader might wonder:
|
|
61 |
Surely, regular expressions must have already been studied to death?
|
|
62 |
What could possibly be \emph{not} known in this area? And surely all
|
|
63 |
implemented algorithms for regular expression matching are blindingly
|
|
64 |
fast?
|
|
65 |
|
|
66 |
|
|
67 |
|
|
68 |
Unfortunately these preconceptions are not supported by evidence: Take
|
|
69 |
for example the regular expression $(a^*)^*\,b$ and ask whether
|
|
70 |
strings of the form $aa..a$ match this regular
|
|
71 |
expression. Obviously they do not match---the expected $b$ in the last
|
|
72 |
position is missing. One would expect that modern regular expression
|
|
73 |
matching engines can find this out very quickly. Alas, if one tries
|
|
74 |
this example in JavaScript, Python or Java 8 with strings like 28
|
|
75 |
$a$'s, one discovers that this decision takes around 30 seconds and
|
|
76 |
takes considerably longer when adding a few more $a$'s, as the graphs
|
|
77 |
below show:
|
|
78 |
|
|
79 |
\begin{center}
|
|
80 |
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
|
|
81 |
\begin{tikzpicture}
|
|
82 |
\begin{axis}[
|
|
83 |
xlabel={$n$},
|
|
84 |
x label style={at={(1.05,-0.05)}},
|
|
85 |
ylabel={time in secs},
|
|
86 |
enlargelimits=false,
|
|
87 |
xtick={0,5,...,30},
|
|
88 |
xmax=33,
|
|
89 |
ymax=35,
|
|
90 |
ytick={0,5,...,30},
|
|
91 |
scaled ticks=false,
|
|
92 |
axis lines=left,
|
|
93 |
width=5cm,
|
|
94 |
height=4cm,
|
|
95 |
legend entries={JavaScript},
|
|
96 |
legend pos=north west,
|
|
97 |
legend cell align=left]
|
|
98 |
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
|
|
99 |
\end{axis}
|
|
100 |
\end{tikzpicture}
|
|
101 |
&
|
|
102 |
\begin{tikzpicture}
|
|
103 |
\begin{axis}[
|
|
104 |
xlabel={$n$},
|
|
105 |
x label style={at={(1.05,-0.05)}},
|
|
106 |
%ylabel={time in secs},
|
|
107 |
enlargelimits=false,
|
|
108 |
xtick={0,5,...,30},
|
|
109 |
xmax=33,
|
|
110 |
ymax=35,
|
|
111 |
ytick={0,5,...,30},
|
|
112 |
scaled ticks=false,
|
|
113 |
axis lines=left,
|
|
114 |
width=5cm,
|
|
115 |
height=4cm,
|
|
116 |
legend entries={Python},
|
|
117 |
legend pos=north west,
|
|
118 |
legend cell align=left]
|
|
119 |
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
|
|
120 |
\end{axis}
|
|
121 |
\end{tikzpicture}
|
|
122 |
&
|
|
123 |
\begin{tikzpicture}
|
|
124 |
\begin{axis}[
|
|
125 |
xlabel={$n$},
|
|
126 |
x label style={at={(1.05,-0.05)}},
|
|
127 |
%ylabel={time in secs},
|
|
128 |
enlargelimits=false,
|
|
129 |
xtick={0,5,...,30},
|
|
130 |
xmax=33,
|
|
131 |
ymax=35,
|
|
132 |
ytick={0,5,...,30},
|
|
133 |
scaled ticks=false,
|
|
134 |
axis lines=left,
|
|
135 |
width=5cm,
|
|
136 |
height=4cm,
|
|
137 |
legend entries={Java 8},
|
|
138 |
legend pos=north west,
|
|
139 |
legend cell align=left]
|
|
140 |
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
|
|
141 |
\end{axis}
|
|
142 |
\end{tikzpicture}\\
|
|
143 |
\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
|
|
144 |
of the form $\underbrace{aa..a}_{n}$.}
|
|
145 |
\end{tabular}
|
|
146 |
\end{center}
|
|
147 |
|
|
148 |
\noindent These are clearly abysmal and possibly surprising results.
|
|
149 |
One would expect these systems doing much better than that---after
|
|
150 |
all, given a DFA and a string, whether a string is matched by this DFA
|
|
151 |
should be linear.
|
|
152 |
|
|
153 |
Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
|
|
154 |
exhibit this ``exponential behaviour''. Unfortunately, such regular
|
|
155 |
expressions are not just a few ``outliers'', but actually they are
|
|
156 |
frequent enough that a separate name has been created for
|
|
157 |
them---\emph{evil regular expressions}. In empiric work, Davis et al
|
|
158 |
report that they have found thousands of such evil regular expressions
|
|
159 |
in the JavaScript and Python ecosystems \cite{Davis18}.
|
|
160 |
|
|
161 |
This exponential blowup sometimes causes real pain in ``real life'':
|
|
162 |
for example one evil regular expression brought on 20 July 2016 the
|
|
163 |
webpage \href{http://stackexchange.com}{Stack Exchange} to its knees.
|
|
164 |
In this instance, a regular expression intended to just trim white
|
|
165 |
spaces from the beginning and the end of a line actually consumed
|
|
166 |
massive amounts of CPU-resources and because of this the web servers
|
|
167 |
ground to a halt. This happened when a post with 20,000 white spaces
|
|
168 |
was submitted, but importantly the white spaces were neither at the
|
|
169 |
beginning nor at the end. As a result, the regular expression matching
|
|
170 |
engine needed to backtrack over many choices.
|
|
171 |
|
|
172 |
The underlying problem is that many ``real life'' regular expression
|
|
173 |
matching engines do not use DFAs for matching. This is because they
|
|
174 |
support regular expressions that are not covered by the classical
|
|
175 |
automata theory, and in this more general setting there are quite a
|
|
176 |
few research questions still unanswered and fast algorithms still need
|
|
177 |
to be developed.
|
|
178 |
|
|
179 |
There is also another under-researched problem to do with regular
|
|
180 |
expressions and lexing, i.e.~the process of breaking up strings into
|
|
181 |
sequences of tokens according to some regular expressions. In this
|
|
182 |
setting one is not just interested in whether or not a regular
|
|
183 |
expression matches a string, but if it matches also in \emph{how} it
|
|
184 |
matches the string. Consider for example a regular expression
|
|
185 |
$r_{key}$ for recognising keywords such as \textit{if}, \textit{then}
|
|
186 |
and so on; and a regular expression $r_{id}$ for recognising
|
|
187 |
identifiers (say, a single character followed by characters or
|
|
188 |
numbers). One can then form the compound regular expression
|
|
189 |
$(r_{key} + r_{id})^*$ and use it to tokenise strings. But then how
|
|
190 |
should the string \textit{iffoo} be tokenised? It could be tokenised
|
|
191 |
as a keyword followed by an identifier, or the entire string as a
|
|
192 |
single identifier. Similarly, how should the string \textit{if} be
|
|
193 |
tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
|
|
194 |
``fire''---so is it an identifier or a keyword? While in applications
|
|
195 |
there is a well-known strategy to decide these questions, called POSIX
|
|
196 |
matching, only relatively recently precise definitions of what POSIX
|
|
197 |
matching actually means have been formalised
|
|
198 |
\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly,
|
|
199 |
POSIX matching means to match the longest initial substring and
|
|
200 |
possible ties are solved according to some priorities attached to the
|
|
201 |
regular expressions (e.g.~keywords have a higher priority than
|
|
202 |
identifiers). This sounds rather simple, but according to Grathwohl et
|
|
203 |
al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
|
|
204 |
|
|
205 |
\begin{quote}
|
|
206 |
\it{}``The POSIX strategy is more complicated than the greedy because of
|
|
207 |
the dependence on information about the length of matched strings in the
|
|
208 |
various subexpressions.''
|
|
209 |
\end{quote}
|
|
210 |
|
|
211 |
\noindent
|
|
212 |
This is also supported by evidence collected by Kuklewicz
|
|
213 |
\cite{Kuklewicz} who noticed that a number of POSIX regular expression
|
|
214 |
matchers calculate incorrect results.
|
|
215 |
|
|
216 |
Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for
|
|
217 |
regular expression matching according to the POSIX strategy
|
|
218 |
\cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
|
|
219 |
Brzozowski from 1964 where he introduced the notion of derivatives of
|
|
220 |
regular expressions \cite{Brzozowski1964}. We shall briefly explain
|
|
221 |
the algorithms next.
|
|
222 |
|
|
223 |
\section{The Algorithms by Brzozowski, and Sulzmann and Lu}
|
|
224 |
|
|
225 |
Suppose regular expressions are given by the following grammar (for
|
|
226 |
the moment ignore the grammar for values on the right-hand side):
|
|
227 |
|
|
228 |
|
|
229 |
\begin{center}
|
|
230 |
\begin{tabular}{c@{\hspace{20mm}}c}
|
|
231 |
\begin{tabular}{@{}rrl@{}}
|
|
232 |
\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
|
|
233 |
$r$ & $::=$ & $\ZERO$\\
|
|
234 |
& $\mid$ & $\ONE$ \\
|
|
235 |
& $\mid$ & $c$ \\
|
|
236 |
& $\mid$ & $r_1 \cdot r_2$\\
|
|
237 |
& $\mid$ & $r_1 + r_2$ \\
|
|
238 |
\\
|
|
239 |
& $\mid$ & $r^*$ \\
|
|
240 |
\end{tabular}
|
|
241 |
&
|
|
242 |
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
|
|
243 |
\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
|
|
244 |
$v$ & $::=$ & \\
|
|
245 |
& & $\Empty$ \\
|
|
246 |
& $\mid$ & $\Char(c)$ \\
|
|
247 |
& $\mid$ & $\Seq\,v_1\, v_2$\\
|
|
248 |
& $\mid$ & $\Left(v)$ \\
|
|
249 |
& $\mid$ & $\Right(v)$ \\
|
|
250 |
& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
|
|
251 |
\end{tabular}
|
|
252 |
\end{tabular}
|
|
253 |
\end{center}
|
|
254 |
|
|
255 |
\noindent
|
|
256 |
The intended meaning of the regular expressions is as usual: $\ZERO$
|
|
257 |
cannot match any string, $\ONE$ can match the empty string, the
|
|
258 |
character regular expression $c$ can match the character $c$, and so
|
|
259 |
on. The brilliant contribution by Brzozowski is the notion of
|
|
260 |
\emph{derivatives} of regular expressions. The idea behind this
|
|
261 |
notion is as follows: suppose a regular expression $r$ can match a
|
|
262 |
string of the form $c\!::\! s$ (that is a list of characters starting
|
|
263 |
with $c$), what does the regular expression look like that can match
|
|
264 |
just $s$? Brzozowski gave a neat answer to this question. He defined
|
|
265 |
the following operation on regular expressions, written
|
|
266 |
$r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$):
|
|
267 |
|
|
268 |
\begin{center}
|
|
269 |
\begin{tabular}{lcl}
|
|
270 |
$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
|
|
271 |
$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
|
|
272 |
$d \backslash c$ & $\dn$ &
|
|
273 |
$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
|
|
274 |
$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
|
|
275 |
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\
|
|
276 |
& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
|
|
277 |
& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
|
|
278 |
$(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\
|
|
279 |
\end{tabular}
|
|
280 |
\end{center}
|
|
281 |
|
|
282 |
\noindent
|
|
283 |
In this definition $\nullable(\_)$ stands for a simple recursive
|
|
284 |
function that tests whether a regular expression can match the empty
|
|
285 |
string (its definition is omitted). Assuming the classic notion of a
|
|
286 |
\emph{language} of a regular expression, written $L(\_)$, the main
|
|
287 |
property of the derivative operation is that
|
|
288 |
|
|
289 |
\begin{center}
|
|
290 |
$c\!::\!s \in L(r)$ holds
|
|
291 |
if and only if $s \in L(r\backslash c)$.
|
|
292 |
\end{center}
|
|
293 |
|
|
294 |
\noindent
|
|
295 |
The beauty of derivatives is that they lead to a really simple regular
|
|
296 |
expression matching algorithm: To find out whether a string $s$
|
|
297 |
matches with a regular expression $r$, build the derivatives of $r$
|
|
298 |
w.r.t.\ (in succession) all the characters of the string $s$. Finally,
|
|
299 |
test whether the resulting regular expression can match the empty
|
|
300 |
string. If yes, then $r$ matches $s$, and no in the negative
|
|
301 |
case. For us the main advantage is that derivatives can be
|
|
302 |
straightforwardly implemented in any functional programming language,
|
|
303 |
and are easily definable and reasoned about in theorem provers---the
|
|
304 |
definitions just consist of inductive datatypes and simple recursive
|
|
305 |
functions. Moreover, the notion of derivatives can be easily
|
|
306 |
generalised to cover extended regular expression constructors such as
|
|
307 |
the not-regular expression, written $\neg\,r$, or bounded repetitions
|
|
308 |
(for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
|
|
309 |
straightforwardly realised within the classic automata approach.
|
|
310 |
|
|
311 |
|
|
312 |
One limitation, however, of Brzozowski's algorithm is that it only
|
|
313 |
produces a YES/NO answer for whether a string is being matched by a
|
|
314 |
regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this
|
|
315 |
algorithm to allow generation of an actual matching, called a
|
|
316 |
\emph{value}---see the grammar above for its definition. Assuming a
|
|
317 |
regular expression matches a string, values encode the information of
|
|
318 |
\emph{how} the string is matched by the regular expression---that is,
|
|
319 |
which part of the string is matched by which part of the regular
|
|
320 |
expression. To illustrate this consider the string $xy$ and the
|
|
321 |
regular expression $(x + (y + xy))^*$. We can view this regular
|
|
322 |
expression as a tree and if the string $xy$ is matched by two Star
|
|
323 |
``iterations'', then the $x$ is matched by the left-most alternative
|
|
324 |
in this tree and the $y$ by the right-left alternative. This suggests
|
|
325 |
to record this matching as
|
|
326 |
|
|
327 |
\begin{center}
|
|
328 |
$\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
|
|
329 |
\end{center}
|
|
330 |
|
|
331 |
\noindent
|
|
332 |
where $\Stars$ records how many
|
|
333 |
iterations were used; and $\Left$, respectively $\Right$, which
|
|
334 |
alternative is used. The value for
|
|
335 |
matching $xy$ in a single ``iteration'', i.e.~the POSIX value,
|
|
336 |
would look as follows
|
|
337 |
|
|
338 |
\begin{center}
|
|
339 |
$\Stars\,[\Seq\,(\Char\,x)\,(\Char\,y)]$
|
|
340 |
\end{center}
|
|
341 |
|
|
342 |
\noindent
|
|
343 |
where $\Stars$ has only a single-element list for the single iteration
|
|
344 |
and $\Seq$ indicates that $xy$ is matched by a sequence regular
|
|
345 |
expression.
|
|
346 |
|
|
347 |
The contribution of Sulzmann and Lu is an extension of Brzozowski's
|
|
348 |
algorithm by a second phase (the first phase being building successive
|
|
349 |
derivatives). In this second phase, for every successful match the
|
|
350 |
corresponding POSIX value is computed. As mentioned before, from this
|
|
351 |
value we can extract the information \emph{how} a regular expression
|
|
352 |
matched a string. We omit the details here on how Sulzmann and Lu
|
|
353 |
achieved this~(see \cite{Sulzmann2014}). Rather, we shall focus next on the
|
|
354 |
process of simplification of regular expressions, which is needed in
|
|
355 |
order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
|
|
356 |
and Lu's algorithms. This is where the PhD-project hopes to advance
|
|
357 |
the state-of-the-art.
|
|
358 |
|
|
359 |
|
|
360 |
\section{Simplification of Regular Expressions}
|
|
361 |
|
|
362 |
The main drawback of building successive derivatives according to
|
|
363 |
Brzozowski's definition is that they can grow very quickly in size.
|
|
364 |
This is mainly due to the fact that the derivative operation generates
|
|
365 |
often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result,
|
|
366 |
if implemented naively both algorithms by Brzozowski and by Sulzmann
|
|
367 |
and Lu are excruciatingly slow. For example when starting with the
|
|
368 |
regular expression $(a + aa)^*$ and building 12 successive derivatives
|
|
369 |
w.r.t.~the character $a$, one obtains a derivative regular expression
|
|
370 |
with more than 8000 nodes (when viewed as a tree). Operations like
|
|
371 |
derivative and $\nullable$ need to traverse such trees and
|
|
372 |
consequently the bigger the size of the derivative the slower the
|
|
373 |
algorithm. Fortunately, one can simplify regular expressions after
|
|
374 |
each derivative step. Various simplifications of regular expressions
|
|
375 |
are possible, such as the simplifications of $\ZERO + r$,
|
|
376 |
$r + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just
|
|
377 |
$r$. These simplifications do not affect the answer for whether a
|
|
378 |
regular expression matches a string or not, but fortunately also do
|
|
379 |
not affect the POSIX strategy of how regular expressions match
|
|
380 |
strings---although the latter is much harder to establish. Some
|
|
381 |
initial results in this regard have been obtained in
|
|
382 |
\cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet
|
|
383 |
is a very tight bound for the size. Such a tight bound is suggested by
|
|
384 |
work of Antimirov who proved that (partial) derivatives can be bound
|
|
385 |
by the number of characters contained in the initial regular
|
|
386 |
expression \cite{Antimirov95}. We believe, and have generated test
|
|
387 |
data, that a similar bound can be obtained for the derivatives in
|
|
388 |
Sulzmann and Lu's algorithm. Let us give some details about this next.
|
|
389 |
|
|
390 |
We first followed Sulzmann and Lu's idea of introducing
|
|
391 |
\emph{annotated regular expressions}~\cite{Sulzmann2014}. They are
|
|
392 |
defined by the following grammar:
|
|
393 |
|
|
394 |
\begin{center}
|
|
395 |
\begin{tabular}{lcl}
|
|
396 |
$\textit{a}$ & $::=$ & $\textit{ZERO}$\\
|
|
397 |
& $\mid$ & $\textit{ONE}\;\;bs$\\
|
|
398 |
& $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
|
|
399 |
& $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
|
|
400 |
& $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
|
|
401 |
& $\mid$ & $\textit{STAR}\;\;bs\,a$
|
|
402 |
\end{tabular}
|
|
403 |
\end{center}
|
|
404 |
|
|
405 |
\noindent
|
|
406 |
where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a
|
|
407 |
list of annotated regular expressions. These bitsequences encode
|
|
408 |
information about the (POSIX) value that should be generated by the
|
|
409 |
Sulzmann and Lu algorithm. There are operations that can transform the
|
|
410 |
usual (un-annotated) regular expressions into annotated regular
|
|
411 |
expressions, and there are operations for encoding/decoding values to
|
|
412 |
or from bitsequences. For example the encoding function for values is
|
|
413 |
defined as follows:
|
|
414 |
|
|
415 |
\begin{center}
|
|
416 |
\begin{tabular}{lcl}
|
|
417 |
$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
|
|
418 |
$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
|
|
419 |
$\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\
|
|
420 |
$\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\
|
|
421 |
$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
|
|
422 |
$\textit{code}(\Stars\,[])$ & $\dn$ & $[\S]$\\
|
|
423 |
$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\Z :: code(v) \;@\;
|
|
424 |
code(\Stars\,vs)$
|
|
425 |
\end{tabular}
|
|
426 |
\end{center}
|
|
427 |
|
|
428 |
\noindent
|
|
429 |
where $\Z$ and $\S$ are arbitrary names for the ``bits'' in the
|
|
430 |
bitsequences. Although this encoding is ``lossy'' in the sense of not
|
|
431 |
recording all details of a value, Sulzmann and Lu have defined the
|
|
432 |
decoding function such that with additional information (namely the
|
|
433 |
corresponding regular expression) a value can be precisely extracted
|
|
434 |
from a bitsequence.
|
|
435 |
|
|
436 |
The main point of the bitsequences and annotated regular expressions
|
|
437 |
is that we can apply rather aggressive (in terms of size)
|
|
438 |
simplification rules in order to keep derivatives small. We have
|
|
439 |
developed such ``aggressive'' simplification rules and generated test
|
|
440 |
data that show that the expected bound can be achieved. Obviously we
|
|
441 |
could only cover partially the search space as there are infinitely
|
|
442 |
many regular expressions and strings. One modification we introduced
|
|
443 |
is to allow a list of annotated regular expressions in the
|
|
444 |
\textit{ALTS} constructor. This allows us to not just delete
|
|
445 |
unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
|
|
446 |
unnecessary ``copies'' of regular expressions (very similar to
|
|
447 |
simplifying $r + r$ to just $r$, but in a more general
|
|
448 |
setting). Another modification is that we use simplification rules
|
|
449 |
inspired by Antimirov's work on partial derivatives. They maintain the
|
|
450 |
idea that only the first ``copy'' of a regular expression in an
|
|
451 |
alternative contributes to the calculation of a POSIX value. All
|
|
452 |
subsequent copies can be prunned from the regular expression.
|
|
453 |
|
|
454 |
We are currently engaged with proving that our simplification rules
|
|
455 |
actually do not affect the POSIX value that should be generated by the
|
|
456 |
algorithm according to the specification of a POSIX value and
|
|
457 |
furthermore that our derivatives stay small for all derivatives. For
|
|
458 |
this proof we use the theorem prover Isabelle. Once completed, this
|
|
459 |
result will advance the state-of-the-art: Sulzmann and Lu wrote in
|
|
460 |
their paper \cite{Sulzmann2014} about the bitcoded ``incremental
|
|
461 |
parsing method'' (that is the matching algorithm outlined in this
|
|
462 |
section):
|
|
463 |
|
|
464 |
\begin{quote}\it
|
|
465 |
``Correctness Claim: We further claim that the incremental parsing
|
|
466 |
method in Figure~5 in combination with the simplification steps in
|
|
467 |
Figure 6 yields POSIX parse trees. We have tested this claim
|
|
468 |
extensively by using the method in Figure~3 as a reference but yet
|
|
469 |
have to work out all proof details.''
|
|
470 |
\end{quote}
|
|
471 |
|
|
472 |
\noindent
|
|
473 |
We would settle the correctness claim and furthermore obtain a much
|
|
474 |
tighter bound on the sizes of derivatives. The result is that our
|
|
475 |
algorithm should be correct and faster on all inputs. The original
|
|
476 |
blow-up, as observed in JavaScript, Python and Java, would be excluded
|
|
477 |
from happening in our algorithm.
|
|
478 |
|
|
479 |
\section{Conclusion}
|
|
480 |
|
|
481 |
In this PhD-project we are interested in fast algorithms for regular
|
|
482 |
expression matching. While this seems to be a ``settled'' area, in
|
|
483 |
fact interesting research questions are popping up as soon as one steps
|
|
484 |
outside the classic automata theory (for example in terms of what kind
|
|
485 |
of regular expressions are supported). The reason why it is
|
|
486 |
interesting for us to look at the derivative approach introduced by
|
|
487 |
Brzozowski for regular expression matching, and then much further
|
|
488 |
developed by Sulzmann and Lu, is that derivatives can elegantly deal
|
|
489 |
with some of the regular expressions that are of interest in ``real
|
|
490 |
life''. This includes the not-regular expression, written $\neg\,r$
|
|
491 |
(that is all strings that are not recognised by $r$), but also bounded
|
|
492 |
regular expressions such as $r^{\{n\}}$ and $r^{\{n..m\}}$). There is
|
|
493 |
also hope that the derivatives can provide another angle for how to
|
|
494 |
deal more efficiently with back-references, which are one of the
|
|
495 |
reasons why regular expression engines in JavaScript, Python and Java
|
|
496 |
choose to not implement the classic automata approach of transforming
|
|
497 |
regular expressions into NFAs and then DFAs---because we simply do not
|
|
498 |
know how such back-references can be represented by DFAs.
|
|
499 |
|
|
500 |
|
|
501 |
\bibliographystyle{plain}
|
|
502 |
\bibliography{root}
|
|
503 |
|
|
504 |
|
|
505 |
\end{document}
|