|
1 \documentclass[a4paper,UKenglish]{lipics} |
|
2 \usepackage{graphic} |
|
3 \usepackage{data} |
|
4 % \documentclass{article} |
|
5 %\usepackage[utf8]{inputenc} |
|
6 %\usepackage[english]{babel} |
|
7 %\usepackage{listings} |
|
8 % \usepackage{amsthm} |
|
9 % \usepackage{hyperref} |
|
10 % \usepackage[margin=0.5in]{geometry} |
|
11 %\usepackage{pmboxdraw} |
|
12 |
|
13 \title{POSIX Regular Expression Matching and Lexing} |
|
14 \author{Chengsong Tan} |
|
15 \affil{King's College London\\ |
|
16 London, UK\\ |
|
17 \texttt{chengsong.tan@kcl.ac.uk}} |
|
18 \authorrunning{Chengsong Tan} |
|
19 \Copyright{Chengsong Tan} |
|
20 |
|
21 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}% |
|
22 \newcommand{\ZERO}{\mbox{\bf 0}} |
|
23 \newcommand{\ONE}{\mbox{\bf 1}} |
|
24 \def\lexer{\mathit{lexer}} |
|
25 \def\mkeps{\mathit{mkeps}} |
|
26 \def\inj{\mathit{inj}} |
|
27 \def\Empty{\mathit{Empty}} |
|
28 \def\Left{\mathit{Left}} |
|
29 \def\Right{\mathit{Right}} |
|
30 \def\Stars{\mathit{Stars}} |
|
31 \def\Char{\mathit{Char}} |
|
32 \def\Seq{\mathit{Seq}} |
|
33 \def\Der{\mathit{Der}} |
|
34 \def\nullable{\mathit{nullable}} |
|
35 \def\Z{\mathit{Z}} |
|
36 \def\S{\mathit{S}} |
|
37 |
|
38 %\theoremstyle{theorem} |
|
39 %\newtheorem{theorem}{Theorem} |
|
40 %\theoremstyle{lemma} |
|
41 %\newtheorem{lemma}{Lemma} |
|
42 %\newcommand{\lemmaautorefname}{Lemma} |
|
43 %\theoremstyle{definition} |
|
44 %\newtheorem{definition}{Definition} |
|
45 |
|
46 |
|
47 \begin{document} |
|
48 |
|
49 \maketitle |
|
50 |
|
51 \begin{abstract} |
|
52 Brzozowski introduced in 1964 a beautifully simple algorithm for |
|
53 regular expression matching based on the notion of derivatives of |
|
54 regular expressions. In 2014, Sulzmann and Lu extended this |
|
55 algorithm to not just give a YES/NO answer for whether or not a regular |
|
56 expression matches a string, but in case it matches also \emph{how} |
|
57 it matches the string. This is important for applications such as |
|
58 lexing (tokenising a string). The problem is to make the algorithm |
|
59 by Sulzmann and Lu fast on all inputs without breaking its |
|
60 correctness. |
|
61 \end{abstract} |
|
62 |
|
63 \section{Introduction} |
|
64 |
|
65 This PhD-project is about regular expression matching and |
|
66 lexing. Given the maturity of this topic, the reader might wonder: |
|
67 Surely, regular expressions must have already been studied to death? |
|
68 What could possibly be \emph{not} known in this area? And surely all |
|
69 implemented algorithms for regular expression matching are blindingly |
|
70 fast? |
|
71 |
|
72 |
|
73 |
|
74 Unfortunately these preconceptions are not supported by evidence: Take |
|
75 for example the regular expression $(a^*)^*\,b$ and ask whether |
|
76 strings of the form $aa..a$ match this regular |
|
77 expression. Obviously they do not match---the expected $b$ in the last |
|
78 position is missing. One would expect that modern regular expression |
|
79 matching engines can find this out very quickly. Alas, if one tries |
|
80 this example in JavaScript, Python or Java 8 with strings like 28 |
|
81 $a$'s, one discovers that this decision takes around 30 seconds and |
|
82 takes considerably longer when adding a few more $a$'s, as the graphs |
|
83 below show: |
|
84 |
|
85 \begin{center} |
|
86 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
87 \begin{tikzpicture} |
|
88 \begin{axis}[ |
|
89 xlabel={$n$}, |
|
90 x label style={at={(1.05,-0.05)}}, |
|
91 ylabel={time in secs}, |
|
92 enlargelimits=false, |
|
93 xtick={0,5,...,30}, |
|
94 xmax=33, |
|
95 ymax=35, |
|
96 ytick={0,5,...,30}, |
|
97 scaled ticks=false, |
|
98 axis lines=left, |
|
99 width=5cm, |
|
100 height=4cm, |
|
101 legend entries={JavaScript}, |
|
102 legend pos=north west, |
|
103 legend cell align=left] |
|
104 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; |
|
105 \end{axis} |
|
106 \end{tikzpicture} |
|
107 & |
|
108 \begin{tikzpicture} |
|
109 \begin{axis}[ |
|
110 xlabel={$n$}, |
|
111 x label style={at={(1.05,-0.05)}}, |
|
112 %ylabel={time in secs}, |
|
113 enlargelimits=false, |
|
114 xtick={0,5,...,30}, |
|
115 xmax=33, |
|
116 ymax=35, |
|
117 ytick={0,5,...,30}, |
|
118 scaled ticks=false, |
|
119 axis lines=left, |
|
120 width=5cm, |
|
121 height=4cm, |
|
122 legend entries={Python}, |
|
123 legend pos=north west, |
|
124 legend cell align=left] |
|
125 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
|
126 \end{axis} |
|
127 \end{tikzpicture} |
|
128 & |
|
129 \begin{tikzpicture} |
|
130 \begin{axis}[ |
|
131 xlabel={$n$}, |
|
132 x label style={at={(1.05,-0.05)}}, |
|
133 %ylabel={time in secs}, |
|
134 enlargelimits=false, |
|
135 xtick={0,5,...,30}, |
|
136 xmax=33, |
|
137 ymax=35, |
|
138 ytick={0,5,...,30}, |
|
139 scaled ticks=false, |
|
140 axis lines=left, |
|
141 width=5cm, |
|
142 height=4cm, |
|
143 legend entries={Java 8}, |
|
144 legend pos=north west, |
|
145 legend cell align=left] |
|
146 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
147 \end{axis} |
|
148 \end{tikzpicture}\\ |
|
149 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings |
|
150 of the form $\underbrace{aa..a}_{n}$.} |
|
151 \end{tabular} |
|
152 \end{center} |
|
153 |
|
154 \noindent These are clearly abysmal and possibly surprising results. |
|
155 One would expect these systems doing much better than that---after |
|
156 all, given a DFA and a string, whether a string is matched by this DFA |
|
157 should be linear. |
|
158 |
|
159 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to |
|
160 exhibit this ``exponential behaviour''. Unfortunately, such regular |
|
161 expressions are not just a few ``outliers'', but actually they are |
|
162 frequent enough that a separate name has been created for |
|
163 them---\emph{evil regular expressions}. In empiric work, Davis et al |
|
164 report that they have found thousands of such evil regular expressions |
|
165 in the JavaScript and Python ecosystems \cite{Davis18}. |
|
166 |
|
167 This exponential blowup sometimes causes real pain in real life: |
|
168 for example on 20 July 2016 one evil regular expression brought the |
|
169 webpage \href{http://stackexchange.com}{Stack Exchange} to its knees \cite{SE16}. |
|
170 In this instance, a regular expression intended to just trim white |
|
171 spaces from the beginning and the end of a line actually consumed |
|
172 massive amounts of CPU-resources and because of this the web servers |
|
173 ground to a halt. This happened when a post with 20,000 white spaces |
|
174 was submitted, but importantly the white spaces were neither at the |
|
175 beginning nor at the end. As a result, the regular expression matching |
|
176 engine needed to backtrack over many choices. |
|
177 |
|
178 The underlying problem is that many ``real life'' regular expression |
|
179 matching engines do not use DFAs for matching. This is because they |
|
180 support regular expressions that are not covered by the classical |
|
181 automata theory, and in this more general setting there are quite a |
|
182 few research questions still unanswered and fast algorithms still need |
|
183 to be developed. |
|
184 |
|
185 There is also another under-researched problem to do with regular |
|
186 expressions and lexing, i.e.~the process of breaking up strings into |
|
187 sequences of tokens according to some regular expressions. In this |
|
188 setting one is not just interested in whether or not a regular |
|
189 expression matches a string, but if it matches also in \emph{how} it |
|
190 matches the string. Consider for example a regular expression |
|
191 $r_{key}$ for recognising keywords such as \textit{if}, \textit{then} |
|
192 and so on; and a regular expression $r_{id}$ for recognising |
|
193 identifiers (say, a single character followed by characters or |
|
194 numbers). One can then form the compound regular expression |
|
195 $(r_{key} + r_{id})^*$ and use it to tokenise strings. But then how |
|
196 should the string \textit{iffoo} be tokenised? It could be tokenised |
|
197 as a keyword followed by an identifier, or the entire string as a |
|
198 single identifier. Similarly, how should the string \textit{if} be |
|
199 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would |
|
200 ``fire''---so is it an identifier or a keyword? While in applications |
|
201 there is a well-known strategy to decide these questions, called POSIX |
|
202 matching, only relatively recently precise definitions of what POSIX |
|
203 matching actually means have been formalised |
|
204 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly, |
|
205 POSIX matching means matching the longest initial substring and |
|
206 in the case of a tie, the initial submatch is chosen according to some priorities attached to the |
|
207 regular expressions (e.g.~keywords have a higher priority than |
|
208 identifiers). This sounds rather simple, but according to Grathwohl et |
|
209 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote: |
|
210 |
|
211 \begin{quote} |
|
212 \it{}``The POSIX strategy is more complicated than the greedy because of |
|
213 the dependence on information about the length of matched strings in the |
|
214 various subexpressions.'' |
|
215 \end{quote} |
|
216 |
|
217 \noindent |
|
218 This is also supported by evidence collected by Kuklewicz |
|
219 \cite{Kuklewicz} who noticed that a number of POSIX regular expression |
|
220 matchers calculate incorrect results. |
|
221 |
|
222 Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for |
|
223 regular expression matching according to the POSIX strategy |
|
224 \cite{Sulzmann2014}. Their algorithm is based on an older algorithm by |
|
225 Brzozowski from 1964 where he introduced the notion of derivatives of |
|
226 regular expressions \cite{Brzozowski1964}. We shall briefly explain |
|
227 the algorithms next. |
|
228 |
|
229 \section{The Algorithms by Brzozowski, and Sulzmann and Lu} |
|
230 |
|
231 Suppose regular expressions are given by the following grammar (for |
|
232 the moment ignore the grammar for values on the right-hand side): |
|
233 |
|
234 |
|
235 \begin{center} |
|
236 \begin{tabular}{c@{\hspace{20mm}}c} |
|
237 \begin{tabular}{@{}rrl@{}} |
|
238 \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |
|
239 $r$ & $::=$ & $\ZERO$\\ |
|
240 & $\mid$ & $\ONE$ \\ |
|
241 & $\mid$ & $c$ \\ |
|
242 & $\mid$ & $r_1 \cdot r_2$\\ |
|
243 & $\mid$ & $r_1 + r_2$ \\ |
|
244 \\ |
|
245 & $\mid$ & $r^*$ \\ |
|
246 \end{tabular} |
|
247 & |
|
248 \begin{tabular}{@{\hspace{0mm}}rrl@{}} |
|
249 \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ |
|
250 $v$ & $::=$ & \\ |
|
251 & & $\Empty$ \\ |
|
252 & $\mid$ & $\Char(c)$ \\ |
|
253 & $\mid$ & $\Seq\,v_1\, v_2$\\ |
|
254 & $\mid$ & $\Left(v)$ \\ |
|
255 & $\mid$ & $\Right(v)$ \\ |
|
256 & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ |
|
257 \end{tabular} |
|
258 \end{tabular} |
|
259 \end{center} |
|
260 |
|
261 \noindent |
|
262 The intended meaning of the regular expressions is as usual: $\ZERO$ |
|
263 cannot match any string, $\ONE$ can match the empty string, the |
|
264 character regular expression $c$ can match the character $c$, and so |
|
265 on. The brilliant contribution by Brzozowski is the notion of |
|
266 \emph{derivatives} of regular expressions. The idea behind this |
|
267 notion is as follows: suppose a regular expression $r$ can match a |
|
268 string of the form $c\!::\! s$ (that is a list of characters starting |
|
269 with $c$), what does the regular expression look like that can match |
|
270 just $s$? Brzozowski gave a neat answer to this question. He defined |
|
271 the following operation on regular expressions, written |
|
272 $r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$): |
|
273 |
|
274 \begin{center} |
|
275 \begin{tabular}{lcl} |
|
276 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
277 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
278 $d \backslash c$ & $\dn$ & |
|
279 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
280 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
281 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\ |
|
282 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
283 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
284 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
285 \end{tabular} |
|
286 \end{center} |
|
287 |
|
288 \noindent |
|
289 In this definition $\nullable(\_)$ stands for a simple recursive |
|
290 function that tests whether a regular expression can match the empty |
|
291 string (its definition is omitted). Assuming the classic notion of a |
|
292 \emph{language} of a regular expression, written $L(\_)$, the main |
|
293 property of the derivative operation is that |
|
294 |
|
295 \begin{center} |
|
296 $c\!::\!s \in L(r)$ holds |
|
297 if and only if $s \in L(r\backslash c)$. |
|
298 \end{center} |
|
299 |
|
300 \noindent |
|
301 The beauty of derivatives is that they lead to a really simple regular |
|
302 expression matching algorithm: To find out whether a string $s$ |
|
303 matches with a regular expression $r$, build the derivatives of $r$ |
|
304 w.r.t.\ (in succession) all the characters of the string $s$. Finally, |
|
305 test whether the resulting regular expression can match the empty |
|
306 string. If yes, then $r$ matches $s$, and no in the negative |
|
307 case. For us the main advantage is that derivatives can be |
|
308 straightforwardly implemented in any functional programming language, |
|
309 and are easily definable and reasoned about in theorem provers---the |
|
310 definitions just consist of inductive datatypes and simple recursive |
|
311 functions. Moreover, the notion of derivatives can be easily |
|
312 generalised to cover extended regular expression constructors such as |
|
313 the not-regular expression, written $\neg\,r$, or bounded repetitions |
|
314 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so |
|
315 straightforwardly realised within the classic automata approach. |
|
316 |
|
317 |
|
318 One limitation, however, of Brzozowski's algorithm is that it only |
|
319 produces a YES/NO answer for whether a string is being matched by a |
|
320 regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this |
|
321 algorithm to allow generation of an actual matching, called a |
|
322 \emph{value}---see the grammar above for its definition. Assuming a |
|
323 regular expression matches a string, values encode the information of |
|
324 \emph{how} the string is matched by the regular expression---that is, |
|
325 which part of the string is matched by which part of the regular |
|
326 expression. To illustrate this consider the string $xy$ and the |
|
327 regular expression $(x + (y + xy))^*$. We can view this regular |
|
328 expression as a tree and if the string $xy$ is matched by two Star |
|
329 ``iterations'', then the $x$ is matched by the left-most alternative |
|
330 in this tree and the $y$ by the right-left alternative. This suggests |
|
331 to record this matching as |
|
332 |
|
333 \begin{center} |
|
334 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$ |
|
335 \end{center} |
|
336 |
|
337 \noindent |
|
338 where $\Stars$ records how many |
|
339 iterations were used; and $\Left$, respectively $\Right$, which |
|
340 alternative is used. The value for |
|
341 matching $xy$ in a single ``iteration'', i.e.~the POSIX value, |
|
342 would look as follows |
|
343 |
|
344 \begin{center} |
|
345 $\Stars\,[\Seq\,(\Char\,x)\,(\Char\,y)]$ |
|
346 \end{center} |
|
347 |
|
348 \noindent |
|
349 where $\Stars$ has only a single-element list for the single iteration |
|
350 and $\Seq$ indicates that $xy$ is matched by a sequence regular |
|
351 expression. |
|
352 |
|
353 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
|
354 algorithm by a second phase (the first phase being building successive |
|
355 derivatives). In this second phase, for every successful match the |
|
356 corresponding POSIX value is computed. As mentioned before, from this |
|
357 value we can extract the information \emph{how} a regular expression |
|
358 matched a string. We omit the details here on how Sulzmann and Lu |
|
359 achieved this~(see \cite{Sulzmann2014}). Rather, we shall focus next on the |
|
360 process of simplification of regular expressions, which is needed in |
|
361 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
|
362 and Lu's algorithms. This is where the PhD-project hopes to advance |
|
363 the state-of-the-art. |
|
364 |
|
365 |
|
366 \section{Simplification of Regular Expressions} |
|
367 |
|
368 The main drawback of building successive derivatives according to |
|
369 Brzozowski's definition is that they can grow very quickly in size. |
|
370 This is mainly due to the fact that the derivative operation generates |
|
371 often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, |
|
372 if implemented naively both algorithms by Brzozowski and by Sulzmann |
|
373 and Lu are excruciatingly slow. For example when starting with the |
|
374 regular expression $(a + aa)^*$ and building 12 successive derivatives |
|
375 w.r.t.~the character $a$, one obtains a derivative regular expression |
|
376 with more than 8000 nodes (when viewed as a tree). Operations like |
|
377 derivative and $\nullable$ need to traverse such trees and |
|
378 consequently the bigger the size of the derivative the slower the |
|
379 algorithm. Fortunately, one can simplify regular expressions after |
|
380 each derivative step. Various simplifications of regular expressions |
|
381 are possible, such as the simplifications of $\ZERO + r$, |
|
382 $r + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just |
|
383 $r$. These simplifications do not affect the answer for whether a |
|
384 regular expression matches a string or not, but fortunately also do |
|
385 not affect the POSIX strategy of how regular expressions match |
|
386 strings---although the latter is much harder to establish. Some |
|
387 initial results in this regard have been obtained in |
|
388 \cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet |
|
389 is a very tight bound for the size. Such a tight bound is suggested by |
|
390 work of Antimirov who proved that (partial) derivatives can be bound |
|
391 by the number of characters contained in the initial regular |
|
392 expression \cite{Antimirov95}. We believe, and have generated test |
|
393 data, that a similar bound can be obtained for the derivatives in |
|
394 Sulzmann and Lu's algorithm. Let us give some details about this next. |
|
395 |
|
396 We first followed Sulzmann and Lu's idea of introducing |
|
397 \emph{annotated regular expressions}~\cite{Sulzmann2014}. They are |
|
398 defined by the following grammar: |
|
399 |
|
400 \begin{center} |
|
401 \begin{tabular}{lcl} |
|
402 $\textit{a}$ & $::=$ & $\textit{ZERO}$\\ |
|
403 & $\mid$ & $\textit{ONE}\;\;bs$\\ |
|
404 & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\ |
|
405 & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\ |
|
406 & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\ |
|
407 & $\mid$ & $\textit{STAR}\;\;bs\,a$ |
|
408 \end{tabular} |
|
409 \end{center} |
|
410 |
|
411 \noindent |
|
412 where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a |
|
413 list of annotated regular expressions. These bitsequences encode |
|
414 information about the (POSIX) value that should be generated by the |
|
415 Sulzmann and Lu algorithm. There are operations that can transform the |
|
416 usual (un-annotated) regular expressions into annotated regular |
|
417 expressions, and there are operations for encoding/decoding values to |
|
418 or from bitsequences. For example the encoding function for values is |
|
419 defined as follows: |
|
420 |
|
421 \begin{center} |
|
422 \begin{tabular}{lcl} |
|
423 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
|
424 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
|
425 $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\ |
|
426 $\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\ |
|
427 $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ |
|
428 $\textit{code}(\Stars\,[])$ & $\dn$ & $[\S]$\\ |
|
429 $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\Z :: code(v) \;@\; |
|
430 code(\Stars\,vs)$ |
|
431 \end{tabular} |
|
432 \end{center} |
|
433 |
|
434 \noindent |
|
435 where $\Z$ and $\S$ are arbitrary names for the ``bits'' in the |
|
436 bitsequences. Although this encoding is ``lossy'' in the sense of not |
|
437 recording all details of a value, Sulzmann and Lu have defined the |
|
438 decoding function such that with additional information (namely the |
|
439 corresponding regular expression) a value can be precisely extracted |
|
440 from a bitsequence. |
|
441 |
|
442 The main point of the bitsequences and annotated regular expressions |
|
443 is that we can apply rather aggressive (in terms of size) |
|
444 simplification rules in order to keep derivatives small. We have |
|
445 developed such ``aggressive'' simplification rules and generated test |
|
446 data that show that the expected bound can be achieved. Obviously we |
|
447 could only partially cover the search space as there are infinitely |
|
448 many regular expressions and strings. One modification we introduced |
|
449 is to allow a list of annotated regular expressions in the |
|
450 \textit{ALTS} constructor. This allows us to not just delete |
|
451 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also |
|
452 unnecessary ``copies'' of regular expressions (very similar to |
|
453 simplifying $r + r$ to just $r$, but in a more general |
|
454 setting). Another modification is that we use simplification rules |
|
455 inspired by Antimirov's work on partial derivatives. They maintain the |
|
456 idea that only the first ``copy'' of a regular expression in an |
|
457 alternative contributes to the calculation of a POSIX value. All |
|
458 subsequent copies can be prunned from the regular expression. |
|
459 |
|
460 We are currently engaged with proving that our simplification rules |
|
461 actually do not affect the POSIX value that should be generated by the |
|
462 algorithm according to the specification of a POSIX value and |
|
463 furthermore that our derivatives stay small for all derivatives. For |
|
464 this proof we use the theorem prover Isabelle. Once completed, this |
|
465 result will advance the state-of-the-art: Sulzmann and Lu wrote in |
|
466 their paper \cite{Sulzmann2014} about the bitcoded ``incremental |
|
467 parsing method'' (that is the matching algorithm outlined in this |
|
468 section): |
|
469 |
|
470 \begin{quote}\it |
|
471 ``Correctness Claim: We further claim that the incremental parsing |
|
472 method in Figure~5 in combination with the simplification steps in |
|
473 Figure 6 yields POSIX parse trees. We have tested this claim |
|
474 extensively by using the method in Figure~3 as a reference but yet |
|
475 have to work out all proof details.'' |
|
476 \end{quote} |
|
477 |
|
478 \noindent |
|
479 We would settle the correctness claim and furthermore obtain a much |
|
480 tighter bound on the sizes of derivatives. The result is that our |
|
481 algorithm should be correct and faster on all inputs. The original |
|
482 blow-up, as observed in JavaScript, Python and Java, would be excluded |
|
483 from happening in our algorithm. |
|
484 |
|
485 \section{Conclusion} |
|
486 |
|
487 In this PhD-project we are interested in fast algorithms for regular |
|
488 expression matching. While this seems to be a ``settled'' area, in |
|
489 fact interesting research questions are popping up as soon as one steps |
|
490 outside the classic automata theory (for example in terms of what kind |
|
491 of regular expressions are supported). The reason why it is |
|
492 interesting for us to look at the derivative approach introduced by |
|
493 Brzozowski for regular expression matching, and then much further |
|
494 developed by Sulzmann and Lu, is that derivatives can elegantly deal |
|
495 with some of the regular expressions that are of interest in ``real |
|
496 life''. This includes the not-regular expression, written $\neg\,r$ |
|
497 (that is all strings that are not recognised by $r$), but also bounded |
|
498 regular expressions such as $r^{\{n\}}$ and $r^{\{n..m\}}$). There is |
|
499 also hope that the derivatives can provide another angle for how to |
|
500 deal more efficiently with back-references, which are one of the |
|
501 reasons why regular expression engines in JavaScript, Python and Java |
|
502 choose to not implement the classic automata approach of transforming |
|
503 regular expressions into NFAs and then DFAs---because we simply do not |
|
504 know how such back-references can be represented by DFAs. |
|
505 |
|
506 |
|
507 \bibliographystyle{plain} |
|
508 \bibliography{root} |
|
509 |
|
510 |
|
511 \end{document} |