20 for function applications, where |
20 for function applications, where |
21 the parameters of a function are not enclosed |
21 the parameters of a function are not enclosed |
22 inside a pair of parentheses (e.g. $f \;x \;y$ |
22 inside a pair of parentheses (e.g. $f \;x \;y$ |
23 instead of $f(x,\;y)$). This is mainly |
23 instead of $f(x,\;y)$). This is mainly |
24 to make the text visually more concise.}. |
24 to make the text visually more concise.}. |
|
25 |
|
26 |
|
27 \section{Technical Overview} |
|
28 |
|
29 Consider for example the regular expression $(a^*)^*\,b$ and |
|
30 strings of the form $aa..a$. These strings cannot be matched by this regular |
|
31 expression: Obviously the expected $b$ in the last |
|
32 position is missing. One would assume that modern regular expression |
|
33 matching engines can find this out very quickly. Surprisingly, if one tries |
|
34 this example in JavaScript, Python or Java 8, even with small strings, |
|
35 say of lenght of around 30 $a$'s, |
|
36 the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}). |
|
37 The algorithms clearly show exponential behaviour, and as can be seen |
|
38 is triggered by some relatively simple regular expressions. |
|
39 Java 9 and newer |
|
40 versions improve this behaviour somewhat, but are still slow compared |
|
41 with the approach we are going to use in this thesis. |
|
42 |
|
43 |
|
44 |
|
45 This superlinear blowup in regular expression engines |
|
46 has caused grief in ``real life'' where it is |
|
47 given the name ``catastrophic backtracking'' or ``evil'' regular expressions. |
|
48 For example, on 20 July 2016 one evil |
|
49 regular expression brought the webpage |
|
50 \href{http://stackexchange.com}{Stack Exchange} to its |
|
51 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} |
|
52 In this instance, a regular expression intended to just trim white |
|
53 spaces from the beginning and the end of a line actually consumed |
|
54 massive amounts of CPU resources---causing the web servers to grind to a |
|
55 halt. In this example, the time needed to process |
|
56 the string was |
|
57 $O(n^2)$ with respect to the string length $n$. This |
|
58 quadratic overhead was enough for the homepage of Stack Exchange to |
|
59 respond so slowly that the load balancer assumed a $\mathit{DoS}$ |
|
60 attack and therefore stopped the servers from responding to any |
|
61 requests. This made the whole site become unavailable. |
|
62 |
|
63 \begin{figure}[p] |
|
64 \begin{center} |
|
65 \begin{tabular}{@{}c@{\hspace{0mm}}c@{}} |
|
66 \begin{tikzpicture} |
|
67 \begin{axis}[ |
|
68 xlabel={$n$}, |
|
69 x label style={at={(1.05,-0.05)}}, |
|
70 ylabel={time in secs}, |
|
71 enlargelimits=false, |
|
72 xtick={0,5,...,30}, |
|
73 xmax=33, |
|
74 ymax=35, |
|
75 ytick={0,5,...,30}, |
|
76 scaled ticks=false, |
|
77 axis lines=left, |
|
78 width=5cm, |
|
79 height=4cm, |
|
80 legend entries={JavaScript}, |
|
81 legend pos=north west, |
|
82 legend cell align=left] |
|
83 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; |
|
84 \end{axis} |
|
85 \end{tikzpicture} |
|
86 & |
|
87 \begin{tikzpicture} |
|
88 \begin{axis}[ |
|
89 xlabel={$n$}, |
|
90 x label style={at={(1.05,-0.05)}}, |
|
91 %ylabel={time in secs}, |
|
92 enlargelimits=false, |
|
93 xtick={0,5,...,30}, |
|
94 xmax=33, |
|
95 ymax=35, |
|
96 ytick={0,5,...,30}, |
|
97 scaled ticks=false, |
|
98 axis lines=left, |
|
99 width=5cm, |
|
100 height=4cm, |
|
101 legend entries={Python}, |
|
102 legend pos=north west, |
|
103 legend cell align=left] |
|
104 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
|
105 \end{axis} |
|
106 \end{tikzpicture}\\ |
|
107 \begin{tikzpicture} |
|
108 \begin{axis}[ |
|
109 xlabel={$n$}, |
|
110 x label style={at={(1.05,-0.05)}}, |
|
111 ylabel={time in secs}, |
|
112 enlargelimits=false, |
|
113 xtick={0,5,...,30}, |
|
114 xmax=33, |
|
115 ymax=35, |
|
116 ytick={0,5,...,30}, |
|
117 scaled ticks=false, |
|
118 axis lines=left, |
|
119 width=5cm, |
|
120 height=4cm, |
|
121 legend entries={Java 8}, |
|
122 legend pos=north west, |
|
123 legend cell align=left] |
|
124 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
125 \end{axis} |
|
126 \end{tikzpicture} |
|
127 & |
|
128 \begin{tikzpicture} |
|
129 \begin{axis}[ |
|
130 xlabel={$n$}, |
|
131 x label style={at={(1.05,-0.05)}}, |
|
132 %ylabel={time in secs}, |
|
133 enlargelimits=false, |
|
134 xtick={0,5,...,30}, |
|
135 xmax=33, |
|
136 ymax=35, |
|
137 ytick={0,5,...,30}, |
|
138 scaled ticks=false, |
|
139 axis lines=left, |
|
140 width=5cm, |
|
141 height=4cm, |
|
142 legend entries={Dart}, |
|
143 legend pos=north west, |
|
144 legend cell align=left] |
|
145 \addplot[green,mark=*, mark options={fill=white}] table {re-dart.data}; |
|
146 \end{axis} |
|
147 \end{tikzpicture}\\ |
|
148 \begin{tikzpicture} |
|
149 \begin{axis}[ |
|
150 xlabel={$n$}, |
|
151 x label style={at={(1.05,-0.05)}}, |
|
152 ylabel={time in secs}, |
|
153 enlargelimits=false, |
|
154 xtick={0,5,...,30}, |
|
155 xmax=33, |
|
156 ymax=35, |
|
157 ytick={0,5,...,30}, |
|
158 scaled ticks=false, |
|
159 axis lines=left, |
|
160 width=5cm, |
|
161 height=4cm, |
|
162 legend entries={Swift}, |
|
163 legend pos=north west, |
|
164 legend cell align=left] |
|
165 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data}; |
|
166 \end{axis} |
|
167 \end{tikzpicture} |
|
168 & |
|
169 \begin{tikzpicture} |
|
170 \begin{axis}[ |
|
171 xlabel={$n$}, |
|
172 x label style={at={(1.05,-0.05)}}, |
|
173 %ylabel={time in secs}, |
|
174 enlargelimits=true, |
|
175 %xtick={0,5000,...,40000}, |
|
176 %xmax=40000, |
|
177 %ymax=35, |
|
178 restrict x to domain*=0:40000, |
|
179 restrict y to domain*=0:35, |
|
180 %ytick={0,5,...,30}, |
|
181 %scaled ticks=false, |
|
182 axis lines=left, |
|
183 width=5cm, |
|
184 height=4cm, |
|
185 legend entries={Java9+}, |
|
186 legend pos=north west, |
|
187 legend cell align=left] |
|
188 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data}; |
|
189 \end{axis} |
|
190 \end{tikzpicture}\\ |
|
191 \multicolumn{2}{c}{Graphs} |
|
192 \end{tabular} |
|
193 \end{center} |
|
194 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings |
|
195 of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries. |
|
196 The reason for their superlinear behaviour is that they do a depth-first-search |
|
197 using NFAs. |
|
198 If the string does not match, the regular expression matching |
|
199 engine starts to explore all possibilities. |
|
200 }\label{fig:aStarStarb} |
|
201 \end{figure}\afterpage{\clearpage} |
|
202 |
|
203 A more recent example is a global outage of all Cloudflare servers on 2 July |
|
204 2019. A poorly written regular expression exhibited catastrophic backtracking |
|
205 and exhausted CPUs that serve HTTP traffic. Although the outage |
|
206 had several causes, at the heart was a regular expression that |
|
207 was used to monitor network |
|
208 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} |
|
209 These problems with regular expressions |
|
210 are not isolated events that happen |
|
211 very rarely, |
|
212 %but actually widespread. |
|
213 %They occur so often that they have a |
|
214 but they occur actually often enough that they have a |
|
215 name: Regular-Expression-Denial-Of-Service (ReDoS) |
|
216 attacks. |
|
217 Davis et al. \cite{Davis18} detected more |
|
218 than 1000 evil regular expressions |
|
219 in Node.js, Python core libraries, npm and pypi. |
|
220 They therefore concluded that evil regular expressions |
|
221 are a real problem rather than just "a parlour trick". |
|
222 |
|
223 The work in this thesis aims to address this issue |
|
224 with the help of formal proofs. |
|
225 We describe a lexing algorithm based |
|
226 on Brzozowski derivatives with verified correctness |
|
227 and a finiteness property for the size of derivatives |
|
228 (which are all done in Isabelle/HOL). |
|
229 Such properties %guarantee the absence of |
|
230 are an important step in preventing |
|
231 catastrophic backtracking once and for all. |
|
232 We will give more details in the next sections |
|
233 on (i) why the slow cases in graph \ref{fig:aStarStarb} |
|
234 can occur in traditional regular expression engines |
|
235 and (ii) why we choose our |
|
236 approach based on Brzozowski derivatives and formal proofs. |
|
237 |
|
238 |
|
239 \section{Preliminaries}%Regex, and the Problems with Regex Matchers} |
|
240 Regular expressions and regular expression matchers |
|
241 have clearly been studied for many, many years. |
|
242 Theoretical results in automata theory state |
|
243 that basic regular expression matching should be linear |
|
244 w.r.t the input. |
|
245 This assumes that the regular expression |
|
246 $r$ was pre-processed and turned into a |
|
247 deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}. |
|
248 By basic we mean textbook definitions such as the one |
|
249 below, involving only regular expressions for characters, alternatives, |
|
250 sequences, and Kleene stars: |
|
251 \[ |
|
252 r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^* |
|
253 \] |
|
254 Modern regular expression matchers used by programmers, |
|
255 however, |
|
256 support much richer constructs, such as bounded repetitions, |
|
257 negations, |
|
258 and back-references. |
|
259 To differentiate, we use the word \emph{regex} to refer |
|
260 to those expressions with richer constructs while reserving the |
|
261 term \emph{regular expression} |
|
262 for the more traditional meaning in formal languages theory. |
|
263 We follow this convention |
|
264 in this thesis. |
|
265 In the future, we aim to support all the popular features of regexes, |
|
266 but for this work we mainly look at basic regular expressions |
|
267 and bounded repetitions. |
|
268 |
|
269 |
|
270 |
|
271 %Most modern regex libraries |
|
272 %the so-called PCRE standard (Peral Compatible Regular Expressions) |
|
273 %has the back-references |
|
274 Regexes come with a number of constructs |
|
275 that make it more convenient for |
|
276 programmers to write regular expressions. |
|
277 Depending on the types of constructs |
|
278 the task of matching and lexing with them |
|
279 will have different levels of complexity. |
|
280 Some of those constructs are syntactic sugars that are |
|
281 simply short hand notations |
|
282 that save the programmers a few keystrokes. |
|
283 These will not cause problems for regex libraries. |
|
284 For example the |
|
285 non-binary alternative involving three or more choices just means: |
|
286 \[ |
|
287 (a | b | c) \stackrel{means}{=} ((a + b)+ c) |
|
288 \] |
|
289 Similarly, the range operator |
|
290 %used to express the alternative |
|
291 %of all characters between its operands, |
|
292 is just a concise way |
|
293 of expressing an alternative of consecutive characters: |
|
294 \[ |
|
295 [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) |
|
296 \] |
|
297 for an alternative. The |
|
298 wildcard character '$.$' is used to refer to any single character, |
|
299 \[ |
|
300 . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots] |
|
301 \] |
|
302 except the newline. |
|
303 |
|
304 \subsection{Bounded Repetitions} |
|
305 More interesting are bounded repetitions, which can |
|
306 make the regular expressions much |
|
307 more compact. |
|
308 Normally there are four kinds of bounded repetitions: |
|
309 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$ |
|
310 (where $n$ and $m$ are constant natural numbers). |
|
311 Like the star regular expressions, the set of strings or language |
|
312 a bounded regular expression can match |
|
313 is defined using the power operation on sets: |
|
314 \begin{center} |
|
315 \begin{tabular}{lcl} |
|
316 $L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\ |
|
317 $L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\ |
|
318 $L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\ |
|
319 $L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$ |
|
320 \end{tabular} |
|
321 \end{center} |
|
322 The attraction of bounded repetitions is that they can be |
|
323 used to avoid a size blow up: for example $r^{\{n\}}$ |
|
324 is a shorthand for |
|
325 the much longer regular expression: |
|
326 \[ |
|
327 \underbrace{r\ldots r}_\text{n copies of r}. |
|
328 \] |
|
329 %Therefore, a naive algorithm that simply unfolds |
|
330 %them into their desugared forms |
|
331 %will suffer from at least an exponential runtime increase. |
|
332 |
|
333 |
|
334 The problem with matching |
|
335 such bounded repetitions |
|
336 is that tools based on the classic notion of |
|
337 automata need to expand $r^{\{n\}}$ into $n$ connected |
|
338 copies of the automaton for $r$. This leads to very inefficient matching |
|
339 algorithms or algorithms that consume large amounts of memory. |
|
340 Implementations using $\DFA$s will |
|
341 in such situations |
|
342 either become excruciatingly slow |
|
343 (for example Verbatim++ \cite{Verbatimpp}) or run |
|
344 out of memory (for example $\mathit{LEX}$ and |
|
345 $\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators |
|
346 in C and JAVA that generate $\mathit{DFA}$-based |
|
347 lexers. The user provides a set of regular expressions |
|
348 and configurations, and then |
|
349 gets an output program encoding a minimized $\mathit{DFA}$ |
|
350 that can be compiled and run. |
|
351 When given the above countdown regular expression, |
|
352 a small $n$ (say 20) would result in a program representing a |
|
353 DFA |
|
354 with millions of states.}) for large counters. |
|
355 A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$ |
|
356 where the minimal DFA requires at least $2^{n+1}$ states. |
|
357 For example, when $n$ is equal to 2, |
|
358 the corresponding $\mathit{NFA}$ looks like: |
|
359 \vspace{6mm} |
|
360 \begin{center} |
|
361 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
|
362 \node[state,initial] (q_0) {$q_0$}; |
|
363 \node[state, red] (q_1) [right=of q_0] {$q_1$}; |
|
364 \node[state, red] (q_2) [right=of q_1] {$q_2$}; |
|
365 \node[state, accepting, red](q_3) [right=of q_2] {$q_3$}; |
|
366 \path[->] |
|
367 (q_0) edge node {a} (q_1) |
|
368 edge [loop below] node {a,b} () |
|
369 (q_1) edge node {a,b} (q_2) |
|
370 (q_2) edge node {a,b} (q_3); |
|
371 \end{tikzpicture} |
|
372 \end{center} |
|
373 and when turned into a DFA by the subset construction |
|
374 requires at least $2^3$ states.\footnote{The |
|
375 red states are "countdown states" which count down |
|
376 the number of characters needed in addition to the current |
|
377 string to make a successful match. |
|
378 For example, state $q_1$ indicates a match that has |
|
379 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$, |
|
380 and just consumed the "delimiter" $a$ in the middle, and |
|
381 needs to match 2 more iterations of $(a|b)$ to complete. |
|
382 State $q_2$ on the other hand, can be viewed as a state |
|
383 after $q_1$ has consumed 1 character, and just waits |
|
384 for 1 more character to complete. |
|
385 The state $q_3$ is the last (accepting) state, requiring 0 |
|
386 more characters. |
|
387 Depending on the suffix of the |
|
388 input string up to the current read location, |
|
389 the states $q_1$ and $q_2$, $q_3$ |
|
390 may or may |
|
391 not be active. |
|
392 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would |
|
393 contain at least $2^3$ non-equivalent states that cannot be merged, |
|
394 because the subset construction during determinisation will generate |
|
395 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$. |
|
396 Generalizing this to regular expressions with larger |
|
397 bounded repetitions number, we have that |
|
398 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s |
|
399 would require at least $2^{n+1}$ states, if $r$ itself contains |
|
400 more than 1 string. |
|
401 This is to represent all different |
|
402 scenarios in which "countdown" states are active.} |
|
403 |
|
404 |
|
405 Bounded repetitions are important because they |
|
406 tend to occur frequently in practical use, |
|
407 for example in the regex library RegExLib, in |
|
408 the rules library of Snort \cite{Snort1999}\footnote{ |
|
409 Snort is a network intrusion detection (NID) tool |
|
410 for monitoring network traffic. |
|
411 The network security community curates a list |
|
412 of malicious patterns written as regexes, |
|
413 which is used by Snort's detection engine |
|
414 to match against network traffic for any hostile |
|
415 activities such as buffer overflow attacks.}, |
|
416 as well as in XML Schema definitions (XSDs). |
|
417 According to Bj\"{o}rklund et al \cite{xml2015}, |
|
418 more than half of the |
|
419 XSDs they found on the Maven.org central repository |
|
420 have bounded regular expressions in them. |
|
421 Often the counters are quite large, with the largest being |
|
422 close to ten million. |
|
423 A smaller sample XSD they gave |
|
424 is: |
|
425 \lstset{ |
|
426 basicstyle=\fontsize{8.5}{9}\ttfamily, |
|
427 language=XML, |
|
428 morekeywords={encoding, |
|
429 xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute} |
|
430 } |
|
431 \begin{lstlisting} |
|
432 <sequence minOccurs="0" maxOccurs="65535"> |
|
433 <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> |
|
434 <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/> |
|
435 </sequence> |
|
436 \end{lstlisting} |
|
437 This can be seen as the regex |
|
438 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves |
|
439 regular expressions |
|
440 satisfying certain constraints (such as |
|
441 satisfying the floating point number format). |
|
442 It is therefore quite unsatisfying that |
|
443 some regular expressions matching libraries |
|
444 impose adhoc limits |
|
445 for bounded regular expressions: |
|
446 For example, in the regular expression matching library in the Go |
|
447 language the regular expression $a^{1001}$ is not permitted, because no counter |
|
448 can be above 1000, and in the built-in Rust regular expression library |
|
449 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message |
|
450 for being too big. |
|
451 As Becchi and Crawley \cite{Becchi08} have pointed out, |
|
452 the reason for these restrictions |
|
453 is that they simulate a non-deterministic finite |
|
454 automata (NFA) with a breadth-first search. |
|
455 This way the number of active states could |
|
456 be equal to the counter number. |
|
457 When the counters are large, |
|
458 the memory requirement could become |
|
459 infeasible, and a regex engine |
|
460 like in Go will reject this pattern straight away. |
|
461 \begin{figure}[H] |
|
462 \begin{center} |
|
463 \begin{tikzpicture} [node distance = 2cm, on grid, auto] |
|
464 |
|
465 \node (q0) [state, initial] {$0$}; |
|
466 \node (q1) [state, right = of q0] {$1$}; |
|
467 %\node (q2) [state, right = of q1] {$2$}; |
|
468 \node (qdots) [right = of q1] {$\ldots$}; |
|
469 \node (qn) [state, right = of qdots] {$n$}; |
|
470 \node (qn1) [state, right = of qn] {$n+1$}; |
|
471 \node (qn2) [state, right = of qn1] {$n+2$}; |
|
472 \node (qn3) [state, accepting, right = of qn2] {$n+3$}; |
|
473 |
|
474 \path [-stealth, thick] |
|
475 (q0) edge [loop above] node {a} () |
|
476 (q0) edge node {a} (q1) |
|
477 %(q1) edge node {.} (q2) |
|
478 (q1) edge node {.} (qdots) |
|
479 (qdots) edge node {.} (qn) |
|
480 (qn) edge node {.} (qn1) |
|
481 (qn1) edge node {b} (qn2) |
|
482 (qn2) edge node {$c$} (qn3); |
|
483 \end{tikzpicture} |
|
484 %\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
|
485 % \node[state,initial] (q_0) {$0$}; |
|
486 % \node[state, ] (q_1) [right=of q_0] {$1$}; |
|
487 % \node[state, ] (q_2) [right=of q_1] {$2$}; |
|
488 % \node[state, |
|
489 % \node[state, accepting, ](q_3) [right=of q_2] {$3$}; |
|
490 % \path[->] |
|
491 % (q_0) edge node {a} (q_1) |
|
492 % edge [loop below] node {a,b} () |
|
493 % (q_1) edge node {a,b} (q_2) |
|
494 % (q_2) edge node {a,b} (q_3); |
|
495 %\end{tikzpicture} |
|
496 \end{center} |
|
497 \caption{The example given by Becchi and Crawley |
|
498 that NFA simulation can consume large |
|
499 amounts of memory: $.^*a.^{\{n\}}bc$ matching |
|
500 strings of the form $aaa\ldots aaaabc$. |
|
501 When traversing in a breadth-first manner, |
|
502 all states from 0 till $n+1$ will become active.} |
|
503 \end{figure} |
|
504 %Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this |
|
505 %type of $\mathit{NFA}$ simulation and guarantees a linear runtime |
|
506 %in terms of input string length. |
|
507 %TODO:try out these lexers |
|
508 These problems can of course be solved in matching algorithms where |
|
509 automata go beyond the classic notion and for instance include explicit |
|
510 counters \cite{Turo_ov__2020}. |
|
511 These solutions can be quite efficient, |
|
512 with the ability to process |
|
513 gigabits of strings input per second |
|
514 even with large counters \cite{Becchi08}. |
|
515 These practical solutions do not come with |
|
516 formal guarantees, and as pointed out by |
|
517 Kuklewicz \cite{KuklewiczHaskell}, can be error-prone. |
|
518 %But formal reasoning about these automata especially in Isabelle |
|
519 %can be challenging |
|
520 %and un-intuitive. |
|
521 %Therefore, we take correctness and runtime claims made about these solutions |
|
522 %with a grain of salt. |
|
523 |
|
524 In the work reported in \cite{FoSSaCS2023} and here, |
|
525 we add better support using derivatives |
|
526 for bounded regular expression $r^{\{n\}}$. |
|
527 Our results |
|
528 extend straightforwardly to |
|
529 repetitions with intervals such as |
|
530 $r^{\{n\ldots m\}}$. |
|
531 The merit of Brzozowski derivatives (more on this later) |
|
532 on this problem is that |
|
533 it can be naturally extended to support bounded repetitions. |
|
534 Moreover these extensions are still made up of only small |
|
535 inductive datatypes and recursive functions, |
|
536 making it handy to deal with them in theorem provers. |
|
537 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be |
|
538 %straightforwardly extended to deal with bounded regular expressions |
|
539 %and moreover the resulting code still consists of only simple |
|
540 %recursive functions and inductive datatypes. |
|
541 Finally, bounded regular expressions do not destroy our finite |
|
542 boundedness property, which we shall prove later on. |
|
543 |
|
544 |
|
545 |
|
546 |
|
547 |
|
548 \subsection{Back-References} |
|
549 The other way to simulate an $\mathit{NFA}$ for matching is choosing |
|
550 a single transition each time, keeping all the other options in |
|
551 a queue or stack, and backtracking if that choice eventually |
|
552 fails. |
|
553 This method, often called a "depth-first-search", |
|
554 is efficient in many cases, but could end up |
|
555 with exponential run time. |
|
556 The backtracking method is employed in regex libraries |
|
557 that support \emph{back-references}, for example |
|
558 in Java and Python. |
|
559 %\section{Back-references and The Terminology Regex} |
|
560 |
|
561 %When one constructs an $\NFA$ out of a regular expression |
|
562 %there is often very little to be done in the first phase, one simply |
|
563 %construct the $\NFA$ states based on the structure of the input regular expression. |
|
564 |
|
565 %In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways: |
|
566 %one by keeping track of all active states after consuming |
|
567 %a character, and update that set of states iteratively. |
|
568 %This can be viewed as a breadth-first-search of the $\mathit{NFA}$ |
|
569 %for a path terminating |
|
570 %at an accepting state. |
|
571 |
|
572 |
|
573 |
|
574 Consider the following regular expression where the sequence |
|
575 operator is omitted for brevity: |
|
576 \begin{center} |
|
577 $r_1r_2r_3r_4$ |
|
578 \end{center} |
|
579 In this example, |
|
580 one could label sub-expressions of interest |
|
581 by parenthesizing them and giving |
|
582 them a number in the order in which their opening parentheses appear. |
|
583 One possible way of parenthesizing and labelling is: |
|
584 \begin{center} |
|
585 $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$ |
|
586 \end{center} |
|
587 The sub-expressions |
|
588 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled |
|
589 by 1 to 4, and can be ``referred back'' by their respective numbers. |
|
590 %These sub-expressions are called "capturing groups". |
|
591 To do so, one uses the syntax $\backslash i$ |
|
592 to denote that we want the sub-string |
|
593 of the input matched by the i-th |
|
594 sub-expression to appear again, |
|
595 exactly the same as it first appeared: |
|
596 \begin{center} |
|
597 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots |
|
598 \underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$ |
|
599 \end{center} |
|
600 Once the sub-string $s_i$ for the sub-expression $r_i$ |
|
601 has been fixed, there is no variability on what the back-reference |
|
602 label $\backslash i$ can be---it is tied to $s_i$. |
|
603 The overall string will look like $\ldots s_i \ldots s_i \ldots $ |
|
604 %The backslash and number $i$ are the |
|
605 %so-called "back-references". |
|
606 %Let $e$ be an expression made of regular expressions |
|
607 %and back-references. $e$ contains the expression $e_i$ |
|
608 %as its $i$-th capturing group. |
|
609 %The semantics of back-reference can be recursively |
|
610 %written as: |
|
611 %\begin{center} |
|
612 % \begin{tabular}{c} |
|
613 % $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\ |
|
614 % $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$ |
|
615 % \end{tabular} |
|
616 %\end{center} |
|
617 A concrete example |
|
618 for back-references is |
|
619 \begin{center} |
|
620 $(.^*)\backslash 1$, |
|
621 \end{center} |
|
622 which matches |
|
623 strings that can be split into two identical halves, |
|
624 for example $\mathit{foofoo}$, $\mathit{ww}$ and so on. |
|
625 Note that this is different from |
|
626 repeating the sub-expression verbatim like |
|
627 \begin{center} |
|
628 $(.^*)(.^*)$, |
|
629 \end{center} |
|
630 which does not impose any restrictions on what strings the second |
|
631 sub-expression $.^*$ |
|
632 might match. |
|
633 Another example for back-references is |
|
634 \begin{center} |
|
635 $(.)(.)\backslash 2\backslash 1$ |
|
636 \end{center} |
|
637 which matches four-character palindromes |
|
638 like $abba$, $x??x$ and so on. |
|
639 |
|
640 Back-references are a regex construct |
|
641 that programmers find quite useful. |
|
642 According to Becchi and Crawley \cite{Becchi08}, |
|
643 6\% of Snort rules (up until 2008) use them. |
|
644 The most common use of back-references |
|
645 is to express well-formed html files, |
|
646 where back-references are convenient for matching |
|
647 opening and closing tags like |
|
648 \begin{center} |
|
649 $\langle html \rangle \ldots \langle / html \rangle$ |
|
650 \end{center} |
|
651 A regex describing such a format |
|
652 is |
|
653 \begin{center} |
|
654 $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$ |
|
655 \end{center} |
|
656 Despite being useful, the expressive power of regexes |
|
657 go beyond regular languages |
|
658 once back-references are included. |
|
659 In fact, they allow the regex construct to express |
|
660 languages that cannot be contained in context-free |
|
661 languages either. |
|
662 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$ |
|
663 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, |
|
664 which cannot be expressed by |
|
665 context-free grammars \cite{campeanu2003formal}. |
|
666 Such a language is contained in the context-sensitive hierarchy |
|
667 of formal languages. |
|
668 Also solving the matching problem involving back-references |
|
669 is known to be NP-complete \parencite{alfred2014algorithms}. |
|
670 Regex libraries supporting back-references such as |
|
671 PCRE \cite{pcre} therefore have to |
|
672 revert to a depth-first search algorithm including backtracking. |
|
673 What is unexpected is that even in the cases |
|
674 not involving back-references, there is still |
|
675 a (non-negligible) chance they might backtrack super-linearly, |
|
676 as shown in the graphs in figure \ref{fig:aStarStarb}. |
|
677 |
|
678 Summing up, we can categorise existing |
|
679 practical regex libraries into two kinds: |
|
680 (i) The ones with linear |
|
681 time guarantees like Go and Rust. The downside with them is that |
|
682 they impose restrictions |
|
683 on the regular expressions (not allowing back-references, |
|
684 bounded repetitions cannot exceed an ad hoc limit etc.). |
|
685 And (ii) those |
|
686 that allow large bounded regular expressions and back-references |
|
687 at the expense of using backtracking algorithms. |
|
688 They can potentially ``grind to a halt'' |
|
689 on some very simple cases, resulting |
|
690 ReDoS attacks if exposed to the internet. |
|
691 |
|
692 The problems with both approaches are the motivation for us |
|
693 to look again at the regular expression matching problem. |
|
694 Another motivation is that regular expression matching algorithms |
|
695 that follow the POSIX standard often contain errors and bugs |
|
696 as we shall explain next. |
|
697 |
|
698 %We would like to have regex engines that can |
|
699 %deal with the regular part (e.g. |
|
700 %bounded repetitions) of regexes more |
|
701 %efficiently. |
|
702 %Also we want to make sure that they do it correctly. |
|
703 %It turns out that such aim is not so easy to achieve. |
|
704 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions |
|
705 % For example, the Rust regex engine claims to be linear, |
|
706 % but does not support lookarounds and back-references. |
|
707 % The GoLang regex library does not support over 1000 repetitions. |
|
708 % Java and Python both support back-references, but shows |
|
709 %catastrophic backtracking behaviours on inputs without back-references( |
|
710 %when the language is still regular). |
|
711 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac |
|
712 %TODO: verify the fact Rust does not allow 1000+ reps |
|
713 |
|
714 |
|
715 |
|
716 |
|
717 %The time cost of regex matching algorithms in general |
|
718 %involve two different phases, and different things can go differently wrong on |
|
719 %these phases. |
|
720 %$\DFA$s usually have problems in the first (construction) phase |
|
721 %, whereas $\NFA$s usually run into trouble |
|
722 %on the second phase. |
|
723 |
|
724 |
|
725 \section{Error-prone POSIX Implementations} |
|
726 Very often there are multiple ways of matching a string |
|
727 with a regular expression. |
|
728 In such cases the regular expressions matcher needs to |
|
729 disambiguate. |
|
730 The more widely used strategy is called POSIX, |
|
731 which roughly speaking always chooses the longest initial match. |
|
732 The POSIX strategy is widely adopted in many regular expression matchers |
|
733 because it is a natural strategy for lexers. |
|
734 However, many implementations (including the C libraries |
|
735 used by Linux and OS X distributions) contain bugs |
|
736 or do not meet the specification they claim to adhere to. |
|
737 Kuklewicz maintains a unit test repository which lists some |
|
738 problems with existing regular expression engines \cite{KuklewiczHaskell}. |
|
739 In some cases, they either fail to generate a |
|
740 result when there exists a match, |
|
741 or give results that are inconsistent with the POSIX standard. |
|
742 A concrete example is the regex: |
|
743 \begin{center} |
|
744 $(aba + ab + a)^* \text{and the string} \; ababa$ |
|
745 \end{center} |
|
746 The correct POSIX match for the above |
|
747 involves the entire string $ababa$, |
|
748 split into two Kleene star iterations, namely $[ab], [aba]$ at positions |
|
749 $[0, 2), [2, 5)$ |
|
750 respectively. |
|
751 But feeding this example to the different engines |
|
752 listed at regex101 \footnote{ |
|
753 regex101 is an online regular expression matcher which |
|
754 provides API for trying out regular expression |
|
755 engines of multiple popular programming languages like |
|
756 Java, Python, Go, etc.} \parencite{regex101}. |
|
757 yields |
|
758 only two incomplete matches: $[aba]$ at $[0, 3)$ |
|
759 and $a$ at $[4, 5)$. |
|
760 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} |
|
761 commented that most regex libraries are not |
|
762 correctly implementing the central POSIX |
|
763 rule, called the maximum munch rule. |
|
764 Grathwohl \parencite{grathwohl2014crash} wrote: |
|
765 \begin{quote}\it |
|
766 ``The POSIX strategy is more complicated than the |
|
767 greedy because of the dependence on information about |
|
768 the length of matched strings in the various subexpressions.'' |
|
769 \end{quote} |
|
770 %\noindent |
|
771 People have recognised that the implementation complexity of POSIX rules also come from |
|
772 the specification being not very precise. |
|
773 The developers of the regexp package of Go |
|
774 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}} |
|
775 commented that |
|
776 \begin{quote}\it |
|
777 `` |
|
778 The POSIX rule is computationally prohibitive |
|
779 and not even well-defined. |
|
780 `` |
|
781 \end{quote} |
|
782 There are many informal summaries of this disambiguation |
|
783 strategy, which are often quite long and delicate. |
|
784 For example Kuklewicz \cite{KuklewiczHaskell} |
|
785 described the POSIX rule as (section 1, last paragraph): |
|
786 \begin{quote} |
|
787 \begin{itemize} |
|
788 \item |
|
789 regular expressions (REs) take the leftmost starting match, and the longest match starting there |
|
790 earlier subpatterns have leftmost-longest priority over later subpatterns\\ |
|
791 \item |
|
792 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\ |
|
793 \item |
|
794 REs have right associative concatenation which can be changed with parenthesis\\ |
|
795 \item |
|
796 parenthesized subexpressions return the match from their last usage\\ |
|
797 \item |
|
798 text of component subexpressions must be contained in the text of the |
|
799 higher-level subexpressions\\ |
|
800 \item |
|
801 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
|
802 \item |
|
803 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ |
|
804 \end{itemize} |
|
805 \end{quote} |
|
806 %The text above |
|
807 %is trying to capture something very precise, |
|
808 %and is crying out for formalising. |
|
809 Ribeiro and Du Bois \cite{RibeiroAgda2017} have |
|
810 formalised the notion of bit-coded regular expressions |
|
811 and proved their relations with simple regular expressions in |
|
812 the dependently-typed proof assistant Agda. |
|
813 They also proved the soundness and completeness of a matching algorithm |
|
814 based on the bit-coded regular expressions. |
|
815 Ausaf et al. \cite{AusafDyckhoffUrban2016} |
|
816 are the first to |
|
817 give a quite simple formalised POSIX |
|
818 specification in Isabelle/HOL, and also prove |
|
819 that their specification coincides with an earlier (unformalised) |
|
820 POSIX specification given by Okui and Suzuki \cite{Okui10}. |
|
821 They then formally proved the correctness of |
|
822 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014} |
|
823 with regards to that specification. |
|
824 They also found that the informal POSIX |
|
825 specification by Sulzmann and Lu needs to be substantially |
|
826 revised in order for the correctness proof to go through. |
|
827 Their original specification and proof were unfixable |
|
828 according to Ausaf et al. |
|
829 |
|
830 |
|
831 In the next section, we will briefly |
|
832 introduce Brzozowski derivatives and Sulzmann |
|
833 and Lu's algorithm, which the main point of this thesis builds on. |
|
834 %We give a taste of what they |
|
835 %are like and why they are suitable for regular expression |
|
836 %matching and lexing. |
|
837 \section{Formal Specification of POSIX Matching |
|
838 and Brzozowski Derivatives} |
|
839 %Now we start with the central topic of the thesis: Brzozowski derivatives. |
|
840 Brzozowski \cite{Brzozowski1964} first introduced the |
|
841 concept of a \emph{derivative} of regular expression in 1964. |
|
842 The derivative of a regular expression $r$ |
|
843 with respect to a character $c$, is written as $r \backslash c$. |
|
844 This operation tells us what $r$ transforms into |
|
845 if we ``chop'' off a particular character $c$ |
|
846 from all strings in the language of $r$ (defined |
|
847 later as $L \; r$). |
|
848 %To give a flavour of Brzozowski derivatives, we present |
|
849 %two straightforward clauses from it: |
|
850 %\begin{center} |
|
851 % \begin{tabular}{lcl} |
|
852 % $d \backslash c$ & $\dn$ & |
|
853 % $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
854 %$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
855 % \end{tabular} |
|
856 %\end{center} |
|
857 %\noindent |
|
858 %The first clause says that for the regular expression |
|
859 %denoting a singleton set consisting of a single-character string $\{ d \}$, |
|
860 %we check the derivative character $c$ against $d$, |
|
861 %returning a set containing only the empty string $\{ [] \}$ |
|
862 %if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise. |
|
863 %The second clause states that to obtain the regular expression |
|
864 %representing all strings' head character $c$ being chopped off |
|
865 %from $r_1 + r_2$, one simply needs to recursively take derivative |
|
866 %of $r_1$ and $r_2$ and then put them together. |
|
867 Derivatives have the property |
|
868 that $s \in L \; (r\backslash c)$ if and only if |
|
869 $c::s \in L \; r$ where $::$ stands for list prepending. |
|
870 %This property can be used on regular expressions |
|
871 %matching and lexing--to test whether a string $s$ is in $L \; r$, |
|
872 %one simply takes derivatives of $r$ successively with |
|
873 %respect to the characters (in the correct order) in $s$, |
|
874 %and then test whether the empty string is in the last regular expression. |
|
875 With this property, derivatives can give a simple solution |
|
876 to the problem of matching a string $s$ with a regular |
|
877 expression $r$: if the derivative of $r$ w.r.t.\ (in |
|
878 succession) all the characters of the string matches the empty string, |
|
879 then $r$ matches $s$ (and {\em vice versa}). |
|
880 %This makes formally reasoning about these properties such |
|
881 %as correctness and complexity smooth and intuitive. |
|
882 There are several mechanised proofs of this property in various theorem |
|
883 provers, |
|
884 for example one by Owens and Slind \cite{Owens2008} in HOL4, |
|
885 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and |
|
886 yet another in Coq by Coquand and Siles \cite{Coquand2012}. |
|
887 |
|
888 In addition, one can extend derivatives to bounded repetitions |
|
889 relatively straightforwardly. For example, the derivative for |
|
890 this can be defined as: |
|
891 \begin{center} |
|
892 \begin{tabular}{lcl} |
|
893 $r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot |
|
894 r^{\{n-1\}} (\text{when} n > 0)$\\ |
|
895 \end{tabular} |
|
896 \end{center} |
|
897 \noindent |
|
898 Experimental results suggest that unlike DFA-based solutions |
|
899 for bounded regular expressions, |
|
900 derivatives can cope |
|
901 large counters |
|
902 quite well. |
|
903 |
|
904 There have also been |
|
905 extensions of derivatives to other regex constructs. |
|
906 For example, Owens et al include the derivatives |
|
907 for the \emph{NOT} regular expression, which is |
|
908 able to concisely express C-style comments of the form |
|
909 $/* \ldots */$ (see \cite{Owens2008}). |
|
910 Another extension for derivatives is |
|
911 regular expressions with look-aheads, done |
|
912 by Miyazaki and Minamide |
|
913 \cite{Takayuki2019}. |
|
914 %We therefore use Brzozowski derivatives on regular expressions |
|
915 %lexing |
|
916 |
|
917 |
|
918 |
|
919 Given the above definitions and properties of |
|
920 Brzozowski derivatives, one quickly realises their potential |
|
921 in generating a formally verified algorithm for lexing: the clauses and property |
|
922 can be easily expressed in a functional programming language |
|
923 or converted to theorem prover |
|
924 code, with great ease. |
|
925 Perhaps this is the reason why derivatives have sparked quite a bit of interest |
|
926 in the functional programming and theorem prover communities in the last |
|
927 fifteen or so years ( |
|
928 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18}, |
|
929 \cite{Chen12} and \cite{Coquand2012} |
|
930 to name a few), despite being buried in the ``sands of time'' \cite{Owens2008} |
|
931 after they were first published by Brzozowski. |
|
932 |
|
933 |
|
934 However, there are two difficulties with derivative-based matchers: |
|
935 First, Brzozowski's original matcher only generates a yes/no answer |
|
936 for whether a regular expression matches a string or not. This is too |
|
937 little information in the context of lexing where separate tokens must |
|
938 be identified and also classified (for example as keywords |
|
939 or identifiers). |
|
940 Second, derivative-based matchers need to be more efficient in terms |
|
941 of the sizes of derivatives. |
|
942 Elegant and beautiful |
|
943 as many implementations are, |
|
944 they can be excruciatingly slow. |
|
945 For example, Sulzmann and Lu |
|
946 claim a linear running time of their proposed algorithm, |
|
947 but that was falsified by our experiments. The running time |
|
948 is actually $\Omega(2^n)$ in the worst case. |
|
949 A similar claim about a theoretical runtime of $O(n^2)$ |
|
950 is made for the Verbatim \cite{Verbatim} |
|
951 %TODO: give references |
|
952 lexer, which calculates POSIX matches and is based on derivatives. |
|
953 They formalized the correctness of the lexer, but not their complexity result. |
|
954 In the performance evaluation section, they analyzed the run time |
|
955 of matching $a$ with the string |
|
956 \begin{center} |
|
957 $\underbrace{a \ldots a}_{\text{n a's}}$. |
|
958 \end{center} |
|
959 \noindent |
|
960 They concluded that the algorithm is quadratic in terms of |
|
961 the length of the input string. |
|
962 When we tried out their extracted OCaml code with the example $(a+aa)^*$, |
|
963 the time it took to match a string of 40 $a$'s was approximately 5 minutes. |
|
964 |
|
965 |
|
966 \subsection{Sulzmann and Lu's Algorithm} |
|
967 Sulzmann and Lu~\cite{Sulzmann2014} overcame the first |
|
968 problem with the yes/no answer |
|
969 by cleverly extending Brzozowski's matching |
|
970 algorithm. Their extended version generates additional information on |
|
971 \emph{how} a regular expression matches a string following the POSIX |
|
972 rules for regular expression matching. They achieve this by adding a |
|
973 second ``phase'' to Brzozowski's algorithm involving an injection |
|
974 function. This injection function in a sense undoes the ``damage'' |
|
975 of the derivatives chopping off characters. |
|
976 In earlier work, Ausaf et al provided the formal |
|
977 specification of what POSIX matching means and proved in Isabelle/HOL |
|
978 the correctness |
|
979 of this extended algorithm accordingly |
|
980 \cite{AusafDyckhoffUrban2016}. |
|
981 |
|
982 The version of the algorithm proven correct |
|
983 suffers however heavily from a |
|
984 second difficulty, where derivatives can |
|
985 grow to arbitrarily big sizes. |
|
986 For example if we start with the |
|
987 regular expression $(a+aa)^*$ and take |
|
988 successive derivatives according to the character $a$, we end up with |
|
989 a sequence of ever-growing derivatives like |
|
990 |
|
991 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}} |
|
992 \begin{center} |
|
993 \begin{tabular}{rll} |
|
994 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
995 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
996 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\ |
|
997 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
998 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots) |
|
999 \end{tabular} |
|
1000 \end{center} |
|
1001 |
|
1002 \noindent where after around 35 steps we usually run out of memory on a |
|
1003 typical computer. Clearly, the |
|
1004 notation involving $\ZERO$s and $\ONE$s already suggests |
|
1005 simplification rules that can be applied to regular regular |
|
1006 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r |
|
1007 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow |
|
1008 r$. While such simple-minded simplifications have been proved in |
|
1009 the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's |
|
1010 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do |
|
1011 \emph{not} help with limiting the growth of the derivatives shown |
|
1012 above: the growth is slowed, but the derivatives can still grow rather |
|
1013 quickly beyond any finite bound. |
|
1014 |
|
1015 Therefore we want to look in this thesis at a second |
|
1016 algorithm by Sulzmann and Lu where they |
|
1017 overcame this ``growth problem'' \cite{Sulzmann2014}. |
|
1018 In this version, POSIX values are |
|
1019 represented as bit sequences and such sequences are incrementally generated |
|
1020 when derivatives are calculated. The compact representation |
|
1021 of bit sequences and regular expressions allows them to define a more |
|
1022 ``aggressive'' simplification method that keeps the size of the |
|
1023 derivatives finite no matter what the length of the string is. |
|
1024 They make some informal claims about the correctness and linear behaviour |
|
1025 of this version, but do not provide any supporting proof arguments, not |
|
1026 even ``pencil-and-paper'' arguments. They write about their bit-coded |
|
1027 \emph{incremental parsing method} (that is the algorithm to be formalised |
|
1028 in this dissertation) |
|
1029 |
|
1030 |
|
1031 |
|
1032 \begin{quote}\it |
|
1033 ``Correctness Claim: We further claim that the incremental parsing |
|
1034 method [..] in combination with the simplification steps [..] |
|
1035 yields POSIX parse trees. We have tested this claim |
|
1036 extensively [..] but yet |
|
1037 have to work out all proof details.'' \cite[Page 14]{Sulzmann2014} |
|
1038 \end{quote} |
|
1039 Ausaf and Urban made some initial progress towards the |
|
1040 full correctness proof but still had to leave out the optimisation |
|
1041 Sulzmann and Lu proposed. |
|
1042 Ausaf wrote \cite{Ausaf}, |
|
1043 \begin{quote}\it |
|
1044 ``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.'' |
|
1045 \end{quote} |
|
1046 This thesis implements the aggressive simplifications envisioned |
|
1047 by Ausaf and Urban, |
|
1048 together with a formal proof of the correctness of those simplifications. |
|
1049 |
|
1050 |
|
1051 One of the most recent work in the context of lexing |
|
1052 %with this issue |
|
1053 is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}. |
|
1054 This is relevant work for us and we will compare it later with |
|
1055 our derivative-based matcher we are going to present. |
|
1056 There is also some newer work called |
|
1057 Verbatim++~\cite{Verbatimpp}, which does not use derivatives, |
|
1058 but deterministic finite automaton instead. |
|
1059 We will also study this work in a later section. |
|
1060 %An example that gives problem to automaton approaches would be |
|
1061 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$. |
|
1062 %It requires at least $2^{n+1}$ states to represent |
|
1063 %as a DFA. |
|
1064 |
25 |
1065 |
26 \section{Basic Concepts} |
1066 \section{Basic Concepts} |
27 Formal language theory usually starts with an alphabet |
1067 Formal language theory usually starts with an alphabet |
28 denoting a set of characters. |
1068 denoting a set of characters. |
29 Here we use the datatype of characters from Isabelle, |
1069 Here we use the datatype of characters from Isabelle, |