5 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts |
5 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts |
6 %and notations we |
6 %and notations we |
7 % used for describing the lexing algorithm by Sulzmann and Lu, |
7 % used for describing the lexing algorithm by Sulzmann and Lu, |
8 %and then give the algorithm and its variant and discuss |
8 %and then give the algorithm and its variant and discuss |
9 %why more aggressive simplifications are needed. |
9 %why more aggressive simplifications are needed. |
10 We start with a technical overview of the catastrophic backtracking problem, |
10 |
11 motivating rigorous approaches to regular expression matching and lexing. |
11 |
12 Then we introduce basic notations and definitions of our problem. |
12 \marginpar{Gerog comment: addressed attributing work correctly problem, |
13 |
13 showed clearly these definitions come from Ausaf et al.} |
14 \section{Technical Overview} |
14 |
15 Consider for example the regular expression $(a^*)^*\,b$ and |
|
16 strings of the form $aa..a$. These strings cannot be matched by this regular |
|
17 expression: obviously the expected $b$ in the last |
|
18 position is missing. One would assume that modern regular expression |
|
19 matching engines can find this out very quickly. Surprisingly, if one tries |
|
20 this example in JavaScript, Python or Java 8, even with small strings, |
|
21 say of lenght of around 30 $a$'s, |
|
22 the decision takes large amounts of time to finish. |
|
23 This is inproportional |
|
24 to the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}). |
|
25 The algorithms clearly show exponential behaviour, and as can be seen |
|
26 is triggered by some relatively simple regular expressions. |
|
27 Java 9 and newer |
|
28 versions mitigates this behaviour by several magnitudes, |
|
29 but are still slow compared |
|
30 with the approach we are going to use in this thesis. |
|
31 |
|
32 |
|
33 |
|
34 This superlinear blowup in regular expression engines |
|
35 has caused grief in ``real life'' where it is |
|
36 given the name ``catastrophic backtracking'' or ``evil'' regular expressions. |
|
37 For example, on 20 July 2016 one evil |
|
38 regular expression brought the webpage |
|
39 \href{http://stackexchange.com}{Stack Exchange} to its |
|
40 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} |
|
41 In this instance, a regular expression intended to just trim white |
|
42 spaces from the beginning and the end of a line actually consumed |
|
43 massive amounts of CPU resources---causing the web servers to grind to a |
|
44 halt. In this example, the time needed to process |
|
45 the string was |
|
46 $O(n^2)$ with respect to the string length $n$. This |
|
47 quadratic overhead was enough for the homepage of Stack Exchange to |
|
48 respond so slowly that the load balancer assumed a $\mathit{DoS}$ |
|
49 attack and therefore stopped the servers from responding to any |
|
50 requests. This made the whole site become unavailable. |
|
51 |
|
52 \begin{figure}[p] |
|
53 \begin{center} |
|
54 \begin{tabular}{@{}c@{\hspace{0mm}}c@{}} |
|
55 \begin{tikzpicture} |
|
56 \begin{axis}[ |
|
57 xlabel={$n$}, |
|
58 x label style={at={(1.05,-0.05)}}, |
|
59 ylabel={time in secs}, |
|
60 enlargelimits=false, |
|
61 xtick={0,5,...,30}, |
|
62 xmax=33, |
|
63 ymax=35, |
|
64 ytick={0,5,...,30}, |
|
65 scaled ticks=false, |
|
66 axis lines=left, |
|
67 width=5cm, |
|
68 height=4cm, |
|
69 legend entries={JavaScript}, |
|
70 legend pos=north west, |
|
71 legend cell align=left] |
|
72 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; |
|
73 \end{axis} |
|
74 \end{tikzpicture} |
|
75 & |
|
76 \begin{tikzpicture} |
|
77 \begin{axis}[ |
|
78 xlabel={$n$}, |
|
79 x label style={at={(1.05,-0.05)}}, |
|
80 %ylabel={time in secs}, |
|
81 enlargelimits=false, |
|
82 xtick={0,5,...,30}, |
|
83 xmax=33, |
|
84 ymax=35, |
|
85 ytick={0,5,...,30}, |
|
86 scaled ticks=false, |
|
87 axis lines=left, |
|
88 width=5cm, |
|
89 height=4cm, |
|
90 legend entries={Python}, |
|
91 legend pos=north west, |
|
92 legend cell align=left] |
|
93 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
|
94 \end{axis} |
|
95 \end{tikzpicture}\\ |
|
96 \begin{tikzpicture} |
|
97 \begin{axis}[ |
|
98 xlabel={$n$}, |
|
99 x label style={at={(1.05,-0.05)}}, |
|
100 ylabel={time in secs}, |
|
101 enlargelimits=false, |
|
102 xtick={0,5,...,30}, |
|
103 xmax=33, |
|
104 ymax=35, |
|
105 ytick={0,5,...,30}, |
|
106 scaled ticks=false, |
|
107 axis lines=left, |
|
108 width=5cm, |
|
109 height=4cm, |
|
110 legend entries={Java 8}, |
|
111 legend pos=north west, |
|
112 legend cell align=left] |
|
113 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
114 \end{axis} |
|
115 \end{tikzpicture} |
|
116 & |
|
117 \begin{tikzpicture} |
|
118 \begin{axis}[ |
|
119 xlabel={$n$}, |
|
120 x label style={at={(1.05,-0.05)}}, |
|
121 %ylabel={time in secs}, |
|
122 enlargelimits=false, |
|
123 xtick={0,5,...,30}, |
|
124 xmax=33, |
|
125 ymax=35, |
|
126 ytick={0,5,...,30}, |
|
127 scaled ticks=false, |
|
128 axis lines=left, |
|
129 width=5cm, |
|
130 height=4cm, |
|
131 legend entries={Dart}, |
|
132 legend pos=north west, |
|
133 legend cell align=left] |
|
134 \addplot[green,mark=*, mark options={fill=white}] table {re-dart.data}; |
|
135 \end{axis} |
|
136 \end{tikzpicture}\\ |
|
137 \begin{tikzpicture} |
|
138 \begin{axis}[ |
|
139 xlabel={$n$}, |
|
140 x label style={at={(1.05,-0.05)}}, |
|
141 ylabel={time in secs}, |
|
142 enlargelimits=false, |
|
143 xtick={0,5,...,30}, |
|
144 xmax=33, |
|
145 ymax=35, |
|
146 ytick={0,5,...,30}, |
|
147 scaled ticks=false, |
|
148 axis lines=left, |
|
149 width=5cm, |
|
150 height=4cm, |
|
151 legend entries={Swift}, |
|
152 legend pos=north west, |
|
153 legend cell align=left] |
|
154 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data}; |
|
155 \end{axis} |
|
156 \end{tikzpicture} |
|
157 & |
|
158 \begin{tikzpicture} |
|
159 \begin{axis}[ |
|
160 xlabel={$n$}, |
|
161 x label style={at={(1.05,-0.05)}}, |
|
162 %ylabel={time in secs}, |
|
163 enlargelimits=true, |
|
164 %xtick={0,5000,...,40000}, |
|
165 %xmax=40000, |
|
166 %ymax=35, |
|
167 restrict x to domain*=0:40000, |
|
168 restrict y to domain*=0:35, |
|
169 %ytick={0,5,...,30}, |
|
170 %scaled ticks=false, |
|
171 axis lines=left, |
|
172 width=5cm, |
|
173 height=4cm, |
|
174 legend entries={Java9+}, |
|
175 legend pos=north west, |
|
176 legend cell align=left] |
|
177 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data}; |
|
178 \end{axis} |
|
179 \end{tikzpicture}\\ |
|
180 \begin{tikzpicture} |
|
181 \begin{axis}[ |
|
182 xlabel={$n$}, |
|
183 x label style={at={(1.05,-0.05)}}, |
|
184 ylabel={time in secs}, |
|
185 enlargelimits=true, |
|
186 %xtick={0,5,...,30}, |
|
187 %xmax=33, |
|
188 %ymax=35, |
|
189 restrict x to domain*=0:60000, |
|
190 restrict y to domain*=0:35, |
|
191 %ytick={0,5,...,30}, |
|
192 %scaled ticks=false, |
|
193 axis lines=left, |
|
194 width=5cm, |
|
195 height=4cm, |
|
196 legend entries={Scala}, |
|
197 legend pos=north west, |
|
198 legend cell align=left] |
|
199 \addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data}; |
|
200 \end{axis} |
|
201 \end{tikzpicture} |
|
202 & |
|
203 \begin{tikzpicture} |
|
204 \begin{axis}[ |
|
205 xlabel={$n$}, |
|
206 x label style={at={(1.05,-0.05)}}, |
|
207 %ylabel={time in secs}, |
|
208 enlargelimits=true, |
|
209 %xtick={0,5000,...,40000}, |
|
210 %xmax=40000, |
|
211 %ymax=35, |
|
212 restrict x to domain*=0:60000, |
|
213 restrict y to domain*=0:45, |
|
214 %ytick={0,5,...,30}, |
|
215 %scaled ticks=false, |
|
216 axis lines=left, |
|
217 width=5cm, |
|
218 height=4cm, |
|
219 legend style={cells={align=left}}, |
|
220 legend entries={Isabelle \\ Extracted}, |
|
221 legend pos=north west, |
|
222 legend cell align=left] |
|
223 \addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data}; |
|
224 \end{axis} |
|
225 \end{tikzpicture}\\ |
|
226 \multicolumn{2}{c}{Graphs} |
|
227 \end{tabular} |
|
228 \end{center} |
|
229 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings |
|
230 of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries. |
|
231 The reason for their fast growth %superlinear behaviour |
|
232 is that they do a depth-first-search |
|
233 using NFAs. |
|
234 If the string does not match, the regular expression matching |
|
235 engine starts to explore all possibilities. |
|
236 The last two graphs are for our implementation in Scala, one manual |
|
237 and one extracted from the verified lexer in Isabelle by $\textit{codegen}$. |
|
238 Our lexer |
|
239 performs better in this case, and |
|
240 is formally verified. |
|
241 Despite being almost identical, the codegen-generated lexer |
|
242 % is generated directly from Isabelle using $\textit{codegen}$, |
|
243 is slower than the manually written version since the code synthesised by |
|
244 $\textit{codegen}$ does not use native integer or string |
|
245 types of the target language. |
|
246 %Note that we are comparing our \emph{lexer} against other languages' matchers. |
|
247 }\label{fig:aStarStarb} |
|
248 \end{figure}\afterpage{\clearpage} |
|
249 |
|
250 A more recent example is a global outage of all Cloudflare servers on 2 July |
|
251 2019. A poorly written regular expression exhibited catastrophic backtracking |
|
252 and exhausted CPUs that serve HTTP traffic. Although the outage |
|
253 had several causes, at the heart was a regular expression that |
|
254 was used to monitor network |
|
255 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} |
|
256 These problems with regular expressions |
|
257 are not isolated events that happen |
|
258 very rarely, |
|
259 %but actually widespread. |
|
260 %They occur so often that they have a |
|
261 but they occur actually often enough that they have a |
|
262 name: Regular-Expression-Denial-Of-Service (ReDoS) |
|
263 attacks. |
|
264 Davis et al. \cite{Davis18} detected more |
|
265 than 1000 evil regular expressions |
|
266 in Node.js, Python core libraries, npm and pypi. |
|
267 They therefore concluded that evil regular expressions |
|
268 are a real problem rather than just "a parlour trick". |
|
269 |
|
270 The work in this thesis aims to address this issue |
|
271 with the help of formal proofs. |
|
272 We describe a lexing algorithm based |
|
273 on Brzozowski derivatives with verified correctness |
|
274 and a finiteness property for the size of derivatives |
|
275 (which are all done in Isabelle/HOL). |
|
276 Such properties %guarantee the absence of |
|
277 are an important step in preventing |
|
278 catastrophic backtracking once and for all. |
|
279 We will give more details in the next sections |
|
280 on (i) why the slow cases in graph \ref{fig:aStarStarb} |
|
281 can occur in traditional regular expression engines |
|
282 and (ii) why we choose our |
|
283 approach based on Brzozowski derivatives and formal proofs. |
|
284 |
|
285 |
|
286 |
|
287 |
|
288 In this chapter, we define the basic notions |
|
289 for regular languages and regular expressions. |
|
290 This is essentially a description in ``English'' |
|
291 the functions and datatypes of our formalisation in Isabelle/HOL. |
|
292 We also define what $\POSIX$ lexing means, |
|
293 followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} |
|
294 that produces the output conforming |
|
295 to the $\POSIX$ standard. |
|
296 This is a preliminary chapter which describes the results of |
15 This is a preliminary chapter which describes the results of |
297 Sulzmann and Lu and Ausaf et al \ref{AusafDyckhoffUrban2016}, |
16 Sulzmann and Lu \cite{Sulzmann2014} and part of the formalisations by |
298 but the proof details and function definitions are needed to motivate our work |
17 Ausaf et al \ref{AusafDyckhoffUrban2016} (mainly the definitions). |
|
18 These results are not part of this PhD work, |
|
19 but the details are included to provide necessary context |
|
20 for our work on the correctness proof of $\blexersimp$, |
299 as we show in chapter \ref{Bitcoded2} how the proofs break down when |
21 as we show in chapter \ref{Bitcoded2} how the proofs break down when |
300 simplifications are applied. |
22 simplifications are applied. |
301 %TODO: Actually show this in chapter 4. |
23 |
302 |
24 In the coming section, the definitions of basic notions |
303 In what follows |
25 for regular languages and regular expressions are given. |
304 we choose to use the Isabelle-style notation |
26 This is essentially a description in ``English'' |
305 for function and datatype constructor applications, where |
27 the functions and datatypes used by Ausaf et al. |
306 the parameters of a function are not enclosed |
28 \cite{AusafDyckhoffUrban2016} \cite{Ausaf} |
307 inside a pair of parentheses (e.g. $f \;x \;y$ |
29 in their formalisation in Isabelle/HOL. |
308 instead of $f(x,\;y)$). This is mainly |
30 We include them as we build on their formalisation, |
309 to make the text visually more concise. |
31 and therefore inherently use these definitions. |
310 |
32 |
311 |
33 |
312 |
34 % In the next section, we will briefly |
313 |
35 % introduce Brzozowski derivatives and Sulzmann |
314 \section{Preliminaries}%Regex, and the Problems with Regex Matchers} |
36 % and Lu's algorithm, which the main point of this thesis builds on. |
315 Regular expressions and regular expression matchers |
|
316 have clearly been studied for many, many years. |
|
317 Theoretical results in automata theory state |
|
318 that basic regular expression matching should be linear |
|
319 w.r.t the input. |
|
320 This assumes that the regular expression |
|
321 $r$ was pre-processed and turned into a |
|
322 deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}. |
|
323 By basic we mean textbook definitions such as the one |
|
324 below, involving only regular expressions for characters, alternatives, |
|
325 sequences, and Kleene stars: |
|
326 \[ |
|
327 r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^* |
|
328 \] |
|
329 Modern regular expression matchers used by programmers, |
|
330 however, |
|
331 support much richer constructs, such as bounded repetitions, |
|
332 negations, |
|
333 and back-references. |
|
334 To differentiate, we use the word \emph{regex} to refer |
|
335 to those expressions with richer constructs while reserving the |
|
336 term \emph{regular expression} |
|
337 for the more traditional meaning in formal languages theory. |
|
338 We follow this convention |
|
339 in this thesis. |
|
340 In the future, we aim to support all the popular features of regexes, |
|
341 but for this work we mainly look at basic regular expressions |
|
342 and bounded repetitions. |
|
343 |
|
344 |
|
345 |
|
346 %Most modern regex libraries |
|
347 %the so-called PCRE standard (Peral Compatible Regular Expressions) |
|
348 %has the back-references |
|
349 Regexes come with a number of constructs |
|
350 that make it more convenient for |
|
351 programmers to write regular expressions. |
|
352 Depending on the types of constructs |
|
353 the task of matching and lexing with them |
|
354 will have different levels of complexity. |
|
355 Some of those constructs are syntactic sugars that are |
|
356 simply short hand notations |
|
357 that save the programmers a few keystrokes. |
|
358 These will not cause problems for regex libraries. |
|
359 For example the |
|
360 non-binary alternative involving three or more choices just means: |
|
361 \[ |
|
362 (a | b | c) \stackrel{means}{=} ((a + b)+ c) |
|
363 \] |
|
364 Similarly, the range operator |
|
365 %used to express the alternative |
|
366 %of all characters between its operands, |
|
367 is just a concise way |
|
368 of expressing an alternative of consecutive characters: |
|
369 \[ |
|
370 [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) |
|
371 \] |
|
372 for an alternative. The |
|
373 wildcard character '$.$' is used to refer to any single character, |
|
374 \[ |
|
375 . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots] |
|
376 \] |
|
377 except the newline. |
|
378 |
|
379 \subsection{Bounded Repetitions} |
|
380 More interesting are bounded repetitions, which can |
|
381 make the regular expressions much |
|
382 more compact. |
|
383 Normally there are four kinds of bounded repetitions: |
|
384 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$ |
|
385 (where $n$ and $m$ are constant natural numbers). |
|
386 Like the star regular expressions, the set of strings or language |
|
387 a bounded regular expression can match |
|
388 is defined using the power operation on sets: |
|
389 \begin{center} |
|
390 \begin{tabular}{lcl} |
|
391 $L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\ |
|
392 $L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\ |
|
393 $L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\ |
|
394 $L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$ |
|
395 \end{tabular} |
|
396 \end{center} |
|
397 The attraction of bounded repetitions is that they can be |
|
398 used to avoid a size blow up: for example $r^{\{n\}}$ |
|
399 is a shorthand for |
|
400 the much longer regular expression: |
|
401 \[ |
|
402 \underbrace{r\ldots r}_\text{n copies of r}. |
|
403 \] |
|
404 %Therefore, a naive algorithm that simply unfolds |
|
405 %them into their desugared forms |
|
406 %will suffer from at least an exponential runtime increase. |
|
407 |
|
408 |
|
409 The problem with matching |
|
410 such bounded repetitions |
|
411 is that tools based on the classic notion of |
|
412 automata need to expand $r^{\{n\}}$ into $n$ connected |
|
413 copies of the automaton for $r$. This leads to very inefficient matching |
|
414 algorithms or algorithms that consume large amounts of memory. |
|
415 Implementations using $\DFA$s will |
|
416 in such situations |
|
417 either become very slow |
|
418 (for example Verbatim++ \cite{Verbatimpp}) or run |
|
419 out of memory (for example $\mathit{LEX}$ and |
|
420 $\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators |
|
421 in C and JAVA that generate $\mathit{DFA}$-based |
|
422 lexers. The user provides a set of regular expressions |
|
423 and configurations, and then |
|
424 gets an output program encoding a minimized $\mathit{DFA}$ |
|
425 that can be compiled and run. |
|
426 When given the above countdown regular expression, |
|
427 a small $n$ (say 20) would result in a program representing a |
|
428 DFA |
|
429 with millions of states.}) for large counters. |
|
430 A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$ |
|
431 where the minimal DFA requires at least $2^{n+1}$ states. |
|
432 For example, when $n$ is equal to 2, |
|
433 the corresponding $\mathit{NFA}$ looks like: |
|
434 \vspace{6mm} |
|
435 \begin{center} |
|
436 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
|
437 \node[state,initial] (q_0) {$q_0$}; |
|
438 \node[state, red] (q_1) [right=of q_0] {$q_1$}; |
|
439 \node[state, red] (q_2) [right=of q_1] {$q_2$}; |
|
440 \node[state, accepting, red](q_3) [right=of q_2] {$q_3$}; |
|
441 \path[->] |
|
442 (q_0) edge node {a} (q_1) |
|
443 edge [loop below] node {a,b} () |
|
444 (q_1) edge node {a,b} (q_2) |
|
445 (q_2) edge node {a,b} (q_3); |
|
446 \end{tikzpicture} |
|
447 \end{center} |
|
448 and when turned into a DFA by the subset construction |
|
449 requires at least $2^3$ states.\footnote{The |
|
450 red states are "countdown states" which count down |
|
451 the number of characters needed in addition to the current |
|
452 string to make a successful match. |
|
453 For example, state $q_1$ indicates a match that has |
|
454 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$, |
|
455 and just consumed the "delimiter" $a$ in the middle, and |
|
456 needs to match 2 more iterations of $(a|b)$ to complete. |
|
457 State $q_2$ on the other hand, can be viewed as a state |
|
458 after $q_1$ has consumed 1 character, and just waits |
|
459 for 1 more character to complete. |
|
460 The state $q_3$ is the last (accepting) state, requiring 0 |
|
461 more characters. |
|
462 Depending on the suffix of the |
|
463 input string up to the current read location, |
|
464 the states $q_1$ and $q_2$, $q_3$ |
|
465 may or may |
|
466 not be active. |
|
467 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would |
|
468 contain at least $2^3$ non-equivalent states that cannot be merged, |
|
469 because the subset construction during determinisation will generate |
|
470 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$. |
|
471 Generalizing this to regular expressions with larger |
|
472 bounded repetitions number, we have that |
|
473 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s |
|
474 would require at least $2^{n+1}$ states, if $r$ itself contains |
|
475 more than 1 string. |
|
476 This is to represent all different |
|
477 scenarios in which "countdown" states are active.} |
|
478 |
|
479 |
|
480 Bounded repetitions are important because they |
|
481 tend to occur frequently in practical use, |
|
482 for example in the regex library RegExLib, in |
|
483 the rules library of Snort \cite{Snort1999}\footnote{ |
|
484 Snort is a network intrusion detection (NID) tool |
|
485 for monitoring network traffic. |
|
486 The network security community curates a list |
|
487 of malicious patterns written as regexes, |
|
488 which is used by Snort's detection engine |
|
489 to match against network traffic for any hostile |
|
490 activities such as buffer overflow attacks.}, |
|
491 as well as in XML Schema definitions (XSDs). |
|
492 According to Bj\"{o}rklund et al \cite{xml2015}, |
|
493 more than half of the |
|
494 XSDs they found on the Maven.org central repository |
|
495 have bounded regular expressions in them. |
|
496 Often the counters are quite large, with the largest being |
|
497 close to ten million. |
|
498 A smaller sample XSD they gave |
|
499 is: |
|
500 \lstset{ |
|
501 basicstyle=\fontsize{8.5}{9}\ttfamily, |
|
502 language=XML, |
|
503 morekeywords={encoding, |
|
504 xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute} |
|
505 } |
|
506 \begin{lstlisting} |
|
507 <sequence minOccurs="0" maxOccurs="65535"> |
|
508 <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> |
|
509 <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/> |
|
510 </sequence> |
|
511 \end{lstlisting} |
|
512 This can be seen as the regex |
|
513 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves |
|
514 regular expressions |
|
515 satisfying certain constraints (such as |
|
516 satisfying the floating point number format). |
|
517 It is therefore quite unsatisfying that |
|
518 some regular expressions matching libraries |
|
519 impose adhoc limits |
|
520 for bounded regular expressions: |
|
521 For example, in the regular expression matching library in the Go |
|
522 language the regular expression $a^{1001}$ is not permitted, because no counter |
|
523 can be above 1000, and in the built-in Rust regular expression library |
|
524 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message |
|
525 for being too big. |
|
526 As Becchi and Crawley \cite{Becchi08} have pointed out, |
|
527 the reason for these restrictions |
|
528 is that they simulate a non-deterministic finite |
|
529 automata (NFA) with a breadth-first search. |
|
530 This way the number of active states could |
|
531 be equal to the counter number. |
|
532 When the counters are large, |
|
533 the memory requirement could become |
|
534 infeasible, and a regex engine |
|
535 like in Go will reject this pattern straight away. |
|
536 \begin{figure}[H] |
|
537 \begin{center} |
|
538 \begin{tikzpicture} [node distance = 2cm, on grid, auto] |
|
539 |
|
540 \node (q0) [state, initial] {$0$}; |
|
541 \node (q1) [state, right = of q0] {$1$}; |
|
542 %\node (q2) [state, right = of q1] {$2$}; |
|
543 \node (qdots) [right = of q1] {$\ldots$}; |
|
544 \node (qn) [state, right = of qdots] {$n$}; |
|
545 \node (qn1) [state, right = of qn] {$n+1$}; |
|
546 \node (qn2) [state, right = of qn1] {$n+2$}; |
|
547 \node (qn3) [state, accepting, right = of qn2] {$n+3$}; |
|
548 |
|
549 \path [-stealth, thick] |
|
550 (q0) edge [loop above] node {a} () |
|
551 (q0) edge node {a} (q1) |
|
552 %(q1) edge node {.} (q2) |
|
553 (q1) edge node {.} (qdots) |
|
554 (qdots) edge node {.} (qn) |
|
555 (qn) edge node {.} (qn1) |
|
556 (qn1) edge node {b} (qn2) |
|
557 (qn2) edge node {$c$} (qn3); |
|
558 \end{tikzpicture} |
|
559 %\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
|
560 % \node[state,initial] (q_0) {$0$}; |
|
561 % \node[state, ] (q_1) [right=of q_0] {$1$}; |
|
562 % \node[state, ] (q_2) [right=of q_1] {$2$}; |
|
563 % \node[state, |
|
564 % \node[state, accepting, ](q_3) [right=of q_2] {$3$}; |
|
565 % \path[->] |
|
566 % (q_0) edge node {a} (q_1) |
|
567 % edge [loop below] node {a,b} () |
|
568 % (q_1) edge node {a,b} (q_2) |
|
569 % (q_2) edge node {a,b} (q_3); |
|
570 %\end{tikzpicture} |
|
571 \end{center} |
|
572 \caption{The example given by Becchi and Crawley |
|
573 that NFA simulation can consume large |
|
574 amounts of memory: $.^*a.^{\{n\}}bc$ matching |
|
575 strings of the form $aaa\ldots aaaabc$. |
|
576 When traversing in a breadth-first manner, |
|
577 all states from 0 till $n+1$ will become active.}\label{fig:inj} |
|
578 |
|
579 \end{figure} |
|
580 %Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this |
|
581 %type of $\mathit{NFA}$ simulation and guarantees a linear runtime |
|
582 %in terms of input string length. |
|
583 %TODO:try out these lexers |
|
584 These problems can of course be solved in matching algorithms where |
|
585 automata go beyond the classic notion and for instance include explicit |
|
586 counters \cite{Turo_ov__2020}. |
|
587 These solutions can be quite efficient, |
|
588 with the ability to process |
|
589 gigabits of strings input per second |
|
590 even with large counters \cite{Becchi08}. |
|
591 These practical solutions do not come with |
|
592 formal guarantees, and as pointed out by |
|
593 Kuklewicz \cite{KuklewiczHaskell}, can be error-prone. |
|
594 %But formal reasoning about these automata especially in Isabelle |
|
595 %can be challenging |
|
596 %and un-intuitive. |
|
597 %Therefore, we take correctness and runtime claims made about these solutions |
|
598 %with a grain of salt. |
|
599 |
|
600 In the work reported in \cite{FoSSaCS2023} and here, |
|
601 we add better support using derivatives |
|
602 for bounded regular expression $r^{\{n\}}$. |
|
603 Our results |
|
604 extend straightforwardly to |
|
605 repetitions with intervals such as |
|
606 $r^{\{n\ldots m\}}$. |
|
607 The merit of Brzozowski derivatives (more on this later) |
|
608 on this problem is that |
|
609 it can be naturally extended to support bounded repetitions. |
|
610 Moreover these extensions are still made up of only small |
|
611 inductive datatypes and recursive functions, |
|
612 making it handy to deal with them in theorem provers. |
|
613 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be |
|
614 %straightforwardly extended to deal with bounded regular expressions |
|
615 %and moreover the resulting code still consists of only simple |
|
616 %recursive functions and inductive datatypes. |
|
617 Finally, bounded regular expressions do not destroy our finite |
|
618 boundedness property, which we shall prove later on. |
|
619 |
|
620 |
|
621 |
|
622 |
|
623 |
|
624 \subsection{Back-References} |
|
625 The other way to simulate an $\mathit{NFA}$ for matching is choosing |
|
626 a single transition each time, keeping all the other options in |
|
627 a queue or stack, and backtracking if that choice eventually |
|
628 fails. |
|
629 This method, often called a "depth-first-search", |
|
630 is efficient in many cases, but could end up |
|
631 with exponential run time. |
|
632 The backtracking method is employed in regex libraries |
|
633 that support \emph{back-references}, for example |
|
634 in Java and Python. |
|
635 %\section{Back-references and The Terminology Regex} |
|
636 |
|
637 %When one constructs an $\NFA$ out of a regular expression |
|
638 %there is often very little to be done in the first phase, one simply |
|
639 %construct the $\NFA$ states based on the structure of the input regular expression. |
|
640 |
|
641 %In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways: |
|
642 %one by keeping track of all active states after consuming |
|
643 %a character, and update that set of states iteratively. |
|
644 %This can be viewed as a breadth-first-search of the $\mathit{NFA}$ |
|
645 %for a path terminating |
|
646 %at an accepting state. |
|
647 |
|
648 |
|
649 |
|
650 Consider the following regular expression where the sequence |
|
651 operator is omitted for brevity: |
|
652 \begin{center} |
|
653 $r_1r_2r_3r_4$ |
|
654 \end{center} |
|
655 In this example, |
|
656 one could label sub-expressions of interest |
|
657 by parenthesizing them and giving |
|
658 them a number in the order in which their opening parentheses appear. |
|
659 One possible way of parenthesizing and labelling is: |
|
660 \begin{center} |
|
661 $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$ |
|
662 \end{center} |
|
663 The sub-expressions |
|
664 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled |
|
665 by 1 to 4, and can be ``referred back'' by their respective numbers. |
|
666 %These sub-expressions are called "capturing groups". |
|
667 To do so, one uses the syntax $\backslash i$ |
|
668 to denote that we want the sub-string |
|
669 of the input matched by the i-th |
|
670 sub-expression to appear again, |
|
671 exactly the same as it first appeared: |
|
672 \begin{center} |
|
673 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots |
|
674 \underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$ |
|
675 \end{center} |
|
676 Once the sub-string $s_i$ for the sub-expression $r_i$ |
|
677 has been fixed, there is no variability on what the back-reference |
|
678 label $\backslash i$ can be---it is tied to $s_i$. |
|
679 The overall string will look like $\ldots s_i \ldots s_i \ldots $ |
|
680 %The backslash and number $i$ are the |
|
681 %so-called "back-references". |
|
682 %Let $e$ be an expression made of regular expressions |
|
683 %and back-references. $e$ contains the expression $e_i$ |
|
684 %as its $i$-th capturing group. |
|
685 %The semantics of back-reference can be recursively |
|
686 %written as: |
|
687 %\begin{center} |
|
688 % \begin{tabular}{c} |
|
689 % $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\ |
|
690 % $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$ |
|
691 % \end{tabular} |
|
692 %\end{center} |
|
693 A concrete example |
|
694 for back-references is |
|
695 \begin{center} |
|
696 $(.^*)\backslash 1$, |
|
697 \end{center} |
|
698 which matches |
|
699 strings that can be split into two identical halves, |
|
700 for example $\mathit{foofoo}$, $\mathit{ww}$ and so on. |
|
701 Note that this is different from |
|
702 repeating the sub-expression verbatim like |
|
703 \begin{center} |
|
704 $(.^*)(.^*)$, |
|
705 \end{center} |
|
706 which does not impose any restrictions on what strings the second |
|
707 sub-expression $.^*$ |
|
708 might match. |
|
709 Another example for back-references is |
|
710 \begin{center} |
|
711 $(.)(.)\backslash 2\backslash 1$ |
|
712 \end{center} |
|
713 which matches four-character palindromes |
|
714 like $abba$, $x??x$ and so on. |
|
715 |
|
716 Back-references are a regex construct |
|
717 that programmers find quite useful. |
|
718 According to Becchi and Crawley \cite{Becchi08}, |
|
719 6\% of Snort rules (up until 2008) use them. |
|
720 The most common use of back-references |
|
721 is to express well-formed html files, |
|
722 where back-references are convenient for matching |
|
723 opening and closing tags like |
|
724 \begin{center} |
|
725 $\langle html \rangle \ldots \langle / html \rangle$ |
|
726 \end{center} |
|
727 A regex describing such a format |
|
728 is |
|
729 \begin{center} |
|
730 $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$ |
|
731 \end{center} |
|
732 Despite being useful, the expressive power of regexes |
|
733 go beyond regular languages |
|
734 once back-references are included. |
|
735 In fact, they allow the regex construct to express |
|
736 languages that cannot be contained in context-free |
|
737 languages either. |
|
738 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$ |
|
739 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, |
|
740 which cannot be expressed by |
|
741 context-free grammars \cite{campeanu2003formal}. |
|
742 Such a language is contained in the context-sensitive hierarchy |
|
743 of formal languages. |
|
744 Also solving the matching problem involving back-references |
|
745 is known to be NP-complete \parencite{alfred2014algorithms}. |
|
746 Regex libraries supporting back-references such as |
|
747 PCRE \cite{pcre} therefore have to |
|
748 revert to a depth-first search algorithm including backtracking. |
|
749 What is unexpected is that even in the cases |
|
750 not involving back-references, there is still |
|
751 a (non-negligible) chance they might backtrack super-linearly, |
|
752 as shown in the graphs in figure \ref{fig:aStarStarb}. |
|
753 |
|
754 Summing up, we can categorise existing |
|
755 practical regex libraries into two kinds: |
|
756 (i) The ones with linear |
|
757 time guarantees like Go and Rust. The downside with them is that |
|
758 they impose restrictions |
|
759 on the regular expressions (not allowing back-references, |
|
760 bounded repetitions cannot exceed an ad hoc limit etc.). |
|
761 And (ii) those |
|
762 that allow large bounded regular expressions and back-references |
|
763 at the expense of using backtracking algorithms. |
|
764 They can potentially ``grind to a halt'' |
|
765 on some very simple cases, resulting |
|
766 ReDoS attacks if exposed to the internet. |
|
767 |
|
768 The problems with both approaches are the motivation for us |
|
769 to look again at the regular expression matching problem. |
|
770 Another motivation is that regular expression matching algorithms |
|
771 that follow the POSIX standard often contain errors and bugs |
|
772 as we shall explain next. |
|
773 |
|
774 %We would like to have regex engines that can |
|
775 %deal with the regular part (e.g. |
|
776 %bounded repetitions) of regexes more |
|
777 %efficiently. |
|
778 %Also we want to make sure that they do it correctly. |
|
779 %It turns out that such aim is not so easy to achieve. |
|
780 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions |
|
781 % For example, the Rust regex engine claims to be linear, |
|
782 % but does not support lookarounds and back-references. |
|
783 % The GoLang regex library does not support over 1000 repetitions. |
|
784 % Java and Python both support back-references, but shows |
|
785 %catastrophic backtracking behaviours on inputs without back-references( |
|
786 %when the language is still regular). |
|
787 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac |
|
788 %TODO: verify the fact Rust does not allow 1000+ reps |
|
789 |
|
790 |
|
791 |
|
792 |
|
793 %The time cost of regex matching algorithms in general |
|
794 %involve two different phases, and different things can go differently wrong on |
|
795 %these phases. |
|
796 %$\DFA$s usually have problems in the first (construction) phase |
|
797 %, whereas $\NFA$s usually run into trouble |
|
798 %on the second phase. |
|
799 |
|
800 |
|
801 \section{Error-prone POSIX Implementations} |
|
802 Very often there are multiple ways of matching a string |
|
803 with a regular expression. |
|
804 In such cases the regular expressions matcher needs to |
|
805 disambiguate. |
|
806 The more widely used strategy is called POSIX, |
|
807 which roughly speaking always chooses the longest initial match. |
|
808 The POSIX strategy is widely adopted in many regular expression matchers |
|
809 because it is a natural strategy for lexers. |
|
810 However, many implementations (including the C libraries |
|
811 used by Linux and OS X distributions) contain bugs |
|
812 or do not meet the specification they claim to adhere to. |
|
813 Kuklewicz maintains a unit test repository which lists some |
|
814 problems with existing regular expression engines \cite{KuklewiczHaskell}. |
|
815 In some cases, they either fail to generate a |
|
816 result when there exists a match, |
|
817 or give results that are inconsistent with the POSIX standard. |
|
818 A concrete example is the regex: |
|
819 \begin{center} |
|
820 $(aba + ab + a)^* \text{and the string} \; ababa$ |
|
821 \end{center} |
|
822 The correct POSIX match for the above |
|
823 involves the entire string $ababa$, |
|
824 split into two Kleene star iterations, namely $[ab], [aba]$ at positions |
|
825 $[0, 2), [2, 5)$ |
|
826 respectively. |
|
827 But feeding this example to the different engines |
|
828 listed at regex101 \footnote{ |
|
829 regex101 is an online regular expression matcher which |
|
830 provides API for trying out regular expression |
|
831 engines of multiple popular programming languages like |
|
832 Java, Python, Go, etc.} \parencite{regex101}. |
|
833 yields |
|
834 only two incomplete matches: $[aba]$ at $[0, 3)$ |
|
835 and $a$ at $[4, 5)$. |
|
836 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} |
|
837 commented that most regex libraries are not |
|
838 correctly implementing the central POSIX |
|
839 rule, called the maximum munch rule. |
|
840 Grathwohl \parencite{grathwohl2014crash} wrote: |
|
841 \begin{quote}\it |
|
842 ``The POSIX strategy is more complicated than the |
|
843 greedy because of the dependence on information about |
|
844 the length of matched strings in the various subexpressions.'' |
|
845 \end{quote} |
|
846 %\noindent |
|
847 People have recognised that the implementation complexity of POSIX rules also come from |
|
848 the specification being not very precise. |
|
849 The developers of the regexp package of Go |
|
850 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}} |
|
851 commented that |
|
852 \begin{quote}\it |
|
853 `` |
|
854 The POSIX rule is computationally prohibitive |
|
855 and not even well-defined. |
|
856 `` |
|
857 \end{quote} |
|
858 There are many informal summaries of this disambiguation |
|
859 strategy, which are often quite long and delicate. |
|
860 For example Kuklewicz \cite{KuklewiczHaskell} |
|
861 described the POSIX rule as (section 1, last paragraph): |
|
862 \begin{quote} |
|
863 \begin{itemize} |
|
864 \item |
|
865 regular expressions (REs) take the leftmost starting match, and the longest match starting there |
|
866 earlier subpatterns have leftmost-longest priority over later subpatterns\\ |
|
867 \item |
|
868 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\ |
|
869 \item |
|
870 REs have right associative concatenation which can be changed with parenthesis\\ |
|
871 \item |
|
872 parenthesized subexpressions return the match from their last usage\\ |
|
873 \item |
|
874 text of component subexpressions must be contained in the text of the |
|
875 higher-level subexpressions\\ |
|
876 \item |
|
877 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
|
878 \item |
|
879 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ |
|
880 \end{itemize} |
|
881 \end{quote} |
|
882 %The text above |
|
883 %is trying to capture something very precise, |
|
884 %and is crying out for formalising. |
|
885 Ribeiro and Du Bois \cite{RibeiroAgda2017} have |
|
886 formalised the notion of bit-coded regular expressions |
|
887 and proved their relations with simple regular expressions in |
|
888 the dependently-typed proof assistant Agda. |
|
889 They also proved the soundness and completeness of a matching algorithm |
|
890 based on the bit-coded regular expressions. |
|
891 Ausaf et al. \cite{AusafDyckhoffUrban2016} |
|
892 are the first to |
|
893 give a quite simple formalised POSIX |
|
894 specification in Isabelle/HOL, and also prove |
|
895 that their specification coincides with an earlier (unformalised) |
|
896 POSIX specification given by Okui and Suzuki \cite{Okui10}. |
|
897 They then formally proved the correctness of |
|
898 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014} |
|
899 with regards to that specification. |
|
900 They also found that the informal POSIX |
|
901 specification by Sulzmann and Lu needs to be substantially |
|
902 revised in order for the correctness proof to go through. |
|
903 Their original specification and proof were unfixable |
|
904 according to Ausaf et al. |
|
905 |
|
906 |
|
907 In the next section, we will briefly |
|
908 introduce Brzozowski derivatives and Sulzmann |
|
909 and Lu's algorithm, which the main point of this thesis builds on. |
|
910 %We give a taste of what they |
37 %We give a taste of what they |
911 %are like and why they are suitable for regular expression |
38 %are like and why they are suitable for regular expression |
912 %matching and lexing. |
39 %matching and lexing. |
913 \section{Formal Specification of POSIX Matching |
40 \section{Formal Specification of POSIX Matching |
914 and Brzozowski Derivatives} |
41 and Brzozowski Derivatives} |