199 |
199 |
200 |
200 |
201 %---------------------------------------------------------------------------------------- |
201 %---------------------------------------------------------------------------------------- |
202 %This part is about regular expressions, Brzozowski derivatives, |
202 %This part is about regular expressions, Brzozowski derivatives, |
203 %and a bit-coded lexing algorithm with proven correctness and time bounds. |
203 %and a bit-coded lexing algorithm with proven correctness and time bounds. |
|
204 % \marginpar{Totally rewritten introduction} |
|
205 % Formal proofs, |
|
206 |
204 |
207 |
205 %TODO: look up snort rules to use here--give readers idea of what regexes look like |
208 %TODO: look up snort rules to use here--give readers idea of what regexes look like |
206 |
209 |
207 \marginpar{rephrasing using "imprecise words"} |
210 \marginpar{rephrasing following Christian comment} |
208 Regular expressions, since their inception in the 1940s, |
211 Regular expressions, since their inception in the 1950s |
|
212 \cite{RegularExpressions}, |
209 have been subject to extensive study and implementation. |
213 have been subject to extensive study and implementation. |
210 Their primary application lies in text processing--finding |
214 Their primary application lies in text processing--finding |
211 matches and identifying patterns in a string. |
215 matches and identifying patterns in a string. |
212 %It is often used to match strings that comprises of numerous fields, |
216 %It is often used to match strings that comprises of numerous fields, |
213 %where certain fields may recur or be omitted. |
217 %where certain fields may recur or be omitted. |
214 For example, a simple regular expression that tries |
218 For example, a simple regular expression that tries |
215 to recognise email addresses is |
219 to recognise email addresses is |
216 \marginpar{rephrased from "the regex for recognising" to "a simple regex that tries to match email"} |
220 \marginpar{rephrased from "the regex for recognising" to "a simple regex that tries to match email"} |
217 \begin{center} |
221 \begin{center} |
218 \verb|[a-z0-9._]^+@[a-z0-9.-]^+\.\{a-z\}\{2,6\}| |
222 \verb|[a-z0-9._]+@[a-z0-9.-]+\.[a-z]{2,6}| |
219 %$[a-z0-9._]^+@[a-z0-9.-]^+\.[a-z]{2,6}$. |
223 %$[a-z0-9._]^+@[a-z0-9.-]^+\.[a-z]{2,6}$. |
220 \end{center} |
224 \end{center} |
221 \marginpar{Simplified example, but the distinction between . and escaped . is correct |
225 \marginpar{Simplified example, but the distinction between . and escaped . is correct |
222 and therefore left unchanged. Also verbatim package does not straightforwardly support superscripts so + kept as they are.} |
226 and therefore left unchanged. Also verbatim package does not straightforwardly support superscripts so + kept as they are.} |
223 %Using this, regular expression matchers and lexers are able to extract |
227 %Using this, regular expression matchers and lexers are able to extract |
224 %the domain names by the use of \verb|[a-zA-Z0-9.-]+|. |
228 %the domain names by the use of \verb|[a-zA-Z0-9.-]+|. |
225 \marginpar{Rewrote explanation for the expression.} |
229 \marginpar{Rewrote explanation for the expression.} |
226 The bracketed sub-expressions are used to extract specific parts of an email address. |
230 This expression assumes all letters in the email have been converted into lower-case. |
227 The local part is recognised by the expression enclosed in |
231 The local-part is recognised by the first |
228 the first pair of brackets: $[a-z0-9._]$, and after the ``@'' sign |
232 bracketed expression $[a-z0-9.\_]^+$ |
229 is the part that recognises the domain, where $[a-z]{2, 6}$ specifically |
233 where the |
230 matches the top-level domain. |
234 operator ``+'' (should be viewed as a superscript) |
|
235 means that it must be some non-empty string |
|
236 made of alpha-numeric characters. |
|
237 After the ``@'' sign |
|
238 is the sub-expression that recognises the domain of that email, |
|
239 where $[a-z]^{\{2, 6\}}$ specifically |
|
240 matches the top-level domain, such as ``org'', ``com'', ``uk'' and etc. |
|
241 The counters in the superscript |
|
242 $2, 6$ specifies that all top-level domains |
|
243 should have between two to six characters. |
|
244 This regular expression does not represent all possible email addresses |
|
245 (e.g. those involving ``-'' cannot be recognised), but patterns |
|
246 of similar shape and using roughly the same set of syntax are used |
|
247 everywhere in our daily life, |
|
248 % Indeed regular expressions are used everywhere |
|
249 % in a computer user's daily life--for example searching for a string |
|
250 % in a text document using ctrl-F. |
|
251 % When such regex searching |
|
252 % a useful and simple way to handle email-processing when the task does |
|
253 % not require absolute precision. |
|
254 %The same can be said about |
|
255 %applications like . |
231 %Consequently, they are an indispensible components in text processing tools |
256 %Consequently, they are an indispensible components in text processing tools |
232 %of software systems such as compilers, IDEs, and firewalls. |
257 %of software systems such as compilers, IDEs, and firewalls. |
233 |
258 % The picture gets more blurry when we start to ask questions like |
234 The study of regular expressions is ongoing due to an |
259 % why |
235 issue known as catastrophic backtracking. |
260 %TODO: Insert corresponding bibliography |
236 This phenomenon refers to scenarios in which the regular expression |
261 %There are many more scenarios in which regular exp% %use cases for regular expressions in computer science, |
237 matching or lexing engine exhibits a disproportionately long |
262 for example in compilers \cite{LEX}, networking \cite{Snort1999}, |
238 runtime despite the simplicity of the input and expression. |
263 software engineering (IDEs) \cite{Atom} |
239 |
264 and operating systems \cite{GREP}, where the correctness |
240 One cause of catastrophic backtracking lies within the |
265 of matching can be crucially important. |
241 ambiguities of lexing.\marginpar{rephrased "the origin of catastrophic ...} |
266 |
242 In the process of matching a multi-character string with |
267 Implementations of regular expression matching out in the wild, however, |
243 a regular expression that encompasses several sub-expressions, |
268 are surprisingly unreliable. |
244 different positions can be designated to mark |
269 %are not always reliable as %people think |
245 the boundaries of corresponding substrings of the sub-expressions. |
270 %they |
246 For instance, in matching the string $aaa$ with the |
271 %should be. %Given the importance of the correct functioning |
247 regular expression $(a+aa)^*$, the divide between |
272 % Indeed they have been heavily relied on and |
248 the initial match and the subsequent iteration could either be |
273 % extensively implemented and studied in modern computer systems--whenever you do |
249 set between the first and second characters ($a | aa$) or between the second and third characters ($aa | a$). As both the length of the input string and the structural complexity of the regular expression increase, the number of potential delimiter combinations can grow exponentially, leading to a corresponding increase in complexity for algorithms that do not optimally navigate these possibilities. |
274 % a |
250 |
275 %TODO: double check this is true |
251 Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}). |
276 An example is the regular expresion $(a^*)^*b$ and the string |
252 |
277 $aa\ldots a$. The expectation is that any regex engine should |
253 Various disambiguation strategies are |
278 be able to solve this (return a no match) in no time. |
254 employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph: |
279 However, if this |
|
280 %regular expression and string pair |
|
281 is tried out in an |
|
282 regex engine like that of Java 8, the runtime would quickly grow beyond 100 seconds |
|
283 with just dozens of characters |
|
284 %in the string. |
|
285 Such behaviour can result in Denial-of-Service (ReDoS) attacks |
|
286 with minimal resources--just the pair of ``evil'' |
|
287 regular expression and string. |
|
288 The outage of the web service provider Cloudflare \cite{Cloudflare} |
|
289 in 2019 \cite{CloudflareOutage} is a recent example |
|
290 where such issues caused serious negative impact. |
|
291 |
|
292 |
|
293 The reason why these regex matching engines get so slow |
|
294 is because they use backtracking or a depth-first-search (DFS) on the |
|
295 search space of possible matches. They employ backtracking algorithms to |
|
296 support back-references, a mechanism allowing |
|
297 expressing languages which repeating an arbitrary long string |
|
298 such as |
|
299 $\{ww| w\in \Sigma*\}$. Such a constructs makes matching |
|
300 NP-complete, for which no known non-backtracking algorithms exist. |
|
301 More modern programming languages' regex engines such as |
|
302 Rust and GO's do promise linear-time behaviours |
|
303 with respect to input string, |
|
304 but they do not support back-references, and often |
|
305 impose ad-hoc restrictions on the input patterns. |
|
306 More importantly, these promises are purely empirical, |
|
307 making them prone to future ReDoS attacks and other types of errors. |
|
308 |
|
309 The other unreliability of industry regex engines |
|
310 is that they do not produce the desired output. |
|
311 Kuklewicz have found out during his testing |
|
312 of practical regex engines that almost all of them are incorrect with respect |
|
313 to the POSIX standard, a disambiguation strategy adopted most |
|
314 widely in computer science. Roughly speaking POSIX lexing means to always |
|
315 go for the longest initial submatch possible, |
|
316 for instance a single iteration |
|
317 $aa$ is preferred over two iterations $a,a$ |
|
318 when matching $(a|aa)^*$ with the string $aa$. |
|
319 The POSIX strategy as summarised by Kuklewicz goes |
|
320 as follows: |
|
321 \marginpar{\em Deleted some peripheral specifications.} |
|
322 \begin{quote} |
|
323 \begin{itemize} |
|
324 \item |
|
325 regular expressions (REs) take the leftmost starting match, and the longest match starting there |
|
326 earlier subpatterns have leftmost-longest priority over later subpatterns\\ |
|
327 \item |
|
328 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\ |
|
329 \item |
|
330 $\ldots$ |
|
331 %REs have right associative concatenation which can be changed with parenthesis\\ |
|
332 %\item |
|
333 %parenthesized subexpressions return the match from their last usage\\ |
|
334 %\item |
|
335 %text of component subexpressions must be contained in the text of the |
|
336 %higher-level subexpressions\\ |
|
337 %\item |
|
338 %if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
|
339 %\item |
|
340 %if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ |
|
341 \end{itemize} |
|
342 \end{quote} |
|
343 The author noted that various lexers that come with POSIX operating systems |
|
344 are rarely correct according to this standard. |
|
345 A test case that exposes the bugs |
|
346 is the regular expression $(aba|ab|a)^*$ and string $ababa$. |
|
347 A correct match would split the string into $ab, aba$, involving |
|
348 two repetitions of the Kleene star $(aba|ab|a)^*$. |
|
349 Most regex engines would instead return two (partial) matches |
|
350 $aba$ and $a$ \footnote{Try it out here: \url{https://regex101.com/r/c5hga5/1}, last accessed 22-Aug-2023}. |
|
351 There are numerous occasions where programmers realised the subtlety and |
|
352 difficulty to implement POSIX correctly, one such quote from Go's regexp library author: |
|
353 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}, last accessed 22-Aug-2023} |
|
354 \begin{quote}\it |
|
355 `` |
|
356 The POSIX rule is computationally prohibitive |
|
357 and not even well-defined. |
|
358 '' |
|
359 \end{quote} |
|
360 % Being able to formally define and capture the idea of |
|
361 % POSIX rules and prove |
|
362 % the correctness of regular expression matching/lexing |
|
363 % algorithms against the POSIX semantics definitions |
|
364 % is valuable. |
|
365 These all point towards a formal treatment of |
|
366 POSIX lexing to clear up inaccuracies and errors |
|
367 in understanding and implementation of regex. The work presented |
|
368 in this thesis uses formal proofs to ensure the correctness |
|
369 and runtime properties |
|
370 of POSIX regular expression lexing. |
|
371 |
|
372 Formal proofs or mechanised proofs are |
|
373 computer checked programs |
|
374 %such as the ones written in Isabelle/HOL, is a powerful means |
|
375 that certify the correctness of facts |
|
376 with respect to a set of axioms and definitions. |
|
377 % This is done by |
|
378 % recursively checking that every fact in a proof script |
|
379 % is either an axiom or a fact that is derived from |
|
380 % known axioms or verified facts. |
|
381 %The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such |
|
382 %methods as their implementation and complexity analysis tend to be error-prone. |
|
383 They provide an unprecendented level of asssurance |
|
384 that an algorithm will perform as expected under all inputs. |
|
385 We believe such proofs can help eliminate catastrophic backtracking |
|
386 once and for all. |
|
387 The software systems that help people interactively build and check |
|
388 formal proofs are called proof assitants. |
|
389 % Many theorem-provers have been developed, such as Mizar, |
|
390 % Isabelle/HOL, HOL-Light, HOL4, |
|
391 % Coq, Agda, Idris, Lean and so on. |
|
392 Isabelle/HOL \cite{Isabelle} is a widely-used proof assistant with a simple type theory |
|
393 and powerful automated proof generators like sledgehammer. |
|
394 We chose to use Isabelle/HOL for its powerful automation |
|
395 and ease and simplicity in expressing regular expressions and |
|
396 regular languages. |
|
397 %Some of those employ |
|
398 %dependent types like Mizar, Coq, Agda, Lean and Idris. |
|
399 %Some support a constructivism approach, such as Coq. |
|
400 |
|
401 |
|
402 % Formal proofs on regular expression matching and lexing |
|
403 % complements the efforts in |
|
404 % industry which tend to focus on overall speed |
|
405 % with techniques like parallelization (FPGA paper), tackling |
|
406 % the problem of catastrophic backtracking |
|
407 % in an ad-hoc manner (cloudflare and stackexchange article). |
|
408 The algorithm we work on is based on Brzozowski derivatives. |
|
409 Brzozowski invented the notion of ``derivatives'' on |
|
410 regular expressions \cite{Brzozowski1964}, and the idea is |
|
411 that we can reason about what regular expressions can match |
|
412 by taking derivatives of them. |
|
413 A derivative operation takes a regular expression $r$ and a character |
|
414 $c$, |
|
415 and returns a new regular expression $r\backslash c$. |
|
416 The derivative is taken with respect to $c$: |
|
417 it transforms $r$ in such a way that the resulting derivative |
|
418 $r\backslash c$ contains all strings in the language of $r$ that |
|
419 starts with $c$, but now with the head character $c$ thrown away. |
|
420 For example, for $r$ equal to $(aba | ab | a)^*$ |
|
421 as discussed earlier, its derivative with respect to character $a$ is |
|
422 \[ |
|
423 r\backslash a = (ba | b| \ONE)(aba | ab | a)^*. |
|
424 \] |
|
425 % By applying derivatives repeatedly one can c |
|
426 Taking derivatives repeatedly with respect to the sequence |
|
427 of characters in the string $s$, one obtain a regular expression |
|
428 containing the information of how $r$ matched $s$. |
|
429 |
|
430 Brzozowski derivatives are purely algebraic operations |
|
431 that can be implemented as a recursive function |
|
432 that does a pattern match on the structure of the regular expression. |
|
433 For example, the derivatives of character regular expressions, |
|
434 Kleene star and bounded repetitions can be described by the following |
|
435 code equations: |
|
436 % For example some clauses |
|
437 \begin{center} |
|
438 \begin{tabular}{lcl} |
|
439 $d \backslash c$ & $\dn$ & |
|
440 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
441 % $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
442 % $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\ |
|
443 % & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
444 % & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
445 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
446 $r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot |
|
447 r^{\{n-1\}} (\text{when} \; n > 0)$\\ |
|
448 \end{tabular} |
|
449 \end{center} |
|
450 (end of ready work, rest construction site) |
|
451 In particular, we are working on an improvement of the work |
|
452 by Ausaf et al. \cite{Ausaf} \cite{AusafDyckhoffUrban2016} |
|
453 and Sulzmann and Lu \cite{Sulzmann2014}. |
|
454 |
|
455 |
|
456 The variant of the problem we are looking at centers around |
|
457 an algorithm (which we call $\blexer$) developed by Sulzmann and Lu \ref{Sulzmann2014}. |
|
458 The reason we chose to look at $\blexer$ and its simplifications |
|
459 is because it allows a lexical tree to be generated |
|
460 by some elegant and subtle procedure based on Brzozowski derivatives. |
|
461 The procedures are made of recursive functions and inductive datatypes just like derivatives, |
|
462 allowing intuitive and concise formal reasoning with theorem provers. |
|
463 Most importantly, $\blexer$ opens up a path to an optimized version |
|
464 of $\blexersimp$ possibilities to improve |
|
465 performance with simplifications that aggressively change the structure of regular expressions. |
|
466 While most derivative-based methods |
|
467 rely on structures to be maintained to allow induction to |
|
468 go through. |
|
469 For example, Egolf et al. \ref{Verbatim} have developed a verified lexer |
|
470 with derivatives, but as soon as they started introducing |
|
471 optimizations such as memoization, they reverted to constructing |
|
472 DFAs first. |
|
473 Edelmann \ref{Edelmann2020} explored similar optimizations in his |
|
474 work on verified LL(1) parsing, with additional enhancements with data structures like |
|
475 zippers. |
|
476 |
|
477 %Sulzmann and Lu have worked out an algorithm |
|
478 %that is especially suited for verification |
|
479 %which utilized the fact |
|
480 %that whenever ambiguity occurs one may duplicate a sub-expression and use |
|
481 %different copies to describe different matching choices. |
|
482 The idea behind the correctness of $\blexer$ is simple: during a derivative, |
|
483 multiple matches might be possible, where an alternative with multiple children |
|
484 each corresponding to a |
|
485 different match is created. In the end of |
|
486 a lexing process one always picks up the leftmost alternative, which is guarnateed |
|
487 to be a POSIX value. |
|
488 This is done by consistently keeping sub-regular expressions in an alternative |
|
489 with longer submatches |
|
490 to the left of other copies ( |
|
491 Remember that POSIX values are roughly the values with the longest inital |
|
492 submatch). |
|
493 The idea behind the optimized version of $\blexer$, which we call $\blexersimp$, |
|
494 is that since we only take the leftmost copy, then all re-occurring copies can be |
|
495 eliminated without losing the POSIX property, and this can be done with |
|
496 children of alternatives at different levels by merging them together. |
|
497 Proving $\blexersimp$ requires a different |
|
498 proof strategy compared to that by Ausaf \cite{FahadThesis}. |
|
499 We invent a rewriting relation as an |
|
500 inductive predicate which captures |
|
501 a strong enough invariance that ensures correctness, |
|
502 which commutes with the derivative operation. |
|
503 This predicate allows a simple |
|
504 induction on the input string to go through. |
|
505 |
|
506 %This idea has been repeatedly used in different variants of lexing |
|
507 %algorithms in their paper, one of which involves bit-codes. The bit-coded |
|
508 %derivative-based algorithm even allows relatively aggressive |
|
509 %%simplification rules which cause |
|
510 %structural changes that render the induction used in the correctness |
|
511 %proofs unusable. |
|
512 %More details will be given in \ref{Bitcoded2} including the |
|
513 %new correctness proof which involves a new inductive predicate which allows |
|
514 %rule induction to go through. |
|
515 |
|
516 |
|
517 \marginpar{20230821 Addition: high-level idea in context} |
|
518 To summarise, we expect modern regex matching and lexing engines |
|
519 to be reliabe, fast and correct, and support rich syntax constructs |
|
520 like bounded repetitions and back references. |
|
521 Backtracking regex engines have exhibited exponential |
|
522 worst-case behaviours (catastrophic backtracking) |
|
523 for employing a depth-first-search on the search tree of possible matches, |
|
524 and complexity analysis about them also takes worst-case exponential |
|
525 time (ref Minamide and Birmingham work). |
|
526 %Expand notes: JFLEX generates DFAs, cannot support backrefs |
|
527 %Expand notes: Java 8, python supports pcre, but is exponential |
|
528 %Expand notes: more modern prog langs like GO, Rust claims linear, and forbids large counters and backrefs |
|
529 %The work described in this thesis is part of an ongoing |
|
530 %project which aims to build a formally verified lexer satisfying all these |
|
531 To ensure the correctness and predictable behaviour of lexical analysis, |
|
532 we propose to |
|
533 build a formally verified lexer that satisfy correctness and non-backtracking |
|
534 requirements in a bottom-up manner using Brzozowski derivatives. |
|
535 We build on the line of work by Ausaf et al. |
|
536 A derivative-based matching algorithm avoids backtracking, by explicitly |
|
537 representing possible matches on the same level of a search tree |
|
538 as regular expression terms |
|
539 in a breadth-first manner. |
|
540 Efficiency of such algorithms rely on limiting the size |
|
541 of the derivatives during the computation, for example by |
|
542 eliminating redundant |
|
543 terms that lead to identical matches. |
|
544 |
|
545 |
|
546 The end goal would be a lexer that comes with additional formal guarantees |
|
547 on computational complexity and actual speed competent with other |
|
548 unverified regex engines. |
|
549 We will report in the next section the outcome of this project |
|
550 so far and the contribution with respect to other works in |
|
551 lexical analysis and theorem proving. |
|
552 |
|
553 |
|
554 Derivatives on regular expressions are popular in the theorem proving community |
|
555 because their algebraic features are amenable to formal reasoning. |
|
556 As a testimony for this, there exists a sizable number of formalisations of Brzozowski derivatives and |
|
557 regular expressions in different theorem proving languages. |
|
558 %Expand notes: verified lexers: proof-of-concept: Ruberio, Egolf, fast but no formal proof in a proof assistant: Magnus, |
|
559 |
|
560 |
|
561 |
|
562 % The study of regular expressions is ongoing due to an |
|
563 % issue known as catastrophic backtracking. |
|
564 % This phenomenon refers to scenarios in which the regular expression |
|
565 |
|
566 |
|
567 % One cause of catastrophic backtracking lies within the |
|
568 % ambiguities of lexing.\marginpar{rephrased "the origin of catastrophic ...} |
|
569 % In the process of matching a multi-character string with |
|
570 % a regular expression that encompasses several sub-expressions, |
|
571 % different positions can be designated to mark |
|
572 % the boundaries of corresponding substrings of the sub-expressions. |
|
573 % For instance, in matching the string $aaa$ with the |
|
574 % regular expression $(a+aa)^*$, the divide between |
|
575 % the initial match and the subsequent iteration could either be |
|
576 % set between the first and second characters ($a | aa$) or |
|
577 % between the second and third characters ($aa | a$). |
|
578 % As both the length of the input string and the structural complexity |
|
579 % of the regular expression increase, the number of potential delimiter |
|
580 % combinations can grow exponentially, leading to a corresponding increase |
|
581 % in complexity for algorithms that do not optimally navigate these possibilities. |
|
582 |
|
583 % Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}). |
|
584 |
|
585 % Various disambiguation strategies are |
|
586 % employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph: |
255 |
587 |
256 |
588 |
257 %Regular expressions |
589 %Regular expressions |
258 %have been extensively studied and |
590 %have been extensively studied and |
259 %implemented since their invention in 1940s. |
591 %implemented since their invention in 1940s. |
293 %There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX. |
625 %There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX. |
294 %The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible. |
626 %The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible. |
295 %There have been prose definitions like the following |
627 %There have been prose definitions like the following |
296 %by Kuklewicz \cite{KuklewiczHaskell}: |
628 %by Kuklewicz \cite{KuklewiczHaskell}: |
297 %described the POSIX rule as (section 1, last paragraph): |
629 %described the POSIX rule as (section 1, last paragraph): |
298 \marginpar{\em Deleted some peripheral specifications.} |
630 |
299 \begin{quote} |
631 |
300 \begin{itemize} |
632 |
301 \item |
633 |
302 regular expressions (REs) take the leftmost starting match, and the longest match starting there |
634 |
303 earlier subpatterns have leftmost-longest priority over later subpatterns\\ |
635 %first character is removed |
304 \item |
636 %state of the automaton after matching that character |
305 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\ |
637 %where nodes are represented as |
306 \item |
638 %a sub-expression (for example tagged NFA |
307 $\ldots$ |
639 %Working on regular expressions |
308 %REs have right associative concatenation which can be changed with parenthesis\\ |
640 %Because of these problems with automata, we prefer regular expressions |
309 %\item |
641 %and derivatives rather than an automata (or graph-based) approach which explicitly converts between |
310 %parenthesized subexpressions return the match from their last usage\\ |
642 %the regular expression and a different data structure. |
311 %\item |
643 % |
312 %text of component subexpressions must be contained in the text of the |
644 % |
313 %higher-level subexpressions\\ |
645 %The key idea |
314 %\item |
646 |
315 %if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
647 |
316 %\item |
648 |
317 %if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ |
649 %Regular expressions are widely used in computer science: |
318 \end{itemize} |
650 %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; |
319 \end{quote} |
651 %command-line tools like $\mathit{grep}$ that facilitate easy |
320 However, the author noted that various lexers that claim to be POSIX |
652 %text-processing \cite{grep}; network intrusion |
321 are rarely correct according to this standard. |
653 %detection systems that inspect suspicious traffic; or compiler |
322 There are numerous occasions where programmers realised the subtlety and |
654 %front ends. |
323 difficulty to implement correctly, one such quote from Go's regexp library author |
655 %Given their usefulness and ubiquity, one would assume that |
324 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}} |
656 %modern regular expression matching implementations |
325 \begin{quote}\it |
657 %are mature and fully studied. |
326 `` |
658 %Indeed, in many popular programming languages' regular expression engines, |
327 The POSIX rule is computationally prohibitive |
659 %one can supply a regular expression and a string, and in most cases get |
328 and not even well-defined. |
660 %get the matching information in a very short time. |
329 `` |
661 %Those engines can sometimes be blindingly fast--some |
330 \end{quote} |
662 %network intrusion detection systems |
331 Being able to formally define and capture the idea of |
663 %use regular expression engines that are able to process |
332 POSIX rules and prove |
664 %hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
333 the correctness of regular expression matching/lexing |
665 %However, those engines can sometimes also exhibit a surprising security vulnerability |
334 algorithms against the POSIX semantics definitions |
666 %under a certain class of inputs. |
335 is valuable. |
667 %However, , this is not the case for $\mathbf{all}$ inputs. |
336 |
668 %TODO: get source for SNORT/BRO's regex matching engine/speed |
337 |
669 |
338 Formal proofs are |
670 |
339 machine checked programs |
671 %---------------------------------------------------------------------------------------- |
340 %such as the ones written in Isabelle/HOL, is a powerful means |
672 \section{Contribution} |
341 for computer scientists to be certain |
673 %{\color{red} \rule{\linewidth}{0.5mm}} |
342 about the correctness of their algorithms. |
674 %\textbf{issue with this part: way too short, not enough details of what I have done.} |
343 This is done by |
675 %{\color{red} \rule{\linewidth}{0.5mm}} |
344 recursively checking that every fact in a proof script |
676 \marginpar{\em Gerog: No sufficient context on related work making contribution hard to digest.} |
345 is either an axiom or a fact that is derived from |
677 |
346 known axioms or verified facts. |
678 |
347 %The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such |
679 \marginpar{introducing formalisations on regex matching} |
348 %methods as their implementation and complexity analysis tend to be error-prone. |
680 There have been many formalisations in the theorem-proving community |
349 Formal proofs provides an unprecendented level of asssurance |
681 about regular expressions and lexing. |
350 that an algorithm will perform as expected under all inputs. |
|
351 The software systems that help people interactively build and check |
|
352 such proofs are called theorem-provers or proof assitants. |
|
353 Many theorem-provers have been developed, such as Mizar, |
|
354 Isabelle/HOL, HOL-Light, HOL4, |
|
355 Coq, Agda, Idris, Lean and so on. |
|
356 Isabelle/HOL is a theorem prover with a simple type theory |
|
357 and powerful automated proof generators like sledgehammer. |
|
358 We chose to use Isabelle/HOL for its powerful automation |
|
359 and ease and simplicity in expressing regular expressions and |
|
360 regular languages. |
|
361 %Some of those employ |
|
362 %dependent types like Mizar, Coq, Agda, Lean and Idris. |
|
363 %Some support a constructivism approach, such as Coq. |
|
364 |
|
365 |
|
366 Formal proofs on regular expression matching and lexing |
|
367 complements the efforts in |
|
368 industry which tend to focus on overall speed |
|
369 with techniques like parallelization (FPGA paper), tackling |
|
370 the problem of catastrophic backtracking |
|
371 in an ad-hoc manner (cloudflare and stackexchange article). |
|
372 |
|
373 There have been many interesting steps in the theorem-proving community |
|
374 about formalising regular expressions and lexing. |
|
375 One flavour is to build from the regular expression an automaton, and determine |
682 One flavour is to build from the regular expression an automaton, and determine |
376 acceptance in terms of the resulting |
683 acceptance in terms of the resulting |
377 state after executing the input string on that automaton. |
684 state after executing the input string on that automaton. |
378 Automata formalisations are in general harder and more cumbersome to deal |
685 Automata formalisations are in general harder and more cumbersome to deal |
379 with for theorem provers than working directly on regular expressions. |
686 with for theorem provers than working directly on regular expressions. |
439 lexical values. |
746 lexical values. |
440 %X also formalised derivatives and regular expressions, producing "parse trees". |
747 %X also formalised derivatives and regular expressions, producing "parse trees". |
441 %(Some person who's a big name in formal methods) |
748 %(Some person who's a big name in formal methods) |
442 |
749 |
443 |
750 |
444 The variant of the problem we are looking at centers around |
|
445 an algorithm (which we call $\blexer$) developed by Sulzmann and Lu \ref{Sulzmann2014}. |
|
446 The reason we chose to look at $\blexer$ and its simplifications |
|
447 is because it allows a lexical tree to be generated |
|
448 by some elegant and subtle procedure based on Brzozowski derivatives. |
|
449 The procedures are made of recursive functions and inductive datatypes just like derivatives, |
|
450 allowing intuitive and concise formal reasoning with theorem provers. |
|
451 Most importantly, $\blexer$ opens up a path to an optimized version |
|
452 of $\blexersimp$ possibilities to improve |
|
453 performance with simplifications that aggressively change the structure of regular expressions. |
|
454 While most derivative-based methods |
|
455 rely on structures to be maintained to allow induction to |
|
456 go through. |
|
457 For example, Egolf et al. \ref{Verbatim} have developed a verified lexer |
|
458 with derivatives, but as soon as they started introducing |
|
459 optimizations such as memoization, they reverted to constructing |
|
460 DFAs first. |
|
461 Edelmann \ref{Edelmann2020} explored similar optimizations in his |
|
462 work on verified LL(1) parsing, with additional enhancements with data structures like |
|
463 zippers. |
|
464 |
|
465 %Sulzmann and Lu have worked out an algorithm |
|
466 %that is especially suited for verification |
|
467 %which utilized the fact |
|
468 %that whenever ambiguity occurs one may duplicate a sub-expression and use |
|
469 %different copies to describe different matching choices. |
|
470 The idea behind the correctness of $\blexer$ is simple: during a derivative, |
|
471 multiple matches might be possible, where an alternative with multiple children |
|
472 each corresponding to a |
|
473 different match is created. In the end of |
|
474 a lexing process one always picks up the leftmost alternative, which is guarnateed |
|
475 to be a POSIX value. |
|
476 This is done by consistently keeping sub-regular expressions in an alternative |
|
477 with longer submatches |
|
478 to the left of other copies ( |
|
479 Remember that POSIX values are roughly the values with the longest inital |
|
480 submatch). |
|
481 The idea behind the optimized version of $\blexer$, which we call $\blexersimp$, |
|
482 is that since we only take the leftmost copy, then all re-occurring copies can be |
|
483 eliminated without losing the POSIX property, and this can be done with |
|
484 children of alternatives at different levels by merging them together. |
|
485 Proving $\blexersimp$ requires a different |
|
486 proof strategy compared to that by Ausaf \cite{FahadThesis}. |
|
487 We invent a rewriting relation as an |
|
488 inductive predicate which captures |
|
489 a strong enough invariance that ensures correctness, |
|
490 which commutes with the derivative operation. |
|
491 This predicate allows a simple |
|
492 induction on the input string to go through. |
|
493 |
|
494 %This idea has been repeatedly used in different variants of lexing |
|
495 %algorithms in their paper, one of which involves bit-codes. The bit-coded |
|
496 %derivative-based algorithm even allows relatively aggressive |
|
497 %%simplification rules which cause |
|
498 %structural changes that render the induction used in the correctness |
|
499 %proofs unusable. |
|
500 %More details will be given in \ref{Bitcoded2} including the |
|
501 %new correctness proof which involves a new inductive predicate which allows |
|
502 %rule induction to go through. |
|
503 |
|
504 |
|
505 |
|
506 |
|
507 |
|
508 |
|
509 |
|
510 %first character is removed |
|
511 %state of the automaton after matching that character |
|
512 %where nodes are represented as |
|
513 %a sub-expression (for example tagged NFA |
|
514 %Working on regular expressions |
|
515 %Because of these problems with automata, we prefer regular expressions |
|
516 %and derivatives rather than an automata (or graph-based) approach which explicitly converts between |
|
517 %the regular expression and a different data structure. |
|
518 % |
|
519 % |
|
520 %The key idea |
|
521 |
|
522 (ends) |
|
523 |
|
524 %Regular expressions are widely used in computer science: |
|
525 %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; |
|
526 %command-line tools like $\mathit{grep}$ that facilitate easy |
|
527 %text-processing \cite{grep}; network intrusion |
|
528 %detection systems that inspect suspicious traffic; or compiler |
|
529 %front ends. |
|
530 %Given their usefulness and ubiquity, one would assume that |
|
531 %modern regular expression matching implementations |
|
532 %are mature and fully studied. |
|
533 %Indeed, in many popular programming languages' regular expression engines, |
|
534 %one can supply a regular expression and a string, and in most cases get |
|
535 %get the matching information in a very short time. |
|
536 %Those engines can sometimes be blindingly fast--some |
|
537 %network intrusion detection systems |
|
538 %use regular expression engines that are able to process |
|
539 %hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
|
540 %However, those engines can sometimes also exhibit a surprising security vulnerability |
|
541 %under a certain class of inputs. |
|
542 %However, , this is not the case for $\mathbf{all}$ inputs. |
|
543 %TODO: get source for SNORT/BRO's regex matching engine/speed |
|
544 |
|
545 |
|
546 %---------------------------------------------------------------------------------------- |
|
547 \section{Contribution} |
|
548 %{\color{red} \rule{\linewidth}{0.5mm}} |
|
549 %\textbf{issue with this part: way too short, not enough details of what I have done.} |
|
550 %{\color{red} \rule{\linewidth}{0.5mm}} |
|
551 \marginpar{\em Gerog: No sufficient context on related work making contribution hard to digest.} |
|
552 |
|
553 |
751 |
554 %In this thesis, |
752 %In this thesis, |
555 %we propose a solution to catastrophic |
753 %we propose a solution to catastrophic |
556 %backtracking and error-prone matchers: a formally verified |
754 %backtracking and error-prone matchers: a formally verified |
557 %regular expression lexing algorithm |
755 %regular expression lexing algorithm |
565 %Package \verb`dingbat`: \rightpointright |
763 %Package \verb`dingbat`: \rightpointright |
566 We have made mainly two contributions in this thesis: %the |
764 We have made mainly two contributions in this thesis: %the |
567 %lexer we developed based on Brzozowski derivatives and |
765 %lexer we developed based on Brzozowski derivatives and |
568 %Sulzmanna and Lu's developments called |
766 %Sulzmanna and Lu's developments called |
569 proving the lexer $\blexersimp$ is both i) correctness and ii)fast. |
767 proving the lexer $\blexersimp$ is both i) correctness and ii)fast. |
570 It is correct w.r.t a formalisation of POSIX lexing by Ausaf et al.\ref{AusafDyckhoffUrban2016}. |
768 It is correct w.r.t a formalisation of POSIX lexing by Ausaf et al.\cite{AusafDyckhoffUrban2016}. |
571 It is fast compared with un-optimised implementations like Sulzmann and Lu's orignal |
769 It is fast compared with un-optimised implementations like Sulzmann and Lu's orignal |
572 development by our metric of internal data structures not growing unbounded. |
770 development by our metric of internal data structures not growing unbounded. |
573 |
771 \subsection{Theorem Proving with Derivatives} |
|
772 |
|
773 Ribeiro and Du Bois \cite{RibeiroAgda2017} have |
|
774 formalised the notion of bit-coded regular expressions |
|
775 and proved their relations with simple regular expressions in |
|
776 the dependently-typed proof assistant Agda. |
|
777 They also proved the soundness and completeness of a matching algorithm |
|
778 based on the bit-coded regular expressions. |
|
779 Ausaf et al. \cite{AusafDyckhoffUrban2016} |
|
780 are the first to |
|
781 give a quite simple formalised POSIX |
|
782 specification in Isabelle/HOL, and also prove |
|
783 that their specification coincides with an earlier (unformalised) |
|
784 POSIX specification given by Okui and Suzuki \cite{Okui10}. |
|
785 They then formally proved the correctness of |
|
786 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014} |
|
787 with regards to that specification. |
|
788 They also found that the informal POSIX |
|
789 specification by Sulzmann and Lu needs to be substantially |
|
790 revised in order for the correctness proof to go through. |
|
791 Their original specification and proof were unfixable |
|
792 according to Ausaf et al. |
|
793 |
|
794 |
|
795 |
|
796 \subsection{Complexity Results} |
574 Our formalisation of complexity is unique among similar works in the sense that |
797 Our formalisation of complexity is unique among similar works in the sense that |
575 %is about the size of internal data structures. |
798 %is about the size of internal data structures. |
576 to our knowledge %we don't know of a |
799 to our knowledge %we don't know of a |
577 there are not other certified |
800 there are not other certified |
578 lexing/parsing algorithms with similar data structure size bound theorems. |
801 lexing/parsing algorithms with similar data structure size bound theorems. |
579 Common practices involve making empirical analysis of the complexity of the algorithm |
802 Common practices involve making empirical analysis of the complexity of the algorithm |
580 in question (\ref{Verbatim}, \ref{Verbatimpp}), or relying |
803 in question (\cite{Verbatim}, \cite{Verbatimpp}), or relying |
581 on prior (unformalised) complexity analysis of well-known algorithms (\ref{ValiantParsing}), |
804 on prior (unformalised) complexity analysis of well-known algorithms (\cite{ValiantParsing}), |
582 making them prone to complexity bugs. |
805 making them prone to complexity bugs. |
|
806 |
583 %TODO: add citation |
807 %TODO: add citation |
584 %, for example in the Verbatim work \ref{Verbatim} |
808 %, for example in the Verbatim work \cite{Verbatim} |
585 |
809 |
586 %While formalised proofs have not included |
810 %While formalised proofs have not included |
587 %Issues known as "performance bugs" can |
811 %Issues known as "performance bugs" can |
588 Whilst formalised complexity theorems |
812 Whilst formalised complexity theorems |
589 have not yet appeared in other certified lexers/parsers, |
813 have not yet appeared in other certified lexers/parsers, |
590 %while this work is done, |
814 %while this work is done, |
591 they do find themselves in the broader theorem-proving literature: |
815 they do find themselves in the broader theorem-proving literature: |
592 \emph{time credits} have been formalised for separation logic in Coq |
816 \emph{time credits} have been formalised for separation logic in Coq |
593 \ref{atkey2010amortised}%not used in |
817 \cite{atkey2010amortised} |
594 to characterise the runtime complexity of an algorithm, |
818 %not used in |
595 where integer values are recorded %from the beginning of an execution |
819 to characterise the runtime complexity of union-find. |
596 as part of the program state |
820 %where integer values are recorded %from the beginning of an execution |
597 and decremented in each step. |
821 %as part of the program state |
598 The idea is that the total number of decrements |
822 %and decremented in each step. |
599 from the time credits during execution represents the complexity of an algorithm. |
823 The idea is that the total number of instructions executed |
|
824 is equal to the time credits consumed |
|
825 during the execution of an algorithm. |
600 %each time a basic step is executed. |
826 %each time a basic step is executed. |
601 %The way to use a time credit Time credit is an integer |
827 %The way to use a time credit Time credit is an integer |
602 %is assigned to a program prior to execution, and each step in the program consumes |
828 %is assigned to a program prior to execution, and each step in the program consumes |
603 %one credit. |
829 %one credit. |
604 Arma{\"e}l et al. have extended the framework to allow expressing time |
830 Arma{\"e}l et al. \cite{bigOImperative} have extended the framework to allow expressing time |
605 credits using big-O notations |
831 credits using big-O notations, |
606 so one can prove both the functional correctness and asymptotic complexity |
832 so one can prove both the functional correctness and asymptotic complexity |
607 of higher-order imperative algorithms \ref{bigOImperative}. |
833 of higher-order imperative algorithms. |
|
834 A number of formal proofs also exist for some other |
|
835 algorithms in Coq \cite{subtypingFormalComplexity}. |
|
836 |
|
837 The big-O notation have also been formalised in Isabelle |
|
838 \cite{bigOIsabelle} |
608 %for example the complexity of |
839 %for example the complexity of |
609 %the Quicksort algorithm |
840 %the Quicksort algorithm |
610 %is $\Theta n\ln n$ |
841 %is $\Theta n\ln n$ |
611 \marginpar{more work on formalising complexity}. |
842 \marginpar{more work on formalising complexity}. |
612 %Our next step is to leverage these frameworks |
843 %Our next step is to leverage these frameworks |
613 %It is a precursor to our |
844 %It is a precursor to our |
614 Our work focuses on the space complexity of the algorithm under our notion of the size of |
845 Our work focuses on the space complexity of the algorithm under our notion of the size of |
615 a regular expression. |
846 a regular expression. |
616 Despite not being a direct asymptotic time complexity proof, |
847 Despite not being a direct asymptotic time complexity proof, |
617 our result is an important stepping towards one. |
848 our result is an important stepping stone towards one. |
618 |
849 |
619 |
850 |
620 |
851 |
621 Brzozowski showed that there are finitely many similar deriviatives, |
852 Brzozowski showed that there are finitely many similar deriviatives, |
622 where similarity is defined in terms of ACI equations. |
853 where similarity is defined in terms of ACI equations. |