ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 649 ef2b8abcbc55
parent 648 d15a0b7d6d90
child 652 a4d692a9a289
equal deleted inserted replaced
648:d15a0b7d6d90 649:ef2b8abcbc55
   214 For example, the regular expression for recognising email addresses is
   214 For example, the regular expression for recognising email addresses is
   215 \begin{center}
   215 \begin{center}
   216 \verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|.
   216 \verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|.
   217 \end{center}
   217 \end{center}
   218 Using this, regular expression matchers and lexers are able to extract 
   218 Using this, regular expression matchers and lexers are able to extract 
   219 for example the domain names by the use of \verb|[a-zA-Z0-9.-]+|. 
   219 the domain names by the use of \verb|[a-zA-Z0-9.-]+|. 
   220 Consequently, they are an indispensible component in text processing tools 
   220 Consequently, they are an indispensible components in text processing tools 
   221 of software systems such as compilers, IDEs, and firewalls.
   221 of software systems such as compilers, IDEs, and firewalls.
   222 
   222 
   223 The study of regular expressions is ongoing due to an
   223 The study of regular expressions is ongoing due to an
   224 issue known as catastrophic backtracking. 
   224 issue known as catastrophic backtracking. 
   225 This phenomenon refers to scenarios in which the regular expression 
   225 This phenomenon refers to scenarios in which the regular expression 
   301 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
   301 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
   302 \item
   302 \item
   303 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
   303 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
   304 \end{itemize}
   304 \end{itemize}
   305 \end{quote}
   305 \end{quote}
   306 However, the author noted that lexers are rarely correct according to this standard.
   306 However, the author noted that various lexers that claim to be POSIX 
   307 Being able to formally define and capture the idea of POSIX rules and matching/lexing algorithms and prove 
   307 are rarely correct according to this standard.
   308 the correctness of such algorithms against the POSIX semantics definitions
   308 There are numerous occasions where programmers realised the subtlety and
   309 would be valuable.
   309 difficulty to implement correctly, one such quote from Go's regexp library author 
   310 
   310 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
   311 Formal proofs are machine checked programs
   311 \begin{quote}\it
   312 such as the ones written in Isabelle/HOL, is a powerful means 
   312 ``
   313 for computer scientists to be certain about correctness of their algorithms and correctness properties.
   313 The POSIX rule is computationally prohibitive
   314 This is done by automatically checking that the end goal is derived solely from a list of axioms, definitions
   314 and not even well-defined.
   315 and proof rules that are known to be correct.
   315 ``
   316 The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
   316 \end{quote}
   317 methods as their implementation and complexity analysis tend to be error-prone.
   317 Being able to formally define and capture the idea of 
   318 
   318 POSIX rules and prove 
   319 There have been many interesting steps taken in the direction of formal proofs and regular expressions
   319 the correctness of regular expression matching/lexing 
   320 lexing and matching.
   320 algorithms against the POSIX semantics definitions
   321 
   321 is valuable.
   322 
   322 
       
   323 
       
   324 Formal proofs are
       
   325 machine checked programs
       
   326  %such as the ones written in Isabelle/HOL, is a powerful means 
       
   327 for computer scientists to be certain 
       
   328 about the correctness of their algorithms.
       
   329 This is done by 
       
   330 recursively checking that every fact in a proof script 
       
   331 is either an axiom or a fact that is derived from
       
   332 known axioms or verified facts.
       
   333 %The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
       
   334 %methods as their implementation and complexity analysis tend to be error-prone.
       
   335 Formal proofs provides an unprecendented level of asssurance
       
   336 that an algorithm will perform as expected under all inputs.
       
   337 The software systems that help people interactively build and check
       
   338 such proofs are called theorem-provers or proof assitants.
       
   339 Many  theorem-provers have been developed, such as Mizar,
       
   340 Isabelle/HOL, HOL-Light, HOL4,
       
   341 Coq, Agda, Idris, Lean and so on.
       
   342 Isabelle/HOL is a theorem prover with a simple type theory
       
   343 and powerful automated proof generators like sledgehammer.
       
   344 We chose to use Isabelle/HOL for its powerful automation
       
   345 and ease and simplicity in expressing regular expressions and 
       
   346 regular languages.
       
   347 %Some of those employ
       
   348 %dependent types like Mizar, Coq, Agda, Lean and Idris.
       
   349 %Some support a constructivism approach, such as Coq.
       
   350 
       
   351 
       
   352 Formal proofs on regular expression matching and lexing
       
   353 complements the efforts in
       
   354 industry which tend to focus on overall speed
       
   355 with techniques like parallelization (FPGA paper), tackling 
       
   356 the problem of catastrophic backtracking 
       
   357 in an ad-hoc manner (cloudflare and stackexchange article).
       
   358 
       
   359 There have been many interesting steps in the theorem-proving community 
       
   360 about formalising regular expressions and lexing.
       
   361 One flavour is to build from the regular expression an automaton, and determine
       
   362 acceptance in terms of the resulting 
       
   363 state after executing the input string on that automaton.
       
   364 Automata formalisations are in general harder and more cumbersome to deal
       
   365 with for theorem provers than working directly on regular expressions.
       
   366 One such example is by Nipkow \cite{Nipkow1998}.
       
   367 %They 
       
   368 %made everything recursive (for example the next state function),
       
   369 As a core idea, they
       
   370 used a list of booleans to name each state so that 
       
   371 after composing sub-automata together, renaming the states to maintain
       
   372 the distinctness of each state is recursive and simple.
       
   373 The result was the obvious lemmas incorporating  
       
   374 ``a painful amount of detail'' in their formalisation.
       
   375 Sometimes the automata are represented as graphs. 
       
   376 But graphs are not inductive datatypes.
       
   377 Having to set the induction principle on the number of nodes
       
   378 in a graph makes formal reasoning non-intuitive and convoluted,
       
   379 resulting in large formalisations \cite{Lammich2012}. 
       
   380 When combining two graphs, one also needs to make sure that the nodes in
       
   381 both graphs are distinct, which almost always involve
       
   382 renaming of the nodes.
       
   383 A theorem-prover which provides dependent types such as Coq 
       
   384 can alleviate the issue of representing graph nodes
       
   385 \cite{Doczkal2013}. There the representation of nodes is made
       
   386 easier by the use of $\textit{FinType}$.
       
   387 Another representation for automata are matrices. 
       
   388 But the induction for them is not as straightforward either.
       
   389 There are some more clever representations, for example one by Paulson 
       
   390 using hereditarily finite sets \cite{Paulson2015}. 
       
   391 There the problem with combining graphs can be solved better. 
       
   392 %but we believe that such clever tricks are not very obvious for 
       
   393 %the John-average-Isabelle-user.
       
   394 
       
   395 The approach that operates directly on regular expressions circumvents the problem of
       
   396 conversion between a regular expression and an automaton, thereby avoiding representation
       
   397 problems altogether, despite that regular expressions may be seen as a variant of a
       
   398 non-deterministic finite automaton (ref Laurikari tagged NFA paper).
       
   399 To matching a string, a sequence of algebraic transformations called 
       
   400 (Brzozowski) $\textit{derivatives}$ (ref Brzozowski) is carried out on that regular expression.
       
   401 Each derivative takes a character and a regular expression, 
       
   402 and returns a new regular expression whose language is closely related to 
       
   403 the original regular expression's language:
       
   404 strings prefixed with that input character will have their head removed
       
   405 and strings not prefixed
       
   406 with that character will be eliminated. 
       
   407 After taking derivatives with respect to all the characters the string is
       
   408 exhausted. Then an algorithm checks whether the empty string is in that final 
       
   409 regular expression's language.
       
   410 If so, a match exists and the string is in the language of the input regular expression.
       
   411 
       
   412 Again this process can be seen as the simulation of an NFA running on a string,
       
   413 but the recursive nature of the datatypes and functions involved makes
       
   414 derivatives a perfect object of study for theorem provers.
       
   415 That is why there has been numerous formalisations of regular expressions
       
   416 and Brzozowski derivatives in the functional programming and 
       
   417 theorem proving community (a large list of refs to derivatives formalisation publications).
       
   418 Ribeiro and Du Bois \cite{RibeiroAgda2017} have 
       
   419 formalised the notion of bit-coded regular expressions
       
   420 and proved their relations with simple regular expressions in
       
   421 the dependently-typed proof assistant Agda.
       
   422 They also proved the soundness and completeness of a matching algorithm
       
   423 based on the bit-coded regular expressions. Their algorithm is a decision procedure
       
   424 that gives a Yes/No answer, which does not produce
       
   425 lexical values.
       
   426 %X also formalised derivatives and regular expressions, producing "parse trees".
       
   427 %(Some person who's a big name in formal methods)
       
   428 
       
   429 
       
   430 The variant of the problem we are looking at centers around
       
   431 an algorithm (which we call $\blexer$) developed by Sulzmann and Lu \ref{Sulzmann2014}.
       
   432 The reason we chose to look at $\blexer$ and its simplifications 
       
   433 is because it allows a lexical tree to be generated
       
   434 by some elegant and subtle procedure based on Brzozowski derivatives.
       
   435 The procedures are made of recursive functions and inductive datatypes just like derivatives, 
       
   436 allowing intuitive and concise formal reasoning with theorem provers.
       
   437 Most importantly, $\blexer$ opens up a path to an optimized version
       
   438 of $\blexersimp$ possibilities to improve
       
   439 performance with simplifications that aggressively change the structure of regular expressions.
       
   440 While most derivative-based methods
       
   441 rely on structures to be maintained to allow induction to
       
   442 go through.
       
   443 For example, Egolf et al. \ref{Verbatim} have developed a verified lexer
       
   444 with derivatives, but as soon as they started introducing
       
   445 optimizations such as memoization, they reverted to constructing
       
   446 DFAs first.
       
   447 Edelmann \ref{Edelmann2020} explored similar optimizations in his
       
   448 work on verified LL(1) parsing, with additional enhancements with data structures like
       
   449 zippers.
       
   450 
       
   451 %Sulzmann and Lu have worked out an algorithm
       
   452 %that is especially suited for verification
       
   453 %which utilized the fact
       
   454 %that whenever ambiguity occurs one may duplicate a sub-expression and use
       
   455 %different copies to describe different matching choices.
       
   456 The idea behind the correctness of $\blexer$ is simple: during a derivative,
       
   457 multiple matches might be possible, where an alternative with multiple children
       
   458 each corresponding to a 
       
   459 different match is created. In the end of
       
   460 a lexing process one always picks up the leftmost alternative, which is guarnateed 
       
   461 to be a POSIX value.
       
   462 This is done by consistently keeping sub-regular expressions in an alternative
       
   463 with longer submatches
       
   464 to the left of other copies (
       
   465 Remember that POSIX values are roughly the values with the longest inital
       
   466 submatch).
       
   467 The idea behind the optimized version of $\blexer$, which we call $\blexersimp$,
       
   468 is that since we only take the leftmost copy, then all re-occurring copies can be
       
   469 eliminated without losing the POSIX property, and this can be done with
       
   470 children of alternatives at different levels by merging them together.
       
   471 Proving $\blexersimp$ requires a different
       
   472 proof strategy compared to that by Ausaf \cite{FahadThesis}.
       
   473 We invent a rewriting relation as an
       
   474 inductive predicate which captures 
       
   475 a strong enough invariance that ensures correctness,
       
   476 which commutes with the derivative operation. 
       
   477 This predicate allows a simple
       
   478 induction on the input string to go through.
       
   479 
       
   480 %This idea has been repeatedly used in different variants of lexing
       
   481 %algorithms in their paper, one of which involves bit-codes. The bit-coded
       
   482 %derivative-based algorithm even allows relatively aggressive 
       
   483 %%simplification rules which cause
       
   484 %structural changes that render the induction used in the correctness
       
   485 %proofs unusable.
       
   486 %More details will be given in \ref{Bitcoded2} including the
       
   487 %new correctness proof which involves a new inductive predicate which allows 
       
   488 %rule induction to go through.
       
   489 
       
   490 
       
   491 
       
   492 
       
   493 
       
   494 
       
   495 
       
   496 %first character is removed 
       
   497 %state of the automaton after matching that character 
       
   498 %where nodes are represented as 
       
   499 %a sub-expression (for example tagged NFA
       
   500 %Working on regular expressions 
       
   501 %Because of these problems with automata, we prefer regular expressions
       
   502 %and derivatives rather than an automata (or graph-based) approach which explicitly converts between
       
   503 %the regular expression and a different data structure.
       
   504 %
       
   505 %
       
   506 %The key idea 
   323 
   507 
   324 (ends)
   508 (ends)
   325 
   509 
   326 %Regular expressions are widely used in computer science: 
   510 %Regular expressions are widely used in computer science: 
   327 %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
   511 %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
   341 %hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
   525 %hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
   342 %However, those engines can sometimes also exhibit a surprising security vulnerability
   526 %However, those engines can sometimes also exhibit a surprising security vulnerability
   343 %under a certain class of inputs.
   527 %under a certain class of inputs.
   344 %However, , this is not the case for $\mathbf{all}$ inputs.
   528 %However, , this is not the case for $\mathbf{all}$ inputs.
   345 %TODO: get source for SNORT/BRO's regex matching engine/speed
   529 %TODO: get source for SNORT/BRO's regex matching engine/speed
   346 
       
   347 
       
   348 Consider for example the regular expression $(a^*)^*\,b$ and 
       
   349 strings of the form $aa..a$. These strings cannot be matched by this regular
       
   350 expression: Obviously the expected $b$ in the last
       
   351 position is missing. One would assume that modern regular expression
       
   352 matching engines can find this out very quickly. Surprisingly, if one tries
       
   353 this example in JavaScript, Python or Java 8, even with small strings, 
       
   354 say of lenght of around 30 $a$'s,
       
   355 the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
       
   356 The algorithms clearly show exponential behaviour, and as can be seen
       
   357 is triggered by some relatively simple regular expressions.
       
   358 Java 9 and newer
       
   359 versions improve this behaviour somewhat, but are still slow compared 
       
   360 with the approach we are going to use in this thesis.
       
   361 
       
   362 
       
   363 
       
   364 This superlinear blowup in regular expression engines
       
   365 has caused grief in ``real life'' where it is 
       
   366 given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
       
   367 For example, on 20 July 2016 one evil
       
   368 regular expression brought the webpage
       
   369 \href{http://stackexchange.com}{Stack Exchange} to its
       
   370 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
       
   371 In this instance, a regular expression intended to just trim white
       
   372 spaces from the beginning and the end of a line actually consumed
       
   373 massive amounts of CPU resources---causing the web servers to grind to a
       
   374 halt. In this example, the time needed to process
       
   375 the string was 
       
   376 $O(n^2)$ with respect to the string length $n$. This
       
   377 quadratic overhead was enough for the homepage of Stack Exchange to
       
   378 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
       
   379 attack and therefore stopped the servers from responding to any
       
   380 requests. This made the whole site become unavailable. 
       
   381 
       
   382 \begin{figure}[p]
       
   383 \begin{center}
       
   384 \begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
       
   385 \begin{tikzpicture}
       
   386 \begin{axis}[
       
   387     xlabel={$n$},
       
   388     x label style={at={(1.05,-0.05)}},
       
   389     ylabel={time in secs},
       
   390     enlargelimits=false,
       
   391     xtick={0,5,...,30},
       
   392     xmax=33,
       
   393     ymax=35,
       
   394     ytick={0,5,...,30},
       
   395     scaled ticks=false,
       
   396     axis lines=left,
       
   397     width=5cm,
       
   398     height=4cm, 
       
   399     legend entries={JavaScript},  
       
   400     legend pos=north west,
       
   401     legend cell align=left]
       
   402 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
       
   403 \end{axis}
       
   404 \end{tikzpicture}
       
   405   &
       
   406 \begin{tikzpicture}
       
   407 \begin{axis}[
       
   408     xlabel={$n$},
       
   409     x label style={at={(1.05,-0.05)}},
       
   410     %ylabel={time in secs},
       
   411     enlargelimits=false,
       
   412     xtick={0,5,...,30},
       
   413     xmax=33,
       
   414     ymax=35,
       
   415     ytick={0,5,...,30},
       
   416     scaled ticks=false,
       
   417     axis lines=left,
       
   418     width=5cm,
       
   419     height=4cm, 
       
   420     legend entries={Python},  
       
   421     legend pos=north west,
       
   422     legend cell align=left]
       
   423 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   424 \end{axis}
       
   425 \end{tikzpicture}\\ 
       
   426 \begin{tikzpicture}
       
   427 \begin{axis}[
       
   428     xlabel={$n$},
       
   429     x label style={at={(1.05,-0.05)}},
       
   430     ylabel={time in secs},
       
   431     enlargelimits=false,
       
   432     xtick={0,5,...,30},
       
   433     xmax=33,
       
   434     ymax=35,
       
   435     ytick={0,5,...,30},
       
   436     scaled ticks=false,
       
   437     axis lines=left,
       
   438     width=5cm,
       
   439     height=4cm, 
       
   440     legend entries={Java 8},  
       
   441     legend pos=north west,
       
   442     legend cell align=left]
       
   443 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   444 \end{axis}
       
   445 \end{tikzpicture}
       
   446   &
       
   447 \begin{tikzpicture}
       
   448 \begin{axis}[
       
   449     xlabel={$n$},
       
   450     x label style={at={(1.05,-0.05)}},
       
   451     %ylabel={time in secs},
       
   452     enlargelimits=false,
       
   453     xtick={0,5,...,30},
       
   454     xmax=33,
       
   455     ymax=35,
       
   456     ytick={0,5,...,30},
       
   457     scaled ticks=false,
       
   458     axis lines=left,
       
   459     width=5cm,
       
   460     height=4cm, 
       
   461     legend entries={Dart},  
       
   462     legend pos=north west,
       
   463     legend cell align=left]
       
   464 \addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
       
   465 \end{axis}
       
   466 \end{tikzpicture}\\ 
       
   467 \begin{tikzpicture}
       
   468 \begin{axis}[
       
   469     xlabel={$n$},
       
   470     x label style={at={(1.05,-0.05)}},
       
   471     ylabel={time in secs},
       
   472     enlargelimits=false,
       
   473     xtick={0,5,...,30},
       
   474     xmax=33,
       
   475     ymax=35,
       
   476     ytick={0,5,...,30},
       
   477     scaled ticks=false,
       
   478     axis lines=left,
       
   479     width=5cm,
       
   480     height=4cm, 
       
   481     legend entries={Swift},  
       
   482     legend pos=north west,
       
   483     legend cell align=left]
       
   484 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
       
   485 \end{axis}
       
   486 \end{tikzpicture}
       
   487   & 
       
   488 \begin{tikzpicture}
       
   489 \begin{axis}[
       
   490     xlabel={$n$},
       
   491     x label style={at={(1.05,-0.05)}},
       
   492     %ylabel={time in secs},
       
   493     enlargelimits=true,
       
   494     %xtick={0,5000,...,40000},
       
   495     %xmax=40000,
       
   496     %ymax=35,
       
   497     restrict x to domain*=0:40000,
       
   498     restrict y to domain*=0:35,
       
   499     %ytick={0,5,...,30},
       
   500     %scaled ticks=false,
       
   501     axis lines=left,
       
   502     width=5cm,
       
   503     height=4cm, 
       
   504     legend entries={Java9+},  
       
   505     legend pos=north west,
       
   506     legend cell align=left]
       
   507 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
       
   508 \end{axis}
       
   509 \end{tikzpicture}\\ 
       
   510 \multicolumn{2}{c}{Graphs}
       
   511 \end{tabular}    
       
   512 \end{center}
       
   513 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
       
   514            of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
       
   515    The reason for their superlinear behaviour is that they do a depth-first-search
       
   516    using NFAs.
       
   517    If the string does not match, the regular expression matching
       
   518    engine starts to explore all possibilities. 
       
   519 }\label{fig:aStarStarb}
       
   520 \end{figure}\afterpage{\clearpage}
       
   521 
       
   522 A more recent example is a global outage of all Cloudflare servers on 2 July
       
   523 2019. A poorly written regular expression exhibited catastrophic backtracking
       
   524 and exhausted CPUs that serve HTTP traffic. Although the outage
       
   525 had several causes, at the heart was a regular expression that
       
   526 was used to monitor network
       
   527 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
       
   528 These problems with regular expressions 
       
   529 are not isolated events that happen
       
   530 very rarely, 
       
   531 %but actually widespread.
       
   532 %They occur so often that they have a 
       
   533 but they occur actually often enough that they have a
       
   534 name: Regular-Expression-Denial-Of-Service (ReDoS)
       
   535 attacks.
       
   536 Davis et al. \cite{Davis18} detected more
       
   537 than 1000 evil regular expressions
       
   538 in Node.js, Python core libraries, npm and pypi. 
       
   539 They therefore concluded that evil regular expressions
       
   540 are a real problem rather than just "a parlour trick".
       
   541 
       
   542 The work in this thesis aims to address this issue
       
   543 with the help of formal proofs.
       
   544 We describe a lexing algorithm based
       
   545 on Brzozowski derivatives with verified correctness 
       
   546 and a finiteness property for the size of derivatives
       
   547 (which are all done in Isabelle/HOL).
       
   548 Such properties %guarantee the absence of 
       
   549 are an important step in preventing
       
   550 catastrophic backtracking once and for all.
       
   551 We will give more details in the next sections
       
   552 on (i) why the slow cases in graph \ref{fig:aStarStarb}
       
   553 can occur in traditional regular expression engines
       
   554 and (ii) why we choose our 
       
   555 approach based on Brzozowski derivatives and formal proofs.
       
   556 
       
   557 
       
   558 \section{Preliminaries}%Regex, and the Problems with Regex Matchers}
       
   559 Regular expressions and regular expression matchers 
       
   560 have clearly been studied for many, many years.
       
   561 Theoretical results in automata theory state 
       
   562 that basic regular expression matching should be linear
       
   563 w.r.t the input.
       
   564 This assumes that the regular expression
       
   565 $r$ was pre-processed and turned into a
       
   566 deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
       
   567 By basic we mean textbook definitions such as the one
       
   568 below, involving only regular expressions for characters, alternatives,
       
   569 sequences, and Kleene stars:
       
   570 \[
       
   571 	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
       
   572 \]
       
   573 Modern regular expression matchers used by programmers,
       
   574 however,
       
   575 support much richer constructs, such as bounded repetitions,
       
   576 negations,
       
   577 and back-references.
       
   578 To differentiate, we use the word \emph{regex} to refer
       
   579 to those expressions with richer constructs while reserving the
       
   580 term \emph{regular expression}
       
   581 for the more traditional meaning in formal languages theory.
       
   582 We follow this convention 
       
   583 in this thesis.
       
   584 In the future, we aim to support all the popular features of regexes, 
       
   585 but for this work we mainly look at basic regular expressions
       
   586 and bounded repetitions.
       
   587 
       
   588 
       
   589 
       
   590 %Most modern regex libraries
       
   591 %the so-called PCRE standard (Peral Compatible Regular Expressions)
       
   592 %has the back-references
       
   593 Regexes come with a number of constructs
       
   594 that make it more convenient for 
       
   595 programmers to write regular expressions.
       
   596 Depending on the types of constructs
       
   597 the task of matching and lexing with them
       
   598 will have different levels of complexity.
       
   599 Some of those constructs are syntactic sugars that are
       
   600 simply short hand notations
       
   601 that save the programmers a few keystrokes.
       
   602 These will not cause problems for regex libraries.
       
   603 For example the
       
   604 non-binary alternative involving three or more choices just means:
       
   605 \[
       
   606 	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
       
   607 \]
       
   608 Similarly, the range operator
       
   609 %used to express the alternative
       
   610 %of all characters between its operands, 
       
   611 is just a concise way
       
   612 of expressing an alternative of consecutive characters:
       
   613 \[
       
   614 	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )  
       
   615 \]
       
   616 for an alternative. The
       
   617 wildcard character '$.$' is used to refer to any single character,
       
   618 \[
       
   619 	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
       
   620 \]
       
   621 except the newline.
       
   622 
       
   623 \subsection{Bounded Repetitions}
       
   624 More interesting are bounded repetitions, which can 
       
   625 make the regular expressions much
       
   626 more compact.
       
   627 Normally there are four kinds of bounded repetitions:
       
   628 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
       
   629 (where $n$ and $m$ are constant natural numbers).
       
   630 Like the star regular expressions, the set of strings or language
       
   631 a bounded regular expression can match
       
   632 is defined using the power operation on sets:
       
   633 \begin{center}
       
   634 	\begin{tabular}{lcl}
       
   635 		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
       
   636 		$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
       
   637 		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
       
   638 		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
       
   639 	\end{tabular}
       
   640 \end{center}
       
   641 The attraction of bounded repetitions is that they can be
       
   642 used to avoid a size blow up: for example $r^{\{n\}}$
       
   643 is a shorthand for
       
   644 the much longer regular expression:
       
   645 \[
       
   646 	\underbrace{r\ldots r}_\text{n copies of r}.
       
   647 \]
       
   648 %Therefore, a naive algorithm that simply unfolds
       
   649 %them into their desugared forms
       
   650 %will suffer from at least an exponential runtime increase.
       
   651 
       
   652 
       
   653 The problem with matching 
       
   654 such bounded repetitions
       
   655 is that tools based on the classic notion of
       
   656 automata need to expand $r^{\{n\}}$ into $n$ connected 
       
   657 copies of the automaton for $r$. This leads to very inefficient matching
       
   658 algorithms  or algorithms that consume large amounts of memory.
       
   659 Implementations using $\DFA$s will
       
   660 in such situations
       
   661 either become excruciatingly slow 
       
   662 (for example Verbatim++ \cite{Verbatimpp}) or run
       
   663 out of memory (for example $\mathit{LEX}$ and 
       
   664 $\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
       
   665 in C and JAVA that generate $\mathit{DFA}$-based
       
   666 lexers. The user provides a set of regular expressions
       
   667 and configurations, and then 
       
   668 gets an output program encoding a minimized $\mathit{DFA}$
       
   669 that can be compiled and run. 
       
   670 When given the above countdown regular expression,
       
   671 a small $n$ (say 20) would result in a program representing a
       
   672 DFA
       
   673 with millions of states.}) for large counters.
       
   674 A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
       
   675 where the minimal DFA requires at least $2^{n+1}$ states.
       
   676 For example, when $n$ is equal to 2,
       
   677 the corresponding $\mathit{NFA}$ looks like:
       
   678 \vspace{6mm}
       
   679 \begin{center}
       
   680 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
       
   681    \node[state,initial] (q_0)   {$q_0$}; 
       
   682    \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
       
   683    \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
       
   684    \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
       
   685     \path[->] 
       
   686     (q_0) edge  node {a} (q_1)
       
   687     	  edge [loop below] node {a,b} ()
       
   688     (q_1) edge  node  {a,b} (q_2)
       
   689     (q_2) edge  node  {a,b} (q_3);
       
   690 \end{tikzpicture}
       
   691 \end{center}
       
   692 and when turned into a DFA by the subset construction
       
   693 requires at least $2^3$ states.\footnote{The 
       
   694 red states are "countdown states" which count down 
       
   695 the number of characters needed in addition to the current
       
   696 string to make a successful match.
       
   697 For example, state $q_1$ indicates a match that has
       
   698 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
       
   699 and just consumed the "delimiter" $a$ in the middle, and 
       
   700 needs to match 2 more iterations of $(a|b)$ to complete.
       
   701 State $q_2$ on the other hand, can be viewed as a state
       
   702 after $q_1$ has consumed 1 character, and just waits
       
   703 for 1 more character to complete.
       
   704 The state $q_3$ is the last (accepting) state, requiring 0 
       
   705 more characters.
       
   706 Depending on the suffix of the
       
   707 input string up to the current read location,
       
   708 the states $q_1$ and $q_2$, $q_3$
       
   709 may or may
       
   710 not be active.
       
   711 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
       
   712 contain at least $2^3$ non-equivalent states that cannot be merged, 
       
   713 because the subset construction during determinisation will generate
       
   714 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
       
   715 Generalizing this to regular expressions with larger
       
   716 bounded repetitions number, we have that
       
   717 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
       
   718 would require at least $2^{n+1}$ states, if $r$ itself contains
       
   719 more than 1 string.
       
   720 This is to represent all different 
       
   721 scenarios in which "countdown" states are active.}
       
   722 
       
   723 
       
   724 Bounded repetitions are important because they
       
   725 tend to occur frequently in practical use,
       
   726 for example in the regex library RegExLib, in
       
   727 the rules library of Snort \cite{Snort1999}\footnote{
       
   728 Snort is a network intrusion detection (NID) tool
       
   729 for monitoring network traffic.
       
   730 The network security community curates a list
       
   731 of malicious patterns written as regexes,
       
   732 which is used by Snort's detection engine
       
   733 to match against network traffic for any hostile
       
   734 activities such as buffer overflow attacks.}, 
       
   735 as well as in XML Schema definitions (XSDs).
       
   736 According to Bj\"{o}rklund et al \cite{xml2015},
       
   737 more than half of the 
       
   738 XSDs they found on the Maven.org central repository
       
   739 have bounded regular expressions in them.
       
   740 Often the counters are quite large, with the largest being
       
   741 close to ten million. 
       
   742 A smaller sample XSD they gave
       
   743 is:
       
   744 \lstset{
       
   745 	basicstyle=\fontsize{8.5}{9}\ttfamily,
       
   746   language=XML,
       
   747   morekeywords={encoding,
       
   748     xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
       
   749 }
       
   750 \begin{lstlisting}
       
   751 <sequence minOccurs="0" maxOccurs="65535">
       
   752 	<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
       
   753  	<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
       
   754 </sequence>
       
   755 \end{lstlisting}
       
   756 This can be seen as the regex
       
   757 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
       
   758 regular expressions 
       
   759 satisfying certain constraints (such as 
       
   760 satisfying the floating point number format).
       
   761 It is therefore quite unsatisfying that 
       
   762 some regular expressions matching libraries
       
   763 impose adhoc limits
       
   764 for bounded regular expressions:
       
   765 For example, in the regular expression matching library in the Go
       
   766 language the regular expression $a^{1001}$ is not permitted, because no counter
       
   767 can be above 1000, and in the built-in Rust regular expression library
       
   768 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
       
   769 for being too big. 
       
   770 As Becchi and Crawley \cite{Becchi08}  have pointed out,
       
   771 the reason for these restrictions
       
   772 is that they simulate a non-deterministic finite
       
   773 automata (NFA) with a breadth-first search.
       
   774 This way the number of active states could
       
   775 be equal to the counter number.
       
   776 When the counters are large, 
       
   777 the memory requirement could become
       
   778 infeasible, and a regex engine
       
   779 like in Go will reject this pattern straight away.
       
   780 \begin{figure}[H]
       
   781 \begin{center}
       
   782 \begin{tikzpicture} [node distance = 2cm, on grid, auto]
       
   783  
       
   784     	\node (q0) [state, initial] {$0$};
       
   785 	\node (q1) [state, right = of q0] {$1$};
       
   786 	%\node (q2) [state, right = of q1] {$2$};
       
   787 	\node (qdots) [right = of q1] {$\ldots$};
       
   788 	\node (qn) [state, right = of qdots] {$n$};
       
   789 	\node (qn1) [state, right = of qn] {$n+1$};
       
   790 	\node (qn2) [state, right = of qn1] {$n+2$};
       
   791 	\node (qn3) [state, accepting, right = of qn2] {$n+3$}; 
       
   792  
       
   793 \path [-stealth, thick]
       
   794 	(q0) edge [loop above] node {a} ()
       
   795     (q0) edge node {a}   (q1) 
       
   796     %(q1) edge node {.}   (q2)
       
   797     (q1) edge node {.}   (qdots)
       
   798     (qdots) edge node {.} (qn)
       
   799     (qn) edge node {.} (qn1)
       
   800     (qn1) edge node {b} (qn2)
       
   801     (qn2) edge node {$c$} (qn3);
       
   802 \end{tikzpicture}
       
   803 %\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
       
   804 %   \node[state,initial] (q_0)   {$0$}; 
       
   805 %   \node[state, ] (q_1) [right=of q_0] {$1$}; 
       
   806 %   \node[state, ] (q_2) [right=of q_1] {$2$}; 
       
   807 %   \node[state,
       
   808 %   \node[state, accepting, ](q_3) [right=of q_2] {$3$};
       
   809 %    \path[->] 
       
   810 %    (q_0) edge  node {a} (q_1)
       
   811 %    	  edge [loop below] node {a,b} ()
       
   812 %    (q_1) edge  node  {a,b} (q_2)
       
   813 %    (q_2) edge  node  {a,b} (q_3);
       
   814 %\end{tikzpicture}
       
   815 \end{center}
       
   816 \caption{The example given by Becchi and Crawley
       
   817 	that NFA simulation can consume large
       
   818 	amounts of memory: $.^*a.^{\{n\}}bc$ matching
       
   819 	strings of the form $aaa\ldots aaaabc$.
       
   820 	When traversing in a breadth-first manner,
       
   821 all states from 0 till $n+1$ will become active.}
       
   822 \end{figure}
       
   823 %Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
       
   824 %type of $\mathit{NFA}$ simulation and guarantees a linear runtime
       
   825 %in terms of input string length.
       
   826 %TODO:try out these lexers
       
   827 These problems can of course be solved in matching algorithms where 
       
   828 automata go beyond the classic notion and for instance include explicit
       
   829 counters \cite{Turo_ov__2020}.
       
   830 These solutions can be quite efficient,
       
   831 with the ability to process
       
   832 gigabits of strings input per second
       
   833 even with large counters \cite{Becchi08}.
       
   834 These practical solutions do not come with
       
   835 formal guarantees, and as pointed out by
       
   836 Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
       
   837 %But formal reasoning about these automata especially in Isabelle 
       
   838 %can be challenging
       
   839 %and un-intuitive. 
       
   840 %Therefore, we take correctness and runtime claims made about these solutions
       
   841 %with a grain of salt.
       
   842 
       
   843 In the work reported in \cite{FoSSaCS2023} and here, 
       
   844 we add better support using derivatives
       
   845 for bounded regular expression $r^{\{n\}}$.
       
   846 Our results
       
   847 extend straightforwardly to
       
   848 repetitions with intervals such as 
       
   849 $r^{\{n\ldots m\}}$.
       
   850 The merit of Brzozowski derivatives (more on this later)
       
   851 on this problem is that
       
   852 it can be naturally extended to support bounded repetitions.
       
   853 Moreover these extensions are still made up of only small
       
   854 inductive datatypes and recursive functions,
       
   855 making it handy to deal with them in theorem provers.
       
   856 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
       
   857 %straightforwardly extended to deal with bounded regular expressions
       
   858 %and moreover the resulting code still consists of only simple
       
   859 %recursive functions and inductive datatypes.
       
   860 Finally, bounded regular expressions do not destroy our finite
       
   861 boundedness property, which we shall prove later on.
       
   862 
       
   863 
       
   864 
       
   865 
       
   866 
       
   867 \subsection{Back-References}
       
   868 The other way to simulate an $\mathit{NFA}$ for matching is choosing  
       
   869 a single transition each time, keeping all the other options in 
       
   870 a queue or stack, and backtracking if that choice eventually 
       
   871 fails. 
       
   872 This method, often called a  "depth-first-search", 
       
   873 is efficient in many cases, but could end up
       
   874 with exponential run time.
       
   875 The backtracking method is employed in regex libraries
       
   876 that support \emph{back-references}, for example
       
   877 in Java and Python.
       
   878 %\section{Back-references and The Terminology Regex}
       
   879 
       
   880 %When one constructs an $\NFA$ out of a regular expression
       
   881 %there is often very little to be done in the first phase, one simply 
       
   882 %construct the $\NFA$ states based on the structure of the input regular expression.
       
   883 
       
   884 %In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
       
   885 %one by keeping track of all active states after consuming 
       
   886 %a character, and update that set of states iteratively.
       
   887 %This can be viewed as a breadth-first-search of the $\mathit{NFA}$
       
   888 %for a path terminating
       
   889 %at an accepting state.
       
   890 
       
   891 
       
   892 
       
   893 Consider the following regular expression where the sequence
       
   894 operator is omitted for brevity:
       
   895 \begin{center}
       
   896 	$r_1r_2r_3r_4$
       
   897 \end{center}
       
   898 In this example,
       
   899 one could label sub-expressions of interest 
       
   900 by parenthesizing them and giving 
       
   901 them a number in the order in which their opening parentheses appear.
       
   902 One possible way of parenthesizing and labelling is: 
       
   903 \begin{center}
       
   904 	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
       
   905 \end{center}
       
   906 The sub-expressions
       
   907 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
       
   908 by 1 to 4, and can be ``referred back'' by their respective numbers. 
       
   909 %These sub-expressions are called "capturing groups".
       
   910 To do so, one uses the syntax $\backslash i$ 
       
   911 to denote that we want the sub-string 
       
   912 of the input matched by the i-th
       
   913 sub-expression to appear again, 
       
   914 exactly the same as it first appeared: 
       
   915 \begin{center}
       
   916 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
       
   917 \underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
       
   918 \end{center}
       
   919 Once the sub-string $s_i$ for the sub-expression $r_i$
       
   920 has been fixed, there is no variability on what the back-reference
       
   921 label $\backslash i$ can be---it is tied to $s_i$.
       
   922 The overall string will look like $\ldots s_i \ldots s_i \ldots $
       
   923 %The backslash and number $i$ are the
       
   924 %so-called "back-references".
       
   925 %Let $e$ be an expression made of regular expressions 
       
   926 %and back-references. $e$ contains the expression $e_i$
       
   927 %as its $i$-th capturing group.
       
   928 %The semantics of back-reference can be recursively
       
   929 %written as:
       
   930 %\begin{center}
       
   931 %	\begin{tabular}{c}
       
   932 %		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
       
   933 %		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
       
   934 %	\end{tabular}
       
   935 %\end{center}
       
   936 A concrete example
       
   937 for back-references is
       
   938 \begin{center}
       
   939 $(.^*)\backslash 1$,
       
   940 \end{center}
       
   941 which matches
       
   942 strings that can be split into two identical halves,
       
   943 for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
       
   944 Note that this is different from 
       
   945 repeating the  sub-expression verbatim like
       
   946 \begin{center}
       
   947 	$(.^*)(.^*)$,
       
   948 \end{center}
       
   949 which does not impose any restrictions on what strings the second 
       
   950 sub-expression $.^*$
       
   951 might match.
       
   952 Another example for back-references is
       
   953 \begin{center}
       
   954 $(.)(.)\backslash 2\backslash 1$
       
   955 \end{center}
       
   956 which matches four-character palindromes
       
   957 like $abba$, $x??x$ and so on.
       
   958 
       
   959 Back-references are a regex construct 
       
   960 that programmers find quite useful.
       
   961 According to Becchi and Crawley \cite{Becchi08},
       
   962 6\% of Snort rules (up until 2008) use them.
       
   963 The most common use of back-references
       
   964 is to express well-formed html files,
       
   965 where back-references are convenient for matching
       
   966 opening and closing tags like 
       
   967 \begin{center}
       
   968 	$\langle html \rangle \ldots \langle / html \rangle$
       
   969 \end{center}
       
   970 A regex describing such a format
       
   971 is
       
   972 \begin{center}
       
   973 	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
       
   974 \end{center}
       
   975 Despite being useful, the expressive power of regexes 
       
   976 go beyond regular languages 
       
   977 once back-references are included.
       
   978 In fact, they allow the regex construct to express 
       
   979 languages that cannot be contained in context-free
       
   980 languages either.
       
   981 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
       
   982 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
       
   983 which cannot be expressed by 
       
   984 context-free grammars \cite{campeanu2003formal}.
       
   985 Such a language is contained in the context-sensitive hierarchy
       
   986 of formal languages. 
       
   987 Also solving the matching problem involving back-references
       
   988 is known to be NP-complete \parencite{alfred2014algorithms}.
       
   989 Regex libraries supporting back-references such as 
       
   990 PCRE \cite{pcre} therefore have to
       
   991 revert to a depth-first search algorithm including backtracking.
       
   992 What is unexpected is that even in the cases 
       
   993 not involving back-references, there is still
       
   994 a (non-negligible) chance they might backtrack super-linearly,
       
   995 as shown in the graphs in figure \ref{fig:aStarStarb}.
       
   996 
       
   997 Summing up, we can categorise existing 
       
   998 practical regex libraries into two kinds:
       
   999 (i) The ones  with  linear
       
  1000 time guarantees like Go and Rust. The downside with them is that
       
  1001 they impose restrictions
       
  1002 on the regular expressions (not allowing back-references, 
       
  1003 bounded repetitions cannot exceed an ad hoc limit etc.).
       
  1004 And (ii) those 
       
  1005 that allow large bounded regular expressions and back-references
       
  1006 at the expense of using backtracking algorithms.
       
  1007 They can potentially ``grind to a halt''
       
  1008 on some very simple cases, resulting 
       
  1009 ReDoS attacks if exposed to the internet.
       
  1010 
       
  1011 The problems with both approaches are the motivation for us 
       
  1012 to look again at the regular expression matching problem. 
       
  1013 Another motivation is that regular expression matching algorithms
       
  1014 that follow the POSIX standard often contain errors and bugs 
       
  1015 as we shall explain next.
       
  1016 
       
  1017 %We would like to have regex engines that can 
       
  1018 %deal with the regular part (e.g.
       
  1019 %bounded repetitions) of regexes more
       
  1020 %efficiently.
       
  1021 %Also we want to make sure that they do it correctly.
       
  1022 %It turns out that such aim is not so easy to achieve.
       
  1023  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
       
  1024 % For example, the Rust regex engine claims to be linear, 
       
  1025 % but does not support lookarounds and back-references.
       
  1026 % The GoLang regex library does not support over 1000 repetitions.  
       
  1027 % Java and Python both support back-references, but shows
       
  1028 %catastrophic backtracking behaviours on inputs without back-references(
       
  1029 %when the language is still regular).
       
  1030  %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
       
  1031  %TODO: verify the fact Rust does not allow 1000+ reps
       
  1032 
       
  1033 
       
  1034 
       
  1035 
       
  1036 %The time cost of regex matching algorithms in general
       
  1037 %involve two different phases, and different things can go differently wrong on 
       
  1038 %these phases.
       
  1039 %$\DFA$s usually have problems in the first (construction) phase
       
  1040 %, whereas $\NFA$s usually run into trouble
       
  1041 %on the second phase.
       
  1042 
       
  1043 
       
  1044 \section{Error-prone POSIX Implementations}
       
  1045 Very often there are multiple ways of matching a string
       
  1046 with a regular expression.
       
  1047 In such cases the regular expressions matcher needs to
       
  1048 disambiguate.
       
  1049 The more widely used strategy is called POSIX,
       
  1050 which roughly speaking always chooses the longest initial match.
       
  1051 The POSIX strategy is widely adopted in many regular expression matchers
       
  1052 because it is a natural strategy for lexers.
       
  1053 However, many implementations (including the C libraries
       
  1054 used by Linux and OS X distributions) contain bugs
       
  1055 or do not meet the specification they claim to adhere to.
       
  1056 Kuklewicz maintains a unit test repository which lists some
       
  1057 problems with existing regular expression engines \cite{KuklewiczHaskell}.
       
  1058 In some cases, they either fail to generate a
       
  1059 result when there exists a match,
       
  1060 or give results that are inconsistent with the POSIX standard.
       
  1061 A concrete example is the regex:
       
  1062 \begin{center}
       
  1063 	$(aba + ab + a)^* \text{and the string} \; ababa$
       
  1064 \end{center}
       
  1065 The correct POSIX match for the above
       
  1066 involves the entire string $ababa$, 
       
  1067 split into two Kleene star iterations, namely $[ab], [aba]$ at positions
       
  1068 $[0, 2), [2, 5)$
       
  1069 respectively.
       
  1070 But feeding this example to the different engines
       
  1071 listed at regex101 \footnote{
       
  1072 	regex101 is an online regular expression matcher which
       
  1073 	provides API for trying out regular expression
       
  1074 	engines of multiple popular programming languages like
       
  1075 Java, Python, Go, etc.} \parencite{regex101}. 
       
  1076 yields
       
  1077 only two incomplete matches: $[aba]$ at $[0, 3)$
       
  1078 and $a$ at $[4, 5)$.
       
  1079 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
       
  1080 commented that most regex libraries are not
       
  1081 correctly implementing the central POSIX
       
  1082 rule, called the maximum munch rule.
       
  1083 Grathwohl \parencite{grathwohl2014crash} wrote:
       
  1084 \begin{quote}\it
       
  1085 	``The POSIX strategy is more complicated than the 
       
  1086 	greedy because of the dependence on information about 
       
  1087 	the length of matched strings in the various subexpressions.''
       
  1088 \end{quote}
       
  1089 %\noindent
       
  1090 People have recognised that the implementation complexity of POSIX rules also come from
       
  1091 the specification being not very precise.
       
  1092 The developers of the regexp package of Go 
       
  1093 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
       
  1094 commented that
       
  1095 \begin{quote}\it
       
  1096 ``
       
  1097 The POSIX rule is computationally prohibitive
       
  1098 and not even well-defined.
       
  1099 ``
       
  1100 \end{quote}
       
  1101 There are many informal summaries of this disambiguation
       
  1102 strategy, which are often quite long and delicate.
       
  1103 For example Kuklewicz \cite{KuklewiczHaskell} 
       
  1104 described the POSIX rule as (section 1, last paragraph):
       
  1105 \begin{quote}
       
  1106 	\begin{itemize}
       
  1107 		\item
       
  1108 regular expressions (REs) take the leftmost starting match, and the longest match starting there
       
  1109 earlier subpatterns have leftmost-longest priority over later subpatterns\\
       
  1110 \item
       
  1111 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
       
  1112 \item
       
  1113 REs have right associative concatenation which can be changed with parenthesis\\
       
  1114 \item
       
  1115 parenthesized subexpressions return the match from their last usage\\
       
  1116 \item
       
  1117 text of component subexpressions must be contained in the text of the 
       
  1118 higher-level subexpressions\\
       
  1119 \item
       
  1120 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
       
  1121 \item
       
  1122 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
       
  1123 \end{itemize}
       
  1124 \end{quote}
       
  1125 %The text above 
       
  1126 %is trying to capture something very precise,
       
  1127 %and is crying out for formalising.
       
  1128 Ribeiro and Du Bois \cite{RibeiroAgda2017} have 
       
  1129 formalised the notion of bit-coded regular expressions
       
  1130 and proved their relations with simple regular expressions in
       
  1131 the dependently-typed proof assistant Agda.
       
  1132 They also proved the soundness and completeness of a matching algorithm
       
  1133 based on the bit-coded regular expressions.
       
  1134 Ausaf et al. \cite{AusafDyckhoffUrban2016}
       
  1135 are the first to
       
  1136 give a quite simple formalised POSIX
       
  1137 specification in Isabelle/HOL, and also prove 
       
  1138 that their specification coincides with an earlier (unformalised) 
       
  1139 POSIX specification given by Okui and Suzuki \cite{Okui10}.
       
  1140 They then formally proved the correctness of
       
  1141 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
       
  1142 with regards to that specification.
       
  1143 They also found that the informal POSIX
       
  1144 specification by Sulzmann and Lu needs to be substantially 
       
  1145 revised in order for the correctness proof to go through.
       
  1146 Their original specification and proof were unfixable
       
  1147 according to Ausaf et al.
       
  1148 
       
  1149 
       
  1150 In the next section, we will briefly
       
  1151 introduce Brzozowski derivatives and Sulzmann
       
  1152 and Lu's algorithm, which the main point of this thesis builds on.
       
  1153 %We give a taste of what they 
       
  1154 %are like and why they are suitable for regular expression
       
  1155 %matching and lexing.
       
  1156 \section{Formal Specification of POSIX Matching 
       
  1157 and Brzozowski Derivatives}
       
  1158 %Now we start with the central topic of the thesis: Brzozowski derivatives.
       
  1159 Brzozowski \cite{Brzozowski1964} first introduced the 
       
  1160 concept of a \emph{derivative} of regular expression in 1964.
       
  1161 The derivative of a regular expression $r$
       
  1162 with respect to a character $c$, is written as $r \backslash c$.
       
  1163 This operation tells us what $r$ transforms into
       
  1164 if we ``chop'' off a particular character $c$ 
       
  1165 from all strings in the language of $r$ (defined
       
  1166 later as $L \; r$).
       
  1167 %To give a flavour of Brzozowski derivatives, we present
       
  1168 %two straightforward clauses from it:
       
  1169 %\begin{center}
       
  1170 %	\begin{tabular}{lcl}
       
  1171 %		$d \backslash c$     & $\dn$ & 
       
  1172 %		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
  1173 %$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
  1174 %	\end{tabular}
       
  1175 %\end{center}
       
  1176 %\noindent
       
  1177 %The first clause says that for the regular expression
       
  1178 %denoting a singleton set consisting of a single-character string $\{ d \}$,
       
  1179 %we check the derivative character $c$ against $d$,
       
  1180 %returning a set containing only the empty string $\{ [] \}$
       
  1181 %if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
       
  1182 %The second clause states that to obtain the regular expression
       
  1183 %representing all strings' head character $c$ being chopped off
       
  1184 %from $r_1 + r_2$, one simply needs to recursively take derivative
       
  1185 %of $r_1$ and $r_2$ and then put them together.
       
  1186 Derivatives have the property
       
  1187 that $s \in L \; (r\backslash c)$ if and only if 
       
  1188 $c::s \in L \; r$ where $::$ stands for list prepending.
       
  1189 %This property can be used on regular expressions
       
  1190 %matching and lexing--to test whether a string $s$ is in $L \; r$,
       
  1191 %one simply takes derivatives of $r$ successively with
       
  1192 %respect to the characters (in the correct order) in $s$,
       
  1193 %and then test whether the empty string is in the last regular expression.
       
  1194 With this property, derivatives can give a simple solution
       
  1195 to the problem of matching a string $s$ with a regular
       
  1196 expression $r$: if the derivative of $r$ w.r.t.\ (in
       
  1197 succession) all the characters of the string matches the empty string,
       
  1198 then $r$ matches $s$ (and {\em vice versa}).  
       
  1199 %This makes formally reasoning about these properties such
       
  1200 %as correctness and complexity smooth and intuitive.
       
  1201 There are several mechanised proofs of this property in various theorem
       
  1202 provers,
       
  1203 for example one by Owens and Slind \cite{Owens2008} in HOL4,
       
  1204 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
       
  1205 yet another in Coq by Coquand and Siles \cite{Coquand2012}.
       
  1206 
       
  1207 In addition, one can extend derivatives to bounded repetitions
       
  1208 relatively straightforwardly. For example, the derivative for 
       
  1209 this can be defined as:
       
  1210 \begin{center}
       
  1211 	\begin{tabular}{lcl}
       
  1212 		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
       
  1213 		r^{\{n-1\}} (\text{when} n > 0)$\\
       
  1214 	\end{tabular}
       
  1215 \end{center}
       
  1216 \noindent
       
  1217 Experimental results suggest that  unlike DFA-based solutions
       
  1218 for bounded regular expressions,
       
  1219 derivatives can cope
       
  1220 large counters
       
  1221 quite well.
       
  1222 
       
  1223 There have also been 
       
  1224 extensions of derivatives to other regex constructs.
       
  1225 For example, Owens et al include the derivatives
       
  1226 for the \emph{NOT} regular expression, which is
       
  1227 able to concisely express C-style comments of the form
       
  1228 $/* \ldots */$ (see \cite{Owens2008}).
       
  1229 Another extension for derivatives is
       
  1230 regular expressions with look-aheads, done
       
  1231 by Miyazaki and Minamide
       
  1232 \cite{Takayuki2019}.
       
  1233 %We therefore use Brzozowski derivatives on regular expressions 
       
  1234 %lexing 
       
  1235 
       
  1236 
       
  1237 
       
  1238 Given the above definitions and properties of
       
  1239 Brzozowski derivatives, one quickly realises their potential
       
  1240 in generating a formally verified algorithm for lexing: the clauses and property
       
  1241 can be easily expressed in a functional programming language 
       
  1242 or converted to theorem prover
       
  1243 code, with great ease.
       
  1244 Perhaps this is the reason why derivatives have sparked quite a bit of interest
       
  1245 in the functional programming and theorem prover communities in the last
       
  1246 fifteen or so years (
       
  1247 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
       
  1248 \cite{Chen12} and \cite{Coquand2012}
       
  1249 to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
       
  1250 after they were first published by Brzozowski.
       
  1251 
       
  1252 
       
  1253 However, there are two difficulties with derivative-based matchers:
       
  1254 First, Brzozowski's original matcher only generates a yes/no answer
       
  1255 for whether a regular expression matches a string or not.  This is too
       
  1256 little information in the context of lexing where separate tokens must
       
  1257 be identified and also classified (for example as keywords
       
  1258 or identifiers). 
       
  1259 Second, derivative-based matchers need to be more efficient in terms
       
  1260 of the sizes of derivatives.
       
  1261 Elegant and beautiful
       
  1262 as many implementations are,
       
  1263 they can be excruciatingly slow. 
       
  1264 For example, Sulzmann and Lu
       
  1265 claim a linear running time of their proposed algorithm,
       
  1266 but that was falsified by our experiments. The running time 
       
  1267 is actually $\Omega(2^n)$ in the worst case.
       
  1268 A similar claim about a theoretical runtime of $O(n^2)$ 
       
  1269 is made for the Verbatim \cite{Verbatim}
       
  1270 %TODO: give references
       
  1271 lexer, which calculates POSIX matches and is based on derivatives.
       
  1272 They formalized the correctness of the lexer, but not their complexity result.
       
  1273 In the performance evaluation section, they analyzed the run time
       
  1274 of matching $a$ with the string 
       
  1275 \begin{center}
       
  1276 	$\underbrace{a \ldots a}_{\text{n a's}}$.
       
  1277 \end{center}
       
  1278 \noindent
       
  1279 They concluded that the algorithm is quadratic in terms of 
       
  1280 the length of the input string.
       
  1281 When we tried out their extracted OCaml code with the example $(a+aa)^*$,
       
  1282 the time it took to match a string of 40 $a$'s was approximately 5 minutes.
       
  1283 
       
  1284 
       
  1285 \subsection{Sulzmann and Lu's Algorithm}
       
  1286 Sulzmann and Lu~\cite{Sulzmann2014} overcame the first 
       
  1287 problem with the yes/no answer 
       
  1288 by cleverly extending Brzozowski's matching
       
  1289 algorithm. Their extended version generates additional information on
       
  1290 \emph{how} a regular expression matches a string following the POSIX
       
  1291 rules for regular expression matching. They achieve this by adding a
       
  1292 second ``phase'' to Brzozowski's algorithm involving an injection
       
  1293 function. This injection function in a sense undoes the ``damage''
       
  1294 of the derivatives chopping off characters.
       
  1295 In earlier work, Ausaf et al provided the formal
       
  1296 specification of what POSIX matching means and proved in Isabelle/HOL
       
  1297 the correctness
       
  1298 of this extended algorithm accordingly
       
  1299 \cite{AusafDyckhoffUrban2016}.
       
  1300 
       
  1301 The version of the algorithm proven correct
       
  1302 suffers however heavily from a 
       
  1303 second difficulty, where derivatives can
       
  1304 grow to arbitrarily big sizes. 
       
  1305 For example if we start with the
       
  1306 regular expression $(a+aa)^*$ and take
       
  1307 successive derivatives according to the character $a$, we end up with
       
  1308 a sequence of ever-growing derivatives like 
       
  1309 
       
  1310 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
       
  1311 \begin{center}
       
  1312 \begin{tabular}{rll}
       
  1313 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
  1314 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
  1315 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
       
  1316 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
  1317 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
       
  1318 \end{tabular}
       
  1319 \end{center}
       
  1320  
       
  1321 \noindent where after around 35 steps we usually run out of memory on a
       
  1322 typical computer.  Clearly, the
       
  1323 notation involving $\ZERO$s and $\ONE$s already suggests
       
  1324 simplification rules that can be applied to regular regular
       
  1325 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
       
  1326 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
       
  1327 r$. While such simple-minded simplifications have been proved in 
       
  1328 the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
       
  1329 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
       
  1330 \emph{not} help with limiting the growth of the derivatives shown
       
  1331 above: the growth is slowed, but the derivatives can still grow rather
       
  1332 quickly beyond any finite bound.
       
  1333 
       
  1334 Therefore we want to look in this thesis at a second
       
  1335 algorithm by Sulzmann and Lu where they
       
  1336 overcame this ``growth problem'' \cite{Sulzmann2014}.
       
  1337 In this version, POSIX values are 
       
  1338 represented as bit sequences and such sequences are incrementally generated
       
  1339 when derivatives are calculated. The compact representation
       
  1340 of bit sequences and regular expressions allows them to define a more
       
  1341 ``aggressive'' simplification method that keeps the size of the
       
  1342 derivatives finite no matter what the length of the string is.
       
  1343 They make some informal claims about the correctness and linear behaviour
       
  1344 of this version, but do not provide any supporting proof arguments, not
       
  1345 even ``pencil-and-paper'' arguments. They write about their bit-coded
       
  1346 \emph{incremental parsing method} (that is the algorithm to be formalised
       
  1347 in this dissertation)
       
  1348 
       
  1349 
       
  1350   
       
  1351   \begin{quote}\it
       
  1352   ``Correctness Claim: We further claim that the incremental parsing
       
  1353   method [..] in combination with the simplification steps [..]
       
  1354   yields POSIX parse trees. We have tested this claim
       
  1355   extensively [..] but yet
       
  1356   have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
       
  1357 \end{quote}  
       
  1358 Ausaf and Urban made some initial progress towards the 
       
  1359 full correctness proof but still had to leave out the optimisation
       
  1360 Sulzmann and Lu proposed.
       
  1361 Ausaf  wrote \cite{Ausaf},
       
  1362   \begin{quote}\it
       
  1363 ``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
       
  1364 \end{quote}  
       
  1365 This thesis implements the aggressive simplifications envisioned
       
  1366 by Ausaf and Urban,
       
  1367 together with a formal proof of the correctness of those simplifications.
       
  1368 
       
  1369 
       
  1370 One of the most recent work in the context of lexing
       
  1371 %with this issue
       
  1372 is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}.
       
  1373 This is relevant work for us and we will compare it later with 
       
  1374 our derivative-based matcher we are going to present.
       
  1375 There is also some newer work called
       
  1376 Verbatim++~\cite{Verbatimpp}, which does not use derivatives, 
       
  1377 but deterministic finite automaton instead.
       
  1378 We will also study this work in a later section.
       
  1379 %An example that gives problem to automaton approaches would be
       
  1380 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
       
  1381 %It requires at least $2^{n+1}$ states to represent
       
  1382 %as a DFA.
       
  1383 
   530 
  1384 
   531 
  1385 %----------------------------------------------------------------------------------------
   532 %----------------------------------------------------------------------------------------
  1386 \section{Contribution}
   533 \section{Contribution}
  1387 {\color{red} \rule{\linewidth}{0.5mm}}
   534 {\color{red} \rule{\linewidth}{0.5mm}}