ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 649 ef2b8abcbc55
parent 646 56057198e4f5
child 651 deb35fd780fe
equal deleted inserted replaced
648:d15a0b7d6d90 649:ef2b8abcbc55
    20 for function applications, where
    20 for function applications, where
    21 the parameters of a function are not enclosed
    21 the parameters of a function are not enclosed
    22 inside a pair of parentheses (e.g. $f \;x \;y$
    22 inside a pair of parentheses (e.g. $f \;x \;y$
    23 instead of $f(x,\;y)$). This is mainly
    23 instead of $f(x,\;y)$). This is mainly
    24 to make the text visually more concise.}.
    24 to make the text visually more concise.}.
       
    25 
       
    26 
       
    27 \section{Technical Overview}
       
    28 
       
    29 Consider for example the regular expression $(a^*)^*\,b$ and 
       
    30 strings of the form $aa..a$. These strings cannot be matched by this regular
       
    31 expression: Obviously the expected $b$ in the last
       
    32 position is missing. One would assume that modern regular expression
       
    33 matching engines can find this out very quickly. Surprisingly, if one tries
       
    34 this example in JavaScript, Python or Java 8, even with small strings, 
       
    35 say of lenght of around 30 $a$'s,
       
    36 the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
       
    37 The algorithms clearly show exponential behaviour, and as can be seen
       
    38 is triggered by some relatively simple regular expressions.
       
    39 Java 9 and newer
       
    40 versions improve this behaviour somewhat, but are still slow compared 
       
    41 with the approach we are going to use in this thesis.
       
    42 
       
    43 
       
    44 
       
    45 This superlinear blowup in regular expression engines
       
    46 has caused grief in ``real life'' where it is 
       
    47 given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
       
    48 For example, on 20 July 2016 one evil
       
    49 regular expression brought the webpage
       
    50 \href{http://stackexchange.com}{Stack Exchange} to its
       
    51 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
       
    52 In this instance, a regular expression intended to just trim white
       
    53 spaces from the beginning and the end of a line actually consumed
       
    54 massive amounts of CPU resources---causing the web servers to grind to a
       
    55 halt. In this example, the time needed to process
       
    56 the string was 
       
    57 $O(n^2)$ with respect to the string length $n$. This
       
    58 quadratic overhead was enough for the homepage of Stack Exchange to
       
    59 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
       
    60 attack and therefore stopped the servers from responding to any
       
    61 requests. This made the whole site become unavailable. 
       
    62 
       
    63 \begin{figure}[p]
       
    64 \begin{center}
       
    65 \begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
       
    66 \begin{tikzpicture}
       
    67 \begin{axis}[
       
    68     xlabel={$n$},
       
    69     x label style={at={(1.05,-0.05)}},
       
    70     ylabel={time in secs},
       
    71     enlargelimits=false,
       
    72     xtick={0,5,...,30},
       
    73     xmax=33,
       
    74     ymax=35,
       
    75     ytick={0,5,...,30},
       
    76     scaled ticks=false,
       
    77     axis lines=left,
       
    78     width=5cm,
       
    79     height=4cm, 
       
    80     legend entries={JavaScript},  
       
    81     legend pos=north west,
       
    82     legend cell align=left]
       
    83 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
       
    84 \end{axis}
       
    85 \end{tikzpicture}
       
    86   &
       
    87 \begin{tikzpicture}
       
    88 \begin{axis}[
       
    89     xlabel={$n$},
       
    90     x label style={at={(1.05,-0.05)}},
       
    91     %ylabel={time in secs},
       
    92     enlargelimits=false,
       
    93     xtick={0,5,...,30},
       
    94     xmax=33,
       
    95     ymax=35,
       
    96     ytick={0,5,...,30},
       
    97     scaled ticks=false,
       
    98     axis lines=left,
       
    99     width=5cm,
       
   100     height=4cm, 
       
   101     legend entries={Python},  
       
   102     legend pos=north west,
       
   103     legend cell align=left]
       
   104 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   105 \end{axis}
       
   106 \end{tikzpicture}\\ 
       
   107 \begin{tikzpicture}
       
   108 \begin{axis}[
       
   109     xlabel={$n$},
       
   110     x label style={at={(1.05,-0.05)}},
       
   111     ylabel={time in secs},
       
   112     enlargelimits=false,
       
   113     xtick={0,5,...,30},
       
   114     xmax=33,
       
   115     ymax=35,
       
   116     ytick={0,5,...,30},
       
   117     scaled ticks=false,
       
   118     axis lines=left,
       
   119     width=5cm,
       
   120     height=4cm, 
       
   121     legend entries={Java 8},  
       
   122     legend pos=north west,
       
   123     legend cell align=left]
       
   124 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   125 \end{axis}
       
   126 \end{tikzpicture}
       
   127   &
       
   128 \begin{tikzpicture}
       
   129 \begin{axis}[
       
   130     xlabel={$n$},
       
   131     x label style={at={(1.05,-0.05)}},
       
   132     %ylabel={time in secs},
       
   133     enlargelimits=false,
       
   134     xtick={0,5,...,30},
       
   135     xmax=33,
       
   136     ymax=35,
       
   137     ytick={0,5,...,30},
       
   138     scaled ticks=false,
       
   139     axis lines=left,
       
   140     width=5cm,
       
   141     height=4cm, 
       
   142     legend entries={Dart},  
       
   143     legend pos=north west,
       
   144     legend cell align=left]
       
   145 \addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
       
   146 \end{axis}
       
   147 \end{tikzpicture}\\ 
       
   148 \begin{tikzpicture}
       
   149 \begin{axis}[
       
   150     xlabel={$n$},
       
   151     x label style={at={(1.05,-0.05)}},
       
   152     ylabel={time in secs},
       
   153     enlargelimits=false,
       
   154     xtick={0,5,...,30},
       
   155     xmax=33,
       
   156     ymax=35,
       
   157     ytick={0,5,...,30},
       
   158     scaled ticks=false,
       
   159     axis lines=left,
       
   160     width=5cm,
       
   161     height=4cm, 
       
   162     legend entries={Swift},  
       
   163     legend pos=north west,
       
   164     legend cell align=left]
       
   165 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
       
   166 \end{axis}
       
   167 \end{tikzpicture}
       
   168   & 
       
   169 \begin{tikzpicture}
       
   170 \begin{axis}[
       
   171     xlabel={$n$},
       
   172     x label style={at={(1.05,-0.05)}},
       
   173     %ylabel={time in secs},
       
   174     enlargelimits=true,
       
   175     %xtick={0,5000,...,40000},
       
   176     %xmax=40000,
       
   177     %ymax=35,
       
   178     restrict x to domain*=0:40000,
       
   179     restrict y to domain*=0:35,
       
   180     %ytick={0,5,...,30},
       
   181     %scaled ticks=false,
       
   182     axis lines=left,
       
   183     width=5cm,
       
   184     height=4cm, 
       
   185     legend entries={Java9+},  
       
   186     legend pos=north west,
       
   187     legend cell align=left]
       
   188 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
       
   189 \end{axis}
       
   190 \end{tikzpicture}\\ 
       
   191 \multicolumn{2}{c}{Graphs}
       
   192 \end{tabular}    
       
   193 \end{center}
       
   194 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
       
   195            of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
       
   196    The reason for their superlinear behaviour is that they do a depth-first-search
       
   197    using NFAs.
       
   198    If the string does not match, the regular expression matching
       
   199    engine starts to explore all possibilities. 
       
   200 }\label{fig:aStarStarb}
       
   201 \end{figure}\afterpage{\clearpage}
       
   202 
       
   203 A more recent example is a global outage of all Cloudflare servers on 2 July
       
   204 2019. A poorly written regular expression exhibited catastrophic backtracking
       
   205 and exhausted CPUs that serve HTTP traffic. Although the outage
       
   206 had several causes, at the heart was a regular expression that
       
   207 was used to monitor network
       
   208 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
       
   209 These problems with regular expressions 
       
   210 are not isolated events that happen
       
   211 very rarely, 
       
   212 %but actually widespread.
       
   213 %They occur so often that they have a 
       
   214 but they occur actually often enough that they have a
       
   215 name: Regular-Expression-Denial-Of-Service (ReDoS)
       
   216 attacks.
       
   217 Davis et al. \cite{Davis18} detected more
       
   218 than 1000 evil regular expressions
       
   219 in Node.js, Python core libraries, npm and pypi. 
       
   220 They therefore concluded that evil regular expressions
       
   221 are a real problem rather than just "a parlour trick".
       
   222 
       
   223 The work in this thesis aims to address this issue
       
   224 with the help of formal proofs.
       
   225 We describe a lexing algorithm based
       
   226 on Brzozowski derivatives with verified correctness 
       
   227 and a finiteness property for the size of derivatives
       
   228 (which are all done in Isabelle/HOL).
       
   229 Such properties %guarantee the absence of 
       
   230 are an important step in preventing
       
   231 catastrophic backtracking once and for all.
       
   232 We will give more details in the next sections
       
   233 on (i) why the slow cases in graph \ref{fig:aStarStarb}
       
   234 can occur in traditional regular expression engines
       
   235 and (ii) why we choose our 
       
   236 approach based on Brzozowski derivatives and formal proofs.
       
   237 
       
   238 
       
   239 \section{Preliminaries}%Regex, and the Problems with Regex Matchers}
       
   240 Regular expressions and regular expression matchers 
       
   241 have clearly been studied for many, many years.
       
   242 Theoretical results in automata theory state 
       
   243 that basic regular expression matching should be linear
       
   244 w.r.t the input.
       
   245 This assumes that the regular expression
       
   246 $r$ was pre-processed and turned into a
       
   247 deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
       
   248 By basic we mean textbook definitions such as the one
       
   249 below, involving only regular expressions for characters, alternatives,
       
   250 sequences, and Kleene stars:
       
   251 \[
       
   252 	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
       
   253 \]
       
   254 Modern regular expression matchers used by programmers,
       
   255 however,
       
   256 support much richer constructs, such as bounded repetitions,
       
   257 negations,
       
   258 and back-references.
       
   259 To differentiate, we use the word \emph{regex} to refer
       
   260 to those expressions with richer constructs while reserving the
       
   261 term \emph{regular expression}
       
   262 for the more traditional meaning in formal languages theory.
       
   263 We follow this convention 
       
   264 in this thesis.
       
   265 In the future, we aim to support all the popular features of regexes, 
       
   266 but for this work we mainly look at basic regular expressions
       
   267 and bounded repetitions.
       
   268 
       
   269 
       
   270 
       
   271 %Most modern regex libraries
       
   272 %the so-called PCRE standard (Peral Compatible Regular Expressions)
       
   273 %has the back-references
       
   274 Regexes come with a number of constructs
       
   275 that make it more convenient for 
       
   276 programmers to write regular expressions.
       
   277 Depending on the types of constructs
       
   278 the task of matching and lexing with them
       
   279 will have different levels of complexity.
       
   280 Some of those constructs are syntactic sugars that are
       
   281 simply short hand notations
       
   282 that save the programmers a few keystrokes.
       
   283 These will not cause problems for regex libraries.
       
   284 For example the
       
   285 non-binary alternative involving three or more choices just means:
       
   286 \[
       
   287 	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
       
   288 \]
       
   289 Similarly, the range operator
       
   290 %used to express the alternative
       
   291 %of all characters between its operands, 
       
   292 is just a concise way
       
   293 of expressing an alternative of consecutive characters:
       
   294 \[
       
   295 	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )  
       
   296 \]
       
   297 for an alternative. The
       
   298 wildcard character '$.$' is used to refer to any single character,
       
   299 \[
       
   300 	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
       
   301 \]
       
   302 except the newline.
       
   303 
       
   304 \subsection{Bounded Repetitions}
       
   305 More interesting are bounded repetitions, which can 
       
   306 make the regular expressions much
       
   307 more compact.
       
   308 Normally there are four kinds of bounded repetitions:
       
   309 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
       
   310 (where $n$ and $m$ are constant natural numbers).
       
   311 Like the star regular expressions, the set of strings or language
       
   312 a bounded regular expression can match
       
   313 is defined using the power operation on sets:
       
   314 \begin{center}
       
   315 	\begin{tabular}{lcl}
       
   316 		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
       
   317 		$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
       
   318 		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
       
   319 		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
       
   320 	\end{tabular}
       
   321 \end{center}
       
   322 The attraction of bounded repetitions is that they can be
       
   323 used to avoid a size blow up: for example $r^{\{n\}}$
       
   324 is a shorthand for
       
   325 the much longer regular expression:
       
   326 \[
       
   327 	\underbrace{r\ldots r}_\text{n copies of r}.
       
   328 \]
       
   329 %Therefore, a naive algorithm that simply unfolds
       
   330 %them into their desugared forms
       
   331 %will suffer from at least an exponential runtime increase.
       
   332 
       
   333 
       
   334 The problem with matching 
       
   335 such bounded repetitions
       
   336 is that tools based on the classic notion of
       
   337 automata need to expand $r^{\{n\}}$ into $n$ connected 
       
   338 copies of the automaton for $r$. This leads to very inefficient matching
       
   339 algorithms  or algorithms that consume large amounts of memory.
       
   340 Implementations using $\DFA$s will
       
   341 in such situations
       
   342 either become excruciatingly slow 
       
   343 (for example Verbatim++ \cite{Verbatimpp}) or run
       
   344 out of memory (for example $\mathit{LEX}$ and 
       
   345 $\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
       
   346 in C and JAVA that generate $\mathit{DFA}$-based
       
   347 lexers. The user provides a set of regular expressions
       
   348 and configurations, and then 
       
   349 gets an output program encoding a minimized $\mathit{DFA}$
       
   350 that can be compiled and run. 
       
   351 When given the above countdown regular expression,
       
   352 a small $n$ (say 20) would result in a program representing a
       
   353 DFA
       
   354 with millions of states.}) for large counters.
       
   355 A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
       
   356 where the minimal DFA requires at least $2^{n+1}$ states.
       
   357 For example, when $n$ is equal to 2,
       
   358 the corresponding $\mathit{NFA}$ looks like:
       
   359 \vspace{6mm}
       
   360 \begin{center}
       
   361 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
       
   362    \node[state,initial] (q_0)   {$q_0$}; 
       
   363    \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
       
   364    \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
       
   365    \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
       
   366     \path[->] 
       
   367     (q_0) edge  node {a} (q_1)
       
   368     	  edge [loop below] node {a,b} ()
       
   369     (q_1) edge  node  {a,b} (q_2)
       
   370     (q_2) edge  node  {a,b} (q_3);
       
   371 \end{tikzpicture}
       
   372 \end{center}
       
   373 and when turned into a DFA by the subset construction
       
   374 requires at least $2^3$ states.\footnote{The 
       
   375 red states are "countdown states" which count down 
       
   376 the number of characters needed in addition to the current
       
   377 string to make a successful match.
       
   378 For example, state $q_1$ indicates a match that has
       
   379 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
       
   380 and just consumed the "delimiter" $a$ in the middle, and 
       
   381 needs to match 2 more iterations of $(a|b)$ to complete.
       
   382 State $q_2$ on the other hand, can be viewed as a state
       
   383 after $q_1$ has consumed 1 character, and just waits
       
   384 for 1 more character to complete.
       
   385 The state $q_3$ is the last (accepting) state, requiring 0 
       
   386 more characters.
       
   387 Depending on the suffix of the
       
   388 input string up to the current read location,
       
   389 the states $q_1$ and $q_2$, $q_3$
       
   390 may or may
       
   391 not be active.
       
   392 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
       
   393 contain at least $2^3$ non-equivalent states that cannot be merged, 
       
   394 because the subset construction during determinisation will generate
       
   395 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
       
   396 Generalizing this to regular expressions with larger
       
   397 bounded repetitions number, we have that
       
   398 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
       
   399 would require at least $2^{n+1}$ states, if $r$ itself contains
       
   400 more than 1 string.
       
   401 This is to represent all different 
       
   402 scenarios in which "countdown" states are active.}
       
   403 
       
   404 
       
   405 Bounded repetitions are important because they
       
   406 tend to occur frequently in practical use,
       
   407 for example in the regex library RegExLib, in
       
   408 the rules library of Snort \cite{Snort1999}\footnote{
       
   409 Snort is a network intrusion detection (NID) tool
       
   410 for monitoring network traffic.
       
   411 The network security community curates a list
       
   412 of malicious patterns written as regexes,
       
   413 which is used by Snort's detection engine
       
   414 to match against network traffic for any hostile
       
   415 activities such as buffer overflow attacks.}, 
       
   416 as well as in XML Schema definitions (XSDs).
       
   417 According to Bj\"{o}rklund et al \cite{xml2015},
       
   418 more than half of the 
       
   419 XSDs they found on the Maven.org central repository
       
   420 have bounded regular expressions in them.
       
   421 Often the counters are quite large, with the largest being
       
   422 close to ten million. 
       
   423 A smaller sample XSD they gave
       
   424 is:
       
   425 \lstset{
       
   426 	basicstyle=\fontsize{8.5}{9}\ttfamily,
       
   427   language=XML,
       
   428   morekeywords={encoding,
       
   429     xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
       
   430 }
       
   431 \begin{lstlisting}
       
   432 <sequence minOccurs="0" maxOccurs="65535">
       
   433 	<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
       
   434  	<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
       
   435 </sequence>
       
   436 \end{lstlisting}
       
   437 This can be seen as the regex
       
   438 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
       
   439 regular expressions 
       
   440 satisfying certain constraints (such as 
       
   441 satisfying the floating point number format).
       
   442 It is therefore quite unsatisfying that 
       
   443 some regular expressions matching libraries
       
   444 impose adhoc limits
       
   445 for bounded regular expressions:
       
   446 For example, in the regular expression matching library in the Go
       
   447 language the regular expression $a^{1001}$ is not permitted, because no counter
       
   448 can be above 1000, and in the built-in Rust regular expression library
       
   449 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
       
   450 for being too big. 
       
   451 As Becchi and Crawley \cite{Becchi08}  have pointed out,
       
   452 the reason for these restrictions
       
   453 is that they simulate a non-deterministic finite
       
   454 automata (NFA) with a breadth-first search.
       
   455 This way the number of active states could
       
   456 be equal to the counter number.
       
   457 When the counters are large, 
       
   458 the memory requirement could become
       
   459 infeasible, and a regex engine
       
   460 like in Go will reject this pattern straight away.
       
   461 \begin{figure}[H]
       
   462 \begin{center}
       
   463 \begin{tikzpicture} [node distance = 2cm, on grid, auto]
       
   464  
       
   465     	\node (q0) [state, initial] {$0$};
       
   466 	\node (q1) [state, right = of q0] {$1$};
       
   467 	%\node (q2) [state, right = of q1] {$2$};
       
   468 	\node (qdots) [right = of q1] {$\ldots$};
       
   469 	\node (qn) [state, right = of qdots] {$n$};
       
   470 	\node (qn1) [state, right = of qn] {$n+1$};
       
   471 	\node (qn2) [state, right = of qn1] {$n+2$};
       
   472 	\node (qn3) [state, accepting, right = of qn2] {$n+3$}; 
       
   473  
       
   474 \path [-stealth, thick]
       
   475 	(q0) edge [loop above] node {a} ()
       
   476     (q0) edge node {a}   (q1) 
       
   477     %(q1) edge node {.}   (q2)
       
   478     (q1) edge node {.}   (qdots)
       
   479     (qdots) edge node {.} (qn)
       
   480     (qn) edge node {.} (qn1)
       
   481     (qn1) edge node {b} (qn2)
       
   482     (qn2) edge node {$c$} (qn3);
       
   483 \end{tikzpicture}
       
   484 %\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
       
   485 %   \node[state,initial] (q_0)   {$0$}; 
       
   486 %   \node[state, ] (q_1) [right=of q_0] {$1$}; 
       
   487 %   \node[state, ] (q_2) [right=of q_1] {$2$}; 
       
   488 %   \node[state,
       
   489 %   \node[state, accepting, ](q_3) [right=of q_2] {$3$};
       
   490 %    \path[->] 
       
   491 %    (q_0) edge  node {a} (q_1)
       
   492 %    	  edge [loop below] node {a,b} ()
       
   493 %    (q_1) edge  node  {a,b} (q_2)
       
   494 %    (q_2) edge  node  {a,b} (q_3);
       
   495 %\end{tikzpicture}
       
   496 \end{center}
       
   497 \caption{The example given by Becchi and Crawley
       
   498 	that NFA simulation can consume large
       
   499 	amounts of memory: $.^*a.^{\{n\}}bc$ matching
       
   500 	strings of the form $aaa\ldots aaaabc$.
       
   501 	When traversing in a breadth-first manner,
       
   502 all states from 0 till $n+1$ will become active.}
       
   503 \end{figure}
       
   504 %Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
       
   505 %type of $\mathit{NFA}$ simulation and guarantees a linear runtime
       
   506 %in terms of input string length.
       
   507 %TODO:try out these lexers
       
   508 These problems can of course be solved in matching algorithms where 
       
   509 automata go beyond the classic notion and for instance include explicit
       
   510 counters \cite{Turo_ov__2020}.
       
   511 These solutions can be quite efficient,
       
   512 with the ability to process
       
   513 gigabits of strings input per second
       
   514 even with large counters \cite{Becchi08}.
       
   515 These practical solutions do not come with
       
   516 formal guarantees, and as pointed out by
       
   517 Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
       
   518 %But formal reasoning about these automata especially in Isabelle 
       
   519 %can be challenging
       
   520 %and un-intuitive. 
       
   521 %Therefore, we take correctness and runtime claims made about these solutions
       
   522 %with a grain of salt.
       
   523 
       
   524 In the work reported in \cite{FoSSaCS2023} and here, 
       
   525 we add better support using derivatives
       
   526 for bounded regular expression $r^{\{n\}}$.
       
   527 Our results
       
   528 extend straightforwardly to
       
   529 repetitions with intervals such as 
       
   530 $r^{\{n\ldots m\}}$.
       
   531 The merit of Brzozowski derivatives (more on this later)
       
   532 on this problem is that
       
   533 it can be naturally extended to support bounded repetitions.
       
   534 Moreover these extensions are still made up of only small
       
   535 inductive datatypes and recursive functions,
       
   536 making it handy to deal with them in theorem provers.
       
   537 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
       
   538 %straightforwardly extended to deal with bounded regular expressions
       
   539 %and moreover the resulting code still consists of only simple
       
   540 %recursive functions and inductive datatypes.
       
   541 Finally, bounded regular expressions do not destroy our finite
       
   542 boundedness property, which we shall prove later on.
       
   543 
       
   544 
       
   545 
       
   546 
       
   547 
       
   548 \subsection{Back-References}
       
   549 The other way to simulate an $\mathit{NFA}$ for matching is choosing  
       
   550 a single transition each time, keeping all the other options in 
       
   551 a queue or stack, and backtracking if that choice eventually 
       
   552 fails. 
       
   553 This method, often called a  "depth-first-search", 
       
   554 is efficient in many cases, but could end up
       
   555 with exponential run time.
       
   556 The backtracking method is employed in regex libraries
       
   557 that support \emph{back-references}, for example
       
   558 in Java and Python.
       
   559 %\section{Back-references and The Terminology Regex}
       
   560 
       
   561 %When one constructs an $\NFA$ out of a regular expression
       
   562 %there is often very little to be done in the first phase, one simply 
       
   563 %construct the $\NFA$ states based on the structure of the input regular expression.
       
   564 
       
   565 %In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
       
   566 %one by keeping track of all active states after consuming 
       
   567 %a character, and update that set of states iteratively.
       
   568 %This can be viewed as a breadth-first-search of the $\mathit{NFA}$
       
   569 %for a path terminating
       
   570 %at an accepting state.
       
   571 
       
   572 
       
   573 
       
   574 Consider the following regular expression where the sequence
       
   575 operator is omitted for brevity:
       
   576 \begin{center}
       
   577 	$r_1r_2r_3r_4$
       
   578 \end{center}
       
   579 In this example,
       
   580 one could label sub-expressions of interest 
       
   581 by parenthesizing them and giving 
       
   582 them a number in the order in which their opening parentheses appear.
       
   583 One possible way of parenthesizing and labelling is: 
       
   584 \begin{center}
       
   585 	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
       
   586 \end{center}
       
   587 The sub-expressions
       
   588 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
       
   589 by 1 to 4, and can be ``referred back'' by their respective numbers. 
       
   590 %These sub-expressions are called "capturing groups".
       
   591 To do so, one uses the syntax $\backslash i$ 
       
   592 to denote that we want the sub-string 
       
   593 of the input matched by the i-th
       
   594 sub-expression to appear again, 
       
   595 exactly the same as it first appeared: 
       
   596 \begin{center}
       
   597 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
       
   598 \underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
       
   599 \end{center}
       
   600 Once the sub-string $s_i$ for the sub-expression $r_i$
       
   601 has been fixed, there is no variability on what the back-reference
       
   602 label $\backslash i$ can be---it is tied to $s_i$.
       
   603 The overall string will look like $\ldots s_i \ldots s_i \ldots $
       
   604 %The backslash and number $i$ are the
       
   605 %so-called "back-references".
       
   606 %Let $e$ be an expression made of regular expressions 
       
   607 %and back-references. $e$ contains the expression $e_i$
       
   608 %as its $i$-th capturing group.
       
   609 %The semantics of back-reference can be recursively
       
   610 %written as:
       
   611 %\begin{center}
       
   612 %	\begin{tabular}{c}
       
   613 %		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
       
   614 %		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
       
   615 %	\end{tabular}
       
   616 %\end{center}
       
   617 A concrete example
       
   618 for back-references is
       
   619 \begin{center}
       
   620 $(.^*)\backslash 1$,
       
   621 \end{center}
       
   622 which matches
       
   623 strings that can be split into two identical halves,
       
   624 for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
       
   625 Note that this is different from 
       
   626 repeating the  sub-expression verbatim like
       
   627 \begin{center}
       
   628 	$(.^*)(.^*)$,
       
   629 \end{center}
       
   630 which does not impose any restrictions on what strings the second 
       
   631 sub-expression $.^*$
       
   632 might match.
       
   633 Another example for back-references is
       
   634 \begin{center}
       
   635 $(.)(.)\backslash 2\backslash 1$
       
   636 \end{center}
       
   637 which matches four-character palindromes
       
   638 like $abba$, $x??x$ and so on.
       
   639 
       
   640 Back-references are a regex construct 
       
   641 that programmers find quite useful.
       
   642 According to Becchi and Crawley \cite{Becchi08},
       
   643 6\% of Snort rules (up until 2008) use them.
       
   644 The most common use of back-references
       
   645 is to express well-formed html files,
       
   646 where back-references are convenient for matching
       
   647 opening and closing tags like 
       
   648 \begin{center}
       
   649 	$\langle html \rangle \ldots \langle / html \rangle$
       
   650 \end{center}
       
   651 A regex describing such a format
       
   652 is
       
   653 \begin{center}
       
   654 	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
       
   655 \end{center}
       
   656 Despite being useful, the expressive power of regexes 
       
   657 go beyond regular languages 
       
   658 once back-references are included.
       
   659 In fact, they allow the regex construct to express 
       
   660 languages that cannot be contained in context-free
       
   661 languages either.
       
   662 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
       
   663 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
       
   664 which cannot be expressed by 
       
   665 context-free grammars \cite{campeanu2003formal}.
       
   666 Such a language is contained in the context-sensitive hierarchy
       
   667 of formal languages. 
       
   668 Also solving the matching problem involving back-references
       
   669 is known to be NP-complete \parencite{alfred2014algorithms}.
       
   670 Regex libraries supporting back-references such as 
       
   671 PCRE \cite{pcre} therefore have to
       
   672 revert to a depth-first search algorithm including backtracking.
       
   673 What is unexpected is that even in the cases 
       
   674 not involving back-references, there is still
       
   675 a (non-negligible) chance they might backtrack super-linearly,
       
   676 as shown in the graphs in figure \ref{fig:aStarStarb}.
       
   677 
       
   678 Summing up, we can categorise existing 
       
   679 practical regex libraries into two kinds:
       
   680 (i) The ones  with  linear
       
   681 time guarantees like Go and Rust. The downside with them is that
       
   682 they impose restrictions
       
   683 on the regular expressions (not allowing back-references, 
       
   684 bounded repetitions cannot exceed an ad hoc limit etc.).
       
   685 And (ii) those 
       
   686 that allow large bounded regular expressions and back-references
       
   687 at the expense of using backtracking algorithms.
       
   688 They can potentially ``grind to a halt''
       
   689 on some very simple cases, resulting 
       
   690 ReDoS attacks if exposed to the internet.
       
   691 
       
   692 The problems with both approaches are the motivation for us 
       
   693 to look again at the regular expression matching problem. 
       
   694 Another motivation is that regular expression matching algorithms
       
   695 that follow the POSIX standard often contain errors and bugs 
       
   696 as we shall explain next.
       
   697 
       
   698 %We would like to have regex engines that can 
       
   699 %deal with the regular part (e.g.
       
   700 %bounded repetitions) of regexes more
       
   701 %efficiently.
       
   702 %Also we want to make sure that they do it correctly.
       
   703 %It turns out that such aim is not so easy to achieve.
       
   704  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
       
   705 % For example, the Rust regex engine claims to be linear, 
       
   706 % but does not support lookarounds and back-references.
       
   707 % The GoLang regex library does not support over 1000 repetitions.  
       
   708 % Java and Python both support back-references, but shows
       
   709 %catastrophic backtracking behaviours on inputs without back-references(
       
   710 %when the language is still regular).
       
   711  %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
       
   712  %TODO: verify the fact Rust does not allow 1000+ reps
       
   713 
       
   714 
       
   715 
       
   716 
       
   717 %The time cost of regex matching algorithms in general
       
   718 %involve two different phases, and different things can go differently wrong on 
       
   719 %these phases.
       
   720 %$\DFA$s usually have problems in the first (construction) phase
       
   721 %, whereas $\NFA$s usually run into trouble
       
   722 %on the second phase.
       
   723 
       
   724 
       
   725 \section{Error-prone POSIX Implementations}
       
   726 Very often there are multiple ways of matching a string
       
   727 with a regular expression.
       
   728 In such cases the regular expressions matcher needs to
       
   729 disambiguate.
       
   730 The more widely used strategy is called POSIX,
       
   731 which roughly speaking always chooses the longest initial match.
       
   732 The POSIX strategy is widely adopted in many regular expression matchers
       
   733 because it is a natural strategy for lexers.
       
   734 However, many implementations (including the C libraries
       
   735 used by Linux and OS X distributions) contain bugs
       
   736 or do not meet the specification they claim to adhere to.
       
   737 Kuklewicz maintains a unit test repository which lists some
       
   738 problems with existing regular expression engines \cite{KuklewiczHaskell}.
       
   739 In some cases, they either fail to generate a
       
   740 result when there exists a match,
       
   741 or give results that are inconsistent with the POSIX standard.
       
   742 A concrete example is the regex:
       
   743 \begin{center}
       
   744 	$(aba + ab + a)^* \text{and the string} \; ababa$
       
   745 \end{center}
       
   746 The correct POSIX match for the above
       
   747 involves the entire string $ababa$, 
       
   748 split into two Kleene star iterations, namely $[ab], [aba]$ at positions
       
   749 $[0, 2), [2, 5)$
       
   750 respectively.
       
   751 But feeding this example to the different engines
       
   752 listed at regex101 \footnote{
       
   753 	regex101 is an online regular expression matcher which
       
   754 	provides API for trying out regular expression
       
   755 	engines of multiple popular programming languages like
       
   756 Java, Python, Go, etc.} \parencite{regex101}. 
       
   757 yields
       
   758 only two incomplete matches: $[aba]$ at $[0, 3)$
       
   759 and $a$ at $[4, 5)$.
       
   760 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
       
   761 commented that most regex libraries are not
       
   762 correctly implementing the central POSIX
       
   763 rule, called the maximum munch rule.
       
   764 Grathwohl \parencite{grathwohl2014crash} wrote:
       
   765 \begin{quote}\it
       
   766 	``The POSIX strategy is more complicated than the 
       
   767 	greedy because of the dependence on information about 
       
   768 	the length of matched strings in the various subexpressions.''
       
   769 \end{quote}
       
   770 %\noindent
       
   771 People have recognised that the implementation complexity of POSIX rules also come from
       
   772 the specification being not very precise.
       
   773 The developers of the regexp package of Go 
       
   774 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
       
   775 commented that
       
   776 \begin{quote}\it
       
   777 ``
       
   778 The POSIX rule is computationally prohibitive
       
   779 and not even well-defined.
       
   780 ``
       
   781 \end{quote}
       
   782 There are many informal summaries of this disambiguation
       
   783 strategy, which are often quite long and delicate.
       
   784 For example Kuklewicz \cite{KuklewiczHaskell} 
       
   785 described the POSIX rule as (section 1, last paragraph):
       
   786 \begin{quote}
       
   787 	\begin{itemize}
       
   788 		\item
       
   789 regular expressions (REs) take the leftmost starting match, and the longest match starting there
       
   790 earlier subpatterns have leftmost-longest priority over later subpatterns\\
       
   791 \item
       
   792 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
       
   793 \item
       
   794 REs have right associative concatenation which can be changed with parenthesis\\
       
   795 \item
       
   796 parenthesized subexpressions return the match from their last usage\\
       
   797 \item
       
   798 text of component subexpressions must be contained in the text of the 
       
   799 higher-level subexpressions\\
       
   800 \item
       
   801 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
       
   802 \item
       
   803 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
       
   804 \end{itemize}
       
   805 \end{quote}
       
   806 %The text above 
       
   807 %is trying to capture something very precise,
       
   808 %and is crying out for formalising.
       
   809 Ribeiro and Du Bois \cite{RibeiroAgda2017} have 
       
   810 formalised the notion of bit-coded regular expressions
       
   811 and proved their relations with simple regular expressions in
       
   812 the dependently-typed proof assistant Agda.
       
   813 They also proved the soundness and completeness of a matching algorithm
       
   814 based on the bit-coded regular expressions.
       
   815 Ausaf et al. \cite{AusafDyckhoffUrban2016}
       
   816 are the first to
       
   817 give a quite simple formalised POSIX
       
   818 specification in Isabelle/HOL, and also prove 
       
   819 that their specification coincides with an earlier (unformalised) 
       
   820 POSIX specification given by Okui and Suzuki \cite{Okui10}.
       
   821 They then formally proved the correctness of
       
   822 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
       
   823 with regards to that specification.
       
   824 They also found that the informal POSIX
       
   825 specification by Sulzmann and Lu needs to be substantially 
       
   826 revised in order for the correctness proof to go through.
       
   827 Their original specification and proof were unfixable
       
   828 according to Ausaf et al.
       
   829 
       
   830 
       
   831 In the next section, we will briefly
       
   832 introduce Brzozowski derivatives and Sulzmann
       
   833 and Lu's algorithm, which the main point of this thesis builds on.
       
   834 %We give a taste of what they 
       
   835 %are like and why they are suitable for regular expression
       
   836 %matching and lexing.
       
   837 \section{Formal Specification of POSIX Matching 
       
   838 and Brzozowski Derivatives}
       
   839 %Now we start with the central topic of the thesis: Brzozowski derivatives.
       
   840 Brzozowski \cite{Brzozowski1964} first introduced the 
       
   841 concept of a \emph{derivative} of regular expression in 1964.
       
   842 The derivative of a regular expression $r$
       
   843 with respect to a character $c$, is written as $r \backslash c$.
       
   844 This operation tells us what $r$ transforms into
       
   845 if we ``chop'' off a particular character $c$ 
       
   846 from all strings in the language of $r$ (defined
       
   847 later as $L \; r$).
       
   848 %To give a flavour of Brzozowski derivatives, we present
       
   849 %two straightforward clauses from it:
       
   850 %\begin{center}
       
   851 %	\begin{tabular}{lcl}
       
   852 %		$d \backslash c$     & $\dn$ & 
       
   853 %		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   854 %$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   855 %	\end{tabular}
       
   856 %\end{center}
       
   857 %\noindent
       
   858 %The first clause says that for the regular expression
       
   859 %denoting a singleton set consisting of a single-character string $\{ d \}$,
       
   860 %we check the derivative character $c$ against $d$,
       
   861 %returning a set containing only the empty string $\{ [] \}$
       
   862 %if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
       
   863 %The second clause states that to obtain the regular expression
       
   864 %representing all strings' head character $c$ being chopped off
       
   865 %from $r_1 + r_2$, one simply needs to recursively take derivative
       
   866 %of $r_1$ and $r_2$ and then put them together.
       
   867 Derivatives have the property
       
   868 that $s \in L \; (r\backslash c)$ if and only if 
       
   869 $c::s \in L \; r$ where $::$ stands for list prepending.
       
   870 %This property can be used on regular expressions
       
   871 %matching and lexing--to test whether a string $s$ is in $L \; r$,
       
   872 %one simply takes derivatives of $r$ successively with
       
   873 %respect to the characters (in the correct order) in $s$,
       
   874 %and then test whether the empty string is in the last regular expression.
       
   875 With this property, derivatives can give a simple solution
       
   876 to the problem of matching a string $s$ with a regular
       
   877 expression $r$: if the derivative of $r$ w.r.t.\ (in
       
   878 succession) all the characters of the string matches the empty string,
       
   879 then $r$ matches $s$ (and {\em vice versa}).  
       
   880 %This makes formally reasoning about these properties such
       
   881 %as correctness and complexity smooth and intuitive.
       
   882 There are several mechanised proofs of this property in various theorem
       
   883 provers,
       
   884 for example one by Owens and Slind \cite{Owens2008} in HOL4,
       
   885 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
       
   886 yet another in Coq by Coquand and Siles \cite{Coquand2012}.
       
   887 
       
   888 In addition, one can extend derivatives to bounded repetitions
       
   889 relatively straightforwardly. For example, the derivative for 
       
   890 this can be defined as:
       
   891 \begin{center}
       
   892 	\begin{tabular}{lcl}
       
   893 		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
       
   894 		r^{\{n-1\}} (\text{when} n > 0)$\\
       
   895 	\end{tabular}
       
   896 \end{center}
       
   897 \noindent
       
   898 Experimental results suggest that  unlike DFA-based solutions
       
   899 for bounded regular expressions,
       
   900 derivatives can cope
       
   901 large counters
       
   902 quite well.
       
   903 
       
   904 There have also been 
       
   905 extensions of derivatives to other regex constructs.
       
   906 For example, Owens et al include the derivatives
       
   907 for the \emph{NOT} regular expression, which is
       
   908 able to concisely express C-style comments of the form
       
   909 $/* \ldots */$ (see \cite{Owens2008}).
       
   910 Another extension for derivatives is
       
   911 regular expressions with look-aheads, done
       
   912 by Miyazaki and Minamide
       
   913 \cite{Takayuki2019}.
       
   914 %We therefore use Brzozowski derivatives on regular expressions 
       
   915 %lexing 
       
   916 
       
   917 
       
   918 
       
   919 Given the above definitions and properties of
       
   920 Brzozowski derivatives, one quickly realises their potential
       
   921 in generating a formally verified algorithm for lexing: the clauses and property
       
   922 can be easily expressed in a functional programming language 
       
   923 or converted to theorem prover
       
   924 code, with great ease.
       
   925 Perhaps this is the reason why derivatives have sparked quite a bit of interest
       
   926 in the functional programming and theorem prover communities in the last
       
   927 fifteen or so years (
       
   928 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
       
   929 \cite{Chen12} and \cite{Coquand2012}
       
   930 to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
       
   931 after they were first published by Brzozowski.
       
   932 
       
   933 
       
   934 However, there are two difficulties with derivative-based matchers:
       
   935 First, Brzozowski's original matcher only generates a yes/no answer
       
   936 for whether a regular expression matches a string or not.  This is too
       
   937 little information in the context of lexing where separate tokens must
       
   938 be identified and also classified (for example as keywords
       
   939 or identifiers). 
       
   940 Second, derivative-based matchers need to be more efficient in terms
       
   941 of the sizes of derivatives.
       
   942 Elegant and beautiful
       
   943 as many implementations are,
       
   944 they can be excruciatingly slow. 
       
   945 For example, Sulzmann and Lu
       
   946 claim a linear running time of their proposed algorithm,
       
   947 but that was falsified by our experiments. The running time 
       
   948 is actually $\Omega(2^n)$ in the worst case.
       
   949 A similar claim about a theoretical runtime of $O(n^2)$ 
       
   950 is made for the Verbatim \cite{Verbatim}
       
   951 %TODO: give references
       
   952 lexer, which calculates POSIX matches and is based on derivatives.
       
   953 They formalized the correctness of the lexer, but not their complexity result.
       
   954 In the performance evaluation section, they analyzed the run time
       
   955 of matching $a$ with the string 
       
   956 \begin{center}
       
   957 	$\underbrace{a \ldots a}_{\text{n a's}}$.
       
   958 \end{center}
       
   959 \noindent
       
   960 They concluded that the algorithm is quadratic in terms of 
       
   961 the length of the input string.
       
   962 When we tried out their extracted OCaml code with the example $(a+aa)^*$,
       
   963 the time it took to match a string of 40 $a$'s was approximately 5 minutes.
       
   964 
       
   965 
       
   966 \subsection{Sulzmann and Lu's Algorithm}
       
   967 Sulzmann and Lu~\cite{Sulzmann2014} overcame the first 
       
   968 problem with the yes/no answer 
       
   969 by cleverly extending Brzozowski's matching
       
   970 algorithm. Their extended version generates additional information on
       
   971 \emph{how} a regular expression matches a string following the POSIX
       
   972 rules for regular expression matching. They achieve this by adding a
       
   973 second ``phase'' to Brzozowski's algorithm involving an injection
       
   974 function. This injection function in a sense undoes the ``damage''
       
   975 of the derivatives chopping off characters.
       
   976 In earlier work, Ausaf et al provided the formal
       
   977 specification of what POSIX matching means and proved in Isabelle/HOL
       
   978 the correctness
       
   979 of this extended algorithm accordingly
       
   980 \cite{AusafDyckhoffUrban2016}.
       
   981 
       
   982 The version of the algorithm proven correct
       
   983 suffers however heavily from a 
       
   984 second difficulty, where derivatives can
       
   985 grow to arbitrarily big sizes. 
       
   986 For example if we start with the
       
   987 regular expression $(a+aa)^*$ and take
       
   988 successive derivatives according to the character $a$, we end up with
       
   989 a sequence of ever-growing derivatives like 
       
   990 
       
   991 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
       
   992 \begin{center}
       
   993 \begin{tabular}{rll}
       
   994 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   995 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   996 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
       
   997 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   998 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
       
   999 \end{tabular}
       
  1000 \end{center}
       
  1001  
       
  1002 \noindent where after around 35 steps we usually run out of memory on a
       
  1003 typical computer.  Clearly, the
       
  1004 notation involving $\ZERO$s and $\ONE$s already suggests
       
  1005 simplification rules that can be applied to regular regular
       
  1006 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
       
  1007 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
       
  1008 r$. While such simple-minded simplifications have been proved in 
       
  1009 the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
       
  1010 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
       
  1011 \emph{not} help with limiting the growth of the derivatives shown
       
  1012 above: the growth is slowed, but the derivatives can still grow rather
       
  1013 quickly beyond any finite bound.
       
  1014 
       
  1015 Therefore we want to look in this thesis at a second
       
  1016 algorithm by Sulzmann and Lu where they
       
  1017 overcame this ``growth problem'' \cite{Sulzmann2014}.
       
  1018 In this version, POSIX values are 
       
  1019 represented as bit sequences and such sequences are incrementally generated
       
  1020 when derivatives are calculated. The compact representation
       
  1021 of bit sequences and regular expressions allows them to define a more
       
  1022 ``aggressive'' simplification method that keeps the size of the
       
  1023 derivatives finite no matter what the length of the string is.
       
  1024 They make some informal claims about the correctness and linear behaviour
       
  1025 of this version, but do not provide any supporting proof arguments, not
       
  1026 even ``pencil-and-paper'' arguments. They write about their bit-coded
       
  1027 \emph{incremental parsing method} (that is the algorithm to be formalised
       
  1028 in this dissertation)
       
  1029 
       
  1030 
       
  1031   
       
  1032   \begin{quote}\it
       
  1033   ``Correctness Claim: We further claim that the incremental parsing
       
  1034   method [..] in combination with the simplification steps [..]
       
  1035   yields POSIX parse trees. We have tested this claim
       
  1036   extensively [..] but yet
       
  1037   have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
       
  1038 \end{quote}  
       
  1039 Ausaf and Urban made some initial progress towards the 
       
  1040 full correctness proof but still had to leave out the optimisation
       
  1041 Sulzmann and Lu proposed.
       
  1042 Ausaf  wrote \cite{Ausaf},
       
  1043   \begin{quote}\it
       
  1044 ``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
       
  1045 \end{quote}  
       
  1046 This thesis implements the aggressive simplifications envisioned
       
  1047 by Ausaf and Urban,
       
  1048 together with a formal proof of the correctness of those simplifications.
       
  1049 
       
  1050 
       
  1051 One of the most recent work in the context of lexing
       
  1052 %with this issue
       
  1053 is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}.
       
  1054 This is relevant work for us and we will compare it later with 
       
  1055 our derivative-based matcher we are going to present.
       
  1056 There is also some newer work called
       
  1057 Verbatim++~\cite{Verbatimpp}, which does not use derivatives, 
       
  1058 but deterministic finite automaton instead.
       
  1059 We will also study this work in a later section.
       
  1060 %An example that gives problem to automaton approaches would be
       
  1061 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
       
  1062 %It requires at least $2^{n+1}$ states to represent
       
  1063 %as a DFA.
       
  1064 
    25 
  1065 
    26 \section{Basic Concepts}
  1066 \section{Basic Concepts}
    27 Formal language theory usually starts with an alphabet 
  1067 Formal language theory usually starts with an alphabet 
    28 denoting a set of characters.
  1068 denoting a set of characters.
    29 Here we use the datatype of characters from Isabelle,
  1069 Here we use the datatype of characters from Isabelle,