ChengsongTanPhdThesis/Chapters/Overview.tex
changeset 668 3831621d7b14
equal deleted inserted replaced
667:660cf698eb26 668:3831621d7b14
       
     1 \chapter{Technical Overview}
       
     2 \label{Overview}
       
     3 We start with a technical overview of the 
       
     4 catastrophic backtracking problem,
       
     5 motivating rigorous approaches to regular expression matching and lexing.
       
     6 In doing so we also briefly
       
     7 introduce common terminology such as
       
     8 bounded repetitions and back-references.
       
     9 
       
    10 \section{Motivating Examples}
       
    11 Consider for example the regular expression $(a^*)^*\,b$ and 
       
    12 strings of the form $aa..a$. These strings cannot be matched by this regular
       
    13 expression: obviously the expected $b$ in the last
       
    14 position is missing. One would assume that modern regular expression
       
    15 matching engines can find this out very quickly. Surprisingly, if one tries
       
    16 this example in JavaScript, Python or Java 8, even with small strings, 
       
    17 say of lenght of around 30 $a$'s,
       
    18 the decision takes large amounts of time to finish.
       
    19 This is inproportional
       
    20 to the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}).
       
    21 The algorithms clearly show exponential behaviour, and as can be seen
       
    22 is triggered by some relatively simple regular expressions.
       
    23 Java 9 and newer
       
    24 versions mitigates this behaviour by several magnitudes, 
       
    25 but are still slow compared 
       
    26 with the approach we are going to use in this thesis.
       
    27  
       
    28 
       
    29 
       
    30 This superlinear blowup in regular expression engines
       
    31 has caused grief in ``real life'' where it is 
       
    32 given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
       
    33 A less serious example is a bug in the Atom editor:
       
    34 a user found out when writing the following code
       
    35 \begin{verbatim}
       
    36    vVar.Type().Name() == "" && vVar.Kind() == reflect.Ptr 
       
    37 \end{verbatim}
       
    38 \begin{verbatim}
       
    39     && vVar.Type().Elem().Name() == "" && vVar.Type().Elem().Kind() == 
       
    40 \end{verbatim}
       
    41 \begin{verbatim}
       
    42     reflect.Slice
       
    43 \end{verbatim}
       
    44 it took the editor half a hour to finish computing.
       
    45 This was subsequently fixed by Galbraith \cite{fixedAtom},
       
    46 and the issue was due to the regex engine of Atom.
       
    47 But evil regular expressions can be more than a nuisance in a text editor:
       
    48 on 20 July 2016 one evil
       
    49 regular expression brought the webpage
       
    50 \href{http://stackexchange.com}{Stack Exchange} to its
       
    51 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
       
    52 In this instance, a regular expression intended to just trim white
       
    53 spaces from the beginning and the end of a line actually consumed
       
    54 massive amounts of CPU resources---causing the web servers to grind to a
       
    55 halt. In this example, the time needed to process
       
    56 the string was 
       
    57 $O(n^2)$ with respect to the string length $n$. This
       
    58 quadratic overhead was enough for the homepage of Stack Exchange to
       
    59 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
       
    60 attack and therefore stopped the servers from responding to any
       
    61 requests. This made the whole site become unavailable. 
       
    62 
       
    63 \begin{figure}[p]
       
    64 \begin{center}
       
    65 \begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
       
    66 \begin{tikzpicture}
       
    67 \begin{axis}[
       
    68     xlabel={$n$},
       
    69     x label style={at={(1.05,-0.05)}},
       
    70     ylabel={time in secs},
       
    71     enlargelimits=false,
       
    72     xtick={0,5,...,30},
       
    73     xmax=33,
       
    74     ymax=35,
       
    75     ytick={0,5,...,30},
       
    76     scaled ticks=false,
       
    77     axis lines=left,
       
    78     width=5cm,
       
    79     height=4cm, 
       
    80     legend entries={JavaScript},  
       
    81     legend pos=north west,
       
    82     legend cell align=left]
       
    83 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
       
    84 \end{axis}
       
    85 \end{tikzpicture}
       
    86   &
       
    87 \begin{tikzpicture}
       
    88 \begin{axis}[
       
    89     xlabel={$n$},
       
    90     x label style={at={(1.05,-0.05)}},
       
    91     %ylabel={time in secs},
       
    92     enlargelimits=false,
       
    93     xtick={0,5,...,30},
       
    94     xmax=33,
       
    95     ymax=35,
       
    96     ytick={0,5,...,30},
       
    97     scaled ticks=false,
       
    98     axis lines=left,
       
    99     width=5cm,
       
   100     height=4cm, 
       
   101     legend entries={Python},  
       
   102     legend pos=north west,
       
   103     legend cell align=left]
       
   104 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   105 \end{axis}
       
   106 \end{tikzpicture}\\ 
       
   107 \begin{tikzpicture}
       
   108 \begin{axis}[
       
   109     xlabel={$n$},
       
   110     x label style={at={(1.05,-0.05)}},
       
   111     ylabel={time in secs},
       
   112     enlargelimits=false,
       
   113     xtick={0,5,...,30},
       
   114     xmax=33,
       
   115     ymax=35,
       
   116     ytick={0,5,...,30},
       
   117     scaled ticks=false,
       
   118     axis lines=left,
       
   119     width=5cm,
       
   120     height=4cm, 
       
   121     legend entries={Java 8},  
       
   122     legend pos=north west,
       
   123     legend cell align=left]
       
   124 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   125 \end{axis}
       
   126 \end{tikzpicture}
       
   127   &
       
   128 \begin{tikzpicture}
       
   129 \begin{axis}[
       
   130     xlabel={$n$},
       
   131     x label style={at={(1.05,-0.05)}},
       
   132     %ylabel={time in secs},
       
   133     enlargelimits=false,
       
   134     xtick={0,5,...,30},
       
   135     xmax=33,
       
   136     ymax=35,
       
   137     ytick={0,5,...,30},
       
   138     scaled ticks=false,
       
   139     axis lines=left,
       
   140     width=5cm,
       
   141     height=4cm, 
       
   142     legend entries={Dart},  
       
   143     legend pos=north west,
       
   144     legend cell align=left]
       
   145 \addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
       
   146 \end{axis}
       
   147 \end{tikzpicture}\\ 
       
   148 \begin{tikzpicture}
       
   149 \begin{axis}[
       
   150     xlabel={$n$},
       
   151     x label style={at={(1.05,-0.05)}},
       
   152     ylabel={time in secs},
       
   153     enlargelimits=false,
       
   154     xtick={0,5,...,30},
       
   155     xmax=33,
       
   156     ymax=35,
       
   157     ytick={0,5,...,30},
       
   158     scaled ticks=false,
       
   159     axis lines=left,
       
   160     width=5cm,
       
   161     height=4cm, 
       
   162     legend entries={Swift},  
       
   163     legend pos=north west,
       
   164     legend cell align=left]
       
   165 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
       
   166 \end{axis}
       
   167 \end{tikzpicture}
       
   168   & 
       
   169 \begin{tikzpicture}
       
   170 \begin{axis}[
       
   171     xlabel={$n$},
       
   172     x label style={at={(1.05,-0.05)}},
       
   173     %ylabel={time in secs},
       
   174     enlargelimits=true,
       
   175     %xtick={0,5000,...,40000},
       
   176     %xmax=40000,
       
   177     %ymax=35,
       
   178     restrict x to domain*=0:40000,
       
   179     restrict y to domain*=0:35,
       
   180     %ytick={0,5,...,30},
       
   181     %scaled ticks=false,
       
   182     axis lines=left,
       
   183     width=5cm,
       
   184     height=4cm, 
       
   185     legend entries={Java9+},  
       
   186     legend pos=north west,
       
   187     legend cell align=left]
       
   188 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
       
   189 \end{axis}
       
   190 \end{tikzpicture}\\ 
       
   191 \begin{tikzpicture}
       
   192 \begin{axis}[
       
   193     xlabel={$n$},
       
   194     x label style={at={(1.05,-0.05)}},
       
   195     ylabel={time in secs},
       
   196     enlargelimits=true,
       
   197     %xtick={0,5,...,30},
       
   198     %xmax=33,
       
   199     %ymax=35,
       
   200     restrict x to domain*=0:60000,
       
   201     restrict y to domain*=0:35,
       
   202     %ytick={0,5,...,30},
       
   203     %scaled ticks=false,
       
   204     axis lines=left,
       
   205     width=5cm,
       
   206     height=4cm, 
       
   207     legend entries={Scala},  
       
   208     legend pos=north west,
       
   209     legend cell align=left]
       
   210 \addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data};
       
   211 \end{axis}
       
   212 \end{tikzpicture}
       
   213   & 
       
   214 \begin{tikzpicture}
       
   215 \begin{axis}[
       
   216     xlabel={$n$},
       
   217     x label style={at={(1.05,-0.05)}},
       
   218     %ylabel={time in secs},
       
   219     enlargelimits=true,
       
   220     %xtick={0,5000,...,40000},
       
   221     %xmax=40000,
       
   222     %ymax=35,
       
   223     restrict x to domain*=0:60000,
       
   224     restrict y to domain*=0:45,
       
   225     %ytick={0,5,...,30},
       
   226     %scaled ticks=false,
       
   227     axis lines=left,
       
   228     width=5cm,
       
   229     height=4cm, 
       
   230     legend style={cells={align=left}},
       
   231     legend entries={Isabelle \\ Extracted},
       
   232     legend pos=north west,
       
   233     legend cell align=left]
       
   234 \addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data};
       
   235 \end{axis}
       
   236 \end{tikzpicture}\\ 
       
   237 \multicolumn{2}{c}{Graphs}
       
   238 \end{tabular}    
       
   239 \end{center}
       
   240 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
       
   241            of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
       
   242    The reason for their fast growth %superlinear behaviour 
       
   243    is that they do a depth-first-search
       
   244    using NFAs.
       
   245    If the string does not match, the regular expression matching
       
   246    engine starts to explore all possibilities. 
       
   247    The last two graphs are for our implementation in Scala, one manual
       
   248    and one extracted from the verified lexer in Isabelle by $\textit{codegen}$.
       
   249    Our lexer  
       
   250    performs better in this case, and
       
   251    is formally verified.
       
   252    Despite being almost identical, the codegen-generated lexer
       
   253    % is generated directly from Isabelle using $\textit{codegen}$,
       
   254    is slower than the manually written version since the code synthesised by
       
   255    $\textit{codegen}$ does not use native integer or string
       
   256    types of the target language.
       
   257    %Note that we are comparing our \emph{lexer} against other languages' matchers.
       
   258 }\label{fig:aStarStarb}
       
   259 \end{figure}\afterpage{\clearpage}
       
   260 
       
   261 A more recent example is a global outage of all Cloudflare servers on 
       
   262 2 July 2019. 
       
   263 The traffic Cloudflare services each day is more than
       
   264 Twitter, Amazon, Instagram, Apple, Bing and Wikipedia combined, yet
       
   265 it became completely unavailable for half an hour because of 
       
   266 a poorly-written regular expression roughly of the form $^*.^*=.^*$
       
   267 exhausted CPU resources that serve HTTP traffic. 
       
   268 Although the outage
       
   269 had several causes, at the heart was a regular expression that
       
   270 was used to monitor network
       
   271 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
       
   272 These problems with regular expressions 
       
   273 are not isolated events that happen
       
   274 very rarely, 
       
   275 %but actually widespread.
       
   276 %They occur so often that they have a 
       
   277 but they occur actually often enough that they have a
       
   278 name: Regular-Expression-Denial-Of-Service (ReDoS)
       
   279 attacks.
       
   280 Davis et al. \cite{Davis18} detected more
       
   281 than 1000 evil regular expressions
       
   282 in Node.js, Python core libraries, npm and pypi. 
       
   283 They therefore concluded that evil regular expressions
       
   284 are a real problem rather than just "a parlour trick".
       
   285 
       
   286 The work in this thesis aims to address this issue
       
   287 with the help of formal proofs.
       
   288 We describe a lexing algorithm based
       
   289 on Brzozowski derivatives with verified correctness 
       
   290 and a finiteness property for the size of derivatives
       
   291 (which are all done in Isabelle/HOL).
       
   292 Such properties %guarantee the absence of 
       
   293 are an important step in preventing
       
   294 catastrophic backtracking once and for all.
       
   295 We will give more details in the next sections
       
   296 on (i) why the slow cases in graph \ref{fig:aStarStarb}
       
   297 can occur in traditional regular expression engines
       
   298 and (ii) why we choose our 
       
   299 approach based on Brzozowski derivatives and formal proofs.
       
   300 
       
   301 
       
   302 
       
   303 \section{Preliminaries}%Regex, and the Problems with Regex Matchers}
       
   304 Regular expressions and regular expression matchers 
       
   305 have been studied for many years.
       
   306 Theoretical results in automata theory state 
       
   307 that basic regular expression matching should be linear
       
   308 w.r.t the input.
       
   309 This assumes that the regular expression
       
   310 $r$ was pre-processed and turned into a
       
   311 deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
       
   312 By basic we mean textbook definitions such as the one
       
   313 below, involving only regular expressions for characters, alternatives,
       
   314 sequences, and Kleene stars:
       
   315 \[
       
   316 	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
       
   317 \]
       
   318 Modern regular expression matchers used by programmers,
       
   319 however,
       
   320 support much richer constructs, such as bounded repetitions,
       
   321 negations,
       
   322 and back-references.
       
   323 To differentiate, we use the word \emph{regex} to refer
       
   324 to those expressions with richer constructs while reserving the
       
   325 term \emph{regular expression}
       
   326 for the more traditional meaning in formal languages theory.
       
   327 We follow this convention 
       
   328 in this thesis.
       
   329 In the future, we aim to support all the popular features of regexes, 
       
   330 but for this work we mainly look at basic regular expressions
       
   331 and bounded repetitions.
       
   332 
       
   333 
       
   334 %Most modern regex libraries
       
   335 %the so-called PCRE standard (Peral Compatible Regular Expressions)
       
   336 %has the back-references
       
   337 Regexes come with a number of constructs
       
   338 that make it more convenient for 
       
   339 programmers to write regular expressions.
       
   340 Depending on the types of constructs
       
   341 the task of matching and lexing with them
       
   342 will have different levels of complexity.
       
   343 Some of those constructs are syntactic sugars that are
       
   344 simply short hand notations
       
   345 that save the programmers a few keystrokes.
       
   346 These will not cause problems for regex libraries.
       
   347 For example the
       
   348 non-binary alternative involving three or more choices just means:
       
   349 \[
       
   350 	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
       
   351 \]
       
   352 Similarly, the range operator
       
   353 %used to express the alternative
       
   354 %of all characters between its operands, 
       
   355 is just a concise way
       
   356 of expressing an alternative of consecutive characters:
       
   357 \[
       
   358 	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )  
       
   359 \]
       
   360 for an alternative. The
       
   361 wildcard character '$.$' is used to refer to any single character,
       
   362 \[
       
   363 	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
       
   364 \]
       
   365 except the newline.
       
   366 
       
   367 \subsection{Bounded Repetitions}
       
   368 More interesting are bounded repetitions, which can 
       
   369 make the regular expressions much
       
   370 more compact.
       
   371 Normally there are four kinds of bounded repetitions:
       
   372 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
       
   373 (where $n$ and $m$ are constant natural numbers).
       
   374 Like the star regular expressions, the set of strings or language
       
   375 a bounded regular expression can match
       
   376 is defined using the power operation on sets:
       
   377 \begin{center}
       
   378 	\begin{tabular}{lcl}
       
   379 		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
       
   380 		$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
       
   381 		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
       
   382 		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
       
   383 	\end{tabular}
       
   384 \end{center}
       
   385 The attraction of bounded repetitions is that they can be
       
   386 used to avoid a size blow up: for example $r^{\{n\}}$
       
   387 is a shorthand for
       
   388 the much longer regular expression:
       
   389 \[
       
   390 	\underbrace{r\ldots r}_\text{n copies of r}.
       
   391 \]
       
   392 %Therefore, a naive algorithm that simply unfolds
       
   393 %them into their desugared forms
       
   394 %will suffer from at least an exponential runtime increase.
       
   395 
       
   396 
       
   397 The problem with matching 
       
   398 such bounded repetitions
       
   399 is that tools based on the classic notion of
       
   400 automata need to expand $r^{\{n\}}$ into $n$ connected 
       
   401 copies of the automaton for $r$. This leads to very inefficient matching
       
   402 algorithms  or algorithms that consume large amounts of memory.
       
   403 Implementations using $\DFA$s will
       
   404 in such situations
       
   405 either become very slow 
       
   406 (for example Verbatim++ \cite{Verbatimpp}) or run
       
   407 out of memory (for example $\mathit{LEX}$ and 
       
   408 $\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
       
   409 in C and JAVA that generate $\mathit{DFA}$-based
       
   410 lexers. The user provides a set of regular expressions
       
   411 and configurations, and then 
       
   412 gets an output program encoding a minimized $\mathit{DFA}$
       
   413 that can be compiled and run. 
       
   414 When given the above countdown regular expression,
       
   415 a small $n$ (say 20) would result in a program representing a
       
   416 DFA
       
   417 with millions of states.}) for large counters.
       
   418 A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
       
   419 where the minimal DFA requires at least $2^{n+1}$ states.
       
   420 For example, when $n$ is equal to 2,
       
   421 the corresponding $\mathit{NFA}$ looks like:
       
   422 \vspace{6mm}
       
   423 \begin{center}
       
   424 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
       
   425    \node[state,initial] (q_0)   {$q_0$}; 
       
   426    \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
       
   427    \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
       
   428    \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
       
   429     \path[->] 
       
   430     (q_0) edge  node {a} (q_1)
       
   431     	  edge [loop below] node {a,b} ()
       
   432     (q_1) edge  node  {a,b} (q_2)
       
   433     (q_2) edge  node  {a,b} (q_3);
       
   434 \end{tikzpicture}
       
   435 \end{center}
       
   436 and when turned into a DFA by the subset construction
       
   437 requires at least $2^3$ states.\footnote{The 
       
   438 red states are "countdown states" which count down 
       
   439 the number of characters needed in addition to the current
       
   440 string to make a successful match.
       
   441 For example, state $q_1$ indicates a match that has
       
   442 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
       
   443 and just consumed the "delimiter" $a$ in the middle, and 
       
   444 needs to match 2 more iterations of $(a|b)$ to complete.
       
   445 State $q_2$ on the other hand, can be viewed as a state
       
   446 after $q_1$ has consumed 1 character, and just waits
       
   447 for 1 more character to complete.
       
   448 The state $q_3$ is the last (accepting) state, requiring 0 
       
   449 more characters.
       
   450 Depending on the suffix of the
       
   451 input string up to the current read location,
       
   452 the states $q_1$ and $q_2$, $q_3$
       
   453 may or may
       
   454 not be active.
       
   455 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
       
   456 contain at least $2^3$ non-equivalent states that cannot be merged, 
       
   457 because the subset construction during determinisation will generate
       
   458 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
       
   459 Generalizing this to regular expressions with larger
       
   460 bounded repetitions number, we have that
       
   461 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
       
   462 would require at least $2^{n+1}$ states, if $r$ itself contains
       
   463 more than 1 string.
       
   464 This is to represent all different 
       
   465 scenarios in which "countdown" states are active.}
       
   466 
       
   467 
       
   468 Bounded repetitions are important because they
       
   469 tend to occur frequently in practical use,
       
   470 for example in the regex library RegExLib, in
       
   471 the rules library of Snort \cite{Snort1999}\footnote{
       
   472 Snort is a network intrusion detection (NID) tool
       
   473 for monitoring network traffic.
       
   474 The network security community curates a list
       
   475 of malicious patterns written as regexes,
       
   476 which is used by Snort's detection engine
       
   477 to match against network traffic for any hostile
       
   478 activities such as buffer overflow attacks.}, 
       
   479 as well as in XML Schema definitions (XSDs).
       
   480 According to Bj\"{o}rklund et al \cite{xml2015},
       
   481 more than half of the 
       
   482 XSDs they found on the Maven.org central repository
       
   483 have bounded regular expressions in them.
       
   484 Often the counters are quite large, with the largest being
       
   485 close to ten million. 
       
   486 A smaller sample XSD they gave
       
   487 is:
       
   488 \lstset{
       
   489 	basicstyle=\fontsize{8.5}{9}\ttfamily,
       
   490   language=XML,
       
   491   morekeywords={encoding,
       
   492     xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
       
   493 }
       
   494 \begin{lstlisting}
       
   495 <sequence minOccurs="0" maxOccurs="65535">
       
   496 	<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
       
   497  	<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
       
   498 </sequence>
       
   499 \end{lstlisting}
       
   500 This can be seen as the regex
       
   501 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
       
   502 regular expressions 
       
   503 satisfying certain constraints (such as 
       
   504 satisfying the floating point number format).
       
   505 It is therefore quite unsatisfying that 
       
   506 some regular expressions matching libraries
       
   507 impose adhoc limits
       
   508 for bounded regular expressions:
       
   509 For example, in the regular expression matching library in the Go
       
   510 language the regular expression $a^{1001}$ is not permitted, because no counter
       
   511 can be above 1000, and in the built-in Rust regular expression library
       
   512 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
       
   513 for being too big~\footnote{Try it out here: https://rustexp.lpil.uk}. 
       
   514 As Becchi and Crawley \cite{Becchi08}  have pointed out,
       
   515 the reason for these restrictions
       
   516 is that they simulate a non-deterministic finite
       
   517 automata (NFA) with a breadth-first search.
       
   518 This way the number of active states could
       
   519 be equal to the counter number.
       
   520 When the counters are large, 
       
   521 the memory requirement could become
       
   522 infeasible, and a regex engine
       
   523 like in Go will reject this pattern straight away.
       
   524 \begin{figure}[H]
       
   525 \begin{center}
       
   526 \begin{tikzpicture} [node distance = 2cm, on grid, auto]
       
   527  
       
   528     	\node (q0) [state, initial] {$0$};
       
   529 	\node (q1) [state, right = of q0] {$1$};
       
   530 	%\node (q2) [state, right = of q1] {$2$};
       
   531 	\node (qdots) [right = of q1] {$\ldots$};
       
   532 	\node (qn) [state, right = of qdots] {$n$};
       
   533 	\node (qn1) [state, right = of qn] {$n+1$};
       
   534 	\node (qn2) [state, right = of qn1] {$n+2$};
       
   535 	\node (qn3) [state, accepting, right = of qn2] {$n+3$}; 
       
   536  
       
   537 \path [-stealth, thick]
       
   538 	(q0) edge [loop above] node {a} ()
       
   539     (q0) edge node {a}   (q1) 
       
   540     %(q1) edge node {.}   (q2)
       
   541     (q1) edge node {.}   (qdots)
       
   542     (qdots) edge node {.} (qn)
       
   543     (qn) edge node {.} (qn1)
       
   544     (qn1) edge node {b} (qn2)
       
   545     (qn2) edge node {$c$} (qn3);
       
   546 \end{tikzpicture}
       
   547 %\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
       
   548 %   \node[state,initial] (q_0)   {$0$}; 
       
   549 %   \node[state, ] (q_1) [right=of q_0] {$1$}; 
       
   550 %   \node[state, ] (q_2) [right=of q_1] {$2$}; 
       
   551 %   \node[state,
       
   552 %   \node[state, accepting, ](q_3) [right=of q_2] {$3$};
       
   553 %    \path[->] 
       
   554 %    (q_0) edge  node {a} (q_1)
       
   555 %    	  edge [loop below] node {a,b} ()
       
   556 %    (q_1) edge  node  {a,b} (q_2)
       
   557 %    (q_2) edge  node  {a,b} (q_3);
       
   558 %\end{tikzpicture}
       
   559 \end{center}
       
   560 \caption{The example given by Becchi and Crawley
       
   561 	that NFA simulation can consume large
       
   562 	amounts of memory: $.^*a.^{\{n\}}bc$ matching
       
   563 	strings of the form $aaa\ldots aaaabc$.
       
   564 	When traversing in a breadth-first manner,
       
   565 all states from 0 till $n+1$ will become active.}\label{fig:inj}
       
   566 
       
   567 \end{figure}
       
   568 %Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
       
   569 %type of $\mathit{NFA}$ simulation and guarantees a linear runtime
       
   570 %in terms of input string length.
       
   571 %TODO:try out these lexers
       
   572 These problems can of course be solved in matching algorithms where 
       
   573 automata go beyond the classic notion and for instance include explicit
       
   574 counters \cite{Turo_ov__2020}.
       
   575 These solutions can be quite efficient,
       
   576 with the ability to process
       
   577 gigabits of strings input per second
       
   578 even with large counters \cite{Becchi08}.
       
   579 These practical solutions do not come with
       
   580 formal guarantees, and as pointed out by
       
   581 Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
       
   582 %But formal reasoning about these automata especially in Isabelle 
       
   583 %can be challenging
       
   584 %and un-intuitive. 
       
   585 %Therefore, we take correctness and runtime claims made about these solutions
       
   586 %with a grain of salt.
       
   587 
       
   588 In the work reported in \cite{ITP2023} and here, 
       
   589 we add better support using derivatives
       
   590 for bounded regular expression $r^{\{n\}}$.
       
   591 Our results
       
   592 extend straightforwardly to
       
   593 repetitions with intervals such as 
       
   594 $r^{\{n\ldots m\}}$.
       
   595 The merit of Brzozowski derivatives (more on this later)
       
   596 on this problem is that
       
   597 it can be naturally extended to support bounded repetitions.
       
   598 Moreover these extensions are still made up of only small
       
   599 inductive datatypes and recursive functions,
       
   600 making it handy to deal with them in theorem provers.
       
   601 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
       
   602 %straightforwardly extended to deal with bounded regular expressions
       
   603 %and moreover the resulting code still consists of only simple
       
   604 %recursive functions and inductive datatypes.
       
   605 Finally, bounded regular expressions do not destroy our finite
       
   606 boundedness property, which we shall prove later on.
       
   607 
       
   608 
       
   609 
       
   610 
       
   611 
       
   612 \subsection{Back-References}
       
   613 The other way to simulate an $\mathit{NFA}$ for matching is choosing  
       
   614 a single transition each time, keeping all the other options in 
       
   615 a queue or stack, and backtracking if that choice eventually 
       
   616 fails. 
       
   617 This method, often called a  "depth-first-search", 
       
   618 is efficient in many cases, but could end up
       
   619 with exponential run time.
       
   620 The backtracking method is employed in regex libraries
       
   621 that support \emph{back-references}, for example
       
   622 in Java and Python.
       
   623 %\section{Back-references and The Terminology Regex}
       
   624 
       
   625 %When one constructs an $\NFA$ out of a regular expression
       
   626 %there is often very little to be done in the first phase, one simply 
       
   627 %construct the $\NFA$ states based on the structure of the input regular expression.
       
   628 
       
   629 %In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
       
   630 %one by keeping track of all active states after consuming 
       
   631 %a character, and update that set of states iteratively.
       
   632 %This can be viewed as a breadth-first-search of the $\mathit{NFA}$
       
   633 %for a path terminating
       
   634 %at an accepting state.
       
   635 
       
   636 
       
   637 
       
   638 Consider the following regular expression where the sequence
       
   639 operator is omitted for brevity:
       
   640 \begin{center}
       
   641 	$r_1r_2r_3r_4$
       
   642 \end{center}
       
   643 In this example,
       
   644 one could label sub-expressions of interest 
       
   645 by parenthesizing them and giving 
       
   646 them a number in the order in which their opening parentheses appear.
       
   647 One possible way of parenthesizing and labelling is: 
       
   648 \begin{center}
       
   649 	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
       
   650 \end{center}
       
   651 The sub-expressions
       
   652 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
       
   653 by 1 to 4, and can be ``referred back'' by their respective numbers. 
       
   654 %These sub-expressions are called "capturing groups".
       
   655 To do so, one uses the syntax $\backslash i$ 
       
   656 to denote that we want the sub-string 
       
   657 of the input matched by the i-th
       
   658 sub-expression to appear again, 
       
   659 exactly the same as it first appeared: 
       
   660 \begin{center}
       
   661 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
       
   662 \underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
       
   663 \end{center}
       
   664 Once the sub-string $s_i$ for the sub-expression $r_i$
       
   665 has been fixed, there is no variability on what the back-reference
       
   666 label $\backslash i$ can be---it is tied to $s_i$.
       
   667 The overall string will look like $\ldots s_i \ldots s_i \ldots $
       
   668 %The backslash and number $i$ are the
       
   669 %so-called "back-references".
       
   670 %Let $e$ be an expression made of regular expressions 
       
   671 %and back-references. $e$ contains the expression $e_i$
       
   672 %as its $i$-th capturing group.
       
   673 %The semantics of back-reference can be recursively
       
   674 %written as:
       
   675 %\begin{center}
       
   676 %	\begin{tabular}{c}
       
   677 %		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
       
   678 %		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
       
   679 %	\end{tabular}
       
   680 %\end{center}
       
   681 A concrete example
       
   682 for back-references is
       
   683 \begin{center}
       
   684 $(.^*)\backslash 1$,
       
   685 \end{center}
       
   686 which matches
       
   687 strings that can be split into two identical halves,
       
   688 for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
       
   689 Note that this is different from 
       
   690 repeating the  sub-expression verbatim like
       
   691 \begin{center}
       
   692 	$(.^*)(.^*)$,
       
   693 \end{center}
       
   694 which does not impose any restrictions on what strings the second 
       
   695 sub-expression $.^*$
       
   696 might match.
       
   697 Another example for back-references is
       
   698 \begin{center}
       
   699 $(.)(.)\backslash 2\backslash 1$
       
   700 \end{center}
       
   701 which matches four-character palindromes
       
   702 like $abba$, $x??x$ and so on.
       
   703 
       
   704 Back-references are a regex construct 
       
   705 that programmers find quite useful.
       
   706 According to Becchi and Crawley \cite{Becchi08},
       
   707 6\% of Snort rules (up until 2008) use them.
       
   708 The most common use of back-references
       
   709 is to express well-formed html files,
       
   710 where back-references are convenient for matching
       
   711 opening and closing tags like 
       
   712 \begin{center}
       
   713 	$\langle html \rangle \ldots \langle / html \rangle$
       
   714 \end{center}
       
   715 A regex describing such a format
       
   716 is
       
   717 \begin{center}
       
   718 	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
       
   719 \end{center}
       
   720 Despite being useful, the expressive power of regexes 
       
   721 go beyond regular languages 
       
   722 once back-references are included.
       
   723 In fact, they allow the regex construct to express 
       
   724 languages that cannot be contained in context-free
       
   725 languages either.
       
   726 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
       
   727 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
       
   728 which cannot be expressed by 
       
   729 context-free grammars \cite{campeanu2003formal}.
       
   730 Such a language is contained in the context-sensitive hierarchy
       
   731 of formal languages. 
       
   732 Also solving the matching problem involving back-references
       
   733 is known to be NP-complete \parencite{alfred2014algorithms}.
       
   734 Regex libraries supporting back-references such as 
       
   735 PCRE \cite{pcre} therefore have to
       
   736 revert to a depth-first search algorithm including backtracking.
       
   737 What is unexpected is that even in the cases 
       
   738 not involving back-references, there is still
       
   739 a (non-negligible) chance they might backtrack super-linearly,
       
   740 as shown in the graphs in figure \ref{fig:aStarStarb}.
       
   741 
       
   742 Summing up, we can categorise existing 
       
   743 practical regex libraries into two kinds:
       
   744 (i) The ones  with  linear
       
   745 time guarantees like Go and Rust. The downside with them is that
       
   746 they impose restrictions
       
   747 on the regular expressions (not allowing back-references, 
       
   748 bounded repetitions cannot exceed an ad hoc limit etc.).
       
   749 And (ii) those 
       
   750 that allow large bounded regular expressions and back-references
       
   751 at the expense of using backtracking algorithms.
       
   752 They can potentially ``grind to a halt''
       
   753 on some very simple cases, resulting 
       
   754 ReDoS attacks if exposed to the internet.
       
   755 
       
   756 The problems with both approaches are the motivation for us 
       
   757 to look again at the regular expression matching problem. 
       
   758 Another motivation is that regular expression matching algorithms
       
   759 that follow the POSIX standard often contain errors and bugs 
       
   760 as we shall explain next.
       
   761 
       
   762 %We would like to have regex engines that can 
       
   763 %deal with the regular part (e.g.
       
   764 %bounded repetitions) of regexes more
       
   765 %efficiently.
       
   766 %Also we want to make sure that they do it correctly.
       
   767 %It turns out that such aim is not so easy to achieve.
       
   768  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
       
   769 % For example, the Rust regex engine claims to be linear, 
       
   770 % but does not support lookarounds and back-references.
       
   771 % The GoLang regex library does not support over 1000 repetitions.  
       
   772 % Java and Python both support back-references, but shows
       
   773 %catastrophic backtracking behaviours on inputs without back-references(
       
   774 %when the language is still regular).
       
   775  %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
       
   776  %TODO: verify the fact Rust does not allow 1000+ reps
       
   777 
       
   778 
       
   779 
       
   780 
       
   781 %The time cost of regex matching algorithms in general
       
   782 %involve two different phases, and different things can go differently wrong on 
       
   783 %these phases.
       
   784 %$\DFA$s usually have problems in the first (construction) phase
       
   785 %, whereas $\NFA$s usually run into trouble
       
   786 %on the second phase.
       
   787 
       
   788 
       
   789 \section{Error-prone POSIX Implementations}
       
   790 Very often there are multiple ways of matching a string
       
   791 with a regular expression.
       
   792 In such cases the regular expressions matcher needs to
       
   793 disambiguate.
       
   794 The more widely used strategy is called POSIX,
       
   795 which roughly speaking always chooses the longest initial match.
       
   796 The POSIX strategy is widely adopted in many regular expression matchers
       
   797 because it is a natural strategy for lexers.
       
   798 However, many implementations (including the C libraries
       
   799 used by Linux and OS X distributions) contain bugs
       
   800 or do not meet the specification they claim to adhere to.
       
   801 Kuklewicz maintains a unit test repository which lists some
       
   802 problems with existing regular expression engines \cite{KuklewiczHaskell}.
       
   803 In some cases, they either fail to generate a
       
   804 result when there exists a match,
       
   805 or give results that are inconsistent with the POSIX standard.
       
   806 A concrete example is the regex:
       
   807 \begin{center}
       
   808 	$(aba + ab + a)^* \text{and the string} \; ababa$
       
   809 \end{center}
       
   810 The correct POSIX match for the above
       
   811 involves the entire string $ababa$, 
       
   812 split into two Kleene star iterations, namely $[ab], [aba]$ at positions
       
   813 $[0, 2), [2, 5)$
       
   814 respectively.
       
   815 But feeding this example to the different engines
       
   816 listed at regex101 \footnote{
       
   817 	regex101 is an online regular expression matcher which
       
   818 	provides API for trying out regular expression
       
   819 	engines of multiple popular programming languages like
       
   820 Java, Python, Go, etc.} \parencite{regex101}. 
       
   821 yields
       
   822 only two incomplete matches: $[aba]$ at $[0, 3)$
       
   823 and $a$ at $[4, 5)$.
       
   824 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
       
   825 commented that most regex libraries are not
       
   826 correctly implementing the central POSIX
       
   827 rule, called the maximum munch rule.
       
   828 Grathwohl \parencite{grathwohl2014crash} wrote:
       
   829 \begin{quote}\it
       
   830 	``The POSIX strategy is more complicated than the 
       
   831 	greedy because of the dependence on information about 
       
   832 	the length of matched strings in the various subexpressions.''
       
   833 \end{quote}
       
   834 %\noindent
       
   835 People have recognised that the implementation complexity of POSIX rules also come from
       
   836 the specification being not very precise.
       
   837 The developers of the regexp package of Go 
       
   838 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
       
   839 commented that
       
   840 \begin{quote}\it
       
   841 ``
       
   842 The POSIX rule is computationally prohibitive
       
   843 and not even well-defined.
       
   844 ``
       
   845 \end{quote}
       
   846 There are many informal summaries of this disambiguation
       
   847 strategy, which are often quite long and delicate.
       
   848 For example Kuklewicz \cite{KuklewiczHaskell} 
       
   849 described the POSIX rule as (section 1, last paragraph):
       
   850 \begin{quote}
       
   851 	\begin{itemize}
       
   852 		\item
       
   853 regular expressions (REs) take the leftmost starting match, and the longest match starting there
       
   854 earlier subpatterns have leftmost-longest priority over later subpatterns\\
       
   855 \item
       
   856 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
       
   857 \item
       
   858 REs have right associative concatenation which can be changed with parenthesis\\
       
   859 \item
       
   860 parenthesized subexpressions return the match from their last usage\\
       
   861 \item
       
   862 text of component subexpressions must be contained in the text of the 
       
   863 higher-level subexpressions\\
       
   864 \item
       
   865 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
       
   866 \item
       
   867 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
       
   868 \end{itemize}
       
   869 \end{quote}
       
   870 %The text above 
       
   871 %is trying to capture something very precise,
       
   872 %and is crying out for formalising.