ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 668 3831621d7b14
parent 667 660cf698eb26
equal deleted inserted replaced
667:660cf698eb26 668:3831621d7b14
     5 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
     5 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
     6 %and notations we 
     6 %and notations we 
     7 % used for describing the lexing algorithm by Sulzmann and Lu,
     7 % used for describing the lexing algorithm by Sulzmann and Lu,
     8 %and then give the algorithm and its variant and discuss
     8 %and then give the algorithm and its variant and discuss
     9 %why more aggressive simplifications are needed. 
     9 %why more aggressive simplifications are needed. 
    10 We start with a technical overview of the catastrophic backtracking problem,
    10 
    11 motivating rigorous approaches to regular expression matching and lexing.
    11 
    12 Then we introduce basic notations and definitions of our problem.
    12 \marginpar{Gerog comment: addressed attributing work correctly problem,
    13 
    13 showed clearly these definitions come from Ausaf et al.}
    14 \section{Technical Overview}
    14 
    15 Consider for example the regular expression $(a^*)^*\,b$ and 
       
    16 strings of the form $aa..a$. These strings cannot be matched by this regular
       
    17 expression: obviously the expected $b$ in the last
       
    18 position is missing. One would assume that modern regular expression
       
    19 matching engines can find this out very quickly. Surprisingly, if one tries
       
    20 this example in JavaScript, Python or Java 8, even with small strings, 
       
    21 say of lenght of around 30 $a$'s,
       
    22 the decision takes large amounts of time to finish.
       
    23 This is inproportional
       
    24 to the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}).
       
    25 The algorithms clearly show exponential behaviour, and as can be seen
       
    26 is triggered by some relatively simple regular expressions.
       
    27 Java 9 and newer
       
    28 versions mitigates this behaviour by several magnitudes, 
       
    29 but are still slow compared 
       
    30 with the approach we are going to use in this thesis.
       
    31 
       
    32 
       
    33 
       
    34 This superlinear blowup in regular expression engines
       
    35 has caused grief in ``real life'' where it is 
       
    36 given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
       
    37 For example, on 20 July 2016 one evil
       
    38 regular expression brought the webpage
       
    39 \href{http://stackexchange.com}{Stack Exchange} to its
       
    40 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
       
    41 In this instance, a regular expression intended to just trim white
       
    42 spaces from the beginning and the end of a line actually consumed
       
    43 massive amounts of CPU resources---causing the web servers to grind to a
       
    44 halt. In this example, the time needed to process
       
    45 the string was 
       
    46 $O(n^2)$ with respect to the string length $n$. This
       
    47 quadratic overhead was enough for the homepage of Stack Exchange to
       
    48 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
       
    49 attack and therefore stopped the servers from responding to any
       
    50 requests. This made the whole site become unavailable. 
       
    51 
       
    52 \begin{figure}[p]
       
    53 \begin{center}
       
    54 \begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
       
    55 \begin{tikzpicture}
       
    56 \begin{axis}[
       
    57     xlabel={$n$},
       
    58     x label style={at={(1.05,-0.05)}},
       
    59     ylabel={time in secs},
       
    60     enlargelimits=false,
       
    61     xtick={0,5,...,30},
       
    62     xmax=33,
       
    63     ymax=35,
       
    64     ytick={0,5,...,30},
       
    65     scaled ticks=false,
       
    66     axis lines=left,
       
    67     width=5cm,
       
    68     height=4cm, 
       
    69     legend entries={JavaScript},  
       
    70     legend pos=north west,
       
    71     legend cell align=left]
       
    72 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
       
    73 \end{axis}
       
    74 \end{tikzpicture}
       
    75   &
       
    76 \begin{tikzpicture}
       
    77 \begin{axis}[
       
    78     xlabel={$n$},
       
    79     x label style={at={(1.05,-0.05)}},
       
    80     %ylabel={time in secs},
       
    81     enlargelimits=false,
       
    82     xtick={0,5,...,30},
       
    83     xmax=33,
       
    84     ymax=35,
       
    85     ytick={0,5,...,30},
       
    86     scaled ticks=false,
       
    87     axis lines=left,
       
    88     width=5cm,
       
    89     height=4cm, 
       
    90     legend entries={Python},  
       
    91     legend pos=north west,
       
    92     legend cell align=left]
       
    93 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
    94 \end{axis}
       
    95 \end{tikzpicture}\\ 
       
    96 \begin{tikzpicture}
       
    97 \begin{axis}[
       
    98     xlabel={$n$},
       
    99     x label style={at={(1.05,-0.05)}},
       
   100     ylabel={time in secs},
       
   101     enlargelimits=false,
       
   102     xtick={0,5,...,30},
       
   103     xmax=33,
       
   104     ymax=35,
       
   105     ytick={0,5,...,30},
       
   106     scaled ticks=false,
       
   107     axis lines=left,
       
   108     width=5cm,
       
   109     height=4cm, 
       
   110     legend entries={Java 8},  
       
   111     legend pos=north west,
       
   112     legend cell align=left]
       
   113 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   114 \end{axis}
       
   115 \end{tikzpicture}
       
   116   &
       
   117 \begin{tikzpicture}
       
   118 \begin{axis}[
       
   119     xlabel={$n$},
       
   120     x label style={at={(1.05,-0.05)}},
       
   121     %ylabel={time in secs},
       
   122     enlargelimits=false,
       
   123     xtick={0,5,...,30},
       
   124     xmax=33,
       
   125     ymax=35,
       
   126     ytick={0,5,...,30},
       
   127     scaled ticks=false,
       
   128     axis lines=left,
       
   129     width=5cm,
       
   130     height=4cm, 
       
   131     legend entries={Dart},  
       
   132     legend pos=north west,
       
   133     legend cell align=left]
       
   134 \addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
       
   135 \end{axis}
       
   136 \end{tikzpicture}\\ 
       
   137 \begin{tikzpicture}
       
   138 \begin{axis}[
       
   139     xlabel={$n$},
       
   140     x label style={at={(1.05,-0.05)}},
       
   141     ylabel={time in secs},
       
   142     enlargelimits=false,
       
   143     xtick={0,5,...,30},
       
   144     xmax=33,
       
   145     ymax=35,
       
   146     ytick={0,5,...,30},
       
   147     scaled ticks=false,
       
   148     axis lines=left,
       
   149     width=5cm,
       
   150     height=4cm, 
       
   151     legend entries={Swift},  
       
   152     legend pos=north west,
       
   153     legend cell align=left]
       
   154 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
       
   155 \end{axis}
       
   156 \end{tikzpicture}
       
   157   & 
       
   158 \begin{tikzpicture}
       
   159 \begin{axis}[
       
   160     xlabel={$n$},
       
   161     x label style={at={(1.05,-0.05)}},
       
   162     %ylabel={time in secs},
       
   163     enlargelimits=true,
       
   164     %xtick={0,5000,...,40000},
       
   165     %xmax=40000,
       
   166     %ymax=35,
       
   167     restrict x to domain*=0:40000,
       
   168     restrict y to domain*=0:35,
       
   169     %ytick={0,5,...,30},
       
   170     %scaled ticks=false,
       
   171     axis lines=left,
       
   172     width=5cm,
       
   173     height=4cm, 
       
   174     legend entries={Java9+},  
       
   175     legend pos=north west,
       
   176     legend cell align=left]
       
   177 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
       
   178 \end{axis}
       
   179 \end{tikzpicture}\\ 
       
   180 \begin{tikzpicture}
       
   181 \begin{axis}[
       
   182     xlabel={$n$},
       
   183     x label style={at={(1.05,-0.05)}},
       
   184     ylabel={time in secs},
       
   185     enlargelimits=true,
       
   186     %xtick={0,5,...,30},
       
   187     %xmax=33,
       
   188     %ymax=35,
       
   189     restrict x to domain*=0:60000,
       
   190     restrict y to domain*=0:35,
       
   191     %ytick={0,5,...,30},
       
   192     %scaled ticks=false,
       
   193     axis lines=left,
       
   194     width=5cm,
       
   195     height=4cm, 
       
   196     legend entries={Scala},  
       
   197     legend pos=north west,
       
   198     legend cell align=left]
       
   199 \addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data};
       
   200 \end{axis}
       
   201 \end{tikzpicture}
       
   202   & 
       
   203 \begin{tikzpicture}
       
   204 \begin{axis}[
       
   205     xlabel={$n$},
       
   206     x label style={at={(1.05,-0.05)}},
       
   207     %ylabel={time in secs},
       
   208     enlargelimits=true,
       
   209     %xtick={0,5000,...,40000},
       
   210     %xmax=40000,
       
   211     %ymax=35,
       
   212     restrict x to domain*=0:60000,
       
   213     restrict y to domain*=0:45,
       
   214     %ytick={0,5,...,30},
       
   215     %scaled ticks=false,
       
   216     axis lines=left,
       
   217     width=5cm,
       
   218     height=4cm, 
       
   219     legend style={cells={align=left}},
       
   220     legend entries={Isabelle \\ Extracted},
       
   221     legend pos=north west,
       
   222     legend cell align=left]
       
   223 \addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data};
       
   224 \end{axis}
       
   225 \end{tikzpicture}\\ 
       
   226 \multicolumn{2}{c}{Graphs}
       
   227 \end{tabular}    
       
   228 \end{center}
       
   229 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
       
   230            of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
       
   231    The reason for their fast growth %superlinear behaviour 
       
   232    is that they do a depth-first-search
       
   233    using NFAs.
       
   234    If the string does not match, the regular expression matching
       
   235    engine starts to explore all possibilities. 
       
   236    The last two graphs are for our implementation in Scala, one manual
       
   237    and one extracted from the verified lexer in Isabelle by $\textit{codegen}$.
       
   238    Our lexer  
       
   239    performs better in this case, and
       
   240    is formally verified.
       
   241    Despite being almost identical, the codegen-generated lexer
       
   242    % is generated directly from Isabelle using $\textit{codegen}$,
       
   243    is slower than the manually written version since the code synthesised by
       
   244    $\textit{codegen}$ does not use native integer or string
       
   245    types of the target language.
       
   246    %Note that we are comparing our \emph{lexer} against other languages' matchers.
       
   247 }\label{fig:aStarStarb}
       
   248 \end{figure}\afterpage{\clearpage}
       
   249 
       
   250 A more recent example is a global outage of all Cloudflare servers on 2 July
       
   251 2019. A poorly written regular expression exhibited catastrophic backtracking
       
   252 and exhausted CPUs that serve HTTP traffic. Although the outage
       
   253 had several causes, at the heart was a regular expression that
       
   254 was used to monitor network
       
   255 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
       
   256 These problems with regular expressions 
       
   257 are not isolated events that happen
       
   258 very rarely, 
       
   259 %but actually widespread.
       
   260 %They occur so often that they have a 
       
   261 but they occur actually often enough that they have a
       
   262 name: Regular-Expression-Denial-Of-Service (ReDoS)
       
   263 attacks.
       
   264 Davis et al. \cite{Davis18} detected more
       
   265 than 1000 evil regular expressions
       
   266 in Node.js, Python core libraries, npm and pypi. 
       
   267 They therefore concluded that evil regular expressions
       
   268 are a real problem rather than just "a parlour trick".
       
   269 
       
   270 The work in this thesis aims to address this issue
       
   271 with the help of formal proofs.
       
   272 We describe a lexing algorithm based
       
   273 on Brzozowski derivatives with verified correctness 
       
   274 and a finiteness property for the size of derivatives
       
   275 (which are all done in Isabelle/HOL).
       
   276 Such properties %guarantee the absence of 
       
   277 are an important step in preventing
       
   278 catastrophic backtracking once and for all.
       
   279 We will give more details in the next sections
       
   280 on (i) why the slow cases in graph \ref{fig:aStarStarb}
       
   281 can occur in traditional regular expression engines
       
   282 and (ii) why we choose our 
       
   283 approach based on Brzozowski derivatives and formal proofs.
       
   284 
       
   285 
       
   286 
       
   287 
       
   288 In this chapter, we define the basic notions 
       
   289 for regular languages and regular expressions.
       
   290 This is essentially a description in ``English''
       
   291 the functions and datatypes of our formalisation in Isabelle/HOL.
       
   292 We also define what $\POSIX$ lexing means, 
       
   293 followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
       
   294 that produces the output conforming
       
   295 to the $\POSIX$ standard. 
       
   296 This is a preliminary chapter which describes the results of
    15 This is a preliminary chapter which describes the results of
   297 Sulzmann and Lu and Ausaf et al \ref{AusafDyckhoffUrban2016},
    16 Sulzmann and Lu \cite{Sulzmann2014} and part of the formalisations by
   298 but the proof details and function definitions are needed to motivate our work
    17 Ausaf et al \ref{AusafDyckhoffUrban2016} (mainly the definitions).
       
    18 These results are not part of this PhD work,
       
    19 but the details are included to provide necessary context
       
    20 for our work on the correctness proof of $\blexersimp$,
   299 as we show in chapter \ref{Bitcoded2} how the proofs break down when
    21 as we show in chapter \ref{Bitcoded2} how the proofs break down when
   300 simplifications are applied.
    22 simplifications are applied.
   301 %TODO: Actually show this in chapter 4.
    23 
   302 
    24 In the coming section, the definitions of basic notions 
   303 In what follows 
    25 for regular languages and regular expressions are given.
   304 we choose to use the Isabelle-style notation
    26 This is essentially a description in ``English''
   305 for function and datatype constructor applications, where
    27 the functions and datatypes used by Ausaf et al. 
   306 the parameters of a function are not enclosed
    28 \cite{AusafDyckhoffUrban2016} \cite{Ausaf}
   307 inside a pair of parentheses (e.g. $f \;x \;y$
    29 in their formalisation in Isabelle/HOL.
   308 instead of $f(x,\;y)$). This is mainly
    30 We include them as we build on their formalisation,
   309 to make the text visually more concise.
    31 and therefore inherently use these definitions.
   310 
    32 
   311 
    33 
   312 
    34 % In the next section, we will briefly
   313 
    35 % introduce Brzozowski derivatives and Sulzmann
   314 \section{Preliminaries}%Regex, and the Problems with Regex Matchers}
    36 % and Lu's algorithm, which the main point of this thesis builds on.
   315 Regular expressions and regular expression matchers 
       
   316 have clearly been studied for many, many years.
       
   317 Theoretical results in automata theory state 
       
   318 that basic regular expression matching should be linear
       
   319 w.r.t the input.
       
   320 This assumes that the regular expression
       
   321 $r$ was pre-processed and turned into a
       
   322 deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
       
   323 By basic we mean textbook definitions such as the one
       
   324 below, involving only regular expressions for characters, alternatives,
       
   325 sequences, and Kleene stars:
       
   326 \[
       
   327 	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
       
   328 \]
       
   329 Modern regular expression matchers used by programmers,
       
   330 however,
       
   331 support much richer constructs, such as bounded repetitions,
       
   332 negations,
       
   333 and back-references.
       
   334 To differentiate, we use the word \emph{regex} to refer
       
   335 to those expressions with richer constructs while reserving the
       
   336 term \emph{regular expression}
       
   337 for the more traditional meaning in formal languages theory.
       
   338 We follow this convention 
       
   339 in this thesis.
       
   340 In the future, we aim to support all the popular features of regexes, 
       
   341 but for this work we mainly look at basic regular expressions
       
   342 and bounded repetitions.
       
   343 
       
   344 
       
   345 
       
   346 %Most modern regex libraries
       
   347 %the so-called PCRE standard (Peral Compatible Regular Expressions)
       
   348 %has the back-references
       
   349 Regexes come with a number of constructs
       
   350 that make it more convenient for 
       
   351 programmers to write regular expressions.
       
   352 Depending on the types of constructs
       
   353 the task of matching and lexing with them
       
   354 will have different levels of complexity.
       
   355 Some of those constructs are syntactic sugars that are
       
   356 simply short hand notations
       
   357 that save the programmers a few keystrokes.
       
   358 These will not cause problems for regex libraries.
       
   359 For example the
       
   360 non-binary alternative involving three or more choices just means:
       
   361 \[
       
   362 	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
       
   363 \]
       
   364 Similarly, the range operator
       
   365 %used to express the alternative
       
   366 %of all characters between its operands, 
       
   367 is just a concise way
       
   368 of expressing an alternative of consecutive characters:
       
   369 \[
       
   370 	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )  
       
   371 \]
       
   372 for an alternative. The
       
   373 wildcard character '$.$' is used to refer to any single character,
       
   374 \[
       
   375 	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
       
   376 \]
       
   377 except the newline.
       
   378 
       
   379 \subsection{Bounded Repetitions}
       
   380 More interesting are bounded repetitions, which can 
       
   381 make the regular expressions much
       
   382 more compact.
       
   383 Normally there are four kinds of bounded repetitions:
       
   384 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
       
   385 (where $n$ and $m$ are constant natural numbers).
       
   386 Like the star regular expressions, the set of strings or language
       
   387 a bounded regular expression can match
       
   388 is defined using the power operation on sets:
       
   389 \begin{center}
       
   390 	\begin{tabular}{lcl}
       
   391 		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
       
   392 		$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
       
   393 		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
       
   394 		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
       
   395 	\end{tabular}
       
   396 \end{center}
       
   397 The attraction of bounded repetitions is that they can be
       
   398 used to avoid a size blow up: for example $r^{\{n\}}$
       
   399 is a shorthand for
       
   400 the much longer regular expression:
       
   401 \[
       
   402 	\underbrace{r\ldots r}_\text{n copies of r}.
       
   403 \]
       
   404 %Therefore, a naive algorithm that simply unfolds
       
   405 %them into their desugared forms
       
   406 %will suffer from at least an exponential runtime increase.
       
   407 
       
   408 
       
   409 The problem with matching 
       
   410 such bounded repetitions
       
   411 is that tools based on the classic notion of
       
   412 automata need to expand $r^{\{n\}}$ into $n$ connected 
       
   413 copies of the automaton for $r$. This leads to very inefficient matching
       
   414 algorithms  or algorithms that consume large amounts of memory.
       
   415 Implementations using $\DFA$s will
       
   416 in such situations
       
   417 either become very slow 
       
   418 (for example Verbatim++ \cite{Verbatimpp}) or run
       
   419 out of memory (for example $\mathit{LEX}$ and 
       
   420 $\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
       
   421 in C and JAVA that generate $\mathit{DFA}$-based
       
   422 lexers. The user provides a set of regular expressions
       
   423 and configurations, and then 
       
   424 gets an output program encoding a minimized $\mathit{DFA}$
       
   425 that can be compiled and run. 
       
   426 When given the above countdown regular expression,
       
   427 a small $n$ (say 20) would result in a program representing a
       
   428 DFA
       
   429 with millions of states.}) for large counters.
       
   430 A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
       
   431 where the minimal DFA requires at least $2^{n+1}$ states.
       
   432 For example, when $n$ is equal to 2,
       
   433 the corresponding $\mathit{NFA}$ looks like:
       
   434 \vspace{6mm}
       
   435 \begin{center}
       
   436 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
       
   437    \node[state,initial] (q_0)   {$q_0$}; 
       
   438    \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
       
   439    \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
       
   440    \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
       
   441     \path[->] 
       
   442     (q_0) edge  node {a} (q_1)
       
   443     	  edge [loop below] node {a,b} ()
       
   444     (q_1) edge  node  {a,b} (q_2)
       
   445     (q_2) edge  node  {a,b} (q_3);
       
   446 \end{tikzpicture}
       
   447 \end{center}
       
   448 and when turned into a DFA by the subset construction
       
   449 requires at least $2^3$ states.\footnote{The 
       
   450 red states are "countdown states" which count down 
       
   451 the number of characters needed in addition to the current
       
   452 string to make a successful match.
       
   453 For example, state $q_1$ indicates a match that has
       
   454 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
       
   455 and just consumed the "delimiter" $a$ in the middle, and 
       
   456 needs to match 2 more iterations of $(a|b)$ to complete.
       
   457 State $q_2$ on the other hand, can be viewed as a state
       
   458 after $q_1$ has consumed 1 character, and just waits
       
   459 for 1 more character to complete.
       
   460 The state $q_3$ is the last (accepting) state, requiring 0 
       
   461 more characters.
       
   462 Depending on the suffix of the
       
   463 input string up to the current read location,
       
   464 the states $q_1$ and $q_2$, $q_3$
       
   465 may or may
       
   466 not be active.
       
   467 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
       
   468 contain at least $2^3$ non-equivalent states that cannot be merged, 
       
   469 because the subset construction during determinisation will generate
       
   470 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
       
   471 Generalizing this to regular expressions with larger
       
   472 bounded repetitions number, we have that
       
   473 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
       
   474 would require at least $2^{n+1}$ states, if $r$ itself contains
       
   475 more than 1 string.
       
   476 This is to represent all different 
       
   477 scenarios in which "countdown" states are active.}
       
   478 
       
   479 
       
   480 Bounded repetitions are important because they
       
   481 tend to occur frequently in practical use,
       
   482 for example in the regex library RegExLib, in
       
   483 the rules library of Snort \cite{Snort1999}\footnote{
       
   484 Snort is a network intrusion detection (NID) tool
       
   485 for monitoring network traffic.
       
   486 The network security community curates a list
       
   487 of malicious patterns written as regexes,
       
   488 which is used by Snort's detection engine
       
   489 to match against network traffic for any hostile
       
   490 activities such as buffer overflow attacks.}, 
       
   491 as well as in XML Schema definitions (XSDs).
       
   492 According to Bj\"{o}rklund et al \cite{xml2015},
       
   493 more than half of the 
       
   494 XSDs they found on the Maven.org central repository
       
   495 have bounded regular expressions in them.
       
   496 Often the counters are quite large, with the largest being
       
   497 close to ten million. 
       
   498 A smaller sample XSD they gave
       
   499 is:
       
   500 \lstset{
       
   501 	basicstyle=\fontsize{8.5}{9}\ttfamily,
       
   502   language=XML,
       
   503   morekeywords={encoding,
       
   504     xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
       
   505 }
       
   506 \begin{lstlisting}
       
   507 <sequence minOccurs="0" maxOccurs="65535">
       
   508 	<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
       
   509  	<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
       
   510 </sequence>
       
   511 \end{lstlisting}
       
   512 This can be seen as the regex
       
   513 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
       
   514 regular expressions 
       
   515 satisfying certain constraints (such as 
       
   516 satisfying the floating point number format).
       
   517 It is therefore quite unsatisfying that 
       
   518 some regular expressions matching libraries
       
   519 impose adhoc limits
       
   520 for bounded regular expressions:
       
   521 For example, in the regular expression matching library in the Go
       
   522 language the regular expression $a^{1001}$ is not permitted, because no counter
       
   523 can be above 1000, and in the built-in Rust regular expression library
       
   524 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
       
   525 for being too big. 
       
   526 As Becchi and Crawley \cite{Becchi08}  have pointed out,
       
   527 the reason for these restrictions
       
   528 is that they simulate a non-deterministic finite
       
   529 automata (NFA) with a breadth-first search.
       
   530 This way the number of active states could
       
   531 be equal to the counter number.
       
   532 When the counters are large, 
       
   533 the memory requirement could become
       
   534 infeasible, and a regex engine
       
   535 like in Go will reject this pattern straight away.
       
   536 \begin{figure}[H]
       
   537 \begin{center}
       
   538 \begin{tikzpicture} [node distance = 2cm, on grid, auto]
       
   539  
       
   540     	\node (q0) [state, initial] {$0$};
       
   541 	\node (q1) [state, right = of q0] {$1$};
       
   542 	%\node (q2) [state, right = of q1] {$2$};
       
   543 	\node (qdots) [right = of q1] {$\ldots$};
       
   544 	\node (qn) [state, right = of qdots] {$n$};
       
   545 	\node (qn1) [state, right = of qn] {$n+1$};
       
   546 	\node (qn2) [state, right = of qn1] {$n+2$};
       
   547 	\node (qn3) [state, accepting, right = of qn2] {$n+3$}; 
       
   548  
       
   549 \path [-stealth, thick]
       
   550 	(q0) edge [loop above] node {a} ()
       
   551     (q0) edge node {a}   (q1) 
       
   552     %(q1) edge node {.}   (q2)
       
   553     (q1) edge node {.}   (qdots)
       
   554     (qdots) edge node {.} (qn)
       
   555     (qn) edge node {.} (qn1)
       
   556     (qn1) edge node {b} (qn2)
       
   557     (qn2) edge node {$c$} (qn3);
       
   558 \end{tikzpicture}
       
   559 %\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
       
   560 %   \node[state,initial] (q_0)   {$0$}; 
       
   561 %   \node[state, ] (q_1) [right=of q_0] {$1$}; 
       
   562 %   \node[state, ] (q_2) [right=of q_1] {$2$}; 
       
   563 %   \node[state,
       
   564 %   \node[state, accepting, ](q_3) [right=of q_2] {$3$};
       
   565 %    \path[->] 
       
   566 %    (q_0) edge  node {a} (q_1)
       
   567 %    	  edge [loop below] node {a,b} ()
       
   568 %    (q_1) edge  node  {a,b} (q_2)
       
   569 %    (q_2) edge  node  {a,b} (q_3);
       
   570 %\end{tikzpicture}
       
   571 \end{center}
       
   572 \caption{The example given by Becchi and Crawley
       
   573 	that NFA simulation can consume large
       
   574 	amounts of memory: $.^*a.^{\{n\}}bc$ matching
       
   575 	strings of the form $aaa\ldots aaaabc$.
       
   576 	When traversing in a breadth-first manner,
       
   577 all states from 0 till $n+1$ will become active.}\label{fig:inj}
       
   578 
       
   579 \end{figure}
       
   580 %Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
       
   581 %type of $\mathit{NFA}$ simulation and guarantees a linear runtime
       
   582 %in terms of input string length.
       
   583 %TODO:try out these lexers
       
   584 These problems can of course be solved in matching algorithms where 
       
   585 automata go beyond the classic notion and for instance include explicit
       
   586 counters \cite{Turo_ov__2020}.
       
   587 These solutions can be quite efficient,
       
   588 with the ability to process
       
   589 gigabits of strings input per second
       
   590 even with large counters \cite{Becchi08}.
       
   591 These practical solutions do not come with
       
   592 formal guarantees, and as pointed out by
       
   593 Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
       
   594 %But formal reasoning about these automata especially in Isabelle 
       
   595 %can be challenging
       
   596 %and un-intuitive. 
       
   597 %Therefore, we take correctness and runtime claims made about these solutions
       
   598 %with a grain of salt.
       
   599 
       
   600 In the work reported in \cite{FoSSaCS2023} and here, 
       
   601 we add better support using derivatives
       
   602 for bounded regular expression $r^{\{n\}}$.
       
   603 Our results
       
   604 extend straightforwardly to
       
   605 repetitions with intervals such as 
       
   606 $r^{\{n\ldots m\}}$.
       
   607 The merit of Brzozowski derivatives (more on this later)
       
   608 on this problem is that
       
   609 it can be naturally extended to support bounded repetitions.
       
   610 Moreover these extensions are still made up of only small
       
   611 inductive datatypes and recursive functions,
       
   612 making it handy to deal with them in theorem provers.
       
   613 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
       
   614 %straightforwardly extended to deal with bounded regular expressions
       
   615 %and moreover the resulting code still consists of only simple
       
   616 %recursive functions and inductive datatypes.
       
   617 Finally, bounded regular expressions do not destroy our finite
       
   618 boundedness property, which we shall prove later on.
       
   619 
       
   620 
       
   621 
       
   622 
       
   623 
       
   624 \subsection{Back-References}
       
   625 The other way to simulate an $\mathit{NFA}$ for matching is choosing  
       
   626 a single transition each time, keeping all the other options in 
       
   627 a queue or stack, and backtracking if that choice eventually 
       
   628 fails. 
       
   629 This method, often called a  "depth-first-search", 
       
   630 is efficient in many cases, but could end up
       
   631 with exponential run time.
       
   632 The backtracking method is employed in regex libraries
       
   633 that support \emph{back-references}, for example
       
   634 in Java and Python.
       
   635 %\section{Back-references and The Terminology Regex}
       
   636 
       
   637 %When one constructs an $\NFA$ out of a regular expression
       
   638 %there is often very little to be done in the first phase, one simply 
       
   639 %construct the $\NFA$ states based on the structure of the input regular expression.
       
   640 
       
   641 %In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
       
   642 %one by keeping track of all active states after consuming 
       
   643 %a character, and update that set of states iteratively.
       
   644 %This can be viewed as a breadth-first-search of the $\mathit{NFA}$
       
   645 %for a path terminating
       
   646 %at an accepting state.
       
   647 
       
   648 
       
   649 
       
   650 Consider the following regular expression where the sequence
       
   651 operator is omitted for brevity:
       
   652 \begin{center}
       
   653 	$r_1r_2r_3r_4$
       
   654 \end{center}
       
   655 In this example,
       
   656 one could label sub-expressions of interest 
       
   657 by parenthesizing them and giving 
       
   658 them a number in the order in which their opening parentheses appear.
       
   659 One possible way of parenthesizing and labelling is: 
       
   660 \begin{center}
       
   661 	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
       
   662 \end{center}
       
   663 The sub-expressions
       
   664 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
       
   665 by 1 to 4, and can be ``referred back'' by their respective numbers. 
       
   666 %These sub-expressions are called "capturing groups".
       
   667 To do so, one uses the syntax $\backslash i$ 
       
   668 to denote that we want the sub-string 
       
   669 of the input matched by the i-th
       
   670 sub-expression to appear again, 
       
   671 exactly the same as it first appeared: 
       
   672 \begin{center}
       
   673 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
       
   674 \underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
       
   675 \end{center}
       
   676 Once the sub-string $s_i$ for the sub-expression $r_i$
       
   677 has been fixed, there is no variability on what the back-reference
       
   678 label $\backslash i$ can be---it is tied to $s_i$.
       
   679 The overall string will look like $\ldots s_i \ldots s_i \ldots $
       
   680 %The backslash and number $i$ are the
       
   681 %so-called "back-references".
       
   682 %Let $e$ be an expression made of regular expressions 
       
   683 %and back-references. $e$ contains the expression $e_i$
       
   684 %as its $i$-th capturing group.
       
   685 %The semantics of back-reference can be recursively
       
   686 %written as:
       
   687 %\begin{center}
       
   688 %	\begin{tabular}{c}
       
   689 %		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
       
   690 %		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
       
   691 %	\end{tabular}
       
   692 %\end{center}
       
   693 A concrete example
       
   694 for back-references is
       
   695 \begin{center}
       
   696 $(.^*)\backslash 1$,
       
   697 \end{center}
       
   698 which matches
       
   699 strings that can be split into two identical halves,
       
   700 for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
       
   701 Note that this is different from 
       
   702 repeating the  sub-expression verbatim like
       
   703 \begin{center}
       
   704 	$(.^*)(.^*)$,
       
   705 \end{center}
       
   706 which does not impose any restrictions on what strings the second 
       
   707 sub-expression $.^*$
       
   708 might match.
       
   709 Another example for back-references is
       
   710 \begin{center}
       
   711 $(.)(.)\backslash 2\backslash 1$
       
   712 \end{center}
       
   713 which matches four-character palindromes
       
   714 like $abba$, $x??x$ and so on.
       
   715 
       
   716 Back-references are a regex construct 
       
   717 that programmers find quite useful.
       
   718 According to Becchi and Crawley \cite{Becchi08},
       
   719 6\% of Snort rules (up until 2008) use them.
       
   720 The most common use of back-references
       
   721 is to express well-formed html files,
       
   722 where back-references are convenient for matching
       
   723 opening and closing tags like 
       
   724 \begin{center}
       
   725 	$\langle html \rangle \ldots \langle / html \rangle$
       
   726 \end{center}
       
   727 A regex describing such a format
       
   728 is
       
   729 \begin{center}
       
   730 	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
       
   731 \end{center}
       
   732 Despite being useful, the expressive power of regexes 
       
   733 go beyond regular languages 
       
   734 once back-references are included.
       
   735 In fact, they allow the regex construct to express 
       
   736 languages that cannot be contained in context-free
       
   737 languages either.
       
   738 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
       
   739 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
       
   740 which cannot be expressed by 
       
   741 context-free grammars \cite{campeanu2003formal}.
       
   742 Such a language is contained in the context-sensitive hierarchy
       
   743 of formal languages. 
       
   744 Also solving the matching problem involving back-references
       
   745 is known to be NP-complete \parencite{alfred2014algorithms}.
       
   746 Regex libraries supporting back-references such as 
       
   747 PCRE \cite{pcre} therefore have to
       
   748 revert to a depth-first search algorithm including backtracking.
       
   749 What is unexpected is that even in the cases 
       
   750 not involving back-references, there is still
       
   751 a (non-negligible) chance they might backtrack super-linearly,
       
   752 as shown in the graphs in figure \ref{fig:aStarStarb}.
       
   753 
       
   754 Summing up, we can categorise existing 
       
   755 practical regex libraries into two kinds:
       
   756 (i) The ones  with  linear
       
   757 time guarantees like Go and Rust. The downside with them is that
       
   758 they impose restrictions
       
   759 on the regular expressions (not allowing back-references, 
       
   760 bounded repetitions cannot exceed an ad hoc limit etc.).
       
   761 And (ii) those 
       
   762 that allow large bounded regular expressions and back-references
       
   763 at the expense of using backtracking algorithms.
       
   764 They can potentially ``grind to a halt''
       
   765 on some very simple cases, resulting 
       
   766 ReDoS attacks if exposed to the internet.
       
   767 
       
   768 The problems with both approaches are the motivation for us 
       
   769 to look again at the regular expression matching problem. 
       
   770 Another motivation is that regular expression matching algorithms
       
   771 that follow the POSIX standard often contain errors and bugs 
       
   772 as we shall explain next.
       
   773 
       
   774 %We would like to have regex engines that can 
       
   775 %deal with the regular part (e.g.
       
   776 %bounded repetitions) of regexes more
       
   777 %efficiently.
       
   778 %Also we want to make sure that they do it correctly.
       
   779 %It turns out that such aim is not so easy to achieve.
       
   780  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
       
   781 % For example, the Rust regex engine claims to be linear, 
       
   782 % but does not support lookarounds and back-references.
       
   783 % The GoLang regex library does not support over 1000 repetitions.  
       
   784 % Java and Python both support back-references, but shows
       
   785 %catastrophic backtracking behaviours on inputs without back-references(
       
   786 %when the language is still regular).
       
   787  %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
       
   788  %TODO: verify the fact Rust does not allow 1000+ reps
       
   789 
       
   790 
       
   791 
       
   792 
       
   793 %The time cost of regex matching algorithms in general
       
   794 %involve two different phases, and different things can go differently wrong on 
       
   795 %these phases.
       
   796 %$\DFA$s usually have problems in the first (construction) phase
       
   797 %, whereas $\NFA$s usually run into trouble
       
   798 %on the second phase.
       
   799 
       
   800 
       
   801 \section{Error-prone POSIX Implementations}
       
   802 Very often there are multiple ways of matching a string
       
   803 with a regular expression.
       
   804 In such cases the regular expressions matcher needs to
       
   805 disambiguate.
       
   806 The more widely used strategy is called POSIX,
       
   807 which roughly speaking always chooses the longest initial match.
       
   808 The POSIX strategy is widely adopted in many regular expression matchers
       
   809 because it is a natural strategy for lexers.
       
   810 However, many implementations (including the C libraries
       
   811 used by Linux and OS X distributions) contain bugs
       
   812 or do not meet the specification they claim to adhere to.
       
   813 Kuklewicz maintains a unit test repository which lists some
       
   814 problems with existing regular expression engines \cite{KuklewiczHaskell}.
       
   815 In some cases, they either fail to generate a
       
   816 result when there exists a match,
       
   817 or give results that are inconsistent with the POSIX standard.
       
   818 A concrete example is the regex:
       
   819 \begin{center}
       
   820 	$(aba + ab + a)^* \text{and the string} \; ababa$
       
   821 \end{center}
       
   822 The correct POSIX match for the above
       
   823 involves the entire string $ababa$, 
       
   824 split into two Kleene star iterations, namely $[ab], [aba]$ at positions
       
   825 $[0, 2), [2, 5)$
       
   826 respectively.
       
   827 But feeding this example to the different engines
       
   828 listed at regex101 \footnote{
       
   829 	regex101 is an online regular expression matcher which
       
   830 	provides API for trying out regular expression
       
   831 	engines of multiple popular programming languages like
       
   832 Java, Python, Go, etc.} \parencite{regex101}. 
       
   833 yields
       
   834 only two incomplete matches: $[aba]$ at $[0, 3)$
       
   835 and $a$ at $[4, 5)$.
       
   836 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
       
   837 commented that most regex libraries are not
       
   838 correctly implementing the central POSIX
       
   839 rule, called the maximum munch rule.
       
   840 Grathwohl \parencite{grathwohl2014crash} wrote:
       
   841 \begin{quote}\it
       
   842 	``The POSIX strategy is more complicated than the 
       
   843 	greedy because of the dependence on information about 
       
   844 	the length of matched strings in the various subexpressions.''
       
   845 \end{quote}
       
   846 %\noindent
       
   847 People have recognised that the implementation complexity of POSIX rules also come from
       
   848 the specification being not very precise.
       
   849 The developers of the regexp package of Go 
       
   850 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
       
   851 commented that
       
   852 \begin{quote}\it
       
   853 ``
       
   854 The POSIX rule is computationally prohibitive
       
   855 and not even well-defined.
       
   856 ``
       
   857 \end{quote}
       
   858 There are many informal summaries of this disambiguation
       
   859 strategy, which are often quite long and delicate.
       
   860 For example Kuklewicz \cite{KuklewiczHaskell} 
       
   861 described the POSIX rule as (section 1, last paragraph):
       
   862 \begin{quote}
       
   863 	\begin{itemize}
       
   864 		\item
       
   865 regular expressions (REs) take the leftmost starting match, and the longest match starting there
       
   866 earlier subpatterns have leftmost-longest priority over later subpatterns\\
       
   867 \item
       
   868 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
       
   869 \item
       
   870 REs have right associative concatenation which can be changed with parenthesis\\
       
   871 \item
       
   872 parenthesized subexpressions return the match from their last usage\\
       
   873 \item
       
   874 text of component subexpressions must be contained in the text of the 
       
   875 higher-level subexpressions\\
       
   876 \item
       
   877 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
       
   878 \item
       
   879 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
       
   880 \end{itemize}
       
   881 \end{quote}
       
   882 %The text above 
       
   883 %is trying to capture something very precise,
       
   884 %and is crying out for formalising.
       
   885 Ribeiro and Du Bois \cite{RibeiroAgda2017} have 
       
   886 formalised the notion of bit-coded regular expressions
       
   887 and proved their relations with simple regular expressions in
       
   888 the dependently-typed proof assistant Agda.
       
   889 They also proved the soundness and completeness of a matching algorithm
       
   890 based on the bit-coded regular expressions.
       
   891 Ausaf et al. \cite{AusafDyckhoffUrban2016}
       
   892 are the first to
       
   893 give a quite simple formalised POSIX
       
   894 specification in Isabelle/HOL, and also prove 
       
   895 that their specification coincides with an earlier (unformalised) 
       
   896 POSIX specification given by Okui and Suzuki \cite{Okui10}.
       
   897 They then formally proved the correctness of
       
   898 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
       
   899 with regards to that specification.
       
   900 They also found that the informal POSIX
       
   901 specification by Sulzmann and Lu needs to be substantially 
       
   902 revised in order for the correctness proof to go through.
       
   903 Their original specification and proof were unfixable
       
   904 according to Ausaf et al.
       
   905 
       
   906 
       
   907 In the next section, we will briefly
       
   908 introduce Brzozowski derivatives and Sulzmann
       
   909 and Lu's algorithm, which the main point of this thesis builds on.
       
   910 %We give a taste of what they 
    37 %We give a taste of what they 
   911 %are like and why they are suitable for regular expression
    38 %are like and why they are suitable for regular expression
   912 %matching and lexing.
    39 %matching and lexing.
   913 \section{Formal Specification of POSIX Matching 
    40 \section{Formal Specification of POSIX Matching 
   914 and Brzozowski Derivatives}
    41 and Brzozowski Derivatives}
  1677 But what if we want to 
   804 But what if we want to 
  1678 do lexing instead of just getting a true/false answer?
   805 do lexing instead of just getting a true/false answer?
  1679 Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and 
   806 Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and 
  1680 elegant (arguably as beautiful as the definition of the 
   807 elegant (arguably as beautiful as the definition of the 
  1681 Brzozowski derivative) solution for this.
   808 Brzozowski derivative) solution for this.
       
   809 \marginpar{explicitly attribute the work}
       
   810 For the same reason described
       
   811 at the beginning of this chapter, 
       
   812 we introduce the formal
       
   813 semantics of $\POSIX$ lexing by 
       
   814 Ausaf et al.\cite{AusafDyckhoffUrban2016}, 
       
   815 followed by the first lexing algorithm by 
       
   816 Sulzmanna and Lu \cite{Sulzmann2014} 
       
   817 that produces the output conforming
       
   818 to the $\POSIX$ standard. 
       
   819 %TODO: Actually show this in chapter 4.
       
   820 In what follows 
       
   821 we choose to use the Isabelle-style notation
       
   822 for function and datatype constructor applications, where
       
   823 the parameters of a function are not enclosed
       
   824 inside a pair of parentheses (e.g. $f \;x \;y$
       
   825 instead of $f(x,\;y)$). This is mainly
       
   826 to make the text visually more concise.
  1682 
   827 
  1683 \section{Values and the Lexing Algorithm by Sulzmann and Lu}
   828 \section{Values and the Lexing Algorithm by Sulzmann and Lu}
  1684 In this section, we present a two-phase regular expression lexing 
   829 In this section, we present a two-phase regular expression lexing 
  1685 algorithm.
   830 algorithm.
  1686 The first phase takes successive derivatives with 
   831 The first phase takes successive derivatives with