ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 538 8016a2480704
parent 537 50e590823220
child 542 a7344c9afbaf
equal deleted inserted replaced
537:50e590823220 538:8016a2480704
    35 \newcommand\hflataux[1]{\llparenthesis #1 \rrparenthesis_*'}
    35 \newcommand\hflataux[1]{\llparenthesis #1 \rrparenthesis_*'}
    36 \newcommand\createdByStar[1]{\textit{createdByStar}(#1)}
    36 \newcommand\createdByStar[1]{\textit{createdByStar}(#1)}
    37 
    37 
    38 \newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
    38 \newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
    39 
    39 
       
    40 \def\Some{\textit{Some}}
       
    41 \def\None{\textit{None}}
    40 \def\code{\textit{code}}
    42 \def\code{\textit{code}}
    41 \def\decode{\textit{decode}}
    43 \def\decode{\textit{decode}}
    42 \def\internalise{\textit{internalise}}
    44 \def\internalise{\textit{internalise}}
    43 \def\lexer{\mathit{lexer}}
    45 \def\lexer{\mathit{lexer}}
    44 \def\mkeps{\textit{mkeps}}
    46 \def\mkeps{\textit{mkeps}}
    46 
    48 
    47 \def\AZERO{\textit{AZERO}}
    49 \def\AZERO{\textit{AZERO}}
    48 \def\AONE{\textit{AONE}}
    50 \def\AONE{\textit{AONE}}
    49 \def\ACHAR{\textit{ACHAR}}
    51 \def\ACHAR{\textit{ACHAR}}
    50 
    52 
       
    53 \def\fuse{\textit{fuse}}
       
    54 \def\bder{\textit{bder}}
    51 \def\POSIX{\textit{POSIX}}
    55 \def\POSIX{\textit{POSIX}}
    52 \def\ALTS{\textit{ALTS}}
    56 \def\ALTS{\textit{ALTS}}
    53 \def\ASTAR{\textit{ASTAR}}
    57 \def\ASTAR{\textit{ASTAR}}
    54 \def\DFA{\textit{DFA}}
    58 \def\DFA{\textit{DFA}}
       
    59 \def\NFA{\textit{NFA}}
    55 \def\bmkeps{\textit{bmkeps}}
    60 \def\bmkeps{\textit{bmkeps}}
    56 \def\retrieve{\textit{retrieve}}
    61 \def\retrieve{\textit{retrieve}}
    57 \def\blexer{\textit{blexer}}
    62 \def\blexer{\textit{blexer}}
    58 \def\flex{\textit{flex}}
    63 \def\flex{\textit{flex}}
    59 \def\inj{\mathit{inj}}
    64 \def\inj{\mathit{inj}}
    89 \def\rrexp{\textit{rrexp}}
    94 \def\rrexp{\textit{rrexp}}
    90 \newcommand\rnullable[1]{\textit{rnullable}(#1)}
    95 \newcommand\rnullable[1]{\textit{rnullable}(#1)}
    91 \newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
    96 \newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
    92 \newcommand\asize[1]{\llbracket #1 \rrbracket}
    97 \newcommand\asize[1]{\llbracket #1 \rrbracket}
    93 \newcommand\rerase[1]{ (#1)\downarrow_r}
    98 \newcommand\rerase[1]{ (#1)\downarrow_r}
    94 
    99 \newcommand\ChristianComment[1]{\textcolor{blue}{#1}\\}
    95 
   100 
    96 \def\erase{\textit{erase}}
   101 \def\erase{\textit{erase}}
    97 \def\STAR{\textit{STAR}}
   102 \def\STAR{\textit{STAR}}
    98 \def\flts{\textit{flts}}
   103 \def\flts{\textit{flts}}
    99 
   104 
   104 \newcommand\RSEQ[2]{#1 \cdot #2}
   109 \newcommand\RSEQ[2]{#1 \cdot #2}
   105 \newcommand\RALTS[1]{\oplus #1}
   110 \newcommand\RALTS[1]{\oplus #1}
   106 \newcommand\RSTAR[1]{#1^*}
   111 \newcommand\RSTAR[1]{#1^*}
   107 \newcommand\vsuf[2]{\textit{vsuf} \;#1\;#2}
   112 \newcommand\vsuf[2]{\textit{vsuf} \;#1\;#2}
   108 
   113 
       
   114 
       
   115 
       
   116 \pgfplotsset{
       
   117     myplotstyle/.style={
       
   118     legend style={draw=none, font=\small},
       
   119     legend cell align=left,
       
   120     legend pos=north east,
       
   121     ylabel style={align=center, font=\bfseries\boldmath},
       
   122     xlabel style={align=center, font=\bfseries\boldmath},
       
   123     x tick label style={font=\bfseries\boldmath},
       
   124     y tick label style={font=\bfseries\boldmath},
       
   125     scaled ticks=true,
       
   126     every axis plot/.append style={thick},
       
   127     },
       
   128 }
       
   129 
   109 %----------------------------------------------------------------------------------------
   130 %----------------------------------------------------------------------------------------
   110 %This part is about regular expressions, Brzozowski derivatives,
   131 %This part is about regular expressions, Brzozowski derivatives,
   111 %and a bit-coded lexing algorithm with proven correctness and time bounds.
   132 %and a bit-coded lexing algorithm with proven correctness and time bounds.
   112 
   133 
   113 %TODO: look up snort rules to use here--give readers idea of what regexes look like
   134 %TODO: look up snort rules to use here--give readers idea of what regexes look like
   114 
       
   115 
       
   116 Regular expressions are widely used in computer science: 
       
   117 be it in text-editors\parencite{atomEditor} with syntax highlighting and auto-completion, 
       
   118 command-line tools like $\mathit{grep}$ that facilitate easy 
       
   119 text processing, network intrusion
       
   120 detection systems that reject suspicious traffic, or compiler
       
   121 front ends--the majority of the solutions to these tasks 
       
   122 involve lexing with regular 
       
   123 expressions.
       
   124 Given its usefulness and ubiquity, one would imagine that
       
   125 modern regular expression matching implementations
       
   126 are mature and fully studied.
       
   127 Indeed, in a popular programming language' regex engine, 
       
   128 supplying it with regular expressions and strings, one can
       
   129 get rich matching information in a very short time.
       
   130 Some network intrusion detection systems
       
   131 use regex engines that are able to process 
       
   132 megabytes or even gigabytes of data per second\parencite{Turo_ov__2020}.
       
   133 Unfortunately, this is not the case for $\mathbf{all}$ inputs.
       
   134 %TODO: get source for SNORT/BRO's regex matching engine/speed
       
   135 
       
   136 
       
   137 Take $(a^*)^*\,b$ and ask whether
       
   138 strings of the form $aa..a$ match this regular
       
   139 expression. Obviously this is not the case---the expected $b$ in the last
       
   140 position is missing. One would expect that modern regular expression
       
   141 matching engines can find this out very quickly. Alas, if one tries
       
   142 this example in JavaScript, Python or Java 8, even with strings of a small
       
   143 length, say around 30 $a$'s, one discovers that 
       
   144 this decision takes crazy time to finish given the simplicity of the problem.
       
   145 This is clearly exponential behaviour, and 
       
   146 is triggered by some relatively simple regex patterns, as the graphs
       
   147 below show:
       
   148 
   135 
   149 \begin{figure}
   136 \begin{figure}
   150 \centering
   137 \centering
   151 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
   138 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
   152 \begin{tikzpicture}
   139 \begin{tikzpicture}
   218 \end{figure}
   205 \end{figure}
   219 
   206 
   220 
   207 
   221 
   208 
   222 
   209 
   223 This superlinear blowup in matching algorithms sometimes cause
   210 
   224 considerable grief in real life: for example on 20 July 2016 one evil
   211 Regular expressions are widely used in computer science: 
       
   212 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
       
   213 command-line tools like $\mathit{grep}$ that facilitate easy 
       
   214 text-processing; network intrusion
       
   215 detection systems that reject suspicious traffic; or compiler
       
   216 front ends--the majority of the solutions to these tasks 
       
   217 involve lexing with regular 
       
   218 expressions.
       
   219 Given its usefulness and ubiquity, one would imagine that
       
   220 modern regular expression matching implementations
       
   221 are mature and fully studied.
       
   222 Indeed, in a popular programming language' regex engine, 
       
   223 supplying it with regular expressions and strings, one can
       
   224 get rich matching information in a very short time.
       
   225 Some network intrusion detection systems
       
   226 use regex engines that are able to process 
       
   227 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
       
   228 Unfortunately, this is not the case for $\mathbf{all}$ inputs.
       
   229 %TODO: get source for SNORT/BRO's regex matching engine/speed
       
   230 
       
   231 
       
   232 Take $(a^*)^*\,b$ and ask whether
       
   233 strings of the form $aa..a$ match this regular
       
   234 expression. Obviously this is not the case---the expected $b$ in the last
       
   235 position is missing. One would expect that modern regular expression
       
   236 matching engines can find this out very quickly. Alas, if one tries
       
   237 this example in JavaScript, Python or Java 8, even with strings of a small
       
   238 length, say around 30 $a$'s, one discovers that 
       
   239 this decision takes crazy time to finish given the simplicity of the problem.
       
   240 This is clearly exponential behaviour, and 
       
   241 is triggered by some relatively simple regex patterns, as the graphs
       
   242  in \ref{fig:aStarStarb} show.
       
   243 
       
   244 
       
   245 
       
   246 
       
   247 \ChristianComment{Superlinear I just leave out the explanation 
       
   248 which I find once used would distract the flow. Plus if i just say exponential
       
   249 here the 2016 event in StackExchange was not exponential, but just quardratic so would be 
       
   250 in accurate}
       
   251 
       
   252 This superlinear blowup in regular expression engines
       
   253 had repeatedly caused grief in real life.
       
   254 For example, on 20 July 2016 one evil
   225 regular expression brought the webpage
   255 regular expression brought the webpage
   226 \href{http://stackexchange.com}{Stack Exchange} to its
   256 \href{http://stackexchange.com}{Stack Exchange} to its
   227 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
   257 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
   228 In this instance, a regular expression intended to just trim white
   258 In this instance, a regular expression intended to just trim white
   229 spaces from the beginning and the end of a line actually consumed
   259 spaces from the beginning and the end of a line actually consumed
   230 massive amounts of CPU-resources---causing web servers to grind to a
   260 massive amounts of CPU resources---causing web servers to grind to a
   231 halt. This happened when a post with 20,000 white spaces was submitted,
   261 halt. In this example, the time needed to process
   232 but importantly the white spaces were neither at the beginning nor at
       
   233 the end. As a result, the regular expression matching engine needed to
       
   234 backtrack over many choices. In this example, the time needed to process
       
   235 the string was $O(n^2)$ with respect to the string length. This
   262 the string was $O(n^2)$ with respect to the string length. This
   236 quadratic overhead was enough for the homepage of Stack Exchange to
   263 quadratic overhead was enough for the homepage of Stack Exchange to
   237 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
   264 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
   238 attack and therefore stopped the servers from responding to any
   265 attack and therefore stopped the servers from responding to any
   239 requests. This made the whole site become unavailable. 
   266 requests. This made the whole site become unavailable. 
       
   267 
   240 A more recent example is a global outage of all Cloudflare servers on 2 July
   268 A more recent example is a global outage of all Cloudflare servers on 2 July
   241 2019. A poorly written regular expression exhibited exponential
   269 2019. A poorly written regular expression exhibited exponential
   242 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
   270 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
   243 had several causes, at the heart was a regular expression that
   271 had several causes, at the heart was a regular expression that
   244 was used to monitor network
   272 was used to monitor network
   245 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
   273 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
   246 %TODO: data points for some new versions of languages
   274 %TODO: data points for some new versions of languages
   247 These problems with regular expressions 
   275 These problems with regular expressions 
   248 are not isolated events that happen
   276 are not isolated events that happen
   249 very occasionally, but actually widespread.
   277 very occasionally, but actually widespread.
   250 They occur so often that they get a 
   278 They occur so often that they get a 
   251 name--Regular-Expression-Denial-Of-Service (ReDoS)
   279 name--Regular-Expression-Denial-Of-Service (ReDoS)
   252 attack.
   280 attack.
   253 Davis et al. \parencite{Davis18} detected more
   281 \citeauthor{Davis18} detected more
   254 than 1000 super-linear (SL) regular expressions
   282 than 1000 super-linear (SL) regular expressions
   255 in Node.js, Python core libraries, and npm and pypi. 
   283 in Node.js, Python core libraries, and npm and pypi. 
   256 They therefore concluded that evil regular expressions
   284 They therefore concluded that evil regular expressions
   257 are problems more than "a parlour trick", but one that
   285 are problems "more than a parlour trick", but one that
   258 requires
   286 requires
   259 more research attention.
   287 more research attention.
   260 
   288 
   261  \section{The Problem Behind Slow Cases}
   289 
       
   290 But the problems are not limited to slowness on certain 
       
   291 cases. 
       
   292 Another thing about these libraries is that there
       
   293 is no correctness guarantee.
       
   294 In some cases, they either fail to generate a lexing result when there exists a match,
       
   295 or give results that are inconsistent with the $\POSIX$ standard.
       
   296 A concrete example would be 
       
   297 the regex
       
   298 \begin{verbatim}
       
   299 (aba|ab|a)*
       
   300 \end{verbatim}
       
   301 and the string
       
   302 \begin{verbatim}
       
   303 ababa
       
   304 \end{verbatim}
       
   305 The correct $\POSIX$ match for the above would be 
       
   306 with the entire string $ababa$, 
       
   307 split into two Kleene star iterations, $[ab] [aba]$ at positions
       
   308 $[0, 2), [2, 5)$
       
   309 respectively.
       
   310 But trying this out in regex101\parencite{regex101}
       
   311 with different language engines would yield 
       
   312 the same two fragmented matches: $[aba]$ at $[0, 3)$
       
   313 and $a$ at $[4, 5)$.
       
   314 
       
   315 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
       
   316 correctly implementing the POSIX (maximum-munch)
       
   317 rule of regular expression matching.
       
   318 
       
   319 As Grathwohl\parencite{grathwohl2014crash} commented,
       
   320 \begin{center}
       
   321 	``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
       
   322 \end{center}
       
   323 
       
   324 
       
   325 To summarise the above, regular expressions are important.
       
   326 They are popular and programming languages' library functions
       
   327 for them are very fast on non-catastrophic cases.
       
   328 But there are problems with current practical implementations.
       
   329 First thing is that the running time might blow up.
       
   330 The second problem is that they might be error-prone on certain
       
   331 cases that are very easily spotted by a human.
       
   332 In the next part of the chapter, we will look into reasons why 
       
   333 certain regex engines are running horribly slow on the "catastrophic"
       
   334 cases and propose a solution that addresses both of these problems
       
   335 based on Brzozowski and Sulzmann and Lu's work.
       
   336 
       
   337 
       
   338  \section{Why are current regex engines slow?}
   262 
   339 
   263 %find literature/find out for yourself that REGEX->DFA on basic regexes
   340 %find literature/find out for yourself that REGEX->DFA on basic regexes
   264 %does not blow up the size
   341 %does not blow up the size
   265 Shouldn't regular expression matching be linear?
   342 Shouldn't regular expression matching be linear?
   266 How can one explain the super-linear behaviour of the 
   343 How can one explain the super-linear behaviour of the 
   267 regex matching engines we have?
   344 regex matching engines we have?
   268 The time cost of regex matching algorithms in general
   345 The time cost of regex matching algorithms in general
   269 involve two phases: 
   346 involve two different phases, and different things can go differently wrong on 
   270 the construction phase, in which the algorithm builds some  
   347 these phases.
   271 suitable data structure from the input regex $r$, we denote
   348 $\DFA$s usually have problems in the first (construction) phase
   272 the time cost by $P_1(r)$.
   349 , whereas $\NFA$s usually run into trouble
   273 The lexing
   350 on the second phase.
   274 phase, when the input string $s$ is read and the data structure
   351 
   275 representing that regex $r$ is being operated on. We represent the time
   352 \subsection{Different Phases of a Matching/Lexing Algorithm}
       
   353 
       
   354 
       
   355 Most lexing algorithms can be roughly divided into 
       
   356 two phases during its run.
       
   357 The first phase is the "construction" phase,
       
   358 in which the algorithm builds some  
       
   359 suitable data structure from the input regex $r$, so that
       
   360 it can be easily operated on later.
       
   361 We denote
       
   362 the time cost for such a phase by $P_1(r)$.
       
   363 The second phase is the lexing phase, when the input string 
       
   364 $s$ is read and the data structure
       
   365 representing that regex $r$ is being operated on. 
       
   366 We represent the time
   276 it takes by $P_2(r, s)$.\\
   367 it takes by $P_2(r, s)$.\\
   277 In the case of a $\mathit{DFA}$,
   368 
       
   369 For $\mathit{DFA}$,
   278 we have $P_2(r, s) = O( |s| )$,
   370 we have $P_2(r, s) = O( |s| )$,
   279 because we take at most $|s|$ steps, 
   371 because we take at most $|s|$ steps, 
   280 and each step takes
   372 and each step takes
   281 at most one transition--
   373 at most one transition--
   282 a deterministic-finite-automata
   374 a deterministic-finite-automata
   283 by definition has at most one state active and at most one
   375 by definition has at most one state active and at most one
   284 transition upon receiving an input symbol.
   376 transition upon receiving an input symbol.
   285 But unfortunately in the  worst case
   377 But unfortunately in the  worst case
   286 $P_1(r) = O(exp^{|r|})$. An example will be given later. \\
   378 $P_1(r) = O(exp^{|r|})$. An example will be given later. 
       
   379 
       
   380 
   287 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
   381 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
   288 expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.
   382 expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.
   289 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
   383 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
   290 On the other hand, if backtracking is used, the worst-case time bound bloats
   384 On the other hand, if backtracking is used, the worst-case time bound bloats
   291 to $|r| * 2^|s|$ .
   385 to $|r| * 2^|s|$.
   292 %on the input
   386 %on the input
   293 %And when calculating the time complexity of the matching algorithm,
   387 %And when calculating the time complexity of the matching algorithm,
   294 %we are assuming that each input reading step requires constant time.
   388 %we are assuming that each input reading step requires constant time.
   295 %which translates to that the number of 
   389 %which translates to that the number of 
   296 %states active and transitions taken each time is bounded by a
   390 %states active and transitions taken each time is bounded by a
   301 %such as negation, intersection, 
   395 %such as negation, intersection, 
   302 %bounded repetitions and back-references.
   396 %bounded repetitions and back-references.
   303 %And de-sugaring these "extended" regular expressions 
   397 %And de-sugaring these "extended" regular expressions 
   304 %into basic ones might bloat the size exponentially.
   398 %into basic ones might bloat the size exponentially.
   305 %TODO: more reference for exponential size blowup on desugaring. 
   399 %TODO: more reference for exponential size blowup on desugaring. 
   306 \subsection{Tools that uses $\mathit{DFA}$s}
   400 
   307 %TODO:more tools that use DFAs?
   401 \subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
   308 $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
   402 
   309 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
   403 
   310 lexers. The user provides a set of regular expressions
       
   311 and configurations to such lexer generators, and then 
       
   312 gets an output program encoding a minimized $\mathit{DFA}$
       
   313 that can be compiled and run. 
       
   314 The good things about $\mathit{DFA}$s is that once
   404 The good things about $\mathit{DFA}$s is that once
   315 generated, they are fast and stable, unlike
   405 generated, they are fast and stable, unlike
   316 backtracking algorithms. 
   406 backtracking algorithms. 
   317 However, they do not scale well with bounded repetitions.\\
   407 However, they do not scale well with bounded repetitions.
   318 
   408 
   319 
   409 \subsubsection{Problems with Bounded Repetitions}
   320 
       
   321 
       
   322 Bounded repetitions, usually written in the form
   410 Bounded repetitions, usually written in the form
   323 $r^{\{c\}}$ (where $c$ is a constant natural number),
   411 $r^{\{c\}}$ (where $c$ is a constant natural number),
   324 denotes a regular expression accepting strings
   412 denotes a regular expression accepting strings
   325 that can be divided into $c$ substrings, where each 
   413 that can be divided into $c$ substrings, where each 
   326 substring is in $r$. 
   414 substring is in $r$. 
   364 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
   452 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
   365 would require at least $2^{n+1}$ states, if $r$ contains
   453 would require at least $2^{n+1}$ states, if $r$ contains
   366 more than 1 string.
   454 more than 1 string.
   367 This is to represent all different 
   455 This is to represent all different 
   368 scenarios which "countdown" states are active.
   456 scenarios which "countdown" states are active.
   369 For those regexes, tools such as $\mathit{JFLEX}$ 
   457 For those regexes, tools that uses $\DFA$s will get
   370 would generate gigantic $\mathit{DFA}$'s or
       
   371 out of memory errors.
   458 out of memory errors.
       
   459 
       
   460 \subsubsection{Tools that uses $\mathit{DFA}$s}
       
   461 %TODO:more tools that use DFAs?
       
   462 $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
       
   463 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
       
   464 lexers. The user provides a set of regular expressions
       
   465 and configurations to such lexer generators, and then 
       
   466 gets an output program encoding a minimized $\mathit{DFA}$
       
   467 that can be compiled and run. 
       
   468 When given the above countdown regular expression,
       
   469 a small number $n$ would result in a determinised automata
       
   470 with millions of states.
       
   471 
   372 For this reason, regex libraries that support 
   472 For this reason, regex libraries that support 
   373 bounded repetitions often choose to use the $\mathit{NFA}$ 
   473 bounded repetitions often choose to use the $\mathit{NFA}$ 
   374 approach.
   474 approach.
   375 \subsection{The $\mathit{NFA}$ approach to regex matching}
   475 
   376 One can simulate the $\mathit{NFA}$ running in two ways:
   476 
       
   477 
       
   478 
       
   479 
       
   480 
       
   481 
       
   482 
       
   483 \subsection{Why $\mathit{NFA}$s can be slow in the second phase}
       
   484 When one constructs an $\NFA$ out of a regular expression
       
   485 there is often very little to be done in the first phase, one simply 
       
   486 construct the $\NFA$ states based on the structure of the input regular expression.
       
   487 
       
   488 In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
   377 one by keeping track of all active states after consuming 
   489 one by keeping track of all active states after consuming 
   378 a character, and update that set of states iteratively.
   490 a character, and update that set of states iteratively.
   379 This can be viewed as a breadth-first-search of the $\mathit{NFA}$
   491 This can be viewed as a breadth-first-search of the $\mathit{NFA}$
   380 for a path terminating
   492 for a path terminating
   381 at an accepting state.
   493 at an accepting state.
   382 Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
   494 Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
   383 type of $\mathit{NFA}$ simulation, and guarantees a linear runtime
   495 type of $\mathit{NFA}$ simulation and guarantees a linear runtime
   384 in terms of input string length.
   496 in terms of input string length.
   385 %TODO:try out these lexers
   497 %TODO:try out these lexers
   386 The other way to use $\mathit{NFA}$ for matching is choosing  
   498 The other way to use $\mathit{NFA}$ for matching is choosing  
   387 a single transition each time, keeping all the other options in 
   499 a single transition each time, keeping all the other options in 
   388 a queue or stack, and backtracking if that choice eventually 
   500 a queue or stack, and backtracking if that choice eventually 
   390 is efficient in a lot of cases, but could end up
   502 is efficient in a lot of cases, but could end up
   391 with exponential run time.\\
   503 with exponential run time.\\
   392 %TODO:COMPARE java python lexer speed with Rust and Go
   504 %TODO:COMPARE java python lexer speed with Rust and Go
   393 The reason behind backtracking algorithms in languages like
   505 The reason behind backtracking algorithms in languages like
   394 Java and Python is that they support back-references.
   506 Java and Python is that they support back-references.
   395 \subsection{Back References in Regex--Non-Regular part}
   507 \subsubsection{Back References}
   396 If we have a regular expression like this (the sequence
   508 If we have a regular expression like this (the sequence
   397 operator is omitted for brevity):
   509 operator is omitted for brevity):
   398 \begin{center}
   510 \begin{center}
   399 	$r_1(r_2(r_3r_4))$
   511 	$r_1(r_2(r_3r_4))$
   400 \end{center}
   512 \end{center}
   445 of formal languages. 
   557 of formal languages. 
   446 Solving the back-reference expressions matching problem
   558 Solving the back-reference expressions matching problem
   447 is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
   559 is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
   448 efficient solution is not known to exist.
   560 efficient solution is not known to exist.
   449 %TODO:read a bit more about back reference algorithms
   561 %TODO:read a bit more about back reference algorithms
       
   562 
   450 It seems that languages like Java and Python made the trade-off
   563 It seems that languages like Java and Python made the trade-off
   451 to support back-references at the expense of having to backtrack,
   564 to support back-references at the expense of having to backtrack,
   452 even in the case of regexes not involving back-references.\\
   565 even in the case of regexes not involving back-references.\\
   453 Summing these up, we can categorise existing 
   566 Summing these up, we can categorise existing 
   454 practical regex libraries into the ones  with  linear
   567 practical regex libraries into the ones  with  linear
   455 time guarantees like Go and Rust, which impose restrictions
   568 time guarantees like Go and Rust, which impose restrictions
   456 on the user input (not allowing back-references, 
   569 on the user input (not allowing back-references, 
   457 bounded repetitions canno exceed 1000 etc.), and ones  
   570 bounded repetitions cannot exceed 1000 etc.), and ones  
   458  that allows the programmer much freedom, but grinds to a halt
   571  that allows the programmer much freedom, but grinds to a halt
   459  in some non-negligible portion of cases.
   572  in some non-negligible portion of cases.
   460  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
   573  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
   461 % For example, the Rust regex engine claims to be linear, 
   574 % For example, the Rust regex engine claims to be linear, 
   462 % but does not support lookarounds and back-references.
   575 % but does not support lookarounds and back-references.
   464 % Java and Python both support back-references, but shows
   577 % Java and Python both support back-references, but shows
   465 %catastrophic backtracking behaviours on inputs without back-references(
   578 %catastrophic backtracking behaviours on inputs without back-references(
   466 %when the language is still regular).
   579 %when the language is still regular).
   467  %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
   580  %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
   468  %TODO: verify the fact Rust does not allow 1000+ reps
   581  %TODO: verify the fact Rust does not allow 1000+ reps
   469  %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
   582 \ChristianComment{Comment required: Java 17 updated graphs? Is it ok to still use Java 8 graphs?}
   470 \section{Buggy Regex Engines} 
   583 
   471 
   584 
   472 
   585 So we have practical implementations 
   473  Another thing about these libraries is that there
   586 on regular expression matching/lexing which are fast
   474  is no correctness guarantee.
   587 but do not come with any guarantees that it will not grind to a halt
   475  In some cases, they either fail to generate a lexing result when there exists a match,
   588 or give wrong answers.
   476  or give the wrong way of matching.
   589 Our goal is to have a regex lexing algorithm that comes with 
       
   590 \begin{itemize}
       
   591 \item
       
   592 proven correctness 
       
   593 \item 
       
   594 proven non-catastrophic properties
       
   595 \item
       
   596 easy extensions to
       
   597 constructs like 
       
   598  bounded repetitions, negation,  lookarounds, and even back-references.
       
   599  \end{itemize}
   477  
   600  
   478 
       
   479 It turns out that regex libraries not only suffer from 
       
   480 exponential backtracking problems, 
       
   481 but also undesired (or even buggy) outputs.
       
   482 %TODO: comment from who
       
   483 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
       
   484 correctly implementing the POSIX (maximum-munch)
       
   485 rule of regular expression matching.
       
   486 This experience is echoed by the writer's
       
   487 tryout of a few online regex testers:
       
   488 A concrete example would be 
       
   489 the regex
       
   490 \begin{verbatim}
       
   491 (((((a*a*)b*)b){20})*)c
       
   492 \end{verbatim}
       
   493 and the string
       
   494 \begin{verbatim}
       
   495 baabaabababaabaaaaaaaaababaa
       
   496 aababababaaaabaaabaaaaaabaab
       
   497 aabababaababaaaaaaaaababaaaa
       
   498 babababaaaaaaaaaaaaac
       
   499 \end{verbatim}
       
   500 
       
   501 This seemingly complex regex simply says "some $a$'s
       
   502 followed by some $b$'s then followed by 1 single $b$,
       
   503 and this iterates 20 times, finally followed by a $c$.
       
   504 And a POSIX match would involve the entire string,"eating up"
       
   505 all the $b$'s in it.
       
   506 %TODO: give a coloured example of how this matches POSIXly
       
   507 
       
   508 This regex would trigger catastrophic backtracking in 
       
   509 languages like Python and Java,
       
   510 whereas it gives a non-POSIX  and uninformative 
       
   511 match in languages like Go or .NET--The match with only 
       
   512 character $c$.
       
   513 
       
   514 As Grathwohl\parencite{grathwohl2014crash} commented,
       
   515 \begin{center}
       
   516 	``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
       
   517 \end{center}
       
   518 
       
   519 %\section{How people solve problems with regexes}
       
   520 
       
   521 
       
   522 When a regular expression does not behave as intended,
       
   523 people usually try to rewrite the regex to some equivalent form
       
   524 or they try to avoid the possibly problematic patterns completely,
       
   525 for which many false positives exist\parencite{Davis18}.
       
   526 Animated tools to "debug" regular expressions
       
   527 are also popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101} 
       
   528 to name a few.
       
   529 We are also aware of static analysis work on regular expressions that
       
   530 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
       
   531 \parencite{Rathnayake2014StaticAF} proposed an algorithm
       
   532 that detects regular expressions triggering exponential
       
   533 behavious on backtracking matchers.
       
   534 Weideman \parencite{Weideman2017Static} came up with 
       
   535 non-linear polynomial worst-time estimates
       
   536 for regexes, attack string that exploit the worst-time 
       
   537 scenario, and "attack automata" that generates
       
   538 attack strings.
       
   539 %Arguably these methods limits the programmers' freedom
       
   540 %or productivity when all they want is to come up with a regex
       
   541 %that solves the text processing problem.
       
   542 
       
   543 %TODO:also the regex101 debugger
       
   544 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
   601 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
   545  Is it possible to have a regex lexing algorithm with proven correctness and 
   602 We propose Brzozowski derivatives on regular expressions as
   546  time complexity, which allows easy extensions to
   603   a solution to this.
   547   constructs like 
   604 In the last fifteen or so years, Brzozowski's derivatives of regular
   548  bounded repetitions, negation,  lookarounds, and even back-references? 
   605 expressions have sparked quite a bit of interest in the functional
       
   606 programming and theorem prover communities.   
       
   607 
       
   608 \subsection{Motivation}
   549   
   609   
   550   We propose Brzozowski derivatives on regular expressions as
   610 Derivatives give a simple solution
   551   a solution to this.
   611 to the problem of matching a string $s$ with a regular
   552   
   612 expression $r$: if the derivative of $r$ w.r.t.\ (in
   553   In the last fifteen or so years, Brzozowski's derivatives of regular
   613 succession) all the characters of the string matches the empty string,
   554 expressions have sparked quite a bit of interest in the functional
   614 then $r$ matches $s$ (and {\em vice versa}).  
   555 programming and theorem prover communities.  The beauty of
   615 
       
   616 The beauty of
   556 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
   617 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
   557 expressible in any functional language, and easily definable and
   618 expressible in any functional language, and easily definable and
   558 reasoned about in theorem provers---the definitions just consist of
   619 reasoned about in theorem provers---the definitions just consist of
   559 inductive datatypes and simple recursive functions. 
   620 inductive datatypes and simple recursive functions. 
   560 And an algorithms based on it by 
   621 And an algorithms based on it by 
   561 Suzmann and Lu  \parencite{Sulzmann2014} allows easy extension
   622 Suzmann and Lu  \parencite{Sulzmann2014} allows easy extension
   562 to include  extended regular expressions and 
   623 to include  extended regular expressions and 
   563  simplification of internal data structures 
   624  simplification of internal data structures 
   564  eliminating the exponential behaviours.
   625  eliminating the exponential behaviours.
   565  
       
   566 
       
   567   \section{Motivation}
       
   568   
       
   569 Derivatives give a simple solution
       
   570 to the problem of matching a string $s$ with a regular
       
   571 expression $r$: if the derivative of $r$ w.r.t.\ (in
       
   572 succession) all the characters of the string matches the empty string,
       
   573 then $r$ matches $s$ (and {\em vice versa}).  
       
   574 
       
   575 
       
   576 
   626 
   577 However, two difficulties with derivative-based matchers exist:
   627 However, two difficulties with derivative-based matchers exist:
       
   628 \subsubsection{Problems with Current Brzozowski Matchers}
   578 First, Brzozowski's original matcher only generates a yes/no answer
   629 First, Brzozowski's original matcher only generates a yes/no answer
   579 for whether a regular expression matches a string or not.  This is too
   630 for whether a regular expression matches a string or not.  This is too
   580 little information in the context of lexing where separate tokens must
   631 little information in the context of lexing where separate tokens must
   581 be identified and also classified (for example as keywords
   632 be identified and also classified (for example as keywords
   582 or identifiers).  Sulzmann and Lu~\cite{Sulzmann2014} overcome this
   633 or identifiers).  Sulzmann and Lu~\cite{Sulzmann2014} overcome this
   583 difficulty by cleverly extending Brzozowski's matching
   634 difficulty by cleverly extending Brzozowski's matching
   584 algorithm. Their extended version generates additional information on
   635 algorithm. Their extended version generates additional information on
   585 \emph{how} a regular expression matches a string following the POSIX
   636 \emph{how} a regular expression matches a string following the POSIX
   586 rules for regular expression matching. They achieve this by adding a
   637 rules for regular expression matching. They achieve this by adding a
   587 second ``phase'' to Brzozowski's algorithm involving an injection
   638 second ``phase'' to Brzozowski's algorithm involving an injection
   588 function.  In our own earlier work we provided the formal
   639 function.  In our own earlier work, we provided the formal
   589 specification of what POSIX matching means and proved in Isabelle/HOL
   640 specification of what POSIX matching means and proved in Isabelle/HOL
   590 the correctness
   641 the correctness
   591 of Sulzmann and Lu's extended algorithm accordingly
   642 of Sulzmann and Lu's extended algorithm accordingly
   592 \cite{AusafDyckhoffUrban2016}.
   643 \cite{AusafDyckhoffUrban2016}.
   593 
   644 
   622 above: the growth is slowed, but the derivatives can still grow rather
   673 above: the growth is slowed, but the derivatives can still grow rather
   623 quickly beyond any finite bound.
   674 quickly beyond any finite bound.
   624 
   675 
   625 
   676 
   626 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
   677 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
   627 \cite{Sulzmann2014} where they introduce bitcoded
   678 \cite{Sulzmann2014} where they introduce bit-coded
   628 regular expressions. In this version, POSIX values are
   679 regular expressions. In this version, POSIX values are
   629 represented as bitsequences and such sequences are incrementally generated
   680 represented as bit sequences and such sequences are incrementally generated
   630 when derivatives are calculated. The compact representation
   681 when derivatives are calculated. The compact representation
   631 of bitsequences and regular expressions allows them to define a more
   682 of bit sequences and regular expressions allows them to define a more
   632 ``aggressive'' simplification method that keeps the size of the
   683 ``aggressive'' simplification method that keeps the size of the
   633 derivatives finite no matter what the length of the string is.
   684 derivatives finite no matter what the length of the string is.
   634 They make some informal claims about the correctness and linear behaviour
   685 They make some informal claims about the correctness and linear behaviour
   635 of this version, but do not provide any supporting proof arguments, not
   686 of this version, but do not provide any supporting proof arguments, not
   636 even ``pencil-and-paper'' arguments. They write about their bitcoded
   687 even ``pencil-and-paper'' arguments. They write about their bit-coded
   637 \emph{incremental parsing method} (that is the algorithm to be formalised
   688 \emph{incremental parsing method} (that is the algorithm to be formalised
   638 in this paper):
   689 in this dissertation)
   639 
   690 
   640 
   691 
   641   
   692   
   642   \begin{quote}\it
   693   \begin{quote}\it
   643   ``Correctness Claim: We further claim that the incremental parsing
   694   ``Correctness Claim: We further claim that the incremental parsing
   658 This thesis implements the aggressive simplifications envisioned
   709 This thesis implements the aggressive simplifications envisioned
   659 by Ausaf and Urban,
   710 by Ausaf and Urban,
   660 and gives a formal proof of the correctness with those simplifications.
   711 and gives a formal proof of the correctness with those simplifications.
   661 
   712 
   662 
   713 
   663  
       
   664 
       
   665 
       
   666 
       
   667 %----------------------------------------------------------------------------------------
   714 %----------------------------------------------------------------------------------------
   668 
       
   669 \section{Contribution}
   715 \section{Contribution}
   670 
   716 
   671 
   717 
   672 
   718 
   673 This work addresses the vulnerability of super-linear and
   719 This work addresses the vulnerability of super-linear and
   675 of Brzozowski's derivatives and interactive theorem proving. 
   721 of Brzozowski's derivatives and interactive theorem proving. 
   676 We give an 
   722 We give an 
   677 improved version of  Sulzmann and Lu's bit-coded algorithm using 
   723 improved version of  Sulzmann and Lu's bit-coded algorithm using 
   678 derivatives, which come with a formal guarantee in terms of correctness and 
   724 derivatives, which come with a formal guarantee in terms of correctness and 
   679 running time as an Isabelle/HOL proof.
   725 running time as an Isabelle/HOL proof.
   680 Then we improve the algorithm with an even stronger version of 
   726 Further improvements to the algorithm with an even stronger version of 
   681 simplification, and prove a time bound linear to input and
   727 simplification is made.
       
   728 We have not yet come up with one, but believe that it leads to a 
       
   729 formalised proof with a time bound linear to input and
   682 cubic to regular expression size using a technique by
   730 cubic to regular expression size using a technique by
   683 Antimirov.
   731 Antimirov\cite{Antimirov}.
   684 
   732 
   685  
   733  
   686 The main contribution of this thesis is a proven correct lexing algorithm
   734 The main contribution of this thesis is 
   687 with formalized time bounds.
   735 \begin{itemize}
       
   736 \item
       
   737 a proven correct lexing algorithm
       
   738 \item
       
   739 with formalized finite bounds on internal data structures' sizes.
       
   740 \end{itemize}
       
   741 
   688 To our best knowledge, no lexing libraries using Brzozowski derivatives
   742 To our best knowledge, no lexing libraries using Brzozowski derivatives
   689 have a provable time guarantee, 
   743 have a provable time guarantee, 
   690 and claims about running time are usually speculative and backed by thin empirical
   744 and claims about running time are usually speculative and backed by thin empirical
   691 evidence.
   745 evidence.
   692 %TODO: give references
   746 %TODO: give references
   702 of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
   756 of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
   703 and concluded that the algorithm is quadratic in terms of input length.
   757 and concluded that the algorithm is quadratic in terms of input length.
   704 When we tried out their extracted OCaml code with our example $(a+aa)^*$,
   758 When we tried out their extracted OCaml code with our example $(a+aa)^*$,
   705 the time it took to lex only 40 $a$'s was 5 minutes.
   759 the time it took to lex only 40 $a$'s was 5 minutes.
   706 
   760 
   707 We  believe our results of a proof of performance on general
       
   708 inputs rather than specific examples a novel contribution.\\
       
   709 
   761 
   710 
   762 
   711 \subsection{Related Work}
   763 \subsection{Related Work}
   712 We are aware
   764 We are aware
   713 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
   765 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
   714 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
   766 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
   715 of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
   767 of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
   716 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
   768 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
   717 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
   769 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
   718  
   770  
   719  %We propose Brzozowski's derivatives as a solution to this problem.
   771  
   720 % about Lexing Using Brzozowski derivatives
   772  When a regular expression does not behave as intended,
       
   773 people usually try to rewrite the regex to some equivalent form
       
   774 or they try to avoid the possibly problematic patterns completely,
       
   775 for which many false positives exist\parencite{Davis18}.
       
   776 Animated tools to "debug" regular expressions such as
       
   777  \parencite{regexploit2021} \parencite{regex101} are also popular.
       
   778 We are also aware of static analysis work on regular expressions that
       
   779 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
       
   780 \parencite{Rathnayake2014StaticAF} proposed an algorithm
       
   781 that detects regular expressions triggering exponential
       
   782 behavious on backtracking matchers.
       
   783 Weideman \parencite{Weideman2017Static} came up with 
       
   784 non-linear polynomial worst-time estimates
       
   785 for regexes, attack string that exploit the worst-time 
       
   786 scenario, and "attack automata" that generates
       
   787 attack strings.
       
   788 
       
   789 
   721 
   790 
   722 
   791 
   723 \section{Structure of the thesis}
   792 \section{Structure of the thesis}
   724 In chapter 2 \ref{Chapter2} we will introduce the concepts
   793 In chapter 2 \ref{Inj} we will introduce the concepts
   725 and notations we 
   794 and notations we 
   726 use for describing the lexing algorithm by Sulzmann and Lu,
   795 use for describing the lexing algorithm by Sulzmann and Lu,
   727 and then give the algorithm and its variant, and discuss
   796 and then give the lexing algorithm.
   728 why more aggressive simplifications are needed. 
   797 We will give its variant in \ref{Bitcoded1}.
   729 Then we illustrate in Chapter 3\ref{Chapter3}
   798 Then we illustrate in \ref{Bitcoded2}
   730 how the algorithm without bitcodes falls short for such aggressive 
   799 how the algorithm without bitcodes falls short for such aggressive 
   731 simplifications and therefore introduce our version of the
   800 simplifications and therefore introduce our version of the
   732  bitcoded algorithm and 
   801  bit-coded algorithm and 
   733 its correctness proof .  
   802 its correctness proof .  
   734 In Chapter 4 \ref{Chapter4} we give the second guarantee
   803 In \ref{Finite} we give the second guarantee
   735 of our bitcoded algorithm, that is a finite bound on the size of any 
   804 of our bitcoded algorithm, that is a finite bound on the size of any 
   736 regex's derivatives.
   805 regex's derivatives.
   737 In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
   806 In \ref{Cubic} we discuss stronger simplifications to improve the finite bound
   738 in Chapter 4 to a polynomial one, and demonstrate how one can extend the
   807 in \ref{Finite} to a polynomial one, and demonstrate how one can extend the
   739 algorithm to include constructs such as bounded repetitions and negations.
   808 algorithm to include constructs such as bounded repetitions and negations.
   740  
   809  
   741 
   810 
   742 
   811 
   743 
   812