ChengsongTanPhdThesis/Chapters/Chapter1.tex
changeset 471 23818853a710
parent 469 e5dd8cc0aa82
child 472 6953d2786e7c
equal deleted inserted replaced
470:7a8cef3f5234 471:23818853a710
    43 \def\distinct{\mathit{distinct}}
    43 \def\distinct{\mathit{distinct}}
    44 \def\blexersimp{\mathit{blexer}\_\mathit{simp}}
    44 \def\blexersimp{\mathit{blexer}\_\mathit{simp}}
    45 %----------------------------------------------------------------------------------------
    45 %----------------------------------------------------------------------------------------
    46 %This part is about regular expressions, Brzozowski derivatives,
    46 %This part is about regular expressions, Brzozowski derivatives,
    47 %and a bit-coded lexing algorithm with proven correctness and time bounds.
    47 %and a bit-coded lexing algorithm with proven correctness and time bounds.
    48 Regular expressions are widely used in modern day programming tasks.
    48 Regular expressions are widely used in computer science: 
    49 Be it IDE with syntax hightlighting and auto completion, 
    49 be it in IDEs with syntax hightlighting and auto completion, 
    50 command line tools like $\mathit{grep}$ that facilitates easy 
    50 command line tools like $\mathit{grep}$ that facilitates easy 
    51 processing of text by search and replace, network intrusion
    51 text processing , network intrusion
    52 detection systems that rejects suspicious traffic, or compiler
    52 detection systems that rejects suspicious traffic, or compiler
    53 front ends--there is always an important phase of the
    53 front ends--there is always an important phase of the
    54 task that involves matching a regular 
    54 task that involves matching a regular 
    55 exression with a string.
    55 exression with a string.
    56 Given its usefulness and ubiquity, one would imagine that
    56 Given its usefulness and ubiquity, one would imagine that
   195 requires
   195 requires
   196 more research attention.
   196 more research attention.
   197 
   197 
   198  \section{Why are current algorithm for regexes slow?}
   198  \section{Why are current algorithm for regexes slow?}
   199 
   199 
   200 Theoretical results say that regular expression matching
   200 %find literature/find out for yourself that REGEX->DFA on basic regexes
   201 should be linear with respect to the input. 
   201 %does not blow up the size
   202 You could construct
   202 Shouldn't regular expression matching be linear?
   203 an NFA via Thompson construction, and simulate  running it.
   203 How can one explain the super-linear behaviour of the 
   204 This would be linear.
   204 regex matching engines we have?
   205 Or you could determinize the NFA into a DFA, and minimize that
   205 The time cost of regex matching algorithms in general
   206 DFA. Once you have the DFA, the running time is also linear, requiring just 
   206 involve two phases: 
   207 one scanning pass of the input.
   207 the construction phase, in which the algorithm builds some  
   208 
   208 suitable data structure from the input regex $r$, we denote
   209 But modern  regex libraries in popular language engines
   209 the time cost by $P_1(r)$.
   210  often want to support richer constructs
   210 The lexing
   211 such as bounded repetitions and back-references
   211 phase, when the input string $s$ is read and the data structure
   212 rather than  the most basic regular expressions.
   212 representing that regex $r$ is being operated on. We represent the time
   213 %put in illustrations
   213 it takes by $P_2(r, s)$.\\
   214 %a{1}{3}
   214 In the case of a $\mathit{DFA}$,
   215 These make a DFA construction impossible because
   215 we have $P_2(r, s) = O( |s| )$,
   216 of an exponential states explosion.
   216 because we take at most $|s|$ steps, 
   217  They also want to support lexing rather than just matching
   217 and each step takes
   218  for tasks that involves text processing.
   218 at most one transition--
   219  
   219 a deterministic-finite-automata
   220  Existing regex libraries either pose restrictions on the user input, or does 
   220 by definition has at most one state active and at most one
   221  not give linear running time guarantee.
   221 transition upon receiving an input symbol.
       
   222 But unfortunately in the  worst case
       
   223 $P_1(r) = O(exp^{|r|})$. An example will be given later. \\
       
   224 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
       
   225 expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.
       
   226 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
       
   227 On the other hand, if backtracking is used, the worst-case time bound bloats
       
   228 to $|r| * 2^|s|$ .
       
   229 %on the input
       
   230 %And when calculating the time complexity of the matching algorithm,
       
   231 %we are assuming that each input reading step requires constant time.
       
   232 %which translates to that the number of 
       
   233 %states active and transitions taken each time is bounded by a
       
   234 %constant $C$.
       
   235 %But modern  regex libraries in popular language engines
       
   236 % often want to support much richer constructs than just
       
   237 % sequences and Kleene stars,
       
   238 %such as negation, intersection, 
       
   239 %bounded repetitions and back-references.
       
   240 %And de-sugaring these "extended" regular expressions 
       
   241 %into basic ones might bloat the size exponentially.
       
   242 %TODO: more reference for exponential size blowup on desugaring. 
       
   243 \subsection{Tools that uses $\mathit{DFA}$s}
       
   244 %TODO:more tools that use DFAs?
       
   245 $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
       
   246 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
       
   247 lexers. The user provides a set of regular expressions
       
   248 and configurations to such lexer generators, and then 
       
   249 gets an output program encoding a minimized $\mathit{DFA}$
       
   250 that can be compiled and run. 
       
   251 The good things about $\mathit{DFA}$s is that once
       
   252 generated, they are fast and stable, unlike
       
   253 backtracking algorithms. 
       
   254 However, they do not scale well with bounded repetitions.\\
       
   255 
       
   256 Bounded repetitions, usually written in the form
       
   257 $r^{\{c\}}$ (where $c$ is a constant natural number),
       
   258 denotes a regular expression accepting strings
       
   259 that can be divided into $c$ substrings, where each 
       
   260 substring is in $r$. 
       
   261 For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
       
   262 an $\mathit{NFA}$ describing it would look like:
       
   263 \begin{center}
       
   264 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
       
   265    \node[state,initial] (q_0)   {$q_0$}; 
       
   266    \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
       
   267    \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
       
   268    \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
       
   269     \path[->] 
       
   270     (q_0) edge  node {a} (q_1)
       
   271     	  edge [loop below] node {a,b} ()
       
   272     (q_1) edge  node  {a,b} (q_2)
       
   273     (q_2) edge  node  {a,b} (q_3);
       
   274 \end{tikzpicture}
       
   275 \end{center}
       
   276 The red states are "countdown states" which counts down 
       
   277 the number of characters needed in addition to the current
       
   278 string to make a successful match.
       
   279 For example, state $q_1$ indicates a match that has
       
   280 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
       
   281 and just consumed the "delimiter" $a$ in the middle, and 
       
   282 need to match 2 more iterations of $(a|b)$ to complete.
       
   283 State $q_2$ on the other hand, can be viewed as a state
       
   284 after $q_1$ has consumed 1 character, and just waits
       
   285 for 1 more character to complete.
       
   286 $q_3$ is the last state, requiring 0 more character and is accepting.
       
   287 Depending on the suffix of the
       
   288 input string up to the current read location,
       
   289 the states $q_1$ and $q_2$, $q_3$
       
   290 may or may
       
   291 not be active, independent from each other.
       
   292 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
       
   293 contain at least $2^3$ non-equivalent states that cannot be merged, 
       
   294 because the subset construction during determinisation will generate
       
   295 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
       
   296 Generalizing this to regular expressions with larger
       
   297 bounded repetitions number, we have that
       
   298 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
       
   299 would require at least $2^{n+1}$ states, if $r$ contains
       
   300 more than 1 string.
       
   301 This is to represent all different 
       
   302 scenarios which "countdown" states are active.
       
   303 For those regexes, tools such as $\mathit{JFLEX}$ 
       
   304 would generate gigantic $\mathit{DFA}$'s or
       
   305 out of memory errors.
       
   306 For this reason, regex libraries that support 
       
   307 bounded repetitions often choose to use the $\mathit{NFA}$ 
       
   308 approach.
       
   309 \subsection{The $\mathit{NFA}$ approach to regex matching}
       
   310 One can simulate the $\mathit{NFA}$ running in two ways:
       
   311 one by keeping track of all active states after consuming 
       
   312 a character, and update that set of states iteratively.
       
   313 This can be viewed as a breadth-first-search of the $\mathit{NFA}$
       
   314 for a path terminating
       
   315 at an accepting state.
       
   316 Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
       
   317 type of $\mathit{NFA}$ simulation, and guarantees a linear runtime
       
   318 in terms of input string length.
       
   319 %TODO:try out these lexers
       
   320 The other way to use $\mathit{NFA}$ for matching is choosing  
       
   321 a single transition each time, keeping all the other options in 
       
   322 a queue or stack, and backtracking if that choice eventually 
       
   323 fails. This method, often called a  "depth-first-search", 
       
   324 is efficient in a lot of cases, but could end up
       
   325 with exponential run time.\\
       
   326 %TODO:COMPARE java python lexer speed with Rust and Go
       
   327 The reason behind backtracking algorithms in languages like
       
   328 Java and Python is that they support back-references.
       
   329 \subsection{Back References in Regex--Non-Regular part}
       
   330 If we have a regular expression like this (the sequence
       
   331 operator is omitted for brevity):
       
   332 \begin{center}
       
   333 	$r_1(r_2(r_3r_4))$
       
   334 \end{center}
       
   335 We could label sub-expressions of interest 
       
   336 by parenthesizing them and giving 
       
   337 them a number by the order in which their opening parentheses appear.
       
   338 One possible way of parenthesizing and labelling is given below:
       
   339 \begin{center}
       
   340 	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
       
   341 \end{center}
       
   342 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$, $r_4$ are labelled
       
   343 by 1 to 4. $1$ would refer to the entire expression 
       
   344 $(r_1(r_2(r_3)(r_4)))$, $2$ referring to $r_2(r_3)(r_4)$, etc.
       
   345 These sub-expressions are called "capturing groups".
       
   346 We can use the following syntax to denote that we want a string just matched by a 
       
   347 sub-expression (capturing group) to appear at a certain location again, 
       
   348 exactly as it was:
       
   349 \begin{center}
       
   350 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
       
   351 \underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
       
   352 \end{center}
       
   353 The backslash and number $i$ are used to denote such 
       
   354 so-called "back-references".
       
   355 Let $e$ be an expression made of regular expressions 
       
   356 and back-references. $e$ contains the expression $e_i$
       
   357 as its $i$-th capturing group.
       
   358 The semantics of back-reference can be recursively
       
   359 written as:
       
   360 \begin{center}
       
   361 	\begin{tabular}{c}
       
   362 		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
       
   363 		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
       
   364 	\end{tabular}
       
   365 \end{center}
       
   366 The concrete example
       
   367 $((a|b|c|\ldots|z)^*)\backslash 1$
       
   368 would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
       
   369 Back-reference is a construct in the "regex" standard
       
   370 that programmers found quite useful, but not exactly 
       
   371 regular any more.
       
   372 In fact, that allows the regex construct to express 
       
   373 languages that cannot be contained in context-free
       
   374 languages either.
       
   375 For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
       
   376 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
       
   377 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
       
   378 Such a language is contained in the context-sensitive hierarchy
       
   379 of formal languages. 
       
   380 Solving the back-reference expressions matching problem
       
   381 is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
       
   382 efficient solution is not known to exist.
       
   383 %TODO:read a bit more about back reference algorithms
       
   384 It seems that languages like Java and Python made the trade-off
       
   385 to support back-references at the expense of having to backtrack,
       
   386 even in the case of regexes not involving back-references.\\
       
   387 Summing these up, we can categorise existing 
       
   388 practical regex libraries into the ones  with  linear
       
   389 time guarantees like Go and Rust, which impose restrictions
       
   390 on the user input (not allowing back-references, 
       
   391 bounded repetitions canno exceed 1000 etc.), and ones  
       
   392  that allows the programmer much freedom, but grinds to a halt
       
   393  in some non-negligible portion of cases.
   222  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
   394  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
   223  For example, the Rust regex engine claims to be linear, 
   395 % For example, the Rust regex engine claims to be linear, 
   224  but does not support lookarounds and back-references.
   396 % but does not support lookarounds and back-references.
   225  The GoLang regex library does not support over 1000 repetitions.  
   397 % The GoLang regex library does not support over 1000 repetitions.  
   226  Java and Python both support back-references, but shows
   398 % Java and Python both support back-references, but shows
   227 catastrophic backtracking behaviours on inputs without back-references(
   399 %catastrophic backtracking behaviours on inputs without back-references(
   228 when the language is still regular).
   400 %when the language is still regular).
   229  %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
   401  %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
   230  %TODO: verify the fact Rust does not allow 1000+ reps
   402  %TODO: verify the fact Rust does not allow 1000+ reps
   231  %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
   403  %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
       
   404 \section{Buggy Regex Engines} 
       
   405 
       
   406 
   232  Another thing about the these libraries is that there
   407  Another thing about the these libraries is that there
   233  is no correctness guarantee.
   408  is no correctness guarantee.
   234  In some cases they either fails to generate a lexing result when there is a match,
   409  In some cases they either fails to generate a lexing result when there is a match,
   235  or gives the wrong way of matching.
   410  or gives the wrong way of matching.
   236  
   411  
   237 
   412 
   238 It turns out that regex libraries not only suffer from 
   413 It turns out that regex libraries not only suffer from 
   239 exponential backtracking problems, 
   414 exponential backtracking problems, 
   240 but also undesired (or even buggy) outputs.
   415 but also undesired (or even buggy) outputs.
   241 %TODO: comment from who
   416 %TODO: comment from who
   242 xxx commented that most regex libraries are not
   417 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
   243 correctly implementing the POSIX (maximum-munch)
   418 correctly implementing the POSIX (maximum-munch)
   244 rule of regular expression matching.
   419 rule of regular expression matching.
       
   420 This experience is echoed by the writer's
       
   421 tryout of a few online regex testers:
   245 A concrete example would be 
   422 A concrete example would be 
   246 the regex
   423 the regex
   247 \begin{verbatim}
   424 \begin{verbatim}
   248 (((((a*a*)b*)b){20})*)c
   425 (((((a*a*)b*)b){20})*)c
   249 \end{verbatim}
   426 \end{verbatim}
   262 all the $b$'s in it.
   439 all the $b$'s in it.
   263 %TODO: give a coloured example of how this matches POSIXly
   440 %TODO: give a coloured example of how this matches POSIXly
   264 
   441 
   265 This regex would trigger catastrophic backtracking in 
   442 This regex would trigger catastrophic backtracking in 
   266 languages like Python and Java,
   443 languages like Python and Java,
   267 whereas it gives a correct but uninformative (non-POSIX)
   444 whereas it gives a non-POSIX  and uninformative 
   268 match in languages like Go or .NET--The match with only 
   445 match in languages like Go or .NET--The match with only 
   269 character $c$.
   446 character $c$.
   270 
   447 
   271 Backtracking or depth-first search might give us
   448 As Grathwohl\parencite{grathwohl2014crash} commented,
   272 exponential running time, and quite a few tools 
   449 \begin{center}
   273 avoid that by determinising the $\mathit{NFA}$ 
   450 	``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
   274 into a $\mathit{DFA}$ and minimizes it.
   451 \end{center}
   275 For example, $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
   452 
   276 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
   453 \section{Engineering and Academic Approaches to Deal with Catastrophic Backtracking}
   277 lexers.
   454 
   278 However, they do not scale well with bounded repetitions.
   455 
   279 Bounded repetitions, usually written in the form
   456 There is also static analysis work on regular expression that
   280 $r^{\{c\}}$ (where $c$ is a constant natural number),
   457 have potential expoential behavious. Rathnayake and Thielecke 
   281 denotes a regular expression accepting strings
   458 \parencite{Rathnayake2014StaticAF} proposed an algorithm
   282 that can be divided into $c$ substrings, and each 
   459 that detects regular expressions triggering exponential
   283 substring is in $r$. 
   460 behavious on backtracking matchers.
   284 %TODO: draw example NFA.
   461 People also developed static analysis methods for
   285 For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
   462 generating non-linear polynomial worst-time estimates
   286 an $\mathit{NFA}$ describing it would look like:
   463 for regexes, attack string that exploit the worst-time 
   287 \begin{center}
   464 scenario, and "attack automata" that generates
   288 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
   465 attack strings \parencite{Weideman2017Static}.
   289    \node[state,initial] (q_0)   {$q_0$}; 
   466 There are also tools to "debug" regular expressions
   290    \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
   467 that allows people to see why a match failed or was especially slow
   291    \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
   468 by showing the steps a back-tracking regex engine took\parencite{regexploit2021}.
   292    \node[state,accepting](q_3) [right=of q_2] {$q_3$};
   469 %TODO:also the regex101 debugger
   293     \path[->] 
   470 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
   294     (q_0) edge  node {a} (q_1)
       
   295     	  edge [loop below] node {a,b} ()
       
   296     (q_1) edge  node  {a,b} (q_2)
       
   297           edge [loop above] node {0} ()
       
   298     (q_2) edge  node  {a,b} (q_3);
       
   299 \end{tikzpicture}
       
   300 \end{center}
       
   301 The red states are "counter states" which counts down 
       
   302 the number of characters needed in addition to the current
       
   303 string to make a successful match.
       
   304 For example, state $q_1$ indicates a match that has
       
   305 gone past the $(a|b)$ part of $(a|b)^*a(a|b)^{\{2\}}$,
       
   306 and just consumed the "delimiter" $a$ in the middle, and 
       
   307 need to match 2 more iterations of $a|b$ to complete.
       
   308 State $q_2$ on the other hand, can be viewed as a state
       
   309 after $q_1$ has consumed 1 character, and just waits
       
   310 for 1 more character to complete.
       
   311 Depending on the actual characters appearing in the 
       
   312 input string, the states $q_1$ and $q_2$ may or may
       
   313 not be active, independent from each other.
       
   314 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
       
   315 contain at least 4 non-equivalent states that cannot be merged, 
       
   316 because subset states indicating which of $q_1$ and $q_2$
       
   317 are active are at least four: $\phi$, $\{q_1\}$, 
       
   318 $\{q_2\}$, $\{q_1, q_2\}$.
       
   319 Generalizing this to regular expressions with larger
       
   320 bounded repetitions number, we have $r^*ar^{\{n\}}$
       
   321 in general would require at least $2^n$ states
       
   322 in a $\mathit{DFA}$. This is to represent all different 
       
   323 configurations of "countdown" states.
       
   324 For those regexes, tools such as $\mathit{JFLEX}$ 
       
   325 would generate gigantic $\mathit{DFA}$'s or even
       
   326 give out memory errors.
       
   327 
       
   328 For this reason, regex libraries that support 
       
   329 bounded repetitions often choose to use the $\mathit{NFA}$ 
       
   330 approach.
       
   331 One can simulate the $\mathit{NFA}$ running in two ways:
       
   332 one by keeping track of all active states after consuming 
       
   333 a character, and update that set of states iteratively.
       
   334 This is a breadth-first-search of the $\mathit{NFA}$.
       
   335 for a path terminating
       
   336 at an accepting state.
       
   337 Languages like $\mathit{GO}$ and $\mathit{RUST}$ use this
       
   338 type of $\mathit{NFA}$ simulation, and guarantees a linear runtime
       
   339 in terms of input string length.
       
   340 The other way to use $\mathit{NFA}$ for matching is to take 
       
   341 a single state in a path each time, and backtrack if that path
       
   342 fails. This is a depth-first-search that could end up
       
   343 with exponential run time.
       
   344 The reason behind backtracking algorithms in languages like
       
   345 Java and Python is that they support back-references.
       
   346 \subsection{Back References in Regex--Non-Regular part}
       
   347 If we label sub-expressions by parenthesizing them and give 
       
   348 them a number by the order their opening parenthesis appear,
       
   349 $\underset{1}{(}\ldots\underset{2}{(}\ldots\underset{3}{(}\ldots\underset{4}{(}\ldots)\ldots)\ldots)\ldots)$
       
   350 We can use the following syntax to denote that we want a string just matched by a 
       
   351 sub-expression to appear at a certain location again exactly:
       
   352 $(.*)\backslash 1$
       
   353 would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.
       
   354 
       
   355 Back-reference is a construct in the "regex" standard
       
   356 that programmers found quite useful, but not exactly 
       
   357 regular any more.
       
   358 In fact, that allows the regex construct to express 
       
   359 languages that cannot be contained in context-free
       
   360 languages
       
   361 For example, the back reference $(a^*)\backslash 1 \backslash 1$
       
   362 expresses the language $\{a^na^na^n\mid n \in N\}$,
       
   363 which cannot be expressed by context-free grammars.
       
   364 To express such a language one would need context-sensitive
       
   365 grammars, and matching/parsing for such grammars is NP-hard in 
       
   366 general.
       
   367 %TODO:cite reference for NP-hardness of CSG matching
       
   368 For such constructs, the most intuitive way of matching is
       
   369 using backtracking algorithms, and there are no known algorithms 
       
   370 that does not backtrack.
       
   371 %TODO:read a bit more about back reference algorithms
       
   372 
       
   373 
       
   374 
       
   375 
       
   376 \section{Our Solution--Brzozowski Derivatives}
       
   377 
       
   378 
       
   379  
       
   380  Is it possible to have a regex lexing algorithm with proven correctness and 
   471  Is it possible to have a regex lexing algorithm with proven correctness and 
   381  time complexity, which allows easy extensions to
   472  time complexity, which allows easy extensions to
   382   constructs like 
   473   constructs like 
   383  bounded repetitions, negation,  lookarounds, and even back-references? 
   474  bounded repetitions, negation,  lookarounds, and even back-references? 
       
   475 Building on top of Sulzmann and Lu's attempt to formalize the 
       
   476 notion of POSIX lexing rules \parencite{Sulzmann2014}, 
       
   477 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
       
   478 POSIX matching as a ternary relation recursively defined in a
       
   479 natural deduction style.
       
   480 With the formally-specified rules for what a POSIX matching is,
       
   481 they designed a regex matching algorithm based on Brzozowski derivatives, and 
       
   482 proved in Isabelle/HOL that the algorithm gives correct results.
       
   483 
   384  
   484  
   385  We propose Brzozowski's derivatives as a solution to this problem.
   485 
       
   486 
       
   487 
       
   488 %----------------------------------------------------------------------------------------
       
   489 
       
   490 \section{Our Approach}
       
   491 In the last fifteen or so years, Brzozowski's derivatives of regular
       
   492 expressions have sparked quite a bit of interest in the functional
       
   493 programming and theorem prover communities.  The beauty of
       
   494 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
       
   495 expressible in any functional language, and easily definable and
       
   496 reasoned about in theorem provers---the definitions just consist of
       
   497 inductive datatypes and simple recursive functions.  Derivatives of a
       
   498 regular expression, written $r \backslash c$, give a simple solution
       
   499 to the problem of matching a string $s$ with a regular
       
   500 expression $r$: if the derivative of $r$ w.r.t.\ (in
       
   501 succession) all the characters of the string matches the empty string,
       
   502 then $r$ matches $s$ (and {\em vice versa}).  
       
   503 
       
   504 
       
   505 This work aims to address the above vulnerability by the combination
       
   506 of Brzozowski's derivatives and interactive theorem proving. We give an 
       
   507 improved version of  Sulzmann and Lu's bit-coded algorithm using 
       
   508 derivatives, which come with a formal guarantee in terms of correctness and 
       
   509 running time as an Isabelle/HOL proof.
       
   510 Then we improve the algorithm with an even stronger version of 
       
   511 simplification, and prove a time bound linear to input and
       
   512 cubic to regular expression size using a technique by
       
   513 Antimirov.
       
   514 
       
   515 \subsection{Existing Work}
       
   516 We are aware
       
   517 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
       
   518 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
       
   519 of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
       
   520 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
       
   521 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
       
   522  
       
   523  %We propose Brzozowski's derivatives as a solution to this problem.
   386  
   524  
   387 The main contribution of this thesis is a proven correct lexing algorithm
   525 The main contribution of this thesis is a proven correct lexing algorithm
   388 with formalized time bounds.
   526 with formalized time bounds.
   389 To our best knowledge, there is no lexing libraries using Brzozowski derivatives
   527 To our best knowledge, there is no lexing libraries using Brzozowski derivatives
   390 that have a provable time guarantee, 
   528 that have a provable time guarantee, 
  1312 allowing a ReDoS (regular expression denial-of-service ) attack.
  1450 allowing a ReDoS (regular expression denial-of-service ) attack.
  1313 
  1451 
  1314 
  1452 
  1315 %----------------------------------------------------------------------------------------
  1453 %----------------------------------------------------------------------------------------
  1316 
  1454 
  1317 \section{Engineering and Academic Approaches to Deal with Catastrophic Backtracking}
       
  1318 
       
  1319 The reason behind is that regex libraries in popular language engines
       
  1320  often want to support richer constructs
       
  1321 than  the most basic regular expressions, and lexing rather than matching
       
  1322 is needed for sub-match extraction.
       
  1323 
       
  1324 There is also static analysis work on regular expression that
       
  1325 have potential expoential behavious. Rathnayake and Thielecke 
       
  1326 \parencite{Rathnayake2014StaticAF} proposed an algorithm
       
  1327 that detects regular expressions triggering exponential
       
  1328 behavious on backtracking matchers.
       
  1329 People also developed static analysis methods for
       
  1330 generating non-linear polynomial worst-time estimates
       
  1331 for regexes, attack string that exploit the worst-time 
       
  1332 scenario, and "attack automata" that generates
       
  1333 attack strings.
       
  1334 For a comprehensive analysis, please refer to Weideman's thesis 
       
  1335 \parencite{Weideman2017Static}.
       
  1336 
       
  1337 \subsection{DFA Approach}
       
  1338 
       
  1339 Exponential states.
       
  1340 
       
  1341 \subsection{NFA Approach}
       
  1342 Backtracking.
       
  1343 
       
  1344 
       
  1345 
       
  1346 %----------------------------------------------------------------------------------------
       
  1347 
       
  1348 \section{Our Approach}
       
  1349 In the last fifteen or so years, Brzozowski's derivatives of regular
       
  1350 expressions have sparked quite a bit of interest in the functional
       
  1351 programming and theorem prover communities.  The beauty of
       
  1352 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
       
  1353 expressible in any functional language, and easily definable and
       
  1354 reasoned about in theorem provers---the definitions just consist of
       
  1355 inductive datatypes and simple recursive functions.  Derivatives of a
       
  1356 regular expression, written $r \backslash c$, give a simple solution
       
  1357 to the problem of matching a string $s$ with a regular
       
  1358 expression $r$: if the derivative of $r$ w.r.t.\ (in
       
  1359 succession) all the characters of the string matches the empty string,
       
  1360 then $r$ matches $s$ (and {\em vice versa}).  
       
  1361 
       
  1362 
       
  1363 This work aims to address the above vulnerability by the combination
       
  1364 of Brzozowski's derivatives and interactive theorem proving. We give an 
       
  1365 improved version of  Sulzmann and Lu's bit-coded algorithm using 
       
  1366 derivatives, which come with a formal guarantee in terms of correctness and 
       
  1367 running time as an Isabelle/HOL proof.
       
  1368 Then we improve the algorithm with an even stronger version of 
       
  1369 simplification, and prove a time bound linear to input and
       
  1370 cubic to regular expression size using a technique by
       
  1371 Antimirov.
       
  1372 
       
  1373 \subsection{Existing Work}
       
  1374 We are aware
       
  1375 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
       
  1376 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
       
  1377 of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
       
  1378 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
       
  1379 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
       
  1380 
  1455 
  1381 %----------------------------------------------------------------------------------------
  1456 %----------------------------------------------------------------------------------------
  1382 
  1457 
  1383 \section{What this Template Includes}
  1458 \section{What this Template Includes}
  1384 
  1459