ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 668 3831621d7b14
parent 666 6da4516ea87d
equal deleted inserted replaced
667:660cf698eb26 668:3831621d7b14
   199 
   199 
   200 
   200 
   201 %----------------------------------------------------------------------------------------
   201 %----------------------------------------------------------------------------------------
   202 %This part is about regular expressions, Brzozowski derivatives,
   202 %This part is about regular expressions, Brzozowski derivatives,
   203 %and a bit-coded lexing algorithm with proven correctness and time bounds.
   203 %and a bit-coded lexing algorithm with proven correctness and time bounds.
       
   204 % \marginpar{Totally rewritten introduction}
       
   205 % Formal proofs, 
       
   206 
   204 
   207 
   205 %TODO: look up snort rules to use here--give readers idea of what regexes look like
   208 %TODO: look up snort rules to use here--give readers idea of what regexes look like
   206 
   209 
   207 \marginpar{rephrasing using "imprecise words"}
   210 \marginpar{rephrasing following Christian comment}
   208 Regular expressions, since their inception in the 1940s, 
   211 Regular expressions, since their inception in the 1950s
       
   212 \cite{RegularExpressions}, 
   209 have been subject to extensive study and implementation. 
   213 have been subject to extensive study and implementation. 
   210 Their primary application lies in text processing--finding
   214 Their primary application lies in text processing--finding
   211 matches and identifying patterns in a string.
   215 matches and identifying patterns in a string.
   212 %It is often used to match strings that comprises of numerous fields, 
   216 %It is often used to match strings that comprises of numerous fields, 
   213 %where certain fields may recur or be omitted. 
   217 %where certain fields may recur or be omitted. 
   214 For example, a simple regular expression that tries 
   218 For example, a simple regular expression that tries 
   215 to recognise email addresses is
   219 to recognise email addresses is
   216 \marginpar{rephrased from "the regex for recognising" to "a simple regex that tries to match email"}
   220 \marginpar{rephrased from "the regex for recognising" to "a simple regex that tries to match email"}
   217 \begin{center}
   221 \begin{center}
   218 \verb|[a-z0-9._]^+@[a-z0-9.-]^+\.\{a-z\}\{2,6\}|
   222 \verb|[a-z0-9._]+@[a-z0-9.-]+\.[a-z]{2,6}|
   219 %$[a-z0-9._]^+@[a-z0-9.-]^+\.[a-z]{2,6}$.
   223 %$[a-z0-9._]^+@[a-z0-9.-]^+\.[a-z]{2,6}$.
   220 \end{center}
   224 \end{center}
   221 \marginpar{Simplified example, but the distinction between . and escaped . is correct
   225 \marginpar{Simplified example, but the distinction between . and escaped . is correct
   222 and therefore left unchanged. Also verbatim package does not straightforwardly support superscripts so + kept as they are.}
   226 and therefore left unchanged. Also verbatim package does not straightforwardly support superscripts so + kept as they are.}
   223 %Using this, regular expression matchers and lexers are able to extract 
   227 %Using this, regular expression matchers and lexers are able to extract 
   224 %the domain names by the use of \verb|[a-zA-Z0-9.-]+|. 
   228 %the domain names by the use of \verb|[a-zA-Z0-9.-]+|. 
   225 \marginpar{Rewrote explanation for the expression.}
   229 \marginpar{Rewrote explanation for the expression.}
   226 The bracketed sub-expressions are used to extract specific parts of an email address.
   230 This expression assumes all letters in the email have been converted into lower-case.
   227 The local part is recognised by the expression enclosed in 
   231 The local-part is recognised by the first 
   228 the first pair of brackets: $[a-z0-9._]$, and after the ``@'' sign
   232 bracketed expression $[a-z0-9.\_]^+$ 
   229 is the part that recognises the domain, where $[a-z]{2, 6}$ specifically
   233 where the
   230 matches the top-level domain.
   234 operator ``+'' (should be viewed as a superscript) 
       
   235 means that it must be some non-empty string
       
   236 made of alpha-numeric characters.
       
   237 After the ``@'' sign
       
   238 is the sub-expression that recognises the domain of that email, 
       
   239 where $[a-z]^{\{2, 6\}}$ specifically
       
   240 matches the top-level domain, such as ``org'', ``com'', ``uk'' and etc.
       
   241 The counters in the superscript 
       
   242 $2, 6$ specifies that all top-level domains
       
   243 should have between two to six characters.
       
   244 This regular expression does not represent all possible email addresses 
       
   245 (e.g. those involving ``-'' cannot be recognised), but patterns
       
   246 of similar shape and using roughly the same set of syntax are used
       
   247 everywhere in our daily life,
       
   248 % Indeed regular expressions are used everywhere 
       
   249 % in a computer user's daily life--for example searching for a string
       
   250 % in a text document using ctrl-F.
       
   251 % When such regex searching 
       
   252 % a useful and simple way to handle email-processing when the task does
       
   253 % not require absolute precision. 
       
   254 %The same can be said about
       
   255 %applications like .
   231 %Consequently, they are an indispensible components in text processing tools 
   256 %Consequently, they are an indispensible components in text processing tools 
   232 %of software systems such as compilers, IDEs, and firewalls.
   257 %of software systems such as compilers, IDEs, and firewalls.
   233 
   258 % The picture gets more blurry when we start to ask questions like 
   234 The study of regular expressions is ongoing due to an
   259 % why 
   235 issue known as catastrophic backtracking. 
   260 %TODO: Insert corresponding bibliography
   236 This phenomenon refers to scenarios in which the regular expression 
   261 %There are many more scenarios in which regular exp% %use cases for regular expressions in computer science,
   237 matching or lexing engine exhibits a disproportionately long 
   262 for example in compilers \cite{LEX}, networking \cite{Snort1999}, 
   238 runtime despite the simplicity of the input and expression.
   263 software engineering (IDEs) \cite{Atom}
   239 
   264 and operating systems \cite{GREP}, where the correctness
   240 One cause of catastrophic backtracking lies within the 
   265 of matching can be crucially important.
   241 ambiguities of lexing.\marginpar{rephrased "the origin of catastrophic ...} 
   266 
   242 In the process of matching a multi-character string with 
   267 Implementations of regular expression matching out in the wild, however, 
   243 a regular expression that encompasses several sub-expressions, 
   268 are surprisingly unreliable.
   244 different positions can be designated to mark 
   269 %are not always reliable as %people think
   245 the boundaries of corresponding substrings of the sub-expressions. 
   270 %they
   246 For instance, in matching the string $aaa$ with the 
   271 %should be. %Given the importance of the correct functioning
   247 regular expression $(a+aa)^*$, the divide between 
   272 % Indeed they have been heavily relied on and
   248 the initial match and the subsequent iteration could either be 
   273 % extensively implemented and studied in modern computer systems--whenever you do
   249 set between the first and second characters ($a | aa$) or between the second and third characters ($aa | a$). As both the length of the input string and the structural complexity of the regular expression increase, the number of potential delimiter combinations can grow exponentially, leading to a corresponding increase in complexity for algorithms that do not optimally navigate these possibilities.
   274 % a 
   250 
   275 %TODO: double check this is true
   251 Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}).
   276 An example is the regular expresion $(a^*)^*b$ and the string
   252 
   277 $aa\ldots a$. The expectation is that any regex engine should
   253 Various disambiguation strategies are 
   278 be able to solve this (return a no match) in no time.
   254 employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
   279 However, if this 
       
   280 %regular expression and string pair 
       
   281 is tried out in an
       
   282 regex engine like that of Java 8, the runtime would quickly grow beyond 100 seconds
       
   283 with just dozens of characters 
       
   284 %in the string. 
       
   285 Such behaviour can result in Denial-of-Service (ReDoS) attacks
       
   286 with minimal resources--just the pair of ``evil'' 
       
   287 regular expression and string.
       
   288 The outage of the web service provider Cloudflare \cite{Cloudflare} 
       
   289 in 2019 \cite{CloudflareOutage} is a recent example
       
   290 where such issues caused serious negative impact.
       
   291 
       
   292 
       
   293 The reason why these regex matching engines get so slow
       
   294 is because they use backtracking or a depth-first-search (DFS) on the 
       
   295 search space of possible matches. They employ backtracking algorithms to 
       
   296 support back-references, a mechanism allowing
       
   297 expressing languages which repeating an arbitrary long string 
       
   298 such as 
       
   299 $\{ww| w\in \Sigma*\}$. Such a constructs makes matching
       
   300 NP-complete, for which no known non-backtracking algorithms exist.
       
   301 More modern programming languages' regex engines such as
       
   302 Rust and GO's do promise linear-time behaviours 
       
   303 with respect to input string,
       
   304 but they do not support back-references, and often 
       
   305 impose ad-hoc restrictions on the input patterns.
       
   306 More importantly, these promises are purely empirical,
       
   307 making them prone to future ReDoS attacks and other types of errors.
       
   308 
       
   309 The other unreliability of industry regex engines
       
   310 is that they do not produce the desired output. 
       
   311 Kuklewicz have found out during his testing
       
   312 of practical regex engines that almost all of them are incorrect with respect
       
   313 to the POSIX standard, a disambiguation strategy adopted most
       
   314 widely in computer science. Roughly speaking POSIX lexing means to always
       
   315 go for the longest initial submatch possible, 
       
   316 for instance a single iteration
       
   317 $aa$ is preferred over two iterations $a,a$ 
       
   318 when matching $(a|aa)^*$ with the string $aa$.
       
   319 The POSIX strategy as summarised by Kuklewicz goes
       
   320 as follows:
       
   321 \marginpar{\em Deleted some peripheral specifications.}
       
   322 \begin{quote}
       
   323 	\begin{itemize}
       
   324 		\item
       
   325 regular expressions (REs) take the leftmost starting match, and the longest match starting there
       
   326 earlier subpatterns have leftmost-longest priority over later subpatterns\\
       
   327 \item
       
   328 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
       
   329 \item
       
   330 	$\ldots$
       
   331 %REs have right associative concatenation which can be changed with parenthesis\\
       
   332 %\item
       
   333 %parenthesized subexpressions return the match from their last usage\\
       
   334 %\item
       
   335 %text of component subexpressions must be contained in the text of the 
       
   336 %higher-level subexpressions\\
       
   337 %\item
       
   338 %if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
       
   339 %\item
       
   340 %if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
       
   341 \end{itemize}
       
   342 \end{quote}
       
   343 The author noted that various lexers that come with POSIX operating systems
       
   344 are rarely correct according to this standard.
       
   345 A test case that exposes the bugs 
       
   346 is the regular expression $(aba|ab|a)^*$ and string $ababa$.
       
   347 A correct match would split the string into $ab, aba$, involving
       
   348 two repetitions of the Kleene star $(aba|ab|a)^*$.
       
   349 Most regex engines would instead return two (partial) matches
       
   350 $aba$ and  $a$ \footnote{Try it out here: \url{https://regex101.com/r/c5hga5/1}, last accessed 22-Aug-2023}.
       
   351 There are numerous occasions where programmers realised the subtlety and
       
   352 difficulty to implement POSIX correctly, one such quote from Go's regexp library author:
       
   353 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}, last accessed 22-Aug-2023}
       
   354 \begin{quote}\it
       
   355 ``
       
   356 The POSIX rule is computationally prohibitive
       
   357 and not even well-defined.
       
   358 ''
       
   359 \end{quote}
       
   360 % Being able to formally define and capture the idea of 
       
   361 % POSIX rules and prove 
       
   362 % the correctness of regular expression matching/lexing 
       
   363 % algorithms against the POSIX semantics definitions
       
   364 % is valuable.
       
   365 These all point towards a formal treatment of 
       
   366 POSIX lexing to clear up inaccuracies and errors
       
   367 in understanding and implementation of regex. The work presented
       
   368 in this thesis uses formal proofs to ensure the correctness 
       
   369 and runtime properties
       
   370 of POSIX regular expression lexing.
       
   371 
       
   372 Formal proofs or mechanised proofs are
       
   373 computer checked programs
       
   374  %such as the ones written in Isabelle/HOL, is a powerful means 
       
   375 that certify the correctness of facts
       
   376 with respect to a set of axioms and definitions.
       
   377 % This is done by 
       
   378 % recursively checking that every fact in a proof script 
       
   379 % is either an axiom or a fact that is derived from
       
   380 % known axioms or verified facts.
       
   381 %The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
       
   382 %methods as their implementation and complexity analysis tend to be error-prone.
       
   383 They provide an unprecendented level of asssurance
       
   384 that an algorithm will perform as expected under all inputs.
       
   385 We believe such proofs can help eliminate catastrophic backtracking
       
   386 once and for all.
       
   387 The software systems that help people interactively build and check
       
   388 formal proofs are called proof assitants.
       
   389 % Many  theorem-provers have been developed, such as Mizar,
       
   390 % Isabelle/HOL, HOL-Light, HOL4,
       
   391 % Coq, Agda, Idris, Lean and so on.
       
   392 Isabelle/HOL \cite{Isabelle} is a widely-used proof assistant with a simple type theory
       
   393 and powerful automated proof generators like sledgehammer.
       
   394 We chose to use Isabelle/HOL for its powerful automation
       
   395 and ease and simplicity in expressing regular expressions and 
       
   396 regular languages.
       
   397 %Some of those employ
       
   398 %dependent types like Mizar, Coq, Agda, Lean and Idris.
       
   399 %Some support a constructivism approach, such as Coq.
       
   400 
       
   401 
       
   402 % Formal proofs on regular expression matching and lexing
       
   403 % complements the efforts in
       
   404 % industry which tend to focus on overall speed
       
   405 % with techniques like parallelization (FPGA paper), tackling 
       
   406 % the problem of catastrophic backtracking 
       
   407 % in an ad-hoc manner (cloudflare and stackexchange article).
       
   408 The algorithm we work on is based on Brzozowski derivatives.
       
   409 Brzozowski invented the notion of ``derivatives'' on 
       
   410 regular expressions \cite{Brzozowski1964}, and the idea is
       
   411 that we can reason about what regular expressions can match
       
   412 by taking derivatives of them. 
       
   413 A derivative operation takes a regular expression $r$ and a character
       
   414 $c$,
       
   415 and returns a new regular expression $r\backslash c$.
       
   416 The derivative is taken with respect to $c$:
       
   417 it transforms $r$ in such a way that the resulting derivative
       
   418 $r\backslash c$ contains all strings in the language of $r$ that
       
   419 starts with $c$, but now with the head character $c$ thrown away.
       
   420 For example, for $r$ equal to $(aba | ab | a)^*$
       
   421 as discussed earlier, its derivative with respect to character $a$ is
       
   422 \[
       
   423   r\backslash a = (ba | b| \ONE)(aba | ab | a)^*.
       
   424 \]
       
   425 % By applying derivatives repeatedly one can c
       
   426 Taking derivatives repeatedly with respect to the sequence
       
   427 of characters in the string $s$, one obtain a regular expression
       
   428 containing the information of how $r$ matched $s$.
       
   429 
       
   430 Brzozowski derivatives are purely algebraic operations
       
   431 that can be implemented as a recursive function
       
   432 that does a pattern match on the structure of the regular expression.
       
   433 For example, the derivatives of character regular expressions,
       
   434 Kleene star and bounded repetitions can be described by the following
       
   435 code equations:
       
   436 % For example some clauses
       
   437 \begin{center}
       
   438 	\begin{tabular}{lcl}
       
   439     $d \backslash c$     & $\dn$ & 
       
   440 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   441 % $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   442 % $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
       
   443 % 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   444 % 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   445 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   446 		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
       
   447 		r^{\{n-1\}} (\text{when} \; n > 0)$\\
       
   448 	\end{tabular}
       
   449 \end{center}
       
   450 (end of ready work, rest construction site)
       
   451 In particular, we are working on an improvement of the work
       
   452 by Ausaf et al. \cite{Ausaf} \cite{AusafDyckhoffUrban2016} 
       
   453 and Sulzmann and Lu \cite{Sulzmann2014}.
       
   454 
       
   455 
       
   456 The variant of the problem we are looking at centers around
       
   457 an algorithm (which we call $\blexer$) developed by Sulzmann and Lu \ref{Sulzmann2014}.
       
   458 The reason we chose to look at $\blexer$ and its simplifications 
       
   459 is because it allows a lexical tree to be generated
       
   460 by some elegant and subtle procedure based on Brzozowski derivatives.
       
   461 The procedures are made of recursive functions and inductive datatypes just like derivatives, 
       
   462 allowing intuitive and concise formal reasoning with theorem provers.
       
   463 Most importantly, $\blexer$ opens up a path to an optimized version
       
   464 of $\blexersimp$ possibilities to improve
       
   465 performance with simplifications that aggressively change the structure of regular expressions.
       
   466 While most derivative-based methods
       
   467 rely on structures to be maintained to allow induction to
       
   468 go through.
       
   469 For example, Egolf et al. \ref{Verbatim} have developed a verified lexer
       
   470 with derivatives, but as soon as they started introducing
       
   471 optimizations such as memoization, they reverted to constructing
       
   472 DFAs first.
       
   473 Edelmann \ref{Edelmann2020} explored similar optimizations in his
       
   474 work on verified LL(1) parsing, with additional enhancements with data structures like
       
   475 zippers.
       
   476 
       
   477 %Sulzmann and Lu have worked out an algorithm
       
   478 %that is especially suited for verification
       
   479 %which utilized the fact
       
   480 %that whenever ambiguity occurs one may duplicate a sub-expression and use
       
   481 %different copies to describe different matching choices.
       
   482 The idea behind the correctness of $\blexer$ is simple: during a derivative,
       
   483 multiple matches might be possible, where an alternative with multiple children
       
   484 each corresponding to a 
       
   485 different match is created. In the end of
       
   486 a lexing process one always picks up the leftmost alternative, which is guarnateed 
       
   487 to be a POSIX value.
       
   488 This is done by consistently keeping sub-regular expressions in an alternative
       
   489 with longer submatches
       
   490 to the left of other copies (
       
   491 Remember that POSIX values are roughly the values with the longest inital
       
   492 submatch).
       
   493 The idea behind the optimized version of $\blexer$, which we call $\blexersimp$,
       
   494 is that since we only take the leftmost copy, then all re-occurring copies can be
       
   495 eliminated without losing the POSIX property, and this can be done with
       
   496 children of alternatives at different levels by merging them together.
       
   497 Proving $\blexersimp$ requires a different
       
   498 proof strategy compared to that by Ausaf \cite{FahadThesis}.
       
   499 We invent a rewriting relation as an
       
   500 inductive predicate which captures 
       
   501 a strong enough invariance that ensures correctness,
       
   502 which commutes with the derivative operation. 
       
   503 This predicate allows a simple
       
   504 induction on the input string to go through.
       
   505 
       
   506 %This idea has been repeatedly used in different variants of lexing
       
   507 %algorithms in their paper, one of which involves bit-codes. The bit-coded
       
   508 %derivative-based algorithm even allows relatively aggressive 
       
   509 %%simplification rules which cause
       
   510 %structural changes that render the induction used in the correctness
       
   511 %proofs unusable.
       
   512 %More details will be given in \ref{Bitcoded2} including the
       
   513 %new correctness proof which involves a new inductive predicate which allows 
       
   514 %rule induction to go through.
       
   515 
       
   516 
       
   517 \marginpar{20230821 Addition: high-level idea in context}
       
   518 To summarise, we expect modern regex matching and lexing engines
       
   519 to be reliabe, fast and correct, and support rich syntax constructs
       
   520 like bounded repetitions and back references.
       
   521 Backtracking regex engines have exhibited exponential 
       
   522 worst-case behaviours (catastrophic backtracking)
       
   523 for employing a depth-first-search on the search tree of possible matches,
       
   524 and complexity analysis about them also takes worst-case exponential 
       
   525 time (ref Minamide and Birmingham work).
       
   526 %Expand notes: JFLEX generates DFAs, cannot support backrefs
       
   527 %Expand notes: Java 8, python supports pcre, but is exponential
       
   528 %Expand notes: more modern prog langs like GO, Rust claims linear, and forbids large counters and backrefs
       
   529 %The work described in this thesis is part of an ongoing
       
   530 %project which aims to build a formally verified lexer satisfying all these
       
   531 To ensure the correctness and predictable behaviour of lexical analysis, 
       
   532 we propose to
       
   533 build a formally verified lexer that satisfy correctness and non-backtracking
       
   534 requirements in a bottom-up manner using Brzozowski derivatives.
       
   535 We build on the line of work by Ausaf et al.
       
   536 A derivative-based matching algorithm avoids backtracking, by explicitly
       
   537 representing possible matches on the same level of a search tree 
       
   538 as regular expression terms
       
   539 in a breadth-first manner.
       
   540 Efficiency of such algorithms rely on limiting the size
       
   541 of the derivatives during the computation, for example by
       
   542 eliminating redundant 
       
   543 terms that lead to identical matches. 
       
   544 
       
   545 
       
   546 The end goal would be a lexer that comes with additional formal guarantees
       
   547 on computational complexity and actual speed competent with other
       
   548 unverified regex engines.
       
   549 We will report in the next section the outcome of this project
       
   550 so far and the contribution with respect to other works in
       
   551 lexical analysis and theorem proving.
       
   552 
       
   553 
       
   554 Derivatives on regular expressions are popular in the theorem proving community
       
   555 because their algebraic features are amenable to formal reasoning.
       
   556 As a testimony for this, there exists a sizable number of formalisations of Brzozowski derivatives and
       
   557 regular expressions in different theorem proving languages.
       
   558 %Expand notes: verified lexers: proof-of-concept: Ruberio, Egolf, fast but no formal proof in a proof assistant: Magnus, 
       
   559 
       
   560 
       
   561 
       
   562 % The study of regular expressions is ongoing due to an
       
   563 % issue known as catastrophic backtracking. 
       
   564 % This phenomenon refers to scenarios in which the regular expression 
       
   565 
       
   566 
       
   567 % One cause of catastrophic backtracking lies within the 
       
   568 % ambiguities of lexing.\marginpar{rephrased "the origin of catastrophic ...} 
       
   569 % In the process of matching a multi-character string with 
       
   570 % a regular expression that encompasses several sub-expressions, 
       
   571 % different positions can be designated to mark 
       
   572 % the boundaries of corresponding substrings of the sub-expressions. 
       
   573 % For instance, in matching the string $aaa$ with the 
       
   574 % regular expression $(a+aa)^*$, the divide between 
       
   575 % the initial match and the subsequent iteration could either be 
       
   576 % set between the first and second characters ($a | aa$) or 
       
   577 % between the second and third characters ($aa | a$). 
       
   578 % As both the length of the input string and the structural complexity 
       
   579 % of the regular expression increase, the number of potential delimiter 
       
   580 % combinations can grow exponentially, leading to a corresponding increase 
       
   581 % in complexity for algorithms that do not optimally navigate these possibilities.
       
   582 
       
   583 % Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}).
       
   584 
       
   585 % Various disambiguation strategies are 
       
   586 % employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
   255 
   587 
   256 
   588 
   257 %Regular expressions 
   589 %Regular expressions 
   258 %have been extensively studied and
   590 %have been extensively studied and
   259 %implemented since their invention in 1940s.
   591 %implemented since their invention in 1940s.
   293 %There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
   625 %There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
   294 %The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
   626 %The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
   295 %There have been prose definitions like the following
   627 %There have been prose definitions like the following
   296 %by Kuklewicz \cite{KuklewiczHaskell}: 
   628 %by Kuklewicz \cite{KuklewiczHaskell}: 
   297 %described the POSIX rule as (section 1, last paragraph):
   629 %described the POSIX rule as (section 1, last paragraph):
   298 \marginpar{\em Deleted some peripheral specifications.}
   630 
   299 \begin{quote}
   631 
   300 	\begin{itemize}
   632 
   301 		\item
   633 
   302 regular expressions (REs) take the leftmost starting match, and the longest match starting there
   634 
   303 earlier subpatterns have leftmost-longest priority over later subpatterns\\
   635 %first character is removed 
   304 \item
   636 %state of the automaton after matching that character 
   305 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
   637 %where nodes are represented as 
   306 \item
   638 %a sub-expression (for example tagged NFA
   307 	$\ldots$
   639 %Working on regular expressions 
   308 %REs have right associative concatenation which can be changed with parenthesis\\
   640 %Because of these problems with automata, we prefer regular expressions
   309 %\item
   641 %and derivatives rather than an automata (or graph-based) approach which explicitly converts between
   310 %parenthesized subexpressions return the match from their last usage\\
   642 %the regular expression and a different data structure.
   311 %\item
   643 %
   312 %text of component subexpressions must be contained in the text of the 
   644 %
   313 %higher-level subexpressions\\
   645 %The key idea 
   314 %\item
   646 
   315 %if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
   647 
   316 %\item
   648 
   317 %if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
   649 %Regular expressions are widely used in computer science: 
   318 \end{itemize}
   650 %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
   319 \end{quote}
   651 %command-line tools like $\mathit{grep}$ that facilitate easy 
   320 However, the author noted that various lexers that claim to be POSIX 
   652 %text-processing \cite{grep}; network intrusion
   321 are rarely correct according to this standard.
   653 %detection systems that inspect suspicious traffic; or compiler
   322 There are numerous occasions where programmers realised the subtlety and
   654 %front ends.
   323 difficulty to implement correctly, one such quote from Go's regexp library author 
   655 %Given their usefulness and ubiquity, one would assume that
   324 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
   656 %modern regular expression matching implementations
   325 \begin{quote}\it
   657 %are mature and fully studied.
   326 ``
   658 %Indeed, in many popular programming languages' regular expression engines, 
   327 The POSIX rule is computationally prohibitive
   659 %one can supply a regular expression and a string, and in most cases get
   328 and not even well-defined.
   660 %get the matching information in a very short time.
   329 ``
   661 %Those engines can sometimes be blindingly fast--some 
   330 \end{quote}
   662 %network intrusion detection systems
   331 Being able to formally define and capture the idea of 
   663 %use regular expression engines that are able to process 
   332 POSIX rules and prove 
   664 %hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
   333 the correctness of regular expression matching/lexing 
   665 %However, those engines can sometimes also exhibit a surprising security vulnerability
   334 algorithms against the POSIX semantics definitions
   666 %under a certain class of inputs.
   335 is valuable.
   667 %However, , this is not the case for $\mathbf{all}$ inputs.
   336 
   668 %TODO: get source for SNORT/BRO's regex matching engine/speed
   337 
   669 
   338 Formal proofs are
   670 
   339 machine checked programs
   671 %----------------------------------------------------------------------------------------
   340  %such as the ones written in Isabelle/HOL, is a powerful means 
   672 \section{Contribution}
   341 for computer scientists to be certain 
   673 %{\color{red} \rule{\linewidth}{0.5mm}}
   342 about the correctness of their algorithms.
   674 %\textbf{issue with this part: way too short, not enough details of what I have done.}
   343 This is done by 
   675  %{\color{red} \rule{\linewidth}{0.5mm}}
   344 recursively checking that every fact in a proof script 
   676 \marginpar{\em Gerog: No sufficient context on related work making contribution hard to digest.}
   345 is either an axiom or a fact that is derived from
   677 
   346 known axioms or verified facts.
   678 
   347 %The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
   679 \marginpar{introducing formalisations on regex matching}
   348 %methods as their implementation and complexity analysis tend to be error-prone.
   680 There have been many formalisations in the theorem-proving community 
   349 Formal proofs provides an unprecendented level of asssurance
   681 about regular expressions and lexing.
   350 that an algorithm will perform as expected under all inputs.
       
   351 The software systems that help people interactively build and check
       
   352 such proofs are called theorem-provers or proof assitants.
       
   353 Many  theorem-provers have been developed, such as Mizar,
       
   354 Isabelle/HOL, HOL-Light, HOL4,
       
   355 Coq, Agda, Idris, Lean and so on.
       
   356 Isabelle/HOL is a theorem prover with a simple type theory
       
   357 and powerful automated proof generators like sledgehammer.
       
   358 We chose to use Isabelle/HOL for its powerful automation
       
   359 and ease and simplicity in expressing regular expressions and 
       
   360 regular languages.
       
   361 %Some of those employ
       
   362 %dependent types like Mizar, Coq, Agda, Lean and Idris.
       
   363 %Some support a constructivism approach, such as Coq.
       
   364 
       
   365 
       
   366 Formal proofs on regular expression matching and lexing
       
   367 complements the efforts in
       
   368 industry which tend to focus on overall speed
       
   369 with techniques like parallelization (FPGA paper), tackling 
       
   370 the problem of catastrophic backtracking 
       
   371 in an ad-hoc manner (cloudflare and stackexchange article).
       
   372 
       
   373 There have been many interesting steps in the theorem-proving community 
       
   374 about formalising regular expressions and lexing.
       
   375 One flavour is to build from the regular expression an automaton, and determine
   682 One flavour is to build from the regular expression an automaton, and determine
   376 acceptance in terms of the resulting 
   683 acceptance in terms of the resulting 
   377 state after executing the input string on that automaton.
   684 state after executing the input string on that automaton.
   378 Automata formalisations are in general harder and more cumbersome to deal
   685 Automata formalisations are in general harder and more cumbersome to deal
   379 with for theorem provers than working directly on regular expressions.
   686 with for theorem provers than working directly on regular expressions.
   439 lexical values.
   746 lexical values.
   440 %X also formalised derivatives and regular expressions, producing "parse trees".
   747 %X also formalised derivatives and regular expressions, producing "parse trees".
   441 %(Some person who's a big name in formal methods)
   748 %(Some person who's a big name in formal methods)
   442 
   749 
   443 
   750 
   444 The variant of the problem we are looking at centers around
       
   445 an algorithm (which we call $\blexer$) developed by Sulzmann and Lu \ref{Sulzmann2014}.
       
   446 The reason we chose to look at $\blexer$ and its simplifications 
       
   447 is because it allows a lexical tree to be generated
       
   448 by some elegant and subtle procedure based on Brzozowski derivatives.
       
   449 The procedures are made of recursive functions and inductive datatypes just like derivatives, 
       
   450 allowing intuitive and concise formal reasoning with theorem provers.
       
   451 Most importantly, $\blexer$ opens up a path to an optimized version
       
   452 of $\blexersimp$ possibilities to improve
       
   453 performance with simplifications that aggressively change the structure of regular expressions.
       
   454 While most derivative-based methods
       
   455 rely on structures to be maintained to allow induction to
       
   456 go through.
       
   457 For example, Egolf et al. \ref{Verbatim} have developed a verified lexer
       
   458 with derivatives, but as soon as they started introducing
       
   459 optimizations such as memoization, they reverted to constructing
       
   460 DFAs first.
       
   461 Edelmann \ref{Edelmann2020} explored similar optimizations in his
       
   462 work on verified LL(1) parsing, with additional enhancements with data structures like
       
   463 zippers.
       
   464 
       
   465 %Sulzmann and Lu have worked out an algorithm
       
   466 %that is especially suited for verification
       
   467 %which utilized the fact
       
   468 %that whenever ambiguity occurs one may duplicate a sub-expression and use
       
   469 %different copies to describe different matching choices.
       
   470 The idea behind the correctness of $\blexer$ is simple: during a derivative,
       
   471 multiple matches might be possible, where an alternative with multiple children
       
   472 each corresponding to a 
       
   473 different match is created. In the end of
       
   474 a lexing process one always picks up the leftmost alternative, which is guarnateed 
       
   475 to be a POSIX value.
       
   476 This is done by consistently keeping sub-regular expressions in an alternative
       
   477 with longer submatches
       
   478 to the left of other copies (
       
   479 Remember that POSIX values are roughly the values with the longest inital
       
   480 submatch).
       
   481 The idea behind the optimized version of $\blexer$, which we call $\blexersimp$,
       
   482 is that since we only take the leftmost copy, then all re-occurring copies can be
       
   483 eliminated without losing the POSIX property, and this can be done with
       
   484 children of alternatives at different levels by merging them together.
       
   485 Proving $\blexersimp$ requires a different
       
   486 proof strategy compared to that by Ausaf \cite{FahadThesis}.
       
   487 We invent a rewriting relation as an
       
   488 inductive predicate which captures 
       
   489 a strong enough invariance that ensures correctness,
       
   490 which commutes with the derivative operation. 
       
   491 This predicate allows a simple
       
   492 induction on the input string to go through.
       
   493 
       
   494 %This idea has been repeatedly used in different variants of lexing
       
   495 %algorithms in their paper, one of which involves bit-codes. The bit-coded
       
   496 %derivative-based algorithm even allows relatively aggressive 
       
   497 %%simplification rules which cause
       
   498 %structural changes that render the induction used in the correctness
       
   499 %proofs unusable.
       
   500 %More details will be given in \ref{Bitcoded2} including the
       
   501 %new correctness proof which involves a new inductive predicate which allows 
       
   502 %rule induction to go through.
       
   503 
       
   504 
       
   505 
       
   506 
       
   507 
       
   508 
       
   509 
       
   510 %first character is removed 
       
   511 %state of the automaton after matching that character 
       
   512 %where nodes are represented as 
       
   513 %a sub-expression (for example tagged NFA
       
   514 %Working on regular expressions 
       
   515 %Because of these problems with automata, we prefer regular expressions
       
   516 %and derivatives rather than an automata (or graph-based) approach which explicitly converts between
       
   517 %the regular expression and a different data structure.
       
   518 %
       
   519 %
       
   520 %The key idea 
       
   521 
       
   522 (ends)
       
   523 
       
   524 %Regular expressions are widely used in computer science: 
       
   525 %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
       
   526 %command-line tools like $\mathit{grep}$ that facilitate easy 
       
   527 %text-processing \cite{grep}; network intrusion
       
   528 %detection systems that inspect suspicious traffic; or compiler
       
   529 %front ends.
       
   530 %Given their usefulness and ubiquity, one would assume that
       
   531 %modern regular expression matching implementations
       
   532 %are mature and fully studied.
       
   533 %Indeed, in many popular programming languages' regular expression engines, 
       
   534 %one can supply a regular expression and a string, and in most cases get
       
   535 %get the matching information in a very short time.
       
   536 %Those engines can sometimes be blindingly fast--some 
       
   537 %network intrusion detection systems
       
   538 %use regular expression engines that are able to process 
       
   539 %hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
       
   540 %However, those engines can sometimes also exhibit a surprising security vulnerability
       
   541 %under a certain class of inputs.
       
   542 %However, , this is not the case for $\mathbf{all}$ inputs.
       
   543 %TODO: get source for SNORT/BRO's regex matching engine/speed
       
   544 
       
   545 
       
   546 %----------------------------------------------------------------------------------------
       
   547 \section{Contribution}
       
   548 %{\color{red} \rule{\linewidth}{0.5mm}}
       
   549 %\textbf{issue with this part: way too short, not enough details of what I have done.}
       
   550  %{\color{red} \rule{\linewidth}{0.5mm}}
       
   551 \marginpar{\em Gerog: No sufficient context on related work making contribution hard to digest.}
       
   552 
       
   553 
   751 
   554 %In this thesis,
   752 %In this thesis,
   555 %we propose a solution to catastrophic
   753 %we propose a solution to catastrophic
   556 %backtracking and error-prone matchers: a formally verified
   754 %backtracking and error-prone matchers: a formally verified
   557 %regular expression lexing algorithm
   755 %regular expression lexing algorithm
   565 %Package \verb`dingbat`: \rightpointright 
   763 %Package \verb`dingbat`: \rightpointright 
   566 We have made mainly two contributions in this thesis: %the
   764 We have made mainly two contributions in this thesis: %the
   567 %lexer we developed based on Brzozowski derivatives and 
   765 %lexer we developed based on Brzozowski derivatives and 
   568 %Sulzmanna and Lu's developments called 
   766 %Sulzmanna and Lu's developments called 
   569 proving the lexer $\blexersimp$ is both i) correctness and ii)fast.
   767 proving the lexer $\blexersimp$ is both i) correctness and ii)fast.
   570 It is correct w.r.t a formalisation of POSIX lexing by Ausaf et al.\ref{AusafDyckhoffUrban2016}.
   768 It is correct w.r.t a formalisation of POSIX lexing by Ausaf et al.\cite{AusafDyckhoffUrban2016}.
   571 It is fast compared with un-optimised implementations like Sulzmann and Lu's orignal
   769 It is fast compared with un-optimised implementations like Sulzmann and Lu's orignal
   572 development by our metric of internal data structures not growing unbounded.
   770 development by our metric of internal data structures not growing unbounded.
   573 
   771 \subsection{Theorem Proving with Derivatives}
       
   772 
       
   773 Ribeiro and Du Bois \cite{RibeiroAgda2017} have 
       
   774 formalised the notion of bit-coded regular expressions
       
   775 and proved their relations with simple regular expressions in
       
   776 the dependently-typed proof assistant Agda.
       
   777 They also proved the soundness and completeness of a matching algorithm
       
   778 based on the bit-coded regular expressions.
       
   779 Ausaf et al. \cite{AusafDyckhoffUrban2016}
       
   780 are the first to
       
   781 give a quite simple formalised POSIX
       
   782 specification in Isabelle/HOL, and also prove 
       
   783 that their specification coincides with an earlier (unformalised) 
       
   784 POSIX specification given by Okui and Suzuki \cite{Okui10}.
       
   785 They then formally proved the correctness of
       
   786 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
       
   787 with regards to that specification.
       
   788 They also found that the informal POSIX
       
   789 specification by Sulzmann and Lu needs to be substantially 
       
   790 revised in order for the correctness proof to go through.
       
   791 Their original specification and proof were unfixable
       
   792 according to Ausaf et al.
       
   793 
       
   794 
       
   795 
       
   796 \subsection{Complexity Results}
   574 Our formalisation of complexity is unique among similar works in the sense that
   797 Our formalisation of complexity is unique among similar works in the sense that
   575 %is about the size of internal data structures.
   798 %is about the size of internal data structures.
   576 to our knowledge %we don't know of a 
   799 to our knowledge %we don't know of a 
   577 there are not other certified 
   800 there are not other certified 
   578 lexing/parsing algorithms with similar data structure size bound theorems.
   801 lexing/parsing algorithms with similar data structure size bound theorems.
   579 Common practices involve making empirical analysis of the complexity of the algorithm
   802 Common practices involve making empirical analysis of the complexity of the algorithm
   580 in question (\ref{Verbatim}, \ref{Verbatimpp}), or relying 
   803 in question (\cite{Verbatim}, \cite{Verbatimpp}), or relying 
   581 on prior (unformalised) complexity analysis of well-known algorithms (\ref{ValiantParsing}),
   804 on prior (unformalised) complexity analysis of well-known algorithms (\cite{ValiantParsing}),
   582 making them prone to complexity bugs.
   805 making them prone to complexity bugs.
       
   806 
   583 %TODO: add citation
   807 %TODO: add citation
   584 %, for example in the Verbatim work \ref{Verbatim}
   808 %, for example in the Verbatim work \cite{Verbatim}
   585 
   809 
   586 %While formalised proofs have not included 
   810 %While formalised proofs have not included 
   587 %Issues known as "performance bugs" can
   811 %Issues known as "performance bugs" can
   588 Whilst formalised complexity theorems 
   812 Whilst formalised complexity theorems 
   589 have not yet appeared in other certified lexers/parsers,
   813 have not yet appeared in other certified lexers/parsers,
   590 %while this work is done,
   814 %while this work is done,
   591 they do find themselves in the broader theorem-proving literature:
   815 they do find themselves in the broader theorem-proving literature:
   592 \emph{time credits} have been formalised for separation logic in Coq 
   816 \emph{time credits} have been formalised for separation logic in Coq 
   593 \ref{atkey2010amortised}%not used in 
   817 \cite{atkey2010amortised}
   594 to characterise the runtime complexity of an algorithm,
   818 %not used in 
   595 where integer values are recorded %from the beginning of an execution
   819 to characterise the runtime complexity of union-find.
   596 as part of the program state
   820 %where integer values are recorded %from the beginning of an execution
   597 and decremented in each step. 
   821 %as part of the program state
   598 The idea is that the total number of decrements
   822 %and decremented in each step. 
   599 from the time credits during execution represents the complexity of an algorithm.
   823 The idea is that the total number of instructions executed 
       
   824 is equal to the time credits consumed
       
   825 during the execution of an algorithm.
   600 %each time a basic step is executed. 
   826 %each time a basic step is executed. 
   601 %The way to use a time credit Time credit is an integer
   827 %The way to use a time credit Time credit is an integer
   602 %is assigned to a program prior to execution, and each step in the program consumes
   828 %is assigned to a program prior to execution, and each step in the program consumes
   603 %one credit.
   829 %one credit.
   604 Arma{\"e}l et al. have extended the framework to allow expressing time
   830 Arma{\"e}l et al. \cite{bigOImperative} have extended the framework to allow expressing time
   605 credits using big-O notations
   831 credits using big-O notations,
   606 so one can prove both the functional correctness and asymptotic complexity
   832 so one can prove both the functional correctness and asymptotic complexity
   607 of higher-order imperative algorithms \ref{bigOImperative}.
   833 of higher-order imperative algorithms.
       
   834 A number of formal proofs also exist for some other
       
   835 algorithms in Coq \cite{subtypingFormalComplexity}.
       
   836 
       
   837 The big-O notation have also been formalised in Isabelle
       
   838 \cite{bigOIsabelle}
   608 %for example the complexity of
   839 %for example the complexity of
   609 %the Quicksort algorithm 
   840 %the Quicksort algorithm 
   610 %is $\Theta n\ln n$ 
   841 %is $\Theta n\ln n$ 
   611 \marginpar{more work on formalising complexity}.
   842 \marginpar{more work on formalising complexity}.
   612 %Our next step is to leverage these frameworks
   843 %Our next step is to leverage these frameworks
   613 %It is a precursor to our 
   844 %It is a precursor to our 
   614 Our work focuses on the space complexity of the algorithm under our notion of the size of 
   845 Our work focuses on the space complexity of the algorithm under our notion of the size of 
   615 a regular expression.
   846 a regular expression.
   616 Despite not being a direct asymptotic time complexity proof,
   847 Despite not being a direct asymptotic time complexity proof,
   617 our result is an important stepping towards one.
   848 our result is an important stepping stone towards one.
   618 
   849 
   619 
   850 
   620 
   851 
   621 Brzozowski showed that there are finitely many similar deriviatives, 
   852 Brzozowski showed that there are finitely many similar deriviatives, 
   622 where similarity is defined in terms of ACI equations. 
   853 where similarity is defined in terms of ACI equations.