ChengsongTanPhdThesis/Chapters/Chapter1.tex
changeset 519 856d025dbc15
parent 518 ff7945a988a3
child 528 28751de4b4ba
equal deleted inserted replaced
518:ff7945a988a3 519:856d025dbc15
    31 \newcommand\hflataux[1]{\llparenthesis #1 \rrparenthesis_*'}
    31 \newcommand\hflataux[1]{\llparenthesis #1 \rrparenthesis_*'}
    32 \newcommand\createdByStar[1]{\textit{createdByStar}(#1)}
    32 \newcommand\createdByStar[1]{\textit{createdByStar}(#1)}
    33 
    33 
    34 \newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
    34 \newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
    35 
    35 
       
    36 \def\decode{\textit{decode}}
       
    37 \def\internalise{\textit{internalise}}
    36 \def\lexer{\mathit{lexer}}
    38 \def\lexer{\mathit{lexer}}
    37 \def\mkeps{\mathit{mkeps}}
    39 \def\mkeps{\textit{mkeps}}
    38 \newcommand{\rder}[2]{#2 \backslash #1}
    40 \newcommand{\rder}[2]{#2 \backslash #1}
    39 
    41 
    40 \def\AZERO{\textit{AZERO}}
    42 \def\AZERO{\textit{AZERO}}
    41 \def\AONE{\textit{AONE}}
    43 \def\AONE{\textit{AONE}}
    42 \def\ACHAR{\textit{ACHAR}}
    44 \def\ACHAR{\textit{ACHAR}}
   550 to include  extended regular expressions and 
   552 to include  extended regular expressions and 
   551  simplification of internal data structures 
   553  simplification of internal data structures 
   552  eliminating the exponential behaviours.
   554  eliminating the exponential behaviours.
   553  
   555  
   554 
   556 
       
   557   \section{Motivation}
   555   
   558   
   556 
       
   557  
       
   558 
       
   559 
       
   560 
       
   561 %----------------------------------------------------------------------------------------
       
   562 
       
   563 \section{Contribution}
       
   564 
       
   565 
       
   566 
       
   567 This work addresses the vulnerability of super-linear and
       
   568 buggy regex implementations by the combination
       
   569 of Brzozowski's derivatives and interactive theorem proving. 
       
   570 We give an 
       
   571 improved version of  Sulzmann and Lu's bit-coded algorithm using 
       
   572 derivatives, which come with a formal guarantee in terms of correctness and 
       
   573 running time as an Isabelle/HOL proof.
       
   574 Then we improve the algorithm with an even stronger version of 
       
   575 simplification, and prove a time bound linear to input and
       
   576 cubic to regular expression size using a technique by
       
   577 Antimirov.
       
   578 
       
   579  
       
   580 The main contribution of this thesis is a proven correct lexing algorithm
       
   581 with formalized time bounds.
       
   582 To our best knowledge, no lexing libraries using Brzozowski derivatives
       
   583 have a provable time guarantee, 
       
   584 and claims about running time are usually speculative and backed by thin empirical
       
   585 evidence.
       
   586 %TODO: give references
       
   587 For example, Sulzmann and Lu had proposed an algorithm  in which they
       
   588 claim a linear running time.
       
   589 But that was falsified by our experiments and the running time 
       
   590 is actually $\Omega(2^n)$ in the worst case.
       
   591 A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
       
   592 %TODO: give references
       
   593 lexer, which calculates POSIX matches and is based on derivatives.
       
   594 They formalized the correctness of the lexer, but not the complexity.
       
   595 In the performance evaluation section, they simply analyzed the run time
       
   596 of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
       
   597 and concluded that the algorithm is quadratic in terms of input length.
       
   598 When we tried out their extracted OCaml code with our example $(a+aa)^*$,
       
   599 the time it took to lex only 40 $a$'s was 5 minutes.
       
   600 
       
   601 We  believe our results of a proof of performance on general
       
   602 inputs rather than specific examples a novel contribution.\\
       
   603 
       
   604 
       
   605 \subsection{Related Work}
       
   606 We are aware
       
   607 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
       
   608 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
       
   609 of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
       
   610 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
       
   611 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
       
   612  
       
   613  %We propose Brzozowski's derivatives as a solution to this problem.
       
   614 % about Lexing Using Brzozowski derivatives
       
   615  \section{Preliminaries}
       
   616 
       
   617 Suppose we have an alphabet $\Sigma$, the strings  whose characters
       
   618 are from $\Sigma$
       
   619 can be expressed as $\Sigma^*$.
       
   620 
       
   621 We use patterns to define a set of strings concisely. Regular expressions
       
   622 are one of such patterns systems:
       
   623 The basic regular expressions  are defined inductively
       
   624  by the following grammar:
       
   625 \[			r ::=   \ZERO \mid  \ONE
       
   626 			 \mid  c  
       
   627 			 \mid  r_1 \cdot r_2
       
   628 			 \mid  r_1 + r_2   
       
   629 			 \mid r^*         
       
   630 \]
       
   631 
       
   632 The language or set of strings defined by regular expressions are defined as
       
   633 %TODO: FILL in the other defs
       
   634 \begin{center}
       
   635 \begin{tabular}{lcl}
       
   636 $L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
       
   637 $L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
       
   638 \end{tabular}
       
   639 \end{center}
       
   640 Which are also called the "language interpretation".
       
   641 
       
   642 
       
   643 
       
   644 The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
       
   645 where the operation transforms the regex to a new one containing
       
   646 strings without the head character $c$.
       
   647 
       
   648 Formally, we define first such a transformation on any string set, which
       
   649 we call semantic derivative:
       
   650 \begin{center}
       
   651 $\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
       
   652 \end{center}
       
   653 Mathematically, it can be expressed as the 
       
   654 
       
   655 If the $\textit{StringSet}$ happen to have some structure, for example,
       
   656 if it is regular, then we have that it
       
   657 
       
   658 % Derivatives of a
       
   659 %regular expression, written $r \backslash c$, give a simple solution
       
   660 %to the problem of matching a string $s$ with a regular
       
   661 %expression $r$: if the derivative of $r$ w.r.t.\ (in
       
   662 %succession) all the characters of the string matches the empty string,
       
   663 %then $r$ matches $s$ (and {\em vice versa}).  
       
   664 
       
   665 The the derivative of regular expression, denoted as
       
   666 $r \backslash c$, is a function that takes parameters
       
   667 $r$ and $c$, and returns another regular expression $r'$,
       
   668 which is computed by the following recursive function:
       
   669 
       
   670 \begin{center}
       
   671 \begin{tabular}{lcl}
       
   672 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   673 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   674 		$d \backslash c$     & $\dn$ & 
       
   675 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   676 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   677 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
   678 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   679 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   680 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   681 \end{tabular}
       
   682 \end{center}
       
   683 \noindent
       
   684 \noindent
       
   685 
       
   686 The $\nullable$ function tests whether the empty string $""$ 
       
   687 is in the language of $r$:
       
   688 
       
   689 
       
   690 \begin{center}
       
   691 		\begin{tabular}{lcl}
       
   692 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
   693 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
   694 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
   695 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
   696 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
   697 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
   698 		\end{tabular}
       
   699 \end{center}
       
   700 \noindent
       
   701 The empty set does not contain any string and
       
   702 therefore not the empty string, the empty string 
       
   703 regular expression contains the empty string
       
   704 by definition, the character regular expression
       
   705 is the singleton that contains character only,
       
   706 and therefore does not contain the empty string,
       
   707 the alternative regular expression(or "or" expression)
       
   708 might have one of its children regular expressions
       
   709 being nullable and any one of its children being nullable
       
   710 would suffice. The sequence regular expression
       
   711 would require both children to have the empty string
       
   712 to compose an empty string and the Kleene star
       
   713 operation naturally introduced the empty string.
       
   714 
       
   715 We can give the meaning of regular expressions derivatives
       
   716 by language interpretation:
       
   717 
       
   718 
       
   719  
       
   720   
       
   721 \begin{center}
       
   722 \begin{tabular}{lcl}
       
   723 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   724 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   725 		$d \backslash c$     & $\dn$ & 
       
   726 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   727 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   728 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
   729 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   730 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   731 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   732 \end{tabular}
       
   733 \end{center}
       
   734 \noindent
       
   735 \noindent
       
   736 The function derivative, written $\backslash c$, 
       
   737 defines how a regular expression evolves into
       
   738 a new regular expression after all the string it contains
       
   739 is chopped off a certain head character $c$.
       
   740 The most involved cases are the sequence 
       
   741 and star case.
       
   742 The sequence case says that if the first regular expression
       
   743 contains an empty string then the second component of the sequence
       
   744 might be chosen as the target regular expression to be chopped
       
   745 off its head character.
       
   746 The star regular expression's derivative unwraps the iteration of
       
   747 regular expression and attaches the star regular expression
       
   748 to the sequence's second element to make sure a copy is retained
       
   749 for possible more iterations in later phases of lexing.
       
   750 
       
   751 
       
   752 The main property of the derivative operation
       
   753 that enables us to reason about the correctness of
       
   754 an algorithm using derivatives is 
       
   755 
       
   756 \begin{center}
       
   757 $c\!::\!s \in L(r)$ holds
       
   758 if and only if $s \in L(r\backslash c)$.
       
   759 \end{center}
       
   760 
       
   761 \noindent
       
   762 We can generalise the derivative operation shown above for single characters
       
   763 to strings as follows:
       
   764 
       
   765 \begin{center}
       
   766 \begin{tabular}{lcl}
       
   767 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
       
   768 $r \backslash [\,] $ & $\dn$ & $r$
       
   769 \end{tabular}
       
   770 \end{center}
       
   771 
       
   772 \noindent
       
   773 and then define Brzozowski's  regular-expression matching algorithm as:
       
   774 
       
   775 \[
       
   776 match\;s\;r \;\dn\; nullable(r\backslash s)
       
   777 \]
       
   778 
       
   779 \noindent
       
   780 Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, 
       
   781 this algorithm presented graphically is as follows:
       
   782 
       
   783 \begin{equation}\label{graph:*}
       
   784 \begin{tikzcd}
       
   785 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
       
   786 \end{tikzcd}
       
   787 \end{equation}
       
   788 
       
   789 \noindent
       
   790 where we start with  a regular expression  $r_0$, build successive
       
   791 derivatives until we exhaust the string and then use \textit{nullable}
       
   792 to test whether the result can match the empty string. It can  be
       
   793 relatively  easily shown that this matcher is correct  (that is given
       
   794 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
       
   795 
       
   796 Beautiful and simple definition.
       
   797 
       
   798 If we implement the above algorithm naively, however,
       
   799 the algorithm can be excruciatingly slow. 
       
   800 
       
   801 
       
   802 \begin{figure}
       
   803 \centering
       
   804 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   805 \begin{tikzpicture}
       
   806 \begin{axis}[
       
   807     xlabel={$n$},
       
   808     x label style={at={(1.05,-0.05)}},
       
   809     ylabel={time in secs},
       
   810     enlargelimits=false,
       
   811     xtick={0,5,...,30},
       
   812     xmax=33,
       
   813     ymax=10000,
       
   814     ytick={0,1000,...,10000},
       
   815     scaled ticks=false,
       
   816     axis lines=left,
       
   817     width=5cm,
       
   818     height=4cm, 
       
   819     legend entries={JavaScript},  
       
   820     legend pos=north west,
       
   821     legend cell align=left]
       
   822 \addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
       
   823 \end{axis}
       
   824 \end{tikzpicture}\\
       
   825 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
       
   826            of the form $\underbrace{aa..a}_{n}$.}
       
   827 \end{tabular}    
       
   828 \caption{EightThousandNodes} \label{fig:EightThousandNodes}
       
   829 \end{figure}
       
   830 
       
   831 
       
   832 (8000 node data to be added here)
       
   833 For example, when starting with the regular
       
   834 expression $(a + aa)^*$ and building a few successive derivatives (around 10)
       
   835 w.r.t.~the character $a$, one obtains a derivative regular expression
       
   836 with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
       
   837 The reason why $(a + aa) ^*$ explodes so drastically is that without
       
   838 pruning, the algorithm will keep records of all possible ways of matching:
       
   839 \begin{center}
       
   840 $(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
       
   841 \end{center}
       
   842 
       
   843 \noindent
       
   844 Each of the above alternative branches correspond to the match 
       
   845 $aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
       
   846 These different ways of matching will grow exponentially with the string length,
       
   847 and without simplifications that throw away some of these very similar matchings,
       
   848 it is no surprise that these expressions grow so quickly.
       
   849 Operations like
       
   850 $\backslash$ and $\nullable$ need to traverse such trees and
       
   851 consequently the bigger the size of the derivative the slower the
       
   852 algorithm. 
       
   853 
       
   854 Brzozowski was quick in finding that during this process a lot useless
       
   855 $\ONE$s and $\ZERO$s are generated and therefore not optimal.
       
   856 He also introduced some "similarity rules" such
       
   857 as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
       
   858 different but language-equivalent sub-regexes to further decrease the size
       
   859 of the intermediate regexes. 
       
   860 
       
   861 More simplifications are possible, such as deleting duplicates
       
   862 and opening up nested alternatives to trigger even more simplifications.
       
   863 And suppose we apply simplification after each derivative step, and compose
       
   864 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
       
   865 \textit{simp}(a \backslash c)$. Then we can build
       
   866 a matcher without having  cumbersome regular expressions.
       
   867 
       
   868 
       
   869 If we want the size of derivatives in the algorithm to
       
   870 stay even lower, we would need more aggressive simplifications.
       
   871 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
       
   872 deleting duplicates whenever possible. For example, the parentheses in
       
   873 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
       
   874 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
       
   875 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
       
   876 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us
       
   877 to achieve a very tight size bound, namely,
       
   878  the same size bound as that of the \emph{partial derivatives}. 
       
   879 
       
   880 Building derivatives and then simplify them.
       
   881 So far so good. But what if we want to 
       
   882 do lexing instead of just a YES/NO answer?
       
   883 This requires us to go back again to the world 
       
   884 without simplification first for a moment.
       
   885 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
       
   886 elegant(arguably as beautiful as the original
       
   887 derivatives definition) solution for this.
       
   888 
       
   889 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
       
   890 
       
   891 
       
   892 They first defined the datatypes for storing the 
       
   893 lexing information called a \emph{value} or
       
   894 sometimes also \emph{lexical value}.  These values and regular
       
   895 expressions correspond to each other as illustrated in the following
       
   896 table:
       
   897 
       
   898 \begin{center}
       
   899 	\begin{tabular}{c@{\hspace{20mm}}c}
       
   900 		\begin{tabular}{@{}rrl@{}}
       
   901 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
       
   902 			$r$ & $::=$  & $\ZERO$\\
       
   903 			& $\mid$ & $\ONE$   \\
       
   904 			& $\mid$ & $c$          \\
       
   905 			& $\mid$ & $r_1 \cdot r_2$\\
       
   906 			& $\mid$ & $r_1 + r_2$   \\
       
   907 			\\
       
   908 			& $\mid$ & $r^*$         \\
       
   909 		\end{tabular}
       
   910 		&
       
   911 		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
   912 			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
       
   913 			$v$ & $::=$  & \\
       
   914 			&        & $\Empty$   \\
       
   915 			& $\mid$ & $\Char(c)$          \\
       
   916 			& $\mid$ & $\Seq\,v_1\, v_2$\\
       
   917 			& $\mid$ & $\Left(v)$   \\
       
   918 			& $\mid$ & $\Right(v)$  \\
       
   919 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
       
   920 		\end{tabular}
       
   921 	\end{tabular}
       
   922 \end{center}
       
   923 
       
   924 \noindent
       
   925 
       
   926 Building on top of Sulzmann and Lu's attempt to formalize the 
       
   927 notion of POSIX lexing rules \parencite{Sulzmann2014}, 
       
   928 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
       
   929 POSIX matching as a ternary relation recursively defined in a
       
   930 natural deduction style.
       
   931 With the formally-specified rules for what a POSIX matching is,
       
   932 they proved in Isabelle/HOL that the algorithm gives correct results.
       
   933 
       
   934 But having a correct result is still not enough, 
       
   935 we want at least some degree of $\mathbf{efficiency}$.
       
   936 
       
   937 
       
   938 
       
   939 One regular expression can have multiple lexical values. For example
       
   940 for the regular expression $(a+b)^*$, it has a infinite list of
       
   941 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
       
   942 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
       
   943 $\ldots$, and vice versa.
       
   944 Even for the regular expression matching a certain string, there could 
       
   945 still be more than one value corresponding to it.
       
   946 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
       
   947 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   948 The number of different ways of matching 
       
   949 without allowing any value under a star to be flattened
       
   950 to an empty string can be given by the following formula:
       
   951 \begin{equation}
       
   952 	C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}
       
   953 \end{equation}
       
   954 and a closed form formula can be calculated to be
       
   955 \begin{equation}
       
   956 	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
       
   957 \end{equation}
       
   958 which is clearly in exponential order.
       
   959 
       
   960 A lexer aimed at getting all the possible values has an exponential
       
   961 worst case runtime. Therefore it is impractical to try to generate
       
   962 all possible matches in a run. In practice, we are usually 
       
   963 interested about POSIX values, which by intuition always
       
   964 \begin{itemize}
       
   965 \item
       
   966 match the leftmost regular expression when multiple options of matching
       
   967 are available  
       
   968 \item 
       
   969 always match a subpart as much as possible before proceeding
       
   970 to the next token.
       
   971 \end{itemize}
       
   972 
       
   973 
       
   974  For example, the above example has the POSIX value
       
   975 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
       
   976 The output of an algorithm we want would be a POSIX matching
       
   977 encoded as a value.
       
   978 The reason why we are interested in $\POSIX$ values is that they can
       
   979 be practically used in the lexing phase of a compiler front end.
       
   980 For instance, when lexing a code snippet 
       
   981 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
       
   982 as an identifier rather than a keyword.
       
   983 
       
   984 The contribution of Sulzmann and Lu is an extension of Brzozowski's
       
   985 algorithm by a second phase (the first phase being building successive
       
   986 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
       
   987 is generated in case the regular expression matches  the string. 
       
   988 Pictorially, the Sulzmann and Lu algorithm is as follows:
       
   989 
       
   990 \begin{ceqn}
       
   991 \begin{equation}\label{graph:2}
       
   992 \begin{tikzcd}
       
   993 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
   994 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
   995 \end{tikzcd}
       
   996 \end{equation}
       
   997 \end{ceqn}
       
   998 
       
   999 
       
  1000 \noindent
       
  1001 For convenience, we shall employ the following notations: the regular
       
  1002 expression we start with is $r_0$, and the given string $s$ is composed
       
  1003 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
       
  1004 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
       
  1005 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
       
  1006 the derivative $r_n$. We test whether this derivative is
       
  1007 $\textit{nullable}$ or not. If not, we know the string does not match
       
  1008 $r$ and no value needs to be generated. If yes, we start building the
       
  1009 values incrementally by \emph{injecting} back the characters into the
       
  1010 earlier values $v_n, \ldots, v_0$. This is the second phase of the
       
  1011 algorithm from the right to left. For the first value $v_n$, we call the
       
  1012 function $\textit{mkeps}$, which builds a POSIX lexical value
       
  1013 for how the empty string has been matched by the (nullable) regular
       
  1014 expression $r_n$. This function is defined as
       
  1015 
       
  1016 	\begin{center}
       
  1017 		\begin{tabular}{lcl}
       
  1018 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
       
  1019 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
       
  1020 			& \textit{if} $\nullable(r_{1})$\\ 
       
  1021 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
       
  1022 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
       
  1023 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
       
  1024 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
       
  1025 		\end{tabular}
       
  1026 	\end{center}
       
  1027 
       
  1028 
       
  1029 \noindent 
       
  1030 After the $\mkeps$-call, we inject back the characters one by one in order to build
       
  1031 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
       
  1032 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
       
  1033 After injecting back $n$ characters, we get the lexical value for how $r_0$
       
  1034 matches $s$. The POSIX value is maintained throught out the process.
       
  1035 For this Sulzmann and Lu defined a function that reverses
       
  1036 the ``chopping off'' of characters during the derivative phase. The
       
  1037 corresponding function is called \emph{injection}, written
       
  1038 $\textit{inj}$; it takes three arguments: the first one is a regular
       
  1039 expression ${r_{i-1}}$, before the character is chopped off, the second
       
  1040 is a character ${c_{i-1}}$, the character we want to inject and the
       
  1041 third argument is the value ${v_i}$, into which one wants to inject the
       
  1042 character (it corresponds to the regular expression after the character
       
  1043 has been chopped off). The result of this function is a new value. The
       
  1044 definition of $\textit{inj}$ is as follows: 
       
  1045 
       
  1046 \begin{center}
       
  1047 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
  1048   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
  1049   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
  1050   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
  1051   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
  1052   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
  1053   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
  1054   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
  1055 \end{tabular}
       
  1056 \end{center}
       
  1057 
       
  1058 \noindent This definition is by recursion on the ``shape'' of regular
       
  1059 expressions and values. 
       
  1060 The clauses basically do one thing--identifying the ``holes'' on 
       
  1061 value to inject the character back into.
       
  1062 For instance, in the last clause for injecting back to a value
       
  1063 that would turn into a new star value that corresponds to a star,
       
  1064 we know it must be a sequence value. And we know that the first 
       
  1065 value of that sequence corresponds to the child regex of the star
       
  1066 with the first character being chopped off--an iteration of the star
       
  1067 that had just been unfolded. This value is followed by the already
       
  1068 matched star iterations we collected before. So we inject the character 
       
  1069 back to the first value and form a new value with this new iteration
       
  1070 being added to the previous list of iterations, all under the $Stars$
       
  1071 top level.
       
  1072 
       
  1073 We have mentioned before that derivatives without simplification 
       
  1074 can get clumsy, and this is true for values as well--they reflect
       
  1075 the regular expressions size by definition.
       
  1076 
       
  1077 One can introduce simplification on the regex and values, but have to
       
  1078 be careful in not breaking the correctness as the injection 
       
  1079 function heavily relies on the structure of the regexes and values
       
  1080 being correct and match each other.
       
  1081 It can be achieved by recording some extra rectification functions
       
  1082 during the derivatives step, and applying these rectifications in 
       
  1083 each run during the injection phase.
       
  1084 And we can prove that the POSIX value of how
       
  1085 regular expressions match strings will not be affected---although is much harder
       
  1086 to establish. 
       
  1087 Some initial results in this regard have been
       
  1088 obtained in \cite{AusafDyckhoffUrban2016}. 
       
  1089 
       
  1090 
       
  1091 
       
  1092 %Brzozowski, after giving the derivatives and simplification,
       
  1093 %did not explore lexing with simplification or he may well be 
       
  1094 %stuck on an efficient simplificaiton with a proof.
       
  1095 %He went on to explore the use of derivatives together with 
       
  1096 %automaton, and did not try lexing using derivatives.
       
  1097 
       
  1098 We want to get rid of complex and fragile rectification of values.
       
  1099 Can we not create those intermediate values $v_1,\ldots v_n$,
       
  1100 and get the lexing information that should be already there while
       
  1101 doing derivatives in one pass, without a second phase of injection?
       
  1102 In the meantime, can we make sure that simplifications
       
  1103 are easily handled without breaking the correctness of the algorithm?
       
  1104 
       
  1105 Sulzmann and Lu solved this problem by
       
  1106 introducing additional informtaion to the 
       
  1107 regular expressions called \emph{bitcodes}.
       
  1108 
       
  1109 \subsection*{Bit-coded Algorithm}
       
  1110 Bits and bitcodes (lists of bits) are defined as:
       
  1111 
       
  1112 \begin{center}
       
  1113 		$b ::=   1 \mid  0 \qquad
       
  1114 bs ::= [] \mid b::bs    
       
  1115 $
       
  1116 \end{center}
       
  1117 
       
  1118 \noindent
       
  1119 The $1$ and $0$ are not in bold in order to avoid 
       
  1120 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
       
  1121 bit-lists) can be used to encode values (or potentially incomplete values) in a
       
  1122 compact form. This can be straightforwardly seen in the following
       
  1123 coding function from values to bitcodes: 
       
  1124 
       
  1125 \begin{center}
       
  1126 \begin{tabular}{lcl}
       
  1127   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
       
  1128   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
       
  1129   $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
       
  1130   $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
       
  1131   $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
       
  1132   $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
       
  1133   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
       
  1134                                                  code(\Stars\,vs)$
       
  1135 \end{tabular}    
       
  1136 \end{center} 
       
  1137 
       
  1138 \noindent
       
  1139 Here $\textit{code}$ encodes a value into a bitcodes by converting
       
  1140 $\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
       
  1141 star iteration by $1$. The border where a local star terminates
       
  1142 is marked by $0$. This coding is lossy, as it throws away the information about
       
  1143 characters, and also does not encode the ``boundary'' between two
       
  1144 sequence values. Moreover, with only the bitcode we cannot even tell
       
  1145 whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
       
  1146 reason for choosing this compact way of storing information is that the
       
  1147 relatively small size of bits can be easily manipulated and ``moved
       
  1148 around'' in a regular expression. In order to recover values, we will 
       
  1149 need the corresponding regular expression as an extra information. This
       
  1150 means the decoding function is defined as:
       
  1151 
       
  1152 
       
  1153 %\begin{definition}[Bitdecoding of Values]\mbox{}
       
  1154 \begin{center}
       
  1155 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
       
  1156   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
       
  1157   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
       
  1158   $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
  1159      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
       
  1160        (\Left\,v, bs_1)$\\
       
  1161   $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
  1162      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
       
  1163        (\Right\,v, bs_1)$\\                           
       
  1164   $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
       
  1165         $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
       
  1166   & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
       
  1167   & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
       
  1168   $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
       
  1169   $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
       
  1170          $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
  1171   & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
       
  1172   & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
       
  1173   
       
  1174   $\textit{decode}\,bs\,r$ & $\dn$ &
       
  1175      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
  1176   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
       
  1177        \textit{else}\;\textit{None}$                       
       
  1178 \end{tabular}    
       
  1179 \end{center}    
       
  1180 %\end{definition}
       
  1181 
       
  1182 Sulzmann and Lu's integrated the bitcodes into regular expressions to
       
  1183 create annotated regular expressions \cite{Sulzmann2014}.
       
  1184 \emph{Annotated regular expressions} are defined by the following
       
  1185 grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
       
  1186 
       
  1187 \begin{center}
       
  1188 \begin{tabular}{lcl}
       
  1189   $\textit{a}$ & $::=$  & $\ZERO$\\
       
  1190                   & $\mid$ & $_{bs}\ONE$\\
       
  1191                   & $\mid$ & $_{bs}{\bf c}$\\
       
  1192                   & $\mid$ & $_{bs}\sum\,as$\\
       
  1193                   & $\mid$ & $_{bs}a_1\cdot a_2$\\
       
  1194                   & $\mid$ & $_{bs}a^*$
       
  1195 \end{tabular}    
       
  1196 \end{center}  
       
  1197 %(in \textit{ALTS})
       
  1198 
       
  1199 \noindent
       
  1200 where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
       
  1201 expressions and $as$ for a list of annotated regular expressions.
       
  1202 The alternative constructor($\sum$) has been generalized to 
       
  1203 accept a list of annotated regular expressions rather than just 2.
       
  1204 We will show that these bitcodes encode information about
       
  1205 the (POSIX) value that should be generated by the Sulzmann and Lu
       
  1206 algorithm.
       
  1207 
       
  1208 
       
  1209 To do lexing using annotated regular expressions, we shall first
       
  1210 transform the usual (un-annotated) regular expressions into annotated
       
  1211 regular expressions. This operation is called \emph{internalisation} and
       
  1212 defined as follows:
       
  1213 
       
  1214 %\begin{definition}
       
  1215 \begin{center}
       
  1216 \begin{tabular}{lcl}
       
  1217   $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
       
  1218   $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
       
  1219   $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
       
  1220   $(r_1 + r_2)^\uparrow$ & $\dn$ &
       
  1221   $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
       
  1222   \textit{fuse}\,[1]\,r_2^\uparrow]$\\
       
  1223   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
       
  1224          $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
       
  1225   $(r^*)^\uparrow$ & $\dn$ &
       
  1226          $_{[]}(r^\uparrow)^*$\\
       
  1227 \end{tabular}    
       
  1228 \end{center}    
       
  1229 %\end{definition}
       
  1230 
       
  1231 \noindent
       
  1232 We use up arrows here to indicate that the basic un-annotated regular
       
  1233 expressions are ``lifted up'' into something slightly more complex. In the
       
  1234 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
       
  1235 attach bits to the front of an annotated regular expression. Its
       
  1236 definition is as follows:
       
  1237 
       
  1238 \begin{center}
       
  1239 \begin{tabular}{lcl}
       
  1240   $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
       
  1241   $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
       
  1242      $_{bs @ bs'}\ONE$\\
       
  1243   $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
       
  1244      $_{bs@bs'}{\bf c}$\\
       
  1245   $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
       
  1246      $_{bs@bs'}\sum\textit{as}$\\
       
  1247   $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
       
  1248      $_{bs@bs'}a_1 \cdot a_2$\\
       
  1249   $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
       
  1250      $_{bs @ bs'}a^*$
       
  1251 \end{tabular}    
       
  1252 \end{center}  
       
  1253 
       
  1254 \noindent
       
  1255 After internalising the regular expression, we perform successive
       
  1256 derivative operations on the annotated regular expressions. This
       
  1257 derivative operation is the same as what we had previously for the
       
  1258 basic regular expressions, except that we beed to take care of
       
  1259 the bitcodes:
       
  1260 
       
  1261 
       
  1262 \iffalse
       
  1263  %\begin{definition}{bder}
       
  1264 \begin{center}
       
  1265   \begin{tabular}{@{}lcl@{}}
       
  1266   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
  1267   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
  1268   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
       
  1269         $\textit{if}\;c=d\; \;\textit{then}\;
       
  1270          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
       
  1271   $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
       
  1272   $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
       
  1273   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
       
  1274      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
  1275 					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
       
  1276 					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
  1277   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
       
  1278   $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
       
  1279       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
       
  1280        (\textit{STAR}\,[]\,r)$
       
  1281 \end{tabular}    
       
  1282 \end{center}    
       
  1283 %\end{definition}
       
  1284 
       
  1285 \begin{center}
       
  1286   \begin{tabular}{@{}lcl@{}}
       
  1287   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
  1288   $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
  1289   $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
       
  1290         $\textit{if}\;c=d\; \;\textit{then}\;
       
  1291          _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
       
  1292   $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
       
  1293   $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
       
  1294   $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
       
  1295      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
  1296 					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
       
  1297 					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
  1298   & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
       
  1299   $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
       
  1300       $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
       
  1301        (_{bs}\textit{STAR}\,[]\,r)$
       
  1302 \end{tabular}    
       
  1303 \end{center}    
       
  1304 %\end{definition}
       
  1305 \fi
       
  1306 
       
  1307 \begin{center}
       
  1308   \begin{tabular}{@{}lcl@{}}
       
  1309   $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
       
  1310   $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
       
  1311   $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
       
  1312         $\textit{if}\;c=d\; \;\textit{then}\;
       
  1313          _{bs}\ONE\;\textit{else}\;\ZERO$\\  
       
  1314   $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
       
  1315   $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
       
  1316   $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
       
  1317      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
  1318 					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
       
  1319 					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
       
  1320   & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
       
  1321   $(_{bs}a^*)\,\backslash c$ & $\dn$ &
       
  1322       $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
       
  1323        (_{[]}r^*))$
       
  1324 \end{tabular}    
       
  1325 \end{center}    
       
  1326 
       
  1327 %\end{definition}
       
  1328 \noindent
       
  1329 For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
       
  1330 we need to unfold it into a sequence,
       
  1331 and attach an additional bit $0$ to the front of $r \backslash c$
       
  1332 to indicate one more star iteration. Also the sequence clause
       
  1333 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
       
  1334 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
       
  1335 that it is for annotated regular expressions, therefore we omit the
       
  1336 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
       
  1337 $a_1$ matches the string prior to character $c$ (more on this later),
       
  1338 then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
       
  1339 \backslash c)$ will collapse the regular expression $a_1$(as it has
       
  1340 already been fully matched) and store the parsing information at the
       
  1341 head of the regular expression $a_2 \backslash c$ by fusing to it. The
       
  1342 bitsequence $\textit{bs}$, which was initially attached to the
       
  1343 first element of the sequence $a_1 \cdot a_2$, has
       
  1344 now been elevated to the top-level of $\sum$, as this information will be
       
  1345 needed whichever way the sequence is matched---no matter whether $c$ belongs
       
  1346 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
       
  1347 the lexing information, we complete the lexing by collecting the
       
  1348 bitcodes using a generalised version of the $\textit{mkeps}$ function
       
  1349 for annotated regular expressions, called $\textit{bmkeps}$:
       
  1350 
       
  1351 
       
  1352 %\begin{definition}[\textit{bmkeps}]\mbox{}
       
  1353 \begin{center}
       
  1354 \begin{tabular}{lcl}
       
  1355   $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
       
  1356   $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
       
  1357      $\textit{if}\;\textit{bnullable}\,a$\\
       
  1358   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
       
  1359   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
       
  1360   $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
       
  1361      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
       
  1362   $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
       
  1363      $bs \,@\, [0]$
       
  1364 \end{tabular}    
       
  1365 \end{center}    
       
  1366 %\end{definition}
       
  1367 
       
  1368 \noindent
       
  1369 This function completes the value information by travelling along the
       
  1370 path of the regular expression that corresponds to a POSIX value and
       
  1371 collecting all the bitcodes, and using $S$ to indicate the end of star
       
  1372 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
       
  1373 decode them, we get the value we expect. The corresponding lexing
       
  1374 algorithm looks as follows:
       
  1375 
       
  1376 \begin{center}
       
  1377 \begin{tabular}{lcl}
       
  1378   $\textit{blexer}\;r\,s$ & $\dn$ &
       
  1379       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
       
  1380   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
  1381   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
  1382   & & $\;\;\textit{else}\;\textit{None}$
       
  1383 \end{tabular}
       
  1384 \end{center}
       
  1385 
       
  1386 \noindent
       
  1387 In this definition $\_\backslash s$ is the  generalisation  of the derivative
       
  1388 operation from characters to strings (just like the derivatives for un-annotated
       
  1389 regular expressions).
       
  1390 
       
  1391 Now we introduce the simplifications, which is why we introduce the 
       
  1392 bitcodes in the first place.
       
  1393 
       
  1394 \subsection*{Simplification Rules}
       
  1395 
       
  1396 This section introduces aggressive (in terms of size) simplification rules
       
  1397 on annotated regular expressions
       
  1398 to keep derivatives small. Such simplifications are promising
       
  1399 as we have
       
  1400 generated test data that show
       
  1401 that a good tight bound can be achieved. We could only
       
  1402 partially cover the search space as there are infinitely many regular
       
  1403 expressions and strings. 
       
  1404 
       
  1405 One modification we introduced is to allow a list of annotated regular
       
  1406 expressions in the $\sum$ constructor. This allows us to not just
       
  1407 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
       
  1408 also unnecessary ``copies'' of regular expressions (very similar to
       
  1409 simplifying $r + r$ to just $r$, but in a more general setting). Another
       
  1410 modification is that we use simplification rules inspired by Antimirov's
       
  1411 work on partial derivatives. They maintain the idea that only the first
       
  1412 ``copy'' of a regular expression in an alternative contributes to the
       
  1413 calculation of a POSIX value. All subsequent copies can be pruned away from
       
  1414 the regular expression. A recursive definition of our  simplification function 
       
  1415 that looks somewhat similar to our Scala code is given below:
       
  1416 %\comment{Use $\ZERO$, $\ONE$ and so on. 
       
  1417 %Is it $ALTS$ or $ALTS$?}\\
       
  1418 
       
  1419 \begin{center}
       
  1420   \begin{tabular}{@{}lcl@{}}
       
  1421    
       
  1422   $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
       
  1423    &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
       
  1424    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
       
  1425    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
       
  1426    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
       
  1427    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
       
  1428 
       
  1429   $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
       
  1430   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
       
  1431    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
       
  1432    &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
       
  1433 
       
  1434    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
       
  1435 \end{tabular}    
       
  1436 \end{center}    
       
  1437 
       
  1438 \noindent
       
  1439 The simplification does a pattern matching on the regular expression.
       
  1440 When it detected that the regular expression is an alternative or
       
  1441 sequence, it will try to simplify its child regular expressions
       
  1442 recursively and then see if one of the children turns into $\ZERO$ or
       
  1443 $\ONE$, which might trigger further simplification at the current level.
       
  1444 The most involved part is the $\sum$ clause, where we use two
       
  1445 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
       
  1446 alternatives and reduce as many duplicates as possible. Function
       
  1447 $\textit{distinct}$  keeps the first occurring copy only and removes all later ones
       
  1448 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
       
  1449 Its recursive definition is given below:
       
  1450 
       
  1451  \begin{center}
       
  1452   \begin{tabular}{@{}lcl@{}}
       
  1453   $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
       
  1454      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
       
  1455   $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
       
  1456     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
       
  1457 \end{tabular}    
       
  1458 \end{center}  
       
  1459 
       
  1460 \noindent
       
  1461 Here $\textit{flatten}$ behaves like the traditional functional programming flatten
       
  1462 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
       
  1463 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
       
  1464 
       
  1465 Having defined the $\simp$ function,
       
  1466 we can use the previous notation of  natural
       
  1467 extension from derivative w.r.t.~character to derivative
       
  1468 w.r.t.~string:%\comment{simp in  the [] case?}
       
  1469 
       
  1470 \begin{center}
       
  1471 \begin{tabular}{lcl}
       
  1472 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
       
  1473 $r \backslash_{simp} [\,] $ & $\dn$ & $r$
       
  1474 \end{tabular}
       
  1475 \end{center}
       
  1476 
       
  1477 \noindent
       
  1478 to obtain an optimised version of the algorithm:
       
  1479 
       
  1480  \begin{center}
       
  1481 \begin{tabular}{lcl}
       
  1482   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
       
  1483       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
       
  1484   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
  1485   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
  1486   & & $\;\;\textit{else}\;\textit{None}$
       
  1487 \end{tabular}
       
  1488 \end{center}
       
  1489 
       
  1490 \noindent
       
  1491 This algorithm keeps the regular expression size small, for example,
       
  1492 with this simplification our previous $(a + aa)^*$ example's 8000 nodes
       
  1493 will be reduced to just 6 and stays constant, no matter how long the
       
  1494 input string is.
       
  1495 
       
  1496 
       
  1497  
       
  1498 Derivatives give a simple solution
   559 Derivatives give a simple solution
  1499 to the problem of matching a string $s$ with a regular
   560 to the problem of matching a string $s$ with a regular
  1500 expression $r$: if the derivative of $r$ w.r.t.\ (in
   561 expression $r$: if the derivative of $r$ w.r.t.\ (in
  1501 succession) all the characters of the string matches the empty string,
   562 succession) all the characters of the string matches the empty string,
  1502 then $r$ matches $s$ (and {\em vice versa}).  
   563 then $r$ matches $s$ (and {\em vice versa}).  
  1564 of this version, but do not provide any supporting proof arguments, not
   625 of this version, but do not provide any supporting proof arguments, not
  1565 even ``pencil-and-paper'' arguments. They write about their bitcoded
   626 even ``pencil-and-paper'' arguments. They write about their bitcoded
  1566 \emph{incremental parsing method} (that is the algorithm to be formalised
   627 \emph{incremental parsing method} (that is the algorithm to be formalised
  1567 in this paper):
   628 in this paper):
  1568 
   629 
  1569 %motivation part
   630 
  1570 \begin{quote}\it
   631   
       
   632   \begin{quote}\it
  1571   ``Correctness Claim: We further claim that the incremental parsing
   633   ``Correctness Claim: We further claim that the incremental parsing
  1572   method [..] in combination with the simplification steps [..]
   634   method [..] in combination with the simplification steps [..]
  1573   yields POSIX parse trees. We have tested this claim
   635   yields POSIX parse trees. We have tested this claim
  1574   extensively [..] but yet
   636   extensively [..] but yet
  1575   have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
   637   have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
  1576 \end{quote}  
   638 \end{quote}  
  1577 
   639 
  1578 
   640 Ausaf and Urban were able to back this correctness claim with
  1579 
   641 a formal proof.
       
   642 
       
   643 But as they stated,
       
   644   \begin{quote}\it
       
   645 The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.
       
   646 \end{quote}  
       
   647 
       
   648 This thesis implements the aggressive simplifications envisioned
       
   649 by Ausaf and Urban,
       
   650 and gives a formal proof of the correctness with those simplifications.
       
   651 
       
   652 
       
   653  
  1580 
   654 
  1581 
   655 
  1582 
   656 
  1583 %----------------------------------------------------------------------------------------
   657 %----------------------------------------------------------------------------------------
  1584 
   658 
       
   659 \section{Contribution}
       
   660 
       
   661 
       
   662 
       
   663 This work addresses the vulnerability of super-linear and
       
   664 buggy regex implementations by the combination
       
   665 of Brzozowski's derivatives and interactive theorem proving. 
       
   666 We give an 
       
   667 improved version of  Sulzmann and Lu's bit-coded algorithm using 
       
   668 derivatives, which come with a formal guarantee in terms of correctness and 
       
   669 running time as an Isabelle/HOL proof.
       
   670 Then we improve the algorithm with an even stronger version of 
       
   671 simplification, and prove a time bound linear to input and
       
   672 cubic to regular expression size using a technique by
       
   673 Antimirov.
       
   674 
       
   675  
       
   676 The main contribution of this thesis is a proven correct lexing algorithm
       
   677 with formalized time bounds.
       
   678 To our best knowledge, no lexing libraries using Brzozowski derivatives
       
   679 have a provable time guarantee, 
       
   680 and claims about running time are usually speculative and backed by thin empirical
       
   681 evidence.
       
   682 %TODO: give references
       
   683 For example, Sulzmann and Lu had proposed an algorithm  in which they
       
   684 claim a linear running time.
       
   685 But that was falsified by our experiments and the running time 
       
   686 is actually $\Omega(2^n)$ in the worst case.
       
   687 A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
       
   688 %TODO: give references
       
   689 lexer, which calculates POSIX matches and is based on derivatives.
       
   690 They formalized the correctness of the lexer, but not the complexity.
       
   691 In the performance evaluation section, they simply analyzed the run time
       
   692 of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
       
   693 and concluded that the algorithm is quadratic in terms of input length.
       
   694 When we tried out their extracted OCaml code with our example $(a+aa)^*$,
       
   695 the time it took to lex only 40 $a$'s was 5 minutes.
       
   696 
       
   697 We  believe our results of a proof of performance on general
       
   698 inputs rather than specific examples a novel contribution.\\
       
   699 
       
   700 
       
   701 \subsection{Related Work}
       
   702 We are aware
       
   703 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
       
   704 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
       
   705 of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
       
   706 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
       
   707 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
       
   708  
       
   709  %We propose Brzozowski's derivatives as a solution to this problem.
       
   710 % about Lexing Using Brzozowski derivatives
       
   711 
       
   712 
       
   713 \section{Structure of the thesis}
       
   714 In chapter 2 \ref{Chapter2} we will introduce the concepts
       
   715 and notations we 
       
   716 use for describing the lexing algorithm by Sulzmann and Lu,
       
   717 and then give the algorithm and its variant, and discuss
       
   718 why more aggressive simplifications are needed. 
       
   719 Then we illustrate in Chapter 3\ref{Chapter3}
       
   720 how the algorithm without bitcodes falls short for such aggressive 
       
   721 simplifications and therefore introduce our version of the
       
   722  bitcoded algorithm and 
       
   723 its correctness proof .  
       
   724 In Chapter 4 \ref{Chapter4} we give the second guarantee
       
   725 of our bitcoded algorithm, that is a finite bound on the size of any 
       
   726 regex's derivatives.
       
   727 In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
       
   728 in Chapter 4 to a polynomial one, and demonstrate how one can extend the
       
   729 algorithm to include constructs such as bounded repetitions and negations.
       
   730  
       
   731 
       
   732 
       
   733 
  1585 
   734 
  1586 %----------------------------------------------------------------------------------------
   735 %----------------------------------------------------------------------------------------
  1587 
   736 
       
   737 
  1588 %----------------------------------------------------------------------------------------
   738 %----------------------------------------------------------------------------------------
  1589 
   739 
  1590 %----------------------------------------------------------------------------------------
   740 %----------------------------------------------------------------------------------------
  1591 
   741 
  1592 
   742 %----------------------------------------------------------------------------------------
       
   743 
       
   744