intro and chap2
authorChengsong
Thu, 09 Jun 2022 12:57:53 +0100
changeset 538 8016a2480704
parent 537 50e590823220
child 539 7cf9f17aa179
intro and chap2
ChengsongTanPhdThesis/Chapters/Bitcoded1.tex
ChengsongTanPhdThesis/Chapters/Bitcoded2.tex
ChengsongTanPhdThesis/Chapters/Cubic.tex
ChengsongTanPhdThesis/Chapters/Inj.tex
ChengsongTanPhdThesis/Chapters/Introduction.tex
ChengsongTanPhdThesis/main.tex
--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Mon Jun 06 23:17:45 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Thu Jun 09 12:57:53 2022 +0100
@@ -1,17 +1,17 @@
 % Chapter Template
 
 % Main chapter title
-\chapter{Correctness of Bit-coded Algorithm without Simplification}
+\chapter{Bit-coded Algorithm of Sulzmann and Lu}
 
 \label{Bitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
 %Then we illustrate how the algorithm without bitcodes falls short for such aggressive 
 %simplifications and therefore introduce our version of the bitcoded algorithm and 
 %its correctness proof in 
 %Chapter 3\ref{Chapter3}. 
-
+In this chapter, we are going to introduce the bit-coded algorithm
+introduced by Sulzmann and Lu to address the problem of 
+under-simplified regular expressions. 
 \section{Bit-coded Algorithm}
-
-
 The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
 stores information of previous lexing steps
 on a stack, in the form of regular expressions
@@ -135,8 +135,9 @@
 \end{envForCaption}
 
 \noindent
-The $S$ and $Z$ are not in bold in order to avoid 
-confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
+Using $S$ and $Z$ rather than $1$ and $0$ is to avoid
+confusion with the regular expressions $\ZERO$ and $\ONE$.
+Bitcodes (or
 bit-lists) can be used to encode values (or potentially incomplete values) in a
 compact form. This can be straightforwardly seen in the following
 coding function from values to bitcodes: 
@@ -203,15 +204,31 @@
   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
        \textit{else}\;\textit{None}$                       
 \end{tabular}    
-\end{center}
-\caption{Bit-decoding of values}    
+\end{center} 
 \end{envForCaption}
 %\end{definition}
 
-Sulzmann and Lu's integrated the bitcodes into regular expressions to
-create annotated regular expressions \cite{Sulzmann2014}.
-\emph{Annotated regular expressions} are defined by the following
-grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
+\noindent
+$\decode'$ does most of the job while $\decode$ throws
+away leftover bit-codes and returns the value only.
+$\decode$ is terminating as $\decode'$ is terminating.
+We have the property that $\decode$ and $\code$ are
+reverse operations of one another:
+\begin{lemma}
+\[\vdash v : r \implies \decode \; (\code \; v) \; r = \textit{Some}(v) \]
+\end{lemma}
+\begin{proof}
+By proving a more general version of the lemma, on $\decode'$:
+\[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
+Then setting $ds$ to be $[]$ and unfolding $\decode$ definition
+we get the lemma.
+\end{proof}
+With the $\code$ and $\decode$ functions in hand, we know how to 
+switch between bit-codes and value--the two different representations of 
+lexing information. 
+The next step is to integrate this information into the working regular expression.
+Attaching bits to the front of regular expressions is the solution Sulzamann and Lu
+gave for storing partial values on the fly:
 
 \begin{center}
 \begin{tabular}{lcl}
@@ -226,13 +243,130 @@
 %(in \textit{ALTS})
 
 \noindent
-where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
+We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}.
+$bs$ stands for bit-codes, $a$  for $\mathbf{a}$nnotated regular
 expressions and $as$ for a list of annotated regular expressions.
-The alternative constructor($\sum$) has been generalized to 
+The alternative constructor ($\sum$) has been generalised to 
 accept a list of annotated regular expressions rather than just 2.
-We will show that these bitcodes encode information about
-the (POSIX) value that should be generated by the Sulzmann and Lu
-algorithm.
+%We will show that these bitcodes encode information about
+%the ($\POSIX$) value that should be generated by the Sulzmann and Lu
+%algorithm.
+The most central question is how these partial lexing information
+represented as bit-codes is augmented and carried around 
+during a derivative is taken.
+
+This is done by adding bitcodes to the 
+derivatives, for example when one more star iteratoin is taken (we
+call the operation of derivatives on annotated regular expressions $\bder$
+because it is derivatives on regexes with bitcodes):
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+  $\bder \; c\; (_{bs}a^*) $ & $\dn$ &
+      $_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot
+       (_{[]}a^*))$
+\end{tabular}    
+\end{center}    
+
+\noindent
+For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when
+there is no danger of confusion with derivatives on plain regular expressions, 
+for example, the above can be expressed as
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
+      $_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot
+       (_{[]}a^*))$
+\end{tabular}    
+\end{center}   
+
+
+Using the picture we used earlier to depict this, the transformation when 
+taking a derivative w.r.t a star is like below:
+\centering
+\begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+        {$bs$
+         \nodepart{two} $a^*$ };
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture} 
+&
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+        {$v_{\text{previous iterations}}$
+         \nodepart{two} $a^*$};
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+\\
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+        { $bs$ + [Z]
+         \nodepart{two}  $(a\backslash c )\cdot a^*$ };
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+&
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+        {$v_{\text{previous iterations}}$ + 1 more iteration
+         \nodepart{two} $(a\backslash c )\cdot a^*$ };
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+\end{tabular}    
+\noindent
+The operation $\fuse$ is just to attach bit-codes 
+to the front of an annotated regular expression:
+\begin{center}
+\begin{tabular}{lcl}
+  $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
+  $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
+     $_{bs @ bs'}\ONE$\\
+  $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
+     $_{bs@bs'}{\bf c}$\\
+  $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
+     $_{bs@bs'}\sum\textit{as}$\\
+  $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
+     $_{bs@bs'}a_1 \cdot a_2$\\
+  $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
+     $_{bs @ bs'}a^*$
+\end{tabular}    
+\end{center} 
+
+\noindent
+Another place in the $\bder$ function where it differs
+from normal derivatives on un-annotated regular expressions
+is the sequence case:
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+
+  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
+     $\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
+					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
+  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$
+\end{tabular}    
+\end{center}    
+Here 
+
+
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+  $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
+  $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
+  $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
+        $\textit{if}\;c=d\; \;\textit{then}\;
+         _{bs}\ONE\;\textit{else}\;\ZERO$\\  
+  $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
+  $_{bs}\sum\;(\textit{map} (\_\backslash c) as )$\\
+  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
+     $\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
+					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
+  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
+  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
+      $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
+       (_{[]}r^*))$
+\end{tabular}    
+\end{center}    
 
 
 To do lexing using annotated regular expressions, we shall first
@@ -264,21 +398,7 @@
 attach bits to the front of an annotated regular expression. Its
 definition is as follows:
 
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
-  $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
-     $_{bs @ bs'}\ONE$\\
-  $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
-     $_{bs@bs'}{\bf c}$\\
-  $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
-     $_{bs@bs'}\sum\textit{as}$\\
-  $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
-     $_{bs@bs'}a_1 \cdot a_2$\\
-  $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
-     $_{bs @ bs'}a^*$
-\end{tabular}    
-\end{center}  
+ 
 
 \noindent
 After internalising the regular expression, we perform successive
@@ -288,70 +408,8 @@
 the bitcodes:
 
 
-\iffalse
- %\begin{definition}{bder}
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
-        $\textit{if}\;c=d\; \;\textit{then}\;
-         \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
-  $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
-  $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
-  $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a_1$\\
-					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
-					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
-  & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
-  $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
-      $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
-       (\textit{STAR}\,[]\,r)$
-\end{tabular}    
-\end{center}    
-%\end{definition}
 
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
-        $\textit{if}\;c=d\; \;\textit{then}\;
-         _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
-  $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
-  $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
-  $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a_1$\\
-					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
-					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
-  & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
-  $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
-      $_{bs}\textit{SEQ}\;(\textit{fuse}\, [Z] \; r\,\backslash c )\,
-       (_{bs}\textit{STAR}\,[]\,r)$
-\end{tabular}    
-\end{center}    
-%\end{definition}
-\fi
 
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
-  $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
-  $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
-        $\textit{if}\;c=d\; \;\textit{then}\;
-         _{bs}\ONE\;\textit{else}\;\ZERO$\\  
-  $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
-  $_{bs}\sum\;(\textit{map} (\_\backslash c) as )$\\
-  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a_1$\\
-					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
-					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
-  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
-  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
-      $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
-       (_{[]}r^*))$
-\end{tabular}    
-\end{center}    
 
 %\end{definition}
 \noindent
@@ -417,119 +475,6 @@
 operation from characters to strings (just like the derivatives for un-annotated
 regular expressions).
 
-Now we introduce the simplifications, which is why we introduce the 
-bitcodes in the first place.
-
-\subsection*{Simplification Rules}
-
-This section introduces aggressive (in terms of size) simplification rules
-on annotated regular expressions
-to keep derivatives small. Such simplifications are promising
-as we have
-generated test data that show
-that a good tight bound can be achieved. We could only
-partially cover the search space as there are infinitely many regular
-expressions and strings. 
-
-One modification we introduced is to allow a list of annotated regular
-expressions in the $\sum$ constructor. This allows us to not just
-delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
-also unnecessary ``copies'' of regular expressions (very similar to
-simplifying $r + r$ to just $r$, but in a more general setting). Another
-modification is that we use simplification rules inspired by Antimirov's
-work on partial derivatives. They maintain the idea that only the first
-``copy'' of a regular expression in an alternative contributes to the
-calculation of a POSIX value. All subsequent copies can be pruned away from
-the regular expression. A recursive definition of our  simplification function 
-that looks somewhat similar to our Scala code is given below:
-%\comment{Use $\ZERO$, $\ONE$ and so on. 
-%Is it $ALTS$ or $ALTS$?}\\
-
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-   
-  $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
-   &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
-   &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
-   &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
-   &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
-   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
-
-  $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\
-  &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
-   &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
-   &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
-
-   $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
-\end{tabular}    
-\end{center}    
-
-\noindent
-The simplification does a pattern matching on the regular expression.
-When it detected that the regular expression is an alternative or
-sequence, it will try to simplify its child regular expressions
-recursively and then see if one of the children turns into $\ZERO$ or
-$\ONE$, which might trigger further simplification at the current level.
-The most involved part is the $\sum$ clause, where we use two
-auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
-alternatives and reduce as many duplicates as possible. Function
-$\textit{distinct}$  keeps the first occurring copy only and removes all later ones
-when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
-Its recursive definition is given below:
-
- \begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
-     (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
-  $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
-    $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
-\end{tabular}    
-\end{center}  
-
-\noindent
-Here $\textit{flatten}$ behaves like the traditional functional programming flatten
-function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
-removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
-
-Having defined the $\simp$ function,
-we can use the previous notation of  natural
-extension from derivative w.r.t.~character to derivative
-w.r.t.~string:%\comment{simp in  the [] case?}
-
-\begin{center}
-\begin{tabular}{lcl}
-$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
-$r \backslash_{simp} [\,] $ & $\dn$ & $r$
-\end{tabular}
-\end{center}
-
-\noindent
-to obtain an optimised version of the algorithm:
-
- \begin{center}
-\begin{tabular}{lcl}
-  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
-      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
-  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
-  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
-  & & $\;\;\textit{else}\;\textit{None}$
-\end{tabular}
-\end{center}
-
-\noindent
-This algorithm keeps the regular expression size small, for example,
-with this simplification our previous $(a + aa)^*$ example's 8000 nodes
-will be reduced to just 17 and stays constant, no matter how long the
-input string is.
-
-
-
-
-
-
-
-
-
 
 
 %-----------------------------------
--- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex	Mon Jun 06 23:17:45 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex	Thu Jun 09 12:57:53 2022 +0100
@@ -11,6 +11,119 @@
 
 
 
+Now we introduce the simplifications, which is why we introduce the 
+bitcodes in the first place.
+
+\subsection*{Simplification Rules}
+
+This section introduces aggressive (in terms of size) simplification rules
+on annotated regular expressions
+to keep derivatives small. Such simplifications are promising
+as we have
+generated test data that show
+that a good tight bound can be achieved. We could only
+partially cover the search space as there are infinitely many regular
+expressions and strings. 
+
+One modification we introduced is to allow a list of annotated regular
+expressions in the $\sum$ constructor. This allows us to not just
+delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
+also unnecessary ``copies'' of regular expressions (very similar to
+simplifying $r + r$ to just $r$, but in a more general setting). Another
+modification is that we use simplification rules inspired by Antimirov's
+work on partial derivatives. They maintain the idea that only the first
+``copy'' of a regular expression in an alternative contributes to the
+calculation of a POSIX value. All subsequent copies can be pruned away from
+the regular expression. A recursive definition of our  simplification function 
+that looks somewhat similar to our Scala code is given below:
+%\comment{Use $\ZERO$, $\ONE$ and so on. 
+%Is it $ALTS$ or $ALTS$?}\\
+
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+   
+  $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
+   &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
+   &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
+   &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
+   &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
+   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
+
+  $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\
+  &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
+   &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
+   &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
+
+   $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
+\end{tabular}    
+\end{center}    
+
+\noindent
+The simplification does a pattern matching on the regular expression.
+When it detected that the regular expression is an alternative or
+sequence, it will try to simplify its child regular expressions
+recursively and then see if one of the children turns into $\ZERO$ or
+$\ONE$, which might trigger further simplification at the current level.
+The most involved part is the $\sum$ clause, where we use two
+auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
+alternatives and reduce as many duplicates as possible. Function
+$\textit{distinct}$  keeps the first occurring copy only and removes all later ones
+when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
+Its recursive definition is given below:
+
+ \begin{center}
+  \begin{tabular}{@{}lcl@{}}
+  $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
+     (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
+  $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
+    $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
+\end{tabular}    
+\end{center}  
+
+\noindent
+Here $\textit{flatten}$ behaves like the traditional functional programming flatten
+function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
+removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
+
+Having defined the $\simp$ function,
+we can use the previous notation of  natural
+extension from derivative w.r.t.~character to derivative
+w.r.t.~string:%\comment{simp in  the [] case?}
+
+\begin{center}
+\begin{tabular}{lcl}
+$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
+$r \backslash_{simp} [\,] $ & $\dn$ & $r$
+\end{tabular}
+\end{center}
+
+\noindent
+to obtain an optimised version of the algorithm:
+
+ \begin{center}
+\begin{tabular}{lcl}
+  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
+      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
+  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+  & & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+
+\noindent
+This algorithm keeps the regular expression size small, for example,
+with this simplification our previous $(a + aa)^*$ example's 8000 nodes
+will be reduced to just 17 and stays constant, no matter how long the
+input string is.
+
+
+
+
+
+
+
+
+
 
 %----------------------------------------------------------------------------------------
 %	SECTION common identities
--- a/ChengsongTanPhdThesis/Chapters/Cubic.tex	Mon Jun 06 23:17:45 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Cubic.tex	Thu Jun 09 12:57:53 2022 +0100
@@ -245,19 +245,6 @@
 \end{figure}
 
 
-\pgfplotsset{
-    myplotstyle/.style={
-    legend style={draw=none, font=\small},
-    legend cell align=left,
-    legend pos=north east,
-    ylabel style={align=center, font=\bfseries\boldmath},
-    xlabel style={align=center, font=\bfseries\boldmath},
-    x tick label style={font=\bfseries\boldmath},
-    y tick label style={font=\bfseries\boldmath},
-    scaled ticks=true,
-    every axis plot/.append style={thick},
-    },
-}
 
 \begin{figure}
 \centering
--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Mon Jun 06 23:17:45 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Thu Jun 09 12:57:53 2022 +0100
@@ -8,43 +8,34 @@
 %and then give the algorithm and its variant, and discuss
 %why more aggressive simplifications are needed. 
 
+In this chapter, we define the basic notions 
+for regular languages and regular expressions.
+We also give the definition of what $\POSIX$ lexing means.
 
-\section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions}
-We have a primitive datatype char, denoting characters.
-\[			char ::=  a
-			 \mid b
-			 \mid c
-			 \mid  \ldots
-			 \mid z       
-\]
-(which one is better?)
-\begin{center}
-\begin{tabular}{lcl}
-$\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\
-\end{tabular}
-\end{center}
-They can form strings by lists:
+\section{Basic Concepts}
+Usually in formal language theory there is an alphabet 
+denoting a set of characters.
+Here we only use the datatype of characters from Isabelle,
+which roughly corresponds to the ASCII character.
+Then using the usual $[]$ notation for lists,
+we can define strings using chars:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{string}$ & $\dn$ & $[] | c  :: cs$\\
 & & $(c\; \text{has char type})$
 \end{tabular}
 \end{center}
-And strings can be concatenated to form longer strings:
-\begin{center}
-\begin{tabular}{lcl}
-$[] @ s_2$ & $\dn$ & $s_2$\\
-$(c :: s_1) @ s_2$ & $\dn$ & $c :: (s_1 @ s_2)$
-\end{tabular}
-\end{center}
-
-A set of strings can operate with another set of strings:
+And strings can be concatenated to form longer strings,
+in the same way as we concatenate two lists,
+which we denote as $@$. We omit the precise 
+recursive definition here.
+We overload this concatenation operator for two sets of strings:
 \begin{center}
 \begin{tabular}{lcl}
 $A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
 \end{tabular}
 \end{center}
-We also call the above "language concatenation".
+We also call the above \emph{language concatenation}.
 The power of a language is defined recursively, using the 
 concatenation operator $@$:
 \begin{center}
@@ -54,38 +45,43 @@
 \end{tabular}
 \end{center}
 The union of all the natural number powers of a language   
-is denoted by the Kleene star operator:
+is defined as the Kleene star operator:
 \begin{center}
 \begin{tabular}{lcl}
  $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
 \end{tabular}
 \end{center}
 
-To get an induction principle in Isabelle/HOL, 
+\noindent
+However, to obtain a convenient induction principle 
+in Isabelle/HOL, 
 we instead define the Kleene star
 as an inductive set: 
+
 \begin{center}
-\begin{tabular}{lcl}
-$[] \in A*$  & &\\
-$s_1 \in A \land \; s_2 \in A* $ & $\implies$ & $s_1 @ s_2 \in A*$\\
-\end{tabular}
+\begin{mathpar}
+\inferrule{}{[] \in A*\\}
+
+\inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*}
+\end{mathpar}
 \end{center}
-\subsection{Language Derivatives}
-We also define an operation of chopping off a character from
-a language:
+
+We also define an operation of "chopping of" a character from
+a language, which we call $\Der$, meaning "Derivative for a language":
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
 \end{tabular}
 \end{center}
-
-This can be generalised to chopping off a string from all strings within set $A$:
+\noindent
+This can be generalised to "chopping off" a string from all strings within set $A$,
+with the help of the concatenation operator:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
 \end{tabular}
 \end{center}
-
+\noindent
 which is essentially the left quotient $A \backslash L'$ of $A$ against 
 the singleton language $L' = \{w\}$
 in formal language theory.
@@ -99,50 +95,48 @@
 \[
 	\Der \; c \; (A @ B) =
 	\begin{cases}
-	((\Der \; c \; A) @ B ) \cup \Der \; c\; B, &  \text{if} \;  [] \in A  \\
-	 (\Der \; c \; A) @ B, & \text{otherwise}
+	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , &  \text{if} \;  [] \in A  \\
+	 (\Der \; c \; A) \,  @ \, B, & \text{otherwise}
 	 \end{cases}	
 \]
 \end{lemma}
 \noindent
 This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
 and get to $B$.
-
 The language $A*$'s derivative can be described using the language derivative
 of $A$:
 \begin{lemma}
-$\textit{Der} \;c \;A* = (\textit{Der}\; c A) @ (A*)$\\
+$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
 \end{lemma}
 \begin{proof}
 \begin{itemize}
 \item{$\subseteq$}
+\noindent
 The set 
 \[ \{s \mid c :: s \in A*\} \]
 is enclosed in the set
 \[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
 because whenever you have a string starting with a character 
-in the language of a Kleene star $A*$, then that character together with some sub-string
-immediately after it will form the first iteration, and the rest of the string will 
+in the language of a Kleene star $A*$, 
+then that character together with some sub-string
+immediately after it will form the first iteration, 
+and the rest of the string will 
 be still in $A*$.
 \item{$\supseteq$}
 Note that
-\[ \Der \; c \; A* = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
+\[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
 and 
 \[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
 where the $\textit{RHS}$ of the above equatioin can be rewritten
 as \[ (\Der \; c\; A) @ A* \cup A' \], $A'$ being a possibly empty set.
 \end{itemize}
 \end{proof}
+
+\noindent
 Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
 for regular languages, we need to first give definitions for regular expressions.
 
 \subsection{Regular Expressions and Their Meaning}
-Suppose we have an alphabet $\Sigma$, the strings  whose characters
-are from $\Sigma$
-can be expressed as $\Sigma^*$.
-
-We use patterns to define a set of strings concisely. Regular expressions
-are one of such patterns systems:
 The basic regular expressions  are defined inductively
  by the following grammar:
 \[			r ::=   \ZERO \mid  \ONE
@@ -151,6 +145,16 @@
 			 \mid  r_1 + r_2   
 			 \mid r^*         
 \]
+\noindent
+We call them basic because we might introduce
+more constructs later such as negation
+and bounded repetitions.
+We defined the regular expression containing
+nothing as $\ZERO$, note that some authors
+also use $\phi$ for that.
+Similarly, the regular expression denoting the 
+singleton set with only $[]$ is sometimes 
+denoted by $\epsilon$, but we use $\ONE$ here.
 
 The language or set of strings denoted 
 by regular expressions are defined as
@@ -172,13 +176,24 @@
 Now with semantic derivatives of a language and regular expressions and
 their language interpretations in place, we are ready to define derivatives on regexes.
 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
+
+\ChristianComment{Hi this part I want to keep the ordering as is, so that it keeps the 
+readers engaged with a story how we got to the definition of $\backslash$, rather 
+than first "overwhelming" them with the definition of $\nullable$.}
+
 The language derivative acts on a string set and chops off a character from
 all strings in that set, we want to define a derivative operation on regular expressions
 so that after derivative $L(r\backslash c)$ 
 will look as if it was obtained by doing a language derivative on $L(r)$:
+\begin{center}
 \[
-L(r \backslash c) = \Der \; c \; L(r)
+r\backslash c \dn ?
 \]
+so that
+\[
+L(r \backslash c) = \Der \; c \; L(r) ?
+\]
+\end{center}
 So we mimic the equalities we have for $\Der$ on language concatenation
 
 \[
@@ -213,7 +228,7 @@
 		$d \backslash c$     & $\dn$ & 
 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
-$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
+$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
@@ -236,7 +251,8 @@
 for possible more iterations in later phases of lexing.
 
 
-The $\nullable$ function tests whether the empty string $""$ 
+To test whether $[] \in L(r_1)$, we need the $\nullable$ function,
+which tests whether the empty string $""$ 
 is in the language of $r$:
 
 
@@ -302,7 +318,7 @@
 and then define Brzozowski's  regular-expression matching algorithm as:
 
 \begin{definition}
-$match\;s\;r \;\dn\; nullable(r\backslash s)$
+$\textit{match}\;s\;r \;\dn\; \nullable(r\backslash s)$
 \end{definition}
 
 \noindent
@@ -316,6 +332,362 @@
 \end{equation}
 
 \noindent
+
+
+Building derivatives and then testing the existence
+of empty string in the resulting regular expression's language.
+So far, so good. But what if we want to 
+do lexing instead of just getting a YES/NO answer?
+Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
+elegant (arguably as beautiful as the definition of the original derivative) solution for this.
+
+\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
+Here we present the hybrid phases of a regular expression lexing 
+algorithm using the function $\inj$, as given by Sulzmann and Lu.
+They first defined the datatypes for storing the 
+lexing information called a \emph{value} or
+sometimes also \emph{lexical value}.  These values and regular
+expressions correspond to each other as illustrated in the following
+table:
+
+\begin{center}
+	\begin{tabular}{c@{\hspace{20mm}}c}
+		\begin{tabular}{@{}rrl@{}}
+			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
+			$r$ & $::=$  & $\ZERO$\\
+			& $\mid$ & $\ONE$   \\
+			& $\mid$ & $c$          \\
+			& $\mid$ & $r_1 \cdot r_2$\\
+			& $\mid$ & $r_1 + r_2$   \\
+			\\
+			& $\mid$ & $r^*$         \\
+		\end{tabular}
+		&
+		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
+			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
+			$v$ & $::=$  & \\
+			&        & $\Empty$   \\
+			& $\mid$ & $\Char(c)$          \\
+			& $\mid$ & $\Seq\,v_1\, v_2$\\
+			& $\mid$ & $\Left(v)$   \\
+			& $\mid$ & $\Right(v)$  \\
+			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
+		\end{tabular}
+	\end{tabular}
+\end{center}
+
+\noindent
+We have a formal binary relation for telling whether the structure
+of a regular expression agrees with the value.
+\begin{mathpar}
+\inferrule{}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
+\inferrule{}{\vdash \Empty :  \ONE} \hspace{2em}
+\inferrule{\vdash v_1 : r_1 \\ \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
+\end{mathpar}
+
+Building on top of Sulzmann and Lu's attempt to formalise the 
+notion of POSIX lexing rules \parencite{Sulzmann2014}, 
+Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
+POSIX matching as a ternary relation recursively defined in a
+natural deduction style.
+The formal definition of a $\POSIX$ value $v$ for a regular expression
+$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
+in the following set of rules:
+\ChristianComment{Will complete later}
+\newcommand*{\inference}[3][t]{%
+   \begingroup
+   \def\and{\\}%
+   \begin{tabular}[#1]{@{\enspace}c@{\enspace}}
+   #2 \\
+   \hline
+   #3
+   \end{tabular}%
+   \endgroup
+}
+\begin{center}
+\inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$  }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
+\end{center}
+\noindent
+The above $\POSIX$ rules could be explained intuitionally as
+\begin{itemize}
+\item
+match the leftmost regular expression when multiple options of matching
+are available  
+\item 
+always match a subpart as much as possible before proceeding
+to the next token.
+\end{itemize}
+
+The reason why we are interested in $\POSIX$ values is that they can
+be practically used in the lexing phase of a compiler front end.
+For instance, when lexing a code snippet 
+$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
+as an identifier rather than a keyword.
+
+The good property about a $\POSIX$ value is that 
+given the same regular expression $r$ and string $s$,
+one can always uniquely determine the $\POSIX$ value for it:
+\begin{lemma}
+$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
+\end{lemma}
+Now we know what a $\POSIX$ value is, the problem is how do we achieve 
+such a value in a lexing algorithm, using derivatives?
+
+\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
+
+The contribution of Sulzmann and Lu is an extension of Brzozowski's
+algorithm by a second phase (the first phase being building successive
+derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value 
+is generated if the regular expression matches the string.
+Two functions are involved: $\inj$ and $\mkeps$.
+The function $\mkeps$ constructs a value from the last
+one of all the successive derivatives:
+\begin{ceqn}
+\begin{equation}\label{graph:mkeps}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
+	        & 	              & 	            & v_n       
+\end{tikzcd}
+\end{equation}
+\end{ceqn}
+
+It tells us how can an empty string be matched by a 
+regular expression, in a $\POSIX$ way:
+
+	\begin{center}
+		\begin{tabular}{lcl}
+			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
+			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
+			& \textit{if} $\nullable(r_{1})$\\ 
+			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
+			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
+			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
+			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
+		\end{tabular}
+	\end{center}
+
+
+\noindent 
+We favour the left to match an empty string if there is a choice.
+When there is a star for us to match the empty string,
+we give the $\Stars$ constructor an empty list, meaning
+no iterations are taken.
+The result of a call to $\mkeps$ on a $\nullable$ $r$ would
+be a $\POSIX$ value corresponding to $r$:
+\begin{lemma}
+$\nullable(r) \implies (r, []) \rightarrow (\mkeps\; v)$
+\end{lemma}\label{mePosix}
+
+
+After the $\mkeps$-call, we inject back the characters one by one in order to build
+the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
+($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
+After injecting back $n$ characters, we get the lexical value for how $r_0$
+matches $s$. 
+To do this, Sulzmann and Lu defined a function that reverses
+the ``chopping off'' of characters during the derivative phase. The
+corresponding function is called \emph{injection}, written
+$\textit{inj}$; it takes three arguments: the first one is a regular
+expression ${r_{i-1}}$, before the character is chopped off, the second
+is a character ${c_{i-1}}$, the character we want to inject and the
+third argument is the value ${v_i}$, into which one wants to inject the
+character (it corresponds to the regular expression after the character
+has been chopped off). The result of this function is a new value. 
+\begin{ceqn}
+\begin{equation}\label{graph:inj}
+\begin{tikzcd}
+r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
+v_1           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
+\end{tikzcd}
+\end{equation}
+\end{ceqn}
+
+
+\noindent
+The
+definition of $\textit{inj}$ is as follows: 
+
+\begin{center}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
+  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
+  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
+  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
+  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
+\end{tabular}
+\end{center}
+
+\noindent 
+This definition is by recursion on the ``shape'' of regular
+expressions and values. 
+The clauses do one thing--identifying the ``hole'' on a
+value to inject the character back into.
+For instance, in the last clause for injecting back to a value
+that would turn into a new star value that corresponds to a star,
+we know it must be a sequence value. And we know that the first 
+value of that sequence corresponds to the child regex of the star
+with the first character being chopped off--an iteration of the star
+that had just been unfolded. This value is followed by the already
+matched star iterations we collected before. So we inject the character 
+back to the first value and form a new value with this latest iteration
+being added to the previous list of iterations, all under the $\Stars$
+top level.
+The POSIX value is maintained throughout the process.
+\begin{lemma}
+$(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
+\end{lemma}\label{injPosix}
+
+
+Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
+and taking into consideration the possibility of a non-match,
+we have a lexer with the following recursive definition:
+\begin{center}
+\begin{tabular}{lcr}
+$\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
+$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
+& & $\None \implies \None$\\
+& & $\mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
+\end{tabular}
+\end{center}
+ \noindent
+ The central property of the $\lexer$ is that it gives the correct result by
+ $\POSIX$ standards:
+ \begin{lemma}
+ \begin{tabular}{l}
+ $s \in L(r) \Longleftrightarrow  (\exists v. \; r \; s = \Some(v) \land (r, \; s) \rightarrow v)$\\
+ $s \notin L(r) \Longleftrightarrow (\lexer \; r\; s = \None)$
+ \end{tabular}
+ \end{lemma}
+ 
+ 
+ \begin{proof}
+ By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
+ The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
+ by lemma \ref{injPosix}.
+ \end{proof}
+
+For convenience, we shall employ the following notations: the regular
+expression we start with is $r_0$, and the given string $s$ is composed
+of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
+left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
+to the characters $c_0$, $c_1$  until we exhaust the string and obtain
+the derivative $r_n$. We test whether this derivative is
+$\textit{nullable}$ or not. If not, we know the string does not match
+$r$, and no value needs to be generated. If yes, we start building the
+values incrementally by \emph{injecting} back the characters into the
+earlier values $v_n, \ldots, v_0$.
+Pictorially, the algorithm is as follows:
+
+\begin{ceqn}
+\begin{equation}\label{graph:2}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
+v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
+\end{tikzcd}
+\end{equation}
+\end{ceqn}
+
+
+\noindent
+ This is the second phase of the
+algorithm from right to left. For the first value $v_n$, we call the
+function $\textit{mkeps}$, which builds a POSIX lexical value
+for how the empty string has been matched by the (nullable) regular
+expression $r_n$. This function is defined as
+
+
+
+We have mentioned before that derivatives without simplification 
+can get clumsy, and this is true for values as well--they reflect
+the size of the regular expression by definition.
+
+One can introduce simplification on the regex and values but have to
+be careful not to break the correctness, as the injection 
+function heavily relies on the structure of the regexes and values
+being correct and matching each other.
+It can be achieved by recording some extra rectification functions
+during the derivatives step, and applying these rectifications in 
+each run during the injection phase.
+And we can prove that the POSIX value of how
+regular expressions match strings will not be affected---although it is much harder
+to establish. 
+Some initial results in this regard have been
+obtained in \cite{AusafDyckhoffUrban2016}. 
+
+
+
+%Brzozowski, after giving the derivatives and simplification,
+%did not explore lexing with simplification, or he may well be 
+%stuck on an efficient simplification with proof.
+%He went on to examine the use of derivatives together with 
+%automaton, and did not try lexing using products.
+
+We want to get rid of the complex and fragile rectification of values.
+Can we not create those intermediate values $v_1,\ldots v_n$,
+and get the lexing information that should be already there while
+doing derivatives in one pass, without a second injection phase?
+In the meantime, can we make sure that simplifications
+are easily handled without breaking the correctness of the algorithm?
+
+Sulzmann and Lu solved this problem by
+introducing additional information to the 
+regular expressions called \emph{bitcodes}.
+
+
+
+
+
+With the formally-specified rules for what a POSIX matching is,
+they proved in Isabelle/HOL that the algorithm gives correct results.
+But having a correct result is still not enough, 
+we want at least some degree of $\mathbf{efficiency}$.
+
+
+A pair of regular expression and string can have multiple lexical values. 
+Take the example where $r= (a^*\cdot a^*)^*$ and the string 
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
+If we do not allow any empty iterations in its lexical values,
+there will be $n - 1$ "splitting points" on $s$ we can choose to 
+split or not so that each sub-string
+segmented by those chosen splitting points will form different iterations:
+\begin{center}
+\begin{tabular}{lcr}
+$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$\\
+$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$\\
+$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$\\
+ & $\textit{etc}.$ &
+ \end{tabular}
+\end{center}
+\noindent
+And for each iteration, there are still multiple ways to split
+between the two $a^*$s.
+It is not surprising there are exponentially many lexical values
+that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
+A lexer to keep all the possible values will naturally 
+have an exponential runtime on ambiguous regular expressions.
+With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
+of a match. This means Sulzmann and Lu's injection-based algorithm 
+will be exponential by nature.
+Somehow one has to decide which lexical value to keep and
+output in a lexing algorithm.
+
+
+ For example, the above $r= (a^*\cdot a^*)^*$  and 
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
+$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
+The output of an algorithm we want would be a POSIX matching
+encoded as a value.
+
+
+
+
+
+%kind of redundant material
+
+
 where we start with  a regular expression  $r_0$, build successive
 derivatives until we exhaust the string and then use \textit{nullable}
 to test whether the result can match the empty string. It can  be
@@ -405,314 +777,5 @@
  a very tight size bound, possibly as low
   as that of the \emph{partial derivatives}\parencite{Antimirov1995}. 
 
-Building derivatives and then simplifying them.
-So far, so good. But what if we want to 
-do lexing instead of just getting a YES/NO answer?
-This requires us to go back again to the world 
-without simplification first for a moment.
-Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
-elegant(arguably as beautiful as the original
-derivatives definition) solution for this.
-
-\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
-
-
-They first defined the datatypes for storing the 
-lexing information called a \emph{value} or
-sometimes also \emph{lexical value}.  These values and regular
-expressions correspond to each other as illustrated in the following
-table:
-
-\begin{center}
-	\begin{tabular}{c@{\hspace{20mm}}c}
-		\begin{tabular}{@{}rrl@{}}
-			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
-			$r$ & $::=$  & $\ZERO$\\
-			& $\mid$ & $\ONE$   \\
-			& $\mid$ & $c$          \\
-			& $\mid$ & $r_1 \cdot r_2$\\
-			& $\mid$ & $r_1 + r_2$   \\
-			\\
-			& $\mid$ & $r^*$         \\
-		\end{tabular}
-		&
-		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
-			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
-			$v$ & $::=$  & \\
-			&        & $\Empty$   \\
-			& $\mid$ & $\Char(c)$          \\
-			& $\mid$ & $\Seq\,v_1\, v_2$\\
-			& $\mid$ & $\Left(v)$   \\
-			& $\mid$ & $\Right(v)$  \\
-			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
-		\end{tabular}
-	\end{tabular}
-\end{center}
-
-\noindent
-
-Building on top of Sulzmann and Lu's attempt to formalise the 
-notion of POSIX lexing rules \parencite{Sulzmann2014}, 
-Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
-POSIX matching as a ternary relation recursively defined in a
-natural deduction style.
-With the formally-specified rules for what a POSIX matching is,
-they proved in Isabelle/HOL that the algorithm gives correct results.
-
-But having a correct result is still not enough, 
-we want at least some degree of $\mathbf{efficiency}$.
-
-
-
-One regular expression can have multiple lexical values. For example
-for the regular expression $(a+b)^*$, it has a infinite list of
-values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
-$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
-$\ldots$, and vice versa.
-Even for the regular expression matching a particular string, there could 
-be more than one value corresponding to it.
-Take the example where $r= (a^*\cdot a^*)^*$ and the string 
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-If we do not allow any empty iterations in its lexical values,
-there will be $n - 1$ "splitting points" on $s$ we can choose to 
-split or not so that each sub-string
-segmented by those chosen splitting points will form different iterations:
-\begin{center}
-\begin{tabular}{lcr}
-$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$\\
-$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$\\
-$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$\\
- & $\textit{etc}.$ &
- \end{tabular}
-\end{center}
-
-And for each iteration, there are still multiple ways to split
-between the two $a^*$s.
-It is not surprising there are exponentially many lexical values
-that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-
-A lexer to keep all the possible values will naturally 
-have an exponential runtime on ambiguous regular expressions.
-Somehow one has to decide which lexical value to keep and
-output in a lexing algorithm.
-In practice, we are usually 
-interested in POSIX values, which by intuition always
-\begin{itemize}
-\item
-match the leftmost regular expression when multiple options of matching
-are available  
-\item 
-always match a subpart as much as possible before proceeding
-to the next token.
-\end{itemize}
-The formal definition of a $\POSIX$ value $v$ for a regular expression
-$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
-in the following set of rules:
-(TODO: write the entire set of inference rules for POSIX )
-\newcommand*{\inference}[3][t]{%
-   \begingroup
-   \def\and{\\}%
-   \begin{tabular}[#1]{@{\enspace}c@{\enspace}}
-   #2 \\
-   \hline
-   #3
-   \end{tabular}%
-   \endgroup
-}
-\begin{center}
-\inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$  }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
-\end{center}
-
-The reason why we are interested in $\POSIX$ values is that they can
-be practically used in the lexing phase of a compiler front end.
-For instance, when lexing a code snippet 
-$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
-as an identifier rather than a keyword.
-
- For example, the above $r= (a^*\cdot a^*)^*$  and 
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
-$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
-The output of an algorithm we want would be a POSIX matching
-encoded as a value.
-
-
-
-
-The contribution of Sulzmann and Lu is an extension of Brzozowski's
-algorithm by a second phase (the first phase being building successive
-derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value 
-is generated if the regular expression matches the string.
-How can we construct a value out of regular expressions and character
-sequences only?
-Two functions are involved: $\inj$ and $\mkeps$.
-The function $\mkeps$ constructs a value from the last
-one of all the successive derivatives:
-\begin{ceqn}
-\begin{equation}\label{graph:mkeps}
-\begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
-	        & 	              & 	            & v_n       
-\end{tikzcd}
-\end{equation}
-\end{ceqn}
-
-It tells us how can an empty string be matched by a 
-regular expression, in a $\POSIX$ way:
-
-	\begin{center}
-		\begin{tabular}{lcl}
-			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
-			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
-			& \textit{if} $\nullable(r_{1})$\\ 
-			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
-			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
-			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
-			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
-		\end{tabular}
-	\end{center}
-
-
-\noindent 
-We favour the left to match an empty string if there is a choice.
-When there is a star for us to match the empty string,
-we give the $\Stars$ constructor an empty list, meaning
-no iterations are taken.
-
-
-After the $\mkeps$-call, we inject back the characters one by one in order to build
-the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
-($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
-After injecting back $n$ characters, we get the lexical value for how $r_0$
-matches $s$. The POSIX value is maintained throughout the process.
-For this Sulzmann and Lu defined a function that reverses
-the ``chopping off'' of characters during the derivative phase. The
-corresponding function is called \emph{injection}, written
-$\textit{inj}$; it takes three arguments: the first one is a regular
-expression ${r_{i-1}}$, before the character is chopped off, the second
-is a character ${c_{i-1}}$, the character we want to inject and the
-third argument is the value ${v_i}$, into which one wants to inject the
-character (it corresponds to the regular expression after the character
-has been chopped off). The result of this function is a new value. 
-\begin{ceqn}
-\begin{equation}\label{graph:inj}
-\begin{tikzcd}
-r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
-v_1           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
-\end{tikzcd}
-\end{equation}
-\end{ceqn}
-
-
-\noindent
-The
-definition of $\textit{inj}$ is as follows: 
-
-\begin{center}
-\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
-  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
-  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
-  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
-  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
-\end{tabular}
-\end{center}
-
-\noindent This definition is by recursion on the ``shape'' of regular
-expressions and values. 
-The clauses do one thing--identifying the ``hole'' on a
-value to inject the character back into.
-For instance, in the last clause for injecting back to a value
-that would turn into a new star value that corresponds to a star,
-we know it must be a sequence value. And we know that the first 
-value of that sequence corresponds to the child regex of the star
-with the first character being chopped off--an iteration of the star
-that had just been unfolded. This value is followed by the already
-matched star iterations we collected before. So we inject the character 
-back to the first value and form a new value with this latest iteration
-being added to the previous list of iterations, all under the $\Stars$
-top level.
-
-Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
-we have a lexer with the following recursive definition:
-\begin{center}
-\begin{tabular}{lcr}
-$\lexer \; r \; [] $ & $=$ & $\mkeps \; r$\\
-$\lexer \; r \;c::s$ & $=$ & $\inj \; r \; c (\lexer (r\backslash c) s)$
-\end{tabular}
-\end{center}
- 
-Pictorially, the algorithm is as follows:
-
-\begin{ceqn}
-\begin{equation}\label{graph:2}
-\begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
-v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
-\end{tikzcd}
-\end{equation}
-\end{ceqn}
-
-
-\noindent
-For convenience, we shall employ the following notations: the regular
-expression we start with is $r_0$, and the given string $s$ is composed
-of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
-left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
-to the characters $c_0$, $c_1$  until we exhaust the string and obtain
-the derivative $r_n$. We test whether this derivative is
-$\textit{nullable}$ or not. If not, we know the string does not match
-$r$, and no value needs to be generated. If yes, we start building the
-values incrementally by \emph{injecting} back the characters into the
-earlier values $v_n, \ldots, v_0$. This is the second phase of the
-algorithm from right to left. For the first value $v_n$, we call the
-function $\textit{mkeps}$, which builds a POSIX lexical value
-for how the empty string has been matched by the (nullable) regular
-expression $r_n$. This function is defined as
-
-
-
-We have mentioned before that derivatives without simplification 
-can get clumsy, and this is true for values as well--they reflect
-the size of the regular expression by definition.
-
-One can introduce simplification on the regex and values but have to
-be careful not to break the correctness, as the injection 
-function heavily relies on the structure of the regexes and values
-being correct and matching each other.
-It can be achieved by recording some extra rectification functions
-during the derivatives step, and applying these rectifications in 
-each run during the injection phase.
-And we can prove that the POSIX value of how
-regular expressions match strings will not be affected---although it is much harder
-to establish. 
-Some initial results in this regard have been
-obtained in \cite{AusafDyckhoffUrban2016}. 
-
-
-
-%Brzozowski, after giving the derivatives and simplification,
-%did not explore lexing with simplification, or he may well be 
-%stuck on an efficient simplification with proof.
-%He went on to examine the use of derivatives together with 
-%automaton, and did not try lexing using products.
-
-We want to get rid of the complex and fragile rectification of values.
-Can we not create those intermediate values $v_1,\ldots v_n$,
-and get the lexing information that should be already there while
-doing derivatives in one pass, without a second injection phase?
-In the meantime, can we make sure that simplifications
-are easily handled without breaking the correctness of the algorithm?
-
-Sulzmann and Lu solved this problem by
-introducing additional information to the 
-regular expressions called \emph{bitcodes}.
-
-
-
-
-
 
 	
--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Mon Jun 06 23:17:45 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Thu Jun 09 12:57:53 2022 +0100
@@ -37,6 +37,8 @@
 
 \newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
 
+\def\Some{\textit{Some}}
+\def\None{\textit{None}}
 \def\code{\textit{code}}
 \def\decode{\textit{decode}}
 \def\internalise{\textit{internalise}}
@@ -48,10 +50,13 @@
 \def\AONE{\textit{AONE}}
 \def\ACHAR{\textit{ACHAR}}
 
+\def\fuse{\textit{fuse}}
+\def\bder{\textit{bder}}
 \def\POSIX{\textit{POSIX}}
 \def\ALTS{\textit{ALTS}}
 \def\ASTAR{\textit{ASTAR}}
 \def\DFA{\textit{DFA}}
+\def\NFA{\textit{NFA}}
 \def\bmkeps{\textit{bmkeps}}
 \def\retrieve{\textit{retrieve}}
 \def\blexer{\textit{blexer}}
@@ -91,7 +96,7 @@
 \newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
 \newcommand\asize[1]{\llbracket #1 \rrbracket}
 \newcommand\rerase[1]{ (#1)\downarrow_r}
-
+\newcommand\ChristianComment[1]{\textcolor{blue}{#1}\\}
 
 \def\erase{\textit{erase}}
 \def\STAR{\textit{STAR}}
@@ -106,46 +111,28 @@
 \newcommand\RSTAR[1]{#1^*}
 \newcommand\vsuf[2]{\textit{vsuf} \;#1\;#2}
 
+
+
+\pgfplotsset{
+    myplotstyle/.style={
+    legend style={draw=none, font=\small},
+    legend cell align=left,
+    legend pos=north east,
+    ylabel style={align=center, font=\bfseries\boldmath},
+    xlabel style={align=center, font=\bfseries\boldmath},
+    x tick label style={font=\bfseries\boldmath},
+    y tick label style={font=\bfseries\boldmath},
+    scaled ticks=true,
+    every axis plot/.append style={thick},
+    },
+}
+
 %----------------------------------------------------------------------------------------
 %This part is about regular expressions, Brzozowski derivatives,
 %and a bit-coded lexing algorithm with proven correctness and time bounds.
 
 %TODO: look up snort rules to use here--give readers idea of what regexes look like
 
-
-Regular expressions are widely used in computer science: 
-be it in text-editors\parencite{atomEditor} with syntax highlighting and auto-completion, 
-command-line tools like $\mathit{grep}$ that facilitate easy 
-text processing, network intrusion
-detection systems that reject suspicious traffic, or compiler
-front ends--the majority of the solutions to these tasks 
-involve lexing with regular 
-expressions.
-Given its usefulness and ubiquity, one would imagine that
-modern regular expression matching implementations
-are mature and fully studied.
-Indeed, in a popular programming language' regex engine, 
-supplying it with regular expressions and strings, one can
-get rich matching information in a very short time.
-Some network intrusion detection systems
-use regex engines that are able to process 
-megabytes or even gigabytes of data per second\parencite{Turo_ov__2020}.
-Unfortunately, this is not the case for $\mathbf{all}$ inputs.
-%TODO: get source for SNORT/BRO's regex matching engine/speed
-
-
-Take $(a^*)^*\,b$ and ask whether
-strings of the form $aa..a$ match this regular
-expression. Obviously this is not the case---the expected $b$ in the last
-position is missing. One would expect that modern regular expression
-matching engines can find this out very quickly. Alas, if one tries
-this example in JavaScript, Python or Java 8, even with strings of a small
-length, say around 30 $a$'s, one discovers that 
-this decision takes crazy time to finish given the simplicity of the problem.
-This is clearly exponential behaviour, and 
-is triggered by some relatively simple regex patterns, as the graphs
-below show:
-
 \begin{figure}
 \centering
 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
@@ -220,29 +207,70 @@
 
 
 
-This superlinear blowup in matching algorithms sometimes cause
-considerable grief in real life: for example on 20 July 2016 one evil
+
+Regular expressions are widely used in computer science: 
+be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+command-line tools like $\mathit{grep}$ that facilitate easy 
+text-processing; network intrusion
+detection systems that reject suspicious traffic; or compiler
+front ends--the majority of the solutions to these tasks 
+involve lexing with regular 
+expressions.
+Given its usefulness and ubiquity, one would imagine that
+modern regular expression matching implementations
+are mature and fully studied.
+Indeed, in a popular programming language' regex engine, 
+supplying it with regular expressions and strings, one can
+get rich matching information in a very short time.
+Some network intrusion detection systems
+use regex engines that are able to process 
+megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
+Unfortunately, this is not the case for $\mathbf{all}$ inputs.
+%TODO: get source for SNORT/BRO's regex matching engine/speed
+
+
+Take $(a^*)^*\,b$ and ask whether
+strings of the form $aa..a$ match this regular
+expression. Obviously this is not the case---the expected $b$ in the last
+position is missing. One would expect that modern regular expression
+matching engines can find this out very quickly. Alas, if one tries
+this example in JavaScript, Python or Java 8, even with strings of a small
+length, say around 30 $a$'s, one discovers that 
+this decision takes crazy time to finish given the simplicity of the problem.
+This is clearly exponential behaviour, and 
+is triggered by some relatively simple regex patterns, as the graphs
+ in \ref{fig:aStarStarb} show.
+
+
+
+
+\ChristianComment{Superlinear I just leave out the explanation 
+which I find once used would distract the flow. Plus if i just say exponential
+here the 2016 event in StackExchange was not exponential, but just quardratic so would be 
+in accurate}
+
+This superlinear blowup in regular expression engines
+had repeatedly caused grief in real life.
+For example, on 20 July 2016 one evil
 regular expression brought the webpage
 \href{http://stackexchange.com}{Stack Exchange} to its
-knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
+knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
 In this instance, a regular expression intended to just trim white
 spaces from the beginning and the end of a line actually consumed
-massive amounts of CPU-resources---causing web servers to grind to a
-halt. This happened when a post with 20,000 white spaces was submitted,
-but importantly the white spaces were neither at the beginning nor at
-the end. As a result, the regular expression matching engine needed to
-backtrack over many choices. In this example, the time needed to process
+massive amounts of CPU resources---causing web servers to grind to a
+halt. In this example, the time needed to process
 the string was $O(n^2)$ with respect to the string length. This
 quadratic overhead was enough for the homepage of Stack Exchange to
 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
 attack and therefore stopped the servers from responding to any
 requests. This made the whole site become unavailable. 
+
 A more recent example is a global outage of all Cloudflare servers on 2 July
 2019. A poorly written regular expression exhibited exponential
 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
 had several causes, at the heart was a regular expression that
 was used to monitor network
-traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
+traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
 %TODO: data points for some new versions of languages
 These problems with regular expressions 
 are not isolated events that happen
@@ -250,15 +278,64 @@
 They occur so often that they get a 
 name--Regular-Expression-Denial-Of-Service (ReDoS)
 attack.
-Davis et al. \parencite{Davis18} detected more
+\citeauthor{Davis18} detected more
 than 1000 super-linear (SL) regular expressions
 in Node.js, Python core libraries, and npm and pypi. 
 They therefore concluded that evil regular expressions
-are problems more than "a parlour trick", but one that
+are problems "more than a parlour trick", but one that
 requires
 more research attention.
 
- \section{The Problem Behind Slow Cases}
+
+But the problems are not limited to slowness on certain 
+cases. 
+Another thing about these libraries is that there
+is no correctness guarantee.
+In some cases, they either fail to generate a lexing result when there exists a match,
+or give results that are inconsistent with the $\POSIX$ standard.
+A concrete example would be 
+the regex
+\begin{verbatim}
+(aba|ab|a)*
+\end{verbatim}
+and the string
+\begin{verbatim}
+ababa
+\end{verbatim}
+The correct $\POSIX$ match for the above would be 
+with the entire string $ababa$, 
+split into two Kleene star iterations, $[ab] [aba]$ at positions
+$[0, 2), [2, 5)$
+respectively.
+But trying this out in regex101\parencite{regex101}
+with different language engines would yield 
+the same two fragmented matches: $[aba]$ at $[0, 3)$
+and $a$ at $[4, 5)$.
+
+Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
+correctly implementing the POSIX (maximum-munch)
+rule of regular expression matching.
+
+As Grathwohl\parencite{grathwohl2014crash} commented,
+\begin{center}
+	``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
+\end{center}
+
+
+To summarise the above, regular expressions are important.
+They are popular and programming languages' library functions
+for them are very fast on non-catastrophic cases.
+But there are problems with current practical implementations.
+First thing is that the running time might blow up.
+The second problem is that they might be error-prone on certain
+cases that are very easily spotted by a human.
+In the next part of the chapter, we will look into reasons why 
+certain regex engines are running horribly slow on the "catastrophic"
+cases and propose a solution that addresses both of these problems
+based on Brzozowski and Sulzmann and Lu's work.
+
+
+ \section{Why are current regex engines slow?}
 
 %find literature/find out for yourself that REGEX->DFA on basic regexes
 %does not blow up the size
@@ -266,15 +343,30 @@
 How can one explain the super-linear behaviour of the 
 regex matching engines we have?
 The time cost of regex matching algorithms in general
-involve two phases: 
-the construction phase, in which the algorithm builds some  
-suitable data structure from the input regex $r$, we denote
-the time cost by $P_1(r)$.
-The lexing
-phase, when the input string $s$ is read and the data structure
-representing that regex $r$ is being operated on. We represent the time
+involve two different phases, and different things can go differently wrong on 
+these phases.
+$\DFA$s usually have problems in the first (construction) phase
+, whereas $\NFA$s usually run into trouble
+on the second phase.
+
+\subsection{Different Phases of a Matching/Lexing Algorithm}
+
+
+Most lexing algorithms can be roughly divided into 
+two phases during its run.
+The first phase is the "construction" phase,
+in which the algorithm builds some  
+suitable data structure from the input regex $r$, so that
+it can be easily operated on later.
+We denote
+the time cost for such a phase by $P_1(r)$.
+The second phase is the lexing phase, when the input string 
+$s$ is read and the data structure
+representing that regex $r$ is being operated on. 
+We represent the time
 it takes by $P_2(r, s)$.\\
-In the case of a $\mathit{DFA}$,
+
+For $\mathit{DFA}$,
 we have $P_2(r, s) = O( |s| )$,
 because we take at most $|s|$ steps, 
 and each step takes
@@ -283,12 +375,14 @@
 by definition has at most one state active and at most one
 transition upon receiving an input symbol.
 But unfortunately in the  worst case
-$P_1(r) = O(exp^{|r|})$. An example will be given later. \\
+$P_1(r) = O(exp^{|r|})$. An example will be given later. 
+
+
 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
 expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.
 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
 On the other hand, if backtracking is used, the worst-case time bound bloats
-to $|r| * 2^|s|$ .
+to $|r| * 2^|s|$.
 %on the input
 %And when calculating the time complexity of the matching algorithm,
 %we are assuming that each input reading step requires constant time.
@@ -303,22 +397,16 @@
 %And de-sugaring these "extended" regular expressions 
 %into basic ones might bloat the size exponentially.
 %TODO: more reference for exponential size blowup on desugaring. 
-\subsection{Tools that uses $\mathit{DFA}$s}
-%TODO:more tools that use DFAs?
-$\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
-in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
-lexers. The user provides a set of regular expressions
-and configurations to such lexer generators, and then 
-gets an output program encoding a minimized $\mathit{DFA}$
-that can be compiled and run. 
+
+\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
+
+
 The good things about $\mathit{DFA}$s is that once
 generated, they are fast and stable, unlike
 backtracking algorithms. 
-However, they do not scale well with bounded repetitions.\\
-
+However, they do not scale well with bounded repetitions.
 
-
-
+\subsubsection{Problems with Bounded Repetitions}
 Bounded repetitions, usually written in the form
 $r^{\{c\}}$ (where $c$ is a constant natural number),
 denotes a regular expression accepting strings
@@ -366,21 +454,45 @@
 more than 1 string.
 This is to represent all different 
 scenarios which "countdown" states are active.
-For those regexes, tools such as $\mathit{JFLEX}$ 
-would generate gigantic $\mathit{DFA}$'s or
+For those regexes, tools that uses $\DFA$s will get
 out of memory errors.
+
+\subsubsection{Tools that uses $\mathit{DFA}$s}
+%TODO:more tools that use DFAs?
+$\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
+in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
+lexers. The user provides a set of regular expressions
+and configurations to such lexer generators, and then 
+gets an output program encoding a minimized $\mathit{DFA}$
+that can be compiled and run. 
+When given the above countdown regular expression,
+a small number $n$ would result in a determinised automata
+with millions of states.
+
 For this reason, regex libraries that support 
 bounded repetitions often choose to use the $\mathit{NFA}$ 
 approach.
-\subsection{The $\mathit{NFA}$ approach to regex matching}
-One can simulate the $\mathit{NFA}$ running in two ways:
+
+
+
+
+
+
+
+
+\subsection{Why $\mathit{NFA}$s can be slow in the second phase}
+When one constructs an $\NFA$ out of a regular expression
+there is often very little to be done in the first phase, one simply 
+construct the $\NFA$ states based on the structure of the input regular expression.
+
+In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
 one by keeping track of all active states after consuming 
 a character, and update that set of states iteratively.
 This can be viewed as a breadth-first-search of the $\mathit{NFA}$
 for a path terminating
 at an accepting state.
 Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
-type of $\mathit{NFA}$ simulation, and guarantees a linear runtime
+type of $\mathit{NFA}$ simulation and guarantees a linear runtime
 in terms of input string length.
 %TODO:try out these lexers
 The other way to use $\mathit{NFA}$ for matching is choosing  
@@ -392,7 +504,7 @@
 %TODO:COMPARE java python lexer speed with Rust and Go
 The reason behind backtracking algorithms in languages like
 Java and Python is that they support back-references.
-\subsection{Back References in Regex--Non-Regular part}
+\subsubsection{Back References}
 If we have a regular expression like this (the sequence
 operator is omitted for brevity):
 \begin{center}
@@ -447,6 +559,7 @@
 is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
 efficient solution is not known to exist.
 %TODO:read a bit more about back reference algorithms
+
 It seems that languages like Java and Python made the trade-off
 to support back-references at the expense of having to backtrack,
 even in the case of regexes not involving back-references.\\
@@ -454,7 +567,7 @@
 practical regex libraries into the ones  with  linear
 time guarantees like Go and Rust, which impose restrictions
 on the user input (not allowing back-references, 
-bounded repetitions canno exceed 1000 etc.), and ones  
+bounded repetitions cannot exceed 1000 etc.), and ones  
  that allows the programmer much freedom, but grinds to a halt
  in some non-negligible portion of cases.
  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
@@ -466,93 +579,41 @@
 %when the language is still regular).
  %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
  %TODO: verify the fact Rust does not allow 1000+ reps
- %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
-\section{Buggy Regex Engines} 
+\ChristianComment{Comment required: Java 17 updated graphs? Is it ok to still use Java 8 graphs?}
 
 
- Another thing about these libraries is that there
- is no correctness guarantee.
- In some cases, they either fail to generate a lexing result when there exists a match,
- or give the wrong way of matching.
+So we have practical implementations 
+on regular expression matching/lexing which are fast
+but do not come with any guarantees that it will not grind to a halt
+or give wrong answers.
+Our goal is to have a regex lexing algorithm that comes with 
+\begin{itemize}
+\item
+proven correctness 
+\item 
+proven non-catastrophic properties
+\item
+easy extensions to
+constructs like 
+ bounded repetitions, negation,  lookarounds, and even back-references.
+ \end{itemize}
  
-
-It turns out that regex libraries not only suffer from 
-exponential backtracking problems, 
-but also undesired (or even buggy) outputs.
-%TODO: comment from who
-Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
-correctly implementing the POSIX (maximum-munch)
-rule of regular expression matching.
-This experience is echoed by the writer's
-tryout of a few online regex testers:
-A concrete example would be 
-the regex
-\begin{verbatim}
-(((((a*a*)b*)b){20})*)c
-\end{verbatim}
-and the string
-\begin{verbatim}
-baabaabababaabaaaaaaaaababaa
-aababababaaaabaaabaaaaaabaab
-aabababaababaaaaaaaaababaaaa
-babababaaaaaaaaaaaaac
-\end{verbatim}
-
-This seemingly complex regex simply says "some $a$'s
-followed by some $b$'s then followed by 1 single $b$,
-and this iterates 20 times, finally followed by a $c$.
-And a POSIX match would involve the entire string,"eating up"
-all the $b$'s in it.
-%TODO: give a coloured example of how this matches POSIXly
-
-This regex would trigger catastrophic backtracking in 
-languages like Python and Java,
-whereas it gives a non-POSIX  and uninformative 
-match in languages like Go or .NET--The match with only 
-character $c$.
+\section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
+We propose Brzozowski derivatives on regular expressions as
+  a solution to this.
+In the last fifteen or so years, Brzozowski's derivatives of regular
+expressions have sparked quite a bit of interest in the functional
+programming and theorem prover communities.   
 
-As Grathwohl\parencite{grathwohl2014crash} commented,
-\begin{center}
-	``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
-\end{center}
-
-%\section{How people solve problems with regexes}
-
+\subsection{Motivation}
+  
+Derivatives give a simple solution
+to the problem of matching a string $s$ with a regular
+expression $r$: if the derivative of $r$ w.r.t.\ (in
+succession) all the characters of the string matches the empty string,
+then $r$ matches $s$ (and {\em vice versa}).  
 
-When a regular expression does not behave as intended,
-people usually try to rewrite the regex to some equivalent form
-or they try to avoid the possibly problematic patterns completely,
-for which many false positives exist\parencite{Davis18}.
-Animated tools to "debug" regular expressions
-are also popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101} 
-to name a few.
-We are also aware of static analysis work on regular expressions that
-aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
-\parencite{Rathnayake2014StaticAF} proposed an algorithm
-that detects regular expressions triggering exponential
-behavious on backtracking matchers.
-Weideman \parencite{Weideman2017Static} came up with 
-non-linear polynomial worst-time estimates
-for regexes, attack string that exploit the worst-time 
-scenario, and "attack automata" that generates
-attack strings.
-%Arguably these methods limits the programmers' freedom
-%or productivity when all they want is to come up with a regex
-%that solves the text processing problem.
-
-%TODO:also the regex101 debugger
-\section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
- Is it possible to have a regex lexing algorithm with proven correctness and 
- time complexity, which allows easy extensions to
-  constructs like 
- bounded repetitions, negation,  lookarounds, and even back-references? 
-  
-  We propose Brzozowski derivatives on regular expressions as
-  a solution to this.
-  
-  In the last fifteen or so years, Brzozowski's derivatives of regular
-expressions have sparked quite a bit of interest in the functional
-programming and theorem prover communities.  The beauty of
+The beauty of
 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
 expressible in any functional language, and easily definable and
 reasoned about in theorem provers---the definitions just consist of
@@ -562,19 +623,9 @@
 to include  extended regular expressions and 
  simplification of internal data structures 
  eliminating the exponential behaviours.
- 
-
-  \section{Motivation}
-  
-Derivatives give a simple solution
-to the problem of matching a string $s$ with a regular
-expression $r$: if the derivative of $r$ w.r.t.\ (in
-succession) all the characters of the string matches the empty string,
-then $r$ matches $s$ (and {\em vice versa}).  
-
-
 
 However, two difficulties with derivative-based matchers exist:
+\subsubsection{Problems with Current Brzozowski Matchers}
 First, Brzozowski's original matcher only generates a yes/no answer
 for whether a regular expression matches a string or not.  This is too
 little information in the context of lexing where separate tokens must
@@ -585,7 +636,7 @@
 \emph{how} a regular expression matches a string following the POSIX
 rules for regular expression matching. They achieve this by adding a
 second ``phase'' to Brzozowski's algorithm involving an injection
-function.  In our own earlier work we provided the formal
+function.  In our own earlier work, we provided the formal
 specification of what POSIX matching means and proved in Isabelle/HOL
 the correctness
 of Sulzmann and Lu's extended algorithm accordingly
@@ -624,18 +675,18 @@
 
 
 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
-\cite{Sulzmann2014} where they introduce bitcoded
+\cite{Sulzmann2014} where they introduce bit-coded
 regular expressions. In this version, POSIX values are
-represented as bitsequences and such sequences are incrementally generated
+represented as bit sequences and such sequences are incrementally generated
 when derivatives are calculated. The compact representation
-of bitsequences and regular expressions allows them to define a more
+of bit sequences and regular expressions allows them to define a more
 ``aggressive'' simplification method that keeps the size of the
 derivatives finite no matter what the length of the string is.
 They make some informal claims about the correctness and linear behaviour
 of this version, but do not provide any supporting proof arguments, not
-even ``pencil-and-paper'' arguments. They write about their bitcoded
+even ``pencil-and-paper'' arguments. They write about their bit-coded
 \emph{incremental parsing method} (that is the algorithm to be formalised
-in this paper):
+in this dissertation)
 
 
   
@@ -660,12 +711,7 @@
 and gives a formal proof of the correctness with those simplifications.
 
 
- 
-
-
-
 %----------------------------------------------------------------------------------------
-
 \section{Contribution}
 
 
@@ -677,14 +723,22 @@
 improved version of  Sulzmann and Lu's bit-coded algorithm using 
 derivatives, which come with a formal guarantee in terms of correctness and 
 running time as an Isabelle/HOL proof.
-Then we improve the algorithm with an even stronger version of 
-simplification, and prove a time bound linear to input and
+Further improvements to the algorithm with an even stronger version of 
+simplification is made.
+We have not yet come up with one, but believe that it leads to a 
+formalised proof with a time bound linear to input and
 cubic to regular expression size using a technique by
-Antimirov.
+Antimirov\cite{Antimirov}.
 
  
-The main contribution of this thesis is a proven correct lexing algorithm
-with formalized time bounds.
+The main contribution of this thesis is 
+\begin{itemize}
+\item
+a proven correct lexing algorithm
+\item
+with formalized finite bounds on internal data structures' sizes.
+\end{itemize}
+
 To our best knowledge, no lexing libraries using Brzozowski derivatives
 have a provable time guarantee, 
 and claims about running time are usually speculative and backed by thin empirical
@@ -704,8 +758,6 @@
 When we tried out their extracted OCaml code with our example $(a+aa)^*$,
 the time it took to lex only 40 $a$'s was 5 minutes.
 
-We  believe our results of a proof of performance on general
-inputs rather than specific examples a novel contribution.\\
 
 
 \subsection{Related Work}
@@ -716,26 +768,43 @@
 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
  
- %We propose Brzozowski's derivatives as a solution to this problem.
-% about Lexing Using Brzozowski derivatives
+ 
+ When a regular expression does not behave as intended,
+people usually try to rewrite the regex to some equivalent form
+or they try to avoid the possibly problematic patterns completely,
+for which many false positives exist\parencite{Davis18}.
+Animated tools to "debug" regular expressions such as
+ \parencite{regexploit2021} \parencite{regex101} are also popular.
+We are also aware of static analysis work on regular expressions that
+aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
+\parencite{Rathnayake2014StaticAF} proposed an algorithm
+that detects regular expressions triggering exponential
+behavious on backtracking matchers.
+Weideman \parencite{Weideman2017Static} came up with 
+non-linear polynomial worst-time estimates
+for regexes, attack string that exploit the worst-time 
+scenario, and "attack automata" that generates
+attack strings.
+
+
 
 
 \section{Structure of the thesis}
-In chapter 2 \ref{Chapter2} we will introduce the concepts
+In chapter 2 \ref{Inj} we will introduce the concepts
 and notations we 
 use for describing the lexing algorithm by Sulzmann and Lu,
-and then give the algorithm and its variant, and discuss
-why more aggressive simplifications are needed. 
-Then we illustrate in Chapter 3\ref{Chapter3}
+and then give the lexing algorithm.
+We will give its variant in \ref{Bitcoded1}.
+Then we illustrate in \ref{Bitcoded2}
 how the algorithm without bitcodes falls short for such aggressive 
 simplifications and therefore introduce our version of the
- bitcoded algorithm and 
+ bit-coded algorithm and 
 its correctness proof .  
-In Chapter 4 \ref{Chapter4} we give the second guarantee
+In \ref{Finite} we give the second guarantee
 of our bitcoded algorithm, that is a finite bound on the size of any 
 regex's derivatives.
-In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
-in Chapter 4 to a polynomial one, and demonstrate how one can extend the
+In \ref{Cubic} we discuss stronger simplifications to improve the finite bound
+in \ref{Finite} to a polynomial one, and demonstrate how one can extend the
 algorithm to include constructs such as bounded repetitions and negations.
  
 
--- a/ChengsongTanPhdThesis/main.tex	Mon Jun 06 23:17:45 2022 +0100
+++ b/ChengsongTanPhdThesis/main.tex	Thu Jun 09 12:57:53 2022 +0100
@@ -68,7 +68,7 @@
 \usepackage{tikz-cd}
 \usepackage{tikz}
 \usetikzlibrary{automata, positioning, calc}
-
+\usepackage{mathpartir}
 
 
 \DeclareCaptionType{mytype}[Illustration][]