lexing: changeset 566:94604a5fd271

--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Sat Jul 09 14:11:07 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Wed Jul 13 08:35:45 2022 +0100
@@ -8,96 +8,105 @@
 %simplifications and therefore introduce our version of the bitcoded algorithm and 
 %its correctness proof in 
 %Chapter 3\ref{Chapter3}. 
-In this chapter, we are going to introduce the bit-coded algorithm
-introduced by Sulzmann and Lu to address the problem of 
-under-simplified regular expressions. 
+In this chapter, we are going to describe the bit-coded algorithm
+introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of 
+regular expressions. 
 \section{Bit-coded Algorithm}
 The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
 stores information of previous lexing steps
 on a stack, in the form of regular expressions
 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
-\begin{envForCaption}
 \begin{ceqn}
 \begin{equation}%\label{graph:injLexer}
-\begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
-v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
+	\begin{tikzcd}[ampersand replacement=\&, execute at end picture={
+			\begin{scope}[on background layer]
+				\node[rectangle, fill={red!30},
+					pattern=north east lines, pattern color=red,
+					fit={(-3,-1) (-3, 1) (1, -1) 
+						(1, 1)}
+				     ] 
+				     {}; ,
+				\node[rectangle, fill={blue!20},
+					pattern=north east lines, pattern color=blue,
+					fit= {(1, -1) (1, 1) (3, -1) (3, 1)}
+					]
+					{};
+				\end{scope}}
+					]
+r_0 \arrow[r, "\backslash c_0"]  \arrow[d] \& r_1 \arrow[r, "\backslash c_1"] \arrow[d] \& r_2 \arrow[r, dashed] \arrow[d] \& r_n \arrow[d, "mkeps" description] \\
+v_0           \& v_1 \arrow[l,"inj_{r_0} c_0"]                \& v_2 \arrow[l, "inj_{r_1} c_1"]              \& v_n \arrow[l, dashed]         \\
 \end{tikzcd}
 \end{equation}
 \end{ceqn}
-\caption{Injection-based Lexing from Chapter\ref{Inj}}\label{InjFigure}
-\end{envForCaption}
 \noindent
+The red part represents what we already know during the first
+derivative phase,
+and the blue part represents the unknown part of input.
+The red area expands as we move towards $r_n$, 
+indicating an increasing stack size during lexing.
+Despite having some partial lexing information during
+the forward derivative phase, we choose to store them
+temporarily, only to convert the information to lexical
+values at a later stage. In essence we are repeating work we
+have already done.
 This is both inefficient and prone to stack overflow.
 A natural question arises as to whether we can store lexing
 information on the fly, while still using regular expression 
-derivatives?
+derivatives.
 
-In a lexing algorithm's run, split by the current input position,
-we have a sub-string that has been consumed,
-and the sub-string that has yet to come.
-We already know what was before, and this should be reflected in the value 
-and the regular expression at that step as well. But this is not the 
-case for injection-based regular expression derivatives.
-Take the regex $(aa)^* \cdot bc$ matching the string $aabc$
-as an example, if we have just read the two former characters $aa$:
-
+If we remove the details of the individual 
+lexing steps, and use red and blue areas as before
+to indicate consumed (seen) input and constructed
+partial value (before recovering the rest of the stack),
+one could see that the seen part's lexical information
+is stored in the form of a regular expression.
+Consider the regular expression $(aa)^* \cdot bc$ matching the string $aabc$
+and assume we have just read the two characters $aa$:
 \begin{center}
-\begin{envForCaption}
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
-    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-        {Consumed: $aa$
-         \nodepart{two} Not Yet Reached: $bc$ };
-%\caption{term 1 \ref{term:1}'s matching configuration}
-\end{tikzpicture}
-\caption{Partially matched String} 
-\end{envForCaption}
-\end{center}
-%\caption{Input String}\label{StringPartial}
-%\end{figure}
-
-\noindent
-We have the value that has already been partially calculated,
-and the part that has yet to come:
-\begin{center}
-\begin{envForCaption}
 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-        {$\Seq(\Stars[\Char(a), \Char(a)], ???)$
+	    {Partial lexing info: $\ONE \cdot a \cdot (aa)^* \cdot bc$ etc.
          \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
-%\caption{term 1 \ref{term:1}'s matching configuration}
 \end{tikzpicture}
-\caption{Partially constructed Value} 
-\end{envForCaption}
 \end{center}
-
-In the regex derivative part , (after simplification)
-all we have is just what is about to come:
+\noindent
+In the injection-based lexing algorithm, we ``neglect" the red area
+by putting all the characters we have consumed and
+intermediate regular expressions on the stack when 
+we go from left to right in the derivative phase.
+The red area grows till the string is exhausted.
+During the injection phase, the value in the blue area
+is built up incrementally, while the red area shrinks.
+Before we have recovered all characters and intermediate
+derivative regular expressions from the stack,
+what values these characters and regular expressions correspond 
+to are unknown: 
 \begin{center}
-\begin{envForCaption}
 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},]
-        {$???$
-         \nodepart{two} To Come: $b c$};
+	    {$(\ONE \cdot \ONE) \cdot (aa)^* \cdot bc $ correspond to:$???$
+         \nodepart{two}  $b c$ corresponds to  $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
 %\caption{term 1 \ref{term:1}'s matching configuration}
 \end{tikzpicture}
-\caption{Derivative} 
-\end{envForCaption}
 \end{center}
 \noindent
-The previous part is missing.
-How about keeping the partially constructed value 
-attached to the front of the regular expression?
+However, they should be calculable,
+as characters and regular expression shapes
+after taking derivative w.r.t those characters
+have already been known, therefore in our example,
+we know that the value starts with two $a$s,
+and makes up to an iteration in a Kleene star:
+(We have put the injection-based lexing's partial 
+result in the right part of the split rectangle
+to contrast it with the partial valued produced
+in a forward manner)
 \begin{center}
-\begin{envForCaption}
 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-        {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
-         \nodepart{two} To Come: $b c$};
+	    {$\stackrel{Bitcoded}{\longrightarrow} \Seq(\Stars[\Char(a), \Char(a)], ???)$
+	\nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$  $\stackrel{Inj}{\longleftarrow}$};
 %\caption{term 1 \ref{term:1}'s matching configuration}
 \end{tikzpicture}
-\caption{Derivative} 
-\end{envForCaption}
 \end{center}
 \noindent
 If we do this kind of "attachment"
@@ -105,19 +114,23 @@
 constructed value when taking off a 
 character:
 \begin{center}
-\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] (spPoint)
+        {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
+         \nodepart{two} Remaining: $b c$};
+\end{tikzpicture}\\
+$\downarrow$\\
 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
         {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
-         \nodepart{two} To Come: $c$};
+         \nodepart{two} Remaining: $c$};
 \end{tikzpicture}\\
+$\downarrow$\\
 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
         {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
          \nodepart{two} EOF};
 \end{tikzpicture}
-\caption{After $\backslash b$ and $\backslash c$} 
-\end{envForCaption}
 \end{center}
 \noindent
 In the end we could recover the value without a backward phase.
@@ -125,14 +138,11 @@
 we instead use bit-codes to encode them.
 
 Bits and bitcodes (lists of bits) are defined as:
-\begin{envForCaption}
 \begin{center}
 		$b ::=   S \mid  Z \qquad
 bs ::= [] \mid b::bs    
 $
 \end{center}
-\caption{Bit-codes datatype}
-\end{envForCaption}
 
 \noindent
 Using $S$ and $Z$ rather than $1$ and $0$ is to avoid
@@ -141,7 +151,6 @@
 bit-lists) can be used to encode values (or potentially incomplete values) in a
 compact form. This can be straightforwardly seen in the following
 coding function from values to bitcodes: 
-\begin{envForCaption}
 \begin{center}
 \begin{tabular}{lcl}
   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
@@ -154,9 +163,6 @@
                                                  code(\Stars\,vs)$
 \end{tabular}    
 \end{center} 
-\caption{Coding Function for Values}
-\end{envForCaption}
-
 \noindent
 Here $\textit{code}$ encodes a value into a bit-code by converting
 $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
@@ -168,9 +174,10 @@
 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. The
 reason for choosing this compact way of storing information is that the
 relatively small size of bits can be easily manipulated and ``moved
-around'' in a regular expression. 
+around" in a regular expression. 
 
-
+Because of the lossiness, the process of decoding a bitlist requires additionally 
+a regular expression. The function $\decode$ is defined as:
 We define the reverse operation of $\code$, which is $\decode$.
 As expected, $\decode$ not only requires the bit-codes,
 but also a regular expression to guide the decoding and 
@@ -178,7 +185,6 @@
 
 
 %\begin{definition}[Bitdecoding of Values]\mbox{}
-\begin{envForCaption}
 \begin{center}
 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
@@ -205,10 +211,10 @@
        \textit{else}\;\textit{None}$                       
 \end{tabular}    
 \end{center} 
-\end{envForCaption}
 %\end{definition}
 
 \noindent
+The function $\decode'$ returns a pair consisting of a partially decoded value and some leftover:
 $\decode'$ does most of the job while $\decode$ throws
 away leftover bit-codes and returns the value only.
 $\decode$ is terminating as $\decode'$ is terminating.
@@ -251,7 +257,7 @@
 
 
 The first thing we define related to bit-coded regular expressions
-is how we move bits, for instance pasting it at the front of an annotated regex.
+is how we move bits, for instance pasting it at the front of an annotated regular expression.
 The operation $\fuse$ is just to attach bit-codes 
 to the front of an annotated regular expression:
 \begin{center}
@@ -364,7 +370,7 @@
 This is done by adding bitcodes to the 
 derivatives, for example when one more star iteratoin is taken (we
 call the operation of derivatives on annotated regular expressions $\bder$
-because it is derivatives on regexes with \emph{b}itcodes),
+because it is derivatives on regular expressiones with \emph{b}itcodes),
 we need to unfold it into a sequence,
 and attach an additional bit $Z$ to the front of $r \backslash c$
 to indicate one more star iteration. 
@@ -652,7 +658,7 @@
 $\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
 \end{lemma}
 \noindent
-$\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regexes of $\blexer$.
+$\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regular expressiones of $\blexer$.
 For plain regular expressions something similar is required as well.
 
 \subsection{$\flex$}

--- a/ChengsongTanPhdThesis/Chapters/Chapter1.tex	Sat Jul 09 14:11:07 2022 +0100
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,750 +0,0 @@
-% Chapter 1
-
-\chapter{Introduction} % Main chapter title
-
-\label{Chapter1} % For referencing the chapter elsewhere, use \ref{Chapter1} 
-
-%----------------------------------------------------------------------------------------
-
-% Define some commands to keep the formatting separated from the content 
-\newcommand{\keyword}[1]{\textbf{#1}}
-\newcommand{\tabhead}[1]{\textbf{#1}}
-\newcommand{\code}[1]{\texttt{#1}}
-\newcommand{\file}[1]{\texttt{\bfseries#1}}
-\newcommand{\option}[1]{\texttt{\itshape#1}}
-
-%\newcommand{\sflataux}[1]{\textit{sflat}\_\textit{aux} \, #1}
-\newcommand\sflat[1]{\llparenthesis #1 \rrparenthesis }
-\newcommand{\ASEQ}[3]{\textit{ASEQ}_{#1} \, #2 \, #3}
-\newcommand{\bderssimp}[2]{#1 \backslash_{bsimp} #2}
-\newcommand{\rderssimp}[2]{#1 \backslash_{rsimp} #2}
-\newcommand{\bders}[2]{#1 \backslash #2}
-\newcommand{\bsimp}[1]{\textit{bsimp}(#1)}
-\newcommand{\rsimp}[1]{\textit{rsimp}(#1)}
-\newcommand{\sflataux}[1]{\llparenthesis #1 \rrparenthesis'}
-\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
-\newcommand{\denote}{\stackrel{\mbox{\scriptsize denote}}{=}}%
-\newcommand{\ZERO}{\mbox{\bf 0}}
-\newcommand{\ONE}{\mbox{\bf 1}}
-\newcommand{\AALTS}[2]{\oplus {\scriptstyle #1}\, #2}
-\newcommand{\rdistinct}[2]{\textit{distinct} \; \textit{#1} \; #2}
-\newcommand\hflat[1]{\llparenthesis  #1 \rrparenthesis_*}
-\newcommand\hflataux[1]{\llparenthesis #1 \rrparenthesis_*'}
-\newcommand\createdByStar[1]{\textit{createdByStar}(#1)}
-
-\newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
-
-\def\decode{\textit{decode}}
-\def\internalise{\textit{internalise}}
-\def\lexer{\mathit{lexer}}
-\def\mkeps{\textit{mkeps}}
-\newcommand{\rder}[2]{#2 \backslash #1}
-
-\def\AZERO{\textit{AZERO}}
-\def\AONE{\textit{AONE}}
-\def\ACHAR{\textit{ACHAR}}
-
-\def\POSIX{\textit{POSIX}}
-\def\ALTS{\textit{ALTS}}
-\def\ASTAR{\textit{ASTAR}}
-\def\DFA{\textit{DFA}}
-\def\bmkeps{\textit{bmkeps}}
-\def\retrieve{\textit{retrieve}}
-\def\blexer{\textit{blexer}}
-\def\flex{\textit{flex}}
-\def\inj{\mathit{inj}}
-\def\Empty{\mathit{Empty}}
-\def\Left{\mathit{Left}}
-\def\Right{\mathit{Right}}
-\def\Stars{\mathit{Stars}}
-\def\Char{\mathit{Char}}
-\def\Seq{\mathit{Seq}}
-\def\Der{\textit{Der}}
-\def\Ders{\textit{Ders}}
-\def\nullable{\mathit{nullable}}
-\def\Z{\mathit{Z}}
-\def\S{\mathit{S}}
-\def\rup{r^\uparrow}
-%\def\bderssimp{\mathit{bders}\_\mathit{simp}}
-\def\distinctWith{\textit{distinctWith}}
-\def\lf{\textit{lf}}
-\def\PD{\textit{PD}}
-\def\suffix{\textit{Suffix}}
-
-
-\def\size{\mathit{size}}
-\def\rexp{\mathbf{rexp}}
-\def\simp{\mathit{simp}}
-\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
-\def\map{\mathit{map}}
-\def\distinct{\mathit{distinct}}
-\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
-\def\map{\textit{map}}
-%\def\vsuf{\textit{vsuf}}
-%\def\sflataux{\textit{sflat}\_\textit{aux}}
-\def\rrexp{\textit{rrexp}}
-\newcommand\rnullable[1]{\textit{rnullable}(#1)}
-\newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
-\newcommand\asize[1]{\llbracket #1 \rrbracket}
-\newcommand\rerase[1]{ (#1)\downarrow_r}
-
-
-\def\erase{\textit{erase}}
-\def\STAR{\textit{STAR}}
-\def\flts{\textit{flts}}
-
-
-\def\RZERO{\mathbf{0}_r }
-\def\RONE{\mathbf{1}_r}
-\newcommand\RCHAR[1]{\mathbf{#1}_r}
-\newcommand\RSEQ[2]{#1 \cdot #2}
-\newcommand\RALTS[1]{\oplus #1}
-\newcommand\RSTAR[1]{#1^*}
-\newcommand\vsuf[2]{\textit{vsuf} \;#1\;#2}
-
-%----------------------------------------------------------------------------------------
-%This part is about regular expressions, Brzozowski derivatives,
-%and a bit-coded lexing algorithm with proven correctness and time bounds.
-
-%TODO: look up snort rules to use here--give readers idea of what regexes look like
-
-
-Regular expressions are widely used in computer science: 
-be it in text-editors\parencite{atomEditor} with syntax highlighting and auto-completion, 
-command-line tools like $\mathit{grep}$ that facilitate easy 
-text processing, network intrusion
-detection systems that reject suspicious traffic, or compiler
-front ends--the majority of the solutions to these tasks 
-involve lexing with regular 
-expressions.
-Given its usefulness and ubiquity, one would imagine that
-modern regular expression matching implementations
-are mature and fully studied.
-Indeed, in a popular programming language' regex engine, 
-supplying it with regular expressions and strings, one can
-get rich matching information in a very short time.
-Some network intrusion detection systems
-use regex engines that are able to process 
-megabytes or even gigabytes of data per second\parencite{Turo_ov__2020}.
-Unfortunately, this is not the case for $\mathbf{all}$ inputs.
-%TODO: get source for SNORT/BRO's regex matching engine/speed
-
-
-Take $(a^*)^*\,b$ and ask whether
-strings of the form $aa..a$ match this regular
-expression. Obviously this is not the case---the expected $b$ in the last
-position is missing. One would expect that modern regular expression
-matching engines can find this out very quickly. Alas, if one tries
-this example in JavaScript, Python or Java 8, even with strings of a small
-length, say around 30 $a$'s, one discovers that 
-this decision takes crazy time to finish given the simplicity of the problem.
-This is clearly exponential behaviour, and 
-is triggered by some relatively simple regex patterns, as the graphs
-below show:
-
-\begin{figure}
-\centering
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,-0.05)}},
-    ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=35,
-    ytick={0,5,...,30},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={JavaScript},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
-\end{axis}
-\end{tikzpicture}
-  &
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,-0.05)}},
-    %ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=35,
-    ytick={0,5,...,30},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={Python},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
-\end{axis}
-\end{tikzpicture}
-  &
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,-0.05)}},
-    %ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=35,
-    ytick={0,5,...,30},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={Java 8},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}\\
-\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
-           of the form $\underbrace{aa..a}_{n}$.}
-\end{tabular}    
-\caption{aStarStarb} \label{fig:aStarStarb}
-\end{figure}
-
-
-
-
-This superlinear blowup in matching algorithms sometimes cause
-considerable grief in real life: for example on 20 July 2016 one evil
-regular expression brought the webpage
-\href{http://stackexchange.com}{Stack Exchange} to its
-knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
-In this instance, a regular expression intended to just trim white
-spaces from the beginning and the end of a line actually consumed
-massive amounts of CPU-resources---causing web servers to grind to a
-halt. This happened when a post with 20,000 white spaces was submitted,
-but importantly the white spaces were neither at the beginning nor at
-the end. As a result, the regular expression matching engine needed to
-backtrack over many choices. In this example, the time needed to process
-the string was $O(n^2)$ with respect to the string length. This
-quadratic overhead was enough for the homepage of Stack Exchange to
-respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
-attack and therefore stopped the servers from responding to any
-requests. This made the whole site become unavailable. 
-A more recent example is a global outage of all Cloudflare servers on 2 July
-2019. A poorly written regular expression exhibited exponential
-behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
-had several causes, at the heart was a regular expression that
-was used to monitor network
-traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
-%TODO: data points for some new versions of languages
-These problems with regular expressions 
-are not isolated events that happen
-very occasionally, but actually widespread.
-They occur so often that they get a 
-name--Regular-Expression-Denial-Of-Service (ReDoS)
-attack.
-Davis et al. \parencite{Davis18} detected more
-than 1000 super-linear (SL) regular expressions
-in Node.js, Python core libraries, and npm and pypi. 
-They therefore concluded that evil regular expressions
-are problems more than "a parlour trick", but one that
-requires
-more research attention.
-
- \section{The Problem Behind Slow Cases}
-
-%find literature/find out for yourself that REGEX->DFA on basic regexes
-%does not blow up the size
-Shouldn't regular expression matching be linear?
-How can one explain the super-linear behaviour of the 
-regex matching engines we have?
-The time cost of regex matching algorithms in general
-involve two phases: 
-the construction phase, in which the algorithm builds some  
-suitable data structure from the input regex $r$, we denote
-the time cost by $P_1(r)$.
-The lexing
-phase, when the input string $s$ is read and the data structure
-representing that regex $r$ is being operated on. We represent the time
-it takes by $P_2(r, s)$.\\
-In the case of a $\mathit{DFA}$,
-we have $P_2(r, s) = O( |s| )$,
-because we take at most $|s|$ steps, 
-and each step takes
-at most one transition--
-a deterministic-finite-automata
-by definition has at most one state active and at most one
-transition upon receiving an input symbol.
-But unfortunately in the  worst case
-$P_1(r) = O(exp^{|r|})$. An example will be given later. \\
-For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
-expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.
-The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
-On the other hand, if backtracking is used, the worst-case time bound bloats
-to $|r| * 2^|s|$ .
-%on the input
-%And when calculating the time complexity of the matching algorithm,
-%we are assuming that each input reading step requires constant time.
-%which translates to that the number of 
-%states active and transitions taken each time is bounded by a
-%constant $C$.
-%But modern  regex libraries in popular language engines
-% often want to support much richer constructs than just
-% sequences and Kleene stars,
-%such as negation, intersection, 
-%bounded repetitions and back-references.
-%And de-sugaring these "extended" regular expressions 
-%into basic ones might bloat the size exponentially.
-%TODO: more reference for exponential size blowup on desugaring. 
-\subsection{Tools that uses $\mathit{DFA}$s}
-%TODO:more tools that use DFAs?
-$\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
-in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
-lexers. The user provides a set of regular expressions
-and configurations to such lexer generators, and then 
-gets an output program encoding a minimized $\mathit{DFA}$
-that can be compiled and run. 
-The good things about $\mathit{DFA}$s is that once
-generated, they are fast and stable, unlike
-backtracking algorithms. 
-However, they do not scale well with bounded repetitions.\\
-
-
-
-
-Bounded repetitions, usually written in the form
-$r^{\{c\}}$ (where $c$ is a constant natural number),
-denotes a regular expression accepting strings
-that can be divided into $c$ substrings, where each 
-substring is in $r$. 
-For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
-an $\mathit{NFA}$ describing it would look like:
-\begin{center}
-\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
-   \node[state,initial] (q_0)   {$q_0$}; 
-   \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
-   \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
-   \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
-    \path[->] 
-    (q_0) edge  node {a} (q_1)
-    	  edge [loop below] node {a,b} ()
-    (q_1) edge  node  {a,b} (q_2)
-    (q_2) edge  node  {a,b} (q_3);
-\end{tikzpicture}
-\end{center}
-The red states are "countdown states" which counts down 
-the number of characters needed in addition to the current
-string to make a successful match.
-For example, state $q_1$ indicates a match that has
-gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
-and just consumed the "delimiter" $a$ in the middle, and 
-need to match 2 more iterations of $(a|b)$ to complete.
-State $q_2$ on the other hand, can be viewed as a state
-after $q_1$ has consumed 1 character, and just waits
-for 1 more character to complete.
-$q_3$ is the last state, requiring 0 more character and is accepting.
-Depending on the suffix of the
-input string up to the current read location,
-the states $q_1$ and $q_2$, $q_3$
-may or may
-not be active, independent from each other.
-A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
-contain at least $2^3$ non-equivalent states that cannot be merged, 
-because the subset construction during determinisation will generate
-all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
-Generalizing this to regular expressions with larger
-bounded repetitions number, we have that
-regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
-would require at least $2^{n+1}$ states, if $r$ contains
-more than 1 string.
-This is to represent all different 
-scenarios which "countdown" states are active.
-For those regexes, tools such as $\mathit{JFLEX}$ 
-would generate gigantic $\mathit{DFA}$'s or
-out of memory errors.
-For this reason, regex libraries that support 
-bounded repetitions often choose to use the $\mathit{NFA}$ 
-approach.
-\subsection{The $\mathit{NFA}$ approach to regex matching}
-One can simulate the $\mathit{NFA}$ running in two ways:
-one by keeping track of all active states after consuming 
-a character, and update that set of states iteratively.
-This can be viewed as a breadth-first-search of the $\mathit{NFA}$
-for a path terminating
-at an accepting state.
-Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
-type of $\mathit{NFA}$ simulation, and guarantees a linear runtime
-in terms of input string length.
-%TODO:try out these lexers
-The other way to use $\mathit{NFA}$ for matching is choosing  
-a single transition each time, keeping all the other options in 
-a queue or stack, and backtracking if that choice eventually 
-fails. This method, often called a  "depth-first-search", 
-is efficient in a lot of cases, but could end up
-with exponential run time.\\
-%TODO:COMPARE java python lexer speed with Rust and Go
-The reason behind backtracking algorithms in languages like
-Java and Python is that they support back-references.
-\subsection{Back References in Regex--Non-Regular part}
-If we have a regular expression like this (the sequence
-operator is omitted for brevity):
-\begin{center}
-	$r_1(r_2(r_3r_4))$
-\end{center}
-We could label sub-expressions of interest 
-by parenthesizing them and giving 
-them a number by the order in which their opening parentheses appear.
-One possible way of parenthesizing and labelling is given below:
-\begin{center}
-	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
-\end{center}
-$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$, $r_4$ are labelled
-by 1 to 4. $1$ would refer to the entire expression 
-$(r_1(r_2(r_3)(r_4)))$, $2$ referring to $r_2(r_3)(r_4)$, etc.
-These sub-expressions are called "capturing groups".
-We can use the following syntax to denote that we want a string just matched by a 
-sub-expression (capturing group) to appear at a certain location again, 
-exactly as it was:
-\begin{center}
-$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
-\underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
-\end{center}
-The backslash and number $i$ are used to denote such 
-so-called "back-references".
-Let $e$ be an expression made of regular expressions 
-and back-references. $e$ contains the expression $e_i$
-as its $i$-th capturing group.
-The semantics of back-reference can be recursively
-written as:
-\begin{center}
-	\begin{tabular}{c}
-		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
-		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
-	\end{tabular}
-\end{center}
-The concrete example
-$((a|b|c|\ldots|z)^*)\backslash 1$
-would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
-Back-reference is a construct in the "regex" standard
-that programmers found useful, but not exactly 
-regular any more.
-In fact, that allows the regex construct to express 
-languages that cannot be contained in context-free
-languages either.
-For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
-expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
-which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
-Such a language is contained in the context-sensitive hierarchy
-of formal languages. 
-Solving the back-reference expressions matching problem
-is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
-efficient solution is not known to exist.
-%TODO:read a bit more about back reference algorithms
-It seems that languages like Java and Python made the trade-off
-to support back-references at the expense of having to backtrack,
-even in the case of regexes not involving back-references.\\
-Summing these up, we can categorise existing 
-practical regex libraries into the ones  with  linear
-time guarantees like Go and Rust, which impose restrictions
-on the user input (not allowing back-references, 
-bounded repetitions canno exceed 1000 etc.), and ones  
- that allows the programmer much freedom, but grinds to a halt
- in some non-negligible portion of cases.
- %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
-% For example, the Rust regex engine claims to be linear, 
-% but does not support lookarounds and back-references.
-% The GoLang regex library does not support over 1000 repetitions.  
-% Java and Python both support back-references, but shows
-%catastrophic backtracking behaviours on inputs without back-references(
-%when the language is still regular).
- %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
- %TODO: verify the fact Rust does not allow 1000+ reps
- %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
-\section{Buggy Regex Engines} 
-
-
- Another thing about these libraries is that there
- is no correctness guarantee.
- In some cases, they either fail to generate a lexing result when there exists a match,
- or give the wrong way of matching.
- 
-
-It turns out that regex libraries not only suffer from 
-exponential backtracking problems, 
-but also undesired (or even buggy) outputs.
-%TODO: comment from who
-Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
-correctly implementing the POSIX (maximum-munch)
-rule of regular expression matching.
-This experience is echoed by the writer's
-tryout of a few online regex testers:
-A concrete example would be 
-the regex
-\begin{verbatim}
-(((((a*a*)b*)b){20})*)c
-\end{verbatim}
-and the string
-\begin{verbatim}
-baabaabababaabaaaaaaaaababaa
-aababababaaaabaaabaaaaaabaab
-aabababaababaaaaaaaaababaaaa
-babababaaaaaaaaaaaaac
-\end{verbatim}
-
-This seemingly complex regex simply says "some $a$'s
-followed by some $b$'s then followed by 1 single $b$,
-and this iterates 20 times, finally followed by a $c$.
-And a POSIX match would involve the entire string,"eating up"
-all the $b$'s in it.
-%TODO: give a coloured example of how this matches POSIXly
-
-This regex would trigger catastrophic backtracking in 
-languages like Python and Java,
-whereas it gives a non-POSIX  and uninformative 
-match in languages like Go or .NET--The match with only 
-character $c$.
-
-As Grathwohl\parencite{grathwohl2014crash} commented,
-\begin{center}
-	``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
-\end{center}
-
-%\section{How people solve problems with regexes}
-
-
-When a regular expression does not behave as intended,
-people usually try to rewrite the regex to some equivalent form
-or they try to avoid the possibly problematic patterns completely,
-for which many false positives exist\parencite{Davis18}.
-Animated tools to "debug" regular expressions
-are also popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101} 
-to name a few.
-We are also aware of static analysis work on regular expressions that
-aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
-\parencite{Rathnayake2014StaticAF} proposed an algorithm
-that detects regular expressions triggering exponential
-behavious on backtracking matchers.
-Weideman \parencite{Weideman2017Static} came up with 
-non-linear polynomial worst-time estimates
-for regexes, attack string that exploit the worst-time 
-scenario, and "attack automata" that generates
-attack strings.
-%Arguably these methods limits the programmers' freedom
-%or productivity when all they want is to come up with a regex
-%that solves the text processing problem.
-
-%TODO:also the regex101 debugger
-\section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
- Is it possible to have a regex lexing algorithm with proven correctness and 
- time complexity, which allows easy extensions to
-  constructs like 
- bounded repetitions, negation,  lookarounds, and even back-references? 
-  
-  We propose Brzozowski derivatives on regular expressions as
-  a solution to this.
-  
-  In the last fifteen or so years, Brzozowski's derivatives of regular
-expressions have sparked quite a bit of interest in the functional
-programming and theorem prover communities.  The beauty of
-Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
-expressible in any functional language, and easily definable and
-reasoned about in theorem provers---the definitions just consist of
-inductive datatypes and simple recursive functions. 
-And an algorithms based on it by 
-Suzmann and Lu  \parencite{Sulzmann2014} allows easy extension
-to include  extended regular expressions and 
- simplification of internal data structures 
- eliminating the exponential behaviours.
- 
-
-  \section{Motivation}
-  
-Derivatives give a simple solution
-to the problem of matching a string $s$ with a regular
-expression $r$: if the derivative of $r$ w.r.t.\ (in
-succession) all the characters of the string matches the empty string,
-then $r$ matches $s$ (and {\em vice versa}).  
-
-
-
-However, two difficulties with derivative-based matchers exist:
-First, Brzozowski's original matcher only generates a yes/no answer
-for whether a regular expression matches a string or not.  This is too
-little information in the context of lexing where separate tokens must
-be identified and also classified (for example as keywords
-or identifiers).  Sulzmann and Lu~\cite{Sulzmann2014} overcome this
-difficulty by cleverly extending Brzozowski's matching
-algorithm. Their extended version generates additional information on
-\emph{how} a regular expression matches a string following the POSIX
-rules for regular expression matching. They achieve this by adding a
-second ``phase'' to Brzozowski's algorithm involving an injection
-function.  In our own earlier work we provided the formal
-specification of what POSIX matching means and proved in Isabelle/HOL
-the correctness
-of Sulzmann and Lu's extended algorithm accordingly
-\cite{AusafDyckhoffUrban2016}.
-
-The second difficulty is that Brzozowski's derivatives can 
-grow to arbitrarily big sizes. For example if we start with the
-regular expression $(a+aa)^*$ and take
-successive derivatives according to the character $a$, we end up with
-a sequence of ever-growing derivatives like 
-
-\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
-\begin{center}
-\begin{tabular}{rll}
-$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
-& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
-& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
-& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
-& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
-\end{tabular}
-\end{center}
- 
-\noindent where after around 35 steps we run out of memory on a
-typical computer (we shall define shortly the precise details of our
-regular expressions and the derivative operation).  Clearly, the
-notation involving $\ZERO$s and $\ONE$s already suggests
-simplification rules that can be applied to regular regular
-expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
-\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
-r$. While such simple-minded simplifications have been proved in our
-earlier work to preserve the correctness of Sulzmann and Lu's
-algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
-\emph{not} help with limiting the growth of the derivatives shown
-above: the growth is slowed, but the derivatives can still grow rather
-quickly beyond any finite bound.
-
-
-Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
-\cite{Sulzmann2014} where they introduce bitcoded
-regular expressions. In this version, POSIX values are
-represented as bitsequences and such sequences are incrementally generated
-when derivatives are calculated. The compact representation
-of bitsequences and regular expressions allows them to define a more
-``aggressive'' simplification method that keeps the size of the
-derivatives finite no matter what the length of the string is.
-They make some informal claims about the correctness and linear behaviour
-of this version, but do not provide any supporting proof arguments, not
-even ``pencil-and-paper'' arguments. They write about their bitcoded
-\emph{incremental parsing method} (that is the algorithm to be formalised
-in this paper):
-
-
-  
-  \begin{quote}\it
-  ``Correctness Claim: We further claim that the incremental parsing
-  method [..] in combination with the simplification steps [..]
-  yields POSIX parse trees. We have tested this claim
-  extensively [..] but yet
-  have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
-\end{quote}  
-
-Ausaf and Urban were able to back this correctness claim with
-a formal proof.
-
-But as they stated,
-  \begin{quote}\it
-The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.
-\end{quote}  
-
-This thesis implements the aggressive simplifications envisioned
-by Ausaf and Urban,
-and gives a formal proof of the correctness with those simplifications.
-
-
- 
-
-
-
-%----------------------------------------------------------------------------------------
-
-\section{Contribution}
-
-
-
-This work addresses the vulnerability of super-linear and
-buggy regex implementations by the combination
-of Brzozowski's derivatives and interactive theorem proving. 
-We give an 
-improved version of  Sulzmann and Lu's bit-coded algorithm using 
-derivatives, which come with a formal guarantee in terms of correctness and 
-running time as an Isabelle/HOL proof.
-Then we improve the algorithm with an even stronger version of 
-simplification, and prove a time bound linear to input and
-cubic to regular expression size using a technique by
-Antimirov.
-
- 
-The main contribution of this thesis is a proven correct lexing algorithm
-with formalized time bounds.
-To our best knowledge, no lexing libraries using Brzozowski derivatives
-have a provable time guarantee, 
-and claims about running time are usually speculative and backed by thin empirical
-evidence.
-%TODO: give references
-For example, Sulzmann and Lu had proposed an algorithm  in which they
-claim a linear running time.
-But that was falsified by our experiments and the running time 
-is actually $\Omega(2^n)$ in the worst case.
-A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
-%TODO: give references
-lexer, which calculates POSIX matches and is based on derivatives.
-They formalized the correctness of the lexer, but not the complexity.
-In the performance evaluation section, they simply analyzed the run time
-of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
-and concluded that the algorithm is quadratic in terms of input length.
-When we tried out their extracted OCaml code with our example $(a+aa)^*$,
-the time it took to lex only 40 $a$'s was 5 minutes.
-
-We  believe our results of a proof of performance on general
-inputs rather than specific examples a novel contribution.\\
-
-
-\subsection{Related Work}
-We are aware
-of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
-Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
-of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
-in Coq is given by Coquand and Siles \parencite{Coquand2012}.
-Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
- 
- %We propose Brzozowski's derivatives as a solution to this problem.
-% about Lexing Using Brzozowski derivatives
-
-
-\section{Structure of the thesis}
-In chapter 2 \ref{Chapter2} we will introduce the concepts
-and notations we 
-use for describing the lexing algorithm by Sulzmann and Lu,
-and then give the algorithm and its variant, and discuss
-why more aggressive simplifications are needed. 
-Then we illustrate in Chapter 3\ref{Chapter3}
-how the algorithm without bitcodes falls short for such aggressive 
-simplifications and therefore introduce our version of the
- bitcoded algorithm and 
-its correctness proof .  
-In Chapter 4 \ref{Chapter4} we give the second guarantee
-of our bitcoded algorithm, that is a finite bound on the size of any 
-regex's derivatives.
-In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
-in Chapter 4 to a polynomial one, and demonstrate how one can extend the
-algorithm to include constructs such as bounded repetitions and negations.
- 
-
-
-
-
-%----------------------------------------------------------------------------------------
-
-
-%----------------------------------------------------------------------------------------
-
-%----------------------------------------------------------------------------------------
-
-%----------------------------------------------------------------------------------------
-
-

--- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex	Sat Jul 09 14:11:07 2022 +0100
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,689 +0,0 @@
-% Chapter Template
-
-\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
-
-\label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts
-%and notations we 
-%use for describing the lexing algorithm by Sulzmann and Lu,
-%and then give the algorithm and its variant, and discuss
-%why more aggressive simplifications are needed. 
-
-
-\section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions}
-We have a primitive datatype char, denoting characters.
-\[			char ::=  a
-			 \mid b
-			 \mid c
-			 \mid  \ldots
-			 \mid z       
-\]
-(which one is better?)
-\begin{center}
-\begin{tabular}{lcl}
-$\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\
-\end{tabular}
-\end{center}
-They can form strings by lists:
-\begin{center}
-\begin{tabular}{lcl}
-$\textit{string}$ & $\dn$ & $[] | c  :: cs$\\
-& & $(c\; \text{has char type})$
-\end{tabular}
-\end{center}
-And strings can be concatenated to form longer strings:
-\begin{center}
-\begin{tabular}{lcl}
-$[] @ s_2$ & $\dn$ & $s_2$\\
-$(c :: s_1) @ s_2$ & $\dn$ & $c :: (s_1 @ s_2)$
-\end{tabular}
-\end{center}
-
-A set of strings can operate with another set of strings:
-\begin{center}
-\begin{tabular}{lcl}
-$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
-\end{tabular}
-\end{center}
-We also call the above "language concatenation".
-The power of a language is defined recursively, using the 
-concatenation operator $@$:
-\begin{center}
-\begin{tabular}{lcl}
-$A^0 $ & $\dn$ & $\{ [] \}$\\
-$A^{n+1}$ & $\dn$ & $A^n @ A$
-\end{tabular}
-\end{center}
-The union of all the natural number powers of a language   
-is denoted by the Kleene star operator:
-\begin{center}
-\begin{tabular}{lcl}
-$\bigcup_{i \geq 0} A^i$ & $\denote$ & $A^*$\\
-\end{tabular}
-\end{center}
-
-In Isabelle of course we cannot easily get a counterpart of
-the $\bigcup_{i \geq 0}$ operator, so we instead define the Kleene star
-as an inductive set: 
-\begin{center}
-\begin{tabular}{lcl}
-$[] \in A^*$  & &\\
-$s_1 \in A \land \; s_2 \in A^* $ & $\implies$ & $s_1 @ s_2 \in A^*$\\
-\end{tabular}
-\end{center}
-
-We also define an operation of chopping off a character from
-a language:
-\begin{center}
-\begin{tabular}{lcl}
-$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
-\end{tabular}
-\end{center}
-
-This can be generalised to chopping off a string from all strings within set $A$:
-\begin{center}
-\begin{tabular}{lcl}
-$\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
-\end{tabular}
-\end{center}
-
-which is essentially the left quotient $A \backslash L'$ of $A$ against 
-the singleton language $L' = \{w\}$
-in formal language theory.
-For this dissertation the $\textit{Ders}$ notation would suffice, there is
-no need for a more general derivative definition.
-
-With the  sequencing, Kleene star, and $\textit{Der}$ operator on languages,
-we have a  few properties of how the language derivative can be defined using 
-sub-languages.
-\begin{lemma}
-$\Der \; c \; (A @ B) = \textit{if} \;  [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B$
-\end{lemma}
-\noindent
-This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
-and get to $B$.
-
-The language $A^*$'s derivative can be described using the language derivative
-of $A$:
-\begin{lemma}
-$\textit{Der} \;c \;A^* = (\textit{Der}\; c A) @ (A^*)$\\
-\end{lemma}
-\begin{proof}
-\begin{itemize}
-\item{$\subseteq$}
-The set 
-\[ \{s \mid c :: s \in A^*\} \]
-is enclosed in the set
-\[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A^* \} \]
-because whenever you have a string starting with a character 
-in the language of a Kleene star $A^*$, then that character together with some sub-string
-immediately after it will form the first iteration, and the rest of the string will 
-be still in $A^*$.
-\item{$\supseteq$}
-Note that
-\[ \Der \; c \; A^* = \Der \; c \;  (\{ [] \} \cup (A @ A^*) ) \]
-and 
-\[ \Der \; c \;  (\{ [] \} \cup (A @ A^*) ) = \Der\; c \; (A @ A^*) \]
-where the $\textit{RHS}$ of the above equatioin can be rewritten
-as \[ (\Der \; c\; A) @ A^* \cup A' \], $A'$ being a possibly empty set.
-\end{itemize}
-\end{proof}
-Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
-for regular languages, we need to first give definitions for regular expressions.
-
-\section{Regular Expressions and Their Language Interpretation}
-Suppose we have an alphabet $\Sigma$, the strings  whose characters
-are from $\Sigma$
-can be expressed as $\Sigma^*$.
-
-We use patterns to define a set of strings concisely. Regular expressions
-are one of such patterns systems:
-The basic regular expressions  are defined inductively
- by the following grammar:
-\[			r ::=   \ZERO \mid  \ONE
-			 \mid  c  
-			 \mid  r_1 \cdot r_2
-			 \mid  r_1 + r_2   
-			 \mid r^*         
-\]
-
-The language or set of strings defined by regular expressions are defined as
-%TODO: FILL in the other defs
-\begin{center}
-\begin{tabular}{lcl}
-$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
-$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
-\end{tabular}
-\end{center}
-Which are also called the "language interpretation".
-
-
-
-
-% Derivatives of a
-%regular expression, written $r \backslash c$, give a simple solution
-%to the problem of matching a string $s$ with a regular
-%expression $r$: if the derivative of $r$ w.r.t.\ (in
-%succession) all the characters of the string matches the empty string,
-%then $r$ matches $s$ (and {\em vice versa}).  
-
-
-\section{Brzozowski Derivatives of Regular Expressions}
-Now with semantic derivatives of a language and regular expressions and
-their language interpretations, we are ready to define derivatives on regexes.
-
-The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
-where the operation transforms the regex to a new one containing
-strings without the head character $c$.
-
-The  derivative of regular expression, denoted as
-$r \backslash c$, is a function that takes parameters
-$r$ and $c$, and returns another regular expression $r'$,
-which is computed by the following recursive function:
-
-\begin{center}
-\begin{tabular}{lcl}
-		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
-		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
-		$d \backslash c$     & $\dn$ & 
-		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
-$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
-$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
-	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
-	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
-	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
-\end{tabular}
-\end{center}
-\noindent
-The function derivative, written $r\backslash c$, 
-defines how a regular expression evolves into
-a new regular expression after all the string it contains
-is chopped off a certain head character $c$.
-The most involved cases are the sequence 
-and star case.
-The sequence case says that if the first regular expression
-contains an empty string then the second component of the sequence
-might be chosen as the target regular expression to be chopped
-off its head character.
-The star regular expression's derivative unwraps the iteration of
-regular expression and attaches the star regular expression
-to the sequence's second element to make sure a copy is retained
-for possible more iterations in later phases of lexing.
-
-
-The $\nullable$ function tests whether the empty string $""$ 
-is in the language of $r$:
-
-
-\begin{center}
-		\begin{tabular}{lcl}
-			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
-			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
-			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
-			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
-			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
-			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
-		\end{tabular}
-\end{center}
-\noindent
-The empty set does not contain any string and
-therefore not the empty string, the empty string 
-regular expression contains the empty string
-by definition, the character regular expression
-is the singleton that contains character only,
-and therefore does not contain the empty string,
-the alternative regular expression (or "or" expression)
-might have one of its children regular expressions
-being nullable and any one of its children being nullable
-would suffice. The sequence regular expression
-would require both children to have the empty string
-to compose an empty string and the Kleene star
-operation naturally introduced the empty string. 
-  
-We have the following property where the derivative on regular 
-expressions coincides with the derivative on a set of strings:
-
-\begin{lemma}
-$\textit{Der} \; c \; L(r) = L (r\backslash c)$
-\end{lemma}
-
-\noindent
-
-
-The main property of the derivative operation
-that enables us to reason about the correctness of
-an algorithm using derivatives is 
-
-\begin{center}
-$c\!::\!s \in L(r)$ holds
-if and only if $s \in L(r\backslash c)$.
-\end{center}
-
-\noindent
-We can generalise the derivative operation shown above for single characters
-to strings as follows:
-
-\begin{center}
-\begin{tabular}{lcl}
-$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
-$r \backslash [\,] $ & $\dn$ & $r$
-\end{tabular}
-\end{center}
-
-\noindent
-and then define Brzozowski's  regular-expression matching algorithm as:
-
-\begin{definition}
-$match\;s\;r \;\dn\; nullable(r\backslash s)$
-\end{definition}
-
-\noindent
-Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
-this algorithm presented graphically is as follows:
-
-\begin{equation}\label{graph:successive_ders}
-\begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
-\end{tikzcd}
-\end{equation}
-
-\noindent
-where we start with  a regular expression  $r_0$, build successive
-derivatives until we exhaust the string and then use \textit{nullable}
-to test whether the result can match the empty string. It can  be
-relatively  easily shown that this matcher is correct  (that is given
-an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
-
-Beautiful and simple definition.
-
-If we implement the above algorithm naively, however,
-the algorithm can be excruciatingly slow. 
-
-
-\begin{figure}
-\centering
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,-0.05)}},
-    ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=10000,
-    ytick={0,1000,...,10000},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={JavaScript},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
-\end{axis}
-\end{tikzpicture}\\
-\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
-           of the form $\underbrace{aa..a}_{n}$.}
-\end{tabular}    
-\caption{EightThousandNodes} \label{fig:EightThousandNodes}
-\end{figure}
-
-
-(8000 node data to be added here)
-For example, when starting with the regular
-expression $(a + aa)^*$ and building a few successive derivatives (around 10)
-w.r.t.~the character $a$, one obtains a derivative regular expression
-with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
-The reason why $(a + aa) ^*$ explodes so drastically is that without
-pruning, the algorithm will keep records of all possible ways of matching:
-\begin{center}
-$(a + aa) ^* \backslash [aa] = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
-\end{center}
-
-\noindent
-Each of the above alternative branches correspond to the match 
-$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
-These different ways of matching will grow exponentially with the string length,
-and without simplifications that throw away some of these very similar matchings,
-it is no surprise that these expressions grow so quickly.
-Operations like
-$\backslash$ and $\nullable$ need to traverse such trees and
-consequently the bigger the size of the derivative the slower the
-algorithm. 
-
-Brzozowski was quick in finding that during this process a lot useless
-$\ONE$s and $\ZERO$s are generated and therefore not optimal.
-He also introduced some "similarity rules", such
-as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
-different but language-equivalent sub-regexes to further decrease the size
-of the intermediate regexes. 
-
-More simplifications are possible, such as deleting duplicates
-and opening up nested alternatives to trigger even more simplifications.
-And suppose we apply simplification after each derivative step, and compose
-these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
-\textit{simp}(a \backslash c)$. Then we can build
-a matcher with simpler regular expressions.
-
-If we want the size of derivatives in the algorithm to
-stay even lower, we would need more aggressive simplifications.
-Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
-delete duplicates whenever possible. For example, the parentheses in
-$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
-\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
-example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
-$a^*+a+\ONE$.  These more aggressive simplification rules are for
- a very tight size bound, possibly as low
-  as that of the \emph{partial derivatives}\parencite{Antimirov1995}. 
-
-Building derivatives and then simplifying them.
-So far, so good. But what if we want to 
-do lexing instead of just getting a YES/NO answer?
-This requires us to go back again to the world 
-without simplification first for a moment.
-Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
-elegant(arguably as beautiful as the original
-derivatives definition) solution for this.
-
-\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
-
-
-They first defined the datatypes for storing the 
-lexing information called a \emph{value} or
-sometimes also \emph{lexical value}.  These values and regular
-expressions correspond to each other as illustrated in the following
-table:
-
-\begin{center}
-	\begin{tabular}{c@{\hspace{20mm}}c}
-		\begin{tabular}{@{}rrl@{}}
-			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
-			$r$ & $::=$  & $\ZERO$\\
-			& $\mid$ & $\ONE$   \\
-			& $\mid$ & $c$          \\
-			& $\mid$ & $r_1 \cdot r_2$\\
-			& $\mid$ & $r_1 + r_2$   \\
-			\\
-			& $\mid$ & $r^*$         \\
-		\end{tabular}
-		&
-		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
-			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
-			$v$ & $::=$  & \\
-			&        & $\Empty$   \\
-			& $\mid$ & $\Char(c)$          \\
-			& $\mid$ & $\Seq\,v_1\, v_2$\\
-			& $\mid$ & $\Left(v)$   \\
-			& $\mid$ & $\Right(v)$  \\
-			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
-		\end{tabular}
-	\end{tabular}
-\end{center}
-
-\noindent
-
-Building on top of Sulzmann and Lu's attempt to formalise the 
-notion of POSIX lexing rules \parencite{Sulzmann2014}, 
-Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
-POSIX matching as a ternary relation recursively defined in a
-natural deduction style.
-With the formally-specified rules for what a POSIX matching is,
-they proved in Isabelle/HOL that the algorithm gives correct results.
-
-But having a correct result is still not enough, 
-we want at least some degree of $\mathbf{efficiency}$.
-
-
-
-One regular expression can have multiple lexical values. For example
-for the regular expression $(a+b)^*$, it has a infinite list of
-values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
-$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
-$\ldots$, and vice versa.
-Even for the regular expression matching a particular string, there could 
-be more than one value corresponding to it.
-Take the example where $r= (a^*\cdot a^*)^*$ and the string 
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-If we do not allow any empty iterations in its lexical values,
-there will be $n - 1$ "splitting points" on $s$ we can choose to 
-split or not so that each sub-string
-segmented by those chosen splitting points will form different iterations:
-\begin{center}
-\begin{tabular}{lcr}
-$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$\\
-$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$\\
-$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$\\
- & $\textit{etc}.$ &
- \end{tabular}
-\end{center}
-
-And for each iteration, there are still multiple ways to split
-between the two $a^*$s.
-It is not surprising there are exponentially many lexical values
-that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-
-A lexer to keep all the possible values will naturally 
-have an exponential runtime on ambiguous regular expressions.
-Somehow one has to decide which lexical value to keep and
-output in a lexing algorithm.
-In practice, we are usually 
-interested in POSIX values, which by intuition always
-\begin{itemize}
-\item
-match the leftmost regular expression when multiple options of matching
-are available  
-\item 
-always match a subpart as much as possible before proceeding
-to the next token.
-\end{itemize}
-The formal definition of a $\POSIX$ value $v$ for a regular expression
-$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
-in the following set of rules:
-(TODO: write the entire set of inference rules for POSIX )
-\newcommand*{\inference}[3][t]{%
-   \begingroup
-   \def\and{\\}%
-   \begin{tabular}[#1]{@{\enspace}c@{\enspace}}
-   #2 \\
-   \hline
-   #3
-   \end{tabular}%
-   \endgroup
-}
-\begin{center}
-\inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$  }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
-\end{center}
-
-The reason why we are interested in $\POSIX$ values is that they can
-be practically used in the lexing phase of a compiler front end.
-For instance, when lexing a code snippet 
-$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
-as an identifier rather than a keyword.
-
- For example, the above $r= (a^*\cdot a^*)^*$  and 
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
-$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
-The output of an algorithm we want would be a POSIX matching
-encoded as a value.
-
-
-
-
-The contribution of Sulzmann and Lu is an extension of Brzozowski's
-algorithm by a second phase (the first phase being building successive
-derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value 
-is generated if the regular expression matches the string.
-How can we construct a value out of regular expressions and character
-sequences only?
-Two functions are involved: $\inj$ and $\mkeps$.
-The function $\mkeps$ constructs a value from the last
-one of all the successive derivatives:
-\begin{ceqn}
-\begin{equation}\label{graph:mkeps}
-\begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
-	        & 	              & 	            & v_n       
-\end{tikzcd}
-\end{equation}
-\end{ceqn}
-
-It tells us how can an empty string be matched by a 
-regular expression, in a $\POSIX$ way:
-
-	\begin{center}
-		\begin{tabular}{lcl}
-			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
-			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
-			& \textit{if} $\nullable(r_{1})$\\ 
-			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
-			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
-			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
-			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
-		\end{tabular}
-	\end{center}
-
-
-\noindent 
-We favour the left to match an empty string if there is a choice.
-When there is a star for us to match the empty string,
-we give the $\Stars$ constructor an empty list, meaning
-no iterations are taken.
-
-
-After the $\mkeps$-call, we inject back the characters one by one in order to build
-the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
-($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
-After injecting back $n$ characters, we get the lexical value for how $r_0$
-matches $s$. The POSIX value is maintained throughout the process.
-For this Sulzmann and Lu defined a function that reverses
-the ``chopping off'' of characters during the derivative phase. The
-corresponding function is called \emph{injection}, written
-$\textit{inj}$; it takes three arguments: the first one is a regular
-expression ${r_{i-1}}$, before the character is chopped off, the second
-is a character ${c_{i-1}}$, the character we want to inject and the
-third argument is the value ${v_i}$, into which one wants to inject the
-character (it corresponds to the regular expression after the character
-has been chopped off). The result of this function is a new value. 
-\begin{ceqn}
-\begin{equation}\label{graph:inj}
-\begin{tikzcd}
-r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
-v_1           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
-\end{tikzcd}
-\end{equation}
-\end{ceqn}
-
-
-\noindent
-The
-definition of $\textit{inj}$ is as follows: 
-
-\begin{center}
-\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
-  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
-  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
-  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
-  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
-\end{tabular}
-\end{center}
-
-\noindent This definition is by recursion on the ``shape'' of regular
-expressions and values. 
-The clauses do one thing--identifying the ``hole'' on a
-value to inject the character back into.
-For instance, in the last clause for injecting back to a value
-that would turn into a new star value that corresponds to a star,
-we know it must be a sequence value. And we know that the first 
-value of that sequence corresponds to the child regex of the star
-with the first character being chopped off--an iteration of the star
-that had just been unfolded. This value is followed by the already
-matched star iterations we collected before. So we inject the character 
-back to the first value and form a new value with this latest iteration
-being added to the previous list of iterations, all under the $\Stars$
-top level.
-
-Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
-we have a lexer with the following recursive definition:
-\begin{center}
-\begin{tabular}{lcr}
-$\lexer \; r \; [] $ & $=$ & $\mkeps \; r$\\
-$\lexer \; r \;c::s$ & $=$ & $\inj \; r \; c (\lexer (r\backslash c) s)$
-\end{tabular}
-\end{center}
- 
-Pictorially, the algorithm is as follows:
-
-\begin{ceqn}
-\begin{equation}\label{graph:2}
-\begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
-v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
-\end{tikzcd}
-\end{equation}
-\end{ceqn}
-
-
-\noindent
-For convenience, we shall employ the following notations: the regular
-expression we start with is $r_0$, and the given string $s$ is composed
-of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
-left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
-to the characters $c_0$, $c_1$  until we exhaust the string and obtain
-the derivative $r_n$. We test whether this derivative is
-$\textit{nullable}$ or not. If not, we know the string does not match
-$r$, and no value needs to be generated. If yes, we start building the
-values incrementally by \emph{injecting} back the characters into the
-earlier values $v_n, \ldots, v_0$. This is the second phase of the
-algorithm from right to left. For the first value $v_n$, we call the
-function $\textit{mkeps}$, which builds a POSIX lexical value
-for how the empty string has been matched by the (nullable) regular
-expression $r_n$. This function is defined as
-
-
-
-We have mentioned before that derivatives without simplification 
-can get clumsy, and this is true for values as well--they reflect
-the size of the regular expression by definition.
-
-One can introduce simplification on the regex and values but have to
-be careful not to break the correctness, as the injection 
-function heavily relies on the structure of the regexes and values
-being correct and matching each other.
-It can be achieved by recording some extra rectification functions
-during the derivatives step, and applying these rectifications in 
-each run during the injection phase.
-And we can prove that the POSIX value of how
-regular expressions match strings will not be affected---although it is much harder
-to establish. 
-Some initial results in this regard have been
-obtained in \cite{AusafDyckhoffUrban2016}. 
-
-
-
-%Brzozowski, after giving the derivatives and simplification,
-%did not explore lexing with simplification, or he may well be 
-%stuck on an efficient simplification with proof.
-%He went on to examine the use of derivatives together with 
-%automaton, and did not try lexing using products.
-
-We want to get rid of the complex and fragile rectification of values.
-Can we not create those intermediate values $v_1,\ldots v_n$,
-and get the lexing information that should be already there while
-doing derivatives in one pass, without a second injection phase?
-In the meantime, can we make sure that simplifications
-are easily handled without breaking the correctness of the algorithm?
-
-Sulzmann and Lu solved this problem by
-introducing additional information to the 
-regular expressions called \emph{bitcodes}.
-
-
-
-
-
-
-

--- a/ChengsongTanPhdThesis/Chapters/Chapter3.tex	Sat Jul 09 14:11:07 2022 +0100
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,63 +0,0 @@
-% Chapter Template
-
-% Main chapter title
-
-\label{Chapter3} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
-%Then we illustrate how the algorithm without bitcodes falls short for such aggressive 
-%simplifications and therefore introduce our version of the bitcoded algorithm and 
-%its correctness proof in 
-%Chapter 3\ref{Chapter3}. 
-
-
-
-
-%----------------------------------------------------------------------------------------
-%	SECTION common identities
-%----------------------------------------------------------------------------------------
-\section{Common Identities In Simplification-Related Functions} 
-Need to prove these before starting on the big correctness proof.
-
-%-----------------------------------
-%	SUBSECTION 
-%-----------------------------------
-\subsection{Idempotency of $\simp$}
-
-\begin{equation}
-	\simp \;r = \simp\; \simp \; r 
-\end{equation}
-This property means we do not have to repeatedly
-apply simplification in each step, which justifies
-our definition of $\blexersimp$.
-It will also be useful in future proofs where properties such as 
-closed forms are needed.
-The proof is by structural induction on $r$.
-
-%-----------------------------------
-%	SUBSECTION 
-%-----------------------------------
-\subsection{Syntactic Equivalence Under $\simp$}
-We prove that minor differences can be annhilated
-by $\simp$.
-For example,
-\begin{center}
-$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
- \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
-\end{center}
-
-
-
-
-
-
-
-%----------------------------------------------------------------------------------------
-%	SECTION corretness proof
-%----------------------------------------------------------------------------------------
-\section{Correctness of Bit-coded Algorithm with Simplification}
-The non-trivial part of proving the correctness of the algorithm with simplification
-compared with not having simplification is that we can no longer use the argument 
-in \cref{flex_retrieve}.
-The function \retrieve needs the structure of the annotated regular expression to 
-agree with the structure of the value, but simplification will always mess with the 
-structure:
-%TODO: after simp does not agree with each other: (a + 0) --> a v.s. Left(Char(a))
\ No newline at end of file

--- a/ChengsongTanPhdThesis/Chapters/Chapter4.tex	Sat Jul 09 14:11:07 2022 +0100
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,791 +0,0 @@
-% Chapter Template
-
-\chapter{Finiteness Bound} % Main chapter title
-
-\label{Chapter4} 
-%  In Chapter 4 \ref{Chapter4} we give the second guarantee
-%of our bitcoded algorithm, that is a finite bound on the size of any 
-%regex's derivatives. 
-
-
-%-----------------------------------
-%	SECTION ?
-%-----------------------------------
-\section{preparatory properties for getting a finiteness bound}
-Before we get to the proof that says the intermediate result of our lexer will
-remain finitely bounded, which is an important efficiency/liveness guarantee,
-we shall first develop a few preparatory properties and definitions to 
-make the process of proving that a breeze.
-
-We define rewriting relations for $\rrexp$s, which allows us to do the 
-same trick as we did for the correctness proof,
-but this time we will have stronger equalities established.
-\subsection{"hrewrite" relation}
-List of 1-step rewrite rules for regular expressions simplification without bitcodes:
-\begin{center}
-\begin{tabular}{lcl}
-$r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$
-\end{tabular}
-\end{center}
-
-And we define an "grewrite" relation that works on lists:
-\begin{center}
-\begin{tabular}{lcl}
-$ \ZERO :: rs$ & $\rightarrow_g$ & $rs$
-\end{tabular}
-\end{center}
-
-
-
-With these relations we prove
-\begin{lemma}
-$rs \rightarrow rs'  \implies \RALTS{rs} \rightarrow \RALTS{rs'}$
-\end{lemma}
-which enables us to prove "closed forms" equalities such as
-\begin{lemma}
-$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
-\end{lemma}
-
-But the most involved part of the above lemma is proving the following:
-\begin{lemma}
-$\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ 
-\end{lemma}
-which is needed in proving things like 
-\begin{lemma}
-$r \rightarrow_f r'  \implies \rsimp{r} \rightarrow \rsimp{r'}$
-\end{lemma}
-
-Fortunately this is provable by induction on the list $rs$
-
-
-
-%-----------------------------------
-%	SECTION 2
-%-----------------------------------
-
-\section{Finiteness Property}
-In this section let us describe our argument for why the size of the simplified
-derivatives with the aggressive simplification function is finitely bounded.
- Suppose
-we have a size function for bitcoded regular expressions, written
-$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
-
-\begin{center}
-\begin{tabular}{ccc}
-$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\
-\end{tabular}
-\end{center}
-(TODO: COMPLETE this defn and for $rs$)
-
-The size is based on a recursive function on the structure of the regex,
-not the bitcodes.
-Therefore we may as well talk about size of an annotated regular expression 
-in their un-annotated form:
-\begin{center}
-$\asize(a) = \size(\erase(a))$. 
-\end{center}
-(TODO: turn equals to roughly equals)
-
-But there is a minor nuisance here:
-the erase function actually messes with the structure of the regular expression:
-\begin{center}
-\begin{tabular}{ccc}
-$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\
-\end{tabular}
-\end{center}
-(TODO: complete this definition with singleton r in alts)
-
-An alternative regular expression with an empty list of children
- is turned into an $\ZERO$ during the
-$\erase$ function, thereby changing the size and structure of the regex.
-These will likely be fixable if we really want to use plain $\rexp$s for dealing
-with size, but we choose a more straightforward (or stupid) method by 
-defining a new datatype that is similar to plain $\rexp$s but can take
-non-binary arguments for its alternative constructor,
- which we call $\rrexp$ to denote
-the difference between it and plain regular expressions. 
-\[			\rrexp ::=   \RZERO \mid  \RONE
-			 \mid  \RCHAR{c}  
-			 \mid  \RSEQ{r_1}{r_2}
-			 \mid  \RALTS{rs}
-			 \mid \RSTAR{r}        
-\]
-For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, 
-but keep everything else intact.
-It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
-(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
-$\ALTS$).
-We denote the operation of erasing the bits and turning an annotated regular expression 
-into an $\rrexp{}$ as $\rerase{}$.
-\begin{center}
-\begin{tabular}{lcl}
-$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
-$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
-$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
-\end{tabular}
-\end{center}
-%TODO: FINISH definition of rerase
-Similarly we could define the derivative  and simplification on 
-$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
-except that now they can operate on alternatives taking multiple arguments.
-
-\begin{center}
-\begin{tabular}{lcr}
-$\RALTS{rs} \backslash c$ & $=$ &  $\RALTS{map\; (\_ \backslash c) \;rs}$\\
-(other clauses omitted)
-\end{tabular}
-\end{center}
-
-Now that $\rrexp$s do not have bitcodes on them, we can do the 
-duplicate removal without  an equivalence relation:
-\begin{center}
-\begin{tabular}{lcl}
-$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
-           			    &        & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
-\end{tabular}
-\end{center}
-%TODO: definition of rsimp (maybe only the alternative clause)
-
-
-The reason why these definitions  mirror precisely the corresponding operations
-on annotated regualar expressions is that we can calculate sizes more easily:
-
-\begin{lemma}
-$\rsize{\rerase a} = \asize a$
-\end{lemma}
-This lemma says that the size of an r-erased regex is the same as the annotated regex.
-this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.
-\begin{lemma}
-$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
-\end{lemma}
-Putting these two together we get a property that allows us to estimate the 
-size of an annotated regular expression derivative using its un-annotated counterpart:
-\begin{lemma}
-$\asize{\bderssimp{r}{s}} =  \rsize{\rderssimp{\rerase{r}}{s}}$
-\end{lemma}
-Unless stated otherwise in this chapter all $\textit{rexp}$s without
- bitcodes are seen as $\rrexp$s.
- We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
- as the former suits people's intuitive way of stating a binary alternative
- regular expression.
-
-
-\begin{theorem}
-For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
-\end{theorem}
-
-\noindent
-\begin{proof}
-We prove this by induction on $r$. The base cases for $\AZERO$,
-$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
-for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
-hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
-$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
-%
-\begin{center}
-\begin{tabular}{lcll}
-& & $ \llbracket   \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
-& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
-    [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
-& $\leq$ &
-    $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
-    [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
-& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
-             \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
-      \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
-\end{tabular}
-\end{center}
-
-
-\noindent
-where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
-The reason why we could write the derivatives of a sequence this way is described in section 2.
-The term (2) is used to control (1). 
-That is because one can obtain an overall
-smaller regex list
-by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
-Section 3 is dedicated to its proof.
-In (3) we know that  $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is 
-bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
-than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
-for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
-We reason similarly for  $\STAR$.\medskip
-\end{proof}
-
-What guarantee does this bound give us?
-
-Whatever the regex is, it will not grow indefinitely.
-Take our previous example $(a + aa)^*$ as an example:
-\begin{center}
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={number of $a$'s},
-    x label style={at={(1.05,-0.05)}},
-    ylabel={regex size},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax= 40,
-    ytick={0,10,...,40},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={$(a + aa)^*$},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
-\end{axis}
-\end{tikzpicture}
-\end{tabular}
-\end{center}
-We are able to limit the size of the regex $(a + aa)^*$'s derivatives
- with our simplification
-rules very effectively.
-
-
-As the graphs of  some more randomly generated regexes show, the size of 
-the derivative might grow quickly at the start of the input,
- but after a sufficiently long  input string, the trend will stop.
-
-
-%a few sample regular experessions' derivatives
-%size change
-%TODO: giving graphs showing a few regexes' size changes 
-%w;r;t the input characters number
-%a*, aa*, aaa*, .....
-%randomly generated regexes
-\begin{center}
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={number of $a$'s},
-    x label style={at={(1.05,-0.05)}},
-    ylabel={regex size},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=1000,
-    ytick={0,100,...,1000},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={regex1},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
-\end{axis}
-\end{tikzpicture}
-  &
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,-0.05)}},
-    %ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=1000,
-    ytick={0,100,...,1000},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={regex2},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
-\end{axis}
-\end{tikzpicture}
-  &
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,-0.05)}},
-    %ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=1000,
-    ytick={0,100,...,1000},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={regex3},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
-\end{axis}
-\end{tikzpicture}\\
-\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}
-\end{tabular}    
-\end{center}  
-
-
-
-
-
-\noindent
-Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
-far from the actual bound we can expect. 
-In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound
-is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.
-Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
-inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
-$f(x) = x * 2^x$.
-This means the bound we have will surge up at least
-tower-exponentially with a linear increase of the depth.
-For a regex of depth $n$, the bound
-would be approximately $4^n$.
-%TODO: change this to tower exponential!
-
-We can do better than this, but this does not improve
-the finiteness property we are proving. If we are interested in a polynomial bound,
-one would hope to obtain a similar tight bound as for partial
-derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
- $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
-partial derivatives). 
-To obtain the exact same bound would mean
-we need to introduce simplifications, such as
- $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
-which exist for partial derivatives. 
-
-However, if we introduce them in our
-setting we would lose the POSIX property of our calculated values. 
-A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$.
-If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer 
-would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of
-what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information
-in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$
-occurs, and apply them in the right order once we get 
-a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value.
-This is unlike the simplification we had before, where the rewriting rules 
-such  as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value.
-We will discuss better
-bounds in the last section of this chapter.\\[-6.5mm]
-
-
-
-
-%----------------------------------------------------------------------------------------
-%	SECTION ??
-%----------------------------------------------------------------------------------------
-
-\section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
-To embark on getting the "closed forms" of regexes, we first
-define a few auxiliary definitions to make expressing them smoothly.
-
- \begin{center}  
- \begin{tabular}{ccc}
- $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
-$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
-$\sflataux r$ & $=$ & $ [r]$
-\end{tabular}
-\end{center}
-The intuition behind $\sflataux{\_}$ is to break up nested regexes 
-of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
-into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
-It will return the singleton list $[r]$ otherwise.
-
-$\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
-the output type a regular expression, not a list:
- \begin{center} 
- \begin{tabular}{ccc}
- $\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
-$\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\
-$\sflat r$ & $=$ & $ [r]$
-\end{tabular}
-\end{center}
-$\sflataux{\_}$  and $\sflat{\_}$ is only recursive in terms of the
- first element of the list of children of
-an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for  dealing with the regular expression
-$r_1 \cdot r_2 \backslash s$.
-
-With $\sflat{\_}$ and $\sflataux{\_}$ we make 
-precise what  "closed forms" we have for the sequence derivatives and their simplifications,
-in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
-and $\bderssimp{(r_1\cdot r_2)}{s}$,
-if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
-and  $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$  ranges over 
-the substring of $s$?
-First let's look at a series of derivatives steps on a sequence 
-regular expression,  (assuming) that each time the first
-component of the sequence is always nullable):
-\begin{center}
-
-$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c}  \quad   r_1  \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
-$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''   \longrightarrow_{\backslash c''} \quad
- \ldots$
-
-\end{center}
-%TODO: cite indian paper
-Indianpaper have  come up with a slightly more formal way of putting the above process:
-\begin{center}
-$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
-\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
-+ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
-\end{center}
-where  $\delta(b, r)$ will produce $r$
-if $b$ evaluates to true, 
-and $\ZERO$ otherwise.
-
- But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
- equivalence. To make this intuition useful 
- for a formal proof, we need something
-stronger than language equivalence.
-With the help of $\sflat{\_}$ we can state the equation in Indianpaper
-more rigorously:
-\begin{lemma}
-$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
-\end{lemma}
-
-The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
-
-\begin{center}
-\begin{tabular}{lcl}
-$\vsuf{[]}{\_} $ & $=$ &  $[]$\\
-$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
-                                     && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
-\end{tabular}
-\end{center}                   
-It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
-and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
-the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
-In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing 
-the entire result of  $(r_1 \cdot r_2) \backslash s$, 
-it only stores all the $s''$ such that $r_2 \backslash s''$  are occurring terms in $(r_1\cdot r_2)\backslash s$.
-
-With this we can also add simplifications to both sides and get
-\begin{lemma}
-$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
-\end{lemma}
-Together with the idempotency property of $\rsimp$,
-%TODO: state the idempotency property of rsimp
-\begin{lemma}
-$\rsimp{r} = \rsimp{\rsimp{r}}$
-\end{lemma}
-it is possible to convert the above lemma to obtain a "closed form"
-for  derivatives nested with simplification:
-\begin{lemma}
-$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
-\end{lemma}
-And now the reason we have (1) in section 1 is clear:
-$\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, 
-is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$
-    as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$.
-
-%----------------------------------------------------------------------------------------
-%	SECTION 3
-%----------------------------------------------------------------------------------------
-
-\section{interaction between $\distinctWith$ and $\flts$}
-Note that it is not immediately obvious that 
-\begin{center}
-$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket   \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket  $.
-\end{center}
-The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
-duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 
-
-
-%-----------------------------------
-%	SECTION syntactic equivalence under simp
-%-----------------------------------
-\section{Syntactic Equivalence Under $\simp$}
-We prove that minor differences can be annhilated
-by $\simp$.
-For example,
-\begin{center}
-$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
- \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
-\end{center}
-
-
-%----------------------------------------------------------------------------------------
-%	SECTION ALTS CLOSED FORM
-%----------------------------------------------------------------------------------------
-\section{A Closed Form for \textit{ALTS}}
-Now we prove that  $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
-
-
-There are a few key steps, one of these steps is
-\[
-rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {})))
-= rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))
-\]
-
-
-One might want to prove this by something a simple statement like: 
-$map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$.
-
-For this to hold we want the $\textit{distinct}$ function to pick up
-the elements before and after derivatives correctly:
-$r \in rset \equiv (rder x r) \in (rder x rset)$.
-which essentially requires that the function $\backslash$ is an injective mapping.
-
-Unfortunately the function $\backslash c$ is not an injective mapping.
-
-\subsection{function $\backslash c$ is not injective (1-to-1)}
-\begin{center}
-The derivative $w.r.t$ character $c$ is not one-to-one.
-Formally,
-	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
-\end{center}
-This property is trivially true for the
-character regex example:
-\begin{center}
-	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
-\end{center}
-But apart from the cases where the derivative
-output is $\ZERO$, are there non-trivial results
-of derivatives which contain strings?
-The answer is yes.
-For example,
-\begin{center}
-	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
-	where $a$ is not nullable.\\
-	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
-	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
-\end{center}
-We start with two syntactically different regexes,
-and end up with the same derivative result.
-This is not surprising as we have such 
-equality as below in the style of Arden's lemma:\\
-\begin{center}
-	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
-\end{center}
-
-%----------------------------------------------------------------------------------------
-%	SECTION 4
-%----------------------------------------------------------------------------------------
-\section{A Bound for the Star Regular Expression}
-We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
-the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
-the property of the $\distinct$ function.
-Now we try to get a bound on $r^* \backslash s$ as well.
-Again, we first look at how a star's derivatives evolve, if they grow maximally: 
-\begin{center}
-
-$r^* \quad \longrightarrow_{\backslash c}  \quad   (r\backslash c)  \cdot  r^* \quad \longrightarrow_{\backslash c'}  \quad
-r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*  \quad \longrightarrow_{\backslash c''} \quad 
-(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)   \quad \longrightarrow_{\backslash c'''}
-\quad \ldots$
-
-\end{center}
-When we have a string $s = c :: c' :: c'' \ldots$  such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, 
-$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
-the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
-of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not 
-count the possible size explosions of $r \backslash c$ themselves.
-
-Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like
-$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
-into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
-and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
-For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$:
-%TODO: definitions of  and \hflataux \hflat
- \begin{center}  
- \begin{tabular}{ccc}
- $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
-$\hflataux r$ & $=$ & $ [r]$
-\end{tabular}
-\end{center}
-
- \begin{center}  
- \begin{tabular}{ccc}
- $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
-$\hflat r$ & $=$ & $ r$
-\end{tabular}
-\end{center}s
-Again these definitions are tailor-made for dealing with alternatives that have
-originated from a star's derivatives, so we don't attempt to open up all possible 
-regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
-elements.
-We give a predicate for such "star-created" regular expressions:
-\begin{center}
-\begin{tabular}{lcr}
-         &    &       $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
- $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
- \end{tabular}
- \end{center}
- 
- These definitions allows us the flexibility to talk about 
- regular expressions in their most convenient format,
- for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
- instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
- These definitions help express that certain classes of syntatically 
- distinct regular expressions are actually the same under simplification.
- This is not entirely true for annotated regular expressions: 
- %TODO: bsimp bders \neq bderssimp
- \begin{center}
- $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
- \end{center}
- For bit-codes, the order in which simplification is applied
- might cause a difference in the location they are placed.
- If we want something like
- \begin{center}
- $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
- \end{center}
- Some "canonicalization" procedure is required,
- which either pushes all the common bitcodes to nodes
-  as senior as possible:
-  \begin{center}
-  $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
-  \end{center}
- or does the reverse. However bitcodes are not of interest if we are talking about
- the $\llbracket r \rrbracket$ size of a regex.
- Therefore for the ease and simplicity of producing a
- proof for a size bound, we are happy to restrict ourselves to 
- unannotated regular expressions, and obtain such equalities as
- \begin{lemma}
- $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
- \end{lemma}
- 
- \begin{proof}
- By using the rewriting relation $\rightsquigarrow$
- \end{proof}
- %TODO: rsimp sflat
-And from this we obtain a proof that a star's derivative will be the same
-as if it had all its nested alternatives created during deriving being flattened out:
- For example,
- \begin{lemma}
- $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
- \end{lemma}
- \begin{proof}
- By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
- \end{proof}
-% The simplification of a flattened out regular expression, provided it comes
-%from the derivative of a star, is the same as the one nested.
- 
- 
- 
- 
- 
- 
- 
- 
- 
-One might wonder the actual bound rather than the loose bound we gave
-for the convenience of an easier proof.
-How much can the regex $r^* \backslash s$ grow? 
-As  earlier graphs have shown,
-%TODO: reference that graph where size grows quickly
- they can grow at a maximum speed
-  exponential $w.r.t$ the number of characters, 
-but will eventually level off when the string $s$ is long enough.
-If they grow to a size exponential $w.r.t$ the original regex, our algorithm
-would still be slow.
-And unfortunately, we have concrete examples
-where such regexes grew exponentially large before levelling off:
-$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
-(\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
- size that is  exponential on the number $n$ 
-under our current simplification rules:
-%TODO: graph of a regex whose size increases exponentially.
-\begin{center}
-\begin{tikzpicture}
-    \begin{axis}[
-        height=0.5\textwidth,
-        width=\textwidth,
-        xlabel=number of a's,
-        xtick={0,...,9},
-        ylabel=maximum size,
-        ymode=log,
-       log basis y={2}
-]
-        \addplot[mark=*,blue] table {re-chengsong.data};
-    \end{axis}
-\end{tikzpicture}
-\end{center}
-
-For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
-to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
-(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
-The exponential size is triggered by that the regex
-$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
-inside the $(\ldots) ^*$ having exponentially many
-different derivatives, despite those difference being minor.
-$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
-will therefore contain the following terms (after flattening out all nested 
-alternatives):
-\begin{center}
-$(\oplus_{i = 1]{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
-$(1 \leq m' \leq m )$
-\end{center}
-These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
- With each new input character taking the derivative against the intermediate result, more and more such distinct
- terms will accumulate, 
-until the length reaches $L.C.M.(1, \ldots, n)$.
-$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms 
-$(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
-
-$(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
- where $m' \neq m''$ \\
- as they are slightly different.
- This means that with our current simplification methods,
- we will not be able to control the derivative so that
- $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
- as there are already exponentially many terms.
- These terms are similar in the sense that the head of those terms
- are all consisted of sub-terms of the form: 
- $(\underbrace{a \ldots a}_{\text{j a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* $.
- For  $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
- $n * (n + 1) / 2$ such terms. 
- For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
- can be described by 6 terms:
- $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, 
- $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
-The total number of different "head terms",  $n * (n + 1) / 2$,
- is proportional to the number of characters in the regex 
-$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
-This suggests a slightly different notion of size, which we call the 
-alphabetic width:
-%TODO:
-(TODO: Alphabetic width def.)
-
- 
-Antimirov\parencite{Antimirov95} has proven that 
-$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
-where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
-created by doing derivatives of $r$ against all possible strings.
-If we can make sure that at any moment in our lexing algorithm our 
-intermediate result hold at most one copy of each of the 
-subterms then we can get the same bound as Antimirov's.
-This leads to the algorithm in the next chapter.
-
-
-
-
-
-%----------------------------------------------------------------------------------------
-%	SECTION 1
-%----------------------------------------------------------------------------------------
-
-\section{Idempotency of $\simp$}
-
-\begin{equation}
-	\simp \;r = \simp\; \simp \; r 
-\end{equation}
-This property means we do not have to repeatedly
-apply simplification in each step, which justifies
-our definition of $\blexersimp$.
-It will also be useful in future proofs where properties such as 
-closed forms are needed.
-The proof is by structural induction on $r$.
-
-%-----------------------------------
-%	SUBSECTION 1
-%-----------------------------------
-\subsection{Syntactic Equivalence Under $\simp$}
-We prove that minor differences can be annhilated
-by $\simp$.
-For example,
-\begin{center}
-$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
- \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
-\end{center}
-

--- a/ChengsongTanPhdThesis/Chapters/Chapter5.tex	Sat Jul 09 14:11:07 2022 +0100
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,87 +0,0 @@
-% Chapter Template
-
-\chapter{A Better Bound and Other Extensions} % Main chapter title
-
-\label{Chapter5} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
-%in Chapter 4 to a polynomial one, and demonstrate how one can extend the
-%algorithm to include constructs such as bounded repetitions and negations.
- 
-%----------------------------------------------------------------------------------------
-%	SECTION strongsimp
-%----------------------------------------------------------------------------------------
-\section{A Stronger Version of Simplification Inspired by Antimirov}
-%TODO: search for isabelle proofs of algorithms that check equivalence 
-
-
-
-%----------------------------------------------------------------------------------------
-%	SECTION 1
-%----------------------------------------------------------------------------------------
-
-\section{Adding Support for the Negation Construct, and its Correctness Proof}
-We now add support for the negation regular expression:
-\[			r ::=   \ZERO \mid  \ONE
-			 \mid  c  
-			 \mid  r_1 \cdot r_2
-			 \mid  r_1 + r_2   
-			 \mid r^*    
-			 \mid \sim r
-\]
-The $\textit{nullable}$ function's clause for it would be 
-\[
-\textit{nullable}(~r) = \neg \nullable(r)
-\]
-The derivative would be
-\[
-~r \backslash c = ~ (r \backslash c)
-\]
- 
-The most tricky part of lexing for the $~r$ regex
- is creating a value for it.
- For other regular expressions, the value aligns with the 
- structure of the regex:
- \[
- \vdash \Seq(\Char(a), \Char(b)) : a \cdot b
- \]
-But for the $~r$ regex, $s$ is a member of it if and only if
-$s$ does not belong to $L(r)$. 
-That means when there
-is a match for the not regex, it is not possible to generate how the string $s$ matched
-with $r$. 
-What we can do is preserve the information of how $s$ was not matched by $r$,
-and there are a number of options to do this.
-
-We could give a partial value when there is a partial match for the regex inside
-the $\mathbf{not}$ construct.
-For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,
-A value for it could be 
- \[
- \vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)
- \]
- The above example demonstrates what value to construct 
- when the string $s$ is at most a real prefix
- of the strings in $L(r)$. When $s$ instead is not a prefix of any strings
- in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$
- constructor.
- 
- Another option would be to either store the string $s$ that resulted in 
- a mis-match for $r$ or a dummy value as a placeholder:
- \[
- \vdash \textit{Not}(abcd) : ~(a^*)
- \]
-or
- \[
- \vdash \textit{Not}(\textit{Dummy}) : ~(a^*)
- \] 
- We choose to implement this as it is most straightforward:
- \[
- \mkeps(~(r))  = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})
- \]
- 
-%----------------------------------------------------------------------------------------
-%	SECTION 2
-%----------------------------------------------------------------------------------------
-
-\section{Bounded Repetitions}
-
-

--- a/ChengsongTanPhdThesis/Chapters/Chapter6.tex	Sat Jul 09 14:11:07 2022 +0100
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,19 +0,0 @@
-% Chapter Template
-
-\chapter{Further Improvements} % Main chapter title
-
-\label{Chapter6} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
-%in Chapter 4 to a polynomial one, and demonstrate how one can extend the
-%algorithm to include constructs such as bounded repetitions and negations.
- 
-%----------------------------------------------------------------------------------------
-%	SECTION strongsimp
-%----------------------------------------------------------------------------------------
-\section{Zippers}
-%TODO: search for isabelle proofs of algorithms that check equivalence 
-
-Unfortunately using zippers causes a similar effect like the "break up" rule
-and does not preserve the $\POSIX$ property.
-
-
-

--- a/ChengsongTanPhdThesis/Chapters/ChapterBitcoded1.tex	Sat Jul 09 14:11:07 2022 +0100
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,506 +0,0 @@
-% Chapter Template
-
-% Main chapter title
-\chapter{Correctness of Bit-coded Algorithm without Simplification}
-
-\label{ChapterBitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
-%Then we illustrate how the algorithm without bitcodes falls short for such aggressive 
-%simplifications and therefore introduce our version of the bitcoded algorithm and 
-%its correctness proof in 
-%Chapter 3\ref{Chapter3}. 
-
-\section*{Bit-coded Algorithm}
-Bits and bitcodes (lists of bits) are defined as:
-
-\begin{center}
-		$b ::=   1 \mid  0 \qquad
-bs ::= [] \mid b::bs    
-$
-\end{center}
-
-\noindent
-The $1$ and $0$ are not in bold in order to avoid 
-confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
-bit-lists) can be used to encode values (or potentially incomplete values) in a
-compact form. This can be straightforwardly seen in the following
-coding function from values to bitcodes: 
-
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
-  $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
-  $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
-  $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
-  $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
-  $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
-  $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
-                                                 code(\Stars\,vs)$
-\end{tabular}    
-\end{center} 
-
-\noindent
-Here $\textit{code}$ encodes a value into a bitcodes by converting
-$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
-star iteration by $1$. The border where a local star terminates
-is marked by $0$. This coding is lossy, as it throws away the information about
-characters, and also does not encode the ``boundary'' between two
-sequence values. Moreover, with only the bitcode we cannot even tell
-whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
-reason for choosing this compact way of storing information is that the
-relatively small size of bits can be easily manipulated and ``moved
-around'' in a regular expression. In order to recover values, we will 
-need the corresponding regular expression as an extra information. This
-means the decoding function is defined as:
-
-
-%\begin{definition}[Bitdecoding of Values]\mbox{}
-\begin{center}
-\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
-  $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
-  $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
-  $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
-     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
-       (\Left\,v, bs_1)$\\
-  $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
-     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
-       (\Right\,v, bs_1)$\\                           
-  $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
-        $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
-  & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
-  & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
-  $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
-  $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
-         $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
-  & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
-  & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
-  
-  $\textit{decode}\,bs\,r$ & $\dn$ &
-     $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
-  & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
-       \textit{else}\;\textit{None}$                       
-\end{tabular}    
-\end{center}    
-%\end{definition}
-
-Sulzmann and Lu's integrated the bitcodes into regular expressions to
-create annotated regular expressions \cite{Sulzmann2014}.
-\emph{Annotated regular expressions} are defined by the following
-grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
-
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{a}$ & $::=$  & $\ZERO$\\
-                  & $\mid$ & $_{bs}\ONE$\\
-                  & $\mid$ & $_{bs}{\bf c}$\\
-                  & $\mid$ & $_{bs}\sum\,as$\\
-                  & $\mid$ & $_{bs}a_1\cdot a_2$\\
-                  & $\mid$ & $_{bs}a^*$
-\end{tabular}    
-\end{center}  
-%(in \textit{ALTS})
-
-\noindent
-where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
-expressions and $as$ for a list of annotated regular expressions.
-The alternative constructor($\sum$) has been generalized to 
-accept a list of annotated regular expressions rather than just 2.
-We will show that these bitcodes encode information about
-the (POSIX) value that should be generated by the Sulzmann and Lu
-algorithm.
-
-
-To do lexing using annotated regular expressions, we shall first
-transform the usual (un-annotated) regular expressions into annotated
-regular expressions. This operation is called \emph{internalisation} and
-defined as follows:
-
-%\begin{definition}
-\begin{center}
-\begin{tabular}{lcl}
-  $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
-  $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
-  $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
-  $(r_1 + r_2)^\uparrow$ & $\dn$ &
-  $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
-  \textit{fuse}\,[1]\,r_2^\uparrow]$\\
-  $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
-         $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
-  $(r^*)^\uparrow$ & $\dn$ &
-         $_{[]}(r^\uparrow)^*$\\
-\end{tabular}    
-\end{center}    
-%\end{definition}
-
-\noindent
-We use up arrows here to indicate that the basic un-annotated regular
-expressions are ``lifted up'' into something slightly more complex. In the
-fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
-attach bits to the front of an annotated regular expression. Its
-definition is as follows:
-
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
-  $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
-     $_{bs @ bs'}\ONE$\\
-  $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
-     $_{bs@bs'}{\bf c}$\\
-  $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
-     $_{bs@bs'}\sum\textit{as}$\\
-  $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
-     $_{bs@bs'}a_1 \cdot a_2$\\
-  $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
-     $_{bs @ bs'}a^*$
-\end{tabular}    
-\end{center}  
-
-\noindent
-After internalising the regular expression, we perform successive
-derivative operations on the annotated regular expressions. This
-derivative operation is the same as what we had previously for the
-basic regular expressions, except that we beed to take care of
-the bitcodes:
-
-
-\iffalse
- %\begin{definition}{bder}
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
-        $\textit{if}\;c=d\; \;\textit{then}\;
-         \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
-  $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
-  $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
-  $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a_1$\\
-					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
-					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
-  & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
-  $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
-      $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
-       (\textit{STAR}\,[]\,r)$
-\end{tabular}    
-\end{center}    
-%\end{definition}
-
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
-        $\textit{if}\;c=d\; \;\textit{then}\;
-         _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
-  $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
-  $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
-  $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a_1$\\
-					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
-					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
-  & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
-  $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
-      $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
-       (_{bs}\textit{STAR}\,[]\,r)$
-\end{tabular}    
-\end{center}    
-%\end{definition}
-\fi
-
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
-  $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
-  $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
-        $\textit{if}\;c=d\; \;\textit{then}\;
-         _{bs}\ONE\;\textit{else}\;\ZERO$\\  
-  $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
-  $_{bs}\sum\;(\textit{map} (\_\backslash c) as )$\\
-  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a_1$\\
-					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
-					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
-  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
-  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
-      $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
-       (_{[]}r^*))$
-\end{tabular}    
-\end{center}    
-
-%\end{definition}
-\noindent
-For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
-we need to unfold it into a sequence,
-and attach an additional bit $0$ to the front of $r \backslash c$
-to indicate one more star iteration. Also the sequence clause
-is more subtle---when $a_1$ is $\textit{bnullable}$ (here
-\textit{bnullable} is exactly the same as $\textit{nullable}$, except
-that it is for annotated regular expressions, therefore we omit the
-definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
-$a_1$ matches the string prior to character $c$ (more on this later),
-then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
-\backslash c)$ will collapse the regular expression $a_1$(as it has
-already been fully matched) and store the parsing information at the
-head of the regular expression $a_2 \backslash c$ by fusing to it. The
-bitsequence $\textit{bs}$, which was initially attached to the
-first element of the sequence $a_1 \cdot a_2$, has
-now been elevated to the top-level of $\sum$, as this information will be
-needed whichever way the sequence is matched---no matter whether $c$ belongs
-to $a_1$ or $ a_2$. After building these derivatives and maintaining all
-the lexing information, we complete the lexing by collecting the
-bitcodes using a generalised version of the $\textit{mkeps}$ function
-for annotated regular expressions, called $\textit{bmkeps}$:
-
-
-%\begin{definition}[\textit{bmkeps}]\mbox{}
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
-  $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a$\\
-  & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
-  & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
-  $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
-     $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
-  $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
-     $bs \,@\, [0]$
-\end{tabular}    
-\end{center}    
-%\end{definition}
-
-\noindent
-This function completes the value information by travelling along the
-path of the regular expression that corresponds to a POSIX value and
-collecting all the bitcodes, and using $S$ to indicate the end of star
-iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
-decode them, we get the value we expect. The corresponding lexing
-algorithm looks as follows:
-
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{blexer}\;r\,s$ & $\dn$ &
-      $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
-  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
-  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
-  & & $\;\;\textit{else}\;\textit{None}$
-\end{tabular}
-\end{center}
-
-\noindent
-In this definition $\_\backslash s$ is the  generalisation  of the derivative
-operation from characters to strings (just like the derivatives for un-annotated
-regular expressions).
-
-Now we introduce the simplifications, which is why we introduce the 
-bitcodes in the first place.
-
-\subsection*{Simplification Rules}
-
-This section introduces aggressive (in terms of size) simplification rules
-on annotated regular expressions
-to keep derivatives small. Such simplifications are promising
-as we have
-generated test data that show
-that a good tight bound can be achieved. We could only
-partially cover the search space as there are infinitely many regular
-expressions and strings. 
-
-One modification we introduced is to allow a list of annotated regular
-expressions in the $\sum$ constructor. This allows us to not just
-delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
-also unnecessary ``copies'' of regular expressions (very similar to
-simplifying $r + r$ to just $r$, but in a more general setting). Another
-modification is that we use simplification rules inspired by Antimirov's
-work on partial derivatives. They maintain the idea that only the first
-``copy'' of a regular expression in an alternative contributes to the
-calculation of a POSIX value. All subsequent copies can be pruned away from
-the regular expression. A recursive definition of our  simplification function 
-that looks somewhat similar to our Scala code is given below:
-%\comment{Use $\ZERO$, $\ONE$ and so on. 
-%Is it $ALTS$ or $ALTS$?}\\
-
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-   
-  $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
-   &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
-   &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
-   &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
-   &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
-   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
-
-  $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\
-  &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
-   &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
-   &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
-
-   $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
-\end{tabular}    
-\end{center}    
-
-\noindent
-The simplification does a pattern matching on the regular expression.
-When it detected that the regular expression is an alternative or
-sequence, it will try to simplify its child regular expressions
-recursively and then see if one of the children turns into $\ZERO$ or
-$\ONE$, which might trigger further simplification at the current level.
-The most involved part is the $\sum$ clause, where we use two
-auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
-alternatives and reduce as many duplicates as possible. Function
-$\textit{distinct}$  keeps the first occurring copy only and removes all later ones
-when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
-Its recursive definition is given below:
-
- \begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
-     (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
-  $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
-    $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
-\end{tabular}    
-\end{center}  
-
-\noindent
-Here $\textit{flatten}$ behaves like the traditional functional programming flatten
-function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
-removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
-
-Having defined the $\simp$ function,
-we can use the previous notation of  natural
-extension from derivative w.r.t.~character to derivative
-w.r.t.~string:%\comment{simp in  the [] case?}
-
-\begin{center}
-\begin{tabular}{lcl}
-$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
-$r \backslash_{simp} [\,] $ & $\dn$ & $r$
-\end{tabular}
-\end{center}
-
-\noindent
-to obtain an optimised version of the algorithm:
-
- \begin{center}
-\begin{tabular}{lcl}
-  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
-      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
-  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
-  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
-  & & $\;\;\textit{else}\;\textit{None}$
-\end{tabular}
-\end{center}
-
-\noindent
-This algorithm keeps the regular expression size small, for example,
-with this simplification our previous $(a + aa)^*$ example's 8000 nodes
-will be reduced to just 6 and stays constant, no matter how long the
-input string is.
-
-
-
-
-
-
-
-
-
-
-
-%-----------------------------------
-%	SUBSECTION 1
-%-----------------------------------
-\section{Specifications of Some Helper Functions}
-Here we give some functions' definitions, 
-which we will use later.
-\begin{center}
-\begin{tabular}{ccc}
-$\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$
-\end{tabular}
-\end{center}
-
-
-%----------------------------------------------------------------------------------------
-%	SECTION  correctness proof
-%----------------------------------------------------------------------------------------
-\section{Correctness of Bit-coded Algorithm (Without Simplification)}
-We now give the proof the correctness of the algorithm with bit-codes.
-
-Ausaf and Urban cleverly defined an auxiliary function called $\flex$,
-defined as
-\[
-\flex \; r \; f \; [] \; v \; = \; f\; v
-\flex \; r \; f \; c :: s \; v =  \flex r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v
-\]
-which accumulates the characters that needs to be injected back, 
-and does the injection in a stack-like manner (last taken derivative first injected).
-$\flex$ is connected to the $\lexer$:
-\begin{lemma}
-$\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$
-\end{lemma}
-$\flex$ provides us a bridge between $\lexer$ and $\blexer$.
-What is even better about $\flex$ is that it allows us to 
-directly operate on the value $\mkeps (r\backslash v)$,
-which is pivotal in the definition of  $\lexer $ and $\blexer$, but not visible as an argument.
-When the value created by $\mkeps$ becomes available, one can 
-prove some stepwise properties of lexing nicely:
-\begin{lemma}\label{flexStepwise}
-$\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $
-\end{lemma}
-
-And for $\blexer$ we have a function with stepwise properties like $\flex$ as well,
-called $\retrieve$\ref{retrieveDef}.
-$\retrieve$ takes bit-codes from annotated regular expressions
-guided by a value.
-$\retrieve$ is connected to the $\blexer$ in the following way:
-\begin{lemma}\label{blexer_retrieve}
-$\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
-\end{lemma}
-If you take derivative of an annotated regular expression, 
-you can $\retrieve$ the same bit-codes as before the derivative took place,
-provided that you use the corresponding value:
-
-\begin{lemma}\label{retrieveStepwise}
-$\retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
-\end{lemma}
-The other good thing about $\retrieve$ is that it can be connected to $\flex$:
-%centralLemma1
-\begin{lemma}\label{flex_retrieve}
-$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
-\end{lemma}
-\begin{proof}
-By induction on $s$. The induction tactic is reverse induction on strings.
-$v$ is allowed to be arbitrary.
-The crucial point is to rewrite 
-\[
-\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
-\]
-as
-\[
-\retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\;  \mkeps (r \backslash s@[c]))
-\].
-This enables us to equate 
-\[
-\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
-\] 
-with 
-\[
-\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c])))
-\],
-which in turn can be rewritten as
-\[
-\flex \; r \; \textit{id} \; s@[c] \;  (\mkeps (r\backslash s@[c]))
-\].
-\end{proof}
-
-With the above lemma we can now link $\flex$ and $\blexer$.
-
-\begin{lemma}\label{flex_blexer}
-$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s)  = \blexer \; r \; s$
-\end{lemma}
-\begin{proof}
-Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.
-\end{proof}
-Finally 
-
-
-

--- a/ChengsongTanPhdThesis/Chapters/ChapterBitcoded2.tex	Sat Jul 09 14:11:07 2022 +0100
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,64 +0,0 @@
-% Chapter Template
-
-% Main chapter title
-\chapter{Correctness of Bit-coded Algorithm with Simplification}
-
-\label{ChapterBitcoded2} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
-%Then we illustrate how the algorithm without bitcodes falls short for such aggressive 
-%simplifications and therefore introduce our version of the bitcoded algorithm and 
-%its correctness proof in 
-%Chapter 3\ref{Chapter3}. 
-
-
-
-
-%----------------------------------------------------------------------------------------
-%	SECTION common identities
-%----------------------------------------------------------------------------------------
-\section{Common Identities In Simplification-Related Functions} 
-Need to prove these before starting on the big correctness proof.
-
-%-----------------------------------
-%	SUBSECTION 
-%-----------------------------------
-\subsection{Idempotency of $\simp$}
-
-\begin{equation}
-	\simp \;r = \simp\; \simp \; r 
-\end{equation}
-This property means we do not have to repeatedly
-apply simplification in each step, which justifies
-our definition of $\blexersimp$.
-It will also be useful in future proofs where properties such as 
-closed forms are needed.
-The proof is by structural induction on $r$.
-
-%-----------------------------------
-%	SUBSECTION 
-%-----------------------------------
-\subsection{Syntactic Equivalence Under $\simp$}
-We prove that minor differences can be annhilated
-by $\simp$.
-For example,
-\begin{center}
-$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
- \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
-\end{center}
-
-
-
-
-
-
-
-%----------------------------------------------------------------------------------------
-%	SECTION corretness proof
-%----------------------------------------------------------------------------------------
-\section{Proof Technique of Correctness of Bit-coded Algorithm with Simplification}
-The non-trivial part of proving the correctness of the algorithm with simplification
-compared with not having simplification is that we can no longer use the argument 
-in \cref{flex_retrieve}.
-The function \retrieve needs the structure of the annotated regular expression to 
-agree with the structure of the value, but simplification will always mess with the 
-structure:
-%TODO: after simp does not agree with each other: (a + 0) --> a v.s. Left(Char(a))
\ No newline at end of file

--- a/ChengsongTanPhdThesis/Chapters/ChapterFinite.tex	Sat Jul 09 14:11:07 2022 +0100
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,877 +0,0 @@
-% Chapter Template
-
-\chapter{Finiteness Bound} % Main chapter title
-
-\label{ChapterFinite} 
-%  In Chapter 4 \ref{Chapter4} we give the second guarantee
-%of our bitcoded algorithm, that is a finite bound on the size of any 
-%regex's derivatives. 
-
-In this chapter we give a guarantee in terms of time complexity:
-given a regular expression $r$, for any string $s$ 
-our algorithm's internal data structure is finitely bounded.
-To obtain such a proof, we need to 
-\begin{itemize}
-\item
-Define an new datatype for regular expressions that makes it easy
-to reason about the size of an annotated regular expression.
-\item
-A set of equalities for this new datatype that enables one to
-rewrite $\bderssimp{r_1 \cdot r_2}{s}$ and $\bderssimp{r^*}{s}$ etc.
-by their children regexes $r_1$, $r_2$, and $r$.
-\item
-Using those equalities to actually get those rewriting equations, which we call
-"closed forms".
-\item
-Bound the closed forms, thereby bounding the original
-$\blexersimp$'s internal data structures.
-\end{itemize}
-
-\section{the $\mathbf{r}$-rexp datatype and the size functions}
-
-We have a size function for bitcoded regular expressions, written
-$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
-
-\begin{center}
-\begin{tabular}{ccc}
-$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\
-\end{tabular}
-\end{center}
-(TODO: COMPLETE this defn and for $rs$)
-
-The size is based on a recursive function on the structure of the regex,
-not the bitcodes.
-Therefore we may as well talk about size of an annotated regular expression 
-in their un-annotated form:
-\begin{center}
-$\asize(a) = \size(\erase(a))$. 
-\end{center}
-(TODO: turn equals to roughly equals)
-
-But there is a minor nuisance here:
-the erase function actually messes with the structure of the regular expression:
-\begin{center}
-\begin{tabular}{ccc}
-$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\
-\end{tabular}
-\end{center}
-(TODO: complete this definition with singleton r in alts)
-
-An alternative regular expression with an empty list of children
- is turned into an $\ZERO$ during the
-$\erase$ function, thereby changing the size and structure of the regex.
-These will likely be fixable if we really want to use plain $\rexp$s for dealing
-with size, but we choose a more straightforward (or stupid) method by 
-defining a new datatype that is similar to plain $\rexp$s but can take
-non-binary arguments for its alternative constructor,
- which we call $\rrexp$ to denote
-the difference between it and plain regular expressions. 
-\[			\rrexp ::=   \RZERO \mid  \RONE
-			 \mid  \RCHAR{c}  
-			 \mid  \RSEQ{r_1}{r_2}
-			 \mid  \RALTS{rs}
-			 \mid \RSTAR{r}        
-\]
-For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, 
-but keep everything else intact.
-It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
-(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
-$\ALTS$).
-We denote the operation of erasing the bits and turning an annotated regular expression 
-into an $\rrexp{}$ as $\rerase{}$.
-\begin{center}
-\begin{tabular}{lcl}
-$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
-$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
-$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
-\end{tabular}
-\end{center}
-%TODO: FINISH definition of rerase
-Similarly we could define the derivative  and simplification on 
-$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
-except that now they can operate on alternatives taking multiple arguments.
-
-\begin{center}
-\begin{tabular}{lcr}
-$\RALTS{rs} \backslash c$ & $=$ &  $\RALTS{map\; (\_ \backslash c) \;rs}$\\
-(other clauses omitted)
-\end{tabular}
-\end{center}
-
-Now that $\rrexp$s do not have bitcodes on them, we can do the 
-duplicate removal without  an equivalence relation:
-\begin{center}
-\begin{tabular}{lcl}
-$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
-           			    &        & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
-\end{tabular}
-\end{center}
-%TODO: definition of rsimp (maybe only the alternative clause)
-
-
-The reason why these definitions  mirror precisely the corresponding operations
-on annotated regualar expressions is that we can calculate sizes more easily:
-
-\begin{lemma}
-$\rsize{\rerase a} = \asize a$
-\end{lemma}
-This lemma says that the size of an r-erased regex is the same as the annotated regex.
-this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.
-\begin{lemma}
-$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
-\end{lemma}
-Putting these two together we get a property that allows us to estimate the 
-size of an annotated regular expression derivative using its un-annotated counterpart:
-\begin{lemma}
-$\asize{\bderssimp{r}{s}} =  \rsize{\rderssimp{\rerase{r}}{s}}$
-\end{lemma}
-Unless stated otherwise in this chapter all $\textit{rexp}$s without
- bitcodes are seen as $\rrexp$s.
- We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
- as the former suits people's intuitive way of stating a binary alternative
- regular expression.
-
-
-
-%-----------------------------------
-%	SECTION ?
-%-----------------------------------
-\section{preparatory properties for getting a finiteness bound}
-Before we get to the proof that says the intermediate result of our lexer will
-remain finitely bounded, which is an important efficiency/liveness guarantee,
-we shall first develop a few preparatory properties and definitions to 
-make the process of proving that a breeze.
-
-We define rewriting relations for $\rrexp$s, which allows us to do the 
-same trick as we did for the correctness proof,
-but this time we will have stronger equalities established.
-\subsection{"hrewrite" relation}
-List of 1-step rewrite rules for regular expressions simplification without bitcodes:
-\begin{figure}
-\caption{the "h-rewrite" rules}
-\[
-r_1 \cdot \ZERO \rightarrow_h \ZERO \]
-
-\[
-\ZERO \cdot r_2 \rightarrow \ZERO 
-\]
-\end{figure}
-And we define an "grewrite" relation that works on lists:
-\begin{center}
-\begin{tabular}{lcl}
-$ \ZERO :: rs$ & $\rightarrow_g$ & $rs$
-\end{tabular}
-\end{center}
-
-
-
-With these relations we prove
-\begin{lemma}
-$rs \rightarrow rs'  \implies \RALTS{rs} \rightarrow \RALTS{rs'}$
-\end{lemma}
-which enables us to prove "closed forms" equalities such as
-\begin{lemma}
-$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\suffix \; s \; r_1 ))}$
-\end{lemma}
-
-But the most involved part of the above lemma is proving the following:
-\begin{lemma}
-$\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ 
-\end{lemma}
-which is needed in proving things like 
-\begin{lemma}
-$r \rightarrow_f r'  \implies \rsimp{r} \rightarrow \rsimp{r'}$
-\end{lemma}
-
-Fortunately this is provable by induction on the list $rs$
-
-
-
-%-----------------------------------
-%	SECTION 2
-%-----------------------------------
-
-\begin{theorem}
-For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
-\end{theorem}
-
-\noindent
-\begin{proof}
-We prove this by induction on $r$. The base cases for $\AZERO$,
-$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
-for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
-hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
-$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
-%
-\begin{center}
-\begin{tabular}{lcll}
-& & $ \llbracket   \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
-& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
-    [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
-& $\leq$ &
-    $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
-    [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
-& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
-             \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
-      \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
-\end{tabular}
-\end{center}
-
-
-\noindent
-where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
-The reason why we could write the derivatives of a sequence this way is described in section 2.
-The term (2) is used to control (1). 
-That is because one can obtain an overall
-smaller regex list
-by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
-Section 3 is dedicated to its proof.
-In (3) we know that  $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is 
-bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
-than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
-for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
-We reason similarly for  $\STAR$.\medskip
-\end{proof}
-
-What guarantee does this bound give us?
-
-Whatever the regex is, it will not grow indefinitely.
-Take our previous example $(a + aa)^*$ as an example:
-\begin{center}
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={number of $a$'s},
-    x label style={at={(1.05,-0.05)}},
-    ylabel={regex size},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax= 40,
-    ytick={0,10,...,40},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={$(a + aa)^*$},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
-\end{axis}
-\end{tikzpicture}
-\end{tabular}
-\end{center}
-We are able to limit the size of the regex $(a + aa)^*$'s derivatives
- with our simplification
-rules very effectively.
-
-
-In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound
-is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.
-Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
-inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
-$f(x) = x * 2^x$.
-This means the bound we have will surge up at least
-tower-exponentially with a linear increase of the depth.
-For a regex of depth $n$, the bound
-would be approximately $4^n$.
-
-Test data in the graphs from randomly generated regular expressions
-shows that the giant bounds are far from being hit.
-%a few sample regular experessions' derivatives
-%size change
-%TODO: giving regex1_size_change.data showing a few regexes' size changes 
-%w;r;t the input characters number, where the size is usually cubic in terms of original size
-%a*, aa*, aaa*, .....
-%randomly generated regexes
-\begin{center}
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={number of $a$'s},
-    x label style={at={(1.05,-0.05)}},
-    ylabel={regex size},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=1000,
-    ytick={0,100,...,1000},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={regex1},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
-\end{axis}
-\end{tikzpicture}
-  &
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,-0.05)}},
-    %ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=1000,
-    ytick={0,100,...,1000},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={regex2},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
-\end{axis}
-\end{tikzpicture}
-  &
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,-0.05)}},
-    %ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=1000,
-    ytick={0,100,...,1000},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={regex3},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
-\end{axis}
-\end{tikzpicture}\\
-\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}
-\end{tabular}    
-\end{center}  
-
-
-
-
-
-\noindent
-Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the 
-original size.
-This suggests a link towrads "partial derivatives"
- introduced by Antimirov \cite{Antimirov95}.
- 
- \section{Antimirov's partial derivatives}
- The idea behind Antimirov's partial derivatives
-is to do derivatives in a similar way as suggested by Brzozowski, 
-but maintain a set of regular expressions instead of a single one:
-
-%TODO: antimirov proposition 3.1, needs completion
- \begin{center}  
- \begin{tabular}{ccc}
- $\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\
-$\partial_x(\ONE)$ & $=$ & $\phi$
-\end{tabular}
-\end{center}
-
-Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together
-using the alternatives constructor, Antimirov cleverly chose to put them into
-a set instead.  This breaks the terms in a derivative regular expression up, 
-allowing us to understand what are the "atomic" components of it.
-For example, To compute what regular expression $x^*(xx + y)^*$'s 
-derivative against $x$ is made of, one can do a partial derivative
-of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$
-from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$.
-To get all the "atomic" components of a regular expression's possible derivatives,
-there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes
-whatever character is available at the head of the string inside the language of a
-regular expression, and gives back the character and the derivative regular expression
-as a pair (which he called "monomial"):
- \begin{center}  
- \begin{tabular}{ccc}
- $\lf(\ONE)$ & $=$ & $\phi$\\
-$\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\
- $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
- $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\
-\end{tabular}
-\end{center}
-%TODO: completion
-
-There is a slight difference in the last three clauses compared
-with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular 
-expression $r$ with every element inside $\textit{rset}$ to create a set of 
-sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates 
-on a set of monomials (which Antimirov called "linear form") and a regular 
-expression, and returns a linear form:
- \begin{center}  
- \begin{tabular}{ccc}
- $l \bigodot (\ZERO)$ & $=$ & $\phi$\\
- $l \bigodot (\ONE)$ & $=$ & $l$\\
- $\phi \bigodot t$ & $=$ & $\phi$\\
- $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\
- $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\
-  $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\
- $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
- $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\
-\end{tabular}
-\end{center}
-%TODO: completion
-
- Some degree of simplification is applied when doing $\bigodot$, for example,
- $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$,
- and  $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and
-  $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$,
-  and so on.
-  
-  With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regex $r$ with 
-  an iterative procedure:
-   \begin{center}  
- \begin{tabular}{llll}
-$\textit{while}$ & $(\Delta_i \neq \phi)$  &                &          \\
- 		       &  $\Delta_{i+1}$           &        $ =$ & $\lf(\Delta_i) - \PD_i$ \\
-		       &  $\PD_{i+1}$              &         $ =$ & $\Delta_{i+1} \cup \PD_i$ \\
-$\partial_{UNIV}(r)$ & $=$ & $\PD$ &		     
-\end{tabular}
-\end{center}
-
-
- $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
-
-
-However, if we introduce them in our
-setting we would lose the POSIX property of our calculated values. 
-A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$.
-If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer 
-would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of
-what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information
-in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$
-occurs, and apply them in the right order once we get 
-a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value.
-This is unlike the simplification we had before, where the rewriting rules 
-such  as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value.
-We will discuss better
-bounds in the last section of this chapter.\\[-6.5mm]
-
-
-
-
-%----------------------------------------------------------------------------------------
-%	SECTION ??
-%----------------------------------------------------------------------------------------
-
-\section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
-To embark on getting the "closed forms" of regexes, we first
-define a few auxiliary definitions to make expressing them smoothly.
-
- \begin{center}  
- \begin{tabular}{ccc}
- $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
-$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
-$\sflataux r$ & $=$ & $ [r]$
-\end{tabular}
-\end{center}
-The intuition behind $\sflataux{\_}$ is to break up nested regexes 
-of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
-into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
-It will return the singleton list $[r]$ otherwise.
-
-$\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
-the output type a regular expression, not a list:
- \begin{center} 
- \begin{tabular}{ccc}
- $\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
-$\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\
-$\sflat r$ & $=$ & $ [r]$
-\end{tabular}
-\end{center}
-$\sflataux{\_}$  and $\sflat{\_}$ is only recursive in terms of the
- first element of the list of children of
-an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for  dealing with the regular expression
-$r_1 \cdot r_2 \backslash s$.
-
-With $\sflat{\_}$ and $\sflataux{\_}$ we make 
-precise what  "closed forms" we have for the sequence derivatives and their simplifications,
-in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
-and $\bderssimp{(r_1\cdot r_2)}{s}$,
-if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
-and  $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$  ranges over 
-the substring of $s$?
-First let's look at a series of derivatives steps on a sequence 
-regular expression,  (assuming) that each time the first
-component of the sequence is always nullable):
-\begin{center}
-
-$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c}  \quad   r_1  \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
-$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''   \longrightarrow_{\backslash c''} \quad
- \ldots$
-
-\end{center}
-%TODO: cite indian paper
-Indianpaper have  come up with a slightly more formal way of putting the above process:
-\begin{center}
-$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
-\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
-+ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
-\end{center}
-where  $\delta(b, r)$ will produce $r$
-if $b$ evaluates to true, 
-and $\ZERO$ otherwise.
-
- But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
- equivalence. To make this intuition useful 
- for a formal proof, we need something
-stronger than language equivalence.
-With the help of $\sflat{\_}$ we can state the equation in Indianpaper
-more rigorously:
-\begin{lemma}
-$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
-\end{lemma}
-
-The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
-
-\begin{center}
-\begin{tabular}{lcl}
-$\vsuf{[]}{\_} $ & $=$ &  $[]$\\
-$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
-                                     && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
-\end{tabular}
-\end{center}                   
-It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
-and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
-the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
-In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing 
-the entire result of  $(r_1 \cdot r_2) \backslash s$, 
-it only stores all the $s''$ such that $r_2 \backslash s''$  are occurring terms in $(r_1\cdot r_2)\backslash s$.
-
-With this we can also add simplifications to both sides and get
-\begin{lemma}
-$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
-\end{lemma}
-Together with the idempotency property of $\rsimp$,
-%TODO: state the idempotency property of rsimp
-\begin{lemma}
-$\rsimp{r} = \rsimp{\rsimp{r}}$
-\end{lemma}
-it is possible to convert the above lemma to obtain a "closed form"
-for  derivatives nested with simplification:
-\begin{lemma}
-$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
-\end{lemma}
-And now the reason we have (1) in section 1 is clear:
-$\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, 
-is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$
-    as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$.
-
-%----------------------------------------------------------------------------------------
-%	SECTION 3
-%----------------------------------------------------------------------------------------
-
-\section{interaction between $\distinctWith$ and $\flts$}
-Note that it is not immediately obvious that 
-\begin{center}
-$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket   \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket  $.
-\end{center}
-The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
-duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 
-
-
-%-----------------------------------
-%	SECTION syntactic equivalence under simp
-%-----------------------------------
-\section{Syntactic Equivalence Under $\simp$}
-We prove that minor differences can be annhilated
-by $\simp$.
-For example,
-\begin{center}
-$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
- \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
-\end{center}
-
-
-%----------------------------------------------------------------------------------------
-%	SECTION ALTS CLOSED FORM
-%----------------------------------------------------------------------------------------
-\section{A Closed Form for \textit{ALTS}}
-Now we prove that  $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
-
-
-There are a few key steps, one of these steps is
-\[
-rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {})))
-= rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))
-\]
-
-
-One might want to prove this by something a simple statement like: 
-$map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$.
-
-For this to hold we want the $\textit{distinct}$ function to pick up
-the elements before and after derivatives correctly:
-$r \in rset \equiv (rder x r) \in (rder x rset)$.
-which essentially requires that the function $\backslash$ is an injective mapping.
-
-Unfortunately the function $\backslash c$ is not an injective mapping.
-
-\subsection{function $\backslash c$ is not injective (1-to-1)}
-\begin{center}
-The derivative $w.r.t$ character $c$ is not one-to-one.
-Formally,
-	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
-\end{center}
-This property is trivially true for the
-character regex example:
-\begin{center}
-	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
-\end{center}
-But apart from the cases where the derivative
-output is $\ZERO$, are there non-trivial results
-of derivatives which contain strings?
-The answer is yes.
-For example,
-\begin{center}
-	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
-	where $a$ is not nullable.\\
-	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
-	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
-\end{center}
-We start with two syntactically different regexes,
-and end up with the same derivative result.
-This is not surprising as we have such 
-equality as below in the style of Arden's lemma:\\
-\begin{center}
-	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
-\end{center}
-
-%----------------------------------------------------------------------------------------
-%	SECTION 4
-%----------------------------------------------------------------------------------------
-\section{A Bound for the Star Regular Expression}
-We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
-the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
-the property of the $\distinct$ function.
-Now we try to get a bound on $r^* \backslash s$ as well.
-Again, we first look at how a star's derivatives evolve, if they grow maximally: 
-\begin{center}
-
-$r^* \quad \longrightarrow_{\backslash c}  \quad   (r\backslash c)  \cdot  r^* \quad \longrightarrow_{\backslash c'}  \quad
-r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*  \quad \longrightarrow_{\backslash c''} \quad 
-(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)   \quad \longrightarrow_{\backslash c'''}
-\quad \ldots$
-
-\end{center}
-When we have a string $s = c :: c' :: c'' \ldots$  such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, 
-$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
-the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
-of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not 
-count the possible size explosions of $r \backslash c$ themselves.
-
-Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like
-$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
-into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
-and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
-For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$:
-%TODO: definitions of  and \hflataux \hflat
- \begin{center}  
- \begin{tabular}{ccc}
- $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
-$\hflataux r$ & $=$ & $ [r]$
-\end{tabular}
-\end{center}
-
- \begin{center}  
- \begin{tabular}{ccc}
- $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
-$\hflat r$ & $=$ & $ r$
-\end{tabular}
-\end{center}s
-Again these definitions are tailor-made for dealing with alternatives that have
-originated from a star's derivatives, so we don't attempt to open up all possible 
-regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
-elements.
-We give a predicate for such "star-created" regular expressions:
-\begin{center}
-\begin{tabular}{lcr}
-         &    &       $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
- $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
- \end{tabular}
- \end{center}
- 
- These definitions allows us the flexibility to talk about 
- regular expressions in their most convenient format,
- for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
- instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
- These definitions help express that certain classes of syntatically 
- distinct regular expressions are actually the same under simplification.
- This is not entirely true for annotated regular expressions: 
- %TODO: bsimp bders \neq bderssimp
- \begin{center}
- $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
- \end{center}
- For bit-codes, the order in which simplification is applied
- might cause a difference in the location they are placed.
- If we want something like
- \begin{center}
- $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
- \end{center}
- Some "canonicalization" procedure is required,
- which either pushes all the common bitcodes to nodes
-  as senior as possible:
-  \begin{center}
-  $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
-  \end{center}
- or does the reverse. However bitcodes are not of interest if we are talking about
- the $\llbracket r \rrbracket$ size of a regex.
- Therefore for the ease and simplicity of producing a
- proof for a size bound, we are happy to restrict ourselves to 
- unannotated regular expressions, and obtain such equalities as
- \begin{lemma}
- $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
- \end{lemma}
- 
- \begin{proof}
- By using the rewriting relation $\rightsquigarrow$
- \end{proof}
- %TODO: rsimp sflat
-And from this we obtain a proof that a star's derivative will be the same
-as if it had all its nested alternatives created during deriving being flattened out:
- For example,
- \begin{lemma}
- $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
- \end{lemma}
- \begin{proof}
- By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
- \end{proof}
-% The simplification of a flattened out regular expression, provided it comes
-%from the derivative of a star, is the same as the one nested.
- 
- 
- 
- 
- 
- 
- 
- 
- 
-One might wonder the actual bound rather than the loose bound we gave
-for the convenience of an easier proof.
-How much can the regex $r^* \backslash s$ grow? 
-As  earlier graphs have shown,
-%TODO: reference that graph where size grows quickly
- they can grow at a maximum speed
-  exponential $w.r.t$ the number of characters, 
-but will eventually level off when the string $s$ is long enough.
-If they grow to a size exponential $w.r.t$ the original regex, our algorithm
-would still be slow.
-And unfortunately, we have concrete examples
-where such regexes grew exponentially large before levelling off:
-$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
-(\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
- size that is  exponential on the number $n$ 
-under our current simplification rules:
-%TODO: graph of a regex whose size increases exponentially.
-\begin{center}
-\begin{tikzpicture}
-    \begin{axis}[
-        height=0.5\textwidth,
-        width=\textwidth,
-        xlabel=number of a's,
-        xtick={0,...,9},
-        ylabel=maximum size,
-        ymode=log,
-       log basis y={2}
-]
-        \addplot[mark=*,blue] table {re-chengsong.data};
-    \end{axis}
-\end{tikzpicture}
-\end{center}
-
-For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
-to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
-(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
-The exponential size is triggered by that the regex
-$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
-inside the $(\ldots) ^*$ having exponentially many
-different derivatives, despite those difference being minor.
-$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
-will therefore contain the following terms (after flattening out all nested 
-alternatives):
-\begin{center}
-$(\oplus_{i = 1]{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
-$(1 \leq m' \leq m )$
-\end{center}
-These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
- With each new input character taking the derivative against the intermediate result, more and more such distinct
- terms will accumulate, 
-until the length reaches $L.C.M.(1, \ldots, n)$.
-$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms 
-$(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
-
-$(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
- where $m' \neq m''$ \\
- as they are slightly different.
- This means that with our current simplification methods,
- we will not be able to control the derivative so that
- $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
- as there are already exponentially many terms.
- These terms are similar in the sense that the head of those terms
- are all consisted of sub-terms of the form: 
- $(\underbrace{a \ldots a}_{\text{j a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* $.
- For  $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
- $n * (n + 1) / 2$ such terms. 
- For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
- can be described by 6 terms:
- $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, 
- $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
-The total number of different "head terms",  $n * (n + 1) / 2$,
- is proportional to the number of characters in the regex 
-$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
-This suggests a slightly different notion of size, which we call the 
-alphabetic width:
-%TODO:
-(TODO: Alphabetic width def.)
-
- 
-Antimirov\parencite{Antimirov95} has proven that 
-$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
-where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
-created by doing derivatives of $r$ against all possible strings.
-If we can make sure that at any moment in our lexing algorithm our 
-intermediate result hold at most one copy of each of the 
-subterms then we can get the same bound as Antimirov's.
-This leads to the algorithm in the next chapter.
-
-
-
-
-
-%----------------------------------------------------------------------------------------
-%	SECTION 1
-%----------------------------------------------------------------------------------------
-
-\section{Idempotency of $\simp$}
-
-\begin{equation}
-	\simp \;r = \simp\; \simp \; r 
-\end{equation}
-This property means we do not have to repeatedly
-apply simplification in each step, which justifies
-our definition of $\blexersimp$.
-It will also be useful in future proofs where properties such as 
-closed forms are needed.
-The proof is by structural induction on $r$.
-
-%-----------------------------------
-%	SUBSECTION 1
-%-----------------------------------
-\subsection{Syntactic Equivalence Under $\simp$}
-We prove that minor differences can be annhilated
-by $\simp$.
-For example,
-\begin{center}
-$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
- \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
-\end{center}
-

--- a/ChengsongTanPhdThesis/Chapters/Finite.tex	Sat Jul 09 14:11:07 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex	Wed Jul 13 08:35:45 2022 +0100
@@ -209,7 +209,7 @@
 With $\rrexp$ the size caclulation of annotated regular expressions'
 simplification and derivatives can be done by the size of their unlifted 
 counterpart with the unlifted version of simplification and derivatives applied.
-\begin{lemma}
+\begin{lemma}\label{sizeRelations}
 	The following equalities hold:
 	\begin{itemize}
 		\item
@@ -1244,8 +1244,35 @@
 \end{corollary}
 \noindent
 \subsection{Closed Forms for Star Regular Expressions}
-We use a similar technique as $r_1 \cdot r_2$ case,
-generating 
+We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
+the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
+the property of the $\distinct$ function.
+Now we try to get a bound on $r^* \backslash s$ as well.
+Again, we first look at how a star's derivatives evolve, if they grow maximally: 
+\begin{center}
+
+$r^* \quad \longrightarrow_{\backslash c}  \quad   (r\backslash c)  \cdot  r^* \quad \longrightarrow_{\backslash c'}  \quad
+r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*  \quad \longrightarrow_{\backslash c''} \quad 
+(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)   \quad \longrightarrow_{\backslash c'''}
+\quad \ldots$
+
+\end{center}
+When we have a string $s = c :: c' :: c'' \ldots$  such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, 
+$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
+the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
+of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not 
+count the possible size explosions of $r \backslash c$ themselves.
+
+Thanks to $\rflts$ and $\rDistinct$, we are able to open up regexes like
+$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
+(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
+into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', 
+r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
+and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
+This allows us to use a similar technique as $r_1 \cdot r_2$ case,
+where the crux is to get an equivalent form of 
+$\rderssimp{r^*}{s}$ with shape $\rsimp{\sum rs}$.
+This requires generating 
 all possible sub-strings $s'$ of $s$
 such that $r\backslash s' \cdot r^*$ will appear 
 as a term in $(r^*) \backslash s$.
@@ -1304,6 +1331,68 @@
 \end{center}
 \noindent
 %MAYBE TODO: introduce createdByStar
+Again these definitions are tailor-made for dealing with alternatives that have
+originated from a star's derivatives, so we do not attempt to open up all possible 
+regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
+elements.
+We give a predicate for such "star-created" regular expressions:
+\begin{center}
+\begin{tabular}{lcr}
+         &    &       $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
+ $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
+ \end{tabular}
+ \end{center}
+ 
+ These definitions allows us the flexibility to talk about 
+ regular expressions in their most convenient format,
+ for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
+ instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
+ These definitions help express that certain classes of syntatically 
+ distinct regular expressions are actually the same under simplification.
+ This is not entirely true for annotated regular expressions: 
+ %TODO: bsimp bders \neq bderssimp
+ \begin{center}
+ $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
+ \end{center}
+ For bit-codes, the order in which simplification is applied
+ might cause a difference in the location they are placed.
+ If we want something like
+ \begin{center}
+ $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
+ \end{center}
+ Some "canonicalization" procedure is required,
+ which either pushes all the common bitcodes to nodes
+  as senior as possible:
+  \begin{center}
+  $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
+  \end{center}
+ or does the reverse. However bitcodes are not of interest if we are talking about
+ the $\llbracket r \rrbracket$ size of a regex.
+ Therefore for the ease and simplicity of producing a
+ proof for a size bound, we are happy to restrict ourselves to 
+ unannotated regular expressions, and obtain such equalities as
+ \begin{lemma}
+ $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
+ \end{lemma}
+ 
+ \begin{proof}
+ By using the rewriting relation $\rightsquigarrow$
+ \end{proof}
+ %TODO: rsimp sflat
+And from this we obtain a proof that a star's derivative will be the same
+as if it had all its nested alternatives created during deriving being flattened out:
+ For example,
+ \begin{lemma}
+ $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
+ \end{lemma}
+ \begin{proof}
+ By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
+ \end{proof}
+% The simplification of a flattened out regular expression, provided it comes
+%from the derivative of a star, is the same as the one nested.
+ 
+
+
 We first introduce an inductive property
 for $\starupdate$ and $\hflataux{\_}$, 
 it says if we do derivatives of $r^*$
@@ -1516,7 +1605,7 @@
 We reason similarly for  $\STAR$.
 The inductive hypothesis is
 $\exists N. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N$.
-Let $\n_r = \llbracket r^* \rrbracket_r$.
+Let $n_r = \llbracket r^* \rrbracket_r$.
 When $s = c :: cs$ is not empty,
 \begin{center}
 \begin{tabular}{lcll}
@@ -1572,6 +1661,13 @@
 (4), (8), and (12) are all the inductive cases proven.
 \end{proof}
 
+
+\begin{corollary}
+For any regex $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$
+\end{corollary}
+\begin{proof}
+	By \ref{sizeRelations}.
+\end{proof}
 \noindent
 
 %-----------------------------------
@@ -1891,105 +1987,6 @@
 %----------------------------------------------------------------------------------------
 %	SECTION 4
 %----------------------------------------------------------------------------------------
-\section{A Bound for the Star Regular Expression}
-We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
-the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
-the property of the $\distinct$ function.
-Now we try to get a bound on $r^* \backslash s$ as well.
-Again, we first look at how a star's derivatives evolve, if they grow maximally: 
-\begin{center}
-
-$r^* \quad \longrightarrow_{\backslash c}  \quad   (r\backslash c)  \cdot  r^* \quad \longrightarrow_{\backslash c'}  \quad
-r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*  \quad \longrightarrow_{\backslash c''} \quad 
-(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)   \quad \longrightarrow_{\backslash c'''}
-\quad \ldots$
-
-\end{center}
-When we have a string $s = c :: c' :: c'' \ldots$  such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, 
-$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
-the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
-of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not 
-count the possible size explosions of $r \backslash c$ themselves.
-
-Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like
-$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
-into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
-and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
-For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$:
-%TODO: definitions of  and \hflataux \hflat
- \begin{center}  
- \begin{tabular}{ccc}
- $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
-$\hflataux r$ & $=$ & $ [r]$
-\end{tabular}
-\end{center}
-
- \begin{center}  
- \begin{tabular}{ccc}
- $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
-$\hflat r$ & $=$ & $ r$
-\end{tabular}
-\end{center}s
-Again these definitions are tailor-made for dealing with alternatives that have
-originated from a star's derivatives, so we don't attempt to open up all possible 
-regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
-elements.
-We give a predicate for such "star-created" regular expressions:
-\begin{center}
-\begin{tabular}{lcr}
-         &    &       $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
- $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
- \end{tabular}
- \end{center}
- 
- These definitions allows us the flexibility to talk about 
- regular expressions in their most convenient format,
- for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
- instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
- These definitions help express that certain classes of syntatically 
- distinct regular expressions are actually the same under simplification.
- This is not entirely true for annotated regular expressions: 
- %TODO: bsimp bders \neq bderssimp
- \begin{center}
- $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
- \end{center}
- For bit-codes, the order in which simplification is applied
- might cause a difference in the location they are placed.
- If we want something like
- \begin{center}
- $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
- \end{center}
- Some "canonicalization" procedure is required,
- which either pushes all the common bitcodes to nodes
-  as senior as possible:
-  \begin{center}
-  $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
-  \end{center}
- or does the reverse. However bitcodes are not of interest if we are talking about
- the $\llbracket r \rrbracket$ size of a regex.
- Therefore for the ease and simplicity of producing a
- proof for a size bound, we are happy to restrict ourselves to 
- unannotated regular expressions, and obtain such equalities as
- \begin{lemma}
- $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
- \end{lemma}
- 
- \begin{proof}
- By using the rewriting relation $\rightsquigarrow$
- \end{proof}
- %TODO: rsimp sflat
-And from this we obtain a proof that a star's derivative will be the same
-as if it had all its nested alternatives created during deriving being flattened out:
- For example,
- \begin{lemma}
- $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
- \end{lemma}
- \begin{proof}
- By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
- \end{proof}
-% The simplification of a flattened out regular expression, provided it comes
-%from the derivative of a star, is the same as the one nested.
-

--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Sat Jul 09 14:11:07 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Wed Jul 13 08:35:45 2022 +0100
@@ -4,22 +4,32 @@
 
 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
 %and notations we 
-%use for describing the lexing algorithm by Sulzmann and Lu,
-%and then give the algorithm and its variant, and discuss
+% used for describing the lexing algorithm by Sulzmann and Lu,
+%and then give the algorithm and its variant and discuss
 %why more aggressive simplifications are needed. 
 
 In this chapter, we define the basic notions 
 for regular languages and regular expressions.
-This is essentially a description in "English"
-of your formalisation in Isabelle/HOL.
-We also give the definition of what $\POSIX$ lexing means.
+This is essentially a description in ``English"
+of our formalisation in Isabelle/HOL.
+We also give the definition of what $\POSIX$ lexing means, 
+followed by an algorithm by Sulzmanna and Lu\parencite{Sulzmann2014} 
+that produces the output conforming
+to the $\POSIX$ standard.
+It is also worth mentioning that
+we choose to use the ML-style notation
+for function applications, where
+the parameters of a function is not enclosed
+inside a pair of parentheses (e.g. $f \;x \;y$
+instead of $f(x,\;y)$). This is mainly
+to make the text visually more concise.
 
 \section{Basic Concepts}
-Usually formal language theory starts with an alphabet 
+Usually, formal language theory starts with an alphabet 
 denoting a set of characters.
 Here we just use the datatype of characters from Isabelle,
 which roughly corresponds to the ASCII characters.
-In what follows we shall leave the information about the alphabet
+In what follows, we shall leave the information about the alphabet
 implicit.
 Then using the usual bracket notation for lists,
 we can define strings made up of characters: 
@@ -30,7 +40,7 @@
 \end{center}
 Where $c$ is a variable ranging over characters.
 Strings can be concatenated to form longer strings in the same
-way as we concatenate two lists, which we write as @.
+way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
 We omit the precise 
 recursive definition here.
 We overload this concatenation operator for two sets of strings:
@@ -48,8 +58,8 @@
 $A^{n+1}$ & $\dn$ & $A @ A^n$
 \end{tabular}
 \end{center}
-The union of all the natural number powers of a language   
-is usually defined as the Kleene star operator:
+The union of all powers of a language   
+can be used to define the Kleene star operator:
 \begin{center}
 \begin{tabular}{lcl}
  $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
@@ -57,19 +67,19 @@
 \end{center}
 
 \noindent
-However, to obtain a convenient induction principle 
+However, to obtain a more convenient induction principle 
 in Isabelle/HOL, 
 we instead define the Kleene star
 as an inductive set: 
 
 \begin{center}
 \begin{mathpar}
-\inferrule{}{[] \in A*\\}
+	\inferrule{\mbox{}}{[] \in A*\\}
 
-\inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*}
+\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}
 \end{mathpar}
 \end{center}
-\ChristianComment{Yes, used the inferrule command in mathpar}
+\noindent
 We also define an operation of "chopping off" a character from
 a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
 \begin{center}
@@ -89,10 +99,10 @@
 which is essentially the left quotient $A \backslash L$ of $A$ against 
 the singleton language with $L = \{w\}$
 in formal language theory.
-However for the purposes here, the $\textit{Ders}$ definition with 
+However, for the purposes here, the $\textit{Ders}$ definition with 
 a single string is sufficient.
 
-With the  sequencing, Kleene star, and $\textit{Der}$ operator on languages,
+With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
 we have a  few properties of how the language derivative can be defined using 
 sub-languages.
 \begin{lemma}
@@ -113,26 +123,28 @@
 $\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
 \end{lemma}
 \begin{proof}
+There are too inclusions to prove:
 \begin{itemize}
-\item{$\subseteq$}
-\noindent
+\item{$\subseteq$}:\\
 The set 
 \[ \{s \mid c :: s \in A*\} \]
 is enclosed in the set
-\[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
+\[ \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
 because whenever you have a string starting with a character 
 in the language of a Kleene star $A*$, 
 then that character together with some sub-string
 immediately after it will form the first iteration, 
 and the rest of the string will 
 be still in $A*$.
-\item{$\supseteq$}
+\item{$\supseteq$}:\\
 Note that
 \[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
-and 
+hold.
+Also this holds:
 \[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
-where the $\textit{RHS}$ of the above equatioin can be rewritten
-as \[ (\Der \; c\; A) @ A* \cup A' \], $A'$ being a possibly empty set.
+where the $\textit{RHS}$ can be rewritten
+as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
+which of course contains $\Der \; c \; A @ A*$.
 \end{itemize}
 \end{proof}
 
@@ -141,7 +153,7 @@
 for regular languages, we need to first give definitions for regular expressions.
 
 \subsection{Regular Expressions and Their Meaning}
-The basic regular expressions  are defined inductively
+The \emph{basic regular expressions} are defined inductively
  by the following grammar:
 \[			r ::=   \ZERO \mid  \ONE
 			 \mid  c  
@@ -150,81 +162,127 @@
 			 \mid r^*         
 \]
 \noindent
-We call them basic because we might introduce
-more constructs later such as negation
+We call them basic because we will introduce
+additional constructors in later chapters such as negation
 and bounded repetitions.
-We defined the regular expression containing
-nothing as $\ZERO$, note that some authors
-also use $\phi$ for that.
-Similarly, the regular expression denoting the 
-singleton set with only $[]$ is sometimes 
-denoted by $\epsilon$, but we use $\ONE$ here.
-
-The language or set of strings denoted 
-by regular expressions are defined as
+We use $\ZERO$ for the regular expression that
+matches no string, and $\ONE$ for the regular
+expression that matches only the empty string\footnote{
+some authors
+also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
+but we prefer our notation}. 
+The sequence regular expression is written $r_1\cdot r_2$
+and sometimes we omit the dot if it is clear which
+regular expression is meant; the alternative
+is written $r_1 + r_2$.
+The \emph{language} or meaning of 
+a regular expression is defined recursively as
+a set of strings:
 %TODO: FILL in the other defs
 \begin{center}
 \begin{tabular}{lcl}
-$L \; (\ZERO)$ & $\dn$ & $\phi$\\
-$L \; (\ONE)$ & $\dn$ & $\{[]\}$\\
-$L \; (c)$ & $\dn$ & $\{[c]\}$\\
-$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
-$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) @ L \; (r_2)$\\
-$L \; (r^*)$ & $\dn$ & $ (L(r))^*$
+$L \; \ZERO$ & $\dn$ & $\phi$\\
+$L \; \ONE$ & $\dn$ & $\{[]\}$\\
+$L \; c$ & $\dn$ & $\{[c]\}$\\
+$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
+$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
+$L \; r^*$ & $\dn$ & $ (L\;r)*$
 \end{tabular}
 \end{center}
 \noindent
-Which is also called the "language interpretation" of
-a regular expression.
-
 Now with semantic derivatives of a language and regular expressions and
 their language interpretations in place, we are ready to define derivatives on regexes.
 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
-
-\ChristianComment{Hi this part I want to keep the ordering as is, so that it keeps the 
-readers engaged with a story how we got to the definition of $\backslash$, rather 
-than first "overwhelming" them with the definition of $\nullable$.}
-
-The language derivative acts on a string set and chops off a character from
-all strings in that set, we want to define a derivative operation on regular expressions
-so that after derivative $L(r\backslash c)$ 
-will look as if it was obtained by doing a language derivative on $L(r)$:
+%Recall, the language derivative acts on a set of strings
+%and essentially chops off a particular character from
+%all strings in that set, Brzozowski defined a derivative operation on regular expressions
+%so that after derivative $L(r\backslash c)$ 
+%will look as if it was obtained by doing a language derivative on $L(r)$:
+Recall that the semantic derivative acts on a set of 
+strings. Brzozowski noticed that this operation
+can be ``mirrored" on regular expressions which
+he calls the derivative of a regular expression $r$
+with respect to a character $c$, written
+$r \backslash c$.
+He defined this operation such that the following property holds:
 \begin{center}
-\[
-r\backslash c \dn ?
-\]
-so that
-\[
-L(r \backslash c) = \Der \; c \; L(r) ?
-\]
-\end{center}
-So we mimic the equalities we have for $\Der$ on language concatenation
 
 \[
-\Der \; c \; (A @ B) = \textit{if} \;  [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B\\
+L(r \backslash c) = \Der \; c \; L(r)
 \]
-to get the derivative for sequence regular expressions:
-\[
-(r_1 \cdot r_2 ) \backslash c = \textit{if}\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c \textit{else} (r_1 \backslash c) \cdot r_2
-\]
+\end{center}
+\noindent
+For example in the sequence case we have 
+\begin{center}
+	\begin{tabular}{lcl}
+		$\Der \; c \; (A @ B)$ & $\dn$ & 
+		$ \textit{if} \;\,  [] \in A \; 
+		\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup 
+		\Der \; c\; B$\\
+		& & $\textit{else}\; (\Der \; c \; A) @ B$\\
+	\end{tabular}
+\end{center}
+\noindent
+This can be translated to  
+regular expressions in the following 
+manner:
+\begin{center}
+	\begin{tabular}{lcl}
+		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
+		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
+	\end{tabular}
+\end{center}
 
 \noindent
-and language Kleene star:
+And similarly, the Kleene star's semantic derivative
+can be expressed as
 \[
-\textit{Der} \;c \;A* = (\textit{Der}\; c A) @ (A*)
+	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
 \]
-to get derivative of the Kleene star regular expression:
+which translates to
 \[
-r^* \backslash c = (r \backslash c)\cdot r^*
+	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
 \]
-Note that although we can formalise the boolean predicate
-$[] \in L(r_1)$ without problems, if we want a function that works
-computationally, then we would have to define a function that tests
-whether an empty string is in the language of a regular expression.
-We call such a function $\nullable$:
-
-
-
+In the above definition of $(r_1\cdot r_2) \backslash c$,
+the $\textit{if}$ clause's
+boolean condition 
+$[] \in L(r_1)$ needs to be 
+somehow recursively computed.
+We call such a function that checks
+whether the empty string $[]$ is 
+in the language of a regular expression $\nullable$:
+\begin{center}
+		\begin{tabular}{lcl}
+			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
+			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
+			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
+			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
+			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
+			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
+		\end{tabular}
+\end{center}
+\noindent
+The $\ZERO$ regular expression
+does not contain any string and
+therefore is not \emph{nullable}.
+$\ONE$ is \emph{nullable} 
+by definition. 
+The character regular expression $c$
+corresponds to the singleton set $\{c\}$, 
+and therefore does not contain the empty string.
+The alternative regular expression is nullable
+if at least one of its children is nullable.
+The sequence regular expression
+would require both children to have the empty string
+to compose an empty string, and the Kleene star
+is always nullable because it naturally
+contains the empty string. 
+  
+The derivative function, written $r\backslash c$, 
+defines how a regular expression evolves into
+a new one after all the string it contains is acted on: 
+if it starts with $c$, then the character is chopped of,
+if not, that string is removed.
 \begin{center}
 \begin{tabular}{lcl}
 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
@@ -239,67 +297,32 @@
 \end{tabular}
 \end{center}
 \noindent
-The function derivative, written $r\backslash c$, 
-defines how a regular expression evolves into
-a new regular expression after all the string it contains
-is chopped off a certain head character $c$.
-The most involved cases are the sequence 
-and star case.
+The most involved cases are the sequence case
+and the star case.
 The sequence case says that if the first regular expression
-contains an empty string then the second component of the sequence
-might be chosen as the target regular expression to be chopped
-off its head character.
-The star regular expression's derivative unwraps the iteration of
-regular expression and attaches the star regular expression
-to the sequence's second element to make sure a copy is retained
-for possible more iterations in later phases of lexing.
-
-
-To test whether $[] \in L(r_1)$, we need the $\nullable$ function,
-which tests whether the empty string $""$ 
-is in the language of $r$:
-
-
-\begin{center}
-		\begin{tabular}{lcl}
-			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
-			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
-			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
-			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
-			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
-			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
-		\end{tabular}
-\end{center}
-\noindent
-The empty set does not contain any string and
-therefore not the empty string, the empty string 
-regular expression contains the empty string
-by definition, the character regular expression
-is the singleton that contains character only,
-and therefore does not contain the empty string,
-the alternative regular expression (or "or" expression)
-might have one of its children regular expressions
-being nullable and any one of its children being nullable
-would suffice. The sequence regular expression
-would require both children to have the empty string
-to compose an empty string and the Kleene star
-operation naturally introduced the empty string. 
-  
-We have the following property where the derivative on regular 
-expressions coincides with the derivative on a set of strings:
-
-\begin{lemma}
+contains an empty string, then the second component of the sequence
+needs to be considered, as its derivative will contribute to the
+result of this derivative.
+The star regular expression $r^*$'s derivative 
+unwraps one iteration of $r$, turns it into $r\backslash c$,
+and attaches the original $r^*$
+after $r\backslash c$, so that 
+we can further unfold it as many times as needed.
+We have the following correspondence between 
+derivatives on regular expressions and
+derivatives on a set of strings:
+\begin{lemma}\label{derDer}
 $\textit{Der} \; c \; L(r) = L (r\backslash c)$
 \end{lemma}
 
 \noindent
 The main property of the derivative operation
-that enables us to reason about the correctness of
-an algorithm using derivatives is 
+(that enables us to reason about the correctness of
+derivative-based matching)
+is 
 
 \begin{lemma}\label{derStepwise}
-$c\!::\!s \in L(r)$ holds
-if and only if $s \in L(r\backslash c)$.
+	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
 \end{lemma}
 
 \noindent
@@ -314,12 +337,14 @@
 \end{center}
 
 \noindent
-When there is no ambiguity we will use  $\backslash$ to denote
+When there is no ambiguity, we will 
+omit the subscript and use $\backslash$ instead
+of $\backslash_r$ to denote
 string derivatives for brevity.
 Brzozowski's  regular-expression matcher algorithm can then be described as:
 
 \begin{definition}
-$\textit{match}\;s\;r \;\dn\; \nullable(r\backslash s)$
+$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
 \end{definition}
 
 \noindent
@@ -336,17 +361,15 @@
  It can  be
 relatively  easily shown that this matcher is correct:
 \begin{lemma}
-$\textit{match} \; s\; r  = \textit{true} \Longleftrightarrow s \in L(r)$
+	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
 \end{lemma}
 \begin{proof}
-By the stepwise property of $\backslash$ (\ref{derStepwise})
+	By the stepwise property of derivatives (lemma \ref{derStepwise})
+	and lemma \ref{derDer}. 
 \end{proof}
 \noindent
-If we implement the above algorithm naively, however,
-the algorithm can be excruciatingly slow.
-
-
-\begin{figure}
+\begin{center}
+	\begin{figure}
 \begin{tikzpicture}
 \begin{axis}[
     xlabel={$n$},
@@ -360,84 +383,93 @@
 \end{tikzpicture} 
 \caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
 \end{figure}
-   
+\end{center} 
 \noindent
-For this we need to introduce certain 
+If we implement the above algorithm naively, however,
+the algorithm can be excruciatingly slow, as shown in 
+\ref{NaiveMatcher}.
+Note that both axes are in logarithmic scale.
+Around two dozens characters
+would already explode the matcher on regular expression 
+$(a^*)^*b$.
+For this, we need to introduce certain 
 rewrite rules for the intermediate results,
 such as $r + r \rightarrow r$,
 and make sure those rules do not change the 
 language of the regular expression.
-We have a simplification function (that is as simple as possible
-while having much power on making a regex simpler):
-\begin{verbatim}
-def simp(r: Rexp) : Rexp = r match {
-  case SEQ(r1, r2) => 
-    (simp(r1), simp(r2)) match {
-      case (ZERO, _) => ZERO
-      case (_, ZERO) => ZERO
-      case (ONE, r2s) => r2s
-      case (r1s, ONE) => r1s
-      case (r1s, r2s) => SEQ(r1s, r2s)
-    }
-  case ALTS(r1, r2) => {
-    (simp(r1), simp(r2)) match {
-      case (ZERO, r2s) => r2s
-      case (r1s, ZERO) => r1s
-      case (r1s, r2s) =>
-        if(r1s == r2s) r1s else ALTS(r1s, r2s)
-    }
-  }
-  case r => r
-}
-\end{verbatim}
-If we repeatedly incorporate these 
-rules during the matching algorithm, 
-we have a lexer with simplification:
-\begin{verbatim}
-def ders_simp(s: List[Char], r: Rexp) : Rexp = s match {
-  case Nil => simp(r)
-  case c :: cs => ders_simp(cs, simp(der(c, r)))
-}
-
-def simp_matcher(s: String, r: Rexp) : Boolean = 
-  nullable(ders_simp(s.toList, r))
-
-\end{verbatim}
-\noindent
-After putting in those rules, the example of \ref{NaiveMatcher}
-is now very tame in the length of inputs:
-
-
+One simpled-minded simplification function
+that achieves these requirements is given below:
+\begin{center}
+	\begin{tabular}{lcl}
+		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
+		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
+					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
+					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
+					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
+					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
+					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
+		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
+				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
+				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
+				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
+		$\simp \; r$ & $\dn$ & $r$
+				   
+	\end{tabular}
+\end{center}
+If we repeatedly apply this simplification  
+function during the matching algorithm, 
+we have a matcher with simplification:
+\begin{center}
+	\begin{tabular}{lcl}
+		$\derssimp \; [] \; r$ & $\dn$ & $r$\\
+		$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
+		$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
+	\end{tabular}
+\end{center}
+\begin{figure}
 \begin{tikzpicture}
 \begin{axis}[
     xlabel={$n$},
     ylabel={time in secs},
     ymode = log,
     xmode = log,
+    grid = both,
     legend entries={Matcher With Simp},  
     legend pos=north west,
     legend cell align=left]
 \addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
 \end{axis}
-\end{tikzpicture} \label{fig:BetterMatcher}
-
+\end{tikzpicture} 
+\caption{$(a^*)^*b$ 
+against 
+$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
+\end{figure}
+\noindent
+The running time of $\textit{ders}\_\textit{simp}$
+on the same example of \ref{NaiveMatcher}
+is now very tame in terms of the length of inputs,
+as shown in \ref{BetterMatcher}.
 
-\noindent
-Note how the x-axis is in logarithmic scale.
 Building derivatives and then testing the existence
 of empty string in the resulting regular expression's language,
-and add simplification rules when necessary.
+adding simplifications when necessary.
 So far, so good. But what if we want to 
 do lexing instead of just getting a YES/NO answer?
-\citeauthor{Sulzmann2014} first came up with a nice and 
+Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and 
 elegant (arguably as beautiful as the definition of the original derivative) solution for this.
 
 \section{Values and the Lexing Algorithm by Sulzmann and Lu}
-Here we present the hybrid phases of a regular expression lexing 
-algorithm using the function $\inj$, as given by Sulzmann and Lu.
-They first defined the datatypes for storing the 
-lexing information called a \emph{value} or
-sometimes also \emph{lexical value}.  These values and regular
+In this section, we present a two-phase regular expression lexing 
+algorithm.
+The first phase takes successive derivatives with 
+respect to the input string,
+and the second phase does the reverse, \emph{injecting} back
+characters, in the meantime constructing a lexing result.
+We will introduce the injection phase in detail slightly
+later, but as a preliminary we have to first define 
+the datatype for lexing results, 
+called \emph{value} or
+sometimes also \emph{lexical value}.  Values and regular
 expressions correspond to each other as illustrated in the following
 table:
 
@@ -466,56 +498,116 @@
 		\end{tabular}
 	\end{tabular}
 \end{center}
-
 \noindent
-We have a formal binary relation for telling whether the structure
-of a regular expression agrees with the value.
+A value has an underlying string, which 
+can be calculated by the ``flatten" function $|\_|$:
+\begin{center}
+	\begin{tabular}{lcl}
+		$|\Empty|$ & $\dn$ &  $[]$\\
+		$|\Char \; c|$ & $ \dn$ & $ [c]$\\
+		$|\Seq(v_1, v_2)|$ & $ \dn$ & $ v_1| @ |v_2|$\\
+		$|\Left(v)|$ & $ \dn$ & $ |v|$\\
+		$|\Right(v)|$ & $ \dn$ & $ |v|$\\
+		$|\Stars([])|$ & $\dn$ & $[]$\\
+		$|\Stars(v::vs)|$ &  $\dn$ & $ |v| @ |\Stars(vs)|$
+	\end{tabular}
+\end{center}
+Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
+to indicate that a value $v$ could be generated from a lexing algorithm
+with input $r$. They call it the value inhabitation relation. 
 \begin{mathpar}
-\inferrule{}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
-\inferrule{}{\vdash \Empty :  \ONE} \hspace{2em}
-\inferrule{\vdash v_1 : r_1 \\ \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
-\end{mathpar}
+	\inferrule{\mbox{}}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
+
+	\inferrule{\mbox{}}{\vdash \Empty :  \ONE} \hspace{2em}
+
+\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
+
+\inferrule{\vdash v_1 : r_1}{\vdash \Left(v_1):r_1+r_2}
+
+\inferrule{\vdash v_2 : r_2}{\vdash \Right(v_2):r_1 + r_2}
 
-Building on top of Sulzmann and Lu's attempt to formalise the 
-notion of POSIX lexing rules \parencite{Sulzmann2014}, 
-Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
-POSIX matching as a ternary relation recursively defined in a
-natural deduction style.
-The formal definition of a $\POSIX$ value $v$ for a regular expression
+\inferrule{\forall v \in vs. \vdash v:r \land  |v| \neq []}{\vdash \Stars(vs):r^*}
+\end{mathpar}
+\noindent
+The condition $|v| \neq []$ in the premise of star's rule
+is to make sure that for a given pair of regular 
+expression $r$ and string $s$, the number of values 
+satisfying $|v| = s$ and $\vdash v:r$ is finite.
+Given the same string and regular expression, there can be
+multiple values for it. For example, both
+$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
+$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
+and the values both flatten to $abc$.
+Lexers therefore have to disambiguate and choose only
+one of the values to output. $\POSIX$ is one of the
+disambiguation strategies that is widely adopted.
+
+Ausaf and Urban\parencite{AusafDyckhoffUrban2016} 
+formalised the property 
+as a ternary relation.
+The $\POSIX$ value $v$ for a regular expression
 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
-in the following set of rules:
-\ChristianComment{Will complete later}
-\newcommand*{\inference}[3][t]{%
-   \begingroup
-   \def\and{\\}%
-   \begin{tabular}[#1]{@{\enspace}c@{\enspace}}
-   #2 \\
-   \hline
-   #3
-   \end{tabular}%
-   \endgroup
-}
-\begin{center}
-\inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$  }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
-\end{center}
+in the following set of rules\footnote{The names of the rules are used
+as they were originally given in \cite{AusafDyckhoffUrban2016}}:
+\noindent
+\begin{figure}
+\begin{mathpar}
+	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
+		
+	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
+
+	\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
+
+	\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
+
+	\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
+		\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
+		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
+	\Seq \; v_1 \; v_2}
+
+	\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
+
+	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
+		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
+		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
+	(v::vs)}
+\end{mathpar}
+\caption{POSIX Lexing Rules}
+\end{figure}
 \noindent
-The above $\POSIX$ rules could be explained intuitionally as
+The above $\POSIX$ rules follows the intuition described below: 
 \begin{itemize}
-\item
-match the leftmost regular expression when multiple options of matching
-are available  
-\item 
-always match a subpart as much as possible before proceeding
-to the next token.
+	\item (Left Priority)\\
+		Match the leftmost regular expression when multiple options of matching
+		are available.
+	\item (Maximum munch)\\
+		Always match a subpart as much as possible before proceeding
+		to the next token.
 \end{itemize}
-
-The reason why we are interested in $\POSIX$ values is that they can
-be practically used in the lexing phase of a compiler front end.
+\noindent
+These disambiguation strategies can be 
+quite practical.
 For instance, when lexing a code snippet 
-$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
-as an identifier rather than a keyword.
-\ChristianComment{Do I also introduce lexical values $LV$ here?}
-We know that $\POSIX$ values are also part of the normal values:
+\[ 
+	\textit{iffoo} = 3
+\]
+using the regular expression (with 
+keyword and identifier having their 
+usualy definitions on any formal
+language textbook, for instance
+keyword is a nonempty string starting with letters 
+followed by alphanumeric characters or underscores):
+\[
+	\textit{keyword} + \textit{identifier},
+\]
+we want $\textit{iffoo}$ to be recognized
+as an identifier rather than a keyword (if)
+followed by
+an identifier (foo).
+POSIX lexing achieves this.
+
+We know that a $\POSIX$ value is also a normal underlying
+value:
 \begin{lemma}
 $(r, s) \rightarrow v \implies \vdash v: r$
 \end{lemma}
@@ -527,10 +619,13 @@
 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
 \end{lemma}
 \begin{proof}
-By induction on $s$, $r$ and $v_1$. The induction principle is
-the \POSIX rules. Each case is proven by a combination of
-the induction rules for $\POSIX$ values and the inductive hypothesis.
-Probably the most cumbersome cases are the sequence and star with non-empty iterations.
+By induction on $s$, $r$ and $v_1$. The inductive cases
+are all the POSIX rules. 
+Each case is proven by a combination of
+the induction rules for $\POSIX$ 
+values and the inductive hypothesis.
+Probably the most cumbersome cases are 
+the sequence and star with non-empty iterations.
 
 We give the reasoning about the sequence case as follows:
 When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$, 
@@ -544,15 +639,14 @@
 which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$
 is the same as $\Seq(v_1, v_2)$. 
 \end{proof}
-Now we know what a $\POSIX$ value is, the problem is how do we achieve 
+Now we know what a $\POSIX$ value is; the problem is how do we achieve 
 such a value in a lexing algorithm, using derivatives?
 
 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
 
 The contribution of Sulzmann and Lu is an extension of Brzozowski's
 algorithm by a second phase (the first phase being building successive
-derivatives---see \ref{graph:successive_ders}). In this second phase, a POSIX value 
-is generated if the regular expression matches the string.
+derivatives---see \ref{graph:successive_ders}). This second phase generates a POSIX value if the regular expression matches the string.
 Two functions are involved: $\inj$ and $\mkeps$.
 The function $\mkeps$ constructs a value from the last
 one of all the successive derivatives:
@@ -570,13 +664,13 @@
 
 	\begin{center}
 		\begin{tabular}{lcl}
-			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
-			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
-			& \textit{if} $\nullable(r_{1})$\\ 
-			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
-			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
-			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
-			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
+			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
+			$\mkeps \; r_{1}+r_{2}$	& $\dn$ 
+						& \textit{if} ($\nullable \; r_{1})$\\ 
+			& & \textit{then} $\Left (\mkeps \; r_{1})$\\ 
+			& & \textit{else} $\Right(\mkeps \; r_{2})$\\
+			$\mkeps \; r_1\cdot r_2$ 	& $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
+			$\mkeps \; r^* $	        & $\dn$ & $\Stars\;[]$
 		\end{tabular}
 	\end{center}
 
@@ -669,9 +763,10 @@
  The central property of the $\lexer$ is that it gives the correct result by
  $\POSIX$ standards:
  \begin{theorem}
- \begin{tabular}{l}
- $\lexer \; r \; s = \Some(v) \Longleftrightarrow (r, \; s) \rightarrow v$\\
- $\lexer \;r \; s = \None \Longleftrightarrow \neg(\exists v. (r, s) \rightarrow v)$
+	 The $\lexer$ based on derivatives and injections is both sound and complete:
+ \begin{tabular}{lcl}
+	 $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
+	 $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
  \end{tabular}
  \end{theorem}
  
@@ -776,7 +871,7 @@
 of a match. This means Sulzmann and Lu's injection-based algorithm 
  exponential by nature.
 Somehow one has to make sure which
- lexical values are $\POSIX$ and need to be kept in a lexing algorithm.
+ lexical values are $\POSIX$ and must be kept in a lexing algorithm.
 
 
  For example, the above $r= (a^*\cdot a^*)^*$  and 
@@ -788,7 +883,7 @@
 Can we not create those intermediate values $v_1,\ldots v_n$,
 and get the lexing information that should be already there while
 doing derivatives in one pass, without a second injection phase?
-In the meantime, can we make sure that simplifications
+In the meantime, can we ensure that simplifications
 are easily handled without breaking the correctness of the algorithm?
 
 Sulzmann and Lu solved this problem by

--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Sat Jul 09 14:11:07 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Wed Jul 13 08:35:45 2022 +0100
@@ -21,6 +21,7 @@
 \newcommand{\ASEQ}[3]{\textit{ASEQ}_{#1} \, #2 \, #3}
 \newcommand{\bderssimp}[2]{#1 \backslash_{bsimps} #2}
 \newcommand{\rderssimp}[2]{#1 \backslash_{rsimp} #2}
+\def\derssimp{\textit{ders}\_\textit{simp}}
 \def\rders{\textit{rders}}
 \newcommand{\bders}[2]{#1 \backslash #2}
 \newcommand{\bsimp}[1]{\textit{bsimp}(#1)}
@@ -39,6 +40,7 @@
 
 \newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
 
+\def\case{\textit{case}}
 \def\sequal{\stackrel{\mbox{\scriptsize rsimp}}{=}}
 \def\rsimpalts{\textit{rsimp}_{ALTS}}
 \def\good{\textit{good}}
@@ -82,7 +84,7 @@
 \def\blexer{\textit{blexer}}
 \def\flex{\textit{flex}}
 \def\inj{\mathit{inj}}
-\def\Empty{\mathit{Empty}}
+\def\Empty{\textit{Empty}}
 \def\Left{\mathit{Left}}
 \def\Right{\mathit{Right}}
 \def\Stars{\mathit{Stars}}

--- a/ChengsongTanPhdThesis/main.tex	Sat Jul 09 14:11:07 2022 +0100
+++ b/ChengsongTanPhdThesis/main.tex	Wed Jul 13 08:35:45 2022 +0100
@@ -46,7 +46,8 @@
 \usepackage{mathpazo} % Use the Palatino font by default
 \usepackage{hyperref}
 \usepackage{lipsum}
-\usepackage[backend=bibtex,style=authoryear,natbib=true]{biblatex} % Use the bibtex backend with the authoryear citation style (which resembles APA)
+\usepackage[backend=bibtex]{biblatex} % Use the bibtex backend with the authoryear citation style (which resembles APA)
+%style=authoryear, natbib=true 
 \usepackage{stmaryrd}
 \usepackage{caption}
 
@@ -71,7 +72,11 @@
 \usepackage{tikz}
 \usetikzlibrary{automata, positioning, calc}
 \usetikzlibrary{fit,
-                shapes.geometric}
+                shapes.geometric,
+		patterns,
+		backgrounds,
+		graphs}
+\usetikzlibrary{babel}
 \usepackage{mathpartir}
 \usepackage{stackrel}

--- a/thys2/blexer2.sc	Sat Jul 09 14:11:07 2022 +0100
+++ b/thys2/blexer2.sc	Wed Jul 13 08:35:45 2022 +0100
@@ -781,7 +781,7 @@
 
 // fun pAKC_aux :: "arexp list ⇒ arexp ⇒ rexp ⇒ arexp"
 //   where
-// "pAKC_aux rsa r ctx = (if (regConcat (SEQ (erase r) ( ctx) )) ⊆ (regConcat (erase (AALTs [] rsa))) then AZERO else
+// "pAKC_aux rsa r ctx = (if (seqFold (SEQ (erase r) ( ctx) )) ⊆ (seqFold (erase (AALTs [] rsa))) then AZERO else
 //                           case r of (ASEQ bs r1 r2) ⇒ 
 //                             bsimp_ASEQ bs (pAKC_aux rsa r1 (SEQ  (erase r2) ( ctx) )) r2   |
 //                                     (AALTs bs rs) ⇒ 
@@ -796,13 +796,13 @@
 
 
 //the "fake" Language interpretation: just concatenates!
-def regConcat(acc: Rexp, rs: List[Rexp]) : Rexp = rs match {
+def seqFold(acc: Rexp, rs: List[Rexp]) : Rexp = rs match {
   case Nil => acc
   case r :: rs1 => 
     // if(acc == ONE) 
-    //   regConcat(r, rs1) 
+    //   seqFold(r, rs1) 
     // else
-      regConcat(SEQ(acc, r), rs1)
+      seqFold(SEQ(acc, r), rs1)
 }
 
 def rprint(r: Rexp) : Unit = println(shortRexpOutput(r))
@@ -819,9 +819,9 @@
   // println("ctx---------")
   // rsprint(ctx)
   // println("ctx---------end")
-  // rsprint(breakIntoTerms(regConcat(erase(r), ctx)).map(oneSimp))
+  // rsprint(breakIntoTerms(seqFold(erase(r), ctx)).map(oneSimp))
 
-  if (breakIntoTerms(regConcat(erase(r), ctx)).map(oneSimp).forall(acc.contains)) {//acc.flatMap(breakIntoTerms
+  if (breakIntoTerms(seqFold(erase(r), ctx)).map(oneSimp).forall(acc.contains)) {//acc.flatMap(breakIntoTerms
     AZERO
   }
   else{
@@ -928,12 +928,12 @@
 }
 
 def attachCtx(r: ARexp, ctx: List[Rexp]) : Set[Rexp] = {
-  turnIntoTerms((regConcat(erase(r), ctx)))
+  turnIntoTerms((seqFold(erase(r), ctx)))
     .toSet
 }
 
 def attachCtxcc(r: Rexp, ctx: List[Rexp]) : Set[Rexp] =
-  turnIntoTerms(regConcat(r, ctx)).toSet
+  turnIntoTerms(seqFold(r, ctx)).toSet
 
 def ABIncludedByC[A, B, C](a: A, b: B, c: C, f: (A, B) => C, 
 subseteqPred: (C, C) => Boolean) : Boolean = {

--- a/thys3/BlexerSimp.thy	Sat Jul 09 14:11:07 2022 +0100
+++ b/thys3/BlexerSimp.thy	Wed Jul 13 08:35:45 2022 +0100
@@ -77,23 +77,27 @@
 | "turnIntoTerms (SEQ r1 r2) = concat (map (\<lambda>r11. furtherSEQ r11 r2) (turnIntoTerms r1))"
 | "turnIntoTerms r = [r]"
 
-fun regConcat :: "rexp \<Rightarrow> rexp list \<Rightarrow> rexp" where
-  "regConcat acc [] = acc"
-| "regConcat ONE (r # rs1) = regConcat r rs1"
-| "regConcat acc (r # rs1) = regConcat (SEQ acc r) rs1"
+abbreviation "tint \<equiv> turnIntoTerms"
+
+fun seqFold :: "rexp \<Rightarrow> rexp list \<Rightarrow> rexp" where
+  "seqFold acc [] = acc"
+| "seqFold ONE (r # rs1) = seqFold r rs1"
+| "seqFold acc (r # rs1) = seqFold (SEQ acc r) rs1"
+
 
 fun attachCtx :: "arexp \<Rightarrow> rexp list \<Rightarrow> rexp set" where
-  "attachCtx r ctx = set (turnIntoTerms (regConcat (erase r) ctx))"
-
+  "attachCtx r ctx = set (turnIntoTerms (seqFold (erase r) ctx))"
 
 
 fun rs1_subseteq_rs2 :: "rexp set \<Rightarrow> rexp set \<Rightarrow> bool" where
   "rs1_subseteq_rs2 rs1 rs2 = (rs1 \<subseteq> rs2)"
 
+
 fun notZero :: "arexp \<Rightarrow> bool" where
   "notZero AZERO = True"
 | "notZero _ = False"
 
+
 fun tset :: "arexp list \<Rightarrow> rexp set" where
   "tset rs = set (concat (map turnIntoTerms (map erase rs)))"
 
@@ -101,10 +105,14 @@
 fun prune6 :: "rexp set \<Rightarrow> arexp \<Rightarrow> rexp list \<Rightarrow> arexp" where
   "prune6 acc a ctx = (if (ABIncludedByC a ctx acc attachCtx rs1_subseteq_rs2) then AZERO else 
                         (case a of (ASEQ bs r1 r2) \<Rightarrow> bsimp_ASEQ bs (prune6 acc r1 (erase r2 # ctx)) r2
-                                 | AALTs bs rs0 \<Rightarrow> bsimp_AALTs bs (filter notZero (map (\<lambda>r.(prune6 acc r ctx)) rs0)))  )"
+                                 | AALTs bs rs0 \<Rightarrow> bsimp_AALTs bs (filter notZero (map (\<lambda>r.(prune6 acc r ctx)) rs0))
+                                 | r \<Rightarrow> r
+                        )
+                      )
+            "
 
 abbreviation
-  "p acc r \<equiv> prune6 (set (concat (map turnIntoTerms (map erase acc)) ) ) r Nil"
+  "p6 acc r \<equiv> prune6 (tset acc) r Nil"
 
 
 fun dB6 :: "arexp list \<Rightarrow> rexp set \<Rightarrow> arexp list" where
@@ -154,8 +162,9 @@
 | ss4:  "(AZERO # rs) s\<leadsto> rs"
 | ss5:  "(AALTs bs1 rs1 # rsb) s\<leadsto> ((map (fuse bs1) rs1) @ rsb)"
 | ss6:  "L (erase a2) \<subseteq> L (erase a1) \<Longrightarrow> (rsa@[a1]@rsb@[a2]@rsc) s\<leadsto> (rsa@[a1]@rsb@rsc)"
-| ss7:  " (as @ [a] @ as1) s\<leadsto> (as @ [prune6 (set (concat (map (\<lambda>r. turnIntoTerms (erase r)) as))) a Nil] @ as1)"
+| ss7:  " (as @ [a] @ as1) s\<leadsto> (as @ [p6 as a] @ as1)"
 
+thm tset.simps
 
 inductive 
   rrewrites:: "arexp \<Rightarrow> arexp \<Rightarrow> bool" ("_ \<leadsto>* _" [100, 100] 100)
@@ -291,10 +300,10 @@
   apply(subgoal_tac "(cc @ a # cc') s\<leadsto>* (cc @ a # dB6 cc' (tset (cc @ [a])))")
    prefer 2
   apply (metis append.assoc append.left_neutral append_Cons)
-  apply(subgoal_tac "(cc @ a # dB6 cc' (tset (cc @ [a]))) s\<leadsto>* (cc @ (p cc a) # dB6 cc' (tset (cc @ [a])))") 
+  apply(subgoal_tac "(cc @ a # dB6 cc' (tset (cc @ [a]))) s\<leadsto>* (cc @ (p6 cc a) # dB6 cc' (tset (cc @ [a])))") 
   sorry
 
-
+(*
 lemma ss6_stronger_aux:
   shows "(rs1 @ rs2) s\<leadsto>* (rs1 @ dB6 rs2 (set (map erase rs1)))"
   apply(induct rs2 arbitrary: rs1)
@@ -313,22 +322,83 @@
   prefer 2
   apply(drule_tac x="rs1 @ [a]" in meta_spec)
   apply(simp)
-  apply(drule_tac x="rs1" in meta_spec)
-  apply(subgoal_tac "(rs1 @ a # rs2) s\<leadsto>* (rs1 @ rs2)")
-  using srewrites_trans apply blast
-  apply(subgoal_tac "\<exists>rs1a rs1b. rs1 = rs1a @ [x] @ rs1b")
-  prefer 2
-  apply (simp add: split_list)
-  apply(erule exE)+
-  apply(simp)
-  using eq1_L rs_in_rstar ss
   sorry
+*)
 
 
 lemma ss6_stronger:
   shows "rs1 s\<leadsto>* dB6 rs1 {}"
+  using ss6
+  by (metis append_Nil concat.simps(1) list.set(1) list.simps(8) ss6_realistic tset.simps)
+
+ 
+lemma tint_fuse:
+  shows "tint (erase a) = tint (erase (fuse bs a))"
+  by (simp add: erase_fuse)
+
+lemma tint_fuse2:
+  shows " set (tint (erase a)) =
+     set (tint (erase (fuse bs a)))"
+  using tint_fuse by auto
+
+lemma fused_bits_at_head:
+  shows "fuse bs a = ASEQ bs1 a1 a2 \<Longrightarrow> \<exists>bs2. bs1 = bs @ bs2"
+  
+(* by (smt (verit) arexp.distinct(13) arexp.distinct(19) arexp.distinct(25) arexp.distinct(27) arexp.distinct(5) arexp.inject(3) fuse.elims)
+harmless sorry
+*)
+
   sorry
 
+thm seqFold.simps
+
+lemma seqFold_concats:
+  shows "r \<noteq> ONE \<Longrightarrow> seqFold r [r1] = SEQ r r1"
+  apply(case_tac r)
+       apply simp+
+done
+
+
+lemma "set (tint (seqFold (erase x42) [erase x43])) = 
+           set (tint (SEQ (erase x42) (erase x43)))"
+  apply(case_tac "erase x42 = ONE")
+   apply simp
+  apply simp
+
+lemma prune6_preserves_fuse:
+  shows "fuse bs (p6 as a) = p6 as (fuse bs a)"
+  using tint_fuse2
+  apply simp
+  apply(case_tac a)
+       apply simp
+  apply simp
+     apply simp
+
+  using fused_bits_at_head
+
+    apply simp
+  apply(case_tac "erase x42 = ONE")
+  apply simp
+
+  sorry
+
+corollary prune6_preserves_fuse_srewrite:
+  shows "(as @ map (fuse bs) [a] @ as2) s\<leadsto>* (as @ map (fuse bs) [p6 as a] @ as2)"
+  apply(subgoal_tac "map (fuse bs) [a] = [fuse bs a]")
+  apply(subgoal_tac "(as @ [fuse bs a] @ as2) s\<leadsto>* (as @ [ (p6 as (fuse bs a))] @ as2)")
+  using prune6_preserves_fuse apply auto[1]
+  using rs_in_rstar ss7 apply presburger
+  by simp
+
+lemma prune6_invariant_fuse:
+  shows "p6 as a = p6 (map (fuse bs) as) a"
+  apply simp
+  using erase_fuse by presburger
+
+
+lemma p6pfs_cor:
+  shows "(map (fuse bs) as @ map (fuse bs) [a] @ map (fuse bs) as1) s\<leadsto>* (map (fuse bs) as @ map (fuse bs) [p6 as a] @ map (fuse bs) as1)"
+  by (metis prune6_invariant_fuse prune6_preserves_fuse_srewrite)
 
 lemma rewrite_preserves_fuse: 
   shows "r2 \<leadsto> r3 \<Longrightarrow> fuse bs r2 \<leadsto> fuse bs r3"
@@ -358,7 +428,13 @@
   then show ?case 
     apply(simp only: map_append)
     by (smt (verit, best) erase_fuse list.simps(8) list.simps(9) rrewrite_srewrite.ss6 srewrites.simps)
+next
+  case (ss7 as a as1)
+  then show ?case
+    apply(simp only: map_append)
+    using p6pfs_cor by presburger
 qed (auto intro: rrewrite_srewrite.intros)
+  
 
 
 lemma rewrites_fuse:  
@@ -424,11 +500,14 @@
 lemma bnullable0:
 shows "r1 \<leadsto> r2 \<Longrightarrow> bnullable r1 = bnullable r2" 
   and "rs1 s\<leadsto> rs2 \<Longrightarrow> bnullables rs1 = bnullables rs2" 
-  apply(induct rule: rrewrite_srewrite.inducts)
-  apply(auto simp add:  bnullable_fuse)
-   apply (meson UnCI bnullable_fuse imageI)
-  using bnullable_correctness nullable_correctness by blast
-
+   apply(induct rule: rrewrite_srewrite.inducts)
+              apply simp
+             apply simp
+              apply (simp add: bnullable_fuse)
+  using bnullable.simps(5) apply presburger
+          apply simp
+         apply simp
+  sorry   
 
 
 
@@ -438,7 +517,7 @@
 using assms 
   apply(induction r1 r2 rule: rrewrites.induct)
   apply simp
-  using bnullable0(1) by auto
+  using bnullable0(1) by presburger
 
 lemma rewrite_bmkeps_aux: 
   shows "r1 \<leadsto> r2 \<Longrightarrow> (bnullable r1 \<and> bnullable r2 \<Longrightarrow> bmkeps r1 = bmkeps r2)"

author	Chengsong
	Wed, 13 Jul 2022 08:35:45 +0100
changeset 566	94604a5fd271
parent 565	0497408a3598 (diff)
parent 563	c92a41d9c4da (current diff)
child 567	28cb8089ec36