lexing: changeset 530:823d9b19d21c

--- a/ChengsongTanPhdThesis/Chapters/Chapter1.tex	Mon May 30 17:24:52 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Chapter1.tex	Mon May 30 20:36:15 2022 +0100
@@ -69,6 +69,7 @@
 \def\distinctWith{\textit{distinctWith}}
 \def\lf{\textit{lf}}
 \def\PD{\textit{PD}}
+\def\suffix{\textit{Suffix}}
 
 
 \def\size{\mathit{size}}

--- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex	Mon May 30 17:24:52 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Chapter2.tex	Mon May 30 20:36:15 2022 +0100
@@ -277,10 +277,10 @@
 \end{definition}
 
 \noindent
-Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, 
+Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
 this algorithm presented graphically is as follows:
 
-\begin{equation}\label{graph:*}
+\begin{equation}\label{graph:successive_ders}
 \begin{tikzcd}
 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
 \end{tikzcd}
@@ -477,26 +477,145 @@
 always match a subpart as much as possible before proceeding
 to the next token.
 \end{itemize}
-The formal definition of a $\POSIX$ value can be described 
+The formal definition of a $\POSIX$ value $v$ for a regular expression
+$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
 in the following set of rules:
-\
-
+(TODO: write the entire set of inference rules for POSIX )
+\newcommand*{\inference}[3][t]{%
+   \begingroup
+   \def\and{\\}%
+   \begin{tabular}[#1]{@{\enspace}c@{\enspace}}
+   #2 \\
+   \hline
+   #3
+   \end{tabular}%
+   \endgroup
+}
+\begin{center}
+\inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$  }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
+\end{center}
 
- For example, the above example has the POSIX value
-$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
-The output of an algorithm we want would be a POSIX matching
-encoded as a value.
 The reason why we are interested in $\POSIX$ values is that they can
 be practically used in the lexing phase of a compiler front end.
 For instance, when lexing a code snippet 
 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
 as an identifier rather than a keyword.
 
+ For example, the above $r= (a^*\cdot a^*)^*$  and 
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
+$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
+The output of an algorithm we want would be a POSIX matching
+encoded as a value.
+
+
+
+
 The contribution of Sulzmann and Lu is an extension of Brzozowski's
 algorithm by a second phase (the first phase being building successive
-derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
-is generated in case the regular expression matches  the string. 
-Pictorially, the Sulzmann and Lu algorithm is as follows:
+derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value 
+is generated in case the regular expression matches  the string.
+How can we construct a value out of regular expressions and character
+sequences only?
+Two functions are involved: $\inj$ and $\mkeps$.
+The function $\mkeps$ constructs a value from the last
+one of all the successive derivatives:
+\begin{ceqn}
+\begin{equation}\label{graph:mkeps}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
+	        & 	              & 	            & v_n       
+\end{tikzcd}
+\end{equation}
+\end{ceqn}
+
+It tells us how can an empty string be matched by a 
+regular expression, in a $\POSIX$ way:
+
+	\begin{center}
+		\begin{tabular}{lcl}
+			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
+			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
+			& \textit{if} $\nullable(r_{1})$\\ 
+			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
+			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
+			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
+			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
+		\end{tabular}
+	\end{center}
+
+
+\noindent 
+We favour the left to match an empty string in case there is a choice.
+When there is a star for us to match the empty string,
+we simply give the $\Stars$ constructor an empty list, meaning
+no iterations are taken.
+
+
+After the $\mkeps$-call, we inject back the characters one by one in order to build
+the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
+($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
+After injecting back $n$ characters, we get the lexical value for how $r_0$
+matches $s$. The POSIX value is maintained throught out the process.
+For this Sulzmann and Lu defined a function that reverses
+the ``chopping off'' of characters during the derivative phase. The
+corresponding function is called \emph{injection}, written
+$\textit{inj}$; it takes three arguments: the first one is a regular
+expression ${r_{i-1}}$, before the character is chopped off, the second
+is a character ${c_{i-1}}$, the character we want to inject and the
+third argument is the value ${v_i}$, into which one wants to inject the
+character (it corresponds to the regular expression after the character
+has been chopped off). The result of this function is a new value. 
+\begin{ceqn}
+\begin{equation}\label{graph:inj}
+\begin{tikzcd}
+r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
+v_1           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
+\end{tikzcd}
+\end{equation}
+\end{ceqn}
+
+
+\noindent
+The
+definition of $\textit{inj}$ is as follows: 
+
+\begin{center}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
+  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
+  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
+  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
+  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
+\end{tabular}
+\end{center}
+
+\noindent This definition is by recursion on the ``shape'' of regular
+expressions and values. 
+The clauses basically do one thing--identifying the ``holes'' on 
+value to inject the character back into.
+For instance, in the last clause for injecting back to a value
+that would turn into a new star value that corresponds to a star,
+we know it must be a sequence value. And we know that the first 
+value of that sequence corresponds to the child regex of the star
+with the first character being chopped off--an iteration of the star
+that had just been unfolded. This value is followed by the already
+matched star iterations we collected before. So we inject the character 
+back to the first value and form a new value with this new iteration
+being added to the previous list of iterations, all under the $\Stars$
+top level.
+
+Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
+we have a lexer with the following recursive definition:
+\begin{center}
+\begin{tabular}{lcr}
+$\lexer \; r \; [] $ & $=$ & $\mkeps \; r$\\
+$\lexer \; r \;c::s$ & $=$ & $\inj \; r \; c (\lexer (r\backslash c) s)$
+\end{tabular}
+\end{center}
+ 
+Pictorially, the algorithm is as follows:
 
 \begin{ceqn}
 \begin{equation}\label{graph:2}
@@ -524,63 +643,8 @@
 for how the empty string has been matched by the (nullable) regular
 expression $r_n$. This function is defined as
 
-	\begin{center}
-		\begin{tabular}{lcl}
-			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
-			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
-			& \textit{if} $\nullable(r_{1})$\\ 
-			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
-			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
-			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
-			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
-		\end{tabular}
-	\end{center}
 
 
-\noindent 
-After the $\mkeps$-call, we inject back the characters one by one in order to build
-the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
-($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
-After injecting back $n$ characters, we get the lexical value for how $r_0$
-matches $s$. The POSIX value is maintained throught out the process.
-For this Sulzmann and Lu defined a function that reverses
-the ``chopping off'' of characters during the derivative phase. The
-corresponding function is called \emph{injection}, written
-$\textit{inj}$; it takes three arguments: the first one is a regular
-expression ${r_{i-1}}$, before the character is chopped off, the second
-is a character ${c_{i-1}}$, the character we want to inject and the
-third argument is the value ${v_i}$, into which one wants to inject the
-character (it corresponds to the regular expression after the character
-has been chopped off). The result of this function is a new value. The
-definition of $\textit{inj}$ is as follows: 
-
-\begin{center}
-\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
-  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
-  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
-  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
-  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
-\end{tabular}
-\end{center}
-
-\noindent This definition is by recursion on the ``shape'' of regular
-expressions and values. 
-The clauses basically do one thing--identifying the ``holes'' on 
-value to inject the character back into.
-For instance, in the last clause for injecting back to a value
-that would turn into a new star value that corresponds to a star,
-we know it must be a sequence value. And we know that the first 
-value of that sequence corresponds to the child regex of the star
-with the first character being chopped off--an iteration of the star
-that had just been unfolded. This value is followed by the already
-matched star iterations we collected before. So we inject the character 
-back to the first value and form a new value with this new iteration
-being added to the previous list of iterations, all under the $Stars$
-top level.
-
 We have mentioned before that derivatives without simplification 
 can get clumsy, and this is true for values as well--they reflect
 the regular expressions size by definition.
@@ -617,411 +681,9 @@
 introducing additional informtaion to the 
 regular expressions called \emph{bitcodes}.
 
-\subsection*{Bit-coded Algorithm}
-Bits and bitcodes (lists of bits) are defined as:
-
-\begin{center}
-		$b ::=   1 \mid  0 \qquad
-bs ::= [] \mid b::bs    
-$
-\end{center}
-
-\noindent
-The $1$ and $0$ are not in bold in order to avoid 
-confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
-bit-lists) can be used to encode values (or potentially incomplete values) in a
-compact form. This can be straightforwardly seen in the following
-coding function from values to bitcodes: 
-
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
-  $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
-  $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
-  $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
-  $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
-  $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
-  $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
-                                                 code(\Stars\,vs)$
-\end{tabular}    
-\end{center} 
-
-\noindent
-Here $\textit{code}$ encodes a value into a bitcodes by converting
-$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
-star iteration by $1$. The border where a local star terminates
-is marked by $0$. This coding is lossy, as it throws away the information about
-characters, and also does not encode the ``boundary'' between two
-sequence values. Moreover, with only the bitcode we cannot even tell
-whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
-reason for choosing this compact way of storing information is that the
-relatively small size of bits can be easily manipulated and ``moved
-around'' in a regular expression. In order to recover values, we will 
-need the corresponding regular expression as an extra information. This
-means the decoding function is defined as:
-
-
-%\begin{definition}[Bitdecoding of Values]\mbox{}
-\begin{center}
-\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
-  $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
-  $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
-  $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
-     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
-       (\Left\,v, bs_1)$\\
-  $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
-     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
-       (\Right\,v, bs_1)$\\                           
-  $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
-        $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
-  & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
-  & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
-  $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
-  $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
-         $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
-  & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
-  & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
-  
-  $\textit{decode}\,bs\,r$ & $\dn$ &
-     $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
-  & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
-       \textit{else}\;\textit{None}$                       
-\end{tabular}    
-\end{center}    
-%\end{definition}
-
-Sulzmann and Lu's integrated the bitcodes into regular expressions to
-create annotated regular expressions \cite{Sulzmann2014}.
-\emph{Annotated regular expressions} are defined by the following
-grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
-
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{a}$ & $::=$  & $\ZERO$\\
-                  & $\mid$ & $_{bs}\ONE$\\
-                  & $\mid$ & $_{bs}{\bf c}$\\
-                  & $\mid$ & $_{bs}\sum\,as$\\
-                  & $\mid$ & $_{bs}a_1\cdot a_2$\\
-                  & $\mid$ & $_{bs}a^*$
-\end{tabular}    
-\end{center}  
-%(in \textit{ALTS})
-
-\noindent
-where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
-expressions and $as$ for a list of annotated regular expressions.
-The alternative constructor($\sum$) has been generalized to 
-accept a list of annotated regular expressions rather than just 2.
-We will show that these bitcodes encode information about
-the (POSIX) value that should be generated by the Sulzmann and Lu
-algorithm.
-
-
-To do lexing using annotated regular expressions, we shall first
-transform the usual (un-annotated) regular expressions into annotated
-regular expressions. This operation is called \emph{internalisation} and
-defined as follows:
-
-%\begin{definition}
-\begin{center}
-\begin{tabular}{lcl}
-  $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
-  $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
-  $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
-  $(r_1 + r_2)^\uparrow$ & $\dn$ &
-  $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
-  \textit{fuse}\,[1]\,r_2^\uparrow]$\\
-  $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
-         $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
-  $(r^*)^\uparrow$ & $\dn$ &
-         $_{[]}(r^\uparrow)^*$\\
-\end{tabular}    
-\end{center}    
-%\end{definition}
-
-\noindent
-We use up arrows here to indicate that the basic un-annotated regular
-expressions are ``lifted up'' into something slightly more complex. In the
-fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
-attach bits to the front of an annotated regular expression. Its
-definition is as follows:
-
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
-  $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
-     $_{bs @ bs'}\ONE$\\
-  $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
-     $_{bs@bs'}{\bf c}$\\
-  $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
-     $_{bs@bs'}\sum\textit{as}$\\
-  $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
-     $_{bs@bs'}a_1 \cdot a_2$\\
-  $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
-     $_{bs @ bs'}a^*$
-\end{tabular}    
-\end{center}  
-
-\noindent
-After internalising the regular expression, we perform successive
-derivative operations on the annotated regular expressions. This
-derivative operation is the same as what we had previously for the
-basic regular expressions, except that we beed to take care of
-the bitcodes:
-
-
-\iffalse
- %\begin{definition}{bder}
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
-        $\textit{if}\;c=d\; \;\textit{then}\;
-         \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
-  $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
-  $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
-  $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a_1$\\
-					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
-					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
-  & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
-  $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
-      $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
-       (\textit{STAR}\,[]\,r)$
-\end{tabular}    
-\end{center}    
-%\end{definition}
-
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
-        $\textit{if}\;c=d\; \;\textit{then}\;
-         _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
-  $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
-  $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
-  $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a_1$\\
-					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
-					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
-  & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
-  $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
-      $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
-       (_{bs}\textit{STAR}\,[]\,r)$
-\end{tabular}    
-\end{center}    
-%\end{definition}
-\fi
-
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
-  $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
-  $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
-        $\textit{if}\;c=d\; \;\textit{then}\;
-         _{bs}\ONE\;\textit{else}\;\ZERO$\\  
-  $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
-  $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
-  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a_1$\\
-					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
-					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
-  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
-  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
-      $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
-       (_{[]}r^*))$
-\end{tabular}    
-\end{center}    
-
-%\end{definition}
-\noindent
-For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
-we need to unfold it into a sequence,
-and attach an additional bit $0$ to the front of $r \backslash c$
-to indicate one more star iteration. Also the sequence clause
-is more subtle---when $a_1$ is $\textit{bnullable}$ (here
-\textit{bnullable} is exactly the same as $\textit{nullable}$, except
-that it is for annotated regular expressions, therefore we omit the
-definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
-$a_1$ matches the string prior to character $c$ (more on this later),
-then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
-\backslash c)$ will collapse the regular expression $a_1$(as it has
-already been fully matched) and store the parsing information at the
-head of the regular expression $a_2 \backslash c$ by fusing to it. The
-bitsequence $\textit{bs}$, which was initially attached to the
-first element of the sequence $a_1 \cdot a_2$, has
-now been elevated to the top-level of $\sum$, as this information will be
-needed whichever way the sequence is matched---no matter whether $c$ belongs
-to $a_1$ or $ a_2$. After building these derivatives and maintaining all
-the lexing information, we complete the lexing by collecting the
-bitcodes using a generalised version of the $\textit{mkeps}$ function
-for annotated regular expressions, called $\textit{bmkeps}$:
-
-
-%\begin{definition}[\textit{bmkeps}]\mbox{}
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
-  $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a$\\
-  & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
-  & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
-  $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
-     $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
-  $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
-     $bs \,@\, [0]$
-\end{tabular}    
-\end{center}    
-%\end{definition}
-
-\noindent
-This function completes the value information by travelling along the
-path of the regular expression that corresponds to a POSIX value and
-collecting all the bitcodes, and using $S$ to indicate the end of star
-iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
-decode them, we get the value we expect. The corresponding lexing
-algorithm looks as follows:
-
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{blexer}\;r\,s$ & $\dn$ &
-      $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
-  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
-  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
-  & & $\;\;\textit{else}\;\textit{None}$
-\end{tabular}
-\end{center}
-
-\noindent
-In this definition $\_\backslash s$ is the  generalisation  of the derivative
-operation from characters to strings (just like the derivatives for un-annotated
-regular expressions).
-
-Now we introduce the simplifications, which is why we introduce the 
-bitcodes in the first place.
-
-\subsection*{Simplification Rules}
-
-This section introduces aggressive (in terms of size) simplification rules
-on annotated regular expressions
-to keep derivatives small. Such simplifications are promising
-as we have
-generated test data that show
-that a good tight bound can be achieved. We could only
-partially cover the search space as there are infinitely many regular
-expressions and strings. 
-
-One modification we introduced is to allow a list of annotated regular
-expressions in the $\sum$ constructor. This allows us to not just
-delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
-also unnecessary ``copies'' of regular expressions (very similar to
-simplifying $r + r$ to just $r$, but in a more general setting). Another
-modification is that we use simplification rules inspired by Antimirov's
-work on partial derivatives. They maintain the idea that only the first
-``copy'' of a regular expression in an alternative contributes to the
-calculation of a POSIX value. All subsequent copies can be pruned away from
-the regular expression. A recursive definition of our  simplification function 
-that looks somewhat similar to our Scala code is given below:
-%\comment{Use $\ZERO$, $\ONE$ and so on. 
-%Is it $ALTS$ or $ALTS$?}\\
-
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-   
-  $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
-   &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
-   &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
-   &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
-   &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
-   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
-
-  $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
-  &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
-   &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
-   &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
-
-   $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
-\end{tabular}    
-\end{center}    
-
-\noindent
-The simplification does a pattern matching on the regular expression.
-When it detected that the regular expression is an alternative or
-sequence, it will try to simplify its child regular expressions
-recursively and then see if one of the children turns into $\ZERO$ or
-$\ONE$, which might trigger further simplification at the current level.
-The most involved part is the $\sum$ clause, where we use two
-auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
-alternatives and reduce as many duplicates as possible. Function
-$\textit{distinct}$  keeps the first occurring copy only and removes all later ones
-when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
-Its recursive definition is given below:
-
- \begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
-     (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
-  $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
-    $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
-\end{tabular}    
-\end{center}  
-
-\noindent
-Here $\textit{flatten}$ behaves like the traditional functional programming flatten
-function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
-removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
-
-Having defined the $\simp$ function,
-we can use the previous notation of  natural
-extension from derivative w.r.t.~character to derivative
-w.r.t.~string:%\comment{simp in  the [] case?}
-
-\begin{center}
-\begin{tabular}{lcl}
-$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
-$r \backslash_{simp} [\,] $ & $\dn$ & $r$
-\end{tabular}
-\end{center}
-
-\noindent
-to obtain an optimised version of the algorithm:
-
- \begin{center}
-\begin{tabular}{lcl}
-  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
-      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
-  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
-  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
-  & & $\;\;\textit{else}\;\textit{None}$
-\end{tabular}
-\end{center}
-
-\noindent
-This algorithm keeps the regular expression size small, for example,
-with this simplification our previous $(a + aa)^*$ example's 8000 nodes
-will be reduced to just 6 and stays constant, no matter how long the
-input string is.
 
 
 
 
 
-
-
-%-----------------------------------
-%	SUBSECTION 1
-%-----------------------------------
-\section{Specifications of Certain Functions to be Used}
-Here we give some functions' definitions, 
-which we will use later.
-\begin{center}
-\begin{tabular}{ccc}
-$\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$
-\end{tabular}
-\end{center}
-
-
-

--- a/ChengsongTanPhdThesis/Chapters/ChapterBitcoded1.tex	Mon May 30 17:24:52 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/ChapterBitcoded1.tex	Mon May 30 20:36:15 2022 +0100
@@ -9,47 +9,416 @@
 %its correctness proof in 
 %Chapter 3\ref{Chapter3}. 
 
+\section*{Bit-coded Algorithm}
+Bits and bitcodes (lists of bits) are defined as:
 
+\begin{center}
+		$b ::=   1 \mid  0 \qquad
+bs ::= [] \mid b::bs    
+$
+\end{center}
+
+\noindent
+The $1$ and $0$ are not in bold in order to avoid 
+confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
+bit-lists) can be used to encode values (or potentially incomplete values) in a
+compact form. This can be straightforwardly seen in the following
+coding function from values to bitcodes: 
+
+\begin{center}
+\begin{tabular}{lcl}
+  $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
+  $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
+  $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
+  $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
+  $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
+  $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
+  $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
+                                                 code(\Stars\,vs)$
+\end{tabular}    
+\end{center} 
+
+\noindent
+Here $\textit{code}$ encodes a value into a bitcodes by converting
+$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
+star iteration by $1$. The border where a local star terminates
+is marked by $0$. This coding is lossy, as it throws away the information about
+characters, and also does not encode the ``boundary'' between two
+sequence values. Moreover, with only the bitcode we cannot even tell
+whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
+reason for choosing this compact way of storing information is that the
+relatively small size of bits can be easily manipulated and ``moved
+around'' in a regular expression. In order to recover values, we will 
+need the corresponding regular expression as an extra information. This
+means the decoding function is defined as:
+
+
+%\begin{definition}[Bitdecoding of Values]\mbox{}
+\begin{center}
+\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
+  $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
+  $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
+  $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
+       (\Left\,v, bs_1)$\\
+  $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
+       (\Right\,v, bs_1)$\\                           
+  $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
+        $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
+  & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
+  & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
+  $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
+  $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
+         $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
+  & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
+  & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
+  
+  $\textit{decode}\,bs\,r$ & $\dn$ &
+     $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
+  & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
+       \textit{else}\;\textit{None}$                       
+\end{tabular}    
+\end{center}    
+%\end{definition}
+
+Sulzmann and Lu's integrated the bitcodes into regular expressions to
+create annotated regular expressions \cite{Sulzmann2014}.
+\emph{Annotated regular expressions} are defined by the following
+grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
+
+\begin{center}
+\begin{tabular}{lcl}
+  $\textit{a}$ & $::=$  & $\ZERO$\\
+                  & $\mid$ & $_{bs}\ONE$\\
+                  & $\mid$ & $_{bs}{\bf c}$\\
+                  & $\mid$ & $_{bs}\sum\,as$\\
+                  & $\mid$ & $_{bs}a_1\cdot a_2$\\
+                  & $\mid$ & $_{bs}a^*$
+\end{tabular}    
+\end{center}  
+%(in \textit{ALTS})
+
+\noindent
+where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
+expressions and $as$ for a list of annotated regular expressions.
+The alternative constructor($\sum$) has been generalized to 
+accept a list of annotated regular expressions rather than just 2.
+We will show that these bitcodes encode information about
+the (POSIX) value that should be generated by the Sulzmann and Lu
+algorithm.
+
+
+To do lexing using annotated regular expressions, we shall first
+transform the usual (un-annotated) regular expressions into annotated
+regular expressions. This operation is called \emph{internalisation} and
+defined as follows:
+
+%\begin{definition}
+\begin{center}
+\begin{tabular}{lcl}
+  $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
+  $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
+  $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
+  $(r_1 + r_2)^\uparrow$ & $\dn$ &
+  $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
+  \textit{fuse}\,[1]\,r_2^\uparrow]$\\
+  $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
+         $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
+  $(r^*)^\uparrow$ & $\dn$ &
+         $_{[]}(r^\uparrow)^*$\\
+\end{tabular}    
+\end{center}    
+%\end{definition}
+
+\noindent
+We use up arrows here to indicate that the basic un-annotated regular
+expressions are ``lifted up'' into something slightly more complex. In the
+fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
+attach bits to the front of an annotated regular expression. Its
+definition is as follows:
+
+\begin{center}
+\begin{tabular}{lcl}
+  $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
+  $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
+     $_{bs @ bs'}\ONE$\\
+  $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
+     $_{bs@bs'}{\bf c}$\\
+  $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
+     $_{bs@bs'}\sum\textit{as}$\\
+  $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
+     $_{bs@bs'}a_1 \cdot a_2$\\
+  $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
+     $_{bs @ bs'}a^*$
+\end{tabular}    
+\end{center}  
+
+\noindent
+After internalising the regular expression, we perform successive
+derivative operations on the annotated regular expressions. This
+derivative operation is the same as what we had previously for the
+basic regular expressions, except that we beed to take care of
+the bitcodes:
 
 
-%----------------------------------------------------------------------------------------
-%	SECTION common identities
-%----------------------------------------------------------------------------------------
-\section{Common Identities In Simplification-Related Functions} 
-Need to prove these before starting on the big correctness proof.
+\iffalse
+ %\begin{definition}{bder}
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
+  $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
+  $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
+        $\textit{if}\;c=d\; \;\textit{then}\;
+         \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
+  $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
+  $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
+  $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
+     $\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
+					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
+  & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
+  $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
+      $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
+       (\textit{STAR}\,[]\,r)$
+\end{tabular}    
+\end{center}    
+%\end{definition}
+
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
+  $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
+  $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
+        $\textit{if}\;c=d\; \;\textit{then}\;
+         _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
+  $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
+  $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
+  $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
+     $\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
+					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
+  & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
+  $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
+      $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
+       (_{bs}\textit{STAR}\,[]\,r)$
+\end{tabular}    
+\end{center}    
+%\end{definition}
+\fi
 
-%-----------------------------------
-%	SUBSECTION 
-%-----------------------------------
-\subsection{Idempotency of $\simp$}
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+  $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
+  $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
+  $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
+        $\textit{if}\;c=d\; \;\textit{then}\;
+         _{bs}\ONE\;\textit{else}\;\ZERO$\\  
+  $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
+  $_{bs}\sum\;(\textit{map} (\_\backslash c) as )$\\
+  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
+     $\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
+					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
+  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
+  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
+      $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
+       (_{[]}r^*))$
+\end{tabular}    
+\end{center}    
+
+%\end{definition}
+\noindent
+For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
+we need to unfold it into a sequence,
+and attach an additional bit $0$ to the front of $r \backslash c$
+to indicate one more star iteration. Also the sequence clause
+is more subtle---when $a_1$ is $\textit{bnullable}$ (here
+\textit{bnullable} is exactly the same as $\textit{nullable}$, except
+that it is for annotated regular expressions, therefore we omit the
+definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
+$a_1$ matches the string prior to character $c$ (more on this later),
+then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
+\backslash c)$ will collapse the regular expression $a_1$(as it has
+already been fully matched) and store the parsing information at the
+head of the regular expression $a_2 \backslash c$ by fusing to it. The
+bitsequence $\textit{bs}$, which was initially attached to the
+first element of the sequence $a_1 \cdot a_2$, has
+now been elevated to the top-level of $\sum$, as this information will be
+needed whichever way the sequence is matched---no matter whether $c$ belongs
+to $a_1$ or $ a_2$. After building these derivatives and maintaining all
+the lexing information, we complete the lexing by collecting the
+bitcodes using a generalised version of the $\textit{mkeps}$ function
+for annotated regular expressions, called $\textit{bmkeps}$:
+
+
+%\begin{definition}[\textit{bmkeps}]\mbox{}
+\begin{center}
+\begin{tabular}{lcl}
+  $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
+  $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
+     $\textit{if}\;\textit{bnullable}\,a$\\
+  & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
+  & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
+  $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
+     $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
+  $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
+     $bs \,@\, [0]$
+\end{tabular}    
+\end{center}    
+%\end{definition}
+
+\noindent
+This function completes the value information by travelling along the
+path of the regular expression that corresponds to a POSIX value and
+collecting all the bitcodes, and using $S$ to indicate the end of star
+iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
+decode them, we get the value we expect. The corresponding lexing
+algorithm looks as follows:
 
-\begin{equation}
-	\simp \;r = \simp\; \simp \; r 
-\end{equation}
-This property means we do not have to repeatedly
-apply simplification in each step, which justifies
-our definition of $\blexersimp$.
-It will also be useful in future proofs where properties such as 
-closed forms are needed.
-The proof is by structural induction on $r$.
+\begin{center}
+\begin{tabular}{lcl}
+  $\textit{blexer}\;r\,s$ & $\dn$ &
+      $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
+  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+  & & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+
+\noindent
+In this definition $\_\backslash s$ is the  generalisation  of the derivative
+operation from characters to strings (just like the derivatives for un-annotated
+regular expressions).
+
+Now we introduce the simplifications, which is why we introduce the 
+bitcodes in the first place.
+
+\subsection*{Simplification Rules}
+
+This section introduces aggressive (in terms of size) simplification rules
+on annotated regular expressions
+to keep derivatives small. Such simplifications are promising
+as we have
+generated test data that show
+that a good tight bound can be achieved. We could only
+partially cover the search space as there are infinitely many regular
+expressions and strings. 
+
+One modification we introduced is to allow a list of annotated regular
+expressions in the $\sum$ constructor. This allows us to not just
+delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
+also unnecessary ``copies'' of regular expressions (very similar to
+simplifying $r + r$ to just $r$, but in a more general setting). Another
+modification is that we use simplification rules inspired by Antimirov's
+work on partial derivatives. They maintain the idea that only the first
+``copy'' of a regular expression in an alternative contributes to the
+calculation of a POSIX value. All subsequent copies can be pruned away from
+the regular expression. A recursive definition of our  simplification function 
+that looks somewhat similar to our Scala code is given below:
+%\comment{Use $\ZERO$, $\ONE$ and so on. 
+%Is it $ALTS$ or $ALTS$?}\\
+
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+   
+  $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
+   &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
+   &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
+   &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
+   &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
+   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
+
+  $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\
+  &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
+   &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
+   &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
 
-%-----------------------------------
-%	SUBSECTION 
-%-----------------------------------
-\subsection{Syntactic Equivalence Under $\simp$}
-We prove that minor differences can be annhilated
-by $\simp$.
-For example,
+   $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
+\end{tabular}    
+\end{center}    
+
+\noindent
+The simplification does a pattern matching on the regular expression.
+When it detected that the regular expression is an alternative or
+sequence, it will try to simplify its child regular expressions
+recursively and then see if one of the children turns into $\ZERO$ or
+$\ONE$, which might trigger further simplification at the current level.
+The most involved part is the $\sum$ clause, where we use two
+auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
+alternatives and reduce as many duplicates as possible. Function
+$\textit{distinct}$  keeps the first occurring copy only and removes all later ones
+when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
+Its recursive definition is given below:
+
+ \begin{center}
+  \begin{tabular}{@{}lcl@{}}
+  $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
+     (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
+  $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
+    $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
+\end{tabular}    
+\end{center}  
+
+\noindent
+Here $\textit{flatten}$ behaves like the traditional functional programming flatten
+function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
+removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
+
+Having defined the $\simp$ function,
+we can use the previous notation of  natural
+extension from derivative w.r.t.~character to derivative
+w.r.t.~string:%\comment{simp in  the [] case?}
+
 \begin{center}
-$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
- \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
+\begin{tabular}{lcl}
+$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
+$r \backslash_{simp} [\,] $ & $\dn$ & $r$
+\end{tabular}
 \end{center}
 
+\noindent
+to obtain an optimised version of the algorithm:
+
+ \begin{center}
+\begin{tabular}{lcl}
+  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
+      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
+  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+  & & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+
+\noindent
+This algorithm keeps the regular expression size small, for example,
+with this simplification our previous $(a + aa)^*$ example's 8000 nodes
+will be reduced to just 6 and stays constant, no matter how long the
+input string is.
+
 
 
 
 
 
+
+
+
+
+
+%-----------------------------------
+%	SUBSECTION 1
+%-----------------------------------
+\section{Specifications of Some Helper Functions}
+Here we give some functions' definitions, 
+which we will use later.
+\begin{center}
+\begin{tabular}{ccc}
+$\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$
+\end{tabular}
+\end{center}
+
+
 %----------------------------------------------------------------------------------------
 %	SECTION  correctness proof
 %----------------------------------------------------------------------------------------
@@ -134,3 +503,4 @@
 Finally 
 
 
+

--- a/ChengsongTanPhdThesis/Chapters/ChapterFinite.tex	Mon May 30 17:24:52 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/ChapterFinite.tex	Mon May 30 20:36:15 2022 +0100
@@ -7,67 +7,29 @@
 %of our bitcoded algorithm, that is a finite bound on the size of any 
 %regex's derivatives. 
 
-
-%-----------------------------------
-%	SECTION ?
-%-----------------------------------
-\section{preparatory properties for getting a finiteness bound}
-Before we get to the proof that says the intermediate result of our lexer will
-remain finitely bounded, which is an important efficiency/liveness guarantee,
-we shall first develop a few preparatory properties and definitions to 
-make the process of proving that a breeze.
-
-We define rewriting relations for $\rrexp$s, which allows us to do the 
-same trick as we did for the correctness proof,
-but this time we will have stronger equalities established.
-\subsection{"hrewrite" relation}
-List of 1-step rewrite rules for regular expressions simplification without bitcodes:
-\begin{center}
-\begin{tabular}{lcl}
-$r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$
-\end{tabular}
-\end{center}
-
-And we define an "grewrite" relation that works on lists:
-\begin{center}
-\begin{tabular}{lcl}
-$ \ZERO :: rs$ & $\rightarrow_g$ & $rs$
-\end{tabular}
-\end{center}
-
-
+In this chapter we give a guarantee in terms of time complexity:
+given a regular expression $r$, for any string $s$ 
+our algorithm's internal data structure is finitely bounded.
+To obtain such a proof, we need to 
+\begin{itemize}
+\item
+Define an new datatype for regular expressions that makes it easy
+to reason about the size of an annotated regular expression.
+\item
+A set of equalities for this new datatype that enables one to
+rewrite $\bderssimp{r_1 \cdot r_2}{s}$ and $\bderssimp{r^*}{s}$ etc.
+by their children regexes $r_1$, $r_2$, and $r$.
+\item
+Using those equalities to actually get those rewriting equations, which we call
+"closed forms".
+\item
+Bound the closed forms, thereby bounding the original
+$\blexersimp$'s internal data structures.
+\end{itemize}
 
-With these relations we prove
-\begin{lemma}
-$rs \rightarrow rs'  \implies \RALTS{rs} \rightarrow \RALTS{rs'}$
-\end{lemma}
-which enables us to prove "closed forms" equalities such as
-\begin{lemma}
-$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
-\end{lemma}
+\section{the $\mathbf{r}$-rexp datatype and the size functions}
 
-But the most involved part of the above lemma is proving the following:
-\begin{lemma}
-$\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ 
-\end{lemma}
-which is needed in proving things like 
-\begin{lemma}
-$r \rightarrow_f r'  \implies \rsimp{r} \rightarrow \rsimp{r'}$
-\end{lemma}
-
-Fortunately this is provable by induction on the list $rs$
-
-
-
-%-----------------------------------
-%	SECTION 2
-%-----------------------------------
-
-\section{Finiteness Property}
-In this section let us describe our argument for why the size of the simplified
-derivatives with the aggressive simplification function is finitely bounded.
- Suppose
-we have a size function for bitcoded regular expressions, written
+We have a size function for bitcoded regular expressions, written
 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
 
 \begin{center}
@@ -170,6 +132,65 @@
  regular expression.
 
 
+
+%-----------------------------------
+%	SECTION ?
+%-----------------------------------
+\section{preparatory properties for getting a finiteness bound}
+Before we get to the proof that says the intermediate result of our lexer will
+remain finitely bounded, which is an important efficiency/liveness guarantee,
+we shall first develop a few preparatory properties and definitions to 
+make the process of proving that a breeze.
+
+We define rewriting relations for $\rrexp$s, which allows us to do the 
+same trick as we did for the correctness proof,
+but this time we will have stronger equalities established.
+\subsection{"hrewrite" relation}
+List of 1-step rewrite rules for regular expressions simplification without bitcodes:
+\begin{figure}
+\caption{the "h-rewrite" rules}
+\[
+r_1 \cdot \ZERO \rightarrow_h \ZERO \]
+
+\[
+\ZERO \cdot r_2 \rightarrow \ZERO 
+\]
+\end{figure}
+And we define an "grewrite" relation that works on lists:
+\begin{center}
+\begin{tabular}{lcl}
+$ \ZERO :: rs$ & $\rightarrow_g$ & $rs$
+\end{tabular}
+\end{center}
+
+
+
+With these relations we prove
+\begin{lemma}
+$rs \rightarrow rs'  \implies \RALTS{rs} \rightarrow \RALTS{rs'}$
+\end{lemma}
+which enables us to prove "closed forms" equalities such as
+\begin{lemma}
+$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\suffix \; s \; r_1 ))}$
+\end{lemma}
+
+But the most involved part of the above lemma is proving the following:
+\begin{lemma}
+$\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ 
+\end{lemma}
+which is needed in proving things like 
+\begin{lemma}
+$r \rightarrow_f r'  \implies \rsimp{r} \rightarrow \rsimp{r'}$
+\end{lemma}
+
+Fortunately this is provable by induction on the list $rs$
+
+
+
+%-----------------------------------
+%	SECTION 2
+%-----------------------------------
+
 \begin{theorem}
 For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
 \end{theorem}

--- a/thys2/blexer2.sc	Mon May 30 17:24:52 2022 +0100
+++ b/thys2/blexer2.sc	Mon May 30 20:36:15 2022 +0100
@@ -521,6 +521,7 @@
         case (AZERO, _) => AZERO
         case (_, AZERO) => AZERO
         case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s)
+        case (r1s, AONE(bs2)) => fuse(bs1, r1s) //asserted bs2 == Nil
         case (r1s, r2s) => ASEQ(bs1, r1s, r2s)
     }
     case AALTS(bs1, rs) => {
@@ -732,9 +733,9 @@
   res.toSet
 }
 
-def ABIncludedByC[A, B, C](a: A, b: B, c: C, f: (A, B) => C, inclusionPred: (C, C) => Boolean) : Boolean = {
+def ABIncludedByC[A, B, C](a: A, b: B, c: C, f: (A, B) => C, subseteqPred: (C, C) => Boolean) : Boolean = {
   
-  inclusionPred(f(a, b), c)
+  subseteqPred(f(a, b), c)
 }
 def rexpListInclusion(rs1: Set[Rexp], rs2: Set[Rexp]) : Boolean = {
   // println("r+ctx---------")
@@ -747,7 +748,7 @@
   // println("end------------------")
   res
 }
-def pAKC6(acc: Set[Rexp], r: ARexp, ctx: List[Rexp]) : ARexp = {
+def prune6(acc: Set[Rexp], r: ARexp, ctx: List[Rexp]) : ARexp = {
   // println("pakc--------r")
   // println(shortRexpOutput(erase(r)))
   //   println("ctx---------")
@@ -766,7 +767,7 @@
   else{
     r match {
       case ASEQ(bs, r1, r2) => 
-      (pAKC6(acc, r1, erase(r2) :: ctx)) match{
+      (prune6(acc, r1, erase(r2) :: ctx)) match{
         case AZERO => 
           AZERO
         case AONE(bs1) => 
@@ -782,7 +783,7 @@
         // rs0.foreach(r => println(shortRexpOutput(erase(r))))
         // println(s"acc is ")
         // acc.foreach(r => println(shortRexpOutput(r)))
-        rs0.map(r => pAKC6(acc, r, ctx)).filter(_ != AZERO) match {
+        rs0.map(r => prune6(acc, r, ctx)).filter(_ != AZERO) match {
           case Nil => 
             // println("after pruning Nil")
             AZERO
@@ -809,7 +810,7 @@
       distinctBy6(xs, acc)
     }
     else{
-      val pruned = pAKC6(acc, x, Nil)
+      val pruned = prune6(acc, x, Nil)
       val newTerms = strongBreakIntoTerms(erase(pruned))
       pruned match {
         case AZERO => 
@@ -1236,7 +1237,7 @@
   //STAR(SEQ(ALTS(STAR(STAR(STAR(STAR(CHAR(c))))),ALTS(CHAR(c),CHAR(b))),ALTS(ZERO,SEQ(ALTS(ALTS(STAR(CHAR(c)),SEQ(CHAR(b),CHAR(a))),CHAR(c)),STAR(ALTS(ALTS(ONE,CHAR(a)),STAR(CHAR(c))))))))
   //(depth5)
   //STAR(SEQ(ALTS(ALTS(STAR(CHAR(b)),SEQ(ONE,CHAR(b))),SEQ(STAR(CHAR(a)),CHAR(b))),ALTS(ZERO,ALTS(STAR(CHAR(b)),STAR(CHAR(a))))))
-  test(rexp(4), 100000) { (r: Rexp) => 
+  test(rexp(8), 10000) { (r: Rexp) => 
   // ALTS(SEQ(SEQ(ONE,CHAR('a')),STAR(CHAR('a'))),SEQ(ALTS(CHAR('c'),ONE),STAR(ZERO))))))), 1) { (r: Rexp) => 
     val ss = stringsFromRexp(r)
     val boolList = ss.filter(s => s != "").map(s => {
@@ -1256,7 +1257,7 @@
 
 }
 // small()
-// generator_test()
+generator_test()
 
 def counterexample_check() {
   val r = SEQ(STAR(CHAR('c')),STAR(SEQ(STAR(CHAR('c')),ONE)))//STAR(SEQ(ALTS(STAR(CHAR('c')),CHAR('c')),SEQ(ALTS(CHAR('c'),ONE),ONE)))
@@ -1275,11 +1276,13 @@
 }
 // counterexample_check()
 def linform_test() {
-  val r = STAR(SEQ(STAR(CHAR('c')), ONE))
+  val r = STAR(SEQ(STAR(CHAR('c')),ONE))//SEQ(STAR(CHAR('c')),STAR(SEQ(STAR(CHAR('c')),ONE))) //
   val r_linforms = lf(r)
   println(r_linforms.size)
+  val pderuniv = pderUNIV(r)
+  println(pderuniv.size)
 }
-linform_test()
+// linform_test()
 // 1
 def newStrong_test() {
   val r2 = (CHAR('b') | ONE)

author	Chengsong
	Mon, 30 May 2022 20:36:15 +0100
changeset 530	823d9b19d21c
parent 529	96e93df60954
child 531	c334f0b3ef52

ChengsongTanPhdThesis/Chapters/Chapter1.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Chapter2.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/ChapterBitcoded1.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/ChapterFinite.tex		file \| annotate \| diff \| comparison \| revisions
thys2/blexer2.sc		file \| annotate \| diff \| comparison \| revisions