lexing: comparison ChengsongTanPhdThesis/Chapters/Bitcoded1.tex

equal deleted inserted replaced

-:692911c0b981
+:3178f0e948ac
 relatively small size of bits can be easily manipulated and ``moved
 around" in a regular expression.
 Because of the lossiness, the process of decoding a bitlist requires additionally
 a regular expression. The function $\decode$ is defined as:
-We define the reverse operation of $\code$, which is $\decode$.
-As expected, $\decode$ not only requires the bit-codes,
-but also a regular expression to guide the decoding and
-fill the gaps of characters:
 %\begin{definition}[Bitdecoding of Values]\mbox{}
 \begin{center}
 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
 \end{tabular}
 \end{center}
 %\end{definition}
 \noindent
-The function $\decode'$ returns a pair consisting of a partially decoded value and some leftover:
+The function $\decode'$ returns a pair consisting of
-$\decode'$ does most of the job while $\decode$ throws
+a partially decoded value and some leftover bit list that cannot
-away leftover bit-codes and returns the value only.
+be decide yet.
+The function $\decode'$ succeeds if the left-over
+bit-sequence is empty.
 $\decode$ is terminating as $\decode'$ is terminating.
-We have the property that $\decode$ and $\code$ are
+$\decode'$ is terminating
+because at least one of $\decode'$'s parameters will go down in terms
+of size.
+Assuming we have a value $v$ and regular expression $r$
+with $\vdash v:r$,
+then we have the property that $\decode$ and $\code$ are
 reverse operations of one another:
 \begin{lemma}
-\[\vdash v : r \implies \decode \; (\code \; v) \; r = \textit{Some}(v) \]
+\[If \vdash v : r \; then \;\decode \; (\code \; v) \; r = \textit{Some}(v) \]
 \end{lemma}
 \begin{proof}
 By proving a more general version of the lemma, on $\decode'$:
 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
-Then setting $ds$ to be $[]$ and unfolding $\decode$ definition
+Then setting $ds$ to be $[]$ and unfolding $\decode$ definition,
-we get the lemma.
+we obtain the property.
 \end{proof}
 With the $\code$ and $\decode$ functions in hand, we know how to
-switch between bit-codes and value--the two different representations of
+switch between bit-codes and values.
-lexing information.
+The next step is to integrate this information into regular expression.
-The next step is to integrate this information into the working regular expression.
 Attaching bits to the front of regular expressions is the solution Sulzamann and Lu
-gave for storing partial values on the fly:
+gave for storing partial values in regular expressions.
+Annotated regular expressions are therefore defined as the Isabelle
+datatype:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{a}$ & $::=$  & $\ZERO$\\
 & $\mid$ & $_{bs}\ONE$\\
 \end{tabular}
 \end{center}
 %(in \textit{ALTS})
 \noindent
-We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}.
+where $bs$ stands for bit-codes, $a$  for $\mathbf{a}$nnotated regular
-$bs$ stands for bit-codes, $a$  for $\mathbf{a}$nnotated regular
+expressions and $as$ for lists of annotated regular expressions.
-expressions and $as$ for a list of annotated regular expressions.
+The alternative constructor, written, $\sum$, has been generalised to
-The alternative constructor ($\sum$) has been generalised to
+accept a list of annotated regular expressions rather than just two.
-accept a list of annotated regular expressions rather than just 2.
+Why is it generalised? This is because when we open up nested
+alternatives, there could be more than two elements at the same level
+after de-duplication, which can no longer be stored in a binary
-The first thing we define related to bit-coded regular expressions
+constructor.
-is how we move bits, for instance pasting it at the front of an annotated regular expression.
-The operation $\fuse$ is just to attach bit-codes
+The first operation we define related to bit-coded regular expressions
+is how we move bits to the inside of regular expressions.
+Called $\fuse$, this operation is attaches bit-codes
 to the front of an annotated regular expression:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
 $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
 $_{bs @ bs'}a^*$
 \end{tabular}
 \end{center}
 \noindent
-With that we are able to define $\internalise$.
+With \emph{fuse} we are able to define the $\internalise$ function
+that translates a ``standard'' regular expression into an
-To do lexing using annotated regular expressions, we shall first
+annotated regular expression.
-transform the usual (un-annotated) regular expressions into annotated
+This function will be applied before we start
-regular expressions. This operation is called \emph{internalisation} and
+with the derivative phase of the algorithm.
-defined as follows:
-%\begin{definition}
 \begin{center}
 \begin{tabular}{lcl}
 $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
 $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
 $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
 \end{center}
 %\end{definition}
 \noindent
 We use an up arrow with postfix notation
-to denote the operation,
+to denote this operation.
 for convenience. The $\textit{internalise} \; r$
 notation is more cumbersome.
 The opposite of $\textit{internalise}$ is
 $\erase$, where all the bit-codes are removed,
 and the alternative operator $\sum$ for annotated
 regular expressions is transformed to the binary alternatives
 for plain regular expressions.
 \begin{center}
-	\begin{tabular}{lcr}
+	\begin{tabular}{lcl}
-		$\erase \; \ZERO$ & $\dn$ & $\ZERO$\\
+		$\ZERO_\downarrow$ & $\dn$ & $\ZERO$\\
-		$\erase \; _{bs}\ONE$ & $\dn$ & $\ONE$\\
+		$( _{bs}\ONE )_\downarrow$ & $\dn$ & $\ONE$\\
-		$\erase \; _{bs}\mathbf{c}$ & $\dn$ & $\mathbf{c}$\\
+		$( _{bs}\mathbf{c} )_\downarrow$ & $\dn$ & $\mathbf{c}$\\
-		$\erase \; _{bs} a_1 \cdot a_2$ & $\dn$ & $\erase \; r_1\cdot\erase\; r_2$\\
+		$( _{bs} a_1 \cdot a_2 )_\downarrow$ & $\dn$ &
-		$\erase \; _{bs} [] $ & $\dn$ & $\ZERO$\\
+		$ (a_1) _\downarrow \cdot  (a_2) _\downarrow$\\
-		$\erase \; _{bs} [a] $ & $\dn$ & $\erase \; a$\\
+		$( _{bs} [])_\downarrow $ & $\dn$ & $\ZERO $\\
-		$\erase \; _{bs} \sum [a_1, \; a_2]$ & $\dn$ & $\erase \; a_1 +\erase \; a_2$\\
+		$( _{bs} [a]  )_\downarrow$ & $\dn$ & $a_\downarrow$\\
-		$\erase \; _{bs} \sum (a :: as)$ & $\dn$ & $\erase \; a + \erase \; _{[]} \sum as$\\
+		$_{bs} \sum [a_1, \; a_2]$ & $\dn$ & $ (a_1) _\downarrow + ( a_2 ) _\downarrow $\\
-		$\erase \; _{bs} a^*$ & $\dn$ & $(\erase \; a)^*$
+		$(_{bs} \sum (a :: as))_\downarrow$ & $\dn$ & $ a_\downarrow + \; (_{[]} \sum as)_\downarrow$\\
+		$( _{bs} a^* )_\downarrow$ & $\dn$ & $(a_\downarrow)^*$
 	\end{tabular}
 \end{center}
 \noindent
 We also abbreviate the $\erase\; a$ operation
 as $a_\downarrow$, for conciseness.
 For bit-coded regular expressions, as a different datatype,
 testing whether they contain empty string in their lauguage requires
 a dedicated function $\bnullable$
 which simply calls $\erase$ first before testing whether it is $\nullable$.
-\begin{center}
+\begin{definition}
-	\begin{tabular}{lcr}
+		$\bnullable \; a \dn  \nullable \; (a_\downarrow)$
-		$\bnullable \; a $ & $\dn$ & $\nullable \; (a_\downarrow)$
+\end{definition}
-	\end{tabular}
-\end{center}
 The function for collecting the
-bitcodes of a $\bnullable$ annotated regular expression
+bitcodes at the end of the derivative
+phase from a (b)nullable regular expression
 is a generalised version of the $\textit{mkeps}$ function
 for annotated regular expressions, called $\textit{bmkeps}$:
 %\begin{definition}[\textit{bmkeps}]\mbox{}
 \end{tabular}
 \end{center}
 %\end{definition}
 \noindent
-This function completes the value information by travelling along the
+$\bmkeps$ completes the value information by travelling along the
 path of the regular expression that corresponds to a POSIX value and
-collecting all the bitcodes, and using $S$ to indicate the end of star
+collecting all the bitcodes, and attaching $S$ to indicate the end of star
 iterations.
-The most central question is how these partial lexing information
+Now we give out the central part of this lexing algorithm,
-represented as bit-codes is augmented and carried around
+the $\bder$ function (stands for \emph{b}itcoded-derivative).
-during a derivative is taken.
+For most time we use the infix notation $(\_\backslash\_)$
-This is done by adding bitcodes to the
+to mean $\bder$ for brevity when
-derivatives, for example when one more star iteratoin is taken (we
+there is no danger of confusion with derivatives on plain regular expressions.
-call the operation of derivatives on annotated regular expressions $\bder$
+For example, we write $( _{[]}r^* ) \backslash c$ instead of $\bder \;c \; _{[]}r^*$,
-because it is derivatives on regular expressiones with \emph{b}itcodes),
+as the bitcodes at the front of $r^*$ indicates that it is
-we need to unfold it into a sequence,
+a bit-coded regular expression, not a plain one.
-and attach an additional bit $Z$ to the front of $r \backslash c$
+$\bder$ tells us how regular expressions can be recursively traversed,
-to indicate one more star iteration.
+where the bitcodes are augmented and carried around
-\begin{center}
+when a derivative is taken.
-\begin{tabular}{@{}lcl@{}}
-$\bder \; c\; (_{bs}a^*) $ & $\dn$ &
-$_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot
-(_{[]}a^*))$
-\end{tabular}
-\end{center}
-\noindent
-For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when
-there is no danger of confusion with derivatives on plain regular expressions,
-for example, the above can be expressed as
-\begin{center}
-\begin{tabular}{@{}lcl@{}}
-$(_{bs}a^*)\,\backslash c$ & $\dn$ &
-$_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot
-(_{[]}a^*))$
-\end{tabular}
-\end{center}
-\noindent
-Using the picture we used earlier to depict this, the transformation when
-taking a derivative w.r.t a star is like below:
-\begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
-\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-{$bs$
-\nodepart{two} $a^*$ };
-%\caption{term 1 \ref{term:1}'s matching configuration}
-\end{tikzpicture}
-&
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
-\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-{$v_{\text{previous iterations}}$
-\nodepart{two} $a^*$};
-%\caption{term 1 \ref{term:1}'s matching configuration}
-\end{tikzpicture}
-\\
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
-\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-{ $bs$ + [Z]
-\nodepart{two}  $(a\backslash c )\cdot a^*$ };
-%\caption{term 1 \ref{term:1}'s matching configuration}
-\end{tikzpicture}
-&
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
-\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-{$v_{\text{previous iterations}}$ + 1 more iteration
-\nodepart{two} $(a\backslash c )\cdot a^*$ };
-%\caption{term 1 \ref{term:1}'s matching configuration}
-\end{tikzpicture}
-\end{tabular}
-\noindent
-Another place in the $\bder$ function where it differs
-from normal derivatives (on un-annotated regular expressions)
-is the sequence case:
-\begin{center}
-\begin{tabular}{@{}lcl@{}}
-$(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
-$\textit{if}\;\textit{bnullable}\,a_1$\\
-					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
-					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
-& &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$
-\end{tabular}
-\end{center}
-The difference is that (when $a_1$ is $\bnullable$)
-we use $\bmkeps$ to store the lexing information
-in $a_1$ before collapsing it (as it has been fully matched by string prior to $c$,
-and attach the collected bit-codes to the front of $a_2$
-before throwing away $a_1$. We assume that $\bmkeps$ correctly extracts the bitcode for how $a_1$
-matches the string prior to $c$ (more on this later).
-The bitsequence $\textit{bs}$ which was initially attached to the first element of the sequence
-$a_1 \cdot a_2$, has now been elevated to the top level of teh $\sum$.
-This is because this piece of information will be needed whichever way the sequence is matched,
-regardless of whether $c$ belongs to $a_1$ or $a_2$.
-In the injection-based lexing, $r_1$ is immediately thrown away in
-subsequent derivatives on the right branch (when $r_1$ is $\nullable$),
-\begin{center}
-	$(r_1 \cdot r_2 )\backslash c = (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$
-\end{center}
-\noindent
-as it knows $r_1$ is stored on stack and available once the recursive
-call to later derivatives finish.
-Therefore, if the $\Right$ branch is taken in a $\POSIX$ match,
-we construct back the sequence value once step back by
-calling a on $\mkeps(r_1)$.
-\begin{center}
-	\begin{tabular}{lcr}
-		$\ldots r_1 \cdot r_2$ & $\rightarrow$ & $r_1\cdot r_2 + r_2 \backslash c \ldots $\\
-		$\ldots \Seq(v_1, v_2) (\Seq(\mkeps(r1), (\inj \; r_2 \; c\; v_{2c})))$ & $\leftarrow$ & $\Right(v_{2c})\ldots$
-	\end{tabular}
-\end{center}
-\noindent
-The rest of the clauses of $\bder$ is rather similar to
-$\der$, and is put together here as a wholesome definition
-for $\bder$:
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\
 $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\
 $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
 $(_{bs}a^*)\,\backslash c$ & $\dn$ &
 $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
 (_{[]}r^*))$
 \end{tabular}
 \end{center}
+\noindent
+We give the intuition behind some of the more involved cases in
+$\bder$. For example,
+in the \emph{star} case,
+a derivative on $_{bs}a^*$ means
+that one more star iteratoin needs to be taken.
+we need to unfold it into a sequence,
+and attach an additional bit $Z$ to the front of $r \backslash c$
+as a record to indicate one new star iteration is unfolded.
+\noindent
+\begin{center}
+\begin{tabular}{@{}lcl@{}}
+$(_{bs}a^*)\,\backslash c$ & $\dn$ &
+$_{bs}(\underbrace{\textit{fuse}\, [Z] \; a\,\backslash c}_{\text{One more iteration}})\cdot
+(_{[]}a^*))$
+\end{tabular}
+\end{center}
+\noindent
+This information will be recovered later by the $\decode$ function.
+The intuition is that the bit $Z$ will be decoded at the right location,
+because we accumulate bits from left to right (a rigorous proof will be given
+later).
+\begin{tikzpicture}[ > = stealth, % arrow head style
+shorten > = 1pt, % don't touch arrow head to node
+semithick % line style
+]
+\tikzstyle{every state}=[
+draw = black,
+thin,
+fill = cyan!29,
+minimum size = 7mm
+]
+\begin{scope}[node distance=1cm and 0cm, every node/.style=state]
+		\node (k) [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+{$bs$
+\nodepart{two} $a^*$ };
+	 \node (l) [below =of k, rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+{ $bs$ + [Z]
+\nodepart{two}  $(a\backslash c )\cdot a^*$ };
+\end{scope}
+\path[->]
+	      (k) edge (l);
+\end{tikzpicture}
+Pictorially the process looks like below.
+Like before, the red region denotes
+previous lexing information (stored as bitcodes in $bs$).
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+	\begin{scope}[node distance=1cm]
+		\node (a) [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+{$bs$
+\nodepart{two} $a^*$ };
+	 \node (b) [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+{ $bs$ + [Z]
+\nodepart{two}  $(a\backslash c )\cdot a^*$ };
+%\caption{term 1 \ref{term:1}'s matching configuration}
+	\end{scope}
+\end{tikzpicture}
+\noindent
+Another place in the $\bder$ function where it differs
+from normal derivatives (on un-annotated regular expressions)
+is the sequence case:
+\begin{center}
+\begin{tabular}{@{}lcl@{}}
+$(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
+$\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
+					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
+& &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$
+\end{tabular}
+\end{center}
+The difference is that (when $a_1$ is $\bnullable$)
+we use $\bmkeps$ to store the lexing information
+in $a_1$ before collapsing it (as it has been fully matched by string prior to $c$,
+and attach the collected bit-codes to the front of $a_2$
+before throwing away $a_1$. We assume that $\bmkeps$ correctly extracts the bitcode for how $a_1$
+matches the string prior to $c$ (more on this later).
+The bitsequence $\textit{bs}$ which was initially attached to the first element of the sequence
+$a_1 \cdot a_2$, has now been elevated to the top level of teh $\sum$.
+This is because this piece of information will be needed whichever way the sequence is matched,
+regardless of whether $c$ belongs to $a_1$ or $a_2$.
+In the injection-based lexing, $r_1$ is immediately thrown away in
+subsequent derivatives on the right branch (when $r_1$ is $\nullable$),
+\begin{center}
+	$(r_1 \cdot r_2 )\backslash c = (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$
+\end{center}
+\noindent
+as it knows $r_1$ is stored on stack and available once the recursive
+call to later derivatives finish.
+Therefore, if the $\Right$ branch is taken in a $\POSIX$ match,
+we construct back the sequence value once step back by
+calling a on $\mkeps(r_1)$.
+\begin{center}
+	\begin{tabular}{lcr}
+		$\ldots r_1 \cdot r_2$ & $\rightarrow$ & $r_1\cdot r_2 + r_2 \backslash c \ldots $\\
+		$\ldots \Seq(v_1, v_2) (\Seq(\mkeps(r1), (\inj \; r_2 \; c\; v_{2c})))$ & $\leftarrow$ & $\Right(v_{2c})\ldots$
+	\end{tabular}
+\end{center}
+\noindent
+The rest of the clauses of $\bder$ is rather similar to
+$\der$, and is put together here as a wholesome definition
+for $\bder$:
 \noindent
 Generalising the derivative operation with bitcodes to strings, we have
 \begin{center}
 	\begin{tabular}{@{}lcl@{}}
 		$a\backslash_s [] $ & $\dn$ & $a$\\

changeset 575	3178f0e948ac
parent 564	3cbcd7cda0a9
child 576	3e1b699696b6