# HG changeset patch # User Chengsong # Date 1661187509 -3600 # Node ID 3e19073e91f4d549fef7799e9e39831393381d44 # Parent 9db2500629be353842773c0ce2cdbcac0b286e78 chap3 done diff -r 9db2500629be -r 3e19073e91f4 ChengsongTanPhdThesis/Chapters/Bitcoded1.tex --- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Sat Aug 20 23:49:17 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Mon Aug 22 17:58:29 2022 +0100 @@ -1104,16 +1104,21 @@ \vspace{20mm} \noindent We encircle in the diagram all the pairs $v_i, a_i$ to show that these values -and regular expressions correspond to each other. -For the leftmost pair, we have that the +and regular expressions share the same structure. +These pairs all contain adequate information to +obtain the final lexing result. +For example, in the leftmost pair the lexical information is condensed in -$v_0$--the value part, whereas for the rightmost pair, -the lexing result is in the bitcodes of $a_n$. +$v_0$, whereas for the rightmost pair, +the lexing result hides is in the bitcodes of $a_n$.\\ $\bmkeps$ is able to extract from $a_n$ the result -by looking for nullable parts of the regular expression, -however for regular expressions $a_i$ in general -they might not necessarily be nullable and therefore -needs some ``help'' finding the POSIX bit-encoding. +by looking for nullable parts of the regular expression. +However for regular expressions $a_i$ in general, +they might not necessarily be nullable. +For those regular expressions that are not nullable, +$\bmkeps$ will not work. +Therefore they need some additional ``help'' +finding the POSIX bit-encoding. The most straightforward ``help'' comes from $a_i$'s corresponding value $v_i$, and this suggests a function $f$ satisfying the following properties: @@ -1121,57 +1126,84 @@ \item $f \; a_i\;v_i = f \; a_n \; v_n = \bmkeps \; a_n$%\decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$ \item - $f \; a_i\;v_i = f \; a_0 \; v_0 = \decode \;(\code \; v_0) \; (\erase \; a_0)$ + $f \; a_i\;v_i = f \; a_0 \; v_0 = \code \; v_0$ %\decode \;(\code \; v_0) \; (\erase \; a_0)$ \end{itemize} \noindent -If we factor out the common part $\decode \; \_ \; (\erase \; a_0)$, -The core of the function $f$ is something that produces the bitcodes -$\code \; v_0$. -It is unclear how, but Sulzmann and Lu came up with a function satisfying all the above +Sulzmann and Lu came up with a function satisfying +the above requirements, named \emph{retrieve}: - - - \begin{center} -\begin{tabular}{lcr} - $\retrieve \; \, (_{bs} \ONE) \; \, (\Empty)$ & $\dn$ & $\textit{bs}$\\ - $\retrieve \; \, (_{bs} \mathbf{c} ) \; \Char(c)$ & $\dn$ & $ \textit{bs}$\\ - $\retrieve \; \, (_{bs} a_1 \cdot a_2) \; \Seq(v_1, v_2)$ & $\dn$ & $\textit{bs} @ (\retrieve \; a_1\; v_1) @ (\retrieve \; a_2 \; v_2)$\\ - $\retrieve \; \, (_{bs} \Sigma (a :: \textit{as}) \; \,\Left(v)$ & $\dn$ & $\textit{bs} @ (\retrieve \; a \; v)$\\ - $\retrieve \; \, (_{bs} \Sigma (a :: \textit{as} \; \, \Right(v)$ & $\dn$ & $\textit{bs} @ (\retrieve \; (_{[]}\Sigma \textit{as}) \; v)$\\ - $\retrieve \; \, (_{bs} a^*) \; \, (\Stars(v :: vs)) $ & $\dn$ & $\textit{bs} @ (\retrieve \; a \; v) @ (\retrieve \; (_{[]} a^*) \; (\Stars(vs)))$\\ - $\retrieve \; \, (_{bs} a^*) \; \, (\Stars([]) $ & $\dn$ & $\textit{bs} @ [Z]$ +\begin{tabular}{llcl} + $\retrieve \; \, _{bs} \ONE$ & $ \Empty$ & $\dn$ & $\textit{bs}$\\ + + $\retrieve \; \, _{bs} \mathbf{c}$ & $ (\Char \; c) $ & $\dn$ & $ \textit{bs}$\\ + + $\retrieve \; \, _{bs} a_1 \cdot a_2$ & $ (\Seq \; v_1 \; v_2) $ & + $\dn$ & $\textit{bs} \; @\; (\retrieve \; a_1\; v_1)\; @ \;(\retrieve \; a_2 \; v_2)$\\ + + $\retrieve \; \, _{bs} \Sigma (a :: \textit{as})$ & $ (\Left \; v) $ & + $\dn$ & $\textit{bs}\; @\; (\retrieve \; a \; v)$\\ + + $\retrieve \; \, _{bs} \Sigma (a :: \textit{as})$ & $ (\Right \; v) $ & + $\dn$ & $\textit{bs}\; @\; (\retrieve \; (_{[]}\Sigma \textit{as}) \; v)$\\ + + $\retrieve \; \, _{bs} a^* $ & $ (\Stars \; (v :: vs)) $ & $\dn$ & + $\textit{bs}\; @\; (\retrieve \; a \; v)\; @ \; + (\retrieve \; (_{[]} a^*) \; (\Stars \; vs) )$\\ + + $\retrieve \; \, _{bs} a^* $ & $ (\Stars \; []) $ & $\dn$ & $\textit{bs} \; @ \; [Z]$ \end{tabular} \end{center} \noindent As promised, $\retrieve$ collects the right bit-codes from the -final derivative $a_n$: -\begin{lemma} +final derivative $a_n$, guided by $v_n$: +\begin{lemma}\label{bmkepsRetrieve} $\bnullable \; a \implies \bmkeps \; a = \retrieve \; a \; (\mkeps \; (\erase \; a))$ \end{lemma} \begin{proof} By a routine induction on $a$. \end{proof} -The design of $\retrieve$ enables extraction of bit-codes -from not only $\bnullable$ (annotated) regular expressions, -but also those that are not $\bnullable$. -For example, if we have the regular expression just internalised +\noindent +The design of $\retrieve$ enables us to extract bitcodes +from both annotated regular expressions and values. +$\retrieve$ always ``sucks up'' all the information. +When more information is stored in the value, we would be able to +extract more from the value, and vice versa. +For example in star iterations, $\retrieve$ will be able to extract from $\Stars \; vs$ +exactly the same bitcodes as $\code \; (\Stars \; vs)$: +\begin{lemma} + If $\forall v \in vs. \vdash v : r$, and $\code \; v = \retrieve \; (\internalise\; r) \; v$\\ + then $\code \; (\Stars \; vs) = \retrieve \; _{[]}((\internalise \; r)^*) \; (\Stars \; vs)$ +\end{lemma}\label{retrieveEncodeSTARS} +\begin{proof} + By induction on the value list $vs$. +\end{proof} +\noindent +Similarly, when the value list does not hold information, +only the bitcodes plus some delimiter will be recorded: +\begin{center} + $\retrieve \; _{bs}((\internalise \; r)^*) \; (\Stars \; [] ) = bs @ [S]$. +\end{center} +In general, if we have a regular expression just internalised and the lexing result value, we could $\retrieve$ the bitcdoes -that look as if we have en$\code$-ed the value: +that look as if we have en$\code$-ed the value as bitcodes: \begin{lemma} $\vdash v : r \implies \retrieve \; (r)^\uparrow \; v = \code \; v$ \end{lemma} \begin{proof} By induction on $r$. + The star case uses lemma \ref{retrieveEncodeSTARS}. \end{proof} \noindent The following property is more interesting, as -it provides a "bridge" between $a_0, v_0$ and $a_n, v_n$ in the -lexing diagram. -If you take derivative of an annotated regular expression, -you can $\retrieve$ the same bit-codes as before the derivative took place, -provided that you use the corresponding value: - +it provides a "bridge" between $a_0$ and $a_n $ in the +lexing diagram by linking adjacent regular expressions $a_i$ and +$a_{i+1}$. +The property says that one +can retrieve the same bits from a derivative +regular expression as those from +before the derivative took place, +provided that the right values are used respectively: \begin{lemma}\label{retrieveStepwise} $\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$ \end{lemma} @@ -1180,108 +1212,127 @@ The induction principle is function $\erase$'s cases. \end{proof} \noindent -$\retrieve$ is connected to the $\blexer$ in the following way: +Before we move on to the next +helper function, we rewrite $\blexer$ in +the following way which makes later proofs more convenient: \begin{lemma}\label{blexer_retrieve} $\blexer \; r \; s = \decode \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$ \end{lemma} +\begin{proof} + Using lemma \ref{bmkepsRetrieve}. +\end{proof} \noindent -$\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regular expressiones of $\blexer$. -For plain regular expressions something similar is required as well. +$\retrieve$ allows connecting +between the intermediate +results $a_i$ and $a_{i+1}$ in $\blexer$. +For plain regular expressions something similar is required. \subsection{$\flex$} Ausaf and Urban cleverly defined an auxiliary function called $\flex$ for $\lexer$, defined as \begin{center} -\begin{tabular}{lcr} +\begin{tabular}{lcl} $\flex \; r \; f \; [] \; v$ & $=$ & $f\; v$\\ -$\flex \; r \; f \; c :: s \; v$ & $=$ & $\flex \; r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v$ +$\flex \; r \; f \; (c :: s) \; v$ & $=$ & $\flex \; r \; (\lambda v. \, f (\inj \; r\; c\; v)) \;\; s \; v$ \end{tabular} \end{center} -which accumulates the characters that needs to be injected back, +which accumulates the characters that need to be injected back, and does the injection in a stack-like manner (last taken derivative first injected). -$\flex$ is connected to the $\lexer$: +$\flex$ can calculate what $\lexer$ calculates, given the input regular expression +$r$, the identity function $id$, the input string $s$ and the value +$v_n= \mkeps \; (r\backslash s)$: \begin{lemma} -$\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$ + $\flex \;\; r \;\; \textit{id}\;\; s \;\; (\mkeps \;(r\backslash s)) = \lexer \; r \; s$ \end{lemma} \begin{proof} By reverse induction on $s$. \end{proof} -$\flex$ provides us a bridge between $\lexer$'s intermediate steps. -What is even better about $\flex$ is that it allows us to -directly operate on the value $\mkeps (r\backslash v)$, -which is pivotal in the definition of $\lexer $ and $\blexer$, but not visible as an argument. -When the value created by $\mkeps$ becomes available, one can -prove some stepwise properties of lexing nicely: +\noindent +The main advantage of using $\flex$ +in addition to $\lexer$ is that +$\flex$ makes the value $v$ and function $f$ +that manipulates the value explicit parameters, +which ``exposes'' $\lexer$'s recursive call +arguments and allows us to ``set breakpoints'' and ``resume'' +at any point during $\lexer$'s recursive calls.\\ +The following stepwise property holds. \begin{lemma}\label{flexStepwise} -$\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $ -\end{lemma} -\begin{proof} - By induction on the shape of $r\backslash s$ -\end{proof} -\noindent -With $\flex$ and $\retrieve$ ready, we are ready to connect $\lexer$ and $\blexer$ . - -\subsection{Correctness Proof of Bit-coded Algorithm} -\begin{lemma}\label{flex_retrieve} -$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$ + $\textit{flex} \;\; r \;\; f \;\; (s@[c]) \;\; v = \flex \;\; r \;\; f \;\; s \;\; (\inj \; (r\backslash s) \; c \; v) $ \end{lemma} \begin{proof} -By induction on $s$. The induction tactic is reverse induction on strings. -$v$ is allowed to be arbitrary. -The crucial point is to rewrite -\[ -\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) -\] -as -\[ -\retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\; \mkeps (r \backslash s@[c])) -\]. -This enables us to equate -\[ -\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) -\] -with -\[ -\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c]))) -\], -which in turn can be rewritten as -\[ -\flex \; r \; \textit{id} \; s@[c] \; (\mkeps (r\backslash s@[c])) -\]. + By induction on the shape of $r\backslash s$. \end{proof} - -With the above lemma we can now link $\flex$ and $\blexer$. +\noindent +With $\flex$ and $\retrieve$, +we are ready to connect $\lexer$ and $\blexer$, +giving the correctness proof. %---------------------------------------------------------------------------------------- % SECTION correctness proof %---------------------------------------------------------------------------------------- \section{Correctness of Bit-coded Algorithm (Without Simplification)} -We now give the proof the correctness of the algorithm with bit-codes. - -\begin{lemma}\label{flex_blexer} -$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s) = \blexer \; r \; s$ +$\flex$ and $\blexer$ essentially calculates the same thing. +\begin{lemma}\label{flex_retrieve} + If $\vdash v: (r\backslash s)$, then\\ +$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$ \end{lemma} \begin{proof} -Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}. +By induction on $s$. +We prove the interesting case where +both $\flex$ and $\decode$ successfully terminates +with some value. +We take advantage of the stepwise properties +both sides. +The induction tactic is reverse induction on string $s$. +The inductive hypothesis says that $\flex \; r \; id \;s \; v = +\decode \; (\retrieve \; (r\backslash s) \; v) \; r$ holds, +where $v$ can be any value satisfying +the assumption $\vdash v: (r\backslash s)$. +The crucial point is to rewrite +\[ + \retrieve \;\; (r \backslash (s@[c])) \;\; (\mkeps\; (r \backslash (s@[c]) )) +\] +as +\[ + \retrieve \;\; (r \backslash s) + \;\; (\inj \; (r \backslash s) \; c\; \mkeps (r \backslash (s@[c]))) +\] +using lemma \ref{retrieveStepwise}. +This enables us to equate +\[ + \retrieve \; (r \backslash (s@[c])) \; (\mkeps \; (r \backslash (s@[c]) )) +\] +with +\[ +\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c]))) +\] +using IH, +which in turn can be rewritten as +\[ + \flex \; r \; \textit{id} \; (s@[c]) \; (\mkeps \; (r \backslash (s@[c]))) +\]. \end{proof} -Finally the correctness of $\blexer$ is given as it outputs the same result as $\lexer$: +\noindent +With this pivotal lemma we can now link $\flex$ and $\blexer$ +and finally give the correctness of $\blexer$--it outputs the same result as $\lexer$: \begin{theorem} $\blexer\; r \; s = \lexer \; r \; s$ \end{theorem} \begin{proof} - Straightforward corollary of \ref{flex_blexer}. + From \ref{flex_retrieve} we have that + $\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s) = \blexer \; r \; s$, + which immediately implies this theorem. \end{proof} \noindent -To piece things together and spell out the exact correctness -of the bitcoded lexer -in terms of producing POSIX values, +To piece things together and spell out the correctness +result of the bitcoded lexer more explicitly, we use the fact from the previous chapter that \[ - If \; (r, s) \rightarrow v \; then \; \lexer \; r \; s = v + (r, s) \rightarrow v \;\; \textit{iff} \;\; \lexer \; r \; s = v \] to obtain this corollary: \begin{corollary}\label{blexerPOSIX} - $If \; (r, s) \rightarrow v \; then \blexer \; r \; s = v$ + $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \; s = v$ \end{corollary} Our main reason for wanting a bit-coded algorithm over the injection-based one is for its capabilities of allowing diff -r 9db2500629be -r 3e19073e91f4 ChengsongTanPhdThesis/Chapters/Bitcoded2.tex --- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Sat Aug 20 23:49:17 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Mon Aug 22 17:58:29 2022 +0100 @@ -659,3 +659,4 @@ some sense, for exampe, not grinding to a halt at certain cases. In the next chapter we shall prove that for a given $r$, the internal derivative size is always finitely bounded by a constant. +we would expect in the