lexing: comparison ChengsongTanPhdThesis/Chapters/Finite.tex

equal deleted inserted replaced

-:83fab852d72d
+:62f8fa03863e
 	\item
 		We estimate $\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$.
 		The key observation is that $\distinctBy$'s output is
 		a list with a constant length bound.
 \end{itemize}
-We give details of these steps in the next sections.
+We will expand on these steps in the next sections.\\
-The first step is to use
-$\textit{rrexp}$s,
-something simpler than
-annotated regular expressions.
 \section{The $\textit{Rrexp}$ Datatype and Its Lexing-Related Functions}
-We first recap the definition of the new datatype $\rrexp$, called
+The first step is to define
+$\textit{rrexp}$s.
+They are without bitcodes,
+allowing a much simpler size bound proof.
+Of course, the bits which encode the lexing information
+would grow linearly with respect
+to the input, which should be taken into account when we wish to tackle the runtime comlexity.
+But for the sake of the structural size
+we can safely ignore them.\\
+To recapitulate, the datatype
+definition of the $\rrexp$, called
 \emph{r-regular expressions},
-which we first defined in \ref{rrexpDef}.
+was initially defined in \ref{rrexpDef}.
-R-regular expressions are similar to basic regular expressions.
+The reason for the prefix $r$ is
-We call them \emph{r}-regular expressions
 to make a distinction
-with plain regular expressions.
+with basic regular expressions.
 \[			\rrexp ::=   \RZERO \mid  \RONE
 	\mid  \RCHAR{c}
 	\mid  \RSEQ{r_1}{r_2}
 	\mid  \RALTS{rs}
 	\mid \RSTAR{r}
 \end{center}
 \noindent
 The $r$ in the subscript of $\llbracket \rrbracket_r$ is to
 differentiate with the same operation for annotated regular expressions.
 Adding $r$ as subscript will be used in
-other operations as well.
+other operations as well.\\
 The transformation from an annotated regular expression
 to an r-regular expression is straightforward.
 \begin{center}
 	\begin{tabular}{lcl}
 		$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
 		$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
 		$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$
 	\end{tabular}
 \end{center}
 \noindent
-$\textit{Rerase}$ throws away the bitcodes on the annotated regular expressions,
+$\textit{Rerase}$ throws away the bitcodes
+on the annotated regular expressions,
 but keeps everything else intact.
+Therefore it does not change the size
-Before we introduce more functions related to r-regular expressions,
+of an annotated regular expression:
-we first give out the reason why we take all the trouble
+\begin{lemma}\label{rsizeAsize}
-defining a new datatype in the first place.
-We could calculate the size of annotated regular expressions in terms of
-their un-annotated $\rrexp$ counterpart:
-\begin{lemma}
 	$\rsize{\rerase a} = \asize a$
 \end{lemma}
 \begin{proof}
 	By routine structural induction on $a$.
 \end{proof}
 \noindent
 \subsection{Motivation Behind a New Datatype}
-The main difference between a plain regular expression
+The reason we take all the trouble
-and an r-regular expression is that it can take
+defining a new datatype is that $\erase$ makes things harder.
-non-binary arguments for its alternative constructor.
-This turned out to be necessary if we want our proofs to be
-simple.
 We initially started by using
 plain regular expressions and tried to prove
-equalities like
+the lemma \ref{rsizeAsize},
-\begin{center}
+however the $\erase$ function unavoidbly messes with the structure of the
-	$\llbracket a \rrbracket = \llbracket a\downarrow \rrbracket_p$
+annotated regular expression.
-\end{center}
+The $+$ constructor
-and
+of basic regular expressions is binary whereas $\sum$
-\[
+takes a list, and one has to convert between them:
-	\llbracket a \backslash_{bsimps} s \rrbracket =
+\begin{center}
-	\llbracket a \downarrow \backslash_s s
+	\begin{tabular}{ccc}
-\]
+		$\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\
-One might be able to prove that
+		$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
-$\llbracket a \downarrow \rrbracket_p \leq \llbracket a \rrbracket$.
+		$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
-$\rrexp$ give the exact correspondence between an annotated regular expression
+	\end{tabular}
-and its (r-)erased version:
+\end{center}
+\noindent
-This does not hold for plain $\rexp$s.
+An alternative regular expression with an empty argument list
+will be turned into a $\ZERO$.
+The singleton alternative $\sum [r]$ would have $r$ during the
-These operations are
+$\erase$ function.
-almost identical to those of the annotated regular expressions,
+The  annotated regular expression $\sum[a, b, c]$ would turn into
-except that no bitcodes are attached.
+$(a+(b+c))$.
-Of course, the bits which encode the lexing information would grow linearly with respect
+All these operations change the size and structure of
-to the input, which should be taken into account when we wish to tackle the runtime comlexity.
+an annotated regular expressions, adding unnecessary
-But at the current stage
+complexities to the size bound proof.\\
-we can safely ignore them.
+For example, if we define the size of a basic regular expression
-Similarly there is a size function for plain regular expressions:
+in the usual way,
 \begin{center}
 	\begin{tabular}{ccc}
 		$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
 		$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
 		$\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\
 		$\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\
 		$\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\
 		$\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$
 	\end{tabular}
 \end{center}
+\noindent
-\noindent
+Then the property
-The idea of obatining a bound for $\llbracket \bderssimp{a}{s} \rrbracket$
+\begin{center}
-is to get an equivalent form
+	$\llbracket a \rrbracket = \llbracket a_\downarrow \rrbracket_p$
-of something like $\llbracket \bderssimp{a}{s}\rrbracket = f(a, s)$, where $f(a, s)$
+\end{center}
-is easier to estimate than $\llbracket \bderssimp{a}{s}\rrbracket$.
+does not hold.
-We notice that while it is not so clear how to obtain
+One might be able to prove an inequality such as
-a metamorphic representation of $\bderssimp{a}{s}$ (as we argued in chapter \ref{Bitcoded2},
+$\llbracket a \rrbracket  \leq \llbracket  a_\downarrow \rrbracket_p $
-not interleaving the application of the functions $\backslash$ and $\bsimp{\_}$
+and then estimate $\llbracket  a_\downarrow \rrbracket_p$,
-in the order as our lexer will result in the bit-codes dispensed differently),
+but we found our approach more straightforward.\\
-it is possible to get an slightly different representation of the unlifted versions:
-$ (\bderssimp{a}{s})_\downarrow = (\erase \; \bsimp{a \backslash s})_\downarrow$.
+\subsection{Lexing Related Functions for $\rrexp$ such as $\backslash_r$, $\rdistincts$, and $\rsimp$}
-This suggest setting the bounding function $f(a, s)$ as
+The operations on r-regular expressions are
-$\llbracket  (a \backslash s)_\downarrow \rrbracket_p$, the plain size
+almost identical to those of the annotated regular expressions,
-of the erased annotated regular expression.
+except that no bitcodes are used. For example,
-This requires the the regular expression accompanied by bitcodes
+the derivative operation becomes simpler:\\
-to have the same size as its plain counterpart after erasure:
-\begin{center}
-	$\asize{a} \stackrel{?}{=} \llbracket \erase(a)\rrbracket_p$.
-\end{center}
-\noindent
-But there is a minor nuisance:
-the erase function unavoidbly messes with the structure of the regular expression,
-due to the discrepancy between annotated regular expression's $\sum$ constructor
-and plain regular expression's $+$ constructor having different arity.
-\begin{center}
-	\begin{tabular}{ccc}
-		$\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\
-		$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
-		$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
-	\end{tabular}
-\end{center}
-\noindent
-An alternative regular expression with an empty list of children
-is turned into a $\ZERO$ during the
-$\erase$ function, thereby changing the size and structure of the regex.
-Therefore the equality in question does not hold.
-These will likely be fixable if we really want to use plain $\rexp$s for dealing
-with size, but we choose a more straightforward (or stupid) method by
-Similarly we could define the derivative  and simplification on
-$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1,
-except that now they can operate on alternatives taking multiple arguments.
-\begin{center}
-	\begin{tabular}{lcr}
-		$(\RALTS{rs})\; \backslash c$ & $\dn$ &  $\RALTS{\map\; (\_ \backslash c) \;rs}$\\
-		(other clauses omitted)
-		With the new $\rrexp$ datatype in place, one can define its size function,
-		which precisely mirrors that of the annotated regular expressions:
-	\end{tabular}
-\end{center}
-\noindent
-\begin{center}
-	\begin{tabular}{ccc}
-		$\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\
-		$\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\
-		$\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\
-		$\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\
-		$\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as   + 1$\\
-		$\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$
-	\end{tabular}
-\end{center}
-\noindent
-\subsection{Lexing Related Functions for $\rrexp$}
-Everything else for $\rrexp$ will be precisely the same for annotated expressions,
-except that they do not involve rectifying and augmenting bit-encoded tokenization information.
-As expected, most functions are simpler, such as the derivative:
 \begin{center}
 	\begin{tabular}{@{}lcl@{}}
 		$(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\
 		$(\ONE)\,\backslash_r c$ & $\dn$ &
 		$\textit{if}\;c=d\; \;\textit{then}\;
 		\ONE\;\textit{else}\;\ZERO$\\
 		$(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ &
 		$\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\
 		$(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ &
-		$\textit{if}\;\textit{rnullable}\,r_1$\\
+		$\textit{if}\;(\textit{rnullable}\,r_1)$\\
 						 & &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\
 						 & &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\
 						 & &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\
 		$(r^*)\,\backslash_r c$ & $\dn$ &
 		$( r\,\backslash_r c)\cdot
 		(_{[]}r^*))$
 	\end{tabular}
 \end{center}
 \noindent
-The simplification function is simplified without annotation causing superficial differences.
+Similarly, $\distinctBy$ does not need
-Duplicate removal without  an equivalence relation:
+a function checking equivalence because
+there are no bit annotations causing superficial differences
+between syntactically equal terms.
 \begin{center}
 	\begin{tabular}{lcl}
 		$\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\
-		$\rdistinct{r :: rs}{rset}$ & $\dn$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
+		$\rdistinct{r :: rs}{rset}$ & $\dn$ &
-					    &        & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
+		$\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
+					    &        & $\textit{else}\; \;
+					    r::\rdistinct{rs}{(rset \cup \{r\})}$
 	\end{tabular}
 \end{center}
 %TODO: definition of rsimp (maybe only the alternative clause)
 \noindent
-The prefix $r$ in front of $\rdistinct{}{}$ is used mainly to
+Notice there is a difference between our $\rdistincts$ and
-differentiate with $\textit{distinct}$, which is a built-in predicate
+the Isabelle $\textit {distinct}$ function.
-in Isabelle that says all the elements of a list are unique.
+In Isabelle $\textit{distinct}$ is a predicate
+that tests if all the elements of a list are unique.\\
 With $\textit{rdistinct}$ one can chain together all the other modules
 of $\bsimp{\_}$ (removing the functionalities related to bit-sequences)
 and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$.
 We omit these functions, as they are routine. Please refer to the formalisation
 (in file BasicIdentities.thy) for the exact definition.

changeset 594	62f8fa03863e
parent 593	83fab852d72d
child 595	fa92124d1fb7