lexing: comparison ChengsongTanPhdThesis/Chapters/Finite.tex

equal deleted inserted replaced

-:a7344c9afbaf
+:b2bea5968b89
 %regex's derivatives.
 In this chapter we give a guarantee in terms of time complexity:
 given a regular expression $r$, for any string $s$
 our algorithm's internal data structure is finitely bounded.
+Note that it is not immediately obvious that $\llbracket \bderssimp{r}{s} \rrbracket$ (the internal
+data structure used in our $\blexer$)
+is bounded by a constant $N_r$, where $N$ only depends on the regular expression
+$r$, not the string $s$.
+When doing a time complexity analysis of any
+lexer/parser based on Brzozowski derivatives, one needs to take into account that
+not only the "derivative steps".
+%TODO: get a grpah for internal data structure growing arbitrary large
 To obtain such a proof, we need to
 \begin{itemize}
 \item
 Define an new datatype for regular expressions that makes it easy
 to reason about the size of an annotated regular expression.
 We have a size function for bitcoded regular expressions, written
 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
 \begin{center}
 \begin{tabular}{ccc}
-$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\
+	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
-\end{tabular}
+	$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
-\end{center}
+	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
-(TODO: COMPLETE this defn and for $rs$)
+$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
+$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
-The size is based on a recursive function on the structure of the regex,
+$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
-not the bitcodes.
+\end{tabular}
-Therefore we may as well talk about size of an annotated regular expression
+\end{center}
-in their un-annotated form:
-\begin{center}
+\noindent
-$\asize(a) = \size(\erase(a))$.
+Similarly there is a size function for plain regular expressions:
-\end{center}
-(TODO: turn equals to roughly equals)
-But there is a minor nuisance here:
-the erase function actually messes with the structure of the regular expression:
 \begin{center}
 \begin{tabular}{ccc}
-$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\
+	$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
-\end{tabular}
+	$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
-\end{center}
+	$\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\
-(TODO: complete this definition with singleton r in alts)
+$\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\
+$\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\
+$\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$
+\end{tabular}
+\end{center}
+\noindent
+The idea of obatining a bound for $\llbracket \bderssimp{a}{s} \rrbracket$
+is to get an equivalent form
+of something like $\llbracket \bderssimp{a}{s}\rrbracket = f(a, s)$, where $f(a, s)$
+is easier to estimate than $\llbracket \bderssimp{a}{s}\rrbracket$.
+We notice that while it is not so clear how to obtain
+a metamorphic representation of $\bderssimp{a}{s}$ (as we argued in chapter \ref{Bitcoded2},
+not interleaving the application of the functions $\backslash$ and $\bsimp{\_}$
+in the order as our lexer will result in the bit-codes dispensed differently),
+it is possible to get an slightly different representation of the unlifted versions:
+$ (\bderssimp{a}{s})_\downarrow = (\erase \; \bsimp{a \backslash s})_\downarrow$.
+This suggest setting the bounding function $f(a, s)$ as
+$\llbracket  (a \backslash s)_\downarrow \rrbracket_p$, the plain size
+of the erased annotated regular expression.
+This requires the the regular expression accompanied by bitcodes
+to have the same size as its plain counterpart after erasure:
+\begin{center}
+	$\asize{a} \stackrel{?}{=} \llbracket \erase(a)\rrbracket_p$.
+\end{center}
+\noindent
+But there is a minor nuisance:
+the erase function unavoidbly messes with the structure of the regular expression,
+due to the discrepancy between annotated regular expression's $\sum$ constructor
+and plain regular expression's $+$ constructor having different arity.
+\begin{center}
+\begin{tabular}{ccc}
+$\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\
+$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
+$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
+\end{tabular}
+\end{center}
+\noindent
 An alternative regular expression with an empty list of children
-is turned into an $\ZERO$ during the
+is turned into a $\ZERO$ during the
 $\erase$ function, thereby changing the size and structure of the regex.
+Therefore the equality in question does not hold.
 These will likely be fixable if we really want to use plain $\rexp$s for dealing
 with size, but we choose a more straightforward (or stupid) method by
 defining a new datatype that is similar to plain $\rexp$s but can take
 non-binary arguments for its alternative constructor,
 which we call $\rrexp$ to denote
 			 \mid \RSTAR{r}
 \]
 For $\rrexp$ we throw away the bitcodes on the annotated regular expressions,
 but keep everything else intact.
 It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
-(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
-$\ALTS$).
 We denote the operation of erasing the bits and turning an annotated regular expression
 into an $\rrexp{}$ as $\rerase{}$.
 \begin{center}
 \begin{tabular}{lcl}
-$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
+$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
-$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
+$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
-$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
+	$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
-\end{tabular}
+$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
-\end{center}
+$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
-%TODO: FINISH definition of rerase
+$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$
+\end{tabular}
+\end{center}
+\noindent
+$\rrexp$ give the exact correspondence between an annotated regular expression
+and its (r-)erased version:
+\begin{lemma}
+$\rsize{\rerase a} = \asize a$
+\end{lemma}
+This does not hold for plain $\rexp$s.
 Similarly we could define the derivative  and simplification on
 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1,
 except that now they can operate on alternatives taking multiple arguments.
 \begin{center}
 \begin{tabular}{lcr}
-$\RALTS{rs} \backslash c$ & $=$ &  $\RALTS{map\; (\_ \backslash c) \;rs}$\\
+	$(\RALTS{rs})\; \backslash c$ & $\dn$ &  $\RALTS{\map\; (\_ \backslash c) \;rs}$\\
 (other clauses omitted)
-\end{tabular}
+With the new $\rrexp$ datatype in place, one can define its size function,
-\end{center}
+which precisely mirrors that of the annotated regular expressions:
+\end{tabular}
-Now that $\rrexp$s do not have bitcodes on them, we can do the
+\end{center}
-duplicate removal without  an equivalence relation:
+\noindent
+\begin{center}
+\begin{tabular}{ccc}
+	$\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\
+	$\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\
+	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\
+$\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\
+$\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as   + 1$\\
+$\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$
+\end{tabular}
+\end{center}
+\noindent
+\subsection{Lexing Related Functions for $\rrexp$}
+Everything else for $\rrexp$ will be precisely the same for annotated expressions,
+except that they do not involve rectifying and augmenting bit-encoded tokenization information.
+As expected, most functions are simpler, such as the derivative:
+\begin{center}
+\begin{tabular}{@{}lcl@{}}
+$(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\
+$(\ONE)\,\backslash_r c$ & $\dn$ &
+$\textit{if}\;c=d\; \;\textit{then}\;
+\ONE\;\textit{else}\;\ZERO$\\
+$(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ &
+$\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\
+$(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ &
+$\textit{if}\;\textit{rnullable}\,r_1$\\
+					       & &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\
+					       & &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\
+& &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\
+$(r^*)\,\backslash_r c$ & $\dn$ &
+$( r\,\backslash_r c)\cdot
+(_{[]}r^*))$
+\end{tabular}
+\end{center}
+\noindent
+The simplification function is simplified without annotation causing superficial differences.
+Duplicate removal without  an equivalence relation:
 \begin{center}
 \begin{tabular}{lcl}
-$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
+	$\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\
+$\rdistinct{r :: rs}{rset}$ & $\dn$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
 			    &        & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
 \end{tabular}
 \end{center}
 %TODO: definition of rsimp (maybe only the alternative clause)
+\noindent
+With $\textit{rdistinct}$ one can chain together all the other modules
-The reason why these definitions  mirror precisely the corresponding operations
+of $\bsimp{\_}$ (removing the functionalities related to bit-sequences)
-on annotated regualar expressions is that we can calculate sizes more easily:
+and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$.
+We omit these functions, as they are routine. Please refer to the formalisation
-\begin{lemma}
+(in file BasicIdentities.thy) for the exact definition.
-$\rsize{\rerase a} = \asize a$
+With $\rrexp$ the size caclulation of annotated regular expressions'
-\end{lemma}
+simplification and derivatives can be done by the size of their unlifted
-This lemma says that the size of an r-erased regex is the same as the annotated regex.
+counterpart with the unlifted version of simplification and derivatives applied.
-this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.
+\begin{lemma}
-\begin{lemma}
+	\begin{itemize}
+		\item
 $\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
-\end{lemma}
+\item
-Putting these two together we get a property that allows us to estimate the
-size of an annotated regular expression derivative using its un-annotated counterpart:
-\begin{lemma}
 $\asize{\bderssimp{r}{s}} =  \rsize{\rderssimp{\rerase{r}}{s}}$
-\end{lemma}
+\end{itemize}
+\end{lemma}
+\noindent
+In the following content, we will focus on $\rrexp$'s size bound.
+We will piece together this bound and show the same bound for annotated regular
+expressions in the end.
 Unless stated otherwise in this chapter all $\textit{rexp}$s without
 bitcodes are seen as $\rrexp$s.
 We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
 as the former suits people's intuitive way of stating a binary alternative
 regular expression.
 %-----------------------------------
 %	SECTION ?
 %-----------------------------------
-\section{preparatory properties for getting a finiteness bound}
+\section{Roadmap to a Finiteness Proof}
+Now that we have defined the $\rrexp$ datatype, and proven that its size changes
+w.r.t derivatives and simplifications mirrors precisely those of annotated regular expressions,
+we aim to bound the size of $r \backslash s$ for any $\rrexp$  $r$.
+The way we do it is by two steps:
+\begin{itemize}
+	\item
+		First, we rewrite $r\backslash s$ into something else that is easier
+		to bound. This step is especially important for the inductive case
+		$r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way,
+		but after simplification they will always be equal or smaller to a form consisting of an alternative
+		list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application.
+		The functions $f$ and $g$ are usually from $\flts$ $\distinctBy$ and other appearing in $\bsimp{_}$.
+		This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the
+		right hand side the "closed form" of $r\backslash s$.
+	\item
+		Then, for such a sum  list of regular expressions $f\; (g\; (\sum rs))$, we can control its size
+		by estimation, since $\distinctBy$ and $\flts$ are well-behaved and working together would only
+		reduce the size of a regular expression, not adding to it. The closed form $f\; (g\; (\sum rs)) $ is controlled
+		by the size of $rs'$, where every element in $rs'$ is distinct, and each element can be described by som e inductive sub-structures (for example when $r = r_1 \cdot r_2$ then $rs'$ will be solely comprised of $r_1 \backslash s'$ and $r_2 \backslash s''$, $s'$ and $s''$ being sub-strings of $s$), which will have a numeric uppder bound according to inductive hypothesis, which controls $r \backslash s$.
+\end{itemize}
+\section{Closed Forms}
+\subsection{Basic Properties needed for Closed Forms}
+\subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully}
+The $\textit{rdistinct}$ function, as its name suggests, will
+remove duplicates according to the accumulator
+and leave only one of each different element in a list:
+\begin{lemma}
+	The function $\textit{rdistinct}$ satisfies the following
+	properties:
+	\begin{itemize}
+		\item
+			If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$.
+		\item
+			If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
+			All the elements in $rs'$ are unique.
+	\end{itemize}
+\end{lemma}
+\begin{proof}
+	The first part is by an induction on $rs$.
+	The second part can be proven by using the
+	induction rules of $\rdistinct{\_}{\_}$.
+\end{proof}
+\noindent
+$\rdistinct{\_}{\_}$ will cancel out all regular expression terms
+that are in the accumulator, therefore prepending a list $rs_a$ with an arbitrary
+list $rs$ whose elements are all from the accumulator, and then call $\rdistinct{\_}{\_}$
+on the resulting list, the output will be as if we had called $\rdistinct{\_}{\_}$
+without the prepending of $rs$:
+\begin{lemma}
+	If $rs \subseteq rset$, then
+	$\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$.
+\end{lemma}
+\begin{proof}
+	By induction on $rs$.
+\end{proof}
+\noindent
+On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated,
+then expanding the accumulator to include that element will not cause the output list to change:
+\begin{lemma}
+	The accumulator can be augmented to include elements not appearing in the input list,
+	and the output will not change.
+	\begin{itemize}
+		\item
+		If $r \notin rs$, then $\rdistinct{rs}{acc} = \rdistinct{rs}{\{r\} \cup acc}$.
+		\item
+			Particularly, when $acc = \varnothing$ and $rs$ de-duplicated, we have\\
+			\[ \rdistinct{rs}{\varnothing} = rs \]
+	\end{itemize}
+\end{lemma}
+\begin{proof}
+	The first half is by induction on $rs$. The second half is a corollary of the first.
+\end{proof}
+\noindent
+The next property gives the condition for
+when $\rdistinct{\_}{\_}$ becomes an identical mapping
+for any prefix of an input list, in other words, when can
+we ``push out" the arguments of $\rdistinct{\_}{\_}$:
+\begin{lemma}
+	If $rs_1$'s elements are all unique, and not appearing in the accumulator $acc$,
+	then it can be taken out and left untouched in the output:
+	\[\textit{rdistinct}\;  rs_1 @ rsa\;\, acc
+	= rs_1@(\textit{rdistinct} rsa \; (acc \cup rs_1)\]
+\end{lemma}
+\begin{proof}
+By an induction on $rs_1$, where $rsa$ and $acc$ are allowed to be arbitrary.
+\end{proof}
+\subsubsection{The properties of $\backslash_r$, $\backslash_{rsimp}$, $\textit{Rflts}$ and $\textit{Rsimp}_{ALTS}$}
+We give in this subsection some properties of how $\backslash_r$, $\backslash_{rsimp}$, $\textit{Rflts}$ and $\textit{Rsimp}_{ALTS} $ interact with each other and  with $@$, the concatenation operator.
+These will be helpful in later closed form proofs, when
+we want to transform the ways in which multiple functions involving
+those are composed together
+in interleaving derivative and  simplification steps.
+When the function $\textit{Rflts}$
+is applied to the concatenation of two lists, the output can be calculated by first applying the
+functions on two lists separately, and then concatenating them together.
+\begin{lemma}
+	The function $\rflts$ has the below properties:\\
+	\begin{itemize}
+		\item
+	$\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$
+\item
+	If $r \neq \RZERO$ and $\nexists rs_1. r = \RALTS{rs}_1$, then $\rflts \; (r::rs) = r :: \rflts \; rs$
+	\end{itemize}
+\end{lemma}
+\noindent
+\begin{proof}
+	By induction on $rs_1$.
+\end{proof}
+\subsection{A Closed Form for the Sequence Regular Expression}
+\begin{quote}\it
+	Claim: For regular expressions $r_1 \cdot r_2$, we claim that
+	\begin{center}
+		$ \rderssimp{r_1 \cdot r_2}{s} =
+	\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
+	\end{center}
+\end{quote}
+\noindent
 Before we get to the proof that says the intermediate result of our lexer will
 remain finitely bounded, which is an important efficiency/liveness guarantee,
 we shall first develop a few preparatory properties and definitions to
 make the process of proving that a breeze.

changeset 543	b2bea5968b89
parent 532	cc54ce075db5
child 553	0f00d440f484