lexing: comparison ChengsongTanPhdThesis/Chapters/Cubic.tex

equal deleted inserted replaced

-:ae6010c14e49
+:17c7611fb0a9
 Animirov's partial derivatives.\\
 Secondly, we extend our $\blexersimp$
 to support bounded repetitions ($r^{\{n\}}$).
 We update our formalisation of
 the correctness and finiteness properties to
-include this new construct. With bounded repetitions
+include this new construct.
-we are able to out-compete other verified lexers such as
+we can out-compete other verified lexers such as
-Verbatim++ on regular expressions which involve a lot of
+Verbatim++ on bounded regular expressions.
-repetitions. %We also present the idempotency property proof
+%We also present the idempotency property proof
 %of $\bsimp$, which leverages the idempotency proof of $\rsimp$.
 %This reinforces our claim that the fixpoint construction
 %originally required by Sulzmann and Lu can be removed in $\blexersimp$.
-\\
-Last but not least, we present our efforts and challenges we met
+%Last but not least, we present our efforts and challenges we met
-in further improving the algorithm by data
+%in further improving the algorithm by data
-structures such as zippers.
+%structures such as zippers.
 %----------------------------------------------------------------------------------------
 %	SECTION strongsimp
 \caption{Desired simplification, but not done in $\blexersimp$}\label{partialDedup}
 \end{figure}
 \noindent
 A simplification like this actually
 cannot be omitted,
-as without it the size could blow up even with our $\simp$ function:
+as without it the size could blow up even with our $\textit{bsimp}$
+function: for the chapter \ref{Finite} example
+$\protect((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$,
+by just setting n to a small number,
+we get exponential growth that does not stop before it becomes huge:
 \begin{figure}[H]
 \centering
 \begin{tikzpicture}
 \begin{axis}[
 %xlabel={$n$},
 \caption{Desired simplification, but not done in $\blexersimp$}\label{desiredSimp}
 \end{figure}
 \noindent
 in our $\simp$ function,
 so that it makes the simplification in \ref{partialDedup} possible.
-For example,
+Translating the rule into our $\textit{bsimp}$ function simply
-can a function like the below one be used?
+involves adding a new clause to the $\textit{bsimp}_{ASEQ}$ function:
-%TODO: simp' that spills things
+\begin{center}
+	\begin{tabular}{@{}lcl@{}}
+		$\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\
+						       && $\ldots$\\
+&&$\quad\textit{case} \; (_{bs1}\sum as, a_2') \Rightarrow _{bs1}\sum (
+\map \; (_{[]}\textit{ASEQ} \; \_ \; a_2') \; as)$\\
+&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
+	\end{tabular}
+\end{center}
 Unfortunately,
 if we introduce them in our
 setting we would lose the POSIX property of our calculated values.
 For example given the regular expression
 \end{center}
 using  a function similar to $\distinctBy$,
 but this time
 we allow a more general list rewrite:
 \begin{mathpar}\label{cubicRule}
-		\inferrule{\vspace{0mm} }{rs_a@[a]@rs_c
+	\inferrule * [Right = cubicRule]{\vspace{0mm} }{rs_a@[a]@rs_c
 			\stackrel{s}{\rightsquigarrow }
 		rs_a@[\textit{prune} \; a \; rs_a]@rs_c }
 \end{mathpar}
 %L \; a_1' = L \; a_1 \setminus (\cup_{a \in rs_a} L \; a)
 where $\textit{prune} \;a \; acc$ traverses $a$
 function $\textit{prune}$ in Scala,
 and incorporated into our lexer,
 by replacing the $\simp$ function
 with a stronger version called $\bsimpStrong$
 that prunes regular expressions.
-We call this lexer $\blexerStrong$.
-$\blexerStrong$ is able to drastically reduce the
-internal data structure size which could
-trigger exponential behaviours in
-$\blexersimp$.
 \begin{figure}[H]
-\centering
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-%xlabel={$n$},
-myplotstyle,
-xlabel={input length},
-ylabel={size},
-]
-\addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data};
-\end{axis}
-\end{tikzpicture}
-&
-\begin{tikzpicture}
-\begin{axis}[
-%xlabel={$n$},
-myplotstyle,
-xlabel={input length},
-ylabel={size},
-]
-\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
-\end{axis}
-\end{tikzpicture}\\
-\multicolumn{2}{l}{Graphs: Runtime for matching $((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings
-	   of the form $\underbrace{aa..a}_{n}$.}
-\end{tabular}
-\caption{aaaaaStarStar} \label{fig:aaaaaStarStar}
-\end{figure}
-\begin{figure}[H]
 \begin{lstlisting}
 def atMostEmpty(r: Rexp) : Boolean = r match {
 case ZERO => true
 case ONE => true
 case STAR(r) => atMostEmpty(r)
 }
 }
 \end{lstlisting}
 \end{figure}
 \noindent
+We call this lexer $\blexerStrong$.
+$\blexerStrong$ is able to drastically reduce the
+internal data structure size which could
+trigger exponential behaviours in
+$\blexersimp$.
+\begin{figure}[H]
+\centering
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+%xlabel={$n$},
+myplotstyle,
+xlabel={input length},
+ylabel={size},
+width = 7cm,
+height = 5cm,
+]
+\addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data};
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+\begin{axis}[
+%xlabel={$n$},
+myplotstyle,
+xlabel={input length},
+ylabel={size},
+width = 7cm,
+height = 5cm,
+]
+\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
+\end{axis}
+\end{tikzpicture}\\
+\multicolumn{2}{l}{}
+\end{tabular}
+\caption{Runtime for matching
+	$\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings
+	   of the form $\protect\underbrace{aa..a}_{n}$.}\label{fig:aaaaaStarStar}
+\end{figure}
+\noindent
 We would like to preserve the correctness like the one
 we had for $\blexersimp$:
 \begin{conjecture}\label{cubicConjecture}
 	$\blexerStrong \;r \; s = \blexer\; r\;s$
 \end{conjecture}
 \noindent
-We introduce the new rule \ref{cubicRule}
+The idea is to maintain key lemmas in
-and augment our proofs in chapter \ref{Bitcoded2}.
+chapter \ref{Bitcoded2} like
-The idea is to maintain the properties like
+$r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$
-$r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$ etc.
+with the new rewriting rule \ref{cubicRule} .
-In the next section,
+In the next sub-section,
 we will describe why we
 believe a cubic bound can be achieved.
 We give an introduction to the
 partial derivatives,
 which was invented by Antimirov \cite{Antimirov95},
 and then link it with the result of the function
 $\bdersStrongs$.
-\section{Antimirov's partial derivatives}
+\subsection{Antimirov's partial derivatives}
-This suggests a link towrads "partial derivatives"
+Partial derivatives were first introduced by
-introduced  The idea behind Antimirov's partial derivatives
+Antimirov \cite{Antimirov95}.
-is to do derivatives in a similar way as suggested by Brzozowski,
+It does derivatives in a similar way as suggested by Brzozowski,
-but maintain a set of regular expressions instead of a single one:
+but splits children of alternative regular expressions into
+multiple independent terms, causing the output to become a
-%TODO: antimirov proposition 3.1, needs completion
+set of regular expressions:
 \begin{center}
-\begin{tabular}{ccc}
+	\begin{tabular}{lcl}
-$\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\
+		$\partial_x \; (a \cdot b)$ &
-$\partial_x(\ONE)$ & $=$ & $\phi$
+		$\dn$ & $\partial_x \; a\cdot b \cup
-\end{tabular}
+		\partial_x \; b \; \textit{if} \; \; \nullable\; a$\\
-\end{center}
+		      & & $\partial_x \; a\cdot b \quad\quad
+		      \textit{otherwise}$\\
+		$\partial_x \; r^*$ & $\dn$ & $\partial_x \; r \cdot r^*$\\
+		$\partial_x \; c $ & $\dn$ & $\textit{if} \; x = c \;
+		\textit{then} \;
+		\{ \ONE\} \;\;\textit{else} \; \varnothing$\\
+		$\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\
+		$\partial_x(\ONE)$ & $=$ & $\varnothing$\\
+		$\partial_x(\ZERO)$ & $\dn$ & $\varnothing$\\
+	\end{tabular}
+\end{center}
+\noindent
+The $\cdot$ between for example
+$\partial_x \; a\cdot b $
+is a shorthand notation for the cartesian product
+$\partial_x \; a \times \{ b\}$.
+%Each element in the set generated by a partial derivative
+%corresponds to a (potentially partial) match
+%TODO: define derivatives w.r.t string s
 Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together
-using the alternatives constructor, Antimirov cleverly chose to put them into
+using the $\sum$ constructor, Antimirov put them into
-a set instead.  This breaks the terms in a derivative regular expression up,
+a set.  This causes maximum de-duplication to happen,
 allowing us to understand what are the "atomic" components of it.
 For example, To compute what regular expression $x^*(xx + y)^*$'s
 derivative against $x$ is made of, one can do a partial derivative
 of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$
 from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$.
-To get all the "atomic" components of a regular expression's possible derivatives,
-there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes
+The set of all possible partial derivatives is defined
-whatever character is available at the head of the string inside the language of a
+as the union of derivatives w.r.t all the strings in the universe:
-regular expression, and gives back the character and the derivative regular expression
+\begin{center}
-as a pair (which he called "monomial"):
+	\begin{tabular}{lcl}
-\begin{center}
+		$\textit{PDER}_{UNIV} \; r $ & $\dn $ & $\bigcup_{w \in A^*}\partial_w \; r$
-\begin{tabular}{ccc}
+	\end{tabular}
-$\lf(\ONE)$ & $=$ & $\phi$\\
+\end{center}
-$\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\
+\noindent
-$\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
-$\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\
+Back to our
-\end{tabular}
+\begin{center}
-\end{center}
+	$((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$
-%TODO: completion
+\end{center}
+example, if we denote this regular expression as $A$,
-There is a slight difference in the last three clauses compared
+we have that
-with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular
+\begin{center}
-expression $r$ with every element inside $\textit{rset}$ to create a set of
+$\textit{PDER}_{UNIV} \; A =
-sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates
+\bigcup_{i=1}^{n}\bigcup_{j=0}^{i-1} \{
-on a set of monomials (which Antimirov called "linear form") and a regular
+	(\underbrace{a \ldots a}_{\text{j a's}}\cdot
-expression, and returns a linear form:
+(\underbrace{a \ldots a}_{\text{i a's}})^*)\cdot A \}$,
-\begin{center}
+\end{center}
-\begin{tabular}{ccc}
+with exactly $n * (n + 1) / 2$ terms.
-$l \bigodot (\ZERO)$ & $=$ & $\phi$\\
+This is in line with our speculation that only $n*(n+1)/2$ terms are
-$l \bigodot (\ONE)$ & $=$ & $l$\\
+needed. We conjecture that $\bsimpStrong$ is also able to achieve this
-$\phi \bigodot t$ & $=$ & $\phi$\\
+upper limit in general
-$\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\
+\begin{conjecture}\label{bsimpStrongInclusionPder}
-$\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\
+	Using a suitable transformation $f$, we have
-$\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\
+	\begin{center}
-$\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
+		$\forall s.\; f \; (r \bdersStrong  \; s) \subseteq
-$\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\
+		 \textit{PDER}_{UNIV} \; r$
-\end{tabular}
+	\end{center}
-\end{center}
+\end{conjecture}
-%TODO: completion
+\noindent
+because our \ref{cubicRule} will keep only one copy of each term,
-Some degree of simplification is applied when doing $\bigodot$, for example,
+where the function $\textit{prune}$ takes care of maintaining
-$l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$,
+a set like structure similar to partial derivatives.
-and  $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and
+It is anticipated we might need to adjust $\textit{prune}$
-$\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$,
+slightly to make sure all duplicate terms are eliminated,
-and so on.
+which should be doable.
-With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regular expression $r$ with
+Antimirov had proven that the sum of all the partial derivative
-an iterative procedure:
+terms' sizes is bounded by the cubic of the size of that regular
-\begin{center}
+expression:
-\begin{tabular}{llll}
+\begin{property}\label{pderBound}
-$\textit{while}$ & $(\Delta_i \neq \phi)$  &                &          \\
+	$\llbracket \textit{PDER}_{UNIV} \; r \rrbracket \leq O((\llbracket r \rrbracket)^3)$
-		       &  $\Delta_{i+1}$           &        $ =$ & $\lf(\Delta_i) - \PD_i$ \\
+\end{property}
-		       &  $\PD_{i+1}$              &         $ =$ & $\Delta_{i+1} \cup \PD_i$ \\
+This property was formalised by Urban, and the details are in the PDERIVS.thy file
-$\partial_{UNIV}(r)$ & $=$ & $\PD$ &
+in our repository.
-\end{tabular}
+Once conjecture \ref{bsimpStrongInclusionPder} is proven, then property \ref{pderBound}
-\end{center}
+would yield us a cubic bound for our $\blexerStrong$ algorithm:
+\begin{conjecture}\label{strongCubic}
+	$\llbracket r \bdersStrong\; s \rrbracket \leq \llbracket r \rrbracket^3$
-$(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
+\end{conjecture}
+%To get all the "atomic" components of a regular expression's possible derivatives,
+%there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes
+%whatever character is available at the head of the string inside the language of a
+%regular expression, and gives back the character and the derivative regular expression
+%as a pair (which he called "monomial"):
+% \begin{center}
+% \begin{tabular}{ccc}
+% $\lf(\ONE)$ & $=$ & $\phi$\\
+%$\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\
+% $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
+% $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\
+%\end{tabular}
+%\end{center}
+%%TODO: completion
+%
+%There is a slight difference in the last three clauses compared
+%with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular
+%expression $r$ with every element inside $\textit{rset}$ to create a set of
+%sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates
+%on a set of monomials (which Antimirov called "linear form") and a regular
+%expression, and returns a linear form:
+% \begin{center}
+% \begin{tabular}{ccc}
+% $l \bigodot (\ZERO)$ & $=$ & $\phi$\\
+% $l \bigodot (\ONE)$ & $=$ & $l$\\
+% $\phi \bigodot t$ & $=$ & $\phi$\\
+% $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\
+% $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\
+%  $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\
+% $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
+% $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\
+%\end{tabular}
+%\end{center}
+%%TODO: completion
+%
+% Some degree of simplification is applied when doing $\bigodot$, for example,
+% $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$,
+% and  $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and
+%  $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$,
+%  and so on.
+%
+%  With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regular expression $r$ with
+%  an iterative procedure:
+%   \begin{center}
+% \begin{tabular}{llll}
+%$\textit{while}$ & $(\Delta_i \neq \phi)$  &                &          \\
+% 		       &  $\Delta_{i+1}$           &        $ =$ & $\lf(\Delta_i) - \PD_i$ \\
+%		       &  $\PD_{i+1}$              &         $ =$ & $\Delta_{i+1} \cup \PD_i$ \\
+%$\partial_{UNIV}(r)$ & $=$ & $\PD$ &
+%\end{tabular}
+%\end{center}
+%
+%
+% $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
 %----------------------------------------------------------------------------------------
 the ``exactly-$n$-times'' bounded regular expression $r^{\{n\}}$.
 We add clauses in our derivatives-based lexing algorithms (with simplifications)
 introduced in chapter \ref{Bitcoded2}.
 \subsection{Augmented Definitions}
-There are a number of definitions that needs to be augmented.
+There are a number of definitions that need to be augmented.
 The most notable one would be the POSIX rules for $r^{\{n\}}$:
-\begin{mathpar}
+\begin{center}
-	\inferrule{v \in vs}{\textit{Stars} place holder}
+	\begin{mathpar}
-\end{mathpar}
+		\inferrule{\forall v \in vs_1. \vdash v:r \land
+		|v| \neq []\\ \forall v \in vs_2. \vdash v:r \land |v| = []\\
+		\textit{length} \; (vs_1 @ vs_2) = n}{\textit{Stars} \;
+		(vs_1 @ vs_2) : r^{\{n\}} }
+	\end{mathpar}
+\end{center}
+As Ausaf had pointed out \cite{Ausaf},
+sometimes empty iterations have to be taken to get
+a match with exactly $n$ repetitions,
+and hence the $vs_2$ part.
+Another important definition would be the size:
+\begin{center}
+	\begin{tabular}{lcl}
+		$\llbracket r^{\{n\}} \rrbracket_r$ & $\dn$ &
+		$\llbracket r \rrbracket_r + n$\\
+	\end{tabular}
+\end{center}
+\noindent
+Arguably we should use $\log \; n$ for the size because
+the number of digits increase logarithmically w.r.t $n$.
+For simplicity we choose to add the counter directly to the size.
+The derivative w.r.t a bounded regular expression
+is given as
+\begin{center}
+	\begin{tabular}{lcl}
+		$r^{\{n\}} \backslash_r c$ & $\dn$ &
+		$r\backslash_r c \cdot r^{\{n-1\}} \;\; \textit{if} \; n \geq 1$\\
+					   & & $\RZERO \;\quad \quad\quad \quad
+					   \textit{otherwise}$\\
+	\end{tabular}
+\end{center}
+\noindent
+For brevity, we sometimes use NTIMES to refer to bounded
+regular expressions.
+The $\mkeps$ function clause for NTIMES would be
+\begin{center}
+	\begin{tabular}{lcl}
+		$\mkeps \; r^{\{n\}} $ & $\dn$ & $\Stars \;
+		(\textit{replicate} \; n\; (\mkeps \; r))$
+	\end{tabular}
+\end{center}
+\noindent
+The injection looks like
+\begin{center}
+	\begin{tabular}{lcl}
+		$\inj \; r^{\{n\}} \; c\; (\Seq \;v \; (\Stars \; vs)) $ &
+		$\dn$ & $\Stars \;
+		((\inj \; r \;c \;v ) :: vs)$
+	\end{tabular}
+\end{center}
+\noindent
 \subsection{Proofs for the Augmented Lexing Algorithm}
 We need to maintain two proofs with the additional $r^{\{n\}}$
 construct: the
 \begin{proof}
 	By a reverse induction on $s$.
 	For the inductive case, note that if $\cbn \; r$ holds,
 	then $\cbn \; (r\backslash_r c)$ holds.
 \end{proof}
+\noindent
 In addition, for $\cbn$-shaped regular expressioins, one can flatten
 them:
 \begin{lemma}\label{ntimesHfauPushin}
 	If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} =
 	\textit{concat} \; (\map \; \hflataux{\map \; (\_\backslash_r c) \;
 	(\hflataux{r})})$
 \end{lemma}
 \begin{proof}
 	By an induction on the inductive cases of $\cbn$.
 \end{proof}
+\noindent
 This time we do not need to define the flattening functions for NTIMES only,
-because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already:
+because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already.
 \begin{lemma}\label{ntimesHfauInduct}
 $\hflataux{( (r\backslash_r c) \cdot r^{\{n\}}) \backslash_{rsimps} s} =
 \map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$
 \end{lemma}
 \begin{proof}
 	By a reverse induction on $s$.
-	The lemma \ref{ntimesDersCbn} is used.
+	The lemmas \ref{ntimesHfauPushin} and \ref{ntimesDersCbn} are used.
 \end{proof}
 \noindent
 We have a recursive property for NTIMES with $\nupdate$
 similar to that for STAR,
 and one for $\nupdates $ as well:
 \begin{lemma}\label{nupdateInduct1}
-	(i) $\textit{concat} \; (\map \; (\hflataux{\_} \circ (
+	\mbox{}
+	\begin{itemize}
+		\item
+			\begin{center}
+	 $\textit{concat} \; (\map \; (\hflataux{\_} \circ (
 	\opterm \; r)) \; Ss) = \map \; (\opterm \; r) \; (\nupdate \;
 	c \; r \; Ss)$\\
-	(ii) $\textit{concat} \; (\map \; \hflataux{\_}\;
+	\end{center}
+	holds.
+\item
+	\begin{center}
+	 $\textit{concat} \; (\map \; \hflataux{\_}\;
 	\map \; (\_\backslash_r x) \;
-		(\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss))) =
+		(\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss)))$\\
-	\map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$ holds.
+		$=$\\
+	$\map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$
+	\end{center}
+	holds.
+	\end{itemize}
 \end{lemma}
 \begin{proof}
 	(i) is by an induction on $Ss$.
 	(ii) is by an induction on $xs$.
 \end{proof}
+\noindent
 The $\nString$ predicate is defined for conveniently
 expressing that there are no empty strings in the
 $\Some \;(s, n)$ elements generated by $\nupdate$:
 \begin{center}
 	\begin{tabular}{lcl}
 		$\nString \; (\Some \; ([], n))$ & $\dn$ & $ \textit{false}$\\
 		$\nString \; (\Some \; (c::s, n))$  & $\dn$ & $ \textit{true}$\\
 	\end{tabular}
 \end{center}
 \begin{lemma}\label{nupdatesNonempty}
-	If for all elements $opt$ in $\textit{set} \; Ss$,
+	If for all elements $o \in \textit{set} \; Ss$,
-	$\nString \; opt$ holds, the we have that
+	$\nString \; o$ holds, the we have that
-	for all elements $opt$ in $\textit{set} \; (\nupdates \; s \; r \; Ss)$,
+	for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$,
-	$\nString \; opt$ holds.
+	$\nString \; o'$ holds.
 \end{lemma}
 \begin{proof}
 	By an induction on $s$, where $Ss$ is set to vary over all possible values.
 \end{proof}
 	which is from lemma \ref{nupdatesNonempty}.
 	Once the string in $o = \Some \; (s, n)$ is
 	nonempty, $\optermsimp \; r \;o$,
 	$\optermosimp \; r \; o$ and $\optermosimp \; \; o$ are guaranteed
 	to be equal.
+	(v) uses \ref{nupdateInduct1}.
 \end{proof}
+\noindent
+Now we are ready to present the closed form for NTIMES:
 \begin{theorem}\label{ntimesClosedForm}
 	The derivative of $r^{\{n+1\}}$ can be described as an alternative
 	containing a list
 	of terms:\\
 	$r^{\{n+1\}} \backslash_{rsimps} (c::s) = \textit{rsimp} \; (
 	[\Some \; ([c], n)])))$
 \end{theorem}
 \begin{proof}
 	By the rewriting steps described in lemma \ref{ntimesClosedFormsSteps}.
 \end{proof}
+\noindent
+The key observation for bounding this closed form
+is that the counter on $r^{\{n\}}$ will
+only decrement during derivatives:
 \begin{lemma}\label{nupdatesNLeqN}
 	For an element $o$ in $\textit{set} \; (\nupdates \; s \; r \;
 	[\Some \; ([c], n)])$, either $o = \None$, or $o = \Some
 	\; (s', m)$ for some string $s'$ and number $m \leq n$.
 \end{lemma}
+\noindent
+The proof is routine and therefore omitted.
+This allows us to say what kind of terms
+are in the list $\textit{set} \; (\map \; (\optermsimp \; r) \; (
+\nupdates \; s \; r \; [\Some \; ([c], n)]))$:
+only $\ZERO_r$s or a sequence with the tail an $r^{\{m\}}$
+with a small $m$:
 \begin{lemma}\label{ntimesClosedFormListElemShape}
 	For any element $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
 	\nupdates \; s \; r \; [\Some \; ([c], n)]))$,
 	we have that $r'$ is either $\ZERO$ or $r \backslash_{rsimps} s' \cdot
 	r^{\{m\}}$ for some string $s'$ and number $m \leq n$.
 In addition, we know that the list
 $\map \; (\optermsimp \; r) \; (
 \nupdates \; s \; r \; [\Some \; ([c], n)])$'s size is at most
 $c_N = \textit{card} \;
 (\sizeNregex \; ((N + \llbracket r^{\{n\}} \rrbracket) + 1))$.
-This gives us $\llbracket \llbracket r \backslash_{rsimps} \;s \llbracket_r
+This gives us $\llbracket r \backslash_{rsimps} \;s \rrbracket_r
 \leq N * c_N$.
 \end{proof}
+We aim to formalise the correctness and size bound
-Assuming we have that
+for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$
-\begin{center}
+and so on, which is still work in progress.
-	$\rderssimp{r^*}{s} = \rsimp{(\sum \map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss)}$.
+They should more or less follow the same recipe described in this section.
-\end{center}
+Once we know about how to deal with them recursively using suitable auxiliary
-holds.
+definitions, we are able to routinely establish the proofs.
-The idea of $\starupdate$ and $\starupdates$
-is to update $Ss$ when another
-derivative is taken on $\rderssimp{r^*}{s}$
+%The closed form for them looks like:
-w.r.t a character $c$ and a string $s'$
+%%\begin{center}
-respectively.
+%%	\begin{tabular}{llrclll}
-Both $\starupdate$ and $\starupdates$ take three arguments as input:
+%%		$r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & \\
-the new character $c$ or string $s$ to take derivative with,
+%%		$\textit{rsimp}$ & $($ & $
-the regular expression
+%%		\sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\
-$r$ under the star $r^*$, and the
+%%			 & & & & $\textit{nupdates} \;$ &
-list of strings $Ss$ for the derivative $r^* \backslash s$
+%%			 $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\
-up until this point
+%%			 & & & & $)$ &\\
-such that
+%%			 & &  $)$ & & &\\
-\begin{center}
+%%			 & $)$ & & & &\\
-$(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$
+%%	\end{tabular}
-\end{center}
+%%\end{center}
-is satisfied.
-Functions $\starupdate$ and $\starupdates$ characterise what the
-star derivatives will look like once ``straightened out'' into lists.
-The helper functions for such operations will be similar to
-$\sflat{\_}$, $\sflataux{\_}$ and $\sflataux{\_}$, which we defined for sequence.
-We use similar symbols to denote them, with a $*$ subscript to mark the difference.
-\begin{center}
-	\begin{tabular}{lcl}
-		$\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
-		$\hflataux{r}$ & $\dn$ & $[r]$
-	\end{tabular}
-\end{center}
-\begin{center}
-	\begin{tabular}{lcl}
-		$\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\
-		$\hflat{r}$ & $\dn$ & $r$
-	\end{tabular}
-\end{center}
-\noindent
-These definitions are tailor-made for dealing with alternatives that have
-originated from a star's derivatives.
-A typical star derivative always has the structure of a balanced binary tree:
-\begin{center}
-	$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') +
-	(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $
-\end{center}
-All of the nested structures of alternatives
-generated from derivatives are binary, and therefore
-$\hflat{\_}$ and $\hflataux{\_}$ only deal with binary alternatives.
-$\hflat{\_}$ ``untangles'' like the following:
-\[
-	\sum ((r_1 + r_2) + (r_3 + r_4))  + \ldots \;
-	\stackrel{\hflat{\_}}{\longrightarrow} \;
-	\RALTS{[r_1, r_2, \ldots, r_n]}
-\]
-Here is a lemma stating the recursive property of $\starupdate$ and $\starupdates$,
-with the helpers $\hflat{\_}$ and $\hflataux{\_}$\footnote{The function $\textit{concat}$ takes a list of lists
-			and merges each of the element lists to form a flattened list.}:
-\begin{lemma}\label{stupdateInduct1}
-	\mbox
-	For a list of strings $Ss$, the following hold.
-	\begin{itemize}
-		\item
-			If we do a derivative on the terms
-			$r\backslash_r s \cdot r^*$ (where $s$ is taken from the list $Ss$),
-			the result will be the same as if we apply $\starupdate$ to $Ss$.
-			\begin{center}
-				\begin{tabular}{c}
-			$\textit{concat} \; (\map \; (\hflataux{\_} \circ ( (\_\backslash_r x)
-			\circ (\lambda s.\;\; (r \backslash_r s) \cdot r^*)))\; Ss )\;
-			$\\
-			\\
-			$=$ \\
-			\\
-			$\map \; (\lambda s. (r \backslash_r s) \cdot (r^*)) \;
-			(\starupdate \; x \; r \; Ss)$
-				\end{tabular}
-			\end{center}
-		\item
-			$\starupdates$ is ``composable'' w.r.t a derivative.
-			It piggybacks the character $x$ to the tail of the string
-			$xs$.
-			\begin{center}
-				\begin{tabular}{c}
-					$\textit{concat} \; (\map \; \hflataux{\_} \;
-					(\map \; (\_\backslash_r x) \;
-					(\map \; (\lambda s.\;\; (r \backslash_r s) \cdot
-					(r^*) ) \; (\starupdates \; xs \; r \; Ss))))$\\
-					\\
-					$=$\\
-					\\
-					$\map \; (\lambda s.\;\; (r\backslash_r s) \cdot (r^*)) \;
-					(\starupdates \; (xs @ [x]) \; r \; Ss)$
-				\end{tabular}
-			\end{center}
-	\end{itemize}
-\end{lemma}
-\begin{proof}
-	Part 1 is by induction on $Ss$.
-	Part 2 is by induction on $xs$, where $Ss$ is left to take arbitrary values.
-\end{proof}
-Like $\textit{createdBySequence}$, we need
-a predicate for ``star-created'' regular expressions:
-\begin{center}
-	\begin{mathpar}
-		\inferrule{\mbox{}}{ \textit{createdByStar}\; \RSEQ{ra}{\RSTAR{rb}} }
-		\inferrule{  \textit{createdByStar} \; r_1\; \land  \; \textit{createdByStar} \; r_2 }{\textit{createdByStar} \; (r_1 + r_2) }
-	\end{mathpar}
-\end{center}
-\noindent
-All regular expressions created by taking derivatives of
-$r_1 \cdot (r_2)^*$ satisfy the $\textit{createdByStar}$ predicate:
-\begin{lemma}\label{starDersCbs}
-	$\textit{createdByStar} \; ((r_1 \cdot r_2^*) \backslash_r s) $ holds.
-\end{lemma}
-\begin{proof}
-	By a reverse induction on $s$.
-\end{proof}
-If a regular expression conforms to the shape of a star's derivative,
-then we can push an application of $\hflataux{\_}$ inside a derivative of it:
-\begin{lemma}\label{hfauPushin}
-	If $\textit{createdByStar} \; r$ holds, then
-	\begin{center}
-		$\hflataux{r \backslash_r c} = \textit{concat} \; (
-		\map \; \hflataux{\_} (\map \; (\_\backslash_r c) \;(\hflataux{r})))$
-	\end{center}
-	holds.
-\end{lemma}
-\begin{proof}
-	By an induction on the inductivev cases of $\textit{createdByStar}$.
-\end{proof}
-%This is not entirely true for annotated regular expressions:
-%%TODO: bsimp bders \neq bderssimp
 %\begin{center}
-%	$(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
+%	\begin{tabular}{llrcllrllll}
-%\end{center}
+%		$r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & &&&&\\
-%For bit-codes, the order in which simplification is applied
+%			      &&&&$\textit{rsimp}$ & $($ & $
-%might cause a difference in the location they are placed.
+%			      \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\
-%If we want something like
+%					  &&&& & & & & $\;\; \textit{nupdates} \;$ &
-%\begin{center}
+%			 		  $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\
-%	$\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
+%					  &&&& & & & & $)$ &\\
-%\end{center}
+%					  &&&& & &  $)$ & & &\\
-%Some "canonicalization" procedure is required,
+%					  &&&& & $)$ & & & &\\
-%which either pushes all the common bitcodes to nodes
-%as senior as possible:
-%\begin{center}
-%	$_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
-%\end{center}
-%or does the reverse. However bitcodes are not of interest if we are talking about
-%the $\llbracket r \rrbracket$ size of a regex.
-%Therefore for the ease and simplicity of producing a
-%proof for a size bound, we are happy to restrict ourselves to
-%unannotated regular expressions, and obtain such equalities as
-%TODO: rsimp sflat
-% The simplification of a flattened out regular expression, provided it comes
-%from the derivative of a star, is the same as the one nested.
-Now we introduce an inductive property
-for $\starupdate$ and $\hflataux{\_}$.
-\begin{lemma}\label{starHfauInduct}
-	If we do derivatives of $r^*$
-	with a string that starts with $c$,
-	then flatten it out,
-	we obtain a list
-	of the shape $\sum_{s' \in sS} (r\backslash_r s') \cdot r^*$,
-	where $sS = \starupdates \; s \; r \; [[c]]$. Namely,
-	\begin{center}
-		$\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} =
-		\map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \;
-		(\starupdates \; s \; r_0 \; [[c]])$
-	\end{center}
-holds.
-\end{lemma}
-\begin{proof}
-	By an induction on $s$, the inductive cases
-	being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used.
-\end{proof}
-\noindent
-$\hflataux{\_}$ has a similar effect as $\textit{flatten}$:
-\begin{lemma}\label{hflatauxGrewrites}
-	$a :: rs \grewrites \hflataux{a} @ rs$
-\end{lemma}
-\begin{proof}
-	By induction on $a$. $rs$ is set to take arbitrary values.
-\end{proof}
-It is also not surprising that $\textit{rsimp}$ will cover
-the effect of $\hflataux{\_}$:
-\begin{lemma}\label{cbsHfauRsimpeq1}
-	$\rsimp{(r_1 + r_2)} = \rsimp{(\RALTS{\hflataux{r_1} @ \hflataux{r_2}})}$
-\end{lemma}
-\begin{proof}
-	By using the rewriting relation $\rightsquigarrow$
-\end{proof}
-And from this we obtain a proof that a star's derivative will be the same
-as if it had all its nested alternatives created during deriving being flattened out:
-For example,
-\begin{lemma}\label{hfauRsimpeq2}
-	$\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
-\end{lemma}
-\begin{proof}
-	By structural induction on $r$, where the induction rules
-	are these of $\createdByStar{\_}$.
-	Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case.
-\end{proof}
-%Here is a corollary that states the lemma in
-%a more intuitive way:
-%\begin{corollary}
-%	$\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot
-%	(r^*))\; (\starupdates \; c\; r\; [[c]])$
-%\end{corollary}
-%\noindent
-%Note that this is also agnostic of the simplification
-%function we defined, and is therefore of more general interest.
-Together with the rewriting relation
-\begin{lemma}\label{starClosedForm6Hrewrites}
-	We have the following set of rewriting relations or equalities:
-	\begin{itemize}
-		\item
-			$\textit{rsimp} \; (r^* \backslash_r (c::s))
-			\sequal
-			\sum \; ( ( \sum (\lambda s. (r\backslash_r s) \cdot r^*) \; (
-			\starupdates \; s \; r \; [ c::[]] ) ) )$
-		\item
-			$r \backslash_{rsimp} (c::s) = \textit{rsimp} \; ( (
-			\sum ( (\map \; (\lambda s_1. (r\backslash s_1) \; r^*) \;
-			(\starupdates \;s \; r \; [ c::[] ])))))$
-		\item
-			$\sum ( (\map \; (\lambda s. (r\backslash s) \; r^*) \; Ss))
-			\sequal
-			 \sum ( (\map \; (\lambda s. \textit{rsimp} \; (r\backslash s) \;
-			 r^*) \; Ss) )$
-		\item
-			$\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss
-			\scfrewrites
-			\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$
-		\item
-			$( ( \sum ( ( \map \ (\lambda s. \;\;
-			(\textit{rsimp} \; (r \backslash_r s)) \cdot r^*) \; (\starupdates \;
-			s \; r \; [ c::[] ])))))$\\
-			$\sequal$\\
-			$( ( \sum ( ( \map \ (\lambda s. \;\;
-			( r \backslash_{rsimp} s)) \cdot r^*) \; (\starupdates \;
-			s \; r \; [ c::[] ]))))$\\
-	\end{itemize}
-\end{lemma}
-\begin{proof}
-	Part 1 leads to part 2.
-	The rest of them are routine.
-\end{proof}
-\noindent
-Next the closed form for star regular expressions can be derived:
-\begin{theorem}\label{starClosedForm}
-	$\rderssimp{r^*}{c::s} =
-	\rsimp{
-		(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \;
-		(\starupdates \; s\; r \; [[c]])
-		)
-		)
-	}
-	$
-\end{theorem}
-\begin{proof}
-	By an induction on $s$.
-	The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, \ref{starClosedForm6Hrewrites}
-	and \ref{hfauRsimpeq2}
-	are used.
-	In \ref{starClosedForm6Hrewrites}, the equalities are
-	used to link the LHS and RHS.
-\end{proof}
-The closed form for them looks like:
-%\begin{center}
-%	\begin{tabular}{llrclll}
-%		$r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & \\
-%		$\textit{rsimp}$ & $($ & $
-%		\sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\
-%			 & & & & $\textit{nupdates} \;$ &
-%			 $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\
-%			 & & & & $)$ &\\
-%			 & &  $)$ & & &\\
-%			 & $)$ & & & &\\
 %	\end{tabular}
 %\end{center}
-\begin{center}
+%The $\textit{optermsimp}$ function with the argument $r$
-	\begin{tabular}{llrcllrllll}
+%chooses from two options: $\ZERO$ or
-		$r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & &&&&\\
+%We define for the $r^{\{n\}}$ constructor something similar to $\starupdate$
-			      &&&&$\textit{rsimp}$ & $($ & $
+%and $\starupdates$:
-			      \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\
+%\begin{center}
-					  &&&& & & & & $\;\; \textit{nupdates} \;$ &
+%	\begin{tabular}{lcl}
-			 		  $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\
+%		$\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
-					  &&&& & & & & $)$ &\\
+%		$\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\
-					  &&&& & &  $)$ & & &\\
+%						     & & $\textit{if} \;
-					  &&&& & $)$ & & & &\\
+%						     (\rnullable \; (\rders \; r \; s))$ \\
-	\end{tabular}
+%						     & & $\textit{then} \;\; (s @ [c]) :: [c] :: (
-\end{center}
+%						     \starupdate \; c \; r \; Ss)$ \\
-The $\textit{optermsimp}$ function with the argument $r$
+%						     & & $\textit{else} \;\; (s @ [c]) :: (
-chooses from two options: $\ZERO$ or
+%						     \starupdate \; c \; r \; Ss)$
-We define for the $r^{\{n\}}$ constructor something similar to $\starupdate$
+%	\end{tabular}
-and $\starupdates$:
+%\end{center}
-\begin{center}
+%\noindent
-	\begin{tabular}{lcl}
+%As a generalisation from characters to strings,
-		$\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
+%$\starupdates$ takes a string instead of a character
-		$\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\
+%as the first input argument, and is otherwise the same
-						     & & $\textit{if} \;
+%as $\starupdate$.
-						     (\rnullable \; (\rders \; r \; s))$ \\
+%\begin{center}
-						     & & $\textit{then} \;\; (s @ [c]) :: [c] :: (
+%	\begin{tabular}{lcl}
-						     \starupdate \; c \; r \; Ss)$ \\
+%		$\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\
-						     & & $\textit{else} \;\; (s @ [c]) :: (
+%		$\starupdates \; (c :: cs) \; r \; Ss$ &  $=$ &  $\starupdates \; cs \; r \; (
-						     \starupdate \; c \; r \; Ss)$
+%		\starupdate \; c \; r \; Ss)$
-	\end{tabular}
+%	\end{tabular}
-\end{center}
+%\end{center}
-\noindent
+%\noindent
-As a generalisation from characters to strings,
-$\starupdates$ takes a string instead of a character
-as the first input argument, and is otherwise the same
-as $\starupdate$.
+%\section{Zippers}
-\begin{center}
+%Zipper is a data structure designed to operate on
-	\begin{tabular}{lcl}
+%and navigate between local parts of a tree.
-		$\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\
+%It was first formally described by Huet \cite{HuetZipper}.
-		$\starupdates \; (c :: cs) \; r \; Ss$ &  $=$ &  $\starupdates \; cs \; r \; (
+%Typical applications of zippers involve text editor buffers
-		\starupdate \; c \; r \; Ss)$
+%and proof system databases.
-	\end{tabular}
+%In our setting, the idea is to compactify the representation
-\end{center}
+%of derivatives with zippers, thereby making our algorithm faster.
-\noindent
+%Some initial results
+%We first give a brief introduction to what zippers are,
+%and other works
+%that apply zippers to derivatives
+%When dealing with large trees, it would be a waste to
+%traverse the entire tree if
+%the operation only
+%involves a small fraction of it.
+%The idea is to put the focus on that subtree, turning other parts
+%of the tree into a context
+%
+%
+%One observation about our derivative-based lexing algorithm is that
+%the derivative operation sometimes traverses the entire regular expression
+%unnecessarily:
 %----------------------------------------------------------------------------------------
 %	SECTION 1
 %----------------------------------------------------------------------------------------

changeset 621	17c7611fb0a9
parent 620	ae6010c14e49
child 625	b797c9a709d9