lexing: comparison thys2/Paper/Paper.thy

equal deleted inserted replaced

-:b7199d6c672d
+:2416fdec6396
 code ("code _" [79] 74) and
 intern ("_\<^latex>\<open>\\mbox{$^\\uparrow$}\<close>" [900] 80) and
 erase ("_\<^latex>\<open>\\mbox{$^\\downarrow$}\<close>" [1000] 74) and
 bnullable ("bnullable _" [1000] 80) and
+bsimp_AALTs ("bsimpALT _ _" [10,1000] 80) and
+bsimp_ASEQ ("bsimpSEQ _ _ _" [10,1000,1000] 80) and
 bmkeps ("bmkeps _" [1000] 80) and
 srewrite ("_\<^latex>\<open>\\mbox{$\\,\\stackrel{s}{\\leadsto}$}\<close> _" [71, 71] 80) and
 rrewrites ("_ \<^latex>\<open>\\mbox{$\\,\\leadsto^*$}\<close> _" [71, 71] 80) and
 blexer_simp ("blexer\<^sup>+" 1000)
 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
 r$. While such simple-minded simplifications have been proved in our
 earlier work to preserve the correctness of Sulzmann and Lu's
 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
 \emph{not} help with limiting the growth of the derivatives shown
-above: the growth is slowed, but the derivatives can still grow quickly
+above: the growth is slowed, but the derivatives can still grow rather
-beyond any finite bound.
+quickly beyond any finite bound.
 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
 \cite{Sulzmann2014} where they introduce bitcoded
 regular expressions. In this version, POSIX values are
 consisting of just a single character $c$ is written $[c]$.
 Our regular expressions are defined as usual as the elements of the following inductive
 datatype:
 \begin{center}
-@{text "r :="}
+@{text "r ::="} \;
 @{const "ZERO"} $\mid$
 @{const "ONE"} $\mid$
 @{term "CH c"} $\mid$
 @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
 @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
 The novel idea of Sulzmann and Lu is to extend this algorithm for
 lexing, where it is important to find out which part of the string
 is matched by which part of the regular expression.
 For this Sulzmann and Lu presented two lexing algorithms in their paper
-\cite{Sulzmann2014}. This first algorithm consists of two phases: first a
+\cite{Sulzmann2014}. The first algorithm consists of two phases: first a
 matching phase (which is Brzozowski's algorithm) and then a value
 construction phase. The values encode \emph{how} a regular expression
 matches a string. \emph{Values} are defined as the inductive datatype
 \begin{center}
 \end{tabular}
 \end{tabular}
 \end{center}
 \noindent
-The function @{text mkeps} is called when the last derivative is nullable, that is
+The function @{text mkeps} is run when the last derivative is nullable, that is
 the string to be matched is in the language of the regular expression. It generates
 a value for how the last derivative can match the empty string. The injection function
 then calculates the corresponding value for each intermediate derivative until
 a value for the original regular expression is generated.
 Graphically the algorithm by
 Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz}
 where the path from the left to the right involving @{term derivatives}/@{const
 nullable} is the first phase of the algorithm (calculating successive
 \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
-left, the second phase. This picture shows the steps required when a
+left, the second phase. The picture above shows the steps required when a
 regular expression, say @{text "r\<^sub>1"}, matches the string @{term
 "[a,b,c]"}. The lexing algorithm by Sulzmann and Lu can be defined as:
 \begin{figure}[t]
 \begin{center}
 & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}
 \end{tabular}
 \end{center}
-We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that the algorithm by Sulzmann and Lu
+We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that
-is correct. The cenral property we established relates the derivative operation to the injection function.
+this algorithm is correct, that is it generates POSIX values. The
+cenral property we established relates the derivative operation to the
+injection function.
 \begin{proposition}\label{Posix2}
 	\textit{If} $(s,\; r\backslash c) \rightarrow v$ \textit{then} $(c :: s,\; r) \rightarrow$ \textit{inj} $r\; c\; v$.
 \end{proposition}
 \noindent
 In fact we have shown that in the success case the generated POSIX value $v$ is
 unique and in the failure case that there is no POSIX value $v$ that satisfies
 $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly
-slow in examples where the derivatives grow arbitrarily (see example from the
+slow in cases where the derivatives grow arbitrarily (see example from the
-Introduction). However it can be used as a conveninet reference point for the correctness
+Introduction). However it can be used as a convenient reference point for the correctness
 proof of the second algorithm by Sulzmann and Lu, which we shall describe next.
 *}
 section {* Bitcoded Regular Expressions and Derivatives *}
 \noindent where @{text bs} stands for bitsequences; @{text r},
 @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular
 expressions; and @{text rs} for lists of bitcoded regular
 expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"}
-is just an abbreviation for @{text "ALTs bs [r\<^sub>1, r\<^sub>2]"}.
+is just an abbreviation for \mbox{@{text "ALTs bs [r\<^sub>1, r\<^sub>2]"}}.
 For bitsequences we just use lists made up of the
 constants @{text Z} and @{text S}.  The idea with bitcoded regular
 expressions is to incrementally generate the value information (for
 example @{text Left} and @{text Right}) as bitsequences. For this
 Sulzmann and Lu define a coding
 $\textit{STAR}\;[]\,r^\uparrow$\\
 \end{tabular}
 \end{center}
 \noindent
-There is also an \emph{erase}-function, written $a^\downarrow$, which
+There is also an \emph{erase}-function, written $r^\downarrow$, which
 transforms a bitcoded regular expression into a (standard) regular
 expression by just erasing the annotated bitsequences. We omit the
 straightforward definition. For defining the algorithm, we also need
 the functions \textit{bnullable} and \textit{bmkeps}, which are the
 ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
 \end{tabular}
 \end{center}
 \noindent
-The key function in the bitcoded algorithm is the derivative of an
+The key function in the bitcoded algorithm is the derivative of a
 bitcoded regular expression. This derivative calculates the
 derivative but at the same time also the incremental part of bitsequences
 that contribute to constructing a POSIX value.
 \begin{center}
 builds the bitcoded derivative according to $s$. If the derivative is
 (b)nullable the string is in the language of $r$ and it extracts the bitsequence using the
 $\textit{bmkeps}$ function. Finally it decodes the bitsequence into a value.  If
 the derivative is \emph{not} nullable, then $\textit{None}$ is
 returned. We can show that this way of calculating a value
-generates the same result as with \textit{lexer}.
+generates the same result as \textit{lexer}.
 Before we can proceed we need to define a helper function, called
 \textit{retrieve}, which Sulzmann and Lu introduced for the correctness proof.
 \begin{center}
 \end{tabular}
 \end{center}
 \noindent
 The idea behind this function is to retrieve a possibly partial
-bitcode from a bitcoded regular expression, where the retrieval is
+bitsequence from a bitcoded regular expression, where the retrieval is
 guided by a value.  For example if the value is $\Left$ then we
 descend into the left-hand side of an alternative in order to
 assemble the bitcode. Similarly for
 $\Right$. The property we can show is that for a given $v$ and $r$
 with $\vdash v : r$, the retrieved bitsequence from the internalised
 All properties are by induction on annotated regular expressions. There are no
 interesting cases.
 \end{proof}
 \noindent
-The only problem left for the correctness proof is that the bitcoded algorithm
+The only difficulty left for the correctness proof is that the bitcoded algorithm
 has only a ``forward phase'' where POSIX values are generated incrementally.
-We can achive the same effect with @{text lexer} by stacking up injection
+We can achieve the same effect with @{text lexer} by stacking up injection
-functions in the during forward phase. An auxiliary function, called $\textit{flex}$,
+functions during the forward phase. An auxiliary function, called $\textit{flex}$,
 allows us to recast the rules of $\lexer$ (with its two phases) in terms of a single
-phase.
+phase and stacked up injection functions.
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{flex}\;r\,f\,[]$ & $\dn$ & $f$\\
 $\textit{flex}\;r\,f\,(c\!::\!s)$ & $\dn$ &
 \end{center}
 \noindent
 The point of this function is that when
 reaching the end of the string, we just need to apply the stacked
-injection functions to the value generated by $\mkeps$.
+injection functions to the value generated by @{text mkeps}.
 Using this function we can recast the success case in @{text lexer}
 as follows:
 \begin{proposition}\label{flex}
 If @{text "lexer r s = Some v"} \;then\; @{text "v = "}$\,\textit{flex}\,r\,id\,s\,
 Note we did not redefine \textit{lexer}, we just established that the
 value generated by \textit{lexer} can also be obtained by a different
 method. While this different method is not efficient (we essentially
 need to traverse the string $s$ twice, once for building the
 derivative $r\backslash s$ and another time for stacking up injection
-functions using \textit{flex}), it helped us with proving
+functions using \textit{flex}), it helps us with proving
-that incrementally building up values.
+that incrementally building up values generates the same result.
-This brings us to our main lemma in this section: if we build a
+This brings us to our main lemma in this section: if we calculate a
 derivative, say $r\backslash s$ and have a value, say $v$, inhabited
 by this derivative, then we can produce the result $\lexer$ generates
 by applying this value to the stacked-up injection functions
-$\textit{flex}$ assembles. The lemma establishes that this is the same
+that $\textit{flex}$ assembles. The lemma establishes that this is the same
 value as if we build the annotated derivative $r^\uparrow\backslash s$
 and then retrieve the corresponding bitcoded version, followed by a
 decoding step.
 \begin{lemma}[Main Lemma]\label{mainlemma}\it
 (s\,@\,[c]))\,v)\,r$ as required. The last rewrite step is possible
 because we generalised over $v$ in our induction.
 \end{proof}
 \noindent
-With this lemma in place, we can prove the correctness of \textit{blexer} such
+With this lemma in place, we can prove the correctness of \textit{blexer}---it indeed
-that it produces the same result as \textit{lexer}.
+produces the same result as \textit{lexer}.
 \begin{theorem}
 $\textit{lexer}\,r\,s = \textit{blexer}\,r\,s$
 \end{theorem}
 \nullable(r\backslash s)
 \]
 \noindent
 For the \textit{if}-branch suppose $r_d \dn r^\uparrow\backslash s$ and
-$d \dn r\backslash s$. We have (*) $\nullable\,d$. We can then show
+$d \dn r\backslash s$. We have (*) @{text "nullable d"}. We can then show
 by Lemma~\ref{bnullable}\textit{(3)} that
 %
 \[
 \textit{decode}(\textit{bmkeps}\,r_d)\,r =
 \textit{decode}(\textit{retrieve}\,a\,(\textit{mkeps}\,d))\,r
 \textit{if}-branches return the same value. In the
 \textit{else}-branches both \textit{lexer} and \textit{blexer} return
 \textit{None}. Therefore we can conclude the proof.
 \end{proof}
-\noindent
+\noindent This establishes that the bitcoded algorithm by Sulzmann and
-This establishes that the bitcoded algorithm by Sulzmann
+Lu \emph{without} simplification produces correct results. This was
-and Lu \emph{without} simplification produces correct results. This was
+only conjectured by Sulzmann and Lu in their paper
-only conjectured in their paper \cite{Sulzmann2014}. The next step
+\cite{Sulzmann2014}. The next step is to add simplifications.
-is to add simplifications.
 *}
 section {* Simplification *}
 text {*
 Derivatives as calculated by Brzozowski’s method are usually more
 complex regular expressions than the initial one; the result is
-that the derivative-based matching and lexing algorithms are
+that derivative-based matching and lexing algorithms are
-often abysmally slow. However, as Sulzmann and Lu wrote, various
+often abysmally slow if the ``growth problem'' is not addressed. As Sulzmann and Lu wrote, various
 optimisations are possible, such as the simplifications
 $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r \Rightarrow r$,
 $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow r$. While these
-simplifications can speed up the algorithms considerably in many
+simplifications can considerably speed up the two algorithms  in many
-cases, they do not solve fundamentally the ``growth problem'' with
+cases, they do not solve fundamentally the growth problem with
-derivatives. To see this let us return to the example
+derivatives. To see this let us return to the example from the
+Introduction that shows the derivatives for \mbox{@{text "(a + aa)\<^sup>*"}}.
+If we delete in the 3rd step all $\ZERO{}s$ and $\ONE$s according to
+the simplification rules shown above we obtain
+%
+\def\xll{\xrightarrow{\_\backslash{} [a, a, a]}}%%
+%
+\begin{equation}\label{derivex}
+(a + aa)^* \quad\xll\quad
+\underbrace{\mbox{$(\ONE + \ONE{}a) \cdot (a + aa)^*$}}_{r} \;+\;
+((a + aa)^* + \underbrace{\mbox{$(\ONE + \ONE{}a) \cdot (a + aa)^*$}}_{r})
+\end{equation}
+\noindent This is a simpler derivative, but unfortunately we
+cannot make further simplifications. This is a problem because
+the outermost alternatives contains two copies of the same
+regular expression (underlined with $r$). The copies will
+spawn new copies in later steps and they in turn further copies. This
+destroys an hope of taming the size of the derivatives.  But the
+second copy of $r$ in \eqref{derivex} will never contribute to a
+value, because POSIX lexing will always prefer matching a string
+with the first copy. So in principle it could be removed.
+The dilemma with the simple-minded
+simplification rules above is that the rule $r + r \Rightarrow r$
+will never be applicable because as can be seen in this example the
+regular expressions are separated by another sub-regular expression.
+But here is where Sulzmann and Lu's representation of generalised
+alternatives in the bitcoded algorithm shines: in @{term
+"ALTs bs rs"} we can define a more aggressive simplification by
+recursively simplifying all regular expressions in @{text rs} and
+then analyse the resulting list and remove any duplicates.
+Another advantage is that the bitsequences in  bitcoded regular
+expressions can be easily modified such that simplification does not
+interfere with the value constructions. For example we can ``flatten'', or
+de-nest, @{text ALTs} as follows
+%
+\[
+@{term "ALTs bs\<^sub>1 ((ALTs bs\<^sub>2 rs\<^sub>2) # rs\<^sub>1)"}
+\quad\xrightarrow{bsimp}\quad
+@{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) # rs\<^sub>1)"}
+\]
+\noindent
+where we just need to fuse the bitsequence that has accumulated in @{text "bs\<^sub>2"}
+to the alternatives in @{text "rs\<^sub>2"}. As we shall show below this will
+ensure that the correct value corresponding to the original (unsimplified)
+regular expression can still be extracted. %In this way the value construction
+%is not affected by simplification.
+However there is one problem with the definition for the more
+aggressive simlification rules by Sulzmann and Lu. Recasting
+their definition with our syntax they define the step of removing
+duplicates as
+%
+\[ @{text "bsimp (ALTs bs rs)"} \dn @{text "ALTs
+bs (nup (map bsimp rs))"}
+\]
+\noindent where they first recursively simplify the regular
+expressions in @{text rs} (using @{text map}) and then use
+Haskell's @{text nub}-function to remove potential
+duplicates. While this makes sense when considering the example
+shown in \eqref{derivex}, @{text nub} is the inappropriate
+function in the case of bitcoded regular expressions. The reason
+is that in general the n elements in @{text rs} will have a
+different bitsequence annotated to it and in this way @{text nub}
+will never find a duplicate to be removed. The correct way to
+handle this situation is to first \emph{erase} the regular
+expressions when comparing potential duplicates. This is inspired
+by Scala's list functions of the form \mbox{@{text "distinctBy rs f
+acc"}} where a function is applied first before two elements
+are compared. We define this function in Isabelle/HOL as
+\begin{center}
+\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+@{thm (lhs) distinctBy.simps(1)} & $\dn$ & @{thm (rhs) distinctBy.simps(1)}\\
+@{thm (lhs) distinctBy.simps(2)} & $\dn$ & @{thm (rhs) distinctBy.simps(2)}
+\end{tabular}
+\end{center}
+\noindent where we scan the list from left to right (because we
+have to remove later copies). In this function, @{text f} is a
+function and @{text acc} is an accumulator for regular
+expressions---essentially a set of elements we have already seen
+while scanning the list. Therefore we delete an element, say @{text x},
+from the list provided @{text "f x"} is already in the accumulator;
+otherwise we keep @{text x} and scan the rest of the list but now
+add @{text "f x"} as another element to @{text acc}. We will use
+@{term distinctBy} where @{text f} is our erase functions, @{term "erase (DUMMY)"},
+that deletes bitsequences from bitcoded regular expressions.
+This is clearly a computationally more expensive operation, than @{text nub},
+but is needed in order to make the removal of unnecessary copies
+to work.
+Our simplification function depends on three helper functions, one is called
+@{text flts} and defined as follows:
+\begin{center}
+\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+@{thm (lhs) flts.simps(1)} & $\dn$ & @{thm (rhs) flts.simps(1)}\\
+@{thm (lhs) flts.simps(2)} & $\dn$ & @{thm (rhs) flts.simps(2)}\\
+@{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\
+\end{tabular}
+\end{center}
+\noindent
+The second clause removes all instances of @{text ZERO} in alternatives and
+the second ``spills'' out nested alternatives (but retaining the
+bitsequence @{text "bs'"} accumulated in the inner alternative). There are
+some corner cases to be considered when the resulting list inside an alternative is
+empty or a singleton list. We take care of those cases in the
+@{text "bsimpALTs"} function; similarly we define a helper function that simplifies
+sequences according to the usual rules about @{text ZERO}s and @{text ONE}s:
+\begin{center}
+\begin{tabular}{c@ {\hspace{5mm}}c}
+\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+@{text "bsimpALTs bs []"}  & $\dn$ & @{text "ZERO"}\\
+@{text "bsimpALTs bs [r]"} & $\dn$ & @{text "fuse bs r"}\\
+@{text "bsimpALTs bs rs"}  & $\dn$ & @{text "ALTs bs rs"}\\
+\mbox{}\\
+\end{tabular}
+&
+\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+@{text "bsimpSEQ bs _ ZERO"}  & $\dn$ & @{text "ZERO"}\\
+@{text "bsimpSEQ bs ZERO _"} & $\dn$ & @{text "ZERO"}\\
+@{text "bsimpSEQ bs\<^sub>1 (ONE bs\<^sub>2) r\<^sub>2"}
+& $\dn$ & @{text "fuse (bs\<^sub>1 @ bs\<^sub>2) r\<^sub>2"}\\
+@{text "bsimpSEQ bs r\<^sub>1 r\<^sub>2"} & $\dn$ &  @{text "SEQ bs r\<^sub>1 r\<^sub>2"}
+\end{tabular}
+\end{tabular}
+\end{center}
+\noindent
+With this in place we can define our simlification function as
+\begin{center}
+\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+@{thm (lhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]} & $\dn$ &
+@{thm (rhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]}\\
+@{thm (lhs) bsimp.simps(2)[of "bs" _]} & $\dn$ & @{thm (rhs) bsimp.simps(2)[of "bs" _]}\\
+@{text "bsimp r"} & $\dn$ & @{text r}
+\end{tabular}
+\end{center}
+\noindent
+As far as we can see, our recursive function @{term bsimp} simplifies regular
+expressions as intended by Sulzmann and Lu. There is no point to apply the
+@{text bsimp}
+function repeatedly (like the simplification in their paper which is applied
+until a fixpoint is reached), because we can show that it is idempotent, that is
+\begin{proposition}
+???
+\end{proposition}
 \begin{lemma}
 @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]}
 \end{lemma}
 \end{figure}
 *}
 section {* Bound - NO *}
-section {* Bounded Regex / Not *}
 section {* Conclusion *}
 text {*
-\cite{AusafDyckhoffUrban2016}
 %%\bibliographystyle{plain}
 \bibliography{root}
 *}

changeset 424	2416fdec6396
parent 423	b7199d6c672d
child 425	14c558ae0b09