diff -r b7199d6c672d -r 2416fdec6396 thys2/Paper/Paper.thy --- a/thys2/Paper/Paper.thy Tue Feb 08 14:29:41 2022 +0000 +++ b/thys2/Paper/Paper.thy Wed Feb 09 00:29:04 2022 +0000 @@ -64,6 +64,8 @@ intern ("_\<^latex>\\\mbox{$^\\uparrow$}\" [900] 80) and erase ("_\<^latex>\\\mbox{$^\\downarrow$}\" [1000] 74) and bnullable ("bnullable _" [1000] 80) and + bsimp_AALTs ("bsimpALT _ _" [10,1000] 80) and + bsimp_ASEQ ("bsimpSEQ _ _ _" [10,1000,1000] 80) and bmkeps ("bmkeps _" [1000] 80) and srewrite ("_\<^latex>\\\mbox{$\\,\\stackrel{s}{\\leadsto}$}\ _" [71, 71] 80) and @@ -147,9 +149,8 @@ earlier work to preserve the correctness of Sulzmann and Lu's algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do \emph{not} help with limiting the growth of the derivatives shown -above: the growth is slowed, but the derivatives can still grow quickly -beyond any finite bound. - +above: the growth is slowed, but the derivatives can still grow rather +quickly beyond any finite bound. Sulzmann and Lu overcome this ``growth problem'' in a second algorithm @@ -214,7 +215,7 @@ datatype: \begin{center} - @{text "r :="} + @{text "r ::="} \; @{const "ZERO"} $\mid$ @{const "ONE"} $\mid$ @{term "CH c"} $\mid$ @@ -302,7 +303,7 @@ lexing, where it is important to find out which part of the string is matched by which part of the regular expression. For this Sulzmann and Lu presented two lexing algorithms in their paper - \cite{Sulzmann2014}. This first algorithm consists of two phases: first a + \cite{Sulzmann2014}. The first algorithm consists of two phases: first a matching phase (which is Brzozowski's algorithm) and then a value construction phase. The values encode \emph{how} a regular expression matches a string. \emph{Values} are defined as the inductive datatype @@ -423,7 +424,7 @@ \end{center} \noindent - The function @{text mkeps} is called when the last derivative is nullable, that is + The function @{text mkeps} is run when the last derivative is nullable, that is the string to be matched is in the language of the regular expression. It generates a value for how the last derivative can match the empty string. The injection function then calculates the corresponding value for each intermediate derivative until @@ -433,7 +434,7 @@ where the path from the left to the right involving @{term derivatives}/@{const nullable} is the first phase of the algorithm (calculating successive \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to - left, the second phase. This picture shows the steps required when a + left, the second phase. The picture above shows the steps required when a regular expression, say @{text "r\<^sub>1"}, matches the string @{term "[a,b,c]"}. The lexing algorithm by Sulzmann and Lu can be defined as: @@ -488,8 +489,10 @@ \end{center} -We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that the algorithm by Sulzmann and Lu -is correct. The cenral property we established relates the derivative operation to the injection function. +We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that +this algorithm is correct, that is it generates POSIX values. The +cenral property we established relates the derivative operation to the +injection function. \begin{proposition}\label{Posix2} \textit{If} $(s,\; r\backslash c) \rightarrow v$ \textit{then} $(c :: s,\; r) \rightarrow$ \textit{inj} $r\; c\; v$. @@ -510,8 +513,8 @@ In fact we have shown that in the success case the generated POSIX value $v$ is unique and in the failure case that there is no POSIX value $v$ that satisfies $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly - slow in examples where the derivatives grow arbitrarily (see example from the - Introduction). However it can be used as a conveninet reference point for the correctness + slow in cases where the derivatives grow arbitrarily (see example from the + Introduction). However it can be used as a convenient reference point for the correctness proof of the second algorithm by Sulzmann and Lu, which we shall describe next. *} @@ -540,7 +543,7 @@ @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular expressions; and @{text rs} for lists of bitcoded regular expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"} - is just an abbreviation for @{text "ALTs bs [r\<^sub>1, r\<^sub>2]"}. + is just an abbreviation for \mbox{@{text "ALTs bs [r\<^sub>1, r\<^sub>2]"}}. For bitsequences we just use lists made up of the constants @{text Z} and @{text S}. The idea with bitcoded regular expressions is to incrementally generate the value information (for @@ -667,7 +670,7 @@ \end{center} \noindent - There is also an \emph{erase}-function, written $a^\downarrow$, which + There is also an \emph{erase}-function, written $r^\downarrow$, which transforms a bitcoded regular expression into a (standard) regular expression by just erasing the annotated bitsequences. We omit the straightforward definition. For defining the algorithm, we also need @@ -705,7 +708,7 @@ \noindent - The key function in the bitcoded algorithm is the derivative of an + The key function in the bitcoded algorithm is the derivative of a bitcoded regular expression. This derivative calculates the derivative but at the same time also the incremental part of bitsequences that contribute to constructing a POSIX value. @@ -752,7 +755,7 @@ $\textit{bmkeps}$ function. Finally it decodes the bitsequence into a value. If the derivative is \emph{not} nullable, then $\textit{None}$ is returned. We can show that this way of calculating a value -generates the same result as with \textit{lexer}. +generates the same result as \textit{lexer}. Before we can proceed we need to define a helper function, called \textit{retrieve}, which Sulzmann and Lu introduced for the correctness proof. @@ -773,7 +776,7 @@ \noindent The idea behind this function is to retrieve a possibly partial -bitcode from a bitcoded regular expression, where the retrieval is +bitsequence from a bitcoded regular expression, where the retrieval is guided by a value. For example if the value is $\Left$ then we descend into the left-hand side of an alternative in order to assemble the bitcode. Similarly for @@ -806,12 +809,12 @@ \end{proof} \noindent -The only problem left for the correctness proof is that the bitcoded algorithm +The only difficulty left for the correctness proof is that the bitcoded algorithm has only a ``forward phase'' where POSIX values are generated incrementally. -We can achive the same effect with @{text lexer} by stacking up injection -functions in the during forward phase. An auxiliary function, called $\textit{flex}$, +We can achieve the same effect with @{text lexer} by stacking up injection +functions during the forward phase. An auxiliary function, called $\textit{flex}$, allows us to recast the rules of $\lexer$ (with its two phases) in terms of a single -phase. +phase and stacked up injection functions. \begin{center} \begin{tabular}{lcl} @@ -824,7 +827,7 @@ \noindent The point of this function is that when reaching the end of the string, we just need to apply the stacked -injection functions to the value generated by $\mkeps$. +injection functions to the value generated by @{text mkeps}. Using this function we can recast the success case in @{text lexer} as follows: @@ -839,14 +842,14 @@ method. While this different method is not efficient (we essentially need to traverse the string $s$ twice, once for building the derivative $r\backslash s$ and another time for stacking up injection -functions using \textit{flex}), it helped us with proving -that incrementally building up values. +functions using \textit{flex}), it helps us with proving +that incrementally building up values generates the same result. -This brings us to our main lemma in this section: if we build a +This brings us to our main lemma in this section: if we calculate a derivative, say $r\backslash s$ and have a value, say $v$, inhabited by this derivative, then we can produce the result $\lexer$ generates by applying this value to the stacked-up injection functions -$\textit{flex}$ assembles. The lemma establishes that this is the same +that $\textit{flex}$ assembles. The lemma establishes that this is the same value as if we build the annotated derivative $r^\uparrow\backslash s$ and then retrieve the corresponding bitcoded version, followed by a decoding step. @@ -893,8 +896,8 @@ \end{proof} \noindent -With this lemma in place, we can prove the correctness of \textit{blexer} such -that it produces the same result as \textit{lexer}. +With this lemma in place, we can prove the correctness of \textit{blexer}---it indeed +produces the same result as \textit{lexer}. \begin{theorem} @@ -913,7 +916,7 @@ \noindent For the \textit{if}-branch suppose $r_d \dn r^\uparrow\backslash s$ and - $d \dn r\backslash s$. We have (*) $\nullable\,d$. We can then show + $d \dn r\backslash s$. We have (*) @{text "nullable d"}. We can then show by Lemma~\ref{bnullable}\textit{(3)} that % \[ @@ -931,11 +934,10 @@ \textit{None}. Therefore we can conclude the proof. \end{proof} -\noindent -This establishes that the bitcoded algorithm by Sulzmann -and Lu \emph{without} simplification produces correct results. This was -only conjectured in their paper \cite{Sulzmann2014}. The next step -is to add simplifications. +\noindent This establishes that the bitcoded algorithm by Sulzmann and +Lu \emph{without} simplification produces correct results. This was +only conjectured by Sulzmann and Lu in their paper +\cite{Sulzmann2014}. The next step is to add simplifications. *} @@ -946,14 +948,169 @@ Derivatives as calculated by Brzozowski’s method are usually more complex regular expressions than the initial one; the result is - that the derivative-based matching and lexing algorithms are - often abysmally slow. However, as Sulzmann and Lu wrote, various + that derivative-based matching and lexing algorithms are + often abysmally slow if the ``growth problem'' is not addressed. As Sulzmann and Lu wrote, various optimisations are possible, such as the simplifications $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow r$. While these - simplifications can speed up the algorithms considerably in many - cases, they do not solve fundamentally the ``growth problem'' with - derivatives. To see this let us return to the example + simplifications can considerably speed up the two algorithms in many + cases, they do not solve fundamentally the growth problem with + derivatives. To see this let us return to the example from the + Introduction that shows the derivatives for \mbox{@{text "(a + aa)\<^sup>*"}}. + If we delete in the 3rd step all $\ZERO{}s$ and $\ONE$s according to + the simplification rules shown above we obtain + % + \def\xll{\xrightarrow{\_\backslash{} [a, a, a]}}%% + % + \begin{equation}\label{derivex} + (a + aa)^* \quad\xll\quad + \underbrace{\mbox{$(\ONE + \ONE{}a) \cdot (a + aa)^*$}}_{r} \;+\; + ((a + aa)^* + \underbrace{\mbox{$(\ONE + \ONE{}a) \cdot (a + aa)^*$}}_{r}) + \end{equation} + + \noindent This is a simpler derivative, but unfortunately we + cannot make further simplifications. This is a problem because + the outermost alternatives contains two copies of the same + regular expression (underlined with $r$). The copies will + spawn new copies in later steps and they in turn further copies. This + destroys an hope of taming the size of the derivatives. But the + second copy of $r$ in \eqref{derivex} will never contribute to a + value, because POSIX lexing will always prefer matching a string + with the first copy. So in principle it could be removed. + The dilemma with the simple-minded + simplification rules above is that the rule $r + r \Rightarrow r$ + will never be applicable because as can be seen in this example the + regular expressions are separated by another sub-regular expression. + + But here is where Sulzmann and Lu's representation of generalised + alternatives in the bitcoded algorithm shines: in @{term + "ALTs bs rs"} we can define a more aggressive simplification by + recursively simplifying all regular expressions in @{text rs} and + then analyse the resulting list and remove any duplicates. + Another advantage is that the bitsequences in bitcoded regular + expressions can be easily modified such that simplification does not + interfere with the value constructions. For example we can ``flatten'', or + de-nest, @{text ALTs} as follows + % + \[ + @{term "ALTs bs\<^sub>1 ((ALTs bs\<^sub>2 rs\<^sub>2) # rs\<^sub>1)"} + \quad\xrightarrow{bsimp}\quad + @{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) # rs\<^sub>1)"} + \] + + \noindent + where we just need to fuse the bitsequence that has accumulated in @{text "bs\<^sub>2"} + to the alternatives in @{text "rs\<^sub>2"}. As we shall show below this will + ensure that the correct value corresponding to the original (unsimplified) + regular expression can still be extracted. %In this way the value construction + %is not affected by simplification. + + However there is one problem with the definition for the more + aggressive simlification rules by Sulzmann and Lu. Recasting + their definition with our syntax they define the step of removing + duplicates as + % + \[ @{text "bsimp (ALTs bs rs)"} \dn @{text "ALTs + bs (nup (map bsimp rs))"} + \] + + \noindent where they first recursively simplify the regular + expressions in @{text rs} (using @{text map}) and then use + Haskell's @{text nub}-function to remove potential + duplicates. While this makes sense when considering the example + shown in \eqref{derivex}, @{text nub} is the inappropriate + function in the case of bitcoded regular expressions. The reason + is that in general the n elements in @{text rs} will have a + different bitsequence annotated to it and in this way @{text nub} + will never find a duplicate to be removed. The correct way to + handle this situation is to first \emph{erase} the regular + expressions when comparing potential duplicates. This is inspired + by Scala's list functions of the form \mbox{@{text "distinctBy rs f + acc"}} where a function is applied first before two elements + are compared. We define this function in Isabelle/HOL as + + \begin{center} + \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} + @{thm (lhs) distinctBy.simps(1)} & $\dn$ & @{thm (rhs) distinctBy.simps(1)}\\ + @{thm (lhs) distinctBy.simps(2)} & $\dn$ & @{thm (rhs) distinctBy.simps(2)} + \end{tabular} + \end{center} + + \noindent where we scan the list from left to right (because we + have to remove later copies). In this function, @{text f} is a + function and @{text acc} is an accumulator for regular + expressions---essentially a set of elements we have already seen + while scanning the list. Therefore we delete an element, say @{text x}, + from the list provided @{text "f x"} is already in the accumulator; + otherwise we keep @{text x} and scan the rest of the list but now + add @{text "f x"} as another element to @{text acc}. We will use + @{term distinctBy} where @{text f} is our erase functions, @{term "erase (DUMMY)"}, + that deletes bitsequences from bitcoded regular expressions. + This is clearly a computationally more expensive operation, than @{text nub}, + but is needed in order to make the removal of unnecessary copies + to work. + + Our simplification function depends on three helper functions, one is called + @{text flts} and defined as follows: + + \begin{center} + \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} + @{thm (lhs) flts.simps(1)} & $\dn$ & @{thm (rhs) flts.simps(1)}\\ + @{thm (lhs) flts.simps(2)} & $\dn$ & @{thm (rhs) flts.simps(2)}\\ + @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\ + \end{tabular} + \end{center} + + \noindent + The second clause removes all instances of @{text ZERO} in alternatives and + the second ``spills'' out nested alternatives (but retaining the + bitsequence @{text "bs'"} accumulated in the inner alternative). There are + some corner cases to be considered when the resulting list inside an alternative is + empty or a singleton list. We take care of those cases in the + @{text "bsimpALTs"} function; similarly we define a helper function that simplifies + sequences according to the usual rules about @{text ZERO}s and @{text ONE}s: + + \begin{center} + \begin{tabular}{c@ {\hspace{5mm}}c} + \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} + @{text "bsimpALTs bs []"} & $\dn$ & @{text "ZERO"}\\ + @{text "bsimpALTs bs [r]"} & $\dn$ & @{text "fuse bs r"}\\ + @{text "bsimpALTs bs rs"} & $\dn$ & @{text "ALTs bs rs"}\\ + \mbox{}\\ + \end{tabular} + & + \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} + @{text "bsimpSEQ bs _ ZERO"} & $\dn$ & @{text "ZERO"}\\ + @{text "bsimpSEQ bs ZERO _"} & $\dn$ & @{text "ZERO"}\\ + @{text "bsimpSEQ bs\<^sub>1 (ONE bs\<^sub>2) r\<^sub>2"} + & $\dn$ & @{text "fuse (bs\<^sub>1 @ bs\<^sub>2) r\<^sub>2"}\\ + @{text "bsimpSEQ bs r\<^sub>1 r\<^sub>2"} & $\dn$ & @{text "SEQ bs r\<^sub>1 r\<^sub>2"} + \end{tabular} + \end{tabular} + \end{center} + + \noindent + With this in place we can define our simlification function as + + \begin{center} + \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} + @{thm (lhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & + @{thm (rhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]}\\ + @{thm (lhs) bsimp.simps(2)[of "bs" _]} & $\dn$ & @{thm (rhs) bsimp.simps(2)[of "bs" _]}\\ + @{text "bsimp r"} & $\dn$ & @{text r} + \end{tabular} + \end{center} + + \noindent + As far as we can see, our recursive function @{term bsimp} simplifies regular + expressions as intended by Sulzmann and Lu. There is no point to apply the + @{text bsimp} + function repeatedly (like the simplification in their paper which is applied + until a fixpoint is reached), because we can show that it is idempotent, that is + + \begin{proposition} + ??? + \end{proposition} \begin{lemma} @@ -1020,13 +1177,12 @@ section {* Bound - NO *} -section {* Bounded Regex / Not *} section {* Conclusion *} text {* -\cite{AusafDyckhoffUrban2016} + %%\bibliographystyle{plain} \bibliography{root}