# HG changeset patch # User Chengsong # Date 1644430875 0 # Node ID afd8eff20402b046b749012c9f355397c62aef29 # Parent 5dcecc92608e1535a8289c102ada7d17a14edd25# Parent 5b77220fdf01565ae7338282873a3d2e1199cae4 other diff -r 5dcecc92608e -r afd8eff20402 thys2/Paper/Paper.thy --- a/thys2/Paper/Paper.thy Wed Feb 09 18:21:01 2022 +0000 +++ b/thys2/Paper/Paper.thy Wed Feb 09 18:21:15 2022 +0000 @@ -27,7 +27,7 @@ ders_syn ("_\\_" [79, 1000] 76) and bder_syn ("_\\_" [79, 1000] 76) and bders ("_\\_" [79, 1000] 76) and - bders_simp ("_\\\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and + bders_simp ("_\\\<^sub>b\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and ZERO ("\<^bold>0" 81) and ONE ("\<^bold>1" 81) and @@ -70,6 +70,7 @@ srewrite ("_\<^latex>\\\mbox{$\\,\\stackrel{s}{\\leadsto}$}\ _" [71, 71] 80) and rrewrites ("_ \<^latex>\\\mbox{$\\,\\leadsto^*$}\ _" [71, 71] 80) and + srewrites ("_ \<^latex>\\\mbox{$\\,\\stackrel{s}{\\leadsto}^*$}\ _" [71, 71] 80) and blexer_simp ("blexer\<^sup>+" 1000) @@ -172,14 +173,14 @@ method [..] in combination with the simplification steps [..] yields POSIX parse trees. We have tested this claim extensively [..] but yet - have to work out all proof details.'' + have to work out all proof details.'' \cite[Page 14]{Sulzmann2014} \end{quote} \noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL the derivative-based lexing algorithm of Sulzmann and Lu \cite{Sulzmann2014} where regular expressions are annotated with bitsequences. We define the crucial simplification function as a -recursive function, instead of a fix-point operation. One objective of +recursive function, without the need of a fix-point operation. One objective of the simplification function is to remove duplicates of regular expressions. For this Sulzmann and Lu use in their paper the standard @{text nub} function from Haskell's list library, but this function @@ -188,7 +189,7 @@ different bitcode annotation---so @{text nub} would never ``fire''. Inspired by Scala's library for lists, we shall instead use a @{text distinctBy} function that finds duplicates under an erasing function -that deletes bitcodes. +which deletes bitcodes. We shall also introduce our own argument and definitions for establishing the correctness of the bitcoded algorithm when simplifications are included.\medskip @@ -297,7 +298,7 @@ \end{proposition} \noindent -It is a fun exercise to formaly prove this property in a theorem prover. +It is a fun exercise to formally prove this property in a theorem prover. The novel idea of Sulzmann and Lu is to extend this algorithm for lexing, where it is important to find out which part of the string @@ -358,7 +359,7 @@ Unix-like operating systems \cite{POSIX}.} Sometimes these informal rules are called \emph{maximal much rule} and \emph{rule priority}. One contribution of our earlier paper is to give a convenient - specification for what a POSIX value is (the inductive rules are shown in + specification for what POSIX values are (the inductive rules are shown in Figure~\ref{POSIXrules}). \begin{figure}[t] @@ -436,7 +437,7 @@ \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to left, the second phase. The picture above shows the steps required when a regular expression, say @{text "r\<^sub>1"}, matches the string @{term - "[a,b,c]"}. The lexing algorithm by Sulzmann and Lu can be defined as: + "[a,b,c]"}. The first lexing algorithm by Sulzmann and Lu can be defined as: \begin{figure}[t] \begin{center} @@ -491,7 +492,7 @@ We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that this algorithm is correct, that is it generates POSIX values. The -cenral property we established relates the derivative operation to the +central property we established relates the derivative operation to the injection function. \begin{proposition}\label{Posix2} @@ -512,8 +513,8 @@ \noindent In fact we have shown that in the success case the generated POSIX value $v$ is unique and in the failure case that there is no POSIX value $v$ that satisfies - $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly - slow in cases where the derivatives grow arbitrarily (see example from the + $(s, r) \rightarrow v$. While the algorithm is correct, it is excruciatingly + slow in cases where the derivatives grow arbitrarily (recall the example from the Introduction). However it can be used as a convenient reference point for the correctness proof of the second algorithm by Sulzmann and Lu, which we shall describe next. @@ -525,7 +526,7 @@ In the second part of their paper \cite{Sulzmann2014}, Sulzmann and Lu describe another algorithm that also generates POSIX - values but dispences with the second phase where characters are + values but dispenses with the second phase where characters are injected ``back'' into values. For this they annotate bitcodes to regular expressions, which we define in Isabelle/HOL as the datatype @@ -544,7 +545,7 @@ expressions; and @{text rs} for lists of bitcoded regular expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"} is just an abbreviation for \mbox{@{text "ALTs bs [r\<^sub>1, r\<^sub>2]"}}. - For bitsequences we just use lists made up of the + For bitsequences we use lists made up of the constants @{text Z} and @{text S}. The idea with bitcoded regular expressions is to incrementally generate the value information (for example @{text Left} and @{text Right}) as bitsequences. For this @@ -629,7 +630,7 @@ \end{proof} Sulzmann and Lu define the function \emph{internalise} - in order to transform standard regular expressions into annotated + in order to transform (standard) regular expressions into annotated regular expressions. We write this operation as $r^\uparrow$. This internalisation uses the following \emph{fuse} function. @@ -652,7 +653,7 @@ \noindent A regular expression can then be \emph{internalised} into a bitcoded - regular expression as follows. + regular expression as follows: \begin{center} \begin{tabular}{lcl} @@ -676,7 +677,7 @@ straightforward definition. For defining the algorithm, we also need the functions \textit{bnullable} and \textit{bmkeps}, which are the ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on - bitcoded regular expressions, instead of regular expressions. + bitcoded regular expressions. % \begin{center} \begin{tabular}{@ {}c@ {}c@ {}} @@ -710,7 +711,7 @@ \noindent The key function in the bitcoded algorithm is the derivative of a bitcoded regular expression. This derivative calculates the - derivative but at the same time also the incremental part of bitsequences + derivative but at the same time also the incremental part of the bitsequences that contribute to constructing a POSIX value. \begin{center} @@ -811,9 +812,9 @@ \noindent The only difficulty left for the correctness proof is that the bitcoded algorithm has only a ``forward phase'' where POSIX values are generated incrementally. -We can achieve the same effect with @{text lexer} by stacking up injection +We can achieve the same effect with @{text lexer} (which has two phases) by stacking up injection functions during the forward phase. An auxiliary function, called $\textit{flex}$, -allows us to recast the rules of $\lexer$ (with its two phases) in terms of a single +allows us to recast the rules of $\lexer$ in terms of a single phase and stacked up injection functions. \begin{center} @@ -826,7 +827,7 @@ \noindent The point of this function is that when -reaching the end of the string, we just need to apply the stacked +reaching the end of the string, we just need to apply the stacked up injection functions to the value generated by @{text mkeps}. Using this function we can recast the success case in @{text lexer} as follows: @@ -843,11 +844,11 @@ need to traverse the string $s$ twice, once for building the derivative $r\backslash s$ and another time for stacking up injection functions using \textit{flex}), it helps us with proving -that incrementally building up values generates the same result. +that incrementally building up values in @{text blexer} generates the same result. This brings us to our main lemma in this section: if we calculate a -derivative, say $r\backslash s$ and have a value, say $v$, inhabited -by this derivative, then we can produce the result $\lexer$ generates +derivative, say $r\backslash s$, and have a value, say $v$, inhabited +by this derivative, then we can produce the result @{text lexer} generates by applying this value to the stacked-up injection functions that $\textit{flex}$ assembles. The lemma establishes that this is the same value as if we build the annotated derivative $r^\uparrow\backslash s$ @@ -900,7 +901,7 @@ produces the same result as \textit{lexer}. -\begin{theorem} +\begin{theorem}\label{thmone} $\textit{lexer}\,r\,s = \textit{blexer}\,r\,s$ \end{theorem} @@ -969,26 +970,26 @@ \end{equation} \noindent This is a simpler derivative, but unfortunately we - cannot make further simplifications. This is a problem because + cannot make any further simplifications. This is a problem because the outermost alternatives contains two copies of the same - regular expression (underlined with $r$). The copies will - spawn new copies in later steps and they in turn further copies. This - destroys an hope of taming the size of the derivatives. But the + regular expression (underlined with $r$). These copies will + spawn new copies in later derivative steps and they in turn even more copies. This + destroys any hope of taming the size of the derivatives. But the second copy of $r$ in \eqref{derivex} will never contribute to a value, because POSIX lexing will always prefer matching a string - with the first copy. So in principle it could be removed. + with the first copy. So it could be safely removed without affecting the correctness of the algorithm. The dilemma with the simple-minded simplification rules above is that the rule $r + r \Rightarrow r$ will never be applicable because as can be seen in this example the - regular expressions are separated by another sub-regular expression. + regular expressions are not next to each other but separated by another regular expression. But here is where Sulzmann and Lu's representation of generalised alternatives in the bitcoded algorithm shines: in @{term "ALTs bs rs"} we can define a more aggressive simplification by recursively simplifying all regular expressions in @{text rs} and then analyse the resulting list and remove any duplicates. - Another advantage is that the bitsequences in bitcoded regular - expressions can be easily modified such that simplification does not + Another advantage with the bitsequences in bitcoded regular + expressions is that they can be easily modified such that simplification does not interfere with the value constructions. For example we can ``flatten'', or de-nest, @{text ALTs} as follows % @@ -1006,7 +1007,7 @@ %is not affected by simplification. However there is one problem with the definition for the more - aggressive simlification rules by Sulzmann and Lu. Recasting + aggressive simplification rules described by Sulzmann and Lu. Recasting their definition with our syntax they define the step of removing duplicates as % @@ -1020,8 +1021,8 @@ duplicates. While this makes sense when considering the example shown in \eqref{derivex}, @{text nub} is the inappropriate function in the case of bitcoded regular expressions. The reason - is that in general the n elements in @{text rs} will have a - different bitsequence annotated to it and in this way @{text nub} + is that in general the elements in @{text rs} will have a + different annotated bitsequence and in this way @{text nub} will never find a duplicate to be removed. The correct way to handle this situation is to first \emph{erase} the regular expressions when comparing potential duplicates. This is inspired @@ -1037,21 +1038,22 @@ \end{center} \noindent where we scan the list from left to right (because we - have to remove later copies). In this function, @{text f} is a + have to remove later copies). In @{text distinctBy}, @{text f} is a function and @{text acc} is an accumulator for regular - expressions---essentially a set of elements we have already seen + expressions---essentially a set of regular expression that we have already seen while scanning the list. Therefore we delete an element, say @{text x}, from the list provided @{text "f x"} is already in the accumulator; - otherwise we keep @{text x} and scan the rest of the list but now - add @{text "f x"} as another element to @{text acc}. We will use - @{term distinctBy} where @{text f} is our erase functions, @{term "erase (DUMMY)"}, + otherwise we keep @{text x} and scan the rest of the list but + add @{text "f x"} as another ``seen'' element to @{text acc}. We will use + @{term distinctBy} where @{text f} is the erase functions, @{term "erase (DUMMY)"}, that deletes bitsequences from bitcoded regular expressions. This is clearly a computationally more expensive operation, than @{text nub}, but is needed in order to make the removal of unnecessary copies - to work. + to work properly. Our simplification function depends on three helper functions, one is called - @{text flts} and defined as follows: + @{text flts} and analyses lists of regular expressions coming from alternatives. + It is defined as follows: \begin{center} \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} @@ -1062,7 +1064,7 @@ \end{center} \noindent - The second clause removes all instances of @{text ZERO} in alternatives and + The second clause of @{text flts} removes all instances of @{text ZERO} in alternatives and the second ``spills'' out nested alternatives (but retaining the bitsequence @{text "bs'"} accumulated in the inner alternative). There are some corner cases to be considered when the resulting list inside an alternative is @@ -1090,7 +1092,7 @@ \end{center} \noindent - With this in place we can define our simlification function as + With this in place we can define our simplification function as \begin{center} \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} @@ -1103,85 +1105,328 @@ \noindent As far as we can see, our recursive function @{term bsimp} simplifies regular - expressions as intended by Sulzmann and Lu. There is no point to apply the - @{text bsimp} - function repeatedly (like the simplification in their paper which is applied - until a fixpoint is reached), because we can show that it is idempotent, that is + expressions as intended by Sulzmann and Lu. There is no point in applying the + @{text bsimp} function repeatedly (like the simplification in their paper which needs to be + applied until a fixpoint is reached) because we can show that @{term bsimp} is idempotent, + that is \begin{proposition} - ??? + @{term "bsimp (bsimp r) = bsimp r"} \end{proposition} + \noindent + This can be proved by induction on @{text r} but requires a detailed analysis + that the de-nesting of alternatives always results in a flat list of regular + expressions. We omit the details since it does not concern the correctness proof. + + Next we can include simplification after each derivative step leading to the + following notion of bitcoded derivatives: + + \begin{center} + \begin{tabular}{cc} + \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} + @{thm (lhs) bders_simp.simps(1)} & $\dn$ & @{thm (rhs) bders_simp.simps(1)} + \end{tabular} + & + \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} + @{thm (lhs) bders_simp.simps(2)} & $\dn$ & @{thm (rhs) bders_simp.simps(2)} + \end{tabular} + \end{tabular} + \end{center} + + \noindent + and use it in the improved lexing algorithm defined as + + \begin{center} +\begin{tabular}{lcl} + $\textit{blexer}^+\;r\,s$ & $\dn$ & + $\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\ + & & $\;\;\;\;\textit{if}\; \textit{bnullable}(r_{der}) \;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r + \;\;\textit{else}\;\textit{None}$ +\end{tabular} +\end{center} + + \noindent The remaining task is to show that @{term blexer} and + @{term "blexer_simp"} generate the same answers. + + When we first + attempted this proof we encountered a problem with the idea + in Sulzmann and Lu's paper where the argument seems to be to appeal + again to the @{text retrieve}-function defined for the unsimplified version + of the algorithm. But + this does not work, because desirable properties such as + % + \[ + @{text "retrieve r v = retrieve (bsimp r) v"} + \] + + \noindent do not hold under simplification---this property + essentially purports that we can retrieve the same value from a + simplified version of the regular expression. To start with @{text retrieve} + depends on the fact that the value @{text v} correspond to the + structure of the regular expressions---but the whole point of simplification + is to ``destroy'' this structure by making the regular expression simpler. + To see this consider the regular expression @{text "r = r' + 0"} and a corresponding + value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then + we can use @{text retrieve} and @{text v} in order to extract a corresponding + bitsequence. The reason that this works is that @{text r} is an alternative + regular expression and @{text v} a corresponding value. However, if we simplify + @{text r}, then @{text v} does not correspond to the shape of the regular + expression anymore. So unless one can somehow + synchronise the change in the simplified regular expressions with + the original POSIX value, there is no hope of appealing to @{text retrieve} in the + correctness argument for @{term blexer_simp}. + + We found it more helpful to introduce the rewriting systems shown in + Figure~\ref{SimpRewrites}. The idea is to generate + simplified regular expressions in small steps (unlike the @{text bsimp}-function which + does the same in a big step), and show that each of + the small steps preserves the bitcodes that lead to the final POSIX value. + The rewrite system is organised such that $\leadsto$ is for bitcode regular + expressions and $\stackrel{s}{\leadsto}$ for lists of bitcoded regular + expressions. The former essentially implements the simplifications of + @{text "bsimpSEQ"} and @{text flts}; while the latter implements the + simplifications in @{text "bsimpALTs"}. We can show that any bitcoded + regular expression reduces in one or more steps to the simplified + regular expression generated by @{text bsimp}: + + \begin{lemma}\label{lemone} + @{thm[mode=IfThen] rewrites_to_bsimp} + \end{lemma} + + \begin{proof} + By induction on @{text r}. For this we can use the properties + @{thm fltsfrewrites} and @{thm ss6_stronger}. The latter uses + repeated applications of the $LD$ rule which allows the removal + of duplicates that can recognise the same strings. + \end{proof} + + \noindent + We can show that this rewrite system preserves @{term bnullable}, that + is simplification, essentially, does not affect nullability: \begin{lemma} @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]} \end{lemma} + \begin{proof} + Straightforward mutual induction on the definition of $\leadsto$ and $\stackrel{s}{\leadsto}$. + The only interesting case is the rule $LD$ where the property holds since by the side-conditions of that rule the empty string will + be in both @{text "L (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ [r\<^sub>2] @ rs\<^sub>c)"} and + @{text "L (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ rs\<^sub>c)"}. + \end{proof} - \begin{lemma} + \noindent + From this, we can show that @{text bmkeps} will produce the same bitsequence + as long as one of the bitcoded regular expressions in $\leadsto$ is nullable (this lemma + establishes the missing fact we were not able to establish using @{text retrieve}, as suggested + in the paper by Sulzmannn and Lu). + + + \begin{lemma}\label{lemthree} @{thm[mode=IfThen] rewrite_bmkeps_aux(1)[of "r\<^sub>1" "r\<^sub>2"]} \end{lemma} - \begin{lemma} - @{thm[mode=IfThen] rewrites_to_bsimp} - \end{lemma} + \begin{proof} + By straightforward mutual induction on the definition of $\leadsto$ and $\stackrel{s}{\leadsto}$. + Again the only interesting case is the rule $LD$ where we need to ensure that + \[ + @{text "bmkeps (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ [r\<^sub>2] @ rs\<^sub>c) = + bmkeps (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ rs\<^sub>c)"} + \] + + \noindent holds. This is indeed the case because according to the POSIX rules the + generated bitsequence is determined by the first alternative that can match the + string (in this case being nullable). + \end{proof} + + \noindent + Crucial is also the fact that derivative steps and simplification steps can be interleaved, + which is shown by the fact that $\leadsto$ is preserved under derivatives. \begin{lemma} @{thm[mode=IfThen] rewrite_preserves_bder(1)[of "r\<^sub>1" "r\<^sub>2"]} \end{lemma} - \begin{lemma} + \begin{proof} + By straightforward mutual induction on the definition of $\leadsto$ and $\stackrel{s}{\leadsto}$. + The case for $LD$ holds because @{term "L (erase (bder c r\<^sub>2)) \ L (erase (bder c r\<^sub>1))"} + if and only if @{term "L (erase (r\<^sub>2)) \ L (erase (r\<^sub>1))"}. + \end{proof} + + + \noindent + Using this fact together with Lemma~\ref{lemone} allows us to prove the central lemma + that the unsimplified + derivative (with a string @{term s}) reduces to the simplified derivative (with the same string). + + \begin{lemma}\label{lemtwo} @{thm[mode=IfThen] central} \end{lemma} + \begin{proof} + By reverse induction on @{term s} generalising over @{text r}. + \end{proof} + + \noindent + With these lemmas in place we can finally establish that @{term "blexer_simp"} and @{term "blexer"} + generate the same value, and using Theorem~\ref{thmone} from the previous section that this value + is indeed the POSIX value. + \begin{theorem} @{thm[mode=IfThen] main_blexer_simp} \end{theorem} - Sulzmann \& Lu apply simplification via a fixpoint operation - - ; also does not use erase to filter out duplicates. - - not direct correspondence with PDERs, because of example - problem with retrieve - - correctness - - - - - + \begin{proof} + By unfolding the definitions and using Lemmas~\ref{lemtwo} and \ref{lemthree}. + \end{proof} + + \noindent + This completes the correctness proof for the second POSIX lexing algorithm by Sulzmann and Lu. + The interesting point of this algorithm is that the sizes of derivatives do not grow arbitrarily, which + we shall show next. \begin{figure}[t] \begin{center} \begin{tabular}{c} - @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}\qquad - @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}\qquad - @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}\\ - @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}\qquad - @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}\\ - @{thm[mode=Axiom] bs6}\qquad - @{thm[mode=Axiom] bs7}\\ - @{thm[mode=Rule] bs8[of "rs\<^sub>1" "rs\<^sub>2"]}\\ - %@ { t hm[mode=Axiom] ss1}\qquad - @{thm[mode=Rule] ss2[of "rs\<^sub>1" "rs\<^sub>2"]}\qquad - @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}\\ - @{thm[mode=Axiom] ss4}\qquad - @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}\\ - @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}\\ + @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}$S\ZERO{}_l$\qquad + @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}$S\ZERO{}_r$\\ + @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}$S\ONE$\\ + @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad + @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\ + @{thm[mode=Axiom] bs6}$A0$\qquad + @{thm[mode=Axiom] bs7}$A1$\\ + @{thm[mode=Rule] bs8[of "rs\<^sub>1" "rs\<^sub>2"]}$AL$\\ + @{thm[mode=Rule] ss2[of "rs\<^sub>1" "rs\<^sub>2"]}$LH$\qquad + @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}$LT$\\ + @{thm[mode=Axiom] ss4}$L\ZERO$\qquad + @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}$LS$\medskip\\ + @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}$LD$\\ \end{tabular} \end{center} - \caption{???}\label{SimpRewrites} + \caption{The rewrite rules that generate simplified regular expressions + in small steps: @{term "rrewrite r\<^sub>1 r\<^sub>2"} is for bitcoded regular + expressions and @{term "rrewrites rs\<^sub>1 rs\<^sub>2"} for \emph{lists} of bitcoded + regular expressions. Interesting is the $LD$ rule that allows copies of regular + expressions be removed provided a regular expression earlier in the list can + match the same strings.}\label{SimpRewrites} \end{figure} *} -section {* Bound - NO *} +section {* Finiteness of Derivatives *} + +text {* + +In this section let us sketch our argument on why the size of the simplified +derivatives with the aggressive simplification function is finite. Suppose +we have a size functions for bitcoded regular expressions, written +@{text "|r|"}, which counts the number of nodes if we regard $r$ as a tree +(we omit the precise definition). For this we show that for every $r$ +there exists a bound $N$ +such that + +\begin{center} +$\forall s. \; |@{term "bders_simp r s"}| < N$ +\end{center} + +\noindent +We prove this by induction on $r$. The base cases for @{term AZERO}, +@{term "AONE bs"} and @{term "ACHAR bs c"} are straightforward. The interesting case is +for sequences of the form @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}. In this case our induction +hypotheses state $\forall s. \; |@{term "bders_simp r\<^sub>1 s"}| < N_1$ and +$\forall s. \; |@{term "bders_simp r\<^sub>2 s"}| < N_2$. We can reason as follows + +\begin{center} +\begin{tabular}{lcll} +& & $ |@{term "bders_simp (ASEQ bs r\<^sub>1 r\<^sub>2) s"}|$\\ +& $ = $ & $|bsimp(ALTs\;bs\;((@{term "bders_simp r\<^sub>1 s"}) \cdot r_2) :: + [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| $ & (1) \\ +& $\leq$ & + $|distinctBy\,(flts\,((@{term "bders_simp r\<^sub>1 s "}) \cdot r_2) :: + [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (2) \\ +& $\leq$ & $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2| + + |distinctBy\,(flts\, [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (3) \\ +& $\leq$ & $N_1 + |r_2| + 2 + |distinctBy\,(flts\, [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])|$ & (4)\\ +& $\leq$ & $N_1 + |r_2| + 2 + l_{N_{2}} * N_{2}$ & (5) +\end{tabular} +\end{center} + +% tell Chengsong about Indian paper of closed forms of derivatives + +\noindent +where in (1) the $Suf\!fix(s')$ are the suffixes where @{term "bders_simp r\<^sub>1 s''"} is nullable for +@{text "s = s'' @ s'"}. In (3) we know that $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2|$ is +bounded by $N_1 + |r_2|$. In (5) we know the list comprehension contains only regular expressions of size smaller +than $N_2$. The list length after @{text distinctBy} is bounded by a number, which we call $l_{N_2}$. It stands +for the number of distinct regular expressions with a maximum size $N_2$ (there can only be finitely many of them). +We reason similarly in the @{text Star}-case.\medskip + +\noindent +Clearly we give in this finiteness argument (Step (5)) a very loose bound that is +far from the actual bound we can expect. We can do better than this, but this does not improve +the finiteness property we are proving. If we are interested in a polynomial bound, +one would hope to obtain a similar tight bound as for partial +derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with +@{text distinctBy} is to maintain a ``set'' of alternatives (like the sets in +partial derivatives). Unfortunately to obtain the exact same bound would mean +we need to introduce simplifications such as +% +\[ (r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3) +\] + +\noindent +which exist for partial derivatives. However, if we introduce them in our +setting we would lose the POSIX property of our calculated values. We leave better +bounds for future work. + +*} section {* Conclusion *} text {* + We set out in this work to prove in Isabelle/HOL the correctness of + the second POSIX lexing algorithm by Sulzmann and Lu + \cite{Sulzmann2014}. This follows earlier work where we established + the correctness of the first algorithm + \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to + introduce our own specification about what POSIX values are, + because the informal definition given by Sulzmann and Lu did not + stand up to a formal proof. Also for the second algorithm we needed + to introduce our own definitions and proof ideas in order to establish the + correctness. Our interest in the second algorithm + lies in the fact that by using bitcoded regular expressions and an aggressive + simplification method there is a chance that the the derivatives + can be kept universally small (we established in this paper that + they can be kept finite for any string). This is important if one is after + an efficient POSIX lexing algorithm. + + Having proved the correctness of the POSIX lexing algorithm, which + lessons have we learned? Well, we feel this is a very good example + where formal proofs give further insight into the matter at + hand. For example it is very hard to see a problem with @{text nub} + vs @{text distinctBy} with only experimental data---one would still + see the correct result but find that simplification does not simplify in well-chosen, but not + obscure, examples. We found that from an implementation + point-of-view it is really important to have the formal proofs of + the corresponding properties at hand. We have also developed a + healthy suspicion when experimental data is used to back up + efficiency claims. For example Sulzmann and Lu write about their + equivalent of @{term blexer_simp} ``...we can incrementally compute + bitcoded parse trees in linear time in the size of the input'' + \cite[Page 14]{Sulzmann2014}. + Given the growth of the + derivatives in some cases even after aggressive simplification, this + is a hard to believe fact. A similar claim of about a theoretical runtime + of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates POSIX matches and is based on + derivatives \cite{verbatim}. In this case derivatives are not simplified. + Clearly our result of having finite + derivatives is rather weak in this context but we think such effeciency claims + require further scrutiny.\medskip + + \noindent + Our Isabelle/HOL code is available under \url{https://github.com/urbanchr/posix}. %%\bibliographystyle{plain} diff -r 5dcecc92608e -r afd8eff20402 thys2/Paper/document/root.bib --- a/thys2/Paper/document/root.bib Wed Feb 09 18:21:01 2022 +0000 +++ b/thys2/Paper/document/root.bib Wed Feb 09 18:21:15 2022 +0000 @@ -351,3 +351,11 @@ } +@INPROCEEDINGS{verbatim, + author={D.~Egolf and S.~Lasser and K.~Fisher}, + booktitle={2021 IEEE Security and Privacy Workshops (SPW)}, + title={{V}erbatim: {A} {V}erified {L}exer {G}enerator}, + year={2021}, + volume={}, + number={}, + pages={92--100}} diff -r 5dcecc92608e -r afd8eff20402 thys2/SizeBound4.thy --- a/thys2/SizeBound4.thy Wed Feb 09 18:21:01 2022 +0000 +++ b/thys2/SizeBound4.thy Wed Feb 09 18:21:15 2022 +0000 @@ -467,7 +467,6 @@ | "flts (r1 # rs) = r1 # flts rs" - fun bsimp_ASEQ :: "bit list \ arexp \ arexp \ arexp" where "bsimp_ASEQ _ AZERO _ = AZERO" @@ -695,7 +694,7 @@ using rs_in_rstar by force lemma ss6_stronger: - shows "rs1 s\* distinctBy rs1 erase {}" + shows "rs s\* distinctBy rs erase {}" using ss6_stronger_aux[of "[]" _] by auto lemma rewrite_preserves_fuse: diff -r 5dcecc92608e -r afd8eff20402 thys2/paper.pdf Binary file thys2/paper.pdf has changed