lexing: comparison thys2/Paper/Paper.thy

equal deleted inserted replaced

-:14c558ae0b09
+:5b77220fdf01
 \begin{quote}\it
 ``Correctness Claim: We further claim that the incremental parsing
 method [..] in combination with the simplification steps [..]
 yields POSIX parse trees. We have tested this claim
 extensively [..] but yet
-have to work out all proof details.''
+have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
 \end{quote}
 \noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL
 the derivative-based lexing algorithm of Sulzmann and Lu
 \cite{Sulzmann2014} where regular expressions are annotated with
 bitsequences. We define the crucial simplification function as a
-recursive function, instead of a fix-point operation. One objective of
+recursive function, without the need of a fix-point operation. One objective of
 the simplification function is to remove duplicates of regular
 expressions.  For this Sulzmann and Lu use in their paper the standard
 @{text nub} function from Haskell's list library, but this function
 does not achieve the intended objective with bitcoded regular expressions.  The
 reason is that in the bitcoded setting, each copy generally has a
 \begin{proposition}\label{matchcorr}
 @{text "match s r"} \;\;\text{if and only if}\;\; $s \in L(r)$
 \end{proposition}
 \noindent
-It is a fun exercise to formaly prove this property in a theorem prover.
+It is a fun exercise to formally prove this property in a theorem prover.
 The novel idea of Sulzmann and Lu is to extend this algorithm for
 lexing, where it is important to find out which part of the string
 is matched by which part of the regular expression.
 For this Sulzmann and Lu presented two lexing algorithms in their paper
 \end{center}
 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that
 this algorithm is correct, that is it generates POSIX values. The
-cenral property we established relates the derivative operation to the
+central property we established relates the derivative operation to the
 injection function.
 \begin{proposition}\label{Posix2}
 	\textit{If} $(s,\; r\backslash c) \rightarrow v$ \textit{then} $(c :: s,\; r) \rightarrow$ \textit{inj} $r\; c\; v$.
 \end{proposition}
 \end{proposition}
 \noindent
 In fact we have shown that in the success case the generated POSIX value $v$ is
 unique and in the failure case that there is no POSIX value $v$ that satisfies
-$(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly
+$(s, r) \rightarrow v$. While the algorithm is correct, it is excruciatingly
 slow in cases where the derivatives grow arbitrarily (recall the example from the
 Introduction). However it can be used as a convenient reference point for the correctness
 proof of the second algorithm by Sulzmann and Lu, which we shall describe next.
 *}
 text {*
 In the second part of their paper \cite{Sulzmann2014},
 Sulzmann and Lu describe another algorithm that also generates POSIX
-values but dispences with the second phase where characters are
+values but dispenses with the second phase where characters are
 injected ``back'' into values. For this they annotate bitcodes to
 regular expressions, which we define in Isabelle/HOL as the datatype
 \begin{center}
 \begin{tabular}{lcl}
 ensure that the correct value corresponding to the original (unsimplified)
 regular expression can still be extracted. %In this way the value construction
 %is not affected by simplification.
 However there is one problem with the definition for the more
-aggressive simlification rules described by Sulzmann and Lu. Recasting
+aggressive simplification rules described by Sulzmann and Lu. Recasting
 their definition with our syntax they define the step of removing
 duplicates as
 %
 \[ @{text "bsimp (ALTs bs rs)"} \dn @{text "ALTs
 bs (nup (map bsimp rs))"}
 \end{tabular}
 \end{tabular}
 \end{center}
 \noindent
-With this in place we can define our simlification function as
+With this in place we can define our simplification function as
 \begin{center}
 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
 @{thm (lhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]} & $\dn$ &
 @{thm (rhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]}\\
 \noindent do not hold under simplification---this property
 essentially purports that we can retrieve the same value from a
 simplified version of the regular expression. To start with @{text retrieve}
 depends on the fact that the value @{text v} correspond to the
-structure of the regular exprssions---but the whole point of simplification
+structure of the regular expressions---but the whole point of simplification
 is to ``destroy'' this structure by making the regular expression simpler.
 To see this consider the regular expression @{text "r = r' + 0"} and a corresponding
 value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then
 we can use @{text retrieve} and @{text v} in order to extract a corresponding
 bitsequence. The reason that this works is that @{text r} is an alternative
 be in both @{text "L (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ [r\<^sub>2] @ rs\<^sub>c)"} and
 @{text "L (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ rs\<^sub>c)"}.
 \end{proof}
 \noindent
-From this, we can show that @{text bmkeps} will producte the same bitsequence
+From this, we can show that @{text bmkeps} will produce the same bitsequence
 as long as one of the bitcoded regular expressions in $\leadsto$ is nullable (this lemma
 establishes the missing fact we were not able to establish using @{text retrieve}, as suggested
 in the paper by Sulzmannn and Lu).
 \begin{proof}
 By reverse induction on @{term s} generalising over @{text r}.
 \end{proof}
 \noindent
-With these lemmas in place we can finaly establish that @{term "blexer_simp"} and @{term "blexer"}
+With these lemmas in place we can finally establish that @{term "blexer_simp"} and @{term "blexer"}
 generate the same value, and using Theorem~\ref{thmone} from the previous section that this value
 is indeed the POSIX value.
 \begin{theorem}
 @{thm[mode=IfThen] main_blexer_simp}
 expressions be removed provided a regular expression earlier in the list can
 match the same strings.}\label{SimpRewrites}
 \end{figure}
 *}
-section {* Bound - NO *}
+section {* Finiteness of Derivatives *}
+text {*
+In this section let us sketch our argument on why the size of the simplified
+derivatives with the aggressive simplification function is finite. Suppose
+we have a size functions for bitcoded regular expressions, written
+@{text "|r|"}, which counts the number of nodes if we regard $r$ as a tree
+(we omit the precise definition). For this we show that for every $r$
+there exists a bound $N$
+such that
+\begin{center}
+$\forall s. \; |@{term "bders_simp r s"}| < N$
+\end{center}
+\noindent
+We prove this by induction on $r$. The base cases for @{term AZERO},
+@{term "AONE bs"} and @{term "ACHAR bs c"} are straightforward. The interesting case is
+for sequences of the form @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}. In this case our induction
+hypotheses state $\forall s. \; |@{term "bders_simp r\<^sub>1 s"}| < N_1$ and
+$\forall s. \; |@{term "bders_simp r\<^sub>2 s"}| < N_2$. We can reason as follows
+\begin{center}
+\begin{tabular}{lcll}
+& & $ |@{term "bders_simp (ASEQ bs r\<^sub>1 r\<^sub>2) s"}|$\\
+& $ = $ & $|bsimp(ALTs\;bs\;((@{term "bders_simp r\<^sub>1 s"}) \cdot r_2) ::
+[@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| $ & (1) \\
+& $\leq$ &
+$|distinctBy\,(flts\,((@{term "bders_simp r\<^sub>1 s "}) \cdot r_2) ::
+[@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (2) \\
+& $\leq$ & $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2| +
+|distinctBy\,(flts\,   [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (3) \\
+& $\leq$ & $N_1 + |r_2| + 2 + |distinctBy\,(flts\,   [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])|$ & (4)\\
+& $\leq$ & $N_1 + |r_2| + 2 + l_{N_{2}} * N_{2}$ & (5)
+\end{tabular}
+\end{center}
+% tell Chengsong about Indian paper of closed forms of derivatives
+\noindent
+where in (1) the $Suf\!fix(s')$ are the suffixes where @{term "bders_simp r\<^sub>1 s''"} is nullable for
+@{text "s = s'' @ s'"}. In (3) we know that  $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2|$ is
+bounded by $N_1 + |r_2|$. In (5) we know the list comprehension contains only regular expressions of size smaller
+than $N_2$. The list length after @{text distinctBy} is bounded by a number, which we call $l_{N_2}$. It stands
+for the number of distinct regular expressions with a maximum size $N_2$ (there can only be finitely many of them).
+We reason similarly in the @{text Star}-case.\medskip
+\noindent
+Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
+far from the actual bound we can expect. We can do better than this, but this does not improve
+the finiteness property we are proving. If we are interested in a polynomial bound,
+one would hope to obtain a similar tight bound as for partial
+derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
+@{text distinctBy} is to maintain a ``set'' of alternatives (like the sets in
+partial derivatives). Unfortunately to obtain the exact same bound would mean
+we need to introduce simplifications such as
+%
+\[ (r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)
+\]
+\noindent
+which exist for partial derivatives. However, if we introduce them in our
+setting we would lose the POSIX property of our calculated values. We leave better
+bounds for future work.
+*}
 section {* Conclusion *}
 text {*
-We set out in this work to implement in Isabelle/HOL the secod lexing
+We set out in this work to prove in Isabelle/HOL the correctness of
-algorithm by Sulzmann and Lu \cite{Sulzmann2014}. This algorithm is
+the second POSIX lexing algorithm by Sulzmann and Lu
-interesting because it includes an ``aggressive'' simplification
+\cite{Sulzmann2014}. This follows earlier work where we established
-function that keeps the sizes of derivatives finite.
+the correctness of the first algorithm
+\cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
-Having proved the correctness of the POSIX lexing algorithm in
+introduce our own specification about what POSIX values are,
-[53], which lessons have we learned? Well, we feel this is a
+because the informal definition given by Sulzmann and Lu did not
-perfect example where formal proofs give further insight into the
+stand up to a formal proof. Also for the second algorithm we needed
-matter at hand. Our proofs were done both done by hand and
+to introduce our own definitions and proof ideas in order to establish the
-checked in Isabelle/HOL. The experience of doing our proofs in this
+correctness.  Our interest in the second algorithm
-way has been that the mechanical checking was absolutely essential:
+lies in the fact that by using bitcoded regular expressions and an aggressive
-despite the apparent maturity, this subject area has hidden
+simplification method there is a chance that the the derivatives
-snares.
+can be kept universally small  (we established in this paper that
+they can be kept finite for any string). This is important if one is after
+an efficient POSIX lexing algorithm.
+Having proved the correctness of the POSIX lexing algorithm, which
+lessons have we learned? Well, we feel this is a very good example
+where formal proofs give further insight into the matter at
+hand. For example it is very hard to see a problem with @{text nub}
+vs @{text distinctBy} with only experimental data---one would still
+see the correct result but find that simplification does not simplify in well-chosen, but not
+obscure, examples. We found that from an implementation
+point-of-view it is really important to have the formal proofs of
+the corresponding properties at hand. We have also developed a
+healthy suspicion when experimental data is used to back up
+efficiency claims. For example Sulzmann and Lu write about their
+equivalent of @{term blexer_simp} ``...we can incrementally compute
+bitcoded parse trees in linear time in the size of the input''
+\cite[Page 14]{Sulzmann2014}.
+Given the growth of the
+derivatives in some cases even after aggressive simplification, this
+is a hard to believe fact. A similar claim of about a theoretical runtime
+of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates POSIX matches and is based on
+derivatives \cite{verbatim}. In this case derivatives are not simplified.
+Clearly our result of having finite
+derivatives is rather weak in this context but we think such effeciency claims
+require further scrutiny.\medskip
+\noindent
+Our Isabelle/HOL code is available under \url{https://github.com/urbanchr/posix}.
 %%\bibliographystyle{plain}
 \bibliography{root}
 *}

changeset 426	5b77220fdf01
parent 425	14c558ae0b09
child 436	222333d2bdc2