--- a/thys2/Paper/Paper.thy Wed Feb 09 11:21:28 2022 +0000
+++ b/thys2/Paper/Paper.thy Wed Feb 09 15:14:58 2022 +0000
@@ -173,14 +173,14 @@
method [..] in combination with the simplification steps [..]
yields POSIX parse trees. We have tested this claim
extensively [..] but yet
- have to work out all proof details.''
+ have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
\end{quote}
\noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL
the derivative-based lexing algorithm of Sulzmann and Lu
\cite{Sulzmann2014} where regular expressions are annotated with
bitsequences. We define the crucial simplification function as a
-recursive function, instead of a fix-point operation. One objective of
+recursive function, without the need of a fix-point operation. One objective of
the simplification function is to remove duplicates of regular
expressions. For this Sulzmann and Lu use in their paper the standard
@{text nub} function from Haskell's list library, but this function
@@ -298,7 +298,7 @@
\end{proposition}
\noindent
-It is a fun exercise to formaly prove this property in a theorem prover.
+It is a fun exercise to formally prove this property in a theorem prover.
The novel idea of Sulzmann and Lu is to extend this algorithm for
lexing, where it is important to find out which part of the string
@@ -492,7 +492,7 @@
We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that
this algorithm is correct, that is it generates POSIX values. The
-cenral property we established relates the derivative operation to the
+central property we established relates the derivative operation to the
injection function.
\begin{proposition}\label{Posix2}
@@ -513,7 +513,7 @@
\noindent
In fact we have shown that in the success case the generated POSIX value $v$ is
unique and in the failure case that there is no POSIX value $v$ that satisfies
- $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly
+ $(s, r) \rightarrow v$. While the algorithm is correct, it is excruciatingly
slow in cases where the derivatives grow arbitrarily (recall the example from the
Introduction). However it can be used as a convenient reference point for the correctness
proof of the second algorithm by Sulzmann and Lu, which we shall describe next.
@@ -526,7 +526,7 @@
In the second part of their paper \cite{Sulzmann2014},
Sulzmann and Lu describe another algorithm that also generates POSIX
- values but dispences with the second phase where characters are
+ values but dispenses with the second phase where characters are
injected ``back'' into values. For this they annotate bitcodes to
regular expressions, which we define in Isabelle/HOL as the datatype
@@ -1007,7 +1007,7 @@
%is not affected by simplification.
However there is one problem with the definition for the more
- aggressive simlification rules described by Sulzmann and Lu. Recasting
+ aggressive simplification rules described by Sulzmann and Lu. Recasting
their definition with our syntax they define the step of removing
duplicates as
%
@@ -1092,7 +1092,7 @@
\end{center}
\noindent
- With this in place we can define our simlification function as
+ With this in place we can define our simplification function as
\begin{center}
\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
@@ -1164,7 +1164,7 @@
essentially purports that we can retrieve the same value from a
simplified version of the regular expression. To start with @{text retrieve}
depends on the fact that the value @{text v} correspond to the
- structure of the regular exprssions---but the whole point of simplification
+ structure of the regular expressions---but the whole point of simplification
is to ``destroy'' this structure by making the regular expression simpler.
To see this consider the regular expression @{text "r = r' + 0"} and a corresponding
value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then
@@ -1217,7 +1217,7 @@
\end{proof}
\noindent
- From this, we can show that @{text bmkeps} will producte the same bitsequence
+ From this, we can show that @{text bmkeps} will produce the same bitsequence
as long as one of the bitcoded regular expressions in $\leadsto$ is nullable (this lemma
establishes the missing fact we were not able to establish using @{text retrieve}, as suggested
in the paper by Sulzmannn and Lu).
@@ -1269,7 +1269,7 @@
\end{proof}
\noindent
- With these lemmas in place we can finaly establish that @{term "blexer_simp"} and @{term "blexer"}
+ With these lemmas in place we can finally establish that @{term "blexer_simp"} and @{term "blexer"}
generate the same value, and using Theorem~\ref{thmone} from the previous section that this value
is indeed the POSIX value.
@@ -1313,26 +1313,120 @@
\end{figure}
*}
-section {* Bound - NO *}
+section {* Finiteness of Derivatives *}
+
+text {*
+
+In this section let us sketch our argument on why the size of the simplified
+derivatives with the aggressive simplification function is finite. Suppose
+we have a size functions for bitcoded regular expressions, written
+@{text "|r|"}, which counts the number of nodes if we regard $r$ as a tree
+(we omit the precise definition). For this we show that for every $r$
+there exists a bound $N$
+such that
+
+\begin{center}
+$\forall s. \; |@{term "bders_simp r s"}| < N$
+\end{center}
+
+\noindent
+We prove this by induction on $r$. The base cases for @{term AZERO},
+@{term "AONE bs"} and @{term "ACHAR bs c"} are straightforward. The interesting case is
+for sequences of the form @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}. In this case our induction
+hypotheses state $\forall s. \; |@{term "bders_simp r\<^sub>1 s"}| < N_1$ and
+$\forall s. \; |@{term "bders_simp r\<^sub>2 s"}| < N_2$. We can reason as follows
+
+\begin{center}
+\begin{tabular}{lcll}
+& & $ |@{term "bders_simp (ASEQ bs r\<^sub>1 r\<^sub>2) s"}|$\\
+& $ = $ & $|bsimp(ALTs\;bs\;((@{term "bders_simp r\<^sub>1 s"}) \cdot r_2) ::
+ [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| $ & (1) \\
+& $\leq$ &
+ $|distinctBy\,(flts\,((@{term "bders_simp r\<^sub>1 s "}) \cdot r_2) ::
+ [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (2) \\
+& $\leq$ & $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2| +
+ |distinctBy\,(flts\, [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (3) \\
+& $\leq$ & $N_1 + |r_2| + 2 + |distinctBy\,(flts\, [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])|$ & (4)\\
+& $\leq$ & $N_1 + |r_2| + 2 + l_{N_{2}} * N_{2}$ & (5)
+\end{tabular}
+\end{center}
+
+% tell Chengsong about Indian paper of closed forms of derivatives
+
+\noindent
+where in (1) the $Suf\!fix(s')$ are the suffixes where @{term "bders_simp r\<^sub>1 s''"} is nullable for
+@{text "s = s'' @ s'"}. In (3) we know that $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2|$ is
+bounded by $N_1 + |r_2|$. In (5) we know the list comprehension contains only regular expressions of size smaller
+than $N_2$. The list length after @{text distinctBy} is bounded by a number, which we call $l_{N_2}$. It stands
+for the number of distinct regular expressions with a maximum size $N_2$ (there can only be finitely many of them).
+We reason similarly in the @{text Star}-case.\medskip
+
+\noindent
+Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
+far from the actual bound we can expect. We can do better than this, but this does not improve
+the finiteness property we are proving. If we are interested in a polynomial bound,
+one would hope to obtain a similar tight bound as for partial
+derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
+@{text distinctBy} is to maintain a ``set'' of alternatives (like the sets in
+partial derivatives). Unfortunately to obtain the exact same bound would mean
+we need to introduce simplifications such as
+%
+\[ (r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)
+\]
+
+\noindent
+which exist for partial derivatives. However, if we introduce them in our
+setting we would lose the POSIX property of our calculated values. We leave better
+bounds for future work.
+
+*}
section {* Conclusion *}
text {*
-We set out in this work to implement in Isabelle/HOL the secod lexing
- algorithm by Sulzmann and Lu \cite{Sulzmann2014}. This algorithm is
- interesting because it includes an ``aggressive'' simplification
- function that keeps the sizes of derivatives finite.
+ We set out in this work to prove in Isabelle/HOL the correctness of
+ the second POSIX lexing algorithm by Sulzmann and Lu
+ \cite{Sulzmann2014}. This follows earlier work where we established
+ the correctness of the first algorithm
+ \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
+ introduce our own specification about what POSIX values are,
+ because the informal definition given by Sulzmann and Lu did not
+ stand up to a formal proof. Also for the second algorithm we needed
+ to introduce our own definitions and proof ideas in order to establish the
+ correctness. Our interest in the second algorithm
+ lies in the fact that by using bitcoded regular expressions and an aggressive
+ simplification method there is a chance that the the derivatives
+ can be kept universally small (we established in this paper that
+ they can be kept finite for any string). This is important if one is after
+ an efficient POSIX lexing algorithm.
- Having proved the correctness of the POSIX lexing algorithm in
- [53], which lessons have we learned? Well, we feel this is a
- perfect example where formal proofs give further insight into the
- matter at hand. Our proofs were done both done by hand and
- checked in Isabelle/HOL. The experience of doing our proofs in this
- way has been that the mechanical checking was absolutely essential:
- despite the apparent maturity, this subject area has hidden
- snares.
+ Having proved the correctness of the POSIX lexing algorithm, which
+ lessons have we learned? Well, we feel this is a very good example
+ where formal proofs give further insight into the matter at
+ hand. For example it is very hard to see a problem with @{text nub}
+ vs @{text distinctBy} with only experimental data---one would still
+ see the correct result but find that simplification does not simplify in well-chosen, but not
+ obscure, examples. We found that from an implementation
+ point-of-view it is really important to have the formal proofs of
+ the corresponding properties at hand. We have also developed a
+ healthy suspicion when experimental data is used to back up
+ efficiency claims. For example Sulzmann and Lu write about their
+ equivalent of @{term blexer_simp} ``...we can incrementally compute
+ bitcoded parse trees in linear time in the size of the input''
+ \cite[Page 14]{Sulzmann2014}.
+ Given the growth of the
+ derivatives in some cases even after aggressive simplification, this
+ is a hard to believe fact. A similar claim of about a theoretical runtime
+ of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates POSIX matches and is based on
+ derivatives \cite{verbatim}. In this case derivatives are not simplified.
+ Clearly our result of having finite
+ derivatives is rather weak in this context but we think such effeciency claims
+ require further scrutiny.\medskip
+
+ \noindent
+ Our Isabelle/HOL code is available under \url{https://github.com/urbanchr/posix}.
%%\bibliographystyle{plain}