# HG changeset patch
# User Christian Urban <christian.urban@kcl.ac.uk>
# Date 1644419698 0
# Node ID 5b77220fdf01565ae7338282873a3d2e1199cae4
# Parent  14c558ae0b09fb71189716b7042e9828930022d6
updated

diff -r 14c558ae0b09 -r 5b77220fdf01 thys2/Paper/Paper.thy
--- a/thys2/Paper/Paper.thy	Wed Feb 09 11:21:28 2022 +0000
+++ b/thys2/Paper/Paper.thy	Wed Feb 09 15:14:58 2022 +0000
@@ -173,14 +173,14 @@
   method [..] in combination with the simplification steps [..]
   yields POSIX parse trees. We have tested this claim
   extensively [..] but yet
-  have to work out all proof details.''
+  have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
 \end{quote}  
 
 \noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL
 the derivative-based lexing algorithm of Sulzmann and Lu
 \cite{Sulzmann2014} where regular expressions are annotated with
 bitsequences. We define the crucial simplification function as a
-recursive function, instead of a fix-point operation. One objective of
+recursive function, without the need of a fix-point operation. One objective of
 the simplification function is to remove duplicates of regular
 expressions.  For this Sulzmann and Lu use in their paper the standard
 @{text nub} function from Haskell's list library, but this function
@@ -298,7 +298,7 @@
 \end{proposition}
 
 \noindent
-It is a fun exercise to formaly prove this property in a theorem prover.
+It is a fun exercise to formally prove this property in a theorem prover.
 
 The novel idea of Sulzmann and Lu is to extend this algorithm for 
 lexing, where it is important to find out which part of the string
@@ -492,7 +492,7 @@
 
 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that
 this algorithm is correct, that is it generates POSIX values. The
-cenral property we established relates the derivative operation to the
+central property we established relates the derivative operation to the
 injection function.
 
   \begin{proposition}\label{Posix2}
@@ -513,7 +513,7 @@
   \noindent
   In fact we have shown that in the success case the generated POSIX value $v$ is
   unique and in the failure case that there is no POSIX value $v$ that satisfies
-  $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly
+  $(s, r) \rightarrow v$. While the algorithm is correct, it is excruciatingly
   slow in cases where the derivatives grow arbitrarily (recall the example from the
   Introduction). However it can be used as a convenient reference point for the correctness
   proof of the second algorithm by Sulzmann and Lu, which we shall describe next.
@@ -526,7 +526,7 @@
 
   In the second part of their paper \cite{Sulzmann2014},
   Sulzmann and Lu describe another algorithm that also generates POSIX
-  values but dispences with the second phase where characters are
+  values but dispenses with the second phase where characters are
   injected ``back'' into values. For this they annotate bitcodes to
   regular expressions, which we define in Isabelle/HOL as the datatype
 
@@ -1007,7 +1007,7 @@
      %is not affected by simplification. 
 
      However there is one problem with the definition for the more
-     aggressive simlification rules described by Sulzmann and Lu. Recasting
+     aggressive simplification rules described by Sulzmann and Lu. Recasting
      their definition with our syntax they define the step of removing
      duplicates as
      %
@@ -1092,7 +1092,7 @@
      \end{center}
 
      \noindent
-     With this in place we can define our simlification function as
+     With this in place we can define our simplification function as
 
      \begin{center}
      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
@@ -1164,7 +1164,7 @@
      essentially purports that we can retrieve the same value from a
      simplified version of the regular expression. To start with @{text retrieve}
      depends on the fact that the value @{text v} correspond to the
-     structure of the regular exprssions---but the whole point of simplification
+     structure of the regular expressions---but the whole point of simplification
      is to ``destroy'' this structure by making the regular expression simpler.
      To see this consider the regular expression @{text "r = r' + 0"} and a corresponding
      value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then 
@@ -1217,7 +1217,7 @@
      \end{proof}
 
      \noindent
-     From this, we can show that @{text bmkeps} will producte the same bitsequence
+     From this, we can show that @{text bmkeps} will produce the same bitsequence
      as long as one of the bitcoded regular expressions in $\leadsto$ is nullable (this lemma
      establishes the missing fact we were not able to establish using @{text retrieve}, as suggested
      in the paper by Sulzmannn and Lu). 
@@ -1269,7 +1269,7 @@
      \end{proof}
 
      \noindent
-     With these lemmas in place we can finaly establish that @{term "blexer_simp"} and @{term "blexer"}
+     With these lemmas in place we can finally establish that @{term "blexer_simp"} and @{term "blexer"}
      generate the same value, and using Theorem~\ref{thmone} from the previous section that this value
      is indeed the POSIX value.
      
@@ -1313,26 +1313,120 @@
   \end{figure}
 *}
 
-section {* Bound - NO *}
+section {* Finiteness of Derivatives *}
+
+text {*
+
+In this section let us sketch our argument on why the size of the simplified
+derivatives with the aggressive simplification function is finite. Suppose
+we have a size functions for bitcoded regular expressions, written
+@{text "|r|"}, which counts the number of nodes if we regard $r$ as a tree
+(we omit the precise definition). For this we show that for every $r$
+there exists a bound $N$
+such that 
+
+\begin{center}
+$\forall s. \; |@{term "bders_simp r s"}| < N$
+\end{center}
+
+\noindent
+We prove this by induction on $r$. The base cases for @{term AZERO},
+@{term "AONE bs"} and @{term "ACHAR bs c"} are straightforward. The interesting case is
+for sequences of the form @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}. In this case our induction
+hypotheses state $\forall s. \; |@{term "bders_simp r\<^sub>1 s"}| < N_1$ and
+$\forall s. \; |@{term "bders_simp r\<^sub>2 s"}| < N_2$. We can reason as follows
+
+\begin{center}
+\begin{tabular}{lcll}
+& & $ |@{term "bders_simp (ASEQ bs r\<^sub>1 r\<^sub>2) s"}|$\\
+& $ = $ & $|bsimp(ALTs\;bs\;((@{term "bders_simp r\<^sub>1 s"}) \cdot r_2) ::
+    [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| $ & (1) \\
+& $\leq$ &
+    $|distinctBy\,(flts\,((@{term "bders_simp r\<^sub>1 s "}) \cdot r_2) ::
+    [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (2) \\
+& $\leq$ & $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2| +
+             |distinctBy\,(flts\,   [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (3) \\
+& $\leq$ & $N_1 + |r_2| + 2 + |distinctBy\,(flts\,   [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])|$ & (4)\\
+& $\leq$ & $N_1 + |r_2| + 2 + l_{N_{2}} * N_{2}$ & (5)
+\end{tabular}
+\end{center}
+
+% tell Chengsong about Indian paper of closed forms of derivatives
+
+\noindent
+where in (1) the $Suf\!fix(s')$ are the suffixes where @{term "bders_simp r\<^sub>1 s''"} is nullable for
+@{text "s = s'' @ s'"}. In (3) we know that  $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2|$ is 
+bounded by $N_1 + |r_2|$. In (5) we know the list comprehension contains only regular expressions of size smaller
+than $N_2$. The list length after @{text distinctBy} is bounded by a number, which we call $l_{N_2}$. It stands
+for the number of distinct regular expressions with a maximum size $N_2$ (there can only be finitely many of them).
+We reason similarly in the @{text Star}-case.\medskip
+
+\noindent
+Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
+far from the actual bound we can expect. We can do better than this, but this does not improve
+the finiteness property we are proving. If we are interested in a polynomial bound,
+one would hope to obtain a similar tight bound as for partial
+derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
+@{text distinctBy} is to maintain a ``set'' of alternatives (like the sets in
+partial derivatives). Unfortunately to obtain the exact same bound would mean
+we need to introduce simplifications such as
+%
+\[ (r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)
+\]
+
+\noindent
+which exist for partial derivatives. However, if we introduce them in our
+setting we would lose the POSIX property of our calculated values. We leave better
+bounds for future work.
+
+*}
 
 
 section {* Conclusion *}
 
 text {*
 
-We set out in this work to implement in Isabelle/HOL the secod lexing
-   algorithm by Sulzmann and Lu \cite{Sulzmann2014}. This algorithm is
-   interesting because it includes an ``aggressive'' simplification
-   function that keeps the sizes of derivatives finite.
+   We set out in this work to prove in Isabelle/HOL the correctness of
+   the second POSIX lexing algorithm by Sulzmann and Lu
+   \cite{Sulzmann2014}. This follows earlier work where we established
+   the correctness of the first algorithm
+   \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
+   introduce our own specification about what POSIX values are,
+   because the informal definition given by Sulzmann and Lu did not
+   stand up to a formal proof. Also for the second algorithm we needed
+   to introduce our own definitions and proof ideas in order to establish the
+   correctness.  Our interest in the second algorithm 
+   lies in the fact that by using bitcoded regular expressions and an aggressive
+   simplification method there is a chance that the the derivatives
+   can be kept universally small  (we established in this paper that
+   they can be kept finite for any string). This is important if one is after
+   an efficient POSIX lexing algorithm.
 
-   Having proved the correctness of the POSIX lexing algorithm in
-   [53], which lessons have we learned? Well, we feel this is a
-   perfect example where formal proofs give further insight into the
-   matter at hand. Our proofs were done both done by hand and
-   checked in Isabelle/HOL. The experience of doing our proofs in this
-   way has been that the mechanical checking was absolutely essential:
-   despite the apparent maturity, this subject area has hidden
-   snares.
+   Having proved the correctness of the POSIX lexing algorithm, which
+   lessons have we learned? Well, we feel this is a very good example
+   where formal proofs give further insight into the matter at
+   hand. For example it is very hard to see a problem with @{text nub}
+   vs @{text distinctBy} with only experimental data---one would still
+   see the correct result but find that simplification does not simplify in well-chosen, but not
+   obscure, examples. We found that from an implementation
+   point-of-view it is really important to have the formal proofs of
+   the corresponding properties at hand. We have also developed a
+   healthy suspicion when experimental data is used to back up
+   efficiency claims. For example Sulzmann and Lu write about their
+   equivalent of @{term blexer_simp} ``...we can incrementally compute
+   bitcoded parse trees in linear time in the size of the input''
+   \cite[Page 14]{Sulzmann2014}. 
+   Given the growth of the
+   derivatives in some cases even after aggressive simplification, this
+   is a hard to believe fact. A similar claim of about a theoretical runtime
+   of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates POSIX matches and is based on
+   derivatives \cite{verbatim}. In this case derivatives are not simplified.
+   Clearly our result of having finite
+   derivatives is rather weak in this context but we think such effeciency claims
+   require further scrutiny.\medskip
+
+   \noindent
+   Our Isabelle/HOL code is available under \url{https://github.com/urbanchr/posix}.
 
 
 %%\bibliographystyle{plain}
diff -r 14c558ae0b09 -r 5b77220fdf01 thys2/Paper/document/root.bib
--- a/thys2/Paper/document/root.bib	Wed Feb 09 11:21:28 2022 +0000
+++ b/thys2/Paper/document/root.bib	Wed Feb 09 15:14:58 2022 +0000
@@ -351,3 +351,11 @@
 }
 
 
+@INPROCEEDINGS{verbatim,
+  author={D.~Egolf and S.~Lasser and K.~Fisher},
+  booktitle={2021 IEEE Security and Privacy Workshops (SPW)}, 
+  title={{V}erbatim: {A} {V}erified {L}exer {G}enerator}, 
+  year={2021},
+  volume={},
+  number={},
+  pages={92--100}}
diff -r 14c558ae0b09 -r 5b77220fdf01 thys2/paper.pdf
Binary file thys2/paper.pdf has changed