# HG changeset patch # User Christian Urban # Date 1644419698 0 # Node ID 5b77220fdf01565ae7338282873a3d2e1199cae4 # Parent 14c558ae0b09fb71189716b7042e9828930022d6 updated diff -r 14c558ae0b09 -r 5b77220fdf01 thys2/Paper/Paper.thy --- a/thys2/Paper/Paper.thy Wed Feb 09 11:21:28 2022 +0000 +++ b/thys2/Paper/Paper.thy Wed Feb 09 15:14:58 2022 +0000 @@ -173,14 +173,14 @@ method [..] in combination with the simplification steps [..] yields POSIX parse trees. We have tested this claim extensively [..] but yet - have to work out all proof details.'' + have to work out all proof details.'' \cite[Page 14]{Sulzmann2014} \end{quote} \noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL the derivative-based lexing algorithm of Sulzmann and Lu \cite{Sulzmann2014} where regular expressions are annotated with bitsequences. We define the crucial simplification function as a -recursive function, instead of a fix-point operation. One objective of +recursive function, without the need of a fix-point operation. One objective of the simplification function is to remove duplicates of regular expressions. For this Sulzmann and Lu use in their paper the standard @{text nub} function from Haskell's list library, but this function @@ -298,7 +298,7 @@ \end{proposition} \noindent -It is a fun exercise to formaly prove this property in a theorem prover. +It is a fun exercise to formally prove this property in a theorem prover. The novel idea of Sulzmann and Lu is to extend this algorithm for lexing, where it is important to find out which part of the string @@ -492,7 +492,7 @@ We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that this algorithm is correct, that is it generates POSIX values. The -cenral property we established relates the derivative operation to the +central property we established relates the derivative operation to the injection function. \begin{proposition}\label{Posix2} @@ -513,7 +513,7 @@ \noindent In fact we have shown that in the success case the generated POSIX value $v$ is unique and in the failure case that there is no POSIX value $v$ that satisfies - $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly + $(s, r) \rightarrow v$. While the algorithm is correct, it is excruciatingly slow in cases where the derivatives grow arbitrarily (recall the example from the Introduction). However it can be used as a convenient reference point for the correctness proof of the second algorithm by Sulzmann and Lu, which we shall describe next. @@ -526,7 +526,7 @@ In the second part of their paper \cite{Sulzmann2014}, Sulzmann and Lu describe another algorithm that also generates POSIX - values but dispences with the second phase where characters are + values but dispenses with the second phase where characters are injected ``back'' into values. For this they annotate bitcodes to regular expressions, which we define in Isabelle/HOL as the datatype @@ -1007,7 +1007,7 @@ %is not affected by simplification. However there is one problem with the definition for the more - aggressive simlification rules described by Sulzmann and Lu. Recasting + aggressive simplification rules described by Sulzmann and Lu. Recasting their definition with our syntax they define the step of removing duplicates as % @@ -1092,7 +1092,7 @@ \end{center} \noindent - With this in place we can define our simlification function as + With this in place we can define our simplification function as \begin{center} \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} @@ -1164,7 +1164,7 @@ essentially purports that we can retrieve the same value from a simplified version of the regular expression. To start with @{text retrieve} depends on the fact that the value @{text v} correspond to the - structure of the regular exprssions---but the whole point of simplification + structure of the regular expressions---but the whole point of simplification is to ``destroy'' this structure by making the regular expression simpler. To see this consider the regular expression @{text "r = r' + 0"} and a corresponding value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then @@ -1217,7 +1217,7 @@ \end{proof} \noindent - From this, we can show that @{text bmkeps} will producte the same bitsequence + From this, we can show that @{text bmkeps} will produce the same bitsequence as long as one of the bitcoded regular expressions in $\leadsto$ is nullable (this lemma establishes the missing fact we were not able to establish using @{text retrieve}, as suggested in the paper by Sulzmannn and Lu). @@ -1269,7 +1269,7 @@ \end{proof} \noindent - With these lemmas in place we can finaly establish that @{term "blexer_simp"} and @{term "blexer"} + With these lemmas in place we can finally establish that @{term "blexer_simp"} and @{term "blexer"} generate the same value, and using Theorem~\ref{thmone} from the previous section that this value is indeed the POSIX value. @@ -1313,26 +1313,120 @@ \end{figure} *} -section {* Bound - NO *} +section {* Finiteness of Derivatives *} + +text {* + +In this section let us sketch our argument on why the size of the simplified +derivatives with the aggressive simplification function is finite. Suppose +we have a size functions for bitcoded regular expressions, written +@{text "|r|"}, which counts the number of nodes if we regard $r$ as a tree +(we omit the precise definition). For this we show that for every $r$ +there exists a bound $N$ +such that + +\begin{center} +$\forall s. \; |@{term "bders_simp r s"}| < N$ +\end{center} + +\noindent +We prove this by induction on $r$. The base cases for @{term AZERO}, +@{term "AONE bs"} and @{term "ACHAR bs c"} are straightforward. The interesting case is +for sequences of the form @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}. In this case our induction +hypotheses state $\forall s. \; |@{term "bders_simp r\<^sub>1 s"}| < N_1$ and +$\forall s. \; |@{term "bders_simp r\<^sub>2 s"}| < N_2$. We can reason as follows + +\begin{center} +\begin{tabular}{lcll} +& & $ |@{term "bders_simp (ASEQ bs r\<^sub>1 r\<^sub>2) s"}|$\\ +& $ = $ & $|bsimp(ALTs\;bs\;((@{term "bders_simp r\<^sub>1 s"}) \cdot r_2) :: + [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| $ & (1) \\ +& $\leq$ & + $|distinctBy\,(flts\,((@{term "bders_simp r\<^sub>1 s "}) \cdot r_2) :: + [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (2) \\ +& $\leq$ & $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2| + + |distinctBy\,(flts\, [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (3) \\ +& $\leq$ & $N_1 + |r_2| + 2 + |distinctBy\,(flts\, [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])|$ & (4)\\ +& $\leq$ & $N_1 + |r_2| + 2 + l_{N_{2}} * N_{2}$ & (5) +\end{tabular} +\end{center} + +% tell Chengsong about Indian paper of closed forms of derivatives + +\noindent +where in (1) the $Suf\!fix(s')$ are the suffixes where @{term "bders_simp r\<^sub>1 s''"} is nullable for +@{text "s = s'' @ s'"}. In (3) we know that $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2|$ is +bounded by $N_1 + |r_2|$. In (5) we know the list comprehension contains only regular expressions of size smaller +than $N_2$. The list length after @{text distinctBy} is bounded by a number, which we call $l_{N_2}$. It stands +for the number of distinct regular expressions with a maximum size $N_2$ (there can only be finitely many of them). +We reason similarly in the @{text Star}-case.\medskip + +\noindent +Clearly we give in this finiteness argument (Step (5)) a very loose bound that is +far from the actual bound we can expect. We can do better than this, but this does not improve +the finiteness property we are proving. If we are interested in a polynomial bound, +one would hope to obtain a similar tight bound as for partial +derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with +@{text distinctBy} is to maintain a ``set'' of alternatives (like the sets in +partial derivatives). Unfortunately to obtain the exact same bound would mean +we need to introduce simplifications such as +% +\[ (r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3) +\] + +\noindent +which exist for partial derivatives. However, if we introduce them in our +setting we would lose the POSIX property of our calculated values. We leave better +bounds for future work. + +*} section {* Conclusion *} text {* -We set out in this work to implement in Isabelle/HOL the secod lexing - algorithm by Sulzmann and Lu \cite{Sulzmann2014}. This algorithm is - interesting because it includes an ``aggressive'' simplification - function that keeps the sizes of derivatives finite. + We set out in this work to prove in Isabelle/HOL the correctness of + the second POSIX lexing algorithm by Sulzmann and Lu + \cite{Sulzmann2014}. This follows earlier work where we established + the correctness of the first algorithm + \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to + introduce our own specification about what POSIX values are, + because the informal definition given by Sulzmann and Lu did not + stand up to a formal proof. Also for the second algorithm we needed + to introduce our own definitions and proof ideas in order to establish the + correctness. Our interest in the second algorithm + lies in the fact that by using bitcoded regular expressions and an aggressive + simplification method there is a chance that the the derivatives + can be kept universally small (we established in this paper that + they can be kept finite for any string). This is important if one is after + an efficient POSIX lexing algorithm. - Having proved the correctness of the POSIX lexing algorithm in - [53], which lessons have we learned? Well, we feel this is a - perfect example where formal proofs give further insight into the - matter at hand. Our proofs were done both done by hand and - checked in Isabelle/HOL. The experience of doing our proofs in this - way has been that the mechanical checking was absolutely essential: - despite the apparent maturity, this subject area has hidden - snares. + Having proved the correctness of the POSIX lexing algorithm, which + lessons have we learned? Well, we feel this is a very good example + where formal proofs give further insight into the matter at + hand. For example it is very hard to see a problem with @{text nub} + vs @{text distinctBy} with only experimental data---one would still + see the correct result but find that simplification does not simplify in well-chosen, but not + obscure, examples. We found that from an implementation + point-of-view it is really important to have the formal proofs of + the corresponding properties at hand. We have also developed a + healthy suspicion when experimental data is used to back up + efficiency claims. For example Sulzmann and Lu write about their + equivalent of @{term blexer_simp} ``...we can incrementally compute + bitcoded parse trees in linear time in the size of the input'' + \cite[Page 14]{Sulzmann2014}. + Given the growth of the + derivatives in some cases even after aggressive simplification, this + is a hard to believe fact. A similar claim of about a theoretical runtime + of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates POSIX matches and is based on + derivatives \cite{verbatim}. In this case derivatives are not simplified. + Clearly our result of having finite + derivatives is rather weak in this context but we think such effeciency claims + require further scrutiny.\medskip + + \noindent + Our Isabelle/HOL code is available under \url{https://github.com/urbanchr/posix}. %%\bibliographystyle{plain} diff -r 14c558ae0b09 -r 5b77220fdf01 thys2/Paper/document/root.bib --- a/thys2/Paper/document/root.bib Wed Feb 09 11:21:28 2022 +0000 +++ b/thys2/Paper/document/root.bib Wed Feb 09 15:14:58 2022 +0000 @@ -351,3 +351,11 @@ } +@INPROCEEDINGS{verbatim, + author={D.~Egolf and S.~Lasser and K.~Fisher}, + booktitle={2021 IEEE Security and Privacy Workshops (SPW)}, + title={{V}erbatim: {A} {V}erified {L}exer {G}enerator}, + year={2021}, + volume={}, + number={}, + pages={92--100}} diff -r 14c558ae0b09 -r 5b77220fdf01 thys2/paper.pdf Binary file thys2/paper.pdf has changed