lexing: comparison thys3/Paper.thy

equal deleted inserted replaced

-:6c13f76c070b
+:9580bae0500d
 \noindent In this paper, we shall first briefly introduce the basic notions
 of regular expressions and describe the definition
 of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves
 as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove
 the correctness for the bitcoded algorithm without simplification, and
-after that extend the proof to include simplification.\mbox{}\\[-6mm]
+after that extend the proof to include simplification.
+Our Isabelle code including the results from Sec.~5 is available
+from \textcolor{darkblue}{\url{https://github.com/urbanchr/posix}}.
+\mbox{}\\[-6mm]
 *}
 section {* Background *}
 \cite{AusafDyckhoffUrban2016}).
 In our work here we also add to the usual ``basic'' regular
 expressions the \emph{bounded} regular expression @{term "NTIMES r
 n"} where the @{term n} specifies that @{term r} should match
-exactly @{term n}-times. Again for brevity we omit the other bounded
+exactly @{term n}-times (it is not included in Sulzmann and Lu's original work). For brevity we omit the other bounded
 regular expressions @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$
 and @{text "r"}$^{\{n..m\}}$ which specify intervals for how many
 times @{text r} should match. The results presented in this paper
 extend straightforwardly to them too. The importance of the bounded
 regular expressions is that they are often used in practical
 \cite{CountingSet2020}). Therefore regular expression matching
 libraries that rely on the classic notion of DFAs often impose
 adhoc limits for bounded regular expressions: For example in the
 regular expression matching library in the Go language and also in Google's RE2 library the regular expression
 @{term "NTIMES a 1001"} is not permitted, because no counter can be
-above 1000; and in the built-in regular expression library in Rust
+above 1000; and in the regular expression library in Rust
 expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error
 message for being too big. Up until recently,\footnote{up until version 1.5.4 of the regex
 library in Rust; see also CVE-2022-24713.} Rust
-however happily generated an automaton for the regular expression
+however happily generated automata for regular expressions such as
 @{text "a\<^bsup>{0}{4294967295}\<^esup>"}. This was due to a bug
 in the algorithm that decides when a regular expression is acceptable
-or too big according to Rust's classification. We shall come back to
+or too big according to Rust's classification (it did not account for the fact that @{text "a\<^bsup>{0}\<^esup>"} and similar examples can match the empty string). We shall come back to
 this example later in the paper.
 These problems can of course be solved in matching
 algorithms where automata go beyond the classic notion and for
-instance include explicit counters (see for example~\cite{CountingSet2020}).
+instance include explicit counters (e.g.~\cite{CountingSet2020}).
 The point here is that Brzozowski derivatives and the algorithms by
 Sulzmann and Lu can be straightforwardly extended to deal with
 bounded regular expressions and moreover the resulting code
 still consists of only simple recursive functions and inductive
 datatypes. Finally, bounded regular expressions
 the functions \textit{bnullable} and \textit{bmkeps}(\textit{s}), which are the
 ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
 bitcoded regular expressions.
 %
 \begin{center}
-\begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{6mm}}c@ {}}
+\begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{10mm}}c@ {}}
 \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l}
 $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\
 $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\
 $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\
 $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &\\
 we shall show next.
 \begin{figure}[t]
 \begin{center}
 \begin{tabular}{@ {\hspace{-8mm}}c@ {}}
-\\[-7mm]
+\\[-8mm]
 @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}$S\ZERO{}_l$\quad
 @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}$S\ZERO{}_r$\quad
 @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}$S\ONE$\\
 @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad
 @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\
 reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm]
 Note that while Antimirov was able to give a bound on the \emph{size}
 of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound
 on the \emph{number} of derivatives, provided they are quotient via
-ACI rules \cite{Brzozowski1964}. Brozozowski's result is crucial when one
+ACI rules \cite{Brzozowski1964}. Brzozowski's result is crucial when one
 uses his derivatives for obtaining a DFA (it essentially bounds
 the number of states). However, this result does \emph{not}
 transfer to our setting where we are interested in the \emph{size} of the
 derivatives. For example it is \emph{not} true for our derivatives that the
 set of derivatives $r \backslash s$ for a given $r$ and all strings
 We set out in this work to prove in Isabelle/HOL the correctness of
 the second POSIX lexing algorithm by Sulzmann and Lu
 \cite{Sulzmann2014}. This follows earlier work where we established
 the correctness of the first algorithm
 \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
-introduce our own specification about what POSIX values are,
+introduce our own specification for POSIX values,
 because the informal definition given by Sulzmann and Lu did not
 stand up to formal proof. Also for the second algorithm we needed
 to introduce our own definitions and proof ideas in order to
 establish the correctness.  Our interest in the second algorithm
 lies in the fact that by using bitcoded regular expressions and an
 aggressive simplification method there is a chance that the
 derivatives can be kept universally small (we established in this
 paper that for a given $r$ they can be kept finitely bounded for
-all strings).  Our formalisation is approximately 7500 lines of
+\emph{all} strings).  Our formalisation is approximately 7500 lines of
 Isabelle code. A little more than half of this code concerns the finiteness bound
 obtained in Section 5. This slight ``bloat'' in the latter part is because
 we had to introduce an intermediate datatype for annotated regular expressions and repeat many
-definitions for this intermediate datatype. But overall this made our
+definitions for this intermediate datatype. But overall we think this made our
-formalisation work smoother.
+formalisation work smoother. The code of our formalisation can be found at
+\textcolor{darkblue}{\url{https://github.com/urbanchr/posix}}.
 %This is important if one is after
 %an efficient POSIX lexing algorithm based on derivatives.
 Having proved the correctness of the POSIX lexing algorithm, which
 lessons have we learned? Well, we feel this is a very good example
 more than 11 minutes to generate an automaton for this regular
 expression and then to determine that a string of just one(!)
 @{text a} does \emph{not} match this regular expression.  Therefore
 it is probably a wise choice that in newer versions of Rust's
 regular expression library such regular expressions are declared as
-``too big''. While this is clearly
+``too big'' and raise an error message. While this is clearly
 a contrived example, the safety guaranties Rust wants to provide necessitate
 this conservative approach.
 However, with the derivatives and simplifications we have shown
 in this paper, the example can be solved with ease:
 it essentially only involves operations on
 %
 %
 %Possible ideas are
 %zippers which have been used in the context of parsing of
 %context-free grammars \cite{zipperparser}.
-\\[-4mm]  %\smallskip
+%\\[-5mm]  %\smallskip
-\noindent
+%\noindent
-Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
+%Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
-\\[-10mm]
+%%\\[-10mm]
 %%\bibliographystyle{plain}
 \bibliography{root}
 *}

changeset 643	9580bae0500d
parent 642	6c13f76c070b
child 644	9f984ff20020