# HG changeset patch # User Christian Urban # Date 1676936430 0 # Node ID 9580bae0500d4f0a1b5795094890de5bf6d4aa79 # Parent 6c13f76c070bc76e1142913c4f8009f1fcfc6894 updated diff -r 6c13f76c070b -r 9580bae0500d thys3/Paper.thy --- a/thys3/Paper.thy Thu Feb 16 23:23:22 2023 +0000 +++ b/thys3/Paper.thy Mon Feb 20 23:40:30 2023 +0000 @@ -216,7 +216,10 @@ of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove the correctness for the bitcoded algorithm without simplification, and -after that extend the proof to include simplification.\mbox{}\\[-6mm] +after that extend the proof to include simplification. +Our Isabelle code including the results from Sec.~5 is available +from \textcolor{darkblue}{\url{https://github.com/urbanchr/posix}}. +\mbox{}\\[-6mm] *} @@ -256,7 +259,7 @@ In our work here we also add to the usual ``basic'' regular expressions the \emph{bounded} regular expression @{term "NTIMES r n"} where the @{term n} specifies that @{term r} should match - exactly @{term n}-times. Again for brevity we omit the other bounded + exactly @{term n}-times (it is not included in Sulzmann and Lu's original work). For brevity we omit the other bounded regular expressions @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$ and @{text "r"}$^{\{n..m\}}$ which specify intervals for how many times @{text r} should match. The results presented in this paper @@ -278,18 +281,18 @@ adhoc limits for bounded regular expressions: For example in the regular expression matching library in the Go language and also in Google's RE2 library the regular expression @{term "NTIMES a 1001"} is not permitted, because no counter can be - above 1000; and in the built-in regular expression library in Rust + above 1000; and in the regular expression library in Rust expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error message for being too big. Up until recently,\footnote{up until version 1.5.4 of the regex library in Rust; see also CVE-2022-24713.} Rust - however happily generated an automaton for the regular expression + however happily generated automata for regular expressions such as @{text "a\<^bsup>{0}{4294967295}\<^esup>"}. This was due to a bug in the algorithm that decides when a regular expression is acceptable - or too big according to Rust's classification. We shall come back to + or too big according to Rust's classification (it did not account for the fact that @{text "a\<^bsup>{0}\<^esup>"} and similar examples can match the empty string). We shall come back to this example later in the paper. These problems can of course be solved in matching algorithms where automata go beyond the classic notion and for - instance include explicit counters (see for example~\cite{CountingSet2020}). + instance include explicit counters (e.g.~\cite{CountingSet2020}). The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be straightforwardly extended to deal with bounded regular expressions and moreover the resulting code @@ -874,7 +877,7 @@ bitcoded regular expressions. % \begin{center} - \begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{6mm}}c@ {}} + \begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{10mm}}c@ {}} \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l} $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\ $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\ @@ -1460,7 +1463,7 @@ \begin{figure}[t] \begin{center} \begin{tabular}{@ {\hspace{-8mm}}c@ {}} - \\[-7mm] + \\[-8mm] @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}$S\ZERO{}_l$\quad @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}$S\ZERO{}_r$\quad @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}$S\ONE$\\ @@ -1569,7 +1572,7 @@ Note that while Antimirov was able to give a bound on the \emph{size} of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound on the \emph{number} of derivatives, provided they are quotient via -ACI rules \cite{Brzozowski1964}. Brozozowski's result is crucial when one +ACI rules \cite{Brzozowski1964}. Brzozowski's result is crucial when one uses his derivatives for obtaining a DFA (it essentially bounds the number of states). However, this result does \emph{not} transfer to our setting where we are interested in the \emph{size} of the @@ -1594,7 +1597,7 @@ \cite{Sulzmann2014}. This follows earlier work where we established the correctness of the first algorithm \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to - introduce our own specification about what POSIX values are, + introduce our own specification for POSIX values, because the informal definition given by Sulzmann and Lu did not stand up to formal proof. Also for the second algorithm we needed to introduce our own definitions and proof ideas in order to @@ -1603,12 +1606,13 @@ aggressive simplification method there is a chance that the derivatives can be kept universally small (we established in this paper that for a given $r$ they can be kept finitely bounded for - all strings). Our formalisation is approximately 7500 lines of + \emph{all} strings). Our formalisation is approximately 7500 lines of Isabelle code. A little more than half of this code concerns the finiteness bound obtained in Section 5. This slight ``bloat'' in the latter part is because we had to introduce an intermediate datatype for annotated regular expressions and repeat many - definitions for this intermediate datatype. But overall this made our - formalisation work smoother. + definitions for this intermediate datatype. But overall we think this made our + formalisation work smoother. The code of our formalisation can be found at + \textcolor{darkblue}{\url{https://github.com/urbanchr/posix}}. %This is important if one is after %an efficient POSIX lexing algorithm based on derivatives. @@ -1697,7 +1701,7 @@ @{text a} does \emph{not} match this regular expression. Therefore it is probably a wise choice that in newer versions of Rust's regular expression library such regular expressions are declared as - ``too big''. While this is clearly + ``too big'' and raise an error message. While this is clearly a contrived example, the safety guaranties Rust wants to provide necessitate this conservative approach. However, with the derivatives and simplifications we have shown @@ -1730,11 +1734,11 @@ %Possible ideas are %zippers which have been used in the context of parsing of %context-free grammars \cite{zipperparser}. - \\[-4mm] %\smallskip + %\\[-5mm] %\smallskip - \noindent - Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}. - \\[-10mm] + %\noindent + %Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}. + %%\\[-10mm] %%\bibliographystyle{plain} diff -r 6c13f76c070b -r 9580bae0500d thys3/document/root.tex --- a/thys3/document/root.tex Thu Feb 16 23:23:22 2023 +0000 +++ b/thys3/document/root.tex Mon Feb 20 23:40:30 2023 +0000 @@ -53,6 +53,7 @@ \newcommand{\ZERO}{\mbox{\bf 0}} \newcommand{\ONE}{\mbox{\bf 1}} \def\rs{\mathit{rs}} +\definecolor{darkblue}{rgb}{0,0,0.6} \def\Brz{Brzozowski} \def\der{\backslash}