thys3/Paper.thy
changeset 643 9580bae0500d
parent 642 6c13f76c070b
child 644 9f984ff20020
equal deleted inserted replaced
642:6c13f76c070b 643:9580bae0500d
   214 \noindent In this paper, we shall first briefly introduce the basic notions
   214 \noindent In this paper, we shall first briefly introduce the basic notions
   215 of regular expressions and describe the definition
   215 of regular expressions and describe the definition
   216 of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves
   216 of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves
   217 as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove
   217 as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove
   218 the correctness for the bitcoded algorithm without simplification, and
   218 the correctness for the bitcoded algorithm without simplification, and
   219 after that extend the proof to include simplification.\mbox{}\\[-6mm]
   219 after that extend the proof to include simplification.
       
   220 Our Isabelle code including the results from Sec.~5 is available
       
   221 from \textcolor{darkblue}{\url{https://github.com/urbanchr/posix}}.
       
   222 \mbox{}\\[-6mm]
   220 
   223 
   221 *}
   224 *}
   222 
   225 
   223 section {* Background *}
   226 section {* Background *}
   224 
   227 
   254   \cite{AusafDyckhoffUrban2016}).
   257   \cite{AusafDyckhoffUrban2016}).
   255   
   258   
   256   In our work here we also add to the usual ``basic'' regular
   259   In our work here we also add to the usual ``basic'' regular
   257   expressions the \emph{bounded} regular expression @{term "NTIMES r
   260   expressions the \emph{bounded} regular expression @{term "NTIMES r
   258   n"} where the @{term n} specifies that @{term r} should match
   261   n"} where the @{term n} specifies that @{term r} should match
   259   exactly @{term n}-times. Again for brevity we omit the other bounded
   262   exactly @{term n}-times (it is not included in Sulzmann and Lu's original work). For brevity we omit the other bounded
   260   regular expressions @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$
   263   regular expressions @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$
   261   and @{text "r"}$^{\{n..m\}}$ which specify intervals for how many
   264   and @{text "r"}$^{\{n..m\}}$ which specify intervals for how many
   262   times @{text r} should match. The results presented in this paper
   265   times @{text r} should match. The results presented in this paper
   263   extend straightforwardly to them too. The importance of the bounded
   266   extend straightforwardly to them too. The importance of the bounded
   264   regular expressions is that they are often used in practical
   267   regular expressions is that they are often used in practical
   276   \cite{CountingSet2020}). Therefore regular expression matching
   279   \cite{CountingSet2020}). Therefore regular expression matching
   277   libraries that rely on the classic notion of DFAs often impose 
   280   libraries that rely on the classic notion of DFAs often impose 
   278   adhoc limits for bounded regular expressions: For example in the
   281   adhoc limits for bounded regular expressions: For example in the
   279   regular expression matching library in the Go language and also in Google's RE2 library the regular expression
   282   regular expression matching library in the Go language and also in Google's RE2 library the regular expression
   280   @{term "NTIMES a 1001"} is not permitted, because no counter can be
   283   @{term "NTIMES a 1001"} is not permitted, because no counter can be
   281   above 1000; and in the built-in regular expression library in Rust
   284   above 1000; and in the regular expression library in Rust
   282   expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error
   285   expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error
   283   message for being too big. Up until recently,\footnote{up until version 1.5.4 of the regex
   286   message for being too big. Up until recently,\footnote{up until version 1.5.4 of the regex
   284   library in Rust; see also CVE-2022-24713.} Rust
   287   library in Rust; see also CVE-2022-24713.} Rust
   285   however happily generated an automaton for the regular expression
   288   however happily generated automata for regular expressions such as 
   286   @{text "a\<^bsup>{0}{4294967295}\<^esup>"}. This was due to a bug
   289   @{text "a\<^bsup>{0}{4294967295}\<^esup>"}. This was due to a bug
   287   in the algorithm that decides when a regular expression is acceptable
   290   in the algorithm that decides when a regular expression is acceptable
   288   or too big according to Rust's classification. We shall come back to
   291   or too big according to Rust's classification (it did not account for the fact that @{text "a\<^bsup>{0}\<^esup>"} and similar examples can match the empty string). We shall come back to
   289   this example later in the paper. 
   292   this example later in the paper. 
   290   These problems can of course be solved in matching
   293   These problems can of course be solved in matching
   291   algorithms where automata go beyond the classic notion and for
   294   algorithms where automata go beyond the classic notion and for
   292   instance include explicit counters (see for example~\cite{CountingSet2020}).
   295   instance include explicit counters (e.g.~\cite{CountingSet2020}).
   293   The point here is that Brzozowski derivatives and the algorithms by
   296   The point here is that Brzozowski derivatives and the algorithms by
   294   Sulzmann and Lu can be straightforwardly extended to deal with
   297   Sulzmann and Lu can be straightforwardly extended to deal with
   295   bounded regular expressions and moreover the resulting code
   298   bounded regular expressions and moreover the resulting code
   296   still consists of only simple recursive functions and inductive
   299   still consists of only simple recursive functions and inductive
   297   datatypes. Finally, bounded regular expressions 
   300   datatypes. Finally, bounded regular expressions 
   872   the functions \textit{bnullable} and \textit{bmkeps}(\textit{s}), which are the
   875   the functions \textit{bnullable} and \textit{bmkeps}(\textit{s}), which are the
   873   ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
   876   ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
   874   bitcoded regular expressions.
   877   bitcoded regular expressions.
   875   %
   878   %
   876   \begin{center}
   879   \begin{center}
   877   \begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{6mm}}c@ {}}
   880   \begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{10mm}}c@ {}}
   878   \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l}
   881   \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l}
   879   $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\
   882   $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\
   880   $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\
   883   $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\
   881   $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\
   884   $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\
   882   $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &\\
   885   $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &\\
  1458      we shall show next.
  1461      we shall show next.
  1459 
  1462 
  1460    \begin{figure}[t]
  1463    \begin{figure}[t]
  1461   \begin{center}
  1464   \begin{center}
  1462   \begin{tabular}{@ {\hspace{-8mm}}c@ {}}
  1465   \begin{tabular}{@ {\hspace{-8mm}}c@ {}}
  1463   \\[-7mm]
  1466   \\[-8mm]
  1464   @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}$S\ZERO{}_l$\quad
  1467   @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}$S\ZERO{}_l$\quad
  1465   @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}$S\ZERO{}_r$\quad
  1468   @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}$S\ZERO{}_r$\quad
  1466   @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}$S\ONE$\\
  1469   @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}$S\ONE$\\
  1467   @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad
  1470   @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad
  1468   @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\
  1471   @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\
  1567 reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm]
  1570 reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm]
  1568 
  1571 
  1569 Note that while Antimirov was able to give a bound on the \emph{size}
  1572 Note that while Antimirov was able to give a bound on the \emph{size}
  1570 of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound
  1573 of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound
  1571 on the \emph{number} of derivatives, provided they are quotient via
  1574 on the \emph{number} of derivatives, provided they are quotient via
  1572 ACI rules \cite{Brzozowski1964}. Brozozowski's result is crucial when one
  1575 ACI rules \cite{Brzozowski1964}. Brzozowski's result is crucial when one
  1573 uses his derivatives for obtaining a DFA (it essentially bounds
  1576 uses his derivatives for obtaining a DFA (it essentially bounds
  1574 the number of states). However, this result does \emph{not}
  1577 the number of states). However, this result does \emph{not}
  1575 transfer to our setting where we are interested in the \emph{size} of the
  1578 transfer to our setting where we are interested in the \emph{size} of the
  1576 derivatives. For example it is \emph{not} true for our derivatives that the
  1579 derivatives. For example it is \emph{not} true for our derivatives that the
  1577 set of derivatives $r \backslash s$ for a given $r$ and all strings
  1580 set of derivatives $r \backslash s$ for a given $r$ and all strings
  1592    We set out in this work to prove in Isabelle/HOL the correctness of
  1595    We set out in this work to prove in Isabelle/HOL the correctness of
  1593    the second POSIX lexing algorithm by Sulzmann and Lu
  1596    the second POSIX lexing algorithm by Sulzmann and Lu
  1594    \cite{Sulzmann2014}. This follows earlier work where we established
  1597    \cite{Sulzmann2014}. This follows earlier work where we established
  1595    the correctness of the first algorithm
  1598    the correctness of the first algorithm
  1596    \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
  1599    \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
  1597    introduce our own specification about what POSIX values are,
  1600    introduce our own specification for POSIX values,
  1598    because the informal definition given by Sulzmann and Lu did not
  1601    because the informal definition given by Sulzmann and Lu did not
  1599    stand up to formal proof. Also for the second algorithm we needed
  1602    stand up to formal proof. Also for the second algorithm we needed
  1600    to introduce our own definitions and proof ideas in order to
  1603    to introduce our own definitions and proof ideas in order to
  1601    establish the correctness.  Our interest in the second algorithm
  1604    establish the correctness.  Our interest in the second algorithm
  1602    lies in the fact that by using bitcoded regular expressions and an
  1605    lies in the fact that by using bitcoded regular expressions and an
  1603    aggressive simplification method there is a chance that the
  1606    aggressive simplification method there is a chance that the
  1604    derivatives can be kept universally small (we established in this
  1607    derivatives can be kept universally small (we established in this
  1605    paper that for a given $r$ they can be kept finitely bounded for
  1608    paper that for a given $r$ they can be kept finitely bounded for
  1606    all strings).  Our formalisation is approximately 7500 lines of
  1609    \emph{all} strings).  Our formalisation is approximately 7500 lines of
  1607    Isabelle code. A little more than half of this code concerns the finiteness bound
  1610    Isabelle code. A little more than half of this code concerns the finiteness bound
  1608    obtained in Section 5. This slight ``bloat'' in the latter part is because
  1611    obtained in Section 5. This slight ``bloat'' in the latter part is because
  1609    we had to introduce an intermediate datatype for annotated regular expressions and repeat many
  1612    we had to introduce an intermediate datatype for annotated regular expressions and repeat many
  1610    definitions for this intermediate datatype. But overall this made our
  1613    definitions for this intermediate datatype. But overall we think this made our
  1611    formalisation work smoother.
  1614    formalisation work smoother. The code of our formalisation can be found at
       
  1615    \textcolor{darkblue}{\url{https://github.com/urbanchr/posix}}.
  1612    %This is important if one is after
  1616    %This is important if one is after
  1613    %an efficient POSIX lexing algorithm based on derivatives.
  1617    %an efficient POSIX lexing algorithm based on derivatives.
  1614 
  1618 
  1615    Having proved the correctness of the POSIX lexing algorithm, which
  1619    Having proved the correctness of the POSIX lexing algorithm, which
  1616    lessons have we learned? Well, we feel this is a very good example
  1620    lessons have we learned? Well, we feel this is a very good example
  1695    more than 11 minutes to generate an automaton for this regular
  1699    more than 11 minutes to generate an automaton for this regular
  1696    expression and then to determine that a string of just one(!)
  1700    expression and then to determine that a string of just one(!)
  1697    @{text a} does \emph{not} match this regular expression.  Therefore
  1701    @{text a} does \emph{not} match this regular expression.  Therefore
  1698    it is probably a wise choice that in newer versions of Rust's
  1702    it is probably a wise choice that in newer versions of Rust's
  1699    regular expression library such regular expressions are declared as
  1703    regular expression library such regular expressions are declared as
  1700    ``too big''. While this is clearly
  1704    ``too big'' and raise an error message. While this is clearly
  1701    a contrived example, the safety guaranties Rust wants to provide necessitate
  1705    a contrived example, the safety guaranties Rust wants to provide necessitate
  1702    this conservative approach.
  1706    this conservative approach.
  1703    However, with the derivatives and simplifications we have shown
  1707    However, with the derivatives and simplifications we have shown
  1704    in this paper, the example can be solved with ease:
  1708    in this paper, the example can be solved with ease:
  1705    it essentially only involves operations on
  1709    it essentially only involves operations on
  1728    %
  1732    %
  1729    %
  1733    %
  1730    %Possible ideas are
  1734    %Possible ideas are
  1731    %zippers which have been used in the context of parsing of
  1735    %zippers which have been used in the context of parsing of
  1732    %context-free grammars \cite{zipperparser}.
  1736    %context-free grammars \cite{zipperparser}.
  1733    \\[-4mm]  %\smallskip
  1737    %\\[-5mm]  %\smallskip
  1734    
  1738    
  1735    \noindent
  1739    %\noindent
  1736    Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
  1740    %Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
  1737    \\[-10mm]
  1741    %%\\[-10mm]
  1738 
  1742 
  1739 
  1743 
  1740 %%\bibliographystyle{plain}
  1744 %%\bibliographystyle{plain}
  1741 \bibliography{root}
  1745 \bibliography{root}
  1742 *}
  1746 *}