thys2/Paper/Paper.thy
changeset 474 726f4e65c0fe
parent 464 e6248d2c20c2
equal deleted inserted replaced
473:37d14cbce020 474:726f4e65c0fe
    98 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
    98 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
    99 succession) all the characters of the string matches the empty string,
    99 succession) all the characters of the string matches the empty string,
   100 then @{term r} matches @{term s} (and {\em vice versa}).  We are aware
   100 then @{term r} matches @{term s} (and {\em vice versa}).  We are aware
   101 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
   101 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
   102 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
   102 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
   103 of the work by Krauss and Nipkow \cite{Krauss2011}.  And another one
   103 of the work by Krauss and Nipkow~\cite{Krauss2011}.  And another one
   104 in Coq is given by Coquand and Siles \cite{Coquand2012}.
   104 in Coq is given by Coquand and Siles \cite{Coquand2012}.
   105 Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}.
   105 Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}.
   106 
   106 
   107 
   107 
   108 However, there are two difficulties with derivative-based matchers:
   108 However, there are two difficulties with derivative-based matchers:
   159 regular expressions. In this version, POSIX values are
   159 regular expressions. In this version, POSIX values are
   160 represented as bitsequences and such sequences are incrementally generated
   160 represented as bitsequences and such sequences are incrementally generated
   161 when derivatives are calculated. The compact representation
   161 when derivatives are calculated. The compact representation
   162 of bitsequences and regular expressions allows them to define a more
   162 of bitsequences and regular expressions allows them to define a more
   163 ``aggressive'' simplification method that keeps the size of the
   163 ``aggressive'' simplification method that keeps the size of the
   164 derivatives finite no matter what the length of the string is.
   164 derivatives finitely bounded no matter what the length of the string is.
   165 They make some informal claims about the correctness and linear behaviour
   165 They make some informal claims about the correctness and linear behaviour
   166 of this version, but do not provide any supporting proof arguments, not
   166 of this version, but do not provide any supporting proof arguments, not
   167 even ``pencil-and-paper'' arguments. They write about their bitcoded
   167 even ``pencil-and-paper'' arguments. They write about their bitcoded
   168 \emph{incremental parsing method} (that is the algorithm to be formalised
   168 \emph{incremental parsing method} (that is the algorithm to be formalised
   169 in this paper):
   169 in this paper):
   193 We shall also introduce our own argument and definitions for
   193 We shall also introduce our own argument and definitions for
   194 establishing the correctness of the bitcoded algorithm when 
   194 establishing the correctness of the bitcoded algorithm when 
   195 simplifications are included.\medskip
   195 simplifications are included.\medskip
   196 
   196 
   197 \noindent In this paper, we shall first briefly introduce the basic notions
   197 \noindent In this paper, we shall first briefly introduce the basic notions
   198 of regular expressions and describe the basic definitions
   198 of regular expressions and describe the definition
   199 of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves
   199 of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves
   200 as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove
   200 as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove
   201 the correctness for the bitcoded algorithm without simplification, and
   201 the correctness for the bitcoded algorithm without simplification, and
   202 after that extend the proof to include simplification. 
   202 after that extend the proof to include simplification. 
   203 
   203 
   356   string that satisfies the informal POSIX rules (see
   356   string that satisfies the informal POSIX rules (see
   357   \cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}).\footnote{POSIX
   357   \cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}).\footnote{POSIX
   358 	lexing acquired its name from the fact that the corresponding
   358 	lexing acquired its name from the fact that the corresponding
   359 	rules were described as part of the POSIX specification for
   359 	rules were described as part of the POSIX specification for
   360 	Unix-like operating systems \cite{POSIX}.} Sometimes these
   360 	Unix-like operating systems \cite{POSIX}.} Sometimes these
   361   informal rules are called \emph{maximal much rule} and \emph{rule priority}.
   361   informal rules are called \emph{maximal munch rule} and \emph{rule priority}.
   362   One contribution of our earlier paper is to give a convenient
   362   One contribution of our earlier paper is to give a convenient
   363  specification for what POSIX values are (the inductive rules are shown in
   363  specification for what POSIX values are (the inductive rules are shown in
   364   Figure~\ref{POSIXrules}).
   364   Figure~\ref{POSIXrules}).
   365 
   365 
   366 \begin{figure}[t]
   366 \begin{figure}[t]
   510   (2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\
   510   (2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\
   511   \end{tabular}
   511   \end{tabular}
   512   \end{proposition}
   512   \end{proposition}
   513 
   513 
   514   \noindent
   514   \noindent
   515   In fact we have shown that in the success case the generated POSIX value $v$ is
   515   In fact we have shown that, in the success case, the generated POSIX value $v$ is
   516   unique and in the failure case that there is no POSIX value $v$ that satisfies
   516   unique and in the failure case that there is no POSIX value $v$ that satisfies
   517   $(s, r) \rightarrow v$. While the algorithm is correct, it is excruciatingly
   517   $(s, r) \rightarrow v$. While the algorithm is correct, it is excruciatingly
   518   slow in cases where the derivatives grow arbitrarily (recall the example from the
   518   slow in cases where the derivatives grow arbitrarily (recall the example from the
   519   Introduction). However it can be used as a convenient reference point for the correctness
   519   Introduction). However it can be used as a convenient reference point for the correctness
   520   proof of the second algorithm by Sulzmann and Lu, which we shall describe next.
   520   proof of the second algorithm by Sulzmann and Lu, which we shall describe next.
  1164      \]
  1164      \]
  1165 
  1165 
  1166      \noindent do not hold under simplification---this property
  1166      \noindent do not hold under simplification---this property
  1167      essentially purports that we can retrieve the same value from a
  1167      essentially purports that we can retrieve the same value from a
  1168      simplified version of the regular expression. To start with @{text retrieve}
  1168      simplified version of the regular expression. To start with @{text retrieve}
  1169      depends on the fact that the value @{text v} correspond to the
  1169      depends on the fact that the value @{text v} corresponds to the
  1170      structure of the regular expression @{text r}---but the whole point of simplification
  1170      structure of the regular expression @{text r}---but the whole point of simplification
  1171      is to ``destroy'' this structure by making the regular expression simpler.
  1171      is to ``destroy'' this structure by making the regular expression simpler.
  1172      To see this consider the regular expression @{text "r = r' + 0"} and a corresponding
  1172      To see this consider the regular expression @{text "r = r' + 0"} and a corresponding
  1173      value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then 
  1173      value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then 
  1174      we can use @{text retrieve} with @{text r} and @{text v} in order to extract a corresponding
  1174      we can use @{text retrieve} with @{text r} and @{text v} in order to extract a corresponding
  1298   @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad
  1298   @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad
  1299   @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\
  1299   @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\
  1300   @{thm[mode=Axiom] bs6}$A0$\qquad
  1300   @{thm[mode=Axiom] bs6}$A0$\qquad
  1301   @{thm[mode=Axiom] bs7}$A1$\\
  1301   @{thm[mode=Axiom] bs7}$A1$\\
  1302   @{thm[mode=Rule] bs8[of "rs\<^sub>1" "rs\<^sub>2"]}$AL$\\
  1302   @{thm[mode=Rule] bs8[of "rs\<^sub>1" "rs\<^sub>2"]}$AL$\\
  1303   @{thm[mode=Rule] ss2[of "rs\<^sub>1" "rs\<^sub>2"]}$LH$\qquad
  1303   @{thm[mode=Rule] ss2[of "rs\<^sub>1" "rs\<^sub>2"]}$LT$\qquad
  1304   @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}$LT$\\
  1304   @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}$LH$\\
  1305   @{thm[mode=Axiom] ss4}$L\ZERO$\qquad
  1305   @{thm[mode=Axiom] ss4}$L\ZERO$\qquad
  1306   @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}$LS$\medskip\\
  1306   @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}$LS$\medskip\\
  1307   @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}$LD$\\
  1307   @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}$LD$\\
  1308   \end{tabular}
  1308   \end{tabular}
  1309   \end{center}
  1309   \end{center}
  1398    because the informal definition given by Sulzmann and Lu did not
  1398    because the informal definition given by Sulzmann and Lu did not
  1399    stand up to a formal proof. Also for the second algorithm we needed
  1399    stand up to a formal proof. Also for the second algorithm we needed
  1400    to introduce our own definitions and proof ideas in order to establish the
  1400    to introduce our own definitions and proof ideas in order to establish the
  1401    correctness.  Our interest in the second algorithm 
  1401    correctness.  Our interest in the second algorithm 
  1402    lies in the fact that by using bitcoded regular expressions and an aggressive
  1402    lies in the fact that by using bitcoded regular expressions and an aggressive
  1403    simplification method there is a chance that the the derivatives
  1403    simplification method there is a chance that the derivatives
  1404    can be kept universally small  (we established in this paper that
  1404    can be kept universally small  (we established in this paper that
  1405    they can be kept finite for any string). This is important if one is after
  1405    they can be kept finite for any string). This is important if one is after
  1406    an efficient POSIX lexing algorithm based on derivatives.
  1406    an efficient POSIX lexing algorithm based on derivatives.
  1407 
  1407 
  1408    Having proved the correctness of the POSIX lexing algorithm, which
  1408    Having proved the correctness of the POSIX lexing algorithm, which
  1423    \cite[Page 14]{Sulzmann2014}. 
  1423    \cite[Page 14]{Sulzmann2014}. 
  1424    Given the growth of the
  1424    Given the growth of the
  1425    derivatives in some cases even after aggressive simplification, this
  1425    derivatives in some cases even after aggressive simplification, this
  1426    is a hard to believe claim. A similar claim about a theoretical runtime
  1426    is a hard to believe claim. A similar claim about a theoretical runtime
  1427    of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates
  1427    of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates
  1428    tokens according to POSIX rules~\cite{verbatim}. For this it uses Brzozowski's
  1428    tokens according to POSIX rules~\cite{verbatim}. For this Verbatim uses Brzozowski's
  1429    derivatives like in our work. 
  1429    derivatives like in our work. 
  1430    They write: ``The results of our empirical tests [..] confirm that Verbatim has
  1430    The authors write: ``The results of our empirical tests [..] confirm that Verbatim has
  1431    @{text "O(n\<^sup>2)"} time complexity.'' \cite[Section~VII]{verbatim}.
  1431    @{text "O(n\<^sup>2)"} time complexity.'' \cite[Section~VII]{verbatim}.
  1432    While their correctness proof for Verbatim is formalised in Coq, the claim about
  1432    While their correctness proof for Verbatim is formalised in Coq, the claim about
  1433    the runtime complexity is only supported by some emperical evidence obtained
  1433    the runtime complexity is only supported by some emperical evidence obtained
  1434    by using the code extraction facilities of Coq.
  1434    by using the code extraction facilities of Coq.
  1435    Given our observation with the ``growth problem'' of derivatives,
  1435    Given our observation with the ``growth problem'' of derivatives,