diff -r fc22ca36325c -r f1d800062d4f thys/Paper/Paper.thy --- a/thys/Paper/Paper.thy Mon May 09 15:01:31 2016 +0100 +++ b/thys/Paper/Paper.thy Wed May 11 12:20:16 2016 +0100 @@ -102,50 +102,6 @@ of the work by Krauss and Nipkow \cite{Krauss2011}. And another one in Coq is given by Coquand and Siles \cite{Coquand2012}. -One limitation of Brzozowski's matcher is that it only generates a -YES/NO answer for whether a string is being matched by a regular -expression. Sulzmann and Lu \cite{Sulzmann2014} extended this matcher -to allow generation not just of a YES/NO answer but of an actual -matching, called a [lexical] {\em value}. They give a simple algorithm -to calculate a value that appears to be the value associated with -POSIX matching. The challenge then is to specify that value, in an -algorithm-independent fashion, and to show that Sulzmann and Lu's -derivative-based algorithm does indeed calculate a value that is -correct according to the specification. - -The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a -relation (called an ``order relation'') on the set of values of @{term -r}, and to show that (once a string to be matched is chosen) there is -a maximum element and that it is computed by their derivative-based -algorithm. This proof idea is inspired by work of Frisch and Cardelli -\cite{Frisch2004} on a GREEDY regular expression matching -algorithm. However, we were not able to establish transitivity and -totality for the ``order relation'' by Sulzmann and Lu. In Section -\ref{argu} we identify some inherent problems with their approach (of -which some of the proofs are not published in \cite{Sulzmann2014}); -perhaps more importantly, we give a simple inductive (and -algorithm-independent) definition of what we call being a {\em POSIX -value} for a regular expression @{term r} and a string @{term s}; we -show that the algorithm computes such a value and that such a value is -unique. Proofs are both done by hand and checked in Isabelle/HOL. The -experience of doing our proofs has been that this mechanical checking -was absolutely essential: this subject area has hidden snares. This -was also noted by Kuklewicz \cite{Kuklewicz} who found that nearly all -POSIX matching implementations are ``buggy'' \cite[Page -203]{Sulzmann2014}. - -%\footnote{The relation @{text "\\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014} -%is a relation on the -%values for the regular expression @{term r}; but it only holds between -%@{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have -%the same flattening (underlying string). So a counterexample to totality is -%given by taking two values @{term "v\<^sub>1"} and @{term "v\<^sub>2"} for @{term r} that -%have different flattenings (see Section~\ref{posixsec}). A different -%relation @{text "\\<^bsub>r,s\<^esub>"} on the set of values for @{term r} -%with flattening @{term s} is definable by the same approach, and is indeed -%total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.} - - If a regular expression matches a string, then in general there is more than one way of how the string is matched. There are two commonly used disambiguation strategies to generate a unique answer: one is called GREEDY @@ -170,26 +126,76 @@ expressions: \begin{itemize} -\item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}\smallskip - +\item[$\bullet$] \emph{The Longest Match Rule} (or \emph{``maximal munch rule''}): The longest initial substring matched by any regular expression is taken as next token.\smallskip -\item[$\bullet$] \underline{Priority Rule:}\smallskip - +\item[$\bullet$] \emph{Priority Rule:} For a particular longest initial substring, the first regular expression that can match determines the token. \end{itemize} - -\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as -@{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising -identifiers (say, a single character followed by characters or numbers). -Then we can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\"} and use -POSIX matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. -For @{text "iffoo"} we obtain by the longest match rule a single identifier -token, not a keyword followed by an identifier. For @{text "if"} we obtain by -the priority rule a keyword token, not an identifier token---even if @{text -"r\<^bsub>id\<^esub>"} matches also.\bigskip + +\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords +such as @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} +recognising identifiers (say, a single character followed by +characters or numbers). Then we can form the regular expression +@{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\"} and use POSIX matching to tokenise strings, +say @{text "iffoo"} and @{text "if"}. For @{text "iffoo"} we obtain +by the Longest Match Rule a single identifier token, not a keyword +followed by an identifier. For @{text "if"} we obtain by the Priority +Rule a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"} +matches also. + +One limitation of Brzozowski's matcher is that it only generates a +YES/NO answer for whether a string is being matched by a regular +expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this matcher +to allow generation not just of a YES/NO answer but of an actual +matching, called a [lexical] {\em value}. They give a simple algorithm +to calculate a value that appears to be the value associated with +POSIX matching. The challenge then is to specify that value, in an +algorithm-independent fashion, and to show that Sulzmann and Lu's +derivative-based algorithm does indeed calculate a value that is +correct according to the specification. + +The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a +relation (called an ``order relation'') on the set of values of @{term +r}, and to show that (once a string to be matched is chosen) there is +a maximum element and that it is computed by their derivative-based +algorithm. This proof idea is inspired by work of Frisch and Cardelli +\cite{Frisch2004} on a GREEDY regular expression matching +algorithm. However, we were not able to establish transitivity and +totality for the ``order relation'' by Sulzmann and Lu. In Section +\ref{argu} we identify some inherent problems with their approach (of +which some of the proofs are not published in \cite{Sulzmann2014}); +perhaps more importantly, we give a simple inductive (and +algorithm-independent) definition of what we call being a {\em POSIX +value} for a regular expression @{term r} and a string @{term s}; we +show that the algorithm computes such a value and that such a value is +unique. Our proofs are both done by hand and checked in Isabelle/HOL. The +experience of doing our proofs has been that this mechanical checking +was absolutely essential: this subject area has hidden snares. This +was also noted by Kuklewicz \cite{Kuklewicz} who found that nearly all +POSIX matching implementations are ``buggy'' \cite[Page +203]{Sulzmann2014} and by Grathwohl et al \cite[Page 36]{CrashCourse2014} +who wrote: + +\begin{quote} +\it{}``The POSIX strategy is more complicated than the greedy because of +the dependence on information about the length of matched strings in the +various subexpressions.'' +\end{quote} + +%\footnote{The relation @{text "\\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014} +%is a relation on the +%values for the regular expression @{term r}; but it only holds between +%@{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have +%the same flattening (underlying string). So a counterexample to totality is +%given by taking two values @{term "v\<^sub>1"} and @{term "v\<^sub>2"} for @{term r} that +%have different flattenings (see Section~\ref{posixsec}). A different +%relation @{text "\\<^bsub>r,s\<^esub>"} on the set of values for @{term r} +%with flattening @{term s} is definable by the same approach, and is indeed +%total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.} + \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the derivative-based regular expression matching algorithm of @@ -239,10 +245,13 @@ recursive function @{term L} with the clauses: \begin{center} - \begin{tabular}{l@ {\hspace{5mm}}rcl} + \begin{tabular}{l@ {\hspace{3mm}}rcl} (1) & @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\ (2) & @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\ (3) & @{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\ + \end{tabular} + \hspace{14mm} + \begin{tabular}{l@ {\hspace{3mm}}rcl} (4) & @{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\ (5) & @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ (6) & @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\ @@ -355,11 +364,11 @@ text {* - The clever idea by Sulzmann and Lu \cite{Sulzmann2014} is to introduce values for encoding - \emph{how} a regular expression matches a string and then define a - function on values that mirrors (but inverts) the construction of the - derivative on regular expressions. \emph{Values} are defined as the - inductive datatype + The clever idea by Sulzmann and Lu \cite{Sulzmann2014} is to define + values for encoding \emph{how} a regular expression matches a string + and then define a function on values that mirrors (but inverts) the + construction of the derivative on regular expressions. \emph{Values} + are defined as the inductive datatype \begin{center} @{text "v :="} @@ -379,11 +388,13 @@ @{term "flat DUMMY"} and defined as: \begin{center} - \begin{tabular}{lcl} + \begin{tabular}[t]{lcl} @{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\ @{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\ @{thm (lhs) flat.simps(3)} & $\dn$ & @{thm (rhs) flat.simps(3)}\\ - @{thm (lhs) flat.simps(4)} & $\dn$ & @{thm (rhs) flat.simps(4)}\\ + @{thm (lhs) flat.simps(4)} & $\dn$ & @{thm (rhs) flat.simps(4)} + \end{tabular}\hspace{14mm} + \begin{tabular}[t]{lcl} @{thm (lhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]}\\ @{thm (lhs) flat.simps(6)} & $\dn$ & @{thm (rhs) flat.simps(6)}\\ @{thm (lhs) flat.simps(7)} & $\dn$ & @{thm (rhs) flat.simps(7)}\\ @@ -395,11 +406,12 @@ \begin{center} \begin{tabular}{c} + \\[-8mm] @{thm[mode=Axiom] Prf.intros(4)} \qquad - @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\smallskip\\ + @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[3mm] @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad - @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\smallskip\\ - @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\smallskip\\ + @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[3mm] + @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[3mm] @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]} \end{tabular}