thys/Paper/Paper.thy
changeset 152 e3eb82ea2244
parent 151 5a1196466a9c
child 154 2de3cf684ba0
equal deleted inserted replaced
151:5a1196466a9c 152:e3eb82ea2244
    81 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
    81 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
    82 derivatives is that they are neatly expressible in any functional language,
    82 derivatives is that they are neatly expressible in any functional language,
    83 and easily definable and reasoned about in theorem provers---the definitions
    83 and easily definable and reasoned about in theorem provers---the definitions
    84 just consist of inductive datatypes and simple recursive functions. A
    84 just consist of inductive datatypes and simple recursive functions. A
    85 completely formalised correctness proof of this matcher in for example HOL4
    85 completely formalised correctness proof of this matcher in for example HOL4
    86 has been mentioned in~\cite{Owens2008}. Another one in Isabelle/HOL is part
    86 has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
    87 of the work in \cite{Krauss2011}.
    87 of the work by Krauss and Nipkow \cite{Krauss2011}.
    88 
    88 
    89 One limitation of Brzozowski's matcher is that it only generates a YES/NO
    89 One limitation of Brzozowski's matcher is that it only generates a YES/NO
    90 answer for whether a string is being matched by a regular expression.
    90 answer for whether a string is being matched by a regular expression.
    91 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow
    91 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow
    92 generation not just of a YES/NO answer but of an actual matching, called a
    92 generation not just of a YES/NO answer but of an actual matching, called a
   104 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a
   104 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a
   105 GREEDY regular expression matching algorithm. Beginning with our
   105 GREEDY regular expression matching algorithm. Beginning with our
   106 observations that, without evidence that it is transitive, it cannot be
   106 observations that, without evidence that it is transitive, it cannot be
   107 called an ``order relation'', and that the relation is called a ``total
   107 called an ``order relation'', and that the relation is called a ``total
   108 order'' despite being evidently not total\footnote{The relation @{text
   108 order'' despite being evidently not total\footnote{The relation @{text
   109 "\<ge>\<^bsub>r\<^esub>"} defined in \cite{Sulzmann2014} is a relation on the
   109 "\<ge>\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014} is a relation on the
   110 values for the regular expression @{term r}; but it only holds between
   110 values for the regular expression @{term r}; but it only holds between
   111 @{term v} and @{term "v'"} in cases where @{term v} and @{term "v'"} have
   111 @{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have
   112 the same flattening (underlying string). So a counterexample to totality is
   112 the same flattening (underlying string). So a counterexample to totality is
   113 given by taking two values @{term v} and @{term "v'"} for @{term r} that
   113 given by taking two values @{term "v\<^sub>1"} and @{term "v\<^sub>2"} for @{term r} that
   114 have different flattenings (see Section~\ref{posixsec}). A different
   114 have different flattenings (see Section~\ref{posixsec}). A different
   115 relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r}
   115 relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r}
   116 with flattening @{term s} is definable by the same approach, and is indeed
   116 with flattening @{term s} is definable by the same approach, and is indeed
   117 total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}, we
   117 total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}, we
   118 identify problems with this approach (of which some of the proofs are not
   118 identify problems with this approach (of which some of the proofs are not
   121 being a {\em POSIX value} for a regular expression @{term r} and a string
   121 being a {\em POSIX value} for a regular expression @{term r} and a string
   122 @{term s}; we show that the algorithm computes such a value and that such a
   122 @{term s}; we show that the algorithm computes such a value and that such a
   123 value is unique. Proofs are both done by hand and checked in Isabelle/HOL.
   123 value is unique. Proofs are both done by hand and checked in Isabelle/HOL.
   124 The experience of doing our proofs has been that this mechanical checking
   124 The experience of doing our proofs has been that this mechanical checking
   125 was absolutely essential: this subject area has hidden snares. This was also
   125 was absolutely essential: this subject area has hidden snares. This was also
   126 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching
   126 noted by Kuklewicz \cite{Kuklewicz} who found that nearly all POSIX matching
   127 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
   127 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
   128 
   128 
   129 If a regular expression matches a string, then in general there is more than
   129 If a regular expression matches a string, then in general there is more than
   130 one way of how the string is matched. There are two commonly used
   130 one way of how the string is matched. There are two commonly used
   131 disambiguation strategies to generate a unique answer: one is called GREEDY
   131 disambiguation strategies to generate a unique answer: one is called GREEDY
   148 identifiers, respectively. There are two underlying (informal) rules behind
   148 identifiers, respectively. There are two underlying (informal) rules behind
   149 tokenising a string in a POSIX fashion according to a collection of regular
   149 tokenising a string in a POSIX fashion according to a collection of regular
   150 expressions:
   150 expressions:
   151 
   151 
   152 \begin{itemize} 
   152 \begin{itemize} 
   153 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}
   153 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}\smallskip
   154 
   154 
   155 The longest initial substring matched by any regular expression is taken as
   155 The longest initial substring matched by any regular expression is taken as
   156 next token.\smallskip
   156 next token.\smallskip
   157 
   157 
   158 \item[$\bullet$] \underline{Priority Rule:}
   158 \item[$\bullet$] \underline{Priority Rule:}\smallskip
   159 
   159 
   160 For a particular longest initial substring, the first regular expression
   160 For a particular longest initial substring, the first regular expression
   161 that can match determines the token.
   161 that can match determines the token.
   162 \end{itemize}
   162 \end{itemize}
   163  
   163  
   331 
   331 
   332 section {* POSIX Regular Expression Matching\label{posixsec} *}
   332 section {* POSIX Regular Expression Matching\label{posixsec} *}
   333 
   333 
   334 text {* 
   334 text {* 
   335 
   335 
   336   The clever idea in \cite{Sulzmann2014} is to introduce values for encoding
   336   The clever idea by Sulzmann and Lu \cite{Sulzmann2014} is to introduce values for encoding
   337   \emph{how} a regular expression matches a string and then define a
   337   \emph{how} a regular expression matches a string and then define a
   338   function on values that mirrors (but inverts) the construction of the
   338   function on values that mirrors (but inverts) the construction of the
   339   derivative on regular expressions. \emph{Values} are defined as the
   339   derivative on regular expressions. \emph{Values} are defined as the
   340   inductive datatype
   340   inductive datatype
   341 
   341 
   347   @{term "Right v"} $\mid$
   347   @{term "Right v"} $\mid$
   348   @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ 
   348   @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ 
   349   @{term "Stars vs"} 
   349   @{term "Stars vs"} 
   350   \end{center}  
   350   \end{center}  
   351 
   351 
   352   \noindent where we use @{term vs} to stand for a list of values. (This is
   352   \noindent where we use @{term vs} to stand for a list of
   353   similar to the approach taken by Frisch and Cardelli for GREEDY matching
   353   values. (This is similar to the approach taken by Frisch and
   354   \cite{Frisch2004}, and Sulzmann and Lu \cite{Sulzmann2014} for POSIX
   354   Cardelli for GREEDY matching \cite{Frisch2004}, and Sulzmann and Lu
   355   matching). The string underlying a value can be calculated by the @{const
   355   for POSIX matching \cite{Sulzmann2014}). The string underlying a
   356   flat} function, written @{term "flat DUMMY"} and defined as:
   356   value can be calculated by the @{const flat} function, written
       
   357   @{term "flat DUMMY"} and defined as:
   357 
   358 
   358   \begin{center}
   359   \begin{center}
   359   \begin{tabular}{lcl}
   360   \begin{tabular}{lcl}
   360   @{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\
   361   @{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\
   361   @{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\
   362   @{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\
   584   without giving evidence that it is transitive.} between values and argue
   585   without giving evidence that it is transitive.} between values and argue
   585   that there is a maximum value, as given by the derivative-based algorithm.
   586   that there is a maximum value, as given by the derivative-based algorithm.
   586   In contrast, we shall introduce a simple inductive definition that
   587   In contrast, we shall introduce a simple inductive definition that
   587   specifies directly what a \emph{POSIX value} is, incorporating the
   588   specifies directly what a \emph{POSIX value} is, incorporating the
   588   POSIX-specific choices into the side-conditions of our rules. Our
   589   POSIX-specific choices into the side-conditions of our rules. Our
   589   definition is inspired by the matching relation given in
   590   definition is inspired by the matching relation given by Vansummeren
   590   \cite{Vansummeren2006}. The relation we define is ternary and written as
   591   \cite{Vansummeren2006}. The relation we define is ternary and written as
   591   \mbox{@{term "s \<in> r \<rightarrow> v"}}, relating strings, regular expressions and
   592   \mbox{@{term "s \<in> r \<rightarrow> v"}}, relating strings, regular expressions and
   592   values.
   593   values.
   593 
   594 
   594   \begin{center}
   595   \begin{center}
  1094 
  1095 
  1095 section {* Conclusion *}
  1096 section {* Conclusion *}
  1096 
  1097 
  1097 text {*
  1098 text {*
  1098 
  1099 
  1099   We have implemented the POSIX value calculation algorithm introduced in
  1100   We have implemented the POSIX value calculation algorithm introduced by
       
  1101   Sulzmann and Lu
  1100   \cite{Sulzmann2014}. Our implementation is nearly identical to the
  1102   \cite{Sulzmann2014}. Our implementation is nearly identical to the
  1101   original and all modifications we introduced are harmless (like our char-clause for
  1103   original and all modifications we introduced are harmless (like our char-clause for
  1102   @{text inj}). We have proved this algorithm to be correct, but correct
  1104   @{text inj}). We have proved this algorithm to be correct, but correct
  1103   according to our own specification of what POSIX values are. Our
  1105   according to our own specification of what POSIX values are. Our
  1104   specification (inspired from work in \cite{Vansummeren2006}) appears to be
  1106   specification (inspired from work by Vansummeren \cite{Vansummeren2006}) appears to be
  1105   much simpler than in \cite{Sulzmann2014} and our proofs are nearly always
  1107   much simpler than in \cite{Sulzmann2014} and our proofs are nearly always
  1106   straightforward. We have attempted to formalise the original proof
  1108   straightforward. We have attempted to formalise the original proof
  1107   by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains
  1109   by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains
  1108   unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors
  1110   unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors
  1109   already acknowledge some small problems, but our experience suggests
  1111   already acknowledge some small problems, but our experience suggests
  1127   Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest
  1129   Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest
  1128   initial substrings, but Nipkow's algorithm is not completely
  1130   initial substrings, but Nipkow's algorithm is not completely
  1129   computational. The algorithm by Sulzmann and Lu, in contrast, can be
  1131   computational. The algorithm by Sulzmann and Lu, in contrast, can be
  1130   implemented with easy in any functional language. A bespoke lexer for the
  1132   implemented with easy in any functional language. A bespoke lexer for the
  1131   Imp-language is formalised in Coq as part of the Software Foundations book
  1133   Imp-language is formalised in Coq as part of the Software Foundations book
  1132   \cite{Pierce2015}. The disadvantage of such bespoke lexers is that they
  1134   by Pierce et al\cite{Pierce2015}. The disadvantage of such bespoke lexers is that they
  1133   do not generalise easily to more advanced features.
  1135   do not generalise easily to more advanced features.
  1134   Our formalisation is available from
  1136   Our formalisation is available from
  1135   \url{http://www.inf.kcl.ac.uk/staff/urbanc/lex}.
  1137   \url{http://www.inf.kcl.ac.uk/staff/urbanc/lex}.\medskip
  1136 
  1138 
  1137   %\noindent
  1139   \noindent
  1138   %{\bf Acknowledgements:}
  1140   {\bf Acknowledgements:}
  1139   %We are grateful for the comments we received from anonymous
  1141   We are very grateful to Martin Sulzmann for his comments on our work and 
  1140   %referees.
  1142   also patiently explaining to us the details in \cite{Sulzmann2014}.
  1141   \small
  1143   \small
  1142   \bibliographystyle{plain}
  1144   \bibliographystyle{plain}
  1143   \bibliography{root}
  1145   \bibliography{root}
  1144 
  1146 
  1145 *}
  1147 *}