thys/Paper/Paper.thy
changeset 176 f1d800062d4f
parent 175 fc22ca36325c
child 177 e85d10b238d0
equal deleted inserted replaced
175:fc22ca36325c 176:f1d800062d4f
   100 completely formalised correctness proof of this matcher in for example HOL4
   100 completely formalised correctness proof of this matcher in for example HOL4
   101 has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
   101 has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
   102 of the work by Krauss and Nipkow \cite{Krauss2011}. And another one in Coq is given
   102 of the work by Krauss and Nipkow \cite{Krauss2011}. And another one in Coq is given
   103 by Coquand and Siles \cite{Coquand2012}.
   103 by Coquand and Siles \cite{Coquand2012}.
   104 
   104 
       
   105 If a regular expression matches a string, then in general there is more than
       
   106 one way of how the string is matched. There are two commonly used
       
   107 disambiguation strategies to generate a unique answer: one is called GREEDY
       
   108 matching \cite{Frisch2004} and the other is POSIX
       
   109 matching~\cite{Kuklewicz,Sulzmann2014,Vansummeren2006}. For example consider
       
   110 the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT
       
   111 (ALT x y) xy)"}}. Either the string can be matched in two `iterations' by
       
   112 the single letter-regular expressions @{term x} and @{term y}, or directly
       
   113 in one iteration by @{term xy}. The first case corresponds to GREEDY
       
   114 matching, which first matches with the left-most symbol and only matches the
       
   115 next symbol in case of a mismatch (this is greedy in the sense of preferring
       
   116 instant gratification to delayed repletion). The second case is POSIX
       
   117 matching, which prefers the longest match.
       
   118 
       
   119 In the context of lexing, where an input string needs to be split up into a
       
   120 sequence of tokens, POSIX is the more natural disambiguation strategy for
       
   121 what programmers consider basic syntactic building blocks in their programs.
       
   122 These building blocks are often specified by some regular expressions, say
       
   123 @{text "r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising keywords and
       
   124 identifiers, respectively. There are two underlying (informal) rules behind
       
   125 tokenising a string in a POSIX fashion according to a collection of regular
       
   126 expressions:
       
   127 
       
   128 \begin{itemize} 
       
   129 \item[$\bullet$] \emph{The Longest Match Rule} (or \emph{``maximal munch rule''}):
       
   130 The longest initial substring matched by any regular expression is taken as
       
   131 next token.\smallskip
       
   132 
       
   133 \item[$\bullet$] \emph{Priority Rule:}
       
   134 For a particular longest initial substring, the first regular expression
       
   135 that can match determines the token.
       
   136 \end{itemize}
       
   137 
       
   138 \noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords
       
   139 such as @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"}
       
   140 recognising identifiers (say, a single character followed by
       
   141 characters or numbers).  Then we can form the regular expression
       
   142 @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX matching to tokenise strings,
       
   143 say @{text "iffoo"} and @{text "if"}.  For @{text "iffoo"} we obtain
       
   144 by the Longest Match Rule a single identifier token, not a keyword
       
   145 followed by an identifier. For @{text "if"} we obtain by the Priority
       
   146 Rule a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"}
       
   147 matches also.
       
   148 
   105 One limitation of Brzozowski's matcher is that it only generates a
   149 One limitation of Brzozowski's matcher is that it only generates a
   106 YES/NO answer for whether a string is being matched by a regular
   150 YES/NO answer for whether a string is being matched by a regular
   107 expression.  Sulzmann and Lu \cite{Sulzmann2014} extended this matcher
   151 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this matcher
   108 to allow generation not just of a YES/NO answer but of an actual
   152 to allow generation not just of a YES/NO answer but of an actual
   109 matching, called a [lexical] {\em value}. They give a simple algorithm
   153 matching, called a [lexical] {\em value}. They give a simple algorithm
   110 to calculate a value that appears to be the value associated with
   154 to calculate a value that appears to be the value associated with
   111 POSIX matching. The challenge then is to specify that value, in an
   155 POSIX matching.  The challenge then is to specify that value, in an
   112 algorithm-independent fashion, and to show that Sulzmann and Lu's
   156 algorithm-independent fashion, and to show that Sulzmann and Lu's
   113 derivative-based algorithm does indeed calculate a value that is
   157 derivative-based algorithm does indeed calculate a value that is
   114 correct according to the specification.
   158 correct according to the specification.
   115 
   159 
   116 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a
   160 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a
   125 which some of the proofs are not published in \cite{Sulzmann2014});
   169 which some of the proofs are not published in \cite{Sulzmann2014});
   126 perhaps more importantly, we give a simple inductive (and
   170 perhaps more importantly, we give a simple inductive (and
   127 algorithm-independent) definition of what we call being a {\em POSIX
   171 algorithm-independent) definition of what we call being a {\em POSIX
   128 value} for a regular expression @{term r} and a string @{term s}; we
   172 value} for a regular expression @{term r} and a string @{term s}; we
   129 show that the algorithm computes such a value and that such a value is
   173 show that the algorithm computes such a value and that such a value is
   130 unique. Proofs are both done by hand and checked in Isabelle/HOL.  The
   174 unique. Our proofs are both done by hand and checked in Isabelle/HOL.  The
   131 experience of doing our proofs has been that this mechanical checking
   175 experience of doing our proofs has been that this mechanical checking
   132 was absolutely essential: this subject area has hidden snares. This
   176 was absolutely essential: this subject area has hidden snares. This
   133 was also noted by Kuklewicz \cite{Kuklewicz} who found that nearly all
   177 was also noted by Kuklewicz \cite{Kuklewicz} who found that nearly all
   134 POSIX matching implementations are ``buggy'' \cite[Page
   178 POSIX matching implementations are ``buggy'' \cite[Page
   135 203]{Sulzmann2014}.
   179 203]{Sulzmann2014} and by Grathwohl et al \cite[Page 36]{CrashCourse2014}
       
   180 who wrote:
       
   181 
       
   182 \begin{quote}
       
   183 \it{}``The POSIX strategy is more complicated than the greedy because of 
       
   184 the dependence on information about the length of matched strings in the 
       
   185 various subexpressions.''
       
   186 \end{quote}
   136 
   187 
   137 %\footnote{The relation @{text "\<ge>\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014} 
   188 %\footnote{The relation @{text "\<ge>\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014} 
   138 %is a relation on the
   189 %is a relation on the
   139 %values for the regular expression @{term r}; but it only holds between
   190 %values for the regular expression @{term r}; but it only holds between
   140 %@{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have
   191 %@{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have
   143 %have different flattenings (see Section~\ref{posixsec}). A different
   194 %have different flattenings (see Section~\ref{posixsec}). A different
   144 %relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r}
   195 %relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r}
   145 %with flattening @{term s} is definable by the same approach, and is indeed
   196 %with flattening @{term s} is definable by the same approach, and is indeed
   146 %total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}
   197 %total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}
   147 
   198 
   148 
       
   149 If a regular expression matches a string, then in general there is more than
       
   150 one way of how the string is matched. There are two commonly used
       
   151 disambiguation strategies to generate a unique answer: one is called GREEDY
       
   152 matching \cite{Frisch2004} and the other is POSIX
       
   153 matching~\cite{Kuklewicz,Sulzmann2014,Vansummeren2006}. For example consider
       
   154 the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT
       
   155 (ALT x y) xy)"}}. Either the string can be matched in two `iterations' by
       
   156 the single letter-regular expressions @{term x} and @{term y}, or directly
       
   157 in one iteration by @{term xy}. The first case corresponds to GREEDY
       
   158 matching, which first matches with the left-most symbol and only matches the
       
   159 next symbol in case of a mismatch (this is greedy in the sense of preferring
       
   160 instant gratification to delayed repletion). The second case is POSIX
       
   161 matching, which prefers the longest match.
       
   162 
       
   163 In the context of lexing, where an input string needs to be split up into a
       
   164 sequence of tokens, POSIX is the more natural disambiguation strategy for
       
   165 what programmers consider basic syntactic building blocks in their programs.
       
   166 These building blocks are often specified by some regular expressions, say
       
   167 @{text "r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising keywords and
       
   168 identifiers, respectively. There are two underlying (informal) rules behind
       
   169 tokenising a string in a POSIX fashion according to a collection of regular
       
   170 expressions:
       
   171 
       
   172 \begin{itemize} 
       
   173 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}\smallskip
       
   174 
       
   175 The longest initial substring matched by any regular expression is taken as
       
   176 next token.\smallskip
       
   177 
       
   178 \item[$\bullet$] \underline{Priority Rule:}\smallskip
       
   179 
       
   180 For a particular longest initial substring, the first regular expression
       
   181 that can match determines the token.
       
   182 \end{itemize}
       
   183  
       
   184 \noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as
       
   185 @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising
       
   186 identifiers (say, a single character followed by characters or numbers).
       
   187 Then we can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use
       
   188 POSIX matching to tokenise strings, say @{text "iffoo"} and @{text "if"}.
       
   189 For @{text "iffoo"} we obtain by the longest match rule a single identifier
       
   190 token, not a keyword followed by an identifier. For @{text "if"} we obtain by
       
   191 the priority rule a keyword token, not an identifier token---even if @{text
       
   192 "r\<^bsub>id\<^esub>"} matches also.\bigskip
       
   193 
   199 
   194 \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
   200 \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
   195 derivative-based regular expression matching algorithm of
   201 derivative-based regular expression matching algorithm of
   196 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
   202 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
   197 algorithm according to our specification of what a POSIX value is. Sulzmann
   203 algorithm according to our specification of what a POSIX value is. Sulzmann
   237   only the empty string and @{term c} for matching a character literal. The
   243   only the empty string and @{term c} for matching a character literal. The
   238   language of a regular expression is also defined as usual by the
   244   language of a regular expression is also defined as usual by the
   239   recursive function @{term L} with the clauses:
   245   recursive function @{term L} with the clauses:
   240 
   246 
   241   \begin{center}
   247   \begin{center}
   242   \begin{tabular}{l@ {\hspace{5mm}}rcl}
   248   \begin{tabular}{l@ {\hspace{3mm}}rcl}
   243   (1) & @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
   249   (1) & @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
   244   (2) & @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\
   250   (2) & @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\
   245   (3) & @{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\
   251   (3) & @{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\
       
   252   \end{tabular}
       
   253   \hspace{14mm}
       
   254   \begin{tabular}{l@ {\hspace{3mm}}rcl}
   246   (4) & @{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
   255   (4) & @{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
   247   (5) & @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
   256   (5) & @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
   248   (6) & @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
   257   (6) & @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
   249   \end{tabular}
   258   \end{tabular}
   250   \end{center}
   259   \end{center}
   353 
   362 
   354 section {* POSIX Regular Expression Matching\label{posixsec} *}
   363 section {* POSIX Regular Expression Matching\label{posixsec} *}
   355 
   364 
   356 text {* 
   365 text {* 
   357 
   366 
   358   The clever idea by Sulzmann and Lu \cite{Sulzmann2014} is to introduce values for encoding
   367   The clever idea by Sulzmann and Lu \cite{Sulzmann2014} is to define
   359   \emph{how} a regular expression matches a string and then define a
   368   values for encoding \emph{how} a regular expression matches a string
   360   function on values that mirrors (but inverts) the construction of the
   369   and then define a function on values that mirrors (but inverts) the
   361   derivative on regular expressions. \emph{Values} are defined as the
   370   construction of the derivative on regular expressions. \emph{Values}
   362   inductive datatype
   371   are defined as the inductive datatype
   363 
   372 
   364   \begin{center}
   373   \begin{center}
   365   @{text "v :="}
   374   @{text "v :="}
   366   @{const "Void"} $\mid$
   375   @{const "Void"} $\mid$
   367   @{term "val.Char c"} $\mid$
   376   @{term "val.Char c"} $\mid$
   377   for POSIX matching \cite{Sulzmann2014}). The string underlying a
   386   for POSIX matching \cite{Sulzmann2014}). The string underlying a
   378   value can be calculated by the @{const flat} function, written
   387   value can be calculated by the @{const flat} function, written
   379   @{term "flat DUMMY"} and defined as:
   388   @{term "flat DUMMY"} and defined as:
   380 
   389 
   381   \begin{center}
   390   \begin{center}
   382   \begin{tabular}{lcl}
   391   \begin{tabular}[t]{lcl}
   383   @{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\
   392   @{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\
   384   @{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\
   393   @{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\
   385   @{thm (lhs) flat.simps(3)} & $\dn$ & @{thm (rhs) flat.simps(3)}\\
   394   @{thm (lhs) flat.simps(3)} & $\dn$ & @{thm (rhs) flat.simps(3)}\\
   386   @{thm (lhs) flat.simps(4)} & $\dn$ & @{thm (rhs) flat.simps(4)}\\
   395   @{thm (lhs) flat.simps(4)} & $\dn$ & @{thm (rhs) flat.simps(4)}
       
   396   \end{tabular}\hspace{14mm}
       
   397   \begin{tabular}[t]{lcl}
   387   @{thm (lhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]}\\
   398   @{thm (lhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]}\\
   388   @{thm (lhs) flat.simps(6)} & $\dn$ & @{thm (rhs) flat.simps(6)}\\
   399   @{thm (lhs) flat.simps(6)} & $\dn$ & @{thm (rhs) flat.simps(6)}\\
   389   @{thm (lhs) flat.simps(7)} & $\dn$ & @{thm (rhs) flat.simps(7)}\\
   400   @{thm (lhs) flat.simps(7)} & $\dn$ & @{thm (rhs) flat.simps(7)}\\
   390   \end{tabular}
   401   \end{tabular}
   391   \end{center}
   402   \end{center}
   393   \noindent Sulzmann and Lu also define inductively an inhabitation relation
   404   \noindent Sulzmann and Lu also define inductively an inhabitation relation
   394   that associates values to regular expressions:
   405   that associates values to regular expressions:
   395 
   406 
   396   \begin{center}
   407   \begin{center}
   397   \begin{tabular}{c}
   408   \begin{tabular}{c}
       
   409   \\[-8mm]
   398   @{thm[mode=Axiom] Prf.intros(4)} \qquad
   410   @{thm[mode=Axiom] Prf.intros(4)} \qquad
   399   @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\smallskip\\
   411   @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[3mm]
   400   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad 
   412   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad 
   401   @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\smallskip\\
   413   @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[3mm]
   402   @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\smallskip\\ 
   414   @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[3mm]
   403   @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad  
   415   @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad  
   404   @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}
   416   @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}
   405   \end{tabular}
   417   \end{tabular}
   406   \end{center}
   418   \end{center}
   407 
   419