lexing: comparison thys/Paper/Paper.thy

equal deleted inserted replaced

-:fc22ca36325c
+:f1d800062d4f
 completely formalised correctness proof of this matcher in for example HOL4
 has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
 of the work by Krauss and Nipkow \cite{Krauss2011}. And another one in Coq is given
 by Coquand and Siles \cite{Coquand2012}.
+If a regular expression matches a string, then in general there is more than
+one way of how the string is matched. There are two commonly used
+disambiguation strategies to generate a unique answer: one is called GREEDY
+matching \cite{Frisch2004} and the other is POSIX
+matching~\cite{Kuklewicz,Sulzmann2014,Vansummeren2006}. For example consider
+the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT
+(ALT x y) xy)"}}. Either the string can be matched in two `iterations' by
+the single letter-regular expressions @{term x} and @{term y}, or directly
+in one iteration by @{term xy}. The first case corresponds to GREEDY
+matching, which first matches with the left-most symbol and only matches the
+next symbol in case of a mismatch (this is greedy in the sense of preferring
+instant gratification to delayed repletion). The second case is POSIX
+matching, which prefers the longest match.
+In the context of lexing, where an input string needs to be split up into a
+sequence of tokens, POSIX is the more natural disambiguation strategy for
+what programmers consider basic syntactic building blocks in their programs.
+These building blocks are often specified by some regular expressions, say
+@{text "r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising keywords and
+identifiers, respectively. There are two underlying (informal) rules behind
+tokenising a string in a POSIX fashion according to a collection of regular
+expressions:
+\begin{itemize}
+\item[$\bullet$] \emph{The Longest Match Rule} (or \emph{``maximal munch rule''}):
+The longest initial substring matched by any regular expression is taken as
+next token.\smallskip
+\item[$\bullet$] \emph{Priority Rule:}
+For a particular longest initial substring, the first regular expression
+that can match determines the token.
+\end{itemize}
+\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords
+such as @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"}
+recognising identifiers (say, a single character followed by
+characters or numbers).  Then we can form the regular expression
+@{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX matching to tokenise strings,
+say @{text "iffoo"} and @{text "if"}.  For @{text "iffoo"} we obtain
+by the Longest Match Rule a single identifier token, not a keyword
+followed by an identifier. For @{text "if"} we obtain by the Priority
+Rule a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"}
+matches also.
 One limitation of Brzozowski's matcher is that it only generates a
 YES/NO answer for whether a string is being matched by a regular
-expression.  Sulzmann and Lu \cite{Sulzmann2014} extended this matcher
+expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this matcher
 to allow generation not just of a YES/NO answer but of an actual
 matching, called a [lexical] {\em value}. They give a simple algorithm
 to calculate a value that appears to be the value associated with
-POSIX matching. The challenge then is to specify that value, in an
+POSIX matching.  The challenge then is to specify that value, in an
 algorithm-independent fashion, and to show that Sulzmann and Lu's
 derivative-based algorithm does indeed calculate a value that is
 correct according to the specification.
 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a
 which some of the proofs are not published in \cite{Sulzmann2014});
 perhaps more importantly, we give a simple inductive (and
 algorithm-independent) definition of what we call being a {\em POSIX
 value} for a regular expression @{term r} and a string @{term s}; we
 show that the algorithm computes such a value and that such a value is
-unique. Proofs are both done by hand and checked in Isabelle/HOL.  The
+unique. Our proofs are both done by hand and checked in Isabelle/HOL.  The
 experience of doing our proofs has been that this mechanical checking
 was absolutely essential: this subject area has hidden snares. This
 was also noted by Kuklewicz \cite{Kuklewicz} who found that nearly all
 POSIX matching implementations are ``buggy'' \cite[Page
-203]{Sulzmann2014}.
+203]{Sulzmann2014} and by Grathwohl et al \cite[Page 36]{CrashCourse2014}
+who wrote:
+\begin{quote}
+\it{}``The POSIX strategy is more complicated than the greedy because of
+the dependence on information about the length of matched strings in the
+various subexpressions.''
+\end{quote}
 %\footnote{The relation @{text "\<ge>\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014}
 %is a relation on the
 %values for the regular expression @{term r}; but it only holds between
 %@{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have
 %have different flattenings (see Section~\ref{posixsec}). A different
 %relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r}
 %with flattening @{term s} is definable by the same approach, and is indeed
 %total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}
-If a regular expression matches a string, then in general there is more than
-one way of how the string is matched. There are two commonly used
-disambiguation strategies to generate a unique answer: one is called GREEDY
-matching \cite{Frisch2004} and the other is POSIX
-matching~\cite{Kuklewicz,Sulzmann2014,Vansummeren2006}. For example consider
-the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT
-(ALT x y) xy)"}}. Either the string can be matched in two `iterations' by
-the single letter-regular expressions @{term x} and @{term y}, or directly
-in one iteration by @{term xy}. The first case corresponds to GREEDY
-matching, which first matches with the left-most symbol and only matches the
-next symbol in case of a mismatch (this is greedy in the sense of preferring
-instant gratification to delayed repletion). The second case is POSIX
-matching, which prefers the longest match.
-In the context of lexing, where an input string needs to be split up into a
-sequence of tokens, POSIX is the more natural disambiguation strategy for
-what programmers consider basic syntactic building blocks in their programs.
-These building blocks are often specified by some regular expressions, say
-@{text "r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising keywords and
-identifiers, respectively. There are two underlying (informal) rules behind
-tokenising a string in a POSIX fashion according to a collection of regular
-expressions:
-\begin{itemize}
-\item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}\smallskip
-The longest initial substring matched by any regular expression is taken as
-next token.\smallskip
-\item[$\bullet$] \underline{Priority Rule:}\smallskip
-For a particular longest initial substring, the first regular expression
-that can match determines the token.
-\end{itemize}
-\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as
-@{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising
-identifiers (say, a single character followed by characters or numbers).
-Then we can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use
-POSIX matching to tokenise strings, say @{text "iffoo"} and @{text "if"}.
-For @{text "iffoo"} we obtain by the longest match rule a single identifier
-token, not a keyword followed by an identifier. For @{text "if"} we obtain by
-the priority rule a keyword token, not an identifier token---even if @{text
-"r\<^bsub>id\<^esub>"} matches also.\bigskip
 \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
 derivative-based regular expression matching algorithm of
 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
 algorithm according to our specification of what a POSIX value is. Sulzmann
 only the empty string and @{term c} for matching a character literal. The
 language of a regular expression is also defined as usual by the
 recursive function @{term L} with the clauses:
 \begin{center}
-\begin{tabular}{l@ {\hspace{5mm}}rcl}
+\begin{tabular}{l@ {\hspace{3mm}}rcl}
 (1) & @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
 (2) & @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\
 (3) & @{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\
+\end{tabular}
+\hspace{14mm}
+\begin{tabular}{l@ {\hspace{3mm}}rcl}
 (4) & @{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
 (5) & @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
 (6) & @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
 \end{tabular}
 \end{center}
 section {* POSIX Regular Expression Matching\label{posixsec} *}
 text {*
-The clever idea by Sulzmann and Lu \cite{Sulzmann2014} is to introduce values for encoding
+The clever idea by Sulzmann and Lu \cite{Sulzmann2014} is to define
-\emph{how} a regular expression matches a string and then define a
+values for encoding \emph{how} a regular expression matches a string
-function on values that mirrors (but inverts) the construction of the
+and then define a function on values that mirrors (but inverts) the
-derivative on regular expressions. \emph{Values} are defined as the
+construction of the derivative on regular expressions. \emph{Values}
-inductive datatype
+are defined as the inductive datatype
 \begin{center}
 @{text "v :="}
 @{const "Void"} $\mid$
 @{term "val.Char c"} $\mid$
 for POSIX matching \cite{Sulzmann2014}). The string underlying a
 value can be calculated by the @{const flat} function, written
 @{term "flat DUMMY"} and defined as:
 \begin{center}
-\begin{tabular}{lcl}
+\begin{tabular}[t]{lcl}
 @{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\
 @{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\
 @{thm (lhs) flat.simps(3)} & $\dn$ & @{thm (rhs) flat.simps(3)}\\
-@{thm (lhs) flat.simps(4)} & $\dn$ & @{thm (rhs) flat.simps(4)}\\
+@{thm (lhs) flat.simps(4)} & $\dn$ & @{thm (rhs) flat.simps(4)}
+\end{tabular}\hspace{14mm}
+\begin{tabular}[t]{lcl}
 @{thm (lhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]}\\
 @{thm (lhs) flat.simps(6)} & $\dn$ & @{thm (rhs) flat.simps(6)}\\
 @{thm (lhs) flat.simps(7)} & $\dn$ & @{thm (rhs) flat.simps(7)}\\
 \end{tabular}
 \end{center}
 \noindent Sulzmann and Lu also define inductively an inhabitation relation
 that associates values to regular expressions:
 \begin{center}
 \begin{tabular}{c}
+\\[-8mm]
 @{thm[mode=Axiom] Prf.intros(4)} \qquad
-@{thm[mode=Axiom] Prf.intros(5)[of "c"]}\smallskip\\
+@{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[3mm]
 @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad
-@{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\smallskip\\
+@{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[3mm]
-@{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\smallskip\\
+@{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[3mm]
 @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad
 @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}
 \end{tabular}
 \end{center}

changeset 176	f1d800062d4f
parent 175	fc22ca36325c
child 177	e85d10b238d0