lexing: comparison thys/Paper/Paper.thy

equal deleted inserted replaced

-:73f7dc60c285
+:2c38f10643ae
 "der_syn r c \<equiv> der c r"
 notation (latex output)
 If  ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
 Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and
-ZERO ("\<^raw:\textrm{0}>" 78) and
+ZERO ("\<^bold>0" 80) and
-ONE ("\<^raw:\textrm{1}>" 78) and
+ONE ("\<^bold>1" 80) and
-CHAR ("_" [1000] 10) and
+CHAR ("_" [1000] 80) and
 ALT ("_ + _" [77,77] 78) and
 SEQ ("_ \<cdot> _" [77,77] 78) and
 STAR ("_\<^sup>\<star>" [1000] 78) and
 val.Char ("Char _" [1000] 78) and
 val.Left ("Left _" [1000] 78) and
 val.Right ("Right _" [1000] 78) and
 L ("L'(_')" [10] 78) and
-der_syn ("_\\_" [1000, 1000] 78) and
+der_syn ("_\\_" [79, 1000] 76) and
 flat ("|_|" [70] 73) and
 Sequ ("_ @ _" [78,77] 63) and
 injval ("inj _ _ _" [1000,77,1000] 77) and
 projval ("proj _ _ _" [1000,77,1000] 77) and
 length ("len _" [78] 73)
 the string matches the empty string $\mts$, then @{term r} matches @{term s}
 (and {\em vice versa}). The derivative has the property (which may be
 regarded as its specification) that, for every string @{term s} and regular
 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
-derivatives is that they are neatly expressible in any functional language, and
+derivatives is that they are neatly expressible in any functional language,
-very easy to be defined and reasoned about in a theorem prover---the
+and easily definable and reasoned about in theorem provers---the definitions
-definitions consist just of inductive datatypes and recursive functions. A
+just consist of inductive datatypes and simple recursive functions. A
-completely formalised proof of this matcher has for example been given in
+completely formalised correctness proof of this matcher in for example HOL4
-\cite{Owens2008}.
+has been given in~\cite{Owens2008}.
 One limitation of Brzozowski's matcher is that it only generates a YES/NO
-answer for a string being matched by a regular expression. Sulzmann and Lu
+answer for whether a string is being matched by a regular expression.
-\cite{Sulzmann2014} extended this matcher to allow generation not just of a
+Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow
-YES/NO answer but of an actual matching, called a [lexical] {\em value}.
+generation not just of a YES/NO answer but of an actual matching, called a
-They give a still very simple algorithm to calculate a value that appears to
+[lexical] {\em value}. They give a simple algorithm to calculate a value
-be the value associated with POSIX lexing (posix) %%\cite{posix}. The
+that appears to be the value associated with POSIX lexing
-challenge then is to specify that value, in an algorithm-independent
+\cite{Kuklewicz,Vansummeren2006}. The challenge then is to specify that
-fashion, and to show that Sulzamman and Lu's derivative-based algorithm does
+value, in an algorithm-independent fashion, and to show that Sulzamann and
-indeed calculate a value that is correct according to the specification.
+Lu's derivative-based algorithm does indeed calculate a value that is
+correct according to the specification.
-Inspired by work of Frisch and Cardelli \cite{Frisch2004} on a GREEDY
-regular expression matching algorithm, the answer given in
+The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a
-\cite{Sulzmann2014} is to define a relation (called an ``Order Relation'')
+relation (called an ``Order Relation'') on the set of values of @{term r},
-on the set of values of @{term r}, and to show that (once a string to be
+and to show that (once a string to be matched is chosen) there is a maximum
-matched is chosen) there is a maximum element and that it is computed by the
+element and that it is computed by their derivative-based algorithm. This
-derivative-based algorithm. Beginning with our observations that, without
+proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a
-evidence that it is transitive, it cannot be called an ``order relation'',
+GREEDY regular expression matching algorithm. Beginning with our
-and that the relation is called a ``total order'' despite being evidently
+observations that, without evidence that it is transitive, it cannot be
-not total\footnote{\textcolor{green}{Why is it not total?}}, we identify
+called an ``order relation'', and that the relation is called a ``total
-problems with this approach (of which some of the proofs are not published
+order'' despite being evidently not total\footnote{\textcolor{green}{We
-in \cite{Sulzmann2014}); perhaps more importantly, we give a simple
+should give an argument as footnote}}, we identify problems with this
-inductive (and algorithm-independent) definition of what we call being a
+approach (of which some of the proofs are not published in
-{\em POSIX value} for a regular expression @{term r} and a string @{term s};
+\cite{Sulzmann2014}); perhaps more importantly, we give a simple inductive
-we show that the algorithm computes such a value and that such a value is
+(and algorithm-independent) definition of what we call being a {\em POSIX
-unique. Proofs are both done by hand and checked in {\em Isabelle}
+value} for a regular expression @{term r} and a string @{term s}; we show
-%\cite{isabelle}. Our experience of doing our proofs has been that this
+that the algorithm computes such a value and that such a value is unique.
-mechanical checking was absolutely essential: this subject area has hidden
+Proofs are both done by hand and checked in {\em Isabelle/HOL}. The
-snares. This was also noted by Kuklewitz \cite{Kuklewicz} who found that
+experience of doing our proofs has been that this mechanical checking was
-nearly all POSIX matching implementations are ``buggy'' \cite[Page
+absolutely essential: this subject area has hidden snares. This was also
-203]{Sulzmann2014}.
+noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching
+implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
-\textcolor{green}{Say something about POSIX lexing}
+If a regular expression matches a string, then in general there are more
+than one possibility of how the string is matched. There are two commonly
+used disambiguation strategies to generate a unique answer: one is called
+GREEDY matching \cite{Frisch2004} and the other is POSIX
+matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string
+@{term xy} and the regular expression @{term "STAR (ALT (ALT x y) xy)"}.
+Either the string can be matched in two `iterations' by the single
+letter-regular expressions @{term x} and @{term y}, or directly in one
+iteration by @{term xy}. The first case corresponds to GREEDY matching,
+which first matches with the left-most symbol and only matches the next
+symbol in case of a mismatch. The second case is POSIX matching, which
+prefers the longest match.
+In the context of lexing, where an input string is separated into a sequence
+of tokens, POSIX is the more natural disambiguation strategy for what
+programmers consider basic syntactic building blocks in their programs.
+There are two underlying rules behind tokenising using POSIX matching:
+\begin{itemize}
+\item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}
+The longest initial substring matched by any regular expression is taken as
+next token.\smallskip
+\item[$\bullet$] \underline{Rule Priority:}
+For a particular longest initial substring, the first regular expression
+that can match determines the token.
+\end{itemize}
+\noindent Consider for example a regular expression @{text
+"r\<^bsub>key\<^esub>"} that can recognise keywords, like @{text "if"},
+@{text "then"} and so on; and another regular expression @{text
+"r\<^bsub>id\<^esub>"} that can recognise identifiers (such as single
+characters followed by characters or numbers). Then we can form the regular
+expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and
+use POSIX matching to, for example, tokenise the strings @{text "iffoo"} and
+@{text "if"}. In the first case we obtain by the longest match rule a
+single identifier token, not a keyword and identifier. In the second case we
+obtain by rule priority a keyword token, not an identifier token.
+\bigskip
+\noindent\textcolor{green}{Not Done Yet}
+\medskip\noindent
+{\bf Contributions:}
+Derivatives as calculated by Brzozowski's method are usually more complex
+regular expressions than the initial one; various optimisations are
+possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r
+ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the
+advantages of having a simple specification and correctness proof is that
+the latter can be refined to allow for such optimisations and simple
+correctness proof.
 An extended version of \cite{Sulzmann2014} is available at the website
 of its first author; this includes some ``proofs'', claimed in
 \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
 final form, we make no comment thereon, preferring to give general
 reasons for our belief that the approach of \cite{Sulzmann2014} is
 problematic rather than to discuss details of unpublished work.
-Derivatives as calculated by Brzozowski's method are usually more
-complex regular expressions than the initial one; various optimisations
-are possible, such as the simplifications of @{term "ALT ZERO r"},
-@{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
-@{term r}. One of the advantages of having a simple specification and
-correctness proof is that the latter can be refined to allow for such
-optimisations and simple correctness proof.
-Sulzmann and Lu \cite{Sulzmann2014}
-\cite{Brzozowski1964}
-there are two commonly used
-disambiguation strategies to create a unique matching tree:
-one is called \emph{greedy} matching \cite{Frisch2004} and the
-other is \emph{POSIX} matching~\cite{Kuklewicz,Sulzmann2014}.
-For the latter there are two rough rules:
-\begin{itemize}
-\item The Longest Match Rule (or ``maximal munch rule''):\smallskip\\ The
-longest initial substring matched by any regular
-expression is taken as next token.
-\item Rule Priority:\smallskip\\ For a particular longest initial
-substring, the first regular expression that can match
-determines the token.
-\end{itemize}
-\noindent In the context of lexing, POSIX is the more
-interesting disambiguation strategy as it produces longest
-matches, which is necessary for tokenising programs. For
-example the string \textit{iffoo} should not match the keyword
-\textit{if} and the rest, but as one string \textit{iffoo},
-which might be a variable name in a program. As another
-example consider the string $xy$ and the regular expression
-\mbox{$(x + y + xy)^*$}. Either the input string can be
-matched in two `iterations' by the single letter-regular
-expressions $x$ and $y$, or directly in one iteration by $xy$.
-The first case corresponds to greedy matching, which first
-matches with the left-most symbol and only matches the next
-symbol in case of a mismatch. The second case is POSIX
-matching, which prefers the longest match. In case more than
-one (longest) matches exist, only then it prefers the
-left-most match. While POSIX matching seems natural, it turns
-out to be much more subtle than greedy matching in terms of
-implementations and in terms of proving properties about it.
-If POSIX matching is implemented using automata, then one has
-to follow transitions (according to the input string) until
-one finds an accepting state, record this state and look for
-further transition which might lead to another accepting state
-that represents a longer input initial substring to be
-matched. Only if none can be found, the last accepting state
-is returned.
-Sulzmann and Lu's paper \cite{Sulzmann2014} targets POSIX
-regular expression matching. They write that it is known to be
-to be a non-trivial problem and nearly all POSIX matching
-implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
-For this they cite a study by Kuklewicz \cite{Kuklewicz}. My
-current work is about formalising the proofs in the paper by
-Sulzmann and Lu. Specifically, they propose in this paper a
-POSIX matching algorithm and give some details of a
-correctness proof for this algorithm inside the paper and some
-more details in an appendix. This correctness proof is
-unformalised, meaning it is just a ``pencil-and-paper'' proof,
-not done in a theorem prover. Though, the paper and presumably
-the proof have been peer-reviewed. Unfortunately their proof
-does not give enough details such that it can be
-straightforwardly implemented in a theorem prover, say
-Isabelle. In fact, the purported proof they outline does not
-work in central places. We discovered this when filling in
-many gaps and attempting to formalise the proof in Isabelle.
-\medskip\noindent
-{\bf Contributions:}
 *}
 section {* Preliminaries *}

changeset 109	2c38f10643ae
parent 108	73f7dc60c285
child 110	267afb7fb700