lexing: comparison thys/Paper/Paper.thy

equal deleted inserted replaced

-:6adda4a667b1
+:73f7dc60c285
 theory Paper
 imports "../ReStar" "~~/src/HOL/Library/LaTeXsugar"
 begin
 declare [[show_question_marks = false]]
+abbreviation
+"der_syn r c \<equiv> der c r"
 notation (latex output)
 If  ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
 Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and
 ZERO ("\<^raw:\textrm{0}>" 78) and
 ONE ("\<^raw:\textrm{1}>" 78) and
 CHAR ("_" [1000] 10) and
-ALT ("_ + _" [1000,1000] 78) and
+ALT ("_ + _" [77,77] 78) and
-SEQ ("_ \<cdot> _" [1000,1000] 78) and
+SEQ ("_ \<cdot> _" [77,77] 78) and
 STAR ("_\<^sup>\<star>" [1000] 78) and
 val.Char ("Char _" [1000] 78) and
 val.Left ("Left _" [1000] 78) and
 val.Right ("Right _" [1000] 78) and
-L ("L _" [1000] 0) and
+L ("L'(_')" [10] 78) and
+der_syn ("_\\_" [1000, 1000] 78) and
 flat ("|_|" [70] 73) and
 Sequ ("_ @ _" [78,77] 63) and
 injval ("inj _ _ _" [1000,77,1000] 77) and
 projval ("proj _ _ _" [1000,77,1000] 77) and
 length ("len _" [78] 73)
 (*>*)
 section {* Introduction *}
 text {*
+Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em
+derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a
+character~@{text c}, and showed that it gave a simple solution to the
+problem of matching a string @{term s} with a regular expression @{term r}:
+if the derivative of @{term r} w.r.t.\ (in succession) all the characters of
+the string matches the empty string $\mts$, then @{term r} matches @{term s}
+(and {\em vice versa}). The derivative has the property (which may be
+regarded as its specification) that, for every string @{term s} and regular
+expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
+and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
+derivatives is that they are neatly expressible in any functional language, and
+very easy to be defined and reasoned about in a theorem prover---the
+definitions consist just of inductive datatypes and recursive functions. A
+completely formalised proof of this matcher has for example been given in
+\cite{Owens2008}.
+One limitation of Brzozowski's matcher is that it only generates a YES/NO
+answer for a string being matched by a regular expression. Sulzmann and Lu
+\cite{Sulzmann2014} extended this matcher to allow generation not just of a
+YES/NO answer but of an actual matching, called a [lexical] {\em value}.
+They give a still very simple algorithm to calculate a value that appears to
+be the value associated with POSIX lexing (posix) %%\cite{posix}. The
+challenge then is to specify that value, in an algorithm-independent
+fashion, and to show that Sulzamman and Lu's derivative-based algorithm does
+indeed calculate a value that is correct according to the specification.
+Inspired by work of Frisch and Cardelli \cite{Frisch2004} on a GREEDY
+regular expression matching algorithm, the answer given in
+\cite{Sulzmann2014} is to define a relation (called an ``Order Relation'')
+on the set of values of @{term r}, and to show that (once a string to be
+matched is chosen) there is a maximum element and that it is computed by the
+derivative-based algorithm. Beginning with our observations that, without
+evidence that it is transitive, it cannot be called an ``order relation'',
+and that the relation is called a ``total order'' despite being evidently
+not total\footnote{\textcolor{green}{Why is it not total?}}, we identify
+problems with this approach (of which some of the proofs are not published
+in \cite{Sulzmann2014}); perhaps more importantly, we give a simple
+inductive (and algorithm-independent) definition of what we call being a
+{\em POSIX value} for a regular expression @{term r} and a string @{term s};
+we show that the algorithm computes such a value and that such a value is
+unique. Proofs are both done by hand and checked in {\em Isabelle}
+%\cite{isabelle}. Our experience of doing our proofs has been that this
+mechanical checking was absolutely essential: this subject area has hidden
+snares. This was also noted by Kuklewitz \cite{Kuklewicz} who found that
+nearly all POSIX matching implementations are ``buggy'' \cite[Page
+203]{Sulzmann2014}.
+\textcolor{green}{Say something about POSIX lexing}
+An extended version of \cite{Sulzmann2014} is available at the website
+of its first author; this includes some ``proofs'', claimed in
+\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
+final form, we make no comment thereon, preferring to give general
+reasons for our belief that the approach of \cite{Sulzmann2014} is
+problematic rather than to discuss details of unpublished work.
+Derivatives as calculated by Brzozowski's method are usually more
+complex regular expressions than the initial one; various optimisations
+are possible, such as the simplifications of @{term "ALT ZERO r"},
+@{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
+@{term r}. One of the advantages of having a simple specification and
+correctness proof is that the latter can be refined to allow for such
+optimisations and simple correctness proof.
 Sulzmann and Lu \cite{Sulzmann2014}
+\cite{Brzozowski1964}
 there are two commonly used
 disambiguation strategies to create a unique matching tree:
 one is called \emph{greedy} matching \cite{Frisch2004} and the
 other is \emph{POSIX} matching~\cite{Kuklewicz,Sulzmann2014}.
 Isabelle. In fact, the purported proof they outline does not
 work in central places. We discovered this when filling in
 many gaps and attempting to formalise the proof in Isabelle.
+\medskip\noindent
 {\bf Contributions:}
 *}
 section {* Preliminaries *}
 text {* \noindent Strings in Isabelle/HOL are lists of characters with
 the empty string being represented by the empty list, written @{term
-"[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}. By
+"[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}.
-using the type char we have a supply of finitely many characters
+Often we use the usual bracket notation for strings; for example a
-roughly corresponding to the ASCII character set.
+string consiting of a single character is written @{term "[c]"}.  By
-Regular exprtessions
+using the type @{type char} for characters we have a supply of
+finitely many characters roughly corresponding to the ASCII
+character set.  Regular expression are as usual and defined as the
+following inductive datatype:
 \begin{center}
 @{text "r :="}
 @{const "ZERO"} $\mid$
 @{const "ONE"} $\mid$
 @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
 @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
 @{term "STAR r"}
 \end{center}
+\noindent where @{const ZERO} stands for the regular expression that
+does not macth any string and @{const ONE} for the regular
+expression that matches only the empty string. The language of a regular expression
+is again defined as usual by the following clauses
+\begin{center}
+\begin{tabular}{rcl}
+@{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
+@{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\
+@{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\
+@{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
+@{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
+@{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
+\end{tabular}
+\end{center}
+\noindent We use the star-notation for regular expressions and sets of strings.
+The Kleene-star on sets is defined inductively.
+\emph{Semantic derivatives} of sets of strings are defined as
+\begin{center}
+\begin{tabular}{lcl}
+@{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\
+@{thm (lhs) Ders_def} & $\dn$ & @{thm (rhs) Ders_def}\\
+\end{tabular}
+\end{center}
+\noindent where the second definitions lifts the notion of semantic
+derivatives from characters to strings.
+\noindent
+The nullable function
+\begin{center}
+\begin{tabular}{lcl}
+@{thm (lhs) nullable.simps(1)} & $\dn$ & @{thm (rhs) nullable.simps(1)}\\
+@{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\
+@{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
+@{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
+@{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
+@{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\\
+\end{tabular}
+\end{center}
+\noindent
+The derivative function for characters and strings
+\begin{center}
+\begin{tabular}{lcp{7.5cm}}
+@{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
+@{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
+@{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
+@{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
+@{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
+@{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\medskip\\
+@{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\
+@{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\
+\end{tabular}
+\end{center}
+It is a relatively easy exercise to prove that
+\begin{center}
+\begin{tabular}{l}
+@{thm[mode=IfThen] nullable_correctness}\\
+@{thm[mode=IfThen] der_correctness}\\
+\end{tabular}
+\end{center}
 *}
 section {* POSIX Regular Expression Matching *}
 section {* The Argument by Sulzmmann and Lu *}
 @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
 \end{tabular}
 \end{center}
-\noindent
-The nullable function
-\begin{center}
-\begin{tabular}{lcl}
-@{thm (lhs) nullable.simps(1)} & $\dn$ & @{thm (rhs) nullable.simps(1)}\\
-@{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\
-@{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
-@{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
-@{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
-@{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\\
-\end{tabular}
-\end{center}
-\noindent
-The derivative function for characters and strings
-\begin{center}
-\begin{tabular}{lcp{7.5cm}}
-@{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
-@{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
-@{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
-@{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
-@{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
-@{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\medskip\\
-@{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\
-@{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\
-\end{tabular}
-\end{center}
 \noindent
 The @{const flat} function for values
 \begin{center}

changeset 108	73f7dc60c285
parent 107	6adda4a667b1
child 109	2c38f10643ae