--- a/thys/Paper/Paper.thy Thu Mar 03 08:22:39 2016 +0000
+++ b/thys/Paper/Paper.thy Sat Mar 05 04:01:10 2016 +0000
@@ -19,23 +19,27 @@
SEQ ("_ \<cdot> _" [77,77] 78) and
STAR ("_\<^sup>\<star>" [1000] 78) and
- val.Void ("'(')" 78) and
- val.Char ("Char _" [1000] 78) and
+ val.Void ("'(')" 79) and
+ val.Char ("Char _" [1000] 79) and
val.Left ("Left _" [1000] 78) and
val.Right ("Right _" [1000] 78) and
L ("L'(_')" [10] 78) and
der_syn ("_\\_" [79, 1000] 76) and
- flat ("|_|" [70] 73) and
+ flat ("|_|" [75] 73) and
Sequ ("_ @ _" [78,77] 63) and
- injval ("inj _ _ _" [1000,77,1000] 77) and
+ injval ("inj _ _ _" [78,77,78] 77) and
projval ("proj _ _ _" [1000,77,1000] 77) and
- length ("len _" [78] 73)
+ length ("len _" [78] 73) and
+
+ Prf ("\<triangleright> _ : _" [75,75] 75) and
+ PMatch ("_ : _ \<rightarrow> _" [75,75,75] 75)
(* and ValOrd ("_ \<succeq>\<^bsub>_\<^esub> _" [78,77,77] 73) *)
(*>*)
section {* Introduction *}
+
text {*
Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em
@@ -59,7 +63,7 @@
Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow
generation not just of a YES/NO answer but of an actual matching, called a
[lexical] {\em value}. They give a simple algorithm to calculate a value
-that appears to be the value associated with POSIX lexing
+that appears to be the value associated with POSIX matching
\cite{Kuklewicz,Vansummeren2006}. The challenge then is to specify that
value, in an algorithm-independent fashion, and to show that Sulzamann and
Lu's derivative-based algorithm does indeed calculate a value that is
@@ -86,17 +90,17 @@
noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching
implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
-If a regular expression matches a string, then in general there is more
-than one way of how the string is matched. There are two commonly used
+If a regular expression matches a string, then in general there is more than
+one way of how the string is matched. There are two commonly used
disambiguation strategies to generate a unique answer: one is called GREEDY
matching \cite{Frisch2004} and the other is POSIX
-matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string
-@{term xy} and the regular expression \mbox{@{term "STAR (ALT (ALT x y)
-xy)"}}. Either the string can be matched in two `iterations' by the single
-letter-regular expressions @{term x} and @{term y}, or directly in one
-iteration by @{term xy}. The first case corresponds to GREEDY matching,
-which first matches with the left-most symbol and only matches the next
-symbol in case of a mismatch (this is greedy in the sense of preferring
+matching~\cite{Kuklewicz,Sulzmann2014,Vansummeren2006}. For example consider
+the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT
+(ALT x y) xy)"}}. Either the string can be matched in two `iterations' by
+the single letter-regular expressions @{term x} and @{term y}, or directly
+in one iteration by @{term xy}. The first case corresponds to GREEDY
+matching, which first matches with the left-most symbol and only matches the
+next symbol in case of a mismatch (this is greedy in the sense of preferring
instant gratification to delayed repletion). The second case is POSIX
matching, which prefers the longest match.
@@ -120,38 +124,37 @@
that can match determines the token.
\end{itemize}
-\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising
-keywords such as @{text "if"}, @{text "then"} and so on; and @{text
-"r\<^bsub>id\<^esub>"} recognising identifiers (a single character followed
-by characters or numbers). Then we can form the regular expression @{text
-"(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX
+\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as
+@{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising
+identifiers (a single character followed by characters or numbers). Then we
+can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX
matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the
first case we obtain by the longest match rule a single identifier token,
-not a keyword followed by identifier. In the second case we obtain by rule
-priority a keyword token, not an identifier token---even if @{text
-"r\<^bsub>id\<^esub>"} matches also.\bigskip
-
-\noindent\textcolor{green}{Not Done Yet}
-
-
-\medskip\noindent
-{\bf Contributions:}
+not a keyword followed by an identifier. In the second case we obtain by rule
+priority a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"}
+matches also.\bigskip
-Derivatives as calculated by Brzozowski's method are usually more complex
-regular expressions than the initial one; various optimisations are
-possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r
-ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the
-advantages of having a simple specification and correctness proof is that
-the latter can be refined to allow for such optimisations and simple
-correctness proof.
+\noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
+derivative-based regular expression matching algorithm as described by
+Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
+algorithms according to our specification what a POSIX value is. The
+informal correctness proof given in \cite{Sulzmann2014} is in final
+form\footnote{} and to us contains unfillable gaps. Our specification of a
+POSIX value consists of a simple inductive definition that given a string
+and a regular expression uniquely determines this value. Derivatives as
+calculated by Brzozowski's method are usually more complex regular
+expressions than the initial one; various optimisations are possible, such
+as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r ZERO"}, @{term
+"SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the advantages of
+having a simple specification and correctness proof is that the latter can
+be refined to allow for such optimisations and simple correctness proof.
-An extended version of \cite{Sulzmann2014} is available at the website
-of its first author; this includes some ``proofs'', claimed in
+An extended version of \cite{Sulzmann2014} is available at the website of
+its first author; this includes some ``proofs'', claimed in
\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
-final form, we make no comment thereon, preferring to give general
-reasons for our belief that the approach of \cite{Sulzmann2014} is
-problematic rather than to discuss details of unpublished work.
-
+final form, we make no comment thereon, preferring to give general reasons
+for our belief that the approach of \cite{Sulzmann2014} is problematic
+rather than to discuss details of unpublished work.
*}
@@ -160,11 +163,11 @@
text {* \noindent Strings in Isabelle/HOL are lists of characters with the
empty string being represented by the empty list, written @{term "[]"}, and
list-cons being written as @{term "DUMMY # DUMMY"}. Often we use the usual
-bracket notation for strings; for example a string consisting of just a
-single character is written @{term "[c]"}. By using the type @{type char}
-for characters we have a supply of finitely many characters roughly
-corresponding to the ASCII character set. Regular expressions are defined as
-usual as the following inductive datatype:
+bracket notation for lists also for strings; for example a string consisting
+of just a single character @{term c} is written @{term "[c]"}. By using the
+type @{type char} for characters we have a supply of finitely many
+characters roughly corresponding to the ASCII character set. Regular
+expressions are defined as usual as the following inductive datatype:
\begin{center}
@{text "r :="}
@@ -176,11 +179,11 @@
@{term "STAR r"}
\end{center}
- \noindent where @{const ZERO} stands for the regular expression that
- does not match any string and @{const ONE} for the regular
- expression that matches only the empty string. The language of a regular expression
- is again defined routinely by the recursive function @{term L} with the
- clauses:
+ \noindent where @{const ZERO} stands for the regular expression that does
+ not match any string, @{const ONE} for the regular expression that matches
+ only the empty string and @{term c} for matching a character literal. The
+ language of a regular expression is again defined as usual by the
+ recursive function @{term L} with the clauses:
\begin{center}
\begin{tabular}{rcl}
@@ -195,13 +198,14 @@
\noindent In the fourth clause we use @{term "DUMMY ;; DUMMY"} for the
concatenation of two languages (it is also list-append for strings). We
- use the star-notation for regular expressions and also for languages (in
- the last clause). The star for languages is defined inductively as usual
- by two clauses for the empty string being in the star of a language and if
- @{term "s\<^sub>1"} is in a language and @{term "s\<^sub>2"} in the star of this
- language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in the star of this language.
-
- \emph{Semantic derivatives} of sets of strings are defined as
+ use the star-notation for regular expressions and languages (in the last
+ clause above). The star on languages is defined inductively by two
+ clauses: @{text "(i)"} for the empty string being in the star of a
+ language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a language and @{term
+ "s\<^sub>2"} in the star of this language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in
+ the star of this language. It will also be convenient to use the following
+ notion of a \emph{semantic derivative} (or left quotient) of a language,
+ say @{text A}, defined as:
\begin{center}
\begin{tabular}{lcl}
@@ -209,10 +213,23 @@
\end{tabular}
\end{center}
-
+ \noindent
+ For semantic derivatives we have the following equations (for example
+ \cite{Krauss2011}):
- \noindent
- The nullable function
+ \begin{equation}
+ \begin{array}{lcl}
+ @{thm (lhs) Der_null} & \dn & @{thm (rhs) Der_null}\\
+ \end{array}
+ \end{equation}
+
+
+ \noindent \emph{\Brz's derivatives} of regular expressions
+ \cite{Brzozowski1964} can be easily defined by two recursive functions:
+ the first is from regular expressions to booleans (implementing a test
+ when a regular expression can match the empty string), and the second
+ takes a regular expression and a character to a (derivative) regular
+ expression:
\begin{center}
\begin{tabular}{lcl}
@@ -221,15 +238,7 @@
@{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
@{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
- @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\\
- \end{tabular}
- \end{center}
-
- \noindent
- The derivative function for characters and strings
-
- \begin{center}
- \begin{tabular}{lcp{8cm}}
+ @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\
@{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
@{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
@{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
@@ -239,27 +248,78 @@
\end{tabular}
\end{center}
- It is a relatively easy exercise to prove that
+ \noindent
+ Given the equations in ???,
+ it is a relatively easy exercise in mechanical reasoning to establish that
+
+ \begin{proposition}
+ \begin{tabular}{ll}
+ @{text "(1)"} & @{thm (lhs) nullable_correctness} if and only if
+ @{thm (rhs) nullable_correctness} \\
+ @{text "(2)"} & @{thm[mode=IfThen] der_correctness}
+ \end{tabular}
+ \end{proposition}
+
+ \noindent With this in place it is also very routine to prove that the
+ regular expression matcher defined as
\begin{center}
- \begin{tabular}{l}
- @{thm[mode=IfThen] nullable_correctness}\\
- @{thm[mode=IfThen] der_correctness}\\
- \end{tabular}
+
\end{center}
+
+ While the matcher above calculates a provably correct a YES/NO answer for
+ whether a regular expression matches a string, the novel idea of Sulzmann
+ and Lu \cite{Sulzmann2014} is to append another phase to calculate a
+ value.
+
*}
section {* POSIX Regular Expression Matching *}
text {*
-The clever idea in \cite{Sulzmann2014} is to define a function on values that mirrors
-(but inverts) the construction of the derivative on regular expressions. We
-begin with the case of a nullable regular expression: from the nullability
-we need to construct a value that witnesses the nullability. This is as
-follows. The @{const mkeps} function (from \cite{Sulzmann2014}) is a partial (but
-unambiguous) function from regular expressions to values, total on exactly
-the set of nullable regular expressions.
+The clever idea in \cite{Sulzmann2014} is to introduce values for encoding
+\emph{how} a regular expression matches a string and then define a function
+on values that mirrors (but inverts) the construction of the derivative on
+regular expressions. Values are defined as the inductive datatype
+
+ \begin{center}
+ @{text "v :="}
+ @{const "Void"} $\mid$
+ @{term "val.Char c"} $\mid$
+ @{term "Left v"} $\mid$
+ @{term "Right v"} $\mid$
+ @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$
+ @{term "Stars vs"}
+ \end{center}
+
+The inhabitation relation:
+
+ \begin{center}
+ \begin{tabular}{c}
+ @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\
+ @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad
+ @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\medskip\\
+ @{thm[mode=Axiom] Prf.intros(4)} \qquad
+ @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\medskip\\
+ @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad
+ @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}\medskip\\
+ \end{tabular}
+ \end{center}
+
+
+Note that no values are associated with the regular expression $Zero$, and
+that the only value associated with the regular expression $One$ is $Void$,
+pronounced (if one must) as {\em ``Void''}. We use $\in$ for the usual
+membership relation from set theory and take $[]$ as the standard name for
+the empty string (rather than, as in \cite{Sulzmann2014}, the regular
+expression that we call $One$).
+
+We begin with the case of a nullable regular expression: from
+the nullability we need to construct a value that witnesses the nullability.
+The @{const mkeps} function (from \cite{Sulzmann2014})
+is a partial (but unambiguous) function from regular expressions to values,
+total on exactly the set of nullable regular expressions.
\begin{center}
\begin{tabular}{lcl}
@@ -323,15 +383,7 @@
\noindent
Values
- \begin{center}
- @{text "v :="}
- @{const "Void"} $\mid$
- @{term "val.Char c"} $\mid$
- @{term "Left v"} $\mid$
- @{term "Right v"} $\mid$
- @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$
- @{term "Stars vs"}
- \end{center}
+