lexing: comparison thys/Paper/Paper.thy

equal deleted inserted replaced

-:698967eceaf1
+:90fe1a1d7d0e
 CHAR ("_" [1000] 80) and
 ALT ("_ + _" [77,77] 78) and
 SEQ ("_ \<cdot> _" [77,77] 78) and
 STAR ("_\<^sup>\<star>" [1000] 78) and
-val.Void ("'(')" 78) and
+val.Void ("'(')" 79) and
-val.Char ("Char _" [1000] 78) and
+val.Char ("Char _" [1000] 79) and
 val.Left ("Left _" [1000] 78) and
 val.Right ("Right _" [1000] 78) and
 L ("L'(_')" [10] 78) and
 der_syn ("_\\_" [79, 1000] 76) and
-flat ("|_|" [70] 73) and
+flat ("|_|" [75] 73) and
 Sequ ("_ @ _" [78,77] 63) and
-injval ("inj _ _ _" [1000,77,1000] 77) and
+injval ("inj _ _ _" [78,77,78] 77) and
 projval ("proj _ _ _" [1000,77,1000] 77) and
-length ("len _" [78] 73)
+length ("len _" [78] 73) and
+Prf ("\<triangleright> _ : _" [75,75] 75) and
+PMatch ("_ : _ \<rightarrow> _" [75,75,75] 75)
 (* and ValOrd ("_ \<succeq>\<^bsub>_\<^esub> _" [78,77,77] 73) *)
 (*>*)
 section {* Introduction *}
 text {*
 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em
 derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a
 One limitation of Brzozowski's matcher is that it only generates a YES/NO
 answer for whether a string is being matched by a regular expression.
 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow
 generation not just of a YES/NO answer but of an actual matching, called a
 [lexical] {\em value}. They give a simple algorithm to calculate a value
-that appears to be the value associated with POSIX lexing
+that appears to be the value associated with POSIX matching
 \cite{Kuklewicz,Vansummeren2006}. The challenge then is to specify that
 value, in an algorithm-independent fashion, and to show that Sulzamann and
 Lu's derivative-based algorithm does indeed calculate a value that is
 correct according to the specification.
 experience of doing our proofs has been that this mechanical checking was
 absolutely essential: this subject area has hidden snares. This was also
 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching
 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
-If a regular expression matches a string, then in general there is more
+If a regular expression matches a string, then in general there is more than
-than one way of how the string is matched. There are two commonly used
+one way of how the string is matched. There are two commonly used
 disambiguation strategies to generate a unique answer: one is called GREEDY
 matching \cite{Frisch2004} and the other is POSIX
-matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string
+matching~\cite{Kuklewicz,Sulzmann2014,Vansummeren2006}. For example consider
-@{term xy} and the regular expression \mbox{@{term "STAR (ALT (ALT x y)
+the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT
-xy)"}}. Either the string can be matched in two `iterations' by the single
+(ALT x y) xy)"}}. Either the string can be matched in two `iterations' by
-letter-regular expressions @{term x} and @{term y}, or directly in one
+the single letter-regular expressions @{term x} and @{term y}, or directly
-iteration by @{term xy}. The first case corresponds to GREEDY matching,
+in one iteration by @{term xy}. The first case corresponds to GREEDY
-which first matches with the left-most symbol and only matches the next
+matching, which first matches with the left-most symbol and only matches the
-symbol in case of a mismatch (this is greedy in the sense of preferring
+next symbol in case of a mismatch (this is greedy in the sense of preferring
 instant gratification to delayed repletion). The second case is POSIX
 matching, which prefers the longest match.
 In the context of lexing, where an input string needs to be split up into a
 sequence of tokens, POSIX is the more natural disambiguation strategy for
 For a particular longest initial substring, the first regular expression
 that can match determines the token.
 \end{itemize}
-\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising
+\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as
-keywords such as @{text "if"}, @{text "then"} and so on; and @{text
+@{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising
-"r\<^bsub>id\<^esub>"} recognising identifiers (a single character followed
+identifiers (a single character followed by characters or numbers). Then we
-by characters or numbers). Then we can form the regular expression @{text
+can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX
-"(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX
 matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the
 first case we obtain by the longest match rule a single identifier token,
-not a keyword followed by identifier. In the second case we obtain by rule
+not a keyword followed by an identifier. In the second case we obtain by rule
-priority a keyword token, not an identifier token---even if @{text
+priority a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"}
-"r\<^bsub>id\<^esub>"} matches also.\bigskip
+matches also.\bigskip
-\noindent\textcolor{green}{Not Done Yet}
+\noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
+derivative-based regular expression matching algorithm as described by
+Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
-\medskip\noindent
+algorithms according to our specification what a POSIX value is. The
-{\bf Contributions:}
+informal correctness proof given in \cite{Sulzmann2014} is in final
+form\footnote{} and to us contains unfillable gaps. Our specification of a
-Derivatives as calculated by Brzozowski's method are usually more complex
+POSIX value consists of a simple inductive definition that given a string
-regular expressions than the initial one; various optimisations are
+and a regular expression uniquely determines this value. Derivatives as
-possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r
+calculated by Brzozowski's method are usually more complex regular
-ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the
+expressions than the initial one; various optimisations are possible, such
-advantages of having a simple specification and correctness proof is that
+as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r ZERO"}, @{term
-the latter can be refined to allow for such optimisations and simple
+"SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the advantages of
-correctness proof.
+having a simple specification and correctness proof is that the latter can
+be refined to allow for such optimisations and simple correctness proof.
-An extended version of \cite{Sulzmann2014} is available at the website
-of its first author; this includes some ``proofs'', claimed in
+An extended version of \cite{Sulzmann2014} is available at the website of
+its first author; this includes some ``proofs'', claimed in
 \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
-final form, we make no comment thereon, preferring to give general
+final form, we make no comment thereon, preferring to give general reasons
-reasons for our belief that the approach of \cite{Sulzmann2014} is
+for our belief that the approach of \cite{Sulzmann2014} is problematic
-problematic rather than to discuss details of unpublished work.
+rather than to discuss details of unpublished work.
 *}
 section {* Preliminaries *}
 text {* \noindent Strings in Isabelle/HOL are lists of characters with the
 empty string being represented by the empty list, written @{term "[]"}, and
 list-cons being written as @{term "DUMMY # DUMMY"}. Often we use the usual
-bracket notation for strings; for example a string consisting of just a
+bracket notation for lists also for strings; for example a string consisting
-single character is written @{term "[c]"}. By using the type @{type char}
+of just a single character @{term c} is written @{term "[c]"}. By using the
-for characters we have a supply of finitely many characters roughly
+type @{type char} for characters we have a supply of finitely many
-corresponding to the ASCII character set. Regular expressions are defined as
+characters roughly corresponding to the ASCII character set. Regular
-usual as the following inductive datatype:
+expressions are defined as usual as the following inductive datatype:
 \begin{center}
 @{text "r :="}
 @{const "ZERO"} $\mid$
 @{const "ONE"} $\mid$
 @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
 @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
 @{term "STAR r"}
 \end{center}
-\noindent where @{const ZERO} stands for the regular expression that
+\noindent where @{const ZERO} stands for the regular expression that does
-does not match any string and @{const ONE} for the regular
+not match any string, @{const ONE} for the regular expression that matches
-expression that matches only the empty string. The language of a regular expression
+only the empty string and @{term c} for matching a character literal. The
-is again defined routinely by the recursive function @{term L} with the
+language of a regular expression is again defined as usual by the
-clauses:
+recursive function @{term L} with the clauses:
 \begin{center}
 \begin{tabular}{rcl}
 @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
 @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\
 \end{tabular}
 \end{center}
 \noindent In the fourth clause we use @{term "DUMMY ;; DUMMY"} for the
 concatenation of two languages (it is also list-append for strings). We
-use the star-notation for regular expressions and also for languages (in
+use the star-notation for regular expressions and languages (in the last
-the last clause). The star for languages is defined inductively as usual
+clause above). The star on languages is defined inductively by two
-by two clauses for the empty string being in the star of a language and if
+clauses: @{text "(i)"} for the empty string being in the star of a
-@{term "s\<^sub>1"} is in a language and @{term "s\<^sub>2"} in the star of this
+language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a language and @{term
-language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in the star of this language.
+"s\<^sub>2"} in the star of this language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in
+the star of this language. It will also be convenient to use the following
-\emph{Semantic derivatives} of sets of strings are defined as
+notion of a \emph{semantic derivative} (or left quotient) of a language,
+say @{text A}, defined as:
 \begin{center}
 \begin{tabular}{lcl}
 @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\
 \end{tabular}
 \end{center}
+\noindent
+For semantic derivatives we have the following equations (for example
-\noindent
+\cite{Krauss2011}):
-The nullable function
+\begin{equation}
+\begin{array}{lcl}
+@{thm (lhs) Der_null}  & \dn & @{thm (rhs) Der_null}\\
+\end{array}
+\end{equation}
+\noindent \emph{\Brz's derivatives} of regular expressions
+\cite{Brzozowski1964} can be easily defined by two recursive functions:
+the first is from regular expressions to booleans (implementing a test
+when a regular expression can match the empty string), and the second
+takes a regular expression and a character to a (derivative) regular
+expression:
 \begin{center}
 \begin{tabular}{lcl}
 @{thm (lhs) nullable.simps(1)} & $\dn$ & @{thm (rhs) nullable.simps(1)}\\
 @{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\
 @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
 @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
-@{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\\
+@{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\
-\end{tabular}
-\end{center}
-\noindent
-The derivative function for characters and strings
-\begin{center}
-\begin{tabular}{lcp{8cm}}
 @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
 @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
 @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
 @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}
 \end{tabular}
 \end{center}
-It is a relatively easy exercise to prove that
+\noindent
+Given the equations in ???,
-\begin{center}
+it is a relatively easy exercise in mechanical reasoning to establish that
-\begin{tabular}{l}
-@{thm[mode=IfThen] nullable_correctness}\\
+\begin{proposition}
-@{thm[mode=IfThen] der_correctness}\\
+\begin{tabular}{ll}
-\end{tabular}
+@{text "(1)"} & @{thm (lhs) nullable_correctness} if and only if
-\end{center}
+@{thm (rhs) nullable_correctness} \\
+@{text "(2)"} & @{thm[mode=IfThen] der_correctness}
+\end{tabular}
+\end{proposition}
+\noindent With this in place it is also very routine to prove that the
+regular expression matcher defined as
+\begin{center}
+\end{center}
+While the matcher above calculates a provably correct a YES/NO answer for
+whether a regular expression matches a string, the novel idea of Sulzmann
+and Lu \cite{Sulzmann2014} is to append another phase to calculate a
+value.
 *}
 section {* POSIX Regular Expression Matching *}
 text {*
-The clever idea in \cite{Sulzmann2014} is to define a function on values that mirrors
+The clever idea in \cite{Sulzmann2014} is to introduce values for encoding
-(but inverts) the construction of the derivative on regular expressions. We
+\emph{how} a regular expression matches a string and then define a function
-begin with the case of a nullable regular expression: from the nullability
+on values that mirrors (but inverts) the construction of the derivative on
-we need to construct a value that witnesses the nullability. This is as
+regular expressions. Values are defined as the inductive datatype
-follows. The @{const mkeps} function (from \cite{Sulzmann2014}) is a partial (but
-unambiguous) function from regular expressions to values, total on exactly
+\begin{center}
-the set of nullable regular expressions.
+@{text "v :="}
+@{const "Void"} $\mid$
+@{term "val.Char c"} $\mid$
+@{term "Left v"} $\mid$
+@{term "Right v"} $\mid$
+@{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$
+@{term "Stars vs"}
+\end{center}
+The inhabitation relation:
+\begin{center}
+\begin{tabular}{c}
+@{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\
+@{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad
+@{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\medskip\\
+@{thm[mode=Axiom] Prf.intros(4)} \qquad
+@{thm[mode=Axiom] Prf.intros(5)[of "c"]}\medskip\\
+@{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad
+@{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}\medskip\\
+\end{tabular}
+\end{center}
+Note that no values are associated with the regular expression $Zero$, and
+that the only value associated with the regular expression $One$ is $Void$,
+pronounced (if one must) as {\em ``Void''}. We use $\in$ for the usual
+membership relation from set theory and take $[]$ as the standard name for
+the empty string (rather than, as in \cite{Sulzmann2014}, the regular
+expression that we call $One$).
+We begin with the case of a nullable regular expression: from
+the nullability we need to construct a value that witnesses the nullability.
+The @{const mkeps} function (from \cite{Sulzmann2014})
+is a partial (but unambiguous) function from regular expressions to values,
+total on exactly the set of nullable regular expressions.
 \begin{center}
 \begin{tabular}{lcl}
 @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
 @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
 \noindent
 Values
-\begin{center}
-@{text "v :="}
-@{const "Void"} $\mid$
-@{term "val.Char c"} $\mid$
-@{term "Left v"} $\mid$
-@{term "Right v"} $\mid$
-@{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$
-@{term "Stars vs"}
-\end{center}
 \noindent
 The @{const flat} function for values

changeset 113	90fe1a1d7d0e
parent 112	698967eceaf1
child 114	8b41d01b5e5d