lexing: comparison thys/Journal/Paper.thy

equal deleted inserted replaced

-:fff2e1b40dfc
+:32b222d77fa0
 declare [[show_question_marks = false]]
+syntax (latex output)
+"_Collect" :: "pttrn => bool => 'a set"              ("(1{_ \<^raw:\mbox{\boldmath$\mid$}> _})")
+"_CollectIn" :: "pttrn => 'a set => bool => 'a set"   ("(1{_ \<in> _ |e _})")
 abbreviation
 "der_syn r c \<equiv> der c r"
 abbreviation
 "ders_syn r s \<equiv> ders s r"
 abbreviation
 "nprec v1 v2 \<equiv> \<not>(v1 :\<sqsubset>val v2)"
 notation (latex output)
 If  ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
 Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [75,73] 73) and
 ZERO ("\<^bold>0" 78) and
-ONE ("\<^bold>1" 78) and
+ONE ("\<^bold>1" 1000) and
 CHAR ("_" [1000] 80) and
 ALT ("_ + _" [77,77] 78) and
 SEQ ("_ \<cdot> _" [77,77] 78) and
 STAR ("_\<^sup>\<star>" [1000] 78) and
 Sequ ("_ @ _" [78,77] 63) and
 injval ("inj _ _ _" [79,77,79] 76) and
 mkeps ("mkeps _" [79] 76) and
 length ("len _" [73] 73) and
 intlen ("len _" [73] 73) and
+set ("_" [73] 73) and
 Prf ("_ : _" [75,75] 75) and
+CPrf ("_ \<^raw:\mbox{\textbf{\textlengthmark}}> _" [75,75] 75) and
 Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and
 lexer ("lexer _ _" [78,78] 77) and
 F_RIGHT ("F\<^bsub>Right\<^esub> _") and
 F_LEFT ("F\<^bsub>Left\<^esub> _") and
 *)
 definition
 "match r s \<equiv> nullable (ders s r)"
+lemma CV_STAR_ONE_empty:
+shows "CV (STAR ONE) [] = {Stars []}"
+by(auto simp add: CV_def elim: CPrf.cases intro: CPrf.intros)
 (*
 comments not implemented
 p9. The condtion "not exists s3 s4..." appears often enough (in particular in
 the proof of Lemma 3) to warrant a definition.
 *)
 (*>*)
 section {* Introduction *}
 text {*
 \item[$\bullet$] \emph{Star Rule:} A subexpression repeated by ${}^\star$ shall
 not match an empty string unless this is the only match for the repetition.\smallskip
 \item[$\bullet$] \emph{Empty String Rule:} An empty string shall be considered to
-be longer than no match at all.
+be longer than no match at all.\marginpar{Explain its purpose}
 \end{itemize}
 \noindent Consider for example a regular expression @{text "r\<^bsub>key\<^esub>"} for recognising keywords
 such as @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"}
 recognising identifiers (say, a single character followed by
 @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX matching to tokenise strings,
 say @{text "iffoo"} and @{text "if"}.  For @{text "iffoo"} we obtain
 by the Longest Match Rule a single identifier token, not a keyword
 followed by an identifier. For @{text "if"} we obtain by the Priority
 Rule a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"}
-matches also.
+matches also. By the Star Rule we know @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} matches @{text "iffoo"}, respectively @{text "if"}, in exactly one
+`iteration' of the star.
 One limitation of Brzozowski's matcher is that it only generates a
 YES/NO answer for whether a string is being matched by a regular
 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this matcher
 to allow generation not just of a YES/NO answer but of an actual
 r}, and to show that (once a string to be matched is chosen) there is
 a maximum element and that it is computed by their derivative-based
 algorithm. This proof idea is inspired by work of Frisch and Cardelli
 \cite{Frisch2004} on a GREEDY regular expression matching
 algorithm. However, we were not able to establish transitivity and
-totality for the ``order relation'' by Sulzmann and Lu. ??In Section
+totality for the ``order relation'' by Sulzmann and Lu. \marginpar{We probably drop this section}
+??In Section
 \ref{argu} we identify some inherent problems with their approach (of
 which some of the proofs are not published in \cite{Sulzmann2014});
 perhaps more importantly, we give a simple inductive (and
 algorithm-independent) definition of what we call being a {\em POSIX
 value} for a regular expression @{term r} and a string @{term s}; we
 \cite{Sulzmann2014} is available at the website of its first author; this
 extended version already includes remarks in the appendix that their
 informal proof contains gaps, and possible fixes are not fully worked out.}
 Our specification of a POSIX value consists of a simple inductive definition
 that given a string and a regular expression uniquely determines this value.
+We also show that our definition is equivalent to an ordering
+of values based on positions by Okui and Suzuki \cite{OkuiSuzuki2013}.
 Derivatives as calculated by Brzozowski's method are usually more complex
 regular expressions than the initial one; various optimisations are
 possible. We prove the correctness when simplifications of @{term "ALT ZERO
 r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
 @{term r} are applied.
 only the empty string and @{term c} for matching a character literal. The
 language of a regular expression is also defined as usual by the
 recursive function @{term L} with the six clauses:
 \begin{center}
-\begin{tabular}{l@ {\hspace{3mm}}rcl}
+\begin{tabular}{l@ {\hspace{4mm}}rcl}
 (1) & @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
 (2) & @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\
 (3) & @{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\
-\end{tabular}
-\hspace{14mm}
-\begin{tabular}{l@ {\hspace{3mm}}rcl}
 (4) & @{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
 (5) & @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
 (6) & @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
 \end{tabular}
 \end{center}
 @{thm (lhs) nullable.simps(1)} & $\dn$ & @{thm (rhs) nullable.simps(1)}\\
 @{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\
 @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
 @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
-@{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\
+@{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}%\medskip\\
+\end{tabular}
-%\end{tabular}
+\end{center}
-%\end{center}
+\begin{center}
-%\begin{center}
+\begin{tabular}{lcl}
-%\begin{tabular}{lcl}
 @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
 @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
 @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
 @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
 section {* POSIX Regular Expression Matching\label{posixsec} *}
 text {*
-The clever idea by Sulzmann and Lu \cite{Sulzmann2014} is to define
+The clever idea by Sulzmann and Lu \cite{Sulzmann2014} is to use
 values for encoding \emph{how} a regular expression matches a string
 and then define a function on values that mirrors (but inverts) the
 construction of the derivative on regular expressions. \emph{Values}
 are defined as the inductive datatype
 @{thm (lhs) flat.simps(7)} & $\dn$ & @{thm (rhs) flat.simps(7)}\\
 \end{tabular}
 \end{center}
 \noindent Sulzmann and Lu also define inductively an inhabitation relation
-that associates values to regular expressions:
+that associates values to regular expressions
 \begin{center}
 \begin{tabular}{c}
 \\[-8mm]
 @{thm[mode=Axiom] Prf.intros(4)} \qquad
 @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]} \qquad
 @{thm[mode=Rule] Prf.intros(6)[of "vs"]}
 \end{tabular}
 \end{center}
-\noindent Note that no values are associated with the regular expression
+\noindent
+where in the clause for @{const "Stars"} we use the notation @{term "v \<in> set vs"}
+for indicating that @{text v} is a member in the list @{text vs}.
+Note that no values are associated with the regular expression
 @{term ZERO}, and that the only value associated with the regular
 expression @{term ONE} is @{term Void}, pronounced (if one must) as @{text
 "Void"}. It is routine to establish how values ``inhabiting'' a regular
 expression correspond to the language of a regular expression, namely
 \begin{proposition}
 @{thm L_flat_Prf}
 \end{proposition}
-In general there is more than one value associated with a regular
+\noindent
-expression. In case of POSIX matching the problem is to calculate the
+Given a regular expression @{text r} and a string @{text s}, we can define the
-unique value that satisfies the (informal) POSIX rules from the
+set of all \emph{Lexical Values} inhabited by @{text r} with the underlying string
-Introduction. Graphically the POSIX value calculation algorithm by
+being @{text s} by
+\begin{center}
+@{thm LV_def}
+\end{center}
+\noindent However, later on it will sometimes be necessary to
+restrict the set of lexical values to a subset called
+\emph{Canonical Values}. The idea of canonical values is that they
+satisfy the Star Rule (see Introduction) where the $^\star$ does not
+match the empty string unless this is the only match for the
+repetition.  One way to define canonical values formally is to use a
+stronger inhabitation relation, written @{term "\<Turnstile> DUMMY : DUMMY"}, which has the same rules as @{term
+"\<turnstile> DUMMY : DUMMY"} shown above, except that the rule for
+@{term Stars} has
+the additional side-condition of flattened values not being the
+empty string, namely
+\begin{center}
+@{thm [mode=Rule] CPrf.intros(6)}
+\end{center}
+\noindent
+With this we can define
+\begin{center}
+@{thm CV_def}
+\end{center}
+\noindent
+Clearly we have @{thm LV_CV_subset}.
+The main point of canonical values is that for every regular expression @{text r} and every
+string @{text s}, the set @{term "CV r s"} is finite.
+\begin{lemma}
+@{thm CV_finite}
+\end{lemma}
+\noindent This finiteness property does not generally hold for lexical values where
+for example @{term "LV (STAR ONE) []"} contains infinitely many
+values, but @{thm CV_STAR_ONE_empty}. However, if a regular
+expression @{text r} matches a string @{text s}, then in general the
+set @{term "CV r s"} is not just a
+singleton set.  In case of POSIX matching the problem is to
+calculate the unique value that satisfies the (informal) POSIX rules
+from the Introduction. It will turn out that this POSIX value is in fact a
+canonical value.
+Graphically the POSIX value calculation algorithm by
 Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz}
 where the path from the left to the right involving @{term derivatives}/@{const
 nullable} is the first phase of the algorithm (calculating successive
 \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
 left, the second phase. This picture shows the steps required when a
 be proved by induction over the definition of @{term derivatives}; the second by
 an induction on @{term r}. There are no interesting cases.\qed
 \end{proof}
 Having defined the @{const mkeps} and @{text inj} function we can extend
-\Brz's matcher so that a [lexical] value is constructed (assuming the
+\Brz's matcher so that a value is constructed (assuming the
 regular expression matches the string). The clauses of the Sulzmann and Lu lexer are
 \begin{center}
 \begin{tabular}{lcl}
 @{thm (lhs) lexer.simps(1)} & $\dn$ & @{thm (rhs) lexer.simps(1)}\\
 match the string, then @{const Some} value is returned. One important
 virtue of this algorithm is that it can be implemented with ease in any
 functional programming language and also in Isabelle/HOL. In the remaining
 part of this section we prove that this algorithm is correct.
-The well-known idea of POSIX matching is informally defined by the longest
+The well-known idea of POSIX matching is informally defined by some
-match and priority rule (see Introduction); as correctly argued in \cite{Sulzmann2014}, this
+rules such as the longest match and priority rule (see
+Introduction); as correctly argued in \cite{Sulzmann2014}, this
 needs formal specification. Sulzmann and Lu define an ``ordering
-relation'' between values and argue
+relation'' between values and argue that there is a maximum value,
-that there is a maximum value, as given by the derivative-based algorithm.
+as given by the derivative-based algorithm.  In contrast, we shall
-In contrast, we shall introduce a simple inductive definition that
+introduce a simple inductive definition that specifies directly what
-specifies directly what a \emph{POSIX value} is, incorporating the
+a \emph{POSIX value} is, incorporating the POSIX-specific choices
-POSIX-specific choices into the side-conditions of our rules. Our
+into the side-conditions of our rules. Our definition is inspired by
-definition is inspired by the matching relation given by Vansummeren
+the matching relation given by Vansummeren
-\cite{Vansummeren2006}. The relation we define is ternary and written as
+\cite{Vansummeren2006}. The relation we define is ternary and
-\mbox{@{term "s \<in> r \<rightarrow> v"}}, relating strings, regular expressions and
+written as \mbox{@{term "s \<in> r \<rightarrow> v"}}, relating
-values.
+strings, regular expressions and values; the inductive rules are given in
+Figure~\ref{POSIXrules}.
+We can prove that given a string @{term s} and regular expression @{term
+r}, the POSIX value @{term v} is uniquely determined by @{term "s \<in> r \<rightarrow> v"}.
 %
+\begin{figure}[t]
 \begin{center}
 \begin{tabular}{c}
 @{thm[mode=Axiom] Posix.intros(1)}@{text "P"}@{term "ONE"} \qquad
 @{thm[mode=Axiom] Posix.intros(2)}@{text "P"}@{term "c"}\medskip\\
 @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}@{text "P+L"}\qquad
 @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
 @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
 {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$@{text "P\<star>"}
 \end{tabular}
 \end{center}
+\caption{Our inductive definition of POSIX values.}\label{POSIXrules}
-\noindent
+\end{figure}
-We can prove that given a string @{term s} and regular expression @{term
-r}, the POSIX value @{term v} is uniquely determined by @{term "s \<in> r \<rightarrow> v"}.
 \begin{theorem}\mbox{}\smallskip\\\label{posixdeterm}
 \begin{tabular}{ll}
 @{text "(1)"} & If @{thm (prem 1) Posix1(1)} then @{thm (concl)
 Posix1(1)} and @{thm (concl) Posix1(2)}.\\
 The second parts follows by a case analysis of @{term "s \<in> r \<rightarrow> v'"} and
 the first part.\qed
 \end{proof}
 \noindent
-We claim that our @{term "s \<in> r \<rightarrow> v"} relation captures the idea behind the two
+We claim that our @{term "s \<in> r \<rightarrow> v"} relation captures the idea behind the four
 informal POSIX rules shown in the Introduction: Consider for example the
 rules @{text "P+L"} and @{text "P+R"} where the POSIX value for a string
 and an alternative regular expression, that is @{term "(s, ALT r\<^sub>1 r\<^sub>2)"},
 is specified---it is always a @{text "Left"}-value, \emph{except} when the
 string to be matched is not in the language of @{term "r\<^sub>1"}; only then it
 "P\<star>"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial
 split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value
 @{term v} cannot be flattened to the empty string. In effect, we require
 that in each ``iteration'' of the star, some non-empty substring needs to
 be ``chipped'' away; only in case of the empty string we accept @{term
-"Stars []"} as the POSIX value.
+"Stars []"} as the POSIX value. Indeed we can show that our POSIX value
+is a canonical value which excludes those @{text Stars} containing values
+that flatten to the empty string.
+\begin{lemma}
+@{thm [mode=IfThen] Posix_CV}
+\end{lemma}
+\begin{proof}
+By routine induction on @{thm (prem 1) Posix_CV}.\qed
+\end{proof}
+\noindent
 Next is the lemma that shows the function @{term "mkeps"} calculates
 the POSIX value for the empty string and a nullable regular expression.
 \begin{lemma}\label{lemmkeps}
 @{thm[mode=IfThen] Posix_mkeps}
 *}
 text {*
-Theorems:
+Theorem 1:
-@{thm [mode=IfThen] Posix_CPT}
 @{thm [mode=IfThen] Posix_PosOrd[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}
-Corrollary from the last one
+Theorem 2:
-@{thm [mode=IfThen] Posix_PosOrd_stronger[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}
-Theorem
 @{thm [mode=IfThen] PosOrd_Posix[where ?v1.0="v\<^sub>1"]}
 *}

changeset 267	32b222d77fa0
parent 266	fff2e1b40dfc
child 268	6746f5e1f1f8