lexing: comparison thys/Paper/Paper.thy

equal deleted inserted replaced

-:4c85af262ee7
+:7c6c907660d8
 In general there is more than one value associated with a regular
 expression. In case of POSIX matching the problem is to calculate the
 unique value that satisfies the (informal) POSIX rules from the
 Introduction. Graphically the POSIX value calculation algorithm by
 Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz}
-where the path from the left to the right involving @{const der}/@{const
+where the path from the left to the right involving @{term derivatives}/@{const
 nullable} is the first phase of the algorithm (calculating successive
 \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
 left, the second phase. This picture shows the steps required when a
 regular expression, say @{text "r\<^sub>1"}, matches the string @{term
 "[a,b,c]"}. We first build the three derivatives (according to @{term a},
 \noindent To better understand what is going on in this definition it
 might be instructive to look first at the three sequence cases (clauses
 (4)--(6)). In each case we need to construct an ``injected value'' for
 @{term "SEQ r\<^sub>1 r\<^sub>2"}. This must be a value of the form @{term
-"Seq DUMMY DUMMY"}. Recall the clause of the @{const der}-function
+"Seq DUMMY DUMMY"}. Recall the clause of the @{text derivative}-function
 for sequence regular expressions:
 \begin{center}
 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} $\dn$ @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}
 \end{center}
 but our deviation is harmless.}
 The idea of @{term inj} to ``inject back'' a character into a value can
 be made precise by the first part of the following lemma; the second
 part shows that the underlying string of an @{const mkeps}-value is
-the empty string.
+always the empty string.
-\begin{lemma}\mbox{}\\\label{Prf_injval_flat}
+\begin{lemma}\mbox{}\smallskip\\\label{Prf_injval_flat}
 \begin{tabular}{ll}
 (1) & @{thm[mode=IfThen] Prf_injval_flat}\\
 (2) & @{thm[mode=IfThen] mkeps_flat}
 \end{tabular}
 \end{lemma}
+\begin{proof}
+Both properties are by routine inductions: the first one, for example,
+by an induction over the definition of @{term derivatives}; the second by
+induction on @{term r}. There are no interesting cases.
+\end{proof}
 Having defined the @{const mkeps} and @{text inj} function we can extend
 \Brz's matcher so that a [lexical] value is constructed (assuming the
 regular expression matches the string). The clauses of the lexer are
 & & \phantom{$|$} @{term "None"}  @{text "\<Rightarrow>"} @{term None}\\
 & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}
 \end{tabular}
 \end{center}
-\noindent If the regular expression does not match, @{const None} is
+\noindent If the regular expression does not match the string, @{const None} is
 returned, indicating an error is raised. If the regular expression does
 match the string, then @{const Some} value is returned. One important
 virtue of this algorithm is that it can be implemented with ease in a
 functional programming language and also in Isabelle/HOL. In the remaining
 part of this section we prove that this algorithm is correct.
 The well-known idea of POSIX matching is informally defined by the longest
 match and priority rule; as correctly argued in \cite{Sulzmann2014}, this
 needs formal specification. Sulzmann and Lu define a \emph{dominance}
 relation\footnote{Sulzmann and Lu call it an ordering relation, but
-without giving evidence that it is transitive.} between values and argue that
+without giving evidence that it is transitive.} between values and argue
-there is a maximum value, as given by the derivative-based algorithm. In
+that there is a maximum value, as given by the derivative-based algorithm.
-contrast, we shall next introduce a simple inductive definition to specify
+In contrast, we shall introduce a simple inductive definition that
-what a \emph{POSIX value} is, incorporating the POSIX-specific choices
+specifies directly what a \emph{POSIX value} is, incorporating the
-into the side-conditions of our rules. Our definition is inspired by the
+POSIX-specific choices into the side-conditions of our rules. Our
-matching relation given in \cite{Vansummeren2006}. The relation we define
+definition is inspired by the matching relation given in
-is ternary and written as \mbox{@{term "s \<in> r \<rightarrow> v"}}, relating strings,
+\cite{Vansummeren2006}. The relation we define is ternary and written as
-regular expressions and values.
+\mbox{@{term "s \<in> r \<rightarrow> v"}}, relating strings, regular expressions and
+values.
 \begin{center}
 \begin{tabular}{c}
-@{thm[mode=Axiom] PMatch.intros(1)} \qquad
+@{thm[mode=Axiom] PMatch.intros(1)}@{text "P"}@{term "ONE"} \qquad
-@{thm[mode=Axiom] PMatch.intros(2)}\bigskip\\
+@{thm[mode=Axiom] PMatch.intros(2)}@{text "P"}@{term "c"}\bigskip\\
-@{thm[mode=Rule] PMatch.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\qquad
+@{thm[mode=Rule] PMatch.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}@{text "P+L"}\qquad
-@{thm[mode=Rule] PMatch.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\bigskip\\
+@{thm[mode=Rule] PMatch.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}@{text "P+R"}\bigskip\\
 $\mprset{flushleft}
 \inferrule
 {@{thm (prem 1) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
 @{thm (prem 2) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\
 @{thm (prem 3) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}
-{@{thm (concl) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\\
+{@{thm (concl) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$@{text "PS"}\\
-@{thm[mode=Axiom] PMatch.intros(7)}\bigskip\\
+@{thm[mode=Axiom] PMatch.intros(7)}@{text "P[]"}\bigskip\\
 $\mprset{flushleft}
 \inferrule
 {@{thm (prem 1) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
 @{thm (prem 2) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
 @{thm (prem 3) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
 @{thm (prem 4) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
-{@{thm (concl) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$
+{@{thm (concl) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$@{text "P\<star>"}
 \end{tabular}
 \end{center}
 \noindent We claim that this relation captures the idea behind the two
-informal POSIX rules shown in the Introduction: Consider the second line
+informal POSIX rules shown in the Introduction: Consider for example the
-where the POSIX values for an alternative regular expression is
+rules @{text "P+L"} and @{text "P+R"} where the POSIX value for an
-specified---it is always a @{text "Left"}-value, \emph{except} when the
+alternative regular expression is specified---it is always a @{text
-string to be matched is not in the language of @{term "r\<^sub>1"}; only then it
+"Left"}-value, \emph{except} when the string to be matched is not in the
-is a @{text Right}-value (see the side-condition in the rule on the
+language of @{term "r\<^sub>1"}; only then it is a @{text Right}-value (see the
-right). Interesting is also the rule for sequence regular expressions
+side-condition in @{text "P+R"}). Interesting is also the rule for
-(third line). The first two premises state that @{term "v\<^sub>1"} and
+sequence regular expressions (@{text "PS"}). The first two premises state
-@{term "v\<^sub>2"} are the POSIX values for @{term "(s\<^sub>1,
+that @{term "v\<^sub>1"} and @{term "v\<^sub>2"} are the POSIX values for
-r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"} respectively.
+@{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"}
+respectively. Consider now the third premise and note that the POSIX value
+of this rule should match the string @{term "s\<^sub>1 @ s\<^sub>2"}. According to the
+longest match rule, we want that the @{term "s\<^sub>1"} is the longest initial
+split of @{term "s\<^sub>1 @ s\<^sub>2"} such that @{term "s\<^sub>2"} is still recognised
+by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there
+\emph{exists} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"}
+can be split up into a non-empty @{term "s\<^sub>3"} and @{term "s\<^sub>4"}. Moreover
+the longer @{term "s\<^sub>1 @ s\<^sub>3"} can be matched by @{text "r\<^sub>1"} and the
+shorter @{term "s\<^sub>4"} can still be matched by @{term "r\<^sub>2"}. In this case
+@{term "s\<^sub>1"} would not be the longest initial split of @{term "s\<^sub>1 @
+s\<^sub>2"} and therefore @{term "Seq v\<^sub>1 v\<^sub>2"} cannot be a POSIX value
+for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. A similar condition is imposed
+onto the POSIX value in the @{text "P\<star>"}-rule. Also there we want that
+@{term "s\<^sub>1"} is the longest initial split of @{term "s\<^sub>1 @ s\<^sub>2"} and
+furthermore the corresponding value @{term v} cannot be flatten to
+the empty string. In effect, we require that in each ``iteration''
+of the star, some parts of the string need to be ``nibbled'' away; only
+in case of the empty string weBy accept @{term "Stars []"} as the
+POSIX value.
+We can prove that given a string @{term s} and regular expression @{term
+r}, the POSIX value @{term v} is uniquely determined by @{term "s \<in> r \<rightarrow>
+v"}.
 \begin{theorem}
 @{thm[mode=IfThen] PMatch_determ(1)[of _ _ "v\<^sub>1" "v\<^sub>2"]}
 \end{theorem}
+\begin{proof}
+By induction on the definition of @{term "s \<in> r \<rightarrow> v\<^sub>1"} and a case
+analysis of @{term "s \<in> r \<rightarrow> v\<^sub>2"}.
+\end{proof}
 \begin{lemma}
 @{thm[mode=IfThen] PMatch_mkeps}
 \end{lemma}

changeset 122	7c6c907660d8
parent 121	4c85af262ee7
child 123	1bee7006b92b