--- a/thys/Journal/Paper.thy Sun Aug 27 00:03:31 2017 +0300
+++ b/thys/Journal/Paper.thy Wed Sep 06 00:52:08 2017 +0100
@@ -19,6 +19,10 @@
"_Collect" :: "pttrn => bool => 'a set" ("(1{_ \<^raw:\mbox{\boldmath$\mid$}> _})")
"_CollectIn" :: "pttrn => 'a set => bool => 'a set" ("(1{_ \<in> _ |e _})")
+syntax
+ "_Not_Ex" :: "idts \<Rightarrow> bool \<Rightarrow> bool" ("(3\<nexists>_.a./ _)" [0, 10] 10)
+ "_Not_Ex1" :: "pttrn \<Rightarrow> bool \<Rightarrow> bool" ("(3\<nexists>!_.a./ _)" [0, 10] 10)
+
abbreviation
"der_syn r c \<equiv> der c r"
@@ -37,12 +41,12 @@
If ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [75,73] 73) and
- ZERO ("\<^bold>0" 78) and
- ONE ("\<^bold>1" 1000) and
+ ZERO ("\<^bold>0" 81) and
+ ONE ("\<^bold>1" 81) and
CHAR ("_" [1000] 80) and
ALT ("_ + _" [77,77] 78) and
SEQ ("_ \<cdot> _" [77,77] 78) and
- STAR ("_\<^sup>\<star>" [1000] 78) and
+ STAR ("_\<^sup>\<star>" [78] 78) and
val.Void ("Empty" 78) and
val.Char ("Char _" [1000] 78) and
@@ -56,6 +60,7 @@
der_syn ("_\\_" [79, 1000] 76) and
ders_syn ("_\\_" [79, 1000] 76) and
flat ("|_|" [75] 74) and
+ flats ("|_|" [72] 74) and
Sequ ("_ @ _" [78,77] 63) and
injval ("inj _ _ _" [79,77,79] 76) and
mkeps ("mkeps _" [79] 76) and
@@ -106,6 +111,7 @@
*)
+
(*>*)
@@ -187,8 +193,8 @@
matches also. By the Star Rule we know @{text "(r\<^bsub>key\<^esub> +
r\<^bsub>id\<^esub>)\<^sup>\<star>"} matches @{text "iffoo"},
respectively @{text "if"}, in exactly one `iteration' of the star. The
-Empty String Rule is for cases where, for example, @{text
-"(a\<^sup>\<star>)\<^sup>\<star>"} matches against the
+Empty String Rule is for cases where, for example, the regular expression
+@{text "(a\<^sup>\<star>)\<^sup>\<star>"} matches against the
string @{text "bc"}. Then the longest initial matched substring is the
empty string, which is matched by both the whole regular expression
and the parenthesised subexpression.
@@ -202,13 +208,34 @@
expression matches a string, values encode the information of
\emph{how} the string is matched by the regular expression---that is,
which part of the string is matched by which part of the regular
-expression. For this consider again the the string @{text "xy"} and
-the regular expression \mbox{@{text "(x + (y +
-xy))\<^sup>\<star>"}}. The POSIX value, which corresponds to using the
-star in only one repetition,
+expression. For this consider again the string @{text "xy"} and
+the regular expression \mbox{@{text "(x + (y + xy))\<^sup>\<star>"}}
+(this time fully parenthesised). We can view this regular expression
+as tree and if the string @{text xy} is matched by two Star
+`iterations', then the @{text x} is matched by the left-most
+alternative in this tree and the @{text y} by the right-left alternative. This
+suggests to record this matching as
+
+\begin{center}
+@{term "Stars [Left(Char x), Right(Left(Char y))]"}
+\end{center}
+\noindent where @{const Stars}, @{text Left}, @{text Right} and @{text
+Char} are constructors for values. @{text Stars} records how many
+iterations were used; @{text Left}, respectively @{text Right}, which
+alternative is used. This `tree view' leads naturally to the
+idea that regular expressions act as types and values as inhabiting
+those types. This view was first put forward by ???. The value for the
+single `iteration', i.e.~the POSIX value, would look as follows
-\marginpar{explain values; who introduced them}
+\begin{center}
+@{term "Stars [Seq (Char x) (Char y)]"}
+\end{center}
+
+\noindent where @{const Stars} has only a single-element list for the
+single iteration and @{const Seq} indicates that @{term xy} is matched
+by a sequence regular expression, which we will in what follows
+write more formally as @{term "SEQ x y"}.
Sulzmann and Lu give a simple algorithm to calculate a value that
@@ -280,15 +307,17 @@
section {* Preliminaries *}
-text {* \noindent Strings in Isabelle/HOL are lists of characters with the
-empty string being represented by the empty list, written @{term "[]"}, and
-list-cons being written as @{term "DUMMY # DUMMY"}. Often we use the usual
-bracket notation for lists also for strings; for example a string consisting
-of just a single character @{term c} is written @{term "[c]"}. By using the
+text {* \noindent Strings in Isabelle/HOL are lists of characters with
+the empty string being represented by the empty list, written @{term
+"[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}. Often
+we use the usual bracket notation for lists also for strings; for
+example a string consisting of just a single character @{term c} is
+written @{term "[c]"}. We use the usual definitions for
+\emph{prefixes} and \emph{strict prefixes} of strings. By using the
type @{type char} for characters we have a supply of finitely many
characters roughly corresponding to the ASCII character set. Regular
-expressions are defined as usual as the elements of the following inductive
-datatype:
+expressions are defined as usual as the elements of the following
+inductive datatype:
\begin{center}
@{text "r :="}
@@ -308,16 +337,18 @@
\begin{center}
\begin{tabular}{l@ {\hspace{4mm}}rcl}
- (1) & @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
- (2) & @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\
- (3) & @{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\
- (4) & @{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
- (5) & @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
- (6) & @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
+ \textit{(1)} & @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
+ \textit{(2)} & @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\
+ \textit{(3)} & @{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\
+ \textit{(4)} & @{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ &
+ @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
+ \textit{(5)} & @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ &
+ @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
+ \textit{(6)} & @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
\end{tabular}
\end{center}
- \noindent In clause (4) we use the operation @{term "DUMMY ;;
+ \noindent In clause \textit{(4)} we use the operation @{term "DUMMY ;;
DUMMY"} for the concatenation of two languages (it is also list-append for
strings). We use the star-notation for regular expressions and for
languages (in the last clause above). The star for languages is defined
@@ -362,12 +393,14 @@
@{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
@{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
- @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}%\medskip\\
- \end{tabular}
- \end{center}
+ @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\
- \begin{center}
- \begin{tabular}{lcl}
+% \end{tabular}
+% \end{center}
+
+% \begin{center}
+% \begin{tabular}{lcl}
+
@{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
@{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
@{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
@@ -392,9 +425,9 @@
\begin{proposition}\label{derprop}\mbox{}\\
\begin{tabular}{ll}
- @{text "(1)"} & @{thm (lhs) nullable_correctness} if and only if
+ \textit{(1)} & @{thm (lhs) nullable_correctness} if and only if
@{thm (rhs) nullable_correctness}, and \\
- @{text "(2)"} & @{thm[mode=IfThen] der_correctness}.
+ \textit{(2)} & @{thm[mode=IfThen] der_correctness}.
\end{tabular}
\end{proposition}
@@ -457,15 +490,21 @@
\end{tabular}
\end{center}
- \noindent Sulzmann and Lu also define inductively an inhabitation relation
- that associates values to regular expressions. We define this relation as
- follows:\footnote{Note that the rule for @{term Stars} differs from our
- earlier paper \cite{AusafDyckhoffUrban2016}. There we used the original
- definition by Sulzmann and Lu which does not require that the values @{term "v \<in> set vs"}
- flatten to a non-empty string. The reason for introducing the
- more restricted version of lexical values is convenience later
- on when reasoning about
- an ordering relation for values.}
+ \noindent We will sometimes refer to the underlying string of a
+ value as \emph{flattened value}. We will also overload our notation and
+ use @{term "flats vs"} for flattening a list of values and concatenating
+ the resulting strings.
+
+ Sulzmann and Lu define
+ inductively an \emph{inhabitation relation} that associates values to
+ regular expressions. We define this relation as
+ follows:\footnote{Note that the rule for @{term Stars} differs from
+ our earlier paper \cite{AusafDyckhoffUrban2016}. There we used the
+ original definition by Sulzmann and Lu which does not require that
+ the values @{term "v \<in> set vs"} flatten to a non-empty
+ string. The reason for introducing the more restricted version of
+ lexical values is convenience later on when reasoning about an
+ ordering relation for values.}
\begin{center}
\begin{tabular}{c@ {\hspace{12mm}}c}
@@ -501,7 +540,7 @@
set of all \emph{Lexical Values} inhabited by @{text r} with the underlying string
being @{text s}:\footnote{Okui and Suzuki refer to our lexical values
as \emph{canonical values} in \cite{OkuiSuzuki2010}. The notion of \emph{non-problematic
- values} by Cardelli and Frisch \cite{Frisch2004} is similar, but not identical
+ values} by Cardelli and Frisch \cite{Frisch2004} is related, but not identical
to our lexical values.}
\begin{center}
@@ -519,7 +558,7 @@
@{term Stars}-rule above. For example using Sulzmann and Lu's
less restrictive definition, @{term "LV (STAR ONE) []"} would contain
infinitely many values, but according to our more restricted
- definition @{thm LV_STAR_ONE_empty}.
+ definition only a single value, namely @{thm LV_STAR_ONE_empty}.
If a regular expression @{text r} matches a string @{text s}, then
generally the set @{term "LV r s"} is not just a singleton set. In
@@ -618,25 +657,25 @@
%
\begin{center}
\begin{tabular}{l@ {\hspace{5mm}}lcl}
- (1) & @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
- (2) & @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ &
+ \textit{(1)} & @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
+ \textit{(2)} & @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ &
@{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
- (3) & @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ &
+ \textit{(3)} & @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ &
@{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
- (4) & @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$
+ \textit{(4)} & @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$
& @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
- (5) & @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$
+ \textit{(5)} & @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$
& @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
- (6) & @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$
+ \textit{(6)} & @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$
& @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
- (7) & @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$
+ \textit{(7)} & @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$
& @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\
\end{tabular}
\end{center}
\noindent To better understand what is going on in this definition it
might be instructive to look first at the three sequence cases (clauses
- (4)--(6)). In each case we need to construct an ``injected value'' for
+ \textit{(4)} -- \textit{(6)}). In each case we need to construct an ``injected value'' for
@{term "SEQ r\<^sub>1 r\<^sub>2"}. This must be a value of the form @{term
"Seq DUMMY DUMMY"}\,. Recall the clause of the @{text derivative}-function
for sequence regular expressions:
@@ -648,22 +687,22 @@
\noindent Consider first the @{text "else"}-branch where the derivative is @{term
"SEQ (der c r\<^sub>1) r\<^sub>2"}. The corresponding value must therefore
be of the form @{term "Seq v\<^sub>1 v\<^sub>2"}, which matches the left-hand
- side in clause~(4) of @{term inj}. In the @{text "if"}-branch the derivative is an
+ side in clause~\textit{(4)} of @{term inj}. In the @{text "if"}-branch the derivative is an
alternative, namely @{term "ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c
r\<^sub>2)"}. This means we either have to consider a @{text Left}- or
@{text Right}-value. In case of the @{text Left}-value we know further it
must be a value for a sequence regular expression. Therefore the pattern
- we match in the clause (5) is @{term "Left (Seq v\<^sub>1 v\<^sub>2)"},
- while in (6) it is just @{term "Right v\<^sub>2"}. One more interesting
- point is in the right-hand side of clause (6): since in this case the
+ we match in the clause \textit{(5)} is @{term "Left (Seq v\<^sub>1 v\<^sub>2)"},
+ while in \textit{(6)} it is just @{term "Right v\<^sub>2"}. One more interesting
+ point is in the right-hand side of clause \textit{(6)}: since in this case the
regular expression @{text "r\<^sub>1"} does not ``contribute'' to
matching the string, that means it only matches the empty string, we need to
call @{const mkeps} in order to construct a value for how @{term "r\<^sub>1"}
can match this empty string. A similar argument applies for why we can
- expect in the left-hand side of clause (7) that the value is of the form
+ expect in the left-hand side of clause \textit{(7)} that the value is of the form
@{term "Seq v (Stars vs)"}---the derivative of a star is @{term "SEQ (der c r)
(STAR r)"}. Finally, the reason for why we can ignore the second argument
- in clause (1) of @{term inj} is that it will only ever be called in cases
+ in clause \textit{(1)} of @{term inj} is that it will only ever be called in cases
where @{term "c=d"}, but the usual linearity restrictions in patterns do
not allow us to build this constraint explicitly into our function
definition.\footnote{Sulzmann and Lu state this clause as @{thm (lhs)
@@ -712,7 +751,7 @@
part of this section we prove that this algorithm is correct.
The well-known idea of POSIX matching is informally defined by some
- rules such as the longest match and priority rule (see
+ rules such as the Longest Match and Priority Rules (see
Introduction); as correctly argued in \cite{Sulzmann2014}, this
needs formal specification. Sulzmann and Lu define an ``ordering
relation'' between values and argue that there is a maximum value,
@@ -720,8 +759,8 @@
introduce a simple inductive definition that specifies directly what
a \emph{POSIX value} is, incorporating the POSIX-specific choices
into the side-conditions of our rules. Our definition is inspired by
- the matching relation given by Vansummeren
- \cite{Vansummeren2006}. The relation we define is ternary and
+ the matching relation given by Vansummeren~\cite{Vansummeren2006}.
+ The relation we define is ternary and
written as \mbox{@{term "s \<in> r \<rightarrow> v"}}, relating
strings, regular expressions and values; the inductive rules are given in
Figure~\ref{POSIXrules}.
@@ -802,8 +841,8 @@
@{term v} cannot be flattened to the empty string. In effect, we require
that in each ``iteration'' of the star, some non-empty substring needs to
be ``chipped'' away; only in case of the empty string we accept @{term
- "Stars []"} as the POSIX value. Indeed we can show that our POSIX value
- is a lexical value which excludes those @{text Stars} containing subvalues
+ "Stars []"} as the POSIX value. Indeed we can show that our POSIX values
+ are lexical values which exclude those @{text Stars} that contain subvalues
that flatten to the empty string.
\begin{lemma}\label{LVposix}
@@ -909,7 +948,7 @@
By induction on @{term s} using Lemma~\ref{lemmkeps} and \ref{Posix2}.\qed
\end{proof}
- \noindent In (2) we further know by Theorem~\ref{posixdeterm} that the
+ \noindent In \textit{(2)} we further know by Theorem~\ref{posixdeterm} that the
value returned by the lexer must be unique. A simple corollary
of our two theorems is:
@@ -942,20 +981,20 @@
details of their proofs, but which are evidently not in final form
yet. Unfortunately, we were not able to verify claims that their
ordering has properties such as being transitive or having maximal
- elements.
+ elements.
Okui and Suzuki \cite{OkuiSuzuki2010,OkuiSuzukiTech} described
another ordering of values, which they use to establish the
correctness of their automata-based algorithm for POSIX matching.
Their ordering resembles some aspects of the one given by Sulzmann
- and Lu, but is quite different. To begin with, Okui and Suzuki
- identify POSIX values as minimal, rather than maximal, elements in
- their ordering. A more substantial difference is that the ordering
- by Okui and Suzuki uses \emph{positions} in order to identify and
- compare subvalues. Positions are lists of natural numbers. This
- allows them to quite naturally formalise the Longest Match and
- Priority rules of the informal POSIX standard. Consider for example
- the value @{term v}
+ and Lu, but overall is quite different. To begin with, Okui and
+ Suzuki identify POSIX values as minimal, rather than maximal,
+ elements in their ordering. A more substantial difference is that
+ the ordering by Okui and Suzuki uses \emph{positions} in order to
+ identify and compare subvalues. Positions are lists of natural
+ numbers. This allows them to quite naturally formalise the Longest
+ Match and Priority rules of the informal POSIX standard. Consider
+ for example the value @{term v}
\begin{center}
@{term "v == Stars [Seq (Char x) (Char y), Char z]"}
@@ -965,8 +1004,8 @@
At position @{text "[0,1]"} of this value is the
subvalue @{text "Char y"} and at position @{text "[1]"} the
subvalue @{term "Char z"}. At the `root' position, or empty list
- @{term "[]"}, is the whole value @{term v}. The positions @{text
- "[0,1,0]"} and @{text "[2]"}, for example, are outside of @{text
+ @{term "[]"}, is the whole value @{term v}. Positions such as @{text
+ "[0,1,0]"} or @{text "[2]"} are outside of @{text
v}. If it exists, the subvalue of @{term v} at a position @{text
p}, written @{term "at v p"}, can be recursively defined by
@@ -987,7 +1026,7 @@
\noindent In the last clause we use Isabelle's notation @{term "vs ! n"} for the
@{text n}th element in a list. The set of positions inside a value @{text v},
- written @{term "Pos v"}, is given by the clauses
+ written @{term "Pos v"}, is given by
\begin{center}
\begin{tabular}{lcl}
@@ -1003,7 +1042,7 @@
\end{center}
\noindent
- whereby @{text len} stands for the length of a list. Clearly
+ whereby @{text len} in the last clause stands for the length of a list. Clearly
for every position inside a value there exists a subvalue at that position.
@@ -1019,7 +1058,7 @@
\end{center}
\noindent Both values match the string @{text "xyz"}, that means if
- we flatten the values at their respective root position, we obtain
+ we flatten these values at their respective root position, we obtain
@{text "xyz"}. However, at position @{text "[0]"}, @{text v} matches
@{text xy} whereas @{text w} matches only the shorter @{text x}. So
according to the Longest Match Rule, we should prefer @{text v},
@@ -1046,7 +1085,7 @@
@{term "v == Left (Char x)"} \qquad and \qquad @{term "w == Right (Char x)"}
\end{center}
- \noindent Both values match @{text x}, but at position @{text "[0]"}
+ \noindent Both values match @{text x}. At position @{text "[0]"}
the norm of @{term v} is @{text 1} (the subvalue matches @{text x}),
but the norm of @{text w} is @{text "-1"} (the position is outside
@{text w} according to how we defined the `inside' positions of
@@ -1055,7 +1094,8 @@
"pflat_len w [1]"} are reversed, but the point is that subvalues
will be analysed according to lexicographically ordered
positions. According to this ordering, the position @{text "[0]"}
- takes precedence. The lexicographic ordering of positions, written
+ takes precedence over @{text "[1]"} and thus also @{text v} will be
+ preferred over @{text w}. The lexicographic ordering of positions, written
@{term "DUMMY \<sqsubset>lex DUMMY"}, can be conveniently formalised
by three inference rules
@@ -1070,8 +1110,9 @@
With the norm and lexicographic order in place,
we can state the key definition of Okui and Suzuki
- \cite{OkuiSuzuki2010}: a value @{term "v\<^sub>1"} is \emph{smaller} than
- @{term "v\<^sub>2"} if and only if $(i)$ the norm at position @{text p} is
+ \cite{OkuiSuzuki2010}: a value @{term "v\<^sub>1"} is \emph{smaller at position @{text p}} than
+ @{term "v\<^sub>2"}, written @{term "v\<^sub>1 \<sqsubset>val p v\<^sub>2"},
+ if and only if $(i)$ the norm at position @{text p} is
greater in @{term "v\<^sub>1"} (that is the string @{term "flat (at v\<^sub>1 p)"} is longer
than @{term "flat (at v\<^sub>2 p)"}) and $(ii)$ all subvalues at
positions that are inside @{term "v\<^sub>1"} or @{term "v\<^sub>2"} and that are
@@ -1106,7 +1147,9 @@
While we encountered a number of obstacles for establishing properties like
transitivity for the ordering of Sulzmann and Lu (and which we failed
to overcome), it is relatively straightforward to establish this
- property for the ordering by Okui and Suzuki.
+ property for the orderings
+ @{term "DUMMY :\<sqsubset>val DUMMY"} and @{term "DUMMY :\<sqsubseteq>val DUMMY"}
+ by Okui and Suzuki.
\begin{lemma}[Transitivity]\label{transitivity}
@{thm [mode=IfThen] PosOrd_trans[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2" and ?v3.0="v\<^sub>3"]}
@@ -1118,42 +1161,103 @@
"v\<^sub>3"}) are `distinct'. Since @{text
"\<prec>\<^bsub>lex\<^esub>"} is trichotomous, we need to consider
three cases, namely @{term "p = q"}, @{term "p \<sqsubset>lex q"} and
- @{term "q \<sqsubset>lex p"}. Let us look at the first case.
- Clearly @{term "pflat_len v\<^sub>2 p < pflat_len v\<^sub>1 p"}
- and @{term "pflat_len v\<^sub>3 p < pflat_len v\<^sub>2 p"}
- imply @{term "pflat_len v\<^sub>3 p < pflat_len v\<^sub>1 p"}.
- It remains to show for a @{term "p' \<in> Pos v\<^sub>1 \<union> Pos v\<^sub>3"}
- with @{term "p' \<sqsubset>lex p"} that
- @{term "pflat_len v\<^sub>1 p' = pflat_len v\<^sub>3 p'"} holds.
- Suppose @{term "p' \<in> Pos v\<^sub>1"}, then we can infer from the
- first assumption that @{term "pflat_len v\<^sub>1 p' = pflat_len v\<^sub>2 p'"}.
- But this means that @{term "p'"} must be in @{term "Pos v\<^sub>2"} too.
- Hence we can use the second assumption and infer @{term "pflat_len v\<^sub>2 p' = pflat_len v\<^sub>3 p'"}, which concludes
- this case with @{term "v\<^sub>1 :\<sqsubset>val v\<^sub>3"}.
- The reasoning in the other cases is similar.\qed
+ @{term "q \<sqsubset>lex p"}. Let us look at the first case. Clearly
+ @{term "pflat_len v\<^sub>2 p < pflat_len v\<^sub>1 p"} and @{term
+ "pflat_len v\<^sub>3 p < pflat_len v\<^sub>2 p"} imply @{term
+ "pflat_len v\<^sub>3 p < pflat_len v\<^sub>1 p"}. It remains to show
+ that for a @{term "p' \<in> Pos v\<^sub>1 \<union> Pos v\<^sub>3"}
+ with @{term "p' \<sqsubset>lex p"} that @{term "pflat_len v\<^sub>1
+ p' = pflat_len v\<^sub>3 p'"} holds. Suppose @{term "p' \<in> Pos
+ v\<^sub>1"}, then we can infer from the first assumption that @{term
+ "pflat_len v\<^sub>1 p' = pflat_len v\<^sub>2 p'"}. But this means
+ that @{term "p'"} must be in @{term "Pos v\<^sub>2"} too (the norm
+ cannot be @{text "-1"} given @{term "p' \<in> Pos v\<^sub>1"}).
+ Hence we can use the second assumption and
+ infer @{term "pflat_len v\<^sub>2 p' = pflat_len v\<^sub>3 p'"},
+ which concludes this case with @{term "v\<^sub>1 :\<sqsubset>val
+ v\<^sub>3"}. The reasoning in the other cases is similar.\qed
\end{proof}
- \noindent It is straightforward to show that @{text "\<prec>"} and
- $\preccurlyeq$ are partial orders. Okui and Suzuki also show that it
- is a linear order for lexical values \cite{OkuiSuzuki2010} of a given
- regular expression and given string, but we have not done this. It is
+ \noindent
+ The proof for $\preccurlyeq$ is similar and omitted.
+ It is also straightforward to show that @{text "\<prec>"} and
+ $\preccurlyeq$ are partial orders. Okui and Suzuki furthermore show that they
+ are linear orderings for lexical values \cite{OkuiSuzuki2010} of a given
+ regular expression and given string, but we have not formalised this in Isabelle. It is
not essential for our results. What we are going to show below is
- that for a given @{text r} and @{text s}, the ordering has a unique
+ that for a given @{text r} and @{text s}, the orderings have a unique
minimal element on the set @{term "LV r s"}, which is the POSIX value
- we defined in the previous section.
+ we defined in the previous section. We start with two properties that
+ show how the length of a flattened value relates to the @{text "\<prec>"}-ordering.
+
+ \begin{proposition}\mbox{}\smallskip\\\label{ordlen}
+ \begin{tabular}{@ {}ll}
+ (1) &
+ @{thm [mode=IfThen] PosOrd_shorterE[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}\\
+ (2) &
+ @{thm [mode=IfThen] PosOrd_shorterI[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}
+ \end{tabular}
+ \end{proposition}
+
+ \noindent Both properties follow from the definition of the ordering. Note that
+ \textit{(2)} entails that a value, say @{term "v\<^sub>2"}, whose underlying
+ string is a strict prefix of another flattened value, say @{term "v\<^sub>1"}, then
+ @{term "v\<^sub>1"} must be smaller than @{term "v\<^sub>2"}. For our proofs it
+ will be useful to have the following properties---in each case the underlying strings
+ of the compared values are the same:
-
- Lemma 1
-
- @{thm [mode=IfThen] PosOrd_shorterE[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}
+ \begin{proposition}\mbox{}\smallskip\\\label{ordintros}
+ \begin{tabular}{ll}
+ \textit{(1)} &
+ @{thm [mode=IfThen] PosOrd_Left_Right[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}\\
+ \textit{(2)} & If
+ @{thm (prem 1) PosOrd_Left_eq[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} \;then\;
+ @{thm (lhs) PosOrd_Left_eq[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} \;iff\;
+ @{thm (rhs) PosOrd_Left_eq[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}\\
+ \textit{(3)} & If
+ @{thm (prem 1) PosOrd_Right_eq[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} \;then\;
+ @{thm (lhs) PosOrd_Right_eq[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} \;iff\;
+ @{thm (rhs) PosOrd_Right_eq[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}\\
+ \textit{(4)} & If
+ @{thm (prem 1) PosOrd_Seq_eq[where ?v2.0="v\<^sub>2" and ?w2.0="w\<^sub>2"]} \;then\;
+ @{thm (lhs) PosOrd_Seq_eq[where ?v2.0="v\<^sub>2" and ?w2.0="w\<^sub>2"]} \;iff\;
+ @{thm (rhs) PosOrd_Seq_eq[where ?v2.0="v\<^sub>2" and ?w2.0="w\<^sub>2"]}\\
+ \textit{(5)} & If
+ @{thm (prem 2) PosOrd_SeqI1[simplified, where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2" and
+ ?w1.0="w\<^sub>1" and ?w2.0="w\<^sub>2"]} \;and\;
+ @{thm (prem 1) PosOrd_SeqI1[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2" and
+ ?w1.0="w\<^sub>1" and ?w2.0="w\<^sub>2"]} \;then\;
+ @{thm (concl) PosOrd_SeqI1[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2" and
+ ?w1.0="w\<^sub>1" and ?w2.0="w\<^sub>2"]}\\
+ \textit{(6)} & If
+ @{thm (prem 1) PosOrd_Stars_append_eq[where ?vs1.0="vs\<^sub>1" and ?vs2.0="vs\<^sub>2"]} \;then\;
+ @{thm (lhs) PosOrd_Stars_append_eq[where ?vs1.0="vs\<^sub>1" and ?vs2.0="vs\<^sub>2"]} \;iff\;
+ @{thm (rhs) PosOrd_Stars_append_eq[where ?vs1.0="vs\<^sub>1" and ?vs2.0="vs\<^sub>2"]}\\
+
+ \textit{(7)} & If
+ @{thm (prem 2) PosOrd_StarsI[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2" and
+ ?vs1.0="vs\<^sub>1" and ?vs2.0="vs\<^sub>2"]} \;and\;
+ @{thm (prem 1) PosOrd_StarsI[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2" and
+ ?vs1.0="vs\<^sub>1" and ?vs2.0="vs\<^sub>2"]} \;then\;
+ @{thm (concl) PosOrd_StarsI[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2" and
+ ?vs1.0="vs\<^sub>1" and ?vs2.0="vs\<^sub>2"]}\\
+ \end{tabular}
+ \end{proposition}
- but in the other direction only
-
- @{thm [mode=IfThen] PosOrd_shorterI[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}
-
+ \noindent One might prefer that statements \textit{(4)} and \textit{(5)}
+ (respectively \textit{(6)} and \textit{(7)})
+ are combined into a single \textit{iff}-statement (like the ones for @{text
+ Left} and @{text Right}). Unfortunately this cannot be done easily: such
+ a single statement would require an additional assumption about the
+ two values @{term "Seq v\<^sub>1 v\<^sub>2"} and @{term "Seq w\<^sub>1 w\<^sub>2"}
+ being inhabited by the same regular expression. The
+ complexity of the proofs involved seems to not justify such a
+ `cleaner' single statement. The statements given are just the properties that
+ allow us to establish our theorems. The proofs for Proposition~\ref{ordintros}
+ are routine.
- Next we establish how Okui and Suzuki's ordering relates to our
+ Next we establish how Okui and Suzuki's orderings relate to our
definition of POSIX values. Given a @{text POSIX} value @{text "v\<^sub>1"}
for @{text r} and @{text s}, then any other lexical value @{text
"v\<^sub>2"} in @{term "LV r s"} is greater or equal than @{text
@@ -1179,47 +1283,55 @@
\begin{itemize}
\item[$\bullet$] Case @{text "P+L"} with @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2)
- \<rightarrow> (Left w\<^sub>1)"}: In this case @{term "v\<^sub>1 =
- Left w\<^sub>1"} and the value @{term "v\<^sub>2"} is either of the
+ \<rightarrow> (Left w\<^sub>1)"}: In this case the value
+ @{term "v\<^sub>2"} is either of the
form @{term "Left w\<^sub>2"} or @{term "Right w\<^sub>2"}. In the
- latter case we can immediately conclude with @{term "v\<^sub>1
- :\<sqsubseteq>val v\<^sub>2"} since a @{text Left}-value with the
+ latter case we can immediately conclude with \mbox{@{term "v\<^sub>1
+ :\<sqsubseteq>val v\<^sub>2"}} since a @{text Left}-value with the
same underlying string @{text s} is always smaller than a
- @{text Right}-value. In the former case we have @{term "w\<^sub>2
+ @{text Right}-value by Proposition~\ref{ordintros}\textit{(1)}.
+ In the former case we have @{term "w\<^sub>2
\<in> LV r\<^sub>1 s"} and can use the induction hypothesis to infer
@{term "w\<^sub>1 :\<sqsubseteq>val w\<^sub>2"}. Because @{term
"w\<^sub>1"} and @{term "w\<^sub>2"} have the same underlying string
@{text s}, we can conclude with @{term "Left w\<^sub>1
- :\<sqsubseteq>val Left w\<^sub>2"} by Prop ???.\smallskip
+ :\<sqsubseteq>val Left w\<^sub>2"} using
+ Proposition~\ref{ordintros}\textit{(2)}.\smallskip
\item[$\bullet$] Case @{text "P+R"} with @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2)
\<rightarrow> (Right w\<^sub>1)"}: This case similar to the previous
case, except that we additionally know @{term "s \<notin> L
r\<^sub>1"}. This is needed when @{term "v\<^sub>2"} is of the form
- @{term "Left w\<^sub>2"}. Since \mbox{@{term "flat v\<^sub>2 = flat
+ \mbox{@{term "Left w\<^sub>2"}}. Since \mbox{@{term "flat v\<^sub>2 = flat
w\<^sub>2"} @{text "= s"}} and @{term "\<Turnstile> w\<^sub>2 :
r\<^sub>1"}, we can derive a contradiction for \mbox{@{term "s \<notin> L
r\<^sub>1"}} using
Proposition~\ref{inhabs}. So also in this case \mbox{@{term "v\<^sub>1
:\<sqsubseteq>val v\<^sub>2"}}.\smallskip
- \item[$\bullet$] Case @{text "PS"} with @{term "(s\<^sub>1 @ s\<^sub>2) \<in> (SEQ
- r\<^sub>1 r\<^sub>2) \<rightarrow> (Seq w\<^sub>1 w\<^sub>2)"}: We
- can assume @{term "v\<^sub>2 = Seq (u\<^sub>1) (u\<^sub>2)"} with
- @{term "\<Turnstile> u\<^sub>1 : r\<^sub>1"} and \mbox{@{term
- "\<Turnstile> u\<^sub>2 : r\<^sub>2"}}. We have @{term "s\<^sub>1 @
- s\<^sub>2 = (flat u\<^sub>1) @ (flat u\<^sub>2)"}. By the
- side-condition of the @{text PS}-rule we know that either @{term
- "s\<^sub>1 = flat u\<^sub>1"} or that @{term "flat u\<^sub>1"} is a
- strict prefix ??? of @{term "s\<^sub>1"}. In the latter case we can
- infer @{term "w\<^sub>1 :\<sqsubset>val u\<^sub>1"} by ??? and from
- this @{term "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"} by ???. In the
- former case we know @{term "u\<^sub>1 \<in> LV r\<^sub>1 s\<^sub>1"}
- and @{term "u\<^sub>2 \<in> LV r\<^sub>2 s\<^sub>2"}. With this we
- can use the induction hypotheses to infer @{term "w\<^sub>1
- :\<sqsubseteq>val u\<^sub>1"} and @{term "w\<^sub>2
- :\<sqsubseteq>val u\<^sub>2"}. By ??? we can again infer @{term
- "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"}.
+ \item[$\bullet$] Case @{text "PS"} with @{term "(s\<^sub>1 @
+ s\<^sub>2) \<in> (SEQ r\<^sub>1 r\<^sub>2) \<rightarrow> (Seq
+ w\<^sub>1 w\<^sub>2)"}: We can assume @{term "v\<^sub>2 = Seq
+ (u\<^sub>1) (u\<^sub>2)"} with @{term "\<Turnstile> u\<^sub>1 :
+ r\<^sub>1"} and \mbox{@{term "\<Turnstile> u\<^sub>2 :
+ r\<^sub>2"}}. We have @{term "s\<^sub>1 @ s\<^sub>2 = (flat
+ u\<^sub>1) @ (flat u\<^sub>2)"}. By the side-condition of the
+ @{text PS}-rule we know that either @{term "s\<^sub>1 = flat
+ u\<^sub>1"} or that @{term "flat u\<^sub>1"} is a strict prefix of
+ @{term "s\<^sub>1"}. In the latter case we can infer @{term
+ "w\<^sub>1 :\<sqsubset>val u\<^sub>1"} by
+ Proposition~\ref{ordlen}\textit{(2)} and from this @{term "v\<^sub>1
+ :\<sqsubseteq>val v\<^sub>2"} by Proposition~\ref{ordintros}\textit{(5)}
+ (as noted above @{term "v\<^sub>1"} and @{term "v\<^sub>2"} must have the
+ same underlying string).
+ In the former case we know
+ @{term "u\<^sub>1 \<in> LV r\<^sub>1 s\<^sub>1"} and @{term
+ "u\<^sub>2 \<in> LV r\<^sub>2 s\<^sub>2"}. With this we can use the
+ induction hypotheses to infer @{term "w\<^sub>1 :\<sqsubseteq>val
+ u\<^sub>1"} and @{term "w\<^sub>2 :\<sqsubseteq>val u\<^sub>2"}. By
+ Proposition~\ref{ordintros}\textit{(4,5)} we can again infer
+ @{term "v\<^sub>1 :\<sqsubseteq>val
+ v\<^sub>2"}.
\end{itemize}
@@ -1228,14 +1340,16 @@
\noindent This theorem shows that our @{text POSIX} value for a
regular expression @{text r} and string @{term s} is in fact a
- minimal element of the values in @{text "LV r s"}. By ??? we also
- know that any value in @{text "LV s' r"}, with @{term "s'"} being a
- prefix, cannot be smaller than @{text "v\<^sub>1"}. The next theorem
- shows the opposite---namely any minimal element in @{term "LV r s"}
- must be a @{text POSIX} value. For this it helps that we proved in
- the previous section that whenever a string @{term "s \<in> L r"} then
- a corresponding @{text POSIX} value exists and is a lexical value,
- see Theorem ??? and Lemma ???.
+ minimal element of the values in @{text "LV r s"}. By
+ Proposition~\ref{ordlen}\textit{(2)} we also know that any value in
+ @{text "LV s' r"}, with @{term "s'"} being a strict prefix, cannot be
+ smaller than @{text "v\<^sub>1"}. The next theorem shows the
+ opposite---namely any minimal element in @{term "LV r s"} must be a
+ @{text POSIX} value. This can be established by induction on @{text
+ r}, but the proof can be drastically simplified by using the fact
+ from the previous section about the existence of a @{text POSIX} value
+ whenever a string @{term "s \<in> L r"}.
+
\begin{theorem}
@{thm [mode=IfThen] PosOrd_Posix[where ?v1.0="v\<^sub>1"]}
@@ -1246,23 +1360,31 @@
@{term "s \<in> L r"} by Proposition~\ref{inhabs}. Hence by Theorem~\ref{lexercorrect}(2)
there exists a
@{text POSIX} value @{term "v\<^sub>P"} with @{term "s \<in> r \<rightarrow> v\<^sub>P"}
- and by Lemma~\ref{LVposix} we also have @{term "v\<^sub>P \<in> LV r s"}.
+ and by Lemma~\ref{LVposix} we also have \mbox{@{term "v\<^sub>P \<in> LV r s"}}.
By Theorem~\ref{orderone} we therefore have
@{term "v\<^sub>P :\<sqsubseteq>val v\<^sub>1"}. If @{term "v\<^sub>P = v\<^sub>1"} then
- we are done. Otherwise we have @{term "v\<^sub>P :\<sqsubset>val v\<^sub>1"} which
- however contradicts the second assumption and we are done too.\qed
+ we are done. Otherwise we have @{term "v\<^sub>P :\<sqsubset>val v\<^sub>1"}, which
+ however contradicts the second assumption about @{term "v\<^sub>1"} being the smallest
+ element in @{term "LV r s"}. So we are done in this case too.\qed
\end{proof}
+ \noindent
+ From this we can also show
+ that if @{term "LV r s"} is non-empty (or equivalently @{term "s \<in> L r"}) then
+ it has a unique minimal element:
+
\begin{corollary}
@{thm [mode=IfThen] Least_existence1}
\end{corollary}
- \noindent To sum up, we have shown that minimal elements according
- to the ordering by Okui and Suzuki are exactly the @{text POSIX}
- values we defined inductively in Section~\ref{posixsec}
- IS THE minimal element unique? We have not shown totality.
+ \noindent To sum up, we have shown that the (unique) minimal elements
+ of the ordering by Okui and Suzuki are exactly the @{text POSIX}
+ values we defined inductively in Section~\ref{posixsec}. This provides
+ an independent confirmation that our ternary relation formalise the
+ informal POSIX rules.
+
*}
section {* Optimisations *}
@@ -1436,9 +1558,7 @@
A strong point in favour of
Sulzmann and Lu's algorithm is that it can be extended in various
- ways.
-
- If we are interested in tokenising a string, then we need to not just
+ ways. If we are interested in tokenising a string, then we need to not just
split up the string into tokens, but also ``classify'' the tokens (for
example whether it is a keyword or an identifier). This can be done with
only minor modifications to the algorithm by introducing \emph{record