(*<*)
theory Paper
imports "../ReStar" "~~/src/HOL/Library/LaTeXsugar"
begin
declare [[show_question_marks = false]]
abbreviation
"der_syn r c \<equiv> der c r"
abbreviation
"ders_syn r s \<equiv> ders s r"
notation (latex output)
If ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [75,73] 73) and
ZERO ("\<^bold>0" 78) and
ONE ("\<^bold>1" 78) and
CHAR ("_" [1000] 80) and
ALT ("_ + _" [77,77] 78) and
SEQ ("_ \<cdot> _" [77,77] 78) and
STAR ("_\<^sup>\<star>" [1000] 78) and
val.Void ("'(')" 79) and
val.Char ("Char _" [1000] 79) and
val.Left ("Left _" [79] 78) and
val.Right ("Right _" [79] 78) and
val.Seq ("Seq _ _" [79,79] 78) and
val.Stars ("Stars _" [79] 78) and
L ("L'(_')" [10] 78) and
der_syn ("_\\_" [79, 1000] 76) and
ders_syn ("_\\_" [79, 1000] 76) and
flat ("|_|" [75] 74) and
Sequ ("_ @ _" [78,77] 63) and
injval ("inj _ _ _" [79,77,79] 76) and
mkeps ("mkeps _" [79] 76) and
projval ("proj _ _ _" [1000,77,1000] 77) and
length ("len _" [78] 73) and
matcher ("lexer _ _" [78,78] 77) and
Prf ("_ : _" [75,75] 75) and
PMatch ("'(_, _') \<rightarrow> _" [63,75,75] 75)
(* and ValOrd ("_ \<succeq>\<^bsub>_\<^esub> _" [78,77,77] 73) *)
definition
"match r s \<equiv> nullable (ders s r)"
(*>*)
section {* Introduction *}
text {*
Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em
derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a
character~@{text c}, and showed that it gave a simple solution to the
problem of matching a string @{term s} with a regular expression @{term r}:
if the derivative of @{term r} w.r.t.\ (in succession) all the characters of
the string matches the empty string, then @{term r} matches @{term s} (and
{\em vice versa}). The derivative has the property (which may almost be
regarded as its specification) that, for every string @{term s} and regular
expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
derivatives is that they are neatly expressible in any functional language,
and easily definable and reasoned about in theorem provers---the definitions
just consist of inductive datatypes and simple recursive functions. A
completely formalised correctness proof of this matcher in for example HOL4
has been mentioned in~\cite{Owens2008}. Another one in Isabelle/HOL is part
of the work in \cite{Krauss2011}.
One limitation of Brzozowski's matcher is that it only generates a YES/NO
answer for whether a string is being matched by a regular expression.
Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow
generation not just of a YES/NO answer but of an actual matching, called a
[lexical] {\em value}. They give a simple algorithm to calculate a value
that appears to be the value associated with POSIX matching
\cite{Kuklewicz,Vansummeren2006}. The challenge then is to specify that
value, in an algorithm-independent fashion, and to show that Sulzmann and
Lu's derivative-based algorithm does indeed calculate a value that is
correct according to the specification.
The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a
relation (called an ``Order Relation'') on the set of values of @{term r},
and to show that (once a string to be matched is chosen) there is a maximum
element and that it is computed by their derivative-based algorithm. This
proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a
GREEDY regular expression matching algorithm. Beginning with our
observations that, without evidence that it is transitive, it cannot be
called an ``order relation'', and that the relation is called a ``total
order'' despite being evidently not total\footnote{The relation @{text
"\<ge>\<^bsub>r\<^esub>"} defined in \cite{Sulzmann2014} is a relation on the
values for the regular expression @{term r}; but it only holds between
@{term v} and @{term "v'"} in cases where @{term v} and @{term "v'"} have
the same flattening (underlying string). So a counterexample to totality is
given by taking two values @{term v} and @{term "v'"} for @{term r} that
have different flattenings (see Section~\ref{posixsec}). A different
relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r}
with flattening @{term s} is definable by the same approach, and is indeed
total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}, we
identify problems with this approach (of which some of the proofs are not
published in \cite{Sulzmann2014}); perhaps more importantly, we give a
simple inductive (and algorithm-independent) definition of what we call
being a {\em POSIX value} for a regular expression @{term r} and a string
@{term s}; we show that the algorithm computes such a value and that such a
value is unique. Proofs are both done by hand and checked in Isabelle/HOL.
The experience of doing our proofs has been that this mechanical checking
was absolutely essential: this subject area has hidden snares. This was also
noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching
implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
If a regular expression matches a string, then in general there is more than
one way of how the string is matched. There are two commonly used
disambiguation strategies to generate a unique answer: one is called GREEDY
matching \cite{Frisch2004} and the other is POSIX
matching~\cite{Kuklewicz,Sulzmann2014,Vansummeren2006}. For example consider
the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT
(ALT x y) xy)"}}. Either the string can be matched in two `iterations' by
the single letter-regular expressions @{term x} and @{term y}, or directly
in one iteration by @{term xy}. The first case corresponds to GREEDY
matching, which first matches with the left-most symbol and only matches the
next symbol in case of a mismatch (this is greedy in the sense of preferring
instant gratification to delayed repletion). The second case is POSIX
matching, which prefers the longest match.
In the context of lexing, where an input string needs to be split up into a
sequence of tokens, POSIX is the more natural disambiguation strategy for
what programmers consider basic syntactic building blocks in their programs.
These building blocks are often specified by some regular expressions, say
@{text "r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising keywords and
identifiers, respectively. There are two underlying (informal) rules behind
tokenising a string in a POSIX fashion:
\begin{itemize}
\item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}
The longest initial substring matched by any regular expression is taken as
next token.\smallskip
\item[$\bullet$] \underline{Priority Rule:}
For a particular longest initial substring, the first regular expression
that can match determines the token.
\end{itemize}
\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as
@{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising
identifiers (say, a single character followed by characters or numbers).
Then we can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use
POSIX matching to tokenise strings, say @{text "iffoo"} and @{text "if"}.
For @{text "iffoo"} we obtain by the longest match rule a single identifier
token, not a keyword followed by an identifier. For @{text "if"} we obtain by
rule priority a keyword token, not an identifier token---even if @{text
"r\<^bsub>id\<^esub>"} matches also.\bigskip
\noindent {\bf Contributions:} (NOT DONE YET) We have implemented in
Isabelle/HOL the derivative-based regular expression matching algorithm as
described by Sulzmann and Lu \cite{Sulzmann2014}. We have proved the
correctness of this algorithm according to our specification of what a POSIX
value is. Sulzmann and Lu sketch in \cite{Sulzmann2014} an informal
correctness proof: but to us it contains unfillable gaps.
informal correctness proof given in \cite{Sulzmann2014} is in final
form\footnote{} and to us contains unfillable gaps.
Our specification of a POSIX value consists of a simple inductive definition
that given a string and a regular expression uniquely determines this value.
Derivatives as calculated by Brzozowski's method are usually more complex
regular expressions than the initial one; various optimisations are
possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r
ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the
advantages of having a simple specification and correctness proof is that
the latter can be refined to allow for such optimisations and simple
correctness proof.
An extended version of \cite{Sulzmann2014} is available at the website of
its first author; this includes some ``proofs'', claimed in
\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
final form, we make no comment thereon, preferring to give general reasons
for our belief that the approach of \cite{Sulzmann2014} is problematic
rather than to discuss details of unpublished work.
*}
section {* Preliminaries *}
text {* \noindent Strings in Isabelle/HOL are lists of characters with the
empty string being represented by the empty list, written @{term "[]"}, and
list-cons being written as @{term "DUMMY # DUMMY"}. Often we use the usual
bracket notation for lists also for strings; for example a string consisting
of just a single character @{term c} is written @{term "[c]"}. By using the
type @{type char} for characters we have a supply of finitely many
characters roughly corresponding to the ASCII character set. Regular
expressions are defined as usual as the elements of the following inductive
datatype:
\begin{center}
@{text "r :="}
@{const "ZERO"} $\mid$
@{const "ONE"} $\mid$
@{term "CHAR c"} $\mid$
@{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
@{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
@{term "STAR r"}
\end{center}
\noindent where @{const ZERO} stands for the regular expression that does
not match any string, @{const ONE} for the regular expression that matches
only the empty string and @{term c} for matching a character literal. The
language of a regular expression is also defined as usual by the
recursive function @{term L} with the clauses:
\begin{center}
\begin{tabular}{rcl}
@{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
@{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\
@{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\
@{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
\end{tabular}
\end{center}
\noindent In the fourth clause we use the operation @{term "DUMMY ;;
DUMMY"} for the concatenation of two languages (it is also list-append for
strings). We use the star-notation for regular expressions and for
languages (in the last clause above). The star for languages is defined
inductively by two clauses: @{text "(i)"} the empty string being in
the star of a language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a
language and @{term "s\<^sub>2"} in the star of this language, then also @{term
"s\<^sub>1 @ s\<^sub>2"} is in the star of this language. It will also be convenient
to use the following notion of a \emph{semantic derivative} (or \emph{left
quotient}) of a language defined as:
\begin{center}
\begin{tabular}{lcl}
@{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\
\end{tabular}
\end{center}
\noindent
For semantic derivatives we have the following equations (for example
mechanically proved in \cite{Krauss2011}):
\begin{equation}\label{SemDer}
\begin{array}{lcl}
@{thm (lhs) Der_null} & \dn & @{thm (rhs) Der_null}\\
@{thm (lhs) Der_empty} & \dn & @{thm (rhs) Der_empty}\\
@{thm (lhs) Der_char} & \dn & @{thm (rhs) Der_char}\\
@{thm (lhs) Der_union} & \dn & @{thm (rhs) Der_union}\\
@{thm (lhs) Der_Sequ} & \dn & @{thm (rhs) Der_Sequ}\\
@{thm (lhs) Der_star} & \dn & @{thm (rhs) Der_star}
\end{array}
\end{equation}
\noindent \emph{\Brz's derivatives} of regular expressions
\cite{Brzozowski1964} can be easily defined by two recursive functions:
the first is from regular expressions to booleans (implementing a test
when a regular expression can match the empty string), and the second
takes a regular expression and a character to a (derivative) regular
expression:
\begin{center}
\begin{tabular}{lcl}
@{thm (lhs) nullable.simps(1)} & $\dn$ & @{thm (rhs) nullable.simps(1)}\\
@{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\
@{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
@{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\
@{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
@{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
@{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
@{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}
\end{tabular}
\end{center}
\noindent
We may extend this definition to give derivatives w.r.t.~strings:
\begin{center}
\begin{tabular}{lcl}
@{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\
@{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\
\end{tabular}
\end{center}
\noindent Given the equations in \eqref{SemDer}, it is a relatively easy
exercise in mechanical reasoning to establish that
\begin{proposition}\mbox{}\\
\begin{tabular}{ll}
@{text "(1)"} & @{thm (lhs) nullable_correctness} if and only if
@{thm (rhs) nullable_correctness}, and \\
@{text "(2)"} & @{thm[mode=IfThen] der_correctness}.
\end{tabular}
\end{proposition}
\noindent With this in place it is also very routine to prove that the
regular expression matcher defined as
\begin{center}
@{thm match_def}
\end{center}
\noindent gives a positive answer if and only if @{term "s \<in> L r"}.
Consequently, this regular expression matching algorithm satisfies the
usual specification. While the matcher above calculates a provably correct
YES/NO answer for whether a regular expression matches a string, the novel
idea of Sulzmann and Lu \cite{Sulzmann2014} is to append another phase to
this algorithm in order to calculate a [lexical] value. We will explain
the details next.
*}
section {* POSIX Regular Expression Matching\label{posixsec} *}
text {*
The clever idea in \cite{Sulzmann2014} is to introduce values for encoding
\emph{how} a regular expression matches a string and then define a
function on values that mirrors (but inverts) the construction of the
derivative on regular expressions. \emph{Values} are defined as the
inductive datatype
\begin{center}
@{text "v :="}
@{const "Void"} $\mid$
@{term "val.Char c"} $\mid$
@{term "Left v"} $\mid$
@{term "Right v"} $\mid$
@{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$
@{term "Stars vs"}
\end{center}
\noindent where we use @{term vs} standing for a list of values. (This is
similar to the approach taken by Frisch and Cardelli for GREEDY matching
\cite{Frisch2004}, and Sulzmann and Lu \cite{Sulzmann2014} for POSIX matching).
The string underlying a value can be calculated by the @{const flat}
function, written @{term "flat DUMMY"} and defined as:
\begin{center}
\begin{tabular}{lcl}
@{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\
@{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\
@{thm (lhs) flat.simps(3)} & $\dn$ & @{thm (rhs) flat.simps(3)}\\
@{thm (lhs) flat.simps(4)} & $\dn$ & @{thm (rhs) flat.simps(4)}\\
@{thm (lhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]}\\
@{thm (lhs) flat.simps(6)} & $\dn$ & @{thm (rhs) flat.simps(6)}\\
@{thm (lhs) flat.simps(7)} & $\dn$ & @{thm (rhs) flat.simps(7)}\\
\end{tabular}
\end{center}
\noindent Sulzmann and Lu also define inductively an inhabitation relation
that associates values to regular expressions:
\begin{center}
\begin{tabular}{c}
@{thm[mode=Axiom] Prf.intros(4)} \qquad
@{thm[mode=Axiom] Prf.intros(5)[of "c"]}\medskip\\
@{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad
@{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\medskip\\
@{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\
@{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad
@{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}\medskip\\
\end{tabular}
\end{center}
\noindent Note that no values are associated with the regular expression
@{term ZERO}, and that the only value associated with the regular
expression @{term ONE} is @{term Void}, pronounced (if one must) as @{text
"Void"}. It is routine to establish how values ``inhabiting'' a regular
expression correspond to the language of a regular expression, namely
\begin{proposition}
@{thm L_flat_Prf}
\end{proposition}
In general there is more than one value associated with a regular
expression. In case of POSIX matching the problem is to calculate the
unique value that satisfies the (informal) POSIX rules from the
Introduction. Graphically the POSIX value calculation algorithm by
Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz}
where the path from the left to the right involving @{const der}/@{const
nullable} is the first phase of the algorithm (calculating successive
\Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
left, the second phase. This picture shows the steps required when a
regular expression, say @{text "r\<^sub>1"}, matches the string @{term
"[a,b,c]"}. We first build the three derivatives (according to @{term a},
@{term b} and @{term c}). We then use @{const nullable} to find out
whether the resulting derivative regular expression @{term "r\<^sub>4"}
can match the empty string. If yes, we call the function @{const mkeps}
that produces a value @{term "v\<^sub>4"} for how @{term "r\<^sub>4"} can
match the empty string (taking into account the POSIX rules in case
there are several ways). This functions is defined by the clauses:
\begin{figure}[t]
\begin{center}
\begin{tikzpicture}[scale=2,node distance=1.3cm,
every node/.style={minimum size=7mm}]
\node (r1) {@{term "r\<^sub>1"}};
\node (r2) [right=of r1]{@{term "r\<^sub>2"}};
\draw[->,line width=1mm](r1)--(r2) node[above,midway] {@{term "der a DUMMY"}};
\node (r3) [right=of r2]{@{term "r\<^sub>3"}};
\draw[->,line width=1mm](r2)--(r3) node[above,midway] {@{term "der b DUMMY"}};
\node (r4) [right=of r3]{@{term "r\<^sub>4"}};
\draw[->,line width=1mm](r3)--(r4) node[above,midway] {@{term "der c DUMMY"}};
\draw (r4) node[anchor=west] {\;\raisebox{3mm}{@{term nullable}}};
\node (v4) [below=of r4]{@{term "v\<^sub>4"}};
\draw[->,line width=1mm](r4) -- (v4);
\node (v3) [left=of v4] {@{term "v\<^sub>3"}};
\draw[->,line width=1mm](v4)--(v3) node[below,midway] {@{text "inj r\<^sub>3 c"}};
\node (v2) [left=of v3]{@{term "v\<^sub>2"}};
\draw[->,line width=1mm](v3)--(v2) node[below,midway] {@{text "inj r\<^sub>2 b"}};
\node (v1) [left=of v2] {@{term "v\<^sub>1"}};
\draw[->,line width=1mm](v2)--(v1) node[below,midway] {@{text "inj r\<^sub>1 a"}};
\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{@{term "mkeps"}}};
\end{tikzpicture}
\end{center}
\caption{The two phases of the algorithm by Sulzmann \& Lu \cite{Sulzmann2014}
matching the string @{term "[a,b,c]"}. The first phase (the arrows from
left to right) is \Brz's matcher building succesive derivatives. If at the
last regular expression is @{term nullable}, then functions of the
second phase are called: first @{term mkeps} calculates a value witnessing
how the empty string has been recognised by @{term "r\<^sub>4"}. After
that the function @{term inj} `injects back' the characters of the string into
the values (the arrows from right to left).
\label{Sulz}}
\end{figure}
\begin{center}
\begin{tabular}{lcl}
@{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
@{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
\end{tabular}
\end{center}
\noindent Note that this function needs only to be partially defined,
namely only for regular expressions that are nullable. In case @{const
nullable} fails, the string @{term "[a,b,c]"} cannot be matched by @{term
"r\<^sub>1"} and an error is raised instead. Note also how this function
makes some subtle choices leading to a POSIX value: for example if an
alternative regular expression, say @{term "ALT r\<^sub>1 r\<^sub>2"}, can
match the empty string and furthermore @{term "r\<^sub>1"} can match the
empty string, then we return a @{text Left}-value. The @{text
Right}-value will only be returned if @{term "r\<^sub>1"} is not nullable.
The most interesting novelty from Sulzmann and Lu \cite{Sulzmann2014} is
the construction of a value for how @{term "r\<^sub>1"} can match the
string @{term "[a,b,c]"} from the value how the last derivative, @{term
"r\<^sub>4"} in Fig~\ref{Sulz}, can match the empty string. Sulzmann and
Lu achieve this by stepwise ``injecting back'' the characters into the
values thus inverting the operation of building derivatives on the level
of values. The corresponding function, called @{term inj}, takes three
arguments, a regular expression, a character and a value. For example in
the first (or right-most) @{term inj}-step in Fig~\ref{Sulz} the regular
expression @{term "r\<^sub>3"}, the character @{term c} from the last
derivative step and @{term "v\<^sub>4"}, which is the value corresponding
to the derivative regular expression @{term "r\<^sub>4"}. The result is
the new value @{term "v\<^sub>3"}. The final result of the algorithm is
the value @{term "v\<^sub>1"} corresponding to the input regular
expression. The @{term inj} function is by recursion on the regular
expressions and by analysing the shape of values (corresponding to
the derivative regular expressions).
\begin{center}
\begin{tabular}{l@ {\hspace{5mm}}lcl}
(1) & @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
(2) & @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ &
@{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
(3) & @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ &
@{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
(4) & @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$
& @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
(5) & @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$
& @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
(6) & @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$
& @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
(7) & @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$
& @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\
\end{tabular}
\end{center}
\noindent To better understand what is going on in this definition it
might be instructive to look first at the three sequence cases (clauses
(4)--(6)). In each case we need to construct an ``injected value'' for
@{term "SEQ r\<^sub>1 r\<^sub>2"}. This must be a value of the form @{term
"Seq DUMMY DUMMY"}. Recall the clause of the @{const der}-function
for sequence regular expressions:
\begin{center}
@{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} $\dn$ @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}
\end{center}
\noindent Consider first the else-branch where the derivative is @{term
"SEQ (der c r\<^sub>1) r\<^sub>2"}. The corresponding value must therefore
be the form @{term "Seq v\<^sub>1 v\<^sub>2"}, which matches the left-hand
side in clause (4) of @{term inj}. In the if-branch the derivative is an
alternative, namely @{term "ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c
r\<^sub>2)"}. This means we either have to consider a @{text Left}- or
@{text Right}-value. In case of the @{text Left}-value we know further it
must be a value for a sequence regular expression. Therefore the pattern
we match in the clause (5) is @{term "Left (Seq v\<^sub>1 v\<^sub>2)"},
while in (6) it is just @{term "Right v\<^sub>2"}. One more interesting
point is in the right-hand side of clause (6): since in this case the
regular expression @{text "r\<^sub>1"} does not ``contribute'' for
matching the string, that is only matches the empty string, we need to
call @{const mkeps} in order to construct a value how @{term "r\<^sub>1"}
can match this empty string. A similar argument applies for why we can
expect in the left-hand side of clause (7) that the value is of the form
@{term "Seq v (Stars vs)"}---the derivative of a star is @{term "SEQ r
(STAR r)"}. Finally, the reason for why we can ignore the second argument
in clause (1) of @{term inj} is that it will only ever be called in cases
where @{term "c=d"}, but the usual linearity restrictions in patterns do
not allow is to build this constraint explicitly into our function
definition.\footnote{Sulzmann and Lu state this clause as @{thm (lhs)
injval.simps(1)[of "c" "c"]} $\dn$ @{thm (rhs) injval.simps(1)[of "c"]},
but our deviation is harmless.}
Having defined the @{const mkeps} and @{text inj} function we can extend
\Brz's matcher so that a [lexical] value is constructed (assuming the
regular expression matches the string). The clauses of the lexer are
\begin{center}
\begin{tabular}{lcl}
@{thm (lhs) matcher.simps(1)} & $\dn$ & @{thm (rhs) matcher.simps(1)}\\
@{thm (lhs) matcher.simps(2)} & $\dn$ & @{text "case"} @{term "matcher (der c r) s"} @{text of}\\
& & \phantom{$|$} @{term "None"} @{text "\<Rightarrow>"} @{term None}\\
& & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}
\end{tabular}
\end{center}
\noindent If the regular expression does not match, @{const None} is
returned, indicating an error is raised. If the regular expression does
match the string, then @{const Some} value is returned. One important
virtue of this algorithm is that it can be implemented with ease in a
functional programming language and also in Isabelle/HOL. In the remaining
part of this section we prove that this algorithm is correct.
The well-known idea of POSIX matching is informally defined by the longest
match and priority rule; as correctly argued in \cite{Sulzmann2014}, this
needs formal specification. Sulzmann and Lu define a \emph{dominance}
relation\footnote{Sulzmann and Lu call it an ordering relation, but
without giving evidence that it is transitive.} between values and argue that
there is a maximum value, as given by the derivative-based algorithm. In
contrast, we shall next introduce a simple inductive definition to specify
what a \emph{POSIX value} is, incorporating the POSIX-specific choices
into the side-conditions of our rules. Our definition is inspired by the
matching relation given in \cite{Vansummeren2006}. The relation we define
is ternary and written as \mbox{@{term "s \<in> r \<rightarrow> v"}}, relating strings,
regular expressions and values.
\begin{center}
\begin{tabular}{c}
@{thm[mode=Axiom] PMatch.intros(1)} \qquad
@{thm[mode=Axiom] PMatch.intros(2)}\bigskip\\
@{thm[mode=Rule] PMatch.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\qquad
@{thm[mode=Rule] PMatch.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\bigskip\\
$\mprset{flushleft}
\inferrule
{@{thm (prem 1) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
@{thm (prem 2) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\
@{thm (prem 3) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}
{@{thm (concl) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\\
@{thm[mode=Axiom] PMatch.intros(7)}\bigskip\\
$\mprset{flushleft}
\inferrule
{@{thm (prem 1) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
@{thm (prem 2) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
@{thm (prem 3) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
@{thm (prem 4) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
{@{thm (concl) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$
\end{tabular}
\end{center}
\noindent We claim that this relation captures the idea behind the two
informal POSIX rules shown in the Introduction: Consider the second line
where the POSIX values for an alternative regular expression is
specified---it is always a @{text "Left"}-value, \emph{except} when the
string to be matched is not in the language of @{term "r\<^sub>1"}; only then it
is a @{text Right}-value (see the side-condition in the rule on the
right). Interesting is also the rule for sequence regular expressions
(third line). The first two premises state that @{term "v\<^sub>1"} and
@{term "v\<^sub>2"} are the POSIX values for @{term "(s\<^sub>1,
r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"} respectively.
\begin{theorem}
@{thm[mode=IfThen] PMatch_determ(1)[of _ _ "v\<^sub>1" "v\<^sub>2"]}
\end{theorem}
\begin{lemma}
@{thm[mode=IfThen] PMatch_mkeps}
\end{lemma}
\begin{lemma}
@{thm[mode=IfThen] Prf_injval_flat}
\end{lemma}
\begin{lemma}
@{thm[mode=IfThen] PMatch2_roy_version}
\end{lemma}
\begin{theorem}\mbox{}\smallskip\\
\begin{tabular}{ll}
(1) & @{thm (lhs) lex_correct1a} if and only if @{thm (rhs) lex_correct1a}\\
(2) & @{thm (lhs) lex_correct3a} if and only if @{thm (rhs) lex_correct3a}\\
\end{tabular}
\end{theorem}
*}
section {* The Argument by Sulzmmann and Lu *}
section {* Conclusion *}
text {*
Nipkow lexer from 2000
*}
text {*
\noindent
We have also introduced a slightly restricted version of this relation
where the last rule is restricted so that @{term "flat v \<noteq> []"}.
This relation for \emph{non-problematic} is written @{term "\<Turnstile> v : r"}.
\bigskip
\noindent
\noindent
Our version of Sulzmann's ordering relation
\begin{center}
\begin{tabular}{c}
@{thm[mode=Rule] ValOrd.intros(2)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>1'" "v\<^sub>2" "r\<^sub>2" "v\<^sub>2'"]} \qquad
@{thm[mode=Rule] ValOrd.intros(1)[of "v\<^sub>2" "r\<^sub>2" "v\<^sub>2'" "v\<^sub>1" "r\<^sub>1"]}\medskip\\
@{thm[mode=Rule] ValOrd.intros(3)[of "v\<^sub>1" "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]} \qquad
@{thm[mode=Rule] ValOrd.intros(4)[of "v\<^sub>2" "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\medskip\\
@{thm[mode=Rule] ValOrd.intros(5)[of "v\<^sub>2" "r\<^sub>2" "v\<^sub>2'" "r\<^sub>1"]} \qquad
@{thm[mode=Rule] ValOrd.intros(6)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>1'" "r\<^sub>2"]} \medskip\\
@{thm[mode=Axiom] ValOrd.intros(7)}\qquad
@{thm[mode=Axiom] ValOrd.intros(8)[of "c"]}\medskip\\
@{thm[mode=Rule] ValOrd.intros(9)[of "v" "vs" "r"]}\qquad
@{thm[mode=Rule] ValOrd.intros(10)[of "v" "vs" "r"]}\medskip\\
@{thm[mode=Rule] ValOrd.intros(11)[of "v\<^sub>1" "r" "v\<^sub>2" "vs\<^sub>1" "vs\<^sub>2"]}\medskip\\
@{thm[mode=Rule] ValOrd.intros(12)[of "vs\<^sub>1" "r" "vs\<^sub>2" "v"]}\qquad
@{thm[mode=Axiom] ValOrd.intros(13)[of "r"]}\medskip\\
\end{tabular}
\end{center}
\noindent
A prefix of a string s
\begin{center}
\begin{tabular}{c}
@{thm prefix_def[of "s\<^sub>1" "s\<^sub>2"]}
\end{tabular}
\end{center}
\noindent
Values and non-problematic values
\begin{center}
\begin{tabular}{c}
@{thm Values_def}\medskip\\
\end{tabular}
\end{center}
\noindent
The point is that for a given @{text s} and @{text r} there are only finitely many
non-problematic values.
*}
text {*
\noindent
Some lemmas we have proved:\bigskip
@{thm L_flat_Prf}
@{thm L_flat_NPrf}
@{thm[mode=IfThen] mkeps_nullable}
@{thm[mode=IfThen] mkeps_flat}
@{thm[mode=IfThen] Prf_injval}
@{thm[mode=IfThen] Prf_injval_flat}
@{thm[mode=IfThen] PMatch_mkeps}
@{thm[mode=IfThen] PMatch1(2)}
@{thm[mode=IfThen] PMatch1N}
@{thm[mode=IfThen] PMatch_determ(1)[of "s" "r" "v\<^sub>1" "v\<^sub>2"]}
\medskip
\noindent
This is the main theorem that lets us prove that the algorithm is correct according to
@{term "s \<in> r \<rightarrow> v"}:
@{thm[mode=IfThen] PMatch2}
\mbox{}\bigskip
\noindent {\bf Proof} The proof is by induction on the definition of
@{const der}. Other inductions would go through as well. The
interesting case is for @{term "SEQ r\<^sub>1 r\<^sub>2"}. First we analyse the
case where @{term "nullable r\<^sub>1"}. We have by induction hypothesis
\[
\begin{array}{l}
(IH1)\quad @{text "\<forall>s v."} \text{\;if\;} @{term "s \<in> der c r\<^sub>1 \<rightarrow> v"}
\text{\;then\;} @{term "(c # s) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v"}\\
(IH2)\quad @{text "\<forall>s v."} \text{\;if\;} @{term "s \<in> der c r\<^sub>2 \<rightarrow> v"}
\text{\;then\;} @{term "(c # s) \<in> r\<^sub>2 \<rightarrow> injval r\<^sub>2 c v"}
\end{array}
\]
\noindent
and have
\[
@{term "s \<in> ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c r\<^sub>2) \<rightarrow> v"}
\]
\noindent
There are two cases what @{term v} can be: (1) @{term "Left v'"} and (2) @{term "Right v'"}.
\begin{itemize}
\item[(1)] We know @{term "s \<in> SEQ (der c r\<^sub>1) r\<^sub>2 \<rightarrow> v'"} holds, from which we
can infer that there are @{text "s\<^sub>1"}, @{term "s\<^sub>2"}, @{text "v\<^sub>1"}, @{term "v\<^sub>2"}
with
\[
@{term "s\<^sub>1 \<in> der c r\<^sub>1 \<rightarrow> v\<^sub>1"} \qquad\text{and}\qquad @{term "s\<^sub>2 \<in> r\<^sub>2 \<rightarrow> v\<^sub>2"}
\]
and also
\[
@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> s\<^sub>1 @ s\<^sub>3 \<in> L (der c r\<^sub>1) \<and> s\<^sub>4 \<in> L r\<^sub>2)"}
\]
\noindent
and have to prove
\[
@{term "((c # s\<^sub>1) @ s\<^sub>2) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (injval r\<^sub>1 c v\<^sub>1) v\<^sub>2"}
\]
\noindent
The two requirements @{term "(c # s\<^sub>1) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"} and
@{term "s\<^sub>2 \<in> r\<^sub>2 \<rightarrow> v\<^sub>2"} can be proved by the induction hypothese (IH1) and the
fact above.
\noindent
This leaves to prove
\[
@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> (c # s\<^sub>1) @ s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"}
\]
\noindent
which holds because @{term "(c # s\<^sub>1) @ s\<^sub>3 \<in> L r\<^sub>1 "} implies @{term "s\<^sub>1 @ s\<^sub>3 \<in> L (der c r\<^sub>1) "}
\item[(2)] This case is similar.
\end{itemize}
\noindent
The final case is that @{term " \<not> nullable r\<^sub>1"} holds. This case again similar
to the cases above.
*}
text {*
%\noindent
%{\bf Acknowledgements:}
%We are grateful for the comments we received from anonymous
%referees.
\bibliographystyle{plain}
\bibliography{root}
\section{Roy's Rules}
\newcommand{\abs}[1]{\mid\!\! #1\!\! \mid}
%%\newcommand{\mts}{\textit{``''}
\newcommand{\tl}{\ \triangleleft\ }
$$\inferrule[]{Void \tl \epsilon}{}
\quad\quad
\inferrule[]{Char\ c \tl Lit\ c}{}
$$
$$\inferrule
{v_1 \tl r_1}
{Left\ v_1 \tl r_1 + r_2}
\quad\quad
\inferrule[]
{ v_2 \tl r_2 \\ \abs{v_2}\ \not\in\ L(r_1)}
{Right\ v_2 \tl r_1 + r_2}
$$
$$
\inferrule
{v_1 \tl r_1\\
v_2 \tl r_2\\
s \in\ L(r_1\backslash\! \abs{v_1}) \ \land\
\abs{v_2}\!\backslash s\ \epsilon\ L(r_2)
\ \Rightarrow\ s = []
}
{(v_1, v_2) \tl r_1 \cdot r_2}
$$
$$\inferrule
{ v \tl r \\ vs \tl r^* \\ \abs{v}\ \not=\ []}
{ (v :: vs) \tl r^* }
\quad\quad
\inferrule{}
{ [] \tl r^* }
$$
*}
(*<*)
end
(*>*)