lexing: thys/Paper/Paper.thy@15ef2af1a6f2


(*<*)
theory Paper
imports "../ReStar" "~~/src/HOL/Library/LaTeXsugar"
begin

declare [[show_question_marks = false]]

abbreviation 
 "der_syn r c \<equiv> der c r"

abbreviation 
 "ders_syn r s \<equiv> ders s r"

notation (latex output)
  If  ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
  Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and
  
  ZERO ("\<^bold>0" 80) and 
  ONE ("\<^bold>1" 80) and 
  CHAR ("_" [1000] 80) and
  ALT ("_ + _" [77,77] 78) and
  SEQ ("_ \<cdot> _" [77,77] 78) and
  STAR ("_\<^sup>\<star>" [1000] 78) and
  
  val.Void ("'(')" 79) and
  val.Char ("Char _" [1000] 79) and
  val.Left ("Left _" [1000] 78) and
  val.Right ("Right _" [1000] 78) and
  
  L ("L'(_')" [10] 78) and
  der_syn ("_\\_" [79, 1000] 76) and  
  ders_syn ("_\\_" [79, 1000] 76) and
  flat ("|_|" [75] 73) and
  Sequ ("_ @ _" [78,77] 63) and
  injval ("inj _ _ _" [78,77,78] 77) and 
  projval ("proj _ _ _" [1000,77,1000] 77) and 
  length ("len _" [78] 73) and

  Prf ("\<triangleright> _ : _" [75,75] 75) and  
  PMatch ("_ : _ \<rightarrow> _" [75,75,75] 75)
  (* and ValOrd ("_ \<succeq>\<^bsub>_\<^esub> _" [78,77,77] 73) *)

definition 
  "match r s \<equiv> nullable (ders s r)"

(*>*)

section {* Introduction *}


text {*

Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em
derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a
character~@{text c}, and showed that it gave a simple solution to the
problem of matching a string @{term s} with a regular expression @{term r}:
if the derivative of @{term r} w.r.t.\ (in succession) all the characters of
the string matches the empty string, then @{term r} matches @{term s}
(and {\em vice versa}). The derivative has the property (which may be
regarded as its specification) that, for every string @{term s} and regular
expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
derivatives is that they are neatly expressible in any functional language,
and easily definable and reasoned about in theorem provers---the definitions
just consist of inductive datatypes and simple recursive functions. A
completely formalised correctness proof of this matcher in for example HOL4
has been mentioned in~\cite{Owens2008}.

One limitation of Brzozowski's matcher is that it only generates a YES/NO
answer for whether a string is being matched by a regular expression.
Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow
generation not just of a YES/NO answer but of an actual matching, called a
[lexical] {\em value}. They give a simple algorithm to calculate a value
that appears to be the value associated with POSIX matching
\cite{Kuklewicz,Vansummeren2006}. The challenge then is to specify that
value, in an algorithm-independent fashion, and to show that Sulzamann and
Lu's derivative-based algorithm does indeed calculate a value that is
correct according to the specification.

The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a
relation (called an ``Order Relation'') on the set of values of @{term r},
and to show that (once a string to be matched is chosen) there is a maximum
element and that it is computed by their derivative-based algorithm. This
proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a
GREEDY regular expression matching algorithm. Beginning with our
observations that, without evidence that it is transitive, it cannot be
called an ``order relation'', and that the relation is called a ``total
order'' despite being evidently not total\footnote{The relation @{text
"\<ge>\<^bsub>r\<^esub>"} defined in \cite{Sulzmann2014} is a relation on the values for
the regular expression @{term r}; but it only holds between @{term v} and
@{term "v'"} in cases where @{term v} and @{term "v'"} have the same
flattening (underlying string). So a counterexample to totality is given by
taking two values @{term v} and @{term "v'"} for @{term r} that have
different flattenings. A different relation @{text "\<ge>\<^bsub>r,s\<^esub>"}
on the set of values for @{term r} with flattening @{term s} is definable by
the same approach, and is indeed total; but that is not what Proposition 1
of \cite{Sulzmann2014} does.}, we identify problems with this approach (of
which some of the proofs are not published in \cite{Sulzmann2014}); perhaps
more importantly, we give a simple inductive (and algorithm-independent)
definition of what we call being a {\em POSIX value} for a regular
expression @{term r} and a string @{term s}; we show that the algorithm
computes such a value and that such a value is unique. Proofs are both done
by hand and checked in Isabelle/HOL. The experience of doing our proofs has
been that this mechanical checking was absolutely essential: this subject
area has hidden snares. This was also noted by Kuklewitz \cite{Kuklewicz}
who found that nearly all POSIX matching implementations are ``buggy''
\cite[Page 203]{Sulzmann2014}.

If a regular expression matches a string, then in general there is more than
one way of how the string is matched. There are two commonly used
disambiguation strategies to generate a unique answer: one is called GREEDY
matching \cite{Frisch2004} and the other is POSIX
matching~\cite{Kuklewicz,Sulzmann2014,Vansummeren2006}. For example consider
the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT
(ALT x y) xy)"}}. Either the string can be matched in two `iterations' by
the single letter-regular expressions @{term x} and @{term y}, or directly
in one iteration by @{term xy}. The first case corresponds to GREEDY
matching, which first matches with the left-most symbol and only matches the
next symbol in case of a mismatch (this is greedy in the sense of preferring
instant gratification to delayed repletion). The second case is POSIX
matching, which prefers the longest match.

In the context of lexing, where an input string needs to be split up into a
sequence of tokens, POSIX is the more natural disambiguation strategy for
what programmers consider basic syntactic building blocks in their programs.
These building blocks are often specified by some regular expressions, say
@{text "r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising keywords and
identifiers, respectively. There are two underlying (informal) rules behind
tokenising a string in a POSIX fashion:

\begin{itemize} 
\item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}

The longest initial substring matched by any regular expression is taken as
next token.\smallskip

\item[$\bullet$] \underline{Rule Priority:}

For a particular longest initial substring, the first regular expression
that can match determines the token.
\end{itemize}
 
\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as
@{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising
identifiers (a single character followed by characters or numbers). Then we
can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX
matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the
first case we obtain by the longest match rule a single identifier token,
not a keyword followed by an identifier. In the second case we obtain by rule
priority a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"}
matches also.\bigskip

\noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
derivative-based regular expression matching algorithm as described by
Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
algorithms according to our specification what a POSIX value is. The
informal correctness proof given in \cite{Sulzmann2014} is in final
form\footnote{} and to us contains unfillable gaps. Our specification of a
POSIX value consists of a simple inductive definition that given a string
and a regular expression uniquely determines this value. Derivatives as
calculated by Brzozowski's method are usually more complex regular
expressions than the initial one; various optimisations are possible, such
as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r ZERO"}, @{term
"SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the advantages of
having a simple specification and correctness proof is that the latter can
be refined to allow for such optimisations and simple correctness proof.

An extended version of \cite{Sulzmann2014} is available at the website of
its first author; this includes some ``proofs'', claimed in
\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
final form, we make no comment thereon, preferring to give general reasons
for our belief that the approach of \cite{Sulzmann2014} is problematic
rather than to discuss details of unpublished work.

*}

section {* Preliminaries *}

text {* \noindent Strings in Isabelle/HOL are lists of characters with the
empty string being represented by the empty list, written @{term "[]"}, and
list-cons being written as @{term "DUMMY # DUMMY"}. Often we use the usual
bracket notation for lists also for strings; for example a string consisting
of just a single character @{term c} is written @{term "[c]"}. By using the
type @{type char} for characters we have a supply of finitely many
characters roughly corresponding to the ASCII character set. Regular
expressions are defined as usual as the following inductive datatype:

  \begin{center}
  @{text "r :="}
  @{const "ZERO"} $\mid$
  @{const "ONE"} $\mid$
  @{term "CHAR c"} $\mid$
  @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
  @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
  @{term "STAR r"} 
  \end{center}

  \noindent where @{const ZERO} stands for the regular expression that does
  not match any string, @{const ONE} for the regular expression that matches
  only the empty string and @{term c} for matching a character literal. The
  language of a regular expression is again defined as usual by the
  recursive function @{term L} with the clauses:

  \begin{center}
  \begin{tabular}{rcl}
  @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
  @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\
  @{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\
  @{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
  @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
  @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
  \end{tabular}
  \end{center}
  
  \noindent In the fourth clause we use @{term "DUMMY ;; DUMMY"} for the
  concatenation of two languages (it is also list-append for strings). We
  use the star-notation for regular expressions and languages (in the last
  clause above). The star on languages is defined inductively by two
  clauses: @{text "(i)"} for the empty string being in the star of a
  language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a language and @{term
  "s\<^sub>2"} in the star of this language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in
  the star of this language. It will also be convenient to use the following
  notion of a \emph{semantic derivative} (or left quotient) of a language,
  say @{text A}, defined as:

  \begin{center}
  \begin{tabular}{lcl}
  @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\
  \end{tabular}
  \end{center}
  
  \noindent 
  For semantic derivatives we have the following equations (for example
  \cite{Krauss2011}):

  \begin{equation}\label{SemDer}
  \begin{array}{lcl}
  @{thm (lhs) Der_null}  & \dn & @{thm (rhs) Der_null}\\
  @{thm (lhs) Der_empty}  & \dn & @{thm (rhs) Der_empty}\\
  @{thm (lhs) Der_char}  & \dn & @{thm (rhs) Der_char}\\
  @{thm (lhs) Der_union}  & \dn & @{thm (rhs) Der_union}\\
  @{thm (lhs) Der_Sequ}  & \dn & @{thm (rhs) Der_Sequ}\\
  @{thm (lhs) Der_star}  & \dn & @{thm (rhs) Der_star}
  \end{array}
  \end{equation}


  \noindent \emph{\Brz's derivatives} of regular expressions
  \cite{Brzozowski1964} can be easily defined by two recursive functions:
  the first is from regular expressions to booleans (implementing a test
  when a regular expression can match the empty string), and the second
  takes a regular expression and a character to a (derivative) regular
  expression:

  \begin{center}
  \begin{tabular}{lcl}
  @{thm (lhs) nullable.simps(1)} & $\dn$ & @{thm (rhs) nullable.simps(1)}\\
  @{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\
  @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
  @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
  @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
  @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\
  @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
  @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
  @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
  @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
  @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
  @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}
  \end{tabular}
  \end{center}
 
  \noindent
  We may extend this definition to give derivatives w.r.t.~strings:

  \begin{center}
  \begin{tabular}{lcl}
  @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\
  @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\
  \end{tabular}
  \end{center}

  \noindent Given the equations in \eqref{SemDer}, it is a relatively easy
  exercise in mechanical reasoning to establish that

  \begin{proposition}\mbox{}\\ 
  \begin{tabular}{ll}
  @{text "(1)"} & @{thm (lhs) nullable_correctness} if and only if
  @{thm (rhs) nullable_correctness} \\ 
  @{text "(2)"} & @{thm[mode=IfThen] der_correctness}
  \end{tabular}
  \end{proposition}

  \noindent With this in place it is also very routine to prove that the
  regular expression matcher defined as

  \begin{center}
  @{thm match_def}
  \end{center}

  \noindent gives a positive answer if and only if @{term "s \<in> L r"}. While
  the matcher above calculates a provably correct a YES/NO answer for
  whether a regular expression matches a string, the novel idea of Sulzmann
  and Lu \cite{Sulzmann2014} is to append another phase to calculate a
  value. We will explain the details next.

*}

section {* POSIX Regular Expression Matching *}

text {* 

The clever idea in \cite{Sulzmann2014} is to introduce values for encoding
\emph{how} a regular expression matches a string and then define a function
on values that mirrors (but inverts) the construction of the derivative on
regular expressions. Values are defined as the inductive datatype

  \begin{center}
  @{text "v :="}
  @{const "Void"} $\mid$
  @{term "val.Char c"} $\mid$
  @{term "Left v"} $\mid$
  @{term "Right v"} $\mid$
  @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ 
  @{term "Stars vs"} 
  \end{center}  

  \noindent where we use @{term vs} standing for a list of values. The string
  underlying a values can be calculated by the @{const flat} function, written
  @{term "flat DUMMY"} and defined as:

  \begin{center}
  \begin{tabular}{lcl}
  @{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\
  @{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\
  @{thm (lhs) flat.simps(3)} & $\dn$ & @{thm (rhs) flat.simps(3)}\\
  @{thm (lhs) flat.simps(4)} & $\dn$ & @{thm (rhs) flat.simps(4)}\\
  @{thm (lhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]}\\
  @{thm (lhs) flat.simps(6)} & $\dn$ & @{thm (rhs) flat.simps(6)}\\
  @{thm (lhs) flat.simps(7)} & $\dn$ & @{thm (rhs) flat.simps(7)}\\
  \end{tabular}
  \end{center}

  \noindent Sulzmann and Lu also define inductively an inhabitation relation
  that associates values to regular expressions:

  \begin{center}
  \begin{tabular}{c}
  @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\ 
  @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad 
  @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\medskip\\
  @{thm[mode=Axiom] Prf.intros(4)} \qquad 
  @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\medskip\\
  @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad 
  @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}\medskip\\
  \end{tabular}
  \end{center}

  \noindent Note that no values are associated with the regular expression
  @{term ZERO}, and that the only value associated with the regular
  expression @{term ONE} is @{term Void}, pronounced (if one must) as {\em
  ``Void''}. It is routine to stablish how values `inhabitated' by a regular
  expression correspond to the language of a regular expression, namely

  \begin{proposition}
  @{thm L_flat_Prf}
  \end{proposition}

  Graphically the algorithm by Sulzmann \& Lu can be illustrated by the
  picture in Figure~\ref{Sulz} where the path from the left to the right
  involving $der/nullable$ is the first phase of the algorithm and
  $mkeps/inj$, the path from right to left, the second phase. This picture
  shows the steps required when a regular expression, say $r_1$, matches the
  string $abc$. We first build the three derivatives (according to $a$, $b$
  and $c$). We then use $nullable$ to find out whether the resulting regular
  expression can match the empty string. If yes, we call the function
  $mkeps$.

\begin{figure}[t]
\begin{center}
\begin{tikzpicture}[scale=2,node distance=1.3cm,
                    every node/.style={minimum size=7mm}]
\node (r1)  {@{term "r\<^sub>1"}};
\node (r2) [right=of r1]{@{term "r\<^sub>2"}};
\draw[->,line width=1mm](r1)--(r2) node[above,midway] {@{term "der a DUMMY"}};
\node (r3) [right=of r2]{@{term "r\<^sub>3"}};
\draw[->,line width=1mm](r2)--(r3) node[above,midway] {@{term "der b DUMMY"}};
\node (r4) [right=of r3]{@{term "r\<^sub>4"}};
\draw[->,line width=1mm](r3)--(r4) node[above,midway] {@{term "der c DUMMY"}};
\draw (r4) node[anchor=west] {\;\raisebox{3mm}{@{term nullable}}};
\node (v4) [below=of r4]{@{term "v\<^sub>4"}};
\draw[->,line width=1mm](r4) -- (v4);
\node (v3) [left=of v4] {@{term "v\<^sub>3"}};
\draw[->,line width=1mm](v4)--(v3) node[below,midway] {@{text "inj r\<^sub>3 c"}};
\node (v2) [left=of v3]{@{term "v\<^sub>2"}};
\draw[->,line width=1mm](v3)--(v2) node[below,midway] {@{text "inj r\<^sub>2 b"}};
\node (v1) [left=of v2] {@{term "v\<^sub>1"}};
\draw[->,line width=1mm](v2)--(v1) node[below,midway] {@{text "inj r\<^sub>1 a"}};
\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{@{term "mkeps"}}};
\end{tikzpicture}
\end{center}
\caption{The two phases of the algorithm by Sulzmann \& Lu \cite{Sulzmann2014}
matching the string @{term "[a,b,c]"}. The first phase (the arrows from 
left to right) is \Brz's matcher building succesive derivatives. If at the 
last regular expression is @{term nullable}, then functions of the 
second phase are called: first @{term mkeps} calculates a value witnessing
how the empty string has been recognised by @{term "r\<^sub>4"}. After
that the function @{term inj} `injects back' the characters of the string into
the values (the arrows from right to left).
\label{Sulz}}
\end{figure} 

 NOT DONE YET 

  We begin with the case of a nullable regular expression: from the
  nullability we need to construct a value that witnesses the nullability.
  The @{const mkeps} function (from \cite{Sulzmann2014}) is a partial (but
  unambiguous) function from regular expressions to values, total on exactly
  the set of nullable regular expressions.

 \begin{center}
  \begin{tabular}{lcl}
  @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
  @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
  @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
  @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
  \end{tabular}
  \end{center}

The well-known idea of POSIX lexing is informally defined in (for example)
\cite{posix}; as correctly argued in \cite{Sulzmann2014}, this needs formal
specification. The rough idea is that, in contrast to the so-called GREEDY
algorithm, POSIX lexing chooses to match more deeply and using left choices
rather than a right choices. For example, note that to match the string 
@{term "[a, b]"} with the regular expression $(a + \mts)\circ (b+ab)$ the matching
will return $( Void, Right(ab))$ rather than $(Left\ a, Left\ b)$. [The
regular expression $ab$ is short for $(Lit\ a) \circ (Lit\ b)$.] Similarly,
to match {\em ``a''} with $(a+a)$ the leftmost $a$ will be chosen.

We use a simple inductive definition to specify this notion, incorporating
the POSIX-specific choices into the side-conditions for the rules $R tl
+_2$, $R tl\circ$ and $R tl*$ (as they are now called). By contrast,
\cite{Sulzmann2014} defines a relation between values and argues that there is a
maximum value, as given by the derivative-based algorithm yet to be spelt
out. The relation we define is ternary, relating strings, values and regular
expressions.

Our Posix relation @{term "s \<in> r \<rightarrow> v"}

  \begin{center}
  \begin{tabular}{c}
  @{thm[mode=Axiom] PMatch.intros(1)} \qquad
  @{thm[mode=Axiom] PMatch.intros(2)}\medskip\\
  @{thm[mode=Rule] PMatch.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\qquad
  @{thm[mode=Rule] PMatch.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\medskip\\
  \multicolumn{1}{p{5cm}}{@{thm[mode=Rule] PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}\medskip\\
  @{thm[mode=Rule] PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}\medskip\\
  @{thm[mode=Axiom] PMatch.intros(7)}\medskip\\
  \end{tabular}
  \end{center}

*}

section {* The Argument by Sulzmmann and Lu *}

section {* Conclusion *}

text {*

  Nipkow lexer from 2000

*}


text {*




  \noindent
  Values

  

 

 

  \noindent
  The @{const mkeps} function

 

  \noindent
  The @{text inj} function

  \begin{center}
  \begin{tabular}{lcl}
  @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
  @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & 
      @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
  @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & 
      @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
  @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ 
      & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
  @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ 
      & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
  @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ 
      & @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
  @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ 
      & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\
  \end{tabular}
  \end{center}

  \noindent
  The inhabitation relation:

  \begin{center}
  \begin{tabular}{c}
  @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\ 
  @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad 
  @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\medskip\\
  @{thm[mode=Axiom] Prf.intros(4)} \qquad 
  @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\medskip\\
  @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad 
  @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}\medskip\\
  \end{tabular}
  \end{center}

  \noindent
  We have also introduced a slightly restricted version of this relation
  where the last rule is restricted so that @{term "flat v \<noteq> []"}.
  This relation for \emph{non-problematic} is written @{term "\<Turnstile> v : r"}.
  \bigskip


  \noindent
  

  \noindent
  Our version of Sulzmann's ordering relation

  \begin{center}
  \begin{tabular}{c}
  @{thm[mode=Rule] ValOrd.intros(2)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>1'" "v\<^sub>2" "r\<^sub>2" "v\<^sub>2'"]} \qquad
  @{thm[mode=Rule] ValOrd.intros(1)[of "v\<^sub>2" "r\<^sub>2" "v\<^sub>2'" "v\<^sub>1" "r\<^sub>1"]}\medskip\\
  @{thm[mode=Rule] ValOrd.intros(3)[of "v\<^sub>1" "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]} \qquad
  @{thm[mode=Rule] ValOrd.intros(4)[of "v\<^sub>2" "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\medskip\\ 
  @{thm[mode=Rule] ValOrd.intros(5)[of "v\<^sub>2" "r\<^sub>2" "v\<^sub>2'" "r\<^sub>1"]} \qquad
  @{thm[mode=Rule] ValOrd.intros(6)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>1'"  "r\<^sub>2"]} \medskip\\
  @{thm[mode=Axiom] ValOrd.intros(7)}\qquad
  @{thm[mode=Axiom] ValOrd.intros(8)[of "c"]}\medskip\\
  @{thm[mode=Rule] ValOrd.intros(9)[of "v" "vs" "r"]}\qquad
  @{thm[mode=Rule] ValOrd.intros(10)[of "v" "vs" "r"]}\medskip\\
  @{thm[mode=Rule] ValOrd.intros(11)[of "v\<^sub>1" "r" "v\<^sub>2" "vs\<^sub>1" "vs\<^sub>2"]}\medskip\\
  @{thm[mode=Rule] ValOrd.intros(12)[of "vs\<^sub>1" "r" "vs\<^sub>2" "v"]}\qquad
  @{thm[mode=Axiom] ValOrd.intros(13)[of "r"]}\medskip\\
  \end{tabular}
  \end{center}

  \noindent
  A prefix of a string s

  \begin{center}
  \begin{tabular}{c}
  @{thm prefix_def[of "s\<^sub>1" "s\<^sub>2"]}
  \end{tabular}
  \end{center}

  \noindent
  Values and non-problematic values

  \begin{center}
  \begin{tabular}{c}
  @{thm Values_def}\medskip\\
  \end{tabular}
  \end{center}


  \noindent
  The point is that for a given @{text s} and @{text r} there are only finitely many
  non-problematic values.
*} 

text {* 
  \noindent
  Some lemmas we have proved:\bigskip
  
  @{thm L_flat_Prf}

  @{thm L_flat_NPrf}

  @{thm[mode=IfThen] mkeps_nullable}

  @{thm[mode=IfThen] mkeps_flat}

  @{thm[mode=IfThen] Prf_injval}

  @{thm[mode=IfThen] Prf_injval_flat}
  
  @{thm[mode=IfThen] PMatch_mkeps}
  
  @{thm[mode=IfThen] PMatch1(2)}

  @{thm[mode=IfThen] PMatch1N}

  @{thm[mode=IfThen] PMatch_determ(1)[of "s" "r" "v\<^sub>1" "v\<^sub>2"]}

  \medskip
  \noindent
  This is the main theorem that lets us prove that the algorithm is correct according to
  @{term "s \<in> r \<rightarrow> v"}:

  @{thm[mode=IfThen] PMatch2}

  \mbox{}\bigskip
  
  \noindent {\bf Proof} The proof is by induction on the definition of
  @{const der}. Other inductions would go through as well. The
  interesting case is for @{term "SEQ r\<^sub>1 r\<^sub>2"}. First we analyse the
  case where @{term "nullable r\<^sub>1"}. We have by induction hypothesis

  \[
  \begin{array}{l}
  (IH1)\quad @{text "\<forall>s v."} \text{\;if\;} @{term "s \<in> der c r\<^sub>1 \<rightarrow> v"} 
  \text{\;then\;} @{term "(c # s) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v"}\\
  (IH2)\quad @{text "\<forall>s v."} \text{\;if\;} @{term "s \<in> der c r\<^sub>2 \<rightarrow> v"} 
  \text{\;then\;} @{term "(c # s) \<in> r\<^sub>2 \<rightarrow> injval r\<^sub>2 c v"}
  \end{array}
  \]
  
  \noindent
  and have 

  \[
  @{term "s \<in> ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c r\<^sub>2) \<rightarrow> v"}
  \]
  
  \noindent
  There are two cases what @{term v} can be: (1) @{term "Left v'"} and (2) @{term "Right v'"}.

  \begin{itemize}
  \item[(1)] We know @{term "s \<in> SEQ (der c r\<^sub>1) r\<^sub>2 \<rightarrow> v'"} holds, from which we
  can infer that there are @{text "s\<^sub>1"}, @{term "s\<^sub>2"}, @{text "v\<^sub>1"}, @{term "v\<^sub>2"}
  with

  \[
  @{term "s\<^sub>1 \<in> der c r\<^sub>1 \<rightarrow> v\<^sub>1"} \qquad\text{and}\qquad @{term "s\<^sub>2 \<in> r\<^sub>2 \<rightarrow> v\<^sub>2"}
  \]

  and also

  \[
  @{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> s\<^sub>1 @ s\<^sub>3 \<in> L (der c r\<^sub>1) \<and> s\<^sub>4 \<in> L r\<^sub>2)"}
  \]

  \noindent
  and have to prove
  
  \[
  @{term "((c # s\<^sub>1) @ s\<^sub>2) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (injval r\<^sub>1 c v\<^sub>1) v\<^sub>2"}
  \]

  \noindent
  The two requirements @{term "(c # s\<^sub>1) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"} and 
  @{term "s\<^sub>2 \<in> r\<^sub>2 \<rightarrow> v\<^sub>2"} can be proved by the induction hypothese (IH1) and the
  fact above.

  \noindent
  This leaves to prove
  
  \[
  @{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> (c # s\<^sub>1) @ s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"}
  \]
  
  \noindent
  which holds because @{term "(c # s\<^sub>1) @ s\<^sub>3 \<in> L r\<^sub>1 "} implies @{term "s\<^sub>1 @ s\<^sub>3 \<in> L (der c r\<^sub>1) "}

  \item[(2)] This case is similar.
  \end{itemize}

  \noindent 
  The final case is that @{term " \<not> nullable r\<^sub>1"} holds. This case again similar
  to the cases above.
*}


text {*
  %\noindent
  %{\bf Acknowledgements:}
  %We are grateful for the comments we received from anonymous
  %referees.

  \bibliographystyle{plain}
  \bibliography{root}

  \section{Roy's Rules}

   \newcommand{\abs}[1]{\mid\!\! #1\!\! \mid}
   %%\newcommand{\mts}{\textit{``''}
   \newcommand{\tl}{\ \triangleleft\ }
   $$\inferrule[]{Void \tl \epsilon}{}
            \quad\quad
     \inferrule[]{Char\ c \tl Lit\ c}{}
   $$
   $$\inferrule
       {v_1 \tl r_1}
       {Left\ v_1 \tl r_1 + r_2}
   \quad\quad
     \inferrule[]
       { v_2 \tl r_2 \\ \abs{v_2}\ \not\in\ L(r_1)}
       {Right\ v_2 \tl r_1 + r_2}
   $$
   $$
   \inferrule
       {v_1 \tl r_1\\
        v_2 \tl r_2\\
        s \in\  L(r_1\backslash\! \abs{v_1}) \ \land\
        \abs{v_2}\!\backslash s\ \epsilon\ L(r_2)
        \ \Rightarrow\ s = []
       }
       {(v_1, v_2) \tl r_1 \cdot r_2}
   $$
   $$\inferrule
         { v \tl r \\ vs \tl r^* \\ \abs{v}\ \not=\ []} 
         { (v :: vs) \tl r^* }
   \quad\quad
       \inferrule{}
         { []  \tl r^* }       
   $$

*}


(*<*)
end
(*>*)
author	Christian Urban <christian dot urban at kcl dot ac dot uk>
	Sat, 05 Mar 2016 12:10:01 +0000
changeset 115	15ef2af1a6f2
parent 114	8b41d01b5e5d
child 116	022503caa187
permissions	-rw-r--r--