lexing: comparison thys/Paper/Paper.thy

equal deleted inserted replaced

-:d74bfa11802c
+:4c85af262ee7
 value, in an algorithm-independent fashion, and to show that Sulzmann and
 Lu's derivative-based algorithm does indeed calculate a value that is
 correct according to the specification.
 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a
-relation (called an ``Order Relation'') on the set of values of @{term r},
+relation (called an ``order relation'') on the set of values of @{term r},
 and to show that (once a string to be matched is chosen) there is a maximum
 element and that it is computed by their derivative-based algorithm. This
 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a
 GREEDY regular expression matching algorithm. Beginning with our
 observations that, without evidence that it is transitive, it cannot be
 identifiers (say, a single character followed by characters or numbers).
 Then we can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use
 POSIX matching to tokenise strings, say @{text "iffoo"} and @{text "if"}.
 For @{text "iffoo"} we obtain by the longest match rule a single identifier
 token, not a keyword followed by an identifier. For @{text "if"} we obtain by
-rule priority a keyword token, not an identifier token---even if @{text
+the priority rule a keyword token, not an identifier token---even if @{text
 "r\<^bsub>id\<^esub>"} matches also.\bigskip
 \noindent {\bf Contributions:} (NOT DONE YET) We have implemented in
 Isabelle/HOL the derivative-based regular expression matching algorithm as
 described by Sulzmann and Lu \cite{Sulzmann2014}. We have proved the
 only the empty string and @{term c} for matching a character literal. The
 language of a regular expression is also defined as usual by the
 recursive function @{term L} with the clauses:
 \begin{center}
-\begin{tabular}{rcl}
+\begin{tabular}{l@ {\hspace{5mm}}rcl}
-@{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
+(1) & @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
-@{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\
+(2) & @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\
-@{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\
+(3) & @{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\
-@{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
+(4) & @{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
-@{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
+(5) & @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
-@{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
+(6) & @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
 \end{tabular}
 \end{center}
-\noindent In the fourth clause we use the operation @{term "DUMMY ;;
+\noindent In clause (4) we use the operation @{term "DUMMY ;;
 DUMMY"} for the concatenation of two languages (it is also list-append for
 strings). We use the star-notation for regular expressions and for
 languages (in the last clause above). The star for languages is defined
 inductively by two clauses: @{text "(i)"} the empty string being in
 the star of a language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a
 @{thm match_def}
 \end{center}
 \noindent gives a positive answer if and only if @{term "s \<in> L r"}.
 Consequently, this regular expression matching algorithm satisfies the
-usual specification. While the matcher above calculates a provably correct
+usual specification for regular expression matching. While the matcher
-YES/NO answer for whether a regular expression matches a string, the novel
+above calculates a provably correct YES/NO answer for whether a regular
-idea of Sulzmann and Lu \cite{Sulzmann2014} is to append another phase to
+expression matches a string or not, the novel idea of Sulzmann and Lu
-this algorithm in order to calculate a [lexical] value. We will explain
+\cite{Sulzmann2014} is to append another phase to this algorithm in order
-the details next.
+to calculate a [lexical] value. We will explain the details next.
 *}
 section {* POSIX Regular Expression Matching\label{posixsec} *}
 @{term "Right v"} $\mid$
 @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$
 @{term "Stars vs"}
 \end{center}
-\noindent where we use @{term vs} standing for a list of values. (This is
+\noindent where we use @{term vs} to stand for a list of values. (This is
 similar to the approach taken by Frisch and Cardelli for GREEDY matching
-\cite{Frisch2004}, and Sulzmann and Lu \cite{Sulzmann2014} for POSIX matching).
+\cite{Frisch2004}, and Sulzmann and Lu \cite{Sulzmann2014} for POSIX
-The string underlying a value can be calculated by the @{const flat}
+matching). The string underlying a value can be calculated by the @{const
-function, written @{term "flat DUMMY"} and defined as:
+flat} function, written @{term "flat DUMMY"} and defined as:
 \begin{center}
 \begin{tabular}{lcl}
 @{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\
 @{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\
 \node (v1) [left=of v2] {@{term "v\<^sub>1"}};
 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {@{text "inj r\<^sub>1 a"}};
 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{@{term "mkeps"}}};
 \end{tikzpicture}
 \end{center}
-\caption{The two phases of the algorithm by Sulzmann \& Lu \cite{Sulzmann2014}
+\caption{The two phases of the algorithm by Sulzmann \& Lu \cite{Sulzmann2014},
 matching the string @{term "[a,b,c]"}. The first phase (the arrows from
-left to right) is \Brz's matcher building succesive derivatives. If at the
+left to right) is \Brz's matcher building succesive derivatives. If the
-last regular expression is @{term nullable}, then functions of the
+last regular expression is @{term nullable}, then the functions of the
-second phase are called: first @{term mkeps} calculates a value witnessing
+second phase are called (the top-down and right-to-left arrows): first
+@{term mkeps} calculates a value witnessing
 how the empty string has been recognised by @{term "r\<^sub>4"}. After
 that the function @{term inj} `injects back' the characters of the string into
-the values (the arrows from right to left).
+the values.
 \label{Sulz}}
 \end{figure}
 \begin{center}
 \begin{tabular}{lcl}
 "r\<^sub>1"} and an error is raised instead. Note also how this function
 makes some subtle choices leading to a POSIX value: for example if an
 alternative regular expression, say @{term "ALT r\<^sub>1 r\<^sub>2"}, can
 match the empty string and furthermore @{term "r\<^sub>1"} can match the
 empty string, then we return a @{text Left}-value. The @{text
-Right}-value will only be returned if @{term "r\<^sub>1"} is not nullable.
+Right}-value will only be returned if @{term "r\<^sub>1"} cannot match the empty
+string.
-The most interesting novelty from Sulzmann and Lu \cite{Sulzmann2014} is
+The most interesting idea from Sulzmann and Lu \cite{Sulzmann2014} is
 the construction of a value for how @{term "r\<^sub>1"} can match the
 string @{term "[a,b,c]"} from the value how the last derivative, @{term
 "r\<^sub>4"} in Fig~\ref{Sulz}, can match the empty string. Sulzmann and
 Lu achieve this by stepwise ``injecting back'' the characters into the
 values thus inverting the operation of building derivatives on the level
 @{text Right}-value. In case of the @{text Left}-value we know further it
 must be a value for a sequence regular expression. Therefore the pattern
 we match in the clause (5) is @{term "Left (Seq v\<^sub>1 v\<^sub>2)"},
 while in (6) it is just @{term "Right v\<^sub>2"}. One more interesting
 point is in the right-hand side of clause (6): since in this case the
-regular expression @{text "r\<^sub>1"} does not ``contribute'' for
+regular expression @{text "r\<^sub>1"} does not ``contribute'' to
-matching the string, that is only matches the empty string, we need to
+matching the string, that means it only matches the empty string, we need to
-call @{const mkeps} in order to construct a value how @{term "r\<^sub>1"}
+call @{const mkeps} in order to construct a value for how @{term "r\<^sub>1"}
 can match this empty string. A similar argument applies for why we can
 expect in the left-hand side of clause (7) that the value is of the form
 @{term "Seq v (Stars vs)"}---the derivative of a star is @{term "SEQ r
 (STAR r)"}. Finally, the reason for why we can ignore the second argument
 in clause (1) of @{term inj} is that it will only ever be called in cases
 where @{term "c=d"}, but the usual linearity restrictions in patterns do
 not allow is to build this constraint explicitly into our function
 definition.\footnote{Sulzmann and Lu state this clause as @{thm (lhs)
 injval.simps(1)[of "c" "c"]} $\dn$ @{thm (rhs) injval.simps(1)[of "c"]},
 but our deviation is harmless.}
+The idea of @{term inj} to ``inject back'' a character into a value can
+be made precise by the first part of the following lemma; the second
+part shows that the underlying string of an @{const mkeps}-value is
+the empty string.
+\begin{lemma}\mbox{}\\\label{Prf_injval_flat}
+\begin{tabular}{ll}
+(1) & @{thm[mode=IfThen] Prf_injval_flat}\\
+(2) & @{thm[mode=IfThen] mkeps_flat}
+\end{tabular}
+\end{lemma}
 Having defined the @{const mkeps} and @{text inj} function we can extend
 \Brz's matcher so that a [lexical] value is constructed (assuming the
 regular expression matches the string). The clauses of the lexer are
 \begin{lemma}
 @{thm[mode=IfThen] PMatch_mkeps}
 \end{lemma}
-\begin{lemma}
-@{thm[mode=IfThen] Prf_injval_flat}
-\end{lemma}
 \begin{lemma}
 @{thm[mode=IfThen] PMatch2_roy_version}
 \end{lemma}

changeset 121	4c85af262ee7
parent 120	d74bfa11802c
child 122	7c6c907660d8