lexing: comparison thys/Paper/Paper.thy

equal deleted inserted replaced

-:ff0844860981
+:e866678c29cb
 For @{text "iffoo"} we obtain by the longest match rule a single identifier
 token, not a keyword followed by an identifier. For @{text "if"} we obtain by
 the priority rule a keyword token, not an identifier token---even if @{text
 "r\<^bsub>id\<^esub>"} matches also.\bigskip
-\noindent {\bf Contributions:} (NOT DONE YET) We have implemented in
+\noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
-Isabelle/HOL the derivative-based regular expression matching algorithm as
+derivative-based regular expression matching algorithm as described by
-described by Sulzmann and Lu \cite{Sulzmann2014}. We have proved the
+Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
-correctness of this algorithm according to our specification of what a POSIX
+algorithm according to our specification of what a POSIX value is. Sulzmann
-value is. Sulzmann and Lu sketch in \cite{Sulzmann2014} an informal
+and Lu sketch in \cite{Sulzmann2014} an informal correctness proof: but to
-correctness proof: but to us it contains unfillable gaps.
+us it contains unfillable gaps.\footnote{An extended version of
+\cite{Sulzmann2014} is available at the website of its first author; this
-informal correctness proof given in \cite{Sulzmann2014} is in final
+extended version already includes remarks in the appendix that their
-form\footnote{} and to us contains unfillable gaps.
+informal proof contains gaps, and possible fixes are not fully worked out.}
 Our specification of a POSIX value consists of a simple inductive definition
 that given a string and a regular expression uniquely determines this value.
 Derivatives as calculated by Brzozowski's method are usually more complex
 regular expressions than the initial one; various optimisations are
-possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r
+possible. We prove the correctness when simplifications of @{term "ALT ZERO
-ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the
+r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
-advantages of having a simple specification and correctness proof is that
+@{term r} are applied.
-the latter can be refined to allow for such optimisations and simple
-correctness proof.
+%An extended version of \cite{Sulzmann2014} is available at the website of
+%its first author; this includes some ``proofs'', claimed in
-An extended version of \cite{Sulzmann2014} is available at the website of
+%\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
-its first author; this includes some ``proofs'', claimed in
+%final form, we make no comment thereon, preferring to give general reasons
-\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
+%for our belief that the approach of \cite{Sulzmann2014} is problematic
-final form, we make no comment thereon, preferring to give general reasons
+%rather than to discuss details of unpublished work.
-for our belief that the approach of \cite{Sulzmann2014} is problematic
-rather than to discuss details of unpublished work.
 *}
 section {* Preliminaries *}
 @{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\
 @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
 @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\
+\end{tabular}
+\end{center}
+\begin{center}
+\begin{tabular}{lcl}
 @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
 @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
 @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
 @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
 not allow is to build this constraint explicitly into our function
 definition.\footnote{Sulzmann and Lu state this clause as @{thm (lhs)
 injval.simps(1)[of "c" "c"]} $\dn$ @{thm (rhs) injval.simps(1)[of "c"]},
 but our deviation is harmless.}
-The idea of @{term inj} to ``inject back'' a character into a value can
+The idea of the @{term inj}-function to ``inject'' a character, say
-be made precise by the first part of the following lemma; the second
+@{term c}, into a value can be made precise by the first part of the
-part shows that the underlying string of an @{const mkeps}-value is
+following lemma, which shows that the underlying string of an injected
-always the empty string.
+value has a prepend character @{term c}; the second part shows that the
+underlying string of an @{const mkeps}-value is always the empty string
+(given the regular expression is nullable since otherwise @{text mkeps}
+might not be defined).
 \begin{lemma}\mbox{}\smallskip\\\label{Prf_injval_flat}
 \begin{tabular}{ll}
 (1) & @{thm[mode=IfThen] Prf_injval_flat}\\
 (2) & @{thm[mode=IfThen] mkeps_flat}
 \end{tabular}
 \end{lemma}
 \begin{proof}
-Both properties are by routine inductions: the first one, for example,
+Both properties are by routine inductions: the first one can, for example,
-by an induction over the definition of @{term derivatives}; the second by
+be proved by an induction over the definition of @{term derivatives}; the second by
-induction on @{term r}. There are no interesting cases.\qed
+an induction on @{term r}. There are no interesting cases.\qed
 \end{proof}
 Having defined the @{const mkeps} and @{text inj} function we can extend
 \Brz's matcher so that a [lexical] value is constructed (assuming the
 regular expression matches the string). The clauses of the lexer are
 & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}
 \end{tabular}
 \end{center}
 \noindent If the regular expression does not match the string, @{const None} is
-returned, indicating an error is raised. If the regular expression does
+returned, indicating an error is raised. If the regular expression \emph{does}
 match the string, then @{const Some} value is returned. One important
 virtue of this algorithm is that it can be implemented with ease in a
 functional programming language and also in Isabelle/HOL. In the remaining
 part of this section we prove that this algorithm is correct.
 \end{tabular}
 \end{center}
 \noindent We claim that this relation captures the idea behind the two
 informal POSIX rules shown in the Introduction: Consider for example the
-rules @{text "P+L"} and @{text "P+R"} where the POSIX value for an
+rules @{text "P+L"} and @{text "P+R"} where the POSIX value for a string
-alternative regular expression is specified---it is always a @{text
+and an alternative regular expression, that is @{term "(s, ALT r\<^sub>1 r\<^sub>2)"},
-"Left"}-value, \emph{except} when the string to be matched is not in the
+is specified---it is always a @{text "Left"}-value, \emph{except} when the
-language of @{term "r\<^sub>1"}; only then it is a @{text Right}-value (see the
+string to be matched is not in the language of @{term "r\<^sub>1"}; only then it
-side-condition in @{text "P+R"}). Interesting is also the rule for
+is a @{text Right}-value (see the side-condition in @{text "P+R"}).
-sequence regular expressions (@{text "PS"}). The first two premises state
+Interesting is also the rule for sequence regular expressions (@{text
-that @{term "v\<^sub>1"} and @{term "v\<^sub>2"} are the POSIX values for
+"PS"}). The first two premises state that @{term "v\<^sub>1"} and @{term "v\<^sub>2"}
-@{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"}
+are the POSIX values for @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"}
 respectively. Consider now the third premise and note that the POSIX value
 of this rule should match the string @{term "s\<^sub>1 @ s\<^sub>2"}. According to the
 longest match rule, we want that the @{term "s\<^sub>1"} is the longest initial
 split of @{term "s\<^sub>1 @ s\<^sub>2"} such that @{term "s\<^sub>2"} is still recognised
 by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there
-\emph{exists} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"}
+\emph{exist} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"}
-can be split up into a non-empty @{term "s\<^sub>3"} and @{term "s\<^sub>4"}. Moreover
+can be split up into a non-empty string @{term "s\<^sub>3"} and possibly empty
-the longer @{term "s\<^sub>1 @ s\<^sub>3"} can be matched by @{text "r\<^sub>1"} and the
+string @{term "s\<^sub>4"}. Moreover the longer string @{term "s\<^sub>1 @ s\<^sub>3"} can be
-shorter @{term "s\<^sub>4"} can still be matched by @{term "r\<^sub>2"}. In this case
+matched by @{text "r\<^sub>1"} and the shorter @{term "s\<^sub>4"} can still be
-@{term "s\<^sub>1"} would not be the longest initial split of @{term "s\<^sub>1 @
+matched by @{term "r\<^sub>2"}. In this case @{term "s\<^sub>1"} would not be the
-s\<^sub>2"} and therefore @{term "Seq v\<^sub>1 v\<^sub>2"} cannot be a POSIX value
+longest initial split of @{term "s\<^sub>1 @ s\<^sub>2"} and therefore @{term "Seq v\<^sub>1
-for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. A similar condition is imposed
+v\<^sub>2"} cannot be a POSIX value for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}.
-onto the POSIX value in the @{text "P\<star>"}-rule. Also there we want that
+The main point is that this side-condition ensures the longest
-@{term "s\<^sub>1"} is the longest initial split of @{term "s\<^sub>1 @ s\<^sub>2"} and
+match rule is satisfied.
-furthermore the corresponding value @{term v} cannot be flatten to
-the empty string. In effect, we require that in each ``iteration''
+A similar condition is imposed on the POSIX value in the @{text
-of the star, some parts of the string need to be ``nibbled'' away; only
+"P\<star>"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial
-in case of the empty string weBy accept @{term "Stars []"} as the
+split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value
-POSIX value.
+@{term v} cannot be flatten to the empty string. In effect, we require
+that in each ``iteration'' of the star, some non-empty substring need to
+be ``chipped'' away; only in case of the empty string we accept @{term
+"Stars []"} as the POSIX value.
 We can prove that given a string @{term s} and regular expression @{term
 r}, the POSIX value @{term v} is uniquely determined by @{term "s \<in> r \<rightarrow>
-v"}.
+v"} (albeilt in an uncomputable fashion---for example rule @{term "P+R"}
+would require the calculation of the potentially infinite set @{term "L
+r\<^sub>1"}).
 \begin{theorem}
 @{thm[mode=IfThen] PMatch_determ(1)[of _ _ "v\<^sub>1" "v\<^sub>2"]}
 \end{theorem}
-\begin{proof}
+\begin{proof} By induction on the definition of @{term "s \<in> r \<rightarrow> v\<^sub>1"} and
-By induction on the definition of @{term "s \<in> r \<rightarrow> v\<^sub>1"} and a case
+a case analysis of @{term "s \<in> r \<rightarrow> v\<^sub>2"}. This proof requires the
-analysis of @{term "s \<in> r \<rightarrow> v\<^sub>2"}.\qed
+auxiliary lemma that @{thm (prem 1) PMatch1(1)} implies @{thm (concl)
-\end{proof}
+PMatch1(1)} and @{thm (concl) PMatch1(2)}, which are both easily
+established by inductions.\qed \end{proof}
+\noindent
+Next is the lemma that shows the function @{term "mkeps"} calculates
+the posix value for the empty string and a nullable regular expression.
 \begin{lemma}\label{lemmkeps}
 @{thm[mode=IfThen] PMatch_mkeps}
 \end{lemma}
 holds. Putting this all together, we can conclude with @{term "(c #
 s) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (mkeps r\<^sub>1) (injval r\<^sub>2 c v\<^sub>1)"}.
 Finally suppose @{term "r = STAR r\<^sub>1"}. This case is very similar to the
 sequence case, except that we need to ensure that @{term "flat (injval r\<^sub>1
-c v\<^sub>1) \<noteq> []"}. This follows by Lem.~\ref{posixbasic} from @{term "(c # s\<^sub>1)
+c v\<^sub>1) \<noteq> []"}. This follows from @{term "(c # s\<^sub>1)
-\<in> r' \<rightarrow> injval r\<^sub>1 c v\<^sub>1"} (which in turn follows from @{term "s\<^sub>1 \<in> der c
+\<in> r' \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}  (which in turn follows from @{term "s\<^sub>1 \<in> der c
 r\<^sub>1 \<rightarrow> v\<^sub>1"} and the induction hypothesis).\qed
 \end{proof}
 \noindent
 With Lem.~\ref{PMatch2} in place, it is completely routine to establish
-that the Sulzmann and Lu lexer satisfies its specification (returning
+that the Sulzmann and Lu lexer satisfies our specification (returning
 an ``error'' iff the string is not in the language of the regular expression,
 and returning a unique POSIX value iff the string \emph{is} in the language):
 \begin{theorem}\mbox{}\smallskip\\
 \begin{tabular}{ll}
 (2) & @{thm (lhs) lex_correct3b} if and only if @{thm (rhs) lex_correct3b}\\
 \end{tabular}
 \end{theorem}
 \begin{proof}
-By induction on @{term s}.\qed
+By induction on @{term s} using Lem.~\ref{lemmkeps} and \ref{PMatch2}.\qed
 \end{proof}
-This concludes our correctness proof. Note that we have not changed the
+\noindent This concludes our correctness proof. Note that we have not
-algorithm by Sulzmann and Lu, but introduced our own specification for
+changed the algorithm by Sulzmann and Lu, but introduced our own
-what a correct result---a POSIX value---should be.
+specification for what a correct result---a POSIX value---should be.
+A strong point in favour of Sulzmann and Lu's algorithm is that it
+can be extended in various ways.
 *}
-section {* Extensions *}
+section {* Extensions and Optimisations*}
 text {*
+If we are interested in tokenising string, then we need to not just
+split up the string into tokens, but also ``classify'' the tokens (for
+example whether it is a keyword or an identifier). This can be
+done with only minor modifications by introducing \emph{record regular
+expressions} and \emph{record values} (for example \cite{Sulzmann2014b}):
+\begin{center}
+@{text "r :="}
+@{text "..."} $\mid$
+@{text "(l : r)"} \qquad\qquad
+@{text "v :="}
+@{text "..."} $\mid$
+@{text "(l : v)"}
+\end{center}
+\noindent where @{text l} is a label, say a string, @{text r} a regular
+expression and @{text v} a value. All functions can be smoothly extended
+to these regular expressions and values. For example @{text "(l : r)"} is
+nullable iff @{term r} is, and so on. The purpose of the record regular
+expression is to mark certain parts of a regular expression and then
+record in the calculated value which parts of the string were matched by
+this part. The label can then serve for classifying tokens. Recall the
+regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for keywords and
+identifiers from the Introduction. With record regular expression we can
+form @{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"} and then traverse the
+calculated value and only collect the underlying strings in record values.
+With this we obtain finite sequences of pairs of labels and strings, for
+example
+\[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\]
+\noindent from which tokens with classifications (keyword-token,
+identifier-token and so on) can be extracted.
 Derivatives as calculated by \Brz's method are usually more complex
 regular expressions than the initial one; the result is that the matching
-and lexing algorithms are often absymally slow.
+and lexing algorithms are often abysmally slow. However, various
+optimisations are possible, such as the simplifications of @{term "ALT
+ZERO r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r
-various optimisations are
+ONE"} to @{term r}. One of the advantages of having a simple specification
-possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT
+and correctness proof is that the latter can be refined to allow for such
-r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of
+optimisations and simple correctness proof.
-the advantages of having a simple specification and correctness proof is
-that the latter can be refined to allow for such optimisations and simple
+While the simplification of regular expressions according to
-correctness proof.
+simplification rules
+\noindent
+is well understood, there is a problem with the POSIX
+value calculation algorithm by Sulzmann and Lu: if we build a derivative
+regular expression and then simplify it, we will calculate a POSIX value
+for this simplified regular expression, \emph{not} for the original
+(unsimplified) derivative regular expression. Sulzmann and Lu overcome
+this problem by not just calculating a simplified regular expression, but
+also calculating a \emph{rectification function} that ``repairs'' the
+incorrect value.
 *}
 section {* The Argument by Sulzmmann and Lu *}

changeset 126	e866678c29cb
parent 125	ff0844860981
child 127	b208bc047eed