thys/Paper/Paper.thy
changeset 126 e866678c29cb
parent 125 ff0844860981
child 127 b208bc047eed
equal deleted inserted replaced
125:ff0844860981 126:e866678c29cb
   153 For @{text "iffoo"} we obtain by the longest match rule a single identifier
   153 For @{text "iffoo"} we obtain by the longest match rule a single identifier
   154 token, not a keyword followed by an identifier. For @{text "if"} we obtain by
   154 token, not a keyword followed by an identifier. For @{text "if"} we obtain by
   155 the priority rule a keyword token, not an identifier token---even if @{text
   155 the priority rule a keyword token, not an identifier token---even if @{text
   156 "r\<^bsub>id\<^esub>"} matches also.\bigskip
   156 "r\<^bsub>id\<^esub>"} matches also.\bigskip
   157 
   157 
   158 \noindent {\bf Contributions:} (NOT DONE YET) We have implemented in
   158 \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
   159 Isabelle/HOL the derivative-based regular expression matching algorithm as
   159 derivative-based regular expression matching algorithm as described by
   160 described by Sulzmann and Lu \cite{Sulzmann2014}. We have proved the
   160 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
   161 correctness of this algorithm according to our specification of what a POSIX
   161 algorithm according to our specification of what a POSIX value is. Sulzmann
   162 value is. Sulzmann and Lu sketch in \cite{Sulzmann2014} an informal
   162 and Lu sketch in \cite{Sulzmann2014} an informal correctness proof: but to
   163 correctness proof: but to us it contains unfillable gaps.
   163 us it contains unfillable gaps.\footnote{An extended version of
   164 
   164 \cite{Sulzmann2014} is available at the website of its first author; this
   165 informal correctness proof given in \cite{Sulzmann2014} is in final
   165 extended version already includes remarks in the appendix that their
   166 form\footnote{} and to us contains unfillable gaps.
   166 informal proof contains gaps, and possible fixes are not fully worked out.}
   167 
       
   168 Our specification of a POSIX value consists of a simple inductive definition
   167 Our specification of a POSIX value consists of a simple inductive definition
   169 that given a string and a regular expression uniquely determines this value.
   168 that given a string and a regular expression uniquely determines this value.
   170 Derivatives as calculated by Brzozowski's method are usually more complex
   169 Derivatives as calculated by Brzozowski's method are usually more complex
   171 regular expressions than the initial one; various optimisations are
   170 regular expressions than the initial one; various optimisations are
   172 possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r
   171 possible. We prove the correctness when simplifications of @{term "ALT ZERO
   173 ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the
   172 r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
   174 advantages of having a simple specification and correctness proof is that
   173 @{term r} are applied.
   175 the latter can be refined to allow for such optimisations and simple
   174 
   176 correctness proof.
   175 %An extended version of \cite{Sulzmann2014} is available at the website of
   177 
   176 %its first author; this includes some ``proofs'', claimed in
   178 An extended version of \cite{Sulzmann2014} is available at the website of
   177 %\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
   179 its first author; this includes some ``proofs'', claimed in
   178 %final form, we make no comment thereon, preferring to give general reasons
   180 \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
   179 %for our belief that the approach of \cite{Sulzmann2014} is problematic
   181 final form, we make no comment thereon, preferring to give general reasons
   180 %rather than to discuss details of unpublished work.
   182 for our belief that the approach of \cite{Sulzmann2014} is problematic
       
   183 rather than to discuss details of unpublished work.
       
   184 
   181 
   185 *}
   182 *}
   186 
   183 
   187 section {* Preliminaries *}
   184 section {* Preliminaries *}
   188 
   185 
   269   @{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\
   266   @{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\
   270   @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
   267   @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
   271   @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
   268   @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
   272   @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
   269   @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
   273   @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\
   270   @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\
       
   271   \end{tabular}
       
   272   \end{center}
       
   273 
       
   274   \begin{center}
       
   275   \begin{tabular}{lcl}
   274   @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
   276   @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
   275   @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
   277   @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
   276   @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
   278   @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
   277   @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
   279   @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
   278   @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
   280   @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
   524   not allow is to build this constraint explicitly into our function
   526   not allow is to build this constraint explicitly into our function
   525   definition.\footnote{Sulzmann and Lu state this clause as @{thm (lhs)
   527   definition.\footnote{Sulzmann and Lu state this clause as @{thm (lhs)
   526   injval.simps(1)[of "c" "c"]} $\dn$ @{thm (rhs) injval.simps(1)[of "c"]},
   528   injval.simps(1)[of "c" "c"]} $\dn$ @{thm (rhs) injval.simps(1)[of "c"]},
   527   but our deviation is harmless.}
   529   but our deviation is harmless.}
   528 
   530 
   529   The idea of @{term inj} to ``inject back'' a character into a value can
   531   The idea of the @{term inj}-function to ``inject'' a character, say
   530   be made precise by the first part of the following lemma; the second
   532   @{term c}, into a value can be made precise by the first part of the
   531   part shows that the underlying string of an @{const mkeps}-value is
   533   following lemma, which shows that the underlying string of an injected
   532   always the empty string.
   534   value has a prepend character @{term c}; the second part shows that the
       
   535   underlying string of an @{const mkeps}-value is always the empty string
       
   536   (given the regular expression is nullable since otherwise @{text mkeps}
       
   537   might not be defined).
   533 
   538 
   534   \begin{lemma}\mbox{}\smallskip\\\label{Prf_injval_flat}
   539   \begin{lemma}\mbox{}\smallskip\\\label{Prf_injval_flat}
   535   \begin{tabular}{ll}
   540   \begin{tabular}{ll}
   536   (1) & @{thm[mode=IfThen] Prf_injval_flat}\\
   541   (1) & @{thm[mode=IfThen] Prf_injval_flat}\\
   537   (2) & @{thm[mode=IfThen] mkeps_flat}
   542   (2) & @{thm[mode=IfThen] mkeps_flat}
   538   \end{tabular}
   543   \end{tabular}
   539   \end{lemma}
   544   \end{lemma}
   540 
   545 
   541   \begin{proof}
   546   \begin{proof}
   542   Both properties are by routine inductions: the first one, for example,
   547   Both properties are by routine inductions: the first one can, for example,
   543   by an induction over the definition of @{term derivatives}; the second by
   548   be proved by an induction over the definition of @{term derivatives}; the second by
   544   induction on @{term r}. There are no interesting cases.\qed
   549   an induction on @{term r}. There are no interesting cases.\qed
   545   \end{proof}
   550   \end{proof}
   546 
   551 
   547   Having defined the @{const mkeps} and @{text inj} function we can extend
   552   Having defined the @{const mkeps} and @{text inj} function we can extend
   548   \Brz's matcher so that a [lexical] value is constructed (assuming the
   553   \Brz's matcher so that a [lexical] value is constructed (assuming the
   549   regular expression matches the string). The clauses of the lexer are
   554   regular expression matches the string). The clauses of the lexer are
   556                      & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}                          
   561                      & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}                          
   557   \end{tabular}
   562   \end{tabular}
   558   \end{center}
   563   \end{center}
   559 
   564 
   560   \noindent If the regular expression does not match the string, @{const None} is
   565   \noindent If the regular expression does not match the string, @{const None} is
   561   returned, indicating an error is raised. If the regular expression does
   566   returned, indicating an error is raised. If the regular expression \emph{does}
   562   match the string, then @{const Some} value is returned. One important
   567   match the string, then @{const Some} value is returned. One important
   563   virtue of this algorithm is that it can be implemented with ease in a
   568   virtue of this algorithm is that it can be implemented with ease in a
   564   functional programming language and also in Isabelle/HOL. In the remaining
   569   functional programming language and also in Isabelle/HOL. In the remaining
   565   part of this section we prove that this algorithm is correct.
   570   part of this section we prove that this algorithm is correct.
   566 
   571 
   601   \end{tabular}
   606   \end{tabular}
   602   \end{center}
   607   \end{center}
   603 
   608 
   604   \noindent We claim that this relation captures the idea behind the two
   609   \noindent We claim that this relation captures the idea behind the two
   605   informal POSIX rules shown in the Introduction: Consider for example the
   610   informal POSIX rules shown in the Introduction: Consider for example the
   606   rules @{text "P+L"} and @{text "P+R"} where the POSIX value for an
   611   rules @{text "P+L"} and @{text "P+R"} where the POSIX value for a string
   607   alternative regular expression is specified---it is always a @{text
   612   and an alternative regular expression, that is @{term "(s, ALT r\<^sub>1 r\<^sub>2)"},
   608   "Left"}-value, \emph{except} when the string to be matched is not in the
   613   is specified---it is always a @{text "Left"}-value, \emph{except} when the
   609   language of @{term "r\<^sub>1"}; only then it is a @{text Right}-value (see the
   614   string to be matched is not in the language of @{term "r\<^sub>1"}; only then it
   610   side-condition in @{text "P+R"}). Interesting is also the rule for
   615   is a @{text Right}-value (see the side-condition in @{text "P+R"}).
   611   sequence regular expressions (@{text "PS"}). The first two premises state
   616   Interesting is also the rule for sequence regular expressions (@{text
   612   that @{term "v\<^sub>1"} and @{term "v\<^sub>2"} are the POSIX values for
   617   "PS"}). The first two premises state that @{term "v\<^sub>1"} and @{term "v\<^sub>2"}
   613   @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"}
   618   are the POSIX values for @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"}
   614   respectively. Consider now the third premise and note that the POSIX value
   619   respectively. Consider now the third premise and note that the POSIX value
   615   of this rule should match the string @{term "s\<^sub>1 @ s\<^sub>2"}. According to the
   620   of this rule should match the string @{term "s\<^sub>1 @ s\<^sub>2"}. According to the
   616   longest match rule, we want that the @{term "s\<^sub>1"} is the longest initial
   621   longest match rule, we want that the @{term "s\<^sub>1"} is the longest initial
   617   split of @{term "s\<^sub>1 @ s\<^sub>2"} such that @{term "s\<^sub>2"} is still recognised
   622   split of @{term "s\<^sub>1 @ s\<^sub>2"} such that @{term "s\<^sub>2"} is still recognised
   618   by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there
   623   by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there
   619   \emph{exists} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"}
   624   \emph{exist} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"}
   620   can be split up into a non-empty @{term "s\<^sub>3"} and @{term "s\<^sub>4"}. Moreover
   625   can be split up into a non-empty string @{term "s\<^sub>3"} and possibly empty
   621   the longer @{term "s\<^sub>1 @ s\<^sub>3"} can be matched by @{text "r\<^sub>1"} and the
   626   string @{term "s\<^sub>4"}. Moreover the longer string @{term "s\<^sub>1 @ s\<^sub>3"} can be
   622   shorter @{term "s\<^sub>4"} can still be matched by @{term "r\<^sub>2"}. In this case
   627   matched by @{text "r\<^sub>1"} and the shorter @{term "s\<^sub>4"} can still be
   623   @{term "s\<^sub>1"} would not be the longest initial split of @{term "s\<^sub>1 @
   628   matched by @{term "r\<^sub>2"}. In this case @{term "s\<^sub>1"} would not be the
   624   s\<^sub>2"} and therefore @{term "Seq v\<^sub>1 v\<^sub>2"} cannot be a POSIX value
   629   longest initial split of @{term "s\<^sub>1 @ s\<^sub>2"} and therefore @{term "Seq v\<^sub>1
   625   for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. A similar condition is imposed
   630   v\<^sub>2"} cannot be a POSIX value for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. 
   626   onto the POSIX value in the @{text "P\<star>"}-rule. Also there we want that
   631   The main point is that this side-condition ensures the longest 
   627    @{term "s\<^sub>1"} is the longest initial split of @{term "s\<^sub>1 @ s\<^sub>2"} and
   632   match rule is satisfied.
   628    furthermore the corresponding value @{term v} cannot be flatten to
   633 
   629    the empty string. In effect, we require that in each ``iteration''
   634   A similar condition is imposed on the POSIX value in the @{text
   630    of the star, some parts of the string need to be ``nibbled'' away; only
   635   "P\<star>"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial
   631    in case of the empty string weBy accept @{term "Stars []"} as the 
   636   split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value
   632    POSIX value.
   637   @{term v} cannot be flatten to the empty string. In effect, we require
       
   638   that in each ``iteration'' of the star, some non-empty substring need to
       
   639   be ``chipped'' away; only in case of the empty string we accept @{term
       
   640   "Stars []"} as the POSIX value.
   633 
   641 
   634    We can prove that given a string @{term s} and regular expression @{term
   642    We can prove that given a string @{term s} and regular expression @{term
   635    r}, the POSIX value @{term v} is uniquely determined by @{term "s \<in> r \<rightarrow>
   643    r}, the POSIX value @{term v} is uniquely determined by @{term "s \<in> r \<rightarrow>
   636    v"}.
   644    v"} (albeilt in an uncomputable fashion---for example rule @{term "P+R"}
       
   645    would require the calculation of the potentially infinite set @{term "L
       
   646    r\<^sub>1"}).
   637 
   647 
   638   \begin{theorem}
   648   \begin{theorem}
   639   @{thm[mode=IfThen] PMatch_determ(1)[of _ _ "v\<^sub>1" "v\<^sub>2"]}
   649   @{thm[mode=IfThen] PMatch_determ(1)[of _ _ "v\<^sub>1" "v\<^sub>2"]}
   640   \end{theorem}
   650   \end{theorem}
   641 
   651 
   642   \begin{proof}
   652   \begin{proof} By induction on the definition of @{term "s \<in> r \<rightarrow> v\<^sub>1"} and
   643   By induction on the definition of @{term "s \<in> r \<rightarrow> v\<^sub>1"} and a case
   653   a case analysis of @{term "s \<in> r \<rightarrow> v\<^sub>2"}. This proof requires the
   644   analysis of @{term "s \<in> r \<rightarrow> v\<^sub>2"}.\qed
   654   auxiliary lemma that @{thm (prem 1) PMatch1(1)} implies @{thm (concl)
   645   \end{proof}
   655   PMatch1(1)} and @{thm (concl) PMatch1(2)}, which are both easily
       
   656   established by inductions.\qed \end{proof}
       
   657 
       
   658   \noindent
       
   659   Next is the lemma that shows the function @{term "mkeps"} calculates
       
   660   the posix value for the empty string and a nullable regular expression.
   646 
   661 
   647   \begin{lemma}\label{lemmkeps}
   662   \begin{lemma}\label{lemmkeps}
   648   @{thm[mode=IfThen] PMatch_mkeps}
   663   @{thm[mode=IfThen] PMatch_mkeps}
   649   \end{lemma}
   664   \end{lemma}
   650 
   665 
   708   holds. Putting this all together, we can conclude with @{term "(c #
   723   holds. Putting this all together, we can conclude with @{term "(c #
   709   s) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (mkeps r\<^sub>1) (injval r\<^sub>2 c v\<^sub>1)"}.
   724   s) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (mkeps r\<^sub>1) (injval r\<^sub>2 c v\<^sub>1)"}.
   710 
   725 
   711   Finally suppose @{term "r = STAR r\<^sub>1"}. This case is very similar to the
   726   Finally suppose @{term "r = STAR r\<^sub>1"}. This case is very similar to the
   712   sequence case, except that we need to ensure that @{term "flat (injval r\<^sub>1
   727   sequence case, except that we need to ensure that @{term "flat (injval r\<^sub>1
   713   c v\<^sub>1) \<noteq> []"}. This follows by Lem.~\ref{posixbasic} from @{term "(c # s\<^sub>1)
   728   c v\<^sub>1) \<noteq> []"}. This follows from @{term "(c # s\<^sub>1)
   714   \<in> r' \<rightarrow> injval r\<^sub>1 c v\<^sub>1"} (which in turn follows from @{term "s\<^sub>1 \<in> der c
   729   \<in> r' \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}  (which in turn follows from @{term "s\<^sub>1 \<in> der c
   715   r\<^sub>1 \<rightarrow> v\<^sub>1"} and the induction hypothesis).\qed
   730   r\<^sub>1 \<rightarrow> v\<^sub>1"} and the induction hypothesis).\qed
   716   \end{proof}
   731   \end{proof}
   717 
   732 
   718   \noindent
   733   \noindent
   719   With Lem.~\ref{PMatch2} in place, it is completely routine to establish
   734   With Lem.~\ref{PMatch2} in place, it is completely routine to establish
   720   that the Sulzmann and Lu lexer satisfies its specification (returning
   735   that the Sulzmann and Lu lexer satisfies our specification (returning
   721   an ``error'' iff the string is not in the language of the regular expression,
   736   an ``error'' iff the string is not in the language of the regular expression,
   722   and returning a unique POSIX value iff the string \emph{is} in the language):
   737   and returning a unique POSIX value iff the string \emph{is} in the language):
   723 
   738 
   724   \begin{theorem}\mbox{}\smallskip\\
   739   \begin{theorem}\mbox{}\smallskip\\
   725   \begin{tabular}{ll}
   740   \begin{tabular}{ll}
   727   (2) & @{thm (lhs) lex_correct3b} if and only if @{thm (rhs) lex_correct3b}\\
   742   (2) & @{thm (lhs) lex_correct3b} if and only if @{thm (rhs) lex_correct3b}\\
   728   \end{tabular}
   743   \end{tabular}
   729   \end{theorem}
   744   \end{theorem}
   730 
   745 
   731   \begin{proof}
   746   \begin{proof}
   732   By induction on @{term s}.\qed  
   747   By induction on @{term s} using Lem.~\ref{lemmkeps} and \ref{PMatch2}.\qed  
   733   \end{proof}
   748   \end{proof}
   734 
   749 
   735   This concludes our correctness proof. Note that we have not changed the
   750   \noindent This concludes our correctness proof. Note that we have not
   736   algorithm by Sulzmann and Lu, but introduced our own specification for
   751   changed the algorithm by Sulzmann and Lu, but introduced our own
   737   what a correct result---a POSIX value---should be.
   752   specification for what a correct result---a POSIX value---should be.
       
   753   A strong point in favour of Sulzmann and Lu's algorithm is that it
       
   754   can be extended in various ways.
       
   755 
   738 *}
   756 *}
   739 
   757 
   740 section {* Extensions *}
   758 section {* Extensions and Optimisations*}
   741 
   759 
   742 text {*
   760 text {*
       
   761 
       
   762   If we are interested in tokenising string, then we need to not just
       
   763   split up the string into tokens, but also ``classify'' the tokens (for
       
   764   example whether it is a keyword or an identifier). This can be
       
   765   done with only minor modifications by introducing \emph{record regular
       
   766   expressions} and \emph{record values} (for example \cite{Sulzmann2014b}):
       
   767 
       
   768   \begin{center}  
       
   769   @{text "r :="}
       
   770   @{text "..."} $\mid$
       
   771   @{text "(l : r)"} \qquad\qquad
       
   772   @{text "v :="}
       
   773   @{text "..."} $\mid$
       
   774   @{text "(l : v)"}
       
   775   \end{center}
       
   776   
       
   777   \noindent where @{text l} is a label, say a string, @{text r} a regular
       
   778   expression and @{text v} a value. All functions can be smoothly extended
       
   779   to these regular expressions and values. For example @{text "(l : r)"} is
       
   780   nullable iff @{term r} is, and so on. The purpose of the record regular
       
   781   expression is to mark certain parts of a regular expression and then
       
   782   record in the calculated value which parts of the string were matched by
       
   783   this part. The label can then serve for classifying tokens. Recall the
       
   784   regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for keywords and
       
   785   identifiers from the Introduction. With record regular expression we can
       
   786   form @{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"} and then traverse the
       
   787   calculated value and only collect the underlying strings in record values.
       
   788   With this we obtain finite sequences of pairs of labels and strings, for 
       
   789   example
       
   790 
       
   791   \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\]
       
   792   
       
   793   \noindent from which tokens with classifications (keyword-token,
       
   794   identifier-token and so on) can be extracted.
   743 
   795 
   744   Derivatives as calculated by \Brz's method are usually more complex
   796   Derivatives as calculated by \Brz's method are usually more complex
   745   regular expressions than the initial one; the result is that the matching
   797   regular expressions than the initial one; the result is that the matching
   746   and lexing algorithms are often absymally slow. 
   798   and lexing algorithms are often abysmally slow. However, various
   747 
   799   optimisations are possible, such as the simplifications of @{term "ALT
   748 
   800   ZERO r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r
   749   various optimisations are
   801   ONE"} to @{term r}. One of the advantages of having a simple specification
   750   possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT
   802   and correctness proof is that the latter can be refined to allow for such
   751   r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of
   803   optimisations and simple correctness proof.
   752   the advantages of having a simple specification and correctness proof is
   804 
   753   that the latter can be refined to allow for such optimisations and simple
   805   While the simplification of regular expressions according to 
   754   correctness proof.
   806   simplification rules 
   755 
   807 
       
   808 
       
   809   \noindent
       
   810   is well understood, there is a problem with the POSIX
       
   811   value calculation algorithm by Sulzmann and Lu: if we build a derivative
       
   812   regular expression and then simplify it, we will calculate a POSIX value
       
   813   for this simplified regular expression, \emph{not} for the original
       
   814   (unsimplified) derivative regular expression. Sulzmann and Lu overcome
       
   815   this problem by not just calculating a simplified regular expression, but
       
   816   also calculating a \emph{rectification function} that ``repairs'' the
       
   817   incorrect value.
   756 
   818 
   757 *}
   819 *}
   758 
   820 
   759 section {* The Argument by Sulzmmann and Lu *}
   821 section {* The Argument by Sulzmmann and Lu *}
   760 
   822