# HG changeset patch # User Christian Urban # Date 1457436296 0 # Node ID 4178b7e718090e45779f2f3876725ade10a74aaf # Parent fa0d8aa5d7de2275126b9ede9c9977780e3a8d7f updated diff -r fa0d8aa5d7de -r 4178b7e71809 thys/Paper/Paper.thy --- a/thys/Paper/Paper.thy Tue Mar 08 11:09:12 2016 +0000 +++ b/thys/Paper/Paper.thy Tue Mar 08 11:24:56 2016 +0000 @@ -554,7 +554,7 @@ Having defined the @{const mkeps} and @{text inj} function we can extend \Brz's matcher so that a [lexical] value is constructed (assuming the - regular expression matches the string). The clauses of the lexer are + regular expression matches the string). The clauses of the Sulzmann and Lu lexer are \begin{center} \begin{tabular}{lcl} @@ -638,7 +638,7 @@ "P\"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value @{term v} cannot be flattened to the empty string. In effect, we require - that in each ``iteration'' of the star, some non-empty substring need to + that in each ``iteration'' of the star, some non-empty substring needs to be ``chipped'' away; only in case of the empty string we accept @{term "Stars []"} as the POSIX value. @@ -709,7 +709,7 @@ \[@{term "\ (\s\<^sub>3 s\<^sub>4. s\<^sub>3 \ [] \ s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \ (c # s\<^sub>1) @ s\<^sub>3 \ L r\<^sub>1 \ s\<^sub>4 \ L r\<^sub>2)"}\] \noindent We can use the induction hypothesis for @{text "r\<^sub>1"} to obtain - @{term "(c # s\<^sub>1) \ r\<^sub>1 \ injval r\<^sub>1 c v\<^sub>1"}. This allows us to infer + @{term "(c # s\<^sub>1) \ r\<^sub>1 \ injval r\<^sub>1 c v\<^sub>1"}. This all put together allows us to infer @{term "((c # s\<^sub>1) @ s\<^sub>2) \ SEQ r\<^sub>1 r\<^sub>2 \ Seq (injval r\<^sub>1 c v\<^sub>1) v\<^sub>2"}. The case @{text "(c)"} is similarly. @@ -722,12 +722,12 @@ \noindent By Lem.~\ref{lemmkeps} we know @{term "[] \ r\<^sub>1 \ (mkeps r\<^sub>1)"} holds. Putting this all together, we can conclude with @{term "(c # - s) \ SEQ r\<^sub>1 r\<^sub>2 \ Seq (mkeps r\<^sub>1) (injval r\<^sub>2 c v\<^sub>1)"}. + s) \ SEQ r\<^sub>1 r\<^sub>2 \ Seq (mkeps r\<^sub>1) (injval r\<^sub>2 c v\<^sub>1)"}, as required. Finally suppose @{term "r = STAR r\<^sub>1"}. This case is very similar to the sequence case, except that we need to ensure that @{term "flat (injval r\<^sub>1 c v\<^sub>1) \ []"}. This follows from @{term "(c # s\<^sub>1) - \ r' \ injval r\<^sub>1 c v\<^sub>1"} (which in turn follows from @{term "s\<^sub>1 \ der c + \ r\<^sub>1 \ injval r\<^sub>1 c v\<^sub>1"} (which in turn follows from @{term "s\<^sub>1 \ der c r\<^sub>1 \ v\<^sub>1"} and the induction hypothesis).\qed \end{proof} @@ -749,7 +749,8 @@ \end{proof} \noindent This concludes our correctness proof. Note that we have not - changed the algorithm by Sulzmann and Lu, but introduced our own + changed the algorithm by Sulzmann and Lu,\footnote{Any deviation we introduced is harmless.} + but introduced our own specification for what a correct result---a POSIX value---should be. A strong point in favour of Sulzmann and Lu's algorithm is that it can be extended in various ways. @@ -760,10 +761,10 @@ text {* - If we are interested in tokenising string, then we need to not just + If we are interested in tokenising strings, then we need to not just split up the string into tokens, but also ``classify'' the tokens (for example whether it is a keyword or an identifier). This can be - done with only minor modifications by introducing \emph{record regular + done with only minor modifications to the algorithm by introducing \emph{record regular expressions} and \emph{record values} (for example \cite{Sulzmann2014b}): \begin{center} @@ -777,14 +778,14 @@ \noindent where @{text l} is a label, say a string, @{text r} a regular expression and @{text v} a value. All functions can be smoothly extended - to these regular expressions and values. For example @{text "(l : r)"} is + to these regular expressions and values. For example \mbox{@{text "(l : r)"}} is nullable iff @{term r} is, and so on. The purpose of the record regular expression is to mark certain parts of a regular expression and then record in the calculated value which parts of the string were matched by - this part. The label can then serve for classifying tokens. Recall the + this part. The label can then serve as classification of the tokens. For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\"} for keywords and identifiers from the Introduction. With record regular expression we can - form @{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\"} and then traverse the + form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\"}} and then traverse the calculated value and only collect the underlying strings in record values. With this we obtain finite sequences of pairs of labels and strings, for example @@ -799,9 +800,11 @@ and lexing algorithms are often abysmally slow. However, various optimisations are possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r - ONE"} to @{term r}. One of the advantages of having a simple specification - and correctness proof is that the latter can be refined to allow for such - optimisations and simple correctness proof. + ONE"} to @{term r}. These simplifications can speed up the algorithms considerably, + as noted in \cite{Sulzmann2014}. + One of the advantages of having a simple specification + and correctness proof is that the latter can be refined to prove the + correctness of such simplification steps. While the simplification of regular expressions according to rules like @@ -818,7 +821,7 @@ \noindent is well understood, there is an obstacle with the POSIX value calculation algorithm by Sulzmann and Lu: if we build a derivative regular expression and then simplify it, we will calculate a POSIX value for this - simplified regular expression, \emph{not} for the original (unsimplified) + simplified derivative regular expression, \emph{not} for the original (unsimplified) derivative regular expression. Sulzmann and Lu overcome this obstacle by not just calculating a simplified regular expression, but also calculating a \emph{rectification function} that ``repairs'' the incorrect value. @@ -1124,7 +1127,7 @@ already acknowledge some small problems, but our experience suggests there are more serious problems. - Closesly related to our work is an automata-based lexer formalised by + Closely related to our work is an automata-based lexer formalised by Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest initial substrings, but Nipkow's algorithm is not completely computational. The algorithm by Sulzmann and Lu, in contrast, can be diff -r fa0d8aa5d7de -r 4178b7e71809 thys/paper.pdf Binary file thys/paper.pdf has changed