# HG changeset patch # User Christian Urban # Date 1457731269 0 # Node ID 71f4ecc0884993ed1c3f154c58a30f34eb8fb60b # Parent da81ffac4b10b16930bb43205b8967dfc1bea206 updated diff -r da81ffac4b10 -r 71f4ecc08849 thys/Paper/Paper.thy --- a/thys/Paper/Paper.thy Fri Mar 11 13:53:53 2016 +0000 +++ b/thys/Paper/Paper.thy Fri Mar 11 21:21:09 2016 +0000 @@ -707,7 +707,7 @@ \[@{term "\ (\s\<^sub>3 s\<^sub>4. s\<^sub>3 \ [] \ s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \ (c # s\<^sub>1) @ s\<^sub>3 \ L r\<^sub>1 \ s\<^sub>4 \ L r\<^sub>2)"}\] \noindent We can use the induction hypothesis for @{text "r\<^sub>1"} to obtain - @{term "(c # s\<^sub>1) \ r\<^sub>1 \ injval r\<^sub>1 c v\<^sub>1"}. This all put together allows us to infer + @{term "(c # s\<^sub>1) \ r\<^sub>1 \ injval r\<^sub>1 c v\<^sub>1"}. Putting this all together allows us to infer @{term "((c # s\<^sub>1) @ s\<^sub>2) \ SEQ r\<^sub>1 r\<^sub>2 \ Seq (injval r\<^sub>1 c v\<^sub>1) v\<^sub>2"}. The case @{text "(c)"} is similarly. @@ -723,7 +723,7 @@ s) \ SEQ r\<^sub>1 r\<^sub>2 \ Seq (mkeps r\<^sub>1) (injval r\<^sub>2 c v\<^sub>1)"}, as required. Finally suppose @{term "r = STAR r\<^sub>1"}. This case is very similar to the - sequence case, except that we need to ensure that @{term "flat (injval r\<^sub>1 + sequence case, except that we need to also ensure that @{term "flat (injval r\<^sub>1 c v\<^sub>1) \ []"}. This follows from @{term "(c # s\<^sub>1) \ r\<^sub>1 \ injval r\<^sub>1 c v\<^sub>1"} (which in turn follows from @{term "s\<^sub>1 \ der c r\<^sub>1 \ v\<^sub>1"} and the induction hypothesis).\qed @@ -747,11 +747,10 @@ \end{proof} \noindent This concludes our correctness proof. Note that we have not - changed the algorithm by Sulzmann and Lu,\footnote{Any deviation we introduced is harmless.} - but introduced our own - specification for what a correct result---a POSIX value---should be. - A strong point in favour of Sulzmann and Lu's algorithm is that it - can be extended in various ways. + changed the algorithm of Sulzmann and Lu,\footnote{Any deviation we + introduced is harmless.} but introduced our own specification for what a + correct result---a POSIX value---should be. A strong point in favour of + Sulzmann and Lu's algorithm is that it can be extended in various ways. *} @@ -759,11 +758,12 @@ text {* - If we are interested in tokenising strings, then we need to not just + If we are interested in tokenising a string, then we need to not just split up the string into tokens, but also ``classify'' the tokens (for - example whether it is a keyword or an identifier). This can be - done with only minor modifications to the algorithm by introducing \emph{record regular - expressions} and \emph{record values} (for example \cite{Sulzmann2014b}): + example whether it is a keyword or an identifier). This can be done with + only minor modifications to the algorithm by introducing \emph{record + regular expressions} and \emph{record values} (for example + \cite{Sulzmann2014b}): \begin{center} @{text "r :="} @@ -776,17 +776,17 @@ \noindent where @{text l} is a label, say a string, @{text r} a regular expression and @{text v} a value. All functions can be smoothly extended - to these regular expressions and values. For example \mbox{@{text "(l : r)"}} is - nullable iff @{term r} is, and so on. The purpose of the record regular - expression is to mark certain parts of a regular expression and then - record in the calculated value which parts of the string were matched by - this part. The label can then serve as classification of the tokens. For this recall the - regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\"} for keywords and - identifiers from the Introduction. With record regular expression we can - form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\"}} and then traverse the - calculated value and only collect the underlying strings in record values. - With this we obtain finite sequences of pairs of labels and strings, for - example + to these regular expressions and values. For example \mbox{@{text "(l : + r)"}} is nullable iff @{term r} is, and so on. The purpose of the record + regular expression is to mark certain parts of a regular expression and + then record in the calculated value which parts of the string were matched + by this part. The label can then serve as classification for the tokens. + For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\"} for + keywords and identifiers from the Introduction. With record regular + expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\"}} + and then traverse the calculated value and only collect the underlying + strings in record values. With this we obtain finite sequences of pairs of + labels and strings, for example \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\] @@ -794,15 +794,15 @@ identifier-token and so on) can be extracted. Derivatives as calculated by \Brz's method are usually more complex - regular expressions than the initial one; the result is that the matching - and lexing algorithms are often abysmally slow. However, various - optimisations are possible, such as the simplifications of @{term "ALT - ZERO r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r - ONE"} to @{term r}. These simplifications can speed up the algorithms considerably, - as noted in \cite{Sulzmann2014}. - One of the advantages of having a simple specification - and correctness proof is that the latter can be refined to prove the - correctness of such simplification steps. + regular expressions than the initial one; the result is that the + deivative-based matching and lexing algorithms are often abysmally slow. + However, various optimisations are possible, such as the simplifications + of @{term "ALT ZERO r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and + @{term "SEQ r ONE"} to @{term r}. These simplifications can speed up the + algorithms considerably, as noted in \cite{Sulzmann2014}. One of the + advantages of having a simple specification and correctness proof is that + the latter can be refined to prove the correctness of such simplification + steps. While the simplification of regular expressions according to rules like @@ -860,11 +860,11 @@ \end{tabular} \end{center} - \noindent where @{term "id"} stands for the identity function. - This function returns a simplified regular expression and a rectification - function. Note that - we do not simplify under stars: this seems to slow down the algorithm, - rather than speed up. The optimised lexer is then given by the clauses: + \noindent where @{term "id"} stands for the identity function. This + function returns a simplified regular expression and a corresponding + rectification function. Note that we do not simplify under stars: this + seems to slow down the algorithm, rather than speed up. The optimised + lexer is then given by the clauses: \begin{center} \begin{tabular}{lcl} diff -r da81ffac4b10 -r 71f4ecc08849 thys/paper.pdf Binary file thys/paper.pdf has changed