diff -r 2c38f10643ae -r 267afb7fb700 thys/Paper/Paper.thy --- a/thys/Paper/Paper.thy Wed Mar 02 04:13:25 2016 +0000 +++ b/thys/Paper/Paper.thy Wed Mar 02 14:06:13 2016 +0000 @@ -39,7 +39,7 @@ character~@{text c}, and showed that it gave a simple solution to the problem of matching a string @{term s} with a regular expression @{term r}: if the derivative of @{term r} w.r.t.\ (in succession) all the characters of -the string matches the empty string $\mts$, then @{term r} matches @{term s} +the string matches the empty string, then @{term r} matches @{term s} (and {\em vice versa}). The derivative has the property (which may be regarded as its specification) that, for every string @{term s} and regular expression @{term r} and character @{term c}, one has @{term "cs \ L(r)"} if @@ -76,29 +76,33 @@ (and algorithm-independent) definition of what we call being a {\em POSIX value} for a regular expression @{term r} and a string @{term s}; we show that the algorithm computes such a value and that such a value is unique. -Proofs are both done by hand and checked in {\em Isabelle/HOL}. The +Proofs are both done by hand and checked in Isabelle/HOL. The experience of doing our proofs has been that this mechanical checking was absolutely essential: this subject area has hidden snares. This was also noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}. If a regular expression matches a string, then in general there are more -than one possibility of how the string is matched. There are two commonly -used disambiguation strategies to generate a unique answer: one is called -GREEDY matching \cite{Frisch2004} and the other is POSIX +than one way of how the string is matched. There are two commonly used +disambiguation strategies to generate a unique answer: one is called GREEDY +matching \cite{Frisch2004} and the other is POSIX matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string -@{term xy} and the regular expression @{term "STAR (ALT (ALT x y) xy)"}. +@{term xy} and the regular expression \mbox{@{term "STAR (ALT (ALT x y) xy)"}}. Either the string can be matched in two `iterations' by the single letter-regular expressions @{term x} and @{term y}, or directly in one iteration by @{term xy}. The first case corresponds to GREEDY matching, which first matches with the left-most symbol and only matches the next -symbol in case of a mismatch. The second case is POSIX matching, which -prefers the longest match. +symbol in case of a mismatch (this is greedy in the sense of preferring +instant gratification to delayed repletion). The second case is POSIX +matching, which prefers the longest match. -In the context of lexing, where an input string is separated into a sequence -of tokens, POSIX is the more natural disambiguation strategy for what -programmers consider basic syntactic building blocks in their programs. -There are two underlying rules behind tokenising using POSIX matching: +In the context of lexing, where an input string needs to be separated into a +sequence of tokens, POSIX is the more natural disambiguation strategy for +what programmers consider basic syntactic building blocks in their programs. +These building blocks are often specified by some regular expressions, say @{text +"r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising +keywords and identifiers, respectively. There are two underlying rules +behind tokenising a string in a POSIX fashion: \begin{itemize} \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):} @@ -112,17 +116,16 @@ that can match determines the token. \end{itemize} -\noindent Consider for example a regular expression @{text -"r\<^bsub>key\<^esub>"} that can recognise keywords, like @{text "if"}, -@{text "then"} and so on; and another regular expression @{text -"r\<^bsub>id\<^esub>"} that can recognise identifiers (such as single -characters followed by characters or numbers). Then we can form the regular -expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\"} and -use POSIX matching to, for example, tokenise the strings @{text "iffoo"} and -@{text "if"}. In the first case we obtain by the longest match rule a -single identifier token, not a keyword and identifier. In the second case we -obtain by rule priority a keyword token, not an identifier token. -\bigskip +\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising +keywords such as @{text "if"}, @{text "then"} and so on; and @{text +"r\<^bsub>id\<^esub>"} recognising identifiers (a single character followed +by characters or numbers). Then we can form the regular expression @{text +"(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\"} and use POSIX +matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the +first case we obtain by the longest match rule a single identifier token, +not a keyword followed by identifier. In the second case we obtain by rule +priority a keyword token, not an identifier token---even if @{text +"r\<^bsub>id\<^esub>"} matches also.\bigskip \noindent\textcolor{green}{Not Done Yet} @@ -154,10 +157,10 @@ the empty string being represented by the empty list, written @{term "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}. Often we use the usual bracket notation for strings; for example a - string consiting of a single character is written @{term "[c]"}. By + string consisting of a single character is written @{term "[c]"}. By using the type @{type char} for characters we have a supply of finitely many characters roughly corresponding to the ASCII - character set. Regular expression are as usual and defined as the + character set. Regular expressions are defined as usual as the following inductive datatype: \begin{center} @@ -171,9 +174,10 @@ \end{center} \noindent where @{const ZERO} stands for the regular expression that - does not macth any string and @{const ONE} for the regular + does not match any string and @{const ONE} for the regular expression that matches only the empty string. The language of a regular expression - is again defined as usual by the following clauses + is again defined routinely by the recursive function @{term L} with the + clauses: \begin{center} \begin{tabular}{rcl} @@ -186,15 +190,19 @@ \end{tabular} \end{center} - \noindent We use the star-notation for regular expressions and sets of strings. - The Kleene-star on sets is defined inductively. + \noindent In the fourth clause we use @{term "DUMMY ;; DUMMY"} for the + concatenation of two languages. We use the star-notation for regular + expressions and sets of strings (in the last clause). The star on sets is + defined inductively as usual by two clauses for the empty string being in + the star of a language and is @{term "s\<^sub>1"} is in a language and + @{term "s\<^sub>2"} and in the star of this language then also @{term + "s\<^sub>1 @ s\<^sub>2"} is in the star of this language. \emph{Semantic derivatives} of sets of strings are defined as \begin{center} \begin{tabular}{lcl} @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\ - @{thm (lhs) Ders_def} & $\dn$ & @{thm (rhs) Ders_def}\\ \end{tabular} \end{center} @@ -225,10 +233,7 @@ @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\ @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\ @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\ - @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\medskip\\ - - @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\ - @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\ + @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)} \end{tabular} \end{center}