lexing: changeset 110:267afb7fb700

--- a/thys/Paper/Paper.thy	Wed Mar 02 04:13:25 2016 +0000
+++ b/thys/Paper/Paper.thy	Wed Mar 02 14:06:13 2016 +0000
@@ -39,7 +39,7 @@
 character~@{text c}, and showed that it gave a simple solution to the
 problem of matching a string @{term s} with a regular expression @{term r}:
 if the derivative of @{term r} w.r.t.\ (in succession) all the characters of
-the string matches the empty string $\mts$, then @{term r} matches @{term s}
+the string matches the empty string, then @{term r} matches @{term s}
 (and {\em vice versa}). The derivative has the property (which may be
 regarded as its specification) that, for every string @{term s} and regular
 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
@@ -76,29 +76,33 @@
 (and algorithm-independent) definition of what we call being a {\em POSIX
 value} for a regular expression @{term r} and a string @{term s}; we show
 that the algorithm computes such a value and that such a value is unique.
-Proofs are both done by hand and checked in {\em Isabelle/HOL}. The
+Proofs are both done by hand and checked in Isabelle/HOL. The
 experience of doing our proofs has been that this mechanical checking was
 absolutely essential: this subject area has hidden snares. This was also
 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching
 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
 
 If a regular expression matches a string, then in general there are more
-than one possibility of how the string is matched. There are two commonly
-used disambiguation strategies to generate a unique answer: one is called
-GREEDY matching \cite{Frisch2004} and the other is POSIX
+than one way of how the string is matched. There are two commonly used
+disambiguation strategies to generate a unique answer: one is called GREEDY
+matching \cite{Frisch2004} and the other is POSIX
 matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string
-@{term xy} and the regular expression @{term "STAR (ALT (ALT x y) xy)"}.
+@{term xy} and the regular expression \mbox{@{term "STAR (ALT (ALT x y) xy)"}}.
 Either the string can be matched in two `iterations' by the single
 letter-regular expressions @{term x} and @{term y}, or directly in one
 iteration by @{term xy}. The first case corresponds to GREEDY matching,
 which first matches with the left-most symbol and only matches the next
-symbol in case of a mismatch. The second case is POSIX matching, which
-prefers the longest match. 
+symbol in case of a mismatch (this is greedy in the sense of preferring
+instant gratification to delayed repletion). The second case is POSIX
+matching, which prefers the longest match.
 
-In the context of lexing, where an input string is separated into a sequence
-of tokens, POSIX is the more natural disambiguation strategy for what
-programmers consider basic syntactic building blocks in their programs.
-There are two underlying rules behind tokenising using POSIX matching:
+In the context of lexing, where an input string needs to be separated into a
+sequence of tokens, POSIX is the more natural disambiguation strategy for
+what programmers consider basic syntactic building blocks in their programs.
+These building blocks are often specified by some regular expressions, say @{text
+"r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising
+keywords and identifiers, respectively. There are two underlying rules
+behind tokenising a string in a POSIX fashion:
 
 \begin{itemize} 
 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}
@@ -112,17 +116,16 @@
 that can match determines the token.
 \end{itemize}
  
-\noindent Consider for example a regular expression @{text
-"r\<^bsub>key\<^esub>"} that can recognise keywords, like @{text "if"},
-@{text "then"} and so on; and another regular expression @{text
-"r\<^bsub>id\<^esub>"} that can recognise identifiers (such as single
-characters followed by characters or numbers). Then we can form the regular
-expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and
-use POSIX matching to, for example, tokenise the strings @{text "iffoo"} and
-@{text "if"}. In the first case we obtain by the longest match rule a
-single identifier token, not a keyword and identifier. In the second case we
-obtain by rule priority a keyword token, not an identifier token.
-\bigskip
+\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising
+keywords such as @{text "if"}, @{text "then"} and so on; and @{text
+"r\<^bsub>id\<^esub>"} recognising identifiers (a single character followed
+by characters or numbers). Then we can form the regular expression @{text
+"(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX
+matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the
+first case we obtain by the longest match rule a single identifier token,
+not a keyword followed by identifier. In the second case we obtain by rule
+priority a keyword token, not an identifier token---even if @{text
+"r\<^bsub>id\<^esub>"} matches also.\bigskip
 
 \noindent\textcolor{green}{Not Done Yet}
 
@@ -154,10 +157,10 @@
   the empty string being represented by the empty list, written @{term
   "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}.
   Often we use the usual bracket notation for strings; for example a
-  string consiting of a single character is written @{term "[c]"}.  By
+  string consisting of a single character is written @{term "[c]"}.  By
   using the type @{type char} for characters we have a supply of
   finitely many characters roughly corresponding to the ASCII
-  character set.  Regular expression are as usual and defined as the
+  character set.  Regular expressions are defined as usual as the
   following inductive datatype:
 
   \begin{center}
@@ -171,9 +174,10 @@
   \end{center}
 
   \noindent where @{const ZERO} stands for the regular expression that
-  does not macth any string and @{const ONE} for the regular
+  does not match any string and @{const ONE} for the regular
   expression that matches only the empty string. The language of a regular expression
-  is again defined as usual by the following clauses
+  is again defined routinely by the recursive function @{term L} with the
+  clauses:
 
   \begin{center}
   \begin{tabular}{rcl}
@@ -186,15 +190,19 @@
   \end{tabular}
   \end{center}
   
-  \noindent We use the star-notation for regular expressions and sets of strings.
-  The Kleene-star on sets is defined inductively.
+  \noindent In the fourth clause we use @{term "DUMMY ;; DUMMY"} for the
+  concatenation of two languages. We use the star-notation for regular
+  expressions and sets of strings (in the last clause). The star on sets is
+  defined inductively as usual by two clauses for the empty string being in
+  the star of a language and is @{term "s\<^sub>1"} is in a language and
+  @{term "s\<^sub>2"} and in the star of this language then also @{term
+  "s\<^sub>1 @ s\<^sub>2"} is in the star of this language.
   
   \emph{Semantic derivatives} of sets of strings are defined as
 
   \begin{center}
   \begin{tabular}{lcl}
   @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\
-  @{thm (lhs) Ders_def} & $\dn$ & @{thm (rhs) Ders_def}\\
   \end{tabular}
   \end{center}
   
@@ -225,10 +233,7 @@
   @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
   @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
   @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
-  @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\medskip\\
-
-  @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\
-  @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\
+  @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}
   \end{tabular}
   \end{center}

--- a/thys/ReStar.thy	Wed Mar 02 04:13:25 2016 +0000
+++ b/thys/ReStar.thy	Wed Mar 02 14:06:13 2016 +0000
@@ -24,18 +24,13 @@
 by (simp_all add: Sequ_def)
 
 
-section {* Semantic Derivatives of Languages *}
+section {* Semantic Derivative of Languages *}
 
 definition
   Der :: "char \<Rightarrow> string set \<Rightarrow> string set"
 where
   "Der c A \<equiv> {s. [c] @ s \<in> A}"
 
-definition 
-  Ders :: "string \<Rightarrow> string set \<Rightarrow> string set"
-where  
-  "Ders s A \<equiv> {s' | s'. s @ s' \<in> A}"
-
 lemma Der_null [simp]:
   shows "Der c {} = {}"
 unfolding Der_def
@@ -167,12 +162,6 @@
 apply(simp_all add: nullable_correctness)
 done
 
-lemma ders_correctness:
-  shows "L (ders s r) = Ders s (L r)"
-apply(induct s arbitrary: r) 
-apply(simp_all add: der_correctness Der_def Ders_def)
-done
-
 
 section {* Values *}

Binary file thys/paper.pdf has changed

author	Christian Urban <christian dot urban at kcl dot ac dot uk>
	Wed, 02 Mar 2016 14:06:13 +0000
changeset 110	267afb7fb700
parent 109	2c38f10643ae
child 111	289728193164

thys/Paper/Paper.thy		file \| annotate \| diff \| comparison \| revisions
thys/ReStar.thy		file \| annotate \| diff \| comparison \| revisions
thys/paper.pdf		file \| annotate \| diff \| comparison \| revisions