--- a/thys/Paper/Paper.thy Wed Mar 02 04:13:25 2016 +0000
+++ b/thys/Paper/Paper.thy Wed Mar 02 14:06:13 2016 +0000
@@ -39,7 +39,7 @@
character~@{text c}, and showed that it gave a simple solution to the
problem of matching a string @{term s} with a regular expression @{term r}:
if the derivative of @{term r} w.r.t.\ (in succession) all the characters of
-the string matches the empty string $\mts$, then @{term r} matches @{term s}
+the string matches the empty string, then @{term r} matches @{term s}
(and {\em vice versa}). The derivative has the property (which may be
regarded as its specification) that, for every string @{term s} and regular
expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
@@ -76,29 +76,33 @@
(and algorithm-independent) definition of what we call being a {\em POSIX
value} for a regular expression @{term r} and a string @{term s}; we show
that the algorithm computes such a value and that such a value is unique.
-Proofs are both done by hand and checked in {\em Isabelle/HOL}. The
+Proofs are both done by hand and checked in Isabelle/HOL. The
experience of doing our proofs has been that this mechanical checking was
absolutely essential: this subject area has hidden snares. This was also
noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching
implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
If a regular expression matches a string, then in general there are more
-than one possibility of how the string is matched. There are two commonly
-used disambiguation strategies to generate a unique answer: one is called
-GREEDY matching \cite{Frisch2004} and the other is POSIX
+than one way of how the string is matched. There are two commonly used
+disambiguation strategies to generate a unique answer: one is called GREEDY
+matching \cite{Frisch2004} and the other is POSIX
matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string
-@{term xy} and the regular expression @{term "STAR (ALT (ALT x y) xy)"}.
+@{term xy} and the regular expression \mbox{@{term "STAR (ALT (ALT x y) xy)"}}.
Either the string can be matched in two `iterations' by the single
letter-regular expressions @{term x} and @{term y}, or directly in one
iteration by @{term xy}. The first case corresponds to GREEDY matching,
which first matches with the left-most symbol and only matches the next
-symbol in case of a mismatch. The second case is POSIX matching, which
-prefers the longest match.
+symbol in case of a mismatch (this is greedy in the sense of preferring
+instant gratification to delayed repletion). The second case is POSIX
+matching, which prefers the longest match.
-In the context of lexing, where an input string is separated into a sequence
-of tokens, POSIX is the more natural disambiguation strategy for what
-programmers consider basic syntactic building blocks in their programs.
-There are two underlying rules behind tokenising using POSIX matching:
+In the context of lexing, where an input string needs to be separated into a
+sequence of tokens, POSIX is the more natural disambiguation strategy for
+what programmers consider basic syntactic building blocks in their programs.
+These building blocks are often specified by some regular expressions, say @{text
+"r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising
+keywords and identifiers, respectively. There are two underlying rules
+behind tokenising a string in a POSIX fashion:
\begin{itemize}
\item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}
@@ -112,17 +116,16 @@
that can match determines the token.
\end{itemize}
-\noindent Consider for example a regular expression @{text
-"r\<^bsub>key\<^esub>"} that can recognise keywords, like @{text "if"},
-@{text "then"} and so on; and another regular expression @{text
-"r\<^bsub>id\<^esub>"} that can recognise identifiers (such as single
-characters followed by characters or numbers). Then we can form the regular
-expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and
-use POSIX matching to, for example, tokenise the strings @{text "iffoo"} and
-@{text "if"}. In the first case we obtain by the longest match rule a
-single identifier token, not a keyword and identifier. In the second case we
-obtain by rule priority a keyword token, not an identifier token.
-\bigskip
+\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising
+keywords such as @{text "if"}, @{text "then"} and so on; and @{text
+"r\<^bsub>id\<^esub>"} recognising identifiers (a single character followed
+by characters or numbers). Then we can form the regular expression @{text
+"(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX
+matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the
+first case we obtain by the longest match rule a single identifier token,
+not a keyword followed by identifier. In the second case we obtain by rule
+priority a keyword token, not an identifier token---even if @{text
+"r\<^bsub>id\<^esub>"} matches also.\bigskip
\noindent\textcolor{green}{Not Done Yet}
@@ -154,10 +157,10 @@
the empty string being represented by the empty list, written @{term
"[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}.
Often we use the usual bracket notation for strings; for example a
- string consiting of a single character is written @{term "[c]"}. By
+ string consisting of a single character is written @{term "[c]"}. By
using the type @{type char} for characters we have a supply of
finitely many characters roughly corresponding to the ASCII
- character set. Regular expression are as usual and defined as the
+ character set. Regular expressions are defined as usual as the
following inductive datatype:
\begin{center}
@@ -171,9 +174,10 @@
\end{center}
\noindent where @{const ZERO} stands for the regular expression that
- does not macth any string and @{const ONE} for the regular
+ does not match any string and @{const ONE} for the regular
expression that matches only the empty string. The language of a regular expression
- is again defined as usual by the following clauses
+ is again defined routinely by the recursive function @{term L} with the
+ clauses:
\begin{center}
\begin{tabular}{rcl}
@@ -186,15 +190,19 @@
\end{tabular}
\end{center}
- \noindent We use the star-notation for regular expressions and sets of strings.
- The Kleene-star on sets is defined inductively.
+ \noindent In the fourth clause we use @{term "DUMMY ;; DUMMY"} for the
+ concatenation of two languages. We use the star-notation for regular
+ expressions and sets of strings (in the last clause). The star on sets is
+ defined inductively as usual by two clauses for the empty string being in
+ the star of a language and is @{term "s\<^sub>1"} is in a language and
+ @{term "s\<^sub>2"} and in the star of this language then also @{term
+ "s\<^sub>1 @ s\<^sub>2"} is in the star of this language.
\emph{Semantic derivatives} of sets of strings are defined as
\begin{center}
\begin{tabular}{lcl}
@{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\
- @{thm (lhs) Ders_def} & $\dn$ & @{thm (rhs) Ders_def}\\
\end{tabular}
\end{center}
@@ -225,10 +233,7 @@
@{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
@{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
- @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\medskip\\
-
- @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\
- @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\
+ @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}
\end{tabular}
\end{center}
--- a/thys/ReStar.thy Wed Mar 02 04:13:25 2016 +0000
+++ b/thys/ReStar.thy Wed Mar 02 14:06:13 2016 +0000
@@ -24,18 +24,13 @@
by (simp_all add: Sequ_def)
-section {* Semantic Derivatives of Languages *}
+section {* Semantic Derivative of Languages *}
definition
Der :: "char \<Rightarrow> string set \<Rightarrow> string set"
where
"Der c A \<equiv> {s. [c] @ s \<in> A}"
-definition
- Ders :: "string \<Rightarrow> string set \<Rightarrow> string set"
-where
- "Ders s A \<equiv> {s' | s'. s @ s' \<in> A}"
-
lemma Der_null [simp]:
shows "Der c {} = {}"
unfolding Der_def
@@ -167,12 +162,6 @@
apply(simp_all add: nullable_correctness)
done
-lemma ders_correctness:
- shows "L (ders s r) = Ders s (L r)"
-apply(induct s arbitrary: r)
-apply(simp_all add: der_correctness Der_def Ders_def)
-done
-
section {* Values *}
Binary file thys/paper.pdf has changed