lexing: comparison thys/Paper/Paper.thy

equal deleted inserted replaced

-:2c38f10643ae
+:267afb7fb700
 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em
 derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a
 character~@{text c}, and showed that it gave a simple solution to the
 problem of matching a string @{term s} with a regular expression @{term r}:
 if the derivative of @{term r} w.r.t.\ (in succession) all the characters of
-the string matches the empty string $\mts$, then @{term r} matches @{term s}
+the string matches the empty string, then @{term r} matches @{term s}
 (and {\em vice versa}). The derivative has the property (which may be
 regarded as its specification) that, for every string @{term s} and regular
 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
 derivatives is that they are neatly expressible in any functional language,
 approach (of which some of the proofs are not published in
 \cite{Sulzmann2014}); perhaps more importantly, we give a simple inductive
 (and algorithm-independent) definition of what we call being a {\em POSIX
 value} for a regular expression @{term r} and a string @{term s}; we show
 that the algorithm computes such a value and that such a value is unique.
-Proofs are both done by hand and checked in {\em Isabelle/HOL}. The
+Proofs are both done by hand and checked in Isabelle/HOL. The
 experience of doing our proofs has been that this mechanical checking was
 absolutely essential: this subject area has hidden snares. This was also
 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching
 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
 If a regular expression matches a string, then in general there are more
-than one possibility of how the string is matched. There are two commonly
+than one way of how the string is matched. There are two commonly used
-used disambiguation strategies to generate a unique answer: one is called
+disambiguation strategies to generate a unique answer: one is called GREEDY
-GREEDY matching \cite{Frisch2004} and the other is POSIX
+matching \cite{Frisch2004} and the other is POSIX
 matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string
-@{term xy} and the regular expression @{term "STAR (ALT (ALT x y) xy)"}.
+@{term xy} and the regular expression \mbox{@{term "STAR (ALT (ALT x y) xy)"}}.
 Either the string can be matched in two `iterations' by the single
 letter-regular expressions @{term x} and @{term y}, or directly in one
 iteration by @{term xy}. The first case corresponds to GREEDY matching,
 which first matches with the left-most symbol and only matches the next
-symbol in case of a mismatch. The second case is POSIX matching, which
+symbol in case of a mismatch (this is greedy in the sense of preferring
-prefers the longest match.
+instant gratification to delayed repletion). The second case is POSIX
+matching, which prefers the longest match.
-In the context of lexing, where an input string is separated into a sequence
-of tokens, POSIX is the more natural disambiguation strategy for what
+In the context of lexing, where an input string needs to be separated into a
-programmers consider basic syntactic building blocks in their programs.
+sequence of tokens, POSIX is the more natural disambiguation strategy for
-There are two underlying rules behind tokenising using POSIX matching:
+what programmers consider basic syntactic building blocks in their programs.
+These building blocks are often specified by some regular expressions, say @{text
+"r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising
+keywords and identifiers, respectively. There are two underlying rules
+behind tokenising a string in a POSIX fashion:
 \begin{itemize}
 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}
 The longest initial substring matched by any regular expression is taken as
 For a particular longest initial substring, the first regular expression
 that can match determines the token.
 \end{itemize}
-\noindent Consider for example a regular expression @{text
+\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising
-"r\<^bsub>key\<^esub>"} that can recognise keywords, like @{text "if"},
+keywords such as @{text "if"}, @{text "then"} and so on; and @{text
-@{text "then"} and so on; and another regular expression @{text
+"r\<^bsub>id\<^esub>"} recognising identifiers (a single character followed
-"r\<^bsub>id\<^esub>"} that can recognise identifiers (such as single
+by characters or numbers). Then we can form the regular expression @{text
-characters followed by characters or numbers). Then we can form the regular
+"(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX
-expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and
+matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the
-use POSIX matching to, for example, tokenise the strings @{text "iffoo"} and
+first case we obtain by the longest match rule a single identifier token,
-@{text "if"}. In the first case we obtain by the longest match rule a
+not a keyword followed by identifier. In the second case we obtain by rule
-single identifier token, not a keyword and identifier. In the second case we
+priority a keyword token, not an identifier token---even if @{text
-obtain by rule priority a keyword token, not an identifier token.
+"r\<^bsub>id\<^esub>"} matches also.\bigskip
-\bigskip
 \noindent\textcolor{green}{Not Done Yet}
 \medskip\noindent
 text {* \noindent Strings in Isabelle/HOL are lists of characters with
 the empty string being represented by the empty list, written @{term
 "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}.
 Often we use the usual bracket notation for strings; for example a
-string consiting of a single character is written @{term "[c]"}.  By
+string consisting of a single character is written @{term "[c]"}.  By
 using the type @{type char} for characters we have a supply of
 finitely many characters roughly corresponding to the ASCII
-character set.  Regular expression are as usual and defined as the
+character set.  Regular expressions are defined as usual as the
 following inductive datatype:
 \begin{center}
 @{text "r :="}
 @{const "ZERO"} $\mid$
 @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
 @{term "STAR r"}
 \end{center}
 \noindent where @{const ZERO} stands for the regular expression that
-does not macth any string and @{const ONE} for the regular
+does not match any string and @{const ONE} for the regular
 expression that matches only the empty string. The language of a regular expression
-is again defined as usual by the following clauses
+is again defined routinely by the recursive function @{term L} with the
+clauses:
 \begin{center}
 \begin{tabular}{rcl}
 @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
 @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\
 @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
 \end{tabular}
 \end{center}
-\noindent We use the star-notation for regular expressions and sets of strings.
+\noindent In the fourth clause we use @{term "DUMMY ;; DUMMY"} for the
-The Kleene-star on sets is defined inductively.
+concatenation of two languages. We use the star-notation for regular
+expressions and sets of strings (in the last clause). The star on sets is
+defined inductively as usual by two clauses for the empty string being in
+the star of a language and is @{term "s\<^sub>1"} is in a language and
+@{term "s\<^sub>2"} and in the star of this language then also @{term
+"s\<^sub>1 @ s\<^sub>2"} is in the star of this language.
 \emph{Semantic derivatives} of sets of strings are defined as
 \begin{center}
 \begin{tabular}{lcl}
 @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\
-@{thm (lhs) Ders_def} & $\dn$ & @{thm (rhs) Ders_def}\\
 \end{tabular}
 \end{center}
 \noindent where the second definitions lifts the notion of semantic
 derivatives from characters to strings.
 @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
 @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
 @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
 @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
-@{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\medskip\\
+@{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}
-@{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\
-@{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\
 \end{tabular}
 \end{center}
 It is a relatively easy exercise to prove that

changeset 110	267afb7fb700
parent 109	2c38f10643ae
child 111	289728193164