lexing: comparison ChengsongTanPhdThesis/Chapters/Chapter2.tex

equal deleted inserted replaced

-:28751de4b4ba
+:96e93df60954
 %and then give the algorithm and its variant, and discuss
 %why more aggressive simplifications are needed.
 \section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions}
-We have a built-in datatype char, made up of characters, which we do not define
+We have a primitive datatype char, denoting characters.
-on top of anything else.
+\[			char ::=  a
+			 \mid b
+			 \mid c
+			 \mid  \ldots
+			 \mid z
+\]
+(which one is better?)
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\
 \end{tabular}
 \end{center}
 \end{tabular}
 \end{center}
 And strings can be concatenated to form longer strings:
 \begin{center}
 \begin{tabular}{lcl}
-$s_1 @ s_2$ & $\rightarrow$ & $s'$\\
+$[] @ s_2$ & $\dn$ & $s_2$\\
+$(c :: s_1) @ s_2$ & $\dn$ & $c :: (s_1 @ s_2)$
 \end{tabular}
 \end{center}
 A set of strings can operate with another set of strings:
 \begin{center}
 \begin{tabular}{lcl}
 $A^0 $ & $\dn$ & $\{ [] \}$\\
 $A^{n+1}$ & $\dn$ & $A^n @ A$
 \end{tabular}
 \end{center}
-The infinite set of all the power of a language unioned together
+The union of all the natural number powers of a language
-is defined using the power operator, also in recursive function:
+is denoted by the Kleene star operator:
 \begin{center}
 \begin{tabular}{lcl}
-$A^*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$\\
+$\bigcup_{i \geq 0} A^i$ & $\denote$ & $A^*$\\
 \end{tabular}
 \end{center}
-We also define an operation of chopping off a character from all the strings
-in a set:
+In Isabelle of course we cannot easily get a counterpart of
+the $\bigcup_{i \geq 0}$ operator, so we instead define the Kleene star
+as an inductive set:
+\begin{center}
+\begin{tabular}{lcl}
+$[] \in A^*$  & &\\
+$s_1 \in A \land \; s_2 \in A^* $ & $\implies$ & $s_1 @ s_2 \in A^*$\\
+\end{tabular}
+\end{center}
+We also define an operation of chopping off a character from
+a language:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
 \end{tabular}
 \end{center}
-With the above definitions, it becomes natural to define regular expressions
-as a concise way for expressing the languages.
+This can be generalised to chopping off a string from all strings within set $A$:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
+\end{tabular}
+\end{center}
+which is essentially the left quotient $A \backslash L'$ of $A$ against
+the singleton language $L' = \{w\}$
+in formal language theory.
+For this dissertation the $\textit{Ders}$ notation would suffice, there is
+no need for a more general derivative definition.
+With the  sequencing, Kleene star, and $\textit{Der}$ operator on languages,
+we have a  few properties of how the language derivative can be defined using
+sub-languages.
+\begin{lemma}
+$\Der \; c \; (A @ B) = \textit{if} \;  [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B$
+\end{lemma}
+\noindent
+This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
+and get to $B$.
+The language $A^*$'s derivative can be described using the language derivative
+of $A$:
+\begin{lemma}
+$\textit{Der} \;c \;A^* = (\textit{Der}\; c A) @ (A^*)$\\
+\end{lemma}
+\begin{proof}
+\begin{itemize}
+\item{$\subseteq$}
+The set
+\[ \{s \mid c :: s \in A^*\} \]
+is enclosed in the set
+\[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A^* \} \]
+because whenever you have a string starting with a character
+in the language of a Kleene star $A^*$, then that character together with some sub-string
+immediately after it will form the first iteration, and the rest of the string will
+be still in $A^*$.
+\item{$\supseteq$}
+Note that
+\[ \Der \; c \; A^* = \Der \; c \;  (\{ [] \} \cup (A @ A^*) ) \]
+and
+\[ \Der \; c \;  (\{ [] \} \cup (A @ A^*) ) = \Der\; c \; (A @ A^*) \]
+where the $\textit{RHS}$ of the above equatioin can be rewritten
+as \[ (\Der \; c\; A) @ A^* \cup A' \], $A'$ being a possibly empty set.
+\end{itemize}
+\end{proof}
+Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
+for regular languages, we need to first give definitions for regular expressions.
 \section{Regular Expressions and Their Language Interpretation}
 Suppose we have an alphabet $\Sigma$, the strings  whose characters
 are from $\Sigma$
 can be expressed as $\Sigma^*$.
 \end{center}
 Which are also called the "language interpretation".
-The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
-where the operation transforms the regex to a new one containing
-strings without the head character $c$.
-Formally, we define first such a transformation on any string set, which
-we call semantic derivative:
-\begin{center}
-$\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
-\end{center}
-Mathematically, it can be expressed as the
-If the $\textit{A}$ happen to have some structure, for example,
-if it is regular, then we have that it
 % Derivatives of a
 %regular expression, written $r \backslash c$, give a simple solution
 %to the problem of matching a string $s$ with a regular
 %expression $r$: if the derivative of $r$ w.r.t.\ (in
 %succession) all the characters of the string matches the empty string,
 %then $r$ matches $s$ (and {\em vice versa}).
 \section{Brzozowski Derivatives of Regular Expressions}
-The the derivative of regular expression, denoted as
+Now with semantic derivatives of a language and regular expressions and
+their language interpretations, we are ready to define derivatives on regexes.
+The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
+where the operation transforms the regex to a new one containing
+strings without the head character $c$.
+The  derivative of regular expression, denoted as
 $r \backslash c$, is a function that takes parameters
 $r$ and $c$, and returns another regular expression $r'$,
 which is computed by the following recursive function:
 \begin{center}
 between the two $a^*$s.
 It is not surprising there are exponentially many lexical values
 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-A lexer aimed at getting all the possible values has an exponential
+A lexer aimed at keeping all the possible values will naturally
-worst case runtime. Therefore it is impractical to try to generate
+have an exponential runtime on ambiguous regular expressions.
-all possible matches in a run. In practice, we are usually
+Somehow one has to decide which lexical value to keep and
+output in a lexing algorithm.
+In practice, we are usually
 interested about POSIX values, which by intuition always
 \begin{itemize}
 \item
 match the leftmost regular expression when multiple options of matching
 are available
 \item
 always match a subpart as much as possible before proceeding
 to the next token.
 \end{itemize}
+The formal definition of a $\POSIX$ value can be described
+in the following set of rules:
+\
 For example, the above example has the POSIX value
 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
 The output of an algorithm we want would be a POSIX matching

changeset 529	96e93df60954
parent 528	28751de4b4ba
child 530	823d9b19d21c