ChengsongTanPhdThesis/Chapters/Chapter2.tex
changeset 529 96e93df60954
parent 528 28751de4b4ba
child 530 823d9b19d21c
--- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex	Mon May 30 14:41:09 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Chapter2.tex	Mon May 30 17:24:52 2022 +0100
@@ -10,8 +10,14 @@
 
 
 \section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions}
-We have a built-in datatype char, made up of characters, which we do not define
-on top of anything else.
+We have a primitive datatype char, denoting characters.
+\[			char ::=  a
+			 \mid b
+			 \mid c
+			 \mid  \ldots
+			 \mid z       
+\]
+(which one is better?)
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\
@@ -27,7 +33,8 @@
 And strings can be concatenated to form longer strings:
 \begin{center}
 \begin{tabular}{lcl}
-$s_1 @ s_2$ & $\rightarrow$ & $s'$\\
+$[] @ s_2$ & $\dn$ & $s_2$\\
+$(c :: s_1) @ s_2$ & $\dn$ & $c :: (s_1 @ s_2)$
 \end{tabular}
 \end{center}
 
@@ -46,22 +53,83 @@
 $A^{n+1}$ & $\dn$ & $A^n @ A$
 \end{tabular}
 \end{center}
-The infinite set of all the power of a language unioned together 
-is defined using the power operator, also in recursive function:
+The union of all the natural number powers of a language   
+is denoted by the Kleene star operator:
 \begin{center}
 \begin{tabular}{lcl}
-$A^*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$\\
+$\bigcup_{i \geq 0} A^i$ & $\denote$ & $A^*$\\
 \end{tabular}
 \end{center}
-We also define an operation of chopping off a character from all the strings
-in a set:
+
+In Isabelle of course we cannot easily get a counterpart of
+the $\bigcup_{i \geq 0}$ operator, so we instead define the Kleene star
+as an inductive set: 
+\begin{center}
+\begin{tabular}{lcl}
+$[] \in A^*$  & &\\
+$s_1 \in A \land \; s_2 \in A^* $ & $\implies$ & $s_1 @ s_2 \in A^*$\\
+\end{tabular}
+\end{center}
+
+We also define an operation of chopping off a character from
+a language:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
 \end{tabular}
 \end{center}
-With the above definitions, it becomes natural to define regular expressions
-as a concise way for expressing the languages.
+
+This can be generalised to chopping off a string from all strings within set $A$:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
+\end{tabular}
+\end{center}
+
+which is essentially the left quotient $A \backslash L'$ of $A$ against 
+the singleton language $L' = \{w\}$
+in formal language theory.
+For this dissertation the $\textit{Ders}$ notation would suffice, there is
+no need for a more general derivative definition.
+
+With the  sequencing, Kleene star, and $\textit{Der}$ operator on languages,
+we have a  few properties of how the language derivative can be defined using 
+sub-languages.
+\begin{lemma}
+$\Der \; c \; (A @ B) = \textit{if} \;  [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B$
+\end{lemma}
+\noindent
+This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
+and get to $B$.
+
+The language $A^*$'s derivative can be described using the language derivative
+of $A$:
+\begin{lemma}
+$\textit{Der} \;c \;A^* = (\textit{Der}\; c A) @ (A^*)$\\
+\end{lemma}
+\begin{proof}
+\begin{itemize}
+\item{$\subseteq$}
+The set 
+\[ \{s \mid c :: s \in A^*\} \]
+is enclosed in the set
+\[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A^* \} \]
+because whenever you have a string starting with a character 
+in the language of a Kleene star $A^*$, then that character together with some sub-string
+immediately after it will form the first iteration, and the rest of the string will 
+be still in $A^*$.
+\item{$\supseteq$}
+Note that
+\[ \Der \; c \; A^* = \Der \; c \;  (\{ [] \} \cup (A @ A^*) ) \]
+and 
+\[ \Der \; c \;  (\{ [] \} \cup (A @ A^*) ) = \Der\; c \; (A @ A^*) \]
+where the $\textit{RHS}$ of the above equatioin can be rewritten
+as \[ (\Der \; c\; A) @ A^* \cup A' \], $A'$ being a possibly empty set.
+\end{itemize}
+\end{proof}
+Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
+for regular languages, we need to first give definitions for regular expressions.
+
 \section{Regular Expressions and Their Language Interpretation}
 Suppose we have an alphabet $\Sigma$, the strings  whose characters
 are from $\Sigma$
@@ -90,19 +158,6 @@
 
 
 
-The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
-where the operation transforms the regex to a new one containing
-strings without the head character $c$.
-
-Formally, we define first such a transformation on any string set, which
-we call semantic derivative:
-\begin{center}
-$\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
-\end{center}
-Mathematically, it can be expressed as the 
-
-If the $\textit{A}$ happen to have some structure, for example,
-if it is regular, then we have that it
 
 % Derivatives of a
 %regular expression, written $r \backslash c$, give a simple solution
@@ -113,7 +168,14 @@
 
 
 \section{Brzozowski Derivatives of Regular Expressions}
-The the derivative of regular expression, denoted as
+Now with semantic derivatives of a language and regular expressions and
+their language interpretations, we are ready to define derivatives on regexes.
+
+The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
+where the operation transforms the regex to a new one containing
+strings without the head character $c$.
+
+The  derivative of regular expression, denoted as
 $r \backslash c$, is a function that takes parameters
 $r$ and $c$, and returns another regular expression $r'$,
 which is computed by the following recursive function:
@@ -401,9 +463,11 @@
 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
 
-A lexer aimed at getting all the possible values has an exponential
-worst case runtime. Therefore it is impractical to try to generate
-all possible matches in a run. In practice, we are usually 
+A lexer aimed at keeping all the possible values will naturally 
+have an exponential runtime on ambiguous regular expressions.
+Somehow one has to decide which lexical value to keep and
+output in a lexing algorithm.
+In practice, we are usually 
 interested about POSIX values, which by intuition always
 \begin{itemize}
 \item
@@ -413,6 +477,9 @@
 always match a subpart as much as possible before proceeding
 to the next token.
 \end{itemize}
+The formal definition of a $\POSIX$ value can be described 
+in the following set of rules:
+\
 
 
  For example, the above example has the POSIX value