--- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex Mon May 30 14:41:09 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Chapter2.tex Mon May 30 17:24:52 2022 +0100
@@ -10,8 +10,14 @@
\section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions}
-We have a built-in datatype char, made up of characters, which we do not define
-on top of anything else.
+We have a primitive datatype char, denoting characters.
+\[ char ::= a
+ \mid b
+ \mid c
+ \mid \ldots
+ \mid z
+\]
+(which one is better?)
\begin{center}
\begin{tabular}{lcl}
$\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\
@@ -27,7 +33,8 @@
And strings can be concatenated to form longer strings:
\begin{center}
\begin{tabular}{lcl}
-$s_1 @ s_2$ & $\rightarrow$ & $s'$\\
+$[] @ s_2$ & $\dn$ & $s_2$\\
+$(c :: s_1) @ s_2$ & $\dn$ & $c :: (s_1 @ s_2)$
\end{tabular}
\end{center}
@@ -46,22 +53,83 @@
$A^{n+1}$ & $\dn$ & $A^n @ A$
\end{tabular}
\end{center}
-The infinite set of all the power of a language unioned together
-is defined using the power operator, also in recursive function:
+The union of all the natural number powers of a language
+is denoted by the Kleene star operator:
\begin{center}
\begin{tabular}{lcl}
-$A^*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$\\
+$\bigcup_{i \geq 0} A^i$ & $\denote$ & $A^*$\\
\end{tabular}
\end{center}
-We also define an operation of chopping off a character from all the strings
-in a set:
+
+In Isabelle of course we cannot easily get a counterpart of
+the $\bigcup_{i \geq 0}$ operator, so we instead define the Kleene star
+as an inductive set:
+\begin{center}
+\begin{tabular}{lcl}
+$[] \in A^*$ & &\\
+$s_1 \in A \land \; s_2 \in A^* $ & $\implies$ & $s_1 @ s_2 \in A^*$\\
+\end{tabular}
+\end{center}
+
+We also define an operation of chopping off a character from
+a language:
\begin{center}
\begin{tabular}{lcl}
$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
\end{tabular}
\end{center}
-With the above definitions, it becomes natural to define regular expressions
-as a concise way for expressing the languages.
+
+This can be generalised to chopping off a string from all strings within set $A$:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
+\end{tabular}
+\end{center}
+
+which is essentially the left quotient $A \backslash L'$ of $A$ against
+the singleton language $L' = \{w\}$
+in formal language theory.
+For this dissertation the $\textit{Ders}$ notation would suffice, there is
+no need for a more general derivative definition.
+
+With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
+we have a few properties of how the language derivative can be defined using
+sub-languages.
+\begin{lemma}
+$\Der \; c \; (A @ B) = \textit{if} \; [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B$
+\end{lemma}
+\noindent
+This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
+and get to $B$.
+
+The language $A^*$'s derivative can be described using the language derivative
+of $A$:
+\begin{lemma}
+$\textit{Der} \;c \;A^* = (\textit{Der}\; c A) @ (A^*)$\\
+\end{lemma}
+\begin{proof}
+\begin{itemize}
+\item{$\subseteq$}
+The set
+\[ \{s \mid c :: s \in A^*\} \]
+is enclosed in the set
+\[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A^* \} \]
+because whenever you have a string starting with a character
+in the language of a Kleene star $A^*$, then that character together with some sub-string
+immediately after it will form the first iteration, and the rest of the string will
+be still in $A^*$.
+\item{$\supseteq$}
+Note that
+\[ \Der \; c \; A^* = \Der \; c \; (\{ [] \} \cup (A @ A^*) ) \]
+and
+\[ \Der \; c \; (\{ [] \} \cup (A @ A^*) ) = \Der\; c \; (A @ A^*) \]
+where the $\textit{RHS}$ of the above equatioin can be rewritten
+as \[ (\Der \; c\; A) @ A^* \cup A' \], $A'$ being a possibly empty set.
+\end{itemize}
+\end{proof}
+Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
+for regular languages, we need to first give definitions for regular expressions.
+
\section{Regular Expressions and Their Language Interpretation}
Suppose we have an alphabet $\Sigma$, the strings whose characters
are from $\Sigma$
@@ -90,19 +158,6 @@
-The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
-where the operation transforms the regex to a new one containing
-strings without the head character $c$.
-
-Formally, we define first such a transformation on any string set, which
-we call semantic derivative:
-\begin{center}
-$\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
-\end{center}
-Mathematically, it can be expressed as the
-
-If the $\textit{A}$ happen to have some structure, for example,
-if it is regular, then we have that it
% Derivatives of a
%regular expression, written $r \backslash c$, give a simple solution
@@ -113,7 +168,14 @@
\section{Brzozowski Derivatives of Regular Expressions}
-The the derivative of regular expression, denoted as
+Now with semantic derivatives of a language and regular expressions and
+their language interpretations, we are ready to define derivatives on regexes.
+
+The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
+where the operation transforms the regex to a new one containing
+strings without the head character $c$.
+
+The derivative of regular expression, denoted as
$r \backslash c$, is a function that takes parameters
$r$ and $c$, and returns another regular expression $r'$,
which is computed by the following recursive function:
@@ -401,9 +463,11 @@
that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$ and
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-A lexer aimed at getting all the possible values has an exponential
-worst case runtime. Therefore it is impractical to try to generate
-all possible matches in a run. In practice, we are usually
+A lexer aimed at keeping all the possible values will naturally
+have an exponential runtime on ambiguous regular expressions.
+Somehow one has to decide which lexical value to keep and
+output in a lexing algorithm.
+In practice, we are usually
interested about POSIX values, which by intuition always
\begin{itemize}
\item
@@ -413,6 +477,9 @@
always match a subpart as much as possible before proceeding
to the next token.
\end{itemize}
+The formal definition of a $\POSIX$ value can be described
+in the following set of rules:
+\
For example, the above example has the POSIX value