# HG changeset patch # User Chengsong # Date 1653927892 -3600 # Node ID 96e93df60954a90ee25338b9f99efce2e15b4980 # Parent 28751de4b4ba3b0c5e5eb3c989a0addeb959f5a9 more diff -r 28751de4b4ba -r 96e93df60954 ChengsongTanPhdThesis/Chapters/Chapter1.tex --- a/ChengsongTanPhdThesis/Chapters/Chapter1.tex Mon May 30 14:41:09 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Chapter1.tex Mon May 30 17:24:52 2022 +0100 @@ -23,6 +23,7 @@ \newcommand{\rsimp}[1]{\textit{rsimp}(#1)} \newcommand{\sflataux}[1]{\llparenthesis #1 \rrparenthesis'} \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}% +\newcommand{\denote}{\stackrel{\mbox{\scriptsize denote}}{=}}% \newcommand{\ZERO}{\mbox{\bf 0}} \newcommand{\ONE}{\mbox{\bf 1}} \newcommand{\AALTS}[2]{\oplus {\scriptstyle #1}\, #2} @@ -58,7 +59,8 @@ \def\Stars{\mathit{Stars}} \def\Char{\mathit{Char}} \def\Seq{\mathit{Seq}} -\def\Der{\mathit{Der}} +\def\Der{\textit{Der}} +\def\Ders{\textit{Ders}} \def\nullable{\mathit{nullable}} \def\Z{\mathit{Z}} \def\S{\mathit{S}} diff -r 28751de4b4ba -r 96e93df60954 ChengsongTanPhdThesis/Chapters/Chapter2.tex --- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex Mon May 30 14:41:09 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Chapter2.tex Mon May 30 17:24:52 2022 +0100 @@ -10,8 +10,14 @@ \section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions} -We have a built-in datatype char, made up of characters, which we do not define -on top of anything else. +We have a primitive datatype char, denoting characters. +\[ char ::= a + \mid b + \mid c + \mid \ldots + \mid z +\] +(which one is better?) \begin{center} \begin{tabular}{lcl} $\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\ @@ -27,7 +33,8 @@ And strings can be concatenated to form longer strings: \begin{center} \begin{tabular}{lcl} -$s_1 @ s_2$ & $\rightarrow$ & $s'$\\ +$[] @ s_2$ & $\dn$ & $s_2$\\ +$(c :: s_1) @ s_2$ & $\dn$ & $c :: (s_1 @ s_2)$ \end{tabular} \end{center} @@ -46,22 +53,83 @@ $A^{n+1}$ & $\dn$ & $A^n @ A$ \end{tabular} \end{center} -The infinite set of all the power of a language unioned together -is defined using the power operator, also in recursive function: +The union of all the natural number powers of a language +is denoted by the Kleene star operator: \begin{center} \begin{tabular}{lcl} -$A^*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$\\ +$\bigcup_{i \geq 0} A^i$ & $\denote$ & $A^*$\\ \end{tabular} \end{center} -We also define an operation of chopping off a character from all the strings -in a set: + +In Isabelle of course we cannot easily get a counterpart of +the $\bigcup_{i \geq 0}$ operator, so we instead define the Kleene star +as an inductive set: +\begin{center} +\begin{tabular}{lcl} +$[] \in A^*$ & &\\ +$s_1 \in A \land \; s_2 \in A^* $ & $\implies$ & $s_1 @ s_2 \in A^*$\\ +\end{tabular} +\end{center} + +We also define an operation of chopping off a character from +a language: \begin{center} \begin{tabular}{lcl} $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\ \end{tabular} \end{center} -With the above definitions, it becomes natural to define regular expressions -as a concise way for expressing the languages. + +This can be generalised to chopping off a string from all strings within set $A$: +\begin{center} +\begin{tabular}{lcl} +$\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\ +\end{tabular} +\end{center} + +which is essentially the left quotient $A \backslash L'$ of $A$ against +the singleton language $L' = \{w\}$ +in formal language theory. +For this dissertation the $\textit{Ders}$ notation would suffice, there is +no need for a more general derivative definition. + +With the sequencing, Kleene star, and $\textit{Der}$ operator on languages, +we have a few properties of how the language derivative can be defined using +sub-languages. +\begin{lemma} +$\Der \; c \; (A @ B) = \textit{if} \; [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B$ +\end{lemma} +\noindent +This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it +and get to $B$. + +The language $A^*$'s derivative can be described using the language derivative +of $A$: +\begin{lemma} +$\textit{Der} \;c \;A^* = (\textit{Der}\; c A) @ (A^*)$\\ +\end{lemma} +\begin{proof} +\begin{itemize} +\item{$\subseteq$} +The set +\[ \{s \mid c :: s \in A^*\} \] +is enclosed in the set +\[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A^* \} \] +because whenever you have a string starting with a character +in the language of a Kleene star $A^*$, then that character together with some sub-string +immediately after it will form the first iteration, and the rest of the string will +be still in $A^*$. +\item{$\supseteq$} +Note that +\[ \Der \; c \; A^* = \Der \; c \; (\{ [] \} \cup (A @ A^*) ) \] +and +\[ \Der \; c \; (\{ [] \} \cup (A @ A^*) ) = \Der\; c \; (A @ A^*) \] +where the $\textit{RHS}$ of the above equatioin can be rewritten +as \[ (\Der \; c\; A) @ A^* \cup A' \], $A'$ being a possibly empty set. +\end{itemize} +\end{proof} +Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart +for regular languages, we need to first give definitions for regular expressions. + \section{Regular Expressions and Their Language Interpretation} Suppose we have an alphabet $\Sigma$, the strings whose characters are from $\Sigma$ @@ -90,19 +158,6 @@ -The Brzozowski derivative w.r.t character $c$ is an operation on the regex, -where the operation transforms the regex to a new one containing -strings without the head character $c$. - -Formally, we define first such a transformation on any string set, which -we call semantic derivative: -\begin{center} -$\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$ -\end{center} -Mathematically, it can be expressed as the - -If the $\textit{A}$ happen to have some structure, for example, -if it is regular, then we have that it % Derivatives of a %regular expression, written $r \backslash c$, give a simple solution @@ -113,7 +168,14 @@ \section{Brzozowski Derivatives of Regular Expressions} -The the derivative of regular expression, denoted as +Now with semantic derivatives of a language and regular expressions and +their language interpretations, we are ready to define derivatives on regexes. + +The Brzozowski derivative w.r.t character $c$ is an operation on the regex, +where the operation transforms the regex to a new one containing +strings without the head character $c$. + +The derivative of regular expression, denoted as $r \backslash c$, is a function that takes parameters $r$ and $c$, and returns another regular expression $r'$, which is computed by the following recursive function: @@ -401,9 +463,11 @@ that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$ and $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. -A lexer aimed at getting all the possible values has an exponential -worst case runtime. Therefore it is impractical to try to generate -all possible matches in a run. In practice, we are usually +A lexer aimed at keeping all the possible values will naturally +have an exponential runtime on ambiguous regular expressions. +Somehow one has to decide which lexical value to keep and +output in a lexing algorithm. +In practice, we are usually interested about POSIX values, which by intuition always \begin{itemize} \item @@ -413,6 +477,9 @@ always match a subpart as much as possible before proceeding to the next token. \end{itemize} +The formal definition of a $\POSIX$ value can be described +in the following set of rules: +\ For example, the above example has the POSIX value