--- a/ChengsongTanPhdThesis/Chapters/Finite.tex Fri Aug 12 00:39:23 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Sun Aug 14 09:44:27 2022 +0100
@@ -9,20 +9,32 @@
In this chapter we give a guarantee in terms of size:
given an annotated regular expression $a$, for any string $s$
-our algorithm's internal data structure
-$\bderssimp{a}{s}$'s size
+our algorithm's internal annotated regular expression
+size
\begin{center}
$\llbracket \bderssimp{a}{s} \rrbracket$
\end{center}
\noindent
is finitely bounded
-by a constant $N_a$ that only depends on $a$.
+by a constant $N_a$ that only depends on $a$,
+where the size of an annotated regular expression is defined
+in terms of the number of nodes in its tree structure:
+\begin{center}
+\begin{tabular}{ccc}
+ $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
+ $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
+ $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
+$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
+$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\
+$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
+\end{tabular}
+\end{center}
\section{Formalising About Size}
+\noindent
In our lexer $\blexersimp$,
The regular expression is repeatedly being taken derivative of
and then simplified.
-
\begin{center}
\begin{tikzpicture}[scale=2,node distance=1.4cm,
every node/.style={minimum size=9mm}]
@@ -37,36 +49,48 @@
\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$};
\node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$};
\draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};
-\node (v) [circle, draw = blue, thick, right=of rns,minimum size=7mm]{$v$};
+\node (v) [circle, draw = blue, thick, right=of rns, minimum size=7mm, right=2.7cm]{$v$};
\draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode};
\end{tikzpicture}
\end{center}
\noindent
As illustrated in the picture,
-each time the derivative is taken derivative of, it grows,
-and when it is being simplified, it shrinks.
-The blue ones are the regular expressions after simplification,
-which would be smaller than before.
+each time the regular expression
+is taken derivative of, it grows (the black nodes),
+and after simplification, it shrinks (the blue nodes).
+This intuition is depicted by the relative size
+difference between the black and blue nodes.
We give a mechanised proof that after simplification
the regular expression's size (the blue ones)
$\bderssimp{a}{s}$ will never exceed a constant $N_a$ for
a fixed $a$.
-While it is not yet a direct formalisation of our lexer's complexity,
-it is a stepping stone towards a complexity proof because
-if the data structures out lexer has to traverse is large, the program
-will certainly be slow.
-The bound is not yet tight, and we seek to improve $N_a$ so that
+There are two problems with this finiteness proof, though.
+\begin{itemize}
+ \item
+First, It is not yet a direct formalisation of our lexer's complexity,
+as a complexity proof would require looking into
+the time it takes to execute {\bf all} the operations
+involved in the lexer (simp, collect, decode), not just the derivative.
+\item
+Second, the bound is not yet tight, and we seek to improve $N_a$ so that
it is polynomial on $\llbracket a \rrbracket$.
-We believe a formalisation of the complexity-related properties
-of the algorithm is important in our context, because we want to address
-catastrophic backtracking, which is not a correctness issue but
-in essence a performance issue, a formal proof can give
-the strongest assurance that such issues cannot arise
-regardless of what the input is.
-This level of certainty cannot come from a pencil and paper proof or
-eimpirical evidence.
-
+\end{itemize}
+Still, we believe this size formalisation
+of the algorithm is important in our context, because
+\begin{itemize}
+ \item
+ Derivatives are the most important phases of our lexer algorithm.
+ Size properties about derivatives covers the majority of the algorithm
+ and is therefore a good indication of complexity of the entire program.
+ \item
+ What the size bound proof does ensure is an absence of
+ catastrophic backtracking, which is prevalent in regular expression engines
+ in popular programming languages like Java.
+ We prove catastrophic backtracking cannot happen for {\bf all} inputs,
+ which is an advantage over work which
+ only gives empirical evidence on some test cases.
+\end{itemize}
For example, Sulzmann and Lu's made claimed that their algorithm
with bitcodes and simplifications can lex in linear time with respect
to the input string. This assumes that each
@@ -76,8 +100,9 @@
of each derivative step to grow arbitrarily large.
Here in our proof we state that such explosions cannot happen
with our simplification function.
-
-Here is a bird's eye view of how the proof of finiteness works:
+\subsection{Overview of the Proof}
+Here is a bird's eye view of how the proof of finiteness works,
+which involves three steps:
\begin{center}
\begin{tikzpicture}[scale=1,font=\bf,
node/.style={
@@ -98,45 +123,45 @@
\end{tikzpicture}
\end{center}
\noindent
-We explain the above steps one by one:
+We explain the steps one by one:
\begin{itemize}
\item
-We first use a new datatype, called $\textit{rrexp}$s, whose
-inductive type definition and derivative and simplification operations are
+We first define a new datatype
+that is more straightforward to tweak
+into the shape we want
+compared with an annotated regular expression,
+called $\textit{rrexp}$s.
+Its inductive type definition and
+derivative and simplification operations are
almost identical to those of the annotated regular expressions,
except that no bitcodes are attached.
-This new datatype is more straightforward to tweak
-compared with an annotated regular expression.
\item
We have a set of equalities for this new datatype that enables one to
-rewrite $\rderssimp{r_1 \cdot r_2}{s}$ and $\rderssimp{r^*}{s}$ etc.
+rewrite $\rderssimp{r_1 \cdot r_2}{s}$
+and $\rderssimp{r^*}{s}$ (the most involved
+inductive cases)
by a combinatioin of derivatives of their
-children regular expressions ($r_1$, $r_2$, and $r$, respectively).
-These equalities are chained together to get into a shape
-that is very easy to estimate, which we call the \emph{Closed Forms}
+children regular expressions ($r_1$, $r_2$, and $r$, respectively),
+which we call the \emph{Closed Forms}
of the derivatives.
\item
-This closed form is controlled by terms that
-are easier to deal with, wich are in turn bounded loosely
+The Closed Forms of the regular expressions
+are controlled by terms that
+are easier to deal with.
+Using inductive hypothesis, these terms
+are in turn bounded loosely
by a large yet constant number.
\end{itemize}
We give details of these steps in the next sections.
-The first step is to use something simpler than annotated regular expressions.
+The first step is to use
+$\textit{rrexp}$s,
+something simpler than
+annotated regular expressions.
-\section{the $\mathbf{r}$-rexp datatype and the size functions}
+\section{the $\textit{rrexp}$ Datatype and Its Size Functions}
We want to prove the size property about annotated regular expressions.
The size is
written $\llbracket r\rrbracket$, whose intuitive definition is as below
-\begin{center}
-\begin{tabular}{ccc}
- $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
- $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
- $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
-$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
-$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\
-$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
-\end{tabular}
-\end{center}
\noindent
We first note that $\llbracket \_ \rrbracket$
is unaware of bitcodes, since
--- a/ChengsongTanPhdThesis/Chapters/Inj.tex Fri Aug 12 00:39:23 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Sun Aug 14 09:44:27 2022 +0100
@@ -102,9 +102,37 @@
However, for the purposes here, the $\textit{Ders}$ definition with
a single string is sufficient.
+The reason for defining derivatives
+is that it provides a different approach
+to test membership of a string in
+a set of strings.
+For example, to test whether the string
+$bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with
+respect to the string $bar$:
+\begin{center}
+\begin{tabular}{lclll}
+ $S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ &
+ $\{ar, rak\}$ &
+ $\stackrel{\backslash a}{\rightarrow}$ &
+ \\
+ $\{r \}$ & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$ &
+ $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
+\end{tabular}
+\end{center}
+\noindent
+and in the end test whether the set
+has the empty string \footnote{ we use the infix notation $A\backslash c$
+ instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
+on languages rather than regular expressions }.
+In general, if we have a language $S_{start}$,
+then we can test whether $s$ is in $S_{start}$
+by testing whether $[] \in S \backslash s$.
+
With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
we have a few properties of how the language derivative can be defined using
sub-languages.
+For example, for the sequence operator, we have
+something similar to the ``chain rule'' of the calculus derivative:
\begin{lemma}
\[
\Der \; c \; (A @ B) =
@@ -198,13 +226,95 @@
%all strings in that set, Brzozowski defined a derivative operation on regular expressions
%so that after derivative $L(r\backslash c)$
%will look as if it was obtained by doing a language derivative on $L(r)$:
-Recall that the semantic derivative acts on a set of
-strings. Brzozowski noticed that this operation
+Recall that the semantic derivative acts on a
+language (set of strings).
+One can decide whether a string $s$ belongs
+to a language $S$ by taking derivative with respect to
+that string and then checking whether the empty
+string is in the derivative:
+\begin{center}
+\parskip \baselineskip
+\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
+\def\rlwd{.5pt}
+\newcommand\notate[3]{%
+ \unskip\def\useanchorwidth{T}%
+ \setbox0=\hbox{#1}%
+ \def\stackalignment{c}\stackunder[-6pt]{%
+ \def\stackalignment{c}\stackunder[-1.5pt]{%
+ \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
+ \rule{\rlwd}{#2\baselineskip}}}{%
+ \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
+}
+\Longstack{
+\notate{$\{ \ldots ,\;$
+ \notate{s}{1}{$(c_1 :: s_1)$}
+ $, \; \ldots \}$
+}{1}{$S_{start}$}
+}
+\Longstack{
+ $\stackrel{\backslash c_1}{\longrightarrow}$
+}
+\Longstack{
+ $\{ \ldots,\;$ \notate{$s_1$}{1}{$(c_2::s_2)$}
+ $,\; \ldots \}$
+}
+\Longstack{
+ $\stackrel{\backslash c_2}{\longrightarrow}$
+}
+\Longstack{
+ $\{ \ldots,\; s_2
+ ,\; \ldots \}$
+}
+\Longstack{
+ $ \xdashrightarrow{\backslash c_3\ldots\ldots} $
+}
+\Longstack{
+ \notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} =
+ S_{start}\backslash s$}
+}
+\end{center}
+\begin{center}
+ $s \in S_{start} \iff [] \in S_{end}$
+\end{center}
+\noindent
+Brzozowski noticed that this operation
can be ``mirrored" on regular expressions which
he calls the derivative of a regular expression $r$
with respect to a character $c$, written
-$r \backslash c$.
-He defined this operation such that the following property holds:
+$r \backslash c$. This infix operator
+takes an original regular expression $r$ as input
+and a character as a right operand and
+outputs a result, which is a new regular expression.
+The derivative operation on regular expression
+is defined such that the language of the derivative result
+coincides with the language of the original
+regular expression's language being taken the language
+derivative with respect to the same character:
+\begin{center}
+\parskip \baselineskip
+\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
+\def\rlwd{.5pt}
+\newcommand\notate[3]{%
+ \unskip\def\useanchorwidth{T}%
+ \setbox0=\hbox{#1}%
+ \def\stackalignment{c}\stackunder[-6pt]{%
+ \def\stackalignment{c}\stackunder[-1.5pt]{%
+ \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
+ \rule{\rlwd}{#2\baselineskip}}}{%
+ \strut\kern8pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
+}
+\Longstack{
+ \notate{$r$}{1}{$L \; r = \{\ldots, \;c::s_1,
+\;\ldots\}$}
+}
+\Longstack{
+ $\stackrel{\backslash c}{\longrightarrow}$
+}
+\Longstack{
+ \notate{$r\backslash c$}{2}{$L \; (r\backslash c)=
+ \{\ldots,\;s_1,\;\ldots\}$}
+}
+\end{center}
\begin{center}
\[
@@ -212,6 +322,10 @@
\]
\end{center}
\noindent
+where we do derivatives on the regular expression
+$r$ and test membership of $s$ by checking
+whether the empty string is in the language of
+$r\backslash s$.
For example in the sequence case we have
\begin{center}
\begin{tabular}{lcl}
--- a/ChengsongTanPhdThesis/main.tex Fri Aug 12 00:39:23 2022 +0100
+++ b/ChengsongTanPhdThesis/main.tex Sun Aug 14 09:44:27 2022 +0100
@@ -47,6 +47,11 @@
\usepackage{hyperref}
\usepackage{lipsum}
\usepackage[backend=bibtex]{biblatex} % Use the bibtex backend with the authoryear citation style (which resembles APA)
+
+\usepackage[usestackEOL]{stackengine}
+\usepackage{scalerel}
+\usepackage{graphicx}
+
%style=authoryear, natbib=true
\usepackage{stmaryrd}
\usepackage{caption}
@@ -61,6 +66,22 @@
%\usepackage{algorithm}
\usepackage{amsmath}
+\makeatletter
+\newcommand{\xleftrightarrow}[2][]{\ext@arrow 3359\leftrightarrowfill@{#1}{#2}}
+\newcommand{\xdashrightarrow}[2][]{\ext@arrow 0359\rightarrowfill@@{#1}{#2}}
+\newcommand{\xdashleftarrow}[2][]{\ext@arrow 3095\leftarrowfill@@{#1}{#2}}
+\newcommand{\xdashleftrightarrow}[2][]{\ext@arrow 3359\leftrightarrowfill@@{#1}{#2}}
+\def\rightarrowfill@@{\arrowfill@@\relax\relbar\rightarrow}
+\def\leftarrowfill@@{\arrowfill@@\leftarrow\relbar\relax}
+\def\leftrightarrowfill@@{\arrowfill@@\leftarrow\relbar\rightarrow}
+\def\arrowfill@@#1#2#3#4{%
+ $\m@th\thickmuskip0mu\medmuskip\thickmuskip\thinmuskip\thickmuskip
+ \relax#4#1
+ \xleaders\hbox{$#4#2$}\hfill
+ #3$%
+}
+\makeatother
+
\usepackage{amsthm}
\usepackage{amssymb}
\usepackage{cleveref}
@@ -82,6 +103,7 @@
\DeclareCaptionType{mytype}[Illustration][]
\newenvironment{envForCaption}{\captionsetup{type=mytype} }{}
+
%----------------------------------------------------------------------------------------
% MARGIN SETTINGS
%----------------------------------------------------------------------------------------