# HG changeset patch # User Chengsong # Date 1660466667 -3600 # Node ID f47fc48405798e9d6ef654d0345d253f98116197 # Parent 3e1b699696b61d44c45cdb1b8f313540ebb59eae thesis chap2 diff -r 3e1b699696b6 -r f47fc4840579 ChengsongTanPhdThesis/Chapters/Finite.tex --- a/ChengsongTanPhdThesis/Chapters/Finite.tex Fri Aug 12 00:39:23 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Sun Aug 14 09:44:27 2022 +0100 @@ -9,20 +9,32 @@ In this chapter we give a guarantee in terms of size: given an annotated regular expression $a$, for any string $s$ -our algorithm's internal data structure -$\bderssimp{a}{s}$'s size +our algorithm's internal annotated regular expression +size \begin{center} $\llbracket \bderssimp{a}{s} \rrbracket$ \end{center} \noindent is finitely bounded -by a constant $N_a$ that only depends on $a$. +by a constant $N_a$ that only depends on $a$, +where the size of an annotated regular expression is defined +in terms of the number of nodes in its tree structure: +\begin{center} +\begin{tabular}{ccc} + $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ + $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ + $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ +$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ +$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ +$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$ +\end{tabular} +\end{center} \section{Formalising About Size} +\noindent In our lexer $\blexersimp$, The regular expression is repeatedly being taken derivative of and then simplified. - \begin{center} \begin{tikzpicture}[scale=2,node distance=1.4cm, every node/.style={minimum size=9mm}] @@ -37,36 +49,48 @@ \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$}; \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$}; \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$}; -\node (v) [circle, draw = blue, thick, right=of rns,minimum size=7mm]{$v$}; +\node (v) [circle, draw = blue, thick, right=of rns, minimum size=7mm, right=2.7cm]{$v$}; \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode}; \end{tikzpicture} \end{center} \noindent As illustrated in the picture, -each time the derivative is taken derivative of, it grows, -and when it is being simplified, it shrinks. -The blue ones are the regular expressions after simplification, -which would be smaller than before. +each time the regular expression +is taken derivative of, it grows (the black nodes), +and after simplification, it shrinks (the blue nodes). +This intuition is depicted by the relative size +difference between the black and blue nodes. We give a mechanised proof that after simplification the regular expression's size (the blue ones) $\bderssimp{a}{s}$ will never exceed a constant $N_a$ for a fixed $a$. -While it is not yet a direct formalisation of our lexer's complexity, -it is a stepping stone towards a complexity proof because -if the data structures out lexer has to traverse is large, the program -will certainly be slow. -The bound is not yet tight, and we seek to improve $N_a$ so that +There are two problems with this finiteness proof, though. +\begin{itemize} + \item +First, It is not yet a direct formalisation of our lexer's complexity, +as a complexity proof would require looking into +the time it takes to execute {\bf all} the operations +involved in the lexer (simp, collect, decode), not just the derivative. +\item +Second, the bound is not yet tight, and we seek to improve $N_a$ so that it is polynomial on $\llbracket a \rrbracket$. -We believe a formalisation of the complexity-related properties -of the algorithm is important in our context, because we want to address -catastrophic backtracking, which is not a correctness issue but -in essence a performance issue, a formal proof can give -the strongest assurance that such issues cannot arise -regardless of what the input is. -This level of certainty cannot come from a pencil and paper proof or -eimpirical evidence. - +\end{itemize} +Still, we believe this size formalisation +of the algorithm is important in our context, because +\begin{itemize} + \item + Derivatives are the most important phases of our lexer algorithm. + Size properties about derivatives covers the majority of the algorithm + and is therefore a good indication of complexity of the entire program. + \item + What the size bound proof does ensure is an absence of + catastrophic backtracking, which is prevalent in regular expression engines + in popular programming languages like Java. + We prove catastrophic backtracking cannot happen for {\bf all} inputs, + which is an advantage over work which + only gives empirical evidence on some test cases. +\end{itemize} For example, Sulzmann and Lu's made claimed that their algorithm with bitcodes and simplifications can lex in linear time with respect to the input string. This assumes that each @@ -76,8 +100,9 @@ of each derivative step to grow arbitrarily large. Here in our proof we state that such explosions cannot happen with our simplification function. - -Here is a bird's eye view of how the proof of finiteness works: +\subsection{Overview of the Proof} +Here is a bird's eye view of how the proof of finiteness works, +which involves three steps: \begin{center} \begin{tikzpicture}[scale=1,font=\bf, node/.style={ @@ -98,45 +123,45 @@ \end{tikzpicture} \end{center} \noindent -We explain the above steps one by one: +We explain the steps one by one: \begin{itemize} \item -We first use a new datatype, called $\textit{rrexp}$s, whose -inductive type definition and derivative and simplification operations are +We first define a new datatype +that is more straightforward to tweak +into the shape we want +compared with an annotated regular expression, +called $\textit{rrexp}$s. +Its inductive type definition and +derivative and simplification operations are almost identical to those of the annotated regular expressions, except that no bitcodes are attached. -This new datatype is more straightforward to tweak -compared with an annotated regular expression. \item We have a set of equalities for this new datatype that enables one to -rewrite $\rderssimp{r_1 \cdot r_2}{s}$ and $\rderssimp{r^*}{s}$ etc. +rewrite $\rderssimp{r_1 \cdot r_2}{s}$ +and $\rderssimp{r^*}{s}$ (the most involved +inductive cases) by a combinatioin of derivatives of their -children regular expressions ($r_1$, $r_2$, and $r$, respectively). -These equalities are chained together to get into a shape -that is very easy to estimate, which we call the \emph{Closed Forms} +children regular expressions ($r_1$, $r_2$, and $r$, respectively), +which we call the \emph{Closed Forms} of the derivatives. \item -This closed form is controlled by terms that -are easier to deal with, wich are in turn bounded loosely +The Closed Forms of the regular expressions +are controlled by terms that +are easier to deal with. +Using inductive hypothesis, these terms +are in turn bounded loosely by a large yet constant number. \end{itemize} We give details of these steps in the next sections. -The first step is to use something simpler than annotated regular expressions. +The first step is to use +$\textit{rrexp}$s, +something simpler than +annotated regular expressions. -\section{the $\mathbf{r}$-rexp datatype and the size functions} +\section{the $\textit{rrexp}$ Datatype and Its Size Functions} We want to prove the size property about annotated regular expressions. The size is written $\llbracket r\rrbracket$, whose intuitive definition is as below -\begin{center} -\begin{tabular}{ccc} - $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ - $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ - $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ -$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ -$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ -$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$ -\end{tabular} -\end{center} \noindent We first note that $\llbracket \_ \rrbracket$ is unaware of bitcodes, since diff -r 3e1b699696b6 -r f47fc4840579 ChengsongTanPhdThesis/Chapters/Inj.tex --- a/ChengsongTanPhdThesis/Chapters/Inj.tex Fri Aug 12 00:39:23 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Sun Aug 14 09:44:27 2022 +0100 @@ -102,9 +102,37 @@ However, for the purposes here, the $\textit{Ders}$ definition with a single string is sufficient. +The reason for defining derivatives +is that it provides a different approach +to test membership of a string in +a set of strings. +For example, to test whether the string +$bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with +respect to the string $bar$: +\begin{center} +\begin{tabular}{lclll} + $S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & + $\{ar, rak\}$ & + $\stackrel{\backslash a}{\rightarrow}$ & + \\ + $\{r \}$ & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$ & + $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\ +\end{tabular} +\end{center} +\noindent +and in the end test whether the set +has the empty string \footnote{ we use the infix notation $A\backslash c$ + instead of $\Der \; c \; A$ for brevity, as it is clear we are operating +on languages rather than regular expressions }. +In general, if we have a language $S_{start}$, +then we can test whether $s$ is in $S_{start}$ +by testing whether $[] \in S \backslash s$. + With the sequencing, Kleene star, and $\textit{Der}$ operator on languages, we have a few properties of how the language derivative can be defined using sub-languages. +For example, for the sequence operator, we have +something similar to the ``chain rule'' of the calculus derivative: \begin{lemma} \[ \Der \; c \; (A @ B) = @@ -198,13 +226,95 @@ %all strings in that set, Brzozowski defined a derivative operation on regular expressions %so that after derivative $L(r\backslash c)$ %will look as if it was obtained by doing a language derivative on $L(r)$: -Recall that the semantic derivative acts on a set of -strings. Brzozowski noticed that this operation +Recall that the semantic derivative acts on a +language (set of strings). +One can decide whether a string $s$ belongs +to a language $S$ by taking derivative with respect to +that string and then checking whether the empty +string is in the derivative: +\begin{center} +\parskip \baselineskip +\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}} +\def\rlwd{.5pt} +\newcommand\notate[3]{% + \unskip\def\useanchorwidth{T}% + \setbox0=\hbox{#1}% + \def\stackalignment{c}\stackunder[-6pt]{% + \def\stackalignment{c}\stackunder[-1.5pt]{% + \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{% + \rule{\rlwd}{#2\baselineskip}}}{% + \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces% +} +\Longstack{ +\notate{$\{ \ldots ,\;$ + \notate{s}{1}{$(c_1 :: s_1)$} + $, \; \ldots \}$ +}{1}{$S_{start}$} +} +\Longstack{ + $\stackrel{\backslash c_1}{\longrightarrow}$ +} +\Longstack{ + $\{ \ldots,\;$ \notate{$s_1$}{1}{$(c_2::s_2)$} + $,\; \ldots \}$ +} +\Longstack{ + $\stackrel{\backslash c_2}{\longrightarrow}$ +} +\Longstack{ + $\{ \ldots,\; s_2 + ,\; \ldots \}$ +} +\Longstack{ + $ \xdashrightarrow{\backslash c_3\ldots\ldots} $ +} +\Longstack{ + \notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} = + S_{start}\backslash s$} +} +\end{center} +\begin{center} + $s \in S_{start} \iff [] \in S_{end}$ +\end{center} +\noindent +Brzozowski noticed that this operation can be ``mirrored" on regular expressions which he calls the derivative of a regular expression $r$ with respect to a character $c$, written -$r \backslash c$. -He defined this operation such that the following property holds: +$r \backslash c$. This infix operator +takes an original regular expression $r$ as input +and a character as a right operand and +outputs a result, which is a new regular expression. +The derivative operation on regular expression +is defined such that the language of the derivative result +coincides with the language of the original +regular expression's language being taken the language +derivative with respect to the same character: +\begin{center} +\parskip \baselineskip +\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}} +\def\rlwd{.5pt} +\newcommand\notate[3]{% + \unskip\def\useanchorwidth{T}% + \setbox0=\hbox{#1}% + \def\stackalignment{c}\stackunder[-6pt]{% + \def\stackalignment{c}\stackunder[-1.5pt]{% + \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{% + \rule{\rlwd}{#2\baselineskip}}}{% + \strut\kern8pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces% +} +\Longstack{ + \notate{$r$}{1}{$L \; r = \{\ldots, \;c::s_1, +\;\ldots\}$} +} +\Longstack{ + $\stackrel{\backslash c}{\longrightarrow}$ +} +\Longstack{ + \notate{$r\backslash c$}{2}{$L \; (r\backslash c)= + \{\ldots,\;s_1,\;\ldots\}$} +} +\end{center} \begin{center} \[ @@ -212,6 +322,10 @@ \] \end{center} \noindent +where we do derivatives on the regular expression +$r$ and test membership of $s$ by checking +whether the empty string is in the language of +$r\backslash s$. For example in the sequence case we have \begin{center} \begin{tabular}{lcl} diff -r 3e1b699696b6 -r f47fc4840579 ChengsongTanPhdThesis/main.tex --- a/ChengsongTanPhdThesis/main.tex Fri Aug 12 00:39:23 2022 +0100 +++ b/ChengsongTanPhdThesis/main.tex Sun Aug 14 09:44:27 2022 +0100 @@ -47,6 +47,11 @@ \usepackage{hyperref} \usepackage{lipsum} \usepackage[backend=bibtex]{biblatex} % Use the bibtex backend with the authoryear citation style (which resembles APA) + +\usepackage[usestackEOL]{stackengine} +\usepackage{scalerel} +\usepackage{graphicx} + %style=authoryear, natbib=true \usepackage{stmaryrd} \usepackage{caption} @@ -61,6 +66,22 @@ %\usepackage{algorithm} \usepackage{amsmath} +\makeatletter +\newcommand{\xleftrightarrow}[2][]{\ext@arrow 3359\leftrightarrowfill@{#1}{#2}} +\newcommand{\xdashrightarrow}[2][]{\ext@arrow 0359\rightarrowfill@@{#1}{#2}} +\newcommand{\xdashleftarrow}[2][]{\ext@arrow 3095\leftarrowfill@@{#1}{#2}} +\newcommand{\xdashleftrightarrow}[2][]{\ext@arrow 3359\leftrightarrowfill@@{#1}{#2}} +\def\rightarrowfill@@{\arrowfill@@\relax\relbar\rightarrow} +\def\leftarrowfill@@{\arrowfill@@\leftarrow\relbar\relax} +\def\leftrightarrowfill@@{\arrowfill@@\leftarrow\relbar\rightarrow} +\def\arrowfill@@#1#2#3#4{% + $\m@th\thickmuskip0mu\medmuskip\thickmuskip\thinmuskip\thickmuskip + \relax#4#1 + \xleaders\hbox{$#4#2$}\hfill + #3$% +} +\makeatother + \usepackage{amsthm} \usepackage{amssymb} \usepackage{cleveref} @@ -82,6 +103,7 @@ \DeclareCaptionType{mytype}[Illustration][] \newenvironment{envForCaption}{\captionsetup{type=mytype} }{} + %---------------------------------------------------------------------------------------- % MARGIN SETTINGS %----------------------------------------------------------------------------------------