lexing: changeset 577:f47fc4840579

--- a/ChengsongTanPhdThesis/Chapters/Finite.tex	Fri Aug 12 00:39:23 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex	Sun Aug 14 09:44:27 2022 +0100
@@ -9,20 +9,32 @@
 
 In this chapter we give a guarantee in terms of size: 
 given an annotated regular expression $a$, for any string $s$ 
-our algorithm's internal data structure 
-$\bderssimp{a}{s}$'s size
+our algorithm's internal annotated regular expression 
+size
 \begin{center}
 $\llbracket \bderssimp{a}{s} \rrbracket$
 \end{center}
 \noindent
 is finitely bounded
-by a constant $N_a$ that only depends on $a$.
+by a constant $N_a$ that only depends on $a$,
+where the size of an annotated regular expression is defined
+in terms of the number of nodes in its tree structure:
+\begin{center}
+\begin{tabular}{ccc}
+	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
+	$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
+	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
+$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
+$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
+$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
+\end{tabular}
+\end{center}
 
 \section{Formalising About Size}
+\noindent
 In our lexer $\blexersimp$,
 The regular expression is repeatedly being taken derivative of
 and then simplified.
-
 \begin{center}
 \begin{tikzpicture}[scale=2,node distance=1.4cm,
                     every node/.style={minimum size=9mm}]
@@ -37,36 +49,48 @@
 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$};
 \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$};
 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};
-\node (v) [circle, draw = blue, thick, right=of rns,minimum size=7mm]{$v$};
+\node (v) [circle, draw = blue, thick, right=of rns, minimum size=7mm, right=2.7cm]{$v$};
 \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode};
 \end{tikzpicture}
 \end{center}
 \noindent
 As illustrated in the picture, 
-each time the derivative is taken derivative of, it grows,
-and when it is being simplified, it shrinks.
-The blue ones are the regular expressions after simplification,
-which would be smaller than before.
+each time the regular expression 
+is taken derivative of, it grows (the black nodes),
+and after simplification, it shrinks (the blue nodes).
+This intuition is depicted by the relative size
+difference between the black and blue nodes.
 We give a mechanised proof that after simplification 
 the regular expression's size (the blue ones)
 $\bderssimp{a}{s}$ will never exceed a constant $N_a$ for
 a fixed $a$. 
-While it is not yet a direct formalisation of our lexer's complexity,
-it is a stepping stone towards a complexity proof because
-if the data structures out lexer has to traverse is large, the program
-will certainly be slow.
 
-The bound is not yet tight, and we seek to improve $N_a$ so that
+There are two problems with this finiteness proof, though.
+\begin{itemize}
+	\item
+First, It is not yet a direct formalisation of our lexer's complexity,
+as a complexity proof would require looking into 
+the time it takes to execute {\bf all} the operations
+involved in the lexer (simp, collect, decode), not just the derivative.
+\item
+Second, the bound is not yet tight, and we seek to improve $N_a$ so that
 it is polynomial on $\llbracket a \rrbracket$.
-We believe a formalisation of the complexity-related properties
-of the algorithm is important in our context, because we want to address
-catastrophic backtracking, which is not a correctness issue but
-in essence a performance issue, a formal proof can give
-the strongest assurance that such issues cannot arise 
-regardless of what the input is.
-This level of certainty cannot come from a pencil and paper proof or
-eimpirical evidence.
-
+\end{itemize}
+Still, we believe this size formalisation 
+of the algorithm is important in our context, because 
+\begin{itemize}
+	\item
+		Derivatives are the most important phases of our lexer algorithm.
+		Size properties about derivatives covers the majority of the algorithm
+		and is therefore a good indication of complexity of the entire program.
+	\item
+		What the size bound proof does ensure is an absence of 
+		catastrophic backtracking, which is prevalent in regular expression engines
+		in popular programming languages like Java.
+		We prove catastrophic backtracking cannot happen for {\bf all} inputs,
+		which is an advantage over work which 
+		only gives empirical evidence on some test cases.
+\end{itemize}
 For example, Sulzmann and Lu's made claimed that their algorithm
 with bitcodes and simplifications can lex in linear time with respect
 to the input string. This assumes that each
@@ -76,8 +100,9 @@
 of each derivative step to grow arbitrarily large.
 Here in our proof we state that such explosions cannot happen 
 with our simplification function.
-
-Here is a bird's eye view of how the proof of finiteness works: 
+\subsection{Overview of the Proof}
+Here is a bird's eye view of how the proof of finiteness works,
+which involves three steps:
 \begin{center}
   \begin{tikzpicture}[scale=1,font=\bf,
                       node/.style={
@@ -98,45 +123,45 @@
   \end{tikzpicture}
 \end{center}
 \noindent
-We explain the above steps one by one:
+We explain the steps one by one:
 \begin{itemize}
 \item
-We first use a new datatype, called $\textit{rrexp}$s, whose 
-inductive type definition and derivative and simplification operations are
+We first define a new datatype
+that is more straightforward to tweak 
+into the shape we want 
+compared with an annotated regular expression,
+called $\textit{rrexp}$s.
+Its inductive type definition and 
+derivative and simplification operations are
 almost identical to those of the annotated regular expressions,
 except that no bitcodes are attached.
-This new datatype is more straightforward to tweak 
-compared with an annotated regular expression.
 \item
 We have a set of equalities for this new datatype that enables one to
-rewrite $\rderssimp{r_1 \cdot r_2}{s}$ and $\rderssimp{r^*}{s}$ etc.
+rewrite $\rderssimp{r_1 \cdot r_2}{s}$ 
+and $\rderssimp{r^*}{s}$ (the most involved
+inductive cases)
 by a combinatioin of derivatives of their 
-children regular expressions ($r_1$, $r_2$, and $r$, respectively).
-These equalities are chained together to get into a shape
-that is very easy to estimate, which we call the \emph{Closed Forms}
+children regular expressions ($r_1$, $r_2$, and $r$, respectively),
+which we call the \emph{Closed Forms}
 of the derivatives.
 \item
-This closed form is controlled by terms that
-are easier to deal with, wich are in turn bounded loosely
+The Closed Forms of the regular expressions
+are controlled by terms that
+are easier to deal with.
+Using inductive hypothesis, these terms
+are in turn bounded loosely
 by a large yet constant number.
 \end{itemize}
 We give details of these steps in the next sections.
-The first step is to use something simpler than annotated regular expressions. 
+The first step is to use 
+$\textit{rrexp}$s,
+something simpler than
+annotated regular expressions. 
 
-\section{the $\mathbf{r}$-rexp datatype and the size functions}
+\section{the $\textit{rrexp}$ Datatype and Its Size Functions}
 We want to prove the size property about annotated regular expressions.
 The size is 
 written $\llbracket r\rrbracket$, whose intuitive definition is as below
-\begin{center}
-\begin{tabular}{ccc}
-	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
-	$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
-	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
-$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
-$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
-$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
-\end{tabular}
-\end{center}
 \noindent
 We first note that $\llbracket \_ \rrbracket$
 is unaware of bitcodes, since

--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Fri Aug 12 00:39:23 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Sun Aug 14 09:44:27 2022 +0100
@@ -102,9 +102,37 @@
 However, for the purposes here, the $\textit{Ders}$ definition with 
 a single string is sufficient.
 
+The reason for defining derivatives
+is that it provides a different approach
+to test membership of a string in 
+a set of strings. 
+For example, to test whether the string
+$bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with
+respect to the string $bar$:
+\begin{center}
+\begin{tabular}{lclll}
+	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & 
+	$\{ar, rak\}$ & 
+	$\stackrel{\backslash a}{\rightarrow}$ &
+	\\
+	$\{r \}$ & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$ &
+	$\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
+\end{tabular}	
+\end{center}
+\noindent
+and in the end test whether the set
+has the empty string \footnote{ we use the infix notation $A\backslash c$
+	instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
+on languages rather than regular expressions }.
+In general, if we have a language $S_{start}$,
+then we can test whether $s$ is in $S_{start}$
+by testing whether $[] \in S \backslash s$.
+
 With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
 we have a  few properties of how the language derivative can be defined using 
 sub-languages.
+For example, for the sequence operator, we have
+something similar to the ``chain rule'' of the calculus derivative:
 \begin{lemma}
 \[
 	\Der \; c \; (A @ B) =
@@ -198,13 +226,95 @@
 %all strings in that set, Brzozowski defined a derivative operation on regular expressions
 %so that after derivative $L(r\backslash c)$ 
 %will look as if it was obtained by doing a language derivative on $L(r)$:
-Recall that the semantic derivative acts on a set of 
-strings. Brzozowski noticed that this operation
+Recall that the semantic derivative acts on a 
+language (set of strings).
+One can decide whether a string $s$ belongs
+to a language $S$ by taking derivative with respect to
+that string and then checking whether the empty 
+string is in the derivative:
+\begin{center}
+\parskip \baselineskip
+\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
+\def\rlwd{.5pt}
+\newcommand\notate[3]{%
+  \unskip\def\useanchorwidth{T}%
+  \setbox0=\hbox{#1}%
+  \def\stackalignment{c}\stackunder[-6pt]{%
+    \def\stackalignment{c}\stackunder[-1.5pt]{%
+      \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
+    \rule{\rlwd}{#2\baselineskip}}}{%
+  \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
+}
+\Longstack{
+\notate{$\{ \ldots ,\;$ 
+	\notate{s}{1}{$(c_1 :: s_1)$} 
+	$, \; \ldots \}$ 
+}{1}{$S_{start}$} 
+}
+\Longstack{
+	$\stackrel{\backslash c_1}{\longrightarrow}$
+}
+\Longstack{
+	$\{ \ldots,\;$  \notate{$s_1$}{1}{$(c_2::s_2)$} 
+	$,\; \ldots \}$
+}
+\Longstack{
+	$\stackrel{\backslash c_2}{\longrightarrow}$ 
+}
+\Longstack{
+	$\{ \ldots,\;  s_2
+	,\; \ldots \}$
+}
+\Longstack{
+	$ \xdashrightarrow{\backslash c_3\ldots\ldots} $	
+}
+\Longstack{
+	\notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} = 
+	S_{start}\backslash s$}
+}
+\end{center}
+\begin{center}
+	$s \in S_{start} \iff [] \in S_{end}$
+\end{center}
+\noindent
+Brzozowski noticed that this operation
 can be ``mirrored" on regular expressions which
 he calls the derivative of a regular expression $r$
 with respect to a character $c$, written
-$r \backslash c$.
-He defined this operation such that the following property holds:
+$r \backslash c$. This infix operator
+takes an original regular expression $r$ as input
+and a character as a right operand and
+outputs a result, which is a new regular expression.
+The derivative operation on regular expression
+is defined such that the language of the derivative result 
+coincides with the language of the original 
+regular expression's language being taken the language 
+derivative with respect to the same character:
+\begin{center}
+\parskip \baselineskip
+\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
+\def\rlwd{.5pt}
+\newcommand\notate[3]{%
+  \unskip\def\useanchorwidth{T}%
+  \setbox0=\hbox{#1}%
+  \def\stackalignment{c}\stackunder[-6pt]{%
+    \def\stackalignment{c}\stackunder[-1.5pt]{%
+      \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
+    \rule{\rlwd}{#2\baselineskip}}}{%
+  \strut\kern8pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
+}
+\Longstack{
+	\notate{$r$}{1}{$L \; r = \{\ldots, \;c::s_1, 
+\;\ldots\}$}
+}
+\Longstack{
+	$\stackrel{\backslash c}{\longrightarrow}$
+}
+\Longstack{
+	\notate{$r\backslash c$}{2}{$L \; (r\backslash c)=
+	\{\ldots,\;s_1,\;\ldots\}$}
+}
+\end{center}
 \begin{center}
 
 \[
@@ -212,6 +322,10 @@
 \]
 \end{center}
 \noindent
+where we do derivatives on the regular expression
+$r$ and test membership of $s$ by checking
+whether the empty string is in the language of 
+$r\backslash s$.
 For example in the sequence case we have 
 \begin{center}
 	\begin{tabular}{lcl}

--- a/ChengsongTanPhdThesis/main.tex	Fri Aug 12 00:39:23 2022 +0100
+++ b/ChengsongTanPhdThesis/main.tex	Sun Aug 14 09:44:27 2022 +0100
@@ -47,6 +47,11 @@
 \usepackage{hyperref}
 \usepackage{lipsum}
 \usepackage[backend=bibtex]{biblatex} % Use the bibtex backend with the authoryear citation style (which resembles APA)
+
+\usepackage[usestackEOL]{stackengine}
+\usepackage{scalerel}
+\usepackage{graphicx}
+
 %style=authoryear, natbib=true 
 \usepackage{stmaryrd}
 \usepackage{caption}
@@ -61,6 +66,22 @@
 
 %\usepackage{algorithm}
 \usepackage{amsmath}
+\makeatletter
+\newcommand{\xleftrightarrow}[2][]{\ext@arrow 3359\leftrightarrowfill@{#1}{#2}}
+\newcommand{\xdashrightarrow}[2][]{\ext@arrow 0359\rightarrowfill@@{#1}{#2}}
+\newcommand{\xdashleftarrow}[2][]{\ext@arrow 3095\leftarrowfill@@{#1}{#2}}
+\newcommand{\xdashleftrightarrow}[2][]{\ext@arrow 3359\leftrightarrowfill@@{#1}{#2}}
+\def\rightarrowfill@@{\arrowfill@@\relax\relbar\rightarrow}
+\def\leftarrowfill@@{\arrowfill@@\leftarrow\relbar\relax}
+\def\leftrightarrowfill@@{\arrowfill@@\leftarrow\relbar\rightarrow}
+\def\arrowfill@@#1#2#3#4{%
+  $\m@th\thickmuskip0mu\medmuskip\thickmuskip\thinmuskip\thickmuskip
+   \relax#4#1
+   \xleaders\hbox{$#4#2$}\hfill
+   #3$%
+}
+\makeatother
+
 \usepackage{amsthm}
 \usepackage{amssymb}
 \usepackage{cleveref}
@@ -82,6 +103,7 @@
 
 \DeclareCaptionType{mytype}[Illustration][]
 \newenvironment{envForCaption}{\captionsetup{type=mytype} }{}
+
 %----------------------------------------------------------------------------------------
 %	MARGIN SETTINGS
 %----------------------------------------------------------------------------------------

author	Chengsong
	Sun, 14 Aug 2022 09:44:27 +0100
changeset 577	f47fc4840579
parent 576	3e1b699696b6
child 578	e71a6e2aca2d

ChengsongTanPhdThesis/Chapters/Finite.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Inj.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/main.tex		file \| annotate \| diff \| comparison \| revisions