lexing: ChengsongTanPhdThesis/Chapters/Chapter2.tex@213220f54a6e


% Chapter Template

\chapter{Chapter Title Here} % Main chapter title

\label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}




%----------------------------------------------------------------------------------------
%	SECTION 1
%----------------------------------------------------------------------------------------

\section{Properties of $\backslash c$}

To have a clear idea of what we can and 
need to prove about the algorithms involving
Brzozowski's derivatives, there are a few 
properties we need to be clear about .
\subsection{function $\backslash c$ is not 1-to-1}
\begin{center}
The derivative $w.r.t$ character $c$ is not one-to-one.
Formally,
	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
\end{center}
This property is trivially true for the
character regex example:
\begin{center}
	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
\end{center}
But apart from the cases where the derivative
output is $\ZERO$, are there non-trivial results
of derivatives which contain strings?
The answer is yes.
For example,
\begin{center}
	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
	where $a$ is not nullable.\\
	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
\end{center}
We start with two syntactically different regexes,
and end up with the same derivative result, which is
a "meaningful" regex because it contains strings.
We have rediscovered Arden's lemma:\\
\begin{center}
	$A^*B = A\cdot A^* \cdot B + B$
\end{center}

	
%-----------------------------------
%	SUBSECTION 1
%-----------------------------------
\subsection{Subsection 1}
To be completed.







%-----------------------------------
%	SECTION 2
%-----------------------------------

\section{Finiteness Property}
In this section let us sketch our argument for why the size of the simplified
derivatives with the aggressive simplification function can be finitely bounded.

For convenience, we use a new datatype which we call $\rrexp$ to denote
the difference between it and annotated regular expressions. 
\[			\rrexp ::=   \RZERO \mid  \RONE
			 \mid  \RCHAR{c}  
			 \mid  \RSEQ{r_1}{r_2}
			 \mid  \RALTS{rs}
			 \mid \RSTAR{r}        
\]
For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, 
but keep everything else intact.
It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
$\ALTS$).
We denote the operation of erasing the bits and turning an annotated regular expression 
into an $\rrexp{}$ as $\rerase{}$.
\begin{center}
\begin{tabular}{lcr}
$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
\end{tabular}
\end{center}
%TODO: FINISH definition of rerase
Similarly we could define the derivative  and simplification on 
$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
except that now they can operate on alternatives taking multiple arguments.

\begin{center}
\begin{tabular}{lcr}
$\RALTS{rs} \backslash c$ & $=$ &  $\RALTS{map\; (\_ \backslash c) \;rs}$\\
(other clauses omitted)
\end{tabular}
\end{center}

Now that $\rrexp$s do not have bitcodes on them, we can do the 
duplicate removal without  an equivalence relation:
\begin{center}
\begin{tabular}{lcl}
$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
           			    &        & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
\end{tabular}
\end{center}
%TODO: definition of rsimp (maybe only the alternative clause)


The reason why these definitions  mirror precisely the corresponding operations
on annotated regualar expressions is that we can calculate sizes more easily:

\begin{lemma}
$\rsize{\rerase a} = \asize a$
\end{lemma}
A similar equation holds for annotated regular expressions' simplification
and the plain regular expressions' simplification:
\begin{lemma}
$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
\end{lemma}
Putting these two together we get a property that allows us to estimate the 
size of an annotated regular expression derivative using its un-annotated counterpart:
\begin{lemma}
$\asize{\bderssimp{r}{s}} =  \rsize{\rderssimp{\rerase{r}}{s}}$
\end{lemma}
Unless stated otherwise in this chapter all $\textit{rexp}$s without
 bitcodes are seen as $\rrexp$s.
 We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
 as the former suits people's intuitive way of stating a binary alternative
 regular expression.

 Suppose
we have a size function for bitcoded regular expressions, written
$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
(we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$
there exists a bound $N$
such that 

\begin{center}
$\forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N$
\end{center}

\noindent
We prove this by induction on $r$. The base cases for $\AZERO$,
$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
%
\begin{center}
\begin{tabular}{lcll}
& & $ \llbracket   \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
    [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
& $\leq$ &
    $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
    [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
             \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
      \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
\end{tabular}
\end{center}

% tell Chengsong about Indian paper of closed forms of derivatives

\noindent
where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
The reason why we could write the derivatives of a sequence this way is described in section 2.
The term (2) is used to control (1). 
That is because one can obtain an overall
smaller regex list
by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
Section 3 is dedicated to its proof.
In (3) we know that  $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is 
bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
We reason similarly for  $\STAR$.\medskip

\noindent
Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
far from the actual bound we can expect. We can do better than this, but this does not improve
the finiteness property we are proving. If we are interested in a polynomial bound,
one would hope to obtain a similar tight bound as for partial
derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
 $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
partial derivatives). Unfortunately to obtain the exact same bound would mean
we need to introduce simplifications, such as
 $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
which exist for partial derivatives. However, if we introduce them in our
setting we would lose the POSIX property of our calculated values. We leave better
bounds for future work.\\[-6.5mm]



%-----------------------------------
%	SECTION ?
%-----------------------------------
\section{preparatory properties for getting a finiteness bound}
Before we get to the proof that says the intermediate result of our lexer will
remain finitely bounded, which is an important efficiency/liveness guarantee,
we shall first develop a few preparatory properties and definitions to 
make the process of proving that a breeze.

We define rewriting relations for $\rrexp$s, which allows us to do the 
same trick as we did for the correctness proof,
but this time we will have stronger equalities established.
\subsection{"hrewrite" relation}
List of simplifications for regular expressions simplification without bitcodes:
\




%----------------------------------------------------------------------------------------
%	SECTION ??
%----------------------------------------------------------------------------------------

\section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
To embark on getting the "closed forms" of regexes, we first
define a few auxiliary definitions to make expressing them smoothly.

 \begin{center}  
 \begin{tabular}{ccc}
 $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
$\sflataux r$ & $=$ & $ [r]$
\end{tabular}
\end{center}
The intuition behind $\sflataux{\_}$ is to break up nested regexes 
of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
It will return the singleton list $[r]$ otherwise.

$\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
the output type a regular expression, not a list:
 \begin{center} 
 \begin{tabular}{ccc}
 $\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
$\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\
$\sflat r$ & $=$ & $ [r]$
\end{tabular}
\end{center}
$\sflataux{\_}$  and $\sflat{\_}$ is only recursive in terms of the
 first element of the list of children of
an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for  dealing with the regular expression
$r_1 \cdot r_2 \backslash s$.

With $\sflat{\_}$ and $\sflataux{\_}$ we make 
precise what  "closed forms" we have for the sequence derivatives and their simplifications,
in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
and $\bderssimp{(r_1\cdot r_2)}{s}$,
if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
and  $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$  ranges over 
the substring of $s$?
First let's look at a series of derivatives steps on a sequence 
regular expression,  (assuming) that each time the first
component of the sequence is always nullable):
\begin{center}

$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c}  \quad   r_1  \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''   \longrightarrow_{\backslash c''} \quad
 \ldots$

\end{center}
%TODO: cite indian paper
Indianpaper have  come up with a slightly more formal way of putting the above process:
\begin{center}
$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
+ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
\end{center}
where  $\delta(b, r)$ will produce $r$
if $b$ evaluates to true, 
and $\ZERO$ otherwise.

 But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
 equivalence. To make this intuition useful 
 for a formal proof, we need something
stronger than language equivalence.
With the help of $\sflat{\_}$ we can state the equation in Indianpaper
more rigorously:
\begin{lemma}
$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
\end{lemma}

The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:

\begin{center}
\begin{tabular}{lcl}
$\vsuf{[]}{\_} $ & $=$ &  $[]$\\
$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
                                     && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
\end{tabular}
\end{center}                   
It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing 
the entire result of  $(r_1 \cdot r_2) \backslash s$, 
it only stores all the $s''$ such that $r_2 \backslash s''$  are occurring terms in $(r_1\cdot r_2)\backslash s$.

With this we can also add simplifications to both sides and get
\begin{lemma}
$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
\end{lemma}
Together with the idempotency property of $\rsimp$,
%TODO: state the idempotency property of rsimp
\begin{lemma}
$\rsimp(r) = \rsimp (\rsimp(r))$
\end{lemma}
it is possible to convert the above lemma to obtain a "closed form"
for  derivatives nested with simplification:
\begin{lemma}
$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
\end{lemma}
And now the reason we have (1) in section 1 is clear:
$\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, 
is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$
    as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$.

%----------------------------------------------------------------------------------------
%	SECTION 3
%----------------------------------------------------------------------------------------

\section{interaction between $\distinctWith$ and $\flts$}
Note that it is not immediately obvious that 
\begin{center}
$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket   \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket  $.
\end{center}
The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 

%----------------------------------------------------------------------------------------
%	SECTION 4
%----------------------------------------------------------------------------------------
\section{a bound for the star regular expression}
We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
the property of the $\distinct$ function.
Now we try to get a bound on $r^* \backslash s$ as well.
Again, we first look at how a star's derivatives evolve, if they grow maximally: 
\begin{center}

$r^* \quad \longrightarrow_{\backslash c}  \quad   (r\backslash c)  \cdot  r^* \quad \longrightarrow_{\backslash c'}  \quad
r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*  \quad \longrightarrow_{\backslash c''} \quad 
(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)   \quad \longrightarrow_{\backslash c'''}
\quad \ldots$

\end{center}
When we have a string $s = c :: c' :: c'' \ldots$  such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, 
$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not 
count the possible size explosions of $r \backslash c$ themselves.

Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like
$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$:
%TODO: definitions of  and \hflataux \hflat
 \begin{center}  
 \begin{tabular}{ccc}
 $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
$\hflataux r$ & $=$ & $ [r]$
\end{tabular}
\end{center}

 \begin{center}  
 \begin{tabular}{ccc}
 $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
$\hflat r$ & $=$ & $ r$
\end{tabular}
\end{center}
Again these definitions are tailor-made for dealing with alternatives that have
originated from a star's derivatives, so we don't attempt to open up all possible 
regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
elements.
We give a predicate for such "star-created" regular expressions:
\begin{center}
\begin{tabular}{lcr}
         &    &       $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
 $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
 \end{tabular}
 \end{center}
 
 These definitions allows us the flexibility to talk about 
 regular expressions in their most convenient format,
 for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
 instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
 These definitions help express that certain classes of syntatically 
 distinct regular expressions are actually the same under simplification.
 This is not entirely true for annotated regular expressions: 
 %TODO: bsimp bders \neq bderssimp
 \begin{center}
 $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
 \end{center}
 For bit-codes, the order in which simplification is applied
 might cause a difference in the location they are placed.
 If we want something like
 \begin{center}
 $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
 \end{center}
 Some "canonicalization" procedure is required,
 which either pushes all the common bitcodes to nodes
  as senior as possible:
  \begin{center}
  $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
  \end{center}
 or does the reverse. However bitcodes are not of interest if we are talking about
 the $\llbracket r \rrbracket$ size of a regex.
 Therefore for the ease and simplicity of producing a
 proof for a size bound, we are happy to restrict ourselves to 
 unannotated regular expressions, and obtain such equalities as
 \begin{lemma}
 $\rsimp{(r_1 + r_2)} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
 \end{lemma}
 
 \begin{proof}
 By using the rewriting relation $\rightsquigarrow$
 \end{proof}
 %TODO: rsimp sflat
And from this we obtain a proof that a star's derivative will be the same
as if it had all its nested alternatives created during deriving being flattened out:
 For example,
 \begin{lemma}
 $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
 \end{lemma}
 \begin{proof}
 By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
 \end{proof}
% The simplification of a flattened out regular expression, provided it comes
%from the derivative of a star, is the same as the one nested.
 
 
 
 
 
 
 
 
 
One might wonder the actual bound rather than the loose bound we gave
for the convenience of a easier proof.
How much can the regex $r^* \backslash s$ grow? As hinted earlier,
 they can grow at a maximum speed
  exponential $w.r.t$ the number of characters, 
but will eventually level off when the string $s$ is long enough,
as suggested by the finiteness bound proof.
And unfortunately we have concrete examples
where such regexes grew exponentially large before levelling off:
$(a ^ * + (a + aa) ^ * + (a + aa + aaa) ^ * + \ldots + 
(a+ aa + aaa+\ldots \underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
size that is  exponential on the number $n$.
%TODO: give out a graph showing how the size of the regex grows and levels off at a size exponential to the original regex


While the tight upper bound will definitely be a lot lower than 
the one we gave for the finiteness proof, 
it will still be at least expoential, under our current simplification rules.

This suggests stronger simplifications that keep the tight bound polynomial.

%----------------------------------------------------------------------------------------
%	SECTION 5
%----------------------------------------------------------------------------------------
\section{a stronger version of simplification}
%TODO: search for isabelle proofs of algorithms that check equivalence
author	Chengsong
	Sun, 08 May 2022 15:29:17 +0100
changeset 507	213220f54a6e
parent 506	69ad05398894
child 514	036600af4c30
permissions	-rwxr-xr-x