--- a/ChengsongTanPhdThesis/Chapters/Finite.tex Wed Oct 12 14:08:06 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Sun Nov 06 23:05:47 2022 +0000
@@ -7,53 +7,68 @@
%of our bitcoded algorithm, that is a finite bound on the size of any
%regex's derivatives.
-\begin{figure}
-\begin{center}
- \begin{tabular}{ccc}
- $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
- $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
- $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
- $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
- $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\
- $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$.
- \end{tabular}
-\end{center}
-\caption{The size function of bitcoded regular expressions}\label{brexpSize}
-\end{figure}
-In this chapter we give a guarantee in terms of size:
+
+In this chapter we give a guarantee in terms of size of derivatives:
given an annotated regular expression $a$, for any string $s$
our algorithm $\blexersimp$'s internal annotated regular expression
size is finitely bounded
-by a constant $N_a$ that only depends on $a$:
+by a constant that only depends on $a$.
+Formally we show that there exists an $N_a$ such that
\begin{center}
$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
\end{center}
\noindent
-where the size of an annotated regular expression is defined
-in terms of the number of nodes in its tree structure (see figure \ref{brexpSize}).
+where the size ($\llbracket \_ \rrbracket$) of
+an annotated regular expression is defined
+in terms of the number of nodes in its
+tree structure (its recursive definition given in the next page).
We believe this size bound
is important in the context of POSIX lexing, because
\begin{itemize}
\item
- It is a stepping stone towards an ``absence of catastrophic-backtracking''
- guarantee.
- If the internal data structures used by our algorithm cannot grow very large,
- then our algorithm (which traverses those structures) cannot be too slow.
+ It is a stepping stone towards the goal
+ of eliminating ``catastrophic backtracking''.
+ If the internal data structures used by our algorithm
+ grows beyond a finite bound, then clearly
+ the algorithm (which traverses these structures) will
+ be slow.
The next step would be to refine the bound $N_a$ so that it
is polynomial on $\llbracket a\rrbracket$.
\item
- Having it formalised gives us a higher confidence that
+ Having the finite bound formalised
+ gives us a higher confidence that
our simplification algorithm $\simp$ does not ``mis-behave''
like $\simpsulz$ does.
- The bound is universal, which is an advantage over work which
- only gives empirical evidence on some test cases.
+ The bound is universal for a given regular expression,
+ which is an advantage over work which
+ only gives empirical evidence on some test cases (see Verbatim++\cite{Verbatimpp}).
\end{itemize}
+In the next section we describe in more detail
+what the finite bound means in our algorithm
+and why the size of the internal data structures of
+a typical derivative-based lexer such as
+Sulzmann and Lu's need formal treatment.
+
\section{Formalising About Size}
\noindent
In our lexer ($\blexersimp$),
we take an annotated regular expression as input,
-and repeately take derivative of and simplify it:
-\begin{figure}[H]
+and repeately take derivative of and simplify it.
+\begin{figure}
+ \begin{center}
+ \begin{tabular}{lcl}
+ $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
+ $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
+ $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
+ $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
+ $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\
+ $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$.
+ \end{tabular}
+ \end{center}
+ \caption{The size function of bitcoded regular expressions}\label{brexpSize}
+\end{figure}
+
+\begin{figure}
\begin{tikzpicture}[scale=2,
every node/.style={minimum size=11mm},
->,>=stealth',shorten >=1pt,auto,thick
@@ -79,15 +94,16 @@
\end{tikzpicture}
\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks}
\end{figure}
+
\noindent
Each time
a derivative is taken, the regular expression might grow.
However, the simplification that is immediately afterwards will always shrink it so that
-it stays small.
+it stays relatively small.
This intuition is depicted by the relative size
change between the black and blue nodes:
After $\simp$ the node always shrinks.
-Our proof says that all the blue nodes
+Our proof states that all the blue nodes
stay below a size bound $N_a$ determined by the input $a$.
\noindent
@@ -126,7 +142,7 @@
\noindent
The picture means that on certain cases their lexer (where they use $\simpsulz$
as the simplification function)
-will have an indefinite size explosion, causing the running time
+will have a size explosion, causing the running time
of each derivative step to grow continuously (for example
in \ref{SulzmannLuLexerTime}).
They tested out the run time of their
@@ -137,13 +153,13 @@
Before delving in to the details of the formalisation,
we are going to provide an overview of it.
-In the next subsection, we draw a picture of the bird's eye view
+In the next subsection, we give a high-level view
of the proof.
\subsection{Overview of the Proof}
-Here is a bird's eye view of the main components of the finiteness proof,
-which involves three steps:
+A high-level overview of the main components of the finiteness proof
+is as follows:
\begin{figure}[H]
\begin{tikzpicture}[scale=1,font=\bf,
node/.style={
@@ -179,21 +195,23 @@
\item
We first introduce the operations such as
derivatives, simplification, size calculation, etc.
- associated with $\rrexp$s, which we have given
- a very brief introduction to in chapter \ref{Bitcoded2}.
+ associated with $\rrexp$s, which we have introduced
+ in chapter \ref{Bitcoded2}.
The operations on $\rrexp$s are identical to those on
- annotated regular expressions except that they are unaware
- of bitcodes. This means that all proofs about size of $\rrexp$s will apply to
- annotated regular expressions.
+ annotated regular expressions except that they dispense with
+ bitcodes. This means that all proofs about size of $\rrexp$s will apply to
+ annotated regular expressions, because the size of a regular
+ expression is independent of the bitcodes.
\item
We prove that $\rderssimp{r}{s} = \textit{ClosedForm}(r, s)$,
where $\textit{ClosedForm}(r, s)$ is entirely
- written in the derivatives of their children regular
+ given as the derivatives of their children regular
expressions.
We call the right-hand-side the \emph{Closed Form}
of the derivative $\rderssimp{r}{s}$.
\item
- We estimate $\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$.
+ Formally we give an estimate of
+ $\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$.
The key observation is that $\distinctBy$'s output is
a list with a constant length bound.
\end{itemize}
@@ -202,14 +220,14 @@
\section{The $\textit{Rrexp}$ Datatype}
The first step is to define
$\textit{rrexp}$s.
-They are without bitcodes,
+They are annotated regular expressions without bitcodes,
allowing a much simpler size bound proof.
-Of course, the bits which encode the lexing information
-would grow linearly with respect
-to the input, which should be taken into account when we wish to tackle the runtime comlexity.
-But for the sake of the structural size
-we can safely ignore them.\\
-To recapitulate, the datatype
+%Of course, the bits which encode the lexing information
+%would grow linearly with respect
+%to the input, which should be taken into account when we wish to tackle the runtime comlexity.
+%But for the sake of the structural size
+%we can safely ignore them.\\
+The datatype
definition of the $\rrexp$, called
\emph{r-regular expressions},
was initially defined in \ref{rrexpDef}.
@@ -226,7 +244,7 @@
written $\llbracket r\rrbracket_r$,
whose definition mirrors that of an annotated regular expression.
\begin{center}
- \begin{tabular}{ccc}
+ \begin{tabular}{lcl}
$\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\
$\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\
$\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\
@@ -254,27 +272,31 @@
\end{center}
\subsection{Why a New Datatype?}
-The reason we take all the trouble
-defining a new datatype is that $\erase$ makes things harder.
+The reason we
+define a new datatype is that
+the $\erase$ function
+does not preserve the structure of annotated
+regular expressions.
We initially started by using
plain regular expressions and tried to prove
-the lemma \ref{rsizeAsize},
-however the $\erase$ function unavoidbly messes with the structure of the
+lemma \ref{rsizeAsize},
+however the $\erase$ function messes with the structure of the
annotated regular expression.
The $+$ constructor
-of basic regular expressions is binary whereas $\sum$
-takes a list, and one has to convert between them:
+of basic regular expressions is only binary, whereas $\sum$
+takes a list. Therefore we need to convert between
+annotated and normal regular expressions as follows:
\begin{center}
- \begin{tabular}{ccc}
+ \begin{tabular}{lcl}
$\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\
$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
\end{tabular}
\end{center}
\noindent
-An alternative regular expression with an empty argument list
+As can be seen alternative regular expression with an empty argument list
will be turned into a $\ZERO$.
-The singleton alternative $\sum [r]$ would have $r$ during the
+The singleton alternative $\sum [r]$ becomes $r$ during the
$\erase$ function.
The annotated regular expression $\sum[a, b, c]$ would turn into
$(a+(b+c))$.
@@ -284,7 +306,7 @@
For example, if we define the size of a basic plain regular expression
in the usual way,
\begin{center}
- \begin{tabular}{ccc}
+ \begin{tabular}{lcl}
$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
$\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\
@@ -303,7 +325,7 @@
only the bitcodes are thrown away.
Everything about the structure remains intact.
Therefore it does not change the size
-of an annotated regular expression:
+of an annotated regular expression and we have:
\begin{lemma}\label{rsizeAsize}
$\rsize{\rerase a} = \asize a$
\end{lemma}
@@ -317,17 +339,17 @@
but we found our approach more straightforward.\\
\subsection{Functions for R-regular Expressions}
-We shall define the r-regular expression version
-of $\blexer$ and $\blexersimp$ related functions.
+In this section we shall define the r-regular expression version
+of $\blexer$, and $\blexersimp$ related functions.
We use $r$ as the prefix or subscript to differentiate
with the bitcoded version.
-For example,$\backslash_r$, $\rdistincts$, and $\rsimp$
-as opposed to $\backslash$, $\distinctBy$, and $\bsimp$.
-As promised, they are much simpler than their bitcoded counterparts.
+%For example,$\backslash_r$, $\rdistincts$, and $\rsimp$
+%as opposed to $\backslash$, $\distinctBy$, and $\bsimp$.
+%As promised, they are much simpler than their bitcoded counterparts.
%The operations on r-regular expressions are
%almost identical to those of the annotated regular expressions,
%except that no bitcodes are used. For example,
-The derivative operation becomes simpler:\\
+The derivative operation for an r-regular expression is\\
\begin{center}
\begin{tabular}{@{}lcl@{}}
$(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\
@@ -347,10 +369,12 @@
\end{tabular}
\end{center}
\noindent
-Similarly, $\distinctBy$ does not need
+where we omit the definition of $\textit{rnullable}$.
+
+The function $\distinctBy$ for r-regular expressions does not need
a function checking equivalence because
-there are no bit annotations causing superficial differences
-between syntactically equal terms.
+there are no bit annotations.
+Therefore we have
\begin{center}
\begin{tabular}{lcl}
$\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\
@@ -362,14 +386,14 @@
\end{center}
%TODO: definition of rsimp (maybe only the alternative clause)
\noindent
-We would like to make clear
-a difference between our $\rdistincts$ and
-the Isabelle $\textit {distinct}$ predicate.
-In Isabelle $\textit{distinct}$ is a function that returns a boolean
-rather than a list.
-It tests if all the elements of a list are unique.\\
-With $\textit{rdistinct}$,
-and the flatten function for $\rrexp$s:
+%We would like to make clear
+%a difference between our $\rdistincts$ and
+%the Isabelle $\textit {distinct}$ predicate.
+%In Isabelle $\textit{distinct}$ is a function that returns a boolean
+%rather than a list.
+%It tests if all the elements of a list are unique.\\
+With $\textit{rdistinct}$ in place,
+the flatten function for $\rrexp$ is as follows:
\begin{center}
\begin{tabular}{@{}lcl@{}}
$\textit{rflts} \; (\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $as \; @ \; \textit{rflts} \; as' $ \\
@@ -378,8 +402,8 @@
\end{tabular}
\end{center}
\noindent
-one can chain together all the other modules
-such as $\rsimpalts$:
+The function
+$\rsimpalts$ corresponds to $\textit{bsimp}_{ALTS}$:
\begin{center}
\begin{tabular}{@{}lcl@{}}
$\rsimpalts \;\; nil$ & $\dn$ & $\RZERO$\\
@@ -388,7 +412,7 @@
\end{tabular}
\end{center}
\noindent
-and $\rsimpseq$:
+Similarly, we have $\rsimpseq$ which corresponds to $\textit{bsimp}_{SEQ}$:
\begin{center}
\begin{tabular}{@{}lcl@{}}
$\rsimpseq \;\; \RZERO \; \_ $ & $=$ & $\RZERO$\\
@@ -420,8 +444,7 @@
\end{center}
\noindent
We do not define an r-regular expression version of $\blexersimp$,
-as our proof does not involve its use
-(and there is no bitcode to decode into a lexical value).
+as our proof does not depend on it.
Now we are ready to introduce how r-regular expressions allow
us to prove the size bound on bitcoded regular expressions.
@@ -431,7 +454,7 @@
can be calculated via the size of r-regular expressions after the application
of $\rsimp$ and $\backslash_{rsimps}$:
\begin{lemma}\label{sizeRelations}
- The following equalities hold:
+ The following two equalities hold:
\begin{itemize}
\item
$\asize{\bsimps \; a} = \rsize{\rsimp{ \rerase{a}}}$
@@ -456,13 +479,13 @@
implies
$\quad \llbracket a \backslash_{bsimps} s \rrbracket \leq N_r$.
\end{center}
-From now on we
-Unless stated otherwise in the rest of this
-chapter all regular expressions without
-bitcodes are seen as r-regular expressions ($\rrexp$s).
-For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$,
-we use the notation $r_1 + r_2$
-for brevity.
+%From now on we
+%Unless stated otherwise in the rest of this
+%chapter all regular expressions without
+%bitcodes are seen as r-regular expressions ($\rrexp$s).
+%For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$,
+%we use the notation $r_1 + r_2$
+%for brevity.
%-----------------------------------
@@ -521,15 +544,16 @@
$(r_1\cdot r_2) \backslash_{rsimps} s$ looks like,
and therefore $N_{r_1}$ and $N_{r_2}$ in the
inductive hypotheses cannot be directly used.
-We have already seen that $(r_1 \cdot r_2)\backslash s$
-and $(r^*)\backslash s$ can grow in a wild way.
+%We have already seen that $(r_1 \cdot r_2)\backslash s$
+%and $(r^*)\backslash s$ can grow in a wild way.
-The point is that they will be equivalent to a list of
+The point however, is that they will be equivalent to a list of
terms $\sum rs$, where each term in $rs$ will
be made of $r_1 \backslash s' $, $r_2\backslash s'$,
-and $r \backslash s'$ with $s' \in \textit{SubString} \; s$.
+and $r \backslash s'$ with $s' \in \textit{SubString} \; s$ (which stands
+for the set of substrings of $s$).
The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$
-in the simplification which saves $rs$ from growing indefinitely.
+in the simplification, which prevents the $rs$ from growing indefinitely.
Based on this idea, we develop a proof in two steps.
First, we show the equality (where
@@ -547,23 +571,45 @@
\emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$.
Second, we will bound the closed form of r-regular expressions
using some estimation techniques
-and then piece it together
-with lemma \ref{sizeRelations} to show the bitcoded regular expressions
+and then apply
+lemma \ref{sizeRelations} to show that the bitcoded regular expressions
in our $\blexersimp$ are finitely bounded.
-We will flesh out the first step of the proof we
-sketched just now in the next section.
+We will describe in detail the first step of the proof
+in the next section.
\section{Closed Forms}
In this section we introduce in detail
-how the closed forms are obtained for regular expressions'
-derivatives.
+how we express the string derivatives
+of regular expressions (i.e. $r \backslash_r s$ where $s$ is a string
+rather than a single character) in a different way than
+our previous definition.
+In previous chapters, the derivative of a
+regular expression $r$ w.r.t a string $s$
+was recursively defined on the string:
+\begin{center}
+ $r \backslash_s (c::s) \dn (r \backslash c) \backslash_s s$
+\end{center}
+The problem is that
+this definition does not provide much information
+on what $r \backslash_s s$ looks like.
+If we are interested in the size of a derivative like
+$(r_1 \cdot r_2)\backslash s$,
+we have to somehow get a more concrete form to begin.
+We call such more concrete representations the ``closed forms'' of
+string derivatives as opposed to their original definitions.
+The terminology ``closed form'' is borrowed from mathematics,
+which usually describe expressions that are solely comprised of
+well-known and easy-to-compute operations such as
+additions, multiplications, exponential functions.
+
We start by proving some basic identities
involving the simplification functions for r-regular expressions.
After that we introduce the rewrite relations
$\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$
$\rightsquigarrow_f$ and $\rightsquigarrow_g$.
-These relations involves similar techniques in chapter \ref{Bitcoded2}.
+These relations involves similar techniques as in chapter \ref{Bitcoded2}
+for annotated regular expressions.
Finally, we use these identities to establish the
closed forms of the alternative regular expression,
the sequence regular expression, and the star regular expression.
@@ -573,11 +619,9 @@
\subsection{Some Basic Identities}
-We now introduce lemmas
-that are repeatedly used in later proofs.
-Note that for the $\textit{rdistinct}$ function there
-will be a lot of conversion from lists to sets.
-We use $set$ to refere to the
+In what follows we will often convert between lists
+and sets.
+We use Isabelle's $set$ to refere to the
function that converts a list $rs$ to the set
containing all the elements in $rs$.
\subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication}
@@ -589,8 +633,7 @@
%The function $\textit{rdistinct}$ satisfies the following
%properties:
Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its
- recursive definition here, its Isabelle counterpart would be $\textit{distinct}$.}
- readily defined
+ recursive definition here. Its Isabelle counterpart would be $\textit{distinct}$.}
for testing
whether a list's elements are unique. Then the following
properties about $\textit{rdistinct}$ hold:
@@ -675,12 +718,13 @@
\begin{proof}
By induction on $rs$. All other variables are allowed to be arbitrary.
The second part of the lemma requires the first.
- Note that for each part, the two sub-propositions need to be proven concurrently,
+ Note that for each part, the two sub-propositions need to be proven
+ at the same time,
so that the induction goes through.
\end{proof}
\noindent
This allows us to prove a few more equivalence relations involving
-$\textit{rdistinct}$ (it will be useful later):
+$\textit{rdistinct}$ (they will be useful later):
\begin{lemma}\label{rdistinctConcatGeneral}
\mbox{}
\begin{itemize}
@@ -723,7 +767,7 @@
\end{proof}
\noindent
$\textit{rdistinct}$ needs to be applied only once, and
-applying it multiple times does not cause any difference:
+applying it multiple times does not make any difference:
\begin{corollary}\label{distinctOnceEnough}
$\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; (rdistinct \;
rs \{ \} @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$
@@ -748,7 +792,7 @@
%functions on two lists separately, and then concatenating them together.
$\textit{Rflts}$ is composable in terms of concatenation:
\begin{lemma}\label{rfltsProps}
- The function $\rflts$ has the below properties:\\
+ The function $\rflts$ has the properties below:\\
\begin{itemize}
\item
$\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$
@@ -779,15 +823,18 @@
\noindent
Now we introduce the property that the operations
derivative and $\rsimpalts$
-commute, this will be used later in deriving the closed form for
+commute, this will be used later on, when deriving the closed form for
the alternative regular expression:
\begin{lemma}\label{rderRsimpAltsCommute}
$\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$
\end{lemma}
+\begin{proof}
+ By induction on $rs$.
+\end{proof}
\noindent
\subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s}
-Much like the definition of $L$ on plain regular expressions, one could also
+Much like the definition of $L$ on plain regular expressions, one can also
define the language interpretation of $\rrexp$s.
\begin{center}
\begin{tabular}{lcl}
@@ -820,10 +867,12 @@
\subsubsection{Simplified $\textit{Rrexp}$s are Good}
We formalise the notion of ``good" regular expressions,
which means regular expressions that
-are not fully simplified. For alternative regular expressions that means they
-do not contain any nested alternatives like
-\[ r_1 + (r_2 + r_3) \], un-removed $\RZERO$s like \[\RZERO + r\]
-or duplicate elements in a children regular expression list like \[ \sum [r, r, \ldots]\]:
+are fully simplified in terms of our $\textit{rsimp}$ function.
+For alternative regular expressions that means they
+do not contain any nested alternatives, un-eliminated $\RZERO$s
+or duplicate elements (for example,
+$r_1 + (r_2 + r_3)$, $\RZERO + r$ and $ \sum [r, r, \ldots]$).
+The clauses for $\good$ are:
\begin{center}
\begin{tabular}{@{}lcl@{}}
$\good\; \RZERO$ & $\dn$ & $\textit{false}$\\
@@ -833,7 +882,7 @@
$\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\
$\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ &
$\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\
- & & $\textit{and}\; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \, \textit{and}\; \, \textit{nonAlt}\; r')$\\
+ & & $\land \; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \, \land \; \, \textit{nonAlt}\; r')$\\
$\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\
$\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\
$\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\
@@ -842,7 +891,8 @@
\end{tabular}
\end{center}
\noindent
-The predicate $\textit{nonAlt}$ evaluates to true when the regular expression is not an
+We omit the recursive definition of the predicate $\textit{nonAlt}$,
+which evaluates to true when the regular expression is not an
alternative, and false otherwise.
The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that
its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$,
@@ -879,10 +929,13 @@
which enables $\rsimp$ to be non-nested:
\begin{lemma}\label{nonnestedRsimp}
- $\nonnested \; (\rsimp{r})$
+ It is always the case that
+ \begin{center}
+ $\nonnested \; (\rsimp{r})$
+ \end{center}
\end{lemma}
\begin{proof}
- By an induction on $r$.
+ By induction on $r$.
\end{proof}
\noindent
With this we could prove that a regular expressions
@@ -896,13 +949,13 @@
By \ref{nonnestedRsimp}.
\end{proof}
\noindent
-The other thing we know is that once $\rsimp{}$ had finished
+The other fact we know is that once $\rsimp{}$ has finished
processing an alternative regular expression, it will not
-contain any $\RZERO$s, this is because all the recursive
+contain any $\RZERO$s. This is because all the recursive
calls to the simplification on the children regular expressions
-make the children good, and $\rflts$ will not take out
+make the children good, and $\rflts$ will not delete
any $\RZERO$s out of a good regular expression list,
-and $\rdistinct{}$ will not mess with the result.
+and $\rdistinct{}$ will not ``mess'' with the result.
\begin{lemma}\label{flts3Obv}
The following are true:
\begin{itemize}
@@ -931,7 +984,7 @@
\end{lemma}
\begin{proof}
By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$.
- Lemma \ref{rsimpSize} says that
+ Lemma \ref{rsimpMono} says that
$\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to
$\llbracket r \rrbracket_r$.
Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case,
@@ -953,7 +1006,7 @@
\end{proof}
\noindent
Now we are ready to prove that good regular expressions are invariant
-of $\rsimp{}$ application:
+with respect to $\rsimp{}$:
\begin{lemma}\label{test}
If $\good \;r$ then $\rsimp{r} = r$.
\end{lemma}
@@ -964,16 +1017,19 @@
case where 2 or more elements are present in the list.
\end{proof}
\noindent
-Given below is a property involving $\rflts$, $\rdistinct{}{}$, $\rsimp{}$ and $\rsimp_{ALTS}$,
-which requires $\ref{good1}$ to go through smoothly.
-It says that an application of $\rsimp_{ALTS}$ can be "absorbed",
-if it its output is concatenated with a list and then applied to $\rflts$.
+Given below is a property involving $\rflts$, $\textit{rdistinct}$,
+$\rsimp{}$ and $\rsimp_{ALTS}$,
+which requires $\ref{good1}$ to go through smoothly:
\begin{lemma}\label{flattenRsimpalts}
+An application of $\rsimp_{ALTS}$ can be ``absorbed'',
+if its output is concatenated with a list and then applied to $\rflts$.
+\begin{center}
$\rflts \; ( (\rsimp_{ALTS} \;
(\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) ::
\map \; \rsimp{} \; rs' ) =
\rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ (
\map \; \rsimp{rs'}))$
+\end{center}
\end{lemma}
@@ -993,7 +1049,7 @@
forms so that key steps allowing further rewriting to closed forms
are possible.
\begin{lemma}\label{rsimpIdem}
- $\rsimp{r} = \rsimp{\rsimp{r}}$
+ $\rsimp{r} = \rsimp{(\rsimp{r})}$
\end{lemma}
\begin{proof}
@@ -1005,7 +1061,7 @@
our definition of $\blexersimp$.
-On the other hand, we could repeat the same $\rsimp{}$ applications
+On the other hand, we can repeat the same $\rsimp{}$ applications
on regular expressions as many times as we want, if we have at least
one simplification applied to it, and apply it wherever we would like to:
\begin{corollary}\label{headOneMoreSimp}
@@ -1019,8 +1075,8 @@
\end{itemize}
\end{corollary}
\noindent
-This will be useful in later closed form proof's rewriting steps.
-Similarly, we point out the following useful facts below:
+This will be useful in the later closed form proof's rewriting steps.
+Similarly, we state the following useful facts below:
\begin{lemma}
The following equalities hold if $r = \rsimp{r'}$ for some $r'$:
\begin{itemize}
@@ -1041,10 +1097,18 @@
we can start proving some key equalities leading to the
closed forms.
Now presented are a few equivalent terms under $\rsimp{}$.
-We use $r_1 \sequal r_2 $ here to denote $\rsimp{r_1} = \rsimp{r_2}$.
+To make the notation more concise
+We use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$:
+\begin{center}
+\begin{tabular}{lcl}
+ $a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$
+\end{tabular}
+\end{center}
+\noindent
+%\vspace{0em}
\begin{lemma}
+ The following equivalence hold:
\begin{itemize}
- The following equivalence hold:
\item
$\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$
\item
@@ -1063,7 +1127,7 @@
\noindent
The above allows us to prove
two similar equalities (which are a bit more involved).
-It says that we could flatten out the elements
+It says that we could flatten the elements
before simplification and still get the same result.
\begin{lemma}\label{simpFlatten3}
One can flatten the inside $\sum$ of a $\sum$ if it is being
@@ -1099,7 +1163,7 @@
\noindent
-We need more equalities like the above to enable a closed form,
+We need more equalities like the above to enable a closed form lemma,
for which we need to introduce a few rewrite relations
to help
us obtain them.
@@ -1211,11 +1275,11 @@
operations}\label{gRewrite}
\end{figure}
\noindent
-We defined
+We define
two separate list rewriting relations $\frewrite$ and $\grewrite$.
The rewriting steps that take place during
flattening is characterised by $\frewrite$.
-$\grewrite$ characterises both flattening and de-duplicating.
+The rewrite relation $\grewrite$ characterises both flattening and de-duplicating.
Sometimes $\grewrites$ is slightly too powerful
so we would rather use $\frewrites$ to prove
%because we only
@@ -1236,19 +1300,24 @@
\end{lemma}
\noindent
and then the equivalence between two terms
-that can reduce in many steps to each other.
+that can reduce in many steps to each other:
\begin{lemma}\label{frewritesSimpeq}
If $rs_1 \frewrites rs_2 $, then $\sum (\rDistinct \; rs_1 \; \varnothing) \sequal
\sum (\rDistinct \; rs_2 \; \varnothing)$.
\end{lemma}
\noindent
+These two lemmas can both be proven using a straightforward induction (and
+the proofs for them are therefore omitted).
+Now the above equalities can be derived like a breeze:
\begin{corollary}
$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal
\sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing)
$
\end{corollary}
-
+\begin{proof}
+ By lemmas \ref{earlyLaterDerFrewrites} and \ref{frewritesSimpeq}.
+\end{proof}
But this trick will not work for $\grewrites$.
For example, a rewriting step in proving
closed forms is:
@@ -1258,7 +1327,7 @@
$\rsimp{(\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}))} $
\noindent
\end{center}
-For this one would hope to have a rewriting relation between the two lists involved,
+For this, one would hope to have a rewriting relation between the two lists involved,
similar to \ref{earlyLaterDerFrewrites}. However, it turns out that
\begin{center}
$\map \; (\_ \backslash x) \; (\rDistinct \; rs \; rset) \grewrites \rDistinct \; (\map \;
@@ -1267,7 +1336,7 @@
\noindent
does $\mathbf{not}$ hold in general.
For this rewriting step we will introduce some slightly more cumbersome
-proof technique in later sections.
+proof technique later.
The point is that $\frewrite$
allows us to prove equivalence in a straightforward way that is
not possible for $\grewrite$.
@@ -1275,12 +1344,12 @@
\subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$}
In this part, we present lemmas stating
-pairs of r-regular expressions or r-regular expression lists
+pairs of r-regular expressions and r-regular expression lists
where one could rewrite from one in many steps to the other.
Most of the proofs to these lemmas are straightforward, using
-an induction on the inductive cases of the corresponding rewriting relations.
+an induction on the corresponding rewriting relations.
These proofs will therefore be omitted when this is the case.
-We present in the below lemma a few pairs of terms that are rewritable via
+We present in the following lemma a few pairs of terms that are rewritable via
$\grewrites$:
\begin{lemma}\label{gstarRdistinctGeneral}
\mbox{}
@@ -1302,6 +1371,7 @@
If a pair of terms $rs_1, rs_2$ are rewritable via $\grewrites$ to each other,
then they are equivalent under $\rsimp{}$:
\begin{lemma}\label{grewritesSimpalts}
+ \mbox{}
If $rs_1 \grewrites rs_2$, then
we have the following equivalence:
\begin{itemize}
@@ -1330,7 +1400,7 @@
\end{itemize}
\end{lemma}
\noindent
-Here comes the meat of the proof,
+Now comes the core of the proof,
which says that once two lists are rewritable to each other,
then they are equivalent under $\rsimp{}$:
\begin{lemma}
@@ -1338,15 +1408,17 @@
\end{lemma}
\noindent
-And similar to \ref{Bitcoded2} one can preserve rewritability after taking derivative
-of two regular expressions on both sides:
+Similar to what we did in \ref{Bitcoded2},
+we prove that if one can rewrite from one r-regular expression ($r$)
+to the other ($r'$), after taking derivatives one could still rewrite
+the first ($r\backslash c$) to the other ($r'\backslash c$).
\begin{lemma}\label{interleave}
If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$
\end{lemma}
\noindent
-This allows proving more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$ now.
+This allows us proving more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$.
\begin{lemma}\label{insideSimpRemoval}
- $\rsimp{\rder{c}{\rsimp{r}}} = \rsimp{\rder{c}{r}} $
+ $\rsimp{(\rder{c}{(\rsimp{r})})} = \rsimp{(\rder{c}{r})} $
\end{lemma}
\noindent
\begin{proof}
@@ -1373,15 +1445,15 @@
\noindent
\subsection{Closed Forms for $\sum rs$, $r_1\cdot r_2$ and $r^*$}
-\subsubsection{Closed Form for Alternative Regular Expression}
-Lemma \ref{Simpders} leads to the first closed form--
-\begin{lemma}\label{altsClosedForm}
+Lemma \ref{Simpders} leads to our first closed form,
+which is for the alternative regular expression:
+\begin{theorem}\label{altsClosedForm}
+ \mbox{}
\begin{center}
$\rderssimp{(\sum rs)}{s} \sequal
\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
\end{center}
-\end{lemma}
-
+\end{theorem}
\noindent
\begin{proof}
By a reverse induction on the string $s$.
@@ -1402,7 +1474,7 @@
\end{proof}
\noindent
This closed form has a variant which can be more convenient in later proofs:
-\begin{corollary}{altsClosedForm1}
+\begin{corollary}\label{altsClosedForm1}
If $s \neq []$ then
$\rderssimp \; (\sum \; rs) \; s =
\rsimp{(\sum \; (\map \; \rderssimp{\_}{s} \; rs))}$.
@@ -1414,15 +1486,23 @@
\subsubsection{Closed Form for Sequence Regular Expressions}
-First let's look at a series of derivatives steps on a sequence
-regular expression, assuming that each time the first
-component of the sequence is always nullable):
+For the sequence regular expression,
+let's first look at a series of derivative steps on it
+(assuming that each time when a derivative is taken,
+the head of the sequence is always nullable):
\begin{center}
-
- $r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
- $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad
- \ldots$
-
+ \begin{tabular}{llll}
+ $r_1 \cdot r_2$ &
+ $\longrightarrow_{\backslash c}$ &
+ $r_1\backslash c \cdot r_2 + r_2 \backslash c$ &
+ $ \longrightarrow_{\backslash c'} $ \\
+ \\
+ $(r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc'$ &
+ $\longrightarrow_{\backslash c''} $ &
+ $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'')
+ + r_2 \backslash cc'c''$ &
+ $ \longrightarrow_{\backslash c''} \quad \ldots$\\
+ \end{tabular}
\end{center}
Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as
a giant alternative taking a list of terms
@@ -1430,32 +1510,50 @@
where the head of the list is always the term
representing a match involving only $r_1$, and the tail of the list consisting of
terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$.
-This intuition is also echoed by IndianPaper, where they gave
+This intuition is also echoed by Murugesan and Sundaram \cite{Murugesan2014},
+where they gave
a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$:
\begin{center}
- \begin{tabular}{c}
- $(r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) \myequiv$\\
- \rule{0pt}{3ex} $((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1) \; r_2 \backslash_r c_1)) \backslash_r (c_2 :: \ldots c_n)
- \myequiv$\\
- \rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable \; r_1) \; r_2 \backslash_r c_1c_2))
- + (\delta \ (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)) \backslash_r (c_3 \ldots c_n)
- $
+ \begin{tabular}{lc}
+ $L \; [ (r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) ]$ & $ =$\\
+ \\
+ \rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1) \cdot r_2 +
+ (\delta\; (\nullable \; r_1) \; (r_2 \backslash_r c_1) )) \backslash_r
+ (c_2 :: \ldots c_n) ]$ &
+ $=$\\
+ \\
+ \rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1c_2 \cdot r_2 +
+ (\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2)))
+ $ & \\
+ \\
+ $\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) ))
+ \backslash_r (c_3 \ldots c_n) ]$ & $\ldots$ \\
\end{tabular}
\end{center}
\noindent
-The equality in above should be interpretated
-as language equivalence.
-The $\delta$ function works similarly to that of
-a Kronecker delta function:
-\[ \delta \; b\; r\]
-will produce $r$
-if $b$ evaluates to true,
-and $\RZERO$ otherwise.
-Note that their formulation
-\[
- ((r_1 \backslash_r \, c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2)
- + (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)
-\]
+The $\delta$ function
+returns $r$ when the boolean condition
+$b$ evaluates to true and
+$\ZERO$ otherwise:
+\begin{center}
+ \begin{tabular}{lcl}
+ $\delta \; b\; r$ & $\dn$ & $r \quad \textit{if} \; b \; is \;\textit{true}$\\
+ & $\dn$ & $\ZERO \quad otherwise$
+ \end{tabular}
+\end{center}
+\noindent
+Note that the term
+\begin{center}
+ \begin{tabular}{lc}
+ \rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 +
+ (\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2)))
+ $ & \\
+ \\
+ $\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) ))
+ \backslash_r (c_3 \ldots c_n)$ &\\
+ \end{tabular}
+\end{center}
+\noindent
does not faithfully
represent what the intermediate derivatives would actually look like
when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not
@@ -1463,59 +1561,88 @@
For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable,
the regular expression would not look like
\[
- (r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO,
+ r_1 \backslash_r c_1c_2
\]
-but actually $r_1 \backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the
+instead of
+\[
+ (r_1 \backslash_r c_1c_2 + \ZERO ) + \ZERO.
+\]
+The redundant $\ZERO$s will not be created in the
first place.
-In a closed-form one would want to take into account this
-and generate the list of
-regular expressions $r_2 \backslash_r s''$ with
-string pairs $(s', s'')$ where $s'@s'' = s$ and
-$r_1 \backslash s'$ nullable.
-We denote the list consisting of such
-strings $s''$ as $\vsuf{s}{r_1}$.
-
-The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
+In a closed-form one needs to take into account this (because
+closed forms require exact equality rather than language equivalence)
+and only generate the
+$r_2 \backslash_r s''$ terms satisfying the property
+\begin{center}
+$\exists s'. such \; that \; s'@s'' = s \;\; \land \;\;
+r_1 \backslash s' \; is \; nullable$.
+\end{center}
+Given the arguments $s$ and $r_1$, we denote the list of strings
+$s''$ satisfying the above property as $\vsuf{s}{r_1}$.
+The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string\footnote{
+ Perhaps a better name of it would be ``NullablePrefixSuffix''
+ to differentiate with the list of \emph{all} prefixes of $s$, but
+ that is a bit too long for a function name and we are yet to find
+a more concise and easy-to-understand name.}
\begin{center}
\begin{tabular}{lcl}
$\vsuf{[]}{\_} $ & $=$ & $[]$\\
- $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
+ $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} \; (\rnullable{r_1}) \; \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
&& $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $
\end{tabular}
\end{center}
\noindent
-The list is sorted in the order $r_2\backslash s''$
-appears in $(r_1\cdot r_2)\backslash s$.
+The list starts with shorter suffixes
+and ends with longer ones (in other words, the string elements $s''$
+in the list $\vsuf{s}{r_1}$ are sorted
+in the same order as that of the terms $r_2\backslash s''$
+appearing in $(r_1\cdot r_2)\backslash s$).
In essence, $\vsuf{\_}{\_}$ is doing a
"virtual derivative" of $r_1 \cdot r_2$, but instead of producing
the entire result $(r_1 \cdot r_2) \backslash s$,
-it only stores all the strings $s''$ such that $r_2 \backslash s''$
-are occurring terms in $(r_1\cdot r_2)\backslash s$.
+it only stores strings,
+with each string $s''$ representing a term such that $r_2 \backslash s''$
+is occurring in $(r_1\cdot r_2)\backslash s$.
-To make the closed form representation
-more straightforward,
-the flattetning function $\sflat{\_}$ is used to enable the transformation from
+With $\textit{Suffix}$ we are ready to express the
+sequence regular expression's closed form,
+but before doing so
+more devices are needed.
+The first thing is the flattening function $\sflat{\_}$,
+which takes an alternative regular expression and produces a flattened version
+of that alternative regular expression.
+It is needed to convert
a left-associative nested sequence of alternatives into
a flattened list:
\[
- \sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{\_}}{\rightarrow}
- (\ldots ((r_1 + r_2) + r_3) + \ldots)
+ \sum(\ldots ((r_1 + r_2) + r_3) + \ldots)
+ \stackrel{\sflat{\_}}{\rightarrow}
+ \sum[r_1, r_2, r_3, \ldots]
\]
\noindent
-The definitions $\sflat{\_}$, $\sflataux{\_}$ are given below.
+The definitions of $\sflat{\_}$ and helper functions
+$\sflataux{\_}$ and $\llparenthesis \_ \rrparenthesis''$ are given below.
\begin{center}
- \begin{tabular}{ccc}
- $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
- $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
- $\sflataux r$ & $=$ & $ [r]$
+ \begin{tabular}{lcl}
+ $\sflataux{\sum r :: rs}$ & $\dn$ & $\sflataux{r} @ rs$\\
+ $\sflataux{\sum []}$ & $ \dn $ & $ []$\\
+ $\sflataux r$ & $\dn$ & $ [r]$
\end{tabular}
\end{center}
\begin{center}
- \begin{tabular}{ccc}
- $\sflat{(\sum r :: rs)}$ & $=$ & $\sum (\sflataux{r} @ rs)$\\
- $\sflat{\sum []}$ & $ = $ & $ \sum []$\\
- $\sflat r$ & $=$ & $ r$
+ \begin{tabular}{lcl}
+ $\sflat{(\sum r :: rs)}$ & $\dn$ & $\sum (\sflataux{r} @ rs)$\\
+ $\sflat{\sum []}$ & $ \dn $ & $ \sum []$\\
+ $\sflat r$ & $\dn$ & $ r$
+ \end{tabular}
+\end{center}
+
+\begin{center}
+ \begin{tabular}{lcl}
+ $\sflataux{[]}'$ & $ \dn $ & $ []$\\
+ $\sflataux{ (r_1 + r_2) :: rs }'$ & $\dn$ & $r_1 :: r_2 :: rs$\\
+ $\sflataux{r :: rs}$ & $\dn$ & $ r::rs$
\end{tabular}
\end{center}
\noindent
@@ -1527,19 +1654,63 @@
the output type a regular expression, not a list.
$\sflataux{\_}$ and $\sflat{\_}$ are only recursive on the
first element of the list.
+$\sflataux{\_}'$ takes a list of regular expressions as input, and outputs
+a list of regular expressions.
+The use of $\sflataux{\_}$ and $\sflataux{\_}'$ is clear once we have
+$\textit{createdBySequence}$ defined:
+\begin{center}
+ \begin{mathpar}
+ \inferrule{\mbox{}}{\textit{createdBySequence}\; (r_1 \cdot r_2)}
-With $\sflataux{}$ a preliminary to the closed form can be stated,
-where the derivative of $r_1 \cdot r_2 \backslash s$ can be
-flattened into a list whose head and tail meet the description
-we gave earlier.
+ \inferrule{\textit{createdBySequence} \; r_1}{\textit{createdBySequence} \;
+ (r_1 + r_2)}
+ \end{mathpar}
+\end{center}
+\noindent
+The predicate $\textit{createdBySequence}$ is used to describe the shape of
+the derivative regular expressions $(r_1\cdot r_2) \backslash s$:
+\begin{lemma}\label{recursivelyDerseq}
+ It is always the case that
+ \begin{center}
+ $\textit{createdBySequence} \; ( (r_1\cdot r_2) \backslash_r s) $
+ \end{center}
+ holds.
+\end{lemma}
+\begin{proof}
+ By a reverse induction on the string $s$, where the inductive cases are $[]$
+ and $xs @ [x]$.
+\end{proof}
+\noindent
+If we have a regular expression $r$ whose shpae
+fits into those described by $\textit{createdBySequence}$,
+then we can convert between
+$r \backslash_r c$ and $(\sflataux{r}) \backslash_r c$ with
+$\sflataux{\_}'$:
+\begin{lemma}\label{sfauIdemDer}
+ If $\textit{createdBySequence} \; r$, then
+ \begin{center}
+ $\sflataux{ r \backslash_r c} =
+ \llparenthesis (\map \; (\_ \backslash_r c) \; (\sflataux{r}) ) \rrparenthesis''$
+ \end{center}
+ holds.
+\end{lemma}
+\begin{proof}
+ By a simple induction on the inductive cases of $\textit{createdBySequence}.
+ $
+\end{proof}
+
+Now we are ready to express
+the shape of $r_1 \cdot r_2 \backslash s$
\begin{lemma}\label{seqSfau0}
- $\sflataux{\rders{(r_1 \cdot r_2) \backslash s }} = (r_1 \backslash_r s) \cdot r_2
+ $\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2
:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$
\end{lemma}
\begin{proof}
- By an induction on the string $s$, where the inductive cases
- are split as $[]$ and $xs @ [x]$.
- Note the key identify holds:
+ By a reverse induction on the string $s$, where the inductive cases
+ are $[]$ and $xs @ [x]$.
+ For the inductive case, we know that $\textit{createdBySequence} \; ((r_1 \cdot r_2)
+ \backslash_r xs)$ holds from lemma \ref{recursivelyDerseq},
+ which can be used to prove
\[
\map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\;
\map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1}))
@@ -1548,96 +1719,126 @@
\[
\map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1})
\]
- This enables the inductive case to go through.
+ using lemma \ref{sfauIdemDer}.
+ This equality enables the inductive case to go through.
\end{proof}
\noindent
-Note that this lemma does $\mathbf{not}$ depend on any
-specific definitions we used,
-allowing people investigating derivatives to get an alternative
-view of what $r_1 \cdot r_2$ is.
+This lemma says that $(r_1\cdot r_2)\backslash s$
+can be flattened into a list whose head and tail meet the description
+we gave earlier.
+%Note that this lemma does $\mathbf{not}$ depend on any
+%specific definitions we used,
+%allowing people investigating derivatives to get an alternative
+%view of what $r_1 \cdot r_2$ is.
-Now we are able to use this for the intuition that
-the different ways in which regular expressions are
-nested do not matter under $\rsimp{}$:
-\begin{center}
- $\rsimp{r} \stackrel{?}{\sequal} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$
- and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$
-\end{center}
-Simply wrap with $\sum$ constructor and add
-simplifications to both sides of \ref{seqSfau0}
-and one gets
-\begin{corollary}\label{seqClosedFormGeneral}
+We now use $\textit{createdBySequence}$ and
+$\sflataux{\_}$ to describe an intuition
+behind the closed form of the sequence closed form.
+If two regular expressions only differ in the way their
+alternatives are nested, then we should be able to get the same result
+once we apply simplification to both of them:
+\begin{lemma}\label{sflatRsimpeq}
+ If $r$ is created from a sequence through
+ a series of derivatives
+ (i.e. if $\textit{createdBySequence} \; r$ holds),
+ and that $\sflataux{r} = rs$,
+ then we have
+ that
+ \begin{center}
+ $\textit{rsimp} \; r = \textit{rsimp} \; (\sum \; rs)$
+ \end{center}
+ holds.
+\end{lemma}
+\begin{proof}
+ By an induction on the inductive cases of $\textit{createdBySequence}$.
+\end{proof}
+
+Now we are ready for the closed form
+for the sequence regular expressions (without the inner applications
+of simplifications):
+\begin{lemma}\label{seqClosedFormGeneral}
$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }
=\rsimp{(\sum ( (r_1 \backslash s) \cdot r_2 ::
\map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$
-\end{corollary}
-Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),
-it is possible to convert the above lemma to obtain a "closed form"
-for derivatives nested with simplification:
-\begin{lemma}\label{seqClosedForm}
- $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 )
- :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$
\end{lemma}
\begin{proof}
- By a case analysis of string $s$.
- When $s$ is empty list, the rewrite is straightforward.
- When $s$ is a list, one could use the corollary \ref{seqSfau0},
- and lemma \ref{Simpders} to rewrite the left-hand-side.
+ We know that
+ $\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2
+ :: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$
+ holds
+ by lemma \ref{seqSfau0}.
+ This allows the theorem to go through because of lemma \ref{sflatRsimpeq}.
\end{proof}
-As a corollary for this closed form, one can estimate the size
-of the sequence derivative $r_1 \cdot r_2 \backslash_r s$ using
-an easier-to-handle expression:
-\begin{corollary}\label{seqEstimate1}
- \begin{center}
-
- $\llbracket \rderssimp{(r_1 \cdot r_2)}{s} \rrbracket_r = \llbracket \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 )
- :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))} \rrbracket_r$
-
- \end{center}
-\end{corollary}
-\noindent
+Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),
+it is possible to convert the above lemma to obtain the
+proper closed form for $\backslash_{rsimps}$ rather than $\backslash_r$:
+for derivatives nested with simplification:
+\begin{theorem}\label{seqClosedForm}
+ $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 )
+ :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$
+\end{theorem}
+\begin{proof}
+ By a case analysis of the string $s$.
+ When $s$ is an empty list, the rewrite is straightforward.
+ When $s$ is a non-empty list, the
+ lemmas \ref{seqClosedFormGeneral} and \ref{Simpders} apply,
+ making the proof go through.
+\end{proof}
\subsubsection{Closed Forms for Star Regular Expressions}
-We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
-the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then
-the property of the $\distinct$ function.
-Now we try to get a bound on $r^* \backslash s$ as well.
-Again, we first look at how a star's derivatives evolve, if they grow maximally:
+The closed form for the star regular expression involves similar tricks
+for the sequence regular expression.
+The $\textit{Suffix}$ function is now replaced by something
+slightly more complex, because the growth pattern of star
+regular expressions' derivatives is a bit different:
\begin{center}
-
- $r^* \quad \longrightarrow_{\backslash c} \quad (r\backslash c) \cdot r^* \quad \longrightarrow_{\backslash c'} \quad
- r \backslash cc' \cdot r^* + r \backslash c' \cdot r^* \quad \longrightarrow_{\backslash c''} \quad
- (r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) \quad \longrightarrow_{\backslash c'''}
- \quad \ldots$
-
+ \begin{tabular}{lclc}
+ $r^* $ & $\longrightarrow_{\backslash c}$ &
+ $(r\backslash c) \cdot r^*$ & $\longrightarrow_{\backslash c'}$\\
+ \\
+ $r \backslash cc' \cdot r^* + r \backslash c' \cdot r^*$ &
+ $\longrightarrow_{\backslash c''}$ &
+ $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') +
+ (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)$ &
+ $\longrightarrow_{\backslash c'''}$ \\
+ \\
+ $\ldots$\\
+ \end{tabular}
\end{center}
When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$,
$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
-the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
-of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not
-count the possible size explosions of $r \backslash c$ themselves.
+the number of terms in $r^* \backslash s$ will grow exponentially rather than linearly
+in the sequence case.
+The good news is that the function $\textit{rsimp}$ will again
+ignore the difference between differently nesting patterns of alternatives,
+and the exponentially growing star derivative like
+\begin{center}
+ $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') +
+ (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $
+\end{center}
+can be treated as
+\begin{center}
+ $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'',
+ r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
+\end{center}
+which can be de-duplicated by $\rDistinct$
+and therefore bounded finitely.
-Thanks to $\rflts$ and $\rDistinct$, we are able to open up regular expressions like
-$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') +
-(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $
-into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'',
-r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
-and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
+and then de-duplicate terms of the form ($s'$ being a substring of $s$).
This allows us to use a similar technique as $r_1 \cdot r_2$ case,
-where the crux is to get an equivalent form of
-$\rderssimp{r^*}{s}$ with shape $\rsimp{\sum rs}$.
-This requires generating
-all possible sub-strings $s'$ of $s$
-such that $r\backslash s' \cdot r^*$ will appear
-as a term in $(r^*) \backslash s$.
-The first function we define is a single-step
-updating function $\starupdate$, which takes three arguments as input:
-the new character $c$ to take derivative with,
-the regular expression
-$r$ directly under the star $r^*$, and the
-list of strings $sSet$ for the derivative $r^* \backslash s$
-up til this point
-such that $(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$
-(the equality is not exact, more on this later).
+
+Now the crux of this section is finding a suitable description
+for $rs$ where
+\begin{center}
+ $\rderssimp{r^*}{s} = \rsimp{\sum rs}$.
+\end{center}
+holds.
+In addition, the list $rs$
+shall be in the form of
+$\map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss$.
+The $Ss$ is a list of strings, and for example in the sequence
+closed form it is specified as $\textit{Suffix} \; s \; r_1$.
+To get $Ss$ for the star regular expression,
+we need to introduce $\starupdate$ and $\starupdates$:
\begin{center}
\begin{tabular}{lcl}
$\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
@@ -1650,11 +1851,6 @@
\starupdate \; c \; r \; Ss)$
\end{tabular}
\end{center}
-\noindent
-As a generalisation from characters to strings,
-$\starupdates$ takes a string instead of a character
-as the first input argument, and is otherwise the same
-as $\starupdate$.
\begin{center}
\begin{tabular}{lcl}
$\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\
@@ -1663,12 +1859,33 @@
\end{tabular}
\end{center}
\noindent
-For the star regular expression,
-its derivatives can be seen as a nested gigantic
-alternative similar to that of sequence regular expression's derivatives,
-and therefore need
-to be ``straightened out" as well.
-The function for this would be $\hflat{}$ and $\hflataux{}$.
+Assuming we have that
+\begin{center}
+ $\rderssimp{r^*}{s} = \rsimp{(\sum \map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss)}$.
+\end{center}
+holds.
+The idea of $\starupdate$ and $\starupdates$
+is to update $Ss$ when another
+derivative is taken on $\rderssimp{r^*}{s}$
+w.r.t a character $c$ and a string $s'$
+respectively.
+Both $\starupdate$ and $\starupdates$ take three arguments as input:
+the new character $c$ or string $s$ to take derivative with,
+the regular expression
+$r$ under the star $r^*$, and the
+list of strings $Ss$ for the derivative $r^* \backslash s$
+up until this point
+such that
+\begin{center}
+$(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$
+\end{center}
+is satisfied.
+
+Functions $\starupdate$ and $\starupdates$ characterise what the
+star derivatives will look like once ``straightened out'' into lists.
+The helper functions for such operations will be similar to
+$\sflat{\_}$, $\sflataux{\_}$ and $\sflataux{\_}$, which we defined for sequence.
+We use similar symbols to denote them, with a $*$ subscript to mark the difference.
\begin{center}
\begin{tabular}{lcl}
$\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
@@ -1683,122 +1900,230 @@
\end{tabular}
\end{center}
\noindent
-%MAYBE TODO: introduce createdByStar
-Again these definitions are tailor-made for dealing with alternatives that have
-originated from a star's derivatives, so we do not attempt to open up all possible
-regular expressions of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
-elements.
-We give a predicate for such "star-created" regular expressions:
+These definitions are tailor-made for dealing with alternatives that have
+originated from a star's derivatives.
+A typical star derivative always has the structure of a balanced binary tree:
\begin{center}
- \begin{tabular}{lcr}
- & & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
- $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
- \end{tabular}
+ $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') +
+ (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $
\end{center}
+All of the nested structures of alternatives
+generated from derivatives are binary, and therefore
+$\hflat{\_}$ and $\hflataux{\_}$ only deal with binary alternatives.
+$\hflat{\_}$ ``untangles'' like the following:
+\[
+ \sum ((r_1 + r_2) + (r_3 + r_4)) + \ldots \;
+ \stackrel{\hflat{\_}}{\longrightarrow} \;
+ \RALTS{[r_1, r_2, \ldots, r_n]}
+\]
+Here is a lemma stating the recursive property of $\starupdate$ and $\starupdates$,
+with the helpers $\hflat{\_}$ and $\hflataux{\_}$\footnote{The function $\textit{concat}$ takes a list of lists
+ and merges each of the element lists to form a flattened list.}:
+\begin{lemma}\label{stupdateInduct1}
+ \mbox
+ For a list of strings $Ss$, the following hold.
+ \begin{itemize}
+ \item
+ If we do a derivative on the terms
+ $r\backslash_r s \cdot r^*$ (where $s$ is taken from the list $Ss$),
+ the result will be the same as if we apply $\starupdate$ to $Ss$.
+ \begin{center}
+ \begin{tabular}{c}
+ $\textit{concat} \; (\map \; (\hflataux{\_} \circ ( (\_\backslash_r x)
+ \circ (\lambda s.\;\; (r \backslash_r s) \cdot r^*)))\; Ss )\;
+ $\\
+ \\
+ $=$ \\
+ \\
+ $\map \; (\lambda s. (r \backslash_r s) \cdot (r^*)) \;
+ (\starupdate \; x \; r \; Ss)$
+ \end{tabular}
+ \end{center}
+ \item
+ $\starupdates$ is ``composable'' w.r.t a derivative.
+ It piggybacks the character $x$ to the tail of the string
+ $xs$.
+ \begin{center}
+ \begin{tabular}{c}
+ $\textit{concat} \; (\map \; \hflataux{\_} \;
+ (\map \; (\_\backslash_r x) \;
+ (\map \; (\lambda s.\;\; (r \backslash_r s) \cdot
+ (r^*) ) \; (\starupdates \; xs \; r \; Ss))))$\\
+ \\
+ $=$\\
+ \\
+ $\map \; (\lambda s.\;\; (r\backslash_r s) \cdot (r^*)) \;
+ (\starupdates \; (xs @ [x]) \; r \; Ss)$
+ \end{tabular}
+ \end{center}
+ \end{itemize}
+\end{lemma}
+
+\begin{proof}
+ Part 1 is by induction on $Ss$.
+ Part 2 is by induction on $xs$, where $Ss$ is left to take arbitrary values.
+\end{proof}
+
-These definitions allows us the flexibility to talk about
-regular expressions in their most convenient format,
-for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
-instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
-These definitions help express that certain classes of syntatically
-distinct regular expressions are actually the same under simplification.
-This is not entirely true for annotated regular expressions:
-%TODO: bsimp bders \neq bderssimp
+Like $\textit{createdBySequence}$, we need
+a predicate for ``star-created'' regular expressions:
\begin{center}
- $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
-\end{center}
-For bit-codes, the order in which simplification is applied
-might cause a difference in the location they are placed.
-If we want something like
-\begin{center}
- $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
+ \begin{mathpar}
+ \inferrule{\mbox{}}{ \textit{createdByStar}\; \RSEQ{ra}{\RSTAR{rb}} }
+
+ \inferrule{ \textit{createdByStar} \; r_1\; \land \; \textit{createdByStar} \; r_2 }{\textit{createdByStar} \; (r_1 + r_2) }
+ \end{mathpar}
\end{center}
-Some "canonicalization" procedure is required,
-which either pushes all the common bitcodes to nodes
-as senior as possible:
-\begin{center}
- $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
-\end{center}
-or does the reverse. However bitcodes are not of interest if we are talking about
-the $\llbracket r \rrbracket$ size of a regex.
-Therefore for the ease and simplicity of producing a
-proof for a size bound, we are happy to restrict ourselves to
-unannotated regular expressions, and obtain such equalities as
-\begin{lemma}
- $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
+\noindent
+All regular expressions created by taking derivatives of
+$r_1 \cdot (r_2)^*$ satisfy the $\textit{createdByStar}$ predicate:
+\begin{lemma}\label{starDersCbs}
+ $\textit{createdByStar} \; ((r_1 \cdot r_2^*) \backslash_r s) $ holds.
+\end{lemma}
+\begin{proof}
+ By a reverse induction on $s$.
+\end{proof}
+If a regular expression conforms to the shape of a star's derivative,
+then we can push an application of $\hflataux{\_}$ inside a derivative of it:
+\begin{lemma}\label{hfauPushin}
+ If $\textit{createdByStar} \; r$ holds, then
+ \begin{center}
+ $\hflataux{r \backslash_r c} = \textit{concat} \; (
+ \map \; \hflataux{\_} (\map \; (\_\backslash_r c) \;(\hflataux{r})))$
+ \end{center}
+ holds.
+\end{lemma}
+\begin{proof}
+ By an induction on the inductivev cases of $\textit{createdByStar}$.
+\end{proof}
+%This is not entirely true for annotated regular expressions:
+%%TODO: bsimp bders \neq bderssimp
+%\begin{center}
+% $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
+%\end{center}
+%For bit-codes, the order in which simplification is applied
+%might cause a difference in the location they are placed.
+%If we want something like
+%\begin{center}
+% $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
+%\end{center}
+%Some "canonicalization" procedure is required,
+%which either pushes all the common bitcodes to nodes
+%as senior as possible:
+%\begin{center}
+% $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
+%\end{center}
+%or does the reverse. However bitcodes are not of interest if we are talking about
+%the $\llbracket r \rrbracket$ size of a regex.
+%Therefore for the ease and simplicity of producing a
+%proof for a size bound, we are happy to restrict ourselves to
+%unannotated regular expressions, and obtain such equalities as
+%TODO: rsimp sflat
+% The simplification of a flattened out regular expression, provided it comes
+%from the derivative of a star, is the same as the one nested.
+
+
+
+Now we introduce an inductive property
+for $\starupdate$ and $\hflataux{\_}$.
+\begin{lemma}\label{starHfauInduct}
+ If we do derivatives of $r^*$
+ with a string that starts with $c$,
+ then flatten it out,
+ we obtain a list
+ of the shape $\sum_{s' \in sS} (r\backslash_r s') \cdot r^*$,
+ where $sS = \starupdates \; s \; r \; [[c]]$. Namely,
+ \begin{center}
+ $\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} =
+ \map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \;
+ (\starupdates \; s \; r_0 \; [[c]])$
+ \end{center}
+holds.
+\end{lemma}
+\begin{proof}
+ By an induction on $s$, the inductive cases
+ being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used.
+\end{proof}
+\noindent
+
+$\hflataux{\_}$ has a similar effect as $\textit{flatten}$:
+\begin{lemma}\label{hflatauxGrewrites}
+ $a :: rs \grewrites \hflataux{a} @ rs$
+\end{lemma}
+\begin{proof}
+ By induction on $a$. $rs$ is set to take arbitrary values.
+\end{proof}
+It is also not surprising that $\textit{rsimp}$ will cover
+the effect of $\hflataux{\_}$:
+\begin{lemma}\label{cbsHfauRsimpeq1}
+ $\rsimp{(r_1 + r_2)} = \rsimp{(\RALTS{\hflataux{r_1} @ \hflataux{r_2}})}$
\end{lemma}
\begin{proof}
By using the rewriting relation $\rightsquigarrow$
\end{proof}
-%TODO: rsimp sflat
And from this we obtain a proof that a star's derivative will be the same
as if it had all its nested alternatives created during deriving being flattened out:
For example,
-\begin{lemma}
+\begin{lemma}\label{hfauRsimpeq2}
$\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
\end{lemma}
\begin{proof}
- By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
+ By structural induction on $r$, where the induction rules
+ are these of $\createdByStar{\_}$.
+ Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case.
\end{proof}
-% The simplification of a flattened out regular expression, provided it comes
-%from the derivative of a star, is the same as the one nested.
-
-We first introduce an inductive property
-for $\starupdate$ and $\hflataux{\_}$,
-it says if we do derivatives of $r^*$
-with a string that starts with $c$,
-then flatten it out,
-we obtain a list
-of the shape $\sum_{s' \in sSet} (r\backslash_r s') \cdot r^*$,
-where $sSet = \starupdates \; s \; r \; [[c]]$.
-\begin{lemma}\label{starHfauInduct}
- $\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} =
- \map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \;
- (\starupdates \; s \; r_0 \; [[c]])$
+%Here is a corollary that states the lemma in
+%a more intuitive way:
+%\begin{corollary}
+% $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot
+% (r^*))\; (\starupdates \; c\; r\; [[c]])$
+%\end{corollary}
+%\noindent
+%Note that this is also agnostic of the simplification
+%function we defined, and is therefore of more general interest.
+
+Together with the rewriting relation
+\begin{lemma}\label{starClosedForm6Hrewrites}
+ We have the following set of rewriting relations or equalities:
+ \begin{itemize}
+ \item
+ $\textit{rsimp} \; (r^* \backslash_r (c::s))
+ \sequal
+ \sum \; ( ( \sum (\lambda s. (r\backslash_r s) \cdot r^*) \; (
+ \starupdates \; s \; r \; [ c::[]] ) ) )$
+ \item
+ $r \backslash_{rsimp} (c::s) = \textit{rsimp} \; ( (
+ \sum ( (\map \; (\lambda s_1. (r\backslash s_1) \; r^*) \;
+ (\starupdates \;s \; r \; [ c::[] ])))))$
+ \item
+ $\sum ( (\map \; (\lambda s. (r\backslash s) \; r^*) \; Ss))
+ \sequal
+ \sum ( (\map \; (\lambda s. \textit{rsimp} \; (r\backslash s) \;
+ r^*) \; Ss) )$
+ \item
+ $\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss
+ \scfrewrites
+ \map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$
+ \item
+ $( ( \sum ( ( \map \ (\lambda s. \;\;
+ (\textit{rsimp} \; (r \backslash_r s)) \cdot r^*) \; (\starupdates \;
+ s \; r \; [ c::[] ])))))$\\
+ $\sequal$\\
+ $( ( \sum ( ( \map \ (\lambda s. \;\;
+ ( r \backslash_{rsimp} s)) \cdot r^*) \; (\starupdates \;
+ s \; r \; [ c::[] ]))))$\\
+ \end{itemize}
\end{lemma}
\begin{proof}
- By an induction on $s$, the inductive cases
- being $[]$ and $s@[c]$.
+ Part 1 leads to part 2.
+ The rest of them are routine.
\end{proof}
\noindent
-Here is a corollary that states the lemma in
-a more intuitive way:
-\begin{corollary}
- $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot
- (r^*))\; (\starupdates \; c\; r\; [[c]])$
-\end{corollary}
-\noindent
-Note that this is also agnostic of the simplification
-function we defined, and is therefore of more general interest.
-
-Now adding the $\rsimp{}$ bit for closed forms,
-we have
-\begin{lemma}
- $a :: rs \grewrites \hflataux{a} @ rs$
-\end{lemma}
-\noindent
-giving us
-\begin{lemma}\label{cbsHfauRsimpeq1}
- $\rsimp{a+b} = \rsimp{(\sum \hflataux{a} @ \hflataux{b})}$.
-\end{lemma}
-\noindent
-This yields
-\begin{lemma}\label{hfauRsimpeq2}
- $\rsimp{r} = \rsimp{(\sum \hflataux{r})}$
-\end{lemma}
-\noindent
-Together with the rewriting relation
-\begin{lemma}\label{starClosedForm6Hrewrites}
- $\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss
- \scfrewrites
- \map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$
-\end{lemma}
-\noindent
-We obtain the closed form for star regular expression:
-\begin{lemma}\label{starClosedForm}
+Next the closed form for star regular expressions can be derived:
+\begin{theorem}\label{starClosedForm}
$\rderssimp{r^*}{c::s} =
\rsimp{
(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \;
@@ -1807,11 +2132,14 @@
)
}
$
-\end{lemma}
+\end{theorem}
\begin{proof}
By an induction on $s$.
- The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2}
+ The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, \ref{starClosedForm6Hrewrites}
+ and \ref{hfauRsimpeq2}
are used.
+ In \ref{starClosedForm6Hrewrites}, the equalities are
+ used to link the LHS and RHS.
\end{proof}
@@ -1881,13 +2209,56 @@
In this section, we introduce how we formalised the bound
on closed forms.
-We first prove that functions such as $\rflts$
+We first show that in general regular expressions up to a certain
+size are finite.
+Then we prove that functions such as $\rflts$
will not cause the size of r-regular expressions to grow.
Putting this together with a general bound
on the finiteness of distinct regular expressions
-smaller than a certain size, we obtain a bound on
+up to a certain size, we obtain a bound on
the closed forms.
+\subsection{Finiteness of Distinct Regular Expressions}
+We define the set of regular expressions whose size are no more than
+a certain size $N$ as $\textit{sizeNregex} \; N$:
+\[
+ \textit{sizeNregex} \; N \dn \{r\; \mid \; \llbracket r \rrbracket_r \leq N \}
+\]
+We have that $\textit{sizeNregex} \; N$ is always a finite set:
+\begin{lemma}\label{finiteSizeN}
+ $\textit{finite} \; (\textit{sizeNregex} \; N)$ holds.
+\end{lemma}
+\begin{proof}
+ By splitting the set $\textit{sizeNregex} \; (N + 1)$ into
+ subsets by their categories:
+ $\{\ZERO, \ONE, c\}$, $\{* `` \textit{sizeNregex} \; N\}$,
+ and so on. Each of these subsets are finitely bounded.
+\end{proof}
+\noindent
+From this we get a corollary that
+if forall $r \in rs$, $\rsize{r} \leq N$, then the output of
+$\rdistinct{rs}{\varnothing}$ is a list of regular
+expressions of finite size depending on $N$ only.
+\begin{corollary}\label{finiteSizeNCorollary}
+ $\rsize{\rdistinct{rs}{\varnothing}} \leq c_N * N$ holds,
+ where the constant $c_N$ is equal to $\textit{card} \; (\textit{sizeNregex} \;
+ N)$.
+\end{corollary}
+\begin{proof}
+ For all $r$ in
+ $\textit{set} \; (\rdistinct{rs}{\varnothing})$,
+ it is always the case that $\rsize{r} \leq N$.
+ In addition, the list length is bounded by
+ $c_N$, yielding the desired bound.
+\end{proof}
+\noindent
+This fact will be handy in estimating the closed form sizes.
+%We have proven that the size of the
+%output of $\textit{rdistinct} \; rs' \; \varnothing$
+%is bounded by a constant $N * c_N$ depending only on $N$,
+%provided that each of $rs'$'s element
+%is bounded by $N$.
+
\subsection{$\textit{rsimp}$ Does Not Increment the Size}
Although it seems evident, we need a series
of non-trivial lemmas to establish that functions such as $\rflts$
@@ -1944,7 +2315,8 @@
\end{proof}
\subsection{Estimating the Closed Forms' sizes}
-We now summarize the closed forms below:
+We recap the closed forms we obtained
+earlier by putting them together down below:
\begin{itemize}
\item
$\rderssimp{(\sum rs)}{s} \sequal
@@ -1978,44 +2350,10 @@
We elaborate the above reasoning by a series of lemmas
below, where straightforward proofs are omitted.
-\begin{lemma}
- If $\forall r \in rs. \rsize{r} $ is less than or equal to $N$,
- and $\textit{length} \; rs$ is less than or equal to $l$,
- then $\rsize{\sum rs}$ is less than or equal to $l*N + 1$.
-\end{lemma}
-\noindent
-If we define all regular expressions with size no
-more than $N$ as $\sizeNregex \; N$:
-\[
- \sizeNregex \; N \dn \{r \mid \rsize{r} \leq N \}
-\]
-Then such set is finite:
-\begin{lemma}\label{finiteSizeN}
- $\textit{isFinite}\; (\sizeNregex \; N)$
-\end{lemma}
-\begin{proof}
- By overestimating the set $\sizeNregex \; N + 1$
- using union of sets like
- $\{r_1 \cdot r_2 \mid r_1 \in A
- \text{and}
- r_2 \in A\}
- $ where $A = \sizeNregex \; N$.
-\end{proof}
-\noindent
-From this we get a corollary that
-if forall $r \in rs$, $\rsize{r} \leq N$, then the output of
-$\rdistinct{rs}{\varnothing}$ is a list of regular
-expressions of finite size depending on $N$ only.
-\begin{corollary}\label{finiteSizeNCorollary}
- Assumes that for all $r \in rs. \rsize{r} \leq N$,
- and the cardinality of $\sizeNregex \; N$ is $c_N$
- then$\rsize{\rdistinct{rs}{\varnothing}} \leq c*N$.
-\end{corollary}
-\noindent
-We have proven that the output of $\rdistinct{rs'}{\varnothing}$
-is bounded by a constant $c_N$ depending only on $N$,
-provided that each of $rs'$'s element
-is bounded by $N$.
+
+
+
+
We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
We show that $\rdistinct$ and $\rflts$
@@ -2041,20 +2379,20 @@
From
\begin{center}
$\llbracket \textit{rflts}\; rs \rrbracket_r \leq
- \llbracket \; \textit{rs} \rrbracket_r$
+ \llbracket \textit{rs} \rrbracket_r$
\end{center}
and
\begin{center}
$\llbracket \textit{rdistinct} \; rs \; \varnothing \leq
\llbracket rs \rrbracket_r$
\end{center}
-one cannot imply
+one cannot infer
\begin{center}
$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
\leq
\llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $.
\end{center}
-What we can imply is that
+What we can infer is that
\begin{center}
$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
\leq
@@ -2062,7 +2400,7 @@
\end{center}
but this estimate is too rough and $\llbracket rs \rrbracket_r$ is unbounded.
The way we
-get through this is by first proving a more general lemma
+get around this is by first proving a more general lemma
(so that the inductive case goes through):
\begin{lemma}\label{fltsSizeReductionAlts}
If we have three accumulator sets:
@@ -2087,11 +2425,17 @@
holds.
\end{lemma}
\noindent
-We need to split the accumulator into two parts: the part
+We split the accumulator into two parts: the part
which contains alternative regular expressions ($alts\_set$), and
the part without any of them($noalts\_set$).
+This is because $\rflts$ opens up the alternatives in $as$,
+causing the accumulators on both sides of the inequality
+to diverge slightly.
+If we want to compare the accumulators that are not
+perfectly in sync, we need to consider the alternatives and non-alternatives
+separately.
The set $corr\_set$ is the corresponding set
-of $alts\_set$ with all elements under the $\sum$ constructor
+of $alts\_set$ with all elements under the alternative constructor
spilled out.
\begin{proof}
By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}.
@@ -2106,7 +2450,8 @@
By using the lemma \ref{fltsSizeReductionAlts}.
\end{proof}
\noindent
-The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of
+The intuition for why this is true
+is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of
duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$.
Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$:
@@ -2114,14 +2459,14 @@
$\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$
\end{lemma}
\begin{proof}
- By using \ref{interactionFltsDB}.
+ By using corollary \ref{interactionFltsDB}.
\end{proof}
\noindent
This is a key lemma in establishing the bounds on all the
closed forms.
With this we are now ready to control the sizes of
-$(r_1 \cdot r_2 )\backslash s$, $r^* \backslash s$.
-\begin{theorem}
+$(r_1 \cdot r_2 )\backslash s$ and $r^* \backslash s$.
+\begin{theorem}\label{rBound}
For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$
\end{theorem}
\noindent
@@ -2146,7 +2491,7 @@
\end{tabular}
\end{center}
\noindent
-(1) is by the corollary \ref{seqEstimate1}.
+(1) is by theorem \ref{seqClosedForm}.
(2) is by \ref{altsSimpControl}.
(3) is by \ref{finiteSizeNCorollary}.
@@ -2183,9 +2528,9 @@
\end{tabular}
\end{center}
\noindent
-(5) is by the lemma \ref{starClosedForm}.
+(5) is by theorem \ref{starClosedForm}.
(6) is by \ref{altsSimpControl}.
-(7) is by \ref{finiteSizeNCorollary}.
+(7) is by corollary \ref{finiteSizeNCorollary}.
Combining with the case when $s = []$, one gets
\begin{center}
\begin{tabular}{lcll}
@@ -2208,7 +2553,7 @@
\end{tabular}
\end{center}
\noindent
-(9) is by \ref{altsClosedForm}, (10) by \ref{rsimpSize} and (11) by inductive hypothesis.
+(9) is by theorem \ref{altsClosedForm}, (10) by lemma \ref{rsimpMono} and (11) by inductive hypothesis.
Combining with the case when $s = []$, one gets
\begin{center}
@@ -2217,15 +2562,15 @@
& (12)\\
\end{tabular}
\end{center}
-(4), (8), and (12) are all the inductive cases proven.
+We have all the inductive cases proven.
\end{proof}
-
+This leads to our main result on the size bound:
\begin{corollary}
- For any regex $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$
+ For any annotated regular expression $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$
\end{corollary}
\begin{proof}
- By \ref{sizeRelations}.
+ By lemma \ref{sizeRelations} and theorem \ref{rBound}.
\end{proof}
\noindent
@@ -2243,22 +2588,9 @@
%----------------------------------------------------------------------------------------
-\subsection{A Closed Form for the Sequence Regular Expression}
-\noindent
-
-Before we get to the proof that says the intermediate result of our lexer will
-remain finitely bounded, which is an important efficiency/liveness guarantee,
-we shall first develop a few preparatory properties and definitions to
-make the process of proving that a breeze.
-
-We define rewriting relations for $\rrexp$s, which allows us to do the
-same trick as we did for the correctness proof,
-but this time we will have stronger equalities established.
-
-
-
+\section{Comments and Future Improvements}
+\subsection{Some Experimental Results}
What guarantee does this bound give us?
-
Whatever the regex is, it will not grow indefinitely.
Take our previous example $(a + aa)^*$ as an example:
\begin{center}
@@ -2278,7 +2610,7 @@
width=5cm,
height=4cm,
legend entries={$(a + aa)^*$},
- legend pos=north west,
+ legend pos=south east,
legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
\end{axis}
@@ -2297,22 +2629,16 @@
$f(x) = x * 2^x$.
This means the bound we have will surge up at least
tower-exponentially with a linear increase of the depth.
-For a regex of depth $n$, the bound
-would be approximately $4^n$.
-Test data in the graphs from randomly generated regular expressions
-shows that the giant bounds are far from being hit.
-%a few sample regular experessions' derivatives
-%size change
-%TODO: giving regex1_size_change.data showing a few regular expressions' size changes
-%w;r;t the input characters number, where the size is usually cubic in terms of original size
-%a*, aa*, aaa*, .....
-%randomly generated regular expressions
-\begin{figure}{H}
- \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+One might be quite skeptical about what this non-elementary
+bound can bring us.
+It turns out that the giant bounds are far from being hit.
+Here we have some test data from randomly generated regular expressions:
+\begin{figure}[H]
+ \begin{tabular}{@{}c@{\hspace{2mm}}c@{\hspace{0mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
- xlabel={number of characters},
+ xlabel={$n$},
x label style={at={(1.05,-0.05)}},
ylabel={regex size},
enlargelimits=false,
@@ -2322,15 +2648,15 @@
%ytick={0,100,...,1000},
scaled ticks=false,
axis lines=left,
- width=5cm,
- height=4cm,
+ width=4.75cm,
+ height=3.8cm,
legend entries={regex1},
- legend pos=north west,
+ legend pos=north east,
legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
\end{axis}
\end{tikzpicture}
- &
+ &
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
@@ -2343,15 +2669,15 @@
%ytick={0,100,...,1000},
scaled ticks=false,
axis lines=left,
- width=5cm,
- height=4cm,
+ width=4.75cm,
+ height=3.8cm,
legend entries={regex2},
- legend pos=north west,
+ legend pos=south east,
legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
\end{axis}
\end{tikzpicture}
- &
+ &
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
@@ -2364,16 +2690,19 @@
%ytick={0,100,...,1000},
scaled ticks=false,
axis lines=left,
- width=5cm,
- height=4cm,
+ width=4.75cm,
+ height=3.8cm,
legend entries={regex3},
- legend pos=north west,
+ legend pos=south east,
legend cell align=left]
\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
\end{axis}
\end{tikzpicture}\\
- \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regular expressions $w.r.t.$ input string length.}
+ \multicolumn{3}{c}{}
\end{tabular}
+ \caption{Graphs: size change of 3 randomly generated
+ regular expressions $w.r.t.$ input string length.
+ The x axis represents the length of input.}
\end{figure}
\noindent
Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the
@@ -2382,24 +2711,22 @@
-\section{Possible Further Improvements}
-There are two problems with this finiteness result, though.
-\begin{itemize}
- \item
- First, It is not yet a direct formalisation of our lexer's complexity,
+\subsection{Possible Further Improvements}
+There are two problems with this finiteness result, though.\\
+(i)
+ First, it is not yet a direct formalisation of our lexer's complexity,
as a complexity proof would require looking into
the time it takes to execute {\bf all} the operations
- involved in the lexer (simp, collect, decode), not just the derivative.
- \item
+ involved in the lexer (simp, collect, decode), not just the derivative.\\
+(ii)
Second, the bound is not yet tight, and we seek to improve $N_a$ so that
- it is polynomial on $\llbracket a \rrbracket$.
-\end{itemize}
-Still, we believe this contribution is fruitful,
+ it is polynomial on $\llbracket a \rrbracket$.\\
+Still, we believe this contribution is useful,
because
\begin{itemize}
\item
- The size proof can serve as a cornerstone for a complexity
+ The size proof can serve as a starting point for a complexity
formalisation.
Derivatives are the most important phases of our lexer algorithm.
Size properties about derivatives covers the majority of the algorithm
@@ -2435,8 +2762,11 @@
would still be slow.
And unfortunately, we have concrete examples
where such regular expressions grew exponentially large before levelling off:
+\begin{center}
$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots +
-(\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
+(\underbrace{a \ldots a}_{\text{n a's}})^*)^*$
+\end{center}
+will already have a maximum
size that is exponential on the number $n$
under our current simplification rules:
%TODO: graph of a regex whose size increases exponentially.
@@ -2456,38 +2786,48 @@
\end{tikzpicture}
\end{center}
-For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
+For convenience we use $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots +
(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
The exponential size is triggered by that the regex
-$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
+$\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
inside the $(\ldots) ^*$ having exponentially many
different derivatives, despite those difference being minor.
-$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
+$(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
will therefore contain the following terms (after flattening out all nested
alternatives):
\begin{center}
- $(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
+$(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
+[1mm]
$(1 \leq m' \leq m )$
\end{center}
-These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
+There at at least exponentially
+many such terms.\footnote{To be exact, these terms are
+distinct for $m' \leq L.C.M.(1, \ldots, n)$, the details are omitted,
+but the point is that the number is exponential.
+}
With each new input character taking the derivative against the intermediate result, more and more such distinct
-terms will accumulate,
-until the length reaches $L.C.M.(1, \ldots, n)$.
-$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms
-$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
-
-$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
-where $m' \neq m''$ \\
+terms will accumulate.
+The function $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms
+\begin{center}
+$(\sum_{i = 1}^{n}
+(\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot
+(\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot
+(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
+$(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot
+(\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot
+(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
+\end{center}
+\noindent
+where $m' \neq m''$
as they are slightly different.
This means that with our current simplification methods,
we will not be able to control the derivative so that
-$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
-as there are already exponentially many terms.
+$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial. %\leq O((\llbracket r\rrbacket)^c)$
These terms are similar in the sense that the head of those terms
are all consisted of sub-terms of the form:
$(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $.
-For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
+For $\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
$n * (n + 1) / 2$ such terms.
For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
can be described by 6 terms:
@@ -2495,15 +2835,29 @@
$aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
The total number of different "head terms", $n * (n + 1) / 2$,
is proportional to the number of characters in the regex
-$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
-This suggests a slightly different notion of size, which we call the
+$(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
+If we can improve our deduplication process so that it becomes smarter
+and only keep track of these $n * (n+1) /2$ terms, then we can keep
+the size growth polynomial again.
+This example also suggests a slightly different notion of size, which we call the
alphabetic width:
-%TODO:
-(TODO: Alphabetic width def.)
+\begin{center}
+ \begin{tabular}{lcl}
+ $\textit{awidth} \; \ZERO$ & $\dn$ & $0$\\
+ $\textit{awidth} \; \ONE$ & $\dn$ & $0$\\
+ $\textit{awidth} \; c$ & $\dn$ & $1$\\
+ $\textit{awidth} \; r_1 + r_2$ & $\dn$ & $\textit{awidth} \;
+ r_1 + \textit{awidth} \; r_2$\\
+ $\textit{awidth} \; r_1 \cdot r_2$ & $\dn$ & $\textit{awidth} \;
+ r_1 + \textit{awidth} \; r_2$\\
+ $\textit{awidth} \; r^*$ & $\dn$ & $\textit{awidth} \; r$\\
+ \end{tabular}
+\end{center}
+
Antimirov\parencite{Antimirov95} has proven that
-$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
+$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$,
where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
created by doing derivatives of $r$ against all possible strings.
If we can make sure that at any moment in our lexing algorithm our
@@ -2523,12 +2877,12 @@
%-----------------------------------
% SUBSECTION 1
%-----------------------------------
-\subsection{Syntactic Equivalence Under $\simp$}
-We prove that minor differences can be annhilated
-by $\simp$.
-For example,
-\begin{center}
- $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) =
- \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
-\end{center}
+%\subsection{Syntactic Equivalence Under $\simp$}
+%We prove that minor differences can be annhilated
+%by $\simp$.
+%For example,
+%\begin{center}
+% $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) =
+% \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
+%\end{center}