lexing: ChengsongTanPhdThesis/Chapters/Chapter2.tex@6fecb7fe8cd0 (annotated)

468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	1	% Chapter Template
a0f27e21b42c all texrelated Chengsong parents: diff changeset	2
a0f27e21b42c all texrelated Chengsong parents: diff changeset	3	\chapter{Chapter Title Here} % Main chapter title
a0f27e21b42c all texrelated Chengsong parents: diff changeset	4
a0f27e21b42c all texrelated Chengsong parents: diff changeset	5	\label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	6
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	7
4d9eecfc936a sad Chengsong parents: 468 diff changeset	8
4d9eecfc936a sad Chengsong parents: 468 diff changeset	9
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	10	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	11	% SECTION 1
a0f27e21b42c all texrelated Chengsong parents: diff changeset	12	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	13
a0f27e21b42c all texrelated Chengsong parents: diff changeset	14	\section{Properties of $\backslash c$}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	15
a0f27e21b42c all texrelated Chengsong parents: diff changeset	16	To have a clear idea of what we can and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	17	need to prove about the algorithms involving
a0f27e21b42c all texrelated Chengsong parents: diff changeset	18	Brzozowski's derivatives, there are a few
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	19	properties we need to be clear about .
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	20	\subsection{function $\backslash c$ is not 1-to-1}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	21	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	22	The derivative $w.r.t$ character $c$ is not one-to-one.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	23	Formally,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	24	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	25	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	26	This property is trivially true for the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	27	character regex example:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	28	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	29	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	30	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	31	But apart from the cases where the derivative
a0f27e21b42c all texrelated Chengsong parents: diff changeset	32	output is $\ZERO$, are there non-trivial results
a0f27e21b42c all texrelated Chengsong parents: diff changeset	33	of derivatives which contain strings?
a0f27e21b42c all texrelated Chengsong parents: diff changeset	34	The answer is yes.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	35	For example,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	36	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	37	Let $r_1 = a^b\;\quad r_2 = (a\cdot a^)\cdot b + b$.\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	38	where $a$ is not nullable.\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	39	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	40	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	41	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	42	We start with two syntactically different regexes,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	43	and end up with the same derivative result, which is
a0f27e21b42c all texrelated Chengsong parents: diff changeset	44	a "meaningful" regex because it contains strings.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	45	We have rediscovered Arden's lemma:\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	46	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	47	$A^B = A\cdot A^ \cdot B + B$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	48	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	49
a0f27e21b42c all texrelated Chengsong parents: diff changeset	50
a0f27e21b42c all texrelated Chengsong parents: diff changeset	51	%-----------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	52	% SUBSECTION 1
a0f27e21b42c all texrelated Chengsong parents: diff changeset	53	%-----------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	54	\subsection{Subsection 1}
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	55	To be completed.
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	56
4d9eecfc936a sad Chengsong parents: 468 diff changeset	57
4d9eecfc936a sad Chengsong parents: 468 diff changeset	58
4d9eecfc936a sad Chengsong parents: 468 diff changeset	59
4d9eecfc936a sad Chengsong parents: 468 diff changeset	60
4d9eecfc936a sad Chengsong parents: 468 diff changeset	61
4d9eecfc936a sad Chengsong parents: 468 diff changeset	62
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	63	%-----------------------------------
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	64	% SECTION 2
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	65	%-----------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	66
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	67	\section{Finiteness Property}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	68	In this section let us sketch our argument for why the size of the simplified
4d9eecfc936a sad Chengsong parents: 468 diff changeset	69	derivatives with the aggressive simplification function can be finitely bounded.
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	70
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	71	For convenience, we use a new datatype which we call $\rrexp$ to denote
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	72	the difference between it and annotated regular expressions.
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	73	\[ \rrexp ::= \RZERO \mid \RONE
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	74	\mid \RCHAR{c}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	75	\mid \RSEQ{r_1}{r_2}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	76	\mid \RALTS{rs}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	77	\mid \RSTAR{r}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	78	\]
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	79	For $\rrexp$ we throw away the bitcodes on the annotated regular expressions,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	80	but keep everything else intact.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	81	It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	82	(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	83	$\ALTS$).
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	84	We denote the operation of erasing the bits and turning an annotated regular expression
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	85	into an $\rrexp{}$ as $\rerase{}$.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	86	\begin{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	87	\begin{tabular}{lcr}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	88	$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	89	$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	90	$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	91	\end{tabular}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	92	\end{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	93	%TODO: FINISH definition of rerase
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	94	Similarly we could define the derivative and simplification on
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	95	$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	96	except that now they can operate on alternatives taking multiple arguments.
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	97
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	98	\begin{center}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	99	\begin{tabular}{lcr}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	100	$\RALTS{rs} \backslash c$ & $=$ & $\RALTS{map\; (\_ \backslash c) \;rs}$\\
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	101	(other clauses omitted)
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	102	\end{tabular}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	103	\end{center}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	104
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	105	Now that $\rrexp$s do not have bitcodes on them, we can do the
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	106	duplicate removal without an equivalence relation:
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	107	\begin{center}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	108	\begin{tabular}{lcl}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	109	$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	110	& & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	111	\end{tabular}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	112	\end{center}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	113	%TODO: definition of rsimp (maybe only the alternative clause)
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	114
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	115
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	116	The reason why these definitions mirror precisely the corresponding operations
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	117	on annotated regualar expressions is that we can calculate sizes more easily:
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	118
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	119	\begin{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	120	$\rsize{\rerase a} = \asize a$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	121	\end{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	122	A similar equation holds for annotated regular expressions' simplification
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	123	and the plain regular expressions' simplification:
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	124	\begin{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	125	$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	126	\end{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	127	Putting these two together we get a property that allows us to estimate the
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	128	size of an annotated regular expression derivative using its un-annotated counterpart:
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	129	\begin{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	130	$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	131	\end{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	132	Unless stated otherwise in this chapter all $\textit{rexp}$s without
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	133	bitcodes are seen as $\rrexp$s.
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	134	We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	135	as the former suits people's intuitive way of stating a binary alternative
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	136	regular expression.
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	137
4d9eecfc936a sad Chengsong parents: 468 diff changeset	138	Suppose
4d9eecfc936a sad Chengsong parents: 468 diff changeset	139	we have a size function for bitcoded regular expressions, written
4d9eecfc936a sad Chengsong parents: 468 diff changeset	140	$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
4d9eecfc936a sad Chengsong parents: 468 diff changeset	141	(we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$
4d9eecfc936a sad Chengsong parents: 468 diff changeset	142	there exists a bound $N$
4d9eecfc936a sad Chengsong parents: 468 diff changeset	143	such that
4d9eecfc936a sad Chengsong parents: 468 diff changeset	144
4d9eecfc936a sad Chengsong parents: 468 diff changeset	145	\begin{center}
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	146	$\forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N$
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	147	\end{center}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	148
4d9eecfc936a sad Chengsong parents: 468 diff changeset	149	\noindent
4d9eecfc936a sad Chengsong parents: 468 diff changeset	150	We prove this by induction on $r$. The base cases for $\AZERO$,
4d9eecfc936a sad Chengsong parents: 468 diff changeset	151	$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	152	for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	153	hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	154	$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	155	%
4d9eecfc936a sad Chengsong parents: 468 diff changeset	156	\begin{center}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	157	\begin{tabular}{lcll}
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	158	& & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	159	& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	160	[\bderssimp{ r_2}{s'} \;\|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	161	& $\leq$ &
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	162	$\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	163	[$\bderssimp{ r_2}{s'} \;\|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	164	& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	165	\llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;\|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	166	& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	167	\llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;\|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	168	& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
4d9eecfc936a sad Chengsong parents: 468 diff changeset	169	\end{tabular}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	170	\end{center}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	171
4d9eecfc936a sad Chengsong parents: 468 diff changeset	172	% tell Chengsong about Indian paper of closed forms of derivatives
4d9eecfc936a sad Chengsong parents: 468 diff changeset	173
4d9eecfc936a sad Chengsong parents: 468 diff changeset	174	\noindent
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	175	where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	176	The reason why we could write the derivatives of a sequence this way is described in section 2.
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	177	The term (2) is used to control (1).
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	178	That is because one can obtain an overall
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	179	smaller regex list
4d9eecfc936a sad Chengsong parents: 468 diff changeset	180	by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
4d9eecfc936a sad Chengsong parents: 468 diff changeset	181	Section 3 is dedicated to its proof.
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	182	In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	183	bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
4d9eecfc936a sad Chengsong parents: 468 diff changeset	184	than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
4d9eecfc936a sad Chengsong parents: 468 diff changeset	185	for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
4d9eecfc936a sad Chengsong parents: 468 diff changeset	186	We reason similarly for $\STAR$.\medskip
4d9eecfc936a sad Chengsong parents: 468 diff changeset	187
4d9eecfc936a sad Chengsong parents: 468 diff changeset	188	\noindent
4d9eecfc936a sad Chengsong parents: 468 diff changeset	189	Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
4d9eecfc936a sad Chengsong parents: 468 diff changeset	190	far from the actual bound we can expect. We can do better than this, but this does not improve
4d9eecfc936a sad Chengsong parents: 468 diff changeset	191	the finiteness property we are proving. If we are interested in a polynomial bound,
4d9eecfc936a sad Chengsong parents: 468 diff changeset	192	one would hope to obtain a similar tight bound as for partial
4d9eecfc936a sad Chengsong parents: 468 diff changeset	193	derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
4d9eecfc936a sad Chengsong parents: 468 diff changeset	194	$\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
4d9eecfc936a sad Chengsong parents: 468 diff changeset	195	partial derivatives). Unfortunately to obtain the exact same bound would mean
4d9eecfc936a sad Chengsong parents: 468 diff changeset	196	we need to introduce simplifications, such as
4d9eecfc936a sad Chengsong parents: 468 diff changeset	197	$(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
4d9eecfc936a sad Chengsong parents: 468 diff changeset	198	which exist for partial derivatives. However, if we introduce them in our
4d9eecfc936a sad Chengsong parents: 468 diff changeset	199	setting we would lose the POSIX property of our calculated values. We leave better
4d9eecfc936a sad Chengsong parents: 468 diff changeset	200	bounds for future work.\\[-6.5mm]
4d9eecfc936a sad Chengsong parents: 468 diff changeset	201
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	202
507 213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	203
213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	204	%-----------------------------------
213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	205	% SECTION ?
213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	206	%-----------------------------------
213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	207	\section{preparatory properties for getting a finiteness bound}
213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	208	Before we get to the proof that says the intermediate result of our lexer will
213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	209	remain finitely bounded, which is an important efficiency/liveness guarantee,
213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	210	we shall first develop a few preparatory properties and definitions to
213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	211	make the process of proving that a breeze.
213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	212
213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	213	We define rewriting relations for $\rrexp$s, which allows us to do the
213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	214	same trick as we did for the correctness proof,
213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	215	but this time we will have stronger equalities established.
213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	216	\subsection{"hrewrite" relation}
514 036600af4c30 chapter2 Chengsong parents: 507 diff changeset	217	List of 1-step rewrite rules for regular expressions simplification without bitcodes:
515 84938708781d a bit Chengsong parents: 514 diff changeset	218	\begin{center}
84938708781d a bit Chengsong parents: 514 diff changeset	219	\begin{tabular}{lcl}
84938708781d a bit Chengsong parents: 514 diff changeset	220	$r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$
84938708781d a bit Chengsong parents: 514 diff changeset	221	\end{tabular}
84938708781d a bit Chengsong parents: 514 diff changeset	222	\end{center}
84938708781d a bit Chengsong parents: 514 diff changeset	223
84938708781d a bit Chengsong parents: 514 diff changeset	224	And we define an "grewrite" relation that works on lists:
84938708781d a bit Chengsong parents: 514 diff changeset	225	\begin{center}
84938708781d a bit Chengsong parents: 514 diff changeset	226	\begin{tabular}{lcl}
84938708781d a bit Chengsong parents: 514 diff changeset	227	$ \ZERO :: rs$ & $\rightarrow_g$ & $rs$
84938708781d a bit Chengsong parents: 514 diff changeset	228	\end{tabular}
84938708781d a bit Chengsong parents: 514 diff changeset	229	\end{center}
507 213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	230
213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	231
213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	232
515 84938708781d a bit Chengsong parents: 514 diff changeset	233	With these relations we prove
84938708781d a bit Chengsong parents: 514 diff changeset	234	\begin{lemma}
84938708781d a bit Chengsong parents: 514 diff changeset	235	$rs \rightarrow rs' \implies \RALTS{rs} \rightarrow \RALTS{rs'}$
84938708781d a bit Chengsong parents: 514 diff changeset	236	\end{lemma}
84938708781d a bit Chengsong parents: 514 diff changeset	237	which enables us to prove "closed forms" equalities such as
84938708781d a bit Chengsong parents: 514 diff changeset	238	\begin{lemma}
84938708781d a bit Chengsong parents: 514 diff changeset	239	$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
84938708781d a bit Chengsong parents: 514 diff changeset	240	\end{lemma}
84938708781d a bit Chengsong parents: 514 diff changeset	241
84938708781d a bit Chengsong parents: 514 diff changeset	242	But the most involved part of the above lemma is proving the following:
84938708781d a bit Chengsong parents: 514 diff changeset	243	\begin{lemma}
84938708781d a bit Chengsong parents: 514 diff changeset	244	$\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $
84938708781d a bit Chengsong parents: 514 diff changeset	245	\end{lemma}
84938708781d a bit Chengsong parents: 514 diff changeset	246	which is needed in proving things like
84938708781d a bit Chengsong parents: 514 diff changeset	247	\begin{lemma}
84938708781d a bit Chengsong parents: 514 diff changeset	248	$r \rightarrow_f r' \implies \rsimp{r} \rightarrow \rsimp{r'}$
84938708781d a bit Chengsong parents: 514 diff changeset	249	\end{lemma}
84938708781d a bit Chengsong parents: 514 diff changeset	250	.
84938708781d a bit Chengsong parents: 514 diff changeset	251	Fortunately this is provable by induction on the list $rs$
507 213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	252
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	253	%----------------------------------------------------------------------------------------
507 213220f54a6e thesis section2.2 Chengsong parents: 506 diff changeset	254	% SECTION ??
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	255	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	256
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	257	\section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	258	To embark on getting the "closed forms" of regexes, we first
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	259	define a few auxiliary definitions to make expressing them smoothly.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	260
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	261	\begin{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	262	\begin{tabular}{ccc}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	263	$\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	264	$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	265	$\sflataux r$ & $=$ & $ [r]$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	266	\end{tabular}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	267	\end{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	268	The intuition behind $\sflataux{\_}$ is to break up nested regexes
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	269	of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	270	into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	271	It will return the singleton list $[r]$ otherwise.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	272
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	273	$\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	274	the output type a regular expression, not a list:
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	275	\begin{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	276	\begin{tabular}{ccc}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	277	$\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	278	$\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	279	$\sflat r$ & $=$ & $ [r]$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	280	\end{tabular}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	281	\end{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	282	$\sflataux{\_}$ and $\sflat{\_}$ is only recursive in terms of the
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	283	first element of the list of children of
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	284	an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for dealing with the regular expression
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	285	$r_1 \cdot r_2 \backslash s$.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	286
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	287	With $\sflat{\_}$ and $\sflataux{\_}$ we make
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	288	precise what "closed forms" we have for the sequence derivatives and their simplifications,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	289	in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	290	and $\bderssimp{(r_1\cdot r_2)}{s}$,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	291	if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	292	and $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$ ranges over
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	293	the substring of $s$?
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	294	First let's look at a series of derivatives steps on a sequence
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	295	regular expression, (assuming) that each time the first
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	296	component of the sequence is always nullable):
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	297	\begin{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	298
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	299	$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	300	$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	301	\ldots$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	302
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	303	\end{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	304	%TODO: cite indian paper
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	305	Indianpaper have come up with a slightly more formal way of putting the above process:
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	306	\begin{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	307	$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	308	\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	309	+ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	310	\end{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	311	where $\delta(b, r)$ will produce $r$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	312	if $b$ evaluates to true,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	313	and $\ZERO$ otherwise.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	314
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	315	But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	316	equivalence. To make this intuition useful
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	317	for a formal proof, we need something
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	318	stronger than language equivalence.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	319	With the help of $\sflat{\_}$ we can state the equation in Indianpaper
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	320	more rigorously:
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	321	\begin{lemma}
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	322	$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	323	\end{lemma}
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	324
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	325	The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	326
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	327	\begin{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	328	\begin{tabular}{lcl}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	329	$\vsuf{[]}{\_} $ & $=$ & $[]$\\
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	330	$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	331	&& $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	332	\end{tabular}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	333	\end{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	334	It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	335	and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	336	the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	337	In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	338	the entire result of $(r_1 \cdot r_2) \backslash s$,
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	339	it only stores all the $s''$ such that $r_2 \backslash s''$ are occurring terms in $(r_1\cdot r_2)\backslash s$.
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	340
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	341	With this we can also add simplifications to both sides and get
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	342	\begin{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	343	$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	344	\end{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	345	Together with the idempotency property of $\rsimp$,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	346	%TODO: state the idempotency property of rsimp
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	347	\begin{lemma}
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	348	$\rsimp(r) = \rsimp (\rsimp(r))$
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	349	\end{lemma}
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	350	it is possible to convert the above lemma to obtain a "closed form"
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	351	for derivatives nested with simplification:
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	352	\begin{lemma}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	353	$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	354	\end{lemma}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	355	And now the reason we have (1) in section 1 is clear:
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	356	$\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$,
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	357	is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	358	as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$.
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	359
4d9eecfc936a sad Chengsong parents: 468 diff changeset	360	%----------------------------------------------------------------------------------------
4d9eecfc936a sad Chengsong parents: 468 diff changeset	361	% SECTION 3
4d9eecfc936a sad Chengsong parents: 468 diff changeset	362	%----------------------------------------------------------------------------------------
4d9eecfc936a sad Chengsong parents: 468 diff changeset	363
4d9eecfc936a sad Chengsong parents: 468 diff changeset	364	\section{interaction between $\distinctWith$ and $\flts$}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	365	Note that it is not immediately obvious that
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	366	\begin{center}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	367	$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $.
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	368	\end{center}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	369	The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	370	duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$.
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	371
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	372	%----------------------------------------------------------------------------------------
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	373	% SECTION 4
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	374	%----------------------------------------------------------------------------------------
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	375	\section{a bound for the star regular expression}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	376	We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	377	the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	378	the property of the $\distinct$ function.
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	379	Now we try to get a bound on $r^* \backslash s$ as well.
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	380	Again, we first look at how a star's derivatives evolve, if they grow maximally:
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	381	\begin{center}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	382
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	383	$r^* \quad \longrightarrow_{\backslash c} \quad (r\backslash c) \cdot r^* \quad \longrightarrow_{\backslash c'} \quad
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	384	r \backslash cc' \cdot r^* + r \backslash c' \cdot r^* \quad \longrightarrow_{\backslash c''} \quad
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	385	(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) \quad \longrightarrow_{\backslash c'''}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	386	\quad \ldots$
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	387
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	388	\end{center}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	389	When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$,
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	390	$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	391	the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	392	of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	393	count the possible size explosions of $r \backslash c$ themselves.
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	394
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	395	Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	396	$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	397	into $\RALTS{[r_1 \backslash cc'c'' \cdot r^, r \backslash c'', r \backslash c'c'' \cdot r^, r \backslash c'' \cdot r^*]}$
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	398	and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	399	For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$:
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	400	%TODO: definitions of and \hflataux \hflat
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	401	\begin{center}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	402	\begin{tabular}{ccc}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	403	$\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	404	$\hflataux r$ & $=$ & $ [r]$
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	405	\end{tabular}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	406	\end{center}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	407
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	408	\begin{center}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	409	\begin{tabular}{ccc}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	410	$\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	411	$\hflat r$ & $=$ & $ r$
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	412	\end{tabular}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	413	\end{center}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	414	Again these definitions are tailor-made for dealing with alternatives that have
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	415	originated from a star's derivatives, so we don't attempt to open up all possible
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	416	regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	417	elements.
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	418	We give a predicate for such "star-created" regular expressions:
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	419	\begin{center}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	420	\begin{tabular}{lcr}
506 69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	421	& & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	422	$\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	423	\end{tabular}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	424	\end{center}
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	425
506 69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	426	These definitions allows us the flexibility to talk about
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	427	regular expressions in their most convenient format,
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	428	for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	429	instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	430	These definitions help express that certain classes of syntatically
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	431	distinct regular expressions are actually the same under simplification.
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	432	This is not entirely true for annotated regular expressions:
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	433	%TODO: bsimp bders \neq bderssimp
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	434	\begin{center}
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	435	$(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	436	\end{center}
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	437	For bit-codes, the order in which simplification is applied
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	438	might cause a difference in the location they are placed.
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	439	If we want something like
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	440	\begin{center}
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	441	$\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	442	\end{center}
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	443	Some "canonicalization" procedure is required,
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	444	which either pushes all the common bitcodes to nodes
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	445	as senior as possible:
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	446	\begin{center}
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	447	$_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	448	\end{center}
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	449	or does the reverse. However bitcodes are not of interest if we are talking about
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	450	the $\llbracket r \rrbracket$ size of a regex.
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	451	Therefore for the ease and simplicity of producing a
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	452	proof for a size bound, we are happy to restrict ourselves to
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	453	unannotated regular expressions, and obtain such equalities as
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	454	\begin{lemma}
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	455	$\rsimp{(r_1 + r_2)} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	456	\end{lemma}
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	457
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	458	\begin{proof}
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	459	By using the rewriting relation $\rightsquigarrow$
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	460	\end{proof}
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	461	%TODO: rsimp sflat
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	462	And from this we obtain a proof that a star's derivative will be the same
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	463	as if it had all its nested alternatives created during deriving being flattened out:
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	464	For example,
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	465	\begin{lemma}
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	466	$\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	467	\end{lemma}
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	468	\begin{proof}
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	469	By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	470	\end{proof}
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	471	% The simplification of a flattened out regular expression, provided it comes
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	472	%from the derivative of a star, is the same as the one nested.
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	473
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	474
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	475
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	476
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	477
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	478
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	479
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	480
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	481
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	482	One might wonder the actual bound rather than the loose bound we gave
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	483	for the convenience of a easier proof.
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	484	How much can the regex $r^* \backslash s$ grow? As hinted earlier,
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	485	they can grow at a maximum speed
506 69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	486	exponential $w.r.t$ the number of characters,
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	487	but will eventually level off when the string $s$ is long enough,
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	488	as suggested by the finiteness bound proof.
506 69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	489	And unfortunately we have concrete examples
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	490	where such regexes grew exponentially large before levelling off:
516 6fecb7fe8cd0 blexer2: modified for plotting Chengsong parents: 515 diff changeset	491	$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots +
6fecb7fe8cd0 blexer2: modified for plotting Chengsong parents: 515 diff changeset	492	(\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
506 69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	493	size that is exponential on the number $n$.
505 5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	494	%TODO: give out a graph showing how the size of the regex grows and levels off at a size exponential to the original regex
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	495
5ce3bd8e5696 thesis chapter2 section 2.4 Chengsong parents: 503 diff changeset	496
506 69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	497	While the tight upper bound will definitely be a lot lower than
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	498	the one we gave for the finiteness proof,
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	499	it will still be at least expoential, under our current simplification rules.
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	500
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	501	This suggests stronger simplifications that keep the tight bound polynomial.
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	502
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	503	%----------------------------------------------------------------------------------------
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	504	% SECTION 5
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	505	%----------------------------------------------------------------------------------------
516 6fecb7fe8cd0 blexer2: modified for plotting Chengsong parents: 515 diff changeset	506	\section{A Stronger Version of Simplification}
506 69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	507	%TODO: search for isabelle proofs of algorithms that check equivalence
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	508
69ad05398894 thesis chapter 2 section 2.4 2.5 Chengsong parents: 505 diff changeset	509

author	Chengsong
	Fri, 20 May 2022 18:48:34 +0100
changeset 516	6fecb7fe8cd0
parent 515	84938708781d
child 518	ff7945a988a3
permissions	-rwxr-xr-x