lexing: ChengsongTanPhdThesis/Chapters/Chapter2.tex@79fdaad6f36b (annotated)

468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	1	% Chapter Template
a0f27e21b42c all texrelated Chengsong parents: diff changeset	2
a0f27e21b42c all texrelated Chengsong parents: diff changeset	3	\chapter{Chapter Title Here} % Main chapter title
a0f27e21b42c all texrelated Chengsong parents: diff changeset	4
a0f27e21b42c all texrelated Chengsong parents: diff changeset	5	\label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	6
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	7
4d9eecfc936a sad Chengsong parents: 468 diff changeset	8
4d9eecfc936a sad Chengsong parents: 468 diff changeset	9
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	10	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	11	% SECTION 1
a0f27e21b42c all texrelated Chengsong parents: diff changeset	12	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	13
a0f27e21b42c all texrelated Chengsong parents: diff changeset	14	\section{Properties of $\backslash c$}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	15
a0f27e21b42c all texrelated Chengsong parents: diff changeset	16	To have a clear idea of what we can and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	17	need to prove about the algorithms involving
a0f27e21b42c all texrelated Chengsong parents: diff changeset	18	Brzozowski's derivatives, there are a few
a0f27e21b42c all texrelated Chengsong parents: diff changeset	19	properties we need to be clear about
a0f27e21b42c all texrelated Chengsong parents: diff changeset	20	it.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	21	\subsection{function $\backslash c$ is not 1-to-1}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	22	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	23	The derivative $w.r.t$ character $c$ is not one-to-one.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	24	Formally,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	25	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	26	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	27	This property is trivially true for the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	28	character regex example:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	29	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	30	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	31	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	32	But apart from the cases where the derivative
a0f27e21b42c all texrelated Chengsong parents: diff changeset	33	output is $\ZERO$, are there non-trivial results
a0f27e21b42c all texrelated Chengsong parents: diff changeset	34	of derivatives which contain strings?
a0f27e21b42c all texrelated Chengsong parents: diff changeset	35	The answer is yes.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	36	For example,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	37	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	38	Let $r_1 = a^b\;\quad r_2 = (a\cdot a^)\cdot b + b$.\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	39	where $a$ is not nullable.\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	40	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	41	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	42	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	43	We start with two syntactically different regexes,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	44	and end up with the same derivative result, which is
a0f27e21b42c all texrelated Chengsong parents: diff changeset	45	a "meaningful" regex because it contains strings.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	46	We have rediscovered Arden's lemma:\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	47	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	48	$A^B = A\cdot A^ \cdot B + B$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	49	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	50
a0f27e21b42c all texrelated Chengsong parents: diff changeset	51
a0f27e21b42c all texrelated Chengsong parents: diff changeset	52	%-----------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	53	% SUBSECTION 1
a0f27e21b42c all texrelated Chengsong parents: diff changeset	54	%-----------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	55	\subsection{Subsection 1}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	56
a0f27e21b42c all texrelated Chengsong parents: diff changeset	57	Nunc posuere quam at lectus tristique eu ultrices augue venenatis. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Aliquam erat volutpat. Vivamus sodales tortor eget quam adipiscing in vulputate ante ullamcorper. Sed eros ante, lacinia et sollicitudin et, aliquam sit amet augue. In hac habitasse platea dictumst.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	58
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	59
4d9eecfc936a sad Chengsong parents: 468 diff changeset	60
4d9eecfc936a sad Chengsong parents: 468 diff changeset	61
4d9eecfc936a sad Chengsong parents: 468 diff changeset	62
4d9eecfc936a sad Chengsong parents: 468 diff changeset	63
4d9eecfc936a sad Chengsong parents: 468 diff changeset	64
4d9eecfc936a sad Chengsong parents: 468 diff changeset	65
4d9eecfc936a sad Chengsong parents: 468 diff changeset	66
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	67	%-----------------------------------
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	68	% SECTION 2
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	69	%-----------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	70
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	71	\section{Finiteness Property}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	72	In this section let us sketch our argument for why the size of the simplified
4d9eecfc936a sad Chengsong parents: 468 diff changeset	73	derivatives with the aggressive simplification function can be finitely bounded.
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	74
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	75	For convenience, we use a new datatype which we call $\rrexp$ to denote
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	76	the difference between it and annotated regular expressions.
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	77	\[ \rrexp ::= \RZERO \mid \RONE
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	78	\mid \RCHAR{c}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	79	\mid \RSEQ{r_1}{r_2}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	80	\mid \RALTS{rs}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	81	\mid \RSTAR{r}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	82	\]
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	83	For $\rrexp$ we throw away the bitcodes on the annotated regular expressions,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	84	but keep everything else intact.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	85	It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	86	(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	87	$\ALTS$).
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	88	We denote the operation of erasing the bits and turning an annotated regular expression
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	89	into an $\rrexp{}$ as $\rerase{}$.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	90	\begin{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	91	\begin{tabular}{lcr}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	92	$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	93	$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	94	$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	95	\end{tabular}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	96	\end{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	97	%TODO: FINISH definition of rerase
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	98	Similarly we could define the derivative and simplification on
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	99	$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	100	except that now they can operate on alternatives taking multiple arguments.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	101	%TODO: definition of rder rsimp (maybe only the alternative clause)
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	102	The reason why these definitions mirror precisely the corresponding operations
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	103	on annotated regualar expressions is that we can calculate sizes more easily:
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	104
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	105	\begin{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	106	$\rsize{\rerase a} = \asize a$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	107	\end{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	108	A similar equation holds for annotated regular expressions' simplification
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	109	and the plain regular expressions' simplification:
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	110	\begin{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	111	$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	112	\end{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	113	Putting these two together we get a property that allows us to estimate the
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	114	size of an annotated regular expression derivative using its un-annotated counterpart:
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	115	\begin{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	116	$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	117	\end{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	118	Unless stated otherwise in this chapter all $\textit{rexp}$s without
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	119	bitcodes are seen as $\rrexp$s.
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	120
4d9eecfc936a sad Chengsong parents: 468 diff changeset	121	Suppose
4d9eecfc936a sad Chengsong parents: 468 diff changeset	122	we have a size function for bitcoded regular expressions, written
4d9eecfc936a sad Chengsong parents: 468 diff changeset	123	$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
4d9eecfc936a sad Chengsong parents: 468 diff changeset	124	(we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$
4d9eecfc936a sad Chengsong parents: 468 diff changeset	125	there exists a bound $N$
4d9eecfc936a sad Chengsong parents: 468 diff changeset	126	such that
4d9eecfc936a sad Chengsong parents: 468 diff changeset	127
4d9eecfc936a sad Chengsong parents: 468 diff changeset	128	\begin{center}
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	129	$\forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N$
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	130	\end{center}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	131
4d9eecfc936a sad Chengsong parents: 468 diff changeset	132	\noindent
4d9eecfc936a sad Chengsong parents: 468 diff changeset	133	We prove this by induction on $r$. The base cases for $\AZERO$,
4d9eecfc936a sad Chengsong parents: 468 diff changeset	134	$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	135	for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	136	hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	137	$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	138	%
4d9eecfc936a sad Chengsong parents: 468 diff changeset	139	\begin{center}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	140	\begin{tabular}{lcll}
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	141	& & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	142	& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	143	[\bderssimp{ r_2}{s'} \;\|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	144	& $\leq$ &
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	145	$\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	146	[$\bderssimp{ r_2}{s'} \;\|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	147	& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	148	\llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;\|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	149	& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	150	\llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;\|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	151	& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
4d9eecfc936a sad Chengsong parents: 468 diff changeset	152	\end{tabular}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	153	\end{center}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	154
4d9eecfc936a sad Chengsong parents: 468 diff changeset	155	% tell Chengsong about Indian paper of closed forms of derivatives
4d9eecfc936a sad Chengsong parents: 468 diff changeset	156
4d9eecfc936a sad Chengsong parents: 468 diff changeset	157	\noindent
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	158	where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	159	The reason why we could write the derivatives of a sequence this way is described in section 2.
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	160	The term (2) is used to control (1).
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	161	That is because one can obtain an overall
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	162	smaller regex list
4d9eecfc936a sad Chengsong parents: 468 diff changeset	163	by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
4d9eecfc936a sad Chengsong parents: 468 diff changeset	164	Section 3 is dedicated to its proof.
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	165	In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	166	bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
4d9eecfc936a sad Chengsong parents: 468 diff changeset	167	than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
4d9eecfc936a sad Chengsong parents: 468 diff changeset	168	for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
4d9eecfc936a sad Chengsong parents: 468 diff changeset	169	We reason similarly for $\STAR$.\medskip
4d9eecfc936a sad Chengsong parents: 468 diff changeset	170
4d9eecfc936a sad Chengsong parents: 468 diff changeset	171	\noindent
4d9eecfc936a sad Chengsong parents: 468 diff changeset	172	Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
4d9eecfc936a sad Chengsong parents: 468 diff changeset	173	far from the actual bound we can expect. We can do better than this, but this does not improve
4d9eecfc936a sad Chengsong parents: 468 diff changeset	174	the finiteness property we are proving. If we are interested in a polynomial bound,
4d9eecfc936a sad Chengsong parents: 468 diff changeset	175	one would hope to obtain a similar tight bound as for partial
4d9eecfc936a sad Chengsong parents: 468 diff changeset	176	derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
4d9eecfc936a sad Chengsong parents: 468 diff changeset	177	$\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
4d9eecfc936a sad Chengsong parents: 468 diff changeset	178	partial derivatives). Unfortunately to obtain the exact same bound would mean
4d9eecfc936a sad Chengsong parents: 468 diff changeset	179	we need to introduce simplifications, such as
4d9eecfc936a sad Chengsong parents: 468 diff changeset	180	$(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
4d9eecfc936a sad Chengsong parents: 468 diff changeset	181	which exist for partial derivatives. However, if we introduce them in our
4d9eecfc936a sad Chengsong parents: 468 diff changeset	182	setting we would lose the POSIX property of our calculated values. We leave better
4d9eecfc936a sad Chengsong parents: 468 diff changeset	183	bounds for future work.\\[-6.5mm]
4d9eecfc936a sad Chengsong parents: 468 diff changeset	184
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	185
a0f27e21b42c all texrelated Chengsong parents: diff changeset	186	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	187	% SECTION 2
a0f27e21b42c all texrelated Chengsong parents: diff changeset	188	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	189
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	190	\section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	191	To embark on getting the "closed forms" of regexes, we first
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	192	define a few auxiliary definitions to make expressing them smoothly.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	193
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	194	\begin{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	195	\begin{tabular}{ccc}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	196	$\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	197	$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	198	$\sflataux r$ & $=$ & $ [r]$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	199	\end{tabular}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	200	\end{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	201	The intuition behind $\sflataux{\_}$ is to break up nested regexes
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	202	of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	203	into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	204	It will return the singleton list $[r]$ otherwise.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	205
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	206	$\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	207	the output type a regular expression, not a list:
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	208	\begin{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	209	\begin{tabular}{ccc}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	210	$\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	211	$\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	212	$\sflat r$ & $=$ & $ [r]$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	213	\end{tabular}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	214	\end{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	215	$\sflataux{\_}$ and $\sflat{\_}$ is only recursive in terms of the
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	216	first element of the list of children of
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	217	an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for dealing with the regular expression
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	218	$r_1 \cdot r_2 \backslash s$.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	219
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	220	With $\sflat{\_}$ and $\sflataux{\_}$ we make
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	221	precise what "closed forms" we have for the sequence derivatives and their simplifications,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	222	in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	223	and $\bderssimp{(r_1\cdot r_2)}{s}$,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	224	if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	225	and $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$ ranges over
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	226	the substring of $s$?
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	227	First let's look at a series of derivatives steps on a sequence
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	228	regular expression, (assuming) that each time the first
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	229	component of the sequence is always nullable):
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	230	\begin{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	231
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	232	$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	233	$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	234	\ldots$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	235
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	236	\end{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	237	%TODO: cite indian paper
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	238	Indianpaper have come up with a slightly more formal way of putting the above process:
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	239	\begin{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	240	$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	241	\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	242	+ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	243	\end{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	244	where $\delta(b, r)$ will produce $r$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	245	if $b$ evaluates to true,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	246	and $\ZERO$ otherwise.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	247
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	248	But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	249	equivalence. To make this intuition useful
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	250	for a formal proof, we need something
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	251	stronger than language equivalence.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	252	With the help of $\sflat{\_}$ we can state the equation in Indianpaper
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	253	more rigorously:
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	254	\begin{lemma}
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	255	$\sflat{(r_1 \cdot r_2) \backslash s} = \AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	256	\end{lemma}
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	257	With this property we can prove the finiteness of the size of the regex $(r_1 \cdot r_2) \backslash s$,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	258	much like a recursive function's termination proof.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	259	The function $\vsuf{\_}{\_}$ is defined this way:
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	260	%TODO: DEF of vsuf
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	261	\begin{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	262	\begin{tabular}{lcl}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	263	$\vsuf{[]}{\_} $ & $=$ & $[]$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	264	$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} (\vsuf{cs}{(rder c r1)}) @ [c :: cs]$\\
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	265	&& $\textit{else} (\vsuf{cs}{rder c r1}) $
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	266	\end{tabular}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	267	\end{center}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	268	It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	269	and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	270	the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	271	In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r$, but instead of producing
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	272	the entire result of $(r_1 \cdot r_2) \backslash s$,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	273	it only stores all the $r_2 \backslash s''$ terms.
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	274
503 35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	275	With this we can also add simplifications to both sides and get
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	276	\begin{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	277	$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	278	\end{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	279	Together with the idempotency property of $\rsimp$,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	280	%TODO: state the idempotency property of rsimp
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	281	it is possible to convert the above lemma to obtain a "closed form"
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	282	for our lexer's intermediate result without bitcodes:
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	283	\begin{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	284	$\rderssimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	285	\end{lemma}
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	286	And now the reason we have (2) in section 1 is clear:
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	287	$\rsize{\rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}}$
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	288	is bounded by $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	289	[$\bderssimp{ r_2}{s'} \;\|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ ,
35b80e37dfe7 new writing Chengsong parents: 500 diff changeset	290	as $\vsuf{s}{r_1}$ is a subset of s $\textit{Suffix}( r_1, s)])$.
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	291
4d9eecfc936a sad Chengsong parents: 468 diff changeset	292	%----------------------------------------------------------------------------------------
4d9eecfc936a sad Chengsong parents: 468 diff changeset	293	% SECTION 3
4d9eecfc936a sad Chengsong parents: 468 diff changeset	294	%----------------------------------------------------------------------------------------
4d9eecfc936a sad Chengsong parents: 468 diff changeset	295
4d9eecfc936a sad Chengsong parents: 468 diff changeset	296	\section{interaction between $\distinctWith$ and $\flts$}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	297	Note that it is not immediately obvious that
4d9eecfc936a sad Chengsong parents: 468 diff changeset	298	$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	299
a0f27e21b42c all texrelated Chengsong parents: diff changeset	300	Sed ullamcorper quam eu nisl interdum at interdum enim egestas. Aliquam placerat justo sed lectus lobortis ut porta nisl porttitor. Vestibulum mi dolor, lacinia molestie gravida at, tempus vitae ligula. Donec eget quam sapien, in viverra eros. Donec pellentesque justo a massa fringilla non vestibulum metus vestibulum. Vestibulum in orci quis felis tempor lacinia. Vivamus ornare ultrices facilisis. Ut hendrerit volutpat vulputate. Morbi condimentum venenatis augue, id porta ipsum vulputate in. Curabitur luctus tempus justo. Vestibulum risus lectus, adipiscing nec condimentum quis, condimentum nec nisl. Aliquam dictum sagittis velit sed iaculis. Morbi tristique augue sit amet nulla pulvinar id facilisis ligula mollis. Nam elit libero, tincidunt ut aliquam at, molestie in quam. Aenean rhoncus vehicula hendrerit.

author	Christian Urban <christian.urban@kcl.ac.uk>
	Sun, 08 May 2022 09:58:50 +0100
changeset 504	79fdaad6f36b
parent 503	35b80e37dfe7
child 505	5ce3bd8e5696
permissions	-rwxr-xr-x