lexing: ChengsongTanPhdThesis/Chapters/Chapter2.tex@4d9eecfc936a (annotated)

468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	1	% Chapter Template
a0f27e21b42c all texrelated Chengsong parents: diff changeset	2
a0f27e21b42c all texrelated Chengsong parents: diff changeset	3	\chapter{Chapter Title Here} % Main chapter title
a0f27e21b42c all texrelated Chengsong parents: diff changeset	4
a0f27e21b42c all texrelated Chengsong parents: diff changeset	5	\label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	6
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	7
4d9eecfc936a sad Chengsong parents: 468 diff changeset	8
4d9eecfc936a sad Chengsong parents: 468 diff changeset	9
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	10	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	11	% SECTION 1
a0f27e21b42c all texrelated Chengsong parents: diff changeset	12	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	13
a0f27e21b42c all texrelated Chengsong parents: diff changeset	14	\section{Properties of $\backslash c$}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	15
a0f27e21b42c all texrelated Chengsong parents: diff changeset	16	To have a clear idea of what we can and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	17	need to prove about the algorithms involving
a0f27e21b42c all texrelated Chengsong parents: diff changeset	18	Brzozowski's derivatives, there are a few
a0f27e21b42c all texrelated Chengsong parents: diff changeset	19	properties we need to be clear about
a0f27e21b42c all texrelated Chengsong parents: diff changeset	20	it.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	21	\subsection{function $\backslash c$ is not 1-to-1}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	22	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	23	The derivative $w.r.t$ character $c$ is not one-to-one.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	24	Formally,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	25	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	26	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	27	This property is trivially true for the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	28	character regex example:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	29	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	30	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	31	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	32	But apart from the cases where the derivative
a0f27e21b42c all texrelated Chengsong parents: diff changeset	33	output is $\ZERO$, are there non-trivial results
a0f27e21b42c all texrelated Chengsong parents: diff changeset	34	of derivatives which contain strings?
a0f27e21b42c all texrelated Chengsong parents: diff changeset	35	The answer is yes.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	36	For example,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	37	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	38	Let $r_1 = a^b\;\quad r_2 = (a\cdot a^)\cdot b + b$.\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	39	where $a$ is not nullable.\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	40	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	41	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	42	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	43	We start with two syntactically different regexes,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	44	and end up with the same derivative result, which is
a0f27e21b42c all texrelated Chengsong parents: diff changeset	45	a "meaningful" regex because it contains strings.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	46	We have rediscovered Arden's lemma:\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	47	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	48	$A^B = A\cdot A^ \cdot B + B$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	49	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	50
a0f27e21b42c all texrelated Chengsong parents: diff changeset	51
a0f27e21b42c all texrelated Chengsong parents: diff changeset	52	%-----------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	53	% SUBSECTION 1
a0f27e21b42c all texrelated Chengsong parents: diff changeset	54	%-----------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	55	\subsection{Subsection 1}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	56
a0f27e21b42c all texrelated Chengsong parents: diff changeset	57	Nunc posuere quam at lectus tristique eu ultrices augue venenatis. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Aliquam erat volutpat. Vivamus sodales tortor eget quam adipiscing in vulputate ante ullamcorper. Sed eros ante, lacinia et sollicitudin et, aliquam sit amet augue. In hac habitasse platea dictumst.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	58
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	59
4d9eecfc936a sad Chengsong parents: 468 diff changeset	60
4d9eecfc936a sad Chengsong parents: 468 diff changeset	61
4d9eecfc936a sad Chengsong parents: 468 diff changeset	62
4d9eecfc936a sad Chengsong parents: 468 diff changeset	63
4d9eecfc936a sad Chengsong parents: 468 diff changeset	64
4d9eecfc936a sad Chengsong parents: 468 diff changeset	65
4d9eecfc936a sad Chengsong parents: 468 diff changeset	66
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	67	%-----------------------------------
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	68	% SECTION 2
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	69	%-----------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	70
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	71	\section{Finiteness Property}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	72	In this section let us sketch our argument for why the size of the simplified
4d9eecfc936a sad Chengsong parents: 468 diff changeset	73	derivatives with the aggressive simplification function can be finitely bounded.
4d9eecfc936a sad Chengsong parents: 468 diff changeset	74	For convenience, we use a new datatype which we call $\textit{rrexp}$ to denote
4d9eecfc936a sad Chengsong parents: 468 diff changeset	75	the difference between it and annotated regular expressions.
4d9eecfc936a sad Chengsong parents: 468 diff changeset	76	For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, but keeps
4d9eecfc936a sad Chengsong parents: 468 diff changeset	77	everything else intact.
4d9eecfc936a sad Chengsong parents: 468 diff changeset	78	It is similar to annotated regular expression being $\erase$ed, but with all its structure being intact
4d9eecfc936a sad Chengsong parents: 468 diff changeset	79	(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
4d9eecfc936a sad Chengsong parents: 468 diff changeset	80	$\AALTS$.
4d9eecfc936a sad Chengsong parents: 468 diff changeset	81	We denote the operation of erasing the bits and turning an annotated regular expression
4d9eecfc936a sad Chengsong parents: 468 diff changeset	82	into an $\rrexp$ as $\rerase$.
4d9eecfc936a sad Chengsong parents: 468 diff changeset	83	%TODO: definition of rerase
4d9eecfc936a sad Chengsong parents: 468 diff changeset	84	That we can bound the size of annotated regular expressions by
4d9eecfc936a sad Chengsong parents: 468 diff changeset	85	$\rrexp$ is that the bitcodes grow linearly w.r.t the input string, and don't contribute to the structural size of
4d9eecfc936a sad Chengsong parents: 468 diff changeset	86	an annotated regular expression:
4d9eecfc936a sad Chengsong parents: 468 diff changeset	87	$\rsize (\rerase a) = \asize a$
4d9eecfc936a sad Chengsong parents: 468 diff changeset	88
4d9eecfc936a sad Chengsong parents: 468 diff changeset	89	Suppose
4d9eecfc936a sad Chengsong parents: 468 diff changeset	90	we have a size function for bitcoded regular expressions, written
4d9eecfc936a sad Chengsong parents: 468 diff changeset	91	$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
4d9eecfc936a sad Chengsong parents: 468 diff changeset	92	(we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$
4d9eecfc936a sad Chengsong parents: 468 diff changeset	93	there exists a bound $N$
4d9eecfc936a sad Chengsong parents: 468 diff changeset	94	such that
4d9eecfc936a sad Chengsong parents: 468 diff changeset	95
4d9eecfc936a sad Chengsong parents: 468 diff changeset	96	\begin{center}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	97	$\forall s. \; \llbracket{\bderssimp r s}\rrbracket \leq N$
4d9eecfc936a sad Chengsong parents: 468 diff changeset	98	\end{center}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	99
4d9eecfc936a sad Chengsong parents: 468 diff changeset	100	\noindent
4d9eecfc936a sad Chengsong parents: 468 diff changeset	101	We prove this by induction on $r$. The base cases for $\AZERO$,
4d9eecfc936a sad Chengsong parents: 468 diff changeset	102	$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
4d9eecfc936a sad Chengsong parents: 468 diff changeset	103	for sequences of the form $\ASEQ \textit{bs} r_1 r_2$. In this case our induction
4d9eecfc936a sad Chengsong parents: 468 diff changeset	104	hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp(r_1, s) \rrbracket \leq N_1$ and
4d9eecfc936a sad Chengsong parents: 468 diff changeset	105	$\exists N_2. \forall s. \; \llbracket \bderssimp(r_2, s) \rrbracket \leq N_2$. We can reason as follows
4d9eecfc936a sad Chengsong parents: 468 diff changeset	106	%
4d9eecfc936a sad Chengsong parents: 468 diff changeset	107	\begin{center}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	108	\begin{tabular}{lcll}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	109	& & $ \llbracket \bderssimp( (\ASEQ\, \textit{bs} \, r_1 \,r_2), s) \rrbracket$\\
4d9eecfc936a sad Chengsong parents: 468 diff changeset	110	& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;((ASEQ [] (\bderssimp r_1 s) r_2) ::
4d9eecfc936a sad Chengsong parents: 468 diff changeset	111	[\bderssimp r_2 s' \;\|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
4d9eecfc936a sad Chengsong parents: 468 diff changeset	112	& $\leq$ &
4d9eecfc936a sad Chengsong parents: 468 diff changeset	113	$\llbracket\textit{\distinctWith}\,((\ASEQ [] (\bderssimp r_1 s) r_2$) ::
4d9eecfc936a sad Chengsong parents: 468 diff changeset	114	[$\bderssimp r_2 s' \;\|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
4d9eecfc936a sad Chengsong parents: 468 diff changeset	115	& $\leq$ & $\llbracket\ASEQ [] (\bderssimp r_1 s) r_2\rrbracket +
4d9eecfc936a sad Chengsong parents: 468 diff changeset	116	\llbracket\textit{distinctWith}\,[\bderssimp r_2 s' \;\|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
4d9eecfc936a sad Chengsong parents: 468 diff changeset	117	& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
4d9eecfc936a sad Chengsong parents: 468 diff changeset	118	\llbracket \distinctWith\,[ \bderssimp r_2 s' \;\|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
4d9eecfc936a sad Chengsong parents: 468 diff changeset	119	& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
4d9eecfc936a sad Chengsong parents: 468 diff changeset	120	\end{tabular}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	121	\end{center}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	122
4d9eecfc936a sad Chengsong parents: 468 diff changeset	123	% tell Chengsong about Indian paper of closed forms of derivatives
4d9eecfc936a sad Chengsong parents: 468 diff changeset	124
4d9eecfc936a sad Chengsong parents: 468 diff changeset	125	\noindent
4d9eecfc936a sad Chengsong parents: 468 diff changeset	126	where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp r_1 s'$ is nullable ($s'$ being a suffix of $s$).
4d9eecfc936a sad Chengsong parents: 468 diff changeset	127	The reason why we could write the derivatives of a sequence this way is described in section 2.
4d9eecfc936a sad Chengsong parents: 468 diff changeset	128	The term (2) is used to control (1) since it we know that you can obtain an overall
4d9eecfc936a sad Chengsong parents: 468 diff changeset	129	smaller regex list
4d9eecfc936a sad Chengsong parents: 468 diff changeset	130	by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
4d9eecfc936a sad Chengsong parents: 468 diff changeset	131	Section 3 is dedicated to its proof.
4d9eecfc936a sad Chengsong parents: 468 diff changeset	132	In (3) we know that $\llbracket\ASEQ [] (\bderssimp r_1 s) r_2\rrbracket$ is
4d9eecfc936a sad Chengsong parents: 468 diff changeset	133	bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
4d9eecfc936a sad Chengsong parents: 468 diff changeset	134	than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
4d9eecfc936a sad Chengsong parents: 468 diff changeset	135	for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
4d9eecfc936a sad Chengsong parents: 468 diff changeset	136	We reason similarly for $\STAR$.\medskip
4d9eecfc936a sad Chengsong parents: 468 diff changeset	137
4d9eecfc936a sad Chengsong parents: 468 diff changeset	138	\noindent
4d9eecfc936a sad Chengsong parents: 468 diff changeset	139	Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
4d9eecfc936a sad Chengsong parents: 468 diff changeset	140	far from the actual bound we can expect. We can do better than this, but this does not improve
4d9eecfc936a sad Chengsong parents: 468 diff changeset	141	the finiteness property we are proving. If we are interested in a polynomial bound,
4d9eecfc936a sad Chengsong parents: 468 diff changeset	142	one would hope to obtain a similar tight bound as for partial
4d9eecfc936a sad Chengsong parents: 468 diff changeset	143	derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
4d9eecfc936a sad Chengsong parents: 468 diff changeset	144	$\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
4d9eecfc936a sad Chengsong parents: 468 diff changeset	145	partial derivatives). Unfortunately to obtain the exact same bound would mean
4d9eecfc936a sad Chengsong parents: 468 diff changeset	146	we need to introduce simplifications, such as
4d9eecfc936a sad Chengsong parents: 468 diff changeset	147	$(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
4d9eecfc936a sad Chengsong parents: 468 diff changeset	148	which exist for partial derivatives. However, if we introduce them in our
4d9eecfc936a sad Chengsong parents: 468 diff changeset	149	setting we would lose the POSIX property of our calculated values. We leave better
4d9eecfc936a sad Chengsong parents: 468 diff changeset	150	bounds for future work.\\[-6.5mm]
4d9eecfc936a sad Chengsong parents: 468 diff changeset	151
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	152
a0f27e21b42c all texrelated Chengsong parents: diff changeset	153	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	154	% SECTION 2
a0f27e21b42c all texrelated Chengsong parents: diff changeset	155	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	156
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	157	\section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	158	There is a nice property about the compound regular expression
4d9eecfc936a sad Chengsong parents: 468 diff changeset	159	$r_1 \cdot r_2$ in general,
4d9eecfc936a sad Chengsong parents: 468 diff changeset	160	that the derivatives of it against a string $s$ can be described by
4d9eecfc936a sad Chengsong parents: 468 diff changeset	161	the derivatives w.r.t $r_1$ and $r_2$ over substrings of $s$:
4d9eecfc936a sad Chengsong parents: 468 diff changeset	162	\begin{lemma}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	163	$\textit{sflat}\_{aux} ( (r_1 \cdot r_2) \backslash s) = (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf s r_1))$
4d9eecfc936a sad Chengsong parents: 468 diff changeset	164	\end{lemma}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	165
4d9eecfc936a sad Chengsong parents: 468 diff changeset	166
4d9eecfc936a sad Chengsong parents: 468 diff changeset	167	%----------------------------------------------------------------------------------------
4d9eecfc936a sad Chengsong parents: 468 diff changeset	168	% SECTION 3
4d9eecfc936a sad Chengsong parents: 468 diff changeset	169	%----------------------------------------------------------------------------------------
4d9eecfc936a sad Chengsong parents: 468 diff changeset	170
4d9eecfc936a sad Chengsong parents: 468 diff changeset	171	\section{interaction between $\distinctWith$ and $\flts$}
4d9eecfc936a sad Chengsong parents: 468 diff changeset	172	Note that it is not immediately obvious that
4d9eecfc936a sad Chengsong parents: 468 diff changeset	173	$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	174
a0f27e21b42c all texrelated Chengsong parents: diff changeset	175	Sed ullamcorper quam eu nisl interdum at interdum enim egestas. Aliquam placerat justo sed lectus lobortis ut porta nisl porttitor. Vestibulum mi dolor, lacinia molestie gravida at, tempus vitae ligula. Donec eget quam sapien, in viverra eros. Donec pellentesque justo a massa fringilla non vestibulum metus vestibulum. Vestibulum in orci quis felis tempor lacinia. Vivamus ornare ultrices facilisis. Ut hendrerit volutpat vulputate. Morbi condimentum venenatis augue, id porta ipsum vulputate in. Curabitur luctus tempus justo. Vestibulum risus lectus, adipiscing nec condimentum quis, condimentum nec nisl. Aliquam dictum sagittis velit sed iaculis. Morbi tristique augue sit amet nulla pulvinar id facilisis ligula mollis. Nam elit libero, tincidunt ut aliquam at, molestie in quam. Aenean rhoncus vehicula hendrerit.

author	Chengsong
	Mon, 02 May 2022 00:23:39 +0100
changeset 500	4d9eecfc936a
parent 468	a0f27e21b42c
child 503	35b80e37dfe7
permissions	-rwxr-xr-x