lexing: ChengsongTanPhdThesis/Chapters/Cubic.tex@988e92a70704 (annotated)

532 cc54ce075db5 restructured Chengsong parents: diff changeset	1	% Chapter Template
cc54ce075db5 restructured Chengsong parents: diff changeset	2
cc54ce075db5 restructured Chengsong parents: diff changeset	3	\chapter{A Better Bound and Other Extensions} % Main chapter title
cc54ce075db5 restructured Chengsong parents: diff changeset	4
cc54ce075db5 restructured Chengsong parents: diff changeset	5	\label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
cc54ce075db5 restructured Chengsong parents: diff changeset	6	%in Chapter 4 to a polynomial one, and demonstrate how one can extend the
cc54ce075db5 restructured Chengsong parents: diff changeset	7	%algorithm to include constructs such as bounded repetitions and negations.
590 988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	8	\lstset{style=myScalastyle}
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	9
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	10
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	11	This chapter is a ``miscellaneous''
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	12	chapter which records various
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	13	extensions to our $\blexersimp$'s formalisations.\\
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	14	Firstly we present further improvements
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	15	made to our lexer algorithm $\blexersimp$.
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	16	We devise a stronger simplification algorithm,
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	17	called $\bsimpStrong$, which can prune away
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	18	similar components in two regular expressions at the same
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	19	alternative level,
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	20	even if these regular expressions are not exactly the same.
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	21	We call the lexer that uses this stronger simplification function
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	22	$\blexerStrong$.
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	23	We conjecture that both
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	24	\begin{center}
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	25	$\blexerStrong \;r \; s = \blexer\; r\;s$
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	26	\end{center}
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	27	and
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	28	\begin{center}
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	29	$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	30	\end{center}
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	31	hold, but formalising
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	32	them is still work in progress.
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	33	We give reasons why the correctness and cubic size bound proofs
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	34	can be achieved,
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	35	by exploring the connection between the internal
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	36	data structure of our $\blexerStrong$ and
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	37	Animirov's partial derivatives.\\
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	38	Secondly, we extend our $\blexersimp$
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	39	to support bounded repetitions ($r^{\{n\}}$).
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	40	We update our formalisation of
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	41	the correctness and finiteness properties to
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	42	include this new construct. With bounded repetitions
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	43	we are able to out-compete other verified lexers such as
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	44	Verbatim++ on regular expressions which involve a lot of
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	45	repetitions. We also present the idempotency property proof
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	46	of $\bsimp$, which leverages the idempotency proof of $\rsimp$.
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	47	This reinforces our claim that the fixpoint construction
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	48	originally required by Sulzmann and Lu can be removed in $\blexersimp$.
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	49	\\
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	50	Last but not least, we present our efforts and challenges we met
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	51	in further improving the algorithm by data
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	52	structures such as zippers.
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	53
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	54
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	55
532 cc54ce075db5 restructured Chengsong parents: diff changeset	56	%----------------------------------------------------------------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	57	% SECTION strongsimp
cc54ce075db5 restructured Chengsong parents: diff changeset	58	%----------------------------------------------------------------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	59	\section{A Stronger Version of Simplification}
cc54ce075db5 restructured Chengsong parents: diff changeset	60	%TODO: search for isabelle proofs of algorithms that check equivalence
590 988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	61	In our bitcoded lexing algorithm, (sub)terms represent (sub)matches.
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	62	For example, the regular expression $aa \cdot a^+ a \cdot a^$ contains two terms,
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	63	which expresses two possibilities it will match future input.
533 6acbc939af6a more Chengsong parents: 532 diff changeset	64	It will either match 2 $a$'s, then 0 or more $a$'s, in other words, at least 2 more $a$'s
6acbc939af6a more Chengsong parents: 532 diff changeset	65	\begin{figure}\label{string_3parts1}
6acbc939af6a more Chengsong parents: 532 diff changeset	66	\begin{center}
6acbc939af6a more Chengsong parents: 532 diff changeset	67	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
6acbc939af6a more Chengsong parents: 532 diff changeset	68	\node [rectangle split, rectangle split horizontal, rectangle split parts=3]
6acbc939af6a more Chengsong parents: 532 diff changeset	69	{Consumed Input
6acbc939af6a more Chengsong parents: 532 diff changeset	70	\nodepart{two} Expects $aa$
6acbc939af6a more Chengsong parents: 532 diff changeset	71	\nodepart{three} Then expects $a^*$};
6acbc939af6a more Chengsong parents: 532 diff changeset	72
6acbc939af6a more Chengsong parents: 532 diff changeset	73	\end{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	74	\end{center}
6acbc939af6a more Chengsong parents: 532 diff changeset	75	\end{figure}
6acbc939af6a more Chengsong parents: 532 diff changeset	76	\noindent
6acbc939af6a more Chengsong parents: 532 diff changeset	77	Or it will match at least 1 more $a$:
6acbc939af6a more Chengsong parents: 532 diff changeset	78	\begin{figure}\label{string_3parts2}
6acbc939af6a more Chengsong parents: 532 diff changeset	79	\begin{center}
6acbc939af6a more Chengsong parents: 532 diff changeset	80	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
6acbc939af6a more Chengsong parents: 532 diff changeset	81	\node [rectangle split, rectangle split horizontal, rectangle split parts=3]
6acbc939af6a more Chengsong parents: 532 diff changeset	82	{Consumed
6acbc939af6a more Chengsong parents: 532 diff changeset	83	\nodepart{two} Expects $a$
6acbc939af6a more Chengsong parents: 532 diff changeset	84	\nodepart{three} Then expects $a^*$};
6acbc939af6a more Chengsong parents: 532 diff changeset	85
6acbc939af6a more Chengsong parents: 532 diff changeset	86	\end{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	87	\end{center}
6acbc939af6a more Chengsong parents: 532 diff changeset	88	\end{figure}
6acbc939af6a more Chengsong parents: 532 diff changeset	89	If these two possibilities are identical, we can eliminate
6acbc939af6a more Chengsong parents: 532 diff changeset	90	the second one as we know the second one corresponds to
6acbc939af6a more Chengsong parents: 532 diff changeset	91	a match that is not $\POSIX$.
6acbc939af6a more Chengsong parents: 532 diff changeset	92
6acbc939af6a more Chengsong parents: 532 diff changeset	93
6acbc939af6a more Chengsong parents: 532 diff changeset	94	If two identical regexes
6acbc939af6a more Chengsong parents: 532 diff changeset	95	happen to be grouped into different sequences, one cannot use
6acbc939af6a more Chengsong parents: 532 diff changeset	96	the $a + a \rightsquigarrow a$ rule to eliminate them, even if they
6acbc939af6a more Chengsong parents: 532 diff changeset	97	are "parallel" to each other:
6acbc939af6a more Chengsong parents: 532 diff changeset	98	\[
6acbc939af6a more Chengsong parents: 532 diff changeset	99	(a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2
6acbc939af6a more Chengsong parents: 532 diff changeset	100	\]
6acbc939af6a more Chengsong parents: 532 diff changeset	101	and that's because they are followed by possibly
6acbc939af6a more Chengsong parents: 532 diff changeset	102	different "suffixes" $r_1$ and $r_2$, and if
6acbc939af6a more Chengsong parents: 532 diff changeset	103	they do turn out to be different then doing
6acbc939af6a more Chengsong parents: 532 diff changeset	104	\[
6acbc939af6a more Chengsong parents: 532 diff changeset	105	(a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2
6acbc939af6a more Chengsong parents: 532 diff changeset	106	\]
6acbc939af6a more Chengsong parents: 532 diff changeset	107	might cause a possible match where the string is in $L(a \cdot r_2)$ to be lost.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	108
cc54ce075db5 restructured Chengsong parents: diff changeset	109	Here is an example for this.
cc54ce075db5 restructured Chengsong parents: diff changeset	110	Assume we have the derivative regex
533 6acbc939af6a more Chengsong parents: 532 diff changeset	111	\[( r_1 + r_2 + r_3)\cdot r+
6acbc939af6a more Chengsong parents: 532 diff changeset	112	( r_1 + r_5 + r_6)\cdot( r_1 + r_2 + r_3)^*
532 cc54ce075db5 restructured Chengsong parents: diff changeset	113	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	114
533 6acbc939af6a more Chengsong parents: 532 diff changeset	115	occurring in an intermediate step in our bit-coded lexing algorithm.
6acbc939af6a more Chengsong parents: 532 diff changeset	116
532 cc54ce075db5 restructured Chengsong parents: diff changeset	117	In this expression, there will be 6 "parallel" terms if we break up regexes
cc54ce075db5 restructured Chengsong parents: diff changeset	118	of shape $(a+b)\cdot c$ using the distributivity law (into $ac + bc$).
cc54ce075db5 restructured Chengsong parents: diff changeset	119	\begin{align}
cc54ce075db5 restructured Chengsong parents: diff changeset	120	(\nonumber \\
533 6acbc939af6a more Chengsong parents: 532 diff changeset	121	r_1 + & \label{term:1} \\
6acbc939af6a more Chengsong parents: 532 diff changeset	122	r_2 + & \label{term:2} \\
6acbc939af6a more Chengsong parents: 532 diff changeset	123	r_3 & \label{term:3} \\
6acbc939af6a more Chengsong parents: 532 diff changeset	124	& )\cdot r\; + \; (\nonumber \\
6acbc939af6a more Chengsong parents: 532 diff changeset	125	r_1 + & \label{term:4} \\
6acbc939af6a more Chengsong parents: 532 diff changeset	126	r_5 + & \label{term:5} \\
6acbc939af6a more Chengsong parents: 532 diff changeset	127	r_6 \label{term:6}\\
6acbc939af6a more Chengsong parents: 532 diff changeset	128	& )\cdot r\nonumber
532 cc54ce075db5 restructured Chengsong parents: diff changeset	129	\end{align}
cc54ce075db5 restructured Chengsong parents: diff changeset	130
cc54ce075db5 restructured Chengsong parents: diff changeset	131	They have been labelled, and each label denotes
cc54ce075db5 restructured Chengsong parents: diff changeset	132	one term, for example, \ref{term:1} denotes
cc54ce075db5 restructured Chengsong parents: diff changeset	133	\[
533 6acbc939af6a more Chengsong parents: 532 diff changeset	134	r_1 \cdot r
532 cc54ce075db5 restructured Chengsong parents: diff changeset	135	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	136	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	137	and \ref{term:3} denotes
cc54ce075db5 restructured Chengsong parents: diff changeset	138	\[
533 6acbc939af6a more Chengsong parents: 532 diff changeset	139	r_3\cdot r.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	140	\]
533 6acbc939af6a more Chengsong parents: 532 diff changeset	141	From a lexer's point of view, \ref{term:4} will never
6acbc939af6a more Chengsong parents: 532 diff changeset	142	be picked up in later phases of matching because there
6acbc939af6a more Chengsong parents: 532 diff changeset	143	is a term \ref{term:1} giving identical matching information.
6acbc939af6a more Chengsong parents: 532 diff changeset	144	The first term \ref{term:1} will match a string in $L(r_1 \cdot r)$,
6acbc939af6a more Chengsong parents: 532 diff changeset	145	and so on for term \ref{term:2}, \ref{term:3}, etc.
6acbc939af6a more Chengsong parents: 532 diff changeset	146
532 cc54ce075db5 restructured Chengsong parents: diff changeset	147	\mybox{previous input $\ldots$}\mybox{$aaa$ }\mybox{rest of input $\ldots$}
533 6acbc939af6a more Chengsong parents: 532 diff changeset	148	\begin{center}\label{string_2parts}
6acbc939af6a more Chengsong parents: 532 diff changeset	149
6acbc939af6a more Chengsong parents: 532 diff changeset	150	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
6acbc939af6a more Chengsong parents: 532 diff changeset	151	\node [rectangle split, rectangle split horizontal, rectangle split parts=2]
6acbc939af6a more Chengsong parents: 532 diff changeset	152	{$r_{x1}$
6acbc939af6a more Chengsong parents: 532 diff changeset	153	\nodepart{two} $r_1\cdot r$ };
6acbc939af6a more Chengsong parents: 532 diff changeset	154	%\caption{term 1 \ref{term:1}'s matching configuration}
6acbc939af6a more Chengsong parents: 532 diff changeset	155	\end{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	156
6acbc939af6a more Chengsong parents: 532 diff changeset	157	\end{center}
6acbc939af6a more Chengsong parents: 532 diff changeset	158	For term 1 \ref{term:1}, whatever was before the current
6acbc939af6a more Chengsong parents: 532 diff changeset	159	input position was fully matched and the regex corresponding
6acbc939af6a more Chengsong parents: 532 diff changeset	160	to it reduced to $\ONE$,
6acbc939af6a more Chengsong parents: 532 diff changeset	161	and in the next input token, it will start on $r_1\cdot r$.
6acbc939af6a more Chengsong parents: 532 diff changeset	162	The resulting value will be something of a similar configuration:
6acbc939af6a more Chengsong parents: 532 diff changeset	163	\begin{center}\label{value_2parts}
6acbc939af6a more Chengsong parents: 532 diff changeset	164	%\caption{term 1 \ref{term:1}'s matching configuration}
6acbc939af6a more Chengsong parents: 532 diff changeset	165	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
6acbc939af6a more Chengsong parents: 532 diff changeset	166	\node [rectangle split, rectangle split horizontal, rectangle split parts=2]
6acbc939af6a more Chengsong parents: 532 diff changeset	167	{$v_{x1}$
6acbc939af6a more Chengsong parents: 532 diff changeset	168	\nodepart{two} $v_{r_1\cdot r}$ };
6acbc939af6a more Chengsong parents: 532 diff changeset	169	\end{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	170	\end{center}
6acbc939af6a more Chengsong parents: 532 diff changeset	171	For term 2 \ref{term:2} we have a similar value partition:
6acbc939af6a more Chengsong parents: 532 diff changeset	172	\begin{center}\label{value_2parts2}
6acbc939af6a more Chengsong parents: 532 diff changeset	173	%\caption{term 1 \ref{term:1}'s matching configuration}
6acbc939af6a more Chengsong parents: 532 diff changeset	174	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
6acbc939af6a more Chengsong parents: 532 diff changeset	175	\node [rectangle split, rectangle split horizontal, rectangle split parts=2]
6acbc939af6a more Chengsong parents: 532 diff changeset	176	{$v_{x2}$
6acbc939af6a more Chengsong parents: 532 diff changeset	177	\nodepart{two} $v_{r_2\cdot r}$ };
6acbc939af6a more Chengsong parents: 532 diff changeset	178	\end{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	179	\end{center}
6acbc939af6a more Chengsong parents: 532 diff changeset	180	\noindent
6acbc939af6a more Chengsong parents: 532 diff changeset	181	and so on.
6acbc939af6a more Chengsong parents: 532 diff changeset	182	We note that for term 4 \ref{term:4} its result value
6acbc939af6a more Chengsong parents: 532 diff changeset	183	after any position beyond the current one will always
6acbc939af6a more Chengsong parents: 532 diff changeset	184	be the same:
6acbc939af6a more Chengsong parents: 532 diff changeset	185	\begin{center}\label{value_2parts4}
6acbc939af6a more Chengsong parents: 532 diff changeset	186	%\caption{term 1 \ref{term:1}'s matching configuration}
6acbc939af6a more Chengsong parents: 532 diff changeset	187	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
6acbc939af6a more Chengsong parents: 532 diff changeset	188	\node [rectangle split, rectangle split horizontal, rectangle split parts=2]
6acbc939af6a more Chengsong parents: 532 diff changeset	189	{$v_{x4}$
6acbc939af6a more Chengsong parents: 532 diff changeset	190	\nodepart{two} $v_{r_1\cdot r}$ };
6acbc939af6a more Chengsong parents: 532 diff changeset	191	\end{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	192	\end{center}
6acbc939af6a more Chengsong parents: 532 diff changeset	193	And $\POSIX$ rules says that we can eliminate at least one of them.
6acbc939af6a more Chengsong parents: 532 diff changeset	194	Our algorithm always puts the regex with the longest initial sub-match at the left of the
6acbc939af6a more Chengsong parents: 532 diff changeset	195	alternatives, so we safely throw away \ref{term:4}.
6acbc939af6a more Chengsong parents: 532 diff changeset	196	The fact that term 1 and 4 are not immediately in the same alternative
6acbc939af6a more Chengsong parents: 532 diff changeset	197	expression does not prevent them from being two duplicate matches at
6acbc939af6a more Chengsong parents: 532 diff changeset	198	the current point of view.
6acbc939af6a more Chengsong parents: 532 diff changeset	199	To implement this idea into an algorithm, we define a "pruning function"
6acbc939af6a more Chengsong parents: 532 diff changeset	200	$\textit{prune}$, that takes away parts of all terms in $r$ that belongs to
6acbc939af6a more Chengsong parents: 532 diff changeset	201	$\textit{acc}$, which acts as an element
6acbc939af6a more Chengsong parents: 532 diff changeset	202	is a stronger version of $\textit{distinctBy}$.
6acbc939af6a more Chengsong parents: 532 diff changeset	203	Here is a concise version of it (in the style of Scala):
590 988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	204	\begin{figure}[H]
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	205	\begin{lstlisting}
533 6acbc939af6a more Chengsong parents: 532 diff changeset	206	def distinctByPlus(rs: List[ARexp], acc: Set[Rexp] = Set()) :
6acbc939af6a more Chengsong parents: 532 diff changeset	207	List[ARexp] = rs match {
6acbc939af6a more Chengsong parents: 532 diff changeset	208	case Nil => Nil
6acbc939af6a more Chengsong parents: 532 diff changeset	209	case r :: rs =>
6acbc939af6a more Chengsong parents: 532 diff changeset	210	if(acc.contains(erase(r)))
6acbc939af6a more Chengsong parents: 532 diff changeset	211	distinctByPlus(rs, acc)
6acbc939af6a more Chengsong parents: 532 diff changeset	212	else
6acbc939af6a more Chengsong parents: 532 diff changeset	213	prune(r, acc) :: distinctByPlus(rs, prune(r, acc) +: acc)
6acbc939af6a more Chengsong parents: 532 diff changeset	214	}
6acbc939af6a more Chengsong parents: 532 diff changeset	215
590 988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	216	\end{lstlisting}
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	217	\caption{the distinctByPlus function}
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	218	\end{figure}
533 6acbc939af6a more Chengsong parents: 532 diff changeset	219	But for the function $\textit{prune}$ there is a difficulty:
6acbc939af6a more Chengsong parents: 532 diff changeset	220	how could one define the $L$ function in a "computable" way,
6acbc939af6a more Chengsong parents: 532 diff changeset	221	so that they generate a (lazy infinite) set of strings in $L(r)$.
6acbc939af6a more Chengsong parents: 532 diff changeset	222	We also need a function that tests whether $L(r_1) \subseteq L(r_2)$
6acbc939af6a more Chengsong parents: 532 diff changeset	223	is true.
6acbc939af6a more Chengsong parents: 532 diff changeset	224	For the moment we cut corners and do not define these two functions
6acbc939af6a more Chengsong parents: 532 diff changeset	225	as they are not used by Antimirov (and will probably not contribute
6acbc939af6a more Chengsong parents: 532 diff changeset	226	to a bound better than Antimirov's cubic bound).
6acbc939af6a more Chengsong parents: 532 diff changeset	227	Rather, we do not try to eliminate in every instance of regular expressions
6acbc939af6a more Chengsong parents: 532 diff changeset	228	that have "language duplicates", but only eliminate the "exact duplicates".
6acbc939af6a more Chengsong parents: 532 diff changeset	229	For this we need to break up terms $(a+b)\cdot c$ into two
6acbc939af6a more Chengsong parents: 532 diff changeset	230	terms $a\cdot c$ and $b\cdot c$ before we add them to the accumulator:
590 988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	231	\begin{figure}[H]
988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	232	\begin{lstlisting}
533 6acbc939af6a more Chengsong parents: 532 diff changeset	233	def distinctWith(rs: List[ARexp],
6acbc939af6a more Chengsong parents: 532 diff changeset	234	pruneFunction: (ARexp, Set[Rexp]) => ARexp,
6acbc939af6a more Chengsong parents: 532 diff changeset	235	acc: Set[Rexp] = Set()) : List[ARexp] =
6acbc939af6a more Chengsong parents: 532 diff changeset	236	rs match{
6acbc939af6a more Chengsong parents: 532 diff changeset	237	case Nil => Nil
6acbc939af6a more Chengsong parents: 532 diff changeset	238	case r :: rs =>
6acbc939af6a more Chengsong parents: 532 diff changeset	239	if(acc(erase(r)))
6acbc939af6a more Chengsong parents: 532 diff changeset	240	distinctWith(rs, pruneFunction, acc)
6acbc939af6a more Chengsong parents: 532 diff changeset	241	else {
6acbc939af6a more Chengsong parents: 532 diff changeset	242	val pruned_r = pruneFunction(r, acc)
6acbc939af6a more Chengsong parents: 532 diff changeset	243	pruned_r ::
6acbc939af6a more Chengsong parents: 532 diff changeset	244	distinctWith(rs,
6acbc939af6a more Chengsong parents: 532 diff changeset	245	pruneFunction,
6acbc939af6a more Chengsong parents: 532 diff changeset	246	turnIntoTerms(erase(pruned_r)) ++: acc
6acbc939af6a more Chengsong parents: 532 diff changeset	247	)
6acbc939af6a more Chengsong parents: 532 diff changeset	248	}
6acbc939af6a more Chengsong parents: 532 diff changeset	249	}
590 988e92a70704 more chap5 and chap6 bsimp_idem Chengsong parents: 538 diff changeset	250	\end{lstlisting}
535 ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	251	\caption{A Stronger Version of $\textit{distinctBy}$}
ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	252	\end{figure}
533 6acbc939af6a more Chengsong parents: 532 diff changeset	253	\noindent
6acbc939af6a more Chengsong parents: 532 diff changeset	254	This process is done by the function $\textit{turnIntoTerms}$:
535 ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	255	\begin{figure}
533 6acbc939af6a more Chengsong parents: 532 diff changeset	256	\begin{verbatim}
6acbc939af6a more Chengsong parents: 532 diff changeset	257	def turnIntoTerms(r: Rexp): List[Rexp] = r match {
6acbc939af6a more Chengsong parents: 532 diff changeset	258	case SEQ(r1, r2) => if(isOne(r1))
6acbc939af6a more Chengsong parents: 532 diff changeset	259	turnIntoTerms(r2)
6acbc939af6a more Chengsong parents: 532 diff changeset	260	else
6acbc939af6a more Chengsong parents: 532 diff changeset	261	turnIntoTerms(r1).map(r11 => SEQ(r11, r2))
6acbc939af6a more Chengsong parents: 532 diff changeset	262	case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2)
6acbc939af6a more Chengsong parents: 532 diff changeset	263	case ZERO => Nil
6acbc939af6a more Chengsong parents: 532 diff changeset	264	case _ => r :: Nil
6acbc939af6a more Chengsong parents: 532 diff changeset	265	}
6acbc939af6a more Chengsong parents: 532 diff changeset	266	\end{verbatim}
535 ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	267	\caption{function that breaks up regular expression into "atomic" terms}
ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	268	\end{figure}
ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	269
533 6acbc939af6a more Chengsong parents: 532 diff changeset	270	\noindent
6acbc939af6a more Chengsong parents: 532 diff changeset	271	The pruning function can be defined recursively:
535 ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	272	\begin{figure}
533 6acbc939af6a more Chengsong parents: 532 diff changeset	273	\begin{verbatim}
6acbc939af6a more Chengsong parents: 532 diff changeset	274	def prune7(r: ARexp, acc: Set[Rexp]) : ARexp = r match{
6acbc939af6a more Chengsong parents: 532 diff changeset	275	case AALTS(bs, rs) => rs.map(r => prune7(r, acc)).filter(_ != ZERO) match
6acbc939af6a more Chengsong parents: 532 diff changeset	276	{
6acbc939af6a more Chengsong parents: 532 diff changeset	277	case Nil => AZERO
6acbc939af6a more Chengsong parents: 532 diff changeset	278	case r::Nil => fuse(bs, r)
6acbc939af6a more Chengsong parents: 532 diff changeset	279	case rs1 => AALTS(bs, rs1)
6acbc939af6a more Chengsong parents: 532 diff changeset	280	}
6acbc939af6a more Chengsong parents: 532 diff changeset	281	case ASEQ(bs, r1, r2) => prune7(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match {
6acbc939af6a more Chengsong parents: 532 diff changeset	282	case AZERO => AZERO
6acbc939af6a more Chengsong parents: 532 diff changeset	283	case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2)
6acbc939af6a more Chengsong parents: 532 diff changeset	284	case r1p => ASEQ(bs, r1p, r2)
6acbc939af6a more Chengsong parents: 532 diff changeset	285	}
6acbc939af6a more Chengsong parents: 532 diff changeset	286	case r => if(acc(erase(r))) AZERO else r
6acbc939af6a more Chengsong parents: 532 diff changeset	287	}
6acbc939af6a more Chengsong parents: 532 diff changeset	288	\end{verbatim}
535 ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	289	\caption{pruning function}
ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	290	\end{figure}
ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	291
ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	292
533 6acbc939af6a more Chengsong parents: 532 diff changeset	293
6acbc939af6a more Chengsong parents: 532 diff changeset	294	\begin{figure}
6acbc939af6a more Chengsong parents: 532 diff changeset	295	\centering
6acbc939af6a more Chengsong parents: 532 diff changeset	296	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
6acbc939af6a more Chengsong parents: 532 diff changeset	297	\begin{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	298	\begin{axis}[
535 ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	299	%xlabel={$n$},
ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	300	myplotstyle,
ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	301	xlabel={input length},
533 6acbc939af6a more Chengsong parents: 532 diff changeset	302	ylabel={size},
535 ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	303	]
533 6acbc939af6a more Chengsong parents: 532 diff changeset	304	\addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data};
6acbc939af6a more Chengsong parents: 532 diff changeset	305	\end{axis}
6acbc939af6a more Chengsong parents: 532 diff changeset	306	\end{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	307	&
6acbc939af6a more Chengsong parents: 532 diff changeset	308	\begin{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	309	\begin{axis}[
535 ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	310	%xlabel={$n$},
ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	311	myplotstyle,
ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	312	xlabel={input length},
533 6acbc939af6a more Chengsong parents: 532 diff changeset	313	ylabel={size},
535 ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	314	]
533 6acbc939af6a more Chengsong parents: 532 diff changeset	315	\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
6acbc939af6a more Chengsong parents: 532 diff changeset	316	\end{axis}
6acbc939af6a more Chengsong parents: 532 diff changeset	317	\end{tikzpicture}\\
535 ce91c29d2885 fixed plotting error 6.1 Chengsong parents: 533 diff changeset	318	\multicolumn{2}{l}{Graphs: Runtime for matching $((a^* + (aa)^* + \ldots + (aaaaa)^* )^)^$ with strings
533 6acbc939af6a more Chengsong parents: 532 diff changeset	319	of the form $\underbrace{aa..a}_{n}$.}
6acbc939af6a more Chengsong parents: 532 diff changeset	320	\end{tabular}
6acbc939af6a more Chengsong parents: 532 diff changeset	321	\caption{aaaaaStarStar} \label{fig:aaaaaStarStar}
6acbc939af6a more Chengsong parents: 532 diff changeset	322	\end{figure}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	323
cc54ce075db5 restructured Chengsong parents: diff changeset	324
cc54ce075db5 restructured Chengsong parents: diff changeset	325
cc54ce075db5 restructured Chengsong parents: diff changeset	326	%----------------------------------------------------------------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	327	% SECTION 1
cc54ce075db5 restructured Chengsong parents: diff changeset	328	%----------------------------------------------------------------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	329
cc54ce075db5 restructured Chengsong parents: diff changeset	330	\section{Adding Support for the Negation Construct, and its Correctness Proof}
cc54ce075db5 restructured Chengsong parents: diff changeset	331	We now add support for the negation regular expression:
cc54ce075db5 restructured Chengsong parents: diff changeset	332	\[ r ::= \ZERO \mid \ONE
cc54ce075db5 restructured Chengsong parents: diff changeset	333	\mid c
cc54ce075db5 restructured Chengsong parents: diff changeset	334	\mid r_1 \cdot r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	335	\mid r_1 + r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	336	\mid r^*
cc54ce075db5 restructured Chengsong parents: diff changeset	337	\mid \sim r
cc54ce075db5 restructured Chengsong parents: diff changeset	338	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	339	The $\textit{nullable}$ function's clause for it would be
cc54ce075db5 restructured Chengsong parents: diff changeset	340	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	341	\textit{nullable}(~r) = \neg \nullable(r)
cc54ce075db5 restructured Chengsong parents: diff changeset	342	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	343	The derivative would be
cc54ce075db5 restructured Chengsong parents: diff changeset	344	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	345	~r \backslash c = ~ (r \backslash c)
cc54ce075db5 restructured Chengsong parents: diff changeset	346	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	347
cc54ce075db5 restructured Chengsong parents: diff changeset	348	The most tricky part of lexing for the $~r$ regex
cc54ce075db5 restructured Chengsong parents: diff changeset	349	is creating a value for it.
cc54ce075db5 restructured Chengsong parents: diff changeset	350	For other regular expressions, the value aligns with the
cc54ce075db5 restructured Chengsong parents: diff changeset	351	structure of the regex:
cc54ce075db5 restructured Chengsong parents: diff changeset	352	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	353	\vdash \Seq(\Char(a), \Char(b)) : a \cdot b
cc54ce075db5 restructured Chengsong parents: diff changeset	354	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	355	But for the $~r$ regex, $s$ is a member of it if and only if
cc54ce075db5 restructured Chengsong parents: diff changeset	356	$s$ does not belong to $L(r)$.
cc54ce075db5 restructured Chengsong parents: diff changeset	357	That means when there
cc54ce075db5 restructured Chengsong parents: diff changeset	358	is a match for the not regex, it is not possible to generate how the string $s$ matched
cc54ce075db5 restructured Chengsong parents: diff changeset	359	with $r$.
cc54ce075db5 restructured Chengsong parents: diff changeset	360	What we can do is preserve the information of how $s$ was not matched by $r$,
cc54ce075db5 restructured Chengsong parents: diff changeset	361	and there are a number of options to do this.
cc54ce075db5 restructured Chengsong parents: diff changeset	362
cc54ce075db5 restructured Chengsong parents: diff changeset	363	We could give a partial value when there is a partial match for the regex inside
cc54ce075db5 restructured Chengsong parents: diff changeset	364	the $\mathbf{not}$ construct.
cc54ce075db5 restructured Chengsong parents: diff changeset	365	For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,
cc54ce075db5 restructured Chengsong parents: diff changeset	366	A value for it could be
cc54ce075db5 restructured Chengsong parents: diff changeset	367	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	368	\vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)
cc54ce075db5 restructured Chengsong parents: diff changeset	369	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	370	The above example demonstrates what value to construct
cc54ce075db5 restructured Chengsong parents: diff changeset	371	when the string $s$ is at most a real prefix
cc54ce075db5 restructured Chengsong parents: diff changeset	372	of the strings in $L(r)$. When $s$ instead is not a prefix of any strings
cc54ce075db5 restructured Chengsong parents: diff changeset	373	in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$
cc54ce075db5 restructured Chengsong parents: diff changeset	374	constructor.
cc54ce075db5 restructured Chengsong parents: diff changeset	375
cc54ce075db5 restructured Chengsong parents: diff changeset	376	Another option would be to either store the string $s$ that resulted in
cc54ce075db5 restructured Chengsong parents: diff changeset	377	a mis-match for $r$ or a dummy value as a placeholder:
cc54ce075db5 restructured Chengsong parents: diff changeset	378	\[
533 6acbc939af6a more Chengsong parents: 532 diff changeset	379	\vdash \textit{Not}(abcd) : ~( r_1 )
532 cc54ce075db5 restructured Chengsong parents: diff changeset	380	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	381	or
cc54ce075db5 restructured Chengsong parents: diff changeset	382	\[
533 6acbc939af6a more Chengsong parents: 532 diff changeset	383	\vdash \textit{Not}(\textit{Dummy}) : ~( r_1 )
532 cc54ce075db5 restructured Chengsong parents: diff changeset	384	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	385	We choose to implement this as it is most straightforward:
cc54ce075db5 restructured Chengsong parents: diff changeset	386	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	387	\mkeps(~(r)) = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})
cc54ce075db5 restructured Chengsong parents: diff changeset	388	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	389
cc54ce075db5 restructured Chengsong parents: diff changeset	390	%----------------------------------------------------------------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	391	% SECTION 2
cc54ce075db5 restructured Chengsong parents: diff changeset	392	%----------------------------------------------------------------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	393
cc54ce075db5 restructured Chengsong parents: diff changeset	394	\section{Bounded Repetitions}
cc54ce075db5 restructured Chengsong parents: diff changeset	395
cc54ce075db5 restructured Chengsong parents: diff changeset	396

author	Chengsong
	Wed, 31 Aug 2022 23:57:42 +0100
changeset 590	988e92a70704
parent 538	8016a2480704
child 591	b2d0de6aee18
permissions	-rwxr-xr-x