lexing: ChengsongTanPhdThesis/Chapters/Cubic.tex@6acbc939af6a (annotated)

532 cc54ce075db5 restructured Chengsong parents: diff changeset	1	% Chapter Template
cc54ce075db5 restructured Chengsong parents: diff changeset	2
cc54ce075db5 restructured Chengsong parents: diff changeset	3	\chapter{A Better Bound and Other Extensions} % Main chapter title
cc54ce075db5 restructured Chengsong parents: diff changeset	4
cc54ce075db5 restructured Chengsong parents: diff changeset	5	\label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
cc54ce075db5 restructured Chengsong parents: diff changeset	6	%in Chapter 4 to a polynomial one, and demonstrate how one can extend the
cc54ce075db5 restructured Chengsong parents: diff changeset	7	%algorithm to include constructs such as bounded repetitions and negations.
cc54ce075db5 restructured Chengsong parents: diff changeset	8
cc54ce075db5 restructured Chengsong parents: diff changeset	9	%----------------------------------------------------------------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	10	% SECTION strongsimp
cc54ce075db5 restructured Chengsong parents: diff changeset	11	%----------------------------------------------------------------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	12	\section{A Stronger Version of Simplification}
cc54ce075db5 restructured Chengsong parents: diff changeset	13	%TODO: search for isabelle proofs of algorithms that check equivalence
533 6acbc939af6a more Chengsong parents: 532 diff changeset	14	In our bit-coded lexing algorithm, with or without simplification,
6acbc939af6a more Chengsong parents: 532 diff changeset	15	two alternative (distinct) sub-matches for the (sub-)string and (sub-)regex pair
532 cc54ce075db5 restructured Chengsong parents: diff changeset	16	are always expressed in the "derivative regular expression" as two
cc54ce075db5 restructured Chengsong parents: diff changeset	17	disjoint alternative terms at the current (sub-)regex level. The fact that they
cc54ce075db5 restructured Chengsong parents: diff changeset	18	are parallel tells us when we retrieve the information from this (sub-)regex
cc54ce075db5 restructured Chengsong parents: diff changeset	19	there will always be a choice of which alternative term to take.
533 6acbc939af6a more Chengsong parents: 532 diff changeset	20	As an example, the regular expression $aa \cdot a^+ a \cdot a^$ (omitting bit-codes)
6acbc939af6a more Chengsong parents: 532 diff changeset	21	expresses two possibilities it will match future input.
6acbc939af6a more Chengsong parents: 532 diff changeset	22	It will either match 2 $a$'s, then 0 or more $a$'s, in other words, at least 2 more $a$'s
6acbc939af6a more Chengsong parents: 532 diff changeset	23	\begin{figure}\label{string_3parts1}
6acbc939af6a more Chengsong parents: 532 diff changeset	24	\begin{center}
6acbc939af6a more Chengsong parents: 532 diff changeset	25	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
6acbc939af6a more Chengsong parents: 532 diff changeset	26	\node [rectangle split, rectangle split horizontal, rectangle split parts=3]
6acbc939af6a more Chengsong parents: 532 diff changeset	27	{Consumed Input
6acbc939af6a more Chengsong parents: 532 diff changeset	28	\nodepart{two} Expects $aa$
6acbc939af6a more Chengsong parents: 532 diff changeset	29	\nodepart{three} Then expects $a^*$};
6acbc939af6a more Chengsong parents: 532 diff changeset	30
6acbc939af6a more Chengsong parents: 532 diff changeset	31	\end{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	32	\end{center}
6acbc939af6a more Chengsong parents: 532 diff changeset	33	\end{figure}
6acbc939af6a more Chengsong parents: 532 diff changeset	34	\noindent
6acbc939af6a more Chengsong parents: 532 diff changeset	35	Or it will match at least 1 more $a$:
6acbc939af6a more Chengsong parents: 532 diff changeset	36	\begin{figure}\label{string_3parts2}
6acbc939af6a more Chengsong parents: 532 diff changeset	37	\begin{center}
6acbc939af6a more Chengsong parents: 532 diff changeset	38	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
6acbc939af6a more Chengsong parents: 532 diff changeset	39	\node [rectangle split, rectangle split horizontal, rectangle split parts=3]
6acbc939af6a more Chengsong parents: 532 diff changeset	40	{Consumed
6acbc939af6a more Chengsong parents: 532 diff changeset	41	\nodepart{two} Expects $a$
6acbc939af6a more Chengsong parents: 532 diff changeset	42	\nodepart{three} Then expects $a^*$};
6acbc939af6a more Chengsong parents: 532 diff changeset	43
6acbc939af6a more Chengsong parents: 532 diff changeset	44	\end{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	45	\end{center}
6acbc939af6a more Chengsong parents: 532 diff changeset	46	\end{figure}
6acbc939af6a more Chengsong parents: 532 diff changeset	47	If these two possibilities are identical, we can eliminate
6acbc939af6a more Chengsong parents: 532 diff changeset	48	the second one as we know the second one corresponds to
6acbc939af6a more Chengsong parents: 532 diff changeset	49	a match that is not $\POSIX$.
6acbc939af6a more Chengsong parents: 532 diff changeset	50
6acbc939af6a more Chengsong parents: 532 diff changeset	51
6acbc939af6a more Chengsong parents: 532 diff changeset	52	If two identical regexes
6acbc939af6a more Chengsong parents: 532 diff changeset	53	happen to be grouped into different sequences, one cannot use
6acbc939af6a more Chengsong parents: 532 diff changeset	54	the $a + a \rightsquigarrow a$ rule to eliminate them, even if they
6acbc939af6a more Chengsong parents: 532 diff changeset	55	are "parallel" to each other:
6acbc939af6a more Chengsong parents: 532 diff changeset	56	\[
6acbc939af6a more Chengsong parents: 532 diff changeset	57	(a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2
6acbc939af6a more Chengsong parents: 532 diff changeset	58	\]
6acbc939af6a more Chengsong parents: 532 diff changeset	59	and that's because they are followed by possibly
6acbc939af6a more Chengsong parents: 532 diff changeset	60	different "suffixes" $r_1$ and $r_2$, and if
6acbc939af6a more Chengsong parents: 532 diff changeset	61	they do turn out to be different then doing
6acbc939af6a more Chengsong parents: 532 diff changeset	62	\[
6acbc939af6a more Chengsong parents: 532 diff changeset	63	(a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2
6acbc939af6a more Chengsong parents: 532 diff changeset	64	\]
6acbc939af6a more Chengsong parents: 532 diff changeset	65	might cause a possible match where the string is in $L(a \cdot r_2)$ to be lost.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	66
cc54ce075db5 restructured Chengsong parents: diff changeset	67	Here is an example for this.
cc54ce075db5 restructured Chengsong parents: diff changeset	68	Assume we have the derivative regex
533 6acbc939af6a more Chengsong parents: 532 diff changeset	69	\[( r_1 + r_2 + r_3)\cdot r+
6acbc939af6a more Chengsong parents: 532 diff changeset	70	( r_1 + r_5 + r_6)\cdot( r_1 + r_2 + r_3)^*
532 cc54ce075db5 restructured Chengsong parents: diff changeset	71	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	72
533 6acbc939af6a more Chengsong parents: 532 diff changeset	73	occurring in an intermediate step in our bit-coded lexing algorithm.
6acbc939af6a more Chengsong parents: 532 diff changeset	74
532 cc54ce075db5 restructured Chengsong parents: diff changeset	75	In this expression, there will be 6 "parallel" terms if we break up regexes
cc54ce075db5 restructured Chengsong parents: diff changeset	76	of shape $(a+b)\cdot c$ using the distributivity law (into $ac + bc$).
cc54ce075db5 restructured Chengsong parents: diff changeset	77	\begin{align}
cc54ce075db5 restructured Chengsong parents: diff changeset	78	(\nonumber \\
533 6acbc939af6a more Chengsong parents: 532 diff changeset	79	r_1 + & \label{term:1} \\
6acbc939af6a more Chengsong parents: 532 diff changeset	80	r_2 + & \label{term:2} \\
6acbc939af6a more Chengsong parents: 532 diff changeset	81	r_3 & \label{term:3} \\
6acbc939af6a more Chengsong parents: 532 diff changeset	82	& )\cdot r\; + \; (\nonumber \\
6acbc939af6a more Chengsong parents: 532 diff changeset	83	r_1 + & \label{term:4} \\
6acbc939af6a more Chengsong parents: 532 diff changeset	84	r_5 + & \label{term:5} \\
6acbc939af6a more Chengsong parents: 532 diff changeset	85	r_6 \label{term:6}\\
6acbc939af6a more Chengsong parents: 532 diff changeset	86	& )\cdot r\nonumber
532 cc54ce075db5 restructured Chengsong parents: diff changeset	87	\end{align}
cc54ce075db5 restructured Chengsong parents: diff changeset	88
cc54ce075db5 restructured Chengsong parents: diff changeset	89	They have been labelled, and each label denotes
cc54ce075db5 restructured Chengsong parents: diff changeset	90	one term, for example, \ref{term:1} denotes
cc54ce075db5 restructured Chengsong parents: diff changeset	91	\[
533 6acbc939af6a more Chengsong parents: 532 diff changeset	92	r_1 \cdot r
532 cc54ce075db5 restructured Chengsong parents: diff changeset	93	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	94	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	95	and \ref{term:3} denotes
cc54ce075db5 restructured Chengsong parents: diff changeset	96	\[
533 6acbc939af6a more Chengsong parents: 532 diff changeset	97	r_3\cdot r.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	98	\]
533 6acbc939af6a more Chengsong parents: 532 diff changeset	99	From a lexer's point of view, \ref{term:4} will never
6acbc939af6a more Chengsong parents: 532 diff changeset	100	be picked up in later phases of matching because there
6acbc939af6a more Chengsong parents: 532 diff changeset	101	is a term \ref{term:1} giving identical matching information.
6acbc939af6a more Chengsong parents: 532 diff changeset	102	The first term \ref{term:1} will match a string in $L(r_1 \cdot r)$,
6acbc939af6a more Chengsong parents: 532 diff changeset	103	and so on for term \ref{term:2}, \ref{term:3}, etc.
6acbc939af6a more Chengsong parents: 532 diff changeset	104
532 cc54ce075db5 restructured Chengsong parents: diff changeset	105	\mybox{previous input $\ldots$}\mybox{$aaa$ }\mybox{rest of input $\ldots$}
533 6acbc939af6a more Chengsong parents: 532 diff changeset	106	\begin{center}\label{string_2parts}
6acbc939af6a more Chengsong parents: 532 diff changeset	107
6acbc939af6a more Chengsong parents: 532 diff changeset	108	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
6acbc939af6a more Chengsong parents: 532 diff changeset	109	\node [rectangle split, rectangle split horizontal, rectangle split parts=2]
6acbc939af6a more Chengsong parents: 532 diff changeset	110	{$r_{x1}$
6acbc939af6a more Chengsong parents: 532 diff changeset	111	\nodepart{two} $r_1\cdot r$ };
6acbc939af6a more Chengsong parents: 532 diff changeset	112	%\caption{term 1 \ref{term:1}'s matching configuration}
6acbc939af6a more Chengsong parents: 532 diff changeset	113	\end{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	114
6acbc939af6a more Chengsong parents: 532 diff changeset	115	\end{center}
6acbc939af6a more Chengsong parents: 532 diff changeset	116	For term 1 \ref{term:1}, whatever was before the current
6acbc939af6a more Chengsong parents: 532 diff changeset	117	input position was fully matched and the regex corresponding
6acbc939af6a more Chengsong parents: 532 diff changeset	118	to it reduced to $\ONE$,
6acbc939af6a more Chengsong parents: 532 diff changeset	119	and in the next input token, it will start on $r_1\cdot r$.
6acbc939af6a more Chengsong parents: 532 diff changeset	120	The resulting value will be something of a similar configuration:
6acbc939af6a more Chengsong parents: 532 diff changeset	121	\begin{center}\label{value_2parts}
6acbc939af6a more Chengsong parents: 532 diff changeset	122	%\caption{term 1 \ref{term:1}'s matching configuration}
6acbc939af6a more Chengsong parents: 532 diff changeset	123	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
6acbc939af6a more Chengsong parents: 532 diff changeset	124	\node [rectangle split, rectangle split horizontal, rectangle split parts=2]
6acbc939af6a more Chengsong parents: 532 diff changeset	125	{$v_{x1}$
6acbc939af6a more Chengsong parents: 532 diff changeset	126	\nodepart{two} $v_{r_1\cdot r}$ };
6acbc939af6a more Chengsong parents: 532 diff changeset	127	\end{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	128	\end{center}
6acbc939af6a more Chengsong parents: 532 diff changeset	129	For term 2 \ref{term:2} we have a similar value partition:
6acbc939af6a more Chengsong parents: 532 diff changeset	130	\begin{center}\label{value_2parts2}
6acbc939af6a more Chengsong parents: 532 diff changeset	131	%\caption{term 1 \ref{term:1}'s matching configuration}
6acbc939af6a more Chengsong parents: 532 diff changeset	132	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
6acbc939af6a more Chengsong parents: 532 diff changeset	133	\node [rectangle split, rectangle split horizontal, rectangle split parts=2]
6acbc939af6a more Chengsong parents: 532 diff changeset	134	{$v_{x2}$
6acbc939af6a more Chengsong parents: 532 diff changeset	135	\nodepart{two} $v_{r_2\cdot r}$ };
6acbc939af6a more Chengsong parents: 532 diff changeset	136	\end{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	137	\end{center}
6acbc939af6a more Chengsong parents: 532 diff changeset	138	\noindent
6acbc939af6a more Chengsong parents: 532 diff changeset	139	and so on.
6acbc939af6a more Chengsong parents: 532 diff changeset	140	We note that for term 4 \ref{term:4} its result value
6acbc939af6a more Chengsong parents: 532 diff changeset	141	after any position beyond the current one will always
6acbc939af6a more Chengsong parents: 532 diff changeset	142	be the same:
6acbc939af6a more Chengsong parents: 532 diff changeset	143	\begin{center}\label{value_2parts4}
6acbc939af6a more Chengsong parents: 532 diff changeset	144	%\caption{term 1 \ref{term:1}'s matching configuration}
6acbc939af6a more Chengsong parents: 532 diff changeset	145	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
6acbc939af6a more Chengsong parents: 532 diff changeset	146	\node [rectangle split, rectangle split horizontal, rectangle split parts=2]
6acbc939af6a more Chengsong parents: 532 diff changeset	147	{$v_{x4}$
6acbc939af6a more Chengsong parents: 532 diff changeset	148	\nodepart{two} $v_{r_1\cdot r}$ };
6acbc939af6a more Chengsong parents: 532 diff changeset	149	\end{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	150	\end{center}
6acbc939af6a more Chengsong parents: 532 diff changeset	151	And $\POSIX$ rules says that we can eliminate at least one of them.
6acbc939af6a more Chengsong parents: 532 diff changeset	152	Our algorithm always puts the regex with the longest initial sub-match at the left of the
6acbc939af6a more Chengsong parents: 532 diff changeset	153	alternatives, so we safely throw away \ref{term:4}.
6acbc939af6a more Chengsong parents: 532 diff changeset	154	The fact that term 1 and 4 are not immediately in the same alternative
6acbc939af6a more Chengsong parents: 532 diff changeset	155	expression does not prevent them from being two duplicate matches at
6acbc939af6a more Chengsong parents: 532 diff changeset	156	the current point of view.
6acbc939af6a more Chengsong parents: 532 diff changeset	157	To implement this idea into an algorithm, we define a "pruning function"
6acbc939af6a more Chengsong parents: 532 diff changeset	158	$\textit{prune}$, that takes away parts of all terms in $r$ that belongs to
6acbc939af6a more Chengsong parents: 532 diff changeset	159	$\textit{acc}$, which acts as an element
6acbc939af6a more Chengsong parents: 532 diff changeset	160	is a stronger version of $\textit{distinctBy}$.
6acbc939af6a more Chengsong parents: 532 diff changeset	161	Here is a concise version of it (in the style of Scala):
6acbc939af6a more Chengsong parents: 532 diff changeset	162	\begin{verbatim}
6acbc939af6a more Chengsong parents: 532 diff changeset	163	def distinctByPlus(rs: List[ARexp], acc: Set[Rexp] = Set()) :
6acbc939af6a more Chengsong parents: 532 diff changeset	164	List[ARexp] = rs match {
6acbc939af6a more Chengsong parents: 532 diff changeset	165	case Nil => Nil
6acbc939af6a more Chengsong parents: 532 diff changeset	166	case r :: rs =>
6acbc939af6a more Chengsong parents: 532 diff changeset	167	if(acc.contains(erase(r)))
6acbc939af6a more Chengsong parents: 532 diff changeset	168	distinctByPlus(rs, acc)
6acbc939af6a more Chengsong parents: 532 diff changeset	169	else
6acbc939af6a more Chengsong parents: 532 diff changeset	170	prune(r, acc) :: distinctByPlus(rs, prune(r, acc) +: acc)
6acbc939af6a more Chengsong parents: 532 diff changeset	171	}
6acbc939af6a more Chengsong parents: 532 diff changeset	172
6acbc939af6a more Chengsong parents: 532 diff changeset	173	\end{verbatim}
6acbc939af6a more Chengsong parents: 532 diff changeset	174	But for the function $\textit{prune}$ there is a difficulty:
6acbc939af6a more Chengsong parents: 532 diff changeset	175	how could one define the $L$ function in a "computable" way,
6acbc939af6a more Chengsong parents: 532 diff changeset	176	so that they generate a (lazy infinite) set of strings in $L(r)$.
6acbc939af6a more Chengsong parents: 532 diff changeset	177	We also need a function that tests whether $L(r_1) \subseteq L(r_2)$
6acbc939af6a more Chengsong parents: 532 diff changeset	178	is true.
6acbc939af6a more Chengsong parents: 532 diff changeset	179	For the moment we cut corners and do not define these two functions
6acbc939af6a more Chengsong parents: 532 diff changeset	180	as they are not used by Antimirov (and will probably not contribute
6acbc939af6a more Chengsong parents: 532 diff changeset	181	to a bound better than Antimirov's cubic bound).
6acbc939af6a more Chengsong parents: 532 diff changeset	182	Rather, we do not try to eliminate in every instance of regular expressions
6acbc939af6a more Chengsong parents: 532 diff changeset	183	that have "language duplicates", but only eliminate the "exact duplicates".
6acbc939af6a more Chengsong parents: 532 diff changeset	184	For this we need to break up terms $(a+b)\cdot c$ into two
6acbc939af6a more Chengsong parents: 532 diff changeset	185	terms $a\cdot c$ and $b\cdot c$ before we add them to the accumulator:
6acbc939af6a more Chengsong parents: 532 diff changeset	186	\begin{verbatim}
6acbc939af6a more Chengsong parents: 532 diff changeset	187	def distinctWith(rs: List[ARexp],
6acbc939af6a more Chengsong parents: 532 diff changeset	188	pruneFunction: (ARexp, Set[Rexp]) => ARexp,
6acbc939af6a more Chengsong parents: 532 diff changeset	189	acc: Set[Rexp] = Set()) : List[ARexp] =
6acbc939af6a more Chengsong parents: 532 diff changeset	190	rs match{
6acbc939af6a more Chengsong parents: 532 diff changeset	191	case Nil => Nil
6acbc939af6a more Chengsong parents: 532 diff changeset	192	case r :: rs =>
6acbc939af6a more Chengsong parents: 532 diff changeset	193	if(acc(erase(r)))
6acbc939af6a more Chengsong parents: 532 diff changeset	194	distinctWith(rs, pruneFunction, acc)
6acbc939af6a more Chengsong parents: 532 diff changeset	195	else {
6acbc939af6a more Chengsong parents: 532 diff changeset	196	val pruned_r = pruneFunction(r, acc)
6acbc939af6a more Chengsong parents: 532 diff changeset	197	pruned_r ::
6acbc939af6a more Chengsong parents: 532 diff changeset	198	distinctWith(rs,
6acbc939af6a more Chengsong parents: 532 diff changeset	199	pruneFunction,
6acbc939af6a more Chengsong parents: 532 diff changeset	200	turnIntoTerms(erase(pruned_r)) ++: acc
6acbc939af6a more Chengsong parents: 532 diff changeset	201	)
6acbc939af6a more Chengsong parents: 532 diff changeset	202	}
6acbc939af6a more Chengsong parents: 532 diff changeset	203	}
6acbc939af6a more Chengsong parents: 532 diff changeset	204	\end{verbatim}
6acbc939af6a more Chengsong parents: 532 diff changeset	205	\noindent
6acbc939af6a more Chengsong parents: 532 diff changeset	206	This process is done by the function $\textit{turnIntoTerms}$:
6acbc939af6a more Chengsong parents: 532 diff changeset	207	\begin{verbatim}
6acbc939af6a more Chengsong parents: 532 diff changeset	208	def turnIntoTerms(r: Rexp): List[Rexp] = r match {
6acbc939af6a more Chengsong parents: 532 diff changeset	209	case SEQ(r1, r2) => if(isOne(r1))
6acbc939af6a more Chengsong parents: 532 diff changeset	210	turnIntoTerms(r2)
6acbc939af6a more Chengsong parents: 532 diff changeset	211	else
6acbc939af6a more Chengsong parents: 532 diff changeset	212	turnIntoTerms(r1).map(r11 => SEQ(r11, r2))
6acbc939af6a more Chengsong parents: 532 diff changeset	213	case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2)
6acbc939af6a more Chengsong parents: 532 diff changeset	214	case ZERO => Nil
6acbc939af6a more Chengsong parents: 532 diff changeset	215	case _ => r :: Nil
6acbc939af6a more Chengsong parents: 532 diff changeset	216	}
6acbc939af6a more Chengsong parents: 532 diff changeset	217	\end{verbatim}
6acbc939af6a more Chengsong parents: 532 diff changeset	218	\noindent
6acbc939af6a more Chengsong parents: 532 diff changeset	219	The pruning function can be defined recursively:
6acbc939af6a more Chengsong parents: 532 diff changeset	220	\begin{verbatim}
6acbc939af6a more Chengsong parents: 532 diff changeset	221	def prune7(r: ARexp, acc: Set[Rexp]) : ARexp = r match{
6acbc939af6a more Chengsong parents: 532 diff changeset	222	case AALTS(bs, rs) => rs.map(r => prune7(r, acc)).filter(_ != ZERO) match
6acbc939af6a more Chengsong parents: 532 diff changeset	223	{
6acbc939af6a more Chengsong parents: 532 diff changeset	224	case Nil => AZERO
6acbc939af6a more Chengsong parents: 532 diff changeset	225	case r::Nil => fuse(bs, r)
6acbc939af6a more Chengsong parents: 532 diff changeset	226	case rs1 => AALTS(bs, rs1)
6acbc939af6a more Chengsong parents: 532 diff changeset	227	}
6acbc939af6a more Chengsong parents: 532 diff changeset	228	case ASEQ(bs, r1, r2) => prune7(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match {
6acbc939af6a more Chengsong parents: 532 diff changeset	229	case AZERO => AZERO
6acbc939af6a more Chengsong parents: 532 diff changeset	230	case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2)
6acbc939af6a more Chengsong parents: 532 diff changeset	231	case r1p => ASEQ(bs, r1p, r2)
6acbc939af6a more Chengsong parents: 532 diff changeset	232	}
6acbc939af6a more Chengsong parents: 532 diff changeset	233	case r => if(acc(erase(r))) AZERO else r
6acbc939af6a more Chengsong parents: 532 diff changeset	234	}
6acbc939af6a more Chengsong parents: 532 diff changeset	235	\end{verbatim}
6acbc939af6a more Chengsong parents: 532 diff changeset	236
6acbc939af6a more Chengsong parents: 532 diff changeset	237	\begin{figure}
6acbc939af6a more Chengsong parents: 532 diff changeset	238	\centering
6acbc939af6a more Chengsong parents: 532 diff changeset	239	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
6acbc939af6a more Chengsong parents: 532 diff changeset	240	\begin{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	241	\begin{axis}[
6acbc939af6a more Chengsong parents: 532 diff changeset	242	xlabel={$n$},
6acbc939af6a more Chengsong parents: 532 diff changeset	243	x label style={at={(1.05,-0.05)}},
6acbc939af6a more Chengsong parents: 532 diff changeset	244	ylabel={size},
6acbc939af6a more Chengsong parents: 532 diff changeset	245	enlargelimits=false,
6acbc939af6a more Chengsong parents: 532 diff changeset	246	xtick={0,5,...,30},
6acbc939af6a more Chengsong parents: 532 diff changeset	247	xmax=30,
6acbc939af6a more Chengsong parents: 532 diff changeset	248	ymax=800,
6acbc939af6a more Chengsong parents: 532 diff changeset	249	ytick={0,200,...,800},
6acbc939af6a more Chengsong parents: 532 diff changeset	250	scaled ticks=false,
6acbc939af6a more Chengsong parents: 532 diff changeset	251	axis lines=left,
6acbc939af6a more Chengsong parents: 532 diff changeset	252	width=5cm,
6acbc939af6a more Chengsong parents: 532 diff changeset	253	height=4cm,
6acbc939af6a more Chengsong parents: 532 diff changeset	254	legend entries={$bsimpStrong$ size growth},
6acbc939af6a more Chengsong parents: 532 diff changeset	255	legend pos=north west,
6acbc939af6a more Chengsong parents: 532 diff changeset	256	legend cell align=left]
6acbc939af6a more Chengsong parents: 532 diff changeset	257	\addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data};
6acbc939af6a more Chengsong parents: 532 diff changeset	258	\end{axis}
6acbc939af6a more Chengsong parents: 532 diff changeset	259	\end{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	260	&
6acbc939af6a more Chengsong parents: 532 diff changeset	261	\begin{tikzpicture}
6acbc939af6a more Chengsong parents: 532 diff changeset	262	\begin{axis}[
6acbc939af6a more Chengsong parents: 532 diff changeset	263	xlabel={$n$},
6acbc939af6a more Chengsong parents: 532 diff changeset	264	x label style={at={(1.05,-0.05)}},
6acbc939af6a more Chengsong parents: 532 diff changeset	265	ylabel={size},
6acbc939af6a more Chengsong parents: 532 diff changeset	266	enlargelimits=false,
6acbc939af6a more Chengsong parents: 532 diff changeset	267	xtick={0,5,...,30},
6acbc939af6a more Chengsong parents: 532 diff changeset	268	xmax=30,
6acbc939af6a more Chengsong parents: 532 diff changeset	269	ymax=42000,
6acbc939af6a more Chengsong parents: 532 diff changeset	270	ytick={0,10000,...,40000},
6acbc939af6a more Chengsong parents: 532 diff changeset	271	scaled ticks=true,
6acbc939af6a more Chengsong parents: 532 diff changeset	272	axis lines=left,
6acbc939af6a more Chengsong parents: 532 diff changeset	273	width=5cm,
6acbc939af6a more Chengsong parents: 532 diff changeset	274	height=4cm,
6acbc939af6a more Chengsong parents: 532 diff changeset	275	legend entries={$bsimp$ size growth},
6acbc939af6a more Chengsong parents: 532 diff changeset	276	legend pos=north west,
6acbc939af6a more Chengsong parents: 532 diff changeset	277	legend cell align=left]
6acbc939af6a more Chengsong parents: 532 diff changeset	278	\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
6acbc939af6a more Chengsong parents: 532 diff changeset	279	\end{axis}
6acbc939af6a more Chengsong parents: 532 diff changeset	280	\end{tikzpicture}\\
6acbc939af6a more Chengsong parents: 532 diff changeset	281	\multicolumn{2}{c}{Graphs: Runtime for matching $((a^* + (aa)^* + \ldots + (aaaaa)^* )^)^$ with strings
6acbc939af6a more Chengsong parents: 532 diff changeset	282	of the form $\underbrace{aa..a}_{n}$.}
6acbc939af6a more Chengsong parents: 532 diff changeset	283	\end{tabular}
6acbc939af6a more Chengsong parents: 532 diff changeset	284	\caption{aaaaaStarStar} \label{fig:aaaaaStarStar}
6acbc939af6a more Chengsong parents: 532 diff changeset	285	\end{figure}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	286
cc54ce075db5 restructured Chengsong parents: diff changeset	287
cc54ce075db5 restructured Chengsong parents: diff changeset	288
cc54ce075db5 restructured Chengsong parents: diff changeset	289	%----------------------------------------------------------------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	290	% SECTION 1
cc54ce075db5 restructured Chengsong parents: diff changeset	291	%----------------------------------------------------------------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	292
cc54ce075db5 restructured Chengsong parents: diff changeset	293	\section{Adding Support for the Negation Construct, and its Correctness Proof}
cc54ce075db5 restructured Chengsong parents: diff changeset	294	We now add support for the negation regular expression:
cc54ce075db5 restructured Chengsong parents: diff changeset	295	\[ r ::= \ZERO \mid \ONE
cc54ce075db5 restructured Chengsong parents: diff changeset	296	\mid c
cc54ce075db5 restructured Chengsong parents: diff changeset	297	\mid r_1 \cdot r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	298	\mid r_1 + r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	299	\mid r^*
cc54ce075db5 restructured Chengsong parents: diff changeset	300	\mid \sim r
cc54ce075db5 restructured Chengsong parents: diff changeset	301	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	302	The $\textit{nullable}$ function's clause for it would be
cc54ce075db5 restructured Chengsong parents: diff changeset	303	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	304	\textit{nullable}(~r) = \neg \nullable(r)
cc54ce075db5 restructured Chengsong parents: diff changeset	305	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	306	The derivative would be
cc54ce075db5 restructured Chengsong parents: diff changeset	307	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	308	~r \backslash c = ~ (r \backslash c)
cc54ce075db5 restructured Chengsong parents: diff changeset	309	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	310
cc54ce075db5 restructured Chengsong parents: diff changeset	311	The most tricky part of lexing for the $~r$ regex
cc54ce075db5 restructured Chengsong parents: diff changeset	312	is creating a value for it.
cc54ce075db5 restructured Chengsong parents: diff changeset	313	For other regular expressions, the value aligns with the
cc54ce075db5 restructured Chengsong parents: diff changeset	314	structure of the regex:
cc54ce075db5 restructured Chengsong parents: diff changeset	315	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	316	\vdash \Seq(\Char(a), \Char(b)) : a \cdot b
cc54ce075db5 restructured Chengsong parents: diff changeset	317	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	318	But for the $~r$ regex, $s$ is a member of it if and only if
cc54ce075db5 restructured Chengsong parents: diff changeset	319	$s$ does not belong to $L(r)$.
cc54ce075db5 restructured Chengsong parents: diff changeset	320	That means when there
cc54ce075db5 restructured Chengsong parents: diff changeset	321	is a match for the not regex, it is not possible to generate how the string $s$ matched
cc54ce075db5 restructured Chengsong parents: diff changeset	322	with $r$.
cc54ce075db5 restructured Chengsong parents: diff changeset	323	What we can do is preserve the information of how $s$ was not matched by $r$,
cc54ce075db5 restructured Chengsong parents: diff changeset	324	and there are a number of options to do this.
cc54ce075db5 restructured Chengsong parents: diff changeset	325
cc54ce075db5 restructured Chengsong parents: diff changeset	326	We could give a partial value when there is a partial match for the regex inside
cc54ce075db5 restructured Chengsong parents: diff changeset	327	the $\mathbf{not}$ construct.
cc54ce075db5 restructured Chengsong parents: diff changeset	328	For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,
cc54ce075db5 restructured Chengsong parents: diff changeset	329	A value for it could be
cc54ce075db5 restructured Chengsong parents: diff changeset	330	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	331	\vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)
cc54ce075db5 restructured Chengsong parents: diff changeset	332	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	333	The above example demonstrates what value to construct
cc54ce075db5 restructured Chengsong parents: diff changeset	334	when the string $s$ is at most a real prefix
cc54ce075db5 restructured Chengsong parents: diff changeset	335	of the strings in $L(r)$. When $s$ instead is not a prefix of any strings
cc54ce075db5 restructured Chengsong parents: diff changeset	336	in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$
cc54ce075db5 restructured Chengsong parents: diff changeset	337	constructor.
cc54ce075db5 restructured Chengsong parents: diff changeset	338
cc54ce075db5 restructured Chengsong parents: diff changeset	339	Another option would be to either store the string $s$ that resulted in
cc54ce075db5 restructured Chengsong parents: diff changeset	340	a mis-match for $r$ or a dummy value as a placeholder:
cc54ce075db5 restructured Chengsong parents: diff changeset	341	\[
533 6acbc939af6a more Chengsong parents: 532 diff changeset	342	\vdash \textit{Not}(abcd) : ~( r_1 )
532 cc54ce075db5 restructured Chengsong parents: diff changeset	343	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	344	or
cc54ce075db5 restructured Chengsong parents: diff changeset	345	\[
533 6acbc939af6a more Chengsong parents: 532 diff changeset	346	\vdash \textit{Not}(\textit{Dummy}) : ~( r_1 )
532 cc54ce075db5 restructured Chengsong parents: diff changeset	347	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	348	We choose to implement this as it is most straightforward:
cc54ce075db5 restructured Chengsong parents: diff changeset	349	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	350	\mkeps(~(r)) = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})
cc54ce075db5 restructured Chengsong parents: diff changeset	351	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	352
cc54ce075db5 restructured Chengsong parents: diff changeset	353	%----------------------------------------------------------------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	354	% SECTION 2
cc54ce075db5 restructured Chengsong parents: diff changeset	355	%----------------------------------------------------------------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	356
cc54ce075db5 restructured Chengsong parents: diff changeset	357	\section{Bounded Repetitions}
cc54ce075db5 restructured Chengsong parents: diff changeset	358
cc54ce075db5 restructured Chengsong parents: diff changeset	359

author	Chengsong
	Mon, 06 Jun 2022 03:05:31 +0100
changeset 533	6acbc939af6a
parent 532	cc54ce075db5
child 535	ce91c29d2885
permissions	-rwxr-xr-x