lexing: ChengsongTanPhdThesis/Chapters/Chapter2.tex@cb702fb4227f (annotated)

468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	1	% Chapter Template
a0f27e21b42c all texrelated Chengsong parents: diff changeset	2
526 cb702fb4227f updated Chengsong parents: 525 diff changeset	3	\chapter{Regular Expressions and Sulzmann and Lu's Lexing Algorithm Without Bitcodes} % Main chapter title
519 856d025dbc15 more Chengsong parents: 518 diff changeset	4
856d025dbc15 more Chengsong parents: 518 diff changeset	5	\label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts
856d025dbc15 more Chengsong parents: 518 diff changeset	6	%and notations we
856d025dbc15 more Chengsong parents: 518 diff changeset	7	%use for describing the lexing algorithm by Sulzmann and Lu,
856d025dbc15 more Chengsong parents: 518 diff changeset	8	%and then give the algorithm and its variant, and discuss
856d025dbc15 more Chengsong parents: 518 diff changeset	9	%why more aggressive simplifications are needed.
856d025dbc15 more Chengsong parents: 518 diff changeset	10
856d025dbc15 more Chengsong parents: 518 diff changeset	11
856d025dbc15 more Chengsong parents: 518 diff changeset	12	\section{Preliminaries}
856d025dbc15 more Chengsong parents: 518 diff changeset	13
856d025dbc15 more Chengsong parents: 518 diff changeset	14	Suppose we have an alphabet $\Sigma$, the strings whose characters
856d025dbc15 more Chengsong parents: 518 diff changeset	15	are from $\Sigma$
856d025dbc15 more Chengsong parents: 518 diff changeset	16	can be expressed as $\Sigma^*$.
856d025dbc15 more Chengsong parents: 518 diff changeset	17
856d025dbc15 more Chengsong parents: 518 diff changeset	18	We use patterns to define a set of strings concisely. Regular expressions
856d025dbc15 more Chengsong parents: 518 diff changeset	19	are one of such patterns systems:
856d025dbc15 more Chengsong parents: 518 diff changeset	20	The basic regular expressions are defined inductively
856d025dbc15 more Chengsong parents: 518 diff changeset	21	by the following grammar:
856d025dbc15 more Chengsong parents: 518 diff changeset	22	\[ r ::= \ZERO \mid \ONE
856d025dbc15 more Chengsong parents: 518 diff changeset	23	\mid c
856d025dbc15 more Chengsong parents: 518 diff changeset	24	\mid r_1 \cdot r_2
856d025dbc15 more Chengsong parents: 518 diff changeset	25	\mid r_1 + r_2
856d025dbc15 more Chengsong parents: 518 diff changeset	26	\mid r^*
856d025dbc15 more Chengsong parents: 518 diff changeset	27	\]
856d025dbc15 more Chengsong parents: 518 diff changeset	28
856d025dbc15 more Chengsong parents: 518 diff changeset	29	The language or set of strings defined by regular expressions are defined as
856d025dbc15 more Chengsong parents: 518 diff changeset	30	%TODO: FILL in the other defs
856d025dbc15 more Chengsong parents: 518 diff changeset	31	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	32	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	33	$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	34	$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	35	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	36	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	37	Which are also called the "language interpretation".
856d025dbc15 more Chengsong parents: 518 diff changeset	38
856d025dbc15 more Chengsong parents: 518 diff changeset	39
856d025dbc15 more Chengsong parents: 518 diff changeset	40
856d025dbc15 more Chengsong parents: 518 diff changeset	41	The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
856d025dbc15 more Chengsong parents: 518 diff changeset	42	where the operation transforms the regex to a new one containing
856d025dbc15 more Chengsong parents: 518 diff changeset	43	strings without the head character $c$.
856d025dbc15 more Chengsong parents: 518 diff changeset	44
856d025dbc15 more Chengsong parents: 518 diff changeset	45	Formally, we define first such a transformation on any string set, which
856d025dbc15 more Chengsong parents: 518 diff changeset	46	we call semantic derivative:
856d025dbc15 more Chengsong parents: 518 diff changeset	47	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	48	$\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
856d025dbc15 more Chengsong parents: 518 diff changeset	49	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	50	Mathematically, it can be expressed as the
856d025dbc15 more Chengsong parents: 518 diff changeset	51
856d025dbc15 more Chengsong parents: 518 diff changeset	52	If the $\textit{StringSet}$ happen to have some structure, for example,
856d025dbc15 more Chengsong parents: 518 diff changeset	53	if it is regular, then we have that it
856d025dbc15 more Chengsong parents: 518 diff changeset	54
856d025dbc15 more Chengsong parents: 518 diff changeset	55	% Derivatives of a
856d025dbc15 more Chengsong parents: 518 diff changeset	56	%regular expression, written $r \backslash c$, give a simple solution
856d025dbc15 more Chengsong parents: 518 diff changeset	57	%to the problem of matching a string $s$ with a regular
856d025dbc15 more Chengsong parents: 518 diff changeset	58	%expression $r$: if the derivative of $r$ w.r.t.\ (in
856d025dbc15 more Chengsong parents: 518 diff changeset	59	%succession) all the characters of the string matches the empty string,
856d025dbc15 more Chengsong parents: 518 diff changeset	60	%then $r$ matches $s$ (and {\em vice versa}).
856d025dbc15 more Chengsong parents: 518 diff changeset	61
856d025dbc15 more Chengsong parents: 518 diff changeset	62	The the derivative of regular expression, denoted as
856d025dbc15 more Chengsong parents: 518 diff changeset	63	$r \backslash c$, is a function that takes parameters
856d025dbc15 more Chengsong parents: 518 diff changeset	64	$r$ and $c$, and returns another regular expression $r'$,
856d025dbc15 more Chengsong parents: 518 diff changeset	65	which is computed by the following recursive function:
856d025dbc15 more Chengsong parents: 518 diff changeset	66
856d025dbc15 more Chengsong parents: 518 diff changeset	67	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	68	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	69	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	70	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	71	$d \backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	72	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	73	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	74	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	75	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	76	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	77	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	78	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	79	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	80	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	81	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	82
856d025dbc15 more Chengsong parents: 518 diff changeset	83	The $\nullable$ function tests whether the empty string $""$
856d025dbc15 more Chengsong parents: 518 diff changeset	84	is in the language of $r$:
856d025dbc15 more Chengsong parents: 518 diff changeset	85
856d025dbc15 more Chengsong parents: 518 diff changeset	86
856d025dbc15 more Chengsong parents: 518 diff changeset	87	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	88	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	89	$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	90	$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	91	$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	92	$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	93	$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	94	$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	95	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	96	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	97	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	98	The empty set does not contain any string and
856d025dbc15 more Chengsong parents: 518 diff changeset	99	therefore not the empty string, the empty string
856d025dbc15 more Chengsong parents: 518 diff changeset	100	regular expression contains the empty string
856d025dbc15 more Chengsong parents: 518 diff changeset	101	by definition, the character regular expression
856d025dbc15 more Chengsong parents: 518 diff changeset	102	is the singleton that contains character only,
856d025dbc15 more Chengsong parents: 518 diff changeset	103	and therefore does not contain the empty string,
856d025dbc15 more Chengsong parents: 518 diff changeset	104	the alternative regular expression(or "or" expression)
856d025dbc15 more Chengsong parents: 518 diff changeset	105	might have one of its children regular expressions
856d025dbc15 more Chengsong parents: 518 diff changeset	106	being nullable and any one of its children being nullable
856d025dbc15 more Chengsong parents: 518 diff changeset	107	would suffice. The sequence regular expression
856d025dbc15 more Chengsong parents: 518 diff changeset	108	would require both children to have the empty string
856d025dbc15 more Chengsong parents: 518 diff changeset	109	to compose an empty string and the Kleene star
856d025dbc15 more Chengsong parents: 518 diff changeset	110	operation naturally introduced the empty string.
856d025dbc15 more Chengsong parents: 518 diff changeset	111
856d025dbc15 more Chengsong parents: 518 diff changeset	112	We can give the meaning of regular expressions derivatives
856d025dbc15 more Chengsong parents: 518 diff changeset	113	by language interpretation:
856d025dbc15 more Chengsong parents: 518 diff changeset	114
856d025dbc15 more Chengsong parents: 518 diff changeset	115
856d025dbc15 more Chengsong parents: 518 diff changeset	116
856d025dbc15 more Chengsong parents: 518 diff changeset	117
856d025dbc15 more Chengsong parents: 518 diff changeset	118	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	119	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	120	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	121	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	122	$d \backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	123	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	124	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	125	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	126	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	127	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	128	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	129	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	130	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	131	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	132	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	133	The function derivative, written $\backslash c$,
856d025dbc15 more Chengsong parents: 518 diff changeset	134	defines how a regular expression evolves into
856d025dbc15 more Chengsong parents: 518 diff changeset	135	a new regular expression after all the string it contains
856d025dbc15 more Chengsong parents: 518 diff changeset	136	is chopped off a certain head character $c$.
856d025dbc15 more Chengsong parents: 518 diff changeset	137	The most involved cases are the sequence
856d025dbc15 more Chengsong parents: 518 diff changeset	138	and star case.
856d025dbc15 more Chengsong parents: 518 diff changeset	139	The sequence case says that if the first regular expression
856d025dbc15 more Chengsong parents: 518 diff changeset	140	contains an empty string then the second component of the sequence
856d025dbc15 more Chengsong parents: 518 diff changeset	141	might be chosen as the target regular expression to be chopped
856d025dbc15 more Chengsong parents: 518 diff changeset	142	off its head character.
856d025dbc15 more Chengsong parents: 518 diff changeset	143	The star regular expression's derivative unwraps the iteration of
856d025dbc15 more Chengsong parents: 518 diff changeset	144	regular expression and attaches the star regular expression
856d025dbc15 more Chengsong parents: 518 diff changeset	145	to the sequence's second element to make sure a copy is retained
856d025dbc15 more Chengsong parents: 518 diff changeset	146	for possible more iterations in later phases of lexing.
856d025dbc15 more Chengsong parents: 518 diff changeset	147
856d025dbc15 more Chengsong parents: 518 diff changeset	148
856d025dbc15 more Chengsong parents: 518 diff changeset	149	The main property of the derivative operation
856d025dbc15 more Chengsong parents: 518 diff changeset	150	that enables us to reason about the correctness of
856d025dbc15 more Chengsong parents: 518 diff changeset	151	an algorithm using derivatives is
856d025dbc15 more Chengsong parents: 518 diff changeset	152
856d025dbc15 more Chengsong parents: 518 diff changeset	153	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	154	$c\!::\!s \in L(r)$ holds
856d025dbc15 more Chengsong parents: 518 diff changeset	155	if and only if $s \in L(r\backslash c)$.
856d025dbc15 more Chengsong parents: 518 diff changeset	156	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	157
856d025dbc15 more Chengsong parents: 518 diff changeset	158	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	159	We can generalise the derivative operation shown above for single characters
856d025dbc15 more Chengsong parents: 518 diff changeset	160	to strings as follows:
856d025dbc15 more Chengsong parents: 518 diff changeset	161
856d025dbc15 more Chengsong parents: 518 diff changeset	162	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	163	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	164	$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	165	$r \backslash [\,] $ & $\dn$ & $r$
856d025dbc15 more Chengsong parents: 518 diff changeset	166	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	167	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	168
856d025dbc15 more Chengsong parents: 518 diff changeset	169	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	170	and then define Brzozowski's regular-expression matching algorithm as:
856d025dbc15 more Chengsong parents: 518 diff changeset	171
856d025dbc15 more Chengsong parents: 518 diff changeset	172	\[
856d025dbc15 more Chengsong parents: 518 diff changeset	173	match\;s\;r \;\dn\; nullable(r\backslash s)
856d025dbc15 more Chengsong parents: 518 diff changeset	174	\]
856d025dbc15 more Chengsong parents: 518 diff changeset	175
856d025dbc15 more Chengsong parents: 518 diff changeset	176	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	177	Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$,
856d025dbc15 more Chengsong parents: 518 diff changeset	178	this algorithm presented graphically is as follows:
856d025dbc15 more Chengsong parents: 518 diff changeset	179
856d025dbc15 more Chengsong parents: 518 diff changeset	180	\begin{equation}\label{graph:*}
856d025dbc15 more Chengsong parents: 518 diff changeset	181	\begin{tikzcd}
856d025dbc15 more Chengsong parents: 518 diff changeset	182	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
856d025dbc15 more Chengsong parents: 518 diff changeset	183	\end{tikzcd}
856d025dbc15 more Chengsong parents: 518 diff changeset	184	\end{equation}
856d025dbc15 more Chengsong parents: 518 diff changeset	185
856d025dbc15 more Chengsong parents: 518 diff changeset	186	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	187	where we start with a regular expression $r_0$, build successive
856d025dbc15 more Chengsong parents: 518 diff changeset	188	derivatives until we exhaust the string and then use \textit{nullable}
856d025dbc15 more Chengsong parents: 518 diff changeset	189	to test whether the result can match the empty string. It can be
856d025dbc15 more Chengsong parents: 518 diff changeset	190	relatively easily shown that this matcher is correct (that is given
856d025dbc15 more Chengsong parents: 518 diff changeset	191	an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
856d025dbc15 more Chengsong parents: 518 diff changeset	192
856d025dbc15 more Chengsong parents: 518 diff changeset	193	Beautiful and simple definition.
856d025dbc15 more Chengsong parents: 518 diff changeset	194
856d025dbc15 more Chengsong parents: 518 diff changeset	195	If we implement the above algorithm naively, however,
856d025dbc15 more Chengsong parents: 518 diff changeset	196	the algorithm can be excruciatingly slow.
856d025dbc15 more Chengsong parents: 518 diff changeset	197
856d025dbc15 more Chengsong parents: 518 diff changeset	198
856d025dbc15 more Chengsong parents: 518 diff changeset	199	\begin{figure}
856d025dbc15 more Chengsong parents: 518 diff changeset	200	\centering
856d025dbc15 more Chengsong parents: 518 diff changeset	201	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	202	\begin{tikzpicture}
856d025dbc15 more Chengsong parents: 518 diff changeset	203	\begin{axis}[
856d025dbc15 more Chengsong parents: 518 diff changeset	204	xlabel={$n$},
856d025dbc15 more Chengsong parents: 518 diff changeset	205	x label style={at={(1.05,-0.05)}},
856d025dbc15 more Chengsong parents: 518 diff changeset	206	ylabel={time in secs},
856d025dbc15 more Chengsong parents: 518 diff changeset	207	enlargelimits=false,
856d025dbc15 more Chengsong parents: 518 diff changeset	208	xtick={0,5,...,30},
856d025dbc15 more Chengsong parents: 518 diff changeset	209	xmax=33,
856d025dbc15 more Chengsong parents: 518 diff changeset	210	ymax=10000,
856d025dbc15 more Chengsong parents: 518 diff changeset	211	ytick={0,1000,...,10000},
856d025dbc15 more Chengsong parents: 518 diff changeset	212	scaled ticks=false,
856d025dbc15 more Chengsong parents: 518 diff changeset	213	axis lines=left,
856d025dbc15 more Chengsong parents: 518 diff changeset	214	width=5cm,
856d025dbc15 more Chengsong parents: 518 diff changeset	215	height=4cm,
856d025dbc15 more Chengsong parents: 518 diff changeset	216	legend entries={JavaScript},
856d025dbc15 more Chengsong parents: 518 diff changeset	217	legend pos=north west,
856d025dbc15 more Chengsong parents: 518 diff changeset	218	legend cell align=left]
856d025dbc15 more Chengsong parents: 518 diff changeset	219	\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
856d025dbc15 more Chengsong parents: 518 diff changeset	220	\end{axis}
856d025dbc15 more Chengsong parents: 518 diff changeset	221	\end{tikzpicture}\\
856d025dbc15 more Chengsong parents: 518 diff changeset	222	\multicolumn{3}{c}{Graphs: Runtime for matching $(a^)^\,b$ with strings
856d025dbc15 more Chengsong parents: 518 diff changeset	223	of the form $\underbrace{aa..a}_{n}$.}
856d025dbc15 more Chengsong parents: 518 diff changeset	224	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	225	\caption{EightThousandNodes} \label{fig:EightThousandNodes}
856d025dbc15 more Chengsong parents: 518 diff changeset	226	\end{figure}
856d025dbc15 more Chengsong parents: 518 diff changeset	227
856d025dbc15 more Chengsong parents: 518 diff changeset	228
856d025dbc15 more Chengsong parents: 518 diff changeset	229	(8000 node data to be added here)
856d025dbc15 more Chengsong parents: 518 diff changeset	230	For example, when starting with the regular
856d025dbc15 more Chengsong parents: 518 diff changeset	231	expression $(a + aa)^*$ and building a few successive derivatives (around 10)
856d025dbc15 more Chengsong parents: 518 diff changeset	232	w.r.t.~the character $a$, one obtains a derivative regular expression
856d025dbc15 more Chengsong parents: 518 diff changeset	233	with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
856d025dbc15 more Chengsong parents: 518 diff changeset	234	The reason why $(a + aa) ^*$ explodes so drastically is that without
856d025dbc15 more Chengsong parents: 518 diff changeset	235	pruning, the algorithm will keep records of all possible ways of matching:
856d025dbc15 more Chengsong parents: 518 diff changeset	236	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	237	$(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
856d025dbc15 more Chengsong parents: 518 diff changeset	238	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	239
856d025dbc15 more Chengsong parents: 518 diff changeset	240	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	241	Each of the above alternative branches correspond to the match
856d025dbc15 more Chengsong parents: 518 diff changeset	242	$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
856d025dbc15 more Chengsong parents: 518 diff changeset	243	These different ways of matching will grow exponentially with the string length,
856d025dbc15 more Chengsong parents: 518 diff changeset	244	and without simplifications that throw away some of these very similar matchings,
856d025dbc15 more Chengsong parents: 518 diff changeset	245	it is no surprise that these expressions grow so quickly.
856d025dbc15 more Chengsong parents: 518 diff changeset	246	Operations like
856d025dbc15 more Chengsong parents: 518 diff changeset	247	$\backslash$ and $\nullable$ need to traverse such trees and
856d025dbc15 more Chengsong parents: 518 diff changeset	248	consequently the bigger the size of the derivative the slower the
856d025dbc15 more Chengsong parents: 518 diff changeset	249	algorithm.
856d025dbc15 more Chengsong parents: 518 diff changeset	250
856d025dbc15 more Chengsong parents: 518 diff changeset	251	Brzozowski was quick in finding that during this process a lot useless
856d025dbc15 more Chengsong parents: 518 diff changeset	252	$\ONE$s and $\ZERO$s are generated and therefore not optimal.
856d025dbc15 more Chengsong parents: 518 diff changeset	253	He also introduced some "similarity rules" such
856d025dbc15 more Chengsong parents: 518 diff changeset	254	as $P+(Q+R) = (P+Q)+R$ to merge syntactically
856d025dbc15 more Chengsong parents: 518 diff changeset	255	different but language-equivalent sub-regexes to further decrease the size
856d025dbc15 more Chengsong parents: 518 diff changeset	256	of the intermediate regexes.
856d025dbc15 more Chengsong parents: 518 diff changeset	257
856d025dbc15 more Chengsong parents: 518 diff changeset	258	More simplifications are possible, such as deleting duplicates
856d025dbc15 more Chengsong parents: 518 diff changeset	259	and opening up nested alternatives to trigger even more simplifications.
856d025dbc15 more Chengsong parents: 518 diff changeset	260	And suppose we apply simplification after each derivative step, and compose
856d025dbc15 more Chengsong parents: 518 diff changeset	261	these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
856d025dbc15 more Chengsong parents: 518 diff changeset	262	\textit{simp}(a \backslash c)$. Then we can build
856d025dbc15 more Chengsong parents: 518 diff changeset	263	a matcher without having cumbersome regular expressions.
856d025dbc15 more Chengsong parents: 518 diff changeset	264
856d025dbc15 more Chengsong parents: 518 diff changeset	265
856d025dbc15 more Chengsong parents: 518 diff changeset	266	If we want the size of derivatives in the algorithm to
856d025dbc15 more Chengsong parents: 518 diff changeset	267	stay even lower, we would need more aggressive simplifications.
856d025dbc15 more Chengsong parents: 518 diff changeset	268	Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
856d025dbc15 more Chengsong parents: 518 diff changeset	269	deleting duplicates whenever possible. For example, the parentheses in
856d025dbc15 more Chengsong parents: 518 diff changeset	270	$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
856d025dbc15 more Chengsong parents: 518 diff changeset	271	\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
856d025dbc15 more Chengsong parents: 518 diff changeset	272	example is simplifying $(a^+a) + (a^+ \ONE) + (a +\ONE)$ to just
856d025dbc15 more Chengsong parents: 518 diff changeset	273	$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
856d025dbc15 more Chengsong parents: 518 diff changeset	274	to achieve a very tight size bound, namely,
856d025dbc15 more Chengsong parents: 518 diff changeset	275	the same size bound as that of the \emph{partial derivatives}.
856d025dbc15 more Chengsong parents: 518 diff changeset	276
856d025dbc15 more Chengsong parents: 518 diff changeset	277	Building derivatives and then simplify them.
856d025dbc15 more Chengsong parents: 518 diff changeset	278	So far so good. But what if we want to
856d025dbc15 more Chengsong parents: 518 diff changeset	279	do lexing instead of just a YES/NO answer?
856d025dbc15 more Chengsong parents: 518 diff changeset	280	This requires us to go back again to the world
856d025dbc15 more Chengsong parents: 518 diff changeset	281	without simplification first for a moment.
856d025dbc15 more Chengsong parents: 518 diff changeset	282	Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
856d025dbc15 more Chengsong parents: 518 diff changeset	283	elegant(arguably as beautiful as the original
856d025dbc15 more Chengsong parents: 518 diff changeset	284	derivatives definition) solution for this.
856d025dbc15 more Chengsong parents: 518 diff changeset	285
856d025dbc15 more Chengsong parents: 518 diff changeset	286	\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
856d025dbc15 more Chengsong parents: 518 diff changeset	287
856d025dbc15 more Chengsong parents: 518 diff changeset	288
856d025dbc15 more Chengsong parents: 518 diff changeset	289	They first defined the datatypes for storing the
856d025dbc15 more Chengsong parents: 518 diff changeset	290	lexing information called a \emph{value} or
856d025dbc15 more Chengsong parents: 518 diff changeset	291	sometimes also \emph{lexical value}. These values and regular
856d025dbc15 more Chengsong parents: 518 diff changeset	292	expressions correspond to each other as illustrated in the following
856d025dbc15 more Chengsong parents: 518 diff changeset	293	table:
856d025dbc15 more Chengsong parents: 518 diff changeset	294
856d025dbc15 more Chengsong parents: 518 diff changeset	295	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	296	\begin{tabular}{c@{\hspace{20mm}}c}
856d025dbc15 more Chengsong parents: 518 diff changeset	297	\begin{tabular}{@{}rrl@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	298	\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
856d025dbc15 more Chengsong parents: 518 diff changeset	299	$r$ & $::=$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	300	& $\mid$ & $\ONE$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	301	& $\mid$ & $c$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	302	& $\mid$ & $r_1 \cdot r_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	303	& $\mid$ & $r_1 + r_2$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	304	\\
856d025dbc15 more Chengsong parents: 518 diff changeset	305	& $\mid$ & $r^*$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	306	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	307	&
856d025dbc15 more Chengsong parents: 518 diff changeset	308	\begin{tabular}{@{\hspace{0mm}}rrl@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	309	\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
856d025dbc15 more Chengsong parents: 518 diff changeset	310	$v$ & $::=$ & \\
856d025dbc15 more Chengsong parents: 518 diff changeset	311	& & $\Empty$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	312	& $\mid$ & $\Char(c)$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	313	& $\mid$ & $\Seq\,v_1\, v_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	314	& $\mid$ & $\Left(v)$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	315	& $\mid$ & $\Right(v)$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	316	& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	317	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	318	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	319	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	320
856d025dbc15 more Chengsong parents: 518 diff changeset	321	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	322
856d025dbc15 more Chengsong parents: 518 diff changeset	323	Building on top of Sulzmann and Lu's attempt to formalize the
856d025dbc15 more Chengsong parents: 518 diff changeset	324	notion of POSIX lexing rules \parencite{Sulzmann2014},
856d025dbc15 more Chengsong parents: 518 diff changeset	325	Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
856d025dbc15 more Chengsong parents: 518 diff changeset	326	POSIX matching as a ternary relation recursively defined in a
856d025dbc15 more Chengsong parents: 518 diff changeset	327	natural deduction style.
856d025dbc15 more Chengsong parents: 518 diff changeset	328	With the formally-specified rules for what a POSIX matching is,
856d025dbc15 more Chengsong parents: 518 diff changeset	329	they proved in Isabelle/HOL that the algorithm gives correct results.
856d025dbc15 more Chengsong parents: 518 diff changeset	330
856d025dbc15 more Chengsong parents: 518 diff changeset	331	But having a correct result is still not enough,
856d025dbc15 more Chengsong parents: 518 diff changeset	332	we want at least some degree of $\mathbf{efficiency}$.
856d025dbc15 more Chengsong parents: 518 diff changeset	333
856d025dbc15 more Chengsong parents: 518 diff changeset	334
856d025dbc15 more Chengsong parents: 518 diff changeset	335
856d025dbc15 more Chengsong parents: 518 diff changeset	336	One regular expression can have multiple lexical values. For example
856d025dbc15 more Chengsong parents: 518 diff changeset	337	for the regular expression $(a+b)^*$, it has a infinite list of
856d025dbc15 more Chengsong parents: 518 diff changeset	338	values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
856d025dbc15 more Chengsong parents: 518 diff changeset	339	$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
856d025dbc15 more Chengsong parents: 518 diff changeset	340	$\ldots$, and vice versa.
856d025dbc15 more Chengsong parents: 518 diff changeset	341	Even for the regular expression matching a certain string, there could
856d025dbc15 more Chengsong parents: 518 diff changeset	342	still be more than one value corresponding to it.
856d025dbc15 more Chengsong parents: 518 diff changeset	343	Take the example where $r= (a^\cdot a^)^*$ and the string
856d025dbc15 more Chengsong parents: 518 diff changeset	344	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
856d025dbc15 more Chengsong parents: 518 diff changeset	345	The number of different ways of matching
856d025dbc15 more Chengsong parents: 518 diff changeset	346	without allowing any value under a star to be flattened
856d025dbc15 more Chengsong parents: 518 diff changeset	347	to an empty string can be given by the following formula:
856d025dbc15 more Chengsong parents: 518 diff changeset	348	\begin{equation}
856d025dbc15 more Chengsong parents: 518 diff changeset	349	C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}
856d025dbc15 more Chengsong parents: 518 diff changeset	350	\end{equation}
856d025dbc15 more Chengsong parents: 518 diff changeset	351	and a closed form formula can be calculated to be
856d025dbc15 more Chengsong parents: 518 diff changeset	352	\begin{equation}
856d025dbc15 more Chengsong parents: 518 diff changeset	353	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
856d025dbc15 more Chengsong parents: 518 diff changeset	354	\end{equation}
856d025dbc15 more Chengsong parents: 518 diff changeset	355	which is clearly in exponential order.
856d025dbc15 more Chengsong parents: 518 diff changeset	356
856d025dbc15 more Chengsong parents: 518 diff changeset	357	A lexer aimed at getting all the possible values has an exponential
856d025dbc15 more Chengsong parents: 518 diff changeset	358	worst case runtime. Therefore it is impractical to try to generate
856d025dbc15 more Chengsong parents: 518 diff changeset	359	all possible matches in a run. In practice, we are usually
856d025dbc15 more Chengsong parents: 518 diff changeset	360	interested about POSIX values, which by intuition always
856d025dbc15 more Chengsong parents: 518 diff changeset	361	\begin{itemize}
856d025dbc15 more Chengsong parents: 518 diff changeset	362	\item
856d025dbc15 more Chengsong parents: 518 diff changeset	363	match the leftmost regular expression when multiple options of matching
856d025dbc15 more Chengsong parents: 518 diff changeset	364	are available
856d025dbc15 more Chengsong parents: 518 diff changeset	365	\item
856d025dbc15 more Chengsong parents: 518 diff changeset	366	always match a subpart as much as possible before proceeding
856d025dbc15 more Chengsong parents: 518 diff changeset	367	to the next token.
856d025dbc15 more Chengsong parents: 518 diff changeset	368	\end{itemize}
856d025dbc15 more Chengsong parents: 518 diff changeset	369
856d025dbc15 more Chengsong parents: 518 diff changeset	370
856d025dbc15 more Chengsong parents: 518 diff changeset	371	For example, the above example has the POSIX value
856d025dbc15 more Chengsong parents: 518 diff changeset	372	$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
856d025dbc15 more Chengsong parents: 518 diff changeset	373	The output of an algorithm we want would be a POSIX matching
856d025dbc15 more Chengsong parents: 518 diff changeset	374	encoded as a value.
856d025dbc15 more Chengsong parents: 518 diff changeset	375	The reason why we are interested in $\POSIX$ values is that they can
856d025dbc15 more Chengsong parents: 518 diff changeset	376	be practically used in the lexing phase of a compiler front end.
856d025dbc15 more Chengsong parents: 518 diff changeset	377	For instance, when lexing a code snippet
856d025dbc15 more Chengsong parents: 518 diff changeset	378	$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
856d025dbc15 more Chengsong parents: 518 diff changeset	379	as an identifier rather than a keyword.
856d025dbc15 more Chengsong parents: 518 diff changeset	380
856d025dbc15 more Chengsong parents: 518 diff changeset	381	The contribution of Sulzmann and Lu is an extension of Brzozowski's
856d025dbc15 more Chengsong parents: 518 diff changeset	382	algorithm by a second phase (the first phase being building successive
856d025dbc15 more Chengsong parents: 518 diff changeset	383	derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
856d025dbc15 more Chengsong parents: 518 diff changeset	384	is generated in case the regular expression matches the string.
856d025dbc15 more Chengsong parents: 518 diff changeset	385	Pictorially, the Sulzmann and Lu algorithm is as follows:
856d025dbc15 more Chengsong parents: 518 diff changeset	386
856d025dbc15 more Chengsong parents: 518 diff changeset	387	\begin{ceqn}
856d025dbc15 more Chengsong parents: 518 diff changeset	388	\begin{equation}\label{graph:2}
856d025dbc15 more Chengsong parents: 518 diff changeset	389	\begin{tikzcd}
856d025dbc15 more Chengsong parents: 518 diff changeset	390	r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
856d025dbc15 more Chengsong parents: 518 diff changeset	391	v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
856d025dbc15 more Chengsong parents: 518 diff changeset	392	\end{tikzcd}
856d025dbc15 more Chengsong parents: 518 diff changeset	393	\end{equation}
856d025dbc15 more Chengsong parents: 518 diff changeset	394	\end{ceqn}
856d025dbc15 more Chengsong parents: 518 diff changeset	395
856d025dbc15 more Chengsong parents: 518 diff changeset	396
856d025dbc15 more Chengsong parents: 518 diff changeset	397	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	398	For convenience, we shall employ the following notations: the regular
856d025dbc15 more Chengsong parents: 518 diff changeset	399	expression we start with is $r_0$, and the given string $s$ is composed
856d025dbc15 more Chengsong parents: 518 diff changeset	400	of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the
856d025dbc15 more Chengsong parents: 518 diff changeset	401	left to right, we build the derivatives $r_1$, $r_2$, \ldots according
856d025dbc15 more Chengsong parents: 518 diff changeset	402	to the characters $c_0$, $c_1$ until we exhaust the string and obtain
856d025dbc15 more Chengsong parents: 518 diff changeset	403	the derivative $r_n$. We test whether this derivative is
856d025dbc15 more Chengsong parents: 518 diff changeset	404	$\textit{nullable}$ or not. If not, we know the string does not match
856d025dbc15 more Chengsong parents: 518 diff changeset	405	$r$ and no value needs to be generated. If yes, we start building the
856d025dbc15 more Chengsong parents: 518 diff changeset	406	values incrementally by \emph{injecting} back the characters into the
856d025dbc15 more Chengsong parents: 518 diff changeset	407	earlier values $v_n, \ldots, v_0$. This is the second phase of the
856d025dbc15 more Chengsong parents: 518 diff changeset	408	algorithm from the right to left. For the first value $v_n$, we call the
856d025dbc15 more Chengsong parents: 518 diff changeset	409	function $\textit{mkeps}$, which builds a POSIX lexical value
856d025dbc15 more Chengsong parents: 518 diff changeset	410	for how the empty string has been matched by the (nullable) regular
856d025dbc15 more Chengsong parents: 518 diff changeset	411	expression $r_n$. This function is defined as
856d025dbc15 more Chengsong parents: 518 diff changeset	412
856d025dbc15 more Chengsong parents: 518 diff changeset	413	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	414	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	415	$\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	416	$\mkeps(r_{1}+r_{2})$ & $\dn$
856d025dbc15 more Chengsong parents: 518 diff changeset	417	& \textit{if} $\nullable(r_{1})$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	418	& & \textit{then} $\Left(\mkeps(r_{1}))$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	419	& & \textit{else} $\Right(\mkeps(r_{2}))$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	420	$\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	421	$mkeps(r^*)$ & $\dn$ & $\Stars\,[]$
856d025dbc15 more Chengsong parents: 518 diff changeset	422	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	423	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	424
856d025dbc15 more Chengsong parents: 518 diff changeset	425
856d025dbc15 more Chengsong parents: 518 diff changeset	426	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	427	After the $\mkeps$-call, we inject back the characters one by one in order to build
856d025dbc15 more Chengsong parents: 518 diff changeset	428	the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
856d025dbc15 more Chengsong parents: 518 diff changeset	429	($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
856d025dbc15 more Chengsong parents: 518 diff changeset	430	After injecting back $n$ characters, we get the lexical value for how $r_0$
856d025dbc15 more Chengsong parents: 518 diff changeset	431	matches $s$. The POSIX value is maintained throught out the process.
856d025dbc15 more Chengsong parents: 518 diff changeset	432	For this Sulzmann and Lu defined a function that reverses
856d025dbc15 more Chengsong parents: 518 diff changeset	433	the ``chopping off'' of characters during the derivative phase. The
856d025dbc15 more Chengsong parents: 518 diff changeset	434	corresponding function is called \emph{injection}, written
856d025dbc15 more Chengsong parents: 518 diff changeset	435	$\textit{inj}$; it takes three arguments: the first one is a regular
856d025dbc15 more Chengsong parents: 518 diff changeset	436	expression ${r_{i-1}}$, before the character is chopped off, the second
856d025dbc15 more Chengsong parents: 518 diff changeset	437	is a character ${c_{i-1}}$, the character we want to inject and the
856d025dbc15 more Chengsong parents: 518 diff changeset	438	third argument is the value ${v_i}$, into which one wants to inject the
856d025dbc15 more Chengsong parents: 518 diff changeset	439	character (it corresponds to the regular expression after the character
856d025dbc15 more Chengsong parents: 518 diff changeset	440	has been chopped off). The result of this function is a new value. The
856d025dbc15 more Chengsong parents: 518 diff changeset	441	definition of $\textit{inj}$ is as follows:
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	442
519 856d025dbc15 more Chengsong parents: 518 diff changeset	443	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	444	\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
856d025dbc15 more Chengsong parents: 518 diff changeset	445	$\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	446	$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	447	$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	448	$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	449	$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	450	$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	451	$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	452	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	453	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	454
856d025dbc15 more Chengsong parents: 518 diff changeset	455	\noindent This definition is by recursion on the ``shape'' of regular
856d025dbc15 more Chengsong parents: 518 diff changeset	456	expressions and values.
856d025dbc15 more Chengsong parents: 518 diff changeset	457	The clauses basically do one thing--identifying the ``holes'' on
856d025dbc15 more Chengsong parents: 518 diff changeset	458	value to inject the character back into.
856d025dbc15 more Chengsong parents: 518 diff changeset	459	For instance, in the last clause for injecting back to a value
856d025dbc15 more Chengsong parents: 518 diff changeset	460	that would turn into a new star value that corresponds to a star,
856d025dbc15 more Chengsong parents: 518 diff changeset	461	we know it must be a sequence value. And we know that the first
856d025dbc15 more Chengsong parents: 518 diff changeset	462	value of that sequence corresponds to the child regex of the star
856d025dbc15 more Chengsong parents: 518 diff changeset	463	with the first character being chopped off--an iteration of the star
856d025dbc15 more Chengsong parents: 518 diff changeset	464	that had just been unfolded. This value is followed by the already
856d025dbc15 more Chengsong parents: 518 diff changeset	465	matched star iterations we collected before. So we inject the character
856d025dbc15 more Chengsong parents: 518 diff changeset	466	back to the first value and form a new value with this new iteration
856d025dbc15 more Chengsong parents: 518 diff changeset	467	being added to the previous list of iterations, all under the $Stars$
856d025dbc15 more Chengsong parents: 518 diff changeset	468	top level.
856d025dbc15 more Chengsong parents: 518 diff changeset	469
856d025dbc15 more Chengsong parents: 518 diff changeset	470	We have mentioned before that derivatives without simplification
856d025dbc15 more Chengsong parents: 518 diff changeset	471	can get clumsy, and this is true for values as well--they reflect
856d025dbc15 more Chengsong parents: 518 diff changeset	472	the regular expressions size by definition.
856d025dbc15 more Chengsong parents: 518 diff changeset	473
856d025dbc15 more Chengsong parents: 518 diff changeset	474	One can introduce simplification on the regex and values, but have to
856d025dbc15 more Chengsong parents: 518 diff changeset	475	be careful in not breaking the correctness as the injection
856d025dbc15 more Chengsong parents: 518 diff changeset	476	function heavily relies on the structure of the regexes and values
856d025dbc15 more Chengsong parents: 518 diff changeset	477	being correct and match each other.
856d025dbc15 more Chengsong parents: 518 diff changeset	478	It can be achieved by recording some extra rectification functions
856d025dbc15 more Chengsong parents: 518 diff changeset	479	during the derivatives step, and applying these rectifications in
856d025dbc15 more Chengsong parents: 518 diff changeset	480	each run during the injection phase.
856d025dbc15 more Chengsong parents: 518 diff changeset	481	And we can prove that the POSIX value of how
856d025dbc15 more Chengsong parents: 518 diff changeset	482	regular expressions match strings will not be affected---although is much harder
856d025dbc15 more Chengsong parents: 518 diff changeset	483	to establish.
856d025dbc15 more Chengsong parents: 518 diff changeset	484	Some initial results in this regard have been
856d025dbc15 more Chengsong parents: 518 diff changeset	485	obtained in \cite{AusafDyckhoffUrban2016}.
856d025dbc15 more Chengsong parents: 518 diff changeset	486
856d025dbc15 more Chengsong parents: 518 diff changeset	487
856d025dbc15 more Chengsong parents: 518 diff changeset	488
856d025dbc15 more Chengsong parents: 518 diff changeset	489	%Brzozowski, after giving the derivatives and simplification,
856d025dbc15 more Chengsong parents: 518 diff changeset	490	%did not explore lexing with simplification or he may well be
856d025dbc15 more Chengsong parents: 518 diff changeset	491	%stuck on an efficient simplificaiton with a proof.
856d025dbc15 more Chengsong parents: 518 diff changeset	492	%He went on to explore the use of derivatives together with
856d025dbc15 more Chengsong parents: 518 diff changeset	493	%automaton, and did not try lexing using derivatives.
856d025dbc15 more Chengsong parents: 518 diff changeset	494
856d025dbc15 more Chengsong parents: 518 diff changeset	495	We want to get rid of complex and fragile rectification of values.
856d025dbc15 more Chengsong parents: 518 diff changeset	496	Can we not create those intermediate values $v_1,\ldots v_n$,
856d025dbc15 more Chengsong parents: 518 diff changeset	497	and get the lexing information that should be already there while
856d025dbc15 more Chengsong parents: 518 diff changeset	498	doing derivatives in one pass, without a second phase of injection?
856d025dbc15 more Chengsong parents: 518 diff changeset	499	In the meantime, can we make sure that simplifications
856d025dbc15 more Chengsong parents: 518 diff changeset	500	are easily handled without breaking the correctness of the algorithm?
856d025dbc15 more Chengsong parents: 518 diff changeset	501
856d025dbc15 more Chengsong parents: 518 diff changeset	502	Sulzmann and Lu solved this problem by
856d025dbc15 more Chengsong parents: 518 diff changeset	503	introducing additional informtaion to the
856d025dbc15 more Chengsong parents: 518 diff changeset	504	regular expressions called \emph{bitcodes}.
856d025dbc15 more Chengsong parents: 518 diff changeset	505
856d025dbc15 more Chengsong parents: 518 diff changeset	506	\subsection*{Bit-coded Algorithm}
856d025dbc15 more Chengsong parents: 518 diff changeset	507	Bits and bitcodes (lists of bits) are defined as:
856d025dbc15 more Chengsong parents: 518 diff changeset	508
856d025dbc15 more Chengsong parents: 518 diff changeset	509	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	510	$b ::= 1 \mid 0 \qquad
856d025dbc15 more Chengsong parents: 518 diff changeset	511	bs ::= [] \mid b::bs
856d025dbc15 more Chengsong parents: 518 diff changeset	512	$
856d025dbc15 more Chengsong parents: 518 diff changeset	513	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	514
856d025dbc15 more Chengsong parents: 518 diff changeset	515	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	516	The $1$ and $0$ are not in bold in order to avoid
856d025dbc15 more Chengsong parents: 518 diff changeset	517	confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
856d025dbc15 more Chengsong parents: 518 diff changeset	518	bit-lists) can be used to encode values (or potentially incomplete values) in a
856d025dbc15 more Chengsong parents: 518 diff changeset	519	compact form. This can be straightforwardly seen in the following
856d025dbc15 more Chengsong parents: 518 diff changeset	520	coding function from values to bitcodes:
856d025dbc15 more Chengsong parents: 518 diff changeset	521
856d025dbc15 more Chengsong parents: 518 diff changeset	522	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	523	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	524	$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	525	$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	526	$\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	527	$\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	528	$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	529	$\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	530	$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
856d025dbc15 more Chengsong parents: 518 diff changeset	531	code(\Stars\,vs)$
856d025dbc15 more Chengsong parents: 518 diff changeset	532	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	533	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	534
856d025dbc15 more Chengsong parents: 518 diff changeset	535	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	536	Here $\textit{code}$ encodes a value into a bitcodes by converting
856d025dbc15 more Chengsong parents: 518 diff changeset	537	$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
856d025dbc15 more Chengsong parents: 518 diff changeset	538	star iteration by $1$. The border where a local star terminates
856d025dbc15 more Chengsong parents: 518 diff changeset	539	is marked by $0$. This coding is lossy, as it throws away the information about
856d025dbc15 more Chengsong parents: 518 diff changeset	540	characters, and also does not encode the ``boundary'' between two
856d025dbc15 more Chengsong parents: 518 diff changeset	541	sequence values. Moreover, with only the bitcode we cannot even tell
856d025dbc15 more Chengsong parents: 518 diff changeset	542	whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
856d025dbc15 more Chengsong parents: 518 diff changeset	543	reason for choosing this compact way of storing information is that the
856d025dbc15 more Chengsong parents: 518 diff changeset	544	relatively small size of bits can be easily manipulated and ``moved
856d025dbc15 more Chengsong parents: 518 diff changeset	545	around'' in a regular expression. In order to recover values, we will
856d025dbc15 more Chengsong parents: 518 diff changeset	546	need the corresponding regular expression as an extra information. This
856d025dbc15 more Chengsong parents: 518 diff changeset	547	means the decoding function is defined as:
856d025dbc15 more Chengsong parents: 518 diff changeset	548
856d025dbc15 more Chengsong parents: 518 diff changeset	549
856d025dbc15 more Chengsong parents: 518 diff changeset	550	%\begin{definition}[Bitdecoding of Values]\mbox{}
856d025dbc15 more Chengsong parents: 518 diff changeset	551	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	552	\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	553	$\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	554	$\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	555	$\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	556	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
856d025dbc15 more Chengsong parents: 518 diff changeset	557	(\Left\,v, bs_1)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	558	$\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	559	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
856d025dbc15 more Chengsong parents: 518 diff changeset	560	(\Right\,v, bs_1)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	561	$\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	562	$\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	563	& & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	564	& & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	565	$\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	566	$\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	567	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	568	& & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	569	& & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
856d025dbc15 more Chengsong parents: 518 diff changeset	570
856d025dbc15 more Chengsong parents: 518 diff changeset	571	$\textit{decode}\,bs\,r$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	572	$\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	573	& & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
856d025dbc15 more Chengsong parents: 518 diff changeset	574	\textit{else}\;\textit{None}$
856d025dbc15 more Chengsong parents: 518 diff changeset	575	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	576	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	577	%\end{definition}
856d025dbc15 more Chengsong parents: 518 diff changeset	578
856d025dbc15 more Chengsong parents: 518 diff changeset	579	Sulzmann and Lu's integrated the bitcodes into regular expressions to
856d025dbc15 more Chengsong parents: 518 diff changeset	580	create annotated regular expressions \cite{Sulzmann2014}.
856d025dbc15 more Chengsong parents: 518 diff changeset	581	\emph{Annotated regular expressions} are defined by the following
856d025dbc15 more Chengsong parents: 518 diff changeset	582	grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$}
856d025dbc15 more Chengsong parents: 518 diff changeset	583
856d025dbc15 more Chengsong parents: 518 diff changeset	584	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	585	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	586	$\textit{a}$ & $::=$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	587	& $\mid$ & $_{bs}\ONE$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	588	& $\mid$ & $_{bs}{\bf c}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	589	& $\mid$ & $_{bs}\sum\,as$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	590	& $\mid$ & $_{bs}a_1\cdot a_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	591	& $\mid$ & $_{bs}a^*$
856d025dbc15 more Chengsong parents: 518 diff changeset	592	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	593	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	594	%(in \textit{ALTS})
856d025dbc15 more Chengsong parents: 518 diff changeset	595
856d025dbc15 more Chengsong parents: 518 diff changeset	596	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	597	where $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regular
856d025dbc15 more Chengsong parents: 518 diff changeset	598	expressions and $as$ for a list of annotated regular expressions.
856d025dbc15 more Chengsong parents: 518 diff changeset	599	The alternative constructor($\sum$) has been generalized to
856d025dbc15 more Chengsong parents: 518 diff changeset	600	accept a list of annotated regular expressions rather than just 2.
856d025dbc15 more Chengsong parents: 518 diff changeset	601	We will show that these bitcodes encode information about
856d025dbc15 more Chengsong parents: 518 diff changeset	602	the (POSIX) value that should be generated by the Sulzmann and Lu
856d025dbc15 more Chengsong parents: 518 diff changeset	603	algorithm.
856d025dbc15 more Chengsong parents: 518 diff changeset	604
856d025dbc15 more Chengsong parents: 518 diff changeset	605
856d025dbc15 more Chengsong parents: 518 diff changeset	606	To do lexing using annotated regular expressions, we shall first
856d025dbc15 more Chengsong parents: 518 diff changeset	607	transform the usual (un-annotated) regular expressions into annotated
856d025dbc15 more Chengsong parents: 518 diff changeset	608	regular expressions. This operation is called \emph{internalisation} and
856d025dbc15 more Chengsong parents: 518 diff changeset	609	defined as follows:
856d025dbc15 more Chengsong parents: 518 diff changeset	610
856d025dbc15 more Chengsong parents: 518 diff changeset	611	%\begin{definition}
856d025dbc15 more Chengsong parents: 518 diff changeset	612	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	613	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	614	$(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	615	$(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	616	$(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	617	$(r_1 + r_2)^\uparrow$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	618	$_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
856d025dbc15 more Chengsong parents: 518 diff changeset	619	\textit{fuse}\,[1]\,r_2^\uparrow]$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	620	$(r_1\cdot r_2)^\uparrow$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	621	$_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	622	$(r^*)^\uparrow$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	623	$_{[]}(r^\uparrow)^*$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	624	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	625	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	626	%\end{definition}
856d025dbc15 more Chengsong parents: 518 diff changeset	627
856d025dbc15 more Chengsong parents: 518 diff changeset	628	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	629	We use up arrows here to indicate that the basic un-annotated regular
856d025dbc15 more Chengsong parents: 518 diff changeset	630	expressions are ``lifted up'' into something slightly more complex. In the
856d025dbc15 more Chengsong parents: 518 diff changeset	631	fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
856d025dbc15 more Chengsong parents: 518 diff changeset	632	attach bits to the front of an annotated regular expression. Its
856d025dbc15 more Chengsong parents: 518 diff changeset	633	definition is as follows:
856d025dbc15 more Chengsong parents: 518 diff changeset	634
856d025dbc15 more Chengsong parents: 518 diff changeset	635	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	636	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	637	$\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	638	$\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	639	$_{bs @ bs'}\ONE$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	640	$\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	641	$_{bs@bs'}{\bf c}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	642	$\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	643	$_{bs@bs'}\sum\textit{as}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	644	$\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	645	$_{bs@bs'}a_1 \cdot a_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	646	$\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	647	$_{bs @ bs'}a^*$
856d025dbc15 more Chengsong parents: 518 diff changeset	648	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	649	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	650
856d025dbc15 more Chengsong parents: 518 diff changeset	651	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	652	After internalising the regular expression, we perform successive
856d025dbc15 more Chengsong parents: 518 diff changeset	653	derivative operations on the annotated regular expressions. This
856d025dbc15 more Chengsong parents: 518 diff changeset	654	derivative operation is the same as what we had previously for the
856d025dbc15 more Chengsong parents: 518 diff changeset	655	basic regular expressions, except that we beed to take care of
856d025dbc15 more Chengsong parents: 518 diff changeset	656	the bitcodes:
856d025dbc15 more Chengsong parents: 518 diff changeset	657
856d025dbc15 more Chengsong parents: 518 diff changeset	658
856d025dbc15 more Chengsong parents: 518 diff changeset	659	\iffalse
856d025dbc15 more Chengsong parents: 518 diff changeset	660	%\begin{definition}{bder}
856d025dbc15 more Chengsong parents: 518 diff changeset	661	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	662	\begin{tabular}{@{}lcl@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	663	$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	664	$(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	665	$(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	666	$\textit{if}\;c=d\; \;\textit{then}\;
856d025dbc15 more Chengsong parents: 518 diff changeset	667	\textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	668	$(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	669	$\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	670	$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	671	$\textit{if}\;\textit{bnullable}\,a_1$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	672	& &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	673	& &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	674	& &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	675	$(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	676	$\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
856d025dbc15 more Chengsong parents: 518 diff changeset	677	(\textit{STAR}\,[]\,r)$
856d025dbc15 more Chengsong parents: 518 diff changeset	678	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	679	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	680	%\end{definition}
856d025dbc15 more Chengsong parents: 518 diff changeset	681
856d025dbc15 more Chengsong parents: 518 diff changeset	682	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	683	\begin{tabular}{@{}lcl@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	684	$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	685	$(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	686	$(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	687	$\textit{if}\;c=d\; \;\textit{then}\;
856d025dbc15 more Chengsong parents: 518 diff changeset	688	_{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	689	$(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	690	$_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	691	$(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	692	$\textit{if}\;\textit{bnullable}\,a_1$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	693	& &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	694	& &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	695	& &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	696	$(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	697	$_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
856d025dbc15 more Chengsong parents: 518 diff changeset	698	(_{bs}\textit{STAR}\,[]\,r)$
856d025dbc15 more Chengsong parents: 518 diff changeset	699	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	700	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	701	%\end{definition}
856d025dbc15 more Chengsong parents: 518 diff changeset	702	\fi
856d025dbc15 more Chengsong parents: 518 diff changeset	703
856d025dbc15 more Chengsong parents: 518 diff changeset	704	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	705	\begin{tabular}{@{}lcl@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	706	$(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	707	$(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	708	$(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	709	$\textit{if}\;c=d\; \;\textit{then}\;
856d025dbc15 more Chengsong parents: 518 diff changeset	710	_{bs}\ONE\;\textit{else}\;\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	711	$(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	712	$_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	713	$(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	714	$\textit{if}\;\textit{bnullable}\,a_1$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	715	& &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	716	& &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	717	& &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	718	$(_{bs}a^*)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	719	$_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
856d025dbc15 more Chengsong parents: 518 diff changeset	720	(_{[]}r^*))$
856d025dbc15 more Chengsong parents: 518 diff changeset	721	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	722	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	723
856d025dbc15 more Chengsong parents: 518 diff changeset	724	%\end{definition}
856d025dbc15 more Chengsong parents: 518 diff changeset	725	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	726	For instance, when we do derivative of $_{bs}a^*$ with respect to c,
856d025dbc15 more Chengsong parents: 518 diff changeset	727	we need to unfold it into a sequence,
856d025dbc15 more Chengsong parents: 518 diff changeset	728	and attach an additional bit $0$ to the front of $r \backslash c$
856d025dbc15 more Chengsong parents: 518 diff changeset	729	to indicate one more star iteration. Also the sequence clause
856d025dbc15 more Chengsong parents: 518 diff changeset	730	is more subtle---when $a_1$ is $\textit{bnullable}$ (here
856d025dbc15 more Chengsong parents: 518 diff changeset	731	\textit{bnullable} is exactly the same as $\textit{nullable}$, except
856d025dbc15 more Chengsong parents: 518 diff changeset	732	that it is for annotated regular expressions, therefore we omit the
856d025dbc15 more Chengsong parents: 518 diff changeset	733	definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
856d025dbc15 more Chengsong parents: 518 diff changeset	734	$a_1$ matches the string prior to character $c$ (more on this later),
856d025dbc15 more Chengsong parents: 518 diff changeset	735	then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2
856d025dbc15 more Chengsong parents: 518 diff changeset	736	\backslash c)$ will collapse the regular expression $a_1$(as it has
856d025dbc15 more Chengsong parents: 518 diff changeset	737	already been fully matched) and store the parsing information at the
856d025dbc15 more Chengsong parents: 518 diff changeset	738	head of the regular expression $a_2 \backslash c$ by fusing to it. The
856d025dbc15 more Chengsong parents: 518 diff changeset	739	bitsequence $\textit{bs}$, which was initially attached to the
856d025dbc15 more Chengsong parents: 518 diff changeset	740	first element of the sequence $a_1 \cdot a_2$, has
856d025dbc15 more Chengsong parents: 518 diff changeset	741	now been elevated to the top-level of $\sum$, as this information will be
856d025dbc15 more Chengsong parents: 518 diff changeset	742	needed whichever way the sequence is matched---no matter whether $c$ belongs
856d025dbc15 more Chengsong parents: 518 diff changeset	743	to $a_1$ or $ a_2$. After building these derivatives and maintaining all
856d025dbc15 more Chengsong parents: 518 diff changeset	744	the lexing information, we complete the lexing by collecting the
856d025dbc15 more Chengsong parents: 518 diff changeset	745	bitcodes using a generalised version of the $\textit{mkeps}$ function
856d025dbc15 more Chengsong parents: 518 diff changeset	746	for annotated regular expressions, called $\textit{bmkeps}$:
856d025dbc15 more Chengsong parents: 518 diff changeset	747
856d025dbc15 more Chengsong parents: 518 diff changeset	748
856d025dbc15 more Chengsong parents: 518 diff changeset	749	%\begin{definition}[\textit{bmkeps}]\mbox{}
856d025dbc15 more Chengsong parents: 518 diff changeset	750	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	751	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	752	$\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	753	$\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	754	$\textit{if}\;\textit{bnullable}\,a$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	755	& &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	756	& &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	757	$\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	758	$bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	759	$\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	760	$bs \,@\, [0]$
856d025dbc15 more Chengsong parents: 518 diff changeset	761	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	762	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	763	%\end{definition}
856d025dbc15 more Chengsong parents: 518 diff changeset	764
856d025dbc15 more Chengsong parents: 518 diff changeset	765	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	766	This function completes the value information by travelling along the
856d025dbc15 more Chengsong parents: 518 diff changeset	767	path of the regular expression that corresponds to a POSIX value and
856d025dbc15 more Chengsong parents: 518 diff changeset	768	collecting all the bitcodes, and using $S$ to indicate the end of star
856d025dbc15 more Chengsong parents: 518 diff changeset	769	iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
856d025dbc15 more Chengsong parents: 518 diff changeset	770	decode them, we get the value we expect. The corresponding lexing
856d025dbc15 more Chengsong parents: 518 diff changeset	771	algorithm looks as follows:
856d025dbc15 more Chengsong parents: 518 diff changeset	772
856d025dbc15 more Chengsong parents: 518 diff changeset	773	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	774	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	775	$\textit{blexer}\;r\,s$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	776	$\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	777	& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	778	& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	779	& & $\;\;\textit{else}\;\textit{None}$
856d025dbc15 more Chengsong parents: 518 diff changeset	780	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	781	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	782
856d025dbc15 more Chengsong parents: 518 diff changeset	783	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	784	In this definition $\_\backslash s$ is the generalisation of the derivative
856d025dbc15 more Chengsong parents: 518 diff changeset	785	operation from characters to strings (just like the derivatives for un-annotated
856d025dbc15 more Chengsong parents: 518 diff changeset	786	regular expressions).
856d025dbc15 more Chengsong parents: 518 diff changeset	787
856d025dbc15 more Chengsong parents: 518 diff changeset	788	Now we introduce the simplifications, which is why we introduce the
856d025dbc15 more Chengsong parents: 518 diff changeset	789	bitcodes in the first place.
856d025dbc15 more Chengsong parents: 518 diff changeset	790
856d025dbc15 more Chengsong parents: 518 diff changeset	791	\subsection*{Simplification Rules}
856d025dbc15 more Chengsong parents: 518 diff changeset	792
856d025dbc15 more Chengsong parents: 518 diff changeset	793	This section introduces aggressive (in terms of size) simplification rules
856d025dbc15 more Chengsong parents: 518 diff changeset	794	on annotated regular expressions
856d025dbc15 more Chengsong parents: 518 diff changeset	795	to keep derivatives small. Such simplifications are promising
856d025dbc15 more Chengsong parents: 518 diff changeset	796	as we have
856d025dbc15 more Chengsong parents: 518 diff changeset	797	generated test data that show
856d025dbc15 more Chengsong parents: 518 diff changeset	798	that a good tight bound can be achieved. We could only
856d025dbc15 more Chengsong parents: 518 diff changeset	799	partially cover the search space as there are infinitely many regular
856d025dbc15 more Chengsong parents: 518 diff changeset	800	expressions and strings.
856d025dbc15 more Chengsong parents: 518 diff changeset	801
856d025dbc15 more Chengsong parents: 518 diff changeset	802	One modification we introduced is to allow a list of annotated regular
856d025dbc15 more Chengsong parents: 518 diff changeset	803	expressions in the $\sum$ constructor. This allows us to not just
856d025dbc15 more Chengsong parents: 518 diff changeset	804	delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
856d025dbc15 more Chengsong parents: 518 diff changeset	805	also unnecessary ``copies'' of regular expressions (very similar to
856d025dbc15 more Chengsong parents: 518 diff changeset	806	simplifying $r + r$ to just $r$, but in a more general setting). Another
856d025dbc15 more Chengsong parents: 518 diff changeset	807	modification is that we use simplification rules inspired by Antimirov's
856d025dbc15 more Chengsong parents: 518 diff changeset	808	work on partial derivatives. They maintain the idea that only the first
856d025dbc15 more Chengsong parents: 518 diff changeset	809	``copy'' of a regular expression in an alternative contributes to the
856d025dbc15 more Chengsong parents: 518 diff changeset	810	calculation of a POSIX value. All subsequent copies can be pruned away from
856d025dbc15 more Chengsong parents: 518 diff changeset	811	the regular expression. A recursive definition of our simplification function
856d025dbc15 more Chengsong parents: 518 diff changeset	812	that looks somewhat similar to our Scala code is given below:
856d025dbc15 more Chengsong parents: 518 diff changeset	813	%\comment{Use $\ZERO$, $\ONE$ and so on.
856d025dbc15 more Chengsong parents: 518 diff changeset	814	%Is it $ALTS$ or $ALTS$?}\\
856d025dbc15 more Chengsong parents: 518 diff changeset	815
856d025dbc15 more Chengsong parents: 518 diff changeset	816	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	817	\begin{tabular}{@{}lcl@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	818
856d025dbc15 more Chengsong parents: 518 diff changeset	819	$\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	820	&&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	821	&&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	822	&&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	823	&&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	824	&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	825
856d025dbc15 more Chengsong parents: 518 diff changeset	826	$\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	827	&&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	828	&&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	829	&&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	830
856d025dbc15 more Chengsong parents: 518 diff changeset	831	$\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
856d025dbc15 more Chengsong parents: 518 diff changeset	832	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	833	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	834
856d025dbc15 more Chengsong parents: 518 diff changeset	835	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	836	The simplification does a pattern matching on the regular expression.
856d025dbc15 more Chengsong parents: 518 diff changeset	837	When it detected that the regular expression is an alternative or
856d025dbc15 more Chengsong parents: 518 diff changeset	838	sequence, it will try to simplify its child regular expressions
856d025dbc15 more Chengsong parents: 518 diff changeset	839	recursively and then see if one of the children turns into $\ZERO$ or
856d025dbc15 more Chengsong parents: 518 diff changeset	840	$\ONE$, which might trigger further simplification at the current level.
856d025dbc15 more Chengsong parents: 518 diff changeset	841	The most involved part is the $\sum$ clause, where we use two
856d025dbc15 more Chengsong parents: 518 diff changeset	842	auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
856d025dbc15 more Chengsong parents: 518 diff changeset	843	alternatives and reduce as many duplicates as possible. Function
856d025dbc15 more Chengsong parents: 518 diff changeset	844	$\textit{distinct}$ keeps the first occurring copy only and removes all later ones
856d025dbc15 more Chengsong parents: 518 diff changeset	845	when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
856d025dbc15 more Chengsong parents: 518 diff changeset	846	Its recursive definition is given below:
856d025dbc15 more Chengsong parents: 518 diff changeset	847
856d025dbc15 more Chengsong parents: 518 diff changeset	848	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	849	\begin{tabular}{@{}lcl@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	850	$\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
856d025dbc15 more Chengsong parents: 518 diff changeset	851	(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	852	$\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	853	$\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise)
856d025dbc15 more Chengsong parents: 518 diff changeset	854	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	855	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	856
856d025dbc15 more Chengsong parents: 518 diff changeset	857	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	858	Here $\textit{flatten}$ behaves like the traditional functional programming flatten
856d025dbc15 more Chengsong parents: 518 diff changeset	859	function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
856d025dbc15 more Chengsong parents: 518 diff changeset	860	removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
856d025dbc15 more Chengsong parents: 518 diff changeset	861
856d025dbc15 more Chengsong parents: 518 diff changeset	862	Having defined the $\simp$ function,
856d025dbc15 more Chengsong parents: 518 diff changeset	863	we can use the previous notation of natural
856d025dbc15 more Chengsong parents: 518 diff changeset	864	extension from derivative w.r.t.~character to derivative
856d025dbc15 more Chengsong parents: 518 diff changeset	865	w.r.t.~string:%\comment{simp in the [] case?}
856d025dbc15 more Chengsong parents: 518 diff changeset	866
856d025dbc15 more Chengsong parents: 518 diff changeset	867	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	868	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	869	$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	870	$r \backslash_{simp} [\,] $ & $\dn$ & $r$
856d025dbc15 more Chengsong parents: 518 diff changeset	871	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	872	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	873
856d025dbc15 more Chengsong parents: 518 diff changeset	874	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	875	to obtain an optimised version of the algorithm:
856d025dbc15 more Chengsong parents: 518 diff changeset	876
856d025dbc15 more Chengsong parents: 518 diff changeset	877	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	878	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	879	$\textit{blexer\_simp}\;r\,s$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	880	$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	881	& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	882	& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	883	& & $\;\;\textit{else}\;\textit{None}$
856d025dbc15 more Chengsong parents: 518 diff changeset	884	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	885	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	886
856d025dbc15 more Chengsong parents: 518 diff changeset	887	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	888	This algorithm keeps the regular expression size small, for example,
856d025dbc15 more Chengsong parents: 518 diff changeset	889	with this simplification our previous $(a + aa)^*$ example's 8000 nodes
856d025dbc15 more Chengsong parents: 518 diff changeset	890	will be reduced to just 6 and stays constant, no matter how long the
856d025dbc15 more Chengsong parents: 518 diff changeset	891	input string is.
856d025dbc15 more Chengsong parents: 518 diff changeset	892
856d025dbc15 more Chengsong parents: 518 diff changeset	893
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	894
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	895
4d9eecfc936a sad Chengsong parents: 468 diff changeset	896
4d9eecfc936a sad Chengsong parents: 468 diff changeset	897
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	898
a0f27e21b42c all texrelated Chengsong parents: diff changeset	899	%-----------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	900	% SUBSECTION 1
a0f27e21b42c all texrelated Chengsong parents: diff changeset	901	%-----------------------------------
518 ff7945a988a3 more to thesis Chengsong parents: 516 diff changeset	902	\section{Specifications of Certain Functions to be Used}
524 947cbbd4e4a7 more data Chengsong parents: 519 diff changeset	903	Here we give some functions' definitions,
947cbbd4e4a7 more data Chengsong parents: 519 diff changeset	904	which we will use later.
947cbbd4e4a7 more data Chengsong parents: 519 diff changeset	905	\begin{center}
947cbbd4e4a7 more data Chengsong parents: 519 diff changeset	906	\begin{tabular}{ccc}
525 d8740017324c fixed latex problems Christian Urban <christian.urban@kcl.ac.uk> parents: 524 diff changeset	907	$\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$
524 947cbbd4e4a7 more data Chengsong parents: 519 diff changeset	908	\end{tabular}
947cbbd4e4a7 more data Chengsong parents: 519 diff changeset	909	\end{center}
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	910
4d9eecfc936a sad Chengsong parents: 468 diff changeset	911
519 856d025dbc15 more Chengsong parents: 518 diff changeset	912	%----------------------------------------------------------------------------------------
856d025dbc15 more Chengsong parents: 518 diff changeset	913	% SECTION correctness proof
856d025dbc15 more Chengsong parents: 518 diff changeset	914	%----------------------------------------------------------------------------------------
525 d8740017324c fixed latex problems Christian Urban <christian.urban@kcl.ac.uk> parents: 524 diff changeset	915	\section{Correctness of Bit-coded Algorithm (Without Simplification)}
519 856d025dbc15 more Chengsong parents: 518 diff changeset	916	We now give the proof the correctness of the algorithm with bit-codes.
515 84938708781d a bit Chengsong parents: 514 diff changeset	917
519 856d025dbc15 more Chengsong parents: 518 diff changeset	918	Ausaf and Urban cleverly defined an auxiliary function called $\flex$,
856d025dbc15 more Chengsong parents: 518 diff changeset	919	defined as
856d025dbc15 more Chengsong parents: 518 diff changeset	920	\[
856d025dbc15 more Chengsong parents: 518 diff changeset	921	\flex \; r \; f \; [] \; v \; = \; f\; v
856d025dbc15 more Chengsong parents: 518 diff changeset	922	\flex \; r \; f \; c :: s \; v = \flex r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v
856d025dbc15 more Chengsong parents: 518 diff changeset	923	\]
856d025dbc15 more Chengsong parents: 518 diff changeset	924	which accumulates the characters that needs to be injected back,
856d025dbc15 more Chengsong parents: 518 diff changeset	925	and does the injection in a stack-like manner (last taken derivative first injected).
856d025dbc15 more Chengsong parents: 518 diff changeset	926	$\flex$ is connected to the $\lexer$:
515 84938708781d a bit Chengsong parents: 514 diff changeset	927	\begin{lemma}
519 856d025dbc15 more Chengsong parents: 518 diff changeset	928	$\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$
515 84938708781d a bit Chengsong parents: 514 diff changeset	929	\end{lemma}
519 856d025dbc15 more Chengsong parents: 518 diff changeset	930	$\flex$ provides us a bridge between $\lexer$ and $\blexer$.
856d025dbc15 more Chengsong parents: 518 diff changeset	931	What is even better about $\flex$ is that it allows us to
856d025dbc15 more Chengsong parents: 518 diff changeset	932	directly operate on the value $\mkeps (r\backslash v)$,
856d025dbc15 more Chengsong parents: 518 diff changeset	933	which is pivotal in the definition of $\lexer $ and $\blexer$, but not visible as an argument.
856d025dbc15 more Chengsong parents: 518 diff changeset	934	When the value created by $\mkeps$ becomes available, one can
856d025dbc15 more Chengsong parents: 518 diff changeset	935	prove some stepwise properties of lexing nicely:
856d025dbc15 more Chengsong parents: 518 diff changeset	936	\begin{lemma}\label{flexStepwise}
856d025dbc15 more Chengsong parents: 518 diff changeset	937	$\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $
515 84938708781d a bit Chengsong parents: 514 diff changeset	938	\end{lemma}
518 ff7945a988a3 more to thesis Chengsong parents: 516 diff changeset	939
524 947cbbd4e4a7 more data Chengsong parents: 519 diff changeset	940	And for $\blexer$ we have a function with stepwise properties like $\flex$ as well,
947cbbd4e4a7 more data Chengsong parents: 519 diff changeset	941	called $\retrieve$\ref{retrieveDef}.
519 856d025dbc15 more Chengsong parents: 518 diff changeset	942	$\retrieve$ takes bit-codes from annotated regular expressions
856d025dbc15 more Chengsong parents: 518 diff changeset	943	guided by a value.
856d025dbc15 more Chengsong parents: 518 diff changeset	944	$\retrieve$ is connected to the $\blexer$ in the following way:
856d025dbc15 more Chengsong parents: 518 diff changeset	945	\begin{lemma}\label{blexer_retrieve}
856d025dbc15 more Chengsong parents: 518 diff changeset	946	$\blexer \; r \; s = \decode \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
856d025dbc15 more Chengsong parents: 518 diff changeset	947	\end{lemma}
856d025dbc15 more Chengsong parents: 518 diff changeset	948	If you take derivative of an annotated regular expression,
856d025dbc15 more Chengsong parents: 518 diff changeset	949	you can $\retrieve$ the same bit-codes as before the derivative took place,
856d025dbc15 more Chengsong parents: 518 diff changeset	950	provided that you use the corresponding value:
518 ff7945a988a3 more to thesis Chengsong parents: 516 diff changeset	951
519 856d025dbc15 more Chengsong parents: 518 diff changeset	952	\begin{lemma}\label{retrieveStepwise}
856d025dbc15 more Chengsong parents: 518 diff changeset	953	$\retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$
518 ff7945a988a3 more to thesis Chengsong parents: 516 diff changeset	954	\end{lemma}
519 856d025dbc15 more Chengsong parents: 518 diff changeset	955	The other good thing about $\retrieve$ is that it can be connected to $\flex$:
856d025dbc15 more Chengsong parents: 518 diff changeset	956	%centralLemma1
856d025dbc15 more Chengsong parents: 518 diff changeset	957	\begin{lemma}\label{flex_retrieve}
856d025dbc15 more Chengsong parents: 518 diff changeset	958	$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
518 ff7945a988a3 more to thesis Chengsong parents: 516 diff changeset	959	\end{lemma}
ff7945a988a3 more to thesis Chengsong parents: 516 diff changeset	960	\begin{proof}
519 856d025dbc15 more Chengsong parents: 518 diff changeset	961	By induction on $s$. The induction tactic is reverse induction on strings.
856d025dbc15 more Chengsong parents: 518 diff changeset	962	$v$ is allowed to be arbitrary.
856d025dbc15 more Chengsong parents: 518 diff changeset	963	The crucial point is to rewrite
856d025dbc15 more Chengsong parents: 518 diff changeset	964	\[
856d025dbc15 more Chengsong parents: 518 diff changeset	965	\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c])
856d025dbc15 more Chengsong parents: 518 diff changeset	966	\]
856d025dbc15 more Chengsong parents: 518 diff changeset	967	as
856d025dbc15 more Chengsong parents: 518 diff changeset	968	\[
856d025dbc15 more Chengsong parents: 518 diff changeset	969	\retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\; \mkeps (r \backslash s@[c]))
856d025dbc15 more Chengsong parents: 518 diff changeset	970	\].
856d025dbc15 more Chengsong parents: 518 diff changeset	971	This enables us to equate
856d025dbc15 more Chengsong parents: 518 diff changeset	972	\[
856d025dbc15 more Chengsong parents: 518 diff changeset	973	\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c])
856d025dbc15 more Chengsong parents: 518 diff changeset	974	\]
856d025dbc15 more Chengsong parents: 518 diff changeset	975	with
856d025dbc15 more Chengsong parents: 518 diff changeset	976	\[
856d025dbc15 more Chengsong parents: 518 diff changeset	977	\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c])))
856d025dbc15 more Chengsong parents: 518 diff changeset	978	\],
856d025dbc15 more Chengsong parents: 518 diff changeset	979	which in turn can be rewritten as
856d025dbc15 more Chengsong parents: 518 diff changeset	980	\[
856d025dbc15 more Chengsong parents: 518 diff changeset	981	\flex \; r \; \textit{id} \; s@[c] \; (\mkeps (r\backslash s@[c]))
856d025dbc15 more Chengsong parents: 518 diff changeset	982	\].
518 ff7945a988a3 more to thesis Chengsong parents: 516 diff changeset	983	\end{proof}
ff7945a988a3 more to thesis Chengsong parents: 516 diff changeset	984
519 856d025dbc15 more Chengsong parents: 518 diff changeset	985	With the above lemma we can now link $\flex$ and $\blexer$.
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	986
519 856d025dbc15 more Chengsong parents: 518 diff changeset	987	\begin{lemma}\label{flex_blexer}
856d025dbc15 more Chengsong parents: 518 diff changeset	988	$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s) = \blexer \; r \; s$
856d025dbc15 more Chengsong parents: 518 diff changeset	989	\end{lemma}
856d025dbc15 more Chengsong parents: 518 diff changeset	990	\begin{proof}
856d025dbc15 more Chengsong parents: 518 diff changeset	991	Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.
856d025dbc15 more Chengsong parents: 518 diff changeset	992	\end{proof}
856d025dbc15 more Chengsong parents: 518 diff changeset	993	Finally
518 ff7945a988a3 more to thesis Chengsong parents: 516 diff changeset	994
ff7945a988a3 more to thesis Chengsong parents: 516 diff changeset	995
ff7945a988a3 more to thesis Chengsong parents: 516 diff changeset	996
ff7945a988a3 more to thesis Chengsong parents: 516 diff changeset	997

author	Chengsong
	Sat, 28 May 2022 16:29:32 +0100
changeset 526	cb702fb4227f
parent 525	d8740017324c
child 528	28751de4b4ba
permissions	-rwxr-xr-x