lexing: ChengsongTanPhdThesis/Chapters/Chapter2.tex@28751de4b4ba (annotated)

468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	1	% Chapter Template
a0f27e21b42c all texrelated Chengsong parents: diff changeset	2
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	3	\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
519 856d025dbc15 more Chengsong parents: 518 diff changeset	4
856d025dbc15 more Chengsong parents: 518 diff changeset	5	\label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts
856d025dbc15 more Chengsong parents: 518 diff changeset	6	%and notations we
856d025dbc15 more Chengsong parents: 518 diff changeset	7	%use for describing the lexing algorithm by Sulzmann and Lu,
856d025dbc15 more Chengsong parents: 518 diff changeset	8	%and then give the algorithm and its variant, and discuss
856d025dbc15 more Chengsong parents: 518 diff changeset	9	%why more aggressive simplifications are needed.
856d025dbc15 more Chengsong parents: 518 diff changeset	10
856d025dbc15 more Chengsong parents: 518 diff changeset	11
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	12	\section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	13	We have a built-in datatype char, made up of characters, which we do not define
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	14	on top of anything else.
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	15	\begin{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	16	\begin{tabular}{lcl}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	17	$\textit{char}$ & $\dn$ & $a \| b \| c \| \ldots$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	18	\end{tabular}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	19	\end{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	20	They can form strings by lists:
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	21	\begin{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	22	\begin{tabular}{lcl}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	23	$\textit{string}$ & $\dn$ & $[] \| c :: cs$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	24	& & $(c\; \text{has char type})$
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	25	\end{tabular}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	26	\end{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	27	And strings can be concatenated to form longer strings:
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	28	\begin{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	29	\begin{tabular}{lcl}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	30	$s_1 @ s_2$ & $\rightarrow$ & $s'$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	31	\end{tabular}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	32	\end{center}
519 856d025dbc15 more Chengsong parents: 518 diff changeset	33
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	34	A set of strings can operate with another set of strings:
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	35	\begin{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	36	\begin{tabular}{lcl}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	37	$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	38	\end{tabular}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	39	\end{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	40	We also call the above "language concatenation".
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	41	The power of a language is defined recursively, using the
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	42	concatenation operator $@$:
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	43	\begin{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	44	\begin{tabular}{lcl}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	45	$A^0 $ & $\dn$ & $\{ [] \}$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	46	$A^{n+1}$ & $\dn$ & $A^n @ A$
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	47	\end{tabular}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	48	\end{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	49	The infinite set of all the power of a language unioned together
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	50	is defined using the power operator, also in recursive function:
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	51	\begin{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	52	\begin{tabular}{lcl}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	53	$A^*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	54	\end{tabular}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	55	\end{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	56	We also define an operation of chopping off a character from all the strings
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	57	in a set:
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	58	\begin{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	59	\begin{tabular}{lcl}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	60	$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	61	\end{tabular}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	62	\end{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	63	With the above definitions, it becomes natural to define regular expressions
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	64	as a concise way for expressing the languages.
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	65	\section{Regular Expressions and Their Language Interpretation}
519 856d025dbc15 more Chengsong parents: 518 diff changeset	66	Suppose we have an alphabet $\Sigma$, the strings whose characters
856d025dbc15 more Chengsong parents: 518 diff changeset	67	are from $\Sigma$
856d025dbc15 more Chengsong parents: 518 diff changeset	68	can be expressed as $\Sigma^*$.
856d025dbc15 more Chengsong parents: 518 diff changeset	69
856d025dbc15 more Chengsong parents: 518 diff changeset	70	We use patterns to define a set of strings concisely. Regular expressions
856d025dbc15 more Chengsong parents: 518 diff changeset	71	are one of such patterns systems:
856d025dbc15 more Chengsong parents: 518 diff changeset	72	The basic regular expressions are defined inductively
856d025dbc15 more Chengsong parents: 518 diff changeset	73	by the following grammar:
856d025dbc15 more Chengsong parents: 518 diff changeset	74	\[ r ::= \ZERO \mid \ONE
856d025dbc15 more Chengsong parents: 518 diff changeset	75	\mid c
856d025dbc15 more Chengsong parents: 518 diff changeset	76	\mid r_1 \cdot r_2
856d025dbc15 more Chengsong parents: 518 diff changeset	77	\mid r_1 + r_2
856d025dbc15 more Chengsong parents: 518 diff changeset	78	\mid r^*
856d025dbc15 more Chengsong parents: 518 diff changeset	79	\]
856d025dbc15 more Chengsong parents: 518 diff changeset	80
856d025dbc15 more Chengsong parents: 518 diff changeset	81	The language or set of strings defined by regular expressions are defined as
856d025dbc15 more Chengsong parents: 518 diff changeset	82	%TODO: FILL in the other defs
856d025dbc15 more Chengsong parents: 518 diff changeset	83	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	84	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	85	$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	86	$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	87	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	88	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	89	Which are also called the "language interpretation".
856d025dbc15 more Chengsong parents: 518 diff changeset	90
856d025dbc15 more Chengsong parents: 518 diff changeset	91
856d025dbc15 more Chengsong parents: 518 diff changeset	92
856d025dbc15 more Chengsong parents: 518 diff changeset	93	The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
856d025dbc15 more Chengsong parents: 518 diff changeset	94	where the operation transforms the regex to a new one containing
856d025dbc15 more Chengsong parents: 518 diff changeset	95	strings without the head character $c$.
856d025dbc15 more Chengsong parents: 518 diff changeset	96
856d025dbc15 more Chengsong parents: 518 diff changeset	97	Formally, we define first such a transformation on any string set, which
856d025dbc15 more Chengsong parents: 518 diff changeset	98	we call semantic derivative:
856d025dbc15 more Chengsong parents: 518 diff changeset	99	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	100	$\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
856d025dbc15 more Chengsong parents: 518 diff changeset	101	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	102	Mathematically, it can be expressed as the
856d025dbc15 more Chengsong parents: 518 diff changeset	103
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	104	If the $\textit{A}$ happen to have some structure, for example,
519 856d025dbc15 more Chengsong parents: 518 diff changeset	105	if it is regular, then we have that it
856d025dbc15 more Chengsong parents: 518 diff changeset	106
856d025dbc15 more Chengsong parents: 518 diff changeset	107	% Derivatives of a
856d025dbc15 more Chengsong parents: 518 diff changeset	108	%regular expression, written $r \backslash c$, give a simple solution
856d025dbc15 more Chengsong parents: 518 diff changeset	109	%to the problem of matching a string $s$ with a regular
856d025dbc15 more Chengsong parents: 518 diff changeset	110	%expression $r$: if the derivative of $r$ w.r.t.\ (in
856d025dbc15 more Chengsong parents: 518 diff changeset	111	%succession) all the characters of the string matches the empty string,
856d025dbc15 more Chengsong parents: 518 diff changeset	112	%then $r$ matches $s$ (and {\em vice versa}).
856d025dbc15 more Chengsong parents: 518 diff changeset	113
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	114
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	115	\section{Brzozowski Derivatives of Regular Expressions}
519 856d025dbc15 more Chengsong parents: 518 diff changeset	116	The the derivative of regular expression, denoted as
856d025dbc15 more Chengsong parents: 518 diff changeset	117	$r \backslash c$, is a function that takes parameters
856d025dbc15 more Chengsong parents: 518 diff changeset	118	$r$ and $c$, and returns another regular expression $r'$,
856d025dbc15 more Chengsong parents: 518 diff changeset	119	which is computed by the following recursive function:
856d025dbc15 more Chengsong parents: 518 diff changeset	120
856d025dbc15 more Chengsong parents: 518 diff changeset	121	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	122	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	123	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	124	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	125	$d \backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	126	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	127	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	128	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	129	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	130	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	131	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	132	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	133	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	134	\noindent
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	135	The function derivative, written $r\backslash c$,
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	136	defines how a regular expression evolves into
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	137	a new regular expression after all the string it contains
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	138	is chopped off a certain head character $c$.
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	139	The most involved cases are the sequence
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	140	and star case.
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	141	The sequence case says that if the first regular expression
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	142	contains an empty string then the second component of the sequence
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	143	might be chosen as the target regular expression to be chopped
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	144	off its head character.
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	145	The star regular expression's derivative unwraps the iteration of
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	146	regular expression and attaches the star regular expression
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	147	to the sequence's second element to make sure a copy is retained
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	148	for possible more iterations in later phases of lexing.
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	149
519 856d025dbc15 more Chengsong parents: 518 diff changeset	150
856d025dbc15 more Chengsong parents: 518 diff changeset	151	The $\nullable$ function tests whether the empty string $""$
856d025dbc15 more Chengsong parents: 518 diff changeset	152	is in the language of $r$:
856d025dbc15 more Chengsong parents: 518 diff changeset	153
856d025dbc15 more Chengsong parents: 518 diff changeset	154
856d025dbc15 more Chengsong parents: 518 diff changeset	155	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	156	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	157	$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	158	$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	159	$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	160	$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	161	$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	162	$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	163	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	164	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	165	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	166	The empty set does not contain any string and
856d025dbc15 more Chengsong parents: 518 diff changeset	167	therefore not the empty string, the empty string
856d025dbc15 more Chengsong parents: 518 diff changeset	168	regular expression contains the empty string
856d025dbc15 more Chengsong parents: 518 diff changeset	169	by definition, the character regular expression
856d025dbc15 more Chengsong parents: 518 diff changeset	170	is the singleton that contains character only,
856d025dbc15 more Chengsong parents: 518 diff changeset	171	and therefore does not contain the empty string,
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	172	the alternative regular expression (or "or" expression)
519 856d025dbc15 more Chengsong parents: 518 diff changeset	173	might have one of its children regular expressions
856d025dbc15 more Chengsong parents: 518 diff changeset	174	being nullable and any one of its children being nullable
856d025dbc15 more Chengsong parents: 518 diff changeset	175	would suffice. The sequence regular expression
856d025dbc15 more Chengsong parents: 518 diff changeset	176	would require both children to have the empty string
856d025dbc15 more Chengsong parents: 518 diff changeset	177	to compose an empty string and the Kleene star
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	178	operation naturally introduced the empty string.
519 856d025dbc15 more Chengsong parents: 518 diff changeset	179
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	180	We have the following property where the derivative on regular
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	181	expressions coincides with the derivative on a set of strings:
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	182
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	183	\begin{lemma}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	184	$\textit{Der} \; c \; L(r) = L (r\backslash c)$
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	185	\end{lemma}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	186
519 856d025dbc15 more Chengsong parents: 518 diff changeset	187	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	188
856d025dbc15 more Chengsong parents: 518 diff changeset	189
856d025dbc15 more Chengsong parents: 518 diff changeset	190	The main property of the derivative operation
856d025dbc15 more Chengsong parents: 518 diff changeset	191	that enables us to reason about the correctness of
856d025dbc15 more Chengsong parents: 518 diff changeset	192	an algorithm using derivatives is
856d025dbc15 more Chengsong parents: 518 diff changeset	193
856d025dbc15 more Chengsong parents: 518 diff changeset	194	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	195	$c\!::\!s \in L(r)$ holds
856d025dbc15 more Chengsong parents: 518 diff changeset	196	if and only if $s \in L(r\backslash c)$.
856d025dbc15 more Chengsong parents: 518 diff changeset	197	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	198
856d025dbc15 more Chengsong parents: 518 diff changeset	199	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	200	We can generalise the derivative operation shown above for single characters
856d025dbc15 more Chengsong parents: 518 diff changeset	201	to strings as follows:
856d025dbc15 more Chengsong parents: 518 diff changeset	202
856d025dbc15 more Chengsong parents: 518 diff changeset	203	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	204	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	205	$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	206	$r \backslash [\,] $ & $\dn$ & $r$
856d025dbc15 more Chengsong parents: 518 diff changeset	207	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	208	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	209
856d025dbc15 more Chengsong parents: 518 diff changeset	210	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	211	and then define Brzozowski's regular-expression matching algorithm as:
856d025dbc15 more Chengsong parents: 518 diff changeset	212
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	213	\begin{definition}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	214	$match\;s\;r \;\dn\; nullable(r\backslash s)$
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	215	\end{definition}
519 856d025dbc15 more Chengsong parents: 518 diff changeset	216
856d025dbc15 more Chengsong parents: 518 diff changeset	217	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	218	Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$,
856d025dbc15 more Chengsong parents: 518 diff changeset	219	this algorithm presented graphically is as follows:
856d025dbc15 more Chengsong parents: 518 diff changeset	220
856d025dbc15 more Chengsong parents: 518 diff changeset	221	\begin{equation}\label{graph:*}
856d025dbc15 more Chengsong parents: 518 diff changeset	222	\begin{tikzcd}
856d025dbc15 more Chengsong parents: 518 diff changeset	223	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
856d025dbc15 more Chengsong parents: 518 diff changeset	224	\end{tikzcd}
856d025dbc15 more Chengsong parents: 518 diff changeset	225	\end{equation}
856d025dbc15 more Chengsong parents: 518 diff changeset	226
856d025dbc15 more Chengsong parents: 518 diff changeset	227	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	228	where we start with a regular expression $r_0$, build successive
856d025dbc15 more Chengsong parents: 518 diff changeset	229	derivatives until we exhaust the string and then use \textit{nullable}
856d025dbc15 more Chengsong parents: 518 diff changeset	230	to test whether the result can match the empty string. It can be
856d025dbc15 more Chengsong parents: 518 diff changeset	231	relatively easily shown that this matcher is correct (that is given
856d025dbc15 more Chengsong parents: 518 diff changeset	232	an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
856d025dbc15 more Chengsong parents: 518 diff changeset	233
856d025dbc15 more Chengsong parents: 518 diff changeset	234	Beautiful and simple definition.
856d025dbc15 more Chengsong parents: 518 diff changeset	235
856d025dbc15 more Chengsong parents: 518 diff changeset	236	If we implement the above algorithm naively, however,
856d025dbc15 more Chengsong parents: 518 diff changeset	237	the algorithm can be excruciatingly slow.
856d025dbc15 more Chengsong parents: 518 diff changeset	238
856d025dbc15 more Chengsong parents: 518 diff changeset	239
856d025dbc15 more Chengsong parents: 518 diff changeset	240	\begin{figure}
856d025dbc15 more Chengsong parents: 518 diff changeset	241	\centering
856d025dbc15 more Chengsong parents: 518 diff changeset	242	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	243	\begin{tikzpicture}
856d025dbc15 more Chengsong parents: 518 diff changeset	244	\begin{axis}[
856d025dbc15 more Chengsong parents: 518 diff changeset	245	xlabel={$n$},
856d025dbc15 more Chengsong parents: 518 diff changeset	246	x label style={at={(1.05,-0.05)}},
856d025dbc15 more Chengsong parents: 518 diff changeset	247	ylabel={time in secs},
856d025dbc15 more Chengsong parents: 518 diff changeset	248	enlargelimits=false,
856d025dbc15 more Chengsong parents: 518 diff changeset	249	xtick={0,5,...,30},
856d025dbc15 more Chengsong parents: 518 diff changeset	250	xmax=33,
856d025dbc15 more Chengsong parents: 518 diff changeset	251	ymax=10000,
856d025dbc15 more Chengsong parents: 518 diff changeset	252	ytick={0,1000,...,10000},
856d025dbc15 more Chengsong parents: 518 diff changeset	253	scaled ticks=false,
856d025dbc15 more Chengsong parents: 518 diff changeset	254	axis lines=left,
856d025dbc15 more Chengsong parents: 518 diff changeset	255	width=5cm,
856d025dbc15 more Chengsong parents: 518 diff changeset	256	height=4cm,
856d025dbc15 more Chengsong parents: 518 diff changeset	257	legend entries={JavaScript},
856d025dbc15 more Chengsong parents: 518 diff changeset	258	legend pos=north west,
856d025dbc15 more Chengsong parents: 518 diff changeset	259	legend cell align=left]
856d025dbc15 more Chengsong parents: 518 diff changeset	260	\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
856d025dbc15 more Chengsong parents: 518 diff changeset	261	\end{axis}
856d025dbc15 more Chengsong parents: 518 diff changeset	262	\end{tikzpicture}\\
856d025dbc15 more Chengsong parents: 518 diff changeset	263	\multicolumn{3}{c}{Graphs: Runtime for matching $(a^)^\,b$ with strings
856d025dbc15 more Chengsong parents: 518 diff changeset	264	of the form $\underbrace{aa..a}_{n}$.}
856d025dbc15 more Chengsong parents: 518 diff changeset	265	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	266	\caption{EightThousandNodes} \label{fig:EightThousandNodes}
856d025dbc15 more Chengsong parents: 518 diff changeset	267	\end{figure}
856d025dbc15 more Chengsong parents: 518 diff changeset	268
856d025dbc15 more Chengsong parents: 518 diff changeset	269
856d025dbc15 more Chengsong parents: 518 diff changeset	270	(8000 node data to be added here)
856d025dbc15 more Chengsong parents: 518 diff changeset	271	For example, when starting with the regular
856d025dbc15 more Chengsong parents: 518 diff changeset	272	expression $(a + aa)^*$ and building a few successive derivatives (around 10)
856d025dbc15 more Chengsong parents: 518 diff changeset	273	w.r.t.~the character $a$, one obtains a derivative regular expression
856d025dbc15 more Chengsong parents: 518 diff changeset	274	with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
856d025dbc15 more Chengsong parents: 518 diff changeset	275	The reason why $(a + aa) ^*$ explodes so drastically is that without
856d025dbc15 more Chengsong parents: 518 diff changeset	276	pruning, the algorithm will keep records of all possible ways of matching:
856d025dbc15 more Chengsong parents: 518 diff changeset	277	\begin{center}
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	278	$(a + aa) ^* \backslash [aa] = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
519 856d025dbc15 more Chengsong parents: 518 diff changeset	279	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	280
856d025dbc15 more Chengsong parents: 518 diff changeset	281	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	282	Each of the above alternative branches correspond to the match
856d025dbc15 more Chengsong parents: 518 diff changeset	283	$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
856d025dbc15 more Chengsong parents: 518 diff changeset	284	These different ways of matching will grow exponentially with the string length,
856d025dbc15 more Chengsong parents: 518 diff changeset	285	and without simplifications that throw away some of these very similar matchings,
856d025dbc15 more Chengsong parents: 518 diff changeset	286	it is no surprise that these expressions grow so quickly.
856d025dbc15 more Chengsong parents: 518 diff changeset	287	Operations like
856d025dbc15 more Chengsong parents: 518 diff changeset	288	$\backslash$ and $\nullable$ need to traverse such trees and
856d025dbc15 more Chengsong parents: 518 diff changeset	289	consequently the bigger the size of the derivative the slower the
856d025dbc15 more Chengsong parents: 518 diff changeset	290	algorithm.
856d025dbc15 more Chengsong parents: 518 diff changeset	291
856d025dbc15 more Chengsong parents: 518 diff changeset	292	Brzozowski was quick in finding that during this process a lot useless
856d025dbc15 more Chengsong parents: 518 diff changeset	293	$\ONE$s and $\ZERO$s are generated and therefore not optimal.
856d025dbc15 more Chengsong parents: 518 diff changeset	294	He also introduced some "similarity rules" such
856d025dbc15 more Chengsong parents: 518 diff changeset	295	as $P+(Q+R) = (P+Q)+R$ to merge syntactically
856d025dbc15 more Chengsong parents: 518 diff changeset	296	different but language-equivalent sub-regexes to further decrease the size
856d025dbc15 more Chengsong parents: 518 diff changeset	297	of the intermediate regexes.
856d025dbc15 more Chengsong parents: 518 diff changeset	298
856d025dbc15 more Chengsong parents: 518 diff changeset	299	More simplifications are possible, such as deleting duplicates
856d025dbc15 more Chengsong parents: 518 diff changeset	300	and opening up nested alternatives to trigger even more simplifications.
856d025dbc15 more Chengsong parents: 518 diff changeset	301	And suppose we apply simplification after each derivative step, and compose
856d025dbc15 more Chengsong parents: 518 diff changeset	302	these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
856d025dbc15 more Chengsong parents: 518 diff changeset	303	\textit{simp}(a \backslash c)$. Then we can build
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	304	a matcher with simpler regular expressions.
519 856d025dbc15 more Chengsong parents: 518 diff changeset	305
856d025dbc15 more Chengsong parents: 518 diff changeset	306	If we want the size of derivatives in the algorithm to
856d025dbc15 more Chengsong parents: 518 diff changeset	307	stay even lower, we would need more aggressive simplifications.
856d025dbc15 more Chengsong parents: 518 diff changeset	308	Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
856d025dbc15 more Chengsong parents: 518 diff changeset	309	deleting duplicates whenever possible. For example, the parentheses in
856d025dbc15 more Chengsong parents: 518 diff changeset	310	$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
856d025dbc15 more Chengsong parents: 518 diff changeset	311	\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
856d025dbc15 more Chengsong parents: 518 diff changeset	312	example is simplifying $(a^+a) + (a^+ \ONE) + (a +\ONE)$ to just
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	313	$a^*+a+\ONE$. These more aggressive simplification rules are for
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	314	a very tight size bound, possibly as low
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	315	as that of the \emph{partial derivatives}\parencite{Antimirov1995}.
519 856d025dbc15 more Chengsong parents: 518 diff changeset	316
856d025dbc15 more Chengsong parents: 518 diff changeset	317	Building derivatives and then simplify them.
856d025dbc15 more Chengsong parents: 518 diff changeset	318	So far so good. But what if we want to
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	319	do lexing instead of just getting a YES/NO answer?
519 856d025dbc15 more Chengsong parents: 518 diff changeset	320	This requires us to go back again to the world
856d025dbc15 more Chengsong parents: 518 diff changeset	321	without simplification first for a moment.
856d025dbc15 more Chengsong parents: 518 diff changeset	322	Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
856d025dbc15 more Chengsong parents: 518 diff changeset	323	elegant(arguably as beautiful as the original
856d025dbc15 more Chengsong parents: 518 diff changeset	324	derivatives definition) solution for this.
856d025dbc15 more Chengsong parents: 518 diff changeset	325
856d025dbc15 more Chengsong parents: 518 diff changeset	326	\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
856d025dbc15 more Chengsong parents: 518 diff changeset	327
856d025dbc15 more Chengsong parents: 518 diff changeset	328
856d025dbc15 more Chengsong parents: 518 diff changeset	329	They first defined the datatypes for storing the
856d025dbc15 more Chengsong parents: 518 diff changeset	330	lexing information called a \emph{value} or
856d025dbc15 more Chengsong parents: 518 diff changeset	331	sometimes also \emph{lexical value}. These values and regular
856d025dbc15 more Chengsong parents: 518 diff changeset	332	expressions correspond to each other as illustrated in the following
856d025dbc15 more Chengsong parents: 518 diff changeset	333	table:
856d025dbc15 more Chengsong parents: 518 diff changeset	334
856d025dbc15 more Chengsong parents: 518 diff changeset	335	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	336	\begin{tabular}{c@{\hspace{20mm}}c}
856d025dbc15 more Chengsong parents: 518 diff changeset	337	\begin{tabular}{@{}rrl@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	338	\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
856d025dbc15 more Chengsong parents: 518 diff changeset	339	$r$ & $::=$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	340	& $\mid$ & $\ONE$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	341	& $\mid$ & $c$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	342	& $\mid$ & $r_1 \cdot r_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	343	& $\mid$ & $r_1 + r_2$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	344	\\
856d025dbc15 more Chengsong parents: 518 diff changeset	345	& $\mid$ & $r^*$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	346	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	347	&
856d025dbc15 more Chengsong parents: 518 diff changeset	348	\begin{tabular}{@{\hspace{0mm}}rrl@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	349	\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
856d025dbc15 more Chengsong parents: 518 diff changeset	350	$v$ & $::=$ & \\
856d025dbc15 more Chengsong parents: 518 diff changeset	351	& & $\Empty$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	352	& $\mid$ & $\Char(c)$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	353	& $\mid$ & $\Seq\,v_1\, v_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	354	& $\mid$ & $\Left(v)$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	355	& $\mid$ & $\Right(v)$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	356	& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	357	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	358	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	359	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	360
856d025dbc15 more Chengsong parents: 518 diff changeset	361	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	362
856d025dbc15 more Chengsong parents: 518 diff changeset	363	Building on top of Sulzmann and Lu's attempt to formalize the
856d025dbc15 more Chengsong parents: 518 diff changeset	364	notion of POSIX lexing rules \parencite{Sulzmann2014},
856d025dbc15 more Chengsong parents: 518 diff changeset	365	Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
856d025dbc15 more Chengsong parents: 518 diff changeset	366	POSIX matching as a ternary relation recursively defined in a
856d025dbc15 more Chengsong parents: 518 diff changeset	367	natural deduction style.
856d025dbc15 more Chengsong parents: 518 diff changeset	368	With the formally-specified rules for what a POSIX matching is,
856d025dbc15 more Chengsong parents: 518 diff changeset	369	they proved in Isabelle/HOL that the algorithm gives correct results.
856d025dbc15 more Chengsong parents: 518 diff changeset	370
856d025dbc15 more Chengsong parents: 518 diff changeset	371	But having a correct result is still not enough,
856d025dbc15 more Chengsong parents: 518 diff changeset	372	we want at least some degree of $\mathbf{efficiency}$.
856d025dbc15 more Chengsong parents: 518 diff changeset	373
856d025dbc15 more Chengsong parents: 518 diff changeset	374
856d025dbc15 more Chengsong parents: 518 diff changeset	375
856d025dbc15 more Chengsong parents: 518 diff changeset	376	One regular expression can have multiple lexical values. For example
856d025dbc15 more Chengsong parents: 518 diff changeset	377	for the regular expression $(a+b)^*$, it has a infinite list of
856d025dbc15 more Chengsong parents: 518 diff changeset	378	values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
856d025dbc15 more Chengsong parents: 518 diff changeset	379	$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
856d025dbc15 more Chengsong parents: 518 diff changeset	380	$\ldots$, and vice versa.
856d025dbc15 more Chengsong parents: 518 diff changeset	381	Even for the regular expression matching a certain string, there could
856d025dbc15 more Chengsong parents: 518 diff changeset	382	still be more than one value corresponding to it.
856d025dbc15 more Chengsong parents: 518 diff changeset	383	Take the example where $r= (a^\cdot a^)^*$ and the string
856d025dbc15 more Chengsong parents: 518 diff changeset	384	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	385	If we do not allow any empty iterations in its lexical values,
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	386	there will be $n - 1$ "splitting points" on $s$ we can choose to
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	387	split or not so that each sub-string
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	388	segmented by those chosen splitting points will form different iterations:
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	389	\begin{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	390	\begin{tabular}{lcr}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	391	$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	392	$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	393	$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	394	& $\textit{etc}.$ &
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	395	\end{tabular}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	396	\end{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	397
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	398	And for each iteration, there are still multiple ways to split
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	399	between the two $a^*$s.
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	400	It is not surprising there are exponentially many lexical values
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	401	that are distinct for the regex and string pair $r= (a^\cdot a^)^*$ and
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	402	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
519 856d025dbc15 more Chengsong parents: 518 diff changeset	403
856d025dbc15 more Chengsong parents: 518 diff changeset	404	A lexer aimed at getting all the possible values has an exponential
856d025dbc15 more Chengsong parents: 518 diff changeset	405	worst case runtime. Therefore it is impractical to try to generate
856d025dbc15 more Chengsong parents: 518 diff changeset	406	all possible matches in a run. In practice, we are usually
856d025dbc15 more Chengsong parents: 518 diff changeset	407	interested about POSIX values, which by intuition always
856d025dbc15 more Chengsong parents: 518 diff changeset	408	\begin{itemize}
856d025dbc15 more Chengsong parents: 518 diff changeset	409	\item
856d025dbc15 more Chengsong parents: 518 diff changeset	410	match the leftmost regular expression when multiple options of matching
856d025dbc15 more Chengsong parents: 518 diff changeset	411	are available
856d025dbc15 more Chengsong parents: 518 diff changeset	412	\item
856d025dbc15 more Chengsong parents: 518 diff changeset	413	always match a subpart as much as possible before proceeding
856d025dbc15 more Chengsong parents: 518 diff changeset	414	to the next token.
856d025dbc15 more Chengsong parents: 518 diff changeset	415	\end{itemize}
856d025dbc15 more Chengsong parents: 518 diff changeset	416
856d025dbc15 more Chengsong parents: 518 diff changeset	417
856d025dbc15 more Chengsong parents: 518 diff changeset	418	For example, the above example has the POSIX value
856d025dbc15 more Chengsong parents: 518 diff changeset	419	$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
856d025dbc15 more Chengsong parents: 518 diff changeset	420	The output of an algorithm we want would be a POSIX matching
856d025dbc15 more Chengsong parents: 518 diff changeset	421	encoded as a value.
856d025dbc15 more Chengsong parents: 518 diff changeset	422	The reason why we are interested in $\POSIX$ values is that they can
856d025dbc15 more Chengsong parents: 518 diff changeset	423	be practically used in the lexing phase of a compiler front end.
856d025dbc15 more Chengsong parents: 518 diff changeset	424	For instance, when lexing a code snippet
856d025dbc15 more Chengsong parents: 518 diff changeset	425	$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
856d025dbc15 more Chengsong parents: 518 diff changeset	426	as an identifier rather than a keyword.
856d025dbc15 more Chengsong parents: 518 diff changeset	427
856d025dbc15 more Chengsong parents: 518 diff changeset	428	The contribution of Sulzmann and Lu is an extension of Brzozowski's
856d025dbc15 more Chengsong parents: 518 diff changeset	429	algorithm by a second phase (the first phase being building successive
856d025dbc15 more Chengsong parents: 518 diff changeset	430	derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
856d025dbc15 more Chengsong parents: 518 diff changeset	431	is generated in case the regular expression matches the string.
856d025dbc15 more Chengsong parents: 518 diff changeset	432	Pictorially, the Sulzmann and Lu algorithm is as follows:
856d025dbc15 more Chengsong parents: 518 diff changeset	433
856d025dbc15 more Chengsong parents: 518 diff changeset	434	\begin{ceqn}
856d025dbc15 more Chengsong parents: 518 diff changeset	435	\begin{equation}\label{graph:2}
856d025dbc15 more Chengsong parents: 518 diff changeset	436	\begin{tikzcd}
856d025dbc15 more Chengsong parents: 518 diff changeset	437	r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
856d025dbc15 more Chengsong parents: 518 diff changeset	438	v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
856d025dbc15 more Chengsong parents: 518 diff changeset	439	\end{tikzcd}
856d025dbc15 more Chengsong parents: 518 diff changeset	440	\end{equation}
856d025dbc15 more Chengsong parents: 518 diff changeset	441	\end{ceqn}
856d025dbc15 more Chengsong parents: 518 diff changeset	442
856d025dbc15 more Chengsong parents: 518 diff changeset	443
856d025dbc15 more Chengsong parents: 518 diff changeset	444	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	445	For convenience, we shall employ the following notations: the regular
856d025dbc15 more Chengsong parents: 518 diff changeset	446	expression we start with is $r_0$, and the given string $s$ is composed
856d025dbc15 more Chengsong parents: 518 diff changeset	447	of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the
856d025dbc15 more Chengsong parents: 518 diff changeset	448	left to right, we build the derivatives $r_1$, $r_2$, \ldots according
856d025dbc15 more Chengsong parents: 518 diff changeset	449	to the characters $c_0$, $c_1$ until we exhaust the string and obtain
856d025dbc15 more Chengsong parents: 518 diff changeset	450	the derivative $r_n$. We test whether this derivative is
856d025dbc15 more Chengsong parents: 518 diff changeset	451	$\textit{nullable}$ or not. If not, we know the string does not match
856d025dbc15 more Chengsong parents: 518 diff changeset	452	$r$ and no value needs to be generated. If yes, we start building the
856d025dbc15 more Chengsong parents: 518 diff changeset	453	values incrementally by \emph{injecting} back the characters into the
856d025dbc15 more Chengsong parents: 518 diff changeset	454	earlier values $v_n, \ldots, v_0$. This is the second phase of the
856d025dbc15 more Chengsong parents: 518 diff changeset	455	algorithm from the right to left. For the first value $v_n$, we call the
856d025dbc15 more Chengsong parents: 518 diff changeset	456	function $\textit{mkeps}$, which builds a POSIX lexical value
856d025dbc15 more Chengsong parents: 518 diff changeset	457	for how the empty string has been matched by the (nullable) regular
856d025dbc15 more Chengsong parents: 518 diff changeset	458	expression $r_n$. This function is defined as
856d025dbc15 more Chengsong parents: 518 diff changeset	459
856d025dbc15 more Chengsong parents: 518 diff changeset	460	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	461	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	462	$\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	463	$\mkeps(r_{1}+r_{2})$ & $\dn$
856d025dbc15 more Chengsong parents: 518 diff changeset	464	& \textit{if} $\nullable(r_{1})$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	465	& & \textit{then} $\Left(\mkeps(r_{1}))$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	466	& & \textit{else} $\Right(\mkeps(r_{2}))$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	467	$\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	468	$mkeps(r^*)$ & $\dn$ & $\Stars\,[]$
856d025dbc15 more Chengsong parents: 518 diff changeset	469	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	470	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	471
856d025dbc15 more Chengsong parents: 518 diff changeset	472
856d025dbc15 more Chengsong parents: 518 diff changeset	473	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	474	After the $\mkeps$-call, we inject back the characters one by one in order to build
856d025dbc15 more Chengsong parents: 518 diff changeset	475	the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
856d025dbc15 more Chengsong parents: 518 diff changeset	476	($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
856d025dbc15 more Chengsong parents: 518 diff changeset	477	After injecting back $n$ characters, we get the lexical value for how $r_0$
856d025dbc15 more Chengsong parents: 518 diff changeset	478	matches $s$. The POSIX value is maintained throught out the process.
856d025dbc15 more Chengsong parents: 518 diff changeset	479	For this Sulzmann and Lu defined a function that reverses
856d025dbc15 more Chengsong parents: 518 diff changeset	480	the ``chopping off'' of characters during the derivative phase. The
856d025dbc15 more Chengsong parents: 518 diff changeset	481	corresponding function is called \emph{injection}, written
856d025dbc15 more Chengsong parents: 518 diff changeset	482	$\textit{inj}$; it takes three arguments: the first one is a regular
856d025dbc15 more Chengsong parents: 518 diff changeset	483	expression ${r_{i-1}}$, before the character is chopped off, the second
856d025dbc15 more Chengsong parents: 518 diff changeset	484	is a character ${c_{i-1}}$, the character we want to inject and the
856d025dbc15 more Chengsong parents: 518 diff changeset	485	third argument is the value ${v_i}$, into which one wants to inject the
856d025dbc15 more Chengsong parents: 518 diff changeset	486	character (it corresponds to the regular expression after the character
856d025dbc15 more Chengsong parents: 518 diff changeset	487	has been chopped off). The result of this function is a new value. The
856d025dbc15 more Chengsong parents: 518 diff changeset	488	definition of $\textit{inj}$ is as follows:
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	489
519 856d025dbc15 more Chengsong parents: 518 diff changeset	490	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	491	\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
856d025dbc15 more Chengsong parents: 518 diff changeset	492	$\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	493	$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	494	$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	495	$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	496	$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	497	$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	498	$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	499	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	500	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	501
856d025dbc15 more Chengsong parents: 518 diff changeset	502	\noindent This definition is by recursion on the ``shape'' of regular
856d025dbc15 more Chengsong parents: 518 diff changeset	503	expressions and values.
856d025dbc15 more Chengsong parents: 518 diff changeset	504	The clauses basically do one thing--identifying the ``holes'' on
856d025dbc15 more Chengsong parents: 518 diff changeset	505	value to inject the character back into.
856d025dbc15 more Chengsong parents: 518 diff changeset	506	For instance, in the last clause for injecting back to a value
856d025dbc15 more Chengsong parents: 518 diff changeset	507	that would turn into a new star value that corresponds to a star,
856d025dbc15 more Chengsong parents: 518 diff changeset	508	we know it must be a sequence value. And we know that the first
856d025dbc15 more Chengsong parents: 518 diff changeset	509	value of that sequence corresponds to the child regex of the star
856d025dbc15 more Chengsong parents: 518 diff changeset	510	with the first character being chopped off--an iteration of the star
856d025dbc15 more Chengsong parents: 518 diff changeset	511	that had just been unfolded. This value is followed by the already
856d025dbc15 more Chengsong parents: 518 diff changeset	512	matched star iterations we collected before. So we inject the character
856d025dbc15 more Chengsong parents: 518 diff changeset	513	back to the first value and form a new value with this new iteration
856d025dbc15 more Chengsong parents: 518 diff changeset	514	being added to the previous list of iterations, all under the $Stars$
856d025dbc15 more Chengsong parents: 518 diff changeset	515	top level.
856d025dbc15 more Chengsong parents: 518 diff changeset	516
856d025dbc15 more Chengsong parents: 518 diff changeset	517	We have mentioned before that derivatives without simplification
856d025dbc15 more Chengsong parents: 518 diff changeset	518	can get clumsy, and this is true for values as well--they reflect
856d025dbc15 more Chengsong parents: 518 diff changeset	519	the regular expressions size by definition.
856d025dbc15 more Chengsong parents: 518 diff changeset	520
856d025dbc15 more Chengsong parents: 518 diff changeset	521	One can introduce simplification on the regex and values, but have to
856d025dbc15 more Chengsong parents: 518 diff changeset	522	be careful in not breaking the correctness as the injection
856d025dbc15 more Chengsong parents: 518 diff changeset	523	function heavily relies on the structure of the regexes and values
856d025dbc15 more Chengsong parents: 518 diff changeset	524	being correct and match each other.
856d025dbc15 more Chengsong parents: 518 diff changeset	525	It can be achieved by recording some extra rectification functions
856d025dbc15 more Chengsong parents: 518 diff changeset	526	during the derivatives step, and applying these rectifications in
856d025dbc15 more Chengsong parents: 518 diff changeset	527	each run during the injection phase.
856d025dbc15 more Chengsong parents: 518 diff changeset	528	And we can prove that the POSIX value of how
856d025dbc15 more Chengsong parents: 518 diff changeset	529	regular expressions match strings will not be affected---although is much harder
856d025dbc15 more Chengsong parents: 518 diff changeset	530	to establish.
856d025dbc15 more Chengsong parents: 518 diff changeset	531	Some initial results in this regard have been
856d025dbc15 more Chengsong parents: 518 diff changeset	532	obtained in \cite{AusafDyckhoffUrban2016}.
856d025dbc15 more Chengsong parents: 518 diff changeset	533
856d025dbc15 more Chengsong parents: 518 diff changeset	534
856d025dbc15 more Chengsong parents: 518 diff changeset	535
856d025dbc15 more Chengsong parents: 518 diff changeset	536	%Brzozowski, after giving the derivatives and simplification,
856d025dbc15 more Chengsong parents: 518 diff changeset	537	%did not explore lexing with simplification or he may well be
856d025dbc15 more Chengsong parents: 518 diff changeset	538	%stuck on an efficient simplificaiton with a proof.
856d025dbc15 more Chengsong parents: 518 diff changeset	539	%He went on to explore the use of derivatives together with
856d025dbc15 more Chengsong parents: 518 diff changeset	540	%automaton, and did not try lexing using derivatives.
856d025dbc15 more Chengsong parents: 518 diff changeset	541
856d025dbc15 more Chengsong parents: 518 diff changeset	542	We want to get rid of complex and fragile rectification of values.
856d025dbc15 more Chengsong parents: 518 diff changeset	543	Can we not create those intermediate values $v_1,\ldots v_n$,
856d025dbc15 more Chengsong parents: 518 diff changeset	544	and get the lexing information that should be already there while
856d025dbc15 more Chengsong parents: 518 diff changeset	545	doing derivatives in one pass, without a second phase of injection?
856d025dbc15 more Chengsong parents: 518 diff changeset	546	In the meantime, can we make sure that simplifications
856d025dbc15 more Chengsong parents: 518 diff changeset	547	are easily handled without breaking the correctness of the algorithm?
856d025dbc15 more Chengsong parents: 518 diff changeset	548
856d025dbc15 more Chengsong parents: 518 diff changeset	549	Sulzmann and Lu solved this problem by
856d025dbc15 more Chengsong parents: 518 diff changeset	550	introducing additional informtaion to the
856d025dbc15 more Chengsong parents: 518 diff changeset	551	regular expressions called \emph{bitcodes}.
856d025dbc15 more Chengsong parents: 518 diff changeset	552
856d025dbc15 more Chengsong parents: 518 diff changeset	553	\subsection*{Bit-coded Algorithm}
856d025dbc15 more Chengsong parents: 518 diff changeset	554	Bits and bitcodes (lists of bits) are defined as:
856d025dbc15 more Chengsong parents: 518 diff changeset	555
856d025dbc15 more Chengsong parents: 518 diff changeset	556	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	557	$b ::= 1 \mid 0 \qquad
856d025dbc15 more Chengsong parents: 518 diff changeset	558	bs ::= [] \mid b::bs
856d025dbc15 more Chengsong parents: 518 diff changeset	559	$
856d025dbc15 more Chengsong parents: 518 diff changeset	560	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	561
856d025dbc15 more Chengsong parents: 518 diff changeset	562	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	563	The $1$ and $0$ are not in bold in order to avoid
856d025dbc15 more Chengsong parents: 518 diff changeset	564	confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
856d025dbc15 more Chengsong parents: 518 diff changeset	565	bit-lists) can be used to encode values (or potentially incomplete values) in a
856d025dbc15 more Chengsong parents: 518 diff changeset	566	compact form. This can be straightforwardly seen in the following
856d025dbc15 more Chengsong parents: 518 diff changeset	567	coding function from values to bitcodes:
856d025dbc15 more Chengsong parents: 518 diff changeset	568
856d025dbc15 more Chengsong parents: 518 diff changeset	569	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	570	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	571	$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	572	$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	573	$\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	574	$\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	575	$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	576	$\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	577	$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
856d025dbc15 more Chengsong parents: 518 diff changeset	578	code(\Stars\,vs)$
856d025dbc15 more Chengsong parents: 518 diff changeset	579	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	580	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	581
856d025dbc15 more Chengsong parents: 518 diff changeset	582	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	583	Here $\textit{code}$ encodes a value into a bitcodes by converting
856d025dbc15 more Chengsong parents: 518 diff changeset	584	$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
856d025dbc15 more Chengsong parents: 518 diff changeset	585	star iteration by $1$. The border where a local star terminates
856d025dbc15 more Chengsong parents: 518 diff changeset	586	is marked by $0$. This coding is lossy, as it throws away the information about
856d025dbc15 more Chengsong parents: 518 diff changeset	587	characters, and also does not encode the ``boundary'' between two
856d025dbc15 more Chengsong parents: 518 diff changeset	588	sequence values. Moreover, with only the bitcode we cannot even tell
856d025dbc15 more Chengsong parents: 518 diff changeset	589	whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
856d025dbc15 more Chengsong parents: 518 diff changeset	590	reason for choosing this compact way of storing information is that the
856d025dbc15 more Chengsong parents: 518 diff changeset	591	relatively small size of bits can be easily manipulated and ``moved
856d025dbc15 more Chengsong parents: 518 diff changeset	592	around'' in a regular expression. In order to recover values, we will
856d025dbc15 more Chengsong parents: 518 diff changeset	593	need the corresponding regular expression as an extra information. This
856d025dbc15 more Chengsong parents: 518 diff changeset	594	means the decoding function is defined as:
856d025dbc15 more Chengsong parents: 518 diff changeset	595
856d025dbc15 more Chengsong parents: 518 diff changeset	596
856d025dbc15 more Chengsong parents: 518 diff changeset	597	%\begin{definition}[Bitdecoding of Values]\mbox{}
856d025dbc15 more Chengsong parents: 518 diff changeset	598	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	599	\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	600	$\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	601	$\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	602	$\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	603	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
856d025dbc15 more Chengsong parents: 518 diff changeset	604	(\Left\,v, bs_1)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	605	$\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	606	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
856d025dbc15 more Chengsong parents: 518 diff changeset	607	(\Right\,v, bs_1)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	608	$\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	609	$\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	610	& & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	611	& & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	612	$\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	613	$\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	614	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	615	& & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	616	& & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
856d025dbc15 more Chengsong parents: 518 diff changeset	617
856d025dbc15 more Chengsong parents: 518 diff changeset	618	$\textit{decode}\,bs\,r$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	619	$\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	620	& & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
856d025dbc15 more Chengsong parents: 518 diff changeset	621	\textit{else}\;\textit{None}$
856d025dbc15 more Chengsong parents: 518 diff changeset	622	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	623	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	624	%\end{definition}
856d025dbc15 more Chengsong parents: 518 diff changeset	625
856d025dbc15 more Chengsong parents: 518 diff changeset	626	Sulzmann and Lu's integrated the bitcodes into regular expressions to
856d025dbc15 more Chengsong parents: 518 diff changeset	627	create annotated regular expressions \cite{Sulzmann2014}.
856d025dbc15 more Chengsong parents: 518 diff changeset	628	\emph{Annotated regular expressions} are defined by the following
856d025dbc15 more Chengsong parents: 518 diff changeset	629	grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$}
856d025dbc15 more Chengsong parents: 518 diff changeset	630
856d025dbc15 more Chengsong parents: 518 diff changeset	631	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	632	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	633	$\textit{a}$ & $::=$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	634	& $\mid$ & $_{bs}\ONE$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	635	& $\mid$ & $_{bs}{\bf c}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	636	& $\mid$ & $_{bs}\sum\,as$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	637	& $\mid$ & $_{bs}a_1\cdot a_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	638	& $\mid$ & $_{bs}a^*$
856d025dbc15 more Chengsong parents: 518 diff changeset	639	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	640	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	641	%(in \textit{ALTS})
856d025dbc15 more Chengsong parents: 518 diff changeset	642
856d025dbc15 more Chengsong parents: 518 diff changeset	643	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	644	where $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regular
856d025dbc15 more Chengsong parents: 518 diff changeset	645	expressions and $as$ for a list of annotated regular expressions.
856d025dbc15 more Chengsong parents: 518 diff changeset	646	The alternative constructor($\sum$) has been generalized to
856d025dbc15 more Chengsong parents: 518 diff changeset	647	accept a list of annotated regular expressions rather than just 2.
856d025dbc15 more Chengsong parents: 518 diff changeset	648	We will show that these bitcodes encode information about
856d025dbc15 more Chengsong parents: 518 diff changeset	649	the (POSIX) value that should be generated by the Sulzmann and Lu
856d025dbc15 more Chengsong parents: 518 diff changeset	650	algorithm.
856d025dbc15 more Chengsong parents: 518 diff changeset	651
856d025dbc15 more Chengsong parents: 518 diff changeset	652
856d025dbc15 more Chengsong parents: 518 diff changeset	653	To do lexing using annotated regular expressions, we shall first
856d025dbc15 more Chengsong parents: 518 diff changeset	654	transform the usual (un-annotated) regular expressions into annotated
856d025dbc15 more Chengsong parents: 518 diff changeset	655	regular expressions. This operation is called \emph{internalisation} and
856d025dbc15 more Chengsong parents: 518 diff changeset	656	defined as follows:
856d025dbc15 more Chengsong parents: 518 diff changeset	657
856d025dbc15 more Chengsong parents: 518 diff changeset	658	%\begin{definition}
856d025dbc15 more Chengsong parents: 518 diff changeset	659	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	660	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	661	$(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	662	$(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	663	$(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	664	$(r_1 + r_2)^\uparrow$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	665	$_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
856d025dbc15 more Chengsong parents: 518 diff changeset	666	\textit{fuse}\,[1]\,r_2^\uparrow]$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	667	$(r_1\cdot r_2)^\uparrow$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	668	$_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	669	$(r^*)^\uparrow$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	670	$_{[]}(r^\uparrow)^*$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	671	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	672	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	673	%\end{definition}
856d025dbc15 more Chengsong parents: 518 diff changeset	674
856d025dbc15 more Chengsong parents: 518 diff changeset	675	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	676	We use up arrows here to indicate that the basic un-annotated regular
856d025dbc15 more Chengsong parents: 518 diff changeset	677	expressions are ``lifted up'' into something slightly more complex. In the
856d025dbc15 more Chengsong parents: 518 diff changeset	678	fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
856d025dbc15 more Chengsong parents: 518 diff changeset	679	attach bits to the front of an annotated regular expression. Its
856d025dbc15 more Chengsong parents: 518 diff changeset	680	definition is as follows:
856d025dbc15 more Chengsong parents: 518 diff changeset	681
856d025dbc15 more Chengsong parents: 518 diff changeset	682	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	683	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	684	$\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	685	$\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	686	$_{bs @ bs'}\ONE$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	687	$\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	688	$_{bs@bs'}{\bf c}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	689	$\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	690	$_{bs@bs'}\sum\textit{as}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	691	$\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	692	$_{bs@bs'}a_1 \cdot a_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	693	$\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	694	$_{bs @ bs'}a^*$
856d025dbc15 more Chengsong parents: 518 diff changeset	695	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	696	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	697
856d025dbc15 more Chengsong parents: 518 diff changeset	698	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	699	After internalising the regular expression, we perform successive
856d025dbc15 more Chengsong parents: 518 diff changeset	700	derivative operations on the annotated regular expressions. This
856d025dbc15 more Chengsong parents: 518 diff changeset	701	derivative operation is the same as what we had previously for the
856d025dbc15 more Chengsong parents: 518 diff changeset	702	basic regular expressions, except that we beed to take care of
856d025dbc15 more Chengsong parents: 518 diff changeset	703	the bitcodes:
856d025dbc15 more Chengsong parents: 518 diff changeset	704
856d025dbc15 more Chengsong parents: 518 diff changeset	705
856d025dbc15 more Chengsong parents: 518 diff changeset	706	\iffalse
856d025dbc15 more Chengsong parents: 518 diff changeset	707	%\begin{definition}{bder}
856d025dbc15 more Chengsong parents: 518 diff changeset	708	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	709	\begin{tabular}{@{}lcl@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	710	$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	711	$(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	712	$(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	713	$\textit{if}\;c=d\; \;\textit{then}\;
856d025dbc15 more Chengsong parents: 518 diff changeset	714	\textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	715	$(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	716	$\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	717	$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	718	$\textit{if}\;\textit{bnullable}\,a_1$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	719	& &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	720	& &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	721	& &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	722	$(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	723	$\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
856d025dbc15 more Chengsong parents: 518 diff changeset	724	(\textit{STAR}\,[]\,r)$
856d025dbc15 more Chengsong parents: 518 diff changeset	725	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	726	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	727	%\end{definition}
856d025dbc15 more Chengsong parents: 518 diff changeset	728
856d025dbc15 more Chengsong parents: 518 diff changeset	729	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	730	\begin{tabular}{@{}lcl@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	731	$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	732	$(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	733	$(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	734	$\textit{if}\;c=d\; \;\textit{then}\;
856d025dbc15 more Chengsong parents: 518 diff changeset	735	_{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	736	$(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	737	$_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	738	$(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	739	$\textit{if}\;\textit{bnullable}\,a_1$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	740	& &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	741	& &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	742	& &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	743	$(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	744	$_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
856d025dbc15 more Chengsong parents: 518 diff changeset	745	(_{bs}\textit{STAR}\,[]\,r)$
856d025dbc15 more Chengsong parents: 518 diff changeset	746	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	747	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	748	%\end{definition}
856d025dbc15 more Chengsong parents: 518 diff changeset	749	\fi
856d025dbc15 more Chengsong parents: 518 diff changeset	750
856d025dbc15 more Chengsong parents: 518 diff changeset	751	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	752	\begin{tabular}{@{}lcl@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	753	$(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	754	$(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	755	$(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	756	$\textit{if}\;c=d\; \;\textit{then}\;
856d025dbc15 more Chengsong parents: 518 diff changeset	757	_{bs}\ONE\;\textit{else}\;\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	758	$(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	759	$_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	760	$(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	761	$\textit{if}\;\textit{bnullable}\,a_1$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	762	& &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	763	& &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	764	& &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	765	$(_{bs}a^*)\,\backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	766	$_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
856d025dbc15 more Chengsong parents: 518 diff changeset	767	(_{[]}r^*))$
856d025dbc15 more Chengsong parents: 518 diff changeset	768	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	769	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	770
856d025dbc15 more Chengsong parents: 518 diff changeset	771	%\end{definition}
856d025dbc15 more Chengsong parents: 518 diff changeset	772	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	773	For instance, when we do derivative of $_{bs}a^*$ with respect to c,
856d025dbc15 more Chengsong parents: 518 diff changeset	774	we need to unfold it into a sequence,
856d025dbc15 more Chengsong parents: 518 diff changeset	775	and attach an additional bit $0$ to the front of $r \backslash c$
856d025dbc15 more Chengsong parents: 518 diff changeset	776	to indicate one more star iteration. Also the sequence clause
856d025dbc15 more Chengsong parents: 518 diff changeset	777	is more subtle---when $a_1$ is $\textit{bnullable}$ (here
856d025dbc15 more Chengsong parents: 518 diff changeset	778	\textit{bnullable} is exactly the same as $\textit{nullable}$, except
856d025dbc15 more Chengsong parents: 518 diff changeset	779	that it is for annotated regular expressions, therefore we omit the
856d025dbc15 more Chengsong parents: 518 diff changeset	780	definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
856d025dbc15 more Chengsong parents: 518 diff changeset	781	$a_1$ matches the string prior to character $c$ (more on this later),
856d025dbc15 more Chengsong parents: 518 diff changeset	782	then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2
856d025dbc15 more Chengsong parents: 518 diff changeset	783	\backslash c)$ will collapse the regular expression $a_1$(as it has
856d025dbc15 more Chengsong parents: 518 diff changeset	784	already been fully matched) and store the parsing information at the
856d025dbc15 more Chengsong parents: 518 diff changeset	785	head of the regular expression $a_2 \backslash c$ by fusing to it. The
856d025dbc15 more Chengsong parents: 518 diff changeset	786	bitsequence $\textit{bs}$, which was initially attached to the
856d025dbc15 more Chengsong parents: 518 diff changeset	787	first element of the sequence $a_1 \cdot a_2$, has
856d025dbc15 more Chengsong parents: 518 diff changeset	788	now been elevated to the top-level of $\sum$, as this information will be
856d025dbc15 more Chengsong parents: 518 diff changeset	789	needed whichever way the sequence is matched---no matter whether $c$ belongs
856d025dbc15 more Chengsong parents: 518 diff changeset	790	to $a_1$ or $ a_2$. After building these derivatives and maintaining all
856d025dbc15 more Chengsong parents: 518 diff changeset	791	the lexing information, we complete the lexing by collecting the
856d025dbc15 more Chengsong parents: 518 diff changeset	792	bitcodes using a generalised version of the $\textit{mkeps}$ function
856d025dbc15 more Chengsong parents: 518 diff changeset	793	for annotated regular expressions, called $\textit{bmkeps}$:
856d025dbc15 more Chengsong parents: 518 diff changeset	794
856d025dbc15 more Chengsong parents: 518 diff changeset	795
856d025dbc15 more Chengsong parents: 518 diff changeset	796	%\begin{definition}[\textit{bmkeps}]\mbox{}
856d025dbc15 more Chengsong parents: 518 diff changeset	797	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	798	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	799	$\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	800	$\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	801	$\textit{if}\;\textit{bnullable}\,a$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	802	& &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	803	& &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	804	$\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	805	$bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	806	$\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	807	$bs \,@\, [0]$
856d025dbc15 more Chengsong parents: 518 diff changeset	808	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	809	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	810	%\end{definition}
856d025dbc15 more Chengsong parents: 518 diff changeset	811
856d025dbc15 more Chengsong parents: 518 diff changeset	812	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	813	This function completes the value information by travelling along the
856d025dbc15 more Chengsong parents: 518 diff changeset	814	path of the regular expression that corresponds to a POSIX value and
856d025dbc15 more Chengsong parents: 518 diff changeset	815	collecting all the bitcodes, and using $S$ to indicate the end of star
856d025dbc15 more Chengsong parents: 518 diff changeset	816	iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
856d025dbc15 more Chengsong parents: 518 diff changeset	817	decode them, we get the value we expect. The corresponding lexing
856d025dbc15 more Chengsong parents: 518 diff changeset	818	algorithm looks as follows:
856d025dbc15 more Chengsong parents: 518 diff changeset	819
856d025dbc15 more Chengsong parents: 518 diff changeset	820	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	821	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	822	$\textit{blexer}\;r\,s$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	823	$\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	824	& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	825	& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	826	& & $\;\;\textit{else}\;\textit{None}$
856d025dbc15 more Chengsong parents: 518 diff changeset	827	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	828	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	829
856d025dbc15 more Chengsong parents: 518 diff changeset	830	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	831	In this definition $\_\backslash s$ is the generalisation of the derivative
856d025dbc15 more Chengsong parents: 518 diff changeset	832	operation from characters to strings (just like the derivatives for un-annotated
856d025dbc15 more Chengsong parents: 518 diff changeset	833	regular expressions).
856d025dbc15 more Chengsong parents: 518 diff changeset	834
856d025dbc15 more Chengsong parents: 518 diff changeset	835	Now we introduce the simplifications, which is why we introduce the
856d025dbc15 more Chengsong parents: 518 diff changeset	836	bitcodes in the first place.
856d025dbc15 more Chengsong parents: 518 diff changeset	837
856d025dbc15 more Chengsong parents: 518 diff changeset	838	\subsection*{Simplification Rules}
856d025dbc15 more Chengsong parents: 518 diff changeset	839
856d025dbc15 more Chengsong parents: 518 diff changeset	840	This section introduces aggressive (in terms of size) simplification rules
856d025dbc15 more Chengsong parents: 518 diff changeset	841	on annotated regular expressions
856d025dbc15 more Chengsong parents: 518 diff changeset	842	to keep derivatives small. Such simplifications are promising
856d025dbc15 more Chengsong parents: 518 diff changeset	843	as we have
856d025dbc15 more Chengsong parents: 518 diff changeset	844	generated test data that show
856d025dbc15 more Chengsong parents: 518 diff changeset	845	that a good tight bound can be achieved. We could only
856d025dbc15 more Chengsong parents: 518 diff changeset	846	partially cover the search space as there are infinitely many regular
856d025dbc15 more Chengsong parents: 518 diff changeset	847	expressions and strings.
856d025dbc15 more Chengsong parents: 518 diff changeset	848
856d025dbc15 more Chengsong parents: 518 diff changeset	849	One modification we introduced is to allow a list of annotated regular
856d025dbc15 more Chengsong parents: 518 diff changeset	850	expressions in the $\sum$ constructor. This allows us to not just
856d025dbc15 more Chengsong parents: 518 diff changeset	851	delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
856d025dbc15 more Chengsong parents: 518 diff changeset	852	also unnecessary ``copies'' of regular expressions (very similar to
856d025dbc15 more Chengsong parents: 518 diff changeset	853	simplifying $r + r$ to just $r$, but in a more general setting). Another
856d025dbc15 more Chengsong parents: 518 diff changeset	854	modification is that we use simplification rules inspired by Antimirov's
856d025dbc15 more Chengsong parents: 518 diff changeset	855	work on partial derivatives. They maintain the idea that only the first
856d025dbc15 more Chengsong parents: 518 diff changeset	856	``copy'' of a regular expression in an alternative contributes to the
856d025dbc15 more Chengsong parents: 518 diff changeset	857	calculation of a POSIX value. All subsequent copies can be pruned away from
856d025dbc15 more Chengsong parents: 518 diff changeset	858	the regular expression. A recursive definition of our simplification function
856d025dbc15 more Chengsong parents: 518 diff changeset	859	that looks somewhat similar to our Scala code is given below:
856d025dbc15 more Chengsong parents: 518 diff changeset	860	%\comment{Use $\ZERO$, $\ONE$ and so on.
856d025dbc15 more Chengsong parents: 518 diff changeset	861	%Is it $ALTS$ or $ALTS$?}\\
856d025dbc15 more Chengsong parents: 518 diff changeset	862
856d025dbc15 more Chengsong parents: 518 diff changeset	863	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	864	\begin{tabular}{@{}lcl@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	865
856d025dbc15 more Chengsong parents: 518 diff changeset	866	$\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	867	&&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	868	&&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	869	&&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	870	&&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	871	&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	872
856d025dbc15 more Chengsong parents: 518 diff changeset	873	$\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	874	&&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	875	&&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	876	&&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	877
856d025dbc15 more Chengsong parents: 518 diff changeset	878	$\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
856d025dbc15 more Chengsong parents: 518 diff changeset	879	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	880	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	881
856d025dbc15 more Chengsong parents: 518 diff changeset	882	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	883	The simplification does a pattern matching on the regular expression.
856d025dbc15 more Chengsong parents: 518 diff changeset	884	When it detected that the regular expression is an alternative or
856d025dbc15 more Chengsong parents: 518 diff changeset	885	sequence, it will try to simplify its child regular expressions
856d025dbc15 more Chengsong parents: 518 diff changeset	886	recursively and then see if one of the children turns into $\ZERO$ or
856d025dbc15 more Chengsong parents: 518 diff changeset	887	$\ONE$, which might trigger further simplification at the current level.
856d025dbc15 more Chengsong parents: 518 diff changeset	888	The most involved part is the $\sum$ clause, where we use two
856d025dbc15 more Chengsong parents: 518 diff changeset	889	auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
856d025dbc15 more Chengsong parents: 518 diff changeset	890	alternatives and reduce as many duplicates as possible. Function
856d025dbc15 more Chengsong parents: 518 diff changeset	891	$\textit{distinct}$ keeps the first occurring copy only and removes all later ones
856d025dbc15 more Chengsong parents: 518 diff changeset	892	when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
856d025dbc15 more Chengsong parents: 518 diff changeset	893	Its recursive definition is given below:
856d025dbc15 more Chengsong parents: 518 diff changeset	894
856d025dbc15 more Chengsong parents: 518 diff changeset	895	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	896	\begin{tabular}{@{}lcl@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	897	$\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
856d025dbc15 more Chengsong parents: 518 diff changeset	898	(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	899	$\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	900	$\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise)
856d025dbc15 more Chengsong parents: 518 diff changeset	901	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	902	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	903
856d025dbc15 more Chengsong parents: 518 diff changeset	904	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	905	Here $\textit{flatten}$ behaves like the traditional functional programming flatten
856d025dbc15 more Chengsong parents: 518 diff changeset	906	function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
856d025dbc15 more Chengsong parents: 518 diff changeset	907	removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
856d025dbc15 more Chengsong parents: 518 diff changeset	908
856d025dbc15 more Chengsong parents: 518 diff changeset	909	Having defined the $\simp$ function,
856d025dbc15 more Chengsong parents: 518 diff changeset	910	we can use the previous notation of natural
856d025dbc15 more Chengsong parents: 518 diff changeset	911	extension from derivative w.r.t.~character to derivative
856d025dbc15 more Chengsong parents: 518 diff changeset	912	w.r.t.~string:%\comment{simp in the [] case?}
856d025dbc15 more Chengsong parents: 518 diff changeset	913
856d025dbc15 more Chengsong parents: 518 diff changeset	914	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	915	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	916	$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	917	$r \backslash_{simp} [\,] $ & $\dn$ & $r$
856d025dbc15 more Chengsong parents: 518 diff changeset	918	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	919	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	920
856d025dbc15 more Chengsong parents: 518 diff changeset	921	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	922	to obtain an optimised version of the algorithm:
856d025dbc15 more Chengsong parents: 518 diff changeset	923
856d025dbc15 more Chengsong parents: 518 diff changeset	924	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	925	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	926	$\textit{blexer\_simp}\;r\,s$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	927	$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	928	& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	929	& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	930	& & $\;\;\textit{else}\;\textit{None}$
856d025dbc15 more Chengsong parents: 518 diff changeset	931	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	932	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	933
856d025dbc15 more Chengsong parents: 518 diff changeset	934	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	935	This algorithm keeps the regular expression size small, for example,
856d025dbc15 more Chengsong parents: 518 diff changeset	936	with this simplification our previous $(a + aa)^*$ example's 8000 nodes
856d025dbc15 more Chengsong parents: 518 diff changeset	937	will be reduced to just 6 and stays constant, no matter how long the
856d025dbc15 more Chengsong parents: 518 diff changeset	938	input string is.
856d025dbc15 more Chengsong parents: 518 diff changeset	939
856d025dbc15 more Chengsong parents: 518 diff changeset	940
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	941
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	942
4d9eecfc936a sad Chengsong parents: 468 diff changeset	943
4d9eecfc936a sad Chengsong parents: 468 diff changeset	944
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	945
a0f27e21b42c all texrelated Chengsong parents: diff changeset	946	%-----------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	947	% SUBSECTION 1
a0f27e21b42c all texrelated Chengsong parents: diff changeset	948	%-----------------------------------
518 ff7945a988a3 more to thesis Chengsong parents: 516 diff changeset	949	\section{Specifications of Certain Functions to be Used}
524 947cbbd4e4a7 more data Chengsong parents: 519 diff changeset	950	Here we give some functions' definitions,
947cbbd4e4a7 more data Chengsong parents: 519 diff changeset	951	which we will use later.
947cbbd4e4a7 more data Chengsong parents: 519 diff changeset	952	\begin{center}
947cbbd4e4a7 more data Chengsong parents: 519 diff changeset	953	\begin{tabular}{ccc}
525 d8740017324c fixed latex problems Christian Urban <christian.urban@kcl.ac.uk> parents: 524 diff changeset	954	$\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$
524 947cbbd4e4a7 more data Chengsong parents: 519 diff changeset	955	\end{tabular}
947cbbd4e4a7 more data Chengsong parents: 519 diff changeset	956	\end{center}
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	957
4d9eecfc936a sad Chengsong parents: 468 diff changeset	958
518 ff7945a988a3 more to thesis Chengsong parents: 516 diff changeset	959
ff7945a988a3 more to thesis Chengsong parents: 516 diff changeset	960

author	Chengsong
	Mon, 30 May 2022 14:41:09 +0100
changeset 528	28751de4b4ba
parent 526	cb702fb4227f
child 529	96e93df60954
permissions	-rwxr-xr-x