lexing: ChengsongTanPhdThesis/Chapters/Inj.tex@3cbcd7cda0a9 (annotated)

532 cc54ce075db5 restructured Chengsong parents: diff changeset	1	% Chapter Template
cc54ce075db5 restructured Chengsong parents: diff changeset	2
cc54ce075db5 restructured Chengsong parents: diff changeset	3	\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
cc54ce075db5 restructured Chengsong parents: diff changeset	4
cc54ce075db5 restructured Chengsong parents: diff changeset	5	\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
cc54ce075db5 restructured Chengsong parents: diff changeset	6	%and notations we
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	7	% used for describing the lexing algorithm by Sulzmann and Lu,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	8	%and then give the algorithm and its variant and discuss
532 cc54ce075db5 restructured Chengsong parents: diff changeset	9	%why more aggressive simplifications are needed.
cc54ce075db5 restructured Chengsong parents: diff changeset	10
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	11	In this chapter, we define the basic notions
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	12	for regular languages and regular expressions.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	13	This is essentially a description in ``English"
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	14	of our formalisation in Isabelle/HOL.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	15	We also give the definition of what $\POSIX$ lexing means,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	16	followed by an algorithm by Sulzmanna and Lu\parencite{Sulzmann2014}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	17	that produces the output conforming
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	18	to the $\POSIX$ standard.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	19	It is also worth mentioning that
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	20	we choose to use the ML-style notation
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	21	for function applications, where
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	22	the parameters of a function is not enclosed
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	23	inside a pair of parentheses (e.g. $f \;x \;y$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	24	instead of $f(x,\;y)$). This is mainly
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	25	to make the text visually more concise.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	26
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	27	\section{Basic Concepts}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	28	Usually, formal language theory starts with an alphabet
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	29	denoting a set of characters.
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	30	Here we just use the datatype of characters from Isabelle,
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	31	which roughly corresponds to the ASCII characters.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	32	In what follows, we shall leave the information about the alphabet
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	33	implicit.
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	34	Then using the usual bracket notation for lists,
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	35	we can define strings made up of characters:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	36	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	37	\begin{tabular}{lcl}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	38	$\textit{s}$ & $\dn$ & $[] \; \|\; c :: s$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	39	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	40	\end{center}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	41	Where $c$ is a variable ranging over characters.
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	42	Strings can be concatenated to form longer strings in the same
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	43	way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	44	We omit the precise
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	45	recursive definition here.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	46	We overload this concatenation operator for two sets of strings:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	47	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	48	\begin{tabular}{lcl}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	49	$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	50	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	51	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	52	We also call the above \emph{language concatenation}.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	53	The power of a language is defined recursively, using the
cc54ce075db5 restructured Chengsong parents: diff changeset	54	concatenation operator $@$:
cc54ce075db5 restructured Chengsong parents: diff changeset	55	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	56	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	57	$A^0 $ & $\dn$ & $\{ [] \}$\\
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	58	$A^{n+1}$ & $\dn$ & $A @ A^n$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	59	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	60	\end{center}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	61	The union of all powers of a language
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	62	can be used to define the Kleene star operator:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	63	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	64	\begin{tabular}{lcl}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	65	$A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	66	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	67	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	68
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	69	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	70	However, to obtain a more convenient induction principle
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	71	in Isabelle/HOL,
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	72	we instead define the Kleene star
532 cc54ce075db5 restructured Chengsong parents: diff changeset	73	as an inductive set:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	74
532 cc54ce075db5 restructured Chengsong parents: diff changeset	75	\begin{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	76	\begin{mathpar}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	77	\inferrule{\mbox{}}{[] \in A*\\}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	78
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	79	\inferrule{s_1 \in A \;\; s_2 \in A}{s_1 @ s_2 \in A}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	80	\end{mathpar}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	81	\end{center}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	82	\noindent
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	83	We also define an operation of "chopping off" a character from
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	84	a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
532 cc54ce075db5 restructured Chengsong parents: diff changeset	85	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	86	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	87	$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	88	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	89	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	90	\noindent
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	91	This can be generalised to "chopping off" a string from all strings within set $A$,
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	92	namely:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	93	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	94	\begin{tabular}{lcl}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	95	$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	96	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	97	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	98	\noindent
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	99	which is essentially the left quotient $A \backslash L$ of $A$ against
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	100	the singleton language with $L = \{w\}$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	101	in formal language theory.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	102	However, for the purposes here, the $\textit{Ders}$ definition with
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	103	a single string is sufficient.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	104
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	105	With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
532 cc54ce075db5 restructured Chengsong parents: diff changeset	106	we have a few properties of how the language derivative can be defined using
cc54ce075db5 restructured Chengsong parents: diff changeset	107	sub-languages.
cc54ce075db5 restructured Chengsong parents: diff changeset	108	\begin{lemma}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	109	\[
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	110	\Der \; c \; (A @ B) =
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	111	\begin{cases}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	112	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	113	(\Der \; c \; A) \, @ \, B, & \text{otherwise}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	114	\end{cases}
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	115	\]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	116	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	117	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	118	This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
cc54ce075db5 restructured Chengsong parents: diff changeset	119	and get to $B$.
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	120	The language $A*$'s derivative can be described using the language derivative
532 cc54ce075db5 restructured Chengsong parents: diff changeset	121	of $A$:
cc54ce075db5 restructured Chengsong parents: diff changeset	122	\begin{lemma}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	123	$\textit{Der} \;c \;(A) = (\textit{Der}\; c A) @ (A)$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	124	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	125	\begin{proof}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	126	There are too inclusions to prove:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	127	\begin{itemize}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	128	\item{$\subseteq$}:\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	129	The set
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	130	\[ \{s \mid c :: s \in A*\} \]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	131	is enclosed in the set
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	132	\[ \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	133	because whenever you have a string starting with a character
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	134	in the language of a Kleene star $A*$,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	135	then that character together with some sub-string
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	136	immediately after it will form the first iteration,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	137	and the rest of the string will
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	138	be still in $A*$.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	139	\item{$\supseteq$}:\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	140	Note that
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	141	\[ \Der \; c \; (A) = \Der \; c \; (\{ [] \} \cup (A @ A) ) \]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	142	hold.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	143	Also this holds:
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	144	\[ \Der \; c \; (\{ [] \} \cup (A @ A) ) = \Der\; c \; (A @ A) \]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	145	where the $\textit{RHS}$ can be rewritten
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	146	as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	147	which of course contains $\Der \; c \; A @ A*$.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	148	\end{itemize}
cc54ce075db5 restructured Chengsong parents: diff changeset	149	\end{proof}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	150
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	151	\noindent
532 cc54ce075db5 restructured Chengsong parents: diff changeset	152	Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
cc54ce075db5 restructured Chengsong parents: diff changeset	153	for regular languages, we need to first give definitions for regular expressions.
cc54ce075db5 restructured Chengsong parents: diff changeset	154
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	155	\subsection{Regular Expressions and Their Meaning}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	156	The \emph{basic regular expressions} are defined inductively
532 cc54ce075db5 restructured Chengsong parents: diff changeset	157	by the following grammar:
cc54ce075db5 restructured Chengsong parents: diff changeset	158	\[ r ::= \ZERO \mid \ONE
cc54ce075db5 restructured Chengsong parents: diff changeset	159	\mid c
cc54ce075db5 restructured Chengsong parents: diff changeset	160	\mid r_1 \cdot r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	161	\mid r_1 + r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	162	\mid r^*
cc54ce075db5 restructured Chengsong parents: diff changeset	163	\]
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	164	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	165	We call them basic because we will introduce
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	166	additional constructors in later chapters such as negation
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	167	and bounded repetitions.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	168	We use $\ZERO$ for the regular expression that
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	169	matches no string, and $\ONE$ for the regular
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	170	expression that matches only the empty string\footnote{
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	171	some authors
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	172	also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	173	but we prefer our notation}.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	174	The sequence regular expression is written $r_1\cdot r_2$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	175	and sometimes we omit the dot if it is clear which
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	176	regular expression is meant; the alternative
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	177	is written $r_1 + r_2$.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	178	The \emph{language} or meaning of
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	179	a regular expression is defined recursively as
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	180	a set of strings:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	181	%TODO: FILL in the other defs
cc54ce075db5 restructured Chengsong parents: diff changeset	182	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	183	\begin{tabular}{lcl}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	184	$L \; \ZERO$ & $\dn$ & $\phi$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	185	$L \; \ONE$ & $\dn$ & $\{[]\}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	186	$L \; c$ & $\dn$ & $\{[c]\}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	187	$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	188	$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	189	$L \; r^$ & $\dn$ & $ (L\;r)$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	190	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	191	\end{center}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	192	\noindent
532 cc54ce075db5 restructured Chengsong parents: diff changeset	193	Now with semantic derivatives of a language and regular expressions and
cc54ce075db5 restructured Chengsong parents: diff changeset	194	their language interpretations in place, we are ready to define derivatives on regexes.
cc54ce075db5 restructured Chengsong parents: diff changeset	195	\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	196	%Recall, the language derivative acts on a set of strings
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	197	%and essentially chops off a particular character from
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	198	%all strings in that set, Brzozowski defined a derivative operation on regular expressions
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	199	%so that after derivative $L(r\backslash c)$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	200	%will look as if it was obtained by doing a language derivative on $L(r)$:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	201	Recall that the semantic derivative acts on a set of
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	202	strings. Brzozowski noticed that this operation
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	203	can be ``mirrored" on regular expressions which
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	204	he calls the derivative of a regular expression $r$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	205	with respect to a character $c$, written
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	206	$r \backslash c$.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	207	He defined this operation such that the following property holds:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	208	\begin{center}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	209
cc54ce075db5 restructured Chengsong parents: diff changeset	210	\[
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	211	L(r \backslash c) = \Der \; c \; L(r)
532 cc54ce075db5 restructured Chengsong parents: diff changeset	212	\]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	213	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	214	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	215	For example in the sequence case we have
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	216	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	217	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	218	$\Der \; c \; (A @ B)$ & $\dn$ &
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	219	$ \textit{if} \;\, [] \in A \;
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	220	\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	221	\Der \; c\; B$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	222	& & $\textit{else}\; (\Der \; c \; A) @ B$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	223	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	224	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	225	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	226	This can be translated to
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	227	regular expressions in the following
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	228	manner:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	229	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	230	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	231	$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	232	& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	233	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	234	\end{center}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	235
cc54ce075db5 restructured Chengsong parents: diff changeset	236	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	237	And similarly, the Kleene star's semantic derivative
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	238	can be expressed as
532 cc54ce075db5 restructured Chengsong parents: diff changeset	239	\[
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	240	\textit{Der} \;c \;(A) \dn (\textit{Der}\; c A) @ (A)
532 cc54ce075db5 restructured Chengsong parents: diff changeset	241	\]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	242	which translates to
532 cc54ce075db5 restructured Chengsong parents: diff changeset	243	\[
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	244	(r^) \backslash c \dn (r \backslash c)\cdot r^.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	245	\]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	246	In the above definition of $(r_1\cdot r_2) \backslash c$,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	247	the $\textit{if}$ clause's
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	248	boolean condition
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	249	$[] \in L(r_1)$ needs to be
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	250	somehow recursively computed.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	251	We call such a function that checks
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	252	whether the empty string $[]$ is
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	253	in the language of a regular expression $\nullable$:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	254	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	255	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	256	$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	257	$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	258	$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	259	$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	260	$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	261	$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	262	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	263	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	264	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	265	The $\ZERO$ regular expression
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	266	does not contain any string and
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	267	therefore is not \emph{nullable}.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	268	$\ONE$ is \emph{nullable}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	269	by definition.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	270	The character regular expression $c$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	271	corresponds to the singleton set $\{c\}$,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	272	and therefore does not contain the empty string.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	273	The alternative regular expression is nullable
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	274	if at least one of its children is nullable.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	275	The sequence regular expression
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	276	would require both children to have the empty string
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	277	to compose an empty string, and the Kleene star
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	278	is always nullable because it naturally
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	279	contains the empty string.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	280
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	281	The derivative function, written $r\backslash c$,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	282	defines how a regular expression evolves into
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	283	a new one after all the string it contains is acted on:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	284	if it starts with $c$, then the character is chopped of,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	285	if not, that string is removed.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	286	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	287	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	288	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	289	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	290	$d \backslash c$ & $\dn$ &
cc54ce075db5 restructured Chengsong parents: diff changeset	291	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	292	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	293	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	294	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	295	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	296	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	297	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	298	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	299	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	300	The most involved cases are the sequence case
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	301	and the star case.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	302	The sequence case says that if the first regular expression
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	303	contains an empty string, then the second component of the sequence
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	304	needs to be considered, as its derivative will contribute to the
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	305	result of this derivative.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	306	The star regular expression $r^*$'s derivative
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	307	unwraps one iteration of $r$, turns it into $r\backslash c$,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	308	and attaches the original $r^*$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	309	after $r\backslash c$, so that
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	310	we can further unfold it as many times as needed.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	311	We have the following correspondence between
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	312	derivatives on regular expressions and
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	313	derivatives on a set of strings:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	314	\begin{lemma}\label{derDer}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	315	$\textit{Der} \; c \; L(r) = L (r\backslash c)$
cc54ce075db5 restructured Chengsong parents: diff changeset	316	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	317
cc54ce075db5 restructured Chengsong parents: diff changeset	318	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	319	The main property of the derivative operation
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	320	(that enables us to reason about the correctness of
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	321	derivative-based matching)
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	322	is
532 cc54ce075db5 restructured Chengsong parents: diff changeset	323
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	324	\begin{lemma}\label{derStepwise}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	325	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	326	\end{lemma}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	327
cc54ce075db5 restructured Chengsong parents: diff changeset	328	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	329	We can generalise the derivative operation shown above for single characters
cc54ce075db5 restructured Chengsong parents: diff changeset	330	to strings as follows:
cc54ce075db5 restructured Chengsong parents: diff changeset	331
cc54ce075db5 restructured Chengsong parents: diff changeset	332	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	333	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	334	$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
cc54ce075db5 restructured Chengsong parents: diff changeset	335	$r \backslash [\,] $ & $\dn$ & $r$
cc54ce075db5 restructured Chengsong parents: diff changeset	336	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	337	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	338
cc54ce075db5 restructured Chengsong parents: diff changeset	339	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	340	When there is no ambiguity, we will
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	341	omit the subscript and use $\backslash$ instead
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	342	of $\backslash_r$ to denote
532 cc54ce075db5 restructured Chengsong parents: diff changeset	343	string derivatives for brevity.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	344	Brzozowski's regular-expression matcher algorithm can then be described as:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	345
cc54ce075db5 restructured Chengsong parents: diff changeset	346	\begin{definition}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	347	$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	348	\end{definition}
cc54ce075db5 restructured Chengsong parents: diff changeset	349
cc54ce075db5 restructured Chengsong parents: diff changeset	350	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	351	Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$,
cc54ce075db5 restructured Chengsong parents: diff changeset	352	this algorithm presented graphically is as follows:
cc54ce075db5 restructured Chengsong parents: diff changeset	353
cc54ce075db5 restructured Chengsong parents: diff changeset	354	\begin{equation}\label{graph:successive_ders}
cc54ce075db5 restructured Chengsong parents: diff changeset	355	\begin{tikzcd}
cc54ce075db5 restructured Chengsong parents: diff changeset	356	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
cc54ce075db5 restructured Chengsong parents: diff changeset	357	\end{tikzcd}
cc54ce075db5 restructured Chengsong parents: diff changeset	358	\end{equation}
cc54ce075db5 restructured Chengsong parents: diff changeset	359
cc54ce075db5 restructured Chengsong parents: diff changeset	360	\noindent
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	361	It can be
7cf9f17aa179 more Chengsong parents: 538 diff changeset	362	relatively easily shown that this matcher is correct:
7cf9f17aa179 more Chengsong parents: 538 diff changeset	363	\begin{lemma}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	364	$\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	365	\end{lemma}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	366	\begin{proof}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	367	By the stepwise property of derivatives (lemma \ref{derStepwise})
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	368	and lemma \ref{derDer}.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	369	\end{proof}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	370	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	371	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	372	\begin{figure}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	373	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	374	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	375	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	376	ylabel={time in secs},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	377	ymode = log,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	378	legend entries={Naive Matcher},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	379	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	380	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	381	\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
7cf9f17aa179 more Chengsong parents: 538 diff changeset	382	\end{axis}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	383	\end{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	384	\caption{Matching $(a^)^b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	385	\end{figure}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	386	\end{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	387	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	388	If we implement the above algorithm naively, however,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	389	the algorithm can be excruciatingly slow, as shown in
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	390	\ref{NaiveMatcher}.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	391	Note that both axes are in logarithmic scale.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	392	Around two dozens characters
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	393	would already explode the matcher on regular expression
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	394	$(a^)^b$.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	395	For this, we need to introduce certain
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	396	rewrite rules for the intermediate results,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	397	such as $r + r \rightarrow r$,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	398	and make sure those rules do not change the
7cf9f17aa179 more Chengsong parents: 538 diff changeset	399	language of the regular expression.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	400	One simpled-minded simplification function
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	401	that achieves these requirements is given below:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	402	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	403	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	404	$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	405	$(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	406	& & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	407	& & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	408	& & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	409	& & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	410	& & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	411	$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	412	& & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	413	& & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	414	& & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	415	$\simp \; r$ & $\dn$ & $r$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	416
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	417	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	418	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	419	If we repeatedly apply this simplification
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	420	function during the matching algorithm,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	421	we have a matcher with simplification:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	422	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	423	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	424	$\derssimp \; [] \; r$ & $\dn$ & $r$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	425	$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	426	$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	427	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	428	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	429	\begin{figure}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	430	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	431	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	432	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	433	ylabel={time in secs},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	434	ymode = log,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	435	xmode = log,
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	436	grid = both,
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	437	legend entries={Matcher With Simp},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	438	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	439	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	440	\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
7cf9f17aa179 more Chengsong parents: 538 diff changeset	441	\end{axis}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	442	\end{tikzpicture}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	443	\caption{$(a^)^b$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	444	against
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	445	$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	446	\end{figure}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	447	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	448	The running time of $\textit{ders}\_\textit{simp}$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	449	on the same example of \ref{NaiveMatcher}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	450	is now very tame in terms of the length of inputs,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	451	as shown in \ref{BetterMatcher}.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	452
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	453	Building derivatives and then testing the existence
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	454	of empty string in the resulting regular expression's language,
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	455	adding simplifications when necessary.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	456	So far, so good. But what if we want to
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	457	do lexing instead of just getting a YES/NO answer?
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	458	Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	459	elegant (arguably as beautiful as the definition of the original derivative) solution for this.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	460
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	461	\section{Values and the Lexing Algorithm by Sulzmann and Lu}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	462	In this section, we present a two-phase regular expression lexing
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	463	algorithm.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	464	The first phase takes successive derivatives with
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	465	respect to the input string,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	466	and the second phase does the reverse, \emph{injecting} back
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	467	characters, in the meantime constructing a lexing result.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	468	We will introduce the injection phase in detail slightly
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	469	later, but as a preliminary we have to first define
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	470	the datatype for lexing results,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	471	called \emph{value} or
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	472	sometimes also \emph{lexical value}. Values and regular
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	473	expressions correspond to each other as illustrated in the following
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	474	table:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	475
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	476	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	477	\begin{tabular}{c@{\hspace{20mm}}c}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	478	\begin{tabular}{@{}rrl@{}}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	479	\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	480	$r$ & $::=$ & $\ZERO$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	481	& $\mid$ & $\ONE$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	482	& $\mid$ & $c$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	483	& $\mid$ & $r_1 \cdot r_2$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	484	& $\mid$ & $r_1 + r_2$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	485	\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	486	& $\mid$ & $r^*$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	487	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	488	&
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	489	\begin{tabular}{@{\hspace{0mm}}rrl@{}}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	490	\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	491	$v$ & $::=$ & \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	492	& & $\Empty$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	493	& $\mid$ & $\Char(c)$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	494	& $\mid$ & $\Seq\,v_1\, v_2$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	495	& $\mid$ & $\Left(v)$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	496	& $\mid$ & $\Right(v)$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	497	& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	498	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	499	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	500	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	501	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	502	A value has an underlying string, which
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	503	can be calculated by the ``flatten" function $\|\_\|$:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	504	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	505	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	506	$\|\Empty\|$ & $\dn$ & $[]$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	507	$\|\Char \; c\|$ & $ \dn$ & $ [c]$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	508	$\|\Seq(v_1, v_2)\|$ & $ \dn$ & $ v_1\| @ \|v_2\|$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	509	$\|\Left(v)\|$ & $ \dn$ & $ \|v\|$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	510	$\|\Right(v)\|$ & $ \dn$ & $ \|v\|$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	511	$\|\Stars([])\|$ & $\dn$ & $[]$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	512	$\|\Stars(v::vs)\|$ & $\dn$ & $ \|v\| @ \|\Stars(vs)\|$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	513	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	514	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	515	Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	516	to indicate that a value $v$ could be generated from a lexing algorithm
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	517	with input $r$. They call it the value inhabitation relation.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	518	\begin{mathpar}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	519	\inferrule{\mbox{}}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	520
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	521	\inferrule{\mbox{}}{\vdash \Empty : \ONE} \hspace{2em}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	522
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	523	\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	524
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	525	\inferrule{\vdash v_1 : r_1}{\vdash \Left(v_1):r_1+r_2}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	526
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	527	\inferrule{\vdash v_2 : r_2}{\vdash \Right(v_2):r_1 + r_2}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	528
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	529	\inferrule{\forall v \in vs. \vdash v:r \land \|v\| \neq []}{\vdash \Stars(vs):r^*}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	530	\end{mathpar}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	531	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	532	The condition $\|v\| \neq []$ in the premise of star's rule
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	533	is to make sure that for a given pair of regular
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	534	expression $r$ and string $s$, the number of values
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	535	satisfying $\|v\| = s$ and $\vdash v:r$ is finite.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	536	Given the same string and regular expression, there can be
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	537	multiple values for it. For example, both
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	538	$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	539	$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	540	and the values both flatten to $abc$.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	541	Lexers therefore have to disambiguate and choose only
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	542	one of the values to output. $\POSIX$ is one of the
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	543	disambiguation strategies that is widely adopted.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	544
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	545	Ausaf and Urban\parencite{AusafDyckhoffUrban2016}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	546	formalised the property
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	547	as a ternary relation.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	548	The $\POSIX$ value $v$ for a regular expression
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	549	$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	550	in the following set of rules\footnote{The names of the rules are used
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	551	as they were originally given in \cite{AusafDyckhoffUrban2016}}:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	552	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	553	\begin{figure}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	554	\begin{mathpar}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	555	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	556
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	557	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	558
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	559	\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	560
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	561	\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	562
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	563	\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	564	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	565	s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	566	\Seq \; v_1 \; v_2}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	567
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	568	\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	569
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	570	\inferrule[P]{(s_1, v) \rightarrow v \\ (s_2, r^) \rightarrow \Stars \; vs \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	571	\|v\| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	572	s_1@s_3 \in L \; r \land s_4 \in L \; r^}{(s_1@s_2, r^)\rightarrow \Stars \;
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	573	(v::vs)}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	574	\end{mathpar}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	575	\caption{POSIX Lexing Rules}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	576	\end{figure}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	577	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	578	The above $\POSIX$ rules follows the intuition described below:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	579	\begin{itemize}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	580	\item (Left Priority)\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	581	Match the leftmost regular expression when multiple options of matching
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	582	are available.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	583	\item (Maximum munch)\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	584	Always match a subpart as much as possible before proceeding
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	585	to the next token.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	586	\end{itemize}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	587	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	588	These disambiguation strategies can be
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	589	quite practical.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	590	For instance, when lexing a code snippet
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	591	\[
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	592	\textit{iffoo} = 3
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	593	\]
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	594	using the regular expression (with
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	595	keyword and identifier having their
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	596	usualy definitions on any formal
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	597	language textbook, for instance
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	598	keyword is a nonempty string starting with letters
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	599	followed by alphanumeric characters or underscores):
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	600	\[
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	601	\textit{keyword} + \textit{identifier},
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	602	\]
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	603	we want $\textit{iffoo}$ to be recognized
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	604	as an identifier rather than a keyword (if)
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	605	followed by
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	606	an identifier (foo).
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	607	POSIX lexing achieves this.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	608
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	609	We know that a $\POSIX$ value is also a normal underlying
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	610	value:
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	611	\begin{lemma}
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	612	$(r, s) \rightarrow v \implies \vdash v: r$
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	613	\end{lemma}
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	614	\noindent
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	615	The good property about a $\POSIX$ value is that
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	616	given the same regular expression $r$ and string $s$,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	617	one can always uniquely determine the $\POSIX$ value for it:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	618	\begin{lemma}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	619	$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	620	\end{lemma}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	621	\begin{proof}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	622	By induction on $s$, $r$ and $v_1$. The inductive cases
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	623	are all the POSIX rules.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	624	Each case is proven by a combination of
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	625	the induction rules for $\POSIX$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	626	values and the inductive hypothesis.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	627	Probably the most cumbersome cases are
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	628	the sequence and star with non-empty iterations.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	629
7cf9f17aa179 more Chengsong parents: 538 diff changeset	630	We give the reasoning about the sequence case as follows:
7cf9f17aa179 more Chengsong parents: 538 diff changeset	631	When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	632	we know that there could not be a longer string $r_1'$ such that $(s_1', r_1) \rightarrow v_1'$
7cf9f17aa179 more Chengsong parents: 538 diff changeset	633	and $(s_2', r_2) \rightarrow v2'$ and $s_1' @s_2' = s$ all hold.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	634	For possible values of $s_1'$ and $s_2'$ where $s_1'$ is shorter, they cannot
7cf9f17aa179 more Chengsong parents: 538 diff changeset	635	possibly form a $\POSIX$ for $s$.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	636	If we have some other values $v_1'$ and $v_2'$ such that
7cf9f17aa179 more Chengsong parents: 538 diff changeset	637	$(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	638	Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	639	which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$
7cf9f17aa179 more Chengsong parents: 538 diff changeset	640	is the same as $\Seq(v_1, v_2)$.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	641	\end{proof}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	642	Now we know what a $\POSIX$ value is; the problem is how do we achieve
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	643	such a value in a lexing algorithm, using derivatives?
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	644
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	645	\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	646
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	647	The contribution of Sulzmann and Lu is an extension of Brzozowski's
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	648	algorithm by a second phase (the first phase being building successive
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	649	derivatives---see \ref{graph:successive_ders}). This second phase generates a POSIX value if the regular expression matches the string.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	650	Two functions are involved: $\inj$ and $\mkeps$.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	651	The function $\mkeps$ constructs a value from the last
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	652	one of all the successive derivatives:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	653	\begin{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	654	\begin{equation}\label{graph:mkeps}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	655	\begin{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	656	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	657	& & & v_n
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	658	\end{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	659	\end{equation}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	660	\end{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	661
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	662	It tells us how can an empty string be matched by a
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	663	regular expression, in a $\POSIX$ way:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	664
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	665	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	666	\begin{tabular}{lcl}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	667	$\mkeps \; \ONE$ & $\dn$ & $\Empty$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	668	$\mkeps \; r_{1}+r_{2}$ & $\dn$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	669	& \textit{if} ($\nullable \; r_{1})$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	670	& & \textit{then} $\Left (\mkeps \; r_{1})$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	671	& & \textit{else} $\Right(\mkeps \; r_{2})$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	672	$\mkeps \; r_1\cdot r_2$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	673	$\mkeps \; r^* $ & $\dn$ & $\Stars\;[]$
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	674	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	675	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	676
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	677
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	678	\noindent
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	679	We favour the left to match an empty string if there is a choice.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	680	When there is a star for us to match the empty string,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	681	we give the $\Stars$ constructor an empty list, meaning
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	682	no iterations are taken.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	683	The result of a call to $\mkeps$ on a $\nullable$ $r$ would
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	684	be a $\POSIX$ value corresponding to $r$:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	685	\begin{lemma}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	686	$\nullable(r) \implies (r, []) \rightarrow (\mkeps\; v)$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	687	\end{lemma}\label{mePosix}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	688
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	689
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	690	After the $\mkeps$-call, we inject back the characters one by one in order to build
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	691	the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	692	($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	693	After injecting back $n$ characters, we get the lexical value for how $r_0$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	694	matches $s$.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	695	To do this, Sulzmann and Lu defined a function that reverses
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	696	the ``chopping off'' of characters during the derivative phase. The
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	697	corresponding function is called \emph{injection}, written
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	698	$\textit{inj}$; it takes three arguments: the first one is a regular
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	699	expression ${r_{i-1}}$, before the character is chopped off, the second
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	700	is a character ${c_{i-1}}$, the character we want to inject and the
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	701	third argument is the value ${v_i}$, into which one wants to inject the
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	702	character (it corresponds to the regular expression after the character
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	703	has been chopped off). The result of this function is a new value.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	704	\begin{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	705	\begin{equation}\label{graph:inj}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	706	\begin{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	707	r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	708	v_1 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed]
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	709	\end{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	710	\end{equation}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	711	\end{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	712
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	713
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	714	\noindent
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	715	The
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	716	definition of $\textit{inj}$ is as follows:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	717
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	718	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	719	\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	720	$\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	721	$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	722	$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	723	$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	724	$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	725	$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	726	$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	727	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	728	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	729
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	730	\noindent
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	731	This definition is by recursion on the ``shape'' of regular
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	732	expressions and values.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	733	The clauses do one thing--identifying the ``hole'' on a
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	734	value to inject the character back into.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	735	For instance, in the last clause for injecting back to a value
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	736	that would turn into a new star value that corresponds to a star,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	737	we know it must be a sequence value. And we know that the first
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	738	value of that sequence corresponds to the child regex of the star
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	739	with the first character being chopped off--an iteration of the star
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	740	that had just been unfolded. This value is followed by the already
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	741	matched star iterations we collected before. So we inject the character
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	742	back to the first value and form a new value with this latest iteration
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	743	being added to the previous list of iterations, all under the $\Stars$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	744	top level.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	745	The POSIX value is maintained throughout the process:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	746	\begin{lemma}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	747	$(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	748	\end{lemma}\label{injPosix}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	749
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	750
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	751	Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	752	and taking into consideration the possibility of a non-match,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	753	we have a lexer with the following recursive definition:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	754	\begin{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	755	\begin{tabular}{lcl}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	756	$\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	757	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	758	& & $\quad \None \implies \None$\\
7cf9f17aa179 more Chengsong parents: 538 diff changeset	759	& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	760	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	761	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	762	\noindent
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	763	The central property of the $\lexer$ is that it gives the correct result by
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	764	$\POSIX$ standards:
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	765	\begin{theorem}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	766	The $\lexer$ based on derivatives and injections is both sound and complete:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	767	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	768	$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	769	$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	770	\end{tabular}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	771	\end{theorem}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	772
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	773
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	774	\begin{proof}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	775	By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	776	The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	777	by lemma \ref{injPosix}.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	778	\end{proof}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	779
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	780
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	781	We now give a pictorial view of the algorithm (
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	782	For convenience, we employ the following notations: the regular
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	783	expression we start with is $r_0$, and the given string $s$ is composed
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	784	of characters $c_0 c_1 \ldots c_{n-1}$. The
7cf9f17aa179 more Chengsong parents: 538 diff changeset	785	values built incrementally by \emph{injecting} back the characters into the
7cf9f17aa179 more Chengsong parents: 538 diff changeset	786	earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
7cf9f17aa179 more Chengsong parents: 538 diff changeset	787	are always in the same subscript, i.e. $\vdash v_i : r_i$):
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	788
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	789	\begin{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	790	\begin{equation}\label{graph:2}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	791	\begin{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	792	r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	793	v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	794	\end{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	795	\end{equation}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	796	\end{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	797
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	798	\noindent
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	799	As we did earlier in this chapter on the matcher, one can
7cf9f17aa179 more Chengsong parents: 538 diff changeset	800	introduce simplification on the regex.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	801	However, now we need to do a backward phase and make sure
7cf9f17aa179 more Chengsong parents: 538 diff changeset	802	the values align with the regular expressions.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	803	Therefore one has to
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	804	be careful not to break the correctness, as the injection
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	805	function heavily relies on the structure of the regexes and values
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	806	being correct and matching each other.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	807	It can be achieved by recording some extra rectification functions
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	808	during the derivatives step, and applying these rectifications in
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	809	each run during the injection phase.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	810
7cf9f17aa179 more Chengsong parents: 538 diff changeset	811	\ChristianComment{Do I introduce the lexer with rectification here?}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	812	And we can prove that the POSIX value of how
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	813	regular expressions match strings will not be affected---although it is much harder
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	814	to establish.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	815	Some initial results in this regard have been
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	816	obtained in \cite{AusafDyckhoffUrban2016}.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	817
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	818	However, even with these simplification rules, we could still end up in
7cf9f17aa179 more Chengsong parents: 538 diff changeset	819	trouble, when we encounter cases that require more involved and aggressive
7cf9f17aa179 more Chengsong parents: 538 diff changeset	820	simplifications.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	821	\section{A Case Requring More Aggressive Simplification}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	822	For example, when starting with the regular
7cf9f17aa179 more Chengsong parents: 538 diff changeset	823	expression $(a^* \cdot a^)^$ and building a few successive derivatives (around 10)
7cf9f17aa179 more Chengsong parents: 538 diff changeset	824	w.r.t.~the character $a$, one obtains a derivative regular expression
7cf9f17aa179 more Chengsong parents: 538 diff changeset	825	with more than 9000 nodes (when viewed as a tree)
7cf9f17aa179 more Chengsong parents: 538 diff changeset	826	even with simplification.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	827	\begin{figure}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	828	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	829	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	830	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	831	ylabel={size},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	832	legend entries={Naive Matcher},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	833	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	834	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	835	\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
7cf9f17aa179 more Chengsong parents: 538 diff changeset	836	\end{axis}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	837	\end{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	838	\caption{Size of $(a^\cdot a^)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	839	\end{figure}\label{fig:BetterWaterloo}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	840
7cf9f17aa179 more Chengsong parents: 538 diff changeset	841	That is because our lexing algorithm currently keeps a lot of
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	842	"useless" values that will not be used.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	843	These different ways of matching will grow exponentially with the string length.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	844
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	845	For $r= (a^\cdot a^)^*$ and
7cf9f17aa179 more Chengsong parents: 538 diff changeset	846	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	847	if we do not allow any empty iterations in its lexical values,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	848	there will be $n - 1$ "splitting points" on $s$ we can independently choose to
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	849	split or not so that each sub-string
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	850	segmented by those chosen splitting points will form different iterations.
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	851	For example when $n=4$, we give out a few of the many possibilities of splitting:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	852	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	853	\begin{tabular}{lcr}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	854	$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration)\\
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	855	$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$ (two iterations)\\
7cf9f17aa179 more Chengsong parents: 538 diff changeset	856	$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$ (two iterations)\\
7cf9f17aa179 more Chengsong parents: 538 diff changeset	857	$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
7cf9f17aa179 more Chengsong parents: 538 diff changeset	858	$a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	859	& $\textit{etc}.$ &
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	860	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	861	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	862	\noindent
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	863	And for each iteration, there are still multiple ways to split
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	864	between the two $a^*$s.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	865	It is not surprising there are exponentially many lexical values
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	866	that are distinct for the regex and string pair $r= (a^\cdot a^)^*$ and
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	867	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	868	A lexer to keep all the possible values will naturally
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	869	have an exponential runtime on ambiguous regular expressions.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	870	With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	871	of a match. This means Sulzmann and Lu's injection-based algorithm
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	872	exponential by nature.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	873	Somehow one has to make sure which
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	874	lexical values are $\POSIX$ and must be kept in a lexing algorithm.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	875
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	876
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	877	For example, the above $r= (a^\cdot a^)^*$ and
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	878	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	879	$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	880	We want to keep this value only, and remove all the regular expression subparts
7cf9f17aa179 more Chengsong parents: 538 diff changeset	881	not corresponding to this value during lexing.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	882	To do this, a two-phase algorithm with rectification is a bit too fragile.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	883	Can we not create those intermediate values $v_1,\ldots v_n$,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	884	and get the lexing information that should be already there while
7cf9f17aa179 more Chengsong parents: 538 diff changeset	885	doing derivatives in one pass, without a second injection phase?
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	886	In the meantime, can we ensure that simplifications
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	887	are easily handled without breaking the correctness of the algorithm?
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	888
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	889	Sulzmann and Lu solved this problem by
7cf9f17aa179 more Chengsong parents: 538 diff changeset	890	introducing additional information to the
7cf9f17aa179 more Chengsong parents: 538 diff changeset	891	regular expressions called \emph{bitcodes}.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	892
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	893
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	894
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	895

author	Chengsong
	Wed, 13 Jul 2022 08:27:28 +0100
changeset 564	3cbcd7cda0a9
parent 543	b2bea5968b89
child 567	28cb8089ec36
permissions	-rwxr-xr-x