lexing: ChengsongTanPhdThesis/Chapters/Inj.tex@80cc6dc4c98b (annotated)

532 cc54ce075db5 restructured Chengsong parents: diff changeset	1	% Chapter Template
cc54ce075db5 restructured Chengsong parents: diff changeset	2
cc54ce075db5 restructured Chengsong parents: diff changeset	3	\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
cc54ce075db5 restructured Chengsong parents: diff changeset	4
cc54ce075db5 restructured Chengsong parents: diff changeset	5	\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
cc54ce075db5 restructured Chengsong parents: diff changeset	6	%and notations we
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	7	% used for describing the lexing algorithm by Sulzmann and Lu,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	8	%and then give the algorithm and its variant and discuss
532 cc54ce075db5 restructured Chengsong parents: diff changeset	9	%why more aggressive simplifications are needed.
cc54ce075db5 restructured Chengsong parents: diff changeset	10
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	11	In this chapter, we define the basic notions
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	12	for regular languages and regular expressions.
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	13	This is essentially a description in ``English''
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	14	the functions and datatypes of our formalisation in Isabelle/HOL.
e3752aac8ec2 more Chengsong parents: 628 diff changeset	15	We also define what $\POSIX$ lexing means,
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	16	followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	17	that produces the output conforming
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	18	to the $\POSIX$ standard\footnote{In what follows
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	19	we choose to use the Isabelle-style notation
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	20	for function applications, where
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	21	the parameters of a function are not enclosed
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	22	inside a pair of parentheses (e.g. $f \;x \;y$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	23	instead of $f(x,\;y)$). This is mainly
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	24	to make the text visually more concise.}.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	25
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	26	\section{Basic Concepts}
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	27	Formal language theory usually starts with an alphabet
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	28	denoting a set of characters.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	29	Here we use the datatype of characters from Isabelle,
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	30	which roughly corresponds to the ASCII characters.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	31	In what follows, we shall leave the information about the alphabet
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	32	implicit.
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	33	Then using the usual bracket notation for lists,
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	34	we can define strings made up of characters as:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	35	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	36	\begin{tabular}{lcl}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	37	$\textit{s}$ & $\dn$ & $[] \; \|\; c :: s$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	38	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	39	\end{center}
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	40	where $c$ is a variable ranging over characters.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	41	The $::$ stands for list cons and $[]$ for the empty
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	42	list.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	43	For brevity, a singleton list is sometimes written as $[c]$.
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	44	Strings can be concatenated to form longer strings in the same
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	45	way we concatenate two lists, which we shall write as $s_1 @ s_2$.
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	46	We omit the precise
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	47	recursive definition here.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	48	We overload this concatenation operator for two sets of strings:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	49	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	50	\begin{tabular}{lcl}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	51	$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	52	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	53	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	54	We also call the above \emph{language concatenation}.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	55	The power of a language is defined recursively, using the
cc54ce075db5 restructured Chengsong parents: diff changeset	56	concatenation operator $@$:
cc54ce075db5 restructured Chengsong parents: diff changeset	57	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	58	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	59	$A^0 $ & $\dn$ & $\{ [] \}$\\
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	60	$A^{n+1}$ & $\dn$ & $A @ A^n$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	61	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	62	\end{center}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	63	The union of all powers of a language
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	64	can be used to define the Kleene star operator:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	65	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	66	\begin{tabular}{lcl}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	67	$A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	68	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	69	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	70
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	71	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	72	However, to obtain a more convenient induction principle
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	73	in Isabelle/HOL,
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	74	we instead define the Kleene star
532 cc54ce075db5 restructured Chengsong parents: diff changeset	75	as an inductive set:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	76
532 cc54ce075db5 restructured Chengsong parents: diff changeset	77	\begin{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	78	\begin{mathpar}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	79	\inferrule{\mbox{}}{[] \in A*\\}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	80
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	81	\inferrule{s_1 \in A \;\; s_2 \in A}{s_1 @ s_2 \in A}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	82	\end{mathpar}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	83	\end{center}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	84	\noindent
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	85	We also define an operation of "chopping off" a character from
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	86	a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
532 cc54ce075db5 restructured Chengsong parents: diff changeset	87	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	88	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	89	$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	90	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	91	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	92	\noindent
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	93	This can be generalised to ``chopping off'' a string
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	94	from all strings within a set $A$,
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	95	namely:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	96	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	97	\begin{tabular}{lcl}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	98	$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	99	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	100	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	101	\noindent
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	102	which is essentially the left quotient $A \backslash L$ of $A$ against
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	103	the singleton language with $L = \{s\}$.
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	104	However, for our purposes here, the $\textit{Ders}$ definition with
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	105	a single string is sufficient.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	106
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	107	The reason for defining derivatives
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	108	is that they provide another approach
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	109	to test membership of a string in
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	110	a set of strings.
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	111	For example, to test whether the string
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	112	$bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set with
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	113	respect to the string $bar$:
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	114	\begin{center}
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	115	\begin{tabular}{lll}
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	116	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ &
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	117	$\{ar, rak\}$ \\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	118	& $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	119	& $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	120	%& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	121	\end{tabular}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	122	\end{center}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	123	\noindent
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	124	and in the end, test whether the set
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	125	contains the empty string.\footnote{We use the infix notation $A\backslash c$
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	126	instead of $\Der \; c \; A$ for brevity, as it will always be
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	127	clear from the context that we are operating
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	128	on languages rather than regular expressions.}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	129
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	130	In general, if we have a language $S$,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	131	then we can test whether $s$ is in $S$
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	132	by testing whether $[] \in S \backslash s$.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	133	With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
532 cc54ce075db5 restructured Chengsong parents: diff changeset	134	we have a few properties of how the language derivative can be defined using
cc54ce075db5 restructured Chengsong parents: diff changeset	135	sub-languages.
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	136	For example, for the sequence operator, we have
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	137	something similar to a ``chain rule'':
532 cc54ce075db5 restructured Chengsong parents: diff changeset	138	\begin{lemma}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	139	\[
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	140	\Der \; c \; (A @ B) =
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	141	\begin{cases}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	142	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	143	(\Der \; c \; A) \, @ \, B, & \text{otherwise}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	144	\end{cases}
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	145	\]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	146	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	147	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	148	This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
cc54ce075db5 restructured Chengsong parents: diff changeset	149	and get to $B$.
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	150	The language derivative for $A*$ can be described using the language derivative
532 cc54ce075db5 restructured Chengsong parents: diff changeset	151	of $A$:
cc54ce075db5 restructured Chengsong parents: diff changeset	152	\begin{lemma}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	153	$\textit{Der} \;c \;(A) = (\textit{Der}\; c A) @ (A)$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	154	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	155	\begin{proof}
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	156	There are two inclusions to prove:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	157	\begin{itemize}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	158	\item{$\subseteq$}:\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	159	The set
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	160	\[ S_1 = \{s \mid c :: s \in A*\} \]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	161	is enclosed in the set
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	162	\[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \]
e3752aac8ec2 more Chengsong parents: 628 diff changeset	163	This is because for any string $c::s$ satisfying $c::s \in A*$,
e3752aac8ec2 more Chengsong parents: 628 diff changeset	164	%whenever you have a string starting with a character
e3752aac8ec2 more Chengsong parents: 628 diff changeset	165	%in the language of a Kleene star $A*$,
e3752aac8ec2 more Chengsong parents: 628 diff changeset	166	%then that
e3752aac8ec2 more Chengsong parents: 628 diff changeset	167	the character $c$, together with a prefix of $s$
e3752aac8ec2 more Chengsong parents: 628 diff changeset	168	%immediately after $c$
e3752aac8ec2 more Chengsong parents: 628 diff changeset	169	forms the first iteration of $A*$,
e3752aac8ec2 more Chengsong parents: 628 diff changeset	170	and the rest of the $s$ is also $A*$.
e3752aac8ec2 more Chengsong parents: 628 diff changeset	171	This coincides with the definition of $S_2$.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	172	\item{$\supseteq$}:\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	173	Note that
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	174	\[ \Der \; c \; (A) = \Der \; c \; (\{ [] \} \cup (A @ A) ) \]
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	175	holds.
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	176	Also the following holds:
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	177	\[ \Der \; c \; (\{ [] \} \cup (A @ A) ) = \Der\; c \; (A @ A) \]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	178	where the $\textit{RHS}$ can be rewritten
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	179	as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	180	which of course contains $\Der \; c \; A @ A*$.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	181	\end{itemize}
cc54ce075db5 restructured Chengsong parents: diff changeset	182	\end{proof}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	183
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	184	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	185	The clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	186	for regular expressions.
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	187	To introduce them, we need to first give definitions for regular expressions,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	188	which we shall do next.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	189
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	190	\subsection{Regular Expressions and Their Meaning}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	191	The \emph{basic regular expressions} are defined inductively
532 cc54ce075db5 restructured Chengsong parents: diff changeset	192	by the following grammar:
cc54ce075db5 restructured Chengsong parents: diff changeset	193	\[ r ::= \ZERO \mid \ONE
cc54ce075db5 restructured Chengsong parents: diff changeset	194	\mid c
cc54ce075db5 restructured Chengsong parents: diff changeset	195	\mid r_1 \cdot r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	196	\mid r_1 + r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	197	\mid r^*
cc54ce075db5 restructured Chengsong parents: diff changeset	198	\]
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	199	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	200	We call them basic because we will introduce
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	201	additional constructors in later chapters, such as negation
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	202	and bounded repetitions.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	203	We use $\ZERO$ for the regular expression that
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	204	matches no string, and $\ONE$ for the regular
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	205	expression that matches only the empty string.\footnote{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	206	Some authors
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	207	also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	208	but we prefer this notation.}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	209	The sequence regular expression is written $r_1\cdot r_2$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	210	and sometimes we omit the dot if it is clear which
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	211	regular expression is meant; the alternative
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	212	is written $r_1 + r_2$.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	213	The \emph{language} or meaning of
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	214	a regular expression is defined recursively as
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	215	a set of strings:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	216	%TODO: FILL in the other defs
cc54ce075db5 restructured Chengsong parents: diff changeset	217	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	218	\begin{tabular}{lcl}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	219	$L \; \ZERO$ & $\dn$ & $\phi$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	220	$L \; \ONE$ & $\dn$ & $\{[]\}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	221	$L \; c$ & $\dn$ & $\{[c]\}$\\
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	222	$L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	223	$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	224	$L \; (r^)$ & $\dn$ & $ (L\;r)$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	225	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	226	\end{center}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	227	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	228	%Now with language derivatives of a language and regular expressions and
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	229	%their language interpretations in place, we are ready to define derivatives on regular expressions.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	230	With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	231	We do so by first introducing what properties they should satisfy.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	232
532 cc54ce075db5 restructured Chengsong parents: diff changeset	233	\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	234	%Recall, the language derivative acts on a set of strings
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	235	%and essentially chops off a particular character from
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	236	%all strings in that set, Brzozowski defined a derivative operation on regular expressions
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	237	%so that after derivative $L(r\backslash c)$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	238	%will look as if it was obtained by doing a language derivative on $L(r)$:
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	239	%Recall that the language derivative acts on a
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	240	%language (set of strings).
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	241	%One can decide whether a string $s$ belongs
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	242	%to a language $S$ by taking derivative with respect to
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	243	%that string and then checking whether the empty
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	244	%string is in the derivative:
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	245	%\begin{center}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	246	%\parskip \baselineskip
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	247	%\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	248	%\def\rlwd{.5pt}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	249	%\newcommand\notate[3]{%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	250	% \unskip\def\useanchorwidth{T}%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	251	% \setbox0=\hbox{#1}%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	252	% \def\stackalignment{c}\stackunder[-6pt]{%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	253	% \def\stackalignment{c}\stackunder[-1.5pt]{%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	254	% \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	255	% \rule{\rlwd}{#2\baselineskip}}}{%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	256	% \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	257	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	258	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	259	%\notate{$\{ \ldots ,\;$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	260	% \notate{s}{1}{$(c_1 :: s_1)$}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	261	% $, \; \ldots \}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	262	%}{1}{$S_{start}$}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	263	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	264	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	265	% $\stackrel{\backslash c_1}{\longrightarrow}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	266	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	267	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	268	% $\{ \ldots,\;$ \notate{$s_1$}{1}{$(c_2::s_2)$}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	269	% $,\; \ldots \}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	270	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	271	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	272	% $\stackrel{\backslash c_2}{\longrightarrow}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	273	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	274	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	275	% $\{ \ldots,\; s_2
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	276	% ,\; \ldots \}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	277	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	278	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	279	% $ \xdashrightarrow{\backslash c_3\ldots\ldots} $
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	280	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	281	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	282	% \notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} =
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	283	% S_{start}\backslash s$}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	284	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	285	%\end{center}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	286	%\begin{center}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	287	% $s \in S_{start} \iff [] \in S_{end}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	288	%\end{center}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	289	%\noindent
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	290	Brzozowski noticed that $\Der$
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	291	can be ``mirrored'' on regular expressions which
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	292	he calls the derivative of a regular expression $r$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	293	with respect to a character $c$, written
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	294	$r \backslash c$. This infix operator
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	295	takes regular expression $r$ as input
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	296	and a character as a right operand.
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	297	The derivative operation on regular expression
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	298	is defined such that the language of the derivative result
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	299	coincides with the language of the original
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	300	regular expression being taken
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	301	derivative with respect to the same characters, namely
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	302	\begin{property}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	303
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	304	\[
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	305	L \; (r \backslash c) = \Der \; c \; (L \; r)
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	306	\]
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	307	\end{property}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	308	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	309	Pictorially, this looks as follows:\\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	310	\vspace{3mm}
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	311
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	312	\parskip \baselineskip
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	313	\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	314	\def\rlwd{.5pt}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	315	\newcommand\notate[3]{%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	316	\unskip\def\useanchorwidth{T}%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	317	\setbox0=\hbox{#1}%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	318	\def\stackalignment{c}\stackunder[-6pt]{%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	319	\def\stackalignment{c}\stackunder[-1.5pt]{%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	320	\stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	321	\rule{\rlwd}{#2\baselineskip}}}{%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	322	\strut\kern8pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	323	}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	324	\Longstack{
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	325	\notate{$r$}{1}{$L \; r = \{\ldots, \;c::s_1,
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	326	\;\ldots\}$}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	327	}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	328	\Longstack{
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	329	$\stackrel{\backslash c}{\xrightarrow{\hspace*{8cm}}}$
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	330	}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	331	\Longstack{
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	332	\notate{$r\backslash c$}{1}{$L \; (r\backslash c)=
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	333	\{\ldots,\;s_1,\;\ldots\}$}
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	334	}\\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	335	\vspace{ 3mm }
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	336
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	337	The derivatives on regular expression can again be
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	338	generalised to strings.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	339	One could compute $r_{start} \backslash s$ and test membership of $s$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	340	in $L \; r_{start}$ by checking
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	341	whether the empty string is in the language of
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	342	$r_{end}$ (that is $r_{start}\backslash s$).\\
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	343
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	344	\vspace{2mm}
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	345	\Longstack{
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	346	\notate{$r_{start}$}{4}{
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	347	\Longstack{$L \; r_{start} = \{\ldots, \;$
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	348	\notate{$s$}{1}{$c_1::s_1$}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	349	$, \ldots\} $
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	350	}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	351	}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	352	}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	353	\Longstack{
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	354	$\stackrel{\backslash c_1}{ \xrightarrow{\hspace*{1.8cm}} }$
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	355	}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	356	\Longstack{
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	357	\notate{$r_1$}{3}{
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	358	$r_1 = r_{start}\backslash c_1$,
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	359	$L \; r_1 = $
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	360	\Longstack{
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	361	$\{ \ldots,\;$ \notate{$s_1$}{1}{$c_2::s_2$}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	362	$,\; \ldots \}$
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	363	}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	364	}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	365	}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	366	\Longstack{
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	367	$\stackrel{\backslash c_2}{\xrightarrow{\hspace*{1.8cm}}}$
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	368	}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	369	\Longstack{
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	370	$r_2$
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	371	}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	372	\Longstack{
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	373	$ \xdashrightarrow{\hspace{0.3cm} \backslash c_3 \ldots \ldots \ldots \hspace{0.3cm}} $
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	374	}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	375	\Longstack{
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	376	\notate{$r_{end}$}{1}{
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	377	$L \; r_{end} = \{\ldots, \; [], \ldots\}$}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	378	}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	379
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	380
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	381	We have the property that
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	382	\begin{property}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	383	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	384	\end{property}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	385	\noindent
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	386	Next, we give the recursive definition of derivative on
e3752aac8ec2 more Chengsong parents: 628 diff changeset	387	regular expressions so that it satisfies the properties above.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	388	%The derivative function, written $r\backslash c$,
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	389	%takes a regular expression $r$ and character $c$, and
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	390	%returns a new regular expression representing
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	391	%the original regular expression's language $L \; r$
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	392	%being taken the language derivative with respect to $c$.
626 1c8525061545 finished! Chengsong parents: 623 diff changeset	393	\begin{table}
1c8525061545 finished! Chengsong parents: 623 diff changeset	394	\begin{center}
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	395	\begin{tabular}{lcl}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	396	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	397	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	398	$d \backslash c$ & $\dn$ &
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	399	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	400	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	401	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	402	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	403	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	404	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	405	\end{tabular}
626 1c8525061545 finished! Chengsong parents: 623 diff changeset	406	\end{center}
1c8525061545 finished! Chengsong parents: 623 diff changeset	407	\caption{Derivative on Regular Expressions}
1c8525061545 finished! Chengsong parents: 623 diff changeset	408	\label{table:der}
1c8525061545 finished! Chengsong parents: 623 diff changeset	409	\end{table}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	410	\noindent
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	411	The most involved cases are the sequence case
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	412	and the star case.
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	413	The sequence case says that if the first regular expression
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	414	contains an empty string, then the second component of the sequence
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	415	needs to be considered, as its derivative will contribute to the
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	416	result of this derivative:
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	417	\begin{center}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	418	\begin{tabular}{lcl}
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	419	$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ &
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	420	$\textit{if}\;\,([] \in L(r_1))\;
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	421	\textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	422	& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	423	\end{tabular}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	424	\end{center}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	425	\noindent
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	426	Notice how this closely resembles
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	427	the language derivative operation $\Der$:
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	428	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	429	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	430	$\Der \; c \; (A @ B)$ & $\dn$ &
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	431	$ \textit{if} \;\, [] \in A \;
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	432	\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	433	\Der \; c\; B$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	434	& & $\textit{else}\; (\Der \; c \; A) @ B$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	435	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	436	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	437	\noindent
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	438	The derivative of the star regular expression $r^*$
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	439	unwraps one iteration of $r$, turns it into $r\backslash c$,
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	440	and attaches the original $r^*$
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	441	after $r\backslash c$, so that
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	442	we can further unfold it as many times as needed:
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	443	\[
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	444	(r^) \backslash c \dn (r \backslash c)\cdot r^.
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	445	\]
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	446	Again,
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	447	the structure is the same as the language derivative of the Kleene star:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	448	\[
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	449	\textit{Der} \;c \;(A) \dn (\textit{Der}\; c A) @ (A)
532 cc54ce075db5 restructured Chengsong parents: diff changeset	450	\]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	451	In the above definition of $(r_1\cdot r_2) \backslash c$,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	452	the $\textit{if}$ clause's
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	453	boolean condition
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	454	$[] \in L(r_1)$ needs to be
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	455	somehow recursively computed.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	456	We call such a function that checks
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	457	whether the empty string $[]$ is
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	458	in the language of a regular expression $\nullable$:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	459	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	460	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	461	$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	462	$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	463	$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	464	$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	465	$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	466	$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	467	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	468	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	469	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	470	The $\ZERO$ regular expression
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	471	does not contain any string and
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	472	therefore is not \emph{nullable}.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	473	$\ONE$ is \emph{nullable}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	474	by definition.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	475	The character regular expression $c$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	476	corresponds to the singleton set $\{c\}$,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	477	and therefore does not contain the empty string.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	478	The alternative regular expression is nullable
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	479	if at least one of its children is nullable.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	480	The sequence regular expression
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	481	would require both children to have the empty string
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	482	to compose an empty string, and the Kleene star
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	483	is always nullable because it naturally
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	484	contains the empty string.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	485	\noindent
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	486	We have the following two correspondences between
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	487	derivatives on regular expressions and
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	488	derivatives on a set of strings:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	489	\begin{lemma}\label{derDer}
608 37b6fd310a16 added related work chap Chengsong parents: 601 diff changeset	490	\mbox{}
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	491	\begin{itemize}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	492	\item
532 cc54ce075db5 restructured Chengsong parents: diff changeset	493	$\textit{Der} \; c \; L(r) = L (r\backslash c)$
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	494	\item
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	495	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	496	\end{itemize}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	497	\end{lemma}
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	498	\begin{proof}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	499	By induction on $r$.
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	500	\end{proof}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	501	\noindent
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	502	which are the main properties of derivatives
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	503	that enables us later to reason about the correctness of
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	504	derivative-based matching.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	505	We can generalise the derivative operation shown above for single characters
cc54ce075db5 restructured Chengsong parents: diff changeset	506	to strings as follows:
cc54ce075db5 restructured Chengsong parents: diff changeset	507
cc54ce075db5 restructured Chengsong parents: diff changeset	508	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	509	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	510	$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	511	$r \backslash_s [\,] $ & $\dn$ & $r$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	512	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	513	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	514
cc54ce075db5 restructured Chengsong parents: diff changeset	515	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	516	When there is no ambiguity, we will
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	517	omit the subscript and use $\backslash$ instead
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	518	of $\backslash_s$ to denote
532 cc54ce075db5 restructured Chengsong parents: diff changeset	519	string derivatives for brevity.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	520	Brzozowski's derivative-based
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	521	regular-expression matching algorithm can then be described as:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	522
cc54ce075db5 restructured Chengsong parents: diff changeset	523	\begin{definition}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	524	$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	525	\end{definition}
cc54ce075db5 restructured Chengsong parents: diff changeset	526
cc54ce075db5 restructured Chengsong parents: diff changeset	527	\noindent
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	528	Assuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$,
e3752aac8ec2 more Chengsong parents: 628 diff changeset	529	this algorithm, presented graphically, is as follows:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	530
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	531	\begin{equation}\label{matcher}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	532	\begin{tikzcd}
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	533	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] &
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	534	r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] &
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	535	\;\textrm{true}/\textrm{false}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	536	\end{tikzcd}
cc54ce075db5 restructured Chengsong parents: diff changeset	537	\end{equation}
cc54ce075db5 restructured Chengsong parents: diff changeset	538
cc54ce075db5 restructured Chengsong parents: diff changeset	539	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	540	It is relatively
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	541	easy to show that this matcher is correct, namely
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	542	\begin{lemma}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	543	$\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	544	\end{lemma}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	545	\begin{proof}
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	546	By induction on $s$ using the property of derivatives:
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	547	lemma \ref{derDer}.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	548	\end{proof}
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	549	\begin{figure}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	550	\begin{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	551	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	552	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	553	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	554	ylabel={time in secs},
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	555	%ymode = log,
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	556	legend entries={Naive Matcher},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	557	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	558	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	559	\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
7cf9f17aa179 more Chengsong parents: 538 diff changeset	560	\end{axis}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	561	\end{tikzpicture}
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	562	\caption{Matching the regular expression $(a^)^b$ against strings of the form
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	563	$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	564	$ using Brzozowski's original algorithm}\label{NaiveMatcher}
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	565	\end{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	566	\end{figure}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	567	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	568	If we implement the above algorithm naively, however,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	569	the algorithm can be excruciatingly slow, as shown in
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	570	\ref{NaiveMatcher}.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	571	Note that both axes are in logarithmic scale.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	572	Around two dozen characters
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	573	this algorithm already ``explodes'' with the regular expression
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	574	$(a^)^b$.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	575	To improve this situation, we need to introduce simplification
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	576	rules for the intermediate results,
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	577	such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$,
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	578	and make sure those rules do not change the
7cf9f17aa179 more Chengsong parents: 538 diff changeset	579	language of the regular expression.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	580	One simple-minded simplification function
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	581	that achieves these requirements
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	582	is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	583	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	584	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	585	$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	586	$(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	587	& & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	588	& & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	589	& & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	590	& & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	591	& & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	592	$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	593	& & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	594	& & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	595	& & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	596	$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	597
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	598	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	599	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	600	If we repeatedly apply this simplification
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	601	function during the matching algorithm,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	602	we have a matcher with simplification:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	603	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	604	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	605	$\derssimp \; [] \; r$ & $\dn$ & $r$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	606	$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	607	$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	608	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	609	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	610	\begin{figure}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	611	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	612	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	613	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	614	ylabel={time in secs},
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	615	%ymode = log,
ce4e5151a836 more Chengsong parents: 591 diff changeset	616	%xmode = log,
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	617	grid = both,
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	618	legend entries={Matcher With Simp},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	619	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	620	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	621	\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
7cf9f17aa179 more Chengsong parents: 538 diff changeset	622	\end{axis}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	623	\end{tikzpicture}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	624	\caption{$(a^)^b$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	625	against
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	626	$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	627	\end{figure}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	628	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	629	The running time of $\textit{ders}\_\textit{simp}$
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	630	on the same example of Figure \ref{NaiveMatcher}
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	631	is now ``tame'' in terms of the length of inputs,
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	632	as shown in Figure \ref{BetterMatcher}.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	633
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	634	So far, the story is use Brzozowski derivatives and
e3752aac8ec2 more Chengsong parents: 628 diff changeset	635	simplify as much as possible, and at the end test
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	636	whether the empty string is recognised
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	637	by the final derivative.
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	638	But what if we want to
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	639	do lexing instead of just getting a true/false answer?
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	640	Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	641	elegant (arguably as beautiful as the definition of the
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	642	Brzozowski derivative) solution for this.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	643
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	644	\section{Values and the Lexing Algorithm by Sulzmann and Lu}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	645	In this section, we present a two-phase regular expression lexing
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	646	algorithm.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	647	The first phase takes successive derivatives with
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	648	respect to the input string,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	649	and the second phase does the reverse, \emph{injecting} back
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	650	characters, in the meantime constructing a lexing result.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	651	We will introduce the injection phase in detail slightly
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	652	later, but as a preliminary we have to first define
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	653	the datatype for lexing results,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	654	called \emph{value} or
608 37b6fd310a16 added related work chap Chengsong parents: 601 diff changeset	655	sometimes also \emph{lexical value}.
37b6fd310a16 added related work chap Chengsong parents: 601 diff changeset	656	Values and regular
37b6fd310a16 added related work chap Chengsong parents: 601 diff changeset	657	expressions correspond to each other
37b6fd310a16 added related work chap Chengsong parents: 601 diff changeset	658	as illustrated in the following
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	659	table:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	660
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	661	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	662	\begin{tabular}{c@{\hspace{20mm}}c}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	663	\begin{tabular}{@{}rrl@{}}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	664	\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	665	$r$ & $::=$ & $\ZERO$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	666	& $\mid$ & $\ONE$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	667	& $\mid$ & $c$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	668	& $\mid$ & $r_1 \cdot r_2$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	669	& $\mid$ & $r_1 + r_2$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	670	\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	671	& $\mid$ & $r^*$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	672	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	673	&
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	674	\begin{tabular}{@{\hspace{0mm}}rrl@{}}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	675	\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	676	$v$ & $::=$ & \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	677	& & $\Empty$ \\
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	678	& $\mid$ & $\Char \; c$ \\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	679	& $\mid$ & $\Seq\,v_1\, v_2$\\
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	680	& $\mid$ & $\Left \;v$ \\
b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	681	& $\mid$ & $\Right\;v$ \\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	682	& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	683	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	684	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	685	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	686	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	687	A value has an underlying string, which
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	688	can be calculated by the ``flatten" function $\|\_\|$:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	689	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	690	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	691	$\|\Empty\|$ & $\dn$ & $[]$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	692	$\|\Char \; c\|$ & $ \dn$ & $ [c]$\\
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	693	$\|\Seq \; v_1, \;v_2\|$ & $ \dn$ & $ v_1\| @ \|v_2\|$\\
b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	694	$\|\Left \; v\|$ & $ \dn$ & $ \|v\|$\\
b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	695	$\|\Right \; v\|$ & $ \dn$ & $ \|v\|$\\
b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	696	$\|\Stars \; []\|$ & $\dn$ & $[]$\\
b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	697	$\|\Stars \; v::vs\|$ & $\dn$ & $ \|v\| @ \|\Stars(vs)\|$
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	698	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	699	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	700	Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	701	to indicate that a value $v$ could be generated from a lexing algorithm
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	702	with input $r$. They call it the value inhabitation relation,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	703	defined by the rules.
628 7af4e2420a8c ready to submit~~ Chengsong parents: 626 diff changeset	704	\begin{figure}[H]
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	705	\begin{mathpar}
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	706	\inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	707
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	708	\inferrule{\mbox{}}{\vdash \Empty : \ONE} \hspace{2em}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	709
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	710	\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq \; v_1,\; v_2 : (r_1 \cdot r_2)}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	711
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	712	\inferrule{\vdash v_1 : r_1}{\vdash \Left \; v_1 : r_1+r_2}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	713
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	714	\inferrule{\vdash v_2 : r_2}{\vdash \Right \; v_2:r_1 + r_2}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	715
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	716	\inferrule{\forall v \in vs. \vdash v:r \land \|v\| \neq []}{\vdash \Stars \; vs : r^*}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	717	\end{mathpar}
628 7af4e2420a8c ready to submit~~ Chengsong parents: 626 diff changeset	718	\caption{The inhabitation relation for values and regular expressions}\label{fig:inhab}
626 1c8525061545 finished! Chengsong parents: 623 diff changeset	719	\end{figure}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	720	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	721	The condition $\|v\| \neq []$ in the premise of star's rule
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	722	is to make sure that for a given pair of regular
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	723	expression $r$ and string $s$, the number of values
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	724	satisfying $\|v\| = s$ and $\vdash v:r$ is finite.
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	725	This additional condition was
ce4e5151a836 more Chengsong parents: 591 diff changeset	726	imposed by Ausaf and Urban to make their proofs easier.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	727	Given a string and a regular expression, there can be
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	728	multiple values for it. For example, both
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	729	$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	730	$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	731	and the values both flatten to $abc$.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	732	Lexers therefore have to disambiguate and choose only
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	733	one of the values to be generated. $\POSIX$ is one of the
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	734	disambiguation strategies that is widely adopted.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	735
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	736	Ausaf et al. \cite{AusafDyckhoffUrban2016}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	737	formalised the property
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	738	as a ternary relation.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	739	The $\POSIX$ value $v$ for a regular expression
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	740	$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	741	in the following rules\footnote{The names of the rules are used
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	742	as they were originally given in \cite{AusafDyckhoffUrban2016}.}:
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	743	\begin{figure}[p]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	744	\begin{mathpar}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	745	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	746
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	747	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	748
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	749	\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	750
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	751	\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	752
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	753	\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	754	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	755	s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	756	\Seq \; v_1 \; v_2}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	757
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	758	\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	759
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	760	\inferrule[P]{(s_1, v) \rightarrow v \\ (s_2, r^) \rightarrow \Stars \; vs \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	761	\|v\| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	762	s_1@s_3 \in L \; r \land s_4 \in L \; r^}{(s_1@s_2, r^)\rightarrow \Stars \;
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	763	(v::vs)}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	764	\end{mathpar}
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	765	\caption{The inductive POSIX rules given by Ausaf et al.
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	766	\cite{AusafDyckhoffUrban2016}.
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	767	This ternary relation, written $(s, r) \rightarrow v$,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	768	formalises the POSIX constraints on the
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	769	value $v$ given a string $s$ and
ce4e5151a836 more Chengsong parents: 591 diff changeset	770	regular expression $r$.
ce4e5151a836 more Chengsong parents: 591 diff changeset	771	}
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	772	\end{figure}\afterpage{\clearpage}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	773	\noindent
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	774
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	775	%\begin{figure}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	776	%\begin{tikzpicture}[]
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	777	% \node [minimum width = 6cm, rectangle split, rectangle split horizontal,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	778	% rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	779	% (node1)
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	780	% {$r_{token1}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	781	% \nodepart{two} $\;\;\; \quad r_{token2}\;\;\;\quad$ };
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	782	% %\node [left = 6.0cm of node1] (start1) {hi};
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	783	% \node [left = 0.2cm of node1] (middle) {$v.s.$};
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	784	% \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	785	% rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	786	% (node2)
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	787	% {$\quad\;\;\;r_{token1}\quad\;\;\;$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	788	% \nodepart{two} $r_{token2}$ };
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	789	% \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	790	% \node [above = 1.5cm of middle, minimum width = 6cm,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	791	% rectangle, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	792	% (topNode) {$s$};
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	793	% \path[->,draw]
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	794	% (topNode) edge node {split $A$} (node2)
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	795	% (topNode) edge node {split $B$} (node1)
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	796	% ;
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	797	%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	798	%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	799	%\end{tikzpicture}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	800	%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	801	%\end{figure}
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	802	The above $\POSIX$ rules follow the intuition described below:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	803	\begin{itemize}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	804	\item (Left Priority)\\
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	805	Match the leftmost regular expression when multiple options for matching
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	806	are available. See P+L and P+R where in P+R $s$ cannot
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	807	be in the language of $L \; r_1$.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	808	\item (Maximum munch)\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	809	Always match a subpart as much as possible before proceeding
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	810	to the next part of the string.
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	811	For example, when the string $s$ matches
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	812	$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	813	Then the split that matches a longer string for the first part
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	814	$r_{part1}$ is preferred by this maximum munch rule.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	815	The side-condition
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	816	\begin{center}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	817	$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	818	s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	819	\end{center}
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	820	in PS causes this.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	821	%(See
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	822	%\ref{munch} for an illustration).
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	823	\end{itemize}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	824	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	825	These disambiguation strategies can be
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	826	quite practical.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	827	For instance, when lexing a code snippet
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	828	\[
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	829	\textit{iffoo} = 3
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	830	\]
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	831	using a regular expression
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	832	for keywords and
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	833	identifiers:
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	834	%(for example, keyword is a nonempty string starting with letters
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	835	%followed by alphanumeric characters or underscores):
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	836	\[
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	837	r_{keyword} + r_{identifier}.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	838	\]
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	839	If we want $\textit{iffoo}$ to be recognized
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	840	as an identifier
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	841	where identifiers are defined as usual (letters
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	842	followed by letters, numbers or underscores),
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	843	then a match with a keyword (if)
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	844	followed by
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	845	an identifier (foo) would be incorrect.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	846	POSIX lexing generates what is included by lexing.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	847
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	848	\noindent
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	849	We know that a POSIX
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	850	value for regular expression $r$ is inhabited by $r$.
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	851	\begin{lemma}
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	852	$(r, s) \rightarrow v \implies \vdash v: r$
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	853	\end{lemma}
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	854	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	855	The main property about a $\POSIX$ value is that
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	856	given the same regular expression $r$ and string $s$,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	857	one can always uniquely determine the $\POSIX$ value for it:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	858	\begin{lemma}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	859	$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	860	\end{lemma}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	861	\begin{proof}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	862	By induction on $s$, $r$ and $v_1$. The inductive cases
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	863	are all the POSIX rules.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	864	Probably the most cumbersome cases are
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	865	the sequence and star with non-empty iterations.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	866	We shall give the details for proving the sequence case here.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	867
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	868	When we have
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	869	\[
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	870	(s_1, r_1) \rightarrow v_1 \;\, and \;\,
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	871	(s_2, r_2) \rightarrow v_2 \;\, and \;\,
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	872	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	873	s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	874	\]
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	875	we know that the last condition
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	876	excludes the possibility of a
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	877	string $s_1'$ longer than $s_1$ such that
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	878	\[
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	879	(s_1', r_1) \rightarrow v_1' \;\;
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	880	and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	881	\]
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	882	hold.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	883	A shorter string $s_1''$ with $s_2''$ satisfying
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	884	\[
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	885	(s_1'', r_1) \rightarrow v_1''
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	886	\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	887	\]
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	888	cannot possibly form a $\POSIX$ value either, because
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	889	by definition, there is a candidate
e3752aac8ec2 more Chengsong parents: 628 diff changeset	890	with a longer initial string
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	891	$s_1$. Therefore, we know that the POSIX
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	892	value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	893	$s$ must have the
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	894	property that
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	895	\[
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	896	\|a\| = s_1 \;\; and \;\; \|b\| = s_2.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	897	\]
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	898	The goal is to prove that $a = v_1 $ and $b = v_2$.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	899	If we have some other POSIX values $v_{10}$ and $v_{20}$ such that
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	900	$(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	901	then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$,
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	902	which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	903	is the same as $\Seq(v_1, v_2)$.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	904	\end{proof}
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	905	\noindent
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	906	We have now defined what a POSIX value is and shown that it is unique.
e3752aac8ec2 more Chengsong parents: 628 diff changeset	907	The problem is to generate
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	908	such a value in a lexing algorithm using derivatives.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	909
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	910	\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	911
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	912	Sulzmann and Lu extended Brzozowski's
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	913	derivative-based matching
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	914	to a lexing algorithm by a second phase
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	915	after the initial phase of successive derivatives.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	916	This second phase generates a POSIX value
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	917	if the regular expression matches the string.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	918	The algorithm uses two functions called $\inj$ and $\mkeps$.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	919	The function $\mkeps$ constructs a POSIX value from the last
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	920	derivative $r_n$:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	921	\begin{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	922	\begin{equation}\label{graph:mkeps}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	923	\begin{tikzcd}
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	924	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	925	& & & v_n
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	926	\end{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	927	\end{equation}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	928	\end{ceqn}
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	929	\noindent
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	930	In the above diagram, again we assume that
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	931	the input string $s$ is made of $n$ characters
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	932	$c_0c_1 \ldots c_{n-1}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	933	The last derivative operation
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	934	$\backslash c_{n-1}$ generates the derivative $r_n$, for which
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	935	$\mkeps$ produces the value $v_n$. This value
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	936	tells us how the empty string is matched by the (nullable)
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	937	regular expression $r_n$, in a POSIX way.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	938	The definition of $\mkeps$ is
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	939	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	940	\begin{tabular}{lcl}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	941	$\mkeps \; \ONE$ & $\dn$ & $\Empty$ \\
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	942	$\mkeps \; (r_{1}+r_{2})$ & $\dn$
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	943	& $\textit{if}\; (\nullable \; r_{1}) \;\,
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	944	\textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	945	& & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	946	$\mkeps \; (r_1 \cdot r_2)$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	947	$\mkeps \; r^* $ & $\dn$ & $\Stars\;[]$
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	948	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	949	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	950
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	951
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	952	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	953	The function prefers the left child $r_1$ of $r_1 + r_2$
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	954	to match an empty string if there is a choice.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	955	When there is a star to match the empty string,
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	956	we give the $\Stars$ constructor an empty list, meaning
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	957	no iteration is taken.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	958	The result of $\mkeps$ on a $\nullable$ $r$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	959	is a POSIX value for $r$ and the empty string:
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	960	\begin{lemma}\label{mePosix}
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	961	$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	962	\end{lemma}
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	963	\begin{proof}
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	964	By induction on $r$.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	965	\end{proof}
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	966	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	967	After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	968	in reverse order as they were chopped off in the derivative phase.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	969	The function for this is called $\inj$. This function
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	970	operates on values, unlike $\backslash$ which operates on regular expressions.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	971	In the diagram below, $v_i$ stands for the (POSIX) value
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	972	for how the regular expression
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	973	$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	974	of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	975	After injecting back $n$ characters, we get the lexical value for how $r_0$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	976	matches $s$.
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	977	\begin{figure}[H]
ce4e5151a836 more Chengsong parents: 591 diff changeset	978	\begin{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	979	\begin{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	980	\begin{tikzcd}
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	981	r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	982	v_0 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed]
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	983	\end{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	984	\end{ceqn}
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	985	\end{center}
ce4e5151a836 more Chengsong parents: 591 diff changeset	986	\caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016},
ce4e5151a836 more Chengsong parents: 591 diff changeset	987	matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$.
ce4e5151a836 more Chengsong parents: 591 diff changeset	988	The first phase involves taking successive derivatives w.r.t the characters $c_0$,
ce4e5151a836 more Chengsong parents: 591 diff changeset	989	$c_1$, and so on. These are the same operations as they have appeared in the matcher
ce4e5151a836 more Chengsong parents: 591 diff changeset	990	\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	991	then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
e3752aac8ec2 more Chengsong parents: 628 diff changeset	992	the empty string, by always selecting the leftmost
e3752aac8ec2 more Chengsong parents: 628 diff changeset	993	nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	994	appeared in the string, always preserving POSIXness.}\label{graph:inj}
ce4e5151a836 more Chengsong parents: 591 diff changeset	995	\end{figure}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	996	\noindent
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	997	The function $\textit{inj}$ as defined by Sulzmann and Lu
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	998	takes three arguments: a regular
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	999	expression ${r_{i}}$, before the character is chopped off,
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1000	a character ${c_{i}}$ (the character we want to inject back) and
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1001	the third argument $v_{i+1}$ the value we want to inject into.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1002	The result of an application
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1003	$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1004	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1005	(s_i, r_i) \rightarrow v_i
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1006	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1007	holds.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1008	The definition of $\textit{inj}$ is as follows:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1009	\begin{center}
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1010	\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1011	$\textit{inj}\;(c)\;c\,Empty$ & $\dn$ & $\Char\,c$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1012	$\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left \; (\textit{inj}\; r_1 \; c\,v)$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1013	$\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c \; v)$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1014	$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$ &
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1015	$\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1016	$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ &
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1017	$\dn$ & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1018	$\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$ & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1019	$\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1020	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1021	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1022
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1023	\noindent
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1024	The function recurses on
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1025	the shape of regular
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1026	expressions and values.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1027	Intuitively, each clause analyses
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1028	how $r_i$ could have transformed when being
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1029	derived by $c$, identifying which subpart
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1030	of $v_{i+1}$ has the ``hole''
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1031	to inject the character back into.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1032	Once the character is
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1033	injected back to that sub-value;
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1034	$\inj$ assembles all parts
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1035	to form a new value.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1036
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1037	For instance, the last clause is an
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1038	injection into a sequence value $v_{i+1}$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1039	whose second child
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1040	value is a star and the shape of the
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1041	regular expression $r_i$ before injection
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1042	is a star.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1043	We therefore know
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1044	the derivative
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1045	starts on a star and ends as a sequence:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1046	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1047	(r^) \backslash c \longrightarrow r\backslash c \cdot r^
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1048	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1049	during which an iteration of the star
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1050	had just been unfolded, giving the below
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1051	value inhabitation relation:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1052	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1053	\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1054	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1055	The value list $vs$ corresponds to
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1056	matched star iterations,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1057	and the ``hole'' lies in $v$ because
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1058	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1059	\vdash v: r\backslash c.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1060	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1061	Finally,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1062	$\inj \; r \;c \; v$ is prepended
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1063	to the previous list of iterations and then
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1064	wrapped under the $\Stars$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1065	constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1066
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1067	Recall that lemma
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1068	\ref{mePosix} tells us that
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1069	$\mkeps$ always generates the POSIX value.
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1070	The function $\inj$ preserves the POSIXness, provided
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1071	the value before injection is POSIX, namely
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1072	\begin{lemma}\label{injPosix}
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1073	If$(r \backslash c, s) \rightarrow v $,
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1074	then $(r, c :: s) \rightarrow (\inj r \; c\; v)$.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1075	\end{lemma}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1076	\begin{proof}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1077	By induction on $r$.
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1078	The non-trivial cases are sequence and star.
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1079	When $r = a \cdot b$, there can be
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1080	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1081	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1082	case.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1083	\begin{itemize}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1084	\item
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1085	$v = \Seq \; v_a \; v_b$.\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1086	The ``not nullable'' clause of the $\inj$ function is taken:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1087	\begin{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1088	\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1089	$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1090	& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1091	\end{tabular}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1092	\end{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1093	We know that there exists a unique pair of
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1094	$s_a$ and $s_b$ satisfying
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1095	$(a \backslash c, s_a) \rightarrow v_a$,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1096	$(b , s_b) \rightarrow v_b$, and
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1097	$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1098	L \; (a\backslash c) \land
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1099	s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1100	The last condition gives us
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1101	$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1102	L \; a \land
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1103	s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1104	By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1105	and this gives us
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1106	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1107	(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1108	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1109	\item
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1110	$v = \Left \; (\Seq \; v_a \; v_b)$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1111	The argument is almost identical to the above case,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1112	except that a different clause of $\inj$ is taken:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1113	\begin{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1114	\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1115	$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1116	& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1117	\end{tabular}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1118	\end{center}
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1119	With similar reasoning,
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1120
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1121	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1122	(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1123	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1124	again holds.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1125	\item
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1126	$v = \Right \; v_b$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1127	Again the injection result would be
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1128	\begin{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1129	\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1130	$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1131	& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1132	\end{tabular}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1133	\end{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1134	We know that $a$ must be nullable,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1135	allowing us to call $\mkeps$ and get
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1136	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1137	(a, []) \rightarrow \mkeps \; a.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1138	\]
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1139	Also, by inductive hypothesis
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1140	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1141	(b, c::s) \rightarrow \inj\; b \; c \; v_b
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1142	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1143	holds.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1144	In addition, as
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1145	$\Right \;v_b$ instead of $\Left \ldots$ is
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1146	the POSIX value for $v$, it must be the case
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1147	that $s \notin L \;( (a\backslash c)\cdot b)$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1148	This tells us that
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1149	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1150	\nexists s_3 \; s_4.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1151	s_3 @s_4 = s \land s_3 \in L \; (a\backslash c)
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1152	\land s_4 \in L \; b
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1153	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1154	which translates to
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1155	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1156	\nexists s_3 \; s_4. \; s_3 \neq [] \land
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1157	s_3 @s_4 = c::s \land s_3 \in L \; a
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1158	\land s_4 \in L \; b.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1159	\]
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1160	(Which says there cannot be a longer
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1161	initial split for $s$ other than the empty string.)
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1162	Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1163	as the POSIX value for $a\cdot b$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1164	\end{itemize}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1165	The star case can be proven similarly.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1166	\end{proof}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1167	\noindent
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1168	Putting all together, Sulzmann and Lu obtained the following algorithm
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1169	outlined in the diagram \ref{graph:inj}:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1170	\begin{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1171	\begin{tabular}{lcl}
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1172	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1173	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1174	& & $\quad \phantom{\mid}\; \None \implies \None$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1175	& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1176	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1177	\end{center}
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1178	\noindent
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1179	The central property of the $\lexer$ is that it gives the correct result
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1180	according to
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1181	POSIX rules.
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1182	\begin{theorem}\label{lexerCorrectness}
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1183	The $\lexer$ based on derivatives and injections is correct:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1184	\begin{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1185	\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1186	$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1187	$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1188	\end{tabular}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1189	\end{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1190	\end{theorem}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1191	\begin{proof}
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1192	By induction on $s$. $r$ generalising over an arbitrary regular expression.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1193	The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1194	by lemma \ref{injPosix}.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1195	\end{proof}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1196	\noindent
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1197	As we did earlier in this chapter with the matcher, one can
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1198	introduce simplification on the regular expression in each derivative step.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1199	However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase)
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1200	and ensure that
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1201	the values align with the regular expression at each step.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1202	Therefore one has to
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1203	be careful not to break the correctness, as the injection
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1204	function heavily relies on the structure of
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1205	the regular expressions and values being aligned.
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1206	This can be achieved by recording some extra rectification functions
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1207	during the derivatives step and applying these rectifications in
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1208	each run during the injection phase.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1209	With extra care
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1210	one can show that POSIXness will not be affected
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1211	by the simplifications listed here \cite{AusafDyckhoffUrban2016}.
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1212	\begin{center}
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1213	\begin{tabular}{lcl}
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1214	$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1215	$(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1216	& & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1217	& & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1218	& & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1219	& & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1220	& & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1221	$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1222	& & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1223	& & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1224	& & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1225	$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1226
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1227	\end{tabular}
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1228	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1229
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1230	However, one can still end up
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1231	with exploding derivatives,
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1232	even with the simple-minded simplification rules allowed
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1233	in an injection-based lexer.
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1234	\section{A Case Requiring More Aggressive Simplifications}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1235	For example, when starting with the regular
585 4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	1236	expression $(a^* \cdot a^)^$ and building just over
4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	1237	a dozen successive derivatives
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1238	w.r.t.~the character $a$, one obtains a derivative regular expression
585 4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	1239	with millions of nodes (when viewed as a tree)
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1240	even with the mentioned simplifications.
585 4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	1241	\begin{figure}[H]
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1242	\begin{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1243	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1244	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1245	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1246	ylabel={size},
585 4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	1247	legend entries={Simple-Minded Simp, Naive Matcher},
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1248	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1249	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1250	\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
585 4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	1251	\addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data};
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1252	\end{axis}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1253	\end{tikzpicture}
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1254	\end{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1255	\caption{Size of $(a^\cdot a^)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1256	\end{figure}\label{fig:BetterWaterloo}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1257
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1258	That is because Sulzmann and Lu's
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1259	injection-based lexing algorithm keeps a lot of
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1260	"useless" values that will not be used.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1261	These different ways of matching will grow exponentially with the string length.
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1262	Consider the case
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1263	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1264	r= (a^\cdot a^)^* \quad and \quad
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1265	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1266	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1267	as an example.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1268	This is a highly ambiguous regular expression, with
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1269	many ways to split up the string into multiple segments for
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1270	different star iterations,
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1271	and for each segment
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1272	multiple ways of splitting between
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1273	the two $a^*$ sub-expressions.
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1274	When $n$ is equal to $1$, there are two lexical values for
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1275	the match:
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1276	\[
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1277	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1)
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1278	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1279	and
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1280	\[
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1281	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2)
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1282	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1283	The derivative of $\derssimp \;s \; r$ is
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1284	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1285	(a^a^ + a^)\cdot(a^a^)^.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1286	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1287	The $a^a^$ and $a^*$ in the first child of the above sequence
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1288	correspond to value 1 and value 2, respectively.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1289	When $n=2$, the number goes up to 7:
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1290	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1291	\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1292	\]
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1293
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1294	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1295	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1296	\]
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1297
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1298	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1299	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1300	\]
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1301
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1302	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1303	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1304	\]
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1305
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1306	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1307	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []),
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1308	\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1309	]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1310	\]
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1311
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1312	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1313	\Stars \; [
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1314	\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1315	\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1316	]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1317	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1318	and
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1319	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1320	\Stars \; [
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1321	\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1322	\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1323	]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1324	\]
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1325	And $\derssimp \; aa \; (a^a^)^*$ is
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1326	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1327	((a^a^ + a^)+a^)\cdot(a^a^)^* +
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1328	(a^a^ + a^)\cdot(a^a^)^.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1329	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1330	which removes two out of the seven terms corresponding to the
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1331	seven distinct lexical values.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1332
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1333	It is not surprising that there are exponentially many
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1334	distinct lexical values that cannot be eliminated by
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1335	the simple-minded simplification of $\derssimp$.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1336	A lexer without a good enough strategy to
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1337	deduplicate will naturally
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1338	have an exponential runtime on highly
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1339	ambiguous regular expressions because there
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1340	are exponentially many matches.
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1341	For this particular example, it seems
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1342	that the number of distinct matches growth
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1343	speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1344
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1345	On the other hand, the
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1346	$\POSIX$ value for $r= (a^\cdot a^)^*$ and
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1347	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1348	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1349	\Stars\,
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1350	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1351	\]
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1352	At any moment, the subterms in a regular expression
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1353	that will potentially result in a POSIX value is only
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1354	a minority among the many other terms,
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1355	and one can remove the ones that are not possible to
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1356	be POSIX.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1357	In the above example,
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1358	\begin{equation}\label{eqn:growth2}
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1359	((a^a^ + \underbrace{a^}_\text{A})+\underbrace{a^}_\text{duplicate of A})\cdot(a^a^)^* +
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1360	\underbrace{(a^a^ + a^)\cdot(a^a^)^}_\text{further simp removes this}.
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1361	\end{equation}
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1362	can be further simplified by
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1363	removing the underlined term first,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1364	which would open up possibilities
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1365	of further simplification that removes the
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1366	underbraced part.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1367	The result would be
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1368	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1369	(\underbrace{a^a^}_\text{term 1} + \underbrace{a^}_\text{term 2})\cdot(a^a^)^.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1370	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1371	with corresponding values
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1372	\begin{center}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1373	\begin{tabular}{lr}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1374	$\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$ & $(\text{term 1})$\\
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1375	$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] $ & $(\text{term 2})$
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1376	\end{tabular}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1377	\end{center}
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1378	Other terms with an underlying value, such as
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1379	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1380	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1381	\]
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1382	do not to contribute a POSIX lexical value,
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1383	and therefore can be thrown away.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1384
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1385	Ausaf et al. \cite{AusafDyckhoffUrban2016}
639 80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1386	have come up with some simplification steps,
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1387	and incorporated the simplification into $\lexer$.
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1388	They call this lexer $\textit{lexer}_{simp}$ and proved that
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1389	\[
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1390	\lexer \; r\; s = \textit{lexer}_{simp} \; r \; s
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1391	\]
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1392	The function $\textit{lexer}_{simp}$
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1393	involves some fiddly manipulation of value rectification,
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1394	which we omit here.
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1395	however those steps
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1396	are not yet sufficiently strong, to achieve the above effects.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1397	And even with these relatively mild simplifications, the proof
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1398	is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1399	One would need to prove something like this:
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1400	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1401	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\;
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1402	\textit{then}\;\; (r, c::s) \rightarrow
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1403	\inj\;\, r\, \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1404	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1405	instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1406	not only has to return a simplified regular expression,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1407	but also what specific simplifications
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1408	have been done as a function on values
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1409	showing how one can transform the value
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1410	underlying the simplified regular expression
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1411	to the unsimplified one.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1412
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1413	We therefore choose a slightly different approach
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1414	also described by Sulzmann and Lu to
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1415	get better simplifications, which uses
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1416	some augmented data structures compared to
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1417	plain regular expressions.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1418	We call them \emph{annotated}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1419	regular expressions.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1420	With annotated regular expressions,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1421	we can avoid creating the intermediate values $v_1,\ldots v_n$ and a
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1422	second phase altogether.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1423	We introduce this new datatype and the
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1424	corresponding algorithm in the next chapter.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1425
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1426
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1427
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1428

author	Chengsong
	Fri, 30 Dec 2022 17:37:51 +0000
changeset 639	80cc6dc4c98b
parent 638	dd9dde2d902b
child 646	56057198e4f5
permissions	-rwxr-xr-x