lexing: ChengsongTanPhdThesis/Chapters/Inj.tex@f47fc4840579 (annotated)

532 cc54ce075db5 restructured Chengsong parents: diff changeset	1	% Chapter Template
cc54ce075db5 restructured Chengsong parents: diff changeset	2
cc54ce075db5 restructured Chengsong parents: diff changeset	3	\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
cc54ce075db5 restructured Chengsong parents: diff changeset	4
cc54ce075db5 restructured Chengsong parents: diff changeset	5	\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
cc54ce075db5 restructured Chengsong parents: diff changeset	6	%and notations we
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	7	% used for describing the lexing algorithm by Sulzmann and Lu,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	8	%and then give the algorithm and its variant and discuss
532 cc54ce075db5 restructured Chengsong parents: diff changeset	9	%why more aggressive simplifications are needed.
cc54ce075db5 restructured Chengsong parents: diff changeset	10
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	11	In this chapter, we define the basic notions
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	12	for regular languages and regular expressions.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	13	This is essentially a description in ``English"
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	14	of our formalisation in Isabelle/HOL.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	15	We also give the definition of what $\POSIX$ lexing means,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	16	followed by an algorithm by Sulzmanna and Lu\parencite{Sulzmann2014}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	17	that produces the output conforming
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	18	to the $\POSIX$ standard.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	19	It is also worth mentioning that
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	20	we choose to use the ML-style notation
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	21	for function applications, where
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	22	the parameters of a function is not enclosed
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	23	inside a pair of parentheses (e.g. $f \;x \;y$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	24	instead of $f(x,\;y)$). This is mainly
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	25	to make the text visually more concise.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	26
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	27	\section{Basic Concepts}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	28	Usually, formal language theory starts with an alphabet
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	29	denoting a set of characters.
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	30	Here we just use the datatype of characters from Isabelle,
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	31	which roughly corresponds to the ASCII characters.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	32	In what follows, we shall leave the information about the alphabet
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	33	implicit.
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	34	Then using the usual bracket notation for lists,
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	35	we can define strings made up of characters:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	36	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	37	\begin{tabular}{lcl}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	38	$\textit{s}$ & $\dn$ & $[] \; \|\; c :: s$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	39	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	40	\end{center}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	41	Where $c$ is a variable ranging over characters.
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	42	Strings can be concatenated to form longer strings in the same
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	43	way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	44	We omit the precise
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	45	recursive definition here.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	46	We overload this concatenation operator for two sets of strings:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	47	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	48	\begin{tabular}{lcl}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	49	$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	50	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	51	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	52	We also call the above \emph{language concatenation}.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	53	The power of a language is defined recursively, using the
cc54ce075db5 restructured Chengsong parents: diff changeset	54	concatenation operator $@$:
cc54ce075db5 restructured Chengsong parents: diff changeset	55	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	56	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	57	$A^0 $ & $\dn$ & $\{ [] \}$\\
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	58	$A^{n+1}$ & $\dn$ & $A @ A^n$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	59	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	60	\end{center}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	61	The union of all powers of a language
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	62	can be used to define the Kleene star operator:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	63	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	64	\begin{tabular}{lcl}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	65	$A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	66	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	67	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	68
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	69	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	70	However, to obtain a more convenient induction principle
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	71	in Isabelle/HOL,
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	72	we instead define the Kleene star
532 cc54ce075db5 restructured Chengsong parents: diff changeset	73	as an inductive set:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	74
532 cc54ce075db5 restructured Chengsong parents: diff changeset	75	\begin{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	76	\begin{mathpar}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	77	\inferrule{\mbox{}}{[] \in A*\\}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	78
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	79	\inferrule{s_1 \in A \;\; s_2 \in A}{s_1 @ s_2 \in A}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	80	\end{mathpar}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	81	\end{center}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	82	\noindent
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	83	We also define an operation of "chopping off" a character from
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	84	a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
532 cc54ce075db5 restructured Chengsong parents: diff changeset	85	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	86	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	87	$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	88	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	89	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	90	\noindent
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	91	This can be generalised to "chopping off" a string from all strings within set $A$,
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	92	namely:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	93	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	94	\begin{tabular}{lcl}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	95	$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	96	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	97	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	98	\noindent
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	99	which is essentially the left quotient $A \backslash L$ of $A$ against
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	100	the singleton language with $L = \{w\}$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	101	in formal language theory.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	102	However, for the purposes here, the $\textit{Ders}$ definition with
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	103	a single string is sufficient.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	104
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	105	The reason for defining derivatives
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	106	is that it provides a different approach
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	107	to test membership of a string in
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	108	a set of strings.
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	109	For example, to test whether the string
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	110	$bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	111	respect to the string $bar$:
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	112	\begin{center}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	113	\begin{tabular}{lclll}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	114	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ &
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	115	$\{ar, rak\}$ &
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	116	$\stackrel{\backslash a}{\rightarrow}$ &
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	117	\\
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	118	$\{r \}$ & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$ &
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	119	$\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	120	\end{tabular}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	121	\end{center}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	122	\noindent
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	123	and in the end test whether the set
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	124	has the empty string \footnote{ we use the infix notation $A\backslash c$
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	125	instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	126	on languages rather than regular expressions }.
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	127	In general, if we have a language $S_{start}$,
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	128	then we can test whether $s$ is in $S_{start}$
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	129	by testing whether $[] \in S \backslash s$.
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	130
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	131	With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
532 cc54ce075db5 restructured Chengsong parents: diff changeset	132	we have a few properties of how the language derivative can be defined using
cc54ce075db5 restructured Chengsong parents: diff changeset	133	sub-languages.
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	134	For example, for the sequence operator, we have
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	135	something similar to the ``chain rule'' of the calculus derivative:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	136	\begin{lemma}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	137	\[
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	138	\Der \; c \; (A @ B) =
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	139	\begin{cases}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	140	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	141	(\Der \; c \; A) \, @ \, B, & \text{otherwise}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	142	\end{cases}
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	143	\]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	144	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	145	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	146	This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
cc54ce075db5 restructured Chengsong parents: diff changeset	147	and get to $B$.
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	148	The language $A*$'s derivative can be described using the language derivative
532 cc54ce075db5 restructured Chengsong parents: diff changeset	149	of $A$:
cc54ce075db5 restructured Chengsong parents: diff changeset	150	\begin{lemma}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	151	$\textit{Der} \;c \;(A) = (\textit{Der}\; c A) @ (A)$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	152	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	153	\begin{proof}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	154	There are too inclusions to prove:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	155	\begin{itemize}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	156	\item{$\subseteq$}:\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	157	The set
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	158	\[ \{s \mid c :: s \in A*\} \]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	159	is enclosed in the set
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	160	\[ \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	161	because whenever you have a string starting with a character
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	162	in the language of a Kleene star $A*$,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	163	then that character together with some sub-string
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	164	immediately after it will form the first iteration,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	165	and the rest of the string will
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	166	be still in $A*$.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	167	\item{$\supseteq$}:\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	168	Note that
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	169	\[ \Der \; c \; (A) = \Der \; c \; (\{ [] \} \cup (A @ A) ) \]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	170	hold.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	171	Also this holds:
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	172	\[ \Der \; c \; (\{ [] \} \cup (A @ A) ) = \Der\; c \; (A @ A) \]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	173	where the $\textit{RHS}$ can be rewritten
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	174	as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	175	which of course contains $\Der \; c \; A @ A*$.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	176	\end{itemize}
cc54ce075db5 restructured Chengsong parents: diff changeset	177	\end{proof}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	178
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	179	\noindent
532 cc54ce075db5 restructured Chengsong parents: diff changeset	180	Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
cc54ce075db5 restructured Chengsong parents: diff changeset	181	for regular languages, we need to first give definitions for regular expressions.
cc54ce075db5 restructured Chengsong parents: diff changeset	182
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	183	\subsection{Regular Expressions and Their Meaning}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	184	The \emph{basic regular expressions} are defined inductively
532 cc54ce075db5 restructured Chengsong parents: diff changeset	185	by the following grammar:
cc54ce075db5 restructured Chengsong parents: diff changeset	186	\[ r ::= \ZERO \mid \ONE
cc54ce075db5 restructured Chengsong parents: diff changeset	187	\mid c
cc54ce075db5 restructured Chengsong parents: diff changeset	188	\mid r_1 \cdot r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	189	\mid r_1 + r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	190	\mid r^*
cc54ce075db5 restructured Chengsong parents: diff changeset	191	\]
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	192	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	193	We call them basic because we will introduce
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	194	additional constructors in later chapters such as negation
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	195	and bounded repetitions.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	196	We use $\ZERO$ for the regular expression that
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	197	matches no string, and $\ONE$ for the regular
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	198	expression that matches only the empty string\footnote{
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	199	some authors
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	200	also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	201	but we prefer our notation}.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	202	The sequence regular expression is written $r_1\cdot r_2$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	203	and sometimes we omit the dot if it is clear which
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	204	regular expression is meant; the alternative
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	205	is written $r_1 + r_2$.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	206	The \emph{language} or meaning of
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	207	a regular expression is defined recursively as
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	208	a set of strings:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	209	%TODO: FILL in the other defs
cc54ce075db5 restructured Chengsong parents: diff changeset	210	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	211	\begin{tabular}{lcl}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	212	$L \; \ZERO$ & $\dn$ & $\phi$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	213	$L \; \ONE$ & $\dn$ & $\{[]\}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	214	$L \; c$ & $\dn$ & $\{[c]\}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	215	$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	216	$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	217	$L \; r^$ & $\dn$ & $ (L\;r)$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	218	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	219	\end{center}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	220	\noindent
532 cc54ce075db5 restructured Chengsong parents: diff changeset	221	Now with semantic derivatives of a language and regular expressions and
cc54ce075db5 restructured Chengsong parents: diff changeset	222	their language interpretations in place, we are ready to define derivatives on regexes.
cc54ce075db5 restructured Chengsong parents: diff changeset	223	\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	224	%Recall, the language derivative acts on a set of strings
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	225	%and essentially chops off a particular character from
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	226	%all strings in that set, Brzozowski defined a derivative operation on regular expressions
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	227	%so that after derivative $L(r\backslash c)$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	228	%will look as if it was obtained by doing a language derivative on $L(r)$:
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	229	Recall that the semantic derivative acts on a
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	230	language (set of strings).
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	231	One can decide whether a string $s$ belongs
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	232	to a language $S$ by taking derivative with respect to
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	233	that string and then checking whether the empty
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	234	string is in the derivative:
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	235	\begin{center}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	236	\parskip \baselineskip
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	237	\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	238	\def\rlwd{.5pt}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	239	\newcommand\notate[3]{%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	240	\unskip\def\useanchorwidth{T}%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	241	\setbox0=\hbox{#1}%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	242	\def\stackalignment{c}\stackunder[-6pt]{%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	243	\def\stackalignment{c}\stackunder[-1.5pt]{%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	244	\stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	245	\rule{\rlwd}{#2\baselineskip}}}{%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	246	\strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	247	}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	248	\Longstack{
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	249	\notate{$\{ \ldots ,\;$
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	250	\notate{s}{1}{$(c_1 :: s_1)$}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	251	$, \; \ldots \}$
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	252	}{1}{$S_{start}$}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	253	}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	254	\Longstack{
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	255	$\stackrel{\backslash c_1}{\longrightarrow}$
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	256	}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	257	\Longstack{
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	258	$\{ \ldots,\;$ \notate{$s_1$}{1}{$(c_2::s_2)$}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	259	$,\; \ldots \}$
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	260	}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	261	\Longstack{
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	262	$\stackrel{\backslash c_2}{\longrightarrow}$
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	263	}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	264	\Longstack{
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	265	$\{ \ldots,\; s_2
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	266	,\; \ldots \}$
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	267	}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	268	\Longstack{
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	269	$ \xdashrightarrow{\backslash c_3\ldots\ldots} $
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	270	}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	271	\Longstack{
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	272	\notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} =
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	273	S_{start}\backslash s$}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	274	}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	275	\end{center}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	276	\begin{center}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	277	$s \in S_{start} \iff [] \in S_{end}$
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	278	\end{center}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	279	\noindent
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	280	Brzozowski noticed that this operation
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	281	can be ``mirrored" on regular expressions which
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	282	he calls the derivative of a regular expression $r$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	283	with respect to a character $c$, written
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	284	$r \backslash c$. This infix operator
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	285	takes an original regular expression $r$ as input
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	286	and a character as a right operand and
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	287	outputs a result, which is a new regular expression.
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	288	The derivative operation on regular expression
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	289	is defined such that the language of the derivative result
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	290	coincides with the language of the original
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	291	regular expression's language being taken the language
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	292	derivative with respect to the same character:
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	293	\begin{center}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	294	\parskip \baselineskip
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	295	\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	296	\def\rlwd{.5pt}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	297	\newcommand\notate[3]{%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	298	\unskip\def\useanchorwidth{T}%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	299	\setbox0=\hbox{#1}%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	300	\def\stackalignment{c}\stackunder[-6pt]{%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	301	\def\stackalignment{c}\stackunder[-1.5pt]{%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	302	\stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	303	\rule{\rlwd}{#2\baselineskip}}}{%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	304	\strut\kern8pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	305	}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	306	\Longstack{
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	307	\notate{$r$}{1}{$L \; r = \{\ldots, \;c::s_1,
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	308	\;\ldots\}$}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	309	}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	310	\Longstack{
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	311	$\stackrel{\backslash c}{\longrightarrow}$
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	312	}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	313	\Longstack{
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	314	\notate{$r\backslash c$}{2}{$L \; (r\backslash c)=
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	315	\{\ldots,\;s_1,\;\ldots\}$}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	316	}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	317	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	318	\begin{center}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	319
cc54ce075db5 restructured Chengsong parents: diff changeset	320	\[
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	321	L(r \backslash c) = \Der \; c \; L(r)
532 cc54ce075db5 restructured Chengsong parents: diff changeset	322	\]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	323	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	324	\noindent
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	325	where we do derivatives on the regular expression
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	326	$r$ and test membership of $s$ by checking
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	327	whether the empty string is in the language of
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	328	$r\backslash s$.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	329	For example in the sequence case we have
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	330	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	331	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	332	$\Der \; c \; (A @ B)$ & $\dn$ &
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	333	$ \textit{if} \;\, [] \in A \;
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	334	\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	335	\Der \; c\; B$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	336	& & $\textit{else}\; (\Der \; c \; A) @ B$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	337	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	338	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	339	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	340	This can be translated to
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	341	regular expressions in the following
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	342	manner:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	343	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	344	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	345	$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	346	& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	347	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	348	\end{center}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	349
cc54ce075db5 restructured Chengsong parents: diff changeset	350	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	351	And similarly, the Kleene star's semantic derivative
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	352	can be expressed as
532 cc54ce075db5 restructured Chengsong parents: diff changeset	353	\[
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	354	\textit{Der} \;c \;(A) \dn (\textit{Der}\; c A) @ (A)
532 cc54ce075db5 restructured Chengsong parents: diff changeset	355	\]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	356	which translates to
532 cc54ce075db5 restructured Chengsong parents: diff changeset	357	\[
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	358	(r^) \backslash c \dn (r \backslash c)\cdot r^.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	359	\]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	360	In the above definition of $(r_1\cdot r_2) \backslash c$,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	361	the $\textit{if}$ clause's
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	362	boolean condition
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	363	$[] \in L(r_1)$ needs to be
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	364	somehow recursively computed.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	365	We call such a function that checks
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	366	whether the empty string $[]$ is
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	367	in the language of a regular expression $\nullable$:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	368	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	369	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	370	$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	371	$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	372	$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	373	$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	374	$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	375	$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	376	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	377	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	378	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	379	The $\ZERO$ regular expression
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	380	does not contain any string and
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	381	therefore is not \emph{nullable}.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	382	$\ONE$ is \emph{nullable}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	383	by definition.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	384	The character regular expression $c$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	385	corresponds to the singleton set $\{c\}$,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	386	and therefore does not contain the empty string.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	387	The alternative regular expression is nullable
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	388	if at least one of its children is nullable.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	389	The sequence regular expression
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	390	would require both children to have the empty string
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	391	to compose an empty string, and the Kleene star
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	392	is always nullable because it naturally
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	393	contains the empty string.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	394
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	395	The derivative function, written $r\backslash c$,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	396	defines how a regular expression evolves into
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	397	a new one after all the string it contains is acted on:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	398	if it starts with $c$, then the character is chopped of,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	399	if not, that string is removed.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	400	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	401	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	402	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	403	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	404	$d \backslash c$ & $\dn$ &
cc54ce075db5 restructured Chengsong parents: diff changeset	405	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	406	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	407	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	408	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	409	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	410	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	411	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	412	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	413	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	414	The most involved cases are the sequence case
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	415	and the star case.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	416	The sequence case says that if the first regular expression
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	417	contains an empty string, then the second component of the sequence
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	418	needs to be considered, as its derivative will contribute to the
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	419	result of this derivative.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	420	The star regular expression $r^*$'s derivative
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	421	unwraps one iteration of $r$, turns it into $r\backslash c$,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	422	and attaches the original $r^*$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	423	after $r\backslash c$, so that
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	424	we can further unfold it as many times as needed.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	425	We have the following correspondence between
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	426	derivatives on regular expressions and
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	427	derivatives on a set of strings:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	428	\begin{lemma}\label{derDer}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	429	$\textit{Der} \; c \; L(r) = L (r\backslash c)$
cc54ce075db5 restructured Chengsong parents: diff changeset	430	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	431
cc54ce075db5 restructured Chengsong parents: diff changeset	432	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	433	The main property of the derivative operation
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	434	(that enables us to reason about the correctness of
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	435	derivative-based matching)
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	436	is
532 cc54ce075db5 restructured Chengsong parents: diff changeset	437
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	438	\begin{lemma}\label{derStepwise}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	439	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	440	\end{lemma}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	441
cc54ce075db5 restructured Chengsong parents: diff changeset	442	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	443	We can generalise the derivative operation shown above for single characters
cc54ce075db5 restructured Chengsong parents: diff changeset	444	to strings as follows:
cc54ce075db5 restructured Chengsong parents: diff changeset	445
cc54ce075db5 restructured Chengsong parents: diff changeset	446	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	447	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	448	$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
cc54ce075db5 restructured Chengsong parents: diff changeset	449	$r \backslash [\,] $ & $\dn$ & $r$
cc54ce075db5 restructured Chengsong parents: diff changeset	450	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	451	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	452
cc54ce075db5 restructured Chengsong parents: diff changeset	453	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	454	When there is no ambiguity, we will
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	455	omit the subscript and use $\backslash$ instead
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	456	of $\backslash_r$ to denote
532 cc54ce075db5 restructured Chengsong parents: diff changeset	457	string derivatives for brevity.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	458	Brzozowski's regular-expression matcher algorithm can then be described as:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	459
cc54ce075db5 restructured Chengsong parents: diff changeset	460	\begin{definition}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	461	$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	462	\end{definition}
cc54ce075db5 restructured Chengsong parents: diff changeset	463
cc54ce075db5 restructured Chengsong parents: diff changeset	464	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	465	Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$,
cc54ce075db5 restructured Chengsong parents: diff changeset	466	this algorithm presented graphically is as follows:
cc54ce075db5 restructured Chengsong parents: diff changeset	467
cc54ce075db5 restructured Chengsong parents: diff changeset	468	\begin{equation}\label{graph:successive_ders}
cc54ce075db5 restructured Chengsong parents: diff changeset	469	\begin{tikzcd}
cc54ce075db5 restructured Chengsong parents: diff changeset	470	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
cc54ce075db5 restructured Chengsong parents: diff changeset	471	\end{tikzcd}
cc54ce075db5 restructured Chengsong parents: diff changeset	472	\end{equation}
cc54ce075db5 restructured Chengsong parents: diff changeset	473
cc54ce075db5 restructured Chengsong parents: diff changeset	474	\noindent
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	475	It can be
7cf9f17aa179 more Chengsong parents: 538 diff changeset	476	relatively easily shown that this matcher is correct:
7cf9f17aa179 more Chengsong parents: 538 diff changeset	477	\begin{lemma}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	478	$\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	479	\end{lemma}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	480	\begin{proof}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	481	By the stepwise property of derivatives (lemma \ref{derStepwise})
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	482	and lemma \ref{derDer}.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	483	\end{proof}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	484	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	485	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	486	\begin{figure}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	487	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	488	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	489	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	490	ylabel={time in secs},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	491	ymode = log,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	492	legend entries={Naive Matcher},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	493	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	494	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	495	\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
7cf9f17aa179 more Chengsong parents: 538 diff changeset	496	\end{axis}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	497	\end{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	498	\caption{Matching $(a^)^b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	499	\end{figure}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	500	\end{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	501	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	502	If we implement the above algorithm naively, however,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	503	the algorithm can be excruciatingly slow, as shown in
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	504	\ref{NaiveMatcher}.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	505	Note that both axes are in logarithmic scale.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	506	Around two dozens characters
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	507	would already explode the matcher on regular expression
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	508	$(a^)^b$.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	509	For this, we need to introduce certain
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	510	rewrite rules for the intermediate results,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	511	such as $r + r \rightarrow r$,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	512	and make sure those rules do not change the
7cf9f17aa179 more Chengsong parents: 538 diff changeset	513	language of the regular expression.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	514	One simpled-minded simplification function
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	515	that achieves these requirements is given below:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	516	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	517	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	518	$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	519	$(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	520	& & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	521	& & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	522	& & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	523	& & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	524	& & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	525	$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	526	& & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	527	& & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	528	& & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	529	$\simp \; r$ & $\dn$ & $r$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	530
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	531	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	532	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	533	If we repeatedly apply this simplification
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	534	function during the matching algorithm,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	535	we have a matcher with simplification:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	536	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	537	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	538	$\derssimp \; [] \; r$ & $\dn$ & $r$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	539	$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	540	$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	541	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	542	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	543	\begin{figure}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	544	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	545	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	546	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	547	ylabel={time in secs},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	548	ymode = log,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	549	xmode = log,
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	550	grid = both,
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	551	legend entries={Matcher With Simp},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	552	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	553	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	554	\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
7cf9f17aa179 more Chengsong parents: 538 diff changeset	555	\end{axis}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	556	\end{tikzpicture}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	557	\caption{$(a^)^b$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	558	against
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	559	$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	560	\end{figure}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	561	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	562	The running time of $\textit{ders}\_\textit{simp}$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	563	on the same example of \ref{NaiveMatcher}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	564	is now very tame in terms of the length of inputs,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	565	as shown in \ref{BetterMatcher}.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	566
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	567	Building derivatives and then testing the existence
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	568	of empty string in the resulting regular expression's language,
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	569	adding simplifications when necessary.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	570	So far, so good. But what if we want to
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	571	do lexing instead of just getting a YES/NO answer?
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	572	Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	573	elegant (arguably as beautiful as the definition of the original derivative) solution for this.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	574
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	575	\section{Values and the Lexing Algorithm by Sulzmann and Lu}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	576	In this section, we present a two-phase regular expression lexing
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	577	algorithm.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	578	The first phase takes successive derivatives with
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	579	respect to the input string,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	580	and the second phase does the reverse, \emph{injecting} back
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	581	characters, in the meantime constructing a lexing result.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	582	We will introduce the injection phase in detail slightly
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	583	later, but as a preliminary we have to first define
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	584	the datatype for lexing results,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	585	called \emph{value} or
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	586	sometimes also \emph{lexical value}. Values and regular
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	587	expressions correspond to each other as illustrated in the following
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	588	table:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	589
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	590	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	591	\begin{tabular}{c@{\hspace{20mm}}c}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	592	\begin{tabular}{@{}rrl@{}}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	593	\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	594	$r$ & $::=$ & $\ZERO$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	595	& $\mid$ & $\ONE$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	596	& $\mid$ & $c$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	597	& $\mid$ & $r_1 \cdot r_2$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	598	& $\mid$ & $r_1 + r_2$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	599	\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	600	& $\mid$ & $r^*$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	601	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	602	&
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	603	\begin{tabular}{@{\hspace{0mm}}rrl@{}}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	604	\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	605	$v$ & $::=$ & \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	606	& & $\Empty$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	607	& $\mid$ & $\Char(c)$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	608	& $\mid$ & $\Seq\,v_1\, v_2$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	609	& $\mid$ & $\Left(v)$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	610	& $\mid$ & $\Right(v)$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	611	& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	612	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	613	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	614	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	615	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	616	A value has an underlying string, which
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	617	can be calculated by the ``flatten" function $\|\_\|$:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	618	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	619	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	620	$\|\Empty\|$ & $\dn$ & $[]$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	621	$\|\Char \; c\|$ & $ \dn$ & $ [c]$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	622	$\|\Seq(v_1, v_2)\|$ & $ \dn$ & $ v_1\| @ \|v_2\|$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	623	$\|\Left(v)\|$ & $ \dn$ & $ \|v\|$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	624	$\|\Right(v)\|$ & $ \dn$ & $ \|v\|$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	625	$\|\Stars([])\|$ & $\dn$ & $[]$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	626	$\|\Stars(v::vs)\|$ & $\dn$ & $ \|v\| @ \|\Stars(vs)\|$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	627	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	628	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	629	Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	630	to indicate that a value $v$ could be generated from a lexing algorithm
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	631	with input $r$. They call it the value inhabitation relation.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	632	\begin{mathpar}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	633	\inferrule{\mbox{}}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	634
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	635	\inferrule{\mbox{}}{\vdash \Empty : \ONE} \hspace{2em}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	636
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	637	\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	638
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	639	\inferrule{\vdash v_1 : r_1}{\vdash \Left(v_1):r_1+r_2}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	640
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	641	\inferrule{\vdash v_2 : r_2}{\vdash \Right(v_2):r_1 + r_2}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	642
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	643	\inferrule{\forall v \in vs. \vdash v:r \land \|v\| \neq []}{\vdash \Stars(vs):r^*}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	644	\end{mathpar}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	645	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	646	The condition $\|v\| \neq []$ in the premise of star's rule
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	647	is to make sure that for a given pair of regular
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	648	expression $r$ and string $s$, the number of values
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	649	satisfying $\|v\| = s$ and $\vdash v:r$ is finite.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	650	Given the same string and regular expression, there can be
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	651	multiple values for it. For example, both
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	652	$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	653	$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	654	and the values both flatten to $abc$.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	655	Lexers therefore have to disambiguate and choose only
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	656	one of the values to output. $\POSIX$ is one of the
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	657	disambiguation strategies that is widely adopted.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	658
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	659	Ausaf and Urban\parencite{AusafDyckhoffUrban2016}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	660	formalised the property
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	661	as a ternary relation.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	662	The $\POSIX$ value $v$ for a regular expression
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	663	$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	664	in the following set of rules\footnote{The names of the rules are used
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	665	as they were originally given in \cite{AusafDyckhoffUrban2016}}:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	666	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	667	\begin{figure}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	668	\begin{mathpar}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	669	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	670
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	671	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	672
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	673	\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	674
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	675	\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	676
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	677	\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	678	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	679	s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	680	\Seq \; v_1 \; v_2}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	681
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	682	\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	683
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	684	\inferrule[P]{(s_1, v) \rightarrow v \\ (s_2, r^) \rightarrow \Stars \; vs \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	685	\|v\| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	686	s_1@s_3 \in L \; r \land s_4 \in L \; r^}{(s_1@s_2, r^)\rightarrow \Stars \;
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	687	(v::vs)}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	688	\end{mathpar}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	689	\caption{POSIX Lexing Rules}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	690	\end{figure}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	691	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	692	The above $\POSIX$ rules follows the intuition described below:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	693	\begin{itemize}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	694	\item (Left Priority)\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	695	Match the leftmost regular expression when multiple options of matching
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	696	are available.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	697	\item (Maximum munch)\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	698	Always match a subpart as much as possible before proceeding
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	699	to the next token.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	700	\end{itemize}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	701	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	702	These disambiguation strategies can be
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	703	quite practical.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	704	For instance, when lexing a code snippet
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	705	\[
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	706	\textit{iffoo} = 3
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	707	\]
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	708	using the regular expression (with
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	709	keyword and identifier having their
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	710	usualy definitions on any formal
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	711	language textbook, for instance
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	712	keyword is a nonempty string starting with letters
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	713	followed by alphanumeric characters or underscores):
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	714	\[
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	715	\textit{keyword} + \textit{identifier},
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	716	\]
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	717	we want $\textit{iffoo}$ to be recognized
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	718	as an identifier rather than a keyword (if)
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	719	followed by
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	720	an identifier (foo).
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	721	POSIX lexing achieves this.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	722
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	723	We know that a $\POSIX$ value is also a normal underlying
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	724	value:
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	725	\begin{lemma}
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	726	$(r, s) \rightarrow v \implies \vdash v: r$
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	727	\end{lemma}
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	728	\noindent
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	729	The good property about a $\POSIX$ value is that
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	730	given the same regular expression $r$ and string $s$,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	731	one can always uniquely determine the $\POSIX$ value for it:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	732	\begin{lemma}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	733	$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	734	\end{lemma}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	735	\begin{proof}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	736	By induction on $s$, $r$ and $v_1$. The inductive cases
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	737	are all the POSIX rules.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	738	Probably the most cumbersome cases are
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	739	the sequence and star with non-empty iterations.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	740	We shall give the details for proving the sequence case here.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	741
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	742	When we have
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	743	\[
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	744	(s_1, r_1) \rightarrow v_1 \;\, and \;\,
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	745	(s_2, r_2) \rightarrow v_2 \;\, and \;\,\\
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	746	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	747	s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	748	\]
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	749	we know that the last condition
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	750	excludes the possibility of a
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	751	string $s_1'$ longer than $s_1$ such that
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	752	\[
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	753	(s_1', r_1) \rightarrow v_1' \;\;
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	754	and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	755	\]
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	756	hold.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	757	A shorter string $s_1''$ with $s_2''$ satisfying
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	758	\[
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	759	(s_1'', r_1) \rightarrow v_1''
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	760	\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	761	\]
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	762	cannot possibly form a $\POSIX$ value either, because
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	763	by definition there is a candidate
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	764	with longer initial string
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	765	$s_1$. Therefore, we know that the POSIX
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	766	value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	767	$s$ must have the
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	768	property that
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	769	\[
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	770	\|a\| = s_1 \;\; and \;\; \|b\| = s_2.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	771	\]
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	772	The goal is to prove that $a = v_1 $ and $b = v_2$.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	773	If we have some other POSIX values $v_{10}$ and $v_{20}$ such that
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	774	$(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	775	then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$,
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	776	which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	777	is the same as $\Seq(v_1, v_2)$.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	778	\end{proof}
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	779	\noindent
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	780	Now we know what a $\POSIX$ value is and why it is unique;
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	781	the problem is generating
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	782	such a value in a lexing algorithm using derivatives.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	783
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	784	\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	785
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	786	Sulzmann and Lu extended Brzozowski's
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	787	derivative-based matching
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	788	to a lexing algorithm by a second pass
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	789	after the initial phase of successive derivatives.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	790	This second phase generates a POSIX value
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	791	if the regular expression matches the string.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	792	Two functions are involved: $\inj$ and $\mkeps$.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	793	The first one used is $\mkeps$, which constructs a POSIX value from the last
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	794	derivative $r_n$:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	795	\begin{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	796	\begin{equation}\label{graph:mkeps}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	797	\begin{tikzcd}
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	798	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	799	& & & v_n
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	800	\end{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	801	\end{equation}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	802	\end{ceqn}
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	803	\noindent
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	804	In the above diagram, again we assume that
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	805	the input string $s$ is made of $n$ characters
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	806	$c_0c_1 \ldots c_{n-1}$, and the input regular expression $r$
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	807	is given label $0$ and after each character $c_i$ is taken off
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	808	by the derivative operation the resulting derivative regular
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	809	expressioin is $r_{i+1}$.The last derivative operation
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	810	$\backslash c_{n-1}$ gives back $r_n$, which is transformed into
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	811	a value $v_n$ by $\mkeps$.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	812	$v_n$ tells us how an empty string is matched by the (nullable)
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	813	regular expression $r_n$, in a $\POSIX$ way.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	814	The definition of $\mkeps$ is
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	815	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	816	\begin{tabular}{lcl}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	817	$\mkeps \; \ONE$ & $\dn$ & $\Empty$ \\
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	818	$\mkeps \; (r_{1}+r_{2})$ & $\dn$
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	819	& $\textit{if}\; (\nullable \; r_{1}) \;\,
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	820	\textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	821	& & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	822	$\mkeps \; (r_1 \cdot r_2)$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	823	$\mkeps \; r^* $ & $\dn$ & $\Stars\;[]$
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	824	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	825	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	826
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	827
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	828	\noindent
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	829	We favour the left child $r_1$ of $r_1 + r_2$
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	830	to match an empty string if there is a choice.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	831	When there is a star for us to match the empty string,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	832	we give the $\Stars$ constructor an empty list, meaning
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	833	no iteration is taken.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	834	The result of a call to $\mkeps$ on a $\nullable$ $r$ would
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	835	be a $\POSIX$ value corresponding to $r$:
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	836	\begin{lemma}\label{mePosix}
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	837	$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	838	\end{lemma}
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	839	\begin{proof}
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	840	By induction on the shape of $r$.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	841	\end{proof}
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	842	\noindent
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	843	After the $\mkeps$-call, we inject back the characters one by one
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	844	in reverse order as they were chopped off in the derivative phase.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	845	The fucntion for this is called $\inj$. $\inj$ and $\backslash$
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	846	are not exactly reverse operations of one another, as $\inj$
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	847	operates on values instead of regular
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	848	expressions.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	849	In the diagram below, $v_i$ stands for the (POSIX) value
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	850	for how the regular expression
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	851	$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	852	of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	853	After injecting back $n$ characters, we get the lexical value for how $r_0$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	854	matches $s$.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	855	\begin{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	856	\begin{equation}\label{graph:inj}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	857	\begin{tikzcd}
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	858	r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	859	v_0 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed]
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	860	\end{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	861	\end{equation}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	862	\end{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	863	\noindent
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	864	$\textit{inj}$ takes three arguments: a regular
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	865	expression ${r_{i}}$, before the character is chopped off,
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	866	a character ${c_{i}}$, the character we want to inject back and
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	867	the third argument $v_{i+1}$ the value we want to inject into.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	868	The result of an application
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	869	$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	870	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	871	(s_i, r_i) \rightarrow v_i
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	872	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	873	holds.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	874	The definition of $\textit{inj}$ is as follows:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	875	\begin{center}
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	876	\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	877	$\textit{inj}\;(c)\;c\,Empty$ & $\dn$ & $\Char\,c$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	878	$\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left \; (\textit{inj}\; r_1 \; c\,v)$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	879	$\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c \; v)$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	880	$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$ &
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	881	$\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	882	$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ &
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	883	$\dn$ & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	884	$\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$ & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	885	$\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	886	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	887	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	888
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	889	\noindent
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	890	The function does a recursion on
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	891	the shape of regular
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	892	expression $r_i$ and value $v_{i+1}$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	893	Intuitively, each clause analyses
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	894	how $r_i$ could have transformed when being
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	895	derived by $c$, identifying which subpart
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	896	of $v_{i+1}$ has the ``hole''
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	897	to inject the character back into.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	898	Once the character is
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	899	injected back to that sub-value;
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	900	$\inj$ assembles all things together
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	901	to form a new value.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	902
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	903	For instance, the last clause is an
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	904	injection into a sequence value $v_{i+1}$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	905	whose second child
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	906	value is a star, and the shape of the
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	907	regular expression $r_i$ before injection
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	908	is a star.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	909	We therefore know
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	910	the derivative
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	911	starts on a star and ends as a sequence:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	912	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	913	(r^) \backslash c \longrightarrow r\backslash c \cdot r^
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	914	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	915	during which an iteration of the star
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	916	had just been unfolded, giving the below
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	917	value inhabitation relation:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	918	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	919	\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	920	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	921	The value list $vs$ corresponds to
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	922	matched star iterations,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	923	and the ``hole'' lies in $v$ because
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	924	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	925	\vdash v: r\backslash c.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	926	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	927	Finally,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	928	$\inj \; r \;c \; v$ is prepended
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	929	to the previous list of iterations, and then
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	930	wrapped under the $\Stars$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	931	constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	932
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	933	Recall that lemma
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	934	\ref{mePosix} tells us
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	935	$\mkeps$ always selects the POSIX matching among
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	936	multiple values that flatten to the empty string.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	937	Now $\inj$ preserves the POSIXness, provided
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	938	the value before injection is POSIX:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	939	\begin{lemma}\label{injPosix}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	940	If
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	941	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	942	(r \backslash c, s) \rightarrow v
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	943	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	944	then
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	945	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	946	(r, c :: s) \rightarrow (\inj r \; c\; v).
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	947	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	948	\end{lemma}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	949	\begin{proof}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	950	By induction on $r$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	951	The involved cases are sequence and star.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	952	When $r = a \cdot b$, there could be
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	953	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	954	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	955	case.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	956	\begin{itemize}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	957	\item
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	958	$v = \Seq \; v_a \; v_b$.\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	959	The ``not nullable'' clause of the $\inj$ function is taken:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	960	\begin{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	961	\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	962	$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	963	& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	964	\end{tabular}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	965	\end{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	966	We know that there exists a unique pair of
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	967	$s_a$ and $s_b$ satisfaying
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	968	$(a \backslash c, s_a) \rightarrow v_a$,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	969	$(b , s_b) \rightarrow v_b$, and
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	970	$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	971	L \; (a\backslash c) \land
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	972	s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	973	The last condition gives us
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	974	$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	975	L \; a \land
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	976	s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	977	By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	978	and this gives us
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	979	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	980	(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	981	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	982	\item
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	983	$v = \Left \; (\Seq \; v_a \; v_b)$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	984	The argument is almost identical to the above case,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	985	except that a different clause of $\inj$ is taken:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	986	\begin{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	987	\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	988	$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	989	& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	990	\end{tabular}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	991	\end{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	992	With a similar reasoning,
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	993
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	994	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	995	(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	996	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	997	again holds.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	998	\item
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	999	$v = \Right \; v_b$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1000	Again the injection result would be
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1001	\begin{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1002	\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1003	$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1004	& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1005	\end{tabular}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1006	\end{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1007	We know that $a$ must be nullable,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1008	allowing us to call $\mkeps$ and get
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1009	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1010	(a, []) \rightarrow \mkeps \; a.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1011	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1012	Also by inductive hypothesis
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1013	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1014	(b, c::s) \rightarrow \inj\; b \; c \; v_b
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1015	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1016	holds.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1017	In addition, as
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1018	$\Right \;v_b$ instead of $\Left \ldots$ is
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1019	the POSIX value for $v$, it must be the case
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1020	that $s \notin L \;( (a\backslash c)\cdot b)$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1021	This tells us that
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1022	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1023	\nexists s_3 \; s_4.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1024	s_3 @s_4 = s \land s_3 \in L \; (a\backslash c)
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1025	\land s_4 \in L \; b
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1026	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1027	which translates to
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1028	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1029	\nexists s_3 \; s_4. \; s_3 \neq [] \land
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1030	s_3 @s_4 = c::s \land s_3 \in L \; a
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1031	\land s_4 \in L \; b.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1032	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1033	(Which basically says there cannot be a longer
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1034	initial split for $s$ other than the empty string.)
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1035	Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1036	as the POSIX value for $a\cdot b$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1037	\end{itemize}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1038	The star case can be proven similarly.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1039	\end{proof}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1040	\noindent
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1041	Putting all the functions $\inj$, $\mkeps$, $\backslash$ together
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1042	by following the procedure outlined in the diagram \ref{graph:inj},
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1043	and taking into consideration the possibility of a non-match,
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1044	a lexer can be built with the following recursive definition:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1045	\begin{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1046	\begin{tabular}{lcl}
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1047	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1048	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1049	& & $\quad \phantom{\mid}\; \None \implies \None$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1050	& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1051	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1052	\end{center}
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1053	\noindent
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1054	The central property of the $\lexer$ is that it gives the correct result by
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1055	$\POSIX$ standards:
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1056	\begin{theorem}\label{lexerCorrectness}
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1057	The $\lexer$ based on derivatives and injections is correct:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1058	\begin{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1059	\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1060	$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1061	$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1062	\end{tabular}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1063	\end{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1064	\end{theorem}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1065	\begin{proof}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1066	By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1067	The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1068	by lemma \ref{injPosix}.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1069	\end{proof}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1070	\noindent
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1071	As we did earlier in this chapter on the matcher, one can
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1072	introduce simplification on the regex.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1073	However, now one needs to do a backward phase and make sure
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1074	the values align with the regular expressions.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1075	Therefore one has to
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1076	be careful not to break the correctness, as the injection
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1077	function heavily relies on the structure of the regexes and values
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1078	being correct and matching each other.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1079	It can be achieved by recording some extra rectification functions
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1080	during the derivatives step, and applying these rectifications in
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1081	each run during the injection phase.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1082	With extra care
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1083	one can show that POSIXness will not be affected---although it is much harder
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1084	to establish.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1085	Some initial results in this regard have been
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1086	obtained in \cite{AusafDyckhoffUrban2016}.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1087
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1088	However, with all the simplification rules allowed
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1089	in an injection-based lexer, one could still end up in
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1090	trouble, when cases that require more involved and aggressive
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1091	simplifications arise.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1092	\section{A Case Requring More Aggressive Simplifications}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1093	For example, when starting with the regular
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1094	expression $(a^* \cdot a^)^$ and building a few successive derivatives (around 10)
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1095	w.r.t.~the character $a$, one obtains a derivative regular expression
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1096	with more than 9000 nodes (when viewed as a tree)
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1097	even with simplification.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1098	\begin{figure}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1099	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1100	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1101	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1102	ylabel={size},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1103	legend entries={Naive Matcher},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1104	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1105	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1106	\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1107	\end{axis}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1108	\end{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1109	\caption{Size of $(a^\cdot a^)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1110	\end{figure}\label{fig:BetterWaterloo}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1111
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1112	That is because Sulzmann and Lu's
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1113	injection-based lexing algorithm keeps a lot of
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1114	"useless" values that will not be used.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1115	These different ways of matching will grow exponentially with the string length.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1116	Take
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1117	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1118	r= (a^\cdot a^)^* \quad and \quad
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1119	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1120	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1121	as an example.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1122	This is a highly ambiguous regular expression, with
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1123	many ways to split up the string into multiple segments for
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1124	different star iteratioins,
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1125	and for each segment
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1126	multiple ways of splitting between
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1127	the two $a^*$ sub-expressions.
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1128	When $n$ is equal to $1$, there are two lexical values for
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1129	the match:
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1130	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1131	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (value 1)
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1132	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1133	and
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1134	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1135	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (value 2)
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1136	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1137	The derivative of $\derssimp \;s \; r$ is
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1138	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1139	(a^a^ + a^)\cdot(a^a^)^.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1140	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1141	The $a^a^$ and $a^*$ in the first child of the above sequence
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1142	correspond to value 1 and value 2, respectively.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1143	When $n=2$, the number goes up to 7:
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1144	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1145	\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1146	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1147	,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1148	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1149	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1150	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1151	,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1152	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1153	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1154	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1155	,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1156	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1157	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1158	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1159	,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1160	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1161	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []),
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1162	\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1163	]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1164	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1165	,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1166	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1167	\Stars \; [
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1168	\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1169	\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1170	]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1171	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1172	and
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1173	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1174	\Stars \; [
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1175	\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1176	\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1177	]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1178	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1179	And $\derssimp \; aa \; (a^a^)^*$ would be
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1180	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1181	((a^a^ + a^)+a^)\cdot(a^a^)^* +
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1182	(a^a^ + a^)\cdot(a^a^)^.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1183	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1184	which removes two out of the seven terms corresponding to the
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1185	seven distinct lexical values.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1186
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1187	It is not surprising that there are exponentially many
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1188	distinct lexical values that cannot be eliminated by
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1189	the simple-minded simplification of $\derssimp$.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1190
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1191	A lexer without a good enough strategy to
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1192	deduplicate will naturally
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1193	have an exponential runtime on ambiguous regular expressions.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1194
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1195	On the other hand, the
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1196	$\POSIX$ value for $r= (a^\cdot a^)^*$ and
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1197	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1198	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1199	\Stars\,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1200	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1201	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1202	and at any moment the subterms in a regular expression
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1203	that will result in a POSIX value is only
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1204	a minority among the many other terms,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1205	and one can remove ones that are absolutely not possible to
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1206	be POSIX.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1207	In the above example,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1208	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1209	((a^a^ + \underbrace{a^}_\text{A})+\underbrace{a^}_\text{duplicate of A})\cdot(a^a^)^* +
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1210	\underbrace{(a^a^ + a^)\cdot(a^a^)^}_\text{further simp removes this}.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1211	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1212	can be further simplified by
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1213	removing the underlined term first,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1214	which would open up possibilities
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1215	of further simplification that removes the
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1216	underbraced part.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1217	The result would be
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1218	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1219	(\underbrace{a^a^}_\text{term 1} + \underbrace{a^}_\text{term 2})\cdot(a^a^)^.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1220	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1221	with corresponding values
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1222	\begin{center}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1223	\begin{tabular}{lr}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1224	$\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$ & $(\text{term 1})$\\
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1225	$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] $ & $(\text{term 2})$
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1226	\end{tabular}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1227	\end{center}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1228	Other terms with an underlying value such as
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1229	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1230	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1231	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1232	is simply too hopeless to contribute a POSIX lexical value,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1233	and is therefore thrown away.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1234
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1235	Ausaf and Dyckhoff and Urban \cite{AusafDyckhoffUrban2016}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1236	have come up with some simplification steps, however those steps
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1237	are not yet sufficiently strong so that they achieve the above effects.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1238	And even with these relatively mild simplifications the proof
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1239	is already quite a bit complicated than the theorem \ref{lexerCorrectness}.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1240	One would prove something like:
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1241	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1242	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\;
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1243	\textit{then}\;\; (r, c::s) \rightarrow
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1244	\inj\;\, r\, \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v)
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1245	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1246	instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1247	not only has to return a simplified regular expression,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1248	but also what specific simplifications
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1249	has been done as a function on values
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1250	showing how one can transform the value
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1251	underlying the simplified regular expression
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1252	to the unsimplified one.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1253
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1254	We therefore choose a slightly different approach to
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1255	get better simplifications, which uses
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1256	some augmented data structures compared to
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1257	plain regular expressions.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1258	We call them \emph{annotated}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1259	regular expressions.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1260	With annotated regular expressions,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1261	we can avoid creating the intermediate values $v_1,\ldots v_n$ and a
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1262	second phase altogether.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1263	In the meantime, we can also ensure that simplifications
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1264	are easily handled without breaking the correctness of the algorithm.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1265	We introduce this new datatype and the
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1266	corresponding algorithm in the next chapter.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1267
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1268
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1269
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1270

author	Chengsong
	Sun, 14 Aug 2022 09:44:27 +0100
changeset 577	f47fc4840579
parent 573	454ced557605
child 579	35df9cdd36ca
permissions	-rwxr-xr-x