lexing: ChengsongTanPhdThesis/Chapters/Inj.tex@7cf9f17aa179 (annotated)

532 cc54ce075db5 restructured Chengsong parents: diff changeset	1	% Chapter Template
cc54ce075db5 restructured Chengsong parents: diff changeset	2
cc54ce075db5 restructured Chengsong parents: diff changeset	3	\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
cc54ce075db5 restructured Chengsong parents: diff changeset	4
cc54ce075db5 restructured Chengsong parents: diff changeset	5	\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
cc54ce075db5 restructured Chengsong parents: diff changeset	6	%and notations we
cc54ce075db5 restructured Chengsong parents: diff changeset	7	%use for describing the lexing algorithm by Sulzmann and Lu,
cc54ce075db5 restructured Chengsong parents: diff changeset	8	%and then give the algorithm and its variant, and discuss
cc54ce075db5 restructured Chengsong parents: diff changeset	9	%why more aggressive simplifications are needed.
cc54ce075db5 restructured Chengsong parents: diff changeset	10
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	11	In this chapter, we define the basic notions
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	12	for regular languages and regular expressions.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	13	We also give the definition of what $\POSIX$ lexing means.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	14
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	15	\section{Basic Concepts}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	16	Usually in formal language theory there is an alphabet
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	17	denoting a set of characters.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	18	Here we only use the datatype of characters from Isabelle,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	19	which roughly corresponds to the ASCII character.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	20	Then using the usual $[]$ notation for lists,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	21	we can define strings using chars:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	22	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	23	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	24	$\textit{string}$ & $\dn$ & $[] \| c :: cs$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	25	& & $(c\; \text{has char type})$
cc54ce075db5 restructured Chengsong parents: diff changeset	26	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	27	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	28	And strings can be concatenated to form longer strings,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	29	in the same way as we concatenate two lists,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	30	which we denote as $@$. We omit the precise
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	31	recursive definition here.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	32	We overload this concatenation operator for two sets of strings:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	33	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	34	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	35	$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	36	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	37	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	38	We also call the above \emph{language concatenation}.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	39	The power of a language is defined recursively, using the
cc54ce075db5 restructured Chengsong parents: diff changeset	40	concatenation operator $@$:
cc54ce075db5 restructured Chengsong parents: diff changeset	41	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	42	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	43	$A^0 $ & $\dn$ & $\{ [] \}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	44	$A^{n+1}$ & $\dn$ & $A^n @ A$
cc54ce075db5 restructured Chengsong parents: diff changeset	45	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	46	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	47	The union of all the natural number powers of a language
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	48	is defined as the Kleene star operator:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	49	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	50	\begin{tabular}{lcl}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	51	$A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	52	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	53	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	54
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	55	\noindent
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	56	However, to obtain a convenient induction principle
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	57	in Isabelle/HOL,
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	58	we instead define the Kleene star
532 cc54ce075db5 restructured Chengsong parents: diff changeset	59	as an inductive set:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	60
532 cc54ce075db5 restructured Chengsong parents: diff changeset	61	\begin{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	62	\begin{mathpar}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	63	\inferrule{}{[] \in A*\\}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	64
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	65	\inferrule{\\s_1 \in A \land \; s_2 \in A}{s_1 @ s_2 \in A}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	66	\end{mathpar}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	67	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	68
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	69	We also define an operation of "chopping of" a character from
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	70	a language, which we call $\Der$, meaning "Derivative for a language":
532 cc54ce075db5 restructured Chengsong parents: diff changeset	71	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	72	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	73	$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	74	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	75	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	76	\noindent
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	77	This can be generalised to "chopping off" a string from all strings within set $A$,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	78	with the help of the concatenation operator:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	79	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	80	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	81	$\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	82	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	83	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	84	\noindent
532 cc54ce075db5 restructured Chengsong parents: diff changeset	85	which is essentially the left quotient $A \backslash L'$ of $A$ against
cc54ce075db5 restructured Chengsong parents: diff changeset	86	the singleton language $L' = \{w\}$
cc54ce075db5 restructured Chengsong parents: diff changeset	87	in formal language theory.
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	88	For this dissertation the $\textit{Ders}$ definition with
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	89	a single string suffices.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	90
cc54ce075db5 restructured Chengsong parents: diff changeset	91	With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
cc54ce075db5 restructured Chengsong parents: diff changeset	92	we have a few properties of how the language derivative can be defined using
cc54ce075db5 restructured Chengsong parents: diff changeset	93	sub-languages.
cc54ce075db5 restructured Chengsong parents: diff changeset	94	\begin{lemma}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	95	\[
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	96	\Der \; c \; (A @ B) =
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	97	\begin{cases}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	98	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	99	(\Der \; c \; A) \, @ \, B, & \text{otherwise}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	100	\end{cases}
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	101	\]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	102	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	103	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	104	This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
cc54ce075db5 restructured Chengsong parents: diff changeset	105	and get to $B$.
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	106	The language $A*$'s derivative can be described using the language derivative
532 cc54ce075db5 restructured Chengsong parents: diff changeset	107	of $A$:
cc54ce075db5 restructured Chengsong parents: diff changeset	108	\begin{lemma}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	109	$\textit{Der} \;c \;(A) = (\textit{Der}\; c A) @ (A)$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	110	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	111	\begin{proof}
cc54ce075db5 restructured Chengsong parents: diff changeset	112	\begin{itemize}
cc54ce075db5 restructured Chengsong parents: diff changeset	113	\item{$\subseteq$}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	114	\noindent
532 cc54ce075db5 restructured Chengsong parents: diff changeset	115	The set
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	116	\[ \{s \mid c :: s \in A*\} \]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	117	is enclosed in the set
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	118	\[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	119	because whenever you have a string starting with a character
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	120	in the language of a Kleene star $A*$,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	121	then that character together with some sub-string
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	122	immediately after it will form the first iteration,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	123	and the rest of the string will
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	124	be still in $A*$.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	125	\item{$\supseteq$}
cc54ce075db5 restructured Chengsong parents: diff changeset	126	Note that
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	127	\[ \Der \; c \; (A) = \Der \; c \; (\{ [] \} \cup (A @ A) ) \]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	128	and
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	129	\[ \Der \; c \; (\{ [] \} \cup (A @ A) ) = \Der\; c \; (A @ A) \]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	130	where the $\textit{RHS}$ of the above equatioin can be rewritten
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	131	as \[ (\Der \; c\; A) @ A* \cup A' \], $A'$ being a possibly empty set.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	132	\end{itemize}
cc54ce075db5 restructured Chengsong parents: diff changeset	133	\end{proof}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	134
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	135	\noindent
532 cc54ce075db5 restructured Chengsong parents: diff changeset	136	Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
cc54ce075db5 restructured Chengsong parents: diff changeset	137	for regular languages, we need to first give definitions for regular expressions.
cc54ce075db5 restructured Chengsong parents: diff changeset	138
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	139	\subsection{Regular Expressions and Their Meaning}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	140	The basic regular expressions are defined inductively
cc54ce075db5 restructured Chengsong parents: diff changeset	141	by the following grammar:
cc54ce075db5 restructured Chengsong parents: diff changeset	142	\[ r ::= \ZERO \mid \ONE
cc54ce075db5 restructured Chengsong parents: diff changeset	143	\mid c
cc54ce075db5 restructured Chengsong parents: diff changeset	144	\mid r_1 \cdot r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	145	\mid r_1 + r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	146	\mid r^*
cc54ce075db5 restructured Chengsong parents: diff changeset	147	\]
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	148	\noindent
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	149	We call them basic because we might introduce
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	150	more constructs later such as negation
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	151	and bounded repetitions.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	152	We defined the regular expression containing
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	153	nothing as $\ZERO$, note that some authors
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	154	also use $\phi$ for that.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	155	Similarly, the regular expression denoting the
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	156	singleton set with only $[]$ is sometimes
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	157	denoted by $\epsilon$, but we use $\ONE$ here.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	158
cc54ce075db5 restructured Chengsong parents: diff changeset	159	The language or set of strings denoted
cc54ce075db5 restructured Chengsong parents: diff changeset	160	by regular expressions are defined as
cc54ce075db5 restructured Chengsong parents: diff changeset	161	%TODO: FILL in the other defs
cc54ce075db5 restructured Chengsong parents: diff changeset	162	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	163	\begin{tabular}{lcl}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	164	$L \; (\ZERO)$ & $\dn$ & $\phi$\\
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	165	$L \; (\ONE)$ & $\dn$ & $\{[]\}$\\
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	166	$L \; (c)$ & $\dn$ & $\{[c]\}$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	167	$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	168	$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	169	$L \; (r^)$ & $\dn$ & $ (L(r))^$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	170	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	171	\end{center}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	172	\noindent
532 cc54ce075db5 restructured Chengsong parents: diff changeset	173	Which is also called the "language interpretation" of
cc54ce075db5 restructured Chengsong parents: diff changeset	174	a regular expression.
cc54ce075db5 restructured Chengsong parents: diff changeset	175
cc54ce075db5 restructured Chengsong parents: diff changeset	176	Now with semantic derivatives of a language and regular expressions and
cc54ce075db5 restructured Chengsong parents: diff changeset	177	their language interpretations in place, we are ready to define derivatives on regexes.
cc54ce075db5 restructured Chengsong parents: diff changeset	178	\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	179
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	180	\ChristianComment{Hi this part I want to keep the ordering as is, so that it keeps the
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	181	readers engaged with a story how we got to the definition of $\backslash$, rather
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	182	than first "overwhelming" them with the definition of $\nullable$.}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	183
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	184	The language derivative acts on a string set and chops off a character from
532 cc54ce075db5 restructured Chengsong parents: diff changeset	185	all strings in that set, we want to define a derivative operation on regular expressions
cc54ce075db5 restructured Chengsong parents: diff changeset	186	so that after derivative $L(r\backslash c)$
cc54ce075db5 restructured Chengsong parents: diff changeset	187	will look as if it was obtained by doing a language derivative on $L(r)$:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	188	\begin{center}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	189	\[
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	190	r\backslash c \dn ?
532 cc54ce075db5 restructured Chengsong parents: diff changeset	191	\]
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	192	so that
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	193	\[
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	194	L(r \backslash c) = \Der \; c \; L(r) ?
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	195	\]
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	196	\end{center}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	197	So we mimic the equalities we have for $\Der$ on language concatenation
cc54ce075db5 restructured Chengsong parents: diff changeset	198
cc54ce075db5 restructured Chengsong parents: diff changeset	199	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	200	\Der \; c \; (A @ B) = \textit{if} \; [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B\\
cc54ce075db5 restructured Chengsong parents: diff changeset	201	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	202	to get the derivative for sequence regular expressions:
cc54ce075db5 restructured Chengsong parents: diff changeset	203	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	204	(r_1 \cdot r_2 ) \backslash c = \textit{if}\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c \textit{else} (r_1 \backslash c) \cdot r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	205	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	206
cc54ce075db5 restructured Chengsong parents: diff changeset	207	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	208	and language Kleene star:
cc54ce075db5 restructured Chengsong parents: diff changeset	209	\[
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	210	\textit{Der} \;c \;A* = (\textit{Der}\; c A) @ (A*)
532 cc54ce075db5 restructured Chengsong parents: diff changeset	211	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	212	to get derivative of the Kleene star regular expression:
cc54ce075db5 restructured Chengsong parents: diff changeset	213	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	214	r^* \backslash c = (r \backslash c)\cdot r^*
cc54ce075db5 restructured Chengsong parents: diff changeset	215	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	216	Note that although we can formalise the boolean predicate
cc54ce075db5 restructured Chengsong parents: diff changeset	217	$[] \in L(r_1)$ without problems, if we want a function that works
cc54ce075db5 restructured Chengsong parents: diff changeset	218	computationally, then we would have to define a function that tests
cc54ce075db5 restructured Chengsong parents: diff changeset	219	whether an empty string is in the language of a regular expression.
cc54ce075db5 restructured Chengsong parents: diff changeset	220	We call such a function $\nullable$:
cc54ce075db5 restructured Chengsong parents: diff changeset	221
cc54ce075db5 restructured Chengsong parents: diff changeset	222
cc54ce075db5 restructured Chengsong parents: diff changeset	223
cc54ce075db5 restructured Chengsong parents: diff changeset	224	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	225	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	226	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	227	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	228	$d \backslash c$ & $\dn$ &
cc54ce075db5 restructured Chengsong parents: diff changeset	229	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	230	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	231	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	232	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	233	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	234	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	235	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	236	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	237	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	238	The function derivative, written $r\backslash c$,
cc54ce075db5 restructured Chengsong parents: diff changeset	239	defines how a regular expression evolves into
cc54ce075db5 restructured Chengsong parents: diff changeset	240	a new regular expression after all the string it contains
cc54ce075db5 restructured Chengsong parents: diff changeset	241	is chopped off a certain head character $c$.
cc54ce075db5 restructured Chengsong parents: diff changeset	242	The most involved cases are the sequence
cc54ce075db5 restructured Chengsong parents: diff changeset	243	and star case.
cc54ce075db5 restructured Chengsong parents: diff changeset	244	The sequence case says that if the first regular expression
cc54ce075db5 restructured Chengsong parents: diff changeset	245	contains an empty string then the second component of the sequence
cc54ce075db5 restructured Chengsong parents: diff changeset	246	might be chosen as the target regular expression to be chopped
cc54ce075db5 restructured Chengsong parents: diff changeset	247	off its head character.
cc54ce075db5 restructured Chengsong parents: diff changeset	248	The star regular expression's derivative unwraps the iteration of
cc54ce075db5 restructured Chengsong parents: diff changeset	249	regular expression and attaches the star regular expression
cc54ce075db5 restructured Chengsong parents: diff changeset	250	to the sequence's second element to make sure a copy is retained
cc54ce075db5 restructured Chengsong parents: diff changeset	251	for possible more iterations in later phases of lexing.
cc54ce075db5 restructured Chengsong parents: diff changeset	252
cc54ce075db5 restructured Chengsong parents: diff changeset	253
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	254	To test whether $[] \in L(r_1)$, we need the $\nullable$ function,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	255	which tests whether the empty string $""$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	256	is in the language of $r$:
cc54ce075db5 restructured Chengsong parents: diff changeset	257
cc54ce075db5 restructured Chengsong parents: diff changeset	258
cc54ce075db5 restructured Chengsong parents: diff changeset	259	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	260	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	261	$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
cc54ce075db5 restructured Chengsong parents: diff changeset	262	$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
cc54ce075db5 restructured Chengsong parents: diff changeset	263	$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
cc54ce075db5 restructured Chengsong parents: diff changeset	264	$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
cc54ce075db5 restructured Chengsong parents: diff changeset	265	$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
cc54ce075db5 restructured Chengsong parents: diff changeset	266	$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
cc54ce075db5 restructured Chengsong parents: diff changeset	267	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	268	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	269	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	270	The empty set does not contain any string and
cc54ce075db5 restructured Chengsong parents: diff changeset	271	therefore not the empty string, the empty string
cc54ce075db5 restructured Chengsong parents: diff changeset	272	regular expression contains the empty string
cc54ce075db5 restructured Chengsong parents: diff changeset	273	by definition, the character regular expression
cc54ce075db5 restructured Chengsong parents: diff changeset	274	is the singleton that contains character only,
cc54ce075db5 restructured Chengsong parents: diff changeset	275	and therefore does not contain the empty string,
cc54ce075db5 restructured Chengsong parents: diff changeset	276	the alternative regular expression (or "or" expression)
cc54ce075db5 restructured Chengsong parents: diff changeset	277	might have one of its children regular expressions
cc54ce075db5 restructured Chengsong parents: diff changeset	278	being nullable and any one of its children being nullable
cc54ce075db5 restructured Chengsong parents: diff changeset	279	would suffice. The sequence regular expression
cc54ce075db5 restructured Chengsong parents: diff changeset	280	would require both children to have the empty string
cc54ce075db5 restructured Chengsong parents: diff changeset	281	to compose an empty string and the Kleene star
cc54ce075db5 restructured Chengsong parents: diff changeset	282	operation naturally introduced the empty string.
cc54ce075db5 restructured Chengsong parents: diff changeset	283
cc54ce075db5 restructured Chengsong parents: diff changeset	284	We have the following property where the derivative on regular
cc54ce075db5 restructured Chengsong parents: diff changeset	285	expressions coincides with the derivative on a set of strings:
cc54ce075db5 restructured Chengsong parents: diff changeset	286
cc54ce075db5 restructured Chengsong parents: diff changeset	287	\begin{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	288	$\textit{Der} \; c \; L(r) = L (r\backslash c)$
cc54ce075db5 restructured Chengsong parents: diff changeset	289	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	290
cc54ce075db5 restructured Chengsong parents: diff changeset	291	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	292	The main property of the derivative operation
cc54ce075db5 restructured Chengsong parents: diff changeset	293	that enables us to reason about the correctness of
cc54ce075db5 restructured Chengsong parents: diff changeset	294	an algorithm using derivatives is
cc54ce075db5 restructured Chengsong parents: diff changeset	295
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	296	\begin{lemma}\label{derStepwise}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	297	$c\!::\!s \in L(r)$ holds
cc54ce075db5 restructured Chengsong parents: diff changeset	298	if and only if $s \in L(r\backslash c)$.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	299	\end{lemma}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	300
cc54ce075db5 restructured Chengsong parents: diff changeset	301	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	302	We can generalise the derivative operation shown above for single characters
cc54ce075db5 restructured Chengsong parents: diff changeset	303	to strings as follows:
cc54ce075db5 restructured Chengsong parents: diff changeset	304
cc54ce075db5 restructured Chengsong parents: diff changeset	305	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	306	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	307	$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
cc54ce075db5 restructured Chengsong parents: diff changeset	308	$r \backslash [\,] $ & $\dn$ & $r$
cc54ce075db5 restructured Chengsong parents: diff changeset	309	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	310	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	311
cc54ce075db5 restructured Chengsong parents: diff changeset	312	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	313	When there is no ambiguity we will use $\backslash$ to denote
cc54ce075db5 restructured Chengsong parents: diff changeset	314	string derivatives for brevity.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	315	Brzozowski's regular-expression matcher algorithm can then be described as:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	316
cc54ce075db5 restructured Chengsong parents: diff changeset	317	\begin{definition}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	318	$\textit{match}\;s\;r \;\dn\; \nullable(r\backslash s)$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	319	\end{definition}
cc54ce075db5 restructured Chengsong parents: diff changeset	320
cc54ce075db5 restructured Chengsong parents: diff changeset	321	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	322	Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$,
cc54ce075db5 restructured Chengsong parents: diff changeset	323	this algorithm presented graphically is as follows:
cc54ce075db5 restructured Chengsong parents: diff changeset	324
cc54ce075db5 restructured Chengsong parents: diff changeset	325	\begin{equation}\label{graph:successive_ders}
cc54ce075db5 restructured Chengsong parents: diff changeset	326	\begin{tikzcd}
cc54ce075db5 restructured Chengsong parents: diff changeset	327	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
cc54ce075db5 restructured Chengsong parents: diff changeset	328	\end{tikzcd}
cc54ce075db5 restructured Chengsong parents: diff changeset	329	\end{equation}
cc54ce075db5 restructured Chengsong parents: diff changeset	330
cc54ce075db5 restructured Chengsong parents: diff changeset	331	\noindent
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	332	It can be
7cf9f17aa179 more Chengsong parents: 538 diff changeset	333	relatively easily shown that this matcher is correct:
7cf9f17aa179 more Chengsong parents: 538 diff changeset	334	\begin{lemma}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	335	$\textit{match} \; s\; r = \textit{true} \Longleftrightarrow s \in L(r)$
7cf9f17aa179 more Chengsong parents: 538 diff changeset	336	\end{lemma}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	337	\begin{proof}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	338	By the stepwise property of $\backslash$ (\ref{derStepwise})
7cf9f17aa179 more Chengsong parents: 538 diff changeset	339	\end{proof}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	340	\noindent
7cf9f17aa179 more Chengsong parents: 538 diff changeset	341	If we implement the above algorithm naively, however,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	342	the algorithm can be excruciatingly slow.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	343
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	344
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	345	\begin{figure}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	346	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	347	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	348	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	349	ylabel={time in secs},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	350	ymode = log,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	351	legend entries={Naive Matcher},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	352	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	353	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	354	\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
7cf9f17aa179 more Chengsong parents: 538 diff changeset	355	\end{axis}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	356	\end{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	357	\caption{Matching $(a^)^b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	358	\end{figure}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	359
7cf9f17aa179 more Chengsong parents: 538 diff changeset	360	\noindent
7cf9f17aa179 more Chengsong parents: 538 diff changeset	361	For this we need to introduce certain
7cf9f17aa179 more Chengsong parents: 538 diff changeset	362	rewrite rules for the intermediate results,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	363	such as $r + r \rightarrow r$,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	364	and make sure those rules do not change the
7cf9f17aa179 more Chengsong parents: 538 diff changeset	365	language of the regular expression.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	366	We have a simplification function (that is as simple as possible
7cf9f17aa179 more Chengsong parents: 538 diff changeset	367	while having much power on making a regex simpler):
7cf9f17aa179 more Chengsong parents: 538 diff changeset	368	\begin{verbatim}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	369	def simp(r: Rexp) : Rexp = r match {
7cf9f17aa179 more Chengsong parents: 538 diff changeset	370	case SEQ(r1, r2) =>
7cf9f17aa179 more Chengsong parents: 538 diff changeset	371	(simp(r1), simp(r2)) match {
7cf9f17aa179 more Chengsong parents: 538 diff changeset	372	case (ZERO, _) => ZERO
7cf9f17aa179 more Chengsong parents: 538 diff changeset	373	case (_, ZERO) => ZERO
7cf9f17aa179 more Chengsong parents: 538 diff changeset	374	case (ONE, r2s) => r2s
7cf9f17aa179 more Chengsong parents: 538 diff changeset	375	case (r1s, ONE) => r1s
7cf9f17aa179 more Chengsong parents: 538 diff changeset	376	case (r1s, r2s) => SEQ(r1s, r2s)
7cf9f17aa179 more Chengsong parents: 538 diff changeset	377	}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	378	case ALTS(r1, r2) => {
7cf9f17aa179 more Chengsong parents: 538 diff changeset	379	(simp(r1), simp(r2)) match {
7cf9f17aa179 more Chengsong parents: 538 diff changeset	380	case (ZERO, r2s) => r2s
7cf9f17aa179 more Chengsong parents: 538 diff changeset	381	case (r1s, ZERO) => r1s
7cf9f17aa179 more Chengsong parents: 538 diff changeset	382	case (r1s, r2s) =>
7cf9f17aa179 more Chengsong parents: 538 diff changeset	383	if(r1s == r2s) r1s else ALTS(r1s, r2s)
7cf9f17aa179 more Chengsong parents: 538 diff changeset	384	}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	385	}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	386	case r => r
7cf9f17aa179 more Chengsong parents: 538 diff changeset	387	}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	388	\end{verbatim}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	389	If we repeatedly incorporate these
7cf9f17aa179 more Chengsong parents: 538 diff changeset	390	rules during the matching algorithm,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	391	we have a lexer with simplification:
7cf9f17aa179 more Chengsong parents: 538 diff changeset	392	\begin{verbatim}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	393	def ders_simp(s: List[Char], r: Rexp) : Rexp = s match {
7cf9f17aa179 more Chengsong parents: 538 diff changeset	394	case Nil => simp(r)
7cf9f17aa179 more Chengsong parents: 538 diff changeset	395	case c :: cs => ders_simp(cs, simp(der(c, r)))
7cf9f17aa179 more Chengsong parents: 538 diff changeset	396	}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	397
7cf9f17aa179 more Chengsong parents: 538 diff changeset	398	def simp_matcher(s: String, r: Rexp) : Boolean =
7cf9f17aa179 more Chengsong parents: 538 diff changeset	399	nullable(ders_simp(s.toList, r))
7cf9f17aa179 more Chengsong parents: 538 diff changeset	400
7cf9f17aa179 more Chengsong parents: 538 diff changeset	401	\end{verbatim}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	402	\noindent
7cf9f17aa179 more Chengsong parents: 538 diff changeset	403	After putting in those rules, the example of \ref{NaiveMatcher}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	404	is now very tame in the length of inputs:
7cf9f17aa179 more Chengsong parents: 538 diff changeset	405
7cf9f17aa179 more Chengsong parents: 538 diff changeset	406
7cf9f17aa179 more Chengsong parents: 538 diff changeset	407	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	408	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	409	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	410	ylabel={time in secs},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	411	ymode = log,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	412	xmode = log,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	413	legend entries={Matcher With Simp},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	414	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	415	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	416	\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
7cf9f17aa179 more Chengsong parents: 538 diff changeset	417	\end{axis}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	418	\end{tikzpicture} \label{fig:BetterMatcher}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	419
7cf9f17aa179 more Chengsong parents: 538 diff changeset	420
7cf9f17aa179 more Chengsong parents: 538 diff changeset	421	\noindent
7cf9f17aa179 more Chengsong parents: 538 diff changeset	422	Note how the x-axis is in logarithmic scale.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	423	Building derivatives and then testing the existence
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	424	of empty string in the resulting regular expression's language,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	425	and add simplification rules when necessary.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	426	So far, so good. But what if we want to
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	427	do lexing instead of just getting a YES/NO answer?
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	428	\citeauthor{Sulzmann2014} first came up with a nice and
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	429	elegant (arguably as beautiful as the definition of the original derivative) solution for this.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	430
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	431	\section{Values and the Lexing Algorithm by Sulzmann and Lu}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	432	Here we present the hybrid phases of a regular expression lexing
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	433	algorithm using the function $\inj$, as given by Sulzmann and Lu.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	434	They first defined the datatypes for storing the
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	435	lexing information called a \emph{value} or
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	436	sometimes also \emph{lexical value}. These values and regular
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	437	expressions correspond to each other as illustrated in the following
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	438	table:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	439
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	440	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	441	\begin{tabular}{c@{\hspace{20mm}}c}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	442	\begin{tabular}{@{}rrl@{}}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	443	\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	444	$r$ & $::=$ & $\ZERO$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	445	& $\mid$ & $\ONE$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	446	& $\mid$ & $c$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	447	& $\mid$ & $r_1 \cdot r_2$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	448	& $\mid$ & $r_1 + r_2$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	449	\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	450	& $\mid$ & $r^*$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	451	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	452	&
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	453	\begin{tabular}{@{\hspace{0mm}}rrl@{}}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	454	\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	455	$v$ & $::=$ & \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	456	& & $\Empty$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	457	& $\mid$ & $\Char(c)$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	458	& $\mid$ & $\Seq\,v_1\, v_2$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	459	& $\mid$ & $\Left(v)$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	460	& $\mid$ & $\Right(v)$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	461	& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	462	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	463	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	464	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	465
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	466	\noindent
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	467	We have a formal binary relation for telling whether the structure
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	468	of a regular expression agrees with the value.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	469	\begin{mathpar}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	470	\inferrule{}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	471	\inferrule{}{\vdash \Empty : \ONE} \hspace{2em}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	472	\inferrule{\vdash v_1 : r_1 \\ \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	473	\end{mathpar}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	474
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	475	Building on top of Sulzmann and Lu's attempt to formalise the
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	476	notion of POSIX lexing rules \parencite{Sulzmann2014},
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	477	Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	478	POSIX matching as a ternary relation recursively defined in a
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	479	natural deduction style.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	480	The formal definition of a $\POSIX$ value $v$ for a regular expression
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	481	$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	482	in the following set of rules:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	483	\ChristianComment{Will complete later}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	484	\newcommand*{\inference}[3][t]{%
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	485	\begingroup
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	486	\def\and{\\}%
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	487	\begin{tabular}[#1]{@{\enspace}c@{\enspace}}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	488	#2 \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	489	\hline
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	490	#3
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	491	\end{tabular}%
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	492	\endgroup
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	493	}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	494	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	495	\inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$ }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	496	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	497	\noindent
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	498	The above $\POSIX$ rules could be explained intuitionally as
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	499	\begin{itemize}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	500	\item
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	501	match the leftmost regular expression when multiple options of matching
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	502	are available
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	503	\item
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	504	always match a subpart as much as possible before proceeding
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	505	to the next token.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	506	\end{itemize}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	507
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	508	The reason why we are interested in $\POSIX$ values is that they can
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	509	be practically used in the lexing phase of a compiler front end.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	510	For instance, when lexing a code snippet
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	511	$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	512	as an identifier rather than a keyword.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	513
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	514	The good property about a $\POSIX$ value is that
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	515	given the same regular expression $r$ and string $s$,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	516	one can always uniquely determine the $\POSIX$ value for it:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	517	\begin{lemma}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	518	$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	519	\end{lemma}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	520	\begin{proof}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	521	By induction on $s$, $r$ and $v_1$. The induction principle is
7cf9f17aa179 more Chengsong parents: 538 diff changeset	522	the \POSIX rules. Each case is proven by a combination of
7cf9f17aa179 more Chengsong parents: 538 diff changeset	523	the induction rules for $\POSIX$ values and the inductive hypothesis.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	524	Probably the most cumbersome cases are the sequence and star with non-empty iterations.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	525
7cf9f17aa179 more Chengsong parents: 538 diff changeset	526	We give the reasoning about the sequence case as follows:
7cf9f17aa179 more Chengsong parents: 538 diff changeset	527	When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	528	we know that there could not be a longer string $r_1'$ such that $(s_1', r_1) \rightarrow v_1'$
7cf9f17aa179 more Chengsong parents: 538 diff changeset	529	and $(s_2', r_2) \rightarrow v2'$ and $s_1' @s_2' = s$ all hold.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	530	For possible values of $s_1'$ and $s_2'$ where $s_1'$ is shorter, they cannot
7cf9f17aa179 more Chengsong parents: 538 diff changeset	531	possibly form a $\POSIX$ for $s$.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	532	If we have some other values $v_1'$ and $v_2'$ such that
7cf9f17aa179 more Chengsong parents: 538 diff changeset	533	$(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	534	Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	535	which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$
7cf9f17aa179 more Chengsong parents: 538 diff changeset	536	is the same as $\Seq(v_1, v_2)$.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	537	\end{proof}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	538	Now we know what a $\POSIX$ value is, the problem is how do we achieve
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	539	such a value in a lexing algorithm, using derivatives?
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	540
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	541	\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	542
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	543	The contribution of Sulzmann and Lu is an extension of Brzozowski's
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	544	algorithm by a second phase (the first phase being building successive
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	545	derivatives---see \ref{graph:successive_ders}). In this second phase, a POSIX value
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	546	is generated if the regular expression matches the string.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	547	Two functions are involved: $\inj$ and $\mkeps$.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	548	The function $\mkeps$ constructs a value from the last
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	549	one of all the successive derivatives:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	550	\begin{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	551	\begin{equation}\label{graph:mkeps}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	552	\begin{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	553	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	554	& & & v_n
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	555	\end{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	556	\end{equation}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	557	\end{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	558
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	559	It tells us how can an empty string be matched by a
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	560	regular expression, in a $\POSIX$ way:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	561
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	562	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	563	\begin{tabular}{lcl}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	564	$\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	565	$\mkeps(r_{1}+r_{2})$ & $\dn$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	566	& \textit{if} $\nullable(r_{1})$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	567	& & \textit{then} $\Left(\mkeps(r_{1}))$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	568	& & \textit{else} $\Right(\mkeps(r_{2}))$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	569	$\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	570	$mkeps(r^*)$ & $\dn$ & $\Stars\,[]$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	571	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	572	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	573
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	574
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	575	\noindent
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	576	We favour the left to match an empty string if there is a choice.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	577	When there is a star for us to match the empty string,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	578	we give the $\Stars$ constructor an empty list, meaning
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	579	no iterations are taken.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	580	The result of a call to $\mkeps$ on a $\nullable$ $r$ would
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	581	be a $\POSIX$ value corresponding to $r$:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	582	\begin{lemma}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	583	$\nullable(r) \implies (r, []) \rightarrow (\mkeps\; v)$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	584	\end{lemma}\label{mePosix}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	585
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	586
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	587	After the $\mkeps$-call, we inject back the characters one by one in order to build
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	588	the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	589	($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	590	After injecting back $n$ characters, we get the lexical value for how $r_0$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	591	matches $s$.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	592	To do this, Sulzmann and Lu defined a function that reverses
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	593	the ``chopping off'' of characters during the derivative phase. The
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	594	corresponding function is called \emph{injection}, written
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	595	$\textit{inj}$; it takes three arguments: the first one is a regular
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	596	expression ${r_{i-1}}$, before the character is chopped off, the second
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	597	is a character ${c_{i-1}}$, the character we want to inject and the
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	598	third argument is the value ${v_i}$, into which one wants to inject the
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	599	character (it corresponds to the regular expression after the character
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	600	has been chopped off). The result of this function is a new value.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	601	\begin{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	602	\begin{equation}\label{graph:inj}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	603	\begin{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	604	r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	605	v_1 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed]
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	606	\end{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	607	\end{equation}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	608	\end{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	609
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	610
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	611	\noindent
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	612	The
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	613	definition of $\textit{inj}$ is as follows:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	614
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	615	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	616	\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	617	$\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	618	$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	619	$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	620	$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	621	$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	622	$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	623	$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	624	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	625	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	626
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	627	\noindent
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	628	This definition is by recursion on the ``shape'' of regular
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	629	expressions and values.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	630	The clauses do one thing--identifying the ``hole'' on a
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	631	value to inject the character back into.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	632	For instance, in the last clause for injecting back to a value
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	633	that would turn into a new star value that corresponds to a star,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	634	we know it must be a sequence value. And we know that the first
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	635	value of that sequence corresponds to the child regex of the star
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	636	with the first character being chopped off--an iteration of the star
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	637	that had just been unfolded. This value is followed by the already
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	638	matched star iterations we collected before. So we inject the character
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	639	back to the first value and form a new value with this latest iteration
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	640	being added to the previous list of iterations, all under the $\Stars$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	641	top level.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	642	The POSIX value is maintained throughout the process:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	643	\begin{lemma}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	644	$(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	645	\end{lemma}\label{injPosix}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	646
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	647
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	648	Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	649	and taking into consideration the possibility of a non-match,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	650	we have a lexer with the following recursive definition:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	651	\begin{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	652	\begin{tabular}{lcl}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	653	$\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	654	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	655	& & $\quad \None \implies \None$\\
7cf9f17aa179 more Chengsong parents: 538 diff changeset	656	& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	657	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	658	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	659	\noindent
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	660	The central property of the $\lexer$ is that it gives the correct result by
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	661	$\POSIX$ standards:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	662	\begin{lemma}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	663	\begin{tabular}{l}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	664	$s \in L(r) \Longleftrightarrow (\exists v. \; r \; s = \Some(v) \land (r, \; s) \rightarrow v)$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	665	$s \notin L(r) \Longleftrightarrow (\lexer \; r\; s = \None)$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	666	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	667	\end{lemma}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	668
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	669
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	670	\begin{proof}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	671	By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	672	The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	673	by lemma \ref{injPosix}.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	674	\end{proof}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	675
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	676
7cf9f17aa179 more Chengsong parents: 538 diff changeset	677	Pictorially, the algorithm is as follows (
7cf9f17aa179 more Chengsong parents: 538 diff changeset	678	For convenience, we employ the following notations: the regular
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	679	expression we start with is $r_0$, and the given string $s$ is composed
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	680	of characters $c_0 c_1 \ldots c_{n-1}$. The
7cf9f17aa179 more Chengsong parents: 538 diff changeset	681	values built incrementally by \emph{injecting} back the characters into the
7cf9f17aa179 more Chengsong parents: 538 diff changeset	682	earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
7cf9f17aa179 more Chengsong parents: 538 diff changeset	683	are always in the same subscript, i.e. $\vdash v_i : r_i$):
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	684
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	685	\begin{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	686	\begin{equation}\label{graph:2}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	687	\begin{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	688	r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	689	v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	690	\end{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	691	\end{equation}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	692	\end{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	693
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	694	\noindent
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	695	As we did earlier in this chapter on the matcher, one can
7cf9f17aa179 more Chengsong parents: 538 diff changeset	696	introduce simplification on the regex.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	697	However, now we need to do a backward phase and make sure
7cf9f17aa179 more Chengsong parents: 538 diff changeset	698	the values align with the regular expressions.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	699	Therefore one has to
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	700	be careful not to break the correctness, as the injection
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	701	function heavily relies on the structure of the regexes and values
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	702	being correct and matching each other.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	703	It can be achieved by recording some extra rectification functions
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	704	during the derivatives step, and applying these rectifications in
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	705	each run during the injection phase.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	706
7cf9f17aa179 more Chengsong parents: 538 diff changeset	707	\ChristianComment{Do I introduce the lexer with rectification here?}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	708	And we can prove that the POSIX value of how
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	709	regular expressions match strings will not be affected---although it is much harder
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	710	to establish.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	711	Some initial results in this regard have been
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	712	obtained in \cite{AusafDyckhoffUrban2016}.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	713
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	714	However, even with these simplification rules, we could still end up in
7cf9f17aa179 more Chengsong parents: 538 diff changeset	715	trouble, when we encounter cases that require more involved and aggressive
7cf9f17aa179 more Chengsong parents: 538 diff changeset	716	simplifications.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	717	\section{A Case Requring More Aggressive Simplification}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	718	For example, when starting with the regular
7cf9f17aa179 more Chengsong parents: 538 diff changeset	719	expression $(a^* \cdot a^)^$ and building a few successive derivatives (around 10)
7cf9f17aa179 more Chengsong parents: 538 diff changeset	720	w.r.t.~the character $a$, one obtains a derivative regular expression
7cf9f17aa179 more Chengsong parents: 538 diff changeset	721	with more than 9000 nodes (when viewed as a tree)
7cf9f17aa179 more Chengsong parents: 538 diff changeset	722	even with simplification.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	723	\begin{figure}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	724	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	725	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	726	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	727	ylabel={size},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	728	legend entries={Naive Matcher},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	729	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	730	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	731	\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
7cf9f17aa179 more Chengsong parents: 538 diff changeset	732	\end{axis}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	733	\end{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	734	\caption{Size of $(a^\cdot a^)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	735	\end{figure}\label{fig:BetterWaterloo}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	736
7cf9f17aa179 more Chengsong parents: 538 diff changeset	737	That is because our lexing algorithm currently keeps a lot of
7cf9f17aa179 more Chengsong parents: 538 diff changeset	738	"useless values that will never not be used.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	739	These different ways of matching will grow exponentially with the string length.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	740
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	741	For $r= (a^\cdot a^)^*$ and
7cf9f17aa179 more Chengsong parents: 538 diff changeset	742	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	743	if we do not allow any empty iterations in its lexical values,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	744	there will be $n - 1$ "splitting points" on $s$ we can independently choose to
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	745	split or not so that each sub-string
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	746	segmented by those chosen splitting points will form different iterations.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	747	For example when $n=4$,
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	748	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	749	\begin{tabular}{lcr}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	750	$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration, this iteration will be divided between the inner sequence $a^\cdot a^$)\\
7cf9f17aa179 more Chengsong parents: 538 diff changeset	751	$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$ (two iterations)\\
7cf9f17aa179 more Chengsong parents: 538 diff changeset	752	$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$ (two iterations)\\
7cf9f17aa179 more Chengsong parents: 538 diff changeset	753	$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
7cf9f17aa179 more Chengsong parents: 538 diff changeset	754	$a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	755	& $\textit{etc}.$ &
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	756	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	757	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	758	\noindent
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	759	And for each iteration, there are still multiple ways to split
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	760	between the two $a^*$s.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	761	It is not surprising there are exponentially many lexical values
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	762	that are distinct for the regex and string pair $r= (a^\cdot a^)^*$ and
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	763	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	764	A lexer to keep all the possible values will naturally
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	765	have an exponential runtime on ambiguous regular expressions.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	766	With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	767	of a match. This means Sulzmann and Lu's injection-based algorithm
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	768	exponential by nature.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	769	Somehow one has to make sure which
7cf9f17aa179 more Chengsong parents: 538 diff changeset	770	lexical values are $\POSIX$ and need to be kept in a lexing algorithm.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	771
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	772
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	773	For example, the above $r= (a^\cdot a^)^*$ and
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	774	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	775	$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	776	We want to keep this value only, and remove all the regular expression subparts
7cf9f17aa179 more Chengsong parents: 538 diff changeset	777	not corresponding to this value during lexing.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	778	To do this, a two-phase algorithm with rectification is a bit too fragile.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	779	Can we not create those intermediate values $v_1,\ldots v_n$,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	780	and get the lexing information that should be already there while
7cf9f17aa179 more Chengsong parents: 538 diff changeset	781	doing derivatives in one pass, without a second injection phase?
7cf9f17aa179 more Chengsong parents: 538 diff changeset	782	In the meantime, can we make sure that simplifications
7cf9f17aa179 more Chengsong parents: 538 diff changeset	783	are easily handled without breaking the correctness of the algorithm?
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	784
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	785	Sulzmann and Lu solved this problem by
7cf9f17aa179 more Chengsong parents: 538 diff changeset	786	introducing additional information to the
7cf9f17aa179 more Chengsong parents: 538 diff changeset	787	regular expressions called \emph{bitcodes}.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	788
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	789
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	790
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	791

author	Chengsong
	Thu, 09 Jun 2022 22:07:44 +0100
changeset 539	7cf9f17aa179
parent 538	8016a2480704
child 541	5bf9f94c02e1
permissions	-rwxr-xr-x