lexing: ChengsongTanPhdThesis/Chapters/Inj.tex@3831621d7b14 (annotated)

532 cc54ce075db5 restructured Chengsong parents: diff changeset	1	% Chapter Template
cc54ce075db5 restructured Chengsong parents: diff changeset	2
666 6da4516ea87d more changes to figures & benchmarking Chengsong parents: 657 diff changeset	3	\chapter{Regular Expressions and POSIX Lexing-Preliminaries} % Main chapter title
532 cc54ce075db5 restructured Chengsong parents: diff changeset	4
cc54ce075db5 restructured Chengsong parents: diff changeset	5	\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
cc54ce075db5 restructured Chengsong parents: diff changeset	6	%and notations we
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	7	% used for describing the lexing algorithm by Sulzmann and Lu,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	8	%and then give the algorithm and its variant and discuss
532 cc54ce075db5 restructured Chengsong parents: diff changeset	9	%why more aggressive simplifications are needed.
649 ef2b8abcbc55 more Chengsong parents: 646 diff changeset	10
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	11
668 3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	12	\marginpar{Gerog comment: addressed attributing work correctly problem,
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	13	showed clearly these definitions come from Ausaf et al.}
649 ef2b8abcbc55 more Chengsong parents: 646 diff changeset	14
668 3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	15	This is a preliminary chapter which describes the results of
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	16	Sulzmann and Lu \cite{Sulzmann2014} and part of the formalisations by
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	17	Ausaf et al \ref{AusafDyckhoffUrban2016} (mainly the definitions).
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	18	These results are not part of this PhD work,
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	19	but the details are included to provide necessary context
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	20	for our work on the correctness proof of $\blexersimp$,
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	21	as we show in chapter \ref{Bitcoded2} how the proofs break down when
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	22	simplifications are applied.
657 00171b627b8d Fixed some annotated/unannotated a/r notation inconsistencies. Chengsong parents: 651 diff changeset	23
668 3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	24	In the coming section, the definitions of basic notions
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	25	for regular languages and regular expressions are given.
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	26	This is essentially a description in ``English''
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	27	the functions and datatypes used by Ausaf et al.
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	28	\cite{AusafDyckhoffUrban2016} \cite{Ausaf}
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	29	in their formalisation in Isabelle/HOL.
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	30	We include them as we build on their formalisation,
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	31	and therefore inherently use these definitions.
649 ef2b8abcbc55 more Chengsong parents: 646 diff changeset	32
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	33
668 3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	34	% In the next section, we will briefly
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	35	% introduce Brzozowski derivatives and Sulzmann
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	36	% and Lu's algorithm, which the main point of this thesis builds on.
649 ef2b8abcbc55 more Chengsong parents: 646 diff changeset	37	%We give a taste of what they
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	38	%are like and why they are suitable for regular expression
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	39	%matching and lexing.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	40	\section{Formal Specification of POSIX Matching
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	41	and Brzozowski Derivatives}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	42	%Now we start with the central topic of the thesis: Brzozowski derivatives.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	43	Brzozowski \cite{Brzozowski1964} first introduced the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	44	concept of a \emph{derivative} of regular expression in 1964.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	45	The derivative of a regular expression $r$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	46	with respect to a character $c$, is written as $r \backslash c$.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	47	This operation tells us what $r$ transforms into
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	48	if we ``chop'' off a particular character $c$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	49	from all strings in the language of $r$ (defined
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	50	later as $L \; r$).
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	51	%To give a flavour of Brzozowski derivatives, we present
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	52	%two straightforward clauses from it:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	53	%\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	54	% \begin{tabular}{lcl}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	55	% $d \backslash c$ & $\dn$ &
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	56	% $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	57	%$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	58	% \end{tabular}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	59	%\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	60	%\noindent
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	61	%The first clause says that for the regular expression
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	62	%denoting a singleton set consisting of a single-character string $\{ d \}$,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	63	%we check the derivative character $c$ against $d$,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	64	%returning a set containing only the empty string $\{ [] \}$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	65	%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	66	%The second clause states that to obtain the regular expression
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	67	%representing all strings' head character $c$ being chopped off
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	68	%from $r_1 + r_2$, one simply needs to recursively take derivative
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	69	%of $r_1$ and $r_2$ and then put them together.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	70	Derivatives have the property
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	71	that $s \in L \; (r\backslash c)$ if and only if
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	72	$c::s \in L \; r$ where $::$ stands for list prepending.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	73	%This property can be used on regular expressions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	74	%matching and lexing--to test whether a string $s$ is in $L \; r$,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	75	%one simply takes derivatives of $r$ successively with
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	76	%respect to the characters (in the correct order) in $s$,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	77	%and then test whether the empty string is in the last regular expression.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	78	With this property, derivatives can give a simple solution
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	79	to the problem of matching a string $s$ with a regular
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	80	expression $r$: if the derivative of $r$ w.r.t.\ (in
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	81	succession) all the characters of the string matches the empty string,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	82	then $r$ matches $s$ (and {\em vice versa}).
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	83	%This makes formally reasoning about these properties such
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	84	%as correctness and complexity smooth and intuitive.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	85	There are several mechanised proofs of this property in various theorem
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	86	provers,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	87	for example one by Owens and Slind \cite{Owens2008} in HOL4,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	88	another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	89	yet another in Coq by Coquand and Siles \cite{Coquand2012}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	90
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	91	In addition, one can extend derivatives to bounded repetitions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	92	relatively straightforwardly. For example, the derivative for
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	93	this can be defined as:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	94	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	95	\begin{tabular}{lcl}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	96	$r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	97	r^{\{n-1\}} (\text{when} n > 0)$\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	98	\end{tabular}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	99	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	100	\noindent
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	101	Experimental results suggest that unlike DFA-based solutions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	102	for bounded regular expressions,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	103	derivatives can cope
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	104	large counters
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	105	quite well.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	106
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	107	There have also been
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	108	extensions of derivatives to other regex constructs.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	109	For example, Owens et al include the derivatives
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	110	for the \emph{NOT} regular expression, which is
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	111	able to concisely express C-style comments of the form
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	112	$/* \ldots */$ (see \cite{Owens2008}).
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	113	Another extension for derivatives is
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	114	regular expressions with look-aheads, done
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	115	by Miyazaki and Minamide
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	116	\cite{Takayuki2019}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	117	%We therefore use Brzozowski derivatives on regular expressions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	118	%lexing
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	119
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	120
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	121
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	122	Given the above definitions and properties of
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	123	Brzozowski derivatives, one quickly realises their potential
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	124	in generating a formally verified algorithm for lexing: the clauses and property
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	125	can be easily expressed in a functional programming language
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	126	or converted to theorem prover
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	127	code, with great ease.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	128	Perhaps this is the reason why derivatives have sparked quite a bit of interest
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	129	in the functional programming and theorem prover communities in the last
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	130	fifteen or so years (
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	131	\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	132	\cite{Chen12} and \cite{Coquand2012}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	133	to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	134	after they were first published by Brzozowski.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	135
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	136
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	137	However, there are two difficulties with derivative-based matchers:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	138	First, Brzozowski's original matcher only generates a yes/no answer
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	139	for whether a regular expression matches a string or not. This is too
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	140	little information in the context of lexing where separate tokens must
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	141	be identified and also classified (for example as keywords
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	142	or identifiers).
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	143	Second, derivative-based matchers need to be more efficient in terms
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	144	of the sizes of derivatives.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	145	Elegant and beautiful
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	146	as many implementations are,
666 6da4516ea87d more changes to figures & benchmarking Chengsong parents: 657 diff changeset	147	they can be still quite slow.
649 ef2b8abcbc55 more Chengsong parents: 646 diff changeset	148	For example, Sulzmann and Lu
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	149	claim a linear running time of their proposed algorithm,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	150	but that was falsified by our experiments. The running time
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	151	is actually $\Omega(2^n)$ in the worst case.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	152	A similar claim about a theoretical runtime of $O(n^2)$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	153	is made for the Verbatim \cite{Verbatim}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	154	%TODO: give references
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	155	lexer, which calculates POSIX matches and is based on derivatives.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	156	They formalized the correctness of the lexer, but not their complexity result.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	157	In the performance evaluation section, they analyzed the run time
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	158	of matching $a$ with the string
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	159	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	160	$\underbrace{a \ldots a}_{\text{n a's}}$.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	161	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	162	\noindent
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	163	They concluded that the algorithm is quadratic in terms of
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	164	the length of the input string.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	165	When we tried out their extracted OCaml code with the example $(a+aa)^*$,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	166	the time it took to match a string of 40 $a$'s was approximately 5 minutes.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	167
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	168
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	169	\subsection{Sulzmann and Lu's Algorithm}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	170	Sulzmann and Lu~\cite{Sulzmann2014} overcame the first
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	171	problem with the yes/no answer
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	172	by cleverly extending Brzozowski's matching
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	173	algorithm. Their extended version generates additional information on
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	174	\emph{how} a regular expression matches a string following the POSIX
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	175	rules for regular expression matching. They achieve this by adding a
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	176	second ``phase'' to Brzozowski's algorithm involving an injection
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	177	function. This injection function in a sense undoes the ``damage''
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	178	of the derivatives chopping off characters.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	179	In earlier work, Ausaf et al provided the formal
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	180	specification of what POSIX matching means and proved in Isabelle/HOL
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	181	the correctness
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	182	of this extended algorithm accordingly
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	183	\cite{AusafDyckhoffUrban2016}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	184
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	185	The version of the algorithm proven correct
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	186	suffers however heavily from a
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	187	second difficulty, where derivatives can
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	188	grow to arbitrarily big sizes.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	189	For example if we start with the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	190	regular expression $(a+aa)^*$ and take
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	191	successive derivatives according to the character $a$, we end up with
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	192	a sequence of ever-growing derivatives like
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	193
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	194	\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	195	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	196	\begin{tabular}{rll}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	197	$(a + aa)^$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^$\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	198	& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	199	& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	200	& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	201	& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	202	\end{tabular}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	203	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	204
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	205	\noindent where after around 35 steps we usually run out of memory on a
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	206	typical computer. Clearly, the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	207	notation involving $\ZERO$s and $\ONE$s already suggests
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	208	simplification rules that can be applied to regular regular
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	209	expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	210	\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	211	r$. While such simple-minded simplifications have been proved in
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	212	the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	213	algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	214	\emph{not} help with limiting the growth of the derivatives shown
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	215	above: the growth is slowed, but the derivatives can still grow rather
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	216	quickly beyond any finite bound.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	217
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	218	Therefore we want to look in this thesis at a second
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	219	algorithm by Sulzmann and Lu where they
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	220	overcame this ``growth problem'' \cite{Sulzmann2014}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	221	In this version, POSIX values are
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	222	represented as bit sequences and such sequences are incrementally generated
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	223	when derivatives are calculated. The compact representation
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	224	of bit sequences and regular expressions allows them to define a more
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	225	``aggressive'' simplification method that keeps the size of the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	226	derivatives finite no matter what the length of the string is.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	227	They make some informal claims about the correctness and linear behaviour
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	228	of this version, but do not provide any supporting proof arguments, not
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	229	even ``pencil-and-paper'' arguments. They write about their bit-coded
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	230	\emph{incremental parsing method} (that is the algorithm to be formalised
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	231	in this dissertation)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	232
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	233
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	234
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	235	\begin{quote}\it
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	236	``Correctness Claim: We further claim that the incremental parsing
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	237	method [..] in combination with the simplification steps [..]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	238	yields POSIX parse trees. We have tested this claim
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	239	extensively [..] but yet
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	240	have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	241	\end{quote}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	242	Ausaf and Urban made some initial progress towards the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	243	full correctness proof but still had to leave out the optimisation
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	244	Sulzmann and Lu proposed.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	245	Ausaf wrote \cite{Ausaf},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	246	\begin{quote}\it
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	247	``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	248	\end{quote}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	249	This thesis implements the aggressive simplifications envisioned
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	250	by Ausaf and Urban,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	251	together with a formal proof of the correctness of those simplifications.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	252
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	253
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	254	One of the most recent work in the context of lexing
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	255	%with this issue
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	256	is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	257	This is relevant work for us and we will compare it later with
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	258	our derivative-based matcher we are going to present.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	259	There is also some newer work called
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	260	Verbatim++~\cite{Verbatimpp}, which does not use derivatives,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	261	but deterministic finite automaton instead.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	262	We will also study this work in a later section.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	263	%An example that gives problem to automaton approaches would be
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	264	%the regular expression $(a\|b)^*a(a\|b)^{\{n\}}$.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	265	%It requires at least $2^{n+1}$ states to represent
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	266	%as a DFA.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	267
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	268
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	269	\section{Basic Concepts}
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	270	Formal language theory usually starts with an alphabet
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	271	denoting a set of characters.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	272	Here we use the datatype of characters from Isabelle,
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	273	which roughly corresponds to the ASCII characters.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	274	In what follows, we shall leave the information about the alphabet
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	275	implicit.
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	276	Then using the usual bracket notation for lists,
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	277	we can define strings made up of characters as:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	278	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	279	\begin{tabular}{lcl}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	280	$\textit{s}$ & $\dn$ & $[] \; \|\; c :: s$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	281	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	282	\end{center}
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	283	where $c$ is a variable ranging over characters.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	284	The $::$ stands for list cons and $[]$ for the empty
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	285	list.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	286	For brevity, a singleton list is sometimes written as $[c]$.
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	287	Strings can be concatenated to form longer strings in the same
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	288	way we concatenate two lists, which we shall write as $s_1 @ s_2$.
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	289	We omit the precise
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	290	recursive definition here.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	291	We overload this concatenation operator for two sets of strings:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	292	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	293	\begin{tabular}{lcl}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	294	$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	295	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	296	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	297	We also call the above \emph{language concatenation}.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	298	The power of a language is defined recursively, using the
cc54ce075db5 restructured Chengsong parents: diff changeset	299	concatenation operator $@$:
cc54ce075db5 restructured Chengsong parents: diff changeset	300	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	301	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	302	$A^0 $ & $\dn$ & $\{ [] \}$\\
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	303	$A^{n+1}$ & $\dn$ & $A @ A^n$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	304	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	305	\end{center}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	306	The union of all powers of a language
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	307	can be used to define the Kleene star operator:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	308	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	309	\begin{tabular}{lcl}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	310	$A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	311	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	312	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	313
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	314	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	315	However, to obtain a more convenient induction principle
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	316	in Isabelle/HOL,
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	317	we instead define the Kleene star
532 cc54ce075db5 restructured Chengsong parents: diff changeset	318	as an inductive set:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	319
532 cc54ce075db5 restructured Chengsong parents: diff changeset	320	\begin{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	321	\begin{mathpar}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	322	\inferrule{\mbox{}}{[] \in A*\\}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	323
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	324	\inferrule{s_1 \in A \;\; s_2 \in A}{s_1 @ s_2 \in A}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	325	\end{mathpar}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	326	\end{center}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	327	\noindent
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	328	We also define an operation of "chopping off" a character from
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	329	a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
532 cc54ce075db5 restructured Chengsong parents: diff changeset	330	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	331	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	332	$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	333	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	334	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	335	\noindent
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	336	This can be generalised to ``chopping off'' a string
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	337	from all strings within a set $A$,
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	338	namely:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	339	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	340	\begin{tabular}{lcl}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	341	$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	342	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	343	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	344	\noindent
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	345	which is essentially the left quotient $A \backslash L$ of $A$ against
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	346	the singleton language with $L = \{s\}$.
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	347	However, for our purposes here, the $\textit{Ders}$ definition with
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	348	a single string is sufficient.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	349
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	350	The reason for defining derivatives
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	351	is that they provide another approach
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	352	to test membership of a string in
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	353	a set of strings.
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	354	For example, to test whether the string
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	355	$bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set with
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	356	respect to the string $bar$:
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	357	\begin{center}
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	358	\begin{tabular}{lll}
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	359	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ &
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	360	$\{ar, rak\}$ \\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	361	& $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	362	& $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	363	%& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	364	\end{tabular}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	365	\end{center}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	366	\noindent
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	367	and in the end, test whether the set
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	368	contains the empty string.\footnote{We use the infix notation $A\backslash c$
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	369	instead of $\Der \; c \; A$ for brevity, as it will always be
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	370	clear from the context that we are operating
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	371	on languages rather than regular expressions.}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	372
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	373	In general, if we have a language $S$,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	374	then we can test whether $s$ is in $S$
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	375	by testing whether $[] \in S \backslash s$.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	376	With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
532 cc54ce075db5 restructured Chengsong parents: diff changeset	377	we have a few properties of how the language derivative can be defined using
cc54ce075db5 restructured Chengsong parents: diff changeset	378	sub-languages.
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	379	For example, for the sequence operator, we have
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	380	something similar to a ``chain rule'':
532 cc54ce075db5 restructured Chengsong parents: diff changeset	381	\begin{lemma}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	382	\[
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	383	\Der \; c \; (A @ B) =
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	384	\begin{cases}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	385	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	386	(\Der \; c \; A) \, @ \, B, & \text{otherwise}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	387	\end{cases}
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	388	\]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	389	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	390	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	391	This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
cc54ce075db5 restructured Chengsong parents: diff changeset	392	and get to $B$.
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	393	The language derivative for $A*$ can be described using the language derivative
532 cc54ce075db5 restructured Chengsong parents: diff changeset	394	of $A$:
cc54ce075db5 restructured Chengsong parents: diff changeset	395	\begin{lemma}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	396	$\textit{Der} \;c \;(A) = (\textit{Der}\; c A) @ (A)$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	397	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	398	\begin{proof}
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	399	There are two inclusions to prove:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	400	\begin{itemize}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	401	\item{$\subseteq$}:\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	402	The set
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	403	\[ S_1 = \{s \mid c :: s \in A*\} \]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	404	is enclosed in the set
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	405	\[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \]
e3752aac8ec2 more Chengsong parents: 628 diff changeset	406	This is because for any string $c::s$ satisfying $c::s \in A*$,
e3752aac8ec2 more Chengsong parents: 628 diff changeset	407	%whenever you have a string starting with a character
e3752aac8ec2 more Chengsong parents: 628 diff changeset	408	%in the language of a Kleene star $A*$,
e3752aac8ec2 more Chengsong parents: 628 diff changeset	409	%then that
e3752aac8ec2 more Chengsong parents: 628 diff changeset	410	the character $c$, together with a prefix of $s$
e3752aac8ec2 more Chengsong parents: 628 diff changeset	411	%immediately after $c$
e3752aac8ec2 more Chengsong parents: 628 diff changeset	412	forms the first iteration of $A*$,
e3752aac8ec2 more Chengsong parents: 628 diff changeset	413	and the rest of the $s$ is also $A*$.
e3752aac8ec2 more Chengsong parents: 628 diff changeset	414	This coincides with the definition of $S_2$.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	415	\item{$\supseteq$}:\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	416	Note that
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	417	\[ \Der \; c \; (A) = \Der \; c \; (\{ [] \} \cup (A @ A) ) \]
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	418	holds.
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	419	Also the following holds:
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	420	\[ \Der \; c \; (\{ [] \} \cup (A @ A) ) = \Der\; c \; (A @ A) \]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	421	where the $\textit{RHS}$ can be rewritten
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	422	as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	423	which of course contains $\Der \; c \; A @ A*$.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	424	\end{itemize}
cc54ce075db5 restructured Chengsong parents: diff changeset	425	\end{proof}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	426
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	427	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	428	The clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	429	for regular expressions.
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	430	To introduce them, we need to first give definitions for regular expressions,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	431	which we shall do next.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	432
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	433	\subsection{Regular Expressions and Their Meaning}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	434	The \emph{basic regular expressions} are defined inductively
532 cc54ce075db5 restructured Chengsong parents: diff changeset	435	by the following grammar:
cc54ce075db5 restructured Chengsong parents: diff changeset	436	\[ r ::= \ZERO \mid \ONE
cc54ce075db5 restructured Chengsong parents: diff changeset	437	\mid c
cc54ce075db5 restructured Chengsong parents: diff changeset	438	\mid r_1 \cdot r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	439	\mid r_1 + r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	440	\mid r^*
cc54ce075db5 restructured Chengsong parents: diff changeset	441	\]
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	442	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	443	We call them basic because we will introduce
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	444	additional constructors in later chapters, such as negation
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	445	and bounded repetitions.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	446	We use $\ZERO$ for the regular expression that
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	447	matches no string, and $\ONE$ for the regular
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	448	expression that matches only the empty string.\footnote{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	449	Some authors
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	450	also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	451	but we prefer this notation.}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	452	The sequence regular expression is written $r_1\cdot r_2$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	453	and sometimes we omit the dot if it is clear which
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	454	regular expression is meant; the alternative
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	455	is written $r_1 + r_2$.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	456	The \emph{language} or meaning of
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	457	a regular expression is defined recursively as
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	458	a set of strings:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	459	%TODO: FILL in the other defs
cc54ce075db5 restructured Chengsong parents: diff changeset	460	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	461	\begin{tabular}{lcl}
667 660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	462	$L \; \ZERO$ & $\dn$ & $\varnothing$\\
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	463	$L \; \ONE$ & $\dn$ & $\{[]\}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	464	$L \; c$ & $\dn$ & $\{[c]\}$\\
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	465	$L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	466	$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	467	$L \; (r^)$ & $\dn$ & $ (L\;r)$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	468	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	469	\end{center}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	470	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	471	%Now with language derivatives of a language and regular expressions and
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	472	%their language interpretations in place, we are ready to define derivatives on regular expressions.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	473	With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	474	We do so by first introducing what properties they should satisfy.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	475
532 cc54ce075db5 restructured Chengsong parents: diff changeset	476	\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	477	%Recall, the language derivative acts on a set of strings
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	478	%and essentially chops off a particular character from
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	479	%all strings in that set, Brzozowski defined a derivative operation on regular expressions
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	480	%so that after derivative $L(r\backslash c)$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	481	%will look as if it was obtained by doing a language derivative on $L(r)$:
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	482	%Recall that the language derivative acts on a
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	483	%language (set of strings).
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	484	%One can decide whether a string $s$ belongs
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	485	%to a language $S$ by taking derivative with respect to
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	486	%that string and then checking whether the empty
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	487	%string is in the derivative:
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	488	%\begin{center}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	489	%\parskip \baselineskip
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	490	%\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	491	%\def\rlwd{.5pt}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	492	%\newcommand\notate[3]{%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	493	% \unskip\def\useanchorwidth{T}%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	494	% \setbox0=\hbox{#1}%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	495	% \def\stackalignment{c}\stackunder[-6pt]{%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	496	% \def\stackalignment{c}\stackunder[-1.5pt]{%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	497	% \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	498	% \rule{\rlwd}{#2\baselineskip}}}{%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	499	% \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	500	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	501	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	502	%\notate{$\{ \ldots ,\;$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	503	% \notate{s}{1}{$(c_1 :: s_1)$}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	504	% $, \; \ldots \}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	505	%}{1}{$S_{start}$}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	506	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	507	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	508	% $\stackrel{\backslash c_1}{\longrightarrow}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	509	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	510	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	511	% $\{ \ldots,\;$ \notate{$s_1$}{1}{$(c_2::s_2)$}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	512	% $,\; \ldots \}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	513	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	514	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	515	% $\stackrel{\backslash c_2}{\longrightarrow}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	516	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	517	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	518	% $\{ \ldots,\; s_2
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	519	% ,\; \ldots \}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	520	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	521	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	522	% $ \xdashrightarrow{\backslash c_3\ldots\ldots} $
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	523	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	524	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	525	% \notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} =
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	526	% S_{start}\backslash s$}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	527	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	528	%\end{center}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	529	%\begin{center}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	530	% $s \in S_{start} \iff [] \in S_{end}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	531	%\end{center}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	532	%\noindent
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	533	Brzozowski noticed that $\Der$
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	534	can be ``mirrored'' on regular expressions which
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	535	he calls the derivative of a regular expression $r$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	536	with respect to a character $c$, written
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	537	$r \backslash c$. This infix operator
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	538	takes regular expression $r$ as input
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	539	and a character as a right operand.
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	540	The derivative operation on regular expression
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	541	is defined such that the language of the derivative result
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	542	coincides with the language of the original
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	543	regular expression being taken
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	544	derivative with respect to the same characters, namely
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	545	\begin{property}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	546
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	547	\[
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	548	L \; (r \backslash c) = \Der \; c \; (L \; r)
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	549	\]
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	550	\end{property}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	551	\noindent
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	552	Next, we give the recursive definition of derivative on
e3752aac8ec2 more Chengsong parents: 628 diff changeset	553	regular expressions so that it satisfies the properties above.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	554	%The derivative function, written $r\backslash c$,
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	555	%takes a regular expression $r$ and character $c$, and
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	556	%returns a new regular expression representing
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	557	%the original regular expression's language $L \; r$
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	558	%being taken the language derivative with respect to $c$.
626 1c8525061545 finished! Chengsong parents: 623 diff changeset	559	\begin{table}
1c8525061545 finished! Chengsong parents: 623 diff changeset	560	\begin{center}
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	561	\begin{tabular}{lcl}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	562	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	563	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	564	$d \backslash c$ & $\dn$ &
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	565	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	566	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	567	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	568	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	569	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	570	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	571	\end{tabular}
626 1c8525061545 finished! Chengsong parents: 623 diff changeset	572	\end{center}
1c8525061545 finished! Chengsong parents: 623 diff changeset	573	\caption{Derivative on Regular Expressions}
1c8525061545 finished! Chengsong parents: 623 diff changeset	574	\label{table:der}
1c8525061545 finished! Chengsong parents: 623 diff changeset	575	\end{table}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	576	\noindent
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	577	The most involved cases are the sequence case
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	578	and the star case.
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	579	The sequence case says that if the first regular expression
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	580	contains an empty string, then the second component of the sequence
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	581	needs to be considered, as its derivative will contribute to the
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	582	result of this derivative:
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	583	\begin{center}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	584	\begin{tabular}{lcl}
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	585	$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ &
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	586	$\textit{if}\;\,([] \in L(r_1))\;
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	587	\textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	588	& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	589	\end{tabular}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	590	\end{center}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	591	\noindent
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	592	Notice how this closely resembles
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	593	the language derivative operation $\Der$:
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	594	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	595	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	596	$\Der \; c \; (A @ B)$ & $\dn$ &
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	597	$ \textit{if} \;\, [] \in A \;
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	598	\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	599	\Der \; c\; B$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	600	& & $\textit{else}\; (\Der \; c \; A) @ B$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	601	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	602	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	603	\noindent
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	604	The derivative of the star regular expression $r^*$
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	605	unwraps one iteration of $r$, turns it into $r\backslash c$,
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	606	and attaches the original $r^*$
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	607	after $r\backslash c$, so that
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	608	we can further unfold it as many times as needed:
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	609	\[
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	610	(r^) \backslash c \dn (r \backslash c)\cdot r^.
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	611	\]
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	612	Again,
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	613	the structure is the same as the language derivative of the Kleene star:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	614	\[
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	615	\textit{Der} \;c \;(A) \dn (\textit{Der}\; c A) @ (A)
532 cc54ce075db5 restructured Chengsong parents: diff changeset	616	\]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	617	In the above definition of $(r_1\cdot r_2) \backslash c$,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	618	the $\textit{if}$ clause's
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	619	boolean condition
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	620	$[] \in L(r_1)$ needs to be
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	621	somehow recursively computed.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	622	We call such a function that checks
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	623	whether the empty string $[]$ is
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	624	in the language of a regular expression $\nullable$:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	625	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	626	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	627	$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	628	$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	629	$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	630	$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	631	$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	632	$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	633	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	634	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	635	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	636	The $\ZERO$ regular expression
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	637	does not contain any string and
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	638	therefore is not \emph{nullable}.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	639	$\ONE$ is \emph{nullable}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	640	by definition.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	641	The character regular expression $c$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	642	corresponds to the singleton set $\{c\}$,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	643	and therefore does not contain the empty string.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	644	The alternative regular expression is nullable
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	645	if at least one of its children is nullable.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	646	The sequence regular expression
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	647	would require both children to have the empty string
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	648	to compose an empty string, and the Kleene star
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	649	is always nullable because it naturally
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	650	contains the empty string.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	651	\noindent
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	652	We have the following two correspondences between
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	653	derivatives on regular expressions and
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	654	derivatives on a set of strings:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	655	\begin{lemma}\label{derDer}
608 37b6fd310a16 added related work chap Chengsong parents: 601 diff changeset	656	\mbox{}
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	657	\begin{itemize}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	658	\item
532 cc54ce075db5 restructured Chengsong parents: diff changeset	659	$\textit{Der} \; c \; L(r) = L (r\backslash c)$
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	660	\item
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	661	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	662	\end{itemize}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	663	\end{lemma}
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	664	\begin{proof}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	665	By induction on $r$.
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	666	\end{proof}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	667	\noindent
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	668	which are the main properties of derivatives
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	669	that enables us later to reason about the correctness of
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	670	derivative-based matching.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	671	We can generalise the derivative operation shown above for single characters
cc54ce075db5 restructured Chengsong parents: diff changeset	672	to strings as follows:
cc54ce075db5 restructured Chengsong parents: diff changeset	673
cc54ce075db5 restructured Chengsong parents: diff changeset	674	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	675	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	676	$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	677	$r \backslash_s [\,] $ & $\dn$ & $r$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	678	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	679	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	680
cc54ce075db5 restructured Chengsong parents: diff changeset	681	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	682	When there is no ambiguity, we will
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	683	omit the subscript and use $\backslash$ instead
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	684	of $\backslash_s$ to denote
532 cc54ce075db5 restructured Chengsong parents: diff changeset	685	string derivatives for brevity.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	686	Brzozowski's derivative-based
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	687	regular-expression matching algorithm can then be described as:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	688
cc54ce075db5 restructured Chengsong parents: diff changeset	689	\begin{definition}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	690	$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	691	\end{definition}
cc54ce075db5 restructured Chengsong parents: diff changeset	692
cc54ce075db5 restructured Chengsong parents: diff changeset	693	\noindent
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	694	Assuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$,
e3752aac8ec2 more Chengsong parents: 628 diff changeset	695	this algorithm, presented graphically, is as follows:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	696
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	697	\begin{equation}\label{matcher}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	698	\begin{tikzcd}
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	699	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] &
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	700	r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] &
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	701	\;\textrm{true}/\textrm{false}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	702	\end{tikzcd}
cc54ce075db5 restructured Chengsong parents: diff changeset	703	\end{equation}
cc54ce075db5 restructured Chengsong parents: diff changeset	704
cc54ce075db5 restructured Chengsong parents: diff changeset	705	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	706	It is relatively
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	707	easy to show that this matcher is correct, namely
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	708	\begin{lemma}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	709	$\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	710	\end{lemma}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	711	\begin{proof}
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	712	By induction on $s$ using the property of derivatives:
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	713	lemma \ref{derDer}.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	714	\end{proof}
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	715	\begin{figure}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	716	\begin{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	717	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	718	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	719	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	720	ylabel={time in secs},
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	721	%ymode = log,
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	722	legend entries={Naive Matcher},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	723	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	724	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	725	\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
7cf9f17aa179 more Chengsong parents: 538 diff changeset	726	\end{axis}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	727	\end{tikzpicture}
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	728	\caption{Matching the regular expression $(a^)^b$ against strings of the form
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	729	$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	730	$ using Brzozowski's original algorithm}\label{NaiveMatcher}
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	731	\end{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	732	\end{figure}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	733	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	734	If we implement the above algorithm naively, however,
666 6da4516ea87d more changes to figures & benchmarking Chengsong parents: 657 diff changeset	735	the algorithm can be as slow as backtracking lexers, as shown in
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	736	\ref{NaiveMatcher}.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	737	Note that both axes are in logarithmic scale.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	738	Around two dozen characters
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	739	this algorithm already ``explodes'' with the regular expression
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	740	$(a^)^b$.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	741	To improve this situation, we need to introduce simplification
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	742	rules for the intermediate results,
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	743	such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$,
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	744	and make sure those rules do not change the
7cf9f17aa179 more Chengsong parents: 538 diff changeset	745	language of the regular expression.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	746	One simple-minded simplification function
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	747	that achieves these requirements
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	748	is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	749	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	750	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	751	$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	752	$(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	753	& & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	754	& & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	755	& & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	756	& & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	757	& & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	758	$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	759	& & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	760	& & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	761	& & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	762	$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	763
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	764	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	765	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	766	If we repeatedly apply this simplification
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	767	function during the matching algorithm,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	768	we have a matcher with simplification:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	769	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	770	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	771	$\derssimp \; [] \; r$ & $\dn$ & $r$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	772	$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	773	$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	774	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	775	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	776	\begin{figure}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	777	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	778	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	779	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	780	ylabel={time in secs},
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	781	%ymode = log,
ce4e5151a836 more Chengsong parents: 591 diff changeset	782	%xmode = log,
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	783	grid = both,
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	784	legend entries={Matcher With Simp},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	785	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	786	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	787	\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
7cf9f17aa179 more Chengsong parents: 538 diff changeset	788	\end{axis}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	789	\end{tikzpicture}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	790	\caption{$(a^)^b$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	791	against
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	792	$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	793	\end{figure}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	794	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	795	The running time of $\textit{ders}\_\textit{simp}$
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	796	on the same example of Figure \ref{NaiveMatcher}
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	797	is now ``tame'' in terms of the length of inputs,
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	798	as shown in Figure \ref{BetterMatcher}.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	799
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	800	So far, the story is use Brzozowski derivatives and
e3752aac8ec2 more Chengsong parents: 628 diff changeset	801	simplify as much as possible, and at the end test
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	802	whether the empty string is recognised
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	803	by the final derivative.
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	804	But what if we want to
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	805	do lexing instead of just getting a true/false answer?
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	806	Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	807	elegant (arguably as beautiful as the definition of the
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	808	Brzozowski derivative) solution for this.
668 3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	809	\marginpar{explicitly attribute the work}
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	810	For the same reason described
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	811	at the beginning of this chapter,
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	812	we introduce the formal
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	813	semantics of $\POSIX$ lexing by
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	814	Ausaf et al.\cite{AusafDyckhoffUrban2016},
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	815	followed by the first lexing algorithm by
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	816	Sulzmanna and Lu \cite{Sulzmann2014}
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	817	that produces the output conforming
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	818	to the $\POSIX$ standard.
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	819	%TODO: Actually show this in chapter 4.
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	820	In what follows
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	821	we choose to use the Isabelle-style notation
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	822	for function and datatype constructor applications, where
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	823	the parameters of a function are not enclosed
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	824	inside a pair of parentheses (e.g. $f \;x \;y$
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	825	instead of $f(x,\;y)$). This is mainly
3831621d7b14 added technical Overview section, almost done introduction Chengsong parents: 667 diff changeset	826	to make the text visually more concise.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	827
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	828	\section{Values and the Lexing Algorithm by Sulzmann and Lu}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	829	In this section, we present a two-phase regular expression lexing
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	830	algorithm.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	831	The first phase takes successive derivatives with
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	832	respect to the input string,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	833	and the second phase does the reverse, \emph{injecting} back
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	834	characters, in the meantime constructing a lexing result.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	835	We will introduce the injection phase in detail slightly
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	836	later, but as a preliminary we have to first define
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	837	the datatype for lexing results,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	838	called \emph{value} or
608 37b6fd310a16 added related work chap Chengsong parents: 601 diff changeset	839	sometimes also \emph{lexical value}.
37b6fd310a16 added related work chap Chengsong parents: 601 diff changeset	840	Values and regular
37b6fd310a16 added related work chap Chengsong parents: 601 diff changeset	841	expressions correspond to each other
37b6fd310a16 added related work chap Chengsong parents: 601 diff changeset	842	as illustrated in the following
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	843	table:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	844
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	845	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	846	\begin{tabular}{c@{\hspace{20mm}}c}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	847	\begin{tabular}{@{}rrl@{}}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	848	\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	849	$r$ & $::=$ & $\ZERO$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	850	& $\mid$ & $\ONE$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	851	& $\mid$ & $c$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	852	& $\mid$ & $r_1 \cdot r_2$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	853	& $\mid$ & $r_1 + r_2$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	854	\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	855	& $\mid$ & $r^*$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	856	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	857	&
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	858	\begin{tabular}{@{\hspace{0mm}}rrl@{}}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	859	\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	860	$v$ & $::=$ & \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	861	& & $\Empty$ \\
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	862	& $\mid$ & $\Char \; c$ \\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	863	& $\mid$ & $\Seq\,v_1\, v_2$\\
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	864	& $\mid$ & $\Left \;v$ \\
b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	865	& $\mid$ & $\Right\;v$ \\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	866	& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	867	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	868	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	869	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	870	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	871	A value has an underlying string, which
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	872	can be calculated by the ``flatten" function $\|\_\|$:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	873	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	874	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	875	$\|\Empty\|$ & $\dn$ & $[]$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	876	$\|\Char \; c\|$ & $ \dn$ & $ [c]$\\
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	877	$\|\Seq \; v_1, \;v_2\|$ & $ \dn$ & $ v_1\| @ \|v_2\|$\\
b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	878	$\|\Left \; v\|$ & $ \dn$ & $ \|v\|$\\
b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	879	$\|\Right \; v\|$ & $ \dn$ & $ \|v\|$\\
b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	880	$\|\Stars \; []\|$ & $\dn$ & $[]$\\
b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	881	$\|\Stars \; v::vs\|$ & $\dn$ & $ \|v\| @ \|\Stars(vs)\|$
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	882	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	883	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	884	Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	885	to indicate that a value $v$ could be generated from a lexing algorithm
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	886	with input $r$. They call it the value inhabitation relation,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	887	defined by the rules.
628 7af4e2420a8c ready to submit~~ Chengsong parents: 626 diff changeset	888	\begin{figure}[H]
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	889	\begin{mathpar}
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	890	\inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	891
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	892	\inferrule{\mbox{}}{\vdash \Empty : \ONE} \hspace{2em}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	893
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	894	\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq \; v_1,\; v_2 : (r_1 \cdot r_2)}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	895
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	896	\inferrule{\vdash v_1 : r_1}{\vdash \Left \; v_1 : r_1+r_2}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	897
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	898	\inferrule{\vdash v_2 : r_2}{\vdash \Right \; v_2:r_1 + r_2}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	899
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	900	\inferrule{\forall v \in vs. \vdash v:r \land \|v\| \neq []}{\vdash \Stars \; vs : r^*}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	901	\end{mathpar}
628 7af4e2420a8c ready to submit~~ Chengsong parents: 626 diff changeset	902	\caption{The inhabitation relation for values and regular expressions}\label{fig:inhab}
626 1c8525061545 finished! Chengsong parents: 623 diff changeset	903	\end{figure}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	904	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	905	The condition $\|v\| \neq []$ in the premise of star's rule
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	906	is to make sure that for a given pair of regular
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	907	expression $r$ and string $s$, the number of values
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	908	satisfying $\|v\| = s$ and $\vdash v:r$ is finite.
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	909	This additional condition was
ce4e5151a836 more Chengsong parents: 591 diff changeset	910	imposed by Ausaf and Urban to make their proofs easier.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	911	Given a string and a regular expression, there can be
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	912	multiple values for it. For example, both
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	913	$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	914	$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	915	and the values both flatten to $abc$.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	916	Lexers therefore have to disambiguate and choose only
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	917	one of the values to be generated. $\POSIX$ is one of the
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	918	disambiguation strategies that is widely adopted.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	919
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	920	Ausaf et al. \cite{AusafDyckhoffUrban2016}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	921	formalised the property
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	922	as a ternary relation.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	923	The $\POSIX$ value $v$ for a regular expression
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	924	$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	925	in the following rules\footnote{The names of the rules are used
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	926	as they were originally given in \cite{AusafDyckhoffUrban2016}.}:
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	927	\begin{figure}[p]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	928	\begin{mathpar}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	929	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	930
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	931	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	932
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	933	\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	934
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	935	\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	936
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	937	\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	938	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	939	s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	940	\Seq \; v_1 \; v_2}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	941
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	942	\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	943
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	944	\inferrule[P]{(s_1, v) \rightarrow v \\ (s_2, r^) \rightarrow \Stars \; vs \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	945	\|v\| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	946	s_1@s_3 \in L \; r \land s_4 \in L \; r^}{(s_1@s_2, r^)\rightarrow \Stars \;
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	947	(v::vs)}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	948	\end{mathpar}
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	949	\caption{The inductive POSIX rules given by Ausaf et al.
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	950	\cite{AusafDyckhoffUrban2016}.
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	951	This ternary relation, written $(s, r) \rightarrow v$,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	952	formalises the POSIX constraints on the
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	953	value $v$ given a string $s$ and
ce4e5151a836 more Chengsong parents: 591 diff changeset	954	regular expression $r$.
ce4e5151a836 more Chengsong parents: 591 diff changeset	955	}
646 56057198e4f5 intro Chengsong parents: 639 diff changeset	956	\label{fig:POSIXDef}
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	957	\end{figure}\afterpage{\clearpage}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	958	\noindent
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	959
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	960	%\begin{figure}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	961	%\begin{tikzpicture}[]
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	962	% \node [minimum width = 6cm, rectangle split, rectangle split horizontal,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	963	% rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	964	% (node1)
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	965	% {$r_{token1}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	966	% \nodepart{two} $\;\;\; \quad r_{token2}\;\;\;\quad$ };
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	967	% %\node [left = 6.0cm of node1] (start1) {hi};
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	968	% \node [left = 0.2cm of node1] (middle) {$v.s.$};
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	969	% \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	970	% rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	971	% (node2)
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	972	% {$\quad\;\;\;r_{token1}\quad\;\;\;$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	973	% \nodepart{two} $r_{token2}$ };
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	974	% \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	975	% \node [above = 1.5cm of middle, minimum width = 6cm,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	976	% rectangle, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	977	% (topNode) {$s$};
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	978	% \path[->,draw]
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	979	% (topNode) edge node {split $A$} (node2)
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	980	% (topNode) edge node {split $B$} (node1)
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	981	% ;
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	982	%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	983	%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	984	%\end{tikzpicture}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	985	%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	986	%\end{figure}
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	987	The above $\POSIX$ rules follow the intuition described below:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	988	\begin{itemize}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	989	\item (Left Priority)\\
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	990	Match the leftmost regular expression when multiple options for matching
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	991	are available. See P+L and P+R where in P+R $s$ cannot
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	992	be in the language of $L \; r_1$.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	993	\item (Maximum munch)\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	994	Always match a subpart as much as possible before proceeding
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	995	to the next part of the string.
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	996	For example, when the string $s$ matches
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	997	$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	998	Then the split that matches a longer string for the first part
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	999	$r_{part1}$ is preferred by this maximum munch rule.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1000	The side-condition
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1001	\begin{center}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1002	$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1003	s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1004	\end{center}
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1005	in PS causes this.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1006	%(See
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1007	%\ref{munch} for an illustration).
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1008	\end{itemize}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1009	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1010	These disambiguation strategies can be
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1011	quite practical.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1012	For instance, when lexing a code snippet
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1013	\[
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1014	\textit{iffoo} = 3
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1015	\]
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1016	using a regular expression
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1017	for keywords and
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1018	identifiers:
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1019	%(for example, keyword is a nonempty string starting with letters
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1020	%followed by alphanumeric characters or underscores):
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1021	\[
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1022	r_{keyword} + r_{identifier}.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1023	\]
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1024	If we want $\textit{iffoo}$ to be recognized
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1025	as an identifier
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1026	where identifiers are defined as usual (letters
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1027	followed by letters, numbers or underscores),
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1028	then a match with a keyword (if)
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1029	followed by
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1030	an identifier (foo) would be incorrect.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1031	POSIX lexing generates what is included by lexing.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1032
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1033	\noindent
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1034	We know that a POSIX
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1035	value for regular expression $r$ is inhabited by $r$.
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1036	\begin{lemma}
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1037	$(r, s) \rightarrow v \implies \vdash v: r$
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1038	\end{lemma}
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1039	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1040	The main property about a $\POSIX$ value is that
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1041	given the same regular expression $r$ and string $s$,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1042	one can always uniquely determine the $\POSIX$ value for it:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1043	\begin{lemma}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1044	$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1045	\end{lemma}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1046	\begin{proof}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1047	By induction on $s$, $r$ and $v_1$. The inductive cases
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1048	are all the POSIX rules.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1049	Probably the most cumbersome cases are
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1050	the sequence and star with non-empty iterations.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1051	We shall give the details for proving the sequence case here.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1052
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1053	When we have
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1054	\[
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1055	(s_1, r_1) \rightarrow v_1 \;\, and \;\,
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1056	(s_2, r_2) \rightarrow v_2 \;\, and \;\,
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1057	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1058	s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1059	\]
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1060	we know that the last condition
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1061	excludes the possibility of a
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1062	string $s_1'$ longer than $s_1$ such that
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1063	\[
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1064	(s_1', r_1) \rightarrow v_1' \;\;
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1065	and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1066	\]
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1067	hold.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1068	A shorter string $s_1''$ with $s_2''$ satisfying
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1069	\[
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1070	(s_1'', r_1) \rightarrow v_1''
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1071	\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1072	\]
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1073	cannot possibly form a $\POSIX$ value either, because
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1074	by definition, there is a candidate
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1075	with a longer initial string
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1076	$s_1$. Therefore, we know that the POSIX
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1077	value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1078	$s$ must have the
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1079	property that
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1080	\[
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1081	\|a\| = s_1 \;\; and \;\; \|b\| = s_2.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1082	\]
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1083	The goal is to prove that $a = v_1 $ and $b = v_2$.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1084	If we have some other POSIX values $v_{10}$ and $v_{20}$ such that
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1085	$(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1086	then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$,
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1087	which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1088	is the same as $\Seq(v_1, v_2)$.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1089	\end{proof}
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1090	\noindent
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1091	We have now defined what a POSIX value is and shown that it is unique.
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1092	The problem is to generate
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1093	such a value in a lexing algorithm using derivatives.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1094
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1095	\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1096
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1097	Sulzmann and Lu extended Brzozowski's
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1098	derivative-based matching
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1099	to a lexing algorithm by a second phase
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1100	after the initial phase of successive derivatives.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1101	This second phase generates a POSIX value
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1102	if the regular expression matches the string.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1103	The algorithm uses two functions called $\inj$ and $\mkeps$.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1104	The function $\mkeps$ constructs a POSIX value from the last
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1105	derivative $r_n$:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1106	\begin{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1107	\begin{equation}\label{graph:mkeps}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1108	\begin{tikzcd}
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1109	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1110	& & & v_n
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1111	\end{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1112	\end{equation}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1113	\end{ceqn}
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1114	\noindent
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1115	In the above diagram, again we assume that
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1116	the input string $s$ is made of $n$ characters
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1117	$c_0c_1 \ldots c_{n-1}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1118	The last derivative operation
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1119	$\backslash c_{n-1}$ generates the derivative $r_n$, for which
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1120	$\mkeps$ produces the value $v_n$. This value
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1121	tells us how the empty string is matched by the (nullable)
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1122	regular expression $r_n$, in a POSIX way.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1123	The definition of $\mkeps$ is
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1124	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1125	\begin{tabular}{lcl}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1126	$\mkeps \; \ONE$ & $\dn$ & $\Empty$ \\
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1127	$\mkeps \; (r_{1}+r_{2})$ & $\dn$
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1128	& $\textit{if}\; (\nullable \; r_{1}) \;\,
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1129	\textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1130	& & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1131	$\mkeps \; (r_1 \cdot r_2)$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1132	$\mkeps \; r^* $ & $\dn$ & $\Stars\;[]$
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1133	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1134	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1135
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1136
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1137	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1138	The function prefers the left child $r_1$ of $r_1 + r_2$
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1139	to match an empty string if there is a choice.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1140	When there is a star to match the empty string,
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1141	we give the $\Stars$ constructor an empty list, meaning
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1142	no iteration is taken.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1143	The result of $\mkeps$ on a $\nullable$ $r$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1144	is a POSIX value for $r$ and the empty string:
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1145	\begin{lemma}\label{mePosix}
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1146	$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1147	\end{lemma}
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1148	\begin{proof}
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1149	By induction on $r$.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1150	\end{proof}
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1151	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1152	After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1153	in reverse order as they were chopped off in the derivative phase.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1154	The function for this is called $\inj$. This function
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1155	operates on values, unlike $\backslash$ which operates on regular expressions.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1156	In the diagram below, $v_i$ stands for the (POSIX) value
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1157	for how the regular expression
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1158	$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1159	of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1160	After injecting back $n$ characters, we get the lexical value for how $r_0$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1161	matches $s$.
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1162	\begin{figure}[H]
ce4e5151a836 more Chengsong parents: 591 diff changeset	1163	\begin{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1164	\begin{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1165	\begin{tikzcd}
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1166	r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1167	v_0 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed]
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1168	\end{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1169	\end{ceqn}
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1170	\end{center}
ce4e5151a836 more Chengsong parents: 591 diff changeset	1171	\caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016},
ce4e5151a836 more Chengsong parents: 591 diff changeset	1172	matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$.
ce4e5151a836 more Chengsong parents: 591 diff changeset	1173	The first phase involves taking successive derivatives w.r.t the characters $c_0$,
ce4e5151a836 more Chengsong parents: 591 diff changeset	1174	$c_1$, and so on. These are the same operations as they have appeared in the matcher
ce4e5151a836 more Chengsong parents: 591 diff changeset	1175	\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1176	then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1177	the empty string, by always selecting the leftmost
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1178	nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1179	appeared in the string, always preserving POSIXness.}\label{graph:inj}
ce4e5151a836 more Chengsong parents: 591 diff changeset	1180	\end{figure}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1181	\noindent
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1182	The function $\textit{inj}$ as defined by Sulzmann and Lu
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1183	takes three arguments: a regular
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1184	expression ${r_{i}}$, before the character is chopped off,
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1185	a character ${c_{i}}$ (the character we want to inject back) and
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1186	the third argument $v_{i+1}$ the value we want to inject into.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1187	The result of an application
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1188	$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1189	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1190	(s_i, r_i) \rightarrow v_i
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1191	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1192	holds.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1193	The definition of $\textit{inj}$ is as follows:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1194	\begin{center}
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1195	\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1196	$\textit{inj}\;(c)\;c\,Empty$ & $\dn$ & $\Char\,c$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1197	$\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left \; (\textit{inj}\; r_1 \; c\,v)$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1198	$\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c \; v)$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1199	$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$ &
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1200	$\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1201	$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ &
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1202	$\dn$ & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1203	$\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$ & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1204	$\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1205	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1206	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1207	\noindent
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1208	The function recurses on
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1209	the shape of regular
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1210	expressions and values.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1211	Intuitively, each clause analyses
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1212	how $r_i$ could have transformed when being
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1213	derived by $c$, identifying which subpart
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1214	of $v_{i+1}$ has the ``hole''
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1215	to inject the character back into.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1216	Once the character is
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1217	injected back to that sub-value;
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1218	$\inj$ assembles all parts
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1219	to form a new value.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1220
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1221	For instance, the last clause is an
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1222	injection into a sequence value $v_{i+1}$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1223	whose second child
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1224	value is a star and the shape of the
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1225	regular expression $r_i$ before injection
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1226	is a star.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1227	We therefore know
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1228	the derivative
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1229	starts on a star and ends as a sequence:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1230	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1231	(r^) \backslash c \longrightarrow r\backslash c \cdot r^
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1232	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1233	during which an iteration of the star
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1234	had just been unfolded, giving the below
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1235	value inhabitation relation:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1236	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1237	\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1238	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1239	The value list $vs$ corresponds to
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1240	matched star iterations,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1241	and the ``hole'' lies in $v$ because
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1242	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1243	\vdash v: r\backslash c.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1244	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1245	Finally,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1246	$\inj \; r \;c \; v$ is prepended
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1247	to the previous list of iterations and then
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1248	wrapped under the $\Stars$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1249	constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1250
667 660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1251	Putting all together, Sulzmann and Lu obtained the following algorithm
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1252	outlined in the injection-based lexing diagram \ref{graph:inj}:
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1253	\begin{center}
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1254	\begin{tabular}{lcl}
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1255	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1256	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1257	& & $\quad \phantom{\mid}\; \None \implies \None$\\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1258	& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1259	\end{tabular}
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1260	\end{center}
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1261	\noindent
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1262
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1263	\subsection{Examples on How Injection and Lexer Works}
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1264	We will provide a few examples on how $\inj$ and $\lexer$
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1265	works by showing their values in each recursive call on some
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1266	concrete examples.
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1267	We start with the call $\lexer \; (a+aa)^*\cdot c\; aac$
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1268	on the lexer, note that the value's character constructor
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1269	$\Char \;c$ is abbreviated as $c$ for readability.
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1270	Similarly the last derivative's sub-expression
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1271	is abbreviated as $r_{=0}$\footnote{which is equal to
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1272	$((\ZERO + (\ZERO a+ \ZERO))\cdot(a+aa)^* + (\ZERO + \ZERO a)\cdot (a+aa)^*)+
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1273	((\ZERO+\ZERO a)\cdot(a+aa)^* + (\ZERO+\ZERO a)\cdot(a+aa)^*),
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1274	$.}
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1275	whose language interpretation is equivalent to that of
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1276	$\ZERO$ and therefore not crucial to be displayed fully expanded,
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1277	as they will not be injected into.%our example run.
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1278	\begin{center}
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1279	\begin{tabular}{lcl}
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1280	$(a+\textcolor{magenta}{a}\textcolor{blue}{a})^* \cdot \textcolor{red}{c}$ & $\stackrel{\backslash \textcolor{magenta}{a}}{\rightarrow}$ & $((\ONE + \textcolor{magenta}{\ONE} \textcolor{blue}{a})\cdot (a+aa)^*)\cdot \textcolor{red}{c}+\ZERO$\\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1281	%& $\stackrel{\rightarrow}{\backslash a}$ & $((\ONE + \ONE a)\cdot (a+aa)^*)\cdot c + \ZERO$\\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1282	& $\stackrel{\backslash \textcolor{blue}{a}}{\rightarrow}$ &
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1283	$(((\ZERO+(\ZERO a+ \textcolor{blue}{\ONE}))\cdot (a+aa)^* + (\ONE+\ONE a)\cdot (a+aa)^* )\cdot \textcolor{red}{\mathbf{c}} + \ZERO)+\ZERO$\\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1284	& $\stackrel{\backslash \textcolor{red}{c}}{\rightarrow}$ &
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1285	$((r_{=0}\cdot c + \textcolor{red}{\ONE})+\ZERO)+\ZERO$\\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1286	& $\stackrel{\mkeps}{\rightarrow}$ & $\Left (\Left \; (\Right \; \textcolor{red}{\Empty}))$ \\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1287	& $\stackrel{\inj \;\textcolor{red}{c} }{\rightarrow}$ &
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1288	$\Left \; (\Left \; (\Seq \;(\Left \; (\Seq \; (\Right \; (\Right\; \textcolor{blue}{\Empty})) \; \Stars \, [])) \; \textcolor{red}{c}))$\\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1289	& $\stackrel{\inj \;\textcolor{blue}{a}}{\rightarrow}$ &
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1290	$\Left\; (\Seq \; (\Seq\; (\Right \; (\Seq \; \textcolor{magenta}{\Empty }\; \textcolor{blue}{ \mathbf{a}}) )
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1291	\;\Stars \,[]) \; \textcolor{red}{c})$\\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1292	& $\stackrel{\inj \;\textcolor{magenta}{a}}{\rightarrow}$ & $\Seq \; (\Stars \; [\Right \; (\Seq \; \mathbf{\textcolor{magenta}{a}} \; \textcolor{blue}{a})]) \; \textcolor{red}{ c}$
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1293
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1294	%$\inj \; r \; c\A$
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1295	%$\inj \; (a\cdot b)\cdot c \; \Seq \; \ONE \; b$ & $=$ & $(a+e)$\\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1296	\end{tabular}
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1297	\end{center}
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1298	\noindent
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1299	We have assigned different colours for each character,
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1300	as well as their corresponding locations in values and regular expressions.
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1301	The most recently injected character is marked with a bold font.
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1302	%Their corresponding derivative steps have been marked with the same
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1303	%colour. We also mark the specific location where the coloured character
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1304	%was injected into with the same colour, before and after the injection occurred.
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1305	%TODO: can be also used to motivate injection->blexer in Bitcoded1
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1306	To show the details of how $\inj$ works,
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1307	we zoom in to the second injection above,
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1308	illustrating the recursive calls involved:
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1309	\begin{center}
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1310	\begin{tabular}{lcl}
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1311	$\inj \;\quad ((\ONE + {\ONE}
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1312	\textcolor{blue}{a})\cdot (a+aa)^*)\cdot
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1313	c + \ZERO \; \quad \textcolor{blue}{a} \; $ & &\\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1314	$\quad \Left \;
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1315	(\Left \; (\Seq \;(\Left \; (\Seq \; (\Right \; (\Right\;
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1316	\textcolor{blue}{\Empty})) \; \Stars \, [])) \; c))$ & \\$=$\\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1317	$\Left \; (\inj \; ((\ONE + \ONE
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1318	\textcolor{blue}{a})\cdot (a+aa)^*)\cdot
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1319	c \quad \textcolor{blue}{a} \quad (\Left \; (\Seq\; ( \Left \; (\Seq \; (\Right \; (\Right\;
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1320	\textcolor{blue}{\Empty})) \; \Stars \, []) ) \; c ) )\;\;)$ &\\ $=$\\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1321	$\Left \; (\Seq \; (\inj \; (\ONE + \ONE \textcolor{blue}{a})\cdot(a+aa)^* \quad \textcolor{blue}{a} \quad
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1322	(\Left \; (\Seq \; (\Right \; (\Right\;\textcolor{blue}{\Empty})))) ) \; c ) $ & \\$=$\\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1323	$\Left \; (\Seq \; (\Seq \; (\inj \quad (\ONE + \ONE \textcolor{blue}{a}) \quad \textcolor{blue}{a}\quad \Right \;(\Right \; \textcolor{blue}{\Empty}) ) \; \Stars \,[])\; c)$ &\\ $=$ \\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1324	$\Left \; (\Seq \; (\Seq \; (\Right \; (\inj \; \ONE \textcolor{blue}{a} \quad \textcolor{blue}{a}\quad (\Right \;\textcolor{blue}{\Empty})))) \; \Stars \, [])$
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1325	& \\ $=$\\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1326	$\Left \; (\Seq \; (\Seq \; (\Right \; (\Seq \; (\mkeps \; \ONE)\;(\inj \;\textcolor{blue}{a} \; \textcolor{blue}{a} \; \textcolor{blue}{\Empty})) ) ) \; \Stars \, [] )$
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1327	& \\ $=$\\
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1328	$\Left \; (\Seq \; (\Seq \; (\Right \; (\Seq \; \Empty \; \textcolor{blue}{a}) ) ) \; \Stars \, [] )$
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1329	\end{tabular}
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1330	\end{center}
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1331	\noindent
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1332	We will now introduce the proofs related to $\inj$ and
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1333	$\lexer$. These proofs have originally been found by Ausaf et al.
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1334	in their 2016 work \cite{AusafDyckhoffUrban2016}.
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1335	Proofs to some of these lemmas have also been discussed in more detail in
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1336	Ausaf's thesis \cite{Ausaf}.
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1337	Nevertheless, we still introduce these proofs, to make this thesis
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1338	self-contained as we show inductive variants used in these proofs break
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1339	after more simplifications are introduced.
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1340	%Also note that lemmas like \ref{}described in this thesis is a more faithful
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1341	%representation of the actual accompanying Isabelle formalisation, and
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1342	%therefore
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1343
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1344
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1345
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1346	Recall that lemma
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1347	\ref{mePosix} tells us that
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1348	$\mkeps$ always generates the POSIX value.
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1349	The function $\inj$ preserves the POSIXness, provided
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1350	the value before injection is POSIX, namely
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1351	\begin{lemma}\label{injPosix}
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1352	If$(r \backslash c, s) \rightarrow v $,
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1353	then $(r, c :: s) \rightarrow (\inj r \; c\; v)$.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1354	\end{lemma}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1355	\begin{proof}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1356	By induction on $r$.
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1357	The non-trivial cases are sequence and star.
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1358	When $r = a \cdot b$, there can be
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1359	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1360	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1361	case.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1362	\begin{itemize}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1363	\item
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1364	$v = \Seq \; v_a \; v_b$.\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1365	The ``not nullable'' clause of the $\inj$ function is taken:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1366	\begin{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1367	\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1368	$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1369	& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1370	\end{tabular}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1371	\end{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1372	We know that there exists a unique pair of
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1373	$s_a$ and $s_b$ satisfying
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1374	$(a \backslash c, s_a) \rightarrow v_a$,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1375	$(b , s_b) \rightarrow v_b$, and
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1376	$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1377	L \; (a\backslash c) \land
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1378	s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1379	The last condition gives us
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1380	$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1381	L \; a \land
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1382	s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1383	By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1384	and this gives us
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1385	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1386	(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1387	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1388	\item
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1389	$v = \Left \; (\Seq \; v_a \; v_b)$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1390	The argument is almost identical to the above case,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1391	except that a different clause of $\inj$ is taken:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1392	\begin{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1393	\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1394	$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1395	& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1396	\end{tabular}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1397	\end{center}
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1398	With similar reasoning,
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1399
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1400	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1401	(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1402	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1403	again holds.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1404	\item
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1405	$v = \Right \; v_b$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1406	Again the injection result would be
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1407	\begin{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1408	\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1409	$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1410	& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1411	\end{tabular}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1412	\end{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1413	We know that $a$ must be nullable,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1414	allowing us to call $\mkeps$ and get
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1415	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1416	(a, []) \rightarrow \mkeps \; a.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1417	\]
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1418	Also, by inductive hypothesis
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1419	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1420	(b, c::s) \rightarrow \inj\; b \; c \; v_b
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1421	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1422	holds.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1423	In addition, as
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1424	$\Right \;v_b$ instead of $\Left \ldots$ is
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1425	the POSIX value for $v$, it must be the case
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1426	that $s \notin L \;( (a\backslash c)\cdot b)$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1427	This tells us that
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1428	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1429	\nexists s_3 \; s_4.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1430	s_3 @s_4 = s \land s_3 \in L \; (a\backslash c)
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1431	\land s_4 \in L \; b
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1432	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1433	which translates to
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1434	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1435	\nexists s_3 \; s_4. \; s_3 \neq [] \land
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1436	s_3 @s_4 = c::s \land s_3 \in L \; a
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1437	\land s_4 \in L \; b.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1438	\]
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1439	(Which says there cannot be a longer
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1440	initial split for $s$ other than the empty string.)
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1441	Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1442	as the POSIX value for $a\cdot b$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1443	\end{itemize}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1444	The star case can be proven similarly.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1445	\end{proof}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1446	\noindent
667 660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1447
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1448
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1449
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1450
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1451
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1452
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1453
660cf698eb26 added example of how inj and lexer works Chengsong parents: 666 diff changeset	1454	%TODO: Cut from previous lexer def, need to make coherent
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1455	The central property of the $\lexer$ is that it gives the correct result
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1456	according to
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1457	POSIX rules.
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1458	\begin{theorem}\label{lexerCorrectness}
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1459	The $\lexer$ based on derivatives and injections is correct:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1460	\begin{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1461	\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1462	$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1463	$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1464	\end{tabular}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1465	\end{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1466	\end{theorem}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1467	\begin{proof}
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1468	By induction on $s$. $r$ generalising over an arbitrary regular expression.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1469	The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1470	by lemma \ref{injPosix}.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1471	\end{proof}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1472	\noindent
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1473	As we did earlier in this chapter with the matcher, one can
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1474	introduce simplification on the regular expression in each derivative step.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1475	However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase)
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1476	and ensure that
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1477	the values align with the regular expression at each step.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1478	Therefore one has to
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1479	be careful not to break the correctness, as the injection
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1480	function heavily relies on the structure of
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1481	the regular expressions and values being aligned.
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1482	This can be achieved by recording some extra rectification functions
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1483	during the derivatives step and applying these rectifications in
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1484	each run during the injection phase.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1485	With extra care
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1486	one can show that POSIXness will not be affected
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1487	by the simplifications listed here \cite{AusafDyckhoffUrban2016}.
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1488	\begin{center}
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1489	\begin{tabular}{lcl}
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1490	$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1491	$(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1492	& & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1493	& & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1494	& & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1495	& & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1496	& & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1497	$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1498	& & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1499	& & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1500	& & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1501	$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1502
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1503	\end{tabular}
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1504	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1505
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1506	However, one can still end up
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1507	with exploding derivatives,
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1508	even with the simple-minded simplification rules allowed
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1509	in an injection-based lexer.
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1510	\section{A Case Requiring More Aggressive Simplifications}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1511	For example, when starting with the regular
585 4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	1512	expression $(a^* \cdot a^)^$ and building just over
4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	1513	a dozen successive derivatives
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1514	w.r.t.~the character $a$, one obtains a derivative regular expression
585 4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	1515	with millions of nodes (when viewed as a tree)
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1516	even with the mentioned simplifications.
585 4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	1517	\begin{figure}[H]
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1518	\begin{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1519	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1520	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1521	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1522	ylabel={size},
585 4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	1523	legend entries={Simple-Minded Simp, Naive Matcher},
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1524	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1525	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1526	\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
585 4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	1527	\addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data};
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1528	\end{axis}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1529	\end{tikzpicture}
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1530	\end{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1531	\caption{Size of $(a^\cdot a^)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1532	\end{figure}\label{fig:BetterWaterloo}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1533
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1534	That is because Sulzmann and Lu's
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1535	injection-based lexing algorithm keeps a lot of
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1536	"useless" values that will not be used.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1537	These different ways of matching will grow exponentially with the string length.
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1538	Consider the case
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1539	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1540	r= (a^\cdot a^)^* \quad and \quad
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1541	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1542	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1543	as an example.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1544	This is a highly ambiguous regular expression, with
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1545	many ways to split up the string into multiple segments for
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1546	different star iterations,
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1547	and for each segment
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1548	multiple ways of splitting between
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1549	the two $a^*$ sub-expressions.
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1550	When $n$ is equal to $1$, there are two lexical values for
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1551	the match:
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1552	\[
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1553	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1)
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1554	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1555	and
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1556	\[
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1557	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2)
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1558	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1559	The derivative of $\derssimp \;s \; r$ is
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1560	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1561	(a^a^ + a^)\cdot(a^a^)^.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1562	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1563	The $a^a^$ and $a^*$ in the first child of the above sequence
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1564	correspond to value 1 and value 2, respectively.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1565	When $n=2$, the number goes up to 7:
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1566	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1567	\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1568	\]
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1569
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1570	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1571	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1572	\]
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1573
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1574	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1575	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1576	\]
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1577
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1578	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1579	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1580	\]
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1581
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1582	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1583	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []),
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1584	\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1585	]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1586	\]
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1587
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1588	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1589	\Stars \; [
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1590	\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1591	\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1592	]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1593	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1594	and
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1595	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1596	\Stars \; [
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1597	\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1598	\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1599	]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1600	\]
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1601	And $\derssimp \; aa \; (a^a^)^*$ is
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1602	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1603	((a^a^ + a^)+a^)\cdot(a^a^)^* +
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1604	(a^a^ + a^)\cdot(a^a^)^.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1605	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1606	which removes two out of the seven terms corresponding to the
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1607	seven distinct lexical values.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1608
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1609	It is not surprising that there are exponentially many
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1610	distinct lexical values that cannot be eliminated by
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1611	the simple-minded simplification of $\derssimp$.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1612	A lexer without a good enough strategy to
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1613	deduplicate will naturally
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1614	have an exponential runtime on highly
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1615	ambiguous regular expressions because there
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1616	are exponentially many matches.
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1617	For this particular example, it seems
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1618	that the number of distinct matches growth
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1619	speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1620
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1621	On the other hand, the
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1622	$\POSIX$ value for $r= (a^\cdot a^)^*$ and
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1623	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1624	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1625	\Stars\,
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1626	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1627	\]
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1628	At any moment, the subterms in a regular expression
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1629	that will potentially result in a POSIX value is only
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1630	a minority among the many other terms,
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1631	and one can remove the ones that are not possible to
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1632	be POSIX.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1633	In the above example,
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1634	\begin{equation}\label{eqn:growth2}
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1635	((a^a^ + \underbrace{a^}_\text{A})+\underbrace{a^}_\text{duplicate of A})\cdot(a^a^)^* +
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1636	\underbrace{(a^a^ + a^)\cdot(a^a^)^}_\text{further simp removes this}.
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1637	\end{equation}
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1638	can be further simplified by
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1639	removing the underlined term first,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1640	which would open up possibilities
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1641	of further simplification that removes the
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1642	underbraced part.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1643	The result would be
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1644	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1645	(\underbrace{a^a^}_\text{term 1} + \underbrace{a^}_\text{term 2})\cdot(a^a^)^.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1646	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1647	with corresponding values
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1648	\begin{center}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1649	\begin{tabular}{lr}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1650	$\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$ & $(\text{term 1})$\\
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1651	$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] $ & $(\text{term 2})$
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1652	\end{tabular}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1653	\end{center}
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1654	Other terms with an underlying value, such as
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1655	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1656	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1657	\]
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1658	do not to contribute a POSIX lexical value,
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1659	and therefore can be thrown away.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1660
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1661	Ausaf et al. \cite{AusafDyckhoffUrban2016}
639 80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1662	have come up with some simplification steps,
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1663	and incorporated the simplification into $\lexer$.
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1664	They call this lexer $\textit{lexer}_{simp}$ and proved that
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1665	\[
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1666	\lexer \; r\; s = \textit{lexer}_{simp} \; r \; s
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1667	\]
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1668	The function $\textit{lexer}_{simp}$
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1669	involves some fiddly manipulation of value rectification,
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1670	which we omit here.
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	1671	however those steps
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1672	are not yet sufficiently strong, to achieve the above effects.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1673	And even with these relatively mild simplifications, the proof
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1674	is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1675	One would need to prove something like this:
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1676	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1677	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\;
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1678	\textit{then}\;\; (r, c::s) \rightarrow
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1679	\inj\;\, r\, \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1680	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1681	instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1682	not only has to return a simplified regular expression,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1683	but also what specific simplifications
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1684	have been done as a function on values
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1685	showing how one can transform the value
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1686	underlying the simplified regular expression
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1687	to the unsimplified one.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1688
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1689	We therefore choose a slightly different approach
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1690	also described by Sulzmann and Lu to
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1691	get better simplifications, which uses
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1692	some augmented data structures compared to
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1693	plain regular expressions.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1694	We call them \emph{annotated}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1695	regular expressions.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1696	With annotated regular expressions,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1697	we can avoid creating the intermediate values $v_1,\ldots v_n$ and a
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1698	second phase altogether.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1699	We introduce this new datatype and the
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	1700	corresponding algorithm in the next chapter.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1701
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1702
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1703
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1704

author	Chengsong
	Wed, 23 Aug 2023 03:02:31 +0100
changeset 668	3831621d7b14
parent 667	660cf698eb26
permissions	-rwxr-xr-x