lexing: ChengsongTanPhdThesis/Chapters/Chapter2.tex@c334f0b3ef52 (annotated)

468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	1	% Chapter Template
a0f27e21b42c all texrelated Chengsong parents: diff changeset	2
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	3	\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
519 856d025dbc15 more Chengsong parents: 518 diff changeset	4
856d025dbc15 more Chengsong parents: 518 diff changeset	5	\label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts
856d025dbc15 more Chengsong parents: 518 diff changeset	6	%and notations we
856d025dbc15 more Chengsong parents: 518 diff changeset	7	%use for describing the lexing algorithm by Sulzmann and Lu,
856d025dbc15 more Chengsong parents: 518 diff changeset	8	%and then give the algorithm and its variant, and discuss
856d025dbc15 more Chengsong parents: 518 diff changeset	9	%why more aggressive simplifications are needed.
856d025dbc15 more Chengsong parents: 518 diff changeset	10
856d025dbc15 more Chengsong parents: 518 diff changeset	11
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	12	\section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions}
529 96e93df60954 more Chengsong parents: 528 diff changeset	13	We have a primitive datatype char, denoting characters.
96e93df60954 more Chengsong parents: 528 diff changeset	14	\[ char ::= a
96e93df60954 more Chengsong parents: 528 diff changeset	15	\mid b
96e93df60954 more Chengsong parents: 528 diff changeset	16	\mid c
96e93df60954 more Chengsong parents: 528 diff changeset	17	\mid \ldots
96e93df60954 more Chengsong parents: 528 diff changeset	18	\mid z
96e93df60954 more Chengsong parents: 528 diff changeset	19	\]
96e93df60954 more Chengsong parents: 528 diff changeset	20	(which one is better?)
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	21	\begin{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	22	\begin{tabular}{lcl}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	23	$\textit{char}$ & $\dn$ & $a \| b \| c \| \ldots$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	24	\end{tabular}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	25	\end{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	26	They can form strings by lists:
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	27	\begin{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	28	\begin{tabular}{lcl}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	29	$\textit{string}$ & $\dn$ & $[] \| c :: cs$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	30	& & $(c\; \text{has char type})$
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	31	\end{tabular}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	32	\end{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	33	And strings can be concatenated to form longer strings:
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	34	\begin{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	35	\begin{tabular}{lcl}
529 96e93df60954 more Chengsong parents: 528 diff changeset	36	$[] @ s_2$ & $\dn$ & $s_2$\\
96e93df60954 more Chengsong parents: 528 diff changeset	37	$(c :: s_1) @ s_2$ & $\dn$ & $c :: (s_1 @ s_2)$
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	38	\end{tabular}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	39	\end{center}
519 856d025dbc15 more Chengsong parents: 518 diff changeset	40
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	41	A set of strings can operate with another set of strings:
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	42	\begin{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	43	\begin{tabular}{lcl}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	44	$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	45	\end{tabular}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	46	\end{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	47	We also call the above "language concatenation".
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	48	The power of a language is defined recursively, using the
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	49	concatenation operator $@$:
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	50	\begin{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	51	\begin{tabular}{lcl}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	52	$A^0 $ & $\dn$ & $\{ [] \}$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	53	$A^{n+1}$ & $\dn$ & $A^n @ A$
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	54	\end{tabular}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	55	\end{center}
529 96e93df60954 more Chengsong parents: 528 diff changeset	56	The union of all the natural number powers of a language
96e93df60954 more Chengsong parents: 528 diff changeset	57	is denoted by the Kleene star operator:
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	58	\begin{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	59	\begin{tabular}{lcl}
529 96e93df60954 more Chengsong parents: 528 diff changeset	60	$\bigcup_{i \geq 0} A^i$ & $\denote$ & $A^*$\\
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	61	\end{tabular}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	62	\end{center}
529 96e93df60954 more Chengsong parents: 528 diff changeset	63
96e93df60954 more Chengsong parents: 528 diff changeset	64	In Isabelle of course we cannot easily get a counterpart of
96e93df60954 more Chengsong parents: 528 diff changeset	65	the $\bigcup_{i \geq 0}$ operator, so we instead define the Kleene star
96e93df60954 more Chengsong parents: 528 diff changeset	66	as an inductive set:
96e93df60954 more Chengsong parents: 528 diff changeset	67	\begin{center}
96e93df60954 more Chengsong parents: 528 diff changeset	68	\begin{tabular}{lcl}
96e93df60954 more Chengsong parents: 528 diff changeset	69	$[] \in A^*$ & &\\
96e93df60954 more Chengsong parents: 528 diff changeset	70	$s_1 \in A \land \; s_2 \in A^* $ & $\implies$ & $s_1 @ s_2 \in A^*$\\
96e93df60954 more Chengsong parents: 528 diff changeset	71	\end{tabular}
96e93df60954 more Chengsong parents: 528 diff changeset	72	\end{center}
96e93df60954 more Chengsong parents: 528 diff changeset	73
96e93df60954 more Chengsong parents: 528 diff changeset	74	We also define an operation of chopping off a character from
96e93df60954 more Chengsong parents: 528 diff changeset	75	a language:
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	76	\begin{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	77	\begin{tabular}{lcl}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	78	$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	79	\end{tabular}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	80	\end{center}
529 96e93df60954 more Chengsong parents: 528 diff changeset	81
96e93df60954 more Chengsong parents: 528 diff changeset	82	This can be generalised to chopping off a string from all strings within set $A$:
96e93df60954 more Chengsong parents: 528 diff changeset	83	\begin{center}
96e93df60954 more Chengsong parents: 528 diff changeset	84	\begin{tabular}{lcl}
96e93df60954 more Chengsong parents: 528 diff changeset	85	$\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
96e93df60954 more Chengsong parents: 528 diff changeset	86	\end{tabular}
96e93df60954 more Chengsong parents: 528 diff changeset	87	\end{center}
96e93df60954 more Chengsong parents: 528 diff changeset	88
96e93df60954 more Chengsong parents: 528 diff changeset	89	which is essentially the left quotient $A \backslash L'$ of $A$ against
96e93df60954 more Chengsong parents: 528 diff changeset	90	the singleton language $L' = \{w\}$
96e93df60954 more Chengsong parents: 528 diff changeset	91	in formal language theory.
96e93df60954 more Chengsong parents: 528 diff changeset	92	For this dissertation the $\textit{Ders}$ notation would suffice, there is
96e93df60954 more Chengsong parents: 528 diff changeset	93	no need for a more general derivative definition.
96e93df60954 more Chengsong parents: 528 diff changeset	94
96e93df60954 more Chengsong parents: 528 diff changeset	95	With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
96e93df60954 more Chengsong parents: 528 diff changeset	96	we have a few properties of how the language derivative can be defined using
96e93df60954 more Chengsong parents: 528 diff changeset	97	sub-languages.
96e93df60954 more Chengsong parents: 528 diff changeset	98	\begin{lemma}
96e93df60954 more Chengsong parents: 528 diff changeset	99	$\Der \; c \; (A @ B) = \textit{if} \; [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B$
96e93df60954 more Chengsong parents: 528 diff changeset	100	\end{lemma}
96e93df60954 more Chengsong parents: 528 diff changeset	101	\noindent
96e93df60954 more Chengsong parents: 528 diff changeset	102	This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
96e93df60954 more Chengsong parents: 528 diff changeset	103	and get to $B$.
96e93df60954 more Chengsong parents: 528 diff changeset	104
96e93df60954 more Chengsong parents: 528 diff changeset	105	The language $A^*$'s derivative can be described using the language derivative
96e93df60954 more Chengsong parents: 528 diff changeset	106	of $A$:
96e93df60954 more Chengsong parents: 528 diff changeset	107	\begin{lemma}
96e93df60954 more Chengsong parents: 528 diff changeset	108	$\textit{Der} \;c \;A^* = (\textit{Der}\; c A) @ (A^*)$\\
96e93df60954 more Chengsong parents: 528 diff changeset	109	\end{lemma}
96e93df60954 more Chengsong parents: 528 diff changeset	110	\begin{proof}
96e93df60954 more Chengsong parents: 528 diff changeset	111	\begin{itemize}
96e93df60954 more Chengsong parents: 528 diff changeset	112	\item{$\subseteq$}
96e93df60954 more Chengsong parents: 528 diff changeset	113	The set
96e93df60954 more Chengsong parents: 528 diff changeset	114	\[ \{s \mid c :: s \in A^*\} \]
96e93df60954 more Chengsong parents: 528 diff changeset	115	is enclosed in the set
96e93df60954 more Chengsong parents: 528 diff changeset	116	\[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A^* \} \]
96e93df60954 more Chengsong parents: 528 diff changeset	117	because whenever you have a string starting with a character
96e93df60954 more Chengsong parents: 528 diff changeset	118	in the language of a Kleene star $A^*$, then that character together with some sub-string
96e93df60954 more Chengsong parents: 528 diff changeset	119	immediately after it will form the first iteration, and the rest of the string will
96e93df60954 more Chengsong parents: 528 diff changeset	120	be still in $A^*$.
96e93df60954 more Chengsong parents: 528 diff changeset	121	\item{$\supseteq$}
96e93df60954 more Chengsong parents: 528 diff changeset	122	Note that
96e93df60954 more Chengsong parents: 528 diff changeset	123	\[ \Der \; c \; A^* = \Der \; c \; (\{ [] \} \cup (A @ A^*) ) \]
96e93df60954 more Chengsong parents: 528 diff changeset	124	and
96e93df60954 more Chengsong parents: 528 diff changeset	125	\[ \Der \; c \; (\{ [] \} \cup (A @ A^) ) = \Der\; c \; (A @ A^) \]
96e93df60954 more Chengsong parents: 528 diff changeset	126	where the $\textit{RHS}$ of the above equatioin can be rewritten
96e93df60954 more Chengsong parents: 528 diff changeset	127	as \[ (\Der \; c\; A) @ A^* \cup A' \], $A'$ being a possibly empty set.
96e93df60954 more Chengsong parents: 528 diff changeset	128	\end{itemize}
96e93df60954 more Chengsong parents: 528 diff changeset	129	\end{proof}
96e93df60954 more Chengsong parents: 528 diff changeset	130	Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
96e93df60954 more Chengsong parents: 528 diff changeset	131	for regular languages, we need to first give definitions for regular expressions.
96e93df60954 more Chengsong parents: 528 diff changeset	132
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	133	\section{Regular Expressions and Their Language Interpretation}
519 856d025dbc15 more Chengsong parents: 518 diff changeset	134	Suppose we have an alphabet $\Sigma$, the strings whose characters
856d025dbc15 more Chengsong parents: 518 diff changeset	135	are from $\Sigma$
856d025dbc15 more Chengsong parents: 518 diff changeset	136	can be expressed as $\Sigma^*$.
856d025dbc15 more Chengsong parents: 518 diff changeset	137
856d025dbc15 more Chengsong parents: 518 diff changeset	138	We use patterns to define a set of strings concisely. Regular expressions
856d025dbc15 more Chengsong parents: 518 diff changeset	139	are one of such patterns systems:
856d025dbc15 more Chengsong parents: 518 diff changeset	140	The basic regular expressions are defined inductively
856d025dbc15 more Chengsong parents: 518 diff changeset	141	by the following grammar:
856d025dbc15 more Chengsong parents: 518 diff changeset	142	\[ r ::= \ZERO \mid \ONE
856d025dbc15 more Chengsong parents: 518 diff changeset	143	\mid c
856d025dbc15 more Chengsong parents: 518 diff changeset	144	\mid r_1 \cdot r_2
856d025dbc15 more Chengsong parents: 518 diff changeset	145	\mid r_1 + r_2
856d025dbc15 more Chengsong parents: 518 diff changeset	146	\mid r^*
856d025dbc15 more Chengsong parents: 518 diff changeset	147	\]
856d025dbc15 more Chengsong parents: 518 diff changeset	148
856d025dbc15 more Chengsong parents: 518 diff changeset	149	The language or set of strings defined by regular expressions are defined as
856d025dbc15 more Chengsong parents: 518 diff changeset	150	%TODO: FILL in the other defs
856d025dbc15 more Chengsong parents: 518 diff changeset	151	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	152	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	153	$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	154	$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	155	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	156	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	157	Which are also called the "language interpretation".
856d025dbc15 more Chengsong parents: 518 diff changeset	158
856d025dbc15 more Chengsong parents: 518 diff changeset	159
856d025dbc15 more Chengsong parents: 518 diff changeset	160
856d025dbc15 more Chengsong parents: 518 diff changeset	161
856d025dbc15 more Chengsong parents: 518 diff changeset	162	% Derivatives of a
856d025dbc15 more Chengsong parents: 518 diff changeset	163	%regular expression, written $r \backslash c$, give a simple solution
856d025dbc15 more Chengsong parents: 518 diff changeset	164	%to the problem of matching a string $s$ with a regular
856d025dbc15 more Chengsong parents: 518 diff changeset	165	%expression $r$: if the derivative of $r$ w.r.t.\ (in
856d025dbc15 more Chengsong parents: 518 diff changeset	166	%succession) all the characters of the string matches the empty string,
856d025dbc15 more Chengsong parents: 518 diff changeset	167	%then $r$ matches $s$ (and {\em vice versa}).
856d025dbc15 more Chengsong parents: 518 diff changeset	168
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	169
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	170	\section{Brzozowski Derivatives of Regular Expressions}
529 96e93df60954 more Chengsong parents: 528 diff changeset	171	Now with semantic derivatives of a language and regular expressions and
96e93df60954 more Chengsong parents: 528 diff changeset	172	their language interpretations, we are ready to define derivatives on regexes.
96e93df60954 more Chengsong parents: 528 diff changeset	173
96e93df60954 more Chengsong parents: 528 diff changeset	174	The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
96e93df60954 more Chengsong parents: 528 diff changeset	175	where the operation transforms the regex to a new one containing
96e93df60954 more Chengsong parents: 528 diff changeset	176	strings without the head character $c$.
96e93df60954 more Chengsong parents: 528 diff changeset	177
96e93df60954 more Chengsong parents: 528 diff changeset	178	The derivative of regular expression, denoted as
519 856d025dbc15 more Chengsong parents: 518 diff changeset	179	$r \backslash c$, is a function that takes parameters
856d025dbc15 more Chengsong parents: 518 diff changeset	180	$r$ and $c$, and returns another regular expression $r'$,
856d025dbc15 more Chengsong parents: 518 diff changeset	181	which is computed by the following recursive function:
856d025dbc15 more Chengsong parents: 518 diff changeset	182
856d025dbc15 more Chengsong parents: 518 diff changeset	183	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	184	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	185	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	186	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	187	$d \backslash c$ & $\dn$ &
856d025dbc15 more Chengsong parents: 518 diff changeset	188	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	189	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	190	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	191	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	192	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	193	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	194	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	195	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	196	\noindent
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	197	The function derivative, written $r\backslash c$,
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	198	defines how a regular expression evolves into
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	199	a new regular expression after all the string it contains
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	200	is chopped off a certain head character $c$.
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	201	The most involved cases are the sequence
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	202	and star case.
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	203	The sequence case says that if the first regular expression
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	204	contains an empty string then the second component of the sequence
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	205	might be chosen as the target regular expression to be chopped
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	206	off its head character.
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	207	The star regular expression's derivative unwraps the iteration of
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	208	regular expression and attaches the star regular expression
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	209	to the sequence's second element to make sure a copy is retained
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	210	for possible more iterations in later phases of lexing.
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	211
519 856d025dbc15 more Chengsong parents: 518 diff changeset	212
856d025dbc15 more Chengsong parents: 518 diff changeset	213	The $\nullable$ function tests whether the empty string $""$
856d025dbc15 more Chengsong parents: 518 diff changeset	214	is in the language of $r$:
856d025dbc15 more Chengsong parents: 518 diff changeset	215
856d025dbc15 more Chengsong parents: 518 diff changeset	216
856d025dbc15 more Chengsong parents: 518 diff changeset	217	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	218	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	219	$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	220	$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	221	$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	222	$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	223	$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	224	$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	225	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	226	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	227	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	228	The empty set does not contain any string and
856d025dbc15 more Chengsong parents: 518 diff changeset	229	therefore not the empty string, the empty string
856d025dbc15 more Chengsong parents: 518 diff changeset	230	regular expression contains the empty string
856d025dbc15 more Chengsong parents: 518 diff changeset	231	by definition, the character regular expression
856d025dbc15 more Chengsong parents: 518 diff changeset	232	is the singleton that contains character only,
856d025dbc15 more Chengsong parents: 518 diff changeset	233	and therefore does not contain the empty string,
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	234	the alternative regular expression (or "or" expression)
519 856d025dbc15 more Chengsong parents: 518 diff changeset	235	might have one of its children regular expressions
856d025dbc15 more Chengsong parents: 518 diff changeset	236	being nullable and any one of its children being nullable
856d025dbc15 more Chengsong parents: 518 diff changeset	237	would suffice. The sequence regular expression
856d025dbc15 more Chengsong parents: 518 diff changeset	238	would require both children to have the empty string
856d025dbc15 more Chengsong parents: 518 diff changeset	239	to compose an empty string and the Kleene star
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	240	operation naturally introduced the empty string.
519 856d025dbc15 more Chengsong parents: 518 diff changeset	241
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	242	We have the following property where the derivative on regular
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	243	expressions coincides with the derivative on a set of strings:
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	244
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	245	\begin{lemma}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	246	$\textit{Der} \; c \; L(r) = L (r\backslash c)$
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	247	\end{lemma}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	248
519 856d025dbc15 more Chengsong parents: 518 diff changeset	249	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	250
856d025dbc15 more Chengsong parents: 518 diff changeset	251
856d025dbc15 more Chengsong parents: 518 diff changeset	252	The main property of the derivative operation
856d025dbc15 more Chengsong parents: 518 diff changeset	253	that enables us to reason about the correctness of
856d025dbc15 more Chengsong parents: 518 diff changeset	254	an algorithm using derivatives is
856d025dbc15 more Chengsong parents: 518 diff changeset	255
856d025dbc15 more Chengsong parents: 518 diff changeset	256	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	257	$c\!::\!s \in L(r)$ holds
856d025dbc15 more Chengsong parents: 518 diff changeset	258	if and only if $s \in L(r\backslash c)$.
856d025dbc15 more Chengsong parents: 518 diff changeset	259	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	260
856d025dbc15 more Chengsong parents: 518 diff changeset	261	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	262	We can generalise the derivative operation shown above for single characters
856d025dbc15 more Chengsong parents: 518 diff changeset	263	to strings as follows:
856d025dbc15 more Chengsong parents: 518 diff changeset	264
856d025dbc15 more Chengsong parents: 518 diff changeset	265	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	266	\begin{tabular}{lcl}
856d025dbc15 more Chengsong parents: 518 diff changeset	267	$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	268	$r \backslash [\,] $ & $\dn$ & $r$
856d025dbc15 more Chengsong parents: 518 diff changeset	269	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	270	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	271
856d025dbc15 more Chengsong parents: 518 diff changeset	272	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	273	and then define Brzozowski's regular-expression matching algorithm as:
856d025dbc15 more Chengsong parents: 518 diff changeset	274
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	275	\begin{definition}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	276	$match\;s\;r \;\dn\; nullable(r\backslash s)$
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	277	\end{definition}
519 856d025dbc15 more Chengsong parents: 518 diff changeset	278
856d025dbc15 more Chengsong parents: 518 diff changeset	279	\noindent
530 823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	280	Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$,
519 856d025dbc15 more Chengsong parents: 518 diff changeset	281	this algorithm presented graphically is as follows:
856d025dbc15 more Chengsong parents: 518 diff changeset	282
530 823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	283	\begin{equation}\label{graph:successive_ders}
519 856d025dbc15 more Chengsong parents: 518 diff changeset	284	\begin{tikzcd}
856d025dbc15 more Chengsong parents: 518 diff changeset	285	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
856d025dbc15 more Chengsong parents: 518 diff changeset	286	\end{tikzcd}
856d025dbc15 more Chengsong parents: 518 diff changeset	287	\end{equation}
856d025dbc15 more Chengsong parents: 518 diff changeset	288
856d025dbc15 more Chengsong parents: 518 diff changeset	289	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	290	where we start with a regular expression $r_0$, build successive
856d025dbc15 more Chengsong parents: 518 diff changeset	291	derivatives until we exhaust the string and then use \textit{nullable}
856d025dbc15 more Chengsong parents: 518 diff changeset	292	to test whether the result can match the empty string. It can be
856d025dbc15 more Chengsong parents: 518 diff changeset	293	relatively easily shown that this matcher is correct (that is given
856d025dbc15 more Chengsong parents: 518 diff changeset	294	an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
856d025dbc15 more Chengsong parents: 518 diff changeset	295
856d025dbc15 more Chengsong parents: 518 diff changeset	296	Beautiful and simple definition.
856d025dbc15 more Chengsong parents: 518 diff changeset	297
856d025dbc15 more Chengsong parents: 518 diff changeset	298	If we implement the above algorithm naively, however,
856d025dbc15 more Chengsong parents: 518 diff changeset	299	the algorithm can be excruciatingly slow.
856d025dbc15 more Chengsong parents: 518 diff changeset	300
856d025dbc15 more Chengsong parents: 518 diff changeset	301
856d025dbc15 more Chengsong parents: 518 diff changeset	302	\begin{figure}
856d025dbc15 more Chengsong parents: 518 diff changeset	303	\centering
856d025dbc15 more Chengsong parents: 518 diff changeset	304	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	305	\begin{tikzpicture}
856d025dbc15 more Chengsong parents: 518 diff changeset	306	\begin{axis}[
856d025dbc15 more Chengsong parents: 518 diff changeset	307	xlabel={$n$},
856d025dbc15 more Chengsong parents: 518 diff changeset	308	x label style={at={(1.05,-0.05)}},
856d025dbc15 more Chengsong parents: 518 diff changeset	309	ylabel={time in secs},
856d025dbc15 more Chengsong parents: 518 diff changeset	310	enlargelimits=false,
856d025dbc15 more Chengsong parents: 518 diff changeset	311	xtick={0,5,...,30},
856d025dbc15 more Chengsong parents: 518 diff changeset	312	xmax=33,
856d025dbc15 more Chengsong parents: 518 diff changeset	313	ymax=10000,
856d025dbc15 more Chengsong parents: 518 diff changeset	314	ytick={0,1000,...,10000},
856d025dbc15 more Chengsong parents: 518 diff changeset	315	scaled ticks=false,
856d025dbc15 more Chengsong parents: 518 diff changeset	316	axis lines=left,
856d025dbc15 more Chengsong parents: 518 diff changeset	317	width=5cm,
856d025dbc15 more Chengsong parents: 518 diff changeset	318	height=4cm,
856d025dbc15 more Chengsong parents: 518 diff changeset	319	legend entries={JavaScript},
856d025dbc15 more Chengsong parents: 518 diff changeset	320	legend pos=north west,
856d025dbc15 more Chengsong parents: 518 diff changeset	321	legend cell align=left]
856d025dbc15 more Chengsong parents: 518 diff changeset	322	\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
856d025dbc15 more Chengsong parents: 518 diff changeset	323	\end{axis}
856d025dbc15 more Chengsong parents: 518 diff changeset	324	\end{tikzpicture}\\
856d025dbc15 more Chengsong parents: 518 diff changeset	325	\multicolumn{3}{c}{Graphs: Runtime for matching $(a^)^\,b$ with strings
856d025dbc15 more Chengsong parents: 518 diff changeset	326	of the form $\underbrace{aa..a}_{n}$.}
856d025dbc15 more Chengsong parents: 518 diff changeset	327	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	328	\caption{EightThousandNodes} \label{fig:EightThousandNodes}
856d025dbc15 more Chengsong parents: 518 diff changeset	329	\end{figure}
856d025dbc15 more Chengsong parents: 518 diff changeset	330
856d025dbc15 more Chengsong parents: 518 diff changeset	331
856d025dbc15 more Chengsong parents: 518 diff changeset	332	(8000 node data to be added here)
856d025dbc15 more Chengsong parents: 518 diff changeset	333	For example, when starting with the regular
856d025dbc15 more Chengsong parents: 518 diff changeset	334	expression $(a + aa)^*$ and building a few successive derivatives (around 10)
856d025dbc15 more Chengsong parents: 518 diff changeset	335	w.r.t.~the character $a$, one obtains a derivative regular expression
856d025dbc15 more Chengsong parents: 518 diff changeset	336	with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
856d025dbc15 more Chengsong parents: 518 diff changeset	337	The reason why $(a + aa) ^*$ explodes so drastically is that without
856d025dbc15 more Chengsong parents: 518 diff changeset	338	pruning, the algorithm will keep records of all possible ways of matching:
856d025dbc15 more Chengsong parents: 518 diff changeset	339	\begin{center}
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	340	$(a + aa) ^* \backslash [aa] = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
519 856d025dbc15 more Chengsong parents: 518 diff changeset	341	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	342
856d025dbc15 more Chengsong parents: 518 diff changeset	343	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	344	Each of the above alternative branches correspond to the match
856d025dbc15 more Chengsong parents: 518 diff changeset	345	$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
856d025dbc15 more Chengsong parents: 518 diff changeset	346	These different ways of matching will grow exponentially with the string length,
856d025dbc15 more Chengsong parents: 518 diff changeset	347	and without simplifications that throw away some of these very similar matchings,
856d025dbc15 more Chengsong parents: 518 diff changeset	348	it is no surprise that these expressions grow so quickly.
856d025dbc15 more Chengsong parents: 518 diff changeset	349	Operations like
856d025dbc15 more Chengsong parents: 518 diff changeset	350	$\backslash$ and $\nullable$ need to traverse such trees and
856d025dbc15 more Chengsong parents: 518 diff changeset	351	consequently the bigger the size of the derivative the slower the
856d025dbc15 more Chengsong parents: 518 diff changeset	352	algorithm.
856d025dbc15 more Chengsong parents: 518 diff changeset	353
856d025dbc15 more Chengsong parents: 518 diff changeset	354	Brzozowski was quick in finding that during this process a lot useless
856d025dbc15 more Chengsong parents: 518 diff changeset	355	$\ONE$s and $\ZERO$s are generated and therefore not optimal.
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	356	He also introduced some "similarity rules", such
519 856d025dbc15 more Chengsong parents: 518 diff changeset	357	as $P+(Q+R) = (P+Q)+R$ to merge syntactically
856d025dbc15 more Chengsong parents: 518 diff changeset	358	different but language-equivalent sub-regexes to further decrease the size
856d025dbc15 more Chengsong parents: 518 diff changeset	359	of the intermediate regexes.
856d025dbc15 more Chengsong parents: 518 diff changeset	360
856d025dbc15 more Chengsong parents: 518 diff changeset	361	More simplifications are possible, such as deleting duplicates
856d025dbc15 more Chengsong parents: 518 diff changeset	362	and opening up nested alternatives to trigger even more simplifications.
856d025dbc15 more Chengsong parents: 518 diff changeset	363	And suppose we apply simplification after each derivative step, and compose
856d025dbc15 more Chengsong parents: 518 diff changeset	364	these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
856d025dbc15 more Chengsong parents: 518 diff changeset	365	\textit{simp}(a \backslash c)$. Then we can build
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	366	a matcher with simpler regular expressions.
519 856d025dbc15 more Chengsong parents: 518 diff changeset	367
856d025dbc15 more Chengsong parents: 518 diff changeset	368	If we want the size of derivatives in the algorithm to
856d025dbc15 more Chengsong parents: 518 diff changeset	369	stay even lower, we would need more aggressive simplifications.
856d025dbc15 more Chengsong parents: 518 diff changeset	370	Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	371	delete duplicates whenever possible. For example, the parentheses in
519 856d025dbc15 more Chengsong parents: 518 diff changeset	372	$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
856d025dbc15 more Chengsong parents: 518 diff changeset	373	\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
856d025dbc15 more Chengsong parents: 518 diff changeset	374	example is simplifying $(a^+a) + (a^+ \ONE) + (a +\ONE)$ to just
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	375	$a^*+a+\ONE$. These more aggressive simplification rules are for
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	376	a very tight size bound, possibly as low
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	377	as that of the \emph{partial derivatives}\parencite{Antimirov1995}.
519 856d025dbc15 more Chengsong parents: 518 diff changeset	378
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	379	Building derivatives and then simplifying them.
c334f0b3ef52 more Chengsong parents: 530 diff changeset	380	So far, so good. But what if we want to
c334f0b3ef52 more Chengsong parents: 530 diff changeset	381	do lexing instead of just getting a YES/NO answer?
519 856d025dbc15 more Chengsong parents: 518 diff changeset	382	This requires us to go back again to the world
856d025dbc15 more Chengsong parents: 518 diff changeset	383	without simplification first for a moment.
856d025dbc15 more Chengsong parents: 518 diff changeset	384	Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
856d025dbc15 more Chengsong parents: 518 diff changeset	385	elegant(arguably as beautiful as the original
856d025dbc15 more Chengsong parents: 518 diff changeset	386	derivatives definition) solution for this.
856d025dbc15 more Chengsong parents: 518 diff changeset	387
856d025dbc15 more Chengsong parents: 518 diff changeset	388	\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
856d025dbc15 more Chengsong parents: 518 diff changeset	389
856d025dbc15 more Chengsong parents: 518 diff changeset	390
856d025dbc15 more Chengsong parents: 518 diff changeset	391	They first defined the datatypes for storing the
856d025dbc15 more Chengsong parents: 518 diff changeset	392	lexing information called a \emph{value} or
856d025dbc15 more Chengsong parents: 518 diff changeset	393	sometimes also \emph{lexical value}. These values and regular
856d025dbc15 more Chengsong parents: 518 diff changeset	394	expressions correspond to each other as illustrated in the following
856d025dbc15 more Chengsong parents: 518 diff changeset	395	table:
856d025dbc15 more Chengsong parents: 518 diff changeset	396
856d025dbc15 more Chengsong parents: 518 diff changeset	397	\begin{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	398	\begin{tabular}{c@{\hspace{20mm}}c}
856d025dbc15 more Chengsong parents: 518 diff changeset	399	\begin{tabular}{@{}rrl@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	400	\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
856d025dbc15 more Chengsong parents: 518 diff changeset	401	$r$ & $::=$ & $\ZERO$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	402	& $\mid$ & $\ONE$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	403	& $\mid$ & $c$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	404	& $\mid$ & $r_1 \cdot r_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	405	& $\mid$ & $r_1 + r_2$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	406	\\
856d025dbc15 more Chengsong parents: 518 diff changeset	407	& $\mid$ & $r^*$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	408	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	409	&
856d025dbc15 more Chengsong parents: 518 diff changeset	410	\begin{tabular}{@{\hspace{0mm}}rrl@{}}
856d025dbc15 more Chengsong parents: 518 diff changeset	411	\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
856d025dbc15 more Chengsong parents: 518 diff changeset	412	$v$ & $::=$ & \\
856d025dbc15 more Chengsong parents: 518 diff changeset	413	& & $\Empty$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	414	& $\mid$ & $\Char(c)$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	415	& $\mid$ & $\Seq\,v_1\, v_2$\\
856d025dbc15 more Chengsong parents: 518 diff changeset	416	& $\mid$ & $\Left(v)$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	417	& $\mid$ & $\Right(v)$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	418	& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
856d025dbc15 more Chengsong parents: 518 diff changeset	419	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	420	\end{tabular}
856d025dbc15 more Chengsong parents: 518 diff changeset	421	\end{center}
856d025dbc15 more Chengsong parents: 518 diff changeset	422
856d025dbc15 more Chengsong parents: 518 diff changeset	423	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	424
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	425	Building on top of Sulzmann and Lu's attempt to formalise the
519 856d025dbc15 more Chengsong parents: 518 diff changeset	426	notion of POSIX lexing rules \parencite{Sulzmann2014},
856d025dbc15 more Chengsong parents: 518 diff changeset	427	Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
856d025dbc15 more Chengsong parents: 518 diff changeset	428	POSIX matching as a ternary relation recursively defined in a
856d025dbc15 more Chengsong parents: 518 diff changeset	429	natural deduction style.
856d025dbc15 more Chengsong parents: 518 diff changeset	430	With the formally-specified rules for what a POSIX matching is,
856d025dbc15 more Chengsong parents: 518 diff changeset	431	they proved in Isabelle/HOL that the algorithm gives correct results.
856d025dbc15 more Chengsong parents: 518 diff changeset	432
856d025dbc15 more Chengsong parents: 518 diff changeset	433	But having a correct result is still not enough,
856d025dbc15 more Chengsong parents: 518 diff changeset	434	we want at least some degree of $\mathbf{efficiency}$.
856d025dbc15 more Chengsong parents: 518 diff changeset	435
856d025dbc15 more Chengsong parents: 518 diff changeset	436
856d025dbc15 more Chengsong parents: 518 diff changeset	437
856d025dbc15 more Chengsong parents: 518 diff changeset	438	One regular expression can have multiple lexical values. For example
856d025dbc15 more Chengsong parents: 518 diff changeset	439	for the regular expression $(a+b)^*$, it has a infinite list of
856d025dbc15 more Chengsong parents: 518 diff changeset	440	values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
856d025dbc15 more Chengsong parents: 518 diff changeset	441	$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
856d025dbc15 more Chengsong parents: 518 diff changeset	442	$\ldots$, and vice versa.
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	443	Even for the regular expression matching a particular string, there could
c334f0b3ef52 more Chengsong parents: 530 diff changeset	444	be more than one value corresponding to it.
519 856d025dbc15 more Chengsong parents: 518 diff changeset	445	Take the example where $r= (a^\cdot a^)^*$ and the string
856d025dbc15 more Chengsong parents: 518 diff changeset	446	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
528 28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	447	If we do not allow any empty iterations in its lexical values,
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	448	there will be $n - 1$ "splitting points" on $s$ we can choose to
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	449	split or not so that each sub-string
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	450	segmented by those chosen splitting points will form different iterations:
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	451	\begin{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	452	\begin{tabular}{lcr}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	453	$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	454	$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	455	$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$\\
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	456	& $\textit{etc}.$ &
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	457	\end{tabular}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	458	\end{center}
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	459
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	460	And for each iteration, there are still multiple ways to split
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	461	between the two $a^*$s.
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	462	It is not surprising there are exponentially many lexical values
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	463	that are distinct for the regex and string pair $r= (a^\cdot a^)^*$ and
28751de4b4ba revised according to comments Chengsong parents: 526 diff changeset	464	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
519 856d025dbc15 more Chengsong parents: 518 diff changeset	465
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	466	A lexer to keep all the possible values will naturally
529 96e93df60954 more Chengsong parents: 528 diff changeset	467	have an exponential runtime on ambiguous regular expressions.
96e93df60954 more Chengsong parents: 528 diff changeset	468	Somehow one has to decide which lexical value to keep and
96e93df60954 more Chengsong parents: 528 diff changeset	469	output in a lexing algorithm.
96e93df60954 more Chengsong parents: 528 diff changeset	470	In practice, we are usually
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	471	interested in POSIX values, which by intuition always
519 856d025dbc15 more Chengsong parents: 518 diff changeset	472	\begin{itemize}
856d025dbc15 more Chengsong parents: 518 diff changeset	473	\item
856d025dbc15 more Chengsong parents: 518 diff changeset	474	match the leftmost regular expression when multiple options of matching
856d025dbc15 more Chengsong parents: 518 diff changeset	475	are available
856d025dbc15 more Chengsong parents: 518 diff changeset	476	\item
856d025dbc15 more Chengsong parents: 518 diff changeset	477	always match a subpart as much as possible before proceeding
856d025dbc15 more Chengsong parents: 518 diff changeset	478	to the next token.
856d025dbc15 more Chengsong parents: 518 diff changeset	479	\end{itemize}
530 823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	480	The formal definition of a $\POSIX$ value $v$ for a regular expression
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	481	$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
529 96e93df60954 more Chengsong parents: 528 diff changeset	482	in the following set of rules:
530 823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	483	(TODO: write the entire set of inference rules for POSIX )
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	484	\newcommand*{\inference}[3][t]{%
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	485	\begingroup
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	486	\def\and{\\}%
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	487	\begin{tabular}[#1]{@{\enspace}c@{\enspace}}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	488	#2 \\
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	489	\hline
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	490	#3
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	491	\end{tabular}%
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	492	\endgroup
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	493	}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	494	\begin{center}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	495	\inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$ }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	496	\end{center}
519 856d025dbc15 more Chengsong parents: 518 diff changeset	497
856d025dbc15 more Chengsong parents: 518 diff changeset	498	The reason why we are interested in $\POSIX$ values is that they can
856d025dbc15 more Chengsong parents: 518 diff changeset	499	be practically used in the lexing phase of a compiler front end.
856d025dbc15 more Chengsong parents: 518 diff changeset	500	For instance, when lexing a code snippet
856d025dbc15 more Chengsong parents: 518 diff changeset	501	$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
856d025dbc15 more Chengsong parents: 518 diff changeset	502	as an identifier rather than a keyword.
856d025dbc15 more Chengsong parents: 518 diff changeset	503
530 823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	504	For example, the above $r= (a^\cdot a^)^*$ and
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	505	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	506	$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	507	The output of an algorithm we want would be a POSIX matching
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	508	encoded as a value.
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	509
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	510
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	511
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	512
519 856d025dbc15 more Chengsong parents: 518 diff changeset	513	The contribution of Sulzmann and Lu is an extension of Brzozowski's
856d025dbc15 more Chengsong parents: 518 diff changeset	514	algorithm by a second phase (the first phase being building successive
530 823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	515	derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	516	is generated if the regular expression matches the string.
530 823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	517	How can we construct a value out of regular expressions and character
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	518	sequences only?
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	519	Two functions are involved: $\inj$ and $\mkeps$.
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	520	The function $\mkeps$ constructs a value from the last
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	521	one of all the successive derivatives:
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	522	\begin{ceqn}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	523	\begin{equation}\label{graph:mkeps}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	524	\begin{tikzcd}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	525	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	526	& & & v_n
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	527	\end{tikzcd}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	528	\end{equation}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	529	\end{ceqn}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	530
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	531	It tells us how can an empty string be matched by a
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	532	regular expression, in a $\POSIX$ way:
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	533
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	534	\begin{center}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	535	\begin{tabular}{lcl}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	536	$\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	537	$\mkeps(r_{1}+r_{2})$ & $\dn$
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	538	& \textit{if} $\nullable(r_{1})$\\
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	539	& & \textit{then} $\Left(\mkeps(r_{1}))$\\
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	540	& & \textit{else} $\Right(\mkeps(r_{2}))$\\
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	541	$\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	542	$mkeps(r^*)$ & $\dn$ & $\Stars\,[]$
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	543	\end{tabular}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	544	\end{center}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	545
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	546
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	547	\noindent
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	548	We favour the left to match an empty string if there is a choice.
530 823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	549	When there is a star for us to match the empty string,
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	550	we give the $\Stars$ constructor an empty list, meaning
530 823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	551	no iterations are taken.
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	552
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	553
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	554	After the $\mkeps$-call, we inject back the characters one by one in order to build
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	555	the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	556	($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	557	After injecting back $n$ characters, we get the lexical value for how $r_0$
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	558	matches $s$. The POSIX value is maintained throughout the process.
530 823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	559	For this Sulzmann and Lu defined a function that reverses
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	560	the ``chopping off'' of characters during the derivative phase. The
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	561	corresponding function is called \emph{injection}, written
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	562	$\textit{inj}$; it takes three arguments: the first one is a regular
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	563	expression ${r_{i-1}}$, before the character is chopped off, the second
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	564	is a character ${c_{i-1}}$, the character we want to inject and the
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	565	third argument is the value ${v_i}$, into which one wants to inject the
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	566	character (it corresponds to the regular expression after the character
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	567	has been chopped off). The result of this function is a new value.
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	568	\begin{ceqn}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	569	\begin{equation}\label{graph:inj}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	570	\begin{tikzcd}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	571	r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	572	v_1 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed]
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	573	\end{tikzcd}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	574	\end{equation}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	575	\end{ceqn}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	576
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	577
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	578	\noindent
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	579	The
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	580	definition of $\textit{inj}$ is as follows:
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	581
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	582	\begin{center}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	583	\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	584	$\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	585	$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	586	$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	587	$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	588	$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	589	$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	590	$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	591	\end{tabular}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	592	\end{center}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	593
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	594	\noindent This definition is by recursion on the ``shape'' of regular
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	595	expressions and values.
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	596	The clauses do one thing--identifying the ``hole'' on a
530 823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	597	value to inject the character back into.
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	598	For instance, in the last clause for injecting back to a value
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	599	that would turn into a new star value that corresponds to a star,
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	600	we know it must be a sequence value. And we know that the first
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	601	value of that sequence corresponds to the child regex of the star
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	602	with the first character being chopped off--an iteration of the star
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	603	that had just been unfolded. This value is followed by the already
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	604	matched star iterations we collected before. So we inject the character
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	605	back to the first value and form a new value with this latest iteration
530 823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	606	being added to the previous list of iterations, all under the $\Stars$
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	607	top level.
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	608
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	609	Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	610	we have a lexer with the following recursive definition:
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	611	\begin{center}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	612	\begin{tabular}{lcr}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	613	$\lexer \; r \; [] $ & $=$ & $\mkeps \; r$\\
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	614	$\lexer \; r \;c::s$ & $=$ & $\inj \; r \; c (\lexer (r\backslash c) s)$
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	615	\end{tabular}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	616	\end{center}
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	617
823d9b19d21c all comments addressed Chengsong parents: 529 diff changeset	618	Pictorially, the algorithm is as follows:
519 856d025dbc15 more Chengsong parents: 518 diff changeset	619
856d025dbc15 more Chengsong parents: 518 diff changeset	620	\begin{ceqn}
856d025dbc15 more Chengsong parents: 518 diff changeset	621	\begin{equation}\label{graph:2}
856d025dbc15 more Chengsong parents: 518 diff changeset	622	\begin{tikzcd}
856d025dbc15 more Chengsong parents: 518 diff changeset	623	r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
856d025dbc15 more Chengsong parents: 518 diff changeset	624	v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
856d025dbc15 more Chengsong parents: 518 diff changeset	625	\end{tikzcd}
856d025dbc15 more Chengsong parents: 518 diff changeset	626	\end{equation}
856d025dbc15 more Chengsong parents: 518 diff changeset	627	\end{ceqn}
856d025dbc15 more Chengsong parents: 518 diff changeset	628
856d025dbc15 more Chengsong parents: 518 diff changeset	629
856d025dbc15 more Chengsong parents: 518 diff changeset	630	\noindent
856d025dbc15 more Chengsong parents: 518 diff changeset	631	For convenience, we shall employ the following notations: the regular
856d025dbc15 more Chengsong parents: 518 diff changeset	632	expression we start with is $r_0$, and the given string $s$ is composed
856d025dbc15 more Chengsong parents: 518 diff changeset	633	of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the
856d025dbc15 more Chengsong parents: 518 diff changeset	634	left to right, we build the derivatives $r_1$, $r_2$, \ldots according
856d025dbc15 more Chengsong parents: 518 diff changeset	635	to the characters $c_0$, $c_1$ until we exhaust the string and obtain
856d025dbc15 more Chengsong parents: 518 diff changeset	636	the derivative $r_n$. We test whether this derivative is
856d025dbc15 more Chengsong parents: 518 diff changeset	637	$\textit{nullable}$ or not. If not, we know the string does not match
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	638	$r$, and no value needs to be generated. If yes, we start building the
519 856d025dbc15 more Chengsong parents: 518 diff changeset	639	values incrementally by \emph{injecting} back the characters into the
856d025dbc15 more Chengsong parents: 518 diff changeset	640	earlier values $v_n, \ldots, v_0$. This is the second phase of the
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	641	algorithm from right to left. For the first value $v_n$, we call the
519 856d025dbc15 more Chengsong parents: 518 diff changeset	642	function $\textit{mkeps}$, which builds a POSIX lexical value
856d025dbc15 more Chengsong parents: 518 diff changeset	643	for how the empty string has been matched by the (nullable) regular
856d025dbc15 more Chengsong parents: 518 diff changeset	644	expression $r_n$. This function is defined as
856d025dbc15 more Chengsong parents: 518 diff changeset	645
856d025dbc15 more Chengsong parents: 518 diff changeset	646
856d025dbc15 more Chengsong parents: 518 diff changeset	647
856d025dbc15 more Chengsong parents: 518 diff changeset	648	We have mentioned before that derivatives without simplification
856d025dbc15 more Chengsong parents: 518 diff changeset	649	can get clumsy, and this is true for values as well--they reflect
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	650	the size of the regular expression by definition.
519 856d025dbc15 more Chengsong parents: 518 diff changeset	651
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	652	One can introduce simplification on the regex and values but have to
c334f0b3ef52 more Chengsong parents: 530 diff changeset	653	be careful not to break the correctness, as the injection
519 856d025dbc15 more Chengsong parents: 518 diff changeset	654	function heavily relies on the structure of the regexes and values
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	655	being correct and matching each other.
519 856d025dbc15 more Chengsong parents: 518 diff changeset	656	It can be achieved by recording some extra rectification functions
856d025dbc15 more Chengsong parents: 518 diff changeset	657	during the derivatives step, and applying these rectifications in
856d025dbc15 more Chengsong parents: 518 diff changeset	658	each run during the injection phase.
856d025dbc15 more Chengsong parents: 518 diff changeset	659	And we can prove that the POSIX value of how
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	660	regular expressions match strings will not be affected---although it is much harder
519 856d025dbc15 more Chengsong parents: 518 diff changeset	661	to establish.
856d025dbc15 more Chengsong parents: 518 diff changeset	662	Some initial results in this regard have been
856d025dbc15 more Chengsong parents: 518 diff changeset	663	obtained in \cite{AusafDyckhoffUrban2016}.
856d025dbc15 more Chengsong parents: 518 diff changeset	664
856d025dbc15 more Chengsong parents: 518 diff changeset	665
856d025dbc15 more Chengsong parents: 518 diff changeset	666
856d025dbc15 more Chengsong parents: 518 diff changeset	667	%Brzozowski, after giving the derivatives and simplification,
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	668	%did not explore lexing with simplification, or he may well be
c334f0b3ef52 more Chengsong parents: 530 diff changeset	669	%stuck on an efficient simplification with proof.
c334f0b3ef52 more Chengsong parents: 530 diff changeset	670	%He went on to examine the use of derivatives together with
c334f0b3ef52 more Chengsong parents: 530 diff changeset	671	%automaton, and did not try lexing using products.
519 856d025dbc15 more Chengsong parents: 518 diff changeset	672
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	673	We want to get rid of the complex and fragile rectification of values.
519 856d025dbc15 more Chengsong parents: 518 diff changeset	674	Can we not create those intermediate values $v_1,\ldots v_n$,
856d025dbc15 more Chengsong parents: 518 diff changeset	675	and get the lexing information that should be already there while
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	676	doing derivatives in one pass, without a second injection phase?
519 856d025dbc15 more Chengsong parents: 518 diff changeset	677	In the meantime, can we make sure that simplifications
856d025dbc15 more Chengsong parents: 518 diff changeset	678	are easily handled without breaking the correctness of the algorithm?
856d025dbc15 more Chengsong parents: 518 diff changeset	679
856d025dbc15 more Chengsong parents: 518 diff changeset	680	Sulzmann and Lu solved this problem by
531 c334f0b3ef52 more Chengsong parents: 530 diff changeset	681	introducing additional information to the
519 856d025dbc15 more Chengsong parents: 518 diff changeset	682	regular expressions called \emph{bitcodes}.
856d025dbc15 more Chengsong parents: 518 diff changeset	683
856d025dbc15 more Chengsong parents: 518 diff changeset	684
856d025dbc15 more Chengsong parents: 518 diff changeset	685
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	686
500 4d9eecfc936a sad Chengsong parents: 468 diff changeset	687
4d9eecfc936a sad Chengsong parents: 468 diff changeset	688
518 ff7945a988a3 more to thesis Chengsong parents: 516 diff changeset	689

author	Chengsong
	Tue, 31 May 2022 17:36:45 +0100
changeset 531	c334f0b3ef52
parent 530	823d9b19d21c
permissions	-rwxr-xr-x