lexing: ChengsongTanPhdThesis/Chapters/Chapter1.tex@6953d2786e7c (annotated)

468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	1	% Chapter 1
a0f27e21b42c all texrelated Chengsong parents: diff changeset	2
a0f27e21b42c all texrelated Chengsong parents: diff changeset	3	\chapter{Introduction} % Main chapter title
a0f27e21b42c all texrelated Chengsong parents: diff changeset	4
a0f27e21b42c all texrelated Chengsong parents: diff changeset	5	\label{Chapter1} % For referencing the chapter elsewhere, use \ref{Chapter1}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	6
a0f27e21b42c all texrelated Chengsong parents: diff changeset	7	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	8
a0f27e21b42c all texrelated Chengsong parents: diff changeset	9	% Define some commands to keep the formatting separated from the content
a0f27e21b42c all texrelated Chengsong parents: diff changeset	10	\newcommand{\keyword}[1]{\textbf{#1}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	11	\newcommand{\tabhead}[1]{\textbf{#1}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	12	\newcommand{\code}[1]{\texttt{#1}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	13	\newcommand{\file}[1]{\texttt{\bfseries#1}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	14	\newcommand{\option}[1]{\texttt{\itshape#1}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	15
a0f27e21b42c all texrelated Chengsong parents: diff changeset	16
a0f27e21b42c all texrelated Chengsong parents: diff changeset	17	\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
a0f27e21b42c all texrelated Chengsong parents: diff changeset	18	\newcommand{\ZERO}{\mbox{\bf 0}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	19	\newcommand{\ONE}{\mbox{\bf 1}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	20	\def\lexer{\mathit{lexer}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	21	\def\mkeps{\mathit{mkeps}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	22
a0f27e21b42c all texrelated Chengsong parents: diff changeset	23	\def\DFA{\textit{DFA}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	24	\def\bmkeps{\textit{bmkeps}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	25	\def\retrieve{\textit{retrieve}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	26	\def\blexer{\textit{blexer}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	27	\def\flex{\textit{flex}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	28	\def\inj{\mathit{inj}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	29	\def\Empty{\mathit{Empty}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	30	\def\Left{\mathit{Left}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	31	\def\Right{\mathit{Right}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	32	\def\Stars{\mathit{Stars}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	33	\def\Char{\mathit{Char}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	34	\def\Seq{\mathit{Seq}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	35	\def\Der{\mathit{Der}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	36	\def\nullable{\mathit{nullable}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	37	\def\Z{\mathit{Z}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	38	\def\S{\mathit{S}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	39	\def\rup{r^\uparrow}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	40	\def\simp{\mathit{simp}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	41	\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	42	\def\map{\mathit{map}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	43	\def\distinct{\mathit{distinct}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	44	\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	45	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	46	%This part is about regular expressions, Brzozowski derivatives,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	47	%and a bit-coded lexing algorithm with proven correctness and time bounds.
472 6953d2786e7c hi Chengsong parents: 471 diff changeset	48
6953d2786e7c hi Chengsong parents: 471 diff changeset	49	%TODO: look up snort rules to use here--give readers idea of what regexes look like
6953d2786e7c hi Chengsong parents: 471 diff changeset	50
6953d2786e7c hi Chengsong parents: 471 diff changeset	51
471 23818853a710 thesis Chengsong parents: 469 diff changeset	52	Regular expressions are widely used in computer science:
23818853a710 thesis Chengsong parents: 469 diff changeset	53	be it in IDEs with syntax hightlighting and auto completion,
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	54	command line tools like $\mathit{grep}$ that facilitates easy
471 23818853a710 thesis Chengsong parents: 469 diff changeset	55	text processing , network intrusion
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	56	detection systems that rejects suspicious traffic, or compiler
a0f27e21b42c all texrelated Chengsong parents: diff changeset	57	front ends--there is always an important phase of the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	58	task that involves matching a regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	59	exression with a string.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	60	Given its usefulness and ubiquity, one would imagine that
a0f27e21b42c all texrelated Chengsong parents: diff changeset	61	modern regular expression matching implementations
a0f27e21b42c all texrelated Chengsong parents: diff changeset	62	are mature and fully-studied.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	63
a0f27e21b42c all texrelated Chengsong parents: diff changeset	64	If you go to a popular programming language's
a0f27e21b42c all texrelated Chengsong parents: diff changeset	65	regex engine,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	66	you can supply it with regex and strings of your own,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	67	and get matching/lexing information such as how a
a0f27e21b42c all texrelated Chengsong parents: diff changeset	68	sub-part of the regex matches a sub-part of the string.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	69	These lexing libraries are on average quite fast.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	70	%TODO: get source for SNORT/BRO's regex matching engine/speed
a0f27e21b42c all texrelated Chengsong parents: diff changeset	71	For example, the regex engines some network intrusion detection
a0f27e21b42c all texrelated Chengsong parents: diff changeset	72	systems use are able to process
a0f27e21b42c all texrelated Chengsong parents: diff changeset	73	megabytes or even gigabytes of network traffic per second.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	74
a0f27e21b42c all texrelated Chengsong parents: diff changeset	75	Why do we need to have our version, if the algorithms are
a0f27e21b42c all texrelated Chengsong parents: diff changeset	76	blindingly fast already? Well it turns out it is not always the case.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	77
a0f27e21b42c all texrelated Chengsong parents: diff changeset	78
a0f27e21b42c all texrelated Chengsong parents: diff changeset	79	Take $(a^)^\,b$ and ask whether
a0f27e21b42c all texrelated Chengsong parents: diff changeset	80	strings of the form $aa..a$ match this regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	81	expression. Obviously this is not the case---the expected $b$ in the last
a0f27e21b42c all texrelated Chengsong parents: diff changeset	82	position is missing. One would expect that modern regular expression
a0f27e21b42c all texrelated Chengsong parents: diff changeset	83	matching engines can find this out very quickly. Alas, if one tries
a0f27e21b42c all texrelated Chengsong parents: diff changeset	84	this example in JavaScript, Python or Java 8 with strings like 28
a0f27e21b42c all texrelated Chengsong parents: diff changeset	85	$a$'s, one discovers that this decision takes around 30 seconds and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	86	takes considerably longer when adding a few more $a$'s, as the graphs
a0f27e21b42c all texrelated Chengsong parents: diff changeset	87	below show:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	88
a0f27e21b42c all texrelated Chengsong parents: diff changeset	89	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	90	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	91	\begin{tikzpicture}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	92	\begin{axis}[
a0f27e21b42c all texrelated Chengsong parents: diff changeset	93	xlabel={$n$},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	94	x label style={at={(1.05,-0.05)}},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	95	ylabel={time in secs},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	96	enlargelimits=false,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	97	xtick={0,5,...,30},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	98	xmax=33,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	99	ymax=35,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	100	ytick={0,5,...,30},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	101	scaled ticks=false,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	102	axis lines=left,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	103	width=5cm,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	104	height=4cm,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	105	legend entries={JavaScript},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	106	legend pos=north west,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	107	legend cell align=left]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	108	\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
a0f27e21b42c all texrelated Chengsong parents: diff changeset	109	\end{axis}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	110	\end{tikzpicture}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	111	&
a0f27e21b42c all texrelated Chengsong parents: diff changeset	112	\begin{tikzpicture}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	113	\begin{axis}[
a0f27e21b42c all texrelated Chengsong parents: diff changeset	114	xlabel={$n$},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	115	x label style={at={(1.05,-0.05)}},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	116	%ylabel={time in secs},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	117	enlargelimits=false,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	118	xtick={0,5,...,30},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	119	xmax=33,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	120	ymax=35,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	121	ytick={0,5,...,30},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	122	scaled ticks=false,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	123	axis lines=left,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	124	width=5cm,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	125	height=4cm,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	126	legend entries={Python},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	127	legend pos=north west,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	128	legend cell align=left]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	129	\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
a0f27e21b42c all texrelated Chengsong parents: diff changeset	130	\end{axis}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	131	\end{tikzpicture}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	132	&
a0f27e21b42c all texrelated Chengsong parents: diff changeset	133	\begin{tikzpicture}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	134	\begin{axis}[
a0f27e21b42c all texrelated Chengsong parents: diff changeset	135	xlabel={$n$},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	136	x label style={at={(1.05,-0.05)}},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	137	%ylabel={time in secs},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	138	enlargelimits=false,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	139	xtick={0,5,...,30},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	140	xmax=33,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	141	ymax=35,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	142	ytick={0,5,...,30},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	143	scaled ticks=false,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	144	axis lines=left,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	145	width=5cm,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	146	height=4cm,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	147	legend entries={Java 8},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	148	legend pos=north west,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	149	legend cell align=left]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	150	\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
a0f27e21b42c all texrelated Chengsong parents: diff changeset	151	\end{axis}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	152	\end{tikzpicture}\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	153	\multicolumn{3}{c}{Graphs: Runtime for matching $(a^)^\,b$ with strings
a0f27e21b42c all texrelated Chengsong parents: diff changeset	154	of the form $\underbrace{aa..a}_{n}$.}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	155	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	156	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	157
a0f27e21b42c all texrelated Chengsong parents: diff changeset	158
a0f27e21b42c all texrelated Chengsong parents: diff changeset	159	This is clearly exponential behaviour, and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	160	is triggered by some relatively simple regex patterns.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	161
a0f27e21b42c all texrelated Chengsong parents: diff changeset	162
a0f27e21b42c all texrelated Chengsong parents: diff changeset	163	This superlinear blowup in matching algorithms sometimes cause
a0f27e21b42c all texrelated Chengsong parents: diff changeset	164	considerable grief in real life: for example on 20 July 2016 one evil
a0f27e21b42c all texrelated Chengsong parents: diff changeset	165	regular expression brought the webpage
a0f27e21b42c all texrelated Chengsong parents: diff changeset	166	\href{http://stackexchange.com}{Stack Exchange} to its
a0f27e21b42c all texrelated Chengsong parents: diff changeset	167	knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	168	In this instance, a regular expression intended to just trim white
a0f27e21b42c all texrelated Chengsong parents: diff changeset	169	spaces from the beginning and the end of a line actually consumed
a0f27e21b42c all texrelated Chengsong parents: diff changeset	170	massive amounts of CPU-resources---causing web servers to grind to a
a0f27e21b42c all texrelated Chengsong parents: diff changeset	171	halt. This happened when a post with 20,000 white spaces was submitted,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	172	but importantly the white spaces were neither at the beginning nor at
a0f27e21b42c all texrelated Chengsong parents: diff changeset	173	the end. As a result, the regular expression matching engine needed to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	174	backtrack over many choices. In this example, the time needed to process
a0f27e21b42c all texrelated Chengsong parents: diff changeset	175	the string was $O(n^2)$ with respect to the string length. This
a0f27e21b42c all texrelated Chengsong parents: diff changeset	176	quadratic overhead was enough for the homepage of Stack Exchange to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	177	respond so slowly that the load balancer assumed a $\mathit{DoS}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	178	attack and therefore stopped the servers from responding to any
a0f27e21b42c all texrelated Chengsong parents: diff changeset	179	requests. This made the whole site become unavailable.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	180	A more
a0f27e21b42c all texrelated Chengsong parents: diff changeset	181	recent example is a global outage of all Cloudflare servers on 2 July
a0f27e21b42c all texrelated Chengsong parents: diff changeset	182	2019. A poorly written regular expression exhibited exponential
a0f27e21b42c all texrelated Chengsong parents: diff changeset	183	behaviour and exhausted CPUs that serve HTTP traffic. Although the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	184	outage had several causes, at the heart was a regular expression that
a0f27e21b42c all texrelated Chengsong parents: diff changeset	185	was used to monitor network
a0f27e21b42c all texrelated Chengsong parents: diff changeset	186	traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	187	%TODO: data points for some new versions of languages
a0f27e21b42c all texrelated Chengsong parents: diff changeset	188	These problems with regular expressions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	189	are not isolated events that happen
a0f27e21b42c all texrelated Chengsong parents: diff changeset	190	very occasionally, but actually quite widespread.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	191	They occur so often that they get a
a0f27e21b42c all texrelated Chengsong parents: diff changeset	192	name--Regular-Expression-Denial-Of-Service (ReDoS)
a0f27e21b42c all texrelated Chengsong parents: diff changeset	193	attack.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	194	Davis et al. \parencite{Davis18} detected more
a0f27e21b42c all texrelated Chengsong parents: diff changeset	195	than 1000 super-linear (SL) regular expressions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	196	in Node.js, Python core libraries, and npm and pypi.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	197	They therefore concluded that evil regular expressions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	198	are problems more than "a parlour trick", but one that
a0f27e21b42c all texrelated Chengsong parents: diff changeset	199	requires
a0f27e21b42c all texrelated Chengsong parents: diff changeset	200	more research attention.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	201
a0f27e21b42c all texrelated Chengsong parents: diff changeset	202	\section{Why are current algorithm for regexes slow?}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	203
471 23818853a710 thesis Chengsong parents: 469 diff changeset	204	%find literature/find out for yourself that REGEX->DFA on basic regexes
23818853a710 thesis Chengsong parents: 469 diff changeset	205	%does not blow up the size
23818853a710 thesis Chengsong parents: 469 diff changeset	206	Shouldn't regular expression matching be linear?
23818853a710 thesis Chengsong parents: 469 diff changeset	207	How can one explain the super-linear behaviour of the
23818853a710 thesis Chengsong parents: 469 diff changeset	208	regex matching engines we have?
23818853a710 thesis Chengsong parents: 469 diff changeset	209	The time cost of regex matching algorithms in general
23818853a710 thesis Chengsong parents: 469 diff changeset	210	involve two phases:
23818853a710 thesis Chengsong parents: 469 diff changeset	211	the construction phase, in which the algorithm builds some
23818853a710 thesis Chengsong parents: 469 diff changeset	212	suitable data structure from the input regex $r$, we denote
23818853a710 thesis Chengsong parents: 469 diff changeset	213	the time cost by $P_1(r)$.
23818853a710 thesis Chengsong parents: 469 diff changeset	214	The lexing
23818853a710 thesis Chengsong parents: 469 diff changeset	215	phase, when the input string $s$ is read and the data structure
23818853a710 thesis Chengsong parents: 469 diff changeset	216	representing that regex $r$ is being operated on. We represent the time
23818853a710 thesis Chengsong parents: 469 diff changeset	217	it takes by $P_2(r, s)$.\\
23818853a710 thesis Chengsong parents: 469 diff changeset	218	In the case of a $\mathit{DFA}$,
23818853a710 thesis Chengsong parents: 469 diff changeset	219	we have $P_2(r, s) = O( \|s\| )$,
23818853a710 thesis Chengsong parents: 469 diff changeset	220	because we take at most $\|s\|$ steps,
23818853a710 thesis Chengsong parents: 469 diff changeset	221	and each step takes
23818853a710 thesis Chengsong parents: 469 diff changeset	222	at most one transition--
23818853a710 thesis Chengsong parents: 469 diff changeset	223	a deterministic-finite-automata
23818853a710 thesis Chengsong parents: 469 diff changeset	224	by definition has at most one state active and at most one
23818853a710 thesis Chengsong parents: 469 diff changeset	225	transition upon receiving an input symbol.
23818853a710 thesis Chengsong parents: 469 diff changeset	226	But unfortunately in the worst case
23818853a710 thesis Chengsong parents: 469 diff changeset	227	$P_1(r) = O(exp^{\|r\|})$. An example will be given later. \\
23818853a710 thesis Chengsong parents: 469 diff changeset	228	For $\mathit{NFA}$s, we have $P_1(r) = O(\|r\|)$ if we do not unfold
23818853a710 thesis Chengsong parents: 469 diff changeset	229	expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.
23818853a710 thesis Chengsong parents: 469 diff changeset	230	The $P_2(r, s)$ is bounded by $\|r\|\cdot\|s\|$, if we do not backtrack.
23818853a710 thesis Chengsong parents: 469 diff changeset	231	On the other hand, if backtracking is used, the worst-case time bound bloats
23818853a710 thesis Chengsong parents: 469 diff changeset	232	to $\|r\| * 2^\|s\|$ .
23818853a710 thesis Chengsong parents: 469 diff changeset	233	%on the input
23818853a710 thesis Chengsong parents: 469 diff changeset	234	%And when calculating the time complexity of the matching algorithm,
23818853a710 thesis Chengsong parents: 469 diff changeset	235	%we are assuming that each input reading step requires constant time.
23818853a710 thesis Chengsong parents: 469 diff changeset	236	%which translates to that the number of
23818853a710 thesis Chengsong parents: 469 diff changeset	237	%states active and transitions taken each time is bounded by a
23818853a710 thesis Chengsong parents: 469 diff changeset	238	%constant $C$.
23818853a710 thesis Chengsong parents: 469 diff changeset	239	%But modern regex libraries in popular language engines
23818853a710 thesis Chengsong parents: 469 diff changeset	240	% often want to support much richer constructs than just
23818853a710 thesis Chengsong parents: 469 diff changeset	241	% sequences and Kleene stars,
23818853a710 thesis Chengsong parents: 469 diff changeset	242	%such as negation, intersection,
23818853a710 thesis Chengsong parents: 469 diff changeset	243	%bounded repetitions and back-references.
23818853a710 thesis Chengsong parents: 469 diff changeset	244	%And de-sugaring these "extended" regular expressions
23818853a710 thesis Chengsong parents: 469 diff changeset	245	%into basic ones might bloat the size exponentially.
23818853a710 thesis Chengsong parents: 469 diff changeset	246	%TODO: more reference for exponential size blowup on desugaring.
23818853a710 thesis Chengsong parents: 469 diff changeset	247	\subsection{Tools that uses $\mathit{DFA}$s}
23818853a710 thesis Chengsong parents: 469 diff changeset	248	%TODO:more tools that use DFAs?
23818853a710 thesis Chengsong parents: 469 diff changeset	249	$\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
23818853a710 thesis Chengsong parents: 469 diff changeset	250	in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
23818853a710 thesis Chengsong parents: 469 diff changeset	251	lexers. The user provides a set of regular expressions
23818853a710 thesis Chengsong parents: 469 diff changeset	252	and configurations to such lexer generators, and then
23818853a710 thesis Chengsong parents: 469 diff changeset	253	gets an output program encoding a minimized $\mathit{DFA}$
23818853a710 thesis Chengsong parents: 469 diff changeset	254	that can be compiled and run.
23818853a710 thesis Chengsong parents: 469 diff changeset	255	The good things about $\mathit{DFA}$s is that once
23818853a710 thesis Chengsong parents: 469 diff changeset	256	generated, they are fast and stable, unlike
23818853a710 thesis Chengsong parents: 469 diff changeset	257	backtracking algorithms.
23818853a710 thesis Chengsong parents: 469 diff changeset	258	However, they do not scale well with bounded repetitions.\\
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	259
471 23818853a710 thesis Chengsong parents: 469 diff changeset	260	Bounded repetitions, usually written in the form
23818853a710 thesis Chengsong parents: 469 diff changeset	261	$r^{\{c\}}$ (where $c$ is a constant natural number),
23818853a710 thesis Chengsong parents: 469 diff changeset	262	denotes a regular expression accepting strings
23818853a710 thesis Chengsong parents: 469 diff changeset	263	that can be divided into $c$ substrings, where each
23818853a710 thesis Chengsong parents: 469 diff changeset	264	substring is in $r$.
23818853a710 thesis Chengsong parents: 469 diff changeset	265	For the regular expression $(a\|b)^*a(a\|b)^{\{2\}}$,
23818853a710 thesis Chengsong parents: 469 diff changeset	266	an $\mathit{NFA}$ describing it would look like:
23818853a710 thesis Chengsong parents: 469 diff changeset	267	\begin{center}
23818853a710 thesis Chengsong parents: 469 diff changeset	268	\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
23818853a710 thesis Chengsong parents: 469 diff changeset	269	\node[state,initial] (q_0) {$q_0$};
23818853a710 thesis Chengsong parents: 469 diff changeset	270	\node[state, red] (q_1) [right=of q_0] {$q_1$};
23818853a710 thesis Chengsong parents: 469 diff changeset	271	\node[state, red] (q_2) [right=of q_1] {$q_2$};
23818853a710 thesis Chengsong parents: 469 diff changeset	272	\node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
23818853a710 thesis Chengsong parents: 469 diff changeset	273	\path[->]
23818853a710 thesis Chengsong parents: 469 diff changeset	274	(q_0) edge node {a} (q_1)
23818853a710 thesis Chengsong parents: 469 diff changeset	275	edge [loop below] node {a,b} ()
23818853a710 thesis Chengsong parents: 469 diff changeset	276	(q_1) edge node {a,b} (q_2)
23818853a710 thesis Chengsong parents: 469 diff changeset	277	(q_2) edge node {a,b} (q_3);
23818853a710 thesis Chengsong parents: 469 diff changeset	278	\end{tikzpicture}
23818853a710 thesis Chengsong parents: 469 diff changeset	279	\end{center}
23818853a710 thesis Chengsong parents: 469 diff changeset	280	The red states are "countdown states" which counts down
23818853a710 thesis Chengsong parents: 469 diff changeset	281	the number of characters needed in addition to the current
23818853a710 thesis Chengsong parents: 469 diff changeset	282	string to make a successful match.
23818853a710 thesis Chengsong parents: 469 diff changeset	283	For example, state $q_1$ indicates a match that has
23818853a710 thesis Chengsong parents: 469 diff changeset	284	gone past the $(a\|b)^$ part of $(a\|b)^a(a\|b)^{\{2\}}$,
23818853a710 thesis Chengsong parents: 469 diff changeset	285	and just consumed the "delimiter" $a$ in the middle, and
23818853a710 thesis Chengsong parents: 469 diff changeset	286	need to match 2 more iterations of $(a\|b)$ to complete.
23818853a710 thesis Chengsong parents: 469 diff changeset	287	State $q_2$ on the other hand, can be viewed as a state
23818853a710 thesis Chengsong parents: 469 diff changeset	288	after $q_1$ has consumed 1 character, and just waits
23818853a710 thesis Chengsong parents: 469 diff changeset	289	for 1 more character to complete.
23818853a710 thesis Chengsong parents: 469 diff changeset	290	$q_3$ is the last state, requiring 0 more character and is accepting.
23818853a710 thesis Chengsong parents: 469 diff changeset	291	Depending on the suffix of the
23818853a710 thesis Chengsong parents: 469 diff changeset	292	input string up to the current read location,
23818853a710 thesis Chengsong parents: 469 diff changeset	293	the states $q_1$ and $q_2$, $q_3$
23818853a710 thesis Chengsong parents: 469 diff changeset	294	may or may
23818853a710 thesis Chengsong parents: 469 diff changeset	295	not be active, independent from each other.
23818853a710 thesis Chengsong parents: 469 diff changeset	296	A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
23818853a710 thesis Chengsong parents: 469 diff changeset	297	contain at least $2^3$ non-equivalent states that cannot be merged,
23818853a710 thesis Chengsong parents: 469 diff changeset	298	because the subset construction during determinisation will generate
23818853a710 thesis Chengsong parents: 469 diff changeset	299	all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
23818853a710 thesis Chengsong parents: 469 diff changeset	300	Generalizing this to regular expressions with larger
23818853a710 thesis Chengsong parents: 469 diff changeset	301	bounded repetitions number, we have that
23818853a710 thesis Chengsong parents: 469 diff changeset	302	regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
23818853a710 thesis Chengsong parents: 469 diff changeset	303	would require at least $2^{n+1}$ states, if $r$ contains
23818853a710 thesis Chengsong parents: 469 diff changeset	304	more than 1 string.
23818853a710 thesis Chengsong parents: 469 diff changeset	305	This is to represent all different
23818853a710 thesis Chengsong parents: 469 diff changeset	306	scenarios which "countdown" states are active.
23818853a710 thesis Chengsong parents: 469 diff changeset	307	For those regexes, tools such as $\mathit{JFLEX}$
23818853a710 thesis Chengsong parents: 469 diff changeset	308	would generate gigantic $\mathit{DFA}$'s or
23818853a710 thesis Chengsong parents: 469 diff changeset	309	out of memory errors.
23818853a710 thesis Chengsong parents: 469 diff changeset	310	For this reason, regex libraries that support
23818853a710 thesis Chengsong parents: 469 diff changeset	311	bounded repetitions often choose to use the $\mathit{NFA}$
23818853a710 thesis Chengsong parents: 469 diff changeset	312	approach.
23818853a710 thesis Chengsong parents: 469 diff changeset	313	\subsection{The $\mathit{NFA}$ approach to regex matching}
23818853a710 thesis Chengsong parents: 469 diff changeset	314	One can simulate the $\mathit{NFA}$ running in two ways:
23818853a710 thesis Chengsong parents: 469 diff changeset	315	one by keeping track of all active states after consuming
23818853a710 thesis Chengsong parents: 469 diff changeset	316	a character, and update that set of states iteratively.
23818853a710 thesis Chengsong parents: 469 diff changeset	317	This can be viewed as a breadth-first-search of the $\mathit{NFA}$
23818853a710 thesis Chengsong parents: 469 diff changeset	318	for a path terminating
23818853a710 thesis Chengsong parents: 469 diff changeset	319	at an accepting state.
23818853a710 thesis Chengsong parents: 469 diff changeset	320	Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
23818853a710 thesis Chengsong parents: 469 diff changeset	321	type of $\mathit{NFA}$ simulation, and guarantees a linear runtime
23818853a710 thesis Chengsong parents: 469 diff changeset	322	in terms of input string length.
23818853a710 thesis Chengsong parents: 469 diff changeset	323	%TODO:try out these lexers
23818853a710 thesis Chengsong parents: 469 diff changeset	324	The other way to use $\mathit{NFA}$ for matching is choosing
23818853a710 thesis Chengsong parents: 469 diff changeset	325	a single transition each time, keeping all the other options in
23818853a710 thesis Chengsong parents: 469 diff changeset	326	a queue or stack, and backtracking if that choice eventually
23818853a710 thesis Chengsong parents: 469 diff changeset	327	fails. This method, often called a "depth-first-search",
23818853a710 thesis Chengsong parents: 469 diff changeset	328	is efficient in a lot of cases, but could end up
23818853a710 thesis Chengsong parents: 469 diff changeset	329	with exponential run time.\\
23818853a710 thesis Chengsong parents: 469 diff changeset	330	%TODO:COMPARE java python lexer speed with Rust and Go
23818853a710 thesis Chengsong parents: 469 diff changeset	331	The reason behind backtracking algorithms in languages like
23818853a710 thesis Chengsong parents: 469 diff changeset	332	Java and Python is that they support back-references.
23818853a710 thesis Chengsong parents: 469 diff changeset	333	\subsection{Back References in Regex--Non-Regular part}
23818853a710 thesis Chengsong parents: 469 diff changeset	334	If we have a regular expression like this (the sequence
23818853a710 thesis Chengsong parents: 469 diff changeset	335	operator is omitted for brevity):
23818853a710 thesis Chengsong parents: 469 diff changeset	336	\begin{center}
23818853a710 thesis Chengsong parents: 469 diff changeset	337	$r_1(r_2(r_3r_4))$
23818853a710 thesis Chengsong parents: 469 diff changeset	338	\end{center}
23818853a710 thesis Chengsong parents: 469 diff changeset	339	We could label sub-expressions of interest
23818853a710 thesis Chengsong parents: 469 diff changeset	340	by parenthesizing them and giving
23818853a710 thesis Chengsong parents: 469 diff changeset	341	them a number by the order in which their opening parentheses appear.
23818853a710 thesis Chengsong parents: 469 diff changeset	342	One possible way of parenthesizing and labelling is given below:
23818853a710 thesis Chengsong parents: 469 diff changeset	343	\begin{center}
23818853a710 thesis Chengsong parents: 469 diff changeset	344	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
23818853a710 thesis Chengsong parents: 469 diff changeset	345	\end{center}
23818853a710 thesis Chengsong parents: 469 diff changeset	346	$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$, $r_4$ are labelled
23818853a710 thesis Chengsong parents: 469 diff changeset	347	by 1 to 4. $1$ would refer to the entire expression
23818853a710 thesis Chengsong parents: 469 diff changeset	348	$(r_1(r_2(r_3)(r_4)))$, $2$ referring to $r_2(r_3)(r_4)$, etc.
23818853a710 thesis Chengsong parents: 469 diff changeset	349	These sub-expressions are called "capturing groups".
23818853a710 thesis Chengsong parents: 469 diff changeset	350	We can use the following syntax to denote that we want a string just matched by a
23818853a710 thesis Chengsong parents: 469 diff changeset	351	sub-expression (capturing group) to appear at a certain location again,
23818853a710 thesis Chengsong parents: 469 diff changeset	352	exactly as it was:
23818853a710 thesis Chengsong parents: 469 diff changeset	353	\begin{center}
23818853a710 thesis Chengsong parents: 469 diff changeset	354	$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
23818853a710 thesis Chengsong parents: 469 diff changeset	355	\underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
23818853a710 thesis Chengsong parents: 469 diff changeset	356	\end{center}
23818853a710 thesis Chengsong parents: 469 diff changeset	357	The backslash and number $i$ are used to denote such
23818853a710 thesis Chengsong parents: 469 diff changeset	358	so-called "back-references".
23818853a710 thesis Chengsong parents: 469 diff changeset	359	Let $e$ be an expression made of regular expressions
23818853a710 thesis Chengsong parents: 469 diff changeset	360	and back-references. $e$ contains the expression $e_i$
23818853a710 thesis Chengsong parents: 469 diff changeset	361	as its $i$-th capturing group.
23818853a710 thesis Chengsong parents: 469 diff changeset	362	The semantics of back-reference can be recursively
23818853a710 thesis Chengsong parents: 469 diff changeset	363	written as:
23818853a710 thesis Chengsong parents: 469 diff changeset	364	\begin{center}
23818853a710 thesis Chengsong parents: 469 diff changeset	365	\begin{tabular}{c}
23818853a710 thesis Chengsong parents: 469 diff changeset	366	$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
23818853a710 thesis Chengsong parents: 469 diff changeset	367	$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
23818853a710 thesis Chengsong parents: 469 diff changeset	368	\end{tabular}
23818853a710 thesis Chengsong parents: 469 diff changeset	369	\end{center}
23818853a710 thesis Chengsong parents: 469 diff changeset	370	The concrete example
23818853a710 thesis Chengsong parents: 469 diff changeset	371	$((a\|b\|c\|\ldots\|z)^*)\backslash 1$
23818853a710 thesis Chengsong parents: 469 diff changeset	372	would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
23818853a710 thesis Chengsong parents: 469 diff changeset	373	Back-reference is a construct in the "regex" standard
23818853a710 thesis Chengsong parents: 469 diff changeset	374	that programmers found quite useful, but not exactly
23818853a710 thesis Chengsong parents: 469 diff changeset	375	regular any more.
23818853a710 thesis Chengsong parents: 469 diff changeset	376	In fact, that allows the regex construct to express
23818853a710 thesis Chengsong parents: 469 diff changeset	377	languages that cannot be contained in context-free
23818853a710 thesis Chengsong parents: 469 diff changeset	378	languages either.
23818853a710 thesis Chengsong parents: 469 diff changeset	379	For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
23818853a710 thesis Chengsong parents: 469 diff changeset	380	expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
23818853a710 thesis Chengsong parents: 469 diff changeset	381	which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
23818853a710 thesis Chengsong parents: 469 diff changeset	382	Such a language is contained in the context-sensitive hierarchy
23818853a710 thesis Chengsong parents: 469 diff changeset	383	of formal languages.
23818853a710 thesis Chengsong parents: 469 diff changeset	384	Solving the back-reference expressions matching problem
23818853a710 thesis Chengsong parents: 469 diff changeset	385	is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
23818853a710 thesis Chengsong parents: 469 diff changeset	386	efficient solution is not known to exist.
23818853a710 thesis Chengsong parents: 469 diff changeset	387	%TODO:read a bit more about back reference algorithms
23818853a710 thesis Chengsong parents: 469 diff changeset	388	It seems that languages like Java and Python made the trade-off
23818853a710 thesis Chengsong parents: 469 diff changeset	389	to support back-references at the expense of having to backtrack,
23818853a710 thesis Chengsong parents: 469 diff changeset	390	even in the case of regexes not involving back-references.\\
23818853a710 thesis Chengsong parents: 469 diff changeset	391	Summing these up, we can categorise existing
23818853a710 thesis Chengsong parents: 469 diff changeset	392	practical regex libraries into the ones with linear
23818853a710 thesis Chengsong parents: 469 diff changeset	393	time guarantees like Go and Rust, which impose restrictions
23818853a710 thesis Chengsong parents: 469 diff changeset	394	on the user input (not allowing back-references,
23818853a710 thesis Chengsong parents: 469 diff changeset	395	bounded repetitions canno exceed 1000 etc.), and ones
23818853a710 thesis Chengsong parents: 469 diff changeset	396	that allows the programmer much freedom, but grinds to a halt
23818853a710 thesis Chengsong parents: 469 diff changeset	397	in some non-negligible portion of cases.
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	398	%TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
471 23818853a710 thesis Chengsong parents: 469 diff changeset	399	% For example, the Rust regex engine claims to be linear,
23818853a710 thesis Chengsong parents: 469 diff changeset	400	% but does not support lookarounds and back-references.
23818853a710 thesis Chengsong parents: 469 diff changeset	401	% The GoLang regex library does not support over 1000 repetitions.
23818853a710 thesis Chengsong parents: 469 diff changeset	402	% Java and Python both support back-references, but shows
23818853a710 thesis Chengsong parents: 469 diff changeset	403	%catastrophic backtracking behaviours on inputs without back-references(
23818853a710 thesis Chengsong parents: 469 diff changeset	404	%when the language is still regular).
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	405	%TODO: test performance of Rust on (((((aa)b)b){20}))c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
a0f27e21b42c all texrelated Chengsong parents: diff changeset	406	%TODO: verify the fact Rust does not allow 1000+ reps
a0f27e21b42c all texrelated Chengsong parents: diff changeset	407	%TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
471 23818853a710 thesis Chengsong parents: 469 diff changeset	408	\section{Buggy Regex Engines}
23818853a710 thesis Chengsong parents: 469 diff changeset	409
23818853a710 thesis Chengsong parents: 469 diff changeset	410
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	411	Another thing about the these libraries is that there
a0f27e21b42c all texrelated Chengsong parents: diff changeset	412	is no correctness guarantee.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	413	In some cases they either fails to generate a lexing result when there is a match,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	414	or gives the wrong way of matching.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	415
a0f27e21b42c all texrelated Chengsong parents: diff changeset	416
a0f27e21b42c all texrelated Chengsong parents: diff changeset	417	It turns out that regex libraries not only suffer from
a0f27e21b42c all texrelated Chengsong parents: diff changeset	418	exponential backtracking problems,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	419	but also undesired (or even buggy) outputs.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	420	%TODO: comment from who
471 23818853a710 thesis Chengsong parents: 469 diff changeset	421	Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	422	correctly implementing the POSIX (maximum-munch)
a0f27e21b42c all texrelated Chengsong parents: diff changeset	423	rule of regular expression matching.
471 23818853a710 thesis Chengsong parents: 469 diff changeset	424	This experience is echoed by the writer's
23818853a710 thesis Chengsong parents: 469 diff changeset	425	tryout of a few online regex testers:
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	426	A concrete example would be
a0f27e21b42c all texrelated Chengsong parents: diff changeset	427	the regex
a0f27e21b42c all texrelated Chengsong parents: diff changeset	428	\begin{verbatim}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	429	(((((aa)b)b){20}))c
a0f27e21b42c all texrelated Chengsong parents: diff changeset	430	\end{verbatim}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	431	and the string
a0f27e21b42c all texrelated Chengsong parents: diff changeset	432	\begin{verbatim}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	433	baabaabababaabaaaaaaaaababaa
a0f27e21b42c all texrelated Chengsong parents: diff changeset	434	aababababaaaabaaabaaaaaabaab
a0f27e21b42c all texrelated Chengsong parents: diff changeset	435	aabababaababaaaaaaaaababaaaa
a0f27e21b42c all texrelated Chengsong parents: diff changeset	436	babababaaaaaaaaaaaaac
a0f27e21b42c all texrelated Chengsong parents: diff changeset	437	\end{verbatim}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	438
a0f27e21b42c all texrelated Chengsong parents: diff changeset	439	This seemingly complex regex simply says "some $a$'s
a0f27e21b42c all texrelated Chengsong parents: diff changeset	440	followed by some $b$'s then followed by 1 single $b$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	441	and this iterates 20 times, finally followed by a $c$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	442	And a POSIX match would involve the entire string,"eating up"
a0f27e21b42c all texrelated Chengsong parents: diff changeset	443	all the $b$'s in it.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	444	%TODO: give a coloured example of how this matches POSIXly
a0f27e21b42c all texrelated Chengsong parents: diff changeset	445
a0f27e21b42c all texrelated Chengsong parents: diff changeset	446	This regex would trigger catastrophic backtracking in
a0f27e21b42c all texrelated Chengsong parents: diff changeset	447	languages like Python and Java,
471 23818853a710 thesis Chengsong parents: 469 diff changeset	448	whereas it gives a non-POSIX and uninformative
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	449	match in languages like Go or .NET--The match with only
a0f27e21b42c all texrelated Chengsong parents: diff changeset	450	character $c$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	451
471 23818853a710 thesis Chengsong parents: 469 diff changeset	452	As Grathwohl\parencite{grathwohl2014crash} commented,
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	453	\begin{center}
471 23818853a710 thesis Chengsong parents: 469 diff changeset	454	``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	455	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	456
472 6953d2786e7c hi Chengsong parents: 471 diff changeset	457	%\section{How people solve problems with regexes}
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	458
a0f27e21b42c all texrelated Chengsong parents: diff changeset	459
472 6953d2786e7c hi Chengsong parents: 471 diff changeset	460	When a regular expression does not behave as intended,
6953d2786e7c hi Chengsong parents: 471 diff changeset	461	people usually try to rewrite the regex to some equivalent form
6953d2786e7c hi Chengsong parents: 471 diff changeset	462	or they try to avoid the possibly problematic patterns completely\parencite{Davis18},
6953d2786e7c hi Chengsong parents: 471 diff changeset	463	of which there are many false positives.
6953d2786e7c hi Chengsong parents: 471 diff changeset	464	Animated tools to "debug" regular expressions
6953d2786e7c hi Chengsong parents: 471 diff changeset	465	are also quite popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101}
6953d2786e7c hi Chengsong parents: 471 diff changeset	466	to name a few.
6953d2786e7c hi Chengsong parents: 471 diff changeset	467	There is also static analysis work on regular expressions that
6953d2786e7c hi Chengsong parents: 471 diff changeset	468	aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
471 23818853a710 thesis Chengsong parents: 469 diff changeset	469	\parencite{Rathnayake2014StaticAF} proposed an algorithm
23818853a710 thesis Chengsong parents: 469 diff changeset	470	that detects regular expressions triggering exponential
23818853a710 thesis Chengsong parents: 469 diff changeset	471	behavious on backtracking matchers.
472 6953d2786e7c hi Chengsong parents: 471 diff changeset	472	Weideman \parencite{Weideman2017Static} came up with
6953d2786e7c hi Chengsong parents: 471 diff changeset	473	non-linear polynomial worst-time estimates
471 23818853a710 thesis Chengsong parents: 469 diff changeset	474	for regexes, attack string that exploit the worst-time
23818853a710 thesis Chengsong parents: 469 diff changeset	475	scenario, and "attack automata" that generates
472 6953d2786e7c hi Chengsong parents: 471 diff changeset	476	attack strings.
6953d2786e7c hi Chengsong parents: 471 diff changeset	477	%Arguably these methods limits the programmers' freedom
6953d2786e7c hi Chengsong parents: 471 diff changeset	478	%or productivity when all they want is to come up with a regex
6953d2786e7c hi Chengsong parents: 471 diff changeset	479	%that solves the text processing problem.
6953d2786e7c hi Chengsong parents: 471 diff changeset	480
471 23818853a710 thesis Chengsong parents: 469 diff changeset	481	%TODO:also the regex101 debugger
23818853a710 thesis Chengsong parents: 469 diff changeset	482	\section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	483	Is it possible to have a regex lexing algorithm with proven correctness and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	484	time complexity, which allows easy extensions to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	485	constructs like
a0f27e21b42c all texrelated Chengsong parents: diff changeset	486	bounded repetitions, negation, lookarounds, and even back-references?
472 6953d2786e7c hi Chengsong parents: 471 diff changeset	487
6953d2786e7c hi Chengsong parents: 471 diff changeset	488	We propose Brzozowski derivatives on regular expressions as
6953d2786e7c hi Chengsong parents: 471 diff changeset	489	a solution to this.
6953d2786e7c hi Chengsong parents: 471 diff changeset	490
6953d2786e7c hi Chengsong parents: 471 diff changeset	491	In the last fifteen or so years, Brzozowski's derivatives of regular
6953d2786e7c hi Chengsong parents: 471 diff changeset	492	expressions have sparked quite a bit of interest in the functional
6953d2786e7c hi Chengsong parents: 471 diff changeset	493	programming and theorem prover communities. The beauty of
6953d2786e7c hi Chengsong parents: 471 diff changeset	494	Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
6953d2786e7c hi Chengsong parents: 471 diff changeset	495	expressible in any functional language, and easily definable and
6953d2786e7c hi Chengsong parents: 471 diff changeset	496	reasoned about in theorem provers---the definitions just consist of
6953d2786e7c hi Chengsong parents: 471 diff changeset	497	inductive datatypes and simple recursive functions.
6953d2786e7c hi Chengsong parents: 471 diff changeset	498	And an algorithms based on it by
6953d2786e7c hi Chengsong parents: 471 diff changeset	499	Suzmann and Lu \parencite{Sulzmann2014} allows easy extension
6953d2786e7c hi Chengsong parents: 471 diff changeset	500	to include extended regular expressions and
6953d2786e7c hi Chengsong parents: 471 diff changeset	501	simplification of internal data structures
6953d2786e7c hi Chengsong parents: 471 diff changeset	502	eliminating the exponential behaviours.
6953d2786e7c hi Chengsong parents: 471 diff changeset	503
6953d2786e7c hi Chengsong parents: 471 diff changeset	504
6953d2786e7c hi Chengsong parents: 471 diff changeset	505
471 23818853a710 thesis Chengsong parents: 469 diff changeset	506
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	507
471 23818853a710 thesis Chengsong parents: 469 diff changeset	508
23818853a710 thesis Chengsong parents: 469 diff changeset	509
23818853a710 thesis Chengsong parents: 469 diff changeset	510
23818853a710 thesis Chengsong parents: 469 diff changeset	511	%----------------------------------------------------------------------------------------
23818853a710 thesis Chengsong parents: 469 diff changeset	512
472 6953d2786e7c hi Chengsong parents: 471 diff changeset	513	\section{Our Contribution}
6953d2786e7c hi Chengsong parents: 471 diff changeset	514
471 23818853a710 thesis Chengsong parents: 469 diff changeset	515
23818853a710 thesis Chengsong parents: 469 diff changeset	516
472 6953d2786e7c hi Chengsong parents: 471 diff changeset	517	This work addresses the vulnerability of super-linear and
6953d2786e7c hi Chengsong parents: 471 diff changeset	518	buggy regex implementations by the combination
6953d2786e7c hi Chengsong parents: 471 diff changeset	519	of Brzozowski's derivatives and interactive theorem proving.
6953d2786e7c hi Chengsong parents: 471 diff changeset	520	We give an
471 23818853a710 thesis Chengsong parents: 469 diff changeset	521	improved version of Sulzmann and Lu's bit-coded algorithm using
23818853a710 thesis Chengsong parents: 469 diff changeset	522	derivatives, which come with a formal guarantee in terms of correctness and
23818853a710 thesis Chengsong parents: 469 diff changeset	523	running time as an Isabelle/HOL proof.
23818853a710 thesis Chengsong parents: 469 diff changeset	524	Then we improve the algorithm with an even stronger version of
23818853a710 thesis Chengsong parents: 469 diff changeset	525	simplification, and prove a time bound linear to input and
23818853a710 thesis Chengsong parents: 469 diff changeset	526	cubic to regular expression size using a technique by
23818853a710 thesis Chengsong parents: 469 diff changeset	527	Antimirov.
23818853a710 thesis Chengsong parents: 469 diff changeset	528
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	529
a0f27e21b42c all texrelated Chengsong parents: diff changeset	530	The main contribution of this thesis is a proven correct lexing algorithm
a0f27e21b42c all texrelated Chengsong parents: diff changeset	531	with formalized time bounds.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	532	To our best knowledge, there is no lexing libraries using Brzozowski derivatives
a0f27e21b42c all texrelated Chengsong parents: diff changeset	533	that have a provable time guarantee,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	534	and claims about running time are usually speculative and backed by thin empirical
a0f27e21b42c all texrelated Chengsong parents: diff changeset	535	evidence.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	536	%TODO: give references
a0f27e21b42c all texrelated Chengsong parents: diff changeset	537	For example, Sulzmann and Lu had proposed an algorithm in which they
a0f27e21b42c all texrelated Chengsong parents: diff changeset	538	claim a linear running time.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	539	But that was falsified by our experiments and the running time
a0f27e21b42c all texrelated Chengsong parents: diff changeset	540	is actually $\Omega(2^n)$ in the worst case.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	541	A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
a0f27e21b42c all texrelated Chengsong parents: diff changeset	542	%TODO: give references
a0f27e21b42c all texrelated Chengsong parents: diff changeset	543	lexer, which calculates POSIX matches and is based on derivatives.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	544	They formalized the correctness of the lexer, but not the complexity.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	545	In the performance evaluation section, they simply analyzed the run time
a0f27e21b42c all texrelated Chengsong parents: diff changeset	546	of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	547	and concluded that the algorithm is quadratic in terms of input length.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	548	When we tried out their extracted OCaml code with our example $(a+aa)^*$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	549	the time it took to lex only 40 $a$'s was 5 minutes.
472 6953d2786e7c hi Chengsong parents: 471 diff changeset	550
6953d2786e7c hi Chengsong parents: 471 diff changeset	551	We believe our results of a proof of performance on general
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	552	inputs rather than specific examples a novel contribution.\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	553
472 6953d2786e7c hi Chengsong parents: 471 diff changeset	554
6953d2786e7c hi Chengsong parents: 471 diff changeset	555	\subsection{Related Work}
6953d2786e7c hi Chengsong parents: 471 diff changeset	556	We are aware
6953d2786e7c hi Chengsong parents: 471 diff changeset	557	of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
6953d2786e7c hi Chengsong parents: 471 diff changeset	558	Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
6953d2786e7c hi Chengsong parents: 471 diff changeset	559	of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one
6953d2786e7c hi Chengsong parents: 471 diff changeset	560	in Coq is given by Coquand and Siles \parencite{Coquand2012}.
6953d2786e7c hi Chengsong parents: 471 diff changeset	561	Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
6953d2786e7c hi Chengsong parents: 471 diff changeset	562
6953d2786e7c hi Chengsong parents: 471 diff changeset	563	%We propose Brzozowski's derivatives as a solution to this problem.
6953d2786e7c hi Chengsong parents: 471 diff changeset	564	% about Lexing Using Brzozowski derivatives
6953d2786e7c hi Chengsong parents: 471 diff changeset	565	\section{Preliminaries}
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	566
a0f27e21b42c all texrelated Chengsong parents: diff changeset	567	Suppose we have an alphabet $\Sigma$, the strings whose characters
a0f27e21b42c all texrelated Chengsong parents: diff changeset	568	are from $\Sigma$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	569	can be expressed as $\Sigma^*$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	570
a0f27e21b42c all texrelated Chengsong parents: diff changeset	571	We use patterns to define a set of strings concisely. Regular expressions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	572	are one of such patterns systems:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	573	The basic regular expressions are defined inductively
a0f27e21b42c all texrelated Chengsong parents: diff changeset	574	by the following grammar:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	575	\[ r ::= \ZERO \mid \ONE
a0f27e21b42c all texrelated Chengsong parents: diff changeset	576	\mid c
a0f27e21b42c all texrelated Chengsong parents: diff changeset	577	\mid r_1 \cdot r_2
a0f27e21b42c all texrelated Chengsong parents: diff changeset	578	\mid r_1 + r_2
a0f27e21b42c all texrelated Chengsong parents: diff changeset	579	\mid r^*
a0f27e21b42c all texrelated Chengsong parents: diff changeset	580	\]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	581
a0f27e21b42c all texrelated Chengsong parents: diff changeset	582	The language or set of strings defined by regular expressions are defined as
a0f27e21b42c all texrelated Chengsong parents: diff changeset	583	%TODO: FILL in the other defs
a0f27e21b42c all texrelated Chengsong parents: diff changeset	584	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	585	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	586	$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	587	$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	588	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	589	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	590	Which are also called the "language interpretation".
a0f27e21b42c all texrelated Chengsong parents: diff changeset	591
a0f27e21b42c all texrelated Chengsong parents: diff changeset	592
a0f27e21b42c all texrelated Chengsong parents: diff changeset	593
a0f27e21b42c all texrelated Chengsong parents: diff changeset	594	The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	595	where the operation transforms the regex to a new one containing
a0f27e21b42c all texrelated Chengsong parents: diff changeset	596	strings without the head character $c$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	597
a0f27e21b42c all texrelated Chengsong parents: diff changeset	598	Formally, we define first such a transformation on any string set, which
a0f27e21b42c all texrelated Chengsong parents: diff changeset	599	we call semantic derivative:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	600	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	601	$\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	602	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	603	Mathematically, it can be expressed as the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	604
a0f27e21b42c all texrelated Chengsong parents: diff changeset	605	If the $\textit{StringSet}$ happen to have some structure, for example,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	606	if it is regular, then we have that it
a0f27e21b42c all texrelated Chengsong parents: diff changeset	607
472 6953d2786e7c hi Chengsong parents: 471 diff changeset	608	% Derivatives of a
6953d2786e7c hi Chengsong parents: 471 diff changeset	609	%regular expression, written $r \backslash c$, give a simple solution
6953d2786e7c hi Chengsong parents: 471 diff changeset	610	%to the problem of matching a string $s$ with a regular
6953d2786e7c hi Chengsong parents: 471 diff changeset	611	%expression $r$: if the derivative of $r$ w.r.t.\ (in
6953d2786e7c hi Chengsong parents: 471 diff changeset	612	%succession) all the characters of the string matches the empty string,
6953d2786e7c hi Chengsong parents: 471 diff changeset	613	%then $r$ matches $s$ (and {\em vice versa}).
6953d2786e7c hi Chengsong parents: 471 diff changeset	614
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	615	The the derivative of regular expression, denoted as
a0f27e21b42c all texrelated Chengsong parents: diff changeset	616	$r \backslash c$, is a function that takes parameters
a0f27e21b42c all texrelated Chengsong parents: diff changeset	617	$r$ and $c$, and returns another regular expression $r'$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	618	which is computed by the following recursive function:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	619
a0f27e21b42c all texrelated Chengsong parents: diff changeset	620	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	621	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	622	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	623	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	624	$d \backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	625	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	626	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	627	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	628	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	629	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	630	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	631	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	632	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	633	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	634	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	635
a0f27e21b42c all texrelated Chengsong parents: diff changeset	636	The $\nullable$ function tests whether the empty string $""$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	637	is in the language of $r$:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	638
a0f27e21b42c all texrelated Chengsong parents: diff changeset	639
a0f27e21b42c all texrelated Chengsong parents: diff changeset	640	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	641	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	642	$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	643	$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	644	$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	645	$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	646	$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	647	$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	648	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	649	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	650	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	651	The empty set does not contain any string and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	652	therefore not the empty string, the empty string
a0f27e21b42c all texrelated Chengsong parents: diff changeset	653	regular expression contains the empty string
a0f27e21b42c all texrelated Chengsong parents: diff changeset	654	by definition, the character regular expression
a0f27e21b42c all texrelated Chengsong parents: diff changeset	655	is the singleton that contains character only,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	656	and therefore does not contain the empty string,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	657	the alternative regular expression(or "or" expression)
a0f27e21b42c all texrelated Chengsong parents: diff changeset	658	might have one of its children regular expressions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	659	being nullable and any one of its children being nullable
a0f27e21b42c all texrelated Chengsong parents: diff changeset	660	would suffice. The sequence regular expression
a0f27e21b42c all texrelated Chengsong parents: diff changeset	661	would require both children to have the empty string
a0f27e21b42c all texrelated Chengsong parents: diff changeset	662	to compose an empty string and the Kleene star
a0f27e21b42c all texrelated Chengsong parents: diff changeset	663	operation naturally introduced the empty string.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	664
a0f27e21b42c all texrelated Chengsong parents: diff changeset	665	We can give the meaning of regular expressions derivatives
a0f27e21b42c all texrelated Chengsong parents: diff changeset	666	by language interpretation:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	667
a0f27e21b42c all texrelated Chengsong parents: diff changeset	668
a0f27e21b42c all texrelated Chengsong parents: diff changeset	669
a0f27e21b42c all texrelated Chengsong parents: diff changeset	670
a0f27e21b42c all texrelated Chengsong parents: diff changeset	671	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	672	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	673	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	674	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	675	$d \backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	676	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	677	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	678	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	679	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	680	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	681	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	682	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	683	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	684	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	685	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	686	The function derivative, written $\backslash c$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	687	defines how a regular expression evolves into
a0f27e21b42c all texrelated Chengsong parents: diff changeset	688	a new regular expression after all the string it contains
a0f27e21b42c all texrelated Chengsong parents: diff changeset	689	is chopped off a certain head character $c$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	690	The most involved cases are the sequence
a0f27e21b42c all texrelated Chengsong parents: diff changeset	691	and star case.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	692	The sequence case says that if the first regular expression
a0f27e21b42c all texrelated Chengsong parents: diff changeset	693	contains an empty string then second component of the sequence
a0f27e21b42c all texrelated Chengsong parents: diff changeset	694	might be chosen as the target regular expression to be chopped
a0f27e21b42c all texrelated Chengsong parents: diff changeset	695	off its head character.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	696	The star regular expression unwraps the iteration of
a0f27e21b42c all texrelated Chengsong parents: diff changeset	697	regular expression and attack the star regular expression
a0f27e21b42c all texrelated Chengsong parents: diff changeset	698	to its back again to make sure there are 0 or more iterations
a0f27e21b42c all texrelated Chengsong parents: diff changeset	699	following this unfolded iteration.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	700
a0f27e21b42c all texrelated Chengsong parents: diff changeset	701
a0f27e21b42c all texrelated Chengsong parents: diff changeset	702	The main property of the derivative operation
a0f27e21b42c all texrelated Chengsong parents: diff changeset	703	that enables us to reason about the correctness of
a0f27e21b42c all texrelated Chengsong parents: diff changeset	704	an algorithm using derivatives is
a0f27e21b42c all texrelated Chengsong parents: diff changeset	705
a0f27e21b42c all texrelated Chengsong parents: diff changeset	706	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	707	$c\!::\!s \in L(r)$ holds
a0f27e21b42c all texrelated Chengsong parents: diff changeset	708	if and only if $s \in L(r\backslash c)$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	709	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	710
a0f27e21b42c all texrelated Chengsong parents: diff changeset	711	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	712	We can generalise the derivative operation shown above for single characters
a0f27e21b42c all texrelated Chengsong parents: diff changeset	713	to strings as follows:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	714
a0f27e21b42c all texrelated Chengsong parents: diff changeset	715	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	716	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	717	$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	718	$r \backslash [\,] $ & $\dn$ & $r$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	719	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	720	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	721
a0f27e21b42c all texrelated Chengsong parents: diff changeset	722	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	723	and then define Brzozowski's regular-expression matching algorithm as:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	724
a0f27e21b42c all texrelated Chengsong parents: diff changeset	725	\[
a0f27e21b42c all texrelated Chengsong parents: diff changeset	726	match\;s\;r \;\dn\; nullable(r\backslash s)
a0f27e21b42c all texrelated Chengsong parents: diff changeset	727	\]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	728
a0f27e21b42c all texrelated Chengsong parents: diff changeset	729	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	730	Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	731	this algorithm presented graphically is as follows:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	732
a0f27e21b42c all texrelated Chengsong parents: diff changeset	733	\begin{equation}\label{graph:*}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	734	\begin{tikzcd}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	735	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	736	\end{tikzcd}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	737	\end{equation}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	738
a0f27e21b42c all texrelated Chengsong parents: diff changeset	739	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	740	where we start with a regular expression $r_0$, build successive
a0f27e21b42c all texrelated Chengsong parents: diff changeset	741	derivatives until we exhaust the string and then use \textit{nullable}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	742	to test whether the result can match the empty string. It can be
a0f27e21b42c all texrelated Chengsong parents: diff changeset	743	relatively easily shown that this matcher is correct (that is given
a0f27e21b42c all texrelated Chengsong parents: diff changeset	744	an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
a0f27e21b42c all texrelated Chengsong parents: diff changeset	745
a0f27e21b42c all texrelated Chengsong parents: diff changeset	746	Beautiful and simple definition.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	747
a0f27e21b42c all texrelated Chengsong parents: diff changeset	748	If we implement the above algorithm naively, however,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	749	the algorithm can be excruciatingly slow. For example, when starting with the regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	750	expression $(a + aa)^*$ and building 12 successive derivatives
a0f27e21b42c all texrelated Chengsong parents: diff changeset	751	w.r.t.~the character $a$, one obtains a derivative regular expression
a0f27e21b42c all texrelated Chengsong parents: diff changeset	752	with more than 8000 nodes (when viewed as a tree). Operations like
a0f27e21b42c all texrelated Chengsong parents: diff changeset	753	$\backslash$ and $\nullable$ need to traverse such trees and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	754	consequently the bigger the size of the derivative the slower the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	755	algorithm.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	756
a0f27e21b42c all texrelated Chengsong parents: diff changeset	757	Brzozowski was quick in finding that during this process a lot useless
a0f27e21b42c all texrelated Chengsong parents: diff changeset	758	$\ONE$s and $\ZERO$s are generated and therefore not optimal.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	759	He also introduced some "similarity rules" such
a0f27e21b42c all texrelated Chengsong parents: diff changeset	760	as $P+(Q+R) = (P+Q)+R$ to merge syntactically
a0f27e21b42c all texrelated Chengsong parents: diff changeset	761	different but language-equivalent sub-regexes to further decrease the size
a0f27e21b42c all texrelated Chengsong parents: diff changeset	762	of the intermediate regexes.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	763
a0f27e21b42c all texrelated Chengsong parents: diff changeset	764	More simplifications are possible, such as deleting duplicates
a0f27e21b42c all texrelated Chengsong parents: diff changeset	765	and opening up nested alternatives to trigger even more simplifications.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	766	And suppose we apply simplification after each derivative step, and compose
a0f27e21b42c all texrelated Chengsong parents: diff changeset	767	these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
a0f27e21b42c all texrelated Chengsong parents: diff changeset	768	\textit{simp}(a \backslash c)$. Then we can build
a0f27e21b42c all texrelated Chengsong parents: diff changeset	769	a matcher without having cumbersome regular expressions.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	770
a0f27e21b42c all texrelated Chengsong parents: diff changeset	771
a0f27e21b42c all texrelated Chengsong parents: diff changeset	772	If we want the size of derivatives in the algorithm to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	773	stay even lower, we would need more aggressive simplifications.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	774	Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
a0f27e21b42c all texrelated Chengsong parents: diff changeset	775	deleting duplicates whenever possible. For example, the parentheses in
a0f27e21b42c all texrelated Chengsong parents: diff changeset	776	$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
a0f27e21b42c all texrelated Chengsong parents: diff changeset	777	\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
a0f27e21b42c all texrelated Chengsong parents: diff changeset	778	example is simplifying $(a^+a) + (a^+ \ONE) + (a +\ONE)$ to just
a0f27e21b42c all texrelated Chengsong parents: diff changeset	779	$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
a0f27e21b42c all texrelated Chengsong parents: diff changeset	780	to achieve a very tight size bound, namely,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	781	the same size bound as that of the \emph{partial derivatives}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	782
a0f27e21b42c all texrelated Chengsong parents: diff changeset	783	Building derivatives and then simplify them.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	784	So far so good. But what if we want to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	785	do lexing instead of just a YES/NO answer?
a0f27e21b42c all texrelated Chengsong parents: diff changeset	786	This requires us to go back again to the world
a0f27e21b42c all texrelated Chengsong parents: diff changeset	787	without simplification first for a moment.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	788	Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	789	elegant(arguably as beautiful as the original
a0f27e21b42c all texrelated Chengsong parents: diff changeset	790	derivatives definition) solution for this.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	791
a0f27e21b42c all texrelated Chengsong parents: diff changeset	792	\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	793
a0f27e21b42c all texrelated Chengsong parents: diff changeset	794
a0f27e21b42c all texrelated Chengsong parents: diff changeset	795	They first defined the datatypes for storing the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	796	lexing information called a \emph{value} or
a0f27e21b42c all texrelated Chengsong parents: diff changeset	797	sometimes also \emph{lexical value}. These values and regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	798	expressions correspond to each other as illustrated in the following
a0f27e21b42c all texrelated Chengsong parents: diff changeset	799	table:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	800
a0f27e21b42c all texrelated Chengsong parents: diff changeset	801	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	802	\begin{tabular}{c@{\hspace{20mm}}c}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	803	\begin{tabular}{@{}rrl@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	804	\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	805	$r$ & $::=$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	806	& $\mid$ & $\ONE$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	807	& $\mid$ & $c$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	808	& $\mid$ & $r_1 \cdot r_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	809	& $\mid$ & $r_1 + r_2$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	810	\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	811	& $\mid$ & $r^*$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	812	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	813	&
a0f27e21b42c all texrelated Chengsong parents: diff changeset	814	\begin{tabular}{@{\hspace{0mm}}rrl@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	815	\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	816	$v$ & $::=$ & \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	817	& & $\Empty$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	818	& $\mid$ & $\Char(c)$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	819	& $\mid$ & $\Seq\,v_1\, v_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	820	& $\mid$ & $\Left(v)$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	821	& $\mid$ & $\Right(v)$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	822	& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	823	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	824	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	825	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	826
a0f27e21b42c all texrelated Chengsong parents: diff changeset	827	\noindent
472 6953d2786e7c hi Chengsong parents: 471 diff changeset	828
6953d2786e7c hi Chengsong parents: 471 diff changeset	829	Building on top of Sulzmann and Lu's attempt to formalize the
6953d2786e7c hi Chengsong parents: 471 diff changeset	830	notion of POSIX lexing rules \parencite{Sulzmann2014},
6953d2786e7c hi Chengsong parents: 471 diff changeset	831	Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
6953d2786e7c hi Chengsong parents: 471 diff changeset	832	POSIX matching as a ternary relation recursively defined in a
6953d2786e7c hi Chengsong parents: 471 diff changeset	833	natural deduction style.
6953d2786e7c hi Chengsong parents: 471 diff changeset	834	With the formally-specified rules for what a POSIX matching is,
6953d2786e7c hi Chengsong parents: 471 diff changeset	835	they proved in Isabelle/HOL that the algorithm gives correct results.
6953d2786e7c hi Chengsong parents: 471 diff changeset	836
6953d2786e7c hi Chengsong parents: 471 diff changeset	837	But having a correct result is still not enough, we want $\mathbf{efficiency}$.
6953d2786e7c hi Chengsong parents: 471 diff changeset	838
6953d2786e7c hi Chengsong parents: 471 diff changeset	839
6953d2786e7c hi Chengsong parents: 471 diff changeset	840
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	841	One regular expression can have multiple lexical values. For example
a0f27e21b42c all texrelated Chengsong parents: diff changeset	842	for the regular expression $(a+b)^*$, it has a infinite list of
a0f27e21b42c all texrelated Chengsong parents: diff changeset	843	values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	844	$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	845	$\ldots$, and vice versa.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	846	Even for the regular expression matching a certain string, there could
a0f27e21b42c all texrelated Chengsong parents: diff changeset	847	still be more than one value corresponding to it.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	848	Take the example where $r= (a^\cdot a^)^*$ and the string
a0f27e21b42c all texrelated Chengsong parents: diff changeset	849	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	850	The number of different ways of matching
a0f27e21b42c all texrelated Chengsong parents: diff changeset	851	without allowing any value under a star to be flattened
a0f27e21b42c all texrelated Chengsong parents: diff changeset	852	to an empty string can be given by the following formula:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	853	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	854	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	855	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	856	and a closed form formula can be calculated to be
a0f27e21b42c all texrelated Chengsong parents: diff changeset	857	\begin{equation}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	858	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	859	\end{equation}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	860	which is clearly in exponential order.
472 6953d2786e7c hi Chengsong parents: 471 diff changeset	861
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	862	A lexer aimed at getting all the possible values has an exponential
a0f27e21b42c all texrelated Chengsong parents: diff changeset	863	worst case runtime. Therefore it is impractical to try to generate
a0f27e21b42c all texrelated Chengsong parents: diff changeset	864	all possible matches in a run. In practice, we are usually
a0f27e21b42c all texrelated Chengsong parents: diff changeset	865	interested about POSIX values, which by intuition always
a0f27e21b42c all texrelated Chengsong parents: diff changeset	866	match the leftmost regular expression when there is a choice
a0f27e21b42c all texrelated Chengsong parents: diff changeset	867	and always match a sub part as much as possible before proceeding
a0f27e21b42c all texrelated Chengsong parents: diff changeset	868	to the next token. For example, the above example has the POSIX value
a0f27e21b42c all texrelated Chengsong parents: diff changeset	869	$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	870	The output of an algorithm we want would be a POSIX matching
a0f27e21b42c all texrelated Chengsong parents: diff changeset	871	encoded as a value.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	872	The contribution of Sulzmann and Lu is an extension of Brzozowski's
a0f27e21b42c all texrelated Chengsong parents: diff changeset	873	algorithm by a second phase (the first phase being building successive
a0f27e21b42c all texrelated Chengsong parents: diff changeset	874	derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
a0f27e21b42c all texrelated Chengsong parents: diff changeset	875	is generated in case the regular expression matches the string.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	876	Pictorially, the Sulzmann and Lu algorithm is as follows:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	877
a0f27e21b42c all texrelated Chengsong parents: diff changeset	878	\begin{ceqn}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	879	\begin{equation}\label{graph:2}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	880	\begin{tikzcd}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	881	r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	882	v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	883	\end{tikzcd}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	884	\end{equation}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	885	\end{ceqn}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	886
a0f27e21b42c all texrelated Chengsong parents: diff changeset	887
a0f27e21b42c all texrelated Chengsong parents: diff changeset	888	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	889	For convenience, we shall employ the following notations: the regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	890	expression we start with is $r_0$, and the given string $s$ is composed
a0f27e21b42c all texrelated Chengsong parents: diff changeset	891	of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	892	left to right, we build the derivatives $r_1$, $r_2$, \ldots according
a0f27e21b42c all texrelated Chengsong parents: diff changeset	893	to the characters $c_0$, $c_1$ until we exhaust the string and obtain
a0f27e21b42c all texrelated Chengsong parents: diff changeset	894	the derivative $r_n$. We test whether this derivative is
a0f27e21b42c all texrelated Chengsong parents: diff changeset	895	$\textit{nullable}$ or not. If not, we know the string does not match
a0f27e21b42c all texrelated Chengsong parents: diff changeset	896	$r$ and no value needs to be generated. If yes, we start building the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	897	values incrementally by \emph{injecting} back the characters into the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	898	earlier values $v_n, \ldots, v_0$. This is the second phase of the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	899	algorithm from the right to left. For the first value $v_n$, we call the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	900	function $\textit{mkeps}$, which builds a POSIX lexical value
a0f27e21b42c all texrelated Chengsong parents: diff changeset	901	for how the empty string has been matched by the (nullable) regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	902	expression $r_n$. This function is defined as
a0f27e21b42c all texrelated Chengsong parents: diff changeset	903
a0f27e21b42c all texrelated Chengsong parents: diff changeset	904	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	905	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	906	$\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	907	$\mkeps(r_{1}+r_{2})$ & $\dn$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	908	& \textit{if} $\nullable(r_{1})$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	909	& & \textit{then} $\Left(\mkeps(r_{1}))$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	910	& & \textit{else} $\Right(\mkeps(r_{2}))$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	911	$\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	912	$mkeps(r^*)$ & $\dn$ & $\Stars\,[]$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	913	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	914	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	915
a0f27e21b42c all texrelated Chengsong parents: diff changeset	916
a0f27e21b42c all texrelated Chengsong parents: diff changeset	917	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	918	After the $\mkeps$-call, we inject back the characters one by one in order to build
a0f27e21b42c all texrelated Chengsong parents: diff changeset	919	the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	920	($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	921	After injecting back $n$ characters, we get the lexical value for how $r_0$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	922	matches $s$. The POSIX value is maintained throught out the process.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	923	For this Sulzmann and Lu defined a function that reverses
a0f27e21b42c all texrelated Chengsong parents: diff changeset	924	the ``chopping off'' of characters during the derivative phase. The
a0f27e21b42c all texrelated Chengsong parents: diff changeset	925	corresponding function is called \emph{injection}, written
a0f27e21b42c all texrelated Chengsong parents: diff changeset	926	$\textit{inj}$; it takes three arguments: the first one is a regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	927	expression ${r_{i-1}}$, before the character is chopped off, the second
a0f27e21b42c all texrelated Chengsong parents: diff changeset	928	is a character ${c_{i-1}}$, the character we want to inject and the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	929	third argument is the value ${v_i}$, into which one wants to inject the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	930	character (it corresponds to the regular expression after the character
a0f27e21b42c all texrelated Chengsong parents: diff changeset	931	has been chopped off). The result of this function is a new value. The
a0f27e21b42c all texrelated Chengsong parents: diff changeset	932	definition of $\textit{inj}$ is as follows:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	933
a0f27e21b42c all texrelated Chengsong parents: diff changeset	934	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	935	\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	936	$\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	937	$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	938	$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	939	$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	940	$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	941	$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	942	$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	943	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	944	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	945
a0f27e21b42c all texrelated Chengsong parents: diff changeset	946	\noindent This definition is by recursion on the ``shape'' of regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	947	expressions and values.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	948	The clauses basically do one thing--identifying the ``holes'' on
a0f27e21b42c all texrelated Chengsong parents: diff changeset	949	value to inject the character back into.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	950	For instance, in the last clause for injecting back to a value
a0f27e21b42c all texrelated Chengsong parents: diff changeset	951	that would turn into a new star value that corresponds to a star,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	952	we know it must be a sequence value. And we know that the first
a0f27e21b42c all texrelated Chengsong parents: diff changeset	953	value of that sequence corresponds to the child regex of the star
a0f27e21b42c all texrelated Chengsong parents: diff changeset	954	with the first character being chopped off--an iteration of the star
a0f27e21b42c all texrelated Chengsong parents: diff changeset	955	that had just been unfolded. This value is followed by the already
a0f27e21b42c all texrelated Chengsong parents: diff changeset	956	matched star iterations we collected before. So we inject the character
a0f27e21b42c all texrelated Chengsong parents: diff changeset	957	back to the first value and form a new value with this new iteration
a0f27e21b42c all texrelated Chengsong parents: diff changeset	958	being added to the previous list of iterations, all under the $Stars$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	959	top level.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	960
a0f27e21b42c all texrelated Chengsong parents: diff changeset	961	We have mentioned before that derivatives without simplification
a0f27e21b42c all texrelated Chengsong parents: diff changeset	962	can get clumsy, and this is true for values as well--they reflect
a0f27e21b42c all texrelated Chengsong parents: diff changeset	963	the regular expressions size by definition.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	964
a0f27e21b42c all texrelated Chengsong parents: diff changeset	965	One can introduce simplification on the regex and values, but have to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	966	be careful in not breaking the correctness as the injection
a0f27e21b42c all texrelated Chengsong parents: diff changeset	967	function heavily relies on the structure of the regexes and values
a0f27e21b42c all texrelated Chengsong parents: diff changeset	968	being correct and match each other.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	969	It can be achieved by recording some extra rectification functions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	970	during the derivatives step, and applying these rectifications in
a0f27e21b42c all texrelated Chengsong parents: diff changeset	971	each run during the injection phase.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	972	And we can prove that the POSIX value of how
a0f27e21b42c all texrelated Chengsong parents: diff changeset	973	regular expressions match strings will not be affected---although is much harder
472 6953d2786e7c hi Chengsong parents: 471 diff changeset	974	to establish.
6953d2786e7c hi Chengsong parents: 471 diff changeset	975	Some initial results in this regard have been
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	976	obtained in \cite{AusafDyckhoffUrban2016}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	977
472 6953d2786e7c hi Chengsong parents: 471 diff changeset	978
6953d2786e7c hi Chengsong parents: 471 diff changeset	979
468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	980	%Brzozowski, after giving the derivatives and simplification,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	981	%did not explore lexing with simplification or he may well be
a0f27e21b42c all texrelated Chengsong parents: diff changeset	982	%stuck on an efficient simplificaiton with a proof.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	983	%He went on to explore the use of derivatives together with
a0f27e21b42c all texrelated Chengsong parents: diff changeset	984	%automaton, and did not try lexing using derivatives.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	985
a0f27e21b42c all texrelated Chengsong parents: diff changeset	986	We want to get rid of complex and fragile rectification of values.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	987	Can we not create those intermediate values $v_1,\ldots v_n$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	988	and get the lexing information that should be already there while
a0f27e21b42c all texrelated Chengsong parents: diff changeset	989	doing derivatives in one pass, without a second phase of injection?
a0f27e21b42c all texrelated Chengsong parents: diff changeset	990	In the meantime, can we make sure that simplifications
a0f27e21b42c all texrelated Chengsong parents: diff changeset	991	are easily handled without breaking the correctness of the algorithm?
a0f27e21b42c all texrelated Chengsong parents: diff changeset	992
a0f27e21b42c all texrelated Chengsong parents: diff changeset	993	Sulzmann and Lu solved this problem by
a0f27e21b42c all texrelated Chengsong parents: diff changeset	994	introducing additional informtaion to the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	995	regular expressions called \emph{bitcodes}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	996
a0f27e21b42c all texrelated Chengsong parents: diff changeset	997	\subsection*{Bit-coded Algorithm}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	998	Bits and bitcodes (lists of bits) are defined as:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	999
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1000	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1001	$b ::= 1 \mid 0 \qquad
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1002	bs ::= [] \mid b::bs
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1003	$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1004	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1005
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1006	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1007	The $1$ and $0$ are not in bold in order to avoid
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1008	confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1009	bit-lists) can be used to encode values (or potentially incomplete values) in a
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1010	compact form. This can be straightforwardly seen in the following
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1011	coding function from values to bitcodes:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1012
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1013	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1014	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1015	$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1016	$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1017	$\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1018	$\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1019	$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1020	$\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1021	$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1022	code(\Stars\,vs)$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1023	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1024	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1025
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1026	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1027	Here $\textit{code}$ encodes a value into a bitcodes by converting
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1028	$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1029	star iteration by $1$. The border where a local star terminates
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1030	is marked by $0$. This coding is lossy, as it throws away the information about
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1031	characters, and also does not encode the ``boundary'' between two
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1032	sequence values. Moreover, with only the bitcode we cannot even tell
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1033	whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1034	reason for choosing this compact way of storing information is that the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1035	relatively small size of bits can be easily manipulated and ``moved
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1036	around'' in a regular expression. In order to recover values, we will
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1037	need the corresponding regular expression as an extra information. This
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1038	means the decoding function is defined as:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1039
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1040
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1041	%\begin{definition}[Bitdecoding of Values]\mbox{}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1042	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1043	\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1044	$\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1045	$\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1046	$\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1047	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1048	(\Left\,v, bs_1)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1049	$\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1050	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1051	(\Right\,v, bs_1)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1052	$\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1053	$\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1054	& & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1055	& & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1056	$\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1057	$\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1058	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1059	& & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1060	& & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1061
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1062	$\textit{decode}\,bs\,r$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1063	$\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1064	& & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1065	\textit{else}\;\textit{None}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1066	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1067	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1068	%\end{definition}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1069
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1070	Sulzmann and Lu's integrated the bitcodes into regular expressions to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1071	create annotated regular expressions \cite{Sulzmann2014}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1072	\emph{Annotated regular expressions} are defined by the following
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1073	grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1074
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1075	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1076	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1077	$\textit{a}$ & $::=$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1078	& $\mid$ & $_{bs}\ONE$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1079	& $\mid$ & $_{bs}{\bf c}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1080	& $\mid$ & $_{bs}\sum\,as$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1081	& $\mid$ & $_{bs}a_1\cdot a_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1082	& $\mid$ & $_{bs}a^*$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1083	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1084	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1085	%(in \textit{ALTS})
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1086
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1087	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1088	where $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1089	expressions and $as$ for a list of annotated regular expressions.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1090	The alternative constructor($\sum$) has been generalized to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1091	accept a list of annotated regular expressions rather than just 2.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1092	We will show that these bitcodes encode information about
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1093	the (POSIX) value that should be generated by the Sulzmann and Lu
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1094	algorithm.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1095
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1096
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1097	To do lexing using annotated regular expressions, we shall first
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1098	transform the usual (un-annotated) regular expressions into annotated
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1099	regular expressions. This operation is called \emph{internalisation} and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1100	defined as follows:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1101
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1102	%\begin{definition}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1103	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1104	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1105	$(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1106	$(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1107	$(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1108	$(r_1 + r_2)^\uparrow$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1109	$_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1110	\textit{fuse}\,[1]\,r_2^\uparrow]$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1111	$(r_1\cdot r_2)^\uparrow$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1112	$_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1113	$(r^*)^\uparrow$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1114	$_{[]}(r^\uparrow)^*$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1115	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1116	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1117	%\end{definition}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1118
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1119	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1120	We use up arrows here to indicate that the basic un-annotated regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1121	expressions are ``lifted up'' into something slightly more complex. In the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1122	fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1123	attach bits to the front of an annotated regular expression. Its
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1124	definition is as follows:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1125
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1126	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1127	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1128	$\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1129	$\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1130	$_{bs @ bs'}\ONE$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1131	$\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1132	$_{bs@bs'}{\bf c}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1133	$\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1134	$_{bs@bs'}\sum\textit{as}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1135	$\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1136	$_{bs@bs'}a_1 \cdot a_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1137	$\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1138	$_{bs @ bs'}a^*$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1139	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1140	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1141
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1142	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1143	After internalising the regular expression, we perform successive
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1144	derivative operations on the annotated regular expressions. This
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1145	derivative operation is the same as what we had previously for the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1146	basic regular expressions, except that we beed to take care of
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1147	the bitcodes:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1148
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1149
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1150	\iffalse
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1151	%\begin{definition}{bder}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1152	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1153	\begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1154	$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1155	$(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1156	$(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1157	$\textit{if}\;c=d\; \;\textit{then}\;
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1158	\textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1159	$(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1160	$\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1161	$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1162	$\textit{if}\;\textit{bnullable}\,a_1$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1163	& &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1164	& &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1165	& &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1166	$(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1167	$\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1168	(\textit{STAR}\,[]\,r)$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1169	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1170	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1171	%\end{definition}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1172
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1173	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1174	\begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1175	$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1176	$(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1177	$(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1178	$\textit{if}\;c=d\; \;\textit{then}\;
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1179	_{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1180	$(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1181	$_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1182	$(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1183	$\textit{if}\;\textit{bnullable}\,a_1$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1184	& &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1185	& &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1186	& &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1187	$(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1188	$_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1189	(_{bs}\textit{STAR}\,[]\,r)$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1190	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1191	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1192	%\end{definition}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1193	\fi
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1194
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1195	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1196	\begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1197	$(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1198	$(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1199	$(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1200	$\textit{if}\;c=d\; \;\textit{then}\;
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1201	_{bs}\ONE\;\textit{else}\;\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1202	$(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1203	$_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1204	$(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1205	$\textit{if}\;\textit{bnullable}\,a_1$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1206	& &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1207	& &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1208	& &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1209	$(_{bs}a^*)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1210	$_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1211	(_{[]}r^*))$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1212	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1213	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1214
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1215	%\end{definition}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1216	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1217	For instance, when we do derivative of $_{bs}a^*$ with respect to c,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1218	we need to unfold it into a sequence,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1219	and attach an additional bit $0$ to the front of $r \backslash c$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1220	to indicate that there is one more star iteration. Also the sequence clause
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1221	is more subtle---when $a_1$ is $\textit{bnullable}$ (here
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1222	\textit{bnullable} is exactly the same as $\textit{nullable}$, except
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1223	that it is for annotated regular expressions, therefore we omit the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1224	definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1225	$a_1$ matches the string prior to character $c$ (more on this later),
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1226	then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1227	\backslash c)$ will collapse the regular expression $a_1$(as it has
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1228	already been fully matched) and store the parsing information at the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1229	head of the regular expression $a_2 \backslash c$ by fusing to it. The
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1230	bitsequence $\textit{bs}$, which was initially attached to the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1231	first element of the sequence $a_1 \cdot a_2$, has
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1232	now been elevated to the top-level of $\sum$, as this information will be
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1233	needed whichever way the sequence is matched---no matter whether $c$ belongs
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1234	to $a_1$ or $ a_2$. After building these derivatives and maintaining all
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1235	the lexing information, we complete the lexing by collecting the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1236	bitcodes using a generalised version of the $\textit{mkeps}$ function
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1237	for annotated regular expressions, called $\textit{bmkeps}$:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1238
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1239
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1240	%\begin{definition}[\textit{bmkeps}]\mbox{}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1241	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1242	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1243	$\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1244	$\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1245	$\textit{if}\;\textit{bnullable}\,a$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1246	& &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1247	& &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1248	$\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1249	$bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1250	$\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1251	$bs \,@\, [0]$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1252	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1253	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1254	%\end{definition}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1255
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1256	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1257	This function completes the value information by travelling along the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1258	path of the regular expression that corresponds to a POSIX value and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1259	collecting all the bitcodes, and using $S$ to indicate the end of star
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1260	iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1261	decode them, we get the value we expect. The corresponding lexing
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1262	algorithm looks as follows:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1263
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1264	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1265	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1266	$\textit{blexer}\;r\,s$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1267	$\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1268	& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1269	& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1270	& & $\;\;\textit{else}\;\textit{None}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1271	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1272	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1273
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1274	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1275	In this definition $\_\backslash s$ is the generalisation of the derivative
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1276	operation from characters to strings (just like the derivatives for un-annotated
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1277	regular expressions).
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1278
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1279	Remember tha one of the important reasons we introduced bitcodes
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1280	is that they can make simplification more structured and therefore guaranteeing
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1281	the correctness.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1282
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1283	\subsection*{Our Simplification Rules}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1284
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1285	In this section we introduce aggressive (in terms of size) simplification rules
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1286	on annotated regular expressions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1287	in order to keep derivatives small. Such simplifications are promising
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1288	as we have
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1289	generated test data that show
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1290	that a good tight bound can be achieved. Obviously we could only
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1291	partially cover the search space as there are infinitely many regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1292	expressions and strings.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1293
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1294	One modification we introduced is to allow a list of annotated regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1295	expressions in the $\sum$ constructor. This allows us to not just
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1296	delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1297	also unnecessary ``copies'' of regular expressions (very similar to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1298	simplifying $r + r$ to just $r$, but in a more general setting). Another
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1299	modification is that we use simplification rules inspired by Antimirov's
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1300	work on partial derivatives. They maintain the idea that only the first
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1301	``copy'' of a regular expression in an alternative contributes to the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1302	calculation of a POSIX value. All subsequent copies can be pruned away from
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1303	the regular expression. A recursive definition of our simplification function
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1304	that looks somewhat similar to our Scala code is given below:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1305	%\comment{Use $\ZERO$, $\ONE$ and so on.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1306	%Is it $ALTS$ or $ALTS$?}\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1307
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1308	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1309	\begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1310
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1311	$\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1312	&&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1313	&&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1314	&&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1315	&&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1316	&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1317
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1318	$\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1319	&&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1320	&&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1321	&&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1322
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1323	$\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1324	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1325	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1326
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1327	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1328	The simplification does a pattern matching on the regular expression.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1329	When it detected that the regular expression is an alternative or
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1330	sequence, it will try to simplify its children regular expressions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1331	recursively and then see if one of the children turn into $\ZERO$ or
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1332	$\ONE$, which might trigger further simplification at the current level.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1333	The most involved part is the $\sum$ clause, where we use two
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1334	auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1335	alternatives and reduce as many duplicates as possible. Function
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1336	$\textit{distinct}$ keeps the first occurring copy only and remove all later ones
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1337	when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1338	Its recursive definition is given below:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1339
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1340	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1341	\begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1342	$\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1343	(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1344	$\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1345	$\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise)
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1346	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1347	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1348
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1349	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1350	Here $\textit{flatten}$ behaves like the traditional functional programming flatten
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1351	function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1352	removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1353
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1354	Having defined the $\simp$ function,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1355	we can use the previous notation of natural
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1356	extension from derivative w.r.t.~character to derivative
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1357	w.r.t.~string:%\comment{simp in the [] case?}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1358
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1359	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1360	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1361	$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1362	$r \backslash_{simp} [\,] $ & $\dn$ & $r$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1363	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1364	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1365
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1366	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1367	to obtain an optimised version of the algorithm:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1368
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1369	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1370	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1371	$\textit{blexer\_simp}\;r\,s$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1372	$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1373	& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1374	& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1375	& & $\;\;\textit{else}\;\textit{None}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1376	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1377	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1378
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1379	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1380	This algorithm keeps the regular expression size small, for example,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1381	with this simplification our previous $(a + aa)^*$ example's 8000 nodes
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1382	will be reduced to just 6 and stays constant, no matter how long the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1383	input string is.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1384
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1385
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1386
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1387	Derivatives give a simple solution
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1388	to the problem of matching a string $s$ with a regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1389	expression $r$: if the derivative of $r$ w.r.t.\ (in
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1390	succession) all the characters of the string matches the empty string,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1391	then $r$ matches $s$ (and {\em vice versa}).
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1392
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1393
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1394
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1395	However, there are two difficulties with derivative-based matchers:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1396	First, Brzozowski's original matcher only generates a yes/no answer
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1397	for whether a regular expression matches a string or not. This is too
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1398	little information in the context of lexing where separate tokens must
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1399	be identified and also classified (for example as keywords
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1400	or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1401	difficulty by cleverly extending Brzozowski's matching
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1402	algorithm. Their extended version generates additional information on
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1403	\emph{how} a regular expression matches a string following the POSIX
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1404	rules for regular expression matching. They achieve this by adding a
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1405	second ``phase'' to Brzozowski's algorithm involving an injection
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1406	function. In our own earlier work we provided the formal
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1407	specification of what POSIX matching means and proved in Isabelle/HOL
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1408	the correctness
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1409	of Sulzmann and Lu's extended algorithm accordingly
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1410	\cite{AusafDyckhoffUrban2016}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1411
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1412	The second difficulty is that Brzozowski's derivatives can
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1413	grow to arbitrarily big sizes. For example if we start with the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1414	regular expression $(a+aa)^*$ and take
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1415	successive derivatives according to the character $a$, we end up with
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1416	a sequence of ever-growing derivatives like
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1417
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1418	\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1419	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1420	\begin{tabular}{rll}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1421	$(a + aa)^$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1422	& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1423	& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1424	& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1425	& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1426	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1427	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1428
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1429	\noindent where after around 35 steps we run out of memory on a
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1430	typical computer (we shall define shortly the precise details of our
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1431	regular expressions and the derivative operation). Clearly, the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1432	notation involving $\ZERO$s and $\ONE$s already suggests
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1433	simplification rules that can be applied to regular regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1434	expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1435	\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1436	r$. While such simple-minded simplifications have been proved in our
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1437	earlier work to preserve the correctness of Sulzmann and Lu's
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1438	algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1439	\emph{not} help with limiting the growth of the derivatives shown
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1440	above: the growth is slowed, but the derivatives can still grow rather
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1441	quickly beyond any finite bound.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1442
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1443
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1444	Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1445	\cite{Sulzmann2014} where they introduce bitcoded
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1446	regular expressions. In this version, POSIX values are
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1447	represented as bitsequences and such sequences are incrementally generated
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1448	when derivatives are calculated. The compact representation
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1449	of bitsequences and regular expressions allows them to define a more
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1450	``aggressive'' simplification method that keeps the size of the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1451	derivatives finite no matter what the length of the string is.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1452	They make some informal claims about the correctness and linear behaviour
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1453	of this version, but do not provide any supporting proof arguments, not
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1454	even ``pencil-and-paper'' arguments. They write about their bitcoded
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1455	\emph{incremental parsing method} (that is the algorithm to be formalised
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1456	in this paper):
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1457
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1458
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1459	\begin{quote}\it
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1460	``Correctness Claim: We further claim that the incremental parsing
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1461	method [..] in combination with the simplification steps [..]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1462	yields POSIX parse trees. We have tested this claim
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1463	extensively [..] but yet
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1464	have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1465	\end{quote}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1466
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1467
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1468
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1469
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1470
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1471
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1472	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1473
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1474
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1475	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1476
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1477	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1478
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1479	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1480
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1481

author	Chengsong
	Mon, 28 Mar 2022 20:00:04 +0100
changeset 472	6953d2786e7c
parent 471	23818853a710
child 500	4d9eecfc936a
permissions	-rwxr-xr-x