lexing: ChengsongTanPhdThesis/Chapters/Bitcoded1.tex@3cbcd7cda0a9 (annotated)

532 cc54ce075db5 restructured Chengsong parents: diff changeset	1	% Chapter Template
cc54ce075db5 restructured Chengsong parents: diff changeset	2
cc54ce075db5 restructured Chengsong parents: diff changeset	3	% Main chapter title
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	4	\chapter{Bit-coded Algorithm of Sulzmann and Lu}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	5
cc54ce075db5 restructured Chengsong parents: diff changeset	6	\label{Bitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
cc54ce075db5 restructured Chengsong parents: diff changeset	7	%Then we illustrate how the algorithm without bitcodes falls short for such aggressive
cc54ce075db5 restructured Chengsong parents: diff changeset	8	%simplifications and therefore introduce our version of the bitcoded algorithm and
cc54ce075db5 restructured Chengsong parents: diff changeset	9	%its correctness proof in
cc54ce075db5 restructured Chengsong parents: diff changeset	10	%Chapter 3\ref{Chapter3}.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	11	In this chapter, we are going to describe the bit-coded algorithm
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	12	introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	13	regular expressions.
537 50e590823220 more Chengsong parents: 536 diff changeset	14	\section{Bit-coded Algorithm}
50e590823220 more Chengsong parents: 536 diff changeset	15	The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
50e590823220 more Chengsong parents: 536 diff changeset	16	stores information of previous lexing steps
50e590823220 more Chengsong parents: 536 diff changeset	17	on a stack, in the form of regular expressions
50e590823220 more Chengsong parents: 536 diff changeset	18	and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
50e590823220 more Chengsong parents: 536 diff changeset	19	\begin{ceqn}
50e590823220 more Chengsong parents: 536 diff changeset	20	\begin{equation}%\label{graph:injLexer}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	21	\begin{tikzcd}[ampersand replacement=\&, execute at end picture={
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	22	\begin{scope}[on background layer]
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	23	\node[rectangle, fill={red!30},
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	24	pattern=north east lines, pattern color=red,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	25	fit={(-3,-1) (-3, 1) (1, -1)
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	26	(1, 1)}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	27	]
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	28	{}; ,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	29	\node[rectangle, fill={blue!20},
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	30	pattern=north east lines, pattern color=blue,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	31	fit= {(1, -1) (1, 1) (3, -1) (3, 1)}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	32	]
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	33	{};
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	34	\end{scope}}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	35	]
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	36	r_0 \arrow[r, "\backslash c_0"] \arrow[d] \& r_1 \arrow[r, "\backslash c_1"] \arrow[d] \& r_2 \arrow[r, dashed] \arrow[d] \& r_n \arrow[d, "mkeps" description] \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	37	v_0 \& v_1 \arrow[l,"inj_{r_0} c_0"] \& v_2 \arrow[l, "inj_{r_1} c_1"] \& v_n \arrow[l, dashed] \\
537 50e590823220 more Chengsong parents: 536 diff changeset	38	\end{tikzcd}
50e590823220 more Chengsong parents: 536 diff changeset	39	\end{equation}
50e590823220 more Chengsong parents: 536 diff changeset	40	\end{ceqn}
50e590823220 more Chengsong parents: 536 diff changeset	41	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	42	The red part represents what we already know during the first
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	43	derivative phase,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	44	and the blue part represents the unknown part of input.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	45	The red area expands as we move towards $r_n$,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	46	indicating an increasing stack size during lexing.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	47	Despite having some partial lexing information during
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	48	the forward derivative phase, we choose to store them
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	49	temporarily, only to convert the information to lexical
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	50	values at a later stage. In essence we are repeating work we
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	51	have already done.
537 50e590823220 more Chengsong parents: 536 diff changeset	52	This is both inefficient and prone to stack overflow.
50e590823220 more Chengsong parents: 536 diff changeset	53	A natural question arises as to whether we can store lexing
50e590823220 more Chengsong parents: 536 diff changeset	54	information on the fly, while still using regular expression
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	55	derivatives.
537 50e590823220 more Chengsong parents: 536 diff changeset	56
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	57	If we remove the details of the individual
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	58	lexing steps, and use red and blue areas as before
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	59	to indicate consumed (seen) input and constructed
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	60	partial value (before recovering the rest of the stack),
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	61	one could see that the seen part's lexical information
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	62	is stored in the form of a regular expression.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	63	Consider the regular expression $(aa)^* \cdot bc$ matching the string $aabc$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	64	and assume we have just read the two characters $aa$:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	65	\begin{center}
537 50e590823220 more Chengsong parents: 536 diff changeset	66	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
50e590823220 more Chengsong parents: 536 diff changeset	67	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	68	{Partial lexing info: $\ONE \cdot a \cdot (aa)^* \cdot bc$ etc.
537 50e590823220 more Chengsong parents: 536 diff changeset	69	\nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
50e590823220 more Chengsong parents: 536 diff changeset	70	\end{tikzpicture}
50e590823220 more Chengsong parents: 536 diff changeset	71	\end{center}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	72	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	73	In the injection-based lexing algorithm, we ``neglect" the red area
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	74	by putting all the characters we have consumed and
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	75	intermediate regular expressions on the stack when
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	76	we go from left to right in the derivative phase.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	77	The red area grows till the string is exhausted.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	78	During the injection phase, the value in the blue area
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	79	is built up incrementally, while the red area shrinks.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	80	Before we have recovered all characters and intermediate
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	81	derivative regular expressions from the stack,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	82	what values these characters and regular expressions correspond
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	83	to are unknown:
537 50e590823220 more Chengsong parents: 536 diff changeset	84	\begin{center}
50e590823220 more Chengsong parents: 536 diff changeset	85	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
50e590823220 more Chengsong parents: 536 diff changeset	86	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	87	{$(\ONE \cdot \ONE) \cdot (aa)^* \cdot bc $ correspond to:$???$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	88	\nodepart{two} $b c$ corresponds to $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
537 50e590823220 more Chengsong parents: 536 diff changeset	89	%\caption{term 1 \ref{term:1}'s matching configuration}
50e590823220 more Chengsong parents: 536 diff changeset	90	\end{tikzpicture}
50e590823220 more Chengsong parents: 536 diff changeset	91	\end{center}
50e590823220 more Chengsong parents: 536 diff changeset	92	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	93	However, they should be calculable,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	94	as characters and regular expression shapes
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	95	after taking derivative w.r.t those characters
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	96	have already been known, therefore in our example,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	97	we know that the value starts with two $a$s,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	98	and makes up to an iteration in a Kleene star:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	99	(We have put the injection-based lexing's partial
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	100	result in the right part of the split rectangle
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	101	to contrast it with the partial valued produced
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	102	in a forward manner)
537 50e590823220 more Chengsong parents: 536 diff changeset	103	\begin{center}
50e590823220 more Chengsong parents: 536 diff changeset	104	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
50e590823220 more Chengsong parents: 536 diff changeset	105	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	106	{$\stackrel{Bitcoded}{\longrightarrow} \Seq(\Stars[\Char(a), \Char(a)], ???)$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	107	\nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$ $\stackrel{Inj}{\longleftarrow}$};
537 50e590823220 more Chengsong parents: 536 diff changeset	108	%\caption{term 1 \ref{term:1}'s matching configuration}
50e590823220 more Chengsong parents: 536 diff changeset	109	\end{tikzpicture}
50e590823220 more Chengsong parents: 536 diff changeset	110	\end{center}
50e590823220 more Chengsong parents: 536 diff changeset	111	\noindent
50e590823220 more Chengsong parents: 536 diff changeset	112	If we do this kind of "attachment"
50e590823220 more Chengsong parents: 536 diff changeset	113	and each time augment the attached partially
50e590823220 more Chengsong parents: 536 diff changeset	114	constructed value when taking off a
50e590823220 more Chengsong parents: 536 diff changeset	115	character:
50e590823220 more Chengsong parents: 536 diff changeset	116	\begin{center}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	117	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	118	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] (spPoint)
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	119	{$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	120	\nodepart{two} Remaining: $b c$};
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	121	\end{tikzpicture}\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	122	$\downarrow$\\
537 50e590823220 more Chengsong parents: 536 diff changeset	123	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
50e590823220 more Chengsong parents: 536 diff changeset	124	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
50e590823220 more Chengsong parents: 536 diff changeset	125	{$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	126	\nodepart{two} Remaining: $c$};
537 50e590823220 more Chengsong parents: 536 diff changeset	127	\end{tikzpicture}\\
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	128	$\downarrow$\\
537 50e590823220 more Chengsong parents: 536 diff changeset	129	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
50e590823220 more Chengsong parents: 536 diff changeset	130	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
50e590823220 more Chengsong parents: 536 diff changeset	131	{$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
50e590823220 more Chengsong parents: 536 diff changeset	132	\nodepart{two} EOF};
50e590823220 more Chengsong parents: 536 diff changeset	133	\end{tikzpicture}
50e590823220 more Chengsong parents: 536 diff changeset	134	\end{center}
50e590823220 more Chengsong parents: 536 diff changeset	135	\noindent
50e590823220 more Chengsong parents: 536 diff changeset	136	In the end we could recover the value without a backward phase.
50e590823220 more Chengsong parents: 536 diff changeset	137	But (partial) values are a bit clumsy to stick together with a regular expression, so
50e590823220 more Chengsong parents: 536 diff changeset	138	we instead use bit-codes to encode them.
50e590823220 more Chengsong parents: 536 diff changeset	139
50e590823220 more Chengsong parents: 536 diff changeset	140	Bits and bitcodes (lists of bits) are defined as:
50e590823220 more Chengsong parents: 536 diff changeset	141	\begin{center}
50e590823220 more Chengsong parents: 536 diff changeset	142	$b ::= S \mid Z \qquad
532 cc54ce075db5 restructured Chengsong parents: diff changeset	143	bs ::= [] \mid b::bs
cc54ce075db5 restructured Chengsong parents: diff changeset	144	$
cc54ce075db5 restructured Chengsong parents: diff changeset	145	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	146
cc54ce075db5 restructured Chengsong parents: diff changeset	147	\noindent
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	148	Using $S$ and $Z$ rather than $1$ and $0$ is to avoid
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	149	confusion with the regular expressions $\ZERO$ and $\ONE$.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	150	Bitcodes (or
532 cc54ce075db5 restructured Chengsong parents: diff changeset	151	bit-lists) can be used to encode values (or potentially incomplete values) in a
cc54ce075db5 restructured Chengsong parents: diff changeset	152	compact form. This can be straightforwardly seen in the following
cc54ce075db5 restructured Chengsong parents: diff changeset	153	coding function from values to bitcodes:
cc54ce075db5 restructured Chengsong parents: diff changeset	154	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	155	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	156	$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	157	$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
537 50e590823220 more Chengsong parents: 536 diff changeset	158	$\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\
50e590823220 more Chengsong parents: 536 diff changeset	159	$\textit{code}(\Right\,v)$ & $\dn$ & $S :: code(v)$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	160	$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
537 50e590823220 more Chengsong parents: 536 diff changeset	161	$\textit{code}(\Stars\,[])$ & $\dn$ & $[Z]$\\
50e590823220 more Chengsong parents: 536 diff changeset	162	$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $S :: code(v) \;@\;
532 cc54ce075db5 restructured Chengsong parents: diff changeset	163	code(\Stars\,vs)$
cc54ce075db5 restructured Chengsong parents: diff changeset	164	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	165	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	166	\noindent
537 50e590823220 more Chengsong parents: 536 diff changeset	167	Here $\textit{code}$ encodes a value into a bit-code by converting
50e590823220 more Chengsong parents: 536 diff changeset	168	$\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
50e590823220 more Chengsong parents: 536 diff changeset	169	star iteration by $S$. The border where a local star terminates
50e590823220 more Chengsong parents: 536 diff changeset	170	is marked by $Z$.
50e590823220 more Chengsong parents: 536 diff changeset	171	This coding is lossy, as it throws away the information about
532 cc54ce075db5 restructured Chengsong parents: diff changeset	172	characters, and also does not encode the ``boundary'' between two
cc54ce075db5 restructured Chengsong parents: diff changeset	173	sequence values. Moreover, with only the bitcode we cannot even tell
537 50e590823220 more Chengsong parents: 536 diff changeset	174	whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. The
532 cc54ce075db5 restructured Chengsong parents: diff changeset	175	reason for choosing this compact way of storing information is that the
cc54ce075db5 restructured Chengsong parents: diff changeset	176	relatively small size of bits can be easily manipulated and ``moved
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	177	around" in a regular expression.
537 50e590823220 more Chengsong parents: 536 diff changeset	178
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	179	Because of the lossiness, the process of decoding a bitlist requires additionally
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	180	a regular expression. The function $\decode$ is defined as:
537 50e590823220 more Chengsong parents: 536 diff changeset	181	We define the reverse operation of $\code$, which is $\decode$.
50e590823220 more Chengsong parents: 536 diff changeset	182	As expected, $\decode$ not only requires the bit-codes,
50e590823220 more Chengsong parents: 536 diff changeset	183	but also a regular expression to guide the decoding and
50e590823220 more Chengsong parents: 536 diff changeset	184	fill the gaps of characters:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	185
cc54ce075db5 restructured Chengsong parents: diff changeset	186
cc54ce075db5 restructured Chengsong parents: diff changeset	187	%\begin{definition}[Bitdecoding of Values]\mbox{}
cc54ce075db5 restructured Chengsong parents: diff changeset	188	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	189	\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
cc54ce075db5 restructured Chengsong parents: diff changeset	190	$\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	191	$\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
537 50e590823220 more Chengsong parents: 536 diff changeset	192	$\textit{decode}'\,(Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
532 cc54ce075db5 restructured Chengsong parents: diff changeset	193	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
cc54ce075db5 restructured Chengsong parents: diff changeset	194	(\Left\,v, bs_1)$\\
537 50e590823220 more Chengsong parents: 536 diff changeset	195	$\textit{decode}'\,(S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
532 cc54ce075db5 restructured Chengsong parents: diff changeset	196	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
cc54ce075db5 restructured Chengsong parents: diff changeset	197	(\Right\,v, bs_1)$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	198	$\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
cc54ce075db5 restructured Chengsong parents: diff changeset	199	$\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	200	& & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	201	& & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
537 50e590823220 more Chengsong parents: 536 diff changeset	202	$\textit{decode}'\,(Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
50e590823220 more Chengsong parents: 536 diff changeset	203	$\textit{decode}'\,(S\!::\!bs)\,(r^*)$ & $\dn$ &
532 cc54ce075db5 restructured Chengsong parents: diff changeset	204	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	205	& & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	206	& & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
cc54ce075db5 restructured Chengsong parents: diff changeset	207
cc54ce075db5 restructured Chengsong parents: diff changeset	208	$\textit{decode}\,bs\,r$ & $\dn$ &
cc54ce075db5 restructured Chengsong parents: diff changeset	209	$\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	210	& & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
cc54ce075db5 restructured Chengsong parents: diff changeset	211	\textit{else}\;\textit{None}$
cc54ce075db5 restructured Chengsong parents: diff changeset	212	\end{tabular}
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	213	\end{center}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	214	%\end{definition}
cc54ce075db5 restructured Chengsong parents: diff changeset	215
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	216	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	217	The function $\decode'$ returns a pair consisting of a partially decoded value and some leftover:
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	218	$\decode'$ does most of the job while $\decode$ throws
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	219	away leftover bit-codes and returns the value only.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	220	$\decode$ is terminating as $\decode'$ is terminating.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	221	We have the property that $\decode$ and $\code$ are
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	222	reverse operations of one another:
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	223	\begin{lemma}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	224	\[\vdash v : r \implies \decode \; (\code \; v) \; r = \textit{Some}(v) \]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	225	\end{lemma}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	226	\begin{proof}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	227	By proving a more general version of the lemma, on $\decode'$:
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	228	\[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	229	Then setting $ds$ to be $[]$ and unfolding $\decode$ definition
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	230	we get the lemma.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	231	\end{proof}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	232	With the $\code$ and $\decode$ functions in hand, we know how to
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	233	switch between bit-codes and value--the two different representations of
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	234	lexing information.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	235	The next step is to integrate this information into the working regular expression.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	236	Attaching bits to the front of regular expressions is the solution Sulzamann and Lu
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	237	gave for storing partial values on the fly:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	238
cc54ce075db5 restructured Chengsong parents: diff changeset	239	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	240	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	241	$\textit{a}$ & $::=$ & $\ZERO$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	242	& $\mid$ & $_{bs}\ONE$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	243	& $\mid$ & $_{bs}{\bf c}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	244	& $\mid$ & $_{bs}\sum\,as$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	245	& $\mid$ & $_{bs}a_1\cdot a_2$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	246	& $\mid$ & $_{bs}a^*$
cc54ce075db5 restructured Chengsong parents: diff changeset	247	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	248	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	249	%(in \textit{ALTS})
cc54ce075db5 restructured Chengsong parents: diff changeset	250
cc54ce075db5 restructured Chengsong parents: diff changeset	251	\noindent
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	252	We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	253	$bs$ stands for bit-codes, $a$ for $\mathbf{a}$nnotated regular
532 cc54ce075db5 restructured Chengsong parents: diff changeset	254	expressions and $as$ for a list of annotated regular expressions.
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	255	The alternative constructor ($\sum$) has been generalised to
532 cc54ce075db5 restructured Chengsong parents: diff changeset	256	accept a list of annotated regular expressions rather than just 2.
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	257
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	258
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	259	The first thing we define related to bit-coded regular expressions
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	260	is how we move bits, for instance pasting it at the front of an annotated regular expression.
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	261	The operation $\fuse$ is just to attach bit-codes
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	262	to the front of an annotated regular expression:
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	263	\begin{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	264	\begin{tabular}{lcl}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	265	$\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	266	$\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	267	$_{bs @ bs'}\ONE$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	268	$\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	269	$_{bs@bs'}{\bf c}$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	270	$\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	271	$_{bs@bs'}\sum\textit{as}$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	272	$\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	273	$_{bs@bs'}a_1 \cdot a_2$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	274	$\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	275	$_{bs @ bs'}a^*$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	276	\end{tabular}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	277	\end{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	278
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	279	\noindent
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	280	With that we are able to define $\internalise$.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	281
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	282	To do lexing using annotated regular expressions, we shall first
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	283	transform the usual (un-annotated) regular expressions into annotated
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	284	regular expressions. This operation is called \emph{internalisation} and
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	285	defined as follows:
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	286
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	287	%\begin{definition}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	288	\begin{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	289	\begin{tabular}{lcl}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	290	$(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	291	$(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	292	$(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	293	$(r_1 + r_2)^\uparrow$ & $\dn$ &
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	294	$_{[]}\sum[\textit{fuse}\,[Z]\,r_1^\uparrow,\,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	295	\textit{fuse}\,[S]\,r_2^\uparrow]$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	296	$(r_1\cdot r_2)^\uparrow$ & $\dn$ &
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	297	$_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	298	$(r^*)^\uparrow$ & $\dn$ &
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	299	$_{[]}(r^\uparrow)^*$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	300	\end{tabular}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	301	\end{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	302	%\end{definition}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	303
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	304	\noindent
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	305	We use an up arrow with postfix notation
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	306	to denote the operation,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	307	for convenience. The $\textit{internalise} \; r$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	308	notation is more cumbersome.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	309	The opposite of $\textit{internalise}$ is
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	310	$\erase$, where all the bit-codes are removed,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	311	and the alternative operator $\sum$ for annotated
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	312	regular expressions is transformed to the binary alternatives
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	313	for plain regular expressions.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	314	\begin{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	315	\begin{tabular}{lcr}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	316	$\erase \; \ZERO$ & $\dn$ & $\ZERO$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	317	$\erase \; _{bs}\ONE$ & $\dn$ & $\ONE$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	318	$\erase \; _{bs}\mathbf{c}$ & $\dn$ & $\mathbf{c}$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	319	$\erase \; _{bs} a_1 \cdot a_2$ & $\dn$ & $\erase \; r_1\cdot\erase\; r_2$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	320	$\erase \; _{bs} [] $ & $\dn$ & $\ZERO$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	321	$\erase \; _{bs} [a] $ & $\dn$ & $\erase \; a$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	322	$\erase \; _{bs} \sum [a_1, \; a_2]$ & $\dn$ & $\erase \; a_1 +\erase \; a_2$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	323	$\erase \; _{bs} \sum (a :: as)$ & $\dn$ & $\erase \; a + \erase \; _{[]} \sum as$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	324	$\erase \; _{bs} a^$ & $\dn$ & $(\erase \; a)^$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	325	\end{tabular}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	326	\end{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	327	\noindent
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	328	We also abbreviate the $\erase\; a$ operation
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	329	as $a_\downarrow$, for conciseness.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	330	For bit-coded regular expressions, as a different datatype,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	331	testing whether they contain empty string in their lauguage requires
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	332	a dedicated function $\bnullable$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	333	which simply calls $\erase$ first before testing whether it is $\nullable$.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	334	\begin{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	335	\begin{tabular}{lcr}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	336	$\bnullable \; a $ & $\dn$ & $\nullable \; (a_\downarrow)$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	337	\end{tabular}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	338	\end{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	339	The function for collecting the
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	340	bitcodes of a $\bnullable$ annotated regular expression
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	341	is a generalised version of the $\textit{mkeps}$ function
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	342	for annotated regular expressions, called $\textit{bmkeps}$:
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	343
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	344
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	345	%\begin{definition}[\textit{bmkeps}]\mbox{}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	346	\begin{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	347	\begin{tabular}{lcl}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	348	$\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	349	$\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	350	$\textit{if}\;\textit{bnullable}\,a$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	351	& &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	352	& &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{[]}\sum \textit{as})$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	353	$\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	354	$bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	355	$\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	356	$bs \,@\, [Z]$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	357	\end{tabular}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	358	\end{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	359	%\end{definition}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	360
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	361	\noindent
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	362	This function completes the value information by travelling along the
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	363	path of the regular expression that corresponds to a POSIX value and
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	364	collecting all the bitcodes, and using $S$ to indicate the end of star
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	365	iterations.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	366
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	367	The most central question is how these partial lexing information
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	368	represented as bit-codes is augmented and carried around
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	369	during a derivative is taken.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	370	This is done by adding bitcodes to the
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	371	derivatives, for example when one more star iteratoin is taken (we
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	372	call the operation of derivatives on annotated regular expressions $\bder$
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	373	because it is derivatives on regular expressiones with \emph{b}itcodes),
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	374	we need to unfold it into a sequence,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	375	and attach an additional bit $Z$ to the front of $r \backslash c$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	376	to indicate one more star iteration.
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	377	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	378	\begin{tabular}{@{}lcl@{}}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	379	$\bder \; c\; (_{bs}a^*) $ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	380	$_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	381	(_{[]}a^*))$
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	382	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	383	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	384
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	385	\noindent
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	386	For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	387	there is no danger of confusion with derivatives on plain regular expressions,
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	388	for example, the above can be expressed as
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	389	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	390	\begin{tabular}{@{}lcl@{}}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	391	$(_{bs}a^*)\,\backslash c$ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	392	$_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	393	(_{[]}a^*))$
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	394	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	395	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	396
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	397	\noindent
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	398	Using the picture we used earlier to depict this, the transformation when
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	399	taking a derivative w.r.t a star is like below:
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	400
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	401	\begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	402	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	403	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	404	{$bs$
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	405	\nodepart{two} $a^*$ };
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	406	%\caption{term 1 \ref{term:1}'s matching configuration}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	407	\end{tikzpicture}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	408	&
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	409	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	410	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	411	{$v_{\text{previous iterations}}$
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	412	\nodepart{two} $a^*$};
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	413	%\caption{term 1 \ref{term:1}'s matching configuration}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	414	\end{tikzpicture}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	415	\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	416	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	417	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	418	{ $bs$ + [Z]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	419	\nodepart{two} $(a\backslash c )\cdot a^*$ };
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	420	%\caption{term 1 \ref{term:1}'s matching configuration}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	421	\end{tikzpicture}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	422	&
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	423	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	424	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	425	{$v_{\text{previous iterations}}$ + 1 more iteration
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	426	\nodepart{two} $(a\backslash c )\cdot a^*$ };
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	427	%\caption{term 1 \ref{term:1}'s matching configuration}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	428	\end{tikzpicture}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	429	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	430	\noindent
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	431	Another place in the $\bder$ function where it differs
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	432	from normal derivatives (on un-annotated regular expressions)
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	433	is the sequence case:
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	434	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	435	\begin{tabular}{@{}lcl@{}}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	436
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	437	$(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	438	$\textit{if}\;\textit{bnullable}\,a_1$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	439	& &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	440	& &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	441	& &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	442	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	443	\end{center}
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	444
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	445	The difference is that (when $a_1$ is $\bnullable$)
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	446	we use $\bmkeps$ to store the lexing information
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	447	in $a_1$ before collapsing it (as it has been fully matched by string prior to $c$,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	448	and attach the collected bit-codes to the front of $a_2$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	449	before throwing away $a_1$. We assume that $\bmkeps$ correctly extracts the bitcode for how $a_1$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	450	matches the string prior to $c$ (more on this later).
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	451	The bitsequence $\textit{bs}$ which was initially attached to the first element of the sequence
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	452	$a_1 \cdot a_2$, has now been elevated to the top level of teh $\sum$.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	453	This is because this piece of information will be needed whichever way the sequence is matched,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	454	regardless of whether $c$ belongs to $a_1$ or $a_2$.
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	455
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	456	In the injection-based lexing, $r_1$ is immediately thrown away in
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	457	subsequent derivatives on the right branch (when $r_1$ is $\nullable$),
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	458	\begin{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	459	$(r_1 \cdot r_2 )\backslash c = (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	460	\end{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	461	\noindent
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	462	as it knows $r_1$ is stored on stack and available once the recursive
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	463	call to later derivatives finish.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	464	Therefore, if the $\Right$ branch is taken in a $\POSIX$ match,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	465	we construct back the sequence value once step back by
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	466	calling a on $\mkeps(r_1)$.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	467	\begin{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	468	\begin{tabular}{lcr}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	469	$\ldots r_1 \cdot r_2$ & $\rightarrow$ & $r_1\cdot r_2 + r_2 \backslash c \ldots $\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	470	$\ldots \Seq(v_1, v_2) (\Seq(\mkeps(r1), (\inj \; r_2 \; c\; v_{2c})))$ & $\leftarrow$ & $\Right(v_{2c})\ldots$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	471	\end{tabular}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	472	\end{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	473	\noindent
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	474	The rest of the clauses of $\bder$ is rather similar to
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	475	$\der$, and is put together here as a wholesome definition
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	476	for $\bder$:
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	477	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	478	\begin{tabular}{@{}lcl@{}}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	479	$(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	480	$(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	481	$(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	482	$\textit{if}\;c=d\; \;\textit{then}\;
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	483	_{bs}\ONE\;\textit{else}\;\ZERO$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	484	$(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	485	$_{bs}\sum\;(\textit{map} \; (\_\backslash c) \; as )$\\
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	486	$(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	487	$\textit{if}\;\textit{bnullable}\,a_1$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	488	& &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	489	& &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	490	& &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	491	$(_{bs}a^*)\,\backslash c$ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	492	$_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	493	(_{[]}r^*))$
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	494	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	495	\end{center}
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	496	\noindent
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	497	Generalising the derivative operation with bitcodes to strings, we have
532 cc54ce075db5 restructured Chengsong parents: diff changeset	498	\begin{center}
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	499	\begin{tabular}{@{}lcl@{}}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	500	$a\backslash_s [] $ & $\dn$ & $a$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	501	$a\backslash (c :: s) $ & $\dn$ & $(a \backslash c) \backslash_s s$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	502	\end{tabular}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	503	\end{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	504	As we did earlier, we omit the $s$ subscript at $\backslash_s$ when there is no danger
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	505	of confusion.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	506	Putting this all together, we obtain a lexer with bit-coded regular expressions
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	507	as its internal data structures, which we call $\blexer$:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	508
cc54ce075db5 restructured Chengsong parents: diff changeset	509	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	510	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	511	$\textit{blexer}\;r\,s$ & $\dn$ &
cc54ce075db5 restructured Chengsong parents: diff changeset	512	$\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	513	& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	514	& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	515	& & $\;\;\textit{else}\;\textit{None}$
cc54ce075db5 restructured Chengsong parents: diff changeset	516	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	517	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	518
cc54ce075db5 restructured Chengsong parents: diff changeset	519	\noindent
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	520	Ausaf and Urban formally proved the correctness of the $\blexer$, namely
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	521	\begin{conjecture}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	522	$\blexer \;r \; s = \lexer \; r \; s$.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	523	\end{conjecture}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	524	This was claimed but not formalised in Sulzmann and Lu's work.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	525	We introduce the proof later, after we give out all
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	526	the needed auxiliary functions and definitions
532 cc54ce075db5 restructured Chengsong parents: diff changeset	527
cc54ce075db5 restructured Chengsong parents: diff changeset	528
cc54ce075db5 restructured Chengsong parents: diff changeset	529	%-----------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	530	% SUBSECTION 1
cc54ce075db5 restructured Chengsong parents: diff changeset	531	%-----------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	532	\section{Specifications of Some Helper Functions}
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	533	The functions we introduce will give a more detailed glimpse into
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	534	the lexing process, which might not be possible
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	535	using $\lexer$ or $\blexer$ themselves.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	536	The first function we shall look at is $\retrieve$.
543 b2bea5968b89 thesis_thys Chengsong parents: 542 diff changeset	537	\subsection{$\textit{Retrieve}$}
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	538	Our bit-coded lexer "retrieve"s the bitcodes using $\bmkeps$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	539	after we finished doing all the derivatives:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	540	\begin{center}
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	541	\begin{tabular}{lcl}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	542	& & $\ldots$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	543	& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	544	& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	545	& & $\ldots$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	546	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	547	\end{center}
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	548	\noindent
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	549	Recall that $\bmkeps$ looks for the leftmost branch of an alternative
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	550	and completes a star's iterations by attaching a $Z$ at the end of the bitcodes
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	551	extracted. It "retrieves" a sequence by visiting both children and then stitch together
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	552	two bitcodes using concatenation. After the entire tree structure of the regular
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	553	expression has been traversed using the above manner, we get a bitcode encoding the
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	554	lexing result.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	555	We know that this "retrieved" bitcode leads to the correct value after decoding,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	556	which is $v_0$ in the bird's eye view of the injection-based lexing diagram.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	557	Now assume we keep every other data structure in the diagram \ref{InjFigure},
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	558	and only replace all the plain regular expression by their annotated counterparts,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	559	computed during a $\blexer$ run.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	560	Then we obtain a diagram for the annotated regular expression derivatives and
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	561	their corresponding values, though the values are never calculated in $\blexer$.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	562	We have that $a_n$ contains all the lexing result information.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	563	\vspace{20mm}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	564	\begin{center}%\label{graph:injLexer}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	565	\begin{tikzcd}[
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	566	every matrix/.append style = {name=p},
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	567	remember picture, overlay,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	568	]
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	569	a_0 \arrow[r, "\backslash c_0"] \arrow[d] & a_1 \arrow[r, "\backslash c_1"] \arrow[d] & a_2 \arrow[r, dashed] \arrow[d] & a_n \arrow[d] \\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	570	v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	571	\end{tikzcd}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	572	\begin{tikzpicture}[
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	573	remember picture, overlay,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	574	E/.style = {ellipse, draw=blue, dashed,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	575	inner xsep=4mm,inner ysep=-4mm, rotate=90, fit=#1}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	576	]
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	577	\node[E = (p-1-1) (p-2-1)] {};
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	578	\node[E = (p-1-4) (p-2-4)] {};
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	579	\end{tikzpicture}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	580
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	581	\end{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	582	\vspace{20mm}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	583	\noindent
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	584	On the other hand, $v_0$ also encodes the correct lexing result, as we have proven for $\lexer$.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	585	Encircled in the diagram are the two pairs $v_0, a_0$ and $v_n, a_n$, which both
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	586	encode the correct lexical result. Though for the leftmost pair, we have
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	587	the information condensed in $v_0$ the value part, whereas for the rightmost pair,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	588	the information is concentrated on $a_n$.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	589	We know that in the intermediate steps the pairs $v_i, a_i$, must in some way encode the complete
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	590	lexing information as well. Therefore, we need a unified approach to extract such lexing result
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	591	from a value $v_i$ and its annotated regular expression $a_i$.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	592	And the function $f$ must satisfy these requirements:
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	593	\begin{itemize}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	594	\item
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	595	$f \; a_i\;v_i = f \; a_n \; v_n = \decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	596	\item
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	597	$f \; a_i\;v_i = f \; a_0 \; v_0 = v_0 = \decode \;(\code \; v_0) \; (\erase \; a_0)$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	598	\end{itemize}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	599	\noindent
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	600	If we factor out the common part $\decode \; \_ \; (\erase \; a_0)$,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	601	The core of the function $f$ is something that produces the bitcodes
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	602	$\code \; v_0$.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	603	It is unclear how, but Sulzmann and Lu came up with a function satisfying all the above
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	604	requirements, named \emph{retrieve}:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	605
cc54ce075db5 restructured Chengsong parents: diff changeset	606
cc54ce075db5 restructured Chengsong parents: diff changeset	607
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	608	\begin{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	609	\begin{tabular}{lcr}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	610	$\retrieve \; \, (_{bs} \ONE) \; \, (\Empty)$ & $\dn$ & $\textit{bs}$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	611	$\retrieve \; \, (_{bs} \mathbf{c} ) \; \Char(c)$ & $\dn$ & $ \textit{bs}$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	612	$\retrieve \; \, (_{bs} a_1 \cdot a_2) \; \Seq(v_1, v_2)$ & $\dn$ & $\textit{bs} @ (\retrieve \; a_1\; v_1) @ (\retrieve \; a_2 \; v_2)$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	613	$\retrieve \; \, (_{bs} \Sigma (a :: \textit{as}) \; \,\Left(v)$ & $\dn$ & $\textit{bs} @ (\retrieve \; a \; v)$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	614	$\retrieve \; \, (_{bs} \Sigma (a :: \textit{as} \; \, \Right(v)$ & $\dn$ & $\textit{bs} @ (\retrieve \; (_{[]}\Sigma \textit{as}) \; v)$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	615	$\retrieve \; \, (_{bs} a^) \; \, (\Stars(v :: vs)) $ & $\dn$ & $\textit{bs} @ (\retrieve \; a \; v) @ (\retrieve \; (_{[]} a^) \; (\Stars(vs)))$\\
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	616	$\retrieve \; \, (_{bs} a^*) \; \, (\Stars([]) $ & $\dn$ & $\textit{bs} @ [Z]$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	617	\end{tabular}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	618	\end{center}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	619	\noindent
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	620	As promised, $\retrieve$ collects the right bit-codes from the
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	621	final derivative $a_n$:
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	622	\begin{lemma}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	623	$\bnullable \; a \implies \bmkeps \; a = \retrieve \; a \; (\mkeps \; (\erase \; a))$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	624	\end{lemma}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	625	\begin{proof}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	626	By a routine induction on $a$.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	627	\end{proof}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	628	The design of $\retrieve$ enables extraction of bit-codes
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	629	from not only $\bnullable$ (annotated) regular expressions,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	630	but also those that are not $\bnullable$.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	631	For example, if we have the regular expression just internalised
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	632	and the lexing result value, we could $\retrieve$ the bitcdoes
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	633	that look as if we have en$\code$-ed the value:
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	634	\begin{lemma}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	635	$\vdash v : r \implies \retrieve \; (r)^\uparrow \; v = \code \; v$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	636	\end{lemma}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	637	\begin{proof}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	638	By induction on $r$.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	639	\end{proof}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	640	\noindent
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	641	The following property is more interesting, as
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	642	it provides a "bridge" between $a_0, v_0$ and $a_n, v_n$ in the
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	643	lexing diagram.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	644	If you take derivative of an annotated regular expression,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	645	you can $\retrieve$ the same bit-codes as before the derivative took place,
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	646	provided that you use the corresponding value:
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	647
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	648	\begin{lemma}\label{retrieveStepwise}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	649	$\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	650	\end{lemma}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	651	\begin{proof}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	652	By induction on $r$, where $v$ is allowed to be arbitrary.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	653	The induction principle is function $\erase$'s cases.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	654	\end{proof}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	655	\noindent
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	656	$\retrieve$ is connected to the $\blexer$ in the following way:
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	657	\begin{lemma}\label{blexer_retrieve}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	658	$\blexer \; r \; s = \decode \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	659	\end{lemma}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	660	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	661	$\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regular expressiones of $\blexer$.
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	662	For plain regular expressions something similar is required as well.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	663
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	664	\subsection{$\flex$}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	665	Ausaf and Urban cleverly defined an auxiliary function called $\flex$ for $\lexer$,
532 cc54ce075db5 restructured Chengsong parents: diff changeset	666	defined as
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	667	\begin{center}
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	668	\begin{tabular}{lcr}
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	669	$\flex \; r \; f \; [] \; v$ & $=$ & $f\; v$\\
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	670	$\flex \; r \; f \; c :: s \; v$ & $=$ & $\flex \; r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v$
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	671	\end{tabular}
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	672	\end{center}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	673	which accumulates the characters that needs to be injected back,
cc54ce075db5 restructured Chengsong parents: diff changeset	674	and does the injection in a stack-like manner (last taken derivative first injected).
cc54ce075db5 restructured Chengsong parents: diff changeset	675	$\flex$ is connected to the $\lexer$:
cc54ce075db5 restructured Chengsong parents: diff changeset	676	\begin{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	677	$\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$
cc54ce075db5 restructured Chengsong parents: diff changeset	678	\end{lemma}
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	679	\begin{proof}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	680	By reverse induction on $s$.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	681	\end{proof}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	682	$\flex$ provides us a bridge between $\lexer$'s intermediate steps.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	683	What is even better about $\flex$ is that it allows us to
cc54ce075db5 restructured Chengsong parents: diff changeset	684	directly operate on the value $\mkeps (r\backslash v)$,
cc54ce075db5 restructured Chengsong parents: diff changeset	685	which is pivotal in the definition of $\lexer $ and $\blexer$, but not visible as an argument.
cc54ce075db5 restructured Chengsong parents: diff changeset	686	When the value created by $\mkeps$ becomes available, one can
cc54ce075db5 restructured Chengsong parents: diff changeset	687	prove some stepwise properties of lexing nicely:
cc54ce075db5 restructured Chengsong parents: diff changeset	688	\begin{lemma}\label{flexStepwise}
cc54ce075db5 restructured Chengsong parents: diff changeset	689	$\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $
cc54ce075db5 restructured Chengsong parents: diff changeset	690	\end{lemma}
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	691	\begin{proof}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	692	By induction on the shape of $r\backslash s$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	693	\end{proof}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	694	\noindent
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	695	With $\flex$ and $\retrieve$ ready, we are ready to connect $\lexer$ and $\blexer$ .
532 cc54ce075db5 restructured Chengsong parents: diff changeset	696
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	697	\subsection{Correctness Proof of Bit-coded Algorithm}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	698	\begin{lemma}\label{flex_retrieve}
cc54ce075db5 restructured Chengsong parents: diff changeset	699	$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
cc54ce075db5 restructured Chengsong parents: diff changeset	700	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	701	\begin{proof}
cc54ce075db5 restructured Chengsong parents: diff changeset	702	By induction on $s$. The induction tactic is reverse induction on strings.
cc54ce075db5 restructured Chengsong parents: diff changeset	703	$v$ is allowed to be arbitrary.
cc54ce075db5 restructured Chengsong parents: diff changeset	704	The crucial point is to rewrite
cc54ce075db5 restructured Chengsong parents: diff changeset	705	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	706	\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c])
cc54ce075db5 restructured Chengsong parents: diff changeset	707	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	708	as
cc54ce075db5 restructured Chengsong parents: diff changeset	709	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	710	\retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\; \mkeps (r \backslash s@[c]))
cc54ce075db5 restructured Chengsong parents: diff changeset	711	\].
cc54ce075db5 restructured Chengsong parents: diff changeset	712	This enables us to equate
cc54ce075db5 restructured Chengsong parents: diff changeset	713	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	714	\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c])
cc54ce075db5 restructured Chengsong parents: diff changeset	715	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	716	with
cc54ce075db5 restructured Chengsong parents: diff changeset	717	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	718	\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c])))
cc54ce075db5 restructured Chengsong parents: diff changeset	719	\],
cc54ce075db5 restructured Chengsong parents: diff changeset	720	which in turn can be rewritten as
cc54ce075db5 restructured Chengsong parents: diff changeset	721	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	722	\flex \; r \; \textit{id} \; s@[c] \; (\mkeps (r\backslash s@[c]))
cc54ce075db5 restructured Chengsong parents: diff changeset	723	\].
cc54ce075db5 restructured Chengsong parents: diff changeset	724	\end{proof}
cc54ce075db5 restructured Chengsong parents: diff changeset	725
cc54ce075db5 restructured Chengsong parents: diff changeset	726	With the above lemma we can now link $\flex$ and $\blexer$.
cc54ce075db5 restructured Chengsong parents: diff changeset	727
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	728	%----------------------------------------------------------------------------------------
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	729	% SECTION correctness proof
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	730	%----------------------------------------------------------------------------------------
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	731	\section{Correctness of Bit-coded Algorithm (Without Simplification)}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	732	We now give the proof the correctness of the algorithm with bit-codes.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	733
532 cc54ce075db5 restructured Chengsong parents: diff changeset	734	\begin{lemma}\label{flex_blexer}
cc54ce075db5 restructured Chengsong parents: diff changeset	735	$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s) = \blexer \; r \; s$
cc54ce075db5 restructured Chengsong parents: diff changeset	736	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	737	\begin{proof}
cc54ce075db5 restructured Chengsong parents: diff changeset	738	Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.
cc54ce075db5 restructured Chengsong parents: diff changeset	739	\end{proof}
542 a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	740	Finally the correctness of $\blexer$ is given as it outputs the same result as $\lexer$:
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	741	\begin{theorem}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	742	$\blexer\; r \; s = \lexer \; r \; s$
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	743	\end{theorem}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	744	\begin{proof}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	745	Straightforward corollary of \ref{flex_blexer}.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	746	\end{proof}
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	747	\noindent
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	748	Our main reason for wanting a bit-coded algorithm over
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	749	the injection-based one is for its capabilities of allowing
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	750	more aggressive simplifications.
a7344c9afbaf chapter3 finished Chengsong parents: 538 diff changeset	751	We will elaborate on this in the next chapter.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	752
cc54ce075db5 restructured Chengsong parents: diff changeset	753

author	Chengsong
	Wed, 13 Jul 2022 08:27:28 +0100
changeset 564	3cbcd7cda0a9
parent 543	b2bea5968b89
child 575	3178f0e948ac
permissions	-rwxr-xr-x