lexing: ChengsongTanPhdThesis/Chapters/Bitcoded1.tex@8016a2480704 (annotated)

532 cc54ce075db5 restructured Chengsong parents: diff changeset	1	% Chapter Template
cc54ce075db5 restructured Chengsong parents: diff changeset	2
cc54ce075db5 restructured Chengsong parents: diff changeset	3	% Main chapter title
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	4	\chapter{Bit-coded Algorithm of Sulzmann and Lu}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	5
cc54ce075db5 restructured Chengsong parents: diff changeset	6	\label{Bitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
cc54ce075db5 restructured Chengsong parents: diff changeset	7	%Then we illustrate how the algorithm without bitcodes falls short for such aggressive
cc54ce075db5 restructured Chengsong parents: diff changeset	8	%simplifications and therefore introduce our version of the bitcoded algorithm and
cc54ce075db5 restructured Chengsong parents: diff changeset	9	%its correctness proof in
cc54ce075db5 restructured Chengsong parents: diff changeset	10	%Chapter 3\ref{Chapter3}.
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	11	In this chapter, we are going to introduce the bit-coded algorithm
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	12	introduced by Sulzmann and Lu to address the problem of
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	13	under-simplified regular expressions.
537 50e590823220 more Chengsong parents: 536 diff changeset	14	\section{Bit-coded Algorithm}
50e590823220 more Chengsong parents: 536 diff changeset	15	The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
50e590823220 more Chengsong parents: 536 diff changeset	16	stores information of previous lexing steps
50e590823220 more Chengsong parents: 536 diff changeset	17	on a stack, in the form of regular expressions
50e590823220 more Chengsong parents: 536 diff changeset	18	and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
50e590823220 more Chengsong parents: 536 diff changeset	19	\begin{envForCaption}
50e590823220 more Chengsong parents: 536 diff changeset	20	\begin{ceqn}
50e590823220 more Chengsong parents: 536 diff changeset	21	\begin{equation}%\label{graph:injLexer}
50e590823220 more Chengsong parents: 536 diff changeset	22	\begin{tikzcd}
50e590823220 more Chengsong parents: 536 diff changeset	23	r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
50e590823220 more Chengsong parents: 536 diff changeset	24	v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
50e590823220 more Chengsong parents: 536 diff changeset	25	\end{tikzcd}
50e590823220 more Chengsong parents: 536 diff changeset	26	\end{equation}
50e590823220 more Chengsong parents: 536 diff changeset	27	\end{ceqn}
50e590823220 more Chengsong parents: 536 diff changeset	28	\caption{Injection-based Lexing from Chapter\ref{Inj}}\label{InjFigure}
50e590823220 more Chengsong parents: 536 diff changeset	29	\end{envForCaption}
50e590823220 more Chengsong parents: 536 diff changeset	30	\noindent
50e590823220 more Chengsong parents: 536 diff changeset	31	This is both inefficient and prone to stack overflow.
50e590823220 more Chengsong parents: 536 diff changeset	32	A natural question arises as to whether we can store lexing
50e590823220 more Chengsong parents: 536 diff changeset	33	information on the fly, while still using regular expression
50e590823220 more Chengsong parents: 536 diff changeset	34	derivatives?
50e590823220 more Chengsong parents: 536 diff changeset	35
50e590823220 more Chengsong parents: 536 diff changeset	36	In a lexing algorithm's run, split by the current input position,
50e590823220 more Chengsong parents: 536 diff changeset	37	we have a sub-string that has been consumed,
50e590823220 more Chengsong parents: 536 diff changeset	38	and the sub-string that has yet to come.
50e590823220 more Chengsong parents: 536 diff changeset	39	We already know what was before, and this should be reflected in the value
50e590823220 more Chengsong parents: 536 diff changeset	40	and the regular expression at that step as well. But this is not the
50e590823220 more Chengsong parents: 536 diff changeset	41	case for injection-based regular expression derivatives.
50e590823220 more Chengsong parents: 536 diff changeset	42	Take the regex $(aa)^* \cdot bc$ matching the string $aabc$
50e590823220 more Chengsong parents: 536 diff changeset	43	as an example, if we have just read the two former characters $aa$:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	44
cc54ce075db5 restructured Chengsong parents: diff changeset	45	\begin{center}
537 50e590823220 more Chengsong parents: 536 diff changeset	46	\begin{envForCaption}
50e590823220 more Chengsong parents: 536 diff changeset	47	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
50e590823220 more Chengsong parents: 536 diff changeset	48	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
50e590823220 more Chengsong parents: 536 diff changeset	49	{Consumed: $aa$
50e590823220 more Chengsong parents: 536 diff changeset	50	\nodepart{two} Not Yet Reached: $bc$ };
50e590823220 more Chengsong parents: 536 diff changeset	51	%\caption{term 1 \ref{term:1}'s matching configuration}
50e590823220 more Chengsong parents: 536 diff changeset	52	\end{tikzpicture}
50e590823220 more Chengsong parents: 536 diff changeset	53	\caption{Partially matched String}
50e590823220 more Chengsong parents: 536 diff changeset	54	\end{envForCaption}
50e590823220 more Chengsong parents: 536 diff changeset	55	\end{center}
50e590823220 more Chengsong parents: 536 diff changeset	56	%\caption{Input String}\label{StringPartial}
50e590823220 more Chengsong parents: 536 diff changeset	57	%\end{figure}
50e590823220 more Chengsong parents: 536 diff changeset	58
50e590823220 more Chengsong parents: 536 diff changeset	59	\noindent
50e590823220 more Chengsong parents: 536 diff changeset	60	We have the value that has already been partially calculated,
50e590823220 more Chengsong parents: 536 diff changeset	61	and the part that has yet to come:
50e590823220 more Chengsong parents: 536 diff changeset	62	\begin{center}
50e590823220 more Chengsong parents: 536 diff changeset	63	\begin{envForCaption}
50e590823220 more Chengsong parents: 536 diff changeset	64	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
50e590823220 more Chengsong parents: 536 diff changeset	65	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
50e590823220 more Chengsong parents: 536 diff changeset	66	{$\Seq(\Stars[\Char(a), \Char(a)], ???)$
50e590823220 more Chengsong parents: 536 diff changeset	67	\nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
50e590823220 more Chengsong parents: 536 diff changeset	68	%\caption{term 1 \ref{term:1}'s matching configuration}
50e590823220 more Chengsong parents: 536 diff changeset	69	\end{tikzpicture}
50e590823220 more Chengsong parents: 536 diff changeset	70	\caption{Partially constructed Value}
50e590823220 more Chengsong parents: 536 diff changeset	71	\end{envForCaption}
50e590823220 more Chengsong parents: 536 diff changeset	72	\end{center}
50e590823220 more Chengsong parents: 536 diff changeset	73
50e590823220 more Chengsong parents: 536 diff changeset	74	In the regex derivative part , (after simplification)
50e590823220 more Chengsong parents: 536 diff changeset	75	all we have is just what is about to come:
50e590823220 more Chengsong parents: 536 diff changeset	76	\begin{center}
50e590823220 more Chengsong parents: 536 diff changeset	77	\begin{envForCaption}
50e590823220 more Chengsong parents: 536 diff changeset	78	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
50e590823220 more Chengsong parents: 536 diff changeset	79	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},]
50e590823220 more Chengsong parents: 536 diff changeset	80	{$???$
50e590823220 more Chengsong parents: 536 diff changeset	81	\nodepart{two} To Come: $b c$};
50e590823220 more Chengsong parents: 536 diff changeset	82	%\caption{term 1 \ref{term:1}'s matching configuration}
50e590823220 more Chengsong parents: 536 diff changeset	83	\end{tikzpicture}
50e590823220 more Chengsong parents: 536 diff changeset	84	\caption{Derivative}
50e590823220 more Chengsong parents: 536 diff changeset	85	\end{envForCaption}
50e590823220 more Chengsong parents: 536 diff changeset	86	\end{center}
50e590823220 more Chengsong parents: 536 diff changeset	87	\noindent
50e590823220 more Chengsong parents: 536 diff changeset	88	The previous part is missing.
50e590823220 more Chengsong parents: 536 diff changeset	89	How about keeping the partially constructed value
50e590823220 more Chengsong parents: 536 diff changeset	90	attached to the front of the regular expression?
50e590823220 more Chengsong parents: 536 diff changeset	91	\begin{center}
50e590823220 more Chengsong parents: 536 diff changeset	92	\begin{envForCaption}
50e590823220 more Chengsong parents: 536 diff changeset	93	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
50e590823220 more Chengsong parents: 536 diff changeset	94	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
50e590823220 more Chengsong parents: 536 diff changeset	95	{$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
50e590823220 more Chengsong parents: 536 diff changeset	96	\nodepart{two} To Come: $b c$};
50e590823220 more Chengsong parents: 536 diff changeset	97	%\caption{term 1 \ref{term:1}'s matching configuration}
50e590823220 more Chengsong parents: 536 diff changeset	98	\end{tikzpicture}
50e590823220 more Chengsong parents: 536 diff changeset	99	\caption{Derivative}
50e590823220 more Chengsong parents: 536 diff changeset	100	\end{envForCaption}
50e590823220 more Chengsong parents: 536 diff changeset	101	\end{center}
50e590823220 more Chengsong parents: 536 diff changeset	102	\noindent
50e590823220 more Chengsong parents: 536 diff changeset	103	If we do this kind of "attachment"
50e590823220 more Chengsong parents: 536 diff changeset	104	and each time augment the attached partially
50e590823220 more Chengsong parents: 536 diff changeset	105	constructed value when taking off a
50e590823220 more Chengsong parents: 536 diff changeset	106	character:
50e590823220 more Chengsong parents: 536 diff changeset	107	\begin{center}
50e590823220 more Chengsong parents: 536 diff changeset	108	\begin{envForCaption}
50e590823220 more Chengsong parents: 536 diff changeset	109	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
50e590823220 more Chengsong parents: 536 diff changeset	110	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
50e590823220 more Chengsong parents: 536 diff changeset	111	{$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
50e590823220 more Chengsong parents: 536 diff changeset	112	\nodepart{two} To Come: $c$};
50e590823220 more Chengsong parents: 536 diff changeset	113	\end{tikzpicture}\\
50e590823220 more Chengsong parents: 536 diff changeset	114	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
50e590823220 more Chengsong parents: 536 diff changeset	115	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
50e590823220 more Chengsong parents: 536 diff changeset	116	{$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
50e590823220 more Chengsong parents: 536 diff changeset	117	\nodepart{two} EOF};
50e590823220 more Chengsong parents: 536 diff changeset	118	\end{tikzpicture}
50e590823220 more Chengsong parents: 536 diff changeset	119	\caption{After $\backslash b$ and $\backslash c$}
50e590823220 more Chengsong parents: 536 diff changeset	120	\end{envForCaption}
50e590823220 more Chengsong parents: 536 diff changeset	121	\end{center}
50e590823220 more Chengsong parents: 536 diff changeset	122	\noindent
50e590823220 more Chengsong parents: 536 diff changeset	123	In the end we could recover the value without a backward phase.
50e590823220 more Chengsong parents: 536 diff changeset	124	But (partial) values are a bit clumsy to stick together with a regular expression, so
50e590823220 more Chengsong parents: 536 diff changeset	125	we instead use bit-codes to encode them.
50e590823220 more Chengsong parents: 536 diff changeset	126
50e590823220 more Chengsong parents: 536 diff changeset	127	Bits and bitcodes (lists of bits) are defined as:
50e590823220 more Chengsong parents: 536 diff changeset	128	\begin{envForCaption}
50e590823220 more Chengsong parents: 536 diff changeset	129	\begin{center}
50e590823220 more Chengsong parents: 536 diff changeset	130	$b ::= S \mid Z \qquad
532 cc54ce075db5 restructured Chengsong parents: diff changeset	131	bs ::= [] \mid b::bs
cc54ce075db5 restructured Chengsong parents: diff changeset	132	$
cc54ce075db5 restructured Chengsong parents: diff changeset	133	\end{center}
537 50e590823220 more Chengsong parents: 536 diff changeset	134	\caption{Bit-codes datatype}
50e590823220 more Chengsong parents: 536 diff changeset	135	\end{envForCaption}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	136
cc54ce075db5 restructured Chengsong parents: diff changeset	137	\noindent
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	138	Using $S$ and $Z$ rather than $1$ and $0$ is to avoid
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	139	confusion with the regular expressions $\ZERO$ and $\ONE$.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	140	Bitcodes (or
532 cc54ce075db5 restructured Chengsong parents: diff changeset	141	bit-lists) can be used to encode values (or potentially incomplete values) in a
cc54ce075db5 restructured Chengsong parents: diff changeset	142	compact form. This can be straightforwardly seen in the following
cc54ce075db5 restructured Chengsong parents: diff changeset	143	coding function from values to bitcodes:
537 50e590823220 more Chengsong parents: 536 diff changeset	144	\begin{envForCaption}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	145	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	146	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	147	$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	148	$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
537 50e590823220 more Chengsong parents: 536 diff changeset	149	$\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\
50e590823220 more Chengsong parents: 536 diff changeset	150	$\textit{code}(\Right\,v)$ & $\dn$ & $S :: code(v)$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	151	$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
537 50e590823220 more Chengsong parents: 536 diff changeset	152	$\textit{code}(\Stars\,[])$ & $\dn$ & $[Z]$\\
50e590823220 more Chengsong parents: 536 diff changeset	153	$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $S :: code(v) \;@\;
532 cc54ce075db5 restructured Chengsong parents: diff changeset	154	code(\Stars\,vs)$
cc54ce075db5 restructured Chengsong parents: diff changeset	155	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	156	\end{center}
537 50e590823220 more Chengsong parents: 536 diff changeset	157	\caption{Coding Function for Values}
50e590823220 more Chengsong parents: 536 diff changeset	158	\end{envForCaption}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	159
cc54ce075db5 restructured Chengsong parents: diff changeset	160	\noindent
537 50e590823220 more Chengsong parents: 536 diff changeset	161	Here $\textit{code}$ encodes a value into a bit-code by converting
50e590823220 more Chengsong parents: 536 diff changeset	162	$\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
50e590823220 more Chengsong parents: 536 diff changeset	163	star iteration by $S$. The border where a local star terminates
50e590823220 more Chengsong parents: 536 diff changeset	164	is marked by $Z$.
50e590823220 more Chengsong parents: 536 diff changeset	165	This coding is lossy, as it throws away the information about
532 cc54ce075db5 restructured Chengsong parents: diff changeset	166	characters, and also does not encode the ``boundary'' between two
cc54ce075db5 restructured Chengsong parents: diff changeset	167	sequence values. Moreover, with only the bitcode we cannot even tell
537 50e590823220 more Chengsong parents: 536 diff changeset	168	whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. The
532 cc54ce075db5 restructured Chengsong parents: diff changeset	169	reason for choosing this compact way of storing information is that the
cc54ce075db5 restructured Chengsong parents: diff changeset	170	relatively small size of bits can be easily manipulated and ``moved
537 50e590823220 more Chengsong parents: 536 diff changeset	171	around'' in a regular expression.
50e590823220 more Chengsong parents: 536 diff changeset	172
50e590823220 more Chengsong parents: 536 diff changeset	173
50e590823220 more Chengsong parents: 536 diff changeset	174	We define the reverse operation of $\code$, which is $\decode$.
50e590823220 more Chengsong parents: 536 diff changeset	175	As expected, $\decode$ not only requires the bit-codes,
50e590823220 more Chengsong parents: 536 diff changeset	176	but also a regular expression to guide the decoding and
50e590823220 more Chengsong parents: 536 diff changeset	177	fill the gaps of characters:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	178
cc54ce075db5 restructured Chengsong parents: diff changeset	179
cc54ce075db5 restructured Chengsong parents: diff changeset	180	%\begin{definition}[Bitdecoding of Values]\mbox{}
537 50e590823220 more Chengsong parents: 536 diff changeset	181	\begin{envForCaption}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	182	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	183	\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
cc54ce075db5 restructured Chengsong parents: diff changeset	184	$\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	185	$\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
537 50e590823220 more Chengsong parents: 536 diff changeset	186	$\textit{decode}'\,(Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
532 cc54ce075db5 restructured Chengsong parents: diff changeset	187	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
cc54ce075db5 restructured Chengsong parents: diff changeset	188	(\Left\,v, bs_1)$\\
537 50e590823220 more Chengsong parents: 536 diff changeset	189	$\textit{decode}'\,(S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
532 cc54ce075db5 restructured Chengsong parents: diff changeset	190	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
cc54ce075db5 restructured Chengsong parents: diff changeset	191	(\Right\,v, bs_1)$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	192	$\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
cc54ce075db5 restructured Chengsong parents: diff changeset	193	$\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	194	& & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	195	& & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
537 50e590823220 more Chengsong parents: 536 diff changeset	196	$\textit{decode}'\,(Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
50e590823220 more Chengsong parents: 536 diff changeset	197	$\textit{decode}'\,(S\!::\!bs)\,(r^*)$ & $\dn$ &
532 cc54ce075db5 restructured Chengsong parents: diff changeset	198	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	199	& & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	200	& & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
cc54ce075db5 restructured Chengsong parents: diff changeset	201
cc54ce075db5 restructured Chengsong parents: diff changeset	202	$\textit{decode}\,bs\,r$ & $\dn$ &
cc54ce075db5 restructured Chengsong parents: diff changeset	203	$\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	204	& & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
cc54ce075db5 restructured Chengsong parents: diff changeset	205	\textit{else}\;\textit{None}$
cc54ce075db5 restructured Chengsong parents: diff changeset	206	\end{tabular}
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	207	\end{center}
537 50e590823220 more Chengsong parents: 536 diff changeset	208	\end{envForCaption}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	209	%\end{definition}
cc54ce075db5 restructured Chengsong parents: diff changeset	210
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	211	\noindent
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	212	$\decode'$ does most of the job while $\decode$ throws
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	213	away leftover bit-codes and returns the value only.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	214	$\decode$ is terminating as $\decode'$ is terminating.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	215	We have the property that $\decode$ and $\code$ are
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	216	reverse operations of one another:
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	217	\begin{lemma}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	218	\[\vdash v : r \implies \decode \; (\code \; v) \; r = \textit{Some}(v) \]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	219	\end{lemma}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	220	\begin{proof}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	221	By proving a more general version of the lemma, on $\decode'$:
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	222	\[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	223	Then setting $ds$ to be $[]$ and unfolding $\decode$ definition
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	224	we get the lemma.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	225	\end{proof}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	226	With the $\code$ and $\decode$ functions in hand, we know how to
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	227	switch between bit-codes and value--the two different representations of
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	228	lexing information.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	229	The next step is to integrate this information into the working regular expression.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	230	Attaching bits to the front of regular expressions is the solution Sulzamann and Lu
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	231	gave for storing partial values on the fly:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	232
cc54ce075db5 restructured Chengsong parents: diff changeset	233	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	234	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	235	$\textit{a}$ & $::=$ & $\ZERO$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	236	& $\mid$ & $_{bs}\ONE$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	237	& $\mid$ & $_{bs}{\bf c}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	238	& $\mid$ & $_{bs}\sum\,as$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	239	& $\mid$ & $_{bs}a_1\cdot a_2$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	240	& $\mid$ & $_{bs}a^*$
cc54ce075db5 restructured Chengsong parents: diff changeset	241	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	242	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	243	%(in \textit{ALTS})
cc54ce075db5 restructured Chengsong parents: diff changeset	244
cc54ce075db5 restructured Chengsong parents: diff changeset	245	\noindent
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	246	We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	247	$bs$ stands for bit-codes, $a$ for $\mathbf{a}$nnotated regular
532 cc54ce075db5 restructured Chengsong parents: diff changeset	248	expressions and $as$ for a list of annotated regular expressions.
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	249	The alternative constructor ($\sum$) has been generalised to
532 cc54ce075db5 restructured Chengsong parents: diff changeset	250	accept a list of annotated regular expressions rather than just 2.
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	251	%We will show that these bitcodes encode information about
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	252	%the ($\POSIX$) value that should be generated by the Sulzmann and Lu
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	253	%algorithm.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	254	The most central question is how these partial lexing information
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	255	represented as bit-codes is augmented and carried around
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	256	during a derivative is taken.
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	257
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	258	This is done by adding bitcodes to the
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	259	derivatives, for example when one more star iteratoin is taken (we
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	260	call the operation of derivatives on annotated regular expressions $\bder$
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	261	because it is derivatives on regexes with bitcodes):
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	262	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	263	\begin{tabular}{@{}lcl@{}}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	264	$\bder \; c\; (_{bs}a^*) $ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	265	$_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	266	(_{[]}a^*))$
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	267	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	268	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	269
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	270	\noindent
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	271	For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	272	there is no danger of confusion with derivatives on plain regular expressions,
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	273	for example, the above can be expressed as
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	274	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	275	\begin{tabular}{@{}lcl@{}}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	276	$(_{bs}a^*)\,\backslash c$ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	277	$_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	278	(_{[]}a^*))$
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	279	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	280	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	281
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	282
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	283	Using the picture we used earlier to depict this, the transformation when
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	284	taking a derivative w.r.t a star is like below:
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	285	\centering
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	286	\begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	287	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	288	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	289	{$bs$
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	290	\nodepart{two} $a^*$ };
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	291	%\caption{term 1 \ref{term:1}'s matching configuration}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	292	\end{tikzpicture}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	293	&
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	294	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	295	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	296	{$v_{\text{previous iterations}}$
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	297	\nodepart{two} $a^*$};
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	298	%\caption{term 1 \ref{term:1}'s matching configuration}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	299	\end{tikzpicture}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	300	\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	301	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	302	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	303	{ $bs$ + [Z]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	304	\nodepart{two} $(a\backslash c )\cdot a^*$ };
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	305	%\caption{term 1 \ref{term:1}'s matching configuration}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	306	\end{tikzpicture}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	307	&
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	308	\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	309	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	310	{$v_{\text{previous iterations}}$ + 1 more iteration
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	311	\nodepart{two} $(a\backslash c )\cdot a^*$ };
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	312	%\caption{term 1 \ref{term:1}'s matching configuration}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	313	\end{tikzpicture}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	314	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	315	\noindent
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	316	The operation $\fuse$ is just to attach bit-codes
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	317	to the front of an annotated regular expression:
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	318	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	319	\begin{tabular}{lcl}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	320	$\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	321	$\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	322	$_{bs @ bs'}\ONE$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	323	$\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	324	$_{bs@bs'}{\bf c}$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	325	$\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	326	$_{bs@bs'}\sum\textit{as}$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	327	$\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	328	$_{bs@bs'}a_1 \cdot a_2$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	329	$\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	330	$_{bs @ bs'}a^*$
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	331	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	332	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	333
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	334	\noindent
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	335	Another place in the $\bder$ function where it differs
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	336	from normal derivatives on un-annotated regular expressions
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	337	is the sequence case:
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	338	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	339	\begin{tabular}{@{}lcl@{}}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	340
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	341	$(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	342	$\textit{if}\;\textit{bnullable}\,a_1$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	343	& &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	344	& &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	345	& &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	346	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	347	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	348	Here
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	349
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	350
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	351	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	352	\begin{tabular}{@{}lcl@{}}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	353	$(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	354	$(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	355	$(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	356	$\textit{if}\;c=d\; \;\textit{then}\;
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	357	_{bs}\ONE\;\textit{else}\;\ZERO$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	358	$(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	359	$_{bs}\sum\;(\textit{map} (\_\backslash c) as )$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	360	$(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	361	$\textit{if}\;\textit{bnullable}\,a_1$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	362	& &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	363	& &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	364	& &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	365	$(_{bs}a^*)\,\backslash c$ & $\dn$ &
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	366	$_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	367	(_{[]}r^*))$
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	368	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	369	\end{center}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	370
cc54ce075db5 restructured Chengsong parents: diff changeset	371
cc54ce075db5 restructured Chengsong parents: diff changeset	372	To do lexing using annotated regular expressions, we shall first
cc54ce075db5 restructured Chengsong parents: diff changeset	373	transform the usual (un-annotated) regular expressions into annotated
cc54ce075db5 restructured Chengsong parents: diff changeset	374	regular expressions. This operation is called \emph{internalisation} and
cc54ce075db5 restructured Chengsong parents: diff changeset	375	defined as follows:
cc54ce075db5 restructured Chengsong parents: diff changeset	376
cc54ce075db5 restructured Chengsong parents: diff changeset	377	%\begin{definition}
cc54ce075db5 restructured Chengsong parents: diff changeset	378	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	379	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	380	$(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	381	$(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	382	$(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	383	$(r_1 + r_2)^\uparrow$ & $\dn$ &
537 50e590823220 more Chengsong parents: 536 diff changeset	384	$_{[]}\sum[\textit{fuse}\,[Z]\,r_1^\uparrow,\,
50e590823220 more Chengsong parents: 536 diff changeset	385	\textit{fuse}\,[S]\,r_2^\uparrow]$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	386	$(r_1\cdot r_2)^\uparrow$ & $\dn$ &
cc54ce075db5 restructured Chengsong parents: diff changeset	387	$_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	388	$(r^*)^\uparrow$ & $\dn$ &
cc54ce075db5 restructured Chengsong parents: diff changeset	389	$_{[]}(r^\uparrow)^*$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	390	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	391	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	392	%\end{definition}
cc54ce075db5 restructured Chengsong parents: diff changeset	393
cc54ce075db5 restructured Chengsong parents: diff changeset	394	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	395	We use up arrows here to indicate that the basic un-annotated regular
cc54ce075db5 restructured Chengsong parents: diff changeset	396	expressions are ``lifted up'' into something slightly more complex. In the
cc54ce075db5 restructured Chengsong parents: diff changeset	397	fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
cc54ce075db5 restructured Chengsong parents: diff changeset	398	attach bits to the front of an annotated regular expression. Its
cc54ce075db5 restructured Chengsong parents: diff changeset	399	definition is as follows:
cc54ce075db5 restructured Chengsong parents: diff changeset	400
538 8016a2480704 intro and chap2 Chengsong parents: 537 diff changeset	401
532 cc54ce075db5 restructured Chengsong parents: diff changeset	402
cc54ce075db5 restructured Chengsong parents: diff changeset	403	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	404	After internalising the regular expression, we perform successive
cc54ce075db5 restructured Chengsong parents: diff changeset	405	derivative operations on the annotated regular expressions. This
cc54ce075db5 restructured Chengsong parents: diff changeset	406	derivative operation is the same as what we had previously for the
cc54ce075db5 restructured Chengsong parents: diff changeset	407	basic regular expressions, except that we beed to take care of
cc54ce075db5 restructured Chengsong parents: diff changeset	408	the bitcodes:
cc54ce075db5 restructured Chengsong parents: diff changeset	409
cc54ce075db5 restructured Chengsong parents: diff changeset	410
cc54ce075db5 restructured Chengsong parents: diff changeset	411
cc54ce075db5 restructured Chengsong parents: diff changeset	412
cc54ce075db5 restructured Chengsong parents: diff changeset	413
cc54ce075db5 restructured Chengsong parents: diff changeset	414	%\end{definition}
cc54ce075db5 restructured Chengsong parents: diff changeset	415	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	416	For instance, when we do derivative of $_{bs}a^*$ with respect to c,
cc54ce075db5 restructured Chengsong parents: diff changeset	417	we need to unfold it into a sequence,
537 50e590823220 more Chengsong parents: 536 diff changeset	418	and attach an additional bit $Z$ to the front of $r \backslash c$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	419	to indicate one more star iteration. Also the sequence clause
cc54ce075db5 restructured Chengsong parents: diff changeset	420	is more subtle---when $a_1$ is $\textit{bnullable}$ (here
cc54ce075db5 restructured Chengsong parents: diff changeset	421	\textit{bnullable} is exactly the same as $\textit{nullable}$, except
cc54ce075db5 restructured Chengsong parents: diff changeset	422	that it is for annotated regular expressions, therefore we omit the
cc54ce075db5 restructured Chengsong parents: diff changeset	423	definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
cc54ce075db5 restructured Chengsong parents: diff changeset	424	$a_1$ matches the string prior to character $c$ (more on this later),
cc54ce075db5 restructured Chengsong parents: diff changeset	425	then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2
cc54ce075db5 restructured Chengsong parents: diff changeset	426	\backslash c)$ will collapse the regular expression $a_1$(as it has
cc54ce075db5 restructured Chengsong parents: diff changeset	427	already been fully matched) and store the parsing information at the
cc54ce075db5 restructured Chengsong parents: diff changeset	428	head of the regular expression $a_2 \backslash c$ by fusing to it. The
cc54ce075db5 restructured Chengsong parents: diff changeset	429	bitsequence $\textit{bs}$, which was initially attached to the
cc54ce075db5 restructured Chengsong parents: diff changeset	430	first element of the sequence $a_1 \cdot a_2$, has
cc54ce075db5 restructured Chengsong parents: diff changeset	431	now been elevated to the top-level of $\sum$, as this information will be
cc54ce075db5 restructured Chengsong parents: diff changeset	432	needed whichever way the sequence is matched---no matter whether $c$ belongs
cc54ce075db5 restructured Chengsong parents: diff changeset	433	to $a_1$ or $ a_2$. After building these derivatives and maintaining all
cc54ce075db5 restructured Chengsong parents: diff changeset	434	the lexing information, we complete the lexing by collecting the
cc54ce075db5 restructured Chengsong parents: diff changeset	435	bitcodes using a generalised version of the $\textit{mkeps}$ function
cc54ce075db5 restructured Chengsong parents: diff changeset	436	for annotated regular expressions, called $\textit{bmkeps}$:
cc54ce075db5 restructured Chengsong parents: diff changeset	437
cc54ce075db5 restructured Chengsong parents: diff changeset	438
cc54ce075db5 restructured Chengsong parents: diff changeset	439	%\begin{definition}[\textit{bmkeps}]\mbox{}
cc54ce075db5 restructured Chengsong parents: diff changeset	440	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	441	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	442	$\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	443	$\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
cc54ce075db5 restructured Chengsong parents: diff changeset	444	$\textit{if}\;\textit{bnullable}\,a$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	445	& &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	446	& &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	447	$\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
cc54ce075db5 restructured Chengsong parents: diff changeset	448	$bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	449	$\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
537 50e590823220 more Chengsong parents: 536 diff changeset	450	$bs \,@\, [Z]$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	451	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	452	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	453	%\end{definition}
cc54ce075db5 restructured Chengsong parents: diff changeset	454
cc54ce075db5 restructured Chengsong parents: diff changeset	455	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	456	This function completes the value information by travelling along the
cc54ce075db5 restructured Chengsong parents: diff changeset	457	path of the regular expression that corresponds to a POSIX value and
cc54ce075db5 restructured Chengsong parents: diff changeset	458	collecting all the bitcodes, and using $S$ to indicate the end of star
cc54ce075db5 restructured Chengsong parents: diff changeset	459	iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
cc54ce075db5 restructured Chengsong parents: diff changeset	460	decode them, we get the value we expect. The corresponding lexing
cc54ce075db5 restructured Chengsong parents: diff changeset	461	algorithm looks as follows:
cc54ce075db5 restructured Chengsong parents: diff changeset	462
cc54ce075db5 restructured Chengsong parents: diff changeset	463	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	464	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	465	$\textit{blexer}\;r\,s$ & $\dn$ &
cc54ce075db5 restructured Chengsong parents: diff changeset	466	$\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	467	& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	468	& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	469	& & $\;\;\textit{else}\;\textit{None}$
cc54ce075db5 restructured Chengsong parents: diff changeset	470	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	471	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	472
cc54ce075db5 restructured Chengsong parents: diff changeset	473	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	474	In this definition $\_\backslash s$ is the generalisation of the derivative
cc54ce075db5 restructured Chengsong parents: diff changeset	475	operation from characters to strings (just like the derivatives for un-annotated
cc54ce075db5 restructured Chengsong parents: diff changeset	476	regular expressions).
cc54ce075db5 restructured Chengsong parents: diff changeset	477
cc54ce075db5 restructured Chengsong parents: diff changeset	478
cc54ce075db5 restructured Chengsong parents: diff changeset	479
cc54ce075db5 restructured Chengsong parents: diff changeset	480	%-----------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	481	% SUBSECTION 1
cc54ce075db5 restructured Chengsong parents: diff changeset	482	%-----------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	483	\section{Specifications of Some Helper Functions}
cc54ce075db5 restructured Chengsong parents: diff changeset	484	Here we give some functions' definitions,
cc54ce075db5 restructured Chengsong parents: diff changeset	485	which we will use later.
cc54ce075db5 restructured Chengsong parents: diff changeset	486	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	487	\begin{tabular}{ccc}
cc54ce075db5 restructured Chengsong parents: diff changeset	488	$\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$
cc54ce075db5 restructured Chengsong parents: diff changeset	489	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	490	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	491
cc54ce075db5 restructured Chengsong parents: diff changeset	492
cc54ce075db5 restructured Chengsong parents: diff changeset	493	%----------------------------------------------------------------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	494	% SECTION correctness proof
cc54ce075db5 restructured Chengsong parents: diff changeset	495	%----------------------------------------------------------------------------------------
cc54ce075db5 restructured Chengsong parents: diff changeset	496	\section{Correctness of Bit-coded Algorithm (Without Simplification)}
cc54ce075db5 restructured Chengsong parents: diff changeset	497	We now give the proof the correctness of the algorithm with bit-codes.
cc54ce075db5 restructured Chengsong parents: diff changeset	498
cc54ce075db5 restructured Chengsong parents: diff changeset	499	Ausaf and Urban cleverly defined an auxiliary function called $\flex$,
cc54ce075db5 restructured Chengsong parents: diff changeset	500	defined as
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	501	\begin{center}
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	502	\begin{tabular}{lcr}
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	503	$\flex \; r \; f \; [] \; v$ & $=$ & $f\; v$\\
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	504	$\flex \; r \; f \; c :: s \; v$ & $=$ & $\flex \; r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v$
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	505	\end{tabular}
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	506	\end{center}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	507	which accumulates the characters that needs to be injected back,
cc54ce075db5 restructured Chengsong parents: diff changeset	508	and does the injection in a stack-like manner (last taken derivative first injected).
cc54ce075db5 restructured Chengsong parents: diff changeset	509	$\flex$ is connected to the $\lexer$:
cc54ce075db5 restructured Chengsong parents: diff changeset	510	\begin{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	511	$\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$
cc54ce075db5 restructured Chengsong parents: diff changeset	512	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	513	$\flex$ provides us a bridge between $\lexer$ and $\blexer$.
cc54ce075db5 restructured Chengsong parents: diff changeset	514	What is even better about $\flex$ is that it allows us to
cc54ce075db5 restructured Chengsong parents: diff changeset	515	directly operate on the value $\mkeps (r\backslash v)$,
cc54ce075db5 restructured Chengsong parents: diff changeset	516	which is pivotal in the definition of $\lexer $ and $\blexer$, but not visible as an argument.
cc54ce075db5 restructured Chengsong parents: diff changeset	517	When the value created by $\mkeps$ becomes available, one can
cc54ce075db5 restructured Chengsong parents: diff changeset	518	prove some stepwise properties of lexing nicely:
cc54ce075db5 restructured Chengsong parents: diff changeset	519	\begin{lemma}\label{flexStepwise}
cc54ce075db5 restructured Chengsong parents: diff changeset	520	$\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $
cc54ce075db5 restructured Chengsong parents: diff changeset	521	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	522
cc54ce075db5 restructured Chengsong parents: diff changeset	523	And for $\blexer$ we have a function with stepwise properties like $\flex$ as well,
cc54ce075db5 restructured Chengsong parents: diff changeset	524	called $\retrieve$\ref{retrieveDef}.
cc54ce075db5 restructured Chengsong parents: diff changeset	525	$\retrieve$ takes bit-codes from annotated regular expressions
cc54ce075db5 restructured Chengsong parents: diff changeset	526	guided by a value.
cc54ce075db5 restructured Chengsong parents: diff changeset	527	$\retrieve$ is connected to the $\blexer$ in the following way:
cc54ce075db5 restructured Chengsong parents: diff changeset	528	\begin{lemma}\label{blexer_retrieve}
cc54ce075db5 restructured Chengsong parents: diff changeset	529	$\blexer \; r \; s = \decode \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
cc54ce075db5 restructured Chengsong parents: diff changeset	530	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	531	If you take derivative of an annotated regular expression,
cc54ce075db5 restructured Chengsong parents: diff changeset	532	you can $\retrieve$ the same bit-codes as before the derivative took place,
cc54ce075db5 restructured Chengsong parents: diff changeset	533	provided that you use the corresponding value:
cc54ce075db5 restructured Chengsong parents: diff changeset	534
cc54ce075db5 restructured Chengsong parents: diff changeset	535	\begin{lemma}\label{retrieveStepwise}
cc54ce075db5 restructured Chengsong parents: diff changeset	536	$\retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$
cc54ce075db5 restructured Chengsong parents: diff changeset	537	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	538	The other good thing about $\retrieve$ is that it can be connected to $\flex$:
cc54ce075db5 restructured Chengsong parents: diff changeset	539	%centralLemma1
cc54ce075db5 restructured Chengsong parents: diff changeset	540	\begin{lemma}\label{flex_retrieve}
cc54ce075db5 restructured Chengsong parents: diff changeset	541	$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
cc54ce075db5 restructured Chengsong parents: diff changeset	542	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	543	\begin{proof}
cc54ce075db5 restructured Chengsong parents: diff changeset	544	By induction on $s$. The induction tactic is reverse induction on strings.
cc54ce075db5 restructured Chengsong parents: diff changeset	545	$v$ is allowed to be arbitrary.
cc54ce075db5 restructured Chengsong parents: diff changeset	546	The crucial point is to rewrite
cc54ce075db5 restructured Chengsong parents: diff changeset	547	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	548	\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c])
cc54ce075db5 restructured Chengsong parents: diff changeset	549	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	550	as
cc54ce075db5 restructured Chengsong parents: diff changeset	551	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	552	\retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\; \mkeps (r \backslash s@[c]))
cc54ce075db5 restructured Chengsong parents: diff changeset	553	\].
cc54ce075db5 restructured Chengsong parents: diff changeset	554	This enables us to equate
cc54ce075db5 restructured Chengsong parents: diff changeset	555	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	556	\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c])
cc54ce075db5 restructured Chengsong parents: diff changeset	557	\]
cc54ce075db5 restructured Chengsong parents: diff changeset	558	with
cc54ce075db5 restructured Chengsong parents: diff changeset	559	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	560	\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c])))
cc54ce075db5 restructured Chengsong parents: diff changeset	561	\],
cc54ce075db5 restructured Chengsong parents: diff changeset	562	which in turn can be rewritten as
cc54ce075db5 restructured Chengsong parents: diff changeset	563	\[
cc54ce075db5 restructured Chengsong parents: diff changeset	564	\flex \; r \; \textit{id} \; s@[c] \; (\mkeps (r\backslash s@[c]))
cc54ce075db5 restructured Chengsong parents: diff changeset	565	\].
cc54ce075db5 restructured Chengsong parents: diff changeset	566	\end{proof}
cc54ce075db5 restructured Chengsong parents: diff changeset	567
cc54ce075db5 restructured Chengsong parents: diff changeset	568	With the above lemma we can now link $\flex$ and $\blexer$.
cc54ce075db5 restructured Chengsong parents: diff changeset	569
cc54ce075db5 restructured Chengsong parents: diff changeset	570	\begin{lemma}\label{flex_blexer}
cc54ce075db5 restructured Chengsong parents: diff changeset	571	$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s) = \blexer \; r \; s$
cc54ce075db5 restructured Chengsong parents: diff changeset	572	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	573	\begin{proof}
cc54ce075db5 restructured Chengsong parents: diff changeset	574	Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.
cc54ce075db5 restructured Chengsong parents: diff changeset	575	\end{proof}
cc54ce075db5 restructured Chengsong parents: diff changeset	576	Finally
cc54ce075db5 restructured Chengsong parents: diff changeset	577
cc54ce075db5 restructured Chengsong parents: diff changeset	578
cc54ce075db5 restructured Chengsong parents: diff changeset	579

author	Chengsong
	Thu, 09 Jun 2022 12:57:53 +0100
changeset 538	8016a2480704
parent 537	50e590823220
child 542	a7344c9afbaf
permissions	-rwxr-xr-x