ChengsongTanPhdThesis/Chapters/Bitcoded1.tex
author Chengsong
Wed, 13 Jul 2022 08:27:28 +0100
changeset 564 3cbcd7cda0a9
parent 543 b2bea5968b89
child 575 3178f0e948ac
permissions -rwxr-xr-x
more
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     1
% Chapter Template
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     2
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     3
% Main chapter title
538
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
     4
\chapter{Bit-coded Algorithm of Sulzmann and Lu}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     5
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     6
\label{Bitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     7
%Then we illustrate how the algorithm without bitcodes falls short for such aggressive 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     8
%simplifications and therefore introduce our version of the bitcoded algorithm and 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     9
%its correctness proof in 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    10
%Chapter 3\ref{Chapter3}. 
564
Chengsong
parents: 543
diff changeset
    11
In this chapter, we are going to describe the bit-coded algorithm
Chengsong
parents: 543
diff changeset
    12
introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of 
Chengsong
parents: 543
diff changeset
    13
regular expressions. 
537
Chengsong
parents: 536
diff changeset
    14
\section{Bit-coded Algorithm}
Chengsong
parents: 536
diff changeset
    15
The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
Chengsong
parents: 536
diff changeset
    16
stores information of previous lexing steps
Chengsong
parents: 536
diff changeset
    17
on a stack, in the form of regular expressions
Chengsong
parents: 536
diff changeset
    18
and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
Chengsong
parents: 536
diff changeset
    19
\begin{ceqn}
Chengsong
parents: 536
diff changeset
    20
\begin{equation}%\label{graph:injLexer}
564
Chengsong
parents: 543
diff changeset
    21
	\begin{tikzcd}[ampersand replacement=\&, execute at end picture={
Chengsong
parents: 543
diff changeset
    22
			\begin{scope}[on background layer]
Chengsong
parents: 543
diff changeset
    23
				\node[rectangle, fill={red!30},
Chengsong
parents: 543
diff changeset
    24
					pattern=north east lines, pattern color=red,
Chengsong
parents: 543
diff changeset
    25
					fit={(-3,-1) (-3, 1) (1, -1) 
Chengsong
parents: 543
diff changeset
    26
						(1, 1)}
Chengsong
parents: 543
diff changeset
    27
				     ] 
Chengsong
parents: 543
diff changeset
    28
				     {}; ,
Chengsong
parents: 543
diff changeset
    29
				\node[rectangle, fill={blue!20},
Chengsong
parents: 543
diff changeset
    30
					pattern=north east lines, pattern color=blue,
Chengsong
parents: 543
diff changeset
    31
					fit= {(1, -1) (1, 1) (3, -1) (3, 1)}
Chengsong
parents: 543
diff changeset
    32
					]
Chengsong
parents: 543
diff changeset
    33
					{};
Chengsong
parents: 543
diff changeset
    34
				\end{scope}}
Chengsong
parents: 543
diff changeset
    35
					]
Chengsong
parents: 543
diff changeset
    36
r_0 \arrow[r, "\backslash c_0"]  \arrow[d] \& r_1 \arrow[r, "\backslash c_1"] \arrow[d] \& r_2 \arrow[r, dashed] \arrow[d] \& r_n \arrow[d, "mkeps" description] \\
Chengsong
parents: 543
diff changeset
    37
v_0           \& v_1 \arrow[l,"inj_{r_0} c_0"]                \& v_2 \arrow[l, "inj_{r_1} c_1"]              \& v_n \arrow[l, dashed]         \\
537
Chengsong
parents: 536
diff changeset
    38
\end{tikzcd}
Chengsong
parents: 536
diff changeset
    39
\end{equation}
Chengsong
parents: 536
diff changeset
    40
\end{ceqn}
Chengsong
parents: 536
diff changeset
    41
\noindent
564
Chengsong
parents: 543
diff changeset
    42
The red part represents what we already know during the first
Chengsong
parents: 543
diff changeset
    43
derivative phase,
Chengsong
parents: 543
diff changeset
    44
and the blue part represents the unknown part of input.
Chengsong
parents: 543
diff changeset
    45
The red area expands as we move towards $r_n$, 
Chengsong
parents: 543
diff changeset
    46
indicating an increasing stack size during lexing.
Chengsong
parents: 543
diff changeset
    47
Despite having some partial lexing information during
Chengsong
parents: 543
diff changeset
    48
the forward derivative phase, we choose to store them
Chengsong
parents: 543
diff changeset
    49
temporarily, only to convert the information to lexical
Chengsong
parents: 543
diff changeset
    50
values at a later stage. In essence we are repeating work we
Chengsong
parents: 543
diff changeset
    51
have already done.
537
Chengsong
parents: 536
diff changeset
    52
This is both inefficient and prone to stack overflow.
Chengsong
parents: 536
diff changeset
    53
A natural question arises as to whether we can store lexing
Chengsong
parents: 536
diff changeset
    54
information on the fly, while still using regular expression 
564
Chengsong
parents: 543
diff changeset
    55
derivatives.
537
Chengsong
parents: 536
diff changeset
    56
564
Chengsong
parents: 543
diff changeset
    57
If we remove the details of the individual 
Chengsong
parents: 543
diff changeset
    58
lexing steps, and use red and blue areas as before
Chengsong
parents: 543
diff changeset
    59
to indicate consumed (seen) input and constructed
Chengsong
parents: 543
diff changeset
    60
partial value (before recovering the rest of the stack),
Chengsong
parents: 543
diff changeset
    61
one could see that the seen part's lexical information
Chengsong
parents: 543
diff changeset
    62
is stored in the form of a regular expression.
Chengsong
parents: 543
diff changeset
    63
Consider the regular expression $(aa)^* \cdot bc$ matching the string $aabc$
Chengsong
parents: 543
diff changeset
    64
and assume we have just read the two characters $aa$:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    65
\begin{center}
537
Chengsong
parents: 536
diff changeset
    66
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
Chengsong
parents: 536
diff changeset
    67
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
564
Chengsong
parents: 543
diff changeset
    68
	    {Partial lexing info: $\ONE \cdot a \cdot (aa)^* \cdot bc$ etc.
537
Chengsong
parents: 536
diff changeset
    69
         \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
Chengsong
parents: 536
diff changeset
    70
\end{tikzpicture}
Chengsong
parents: 536
diff changeset
    71
\end{center}
564
Chengsong
parents: 543
diff changeset
    72
\noindent
Chengsong
parents: 543
diff changeset
    73
In the injection-based lexing algorithm, we ``neglect" the red area
Chengsong
parents: 543
diff changeset
    74
by putting all the characters we have consumed and
Chengsong
parents: 543
diff changeset
    75
intermediate regular expressions on the stack when 
Chengsong
parents: 543
diff changeset
    76
we go from left to right in the derivative phase.
Chengsong
parents: 543
diff changeset
    77
The red area grows till the string is exhausted.
Chengsong
parents: 543
diff changeset
    78
During the injection phase, the value in the blue area
Chengsong
parents: 543
diff changeset
    79
is built up incrementally, while the red area shrinks.
Chengsong
parents: 543
diff changeset
    80
Before we have recovered all characters and intermediate
Chengsong
parents: 543
diff changeset
    81
derivative regular expressions from the stack,
Chengsong
parents: 543
diff changeset
    82
what values these characters and regular expressions correspond 
Chengsong
parents: 543
diff changeset
    83
to are unknown: 
537
Chengsong
parents: 536
diff changeset
    84
\begin{center}
Chengsong
parents: 536
diff changeset
    85
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
Chengsong
parents: 536
diff changeset
    86
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},]
564
Chengsong
parents: 543
diff changeset
    87
	    {$(\ONE \cdot \ONE) \cdot (aa)^* \cdot bc $ correspond to:$???$
Chengsong
parents: 543
diff changeset
    88
         \nodepart{two}  $b c$ corresponds to  $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
537
Chengsong
parents: 536
diff changeset
    89
%\caption{term 1 \ref{term:1}'s matching configuration}
Chengsong
parents: 536
diff changeset
    90
\end{tikzpicture}
Chengsong
parents: 536
diff changeset
    91
\end{center}
Chengsong
parents: 536
diff changeset
    92
\noindent
564
Chengsong
parents: 543
diff changeset
    93
However, they should be calculable,
Chengsong
parents: 543
diff changeset
    94
as characters and regular expression shapes
Chengsong
parents: 543
diff changeset
    95
after taking derivative w.r.t those characters
Chengsong
parents: 543
diff changeset
    96
have already been known, therefore in our example,
Chengsong
parents: 543
diff changeset
    97
we know that the value starts with two $a$s,
Chengsong
parents: 543
diff changeset
    98
and makes up to an iteration in a Kleene star:
Chengsong
parents: 543
diff changeset
    99
(We have put the injection-based lexing's partial 
Chengsong
parents: 543
diff changeset
   100
result in the right part of the split rectangle
Chengsong
parents: 543
diff changeset
   101
to contrast it with the partial valued produced
Chengsong
parents: 543
diff changeset
   102
in a forward manner)
537
Chengsong
parents: 536
diff changeset
   103
\begin{center}
Chengsong
parents: 536
diff changeset
   104
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
Chengsong
parents: 536
diff changeset
   105
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
564
Chengsong
parents: 543
diff changeset
   106
	    {$\stackrel{Bitcoded}{\longrightarrow} \Seq(\Stars[\Char(a), \Char(a)], ???)$
Chengsong
parents: 543
diff changeset
   107
	\nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$  $\stackrel{Inj}{\longleftarrow}$};
537
Chengsong
parents: 536
diff changeset
   108
%\caption{term 1 \ref{term:1}'s matching configuration}
Chengsong
parents: 536
diff changeset
   109
\end{tikzpicture}
Chengsong
parents: 536
diff changeset
   110
\end{center}
Chengsong
parents: 536
diff changeset
   111
\noindent
Chengsong
parents: 536
diff changeset
   112
If we do this kind of "attachment"
Chengsong
parents: 536
diff changeset
   113
and each time augment the attached partially
Chengsong
parents: 536
diff changeset
   114
constructed value when taking off a 
Chengsong
parents: 536
diff changeset
   115
character:
Chengsong
parents: 536
diff changeset
   116
\begin{center}
564
Chengsong
parents: 543
diff changeset
   117
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
Chengsong
parents: 543
diff changeset
   118
	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] (spPoint)
Chengsong
parents: 543
diff changeset
   119
        {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
Chengsong
parents: 543
diff changeset
   120
         \nodepart{two} Remaining: $b c$};
Chengsong
parents: 543
diff changeset
   121
\end{tikzpicture}\\
Chengsong
parents: 543
diff changeset
   122
$\downarrow$\\
537
Chengsong
parents: 536
diff changeset
   123
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
Chengsong
parents: 536
diff changeset
   124
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
Chengsong
parents: 536
diff changeset
   125
        {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
564
Chengsong
parents: 543
diff changeset
   126
         \nodepart{two} Remaining: $c$};
537
Chengsong
parents: 536
diff changeset
   127
\end{tikzpicture}\\
564
Chengsong
parents: 543
diff changeset
   128
$\downarrow$\\
537
Chengsong
parents: 536
diff changeset
   129
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
Chengsong
parents: 536
diff changeset
   130
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
Chengsong
parents: 536
diff changeset
   131
        {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
Chengsong
parents: 536
diff changeset
   132
         \nodepart{two} EOF};
Chengsong
parents: 536
diff changeset
   133
\end{tikzpicture}
Chengsong
parents: 536
diff changeset
   134
\end{center}
Chengsong
parents: 536
diff changeset
   135
\noindent
Chengsong
parents: 536
diff changeset
   136
In the end we could recover the value without a backward phase.
Chengsong
parents: 536
diff changeset
   137
But (partial) values are a bit clumsy to stick together with a regular expression, so 
Chengsong
parents: 536
diff changeset
   138
we instead use bit-codes to encode them.
Chengsong
parents: 536
diff changeset
   139
Chengsong
parents: 536
diff changeset
   140
Bits and bitcodes (lists of bits) are defined as:
Chengsong
parents: 536
diff changeset
   141
\begin{center}
Chengsong
parents: 536
diff changeset
   142
		$b ::=   S \mid  Z \qquad
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   143
bs ::= [] \mid b::bs    
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   144
$
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   145
\end{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   146
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   147
\noindent
538
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   148
Using $S$ and $Z$ rather than $1$ and $0$ is to avoid
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   149
confusion with the regular expressions $\ZERO$ and $\ONE$.
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   150
Bitcodes (or
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   151
bit-lists) can be used to encode values (or potentially incomplete values) in a
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   152
compact form. This can be straightforwardly seen in the following
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   153
coding function from values to bitcodes: 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   154
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   155
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   156
  $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   157
  $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
537
Chengsong
parents: 536
diff changeset
   158
  $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\
Chengsong
parents: 536
diff changeset
   159
  $\textit{code}(\Right\,v)$ & $\dn$ & $S :: code(v)$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   160
  $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
537
Chengsong
parents: 536
diff changeset
   161
  $\textit{code}(\Stars\,[])$ & $\dn$ & $[Z]$\\
Chengsong
parents: 536
diff changeset
   162
  $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $S :: code(v) \;@\;
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   163
                                                 code(\Stars\,vs)$
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   164
\end{tabular}    
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   165
\end{center} 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   166
\noindent
537
Chengsong
parents: 536
diff changeset
   167
Here $\textit{code}$ encodes a value into a bit-code by converting
Chengsong
parents: 536
diff changeset
   168
$\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
Chengsong
parents: 536
diff changeset
   169
star iteration by $S$. The border where a local star terminates
Chengsong
parents: 536
diff changeset
   170
is marked by $Z$. 
Chengsong
parents: 536
diff changeset
   171
This coding is lossy, as it throws away the information about
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   172
characters, and also does not encode the ``boundary'' between two
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   173
sequence values. Moreover, with only the bitcode we cannot even tell
537
Chengsong
parents: 536
diff changeset
   174
whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. The
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   175
reason for choosing this compact way of storing information is that the
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   176
relatively small size of bits can be easily manipulated and ``moved
564
Chengsong
parents: 543
diff changeset
   177
around" in a regular expression. 
537
Chengsong
parents: 536
diff changeset
   178
564
Chengsong
parents: 543
diff changeset
   179
Because of the lossiness, the process of decoding a bitlist requires additionally 
Chengsong
parents: 543
diff changeset
   180
a regular expression. The function $\decode$ is defined as:
537
Chengsong
parents: 536
diff changeset
   181
We define the reverse operation of $\code$, which is $\decode$.
Chengsong
parents: 536
diff changeset
   182
As expected, $\decode$ not only requires the bit-codes,
Chengsong
parents: 536
diff changeset
   183
but also a regular expression to guide the decoding and 
Chengsong
parents: 536
diff changeset
   184
fill the gaps of characters:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   185
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   186
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   187
%\begin{definition}[Bitdecoding of Values]\mbox{}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   188
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   189
\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   190
  $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   191
  $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
537
Chengsong
parents: 536
diff changeset
   192
  $\textit{decode}'\,(Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   193
     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   194
       (\Left\,v, bs_1)$\\
537
Chengsong
parents: 536
diff changeset
   195
  $\textit{decode}'\,(S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   196
     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   197
       (\Right\,v, bs_1)$\\                           
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   198
  $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   199
        $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   200
  & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   201
  & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
537
Chengsong
parents: 536
diff changeset
   202
  $\textit{decode}'\,(Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
Chengsong
parents: 536
diff changeset
   203
  $\textit{decode}'\,(S\!::\!bs)\,(r^*)$ & $\dn$ & 
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   204
         $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   205
  & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   206
  & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   207
  
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   208
  $\textit{decode}\,bs\,r$ & $\dn$ &
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   209
     $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   210
  & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   211
       \textit{else}\;\textit{None}$                       
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   212
\end{tabular}    
538
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   213
\end{center} 
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   214
%\end{definition}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   215
538
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   216
\noindent
564
Chengsong
parents: 543
diff changeset
   217
The function $\decode'$ returns a pair consisting of a partially decoded value and some leftover:
538
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   218
$\decode'$ does most of the job while $\decode$ throws
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   219
away leftover bit-codes and returns the value only.
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   220
$\decode$ is terminating as $\decode'$ is terminating.
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   221
We have the property that $\decode$ and $\code$ are
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   222
reverse operations of one another:
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   223
\begin{lemma}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   224
\[\vdash v : r \implies \decode \; (\code \; v) \; r = \textit{Some}(v) \]
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   225
\end{lemma}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   226
\begin{proof}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   227
By proving a more general version of the lemma, on $\decode'$:
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   228
\[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   229
Then setting $ds$ to be $[]$ and unfolding $\decode$ definition
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   230
we get the lemma.
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   231
\end{proof}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   232
With the $\code$ and $\decode$ functions in hand, we know how to 
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   233
switch between bit-codes and value--the two different representations of 
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   234
lexing information. 
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   235
The next step is to integrate this information into the working regular expression.
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   236
Attaching bits to the front of regular expressions is the solution Sulzamann and Lu
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   237
gave for storing partial values on the fly:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   238
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   239
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   240
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   241
  $\textit{a}$ & $::=$  & $\ZERO$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   242
                  & $\mid$ & $_{bs}\ONE$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   243
                  & $\mid$ & $_{bs}{\bf c}$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   244
                  & $\mid$ & $_{bs}\sum\,as$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   245
                  & $\mid$ & $_{bs}a_1\cdot a_2$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   246
                  & $\mid$ & $_{bs}a^*$
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   247
\end{tabular}    
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   248
\end{center}  
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   249
%(in \textit{ALTS})
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   250
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   251
\noindent
538
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   252
We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}.
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   253
$bs$ stands for bit-codes, $a$  for $\mathbf{a}$nnotated regular
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   254
expressions and $as$ for a list of annotated regular expressions.
538
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   255
The alternative constructor ($\sum$) has been generalised to 
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   256
accept a list of annotated regular expressions rather than just 2.
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   257
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   258
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   259
The first thing we define related to bit-coded regular expressions
564
Chengsong
parents: 543
diff changeset
   260
is how we move bits, for instance pasting it at the front of an annotated regular expression.
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   261
The operation $\fuse$ is just to attach bit-codes 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   262
to the front of an annotated regular expression:
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   263
\begin{center}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   264
\begin{tabular}{lcl}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   265
  $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   266
  $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   267
     $_{bs @ bs'}\ONE$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   268
  $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   269
     $_{bs@bs'}{\bf c}$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   270
  $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   271
     $_{bs@bs'}\sum\textit{as}$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   272
  $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   273
     $_{bs@bs'}a_1 \cdot a_2$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   274
  $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   275
     $_{bs @ bs'}a^*$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   276
\end{tabular}    
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   277
\end{center} 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   278
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   279
\noindent
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   280
With that we are able to define $\internalise$.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   281
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   282
To do lexing using annotated regular expressions, we shall first
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   283
transform the usual (un-annotated) regular expressions into annotated
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   284
regular expressions. This operation is called \emph{internalisation} and
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   285
defined as follows:
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   286
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   287
%\begin{definition}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   288
\begin{center}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   289
\begin{tabular}{lcl}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   290
  $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   291
  $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   292
  $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   293
  $(r_1 + r_2)^\uparrow$ & $\dn$ &
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   294
  $_{[]}\sum[\textit{fuse}\,[Z]\,r_1^\uparrow,\,
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   295
  \textit{fuse}\,[S]\,r_2^\uparrow]$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   296
  $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   297
         $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   298
  $(r^*)^\uparrow$ & $\dn$ &
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   299
         $_{[]}(r^\uparrow)^*$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   300
\end{tabular}    
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   301
\end{center}    
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   302
%\end{definition}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   303
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   304
\noindent
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   305
We use an up arrow with postfix notation
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   306
to denote the operation, 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   307
for convenience. The $\textit{internalise} \; r$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   308
notation is more cumbersome.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   309
The opposite of $\textit{internalise}$ is
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   310
$\erase$, where all the bit-codes are removed,
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   311
and the alternative operator $\sum$ for annotated
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   312
regular expressions is transformed to the binary alternatives
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   313
for plain regular expressions.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   314
\begin{center}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   315
	\begin{tabular}{lcr}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   316
		$\erase \; \ZERO$ & $\dn$ & $\ZERO$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   317
		$\erase \; _{bs}\ONE$ & $\dn$ & $\ONE$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   318
		$\erase \; _{bs}\mathbf{c}$ & $\dn$ & $\mathbf{c}$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   319
		$\erase \; _{bs} a_1 \cdot a_2$ & $\dn$ & $\erase \; r_1\cdot\erase\; r_2$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   320
		$\erase \; _{bs} [] $ & $\dn$ & $\ZERO$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   321
		$\erase \; _{bs} [a] $ & $\dn$ & $\erase \; a$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   322
		$\erase \; _{bs} \sum [a_1, \; a_2]$ & $\dn$ & $\erase \; a_1 +\erase \; a_2$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   323
		$\erase \; _{bs} \sum (a :: as)$ & $\dn$ & $\erase \; a + \erase \; _{[]} \sum as$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   324
		$\erase \; _{bs} a^*$ & $\dn$ & $(\erase \; a)^*$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   325
	\end{tabular}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   326
\end{center}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   327
\noindent
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   328
We also abbreviate the $\erase\; a$ operation
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   329
as $a_\downarrow$, for conciseness.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   330
For bit-coded regular expressions, as a different datatype, 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   331
testing whether they contain empty string in their lauguage requires
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   332
a dedicated function $\bnullable$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   333
which simply calls $\erase$ first before testing whether it is $\nullable$.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   334
\begin{center}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   335
	\begin{tabular}{lcr}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   336
		$\bnullable \; a $ & $\dn$ & $\nullable \; (a_\downarrow)$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   337
	\end{tabular}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   338
\end{center}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   339
The function for collecting the
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   340
bitcodes of a $\bnullable$ annotated regular expression
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   341
is a generalised version of the $\textit{mkeps}$ function
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   342
for annotated regular expressions, called $\textit{bmkeps}$:
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   343
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   344
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   345
%\begin{definition}[\textit{bmkeps}]\mbox{}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   346
\begin{center}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   347
\begin{tabular}{lcl}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   348
  $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   349
  $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   350
     $\textit{if}\;\textit{bnullable}\,a$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   351
  & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   352
  & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{[]}\sum \textit{as})$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   353
  $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   354
     $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   355
  $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   356
     $bs \,@\, [Z]$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   357
\end{tabular}    
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   358
\end{center}    
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   359
%\end{definition}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   360
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   361
\noindent
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   362
This function completes the value information by travelling along the
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   363
path of the regular expression that corresponds to a POSIX value and
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   364
collecting all the bitcodes, and using $S$ to indicate the end of star
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   365
iterations. 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   366
538
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   367
The most central question is how these partial lexing information
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   368
represented as bit-codes is augmented and carried around 
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   369
during a derivative is taken.
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   370
This is done by adding bitcodes to the 
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   371
derivatives, for example when one more star iteratoin is taken (we
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   372
call the operation of derivatives on annotated regular expressions $\bder$
564
Chengsong
parents: 543
diff changeset
   373
because it is derivatives on regular expressiones with \emph{b}itcodes),
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   374
we need to unfold it into a sequence,
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   375
and attach an additional bit $Z$ to the front of $r \backslash c$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   376
to indicate one more star iteration. 
538
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   377
\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   378
  \begin{tabular}{@{}lcl@{}}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   379
  $\bder \; c\; (_{bs}a^*) $ & $\dn$ &
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   380
      $_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   381
       (_{[]}a^*))$
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   382
\end{tabular}    
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   383
\end{center}    
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   384
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   385
\noindent
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   386
For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   387
there is no danger of confusion with derivatives on plain regular expressions, 
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   388
for example, the above can be expressed as
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   389
\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   390
  \begin{tabular}{@{}lcl@{}}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   391
  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   392
      $_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   393
       (_{[]}a^*))$
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   394
\end{tabular}    
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   395
\end{center}   
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   396
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   397
\noindent
538
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   398
Using the picture we used earlier to depict this, the transformation when 
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   399
taking a derivative w.r.t a star is like below:
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   400
538
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   401
\begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   402
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   403
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   404
        {$bs$
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   405
         \nodepart{two} $a^*$ };
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   406
%\caption{term 1 \ref{term:1}'s matching configuration}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   407
\end{tikzpicture} 
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   408
&
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   409
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   410
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   411
        {$v_{\text{previous iterations}}$
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   412
         \nodepart{two} $a^*$};
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   413
%\caption{term 1 \ref{term:1}'s matching configuration}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   414
\end{tikzpicture}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   415
\\
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   416
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   417
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   418
        { $bs$ + [Z]
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   419
         \nodepart{two}  $(a\backslash c )\cdot a^*$ };
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   420
%\caption{term 1 \ref{term:1}'s matching configuration}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   421
\end{tikzpicture}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   422
&
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   423
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   424
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   425
        {$v_{\text{previous iterations}}$ + 1 more iteration
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   426
         \nodepart{two} $(a\backslash c )\cdot a^*$ };
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   427
%\caption{term 1 \ref{term:1}'s matching configuration}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   428
\end{tikzpicture}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   429
\end{tabular}    
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   430
\noindent
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   431
Another place in the $\bder$ function where it differs
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   432
from normal derivatives (on un-annotated regular expressions)
538
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   433
is the sequence case:
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   434
\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   435
  \begin{tabular}{@{}lcl@{}}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   436
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   437
  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   438
     $\textit{if}\;\textit{bnullable}\,a_1$\\
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   439
					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   440
					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   441
  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   442
\end{tabular}    
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   443
\end{center}    
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   444
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   445
The difference is that (when $a_1$ is $\bnullable$)
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   446
we use $\bmkeps$ to store the lexing information
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   447
in $a_1$ before collapsing it (as it has been fully matched by string prior to $c$, 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   448
and attach the collected bit-codes to the front of $a_2$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   449
before throwing away $a_1$. We assume that $\bmkeps$ correctly extracts the bitcode for how $a_1$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   450
matches the string prior to $c$ (more on this later).
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   451
The bitsequence $\textit{bs}$ which was initially attached to the first element of the sequence
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   452
$a_1 \cdot a_2$, has now been elevated to the top level of teh $\sum$. 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   453
This is because this piece of information will be needed whichever way the sequence is matched,
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   454
regardless of whether $c$ belongs to $a_1$ or $a_2$.
538
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   455
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   456
In the injection-based lexing, $r_1$ is immediately thrown away in 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   457
subsequent derivatives on the right branch (when $r_1$ is $\nullable$),
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   458
\begin{center}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   459
	$(r_1 \cdot r_2 )\backslash c = (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   460
\end{center}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   461
\noindent
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   462
as it knows $r_1$ is stored on stack and available once the recursive 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   463
call to later derivatives finish.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   464
Therefore, if the $\Right$ branch is taken in a $\POSIX$ match,
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   465
we construct back the sequence value once step back by
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   466
calling a on $\mkeps(r_1)$.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   467
\begin{center}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   468
	\begin{tabular}{lcr}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   469
		$\ldots r_1 \cdot r_2$ & $\rightarrow$ & $r_1\cdot r_2 + r_2 \backslash c \ldots $\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   470
		$\ldots \Seq(v_1, v_2) (\Seq(\mkeps(r1), (\inj \; r_2 \; c\; v_{2c})))$ & $\leftarrow$ & $\Right(v_{2c})\ldots$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   471
	\end{tabular}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   472
\end{center}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   473
\noindent
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   474
The rest of the clauses of $\bder$ is rather similar to
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   475
$\der$, and is put together here as a wholesome definition
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   476
for $\bder$:
538
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   477
\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   478
  \begin{tabular}{@{}lcl@{}}
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   479
  $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   480
  $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   481
  $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   482
        $\textit{if}\;c=d\; \;\textit{then}\;
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   483
         _{bs}\ONE\;\textit{else}\;\ZERO$\\  
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   484
  $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   485
  $_{bs}\sum\;(\textit{map} \; (\_\backslash c) \; as )$\\
538
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   486
  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   487
     $\textit{if}\;\textit{bnullable}\,a_1$\\
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   488
					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   489
					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   490
  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   491
  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   492
      $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   493
       (_{[]}r^*))$
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   494
\end{tabular}    
8016a2480704 intro and chap2
Chengsong
parents: 537
diff changeset
   495
\end{center}    
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   496
\noindent
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   497
Generalising the derivative operation with bitcodes to strings, we have 
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   498
\begin{center}
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   499
	\begin{tabular}{@{}lcl@{}}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   500
		$a\backslash_s [] $ & $\dn$ & $a$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   501
		$a\backslash (c :: s) $ & $\dn$ & $(a \backslash c) \backslash_s s$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   502
	\end{tabular}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   503
\end{center}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   504
As we did earlier, we omit the $s$ subscript at $\backslash_s$ when there is no danger
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   505
of confusion.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   506
Putting this all together, we obtain a lexer with bit-coded regular expressions
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   507
as its internal data structures, which we call $\blexer$:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   508
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   509
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   510
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   511
  $\textit{blexer}\;r\,s$ & $\dn$ &
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   512
      $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   513
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   514
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   515
  & & $\;\;\textit{else}\;\textit{None}$
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   516
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   517
\end{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   518
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   519
\noindent
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   520
Ausaf and Urban formally proved the correctness of the $\blexer$, namely
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   521
\begin{conjecture}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   522
$\blexer \;r \; s = \lexer \; r \; s$.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   523
\end{conjecture}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   524
This was claimed but not formalised in Sulzmann and Lu's work.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   525
We introduce the proof later, after we give out all
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   526
the needed auxiliary functions and definitions
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   527
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   528
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   529
%-----------------------------------
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   530
%	SUBSECTION 1
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   531
%-----------------------------------
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   532
\section{Specifications of Some Helper Functions}
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   533
The functions we introduce will give a more detailed glimpse into 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   534
the lexing process, which might not be possible
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   535
using $\lexer$ or $\blexer$ themselves.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   536
The first function we shall look at is $\retrieve$.
543
b2bea5968b89 thesis_thys
Chengsong
parents: 542
diff changeset
   537
\subsection{$\textit{Retrieve}$}
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   538
Our bit-coded lexer "retrieve"s the bitcodes using $\bmkeps$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   539
after we finished doing all the derivatives:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   540
\begin{center}
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   541
\begin{tabular}{lcl}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   542
	& & $\ldots$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   543
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   544
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   545
  & & $\ldots$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   546
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   547
\end{center}
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   548
\noindent
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   549
Recall that $\bmkeps$ looks for the leftmost branch of an alternative
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   550
and completes a star's iterations by attaching a $Z$ at the end of the bitcodes
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   551
extracted. It "retrieves" a sequence by visiting both children and then stitch together 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   552
two bitcodes using concatenation. After the entire tree structure of the regular 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   553
expression has been traversed using the above manner, we get a bitcode encoding the 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   554
lexing result.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   555
We know that this "retrieved" bitcode leads to the correct value after decoding,
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   556
which is $v_0$ in the bird's eye view of the injection-based lexing diagram.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   557
Now assume we keep every other data structure in the diagram \ref{InjFigure},
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   558
and only replace all the plain regular expression by their annotated counterparts,
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   559
computed during a $\blexer$ run.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   560
Then we obtain a diagram for the annotated regular expression derivatives and
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   561
their corresponding values, though the values are never calculated in $\blexer$.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   562
We have that $a_n$ contains all the lexing result information.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   563
\vspace{20mm}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   564
\begin{center}%\label{graph:injLexer}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   565
\begin{tikzcd}[
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   566
	every matrix/.append style = {name=p},
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   567
	remember picture, overlay,
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   568
	]
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   569
	a_0 \arrow[r, "\backslash c_0"]  \arrow[d] & a_1 \arrow[r, "\backslash c_1"] \arrow[d] & a_2 \arrow[r, dashed] \arrow[d] & a_n \arrow[d] \\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   570
v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   571
\end{tikzcd}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   572
\begin{tikzpicture}[
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   573
	remember picture, overlay,
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   574
E/.style = {ellipse, draw=blue, dashed,
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   575
            inner xsep=4mm,inner ysep=-4mm, rotate=90, fit=#1}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   576
                        ]
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   577
\node[E = (p-1-1) (p-2-1)] {};
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   578
\node[E = (p-1-4) (p-2-4)] {};
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   579
\end{tikzpicture}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   580
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   581
\end{center}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   582
\vspace{20mm}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   583
\noindent
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   584
On the other hand, $v_0$ also encodes the correct lexing result, as we have proven for $\lexer$.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   585
Encircled in the diagram  are the two pairs $v_0, a_0$ and $v_n, a_n$, which both 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   586
encode the correct lexical result. Though for the leftmost pair, we have
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   587
the information condensed in $v_0$ the value part, whereas for the rightmost pair,
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   588
the information is concentrated on $a_n$.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   589
We know that in the intermediate steps the pairs $v_i, a_i$, must in some way encode the complete
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   590
lexing information as well. Therefore, we need a unified approach to extract such lexing result
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   591
from a value $v_i$ and its annotated regular expression $a_i$. 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   592
And the function $f$ must satisfy these requirements:
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   593
\begin{itemize}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   594
	\item
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   595
		$f \; a_i\;v_i = f \; a_n \; v_n = \decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   596
	\item
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   597
		$f \; a_i\;v_i = f \; a_0 \; v_0 = v_0 = \decode \;(\code \; v_0) \; (\erase \; a_0)$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   598
\end{itemize}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   599
\noindent
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   600
If we factor out the common part $\decode \; \_ \; (\erase \; a_0)$,
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   601
The core of the function $f$ is something that produces the bitcodes
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   602
$\code \; v_0$.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   603
It is unclear how, but Sulzmann and Lu came up with a function satisfying all the above
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   604
requirements, named \emph{retrieve}:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   605
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   606
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   607
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   608
\begin{center}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   609
\begin{tabular}{lcr}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   610
	$\retrieve \; \, (_{bs} \ONE) \; \, (\Empty)$ & $\dn$ & $\textit{bs}$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   611
	$\retrieve \; \, (_{bs} \mathbf{c} ) \; \Char(c)$ & $\dn$ & $ \textit{bs}$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   612
	$\retrieve \; \, (_{bs} a_1 \cdot a_2) \; \Seq(v_1, v_2)$ & $\dn$ &  $\textit{bs} @ (\retrieve \; a_1\; v_1) @ (\retrieve \; a_2 \; v_2)$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   613
	$\retrieve \; \, (_{bs} \Sigma (a :: \textit{as}) \; \,\Left(v)$ & $\dn$ & $\textit{bs} @ (\retrieve \; a \; v)$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   614
	$\retrieve \; \, (_{bs} \Sigma (a :: \textit{as} \; \, \Right(v)$ & $\dn$ & $\textit{bs} @ (\retrieve \; (_{[]}\Sigma \textit{as}) \; v)$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   615
	$\retrieve \; \, (_{bs} a^*) \; \, (\Stars(v :: vs)) $ & $\dn$ & $\textit{bs} @ (\retrieve \; a \; v) @ (\retrieve \; (_{[]} a^*) \; (\Stars(vs)))$\\
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   616
	$\retrieve \; \, (_{bs} a^*) \; \, (\Stars([]) $ & $\dn$ & $\textit{bs} @ [Z]$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   617
\end{tabular}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   618
\end{center}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   619
\noindent
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   620
As promised, $\retrieve$ collects the right bit-codes from the 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   621
final derivative $a_n$:
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   622
\begin{lemma}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   623
	$\bnullable \; a \implies \bmkeps \; a = \retrieve \; a \; (\mkeps \; (\erase \; a))$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   624
\end{lemma}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   625
\begin{proof}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   626
	By a routine induction on $a$.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   627
\end{proof}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   628
The design of $\retrieve$ enables extraction of bit-codes
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   629
from not only $\bnullable$ (annotated) regular expressions,
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   630
but also those that are not $\bnullable$.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   631
For example, if we have the regular expression just internalised
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   632
and the lexing result value, we could $\retrieve$ the bitcdoes
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   633
that look as if we have en$\code$-ed the value:
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   634
\begin{lemma}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   635
	$\vdash v : r \implies \retrieve \; (r)^\uparrow \; v = \code \; v$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   636
\end{lemma}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   637
\begin{proof}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   638
	By induction on $r$.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   639
\end{proof}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   640
\noindent
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   641
The following property is more interesting, as
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   642
it provides a "bridge" between $a_0, v_0$ and $a_n, v_n$ in the
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   643
lexing diagram.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   644
If you take derivative of an annotated regular expression, 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   645
you can $\retrieve$ the same bit-codes as before the derivative took place,
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   646
provided that you use the corresponding value:
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   647
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   648
\begin{lemma}\label{retrieveStepwise}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   649
	$\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   650
\end{lemma}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   651
\begin{proof}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   652
	By induction on $r$, where $v$ is allowed to be arbitrary.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   653
	The induction principle is function $\erase$'s cases.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   654
\end{proof}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   655
\noindent
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   656
$\retrieve$ is connected to the $\blexer$ in the following way:
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   657
\begin{lemma}\label{blexer_retrieve}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   658
$\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   659
\end{lemma}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   660
\noindent
564
Chengsong
parents: 543
diff changeset
   661
$\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regular expressiones of $\blexer$.
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   662
For plain regular expressions something similar is required as well.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   663
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   664
\subsection{$\flex$}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   665
Ausaf and Urban cleverly defined an auxiliary function called $\flex$ for $\lexer$,
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   666
defined as
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   667
\begin{center}
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   668
\begin{tabular}{lcr}
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   669
$\flex \; r \; f \; [] \; v$       &  $=$ &   $f\; v$\\
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   670
$\flex \; r \; f \; c :: s \; v$ &  $=$ &   $\flex \; r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v$
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   671
\end{tabular}
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   672
\end{center}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   673
which accumulates the characters that needs to be injected back, 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   674
and does the injection in a stack-like manner (last taken derivative first injected).
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   675
$\flex$ is connected to the $\lexer$:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   676
\begin{lemma}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   677
$\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   678
\end{lemma}
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   679
\begin{proof}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   680
	By reverse induction on $s$.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   681
\end{proof}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   682
$\flex$ provides us a bridge between $\lexer$'s intermediate steps.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   683
What is even better about $\flex$ is that it allows us to 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   684
directly operate on the value $\mkeps (r\backslash v)$,
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   685
which is pivotal in the definition of  $\lexer $ and $\blexer$, but not visible as an argument.
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   686
When the value created by $\mkeps$ becomes available, one can 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   687
prove some stepwise properties of lexing nicely:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   688
\begin{lemma}\label{flexStepwise}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   689
$\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   690
\end{lemma}
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   691
\begin{proof}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   692
	By induction on the shape of $r\backslash s$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   693
\end{proof}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   694
\noindent
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   695
With $\flex$ and $\retrieve$ ready, we are ready to connect $\lexer$ and $\blexer$ .
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   696
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   697
\subsection{Correctness Proof of Bit-coded Algorithm}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   698
\begin{lemma}\label{flex_retrieve}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   699
$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   700
\end{lemma}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   701
\begin{proof}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   702
By induction on $s$. The induction tactic is reverse induction on strings.
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   703
$v$ is allowed to be arbitrary.
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   704
The crucial point is to rewrite 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   705
\[
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   706
\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   707
\]
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   708
as
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   709
\[
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   710
\retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\;  \mkeps (r \backslash s@[c]))
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   711
\].
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   712
This enables us to equate 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   713
\[
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   714
\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   715
\] 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   716
with 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   717
\[
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   718
\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c])))
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   719
\],
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   720
which in turn can be rewritten as
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   721
\[
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   722
\flex \; r \; \textit{id} \; s@[c] \;  (\mkeps (r\backslash s@[c]))
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   723
\].
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   724
\end{proof}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   725
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   726
With the above lemma we can now link $\flex$ and $\blexer$.
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   727
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   728
%----------------------------------------------------------------------------------------
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   729
%	SECTION  correctness proof
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   730
%----------------------------------------------------------------------------------------
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   731
\section{Correctness of Bit-coded Algorithm (Without Simplification)}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   732
We now give the proof the correctness of the algorithm with bit-codes.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   733
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   734
\begin{lemma}\label{flex_blexer}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   735
$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s)  = \blexer \; r \; s$
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   736
\end{lemma}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   737
\begin{proof}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   738
Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   739
\end{proof}
542
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   740
Finally the correctness of $\blexer$ is given as it outputs the same result as $\lexer$:
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   741
\begin{theorem}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   742
	$\blexer\; r \; s = \lexer \; r \; s$
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   743
\end{theorem}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   744
\begin{proof}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   745
	Straightforward corollary of \ref{flex_blexer}.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   746
\end{proof}
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   747
\noindent
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   748
Our main reason for wanting a bit-coded algorithm over 
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   749
the injection-based one is for its capabilities of allowing
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   750
more aggressive simplifications.
a7344c9afbaf chapter3 finished
Chengsong
parents: 538
diff changeset
   751
We will elaborate on this in the next chapter.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   752
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   753