7 %Then we illustrate how the algorithm without bitcodes falls short for such aggressive |
7 %Then we illustrate how the algorithm without bitcodes falls short for such aggressive |
8 %simplifications and therefore introduce our version of the bitcoded algorithm and |
8 %simplifications and therefore introduce our version of the bitcoded algorithm and |
9 %its correctness proof in |
9 %its correctness proof in |
10 %Chapter 3\ref{Chapter3}. |
10 %Chapter 3\ref{Chapter3}. |
11 |
11 |
12 \section*{Bit-coded Algorithm} |
12 \section{Bit-coded Algorithm} |
|
13 |
|
14 |
|
15 The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure}, |
|
16 stores information of previous lexing steps |
|
17 on a stack, in the form of regular expressions |
|
18 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc. |
|
19 \begin{envForCaption} |
|
20 \begin{ceqn} |
|
21 \begin{equation}%\label{graph:injLexer} |
|
22 \begin{tikzcd} |
|
23 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
|
24 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
|
25 \end{tikzcd} |
|
26 \end{equation} |
|
27 \end{ceqn} |
|
28 \caption{Injection-based Lexing from Chapter\ref{Inj}}\label{InjFigure} |
|
29 \end{envForCaption} |
|
30 \noindent |
|
31 This is both inefficient and prone to stack overflow. |
|
32 A natural question arises as to whether we can store lexing |
|
33 information on the fly, while still using regular expression |
|
34 derivatives? |
|
35 |
|
36 In a lexing algorithm's run, split by the current input position, |
|
37 we have a sub-string that has been consumed, |
|
38 and the sub-string that has yet to come. |
|
39 We already know what was before, and this should be reflected in the value |
|
40 and the regular expression at that step as well. But this is not the |
|
41 case for injection-based regular expression derivatives. |
|
42 Take the regex $(aa)^* \cdot bc$ matching the string $aabc$ |
|
43 as an example, if we have just read the two former characters $aa$: |
|
44 |
|
45 \begin{center} |
|
46 \begin{envForCaption} |
|
47 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
48 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
49 {Consumed: $aa$ |
|
50 \nodepart{two} Not Yet Reached: $bc$ }; |
|
51 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
52 \end{tikzpicture} |
|
53 \caption{Partially matched String} |
|
54 \end{envForCaption} |
|
55 \end{center} |
|
56 %\caption{Input String}\label{StringPartial} |
|
57 %\end{figure} |
|
58 |
|
59 \noindent |
|
60 We have the value that has already been partially calculated, |
|
61 and the part that has yet to come: |
|
62 \begin{center} |
|
63 \begin{envForCaption} |
|
64 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
65 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
66 {$\Seq(\Stars[\Char(a), \Char(a)], ???)$ |
|
67 \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$}; |
|
68 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
69 \end{tikzpicture} |
|
70 \caption{Partially constructed Value} |
|
71 \end{envForCaption} |
|
72 \end{center} |
|
73 |
|
74 In the regex derivative part , (after simplification) |
|
75 all we have is just what is about to come: |
|
76 \begin{center} |
|
77 \begin{envForCaption} |
|
78 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
79 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},] |
|
80 {$???$ |
|
81 \nodepart{two} To Come: $b c$}; |
|
82 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
83 \end{tikzpicture} |
|
84 \caption{Derivative} |
|
85 \end{envForCaption} |
|
86 \end{center} |
|
87 \noindent |
|
88 The previous part is missing. |
|
89 How about keeping the partially constructed value |
|
90 attached to the front of the regular expression? |
|
91 \begin{center} |
|
92 \begin{envForCaption} |
|
93 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
94 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
95 {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$ |
|
96 \nodepart{two} To Come: $b c$}; |
|
97 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
98 \end{tikzpicture} |
|
99 \caption{Derivative} |
|
100 \end{envForCaption} |
|
101 \end{center} |
|
102 \noindent |
|
103 If we do this kind of "attachment" |
|
104 and each time augment the attached partially |
|
105 constructed value when taking off a |
|
106 character: |
|
107 \begin{center} |
|
108 \begin{envForCaption} |
|
109 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
110 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
111 {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$ |
|
112 \nodepart{two} To Come: $c$}; |
|
113 \end{tikzpicture}\\ |
|
114 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
115 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
116 {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$ |
|
117 \nodepart{two} EOF}; |
|
118 \end{tikzpicture} |
|
119 \caption{After $\backslash b$ and $\backslash c$} |
|
120 \end{envForCaption} |
|
121 \end{center} |
|
122 \noindent |
|
123 In the end we could recover the value without a backward phase. |
|
124 But (partial) values are a bit clumsy to stick together with a regular expression, so |
|
125 we instead use bit-codes to encode them. |
|
126 |
13 Bits and bitcodes (lists of bits) are defined as: |
127 Bits and bitcodes (lists of bits) are defined as: |
14 |
128 \begin{envForCaption} |
15 \begin{center} |
129 \begin{center} |
16 $b ::= 1 \mid 0 \qquad |
130 $b ::= S \mid Z \qquad |
17 bs ::= [] \mid b::bs |
131 bs ::= [] \mid b::bs |
18 $ |
132 $ |
19 \end{center} |
133 \end{center} |
20 |
134 \caption{Bit-codes datatype} |
21 \noindent |
135 \end{envForCaption} |
22 The $1$ and $0$ are not in bold in order to avoid |
136 |
|
137 \noindent |
|
138 The $S$ and $Z$ are not in bold in order to avoid |
23 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or |
139 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or |
24 bit-lists) can be used to encode values (or potentially incomplete values) in a |
140 bit-lists) can be used to encode values (or potentially incomplete values) in a |
25 compact form. This can be straightforwardly seen in the following |
141 compact form. This can be straightforwardly seen in the following |
26 coding function from values to bitcodes: |
142 coding function from values to bitcodes: |
27 |
143 \begin{envForCaption} |
28 \begin{center} |
144 \begin{center} |
29 \begin{tabular}{lcl} |
145 \begin{tabular}{lcl} |
30 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
146 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
31 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
147 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
32 $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\ |
148 $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\ |
33 $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\ |
149 $\textit{code}(\Right\,v)$ & $\dn$ & $S :: code(v)$\\ |
34 $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ |
150 $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ |
35 $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\ |
151 $\textit{code}(\Stars\,[])$ & $\dn$ & $[Z]$\\ |
36 $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\; |
152 $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $S :: code(v) \;@\; |
37 code(\Stars\,vs)$ |
153 code(\Stars\,vs)$ |
38 \end{tabular} |
154 \end{tabular} |
39 \end{center} |
155 \end{center} |
40 |
156 \caption{Coding Function for Values} |
41 \noindent |
157 \end{envForCaption} |
42 Here $\textit{code}$ encodes a value into a bitcodes by converting |
158 |
43 $\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty |
159 \noindent |
44 star iteration by $1$. The border where a local star terminates |
160 Here $\textit{code}$ encodes a value into a bit-code by converting |
45 is marked by $0$. This coding is lossy, as it throws away the information about |
161 $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty |
|
162 star iteration by $S$. The border where a local star terminates |
|
163 is marked by $Z$. |
|
164 This coding is lossy, as it throws away the information about |
46 characters, and also does not encode the ``boundary'' between two |
165 characters, and also does not encode the ``boundary'' between two |
47 sequence values. Moreover, with only the bitcode we cannot even tell |
166 sequence values. Moreover, with only the bitcode we cannot even tell |
48 whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The |
167 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. The |
49 reason for choosing this compact way of storing information is that the |
168 reason for choosing this compact way of storing information is that the |
50 relatively small size of bits can be easily manipulated and ``moved |
169 relatively small size of bits can be easily manipulated and ``moved |
51 around'' in a regular expression. In order to recover values, we will |
170 around'' in a regular expression. |
52 need the corresponding regular expression as an extra information. This |
171 |
53 means the decoding function is defined as: |
172 |
|
173 We define the reverse operation of $\code$, which is $\decode$. |
|
174 As expected, $\decode$ not only requires the bit-codes, |
|
175 but also a regular expression to guide the decoding and |
|
176 fill the gaps of characters: |
54 |
177 |
55 |
178 |
56 %\begin{definition}[Bitdecoding of Values]\mbox{} |
179 %\begin{definition}[Bitdecoding of Values]\mbox{} |
|
180 \begin{envForCaption} |
57 \begin{center} |
181 \begin{center} |
58 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
182 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
59 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ |
183 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ |
60 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ |
184 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ |
61 $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
185 $\textit{decode}'\,(Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
62 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; |
186 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; |
63 (\Left\,v, bs_1)$\\ |
187 (\Left\,v, bs_1)$\\ |
64 $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
188 $\textit{decode}'\,(S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
65 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; |
189 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; |
66 (\Right\,v, bs_1)$\\ |
190 (\Right\,v, bs_1)$\\ |
67 $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & |
191 $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & |
68 $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ |
192 $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ |
69 & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ |
193 & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ |
70 & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ |
194 & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ |
71 $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ |
195 $\textit{decode}'\,(Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ |
72 $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & |
196 $\textit{decode}'\,(S\!::\!bs)\,(r^*)$ & $\dn$ & |
73 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
197 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
74 & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ |
198 & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ |
75 & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ |
199 & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ |
76 |
200 |
77 $\textit{decode}\,bs\,r$ & $\dn$ & |
201 $\textit{decode}\,bs\,r$ & $\dn$ & |
78 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
202 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
79 & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; |
203 & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; |
80 \textit{else}\;\textit{None}$ |
204 \textit{else}\;\textit{None}$ |
81 \end{tabular} |
205 \end{tabular} |
82 \end{center} |
206 \end{center} |
|
207 \caption{Bit-decoding of values} |
|
208 \end{envForCaption} |
83 %\end{definition} |
209 %\end{definition} |
84 |
210 |
85 Sulzmann and Lu's integrated the bitcodes into regular expressions to |
211 Sulzmann and Lu's integrated the bitcodes into regular expressions to |
86 create annotated regular expressions \cite{Sulzmann2014}. |
212 create annotated regular expressions \cite{Sulzmann2014}. |
87 \emph{Annotated regular expressions} are defined by the following |
213 \emph{Annotated regular expressions} are defined by the following |