6 \label{Bitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX} |
6 \label{Bitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX} |
7 %Then we illustrate how the algorithm without bitcodes falls short for such aggressive |
7 %Then we illustrate how the algorithm without bitcodes falls short for such aggressive |
8 %simplifications and therefore introduce our version of the bitcoded algorithm and |
8 %simplifications and therefore introduce our version of the bitcoded algorithm and |
9 %its correctness proof in |
9 %its correctness proof in |
10 %Chapter 3\ref{Chapter3}. |
10 %Chapter 3\ref{Chapter3}. |
11 In this chapter, we are going to introduce the bit-coded algorithm |
11 In this chapter, we are going to describe the bit-coded algorithm |
12 introduced by Sulzmann and Lu to address the problem of |
12 introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of |
13 under-simplified regular expressions. |
13 regular expressions. |
14 \section{Bit-coded Algorithm} |
14 \section{Bit-coded Algorithm} |
15 The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure}, |
15 The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure}, |
16 stores information of previous lexing steps |
16 stores information of previous lexing steps |
17 on a stack, in the form of regular expressions |
17 on a stack, in the form of regular expressions |
18 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc. |
18 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc. |
19 \begin{envForCaption} |
|
20 \begin{ceqn} |
19 \begin{ceqn} |
21 \begin{equation}%\label{graph:injLexer} |
20 \begin{equation}%\label{graph:injLexer} |
22 \begin{tikzcd} |
21 \begin{tikzcd}[ampersand replacement=\&, execute at end picture={ |
23 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
22 \begin{scope}[on background layer] |
24 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
23 \node[rectangle, fill={red!30}, |
|
24 pattern=north east lines, pattern color=red, |
|
25 fit={(-3,-1) (-3, 1) (1, -1) |
|
26 (1, 1)} |
|
27 ] |
|
28 {}; , |
|
29 \node[rectangle, fill={blue!20}, |
|
30 pattern=north east lines, pattern color=blue, |
|
31 fit= {(1, -1) (1, 1) (3, -1) (3, 1)} |
|
32 ] |
|
33 {}; |
|
34 \end{scope}} |
|
35 ] |
|
36 r_0 \arrow[r, "\backslash c_0"] \arrow[d] \& r_1 \arrow[r, "\backslash c_1"] \arrow[d] \& r_2 \arrow[r, dashed] \arrow[d] \& r_n \arrow[d, "mkeps" description] \\ |
|
37 v_0 \& v_1 \arrow[l,"inj_{r_0} c_0"] \& v_2 \arrow[l, "inj_{r_1} c_1"] \& v_n \arrow[l, dashed] \\ |
25 \end{tikzcd} |
38 \end{tikzcd} |
26 \end{equation} |
39 \end{equation} |
27 \end{ceqn} |
40 \end{ceqn} |
28 \caption{Injection-based Lexing from Chapter\ref{Inj}}\label{InjFigure} |
41 \noindent |
29 \end{envForCaption} |
42 The red part represents what we already know during the first |
30 \noindent |
43 derivative phase, |
|
44 and the blue part represents the unknown part of input. |
|
45 The red area expands as we move towards $r_n$, |
|
46 indicating an increasing stack size during lexing. |
|
47 Despite having some partial lexing information during |
|
48 the forward derivative phase, we choose to store them |
|
49 temporarily, only to convert the information to lexical |
|
50 values at a later stage. In essence we are repeating work we |
|
51 have already done. |
31 This is both inefficient and prone to stack overflow. |
52 This is both inefficient and prone to stack overflow. |
32 A natural question arises as to whether we can store lexing |
53 A natural question arises as to whether we can store lexing |
33 information on the fly, while still using regular expression |
54 information on the fly, while still using regular expression |
34 derivatives? |
55 derivatives. |
35 |
56 |
36 In a lexing algorithm's run, split by the current input position, |
57 If we remove the details of the individual |
37 we have a sub-string that has been consumed, |
58 lexing steps, and use red and blue areas as before |
38 and the sub-string that has yet to come. |
59 to indicate consumed (seen) input and constructed |
39 We already know what was before, and this should be reflected in the value |
60 partial value (before recovering the rest of the stack), |
40 and the regular expression at that step as well. But this is not the |
61 one could see that the seen part's lexical information |
41 case for injection-based regular expression derivatives. |
62 is stored in the form of a regular expression. |
42 Take the regex $(aa)^* \cdot bc$ matching the string $aabc$ |
63 Consider the regular expression $(aa)^* \cdot bc$ matching the string $aabc$ |
43 as an example, if we have just read the two former characters $aa$: |
64 and assume we have just read the two characters $aa$: |
44 |
65 \begin{center} |
45 \begin{center} |
|
46 \begin{envForCaption} |
|
47 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
66 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
48 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
67 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
49 {Consumed: $aa$ |
68 {Partial lexing info: $\ONE \cdot a \cdot (aa)^* \cdot bc$ etc. |
50 \nodepart{two} Not Yet Reached: $bc$ }; |
69 \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$}; |
|
70 \end{tikzpicture} |
|
71 \end{center} |
|
72 \noindent |
|
73 In the injection-based lexing algorithm, we ``neglect" the red area |
|
74 by putting all the characters we have consumed and |
|
75 intermediate regular expressions on the stack when |
|
76 we go from left to right in the derivative phase. |
|
77 The red area grows till the string is exhausted. |
|
78 During the injection phase, the value in the blue area |
|
79 is built up incrementally, while the red area shrinks. |
|
80 Before we have recovered all characters and intermediate |
|
81 derivative regular expressions from the stack, |
|
82 what values these characters and regular expressions correspond |
|
83 to are unknown: |
|
84 \begin{center} |
|
85 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
86 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},] |
|
87 {$(\ONE \cdot \ONE) \cdot (aa)^* \cdot bc $ correspond to:$???$ |
|
88 \nodepart{two} $b c$ corresponds to $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$}; |
51 %\caption{term 1 \ref{term:1}'s matching configuration} |
89 %\caption{term 1 \ref{term:1}'s matching configuration} |
52 \end{tikzpicture} |
90 \end{tikzpicture} |
53 \caption{Partially matched String} |
91 \end{center} |
54 \end{envForCaption} |
92 \noindent |
55 \end{center} |
93 However, they should be calculable, |
56 %\caption{Input String}\label{StringPartial} |
94 as characters and regular expression shapes |
57 %\end{figure} |
95 after taking derivative w.r.t those characters |
58 |
96 have already been known, therefore in our example, |
59 \noindent |
97 we know that the value starts with two $a$s, |
60 We have the value that has already been partially calculated, |
98 and makes up to an iteration in a Kleene star: |
61 and the part that has yet to come: |
99 (We have put the injection-based lexing's partial |
62 \begin{center} |
100 result in the right part of the split rectangle |
63 \begin{envForCaption} |
101 to contrast it with the partial valued produced |
|
102 in a forward manner) |
|
103 \begin{center} |
64 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
104 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
65 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
105 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
66 {$\Seq(\Stars[\Char(a), \Char(a)], ???)$ |
106 {$\stackrel{Bitcoded}{\longrightarrow} \Seq(\Stars[\Char(a), \Char(a)], ???)$ |
67 \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$}; |
107 \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$ $\stackrel{Inj}{\longleftarrow}$}; |
68 %\caption{term 1 \ref{term:1}'s matching configuration} |
108 %\caption{term 1 \ref{term:1}'s matching configuration} |
69 \end{tikzpicture} |
109 \end{tikzpicture} |
70 \caption{Partially constructed Value} |
|
71 \end{envForCaption} |
|
72 \end{center} |
|
73 |
|
74 In the regex derivative part , (after simplification) |
|
75 all we have is just what is about to come: |
|
76 \begin{center} |
|
77 \begin{envForCaption} |
|
78 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
79 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},] |
|
80 {$???$ |
|
81 \nodepart{two} To Come: $b c$}; |
|
82 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
83 \end{tikzpicture} |
|
84 \caption{Derivative} |
|
85 \end{envForCaption} |
|
86 \end{center} |
|
87 \noindent |
|
88 The previous part is missing. |
|
89 How about keeping the partially constructed value |
|
90 attached to the front of the regular expression? |
|
91 \begin{center} |
|
92 \begin{envForCaption} |
|
93 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
94 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
95 {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$ |
|
96 \nodepart{two} To Come: $b c$}; |
|
97 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
98 \end{tikzpicture} |
|
99 \caption{Derivative} |
|
100 \end{envForCaption} |
|
101 \end{center} |
110 \end{center} |
102 \noindent |
111 \noindent |
103 If we do this kind of "attachment" |
112 If we do this kind of "attachment" |
104 and each time augment the attached partially |
113 and each time augment the attached partially |
105 constructed value when taking off a |
114 constructed value when taking off a |
106 character: |
115 character: |
107 \begin{center} |
116 \begin{center} |
108 \begin{envForCaption} |
117 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
118 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] (spPoint) |
|
119 {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$ |
|
120 \nodepart{two} Remaining: $b c$}; |
|
121 \end{tikzpicture}\\ |
|
122 $\downarrow$\\ |
109 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
123 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
110 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
124 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
111 {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$ |
125 {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$ |
112 \nodepart{two} To Come: $c$}; |
126 \nodepart{two} Remaining: $c$}; |
113 \end{tikzpicture}\\ |
127 \end{tikzpicture}\\ |
|
128 $\downarrow$\\ |
114 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
129 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
115 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
130 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
116 {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$ |
131 {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$ |
117 \nodepart{two} EOF}; |
132 \nodepart{two} EOF}; |
118 \end{tikzpicture} |
133 \end{tikzpicture} |
119 \caption{After $\backslash b$ and $\backslash c$} |
|
120 \end{envForCaption} |
|
121 \end{center} |
134 \end{center} |
122 \noindent |
135 \noindent |
123 In the end we could recover the value without a backward phase. |
136 In the end we could recover the value without a backward phase. |
124 But (partial) values are a bit clumsy to stick together with a regular expression, so |
137 But (partial) values are a bit clumsy to stick together with a regular expression, so |
125 we instead use bit-codes to encode them. |
138 we instead use bit-codes to encode them. |
126 |
139 |
127 Bits and bitcodes (lists of bits) are defined as: |
140 Bits and bitcodes (lists of bits) are defined as: |
128 \begin{envForCaption} |
|
129 \begin{center} |
141 \begin{center} |
130 $b ::= S \mid Z \qquad |
142 $b ::= S \mid Z \qquad |
131 bs ::= [] \mid b::bs |
143 bs ::= [] \mid b::bs |
132 $ |
144 $ |
133 \end{center} |
145 \end{center} |
134 \caption{Bit-codes datatype} |
|
135 \end{envForCaption} |
|
136 |
146 |
137 \noindent |
147 \noindent |
138 Using $S$ and $Z$ rather than $1$ and $0$ is to avoid |
148 Using $S$ and $Z$ rather than $1$ and $0$ is to avoid |
139 confusion with the regular expressions $\ZERO$ and $\ONE$. |
149 confusion with the regular expressions $\ZERO$ and $\ONE$. |
140 Bitcodes (or |
150 Bitcodes (or |
141 bit-lists) can be used to encode values (or potentially incomplete values) in a |
151 bit-lists) can be used to encode values (or potentially incomplete values) in a |
142 compact form. This can be straightforwardly seen in the following |
152 compact form. This can be straightforwardly seen in the following |
143 coding function from values to bitcodes: |
153 coding function from values to bitcodes: |
144 \begin{envForCaption} |
|
145 \begin{center} |
154 \begin{center} |
146 \begin{tabular}{lcl} |
155 \begin{tabular}{lcl} |
147 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
156 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
148 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
157 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
149 $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\ |
158 $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\ |