14 find out \emph{how} a regular expression matched a string? |
14 find out \emph{how} a regular expression matched a string? |
15 Answering this question will help us with the problem we are |
15 Answering this question will help us with the problem we are |
16 after, namely tokenising an input string, that is splitting it |
16 after, namely tokenising an input string, that is splitting it |
17 up into its ``word'' components. The algorithm we will be |
17 up into its ``word'' components. The algorithm we will be |
18 looking at was designed by Sulzmann \& Lu in a rather recent |
18 looking at was designed by Sulzmann \& Lu in a rather recent |
19 paper. A link to it is provided at KEATS, in case you are |
19 paper. A link to it is provided on KEATS, in case you are |
20 interested.\footnote{In my humble opinion this is an |
20 interested.\footnote{In my humble opinion this is an |
21 interesting instance of the research literature: it contains a |
21 interesting instance of the research literature: it contains a |
22 very neat idea, but its presentation is rather sloppy. |
22 very neat idea, but its presentation is rather sloppy. In |
23 Students and I found several rather annoying typos in their |
23 earlier versions of their paper, students and I found several |
24 examples and definitions.} |
24 rather annoying typos in their examples and definitions.} |
25 |
25 |
26 In order to give an answer for how a regular expression matched |
26 In order to give an answer for how a regular expression matched |
27 a string, Sulzmann and Lu introduce \emph{values}. A value |
27 a string, Sulzmann and Lu introduce \emph{values}. A value |
28 will be the output of the algorithm whenever the regular |
28 will be the output of the algorithm whenever the regular |
29 expression matches the string. If not an error will be raised. |
29 expression matches the string. If not, an error will be raised. |
30 Since the first phase of the algorithm is identical to the |
30 Since the first phase of the algorithm is identical to the |
31 derivative based matcher from the first coursework, the |
31 derivative based matcher from the first coursework, the |
32 function $nullable$ will be used to decide whether as string |
32 function $nullable$ will be used to decide whether as string |
33 is matched by a regular expression. |
33 is matched by a regular expression. If $nullable$ says |
34 |
34 yes, then values are constructed that reflect how the regular |
35 |
35 expression matched the string. The definitions for regular |
36 |
36 expressions $r$ and values $v$ is shown below: |
37 |
37 |
|
38 \begin{center} |
|
39 \begin{tabular}{cc} |
|
40 \begin{tabular}{@{}rrl@{}} |
|
41 $r$ & $::=$ & $\varnothing$\\ |
|
42 & $\mid$ & $\epsilon$ \\ |
|
43 & $\mid$ & $c$ \\ |
|
44 & $\mid$ & $r_1 \cdot r_2$\\ |
|
45 & $\mid$ & $r_1 + r_2$ \\ |
|
46 \\ |
|
47 & $\mid$ & $r^*$ \\ |
|
48 \end{tabular} |
|
49 & |
|
50 \begin{tabular}{@{\hspace{0mm}}rrl@{}} |
|
51 $v$ & $::=$ & \\ |
|
52 & & $Empty$ \\ |
|
53 & $\mid$ & $Char(c)$ \\ |
|
54 & $\mid$ & $Seq(v_1,v_2)$\\ |
|
55 & $\mid$ & $Left(v)$ \\ |
|
56 & $\mid$ & $Right(v)$ \\ |
|
57 & $\mid$ & $[v_1,\ldots\,v_n]$ \\ |
|
58 \end{tabular} |
|
59 \end{tabular} |
|
60 \end{center} |
|
61 |
|
62 \noindent As you can see there is a very strong correspondence |
|
63 between regular expressions and values. There is no value for |
|
64 the $\varnothing$ regular expression (since it does not match |
|
65 any string). Then there is exactly one value corresponding to |
|
66 each regular expression. with the exception of $r_1 + r_2$ |
|
67 where there are two values $Left(v)$ and $Right(v)$ |
|
68 corresponding to the two alternatives. Note that $r^*$ |
|
69 is associated with a list of values, one for each copy of $r$ |
|
70 that was needed to match the string. This mean we might also |
|
71 return the empty list $[]$, if no copy was needed. |
|
72 |
|
73 Graphically the algorithm can be represneted by the |
|
74 picture in Figure~\ref{Sulz} where the path involving $der/nullable$ is the first |
|
75 phase of the algorithm and $mkeps/inj$ the second phase. |
|
76 This picture shows the steps required when a regular |
|
77 expression, say $r_1$, matches the string $abc$. We first |
|
78 build the three derivatives (according to $a$, $b$ and $c$). |
|
79 We then use $nullable$ to find out whether the resulting |
|
80 regular expression can match the empty string. If yes |
|
81 we call the function $mkeps$. |
|
82 |
|
83 \begin{figure}[t] |
|
84 \begin{center} |
|
85 \begin{tikzpicture}[scale=2,node distance=1.2cm, |
|
86 every node/.style={minimum size=7mm}] |
|
87 \node (r1) {$r_1$}; |
|
88 \node (r2) [right=of r1]{$r_2$}; |
|
89 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {$der\,a$}; |
|
90 \node (r3) [right=of r2]{$r_3$}; |
|
91 \draw[->,line width=1mm](r2)--(r3) node[above,midway] {$der\,b$}; |
|
92 \node (r4) [right=of r3]{$r_4$}; |
|
93 \draw[->,line width=1mm](r3)--(r4) node[above,midway] {$der\,c$}; |
|
94 \draw (r4) node[anchor=west] {\;\raisebox{3mm}{$nullable$}}; |
|
95 \node (v4) [below=of r4]{$v_4$}; |
|
96 \draw[->,line width=1mm](r4) -- (v4); |
|
97 \node (v3) [left=of v4] {$v_3$}; |
|
98 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {$inj\,c$}; |
|
99 \node (v2) [left=of v3]{$v_2$}; |
|
100 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {$inj\,b$}; |
|
101 \node (v1) [left=of v2] {$v_1$}; |
|
102 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {$inj\,a$}; |
|
103 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$mkeps$}}; |
|
104 \end{tikzpicture} |
|
105 \end{center} |
|
106 \caption{The two phases of the algorithm by Sulzmann \& Lu. |
|
107 \label{Sulz}} |
|
108 \end{figure} |
|
109 |
|
110 The $mkeps$ function calculates a value for how a regular |
|
111 expression could have matched the empty string. Its definition |
|
112 is as follows: |
|
113 |
|
114 \begin{center} |
|
115 \begin{tabular}{lcl} |
|
116 $mkeps(\epsilon)$ & $\dn$ & $Empty$\\ |
|
117 $mkeps(r_1 + r_2)$ & $\dn$ & if $nullable(r_1)$ \\ |
|
118 & & then $Left(mkeps(r_1))$\\ |
|
119 & & else $Right(mkeps(r_2))$\\ |
|
120 $mkeps(r_1 \cdot r_2)$ & $\dn$ & $Seq(mkeps(r_1),mkeps(r_2))$\\ |
|
121 $mkeps(r^*)$ & $\dn$ & $[]$ \\ |
|
122 \end{tabular} |
|
123 \end{center} |
|
124 |
|
125 \noindent There are no cases for $\epsilon$ and $c$, since |
|
126 these regular expression cannot match the empty string. |
|
127 Note that in case of alternatives we give preference to the |
|
128 regular expression on the left-hand side. This will become |
|
129 important later on. |
|
130 |
|
131 The algorithm is organised recursively such that it will |
|
132 calculate a value for how the derivative regular expression |
|
133 has matched a string where the first character has been |
|
134 chopped off. Now we need a function that reverses this |
|
135 ``chopping off'' for values. The corresponding function |
|
136 is called $inj$ for injection. This function takes |
|
137 three arguments: the first one is a regular expression |
|
138 for which we want to calculate the value, the second is the |
|
139 character we want to inject and the third argument is the |
|
140 value where we will inject the character. The result |
|
141 of this function is a new value. The definition of $inj$ |
|
142 is as follows: |
|
143 |
|
144 \begin{center} |
|
145 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
146 $inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
|
147 $inj\,(r_1 + r_2)\,c\,Left(v)$ & $\dn$ & $Left(inj\,r_1\,c\,v)$\\ |
|
148 $inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(inj\,r_2\,c\,v)$\\ |
|
149 $inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(inj\,r_1\,c\,v_1,v_2)$\\ |
|
150 $inj\,(r_1 \cdot r_2)\,c\,Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(inj\,r_1\,c\,v_1,v_2)$\\ |
|
151 $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\ |
|
152 $inj\,(r^*)\,c\,Seq(v,vs)$ & $\dn$ & $inj\,r\,c\,v\,::\,vs$\\ |
|
153 \end{tabular} |
|
154 \end{center} |
|
155 |
|
156 \noindent This definition is by recursion on the |
|
157 regular expression and by analysing the shape of the values. |
|
158 Therefore there are, for example, three cases for sequnece |
|
159 regular expressions. The last clause returns a list where |
|
160 the first element is $inj\,r\,c\,v$ and the other elements |
|
161 are $vs$. |
|
162 |
|
163 To understand what is going on, it might be best to do some |
|
164 example calculations and compare with Figure~\ref{Sulz}. For |
|
165 this note that we have not yet dealt with the need of |
|
166 simplifying regular expreesions (this will be a topic on its |
|
167 own later). Suppose the regular expression is $a \cdot (b |
|
168 \cdot c)$ and the input string is $abc$. The derivatives are |
|
169 as follows: |
|
170 |
|
171 \begin{center} |
|
172 \begin{tabular}{ll} |
|
173 $r_1$: & $a \cdot (b \cdot c)$\\ |
|
174 $r_2$: & $\epsilon \cdot (b \cdot c)$\\ |
|
175 $r_3$: & $(\varnothing \cdot (b \cdot c)) + (\epsilon \cdot c)$\\ |
|
176 $r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\ |
|
177 \end{tabular} |
|
178 \end{center} |
|
179 |
|
180 \noindent According to the simple algorithm. we would test |
|
181 whether $r_4$ is nullable, which it is. This means we can |
|
182 use the function $mkeps$ to calculate a value for how $r_4$ |
|
183 was able to match the empty string. Remember that this |
|
184 function gives preference for alternatives on the left-hand |
|
185 side. However there is only $\epsilon$ on the very right-hand |
|
186 side of $r_4$ that matches the empty string. Therefore |
|
187 $mkeps$ returns the value |
|
188 |
|
189 \begin{center} |
|
190 $v_4:\;Right(Right(Empty))$ |
|
191 \end{center} |
|
192 |
|
193 \noindent The point is that from this value we can directly |
|
194 read off which part of $r_4$ matched the empty string. Next we |
|
195 have to ``inject'' the last character, that is $c$ in the |
|
196 running example, into this value in order to calculate how |
|
197 $r_3$ could have matched the string $c$. According to the |
|
198 definition of $inj$ we obtain |
|
199 |
|
200 \begin{center} |
|
201 $v_3:\;Right(Seq(Empty, Char(c)))$ |
|
202 \end{center} |
|
203 |
|
204 \noindent This is the correct result, because $r_3$ needs |
|
205 to use the right-hand alternative, and then $\epsilon$ needs |
|
206 to match the empty string and $c$ needs to match $c$. |
|
207 Next we need to inject back the letter $b$ into $v_3$. This |
|
208 gives |
|
209 |
|
210 \begin{center} |
|
211 $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$ |
|
212 \end{center} |
|
213 |
|
214 \noindent Finally we need to inject back the letter $a$ into |
|
215 $v_2$ giving the final result |
|
216 |
|
217 \begin{center} |
|
218 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$ |
|
219 \end{center} |
|
220 |
|
221 \noindent This now corresponds to how the regular |
|
222 expression $a \cdot (b \cdot c)$ matched the string $abc$. |
|
223 This is the expected result. So at least in this case the |
|
224 algorithm seems to calculate what it is supposed to. |
|
225 |
|
226 There are a few auxiliary function that are of interest |
|
227 in analysing this algorithm. One is called \emph{flatten}, |
|
228 written $|\_|$, which extracts the string ``underlying'' a |
|
229 value. It is defined as |
|
230 |
|
231 \begin{center} |
|
232 \begin{tabular}{lcl} |
|
233 $|Empty|$ & $\dn$ & $[]$\\ |
|
234 $|Char(c)|$ & $\dn$ & $[c]$\\ |
|
235 $|Left(v)|$ & $\dn$ & $|v|$\\ |
|
236 $|Right(v)|$ & $\dn$ & $|v|$\\ |
|
237 $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\ |
|
238 $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\ |
|
239 \end{tabular} |
|
240 \end{center} |
|
241 |
|
242 \noindent Using flatten we can see what is the string behind |
|
243 the values calculated by $mkeps$ and $inj$ in our running |
|
244 example: |
|
245 |
|
246 \begin{center} |
|
247 \begin{tabular}{ll} |
|
248 $v_4$: & $[]$\\ |
|
249 $v_3$: & $c$\\ |
|
250 $v_2$: & $bc$\\ |
|
251 $v_1$: & $abc$ |
|
252 \end{tabular} |
|
253 \end{center} |
|
254 |
|
255 \noindent |
|
256 This indicates that $inj$ indeed is injecting, or adding, back |
|
257 a character into the value. |
|
258 |
|
259 |
|
260 \subsubsection*{Simplification} |
|
261 |
|
262 Generally the matching algorithms based on derivatives do |
|
263 poorly unless the regular expressions are simplified after |
|
264 each derivatives step. |
|
265 |
|
266 \newpage |
38 Algorithm by Sulzmann, Lexing |
267 Algorithm by Sulzmann, Lexing |
39 |
268 |
40 \end{document} |
269 \end{document} |
41 |
270 |
42 %%% Local Variables: |
271 %%% Local Variables: |