11 So far our algorithm based on derivatives was only able to say |
11 So far our algorithm based on derivatives was only able to say |
12 yes or no depending on whether a string was matched by regular |
12 yes or no depending on whether a string was matched by regular |
13 expression or not. Often a more interesting question is to |
13 expression or not. Often a more interesting question is to |
14 find out \emph{how} a regular expression matched a string? |
14 find out \emph{how} a regular expression matched a string? |
15 Answering this question will help us with the problem we are |
15 Answering this question will help us with the problem we are |
16 after, namely tokenising an input string, that is splitting it |
16 after, namely tokenising an input string. The algorithm we |
17 up into its ``word'' components. The algorithm we will be |
17 will be looking at for this was designed by Sulzmann \& Lu in |
18 looking at was designed by Sulzmann \& Lu in a rather recent |
18 a rather recent paper. A link to it is provided on KEATS, in |
19 paper. A link to it is provided on KEATS, in case you are |
19 case you are interested.\footnote{In my humble opinion this is |
20 interested.\footnote{In my humble opinion this is an |
20 an interesting instance of the research literature: it |
21 interesting instance of the research literature: it contains a |
21 contains a very neat idea, but its presentation is rather |
22 very neat idea, but its presentation is rather sloppy. In |
22 sloppy. In earlier versions of their paper, students and I |
23 earlier versions of their paper, students and I found several |
23 found several rather annoying typos in their examples and |
24 rather annoying typos in their examples and definitions.} |
24 definitions.} |
25 |
25 |
26 In order to give an answer for how a regular expression matched |
26 In order to give an answer for how a regular expression |
27 a string, Sulzmann and Lu introduce \emph{values}. A value |
27 matched a string, Sulzmann and Lu introduce \emph{values}. A |
28 will be the output of the algorithm whenever the regular |
28 value will be the output of the algorithm whenever the regular |
29 expression matches the string. If not, an error will be raised. |
29 expression matches the string. If not, an error will be |
30 Since the first phase of the algorithm is identical to the |
30 raised. Since the first phase of the algorithm by Sulzmann \& |
31 derivative based matcher from the first coursework, the |
31 Lu is identical to the derivative based matcher from the first |
32 function $nullable$ will be used to decide whether as string |
32 coursework, the function $nullable$ will be used to decide |
33 is matched by a regular expression. If $nullable$ says |
33 whether as string is matched by a regular expression. If |
34 yes, then values are constructed that reflect how the regular |
34 $nullable$ says yes, then values are constructed that reflect |
35 expression matched the string. The definitions for regular |
35 how the regular expression matched the string. The definitions |
36 expressions $r$ and values $v$ is shown below: |
36 for regular expressions $r$ and values $v$ is shown next to |
|
37 each other below: |
37 |
38 |
38 \begin{center} |
39 \begin{center} |
39 \begin{tabular}{cc} |
40 \begin{tabular}{cc} |
40 \begin{tabular}{@{}rrl@{}} |
41 \begin{tabular}{@{}rrl@{}} |
|
42 \multicolumn{3}{c}{regular expressions}\\ |
41 $r$ & $::=$ & $\varnothing$\\ |
43 $r$ & $::=$ & $\varnothing$\\ |
42 & $\mid$ & $\epsilon$ \\ |
44 & $\mid$ & $\epsilon$ \\ |
43 & $\mid$ & $c$ \\ |
45 & $\mid$ & $c$ \\ |
44 & $\mid$ & $r_1 \cdot r_2$\\ |
46 & $\mid$ & $r_1 \cdot r_2$\\ |
45 & $\mid$ & $r_1 + r_2$ \\ |
47 & $\mid$ & $r_1 + r_2$ \\ |
46 \\ |
48 \\ |
47 & $\mid$ & $r^*$ \\ |
49 & $\mid$ & $r^*$ \\ |
48 \end{tabular} |
50 \end{tabular} |
49 & |
51 & |
50 \begin{tabular}{@{\hspace{0mm}}rrl@{}} |
52 \begin{tabular}{@{\hspace{0mm}}rrl@{}} |
51 $v$ & $::=$ & \\ |
53 \multicolumn{3}{c}{values}\\ |
|
54 $v$ & $::=$ & \\ |
52 & & $Empty$ \\ |
55 & & $Empty$ \\ |
53 & $\mid$ & $Char(c)$ \\ |
56 & $\mid$ & $Char(c)$ \\ |
54 & $\mid$ & $Seq(v_1,v_2)$\\ |
57 & $\mid$ & $Seq(v_1,v_2)$\\ |
55 & $\mid$ & $Left(v)$ \\ |
58 & $\mid$ & $Left(v)$ \\ |
56 & $\mid$ & $Right(v)$ \\ |
59 & $\mid$ & $Right(v)$ \\ |
57 & $\mid$ & $[v_1,\ldots\,v_n]$ \\ |
60 & $\mid$ & $[v_1,\ldots\,v_n]$ \\ |
58 \end{tabular} |
61 \end{tabular} |
59 \end{tabular} |
62 \end{tabular} |
60 \end{center} |
63 \end{center} |
61 |
64 |
62 \noindent As you can see there is a very strong correspondence |
65 \noindent The point is that there is a very strong |
63 between regular expressions and values. There is no value for |
66 correspondence between them. There is no value for the |
64 the $\varnothing$ regular expression (since it does not match |
67 $\varnothing$ regular expression, since it does not match any |
65 any string). Then there is exactly one value corresponding to |
68 string. Otherwise there is exactly one value corresponding to |
66 each regular expression. with the exception of $r_1 + r_2$ |
69 each regular expression with the exception of $r_1 + r_2$ |
67 where there are two values $Left(v)$ and $Right(v)$ |
70 where there are two values, namely $Left(v)$ and $Right(v)$ |
68 corresponding to the two alternatives. Note that $r^*$ |
71 corresponding to the two alternatives. Note that $r^*$ is |
69 is associated with a list of values, one for each copy of $r$ |
72 associated with a list of values, one for each copy of $r$ |
70 that was needed to match the string. This mean we might also |
73 that was needed to match the string. This means we might also |
71 return the empty list $[]$, if no copy was needed. |
74 return the empty list $[]$, if no copy was needed. |
72 |
75 |
73 Graphically the algorithm can be represneted by the |
76 Graphically the algorithm by Sulzmann \& Lu can be represneted |
74 picture in Figure~\ref{Sulz} where the path involving $der/nullable$ is the first |
77 by the picture in Figure~\ref{Sulz} where the path from the |
75 phase of the algorithm and $mkeps/inj$ the second phase. |
78 left to the right involving $der/nullable$ is the first phase |
76 This picture shows the steps required when a regular |
79 of the algorithm and $mkeps/inj$, the path from right to left, |
77 expression, say $r_1$, matches the string $abc$. We first |
80 the second phase. This picture shows the steps required when a |
78 build the three derivatives (according to $a$, $b$ and $c$). |
81 regular expression, say $r_1$, matches the string $abc$. We |
79 We then use $nullable$ to find out whether the resulting |
82 first build the three derivatives (according to $a$, $b$ and |
80 regular expression can match the empty string. If yes |
83 $c$). We then use $nullable$ to find out whether the resulting |
81 we call the function $mkeps$. |
84 regular expression can match the empty string. If yes we call |
|
85 the function $mkeps$. |
82 |
86 |
83 \begin{figure}[t] |
87 \begin{figure}[t] |
84 \begin{center} |
88 \begin{center} |
85 \begin{tikzpicture}[scale=2,node distance=1.2cm, |
89 \begin{tikzpicture}[scale=2,node distance=1.2cm, |
86 every node/.style={minimum size=7mm}] |
90 every node/.style={minimum size=7mm}] |
121 $mkeps(r^*)$ & $\dn$ & $[]$ \\ |
125 $mkeps(r^*)$ & $\dn$ & $[]$ \\ |
122 \end{tabular} |
126 \end{tabular} |
123 \end{center} |
127 \end{center} |
124 |
128 |
125 \noindent There are no cases for $\epsilon$ and $c$, since |
129 \noindent There are no cases for $\epsilon$ and $c$, since |
126 these regular expression cannot match the empty string. |
130 these regular expression cannot match the empty string. Note |
127 Note that in case of alternatives we give preference to the |
131 also that in case of alternatives we give preference to the |
128 regular expression on the left-hand side. This will become |
132 regular expression on the left-hand side. This will become |
129 important later on. |
133 important later on. |
130 |
134 |
131 The algorithm is organised recursively such that it will |
135 The second phase of the algorithm is organised recursively |
132 calculate a value for how the derivative regular expression |
136 such that it will calculate a value for how the derivative |
133 has matched a string where the first character has been |
137 regular expression has matched a string whose first character |
134 chopped off. Now we need a function that reverses this |
138 has been chopped off. Now we need a function that reverses |
135 ``chopping off'' for values. The corresponding function |
139 this ``chopping off'' for values. The corresponding function |
136 is called $inj$ for injection. This function takes |
140 is called $inj$ for injection. This function takes three |
137 three arguments: the first one is a regular expression |
141 arguments: the first one is a regular expression for which we |
138 for which we want to calculate the value, the second is the |
142 want to calculate the value, the second is the character we |
139 character we want to inject and the third argument is the |
143 want to inject and the third argument is the value where we |
140 value where we will inject the character. The result |
144 will inject the character. The result of this function is a |
141 of this function is a new value. The definition of $inj$ |
145 new value. The definition of $inj$ is as follows: |
142 is as follows: |
|
143 |
146 |
144 \begin{center} |
147 \begin{center} |
145 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
148 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
146 $inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
149 $inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
147 $inj\,(r_1 + r_2)\,c\,Left(v)$ & $\dn$ & $Left(inj\,r_1\,c\,v)$\\ |
150 $inj\,(r_1 + r_2)\,c\,Left(v)$ & $\dn$ & $Left(inj\,r_1\,c\,v)$\\ |
151 $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\ |
154 $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\ |
152 $inj\,(r^*)\,c\,Seq(v,vs)$ & $\dn$ & $inj\,r\,c\,v\,::\,vs$\\ |
155 $inj\,(r^*)\,c\,Seq(v,vs)$ & $\dn$ & $inj\,r\,c\,v\,::\,vs$\\ |
153 \end{tabular} |
156 \end{tabular} |
154 \end{center} |
157 \end{center} |
155 |
158 |
156 \noindent This definition is by recursion on the |
159 \noindent This definition is by recursion on the regular |
157 regular expression and by analysing the shape of the values. |
160 expression and by analysing the shape of the values. Therefore |
158 Therefore there are, for example, three cases for sequnece |
161 there are, for example, three cases for sequnece regular |
159 regular expressions. The last clause returns a list where |
162 expressions. The last clause for the star regular expression |
160 the first element is $inj\,r\,c\,v$ and the other elements |
163 returns a list where the first element is $inj\,r\,c\,v$ and |
161 are $vs$. |
164 the other elements are $vs$. That mean $\_\,::\,\_$ should be |
|
165 read as list cons. |
162 |
166 |
163 To understand what is going on, it might be best to do some |
167 To understand what is going on, it might be best to do some |
164 example calculations and compare with Figure~\ref{Sulz}. For |
168 example calculations and compare with Figure~\ref{Sulz}. For |
165 this note that we have not yet dealt with the need of |
169 this note that we have not yet dealt with the need of |
166 simplifying regular expreesions (this will be a topic on its |
170 simplifying regular expressions (this will be a topic on its |
167 own later). Suppose the regular expression is $a \cdot (b |
171 own later). Suppose the regular expression is $a \cdot (b |
168 \cdot c)$ and the input string is $abc$. The derivatives are |
172 \cdot c)$ and the input string is $abc$. The derivatives from |
169 as follows: |
173 the first phase are as follows: |
170 |
174 |
171 \begin{center} |
175 \begin{center} |
172 \begin{tabular}{ll} |
176 \begin{tabular}{ll} |
173 $r_1$: & $a \cdot (b \cdot c)$\\ |
177 $r_1$: & $a \cdot (b \cdot c)$\\ |
174 $r_2$: & $\epsilon \cdot (b \cdot c)$\\ |
178 $r_2$: & $\epsilon \cdot (b \cdot c)$\\ |
175 $r_3$: & $(\varnothing \cdot (b \cdot c)) + (\epsilon \cdot c)$\\ |
179 $r_3$: & $(\varnothing \cdot (b \cdot c)) + (\epsilon \cdot c)$\\ |
176 $r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\ |
180 $r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\ |
177 \end{tabular} |
181 \end{tabular} |
178 \end{center} |
182 \end{center} |
179 |
183 |
180 \noindent According to the simple algorithm. we would test |
184 \noindent According to the simple algorithm, we would test |
181 whether $r_4$ is nullable, which it is. This means we can |
185 whether $r_4$ is nullable, which in this case it is. This |
182 use the function $mkeps$ to calculate a value for how $r_4$ |
186 means we can use the function $mkeps$ to calculate a value for |
183 was able to match the empty string. Remember that this |
187 how $r_4$ was able to match the empty string. Remember that |
184 function gives preference for alternatives on the left-hand |
188 this function gives preference for alternatives on the |
185 side. However there is only $\epsilon$ on the very right-hand |
189 left-hand side. However there is only $\epsilon$ on the very |
186 side of $r_4$ that matches the empty string. Therefore |
190 right-hand side of $r_4$ that matches the empty string. |
187 $mkeps$ returns the value |
191 Therefore $mkeps$ returns the value |
188 |
192 |
189 \begin{center} |
193 \begin{center} |
190 $v_4:\;Right(Right(Empty))$ |
194 $v_4:\;Right(Right(Empty))$ |
191 \end{center} |
195 \end{center} |
192 |
196 |
193 \noindent The point is that from this value we can directly |
197 \noindent The point is that from this value we can directly |
194 read off which part of $r_4$ matched the empty string. Next we |
198 read off which part of $r_4$ matched the empty string. Next we |
195 have to ``inject'' the last character, that is $c$ in the |
199 have to ``inject'' the last character, that is $c$ in the |
196 running example, into this value in order to calculate how |
200 running example, into this value $v_4$ in order to calculate |
197 $r_3$ could have matched the string $c$. According to the |
201 how $r_3$ could have matched the string $c$. According to the |
198 definition of $inj$ we obtain |
202 definition of $inj$ we obtain |
199 |
203 |
200 \begin{center} |
204 \begin{center} |
201 $v_3:\;Right(Seq(Empty, Char(c)))$ |
205 $v_3:\;Right(Seq(Empty, Char(c)))$ |
202 \end{center} |
206 \end{center} |
259 |
261 |
260 \subsubsection*{Simplification} |
262 \subsubsection*{Simplification} |
261 |
263 |
262 Generally the matching algorithms based on derivatives do |
264 Generally the matching algorithms based on derivatives do |
263 poorly unless the regular expressions are simplified after |
265 poorly unless the regular expressions are simplified after |
264 each derivatives step. |
266 each derivatives step. But this is a bit more involved in |
|
267 algorithm of Sulzmann \& Lu. Consider the last derivation |
|
268 step in our running example |
|
269 |
|
270 \begin{center} |
|
271 $r_4 = der\,c\,r_3$ |
|
272 \end{center} |
|
273 |
|
274 \noindent |
|
275 Simplifying the result would just give use $\epsilon$. |
|
276 Running $mkeps$ on this regular expression would provide |
|
277 us with $Empty$ instead of $Right(Right(Empty))$ that |
|
278 was obtained without the simplification. The problem is |
|
279 we need to recreate this value, rather than just $Empty$. |
|
280 |
|
281 This requires what I call \emph{rectification functions}. They |
|
282 need to be calculated whenever a regular expression gets |
|
283 simplified. Rectification functions take a value as argument |
|
284 and return a (rectified) value. Our simplification rules so |
|
285 far are |
|
286 |
|
287 \begin{center} |
|
288 \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l} |
|
289 $r \cdot \varnothing$ & $\mapsto$ & $\varnothing$\\ |
|
290 $\varnothing \cdot r$ & $\mapsto$ & $\varnothing$\\ |
|
291 $r \cdot \epsilon$ & $\mapsto$ & $r$\\ |
|
292 $\epsilon \cdot r$ & $\mapsto$ & $r$\\ |
|
293 $r + \varnothing$ & $\mapsto$ & $r$\\ |
|
294 $\varnothing + r$ & $\mapsto$ & $r$\\ |
|
295 $r + r$ & $\mapsto$ & $r$\\ |
|
296 \end{tabular} |
|
297 \end{center} |
|
298 |
|
299 \noindent Applying them to $r_4$ will require several nested |
|
300 simplifications in order end up with just $\epsilon$. We can |
|
301 implement this by letting simp return not just a (simplified) |
|
302 regular expression, but also a rectification function. Let us |
|
303 consider the alternative case, say $r_1 + r_2$, first. We |
|
304 would first simplify the component regular expressions $r_1$ |
|
305 and $r_2.$ This will return simplified versions (if they can |
|
306 be simplified), say $r_{1s}$ and $r_{2s}$, but also two |
|
307 rectification functions $f_{1s}$ and $f_{2s}$. We need to |
|
308 assemble them in order to obtain a rectified value for |
|
309 $r_1 + r_2$. |
|
310 |
|
311 In case $r_{1s}$ simplified to $\varnothing$, we would |
|
312 continue the derivative calculation with $r_{2s}$. The |
|
313 Sulzmann \& Lu algorithm would return a corresponding value, |
|
314 say $v_{2s}$. |
|
315 |
|
316 |
|
317 \subsubsection*{Records and Tokenisation} |
265 |
318 |
266 \newpage |
319 \newpage |
267 Algorithm by Sulzmann, Lexing |
320 Algorithm by Sulzmann, Lexing |
268 |
321 |
269 \end{document} |
322 \end{document} |