21 provided on KEATS, in case you are interested.\footnote{In my |
22 provided on KEATS, in case you are interested.\footnote{In my |
22 humble opinion this is an interesting instance of the research |
23 humble opinion this is an interesting instance of the research |
23 literature: it contains a very neat idea, but its presentation |
24 literature: it contains a very neat idea, but its presentation |
24 is rather sloppy. In earlier versions of this paper, a King's |
25 is rather sloppy. In earlier versions of this paper, a King's |
25 undergraduate student and I found several rather annoying typos in the |
26 undergraduate student and I found several rather annoying typos in the |
26 examples and definitions.} My PhD student Fahad Ausaf and I even more recently |
27 examples and definitions.} My former PhD student Fahad Ausaf and I even more recently |
27 wrote a paper where we build on their result: we provided a |
28 wrote a paper where we build on their result: we provided a |
28 mathematical proof that their algorithm is really correct---the proof |
29 mathematical proof that their algorithm is really correct---the proof |
29 Sulzmann \& Lu had originally given contained major flaws. Such correctness |
30 Sulzmann \& Lu had originally given contained major flaws. Such correctness |
30 proofs are important: Kuklewicz maintains a unit-test library |
31 proofs are important: Kuklewicz maintains a unit-test library |
31 for the kind of algorithms we are interested in here and he showed |
32 for the kind of algorithms we are interested in here and he showed |
101 {../progs/app02.scala}} |
101 {../progs/app02.scala}} |
102 |
102 |
103 |
103 |
104 Graphically the algorithm by Sulzmann \& Lu can be illustrated |
104 Graphically the algorithm by Sulzmann \& Lu can be illustrated |
105 by the picture in Figure~\ref{Sulz} where the path from the |
105 by the picture in Figure~\ref{Sulz} where the path from the |
106 left to the right involving $\textit{der}/\textit{nullable}$ is the first phase |
106 left to the right involving $\der/\nullable$ is the first phase |
107 of the algorithm and $\textit{mkeps}/\textit{inj}$, the path from right to left, |
107 of the algorithm and $\textit{mkeps}/\inj$, the path from right to left, |
108 the second phase. This picture shows the steps required when a |
108 the second phase. This picture shows the steps required when a |
109 regular expression, say $r_1$, matches the string $abc$. We |
109 regular expression, say $r_1$, matches the string $abc$. We |
110 first build the three derivatives (according to $a$, $b$ and |
110 first build the three derivatives (according to $a$, $b$ and |
111 $c$). We then use $\textit{nullable}$ to find out whether the resulting |
111 $c$). We then use $\nullable$ to find out whether the resulting |
112 regular expression can match the empty string. If yes, we call |
112 regular expression can match the empty string. If yes, we call |
113 the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular |
113 the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular |
114 expression has matched the empty string. Its definition |
114 expression has matched the empty string. Its definition |
115 is as follows: |
115 is as follows: |
116 |
116 |
118 \begin{center} |
118 \begin{center} |
119 \begin{tikzpicture}[scale=2,node distance=1.2cm, |
119 \begin{tikzpicture}[scale=2,node distance=1.2cm, |
120 every node/.style={minimum size=7mm}] |
120 every node/.style={minimum size=7mm}] |
121 \node (r1) {$r_1$}; |
121 \node (r1) {$r_1$}; |
122 \node (r2) [right=of r1]{$r_2$}; |
122 \node (r2) [right=of r1]{$r_2$}; |
123 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\textit{der}\,a$}; |
123 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\der\,a$}; |
124 \node (r3) [right=of r2]{$r_3$}; |
124 \node (r3) [right=of r2]{$r_3$}; |
125 \draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\textit{der}\,b$}; |
125 \draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\der\,b$}; |
126 \node (r4) [right=of r3]{$r_4$}; |
126 \node (r4) [right=of r3]{$r_4$}; |
127 \draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\textit{der}\,c$}; |
127 \draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\der\,c$}; |
128 \draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\textit{nullable}$}}; |
128 \draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\nullable$}}; |
129 \node (v4) [below=of r4]{$v_4$}; |
129 \node (v4) [below=of r4]{$v_4$}; |
130 \draw[->,line width=1mm](r4) -- (v4); |
130 \draw[->,line width=1mm](r4) -- (v4); |
131 \node (v3) [left=of v4] {$v_3$}; |
131 \node (v3) [left=of v4] {$v_3$}; |
132 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\textit{inj}\,c$}; |
132 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\inj\,c$}; |
133 \node (v2) [left=of v3]{$v_2$}; |
133 \node (v2) [left=of v3]{$v_2$}; |
134 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\textit{inj}\,b$}; |
134 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\inj\,b$}; |
135 \node (v1) [left=of v2] {$v_1$}; |
135 \node (v1) [left=of v2] {$v_1$}; |
136 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\textit{inj}\,a$}; |
136 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\inj\,a$}; |
137 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}}; |
137 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}}; |
138 \end{tikzpicture} |
138 \end{tikzpicture} |
139 \end{center} |
139 \end{center} |
140 \caption{The two phases of the algorithm by Sulzmann \& Lu. |
140 \caption{The two phases of the algorithm by Sulzmann \& Lu. |
141 \label{Sulz}} |
141 \label{Sulz}} |
163 The second phase of the algorithm is organised so that it will |
163 The second phase of the algorithm is organised so that it will |
164 calculate a value for how the derivative regular expression |
164 calculate a value for how the derivative regular expression |
165 has matched a string. For this we need a function that |
165 has matched a string. For this we need a function that |
166 reverses this ``chopping off'' for values which we did in the |
166 reverses this ``chopping off'' for values which we did in the |
167 first phase for derivatives. The corresponding function is |
167 first phase for derivatives. The corresponding function is |
168 called $\textit{inj}$ (short for injection). This function takes three |
168 called $\inj$ (short for injection). This function takes three |
169 arguments: the first one is a regular expression for which we |
169 arguments: the first one is a regular expression for which we |
170 want to calculate the value, the second is the character we |
170 want to calculate the value, the second is the character we |
171 want to inject and the third argument is the value where we |
171 want to inject and the third argument is the value where we |
172 will inject the character into. The result of this function is a |
172 will inject the character into. The result of this function is a |
173 new value. The definition of $\textit{inj}$ is as follows: |
173 new value. The definition of $\inj$ is as follows: |
174 |
174 |
175 \begin{center} |
175 \begin{center} |
176 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
176 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
177 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
177 $\inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
178 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
178 $\inj\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\inj\,r_1\,c\,v)$\\ |
179 $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ |
179 $\inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\inj\,r_2\,c\,v)$\\ |
180 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
180 $\inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\ |
181 $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
181 $\inj\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\ |
182 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ |
182 $\inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\inj\,r_2\,c\,v)$\\ |
183 $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\textit{inj}\,r\,c\,v\,::\,vs)$\\ |
183 $\inj\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\inj\,r\,c\,v\,::\,vs)$\\ |
184 \end{tabular} |
184 \end{tabular} |
185 \end{center} |
185 \end{center} |
186 |
186 |
187 \noindent This definition is by recursion on the regular |
187 \noindent This definition is by recursion on the regular |
188 expression and by analysing the shape of the values. Therefore |
188 expression and by analysing the shape of the values. Therefore |
189 there are three cases for sequence regular |
189 there are three cases for sequence regular |
190 expressions (for all possible shapes of the value we can encounter). The last |
190 expressions (for all possible shapes of the value we can encounter). The last |
191 clause for the star regular expression returns a list where |
191 clause for the star regular expression returns a list where |
192 the first element is $\textit{inj}\,r\,c\,v$ and the other elements are |
192 the first element is $\inj\,r\,c\,v$ and the other elements are |
193 $vs$. That means $\_\,::\,\_$ should be read as list cons. |
193 $vs$. That means $\_\,::\,\_$ should be read as list cons. |
194 |
194 |
195 To understand what is going on, it might be best to do some |
195 To understand what is going on, it might be best to do some |
196 example calculations and compare them with Figure~\ref{Sulz}. |
196 example calculations and compare them with Figure~\ref{Sulz}. |
197 For this note that we have not yet dealt with the need of |
197 For this note that we have not yet dealt with the need of |
233 \noindent If there had been a $\ONE$ on the left, then |
233 \noindent If there had been a $\ONE$ on the left, then |
234 $\textit{mkeps}$ would have returned something of the form |
234 $\textit{mkeps}$ would have returned something of the form |
235 $\Left(\ldots)$. The point is that from this value we can |
235 $\Left(\ldots)$. The point is that from this value we can |
236 directly read off which part of $r_4$ matched the empty |
236 directly read off which part of $r_4$ matched the empty |
237 string: take the right-alternative first, and then the |
237 string: take the right-alternative first, and then the |
238 right-alternative again. |
238 right-alternative again, then you got to the $\ONE$. |
239 |
239 |
240 Next we have to ``inject'' the last character, that is $c$ in |
240 Next we have to ``inject'' the last character, that is $c$ in |
241 the running example, into this value $v_4$ in order to |
241 the running example, into this value $v_4$ in order to |
242 calculate how $r_3$ could have matched the string $c$. |
242 calculate how $r_3$ could have matched the string $c$. |
243 For this we call injection with $\textit{inj}(r_3, c, v_4)$. |
243 For this we call injection with $\inj(r_3, c, v_4)$. |
244 According to the definition of $\textit{inj}$ we obtain |
244 According to the definition of $\inj$ we obtain |
245 |
245 |
246 \begin{center} |
246 \begin{center} |
247 $v_3:\;Right(Seq(Empty, Char(c)))$ |
247 $v_3:\;Right(Seq(Empty, Char(c)))$ |
248 \end{center} |
248 \end{center} |
249 |
249 |
250 \noindent This is the correct result, because $r_3$ needs |
250 \noindent This is the correct result, because $r_3$ needs |
251 to use the right-hand alternative, and then $\ONE$ needs |
251 to use the right-hand alternative, and then $\ONE$ needs |
252 to match the empty string and $c$ needs to match $c$. |
252 to match the empty string and $c$ needs to match $c$. |
253 Next we need to inject back the letter $b$ into $v_3$. |
253 Next we need to inject back the letter $b$ into $v_3$. |
254 For this we call injection with $\textit{inj}(r_2, b, v_3)$. |
254 For this we call injection with $\inj(r_2, b, v_3)$. |
255 This gives |
255 This gives |
256 |
256 |
257 \begin{center} |
257 \begin{center} |
258 $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$ |
258 $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$ |
259 \end{center} |
259 \end{center} |
260 |
260 |
261 \noindent which is again the correct result for how $r_2$ |
261 \noindent which is again the correct result for how $r_2$ |
262 matched the string $bc$. Finally we need to inject back the |
262 matched the string $bc$. Finally we need to inject back the |
263 letter $a$ into $v_2$ giving the final result. |
263 letter $a$ into $v_2$ giving the final result. |
264 For this we call injection with $\textit{inj}(r_1, a, v_2)$ |
264 For this we call injection with $\inj(r_1, a, v_2)$ |
265 and obtain |
265 and obtain |
266 |
266 |
267 \begin{center} |
267 \begin{center} |
268 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$ |
268 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$ |
269 \end{center} |
269 \end{center} |
270 |
270 |
271 \noindent This value corresponds to how the regular expression $r_1$, |
271 \noindent This value corresponds to how the regular expression $r_1$, |
272 namely $a \cdot (b \cdot c)$, matched the string $abc$. |
272 namely $a \cdot (b \cdot c)$, matched the string $abc$. |
|
273 |
273 |
274 |
274 There are a few auxiliary functions that are of interest |
275 There are a few auxiliary functions that are of interest |
275 when analysing this algorithm. One is called \emph{flatten}, |
276 when analysing this algorithm. One is called \emph{flatten}, |
276 written $|\_|$, which extracts the string ``underlying'' a |
277 written $|\_|$, which extracts the string ``underlying'' a |
277 value. It is defined recursively as |
278 value. It is defined recursively as |
298 $|v_2|$: & $bc$\\ |
299 $|v_2|$: & $bc$\\ |
299 $|v_1|$: & $abc$ |
300 $|v_1|$: & $abc$ |
300 \end{tabular} |
301 \end{tabular} |
301 \end{center} |
302 \end{center} |
302 |
303 |
303 \noindent This indicates that $\textit{inj}$ indeed is injecting, or |
304 \noindent This indicates that $\inj$ indeed is injecting, or |
304 adding, back a character into the value. |
305 adding, back a character into the value. |
305 |
306 |
306 There is a problem, however, with the described algorithm |
307 The definition of $\inj$ might still be very puzzling and each clause |
307 so far: it is very slow. We need to include the simplification |
308 might look arbitrary, but there is in fact a relatively simple idea |
308 from Lecture 2. This is what we shall do next. |
309 behind it. Ultimately the $\inj$-functions is determined by the |
|
310 derivative functions. For this consider one of the ``squares'' from |
|
311 Figure~\ref{Sulz}: |
|
312 |
|
313 |
|
314 \begin{center} |
|
315 \begin{tikzpicture}[scale=2,node distance=1.2cm, |
|
316 every node/.style={minimum size=7mm}] |
|
317 \node (r) {$r$}; |
|
318 \node (rd) [right=of r]{$r_{der}$}; |
|
319 \draw[->,line width=1mm](r)--(rd) node[above,midway] {$\der\,c$}; |
|
320 \node (vd) [below=of r2]{$v_{der}$}; |
|
321 \draw[->,line width=1mm](rd) -- (vd); |
|
322 \node (v) [left=of vd] {$v$}; |
|
323 \draw[->,line width=1mm](vd)--(v) node[below,midway] {$\inj\,c$}; |
|
324 \draw[->,line width=0.5mm,dotted](r) -- (v) node[left,midway,red] {\bf ?}; |
|
325 \end{tikzpicture} |
|
326 \end{center} |
|
327 |
|
328 \noindent |
|
329 The input to the $\inj$-function is $r$ (on the top left), $c$ (the |
|
330 character to be injected) and $v_{der}$ (bottom right). The output is |
|
331 the value $v$ for how the regular expression $r$ matched the |
|
332 corresponding string. How does $\inj$ calculate this value? Well, it has |
|
333 to analyse the value $v_{der}$ and transform it into the value $v$ for |
|
334 the regular expression $r$. The value $v_{der}$ corresponds to the |
|
335 $r_{der}$-regular expression which is the derivative of $r$. Remember |
|
336 that $v_{der}$ is the value for how $r_{der}$ matches the corresponding string |
|
337 where $c$ has been chopped off. |
|
338 |
|
339 Let $r$ be $r_1 + r_2$. Then $r_{der}$ |
|
340 is of the form $(\der\,c\,r_1) + (\der\,c\,r_2)$. What are the possible |
|
341 values corresponding to $r_{der}$? Well, they can be only of the form |
|
342 $\Left(\ldots)$ and $\Right(\ldots)$. Therefore you have two |
|
343 cases in the $\inj$ function -- one for $\Left$ and one for $\Right$. |
|
344 |
|
345 \begin{center} |
|
346 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l} |
|
347 $\inj\,(r_1 + r_2)\,c\,\,\Left(v)$ & $\dn$ & $\ldots$\\ |
|
348 $\inj\,(r_1 + r_2)\,c\,\,\Right(v)$ & $\dn$ & $\ldots$\\ |
|
349 \end{tabular} |
|
350 \end{center} |
|
351 |
|
352 \noindent |
|
353 Let's look next at the sequence case where $r = r_1 \cdot r_2$. What does the |
|
354 derivative of $r$ look like? According to the definition it is: |
|
355 |
|
356 \begin{center} |
|
357 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
358 $\der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $\nullable (r_1)$\\ |
|
359 & & then $(\der\,c\,r_1) \cdot r_2 + \der\, c\, r_2$\\ |
|
360 & & else $(\der\, c\, r_1) \cdot r_2$\\ |
|
361 \end{tabular} |
|
362 \end{center} |
|
363 |
|
364 \noindent As you can see there is a derivative in the if-branch and another |
|
365 in the else-branch. In the if-branch we have an alternative of the form |
|
366 $\_ + \_$. We already know what the possible values are for such a regular |
|
367 expression, namely $\Left$ or $\Right$. Therefore we have in $\inj$ the |
|
368 two cases: |
|
369 |
|
370 \begin{center} |
|
371 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l} |
|
372 $\inj\,(r_1 \cdot r_2)\,c\,\,\Left(Seq(v_1,v_2))$ & $\dn$ & $\ldots$\\ |
|
373 $\inj\,(r_1 \cdot r_2)\,c\,\,Right(v)$ & $\dn$ & $\ldots$\\ |
|
374 \end{tabular} |
|
375 \end{center} |
|
376 |
|
377 \noindent |
|
378 In the first case we even know that it is not just a $\Left$-value, but |
|
379 $\Left(Seq(\ldots))$, because the corresponding regular expression |
|
380 in the derivation is \mbox{$(\der\,c\,r_1) \cdot r_2$}. Hence we know |
|
381 it must be a $Seq$-value enclosed inside $\Left$. |
|
382 The third clause for $r_1\cdot r_2$ in the $\inj$-function |
|
383 |
|
384 \begin{center} |
|
385 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
386 $\inj\,(r_1 \cdot r_2)\,c\,\,Seq(v_1,v_2)$ & $\dn$ & $\ldots$\\ |
|
387 \end{tabular} |
|
388 \end{center} |
|
389 |
|
390 \noindent corresponds to the else-branch in the derivative. In this |
|
391 case we know the derivative is of the form $(\der\,c\,r_1) \cdot r_2$ and |
|
392 therefore the value must be of the form $Seq(\ldots)$. |
|
393 |
|
394 Hopefully the $inj$-function makes now more sense. I let you explain |
|
395 the $r^*$ case. What do the derivative of $r^*$ and |
|
396 the corresponding value look like? Does this explain the shape of |
|
397 the clause? |
|
398 |
|
399 \begin{center} |
|
400 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
401 $\inj\,(r^*)\,c\,\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\inj\,r\,c\,v\,::\,vs)$\\ |
|
402 \end{tabular} |
|
403 \end{center} |
|
404 |
|
405 \noindent If yes, you made sense of the left-hand sides of |
|
406 the $\inj$-definition. |
|
407 |
|
408 How about the right-hand sides? Well, in the |
|
409 $r^*$ case for example we have according to the square in the picture |
|
410 above a value $v_{der}$ which says how the derivative $r_{der}$ |
|
411 matched the string. Since the derivative is of the form |
|
412 $(\der\,c\,r) \cdot (r^*)$ the corresponding value is of the |
|
413 form $Seq(v, Stars\,vs)$. But for $r^*$ we are looking for a value |
|
414 for the original (top left) regular expression --- so it cannot |
|
415 be a value of the shape $Seq(\ldots, Stars\ldots)$ because that is the |
|
416 shape for sequence regular expressions. What we need is a value |
|
417 of the form $Stars \ldots$ (remember the correspondence between |
|
418 values and regular expressions). This suggests to define the right hand |
|
419 side as |
|
420 |
|
421 \begin{center} |
|
422 $\ldots \dn Stars(\inj\,r\,c\,v\,::\,vs)$ |
|
423 \end{center} |
|
424 |
|
425 \noindent It is a value of the right shape, namely $Stars$. It injected |
|
426 $c$ into the first-value, which is in fact the value where we need to |
|
427 undo the derivative. Remember again the shape of the derivative of $r^*$. |
|
428 In place where we chopped off the $c$, we now need to do the $\inj$ of $c$. |
|
429 Therefore $\inj\,r\,c\,v$ in the definition above. That is the same with |
|
430 the other clauses in $\inj$. |
|
431 |
|
432 |
|
433 Phew\ldotsSweat\ldots Unfortunately, there is one more problem with the |
|
434 described algorithm so far: it is very slow. We need to include in all |
|
435 this the simplification from Lecture 2. This is what we shall do next. |
309 |
436 |
310 |
437 |
311 \subsubsection*{Simplification} |
438 \subsubsection*{Simplification} |
312 |
439 |
313 Generally the matching algorithms based on derivatives do |
440 Generally the matching algorithms based on derivatives do |