9 |
9 |
10 \section*{Handout 2 (Regular Expression Matching)} |
10 \section*{Handout 2 (Regular Expression Matching)} |
11 |
11 |
12 This lecture is about implementing a more efficient regular |
12 This lecture is about implementing a more efficient regular |
13 expression matcher (the plots on the right)---more efficient |
13 expression matcher (the plots on the right)---more efficient |
14 than the matchers from regular expression libraries in Ruby and |
14 than the matchers from regular expression libraries in Ruby |
15 Python (the plots on the left). These plots show the running |
15 and Python (the plots on the left). These plots show the |
16 time for the evil regular expression $a?^{\{n\}}a^{\{n\}}$. |
16 running time for the evil regular expression |
17 Note the different scales of the $x$-axes. |
17 $a?^{\{n\}}a^{\{n\}}$ and string composed of $n$ \pcode{a}s. |
|
18 We will use this regular expression and strings as running |
|
19 example. To see the substantial differences in the two plots |
|
20 below, note the different scales of the $x$-axes. |
18 |
21 |
19 |
22 |
20 \begin{center} |
23 \begin{center} |
21 \begin{tabular}{@{}cc@{}} |
24 \begin{tabular}{@{}cc@{}} |
22 \begin{tikzpicture} |
25 \begin{tikzpicture} |
73 Clearly we cannot use the function $L$ directly for this, |
75 Clearly we cannot use the function $L$ directly for this, |
74 because in general the set of strings $L$ returns is infinite |
76 because in general the set of strings $L$ returns is infinite |
75 (recall what $L(a^*)$ is). In such cases there is no way we |
77 (recall what $L(a^*)$ is). In such cases there is no way we |
76 can implement an exhaustive test for whether a string is |
78 can implement an exhaustive test for whether a string is |
77 member of this set or not. In contrast our matching algorithm |
79 member of this set or not. In contrast our matching algorithm |
78 will mainly operate on the regular expression $r$ and string |
80 will operate on the regular expression $r$ and string $s$, |
79 $s$, which are both finite. Before we come to the matching |
81 only, which are both finite. Before we come to the matching |
80 algorithm, however, let us have a closer look at what it means |
82 algorithm, however, let us have a closer look at what it means |
81 when two regular expressions are equivalent. |
83 when two regular expressions are equivalent. |
82 |
84 |
83 \subsection*{Regular Expression Equivalences} |
85 \subsection*{Regular Expression Equivalences} |
84 |
86 |
147 $r + r$ & $\equiv$ & $r$ |
149 $r + r$ & $\equiv$ & $r$ |
148 \end{tabular} |
150 \end{tabular} |
149 \end{center} |
151 \end{center} |
150 |
152 |
151 \noindent which always hold no matter what the regular |
153 \noindent which always hold no matter what the regular |
152 expression $r$ looks like. The first two are easy to verify since |
154 expression $r$ looks like. The first two are easy to verify |
153 $L(\varnothing)$ is the empty set. The next two are also easy |
155 since $L(\varnothing)$ is the empty set. The next two are also |
154 to verify since $L(\epsilon) = \{[]\}$ and appending the empty |
156 easy to verify since $L(\epsilon) = \{[]\}$ and appending the |
155 string to every string of another set, leaves the set |
157 empty string to every string of another set, leaves the set |
156 unchanged. Be careful to fully comprehend the fifth and |
158 unchanged. Be careful to fully comprehend the fifth and sixth |
157 sixth equivalence: if you concatenate two sets of strings |
159 equivalence: if you concatenate two sets of strings and one is |
158 and one is the empty set, then the concatenation will also be |
160 the empty set, then the concatenation will also be the empty |
159 the empty set. Check the definition of \pcode{_ @ _}. |
161 set. To see this, check the definition of \pcode{_ @ _}. The |
160 The last equivalence is again trivial. |
162 last equivalence is again trivial. |
161 |
163 |
162 What will be important later on is that we can orient these |
164 What will be important later on is that we can orient these |
163 equivalences and read them from left to right. In this way we |
165 equivalences and read them from left to right. In this way we |
164 can view them as \emph{simplification rules}. Suppose for |
166 can view them as \emph{simplification rules}. Consider for |
165 example the regular expression |
167 example the regular expression |
166 |
168 |
167 \begin{equation} |
169 \begin{equation} |
168 (r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot (r_4 \cdot \varnothing) |
170 (r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot (r_4 \cdot \varnothing) |
169 \label{big} |
171 \label{big} |
170 \end{equation} |
172 \end{equation} |
171 |
173 |
172 \noindent If we can find an equivalent regular expression that |
174 \noindent If we can find an equivalent regular expression that |
173 is simpler (smaller for example), then this might potentially |
175 is simpler (smaller for example), then this might potentially |
174 make our matching algorithm run faster. The reason is that |
176 make our matching algorithm run faster. The reason is that |
175 whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv r'$ |
177 whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv |
176 will always give the same answer. In the example above you |
178 r'$ will always give the same answer. In the example above you |
177 will see that the regular expression is equivalent to $r_1$ |
179 will see that the regular expression is equivalent to $r_1$. |
178 if you iteratively apply the simplification rules from above: |
180 You can verify this by iteratively applying the simplification |
|
181 rules from above: |
179 |
182 |
180 \begin{center} |
183 \begin{center} |
181 \begin{tabular}{ll} |
184 \begin{tabular}{ll} |
182 & $(r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot |
185 & $(r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot |
183 (\underline{r_4 \cdot \varnothing})$\smallskip\\ |
186 (\underline{r_4 \cdot \varnothing})$\smallskip\\ |
220 |
223 |
221 \[ |
224 \[ |
222 nullable(r) \;\;\text{if and only if}\;\; []\in L(r) |
225 nullable(r) \;\;\text{if and only if}\;\; []\in L(r) |
223 \] |
226 \] |
224 |
227 |
225 \noindent Note on the left-hand side we have a function we can |
228 \noindent Note on the left-hand side of the if-and-only-if we |
226 implement; on the right we have its specification (which we |
229 have a function we can implement; on the right we have its |
227 cannot implement in a programming language). |
230 specification (which we cannot implement in a programming |
|
231 language). |
228 |
232 |
229 The other function of our matching algorithm calculates a |
233 The other function of our matching algorithm calculates a |
230 \emph{derivative} of a regular expression. This is a function |
234 \emph{derivative} of a regular expression. This is a function |
231 which will take a regular expression, say $r$, and a |
235 which will take a regular expression, say $r$, and a |
232 character, say $c$, as argument and return a new regular |
236 character, say $c$, as argument and return a new regular |
233 expression. Be careful that the intuition behind this function |
237 expression. Be careful that the intuition behind this function |
234 is not so easy to grasp on first reading. Essentially this |
238 is not so easy to grasp on first reading. Essentially this |
235 function solves the following problem: if $r$ can match a |
239 function solves the following problem: if $r$ can match a |
236 string of the form $c\!::\!s$, what does the regular |
240 string of the form $c\!::\!s$, what does the regular |
237 expression look like that can match just $s$. The definition |
241 expression look like that can match just $s$? The definition |
238 of this function is as follows: |
242 of this function is as follows: |
239 |
243 |
240 \begin{center} |
244 \begin{center} |
241 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l} |
245 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l} |
242 $der\, c\, (\varnothing)$ & $\dn$ & $\varnothing$\\ |
246 $der\, c\, (\varnothing)$ & $\dn$ & $\varnothing$\\ |
250 \end{tabular} |
254 \end{tabular} |
251 \end{center} |
255 \end{center} |
252 |
256 |
253 \noindent The first two clauses can be rationalised as |
257 \noindent The first two clauses can be rationalised as |
254 follows: recall that $der$ should calculate a regular |
258 follows: recall that $der$ should calculate a regular |
255 expression, if the ``input'' regular expression can match a |
259 expression so that given the ``input'' regular expression can |
256 string of the form $c\!::\!s$. Since neither $\varnothing$ nor |
260 match a string of the form $c\!::\!s$, we want a regular |
257 $\epsilon$ can match such a string we return $\varnothing$. In |
261 expression for $s$. Since neither $\varnothing$ nor $\epsilon$ |
258 the third case we have to make a case-distinction: In case the |
262 can match a string of the form $c\!::\!s$, we return |
259 regular expression is $c$, then clearly it can recognise a |
263 $\varnothing$. In the third case we have to make a |
260 string of the form $c\!::\!s$, just that $s$ is the empty |
264 case-distinction: In case the regular expression is $c$, then |
261 string. Therefore we return the $\epsilon$-regular expression. |
265 clearly it can recognise a string of the form $c\!::\!s$, just |
262 In the other case we again return $\varnothing$ since no |
266 that $s$ is the empty string. Therefore we return the |
263 string of the $c\!::\!s$ can be matched. Next come the |
267 $\epsilon$-regular expression. In the other case we again |
264 recursive cases. Fortunately, the $+$-case is still relatively |
268 return $\varnothing$ since no string of the $c\!::\!s$ can be |
|
269 matched. Next come the recursive cases, which are a bit more |
|
270 involved. Fortunately, the $+$-case is still relatively |
265 straightforward: all strings of the form $c\!::\!s$ are either |
271 straightforward: all strings of the form $c\!::\!s$ are either |
266 matched by the regular expression $r_1$ or $r_2$. So we just |
272 matched by the regular expression $r_1$ or $r_2$. So we just |
267 have to recursively call $der$ with these two regular |
273 have to recursively call $der$ with these two regular |
268 expressions and compose the results again with $+$. Yes, makes |
274 expressions and compose the results again with $+$. Yes, makes |
269 sense? The $\cdot$-case is more complicated: if $r_1\cdot r_2$ |
275 sense? The $\cdot$-case is more complicated: if $r_1\cdot r_2$ |
273 $r_1$ and ``appending'' $r_2$. There is however one exception |
279 $r_1$ and ``appending'' $r_2$. There is however one exception |
274 to this simple rule: if $r_1$ can match the empty string, then |
280 to this simple rule: if $r_1$ can match the empty string, then |
275 all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is |
281 all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is |
276 nullable (that is can match the empty string) we have to allow |
282 nullable (that is can match the empty string) we have to allow |
277 the choice $der\,c\,r_2$ for calculating the regular |
283 the choice $der\,c\,r_2$ for calculating the regular |
278 expression that can match $s$. Therefore we have to |
284 expression that can match $s$. Therefore we have to add the |
279 add the regular expression $der\,c\,r_2$. |
285 regular expression $der\,c\,r_2$ in the result. The $*$-case |
280 The $*$-case is again simple: |
286 is again simple: if $r^*$ matches a string of the form |
281 if $r^*$ matches a string of the form $c\!::\!s$, then the |
287 $c\!::\!s$, then the first part must be ``matched'' by a |
282 first part must be ``matched'' by a single copy of $r$. |
288 single copy of $r$. Therefore we call recursively $der\,c\,r$ |
283 Therefore we call recursively $der\,c\,r$ and ``append'' $r^*$ |
289 and ``append'' $r^*$ in order to match the rest of $s$. |
284 in order to match the rest of $s$. |
|
285 |
290 |
286 If this did not make sense, here is another way to rationalise |
291 If this did not make sense, here is another way to rationalise |
287 the definition of $der$ by considering the following operation |
292 the definition of $der$ by considering the following operation |
288 on sets: |
293 on sets: |
289 |
294 |
331 \end{tabular} |
336 \end{tabular} |
332 \end{center} |
337 \end{center} |
333 |
338 |
334 \noindent Again the operation $Der$ might help to rationalise |
339 \noindent Again the operation $Der$ might help to rationalise |
335 this algorithm. We want to know whether $abc \in L(r_1)$. We |
340 this algorithm. We want to know whether $abc \in L(r_1)$. We |
336 do not know yet. But let us assume it is. Then $Der\,a\,L(r_1)$ |
341 do not know yet---but let us assume it is. Then $Der\,a\,L(r_1)$ |
337 builds the set where all the strings not starting with $a$ are |
342 builds the set where all the strings not starting with $a$ are |
338 filtered out. Of the remaining strings, the $a$ is stripped |
343 filtered out. Of the remaining strings, the $a$ is stripped |
339 off. Then we continue with filtering out all strings not |
344 off. Then we continue with filtering out all strings not |
340 starting with $b$ and stripping off the $b$ from the remaining |
345 starting with $b$ and stripping off the $b$ from the remaining |
341 strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$. |
346 strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$. |
342 Finally we filter out all strings not starting with $c$ and |
347 Finally we filter out all strings not starting with $c$ and |
343 strip off $c$ from the remaining string. This is |
348 strip off $c$ from the remaining string. This is |
344 $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the |
349 $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the |
345 original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ |
350 original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ |
346 must be the empty string. If not then $abc$ was not in the |
351 must be the empty string. If not, then $abc$ was not in the |
347 language we started with. |
352 language we started with. |
348 |
353 |
349 Our matching algorithm using $der$ and $nullable$ works |
354 Our matching algorithm using $der$ and $nullable$ works |
350 similarly, just using regular expression instead of sets. For |
355 similarly, just using regular expression instead of sets. For |
351 this we need to extend the notion of derivatives from |
356 this we need to extend the notion of derivatives from single |
352 characters to strings. This can be done using the following |
357 characters to strings. This can be done using the following |
353 function, taking a string and regular expression as input and |
358 function, taking a string and regular expression as input and |
354 a regular expression as output. |
359 a regular expression as output. |
355 |
360 |
356 \begin{center} |
361 \begin{center} |
358 $\textit{ders}\, []\, r$ & $\dn$ & $r$ & \\ |
363 $\textit{ders}\, []\, r$ & $\dn$ & $r$ & \\ |
359 $\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(der\,c\,r)$ & \\ |
364 $\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(der\,c\,r)$ & \\ |
360 \end{tabular} |
365 \end{tabular} |
361 \end{center} |
366 \end{center} |
362 |
367 |
363 \noindent This function essentially iterates $der$ taking one |
368 \noindent This function iterates $der$ taking one character at |
364 character at the time from the original string until it is |
369 the time from the original string until it is exhausted. |
365 exhausted. Having $ders$ in place, we can finally define our |
370 Having $ders$ in place, we can finally define our matching |
366 matching algorithm: |
371 algorithm: |
367 |
372 |
368 \[ |
373 \[ |
369 matches\,s\,r = nullable(ders\,s\,r) |
374 matches\,s\,r \dn nullable(ders\,s\,r) |
370 \] |
375 \] |
371 |
376 |
372 \noindent |
377 \noindent |
373 We can claim that |
378 and we can claim that |
374 |
379 |
375 \[ |
380 \[ |
376 matches\,s\,r\quad\text{if and only if}\quad s\in L(r) |
381 matches\,s\,r\quad\text{if and only if}\quad s\in L(r) |
377 \] |
382 \] |
378 |
383 |
396 for \pcode{matches} are shown in Figure~\ref{scala1}. |
401 for \pcode{matches} are shown in Figure~\ref{scala1}. |
397 |
402 |
398 \begin{figure}[p] |
403 \begin{figure}[p] |
399 \lstinputlisting{../progs/app5.scala} |
404 \lstinputlisting{../progs/app5.scala} |
400 \caption{Scala implementation of the nullable and |
405 \caption{Scala implementation of the nullable and |
401 derivatives functions.\label{scala1}} |
406 derivatives functions. These functions are easy to |
|
407 implement in functional languages, because pattern |
|
408 matching and recursion allow us to mimic the mathematical |
|
409 definitions very closely.\label{scala1}} |
402 \end{figure} |
410 \end{figure} |
403 |
411 |
404 For running the algorithm with our favourite example, the evil |
412 For running the algorithm with our favourite example, the evil |
405 regular expression $a?^{\{n\}}a^{\{n\}}$, we need to implement |
413 regular expression $a?^{\{n\}}a^{\{n\}}$, we need to implement |
406 the optional regular expression and the exactly $n$-times |
414 the optional regular expression and the exactly $n$-times |
456 \noindent Our algorithm traverses such regular expressions at |
464 \noindent Our algorithm traverses such regular expressions at |
457 least once every time a derivative is calculated. So having |
465 least once every time a derivative is calculated. So having |
458 large regular expressions will cause problems. This problem |
466 large regular expressions will cause problems. This problem |
459 is aggravated by $a?$ being represented as $a + \epsilon$. |
467 is aggravated by $a?$ being represented as $a + \epsilon$. |
460 |
468 |
461 We can fix this by having an explicit constructor for |
469 We can however fix this by having an explicit constructor for |
462 $r^{\{n\}}$. In Scala we would introduce a constructor like |
470 $r^{\{n\}}$. In Scala we would introduce a constructor like |
463 |
471 |
464 \begin{center} |
472 \begin{center} |
465 \code{case class NTIMES(r: Rexp, n: Int) extends Rexp} |
473 \code{case class NTIMES(r: Rexp, n: Int) extends Rexp} |
466 \end{center} |
474 \end{center} |
467 |
475 |
468 \noindent With this we have a constant ``size'' regular |
476 \noindent With this fix we have a constant ``size'' regular |
469 expression for our running example no matter how large $n$ |
477 expression for our running example no matter how large $n$ is. |
470 is. This means we have to also add cases for $nullable$ and |
478 This means we have to also add cases for \pcode{NTIMES} in the |
471 $der$. Does the change have any effect? |
479 functions $nullable$ and $der$. Does the change have any |
|
480 effect? |
472 |
481 |
473 \begin{center} |
482 \begin{center} |
474 \begin{tikzpicture} |
483 \begin{tikzpicture} |
475 \begin{axis}[ |
484 \begin{axis}[ |
476 xlabel={\pcode{a}s}, |
485 xlabel={\pcode{a}s}, |
503 can within 30 seconds handle regular expressions up to |
512 can within 30 seconds handle regular expressions up to |
504 $n = 950$ before a StackOverflow is raised. |
513 $n = 950$ before a StackOverflow is raised. |
505 |
514 |
506 The moral is that our algorithm is rather sensitive to the |
515 The moral is that our algorithm is rather sensitive to the |
507 size of regular expressions it needs to handle. This is of |
516 size of regular expressions it needs to handle. This is of |
508 course obvious because both $nullable$ and $der$ need to |
517 course obvious because both $nullable$ and $der$ frequently |
509 traverse the whole regular expression. There seems to be one |
518 need to traverse the whole regular expression. There seems, |
510 more issue for making the algorithm run faster. The derivative |
519 however, one more issue for making the algorithm run faster. |
511 function often produces ``useless'' $\varnothing$s and |
520 The derivative function often produces ``useless'' |
512 $\epsilon$s. To see this, consider $r = ((a \cdot b) + b)^*$ |
521 $\varnothing$s and $\epsilon$s. To see this, consider $r = ((a |
513 and the following two derivatives |
522 \cdot b) + b)^*$ and the following two derivatives |
514 |
523 |
515 \begin{center} |
524 \begin{center} |
516 \begin{tabular}{l} |
525 \begin{tabular}{l} |
517 $der\,a\,r = ((\epsilon \cdot b) + \varnothing) \cdot r$\\ |
526 $der\,a\,r = ((\epsilon \cdot b) + \varnothing) \cdot r$\\ |
518 $der\,b\,r = ((\varnothing \cdot b) + \epsilon)\cdot r$\\ |
527 $der\,b\,r = ((\varnothing \cdot b) + \epsilon)\cdot r$\\ |
540 regular expression and simplifies it according to the rules |
549 regular expression and simplifies it according to the rules |
541 given at the beginning. There are only rules for $+$, $\cdot$ |
550 given at the beginning. There are only rules for $+$, $\cdot$ |
542 and $n$-times (the latter because we added it in the second |
551 and $n$-times (the latter because we added it in the second |
543 version of our matcher). There is no rule for a star, because |
552 version of our matcher). There is no rule for a star, because |
544 empirical data and also a little thought showed that |
553 empirical data and also a little thought showed that |
545 simplifying under a star is waste of computation time. The |
554 simplifying under a star is a waste of computation time. The |
546 simplification function will be called after every derivation. |
555 simplification function will be called after every derivation. |
547 This additional step removes all the ``junk'' the derivative |
556 This additional step removes all the ``junk'' the derivative |
548 function introduced. Does this improve the speed? You bet!! |
557 function introduced. Does this improve the speed? You bet!! |
549 |
558 |
550 \begin{figure}[p] |
559 \begin{figure}[p] |
551 \lstinputlisting{../progs/app6.scala} |
560 \lstinputlisting{../progs/app6.scala} |
552 \caption{The simplification function and modified |
561 \caption{The simplification function and modified |
553 \texttt{ders}-function.\label{scala2}} |
562 \texttt{ders}-function; this function now |
|
563 calls \texttt{der} first, but then tries to simplify |
|
564 the resulting derivative regular expressions.\label{scala2}} |
554 \end{figure} |
565 \end{figure} |
555 |
566 |
556 \begin{center} |
567 \begin{center} |
557 \begin{tikzpicture} |
568 \begin{tikzpicture} |
558 \begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs}, |
569 \begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs}, |