handouts/ho02.tex
changeset 325 794c599cee53
parent 318 7975e4f0d4de
child 332 4755ad4b457b
equal deleted inserted replaced
324:6cb517754d8a 325:794c599cee53
     9 
     9 
    10 \section*{Handout 2 (Regular Expression Matching)}
    10 \section*{Handout 2 (Regular Expression Matching)}
    11 
    11 
    12 This lecture is about implementing a more efficient regular
    12 This lecture is about implementing a more efficient regular
    13 expression matcher (the plots on the right)---more efficient
    13 expression matcher (the plots on the right)---more efficient
    14 than the matchers from regular expression libraries in Ruby and
    14 than the matchers from regular expression libraries in Ruby
    15 Python (the plots on the left). These plots show the running 
    15 and Python (the plots on the left). These plots show the
    16 time for the evil regular expression $a?^{\{n\}}a^{\{n\}}$.
    16 running time for the evil regular expression
    17 Note the different scales of the $x$-axes. 
    17 $a?^{\{n\}}a^{\{n\}}$ and string composed of $n$ \pcode{a}s.
       
    18 We will use this regular expression and strings as running
       
    19 example. To see the substantial differences in the two plots
       
    20 below, note the different scales of the $x$-axes. 
    18 
    21 
    19 
    22 
    20 \begin{center}
    23 \begin{center}
    21 \begin{tabular}{@{}cc@{}}
    24 \begin{tabular}{@{}cc@{}}
    22 \begin{tikzpicture}
    25 \begin{tikzpicture}
    58 \end{tabular}
    61 \end{tabular}
    59 \end{center}\medskip
    62 \end{center}\medskip
    60 
    63 
    61 
    64 
    62 \noindent Having specified in the previous lecture what
    65 \noindent Having specified in the previous lecture what
    63 problem our regular expression matcher, which we will call
    66 problem our regular expression matcher is supposed to solve,
    64 \pcode{matches}, is supposed to solve, namely for any given
    67 namely for any given regular expression $r$ and string $s$
    65 regular expression $r$ and string $s$ answer \textit{true} if
    68 answer \textit{true} if and only if
    66 and only if
       
    67 
    69 
    68 \[
    70 \[
    69 s \in L(r)
    71 s \in L(r)
    70 \]
    72 \]
    71 
    73 
    73 Clearly we cannot use the function $L$ directly for this,
    75 Clearly we cannot use the function $L$ directly for this,
    74 because in general the set of strings $L$ returns is infinite
    76 because in general the set of strings $L$ returns is infinite
    75 (recall what $L(a^*)$ is). In such cases there is no way we
    77 (recall what $L(a^*)$ is). In such cases there is no way we
    76 can implement an exhaustive test for whether a string is
    78 can implement an exhaustive test for whether a string is
    77 member of this set or not. In contrast our matching algorithm
    79 member of this set or not. In contrast our matching algorithm
    78 will mainly operate on the regular expression $r$ and string
    80 will operate on the regular expression $r$ and string $s$,
    79 $s$, which are both finite. Before we come to the matching
    81 only, which are both finite. Before we come to the matching
    80 algorithm, however, let us have a closer look at what it means
    82 algorithm, however, let us have a closer look at what it means
    81 when two regular expressions are equivalent.
    83 when two regular expressions are equivalent.
    82 
    84 
    83 \subsection*{Regular Expression Equivalences}
    85 \subsection*{Regular Expression Equivalences}
    84 
    86 
   147 $r + r$ & $\equiv$ & $r$
   149 $r + r$ & $\equiv$ & $r$
   148 \end{tabular}
   150 \end{tabular}
   149 \end{center}
   151 \end{center}
   150 
   152 
   151 \noindent which always hold no matter what the regular
   153 \noindent which always hold no matter what the regular
   152 expression $r$ looks like. The first two are easy to verify since
   154 expression $r$ looks like. The first two are easy to verify
   153 $L(\varnothing)$ is the empty set. The next two are also easy 
   155 since $L(\varnothing)$ is the empty set. The next two are also
   154 to verify since $L(\epsilon) = \{[]\}$ and appending the empty 
   156 easy to verify since $L(\epsilon) = \{[]\}$ and appending the
   155 string to every string of another set, leaves the set 
   157 empty string to every string of another set, leaves the set
   156 unchanged. Be careful to fully comprehend the fifth and 
   158 unchanged. Be careful to fully comprehend the fifth and sixth
   157 sixth equivalence: if you concatenate two sets of strings
   159 equivalence: if you concatenate two sets of strings and one is
   158 and one is the empty set, then the concatenation will also be
   160 the empty set, then the concatenation will also be the empty
   159 the empty set. Check the definition of \pcode{_ @ _}.
   161 set. To see this, check the definition of \pcode{_ @ _}. The
   160 The last equivalence is again trivial.
   162 last equivalence is again trivial.
   161 
   163 
   162 What will be important later on is that we can orient these
   164 What will be important later on is that we can orient these
   163 equivalences and read them from left to right. In this way we
   165 equivalences and read them from left to right. In this way we
   164 can view them as \emph{simplification rules}. Suppose for 
   166 can view them as \emph{simplification rules}. Consider for 
   165 example the regular expression 
   167 example the regular expression 
   166 
   168 
   167 \begin{equation}
   169 \begin{equation}
   168 (r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot (r_4 \cdot \varnothing)
   170 (r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot (r_4 \cdot \varnothing)
   169 \label{big}
   171 \label{big}
   170 \end{equation}
   172 \end{equation}
   171 
   173 
   172 \noindent If we can find an equivalent regular expression that
   174 \noindent If we can find an equivalent regular expression that
   173 is simpler (smaller for example), then this might potentially 
   175 is simpler (smaller for example), then this might potentially
   174 make our matching algorithm run faster. The reason is that 
   176 make our matching algorithm run faster. The reason is that
   175 whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv r'$
   177 whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv
   176 will always give the same answer. In the example above you 
   178 r'$ will always give the same answer. In the example above you
   177 will see that the regular expression is equivalent to $r_1$
   179 will see that the regular expression is equivalent to $r_1$.
   178 if you iteratively apply the simplification rules from above:
   180 You can verify this by iteratively applying the simplification
       
   181 rules from above:
   179 
   182 
   180 \begin{center}
   183 \begin{center}
   181 \begin{tabular}{ll}
   184 \begin{tabular}{ll}
   182  & $(r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot 
   185  & $(r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot 
   183 (\underline{r_4 \cdot \varnothing})$\smallskip\\
   186 (\underline{r_4 \cdot \varnothing})$\smallskip\\
   220 
   223 
   221 \[
   224 \[
   222 nullable(r) \;\;\text{if and only if}\;\; []\in L(r)
   225 nullable(r) \;\;\text{if and only if}\;\; []\in L(r)
   223 \]
   226 \]
   224 
   227 
   225 \noindent Note on the left-hand side we have a function we can
   228 \noindent Note on the left-hand side of the if-and-only-if we
   226 implement; on the right we have its specification (which we
   229 have a function we can implement; on the right we have its
   227 cannot implement in a programming language).
   230 specification (which we cannot implement in a programming
       
   231 language).
   228 
   232 
   229 The other function of our matching algorithm calculates a
   233 The other function of our matching algorithm calculates a
   230 \emph{derivative} of a regular expression. This is a function
   234 \emph{derivative} of a regular expression. This is a function
   231 which will take a regular expression, say $r$, and a
   235 which will take a regular expression, say $r$, and a
   232 character, say $c$, as argument and return a new regular
   236 character, say $c$, as argument and return a new regular
   233 expression. Be careful that the intuition behind this function
   237 expression. Be careful that the intuition behind this function
   234 is not so easy to grasp on first reading. Essentially this
   238 is not so easy to grasp on first reading. Essentially this
   235 function solves the following problem: if $r$ can match a
   239 function solves the following problem: if $r$ can match a
   236 string of the form $c\!::\!s$, what does the regular
   240 string of the form $c\!::\!s$, what does the regular
   237 expression look like that can match just $s$. The definition
   241 expression look like that can match just $s$? The definition
   238 of this function is as follows:
   242 of this function is as follows:
   239 
   243 
   240 \begin{center}
   244 \begin{center}
   241 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
   245 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
   242   $der\, c\, (\varnothing)$      & $\dn$ & $\varnothing$\\
   246   $der\, c\, (\varnothing)$      & $\dn$ & $\varnothing$\\
   250   \end{tabular}
   254   \end{tabular}
   251 \end{center}
   255 \end{center}
   252 
   256 
   253 \noindent The first two clauses can be rationalised as
   257 \noindent The first two clauses can be rationalised as
   254 follows: recall that $der$ should calculate a regular
   258 follows: recall that $der$ should calculate a regular
   255 expression, if the ``input'' regular expression can match a
   259 expression so that given the ``input'' regular expression can
   256 string of the form $c\!::\!s$. Since neither $\varnothing$ nor
   260 match a string of the form $c\!::\!s$, we want a regular
   257 $\epsilon$ can match such a string we return $\varnothing$. In
   261 expression for $s$. Since neither $\varnothing$ nor $\epsilon$
   258 the third case we have to make a case-distinction: In case the
   262 can match a string of the form $c\!::\!s$, we return
   259 regular expression is $c$, then clearly it can recognise a
   263 $\varnothing$. In the third case we have to make a
   260 string of the form $c\!::\!s$, just that $s$ is the empty
   264 case-distinction: In case the regular expression is $c$, then
   261 string. Therefore we return the $\epsilon$-regular expression.
   265 clearly it can recognise a string of the form $c\!::\!s$, just
   262 In the other case we again return $\varnothing$ since no
   266 that $s$ is the empty string. Therefore we return the
   263 string of the $c\!::\!s$ can be matched. Next come the
   267 $\epsilon$-regular expression. In the other case we again
   264 recursive cases. Fortunately, the $+$-case is still relatively
   268 return $\varnothing$ since no string of the $c\!::\!s$ can be
       
   269 matched. Next come the recursive cases, which are a bit more
       
   270 involved. Fortunately, the $+$-case is still relatively
   265 straightforward: all strings of the form $c\!::\!s$ are either
   271 straightforward: all strings of the form $c\!::\!s$ are either
   266 matched by the regular expression $r_1$ or $r_2$. So we just
   272 matched by the regular expression $r_1$ or $r_2$. So we just
   267 have to recursively call $der$ with these two regular
   273 have to recursively call $der$ with these two regular
   268 expressions and compose the results again with $+$. Yes, makes
   274 expressions and compose the results again with $+$. Yes, makes
   269 sense? The $\cdot$-case is more complicated: if $r_1\cdot r_2$
   275 sense? The $\cdot$-case is more complicated: if $r_1\cdot r_2$
   273 $r_1$ and ``appending'' $r_2$. There is however one exception
   279 $r_1$ and ``appending'' $r_2$. There is however one exception
   274 to this simple rule: if $r_1$ can match the empty string, then
   280 to this simple rule: if $r_1$ can match the empty string, then
   275 all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is
   281 all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is
   276 nullable (that is can match the empty string) we have to allow
   282 nullable (that is can match the empty string) we have to allow
   277 the choice $der\,c\,r_2$ for calculating the regular
   283 the choice $der\,c\,r_2$ for calculating the regular
   278 expression that can match $s$. Therefore we have to 
   284 expression that can match $s$. Therefore we have to add the
   279 add the regular expression $der\,c\,r_2$.
   285 regular expression $der\,c\,r_2$ in the result. The $*$-case
   280 The $*$-case is again simple:
   286 is again simple: if $r^*$ matches a string of the form
   281 if $r^*$ matches a string of the form $c\!::\!s$, then the
   287 $c\!::\!s$, then the first part must be ``matched'' by a
   282 first part must be ``matched'' by a single copy of $r$.
   288 single copy of $r$. Therefore we call recursively $der\,c\,r$
   283 Therefore we call recursively $der\,c\,r$ and ``append'' $r^*$
   289 and ``append'' $r^*$ in order to match the rest of $s$.
   284 in order to match the rest of $s$.
       
   285 
   290 
   286 If this did not make sense, here is another way to rationalise
   291 If this did not make sense, here is another way to rationalise
   287 the definition of $der$ by considering the following operation
   292 the definition of $der$ by considering the following operation
   288 on sets:
   293 on sets:
   289 
   294 
   331 \end{tabular}
   336 \end{tabular}
   332 \end{center}
   337 \end{center}
   333 
   338 
   334 \noindent Again the operation $Der$ might help to rationalise
   339 \noindent Again the operation $Der$ might help to rationalise
   335 this algorithm. We want to know whether $abc \in L(r_1)$. We
   340 this algorithm. We want to know whether $abc \in L(r_1)$. We
   336 do not know yet. But let us assume it is. Then $Der\,a\,L(r_1)$
   341 do not know yet---but let us assume it is. Then $Der\,a\,L(r_1)$
   337 builds the set where all the strings not starting with $a$ are
   342 builds the set where all the strings not starting with $a$ are
   338 filtered out. Of the remaining strings, the $a$ is stripped
   343 filtered out. Of the remaining strings, the $a$ is stripped
   339 off. Then we continue with filtering out all strings not
   344 off. Then we continue with filtering out all strings not
   340 starting with $b$ and stripping off the $b$ from the remaining
   345 starting with $b$ and stripping off the $b$ from the remaining
   341 strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$.
   346 strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$.
   342 Finally we filter out all strings not starting with $c$ and
   347 Finally we filter out all strings not starting with $c$ and
   343 strip off $c$ from the remaining string. This is
   348 strip off $c$ from the remaining string. This is
   344 $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the 
   349 $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the 
   345 original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ 
   350 original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ 
   346 must be the empty string. If not then $abc$ was not in the 
   351 must be the empty string. If not, then $abc$ was not in the 
   347 language we started with.
   352 language we started with.
   348 
   353 
   349 Our matching algorithm using $der$ and $nullable$ works
   354 Our matching algorithm using $der$ and $nullable$ works
   350 similarly, just using regular expression instead of sets. For
   355 similarly, just using regular expression instead of sets. For
   351 this we need to extend the notion of derivatives from
   356 this we need to extend the notion of derivatives from single
   352 characters to strings. This can be done using the following
   357 characters to strings. This can be done using the following
   353 function, taking a string and regular expression as input and
   358 function, taking a string and regular expression as input and
   354 a regular expression as output.
   359 a regular expression as output.
   355 
   360 
   356 \begin{center}
   361 \begin{center}
   358   $\textit{ders}\, []\, r$     & $\dn$ & $r$ & \\
   363   $\textit{ders}\, []\, r$     & $\dn$ & $r$ & \\
   359   $\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(der\,c\,r)$ & \\
   364   $\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(der\,c\,r)$ & \\
   360   \end{tabular}
   365   \end{tabular}
   361 \end{center}
   366 \end{center}
   362 
   367 
   363 \noindent This function essentially iterates $der$ taking one
   368 \noindent This function iterates $der$ taking one character at
   364 character at the time from the original string until it is
   369 the time from the original string until it is exhausted.
   365 exhausted. Having $ders$ in place, we can finally define our
   370 Having $ders$ in place, we can finally define our matching
   366 matching algorithm:
   371 algorithm:
   367 
   372 
   368 \[
   373 \[
   369 matches\,s\,r = nullable(ders\,s\,r)
   374 matches\,s\,r \dn nullable(ders\,s\,r)
   370 \]
   375 \]
   371 
   376 
   372 \noindent
   377 \noindent
   373 We can claim that
   378 and we can claim that
   374 
   379 
   375 \[
   380 \[
   376 matches\,s\,r\quad\text{if and only if}\quad s\in L(r)
   381 matches\,s\,r\quad\text{if and only if}\quad s\in L(r)
   377 \]
   382 \]
   378 
   383 
   396 for \pcode{matches} are shown in Figure~\ref{scala1}.
   401 for \pcode{matches} are shown in Figure~\ref{scala1}.
   397 
   402 
   398 \begin{figure}[p]
   403 \begin{figure}[p]
   399 \lstinputlisting{../progs/app5.scala}
   404 \lstinputlisting{../progs/app5.scala}
   400 \caption{Scala implementation of the nullable and 
   405 \caption{Scala implementation of the nullable and 
   401   derivatives functions.\label{scala1}}
   406   derivatives functions. These functions are easy to
       
   407   implement in functional languages, because pattern 
       
   408   matching and recursion allow us to mimic the mathematical
       
   409   definitions very closely.\label{scala1}}
   402 \end{figure}
   410 \end{figure}
   403 
   411 
   404 For running the algorithm with our favourite example, the evil
   412 For running the algorithm with our favourite example, the evil
   405 regular expression $a?^{\{n\}}a^{\{n\}}$, we need to implement
   413 regular expression $a?^{\{n\}}a^{\{n\}}$, we need to implement
   406 the optional regular expression and the exactly $n$-times
   414 the optional regular expression and the exactly $n$-times
   437   table {re1.data};  
   445   table {re1.data};  
   438 \end{axis}
   446 \end{axis}
   439 \end{tikzpicture}
   447 \end{tikzpicture}
   440 \end{center}
   448 \end{center}
   441 
   449 
   442 \noindent Analysing this failure a bit we notice that 
   450 \noindent Analysing this failure we notice that for
   443 for $a^{\{n\}}$ we generate quite big regular expressions:
   451 $a^{\{n\}}$ we generate quite big regular expressions:
   444 
   452 
   445 \begin{center}
   453 \begin{center}
   446 \begin{tabular}{rl}
   454 \begin{tabular}{rl}
   447 1: & $a$\\
   455 1: & $a$\\
   448 2: & $a\cdot a$\\
   456 2: & $a\cdot a$\\
   456 \noindent Our algorithm traverses such regular expressions at
   464 \noindent Our algorithm traverses such regular expressions at
   457 least once every time a derivative is calculated. So having
   465 least once every time a derivative is calculated. So having
   458 large regular expressions will cause problems. This problem
   466 large regular expressions will cause problems. This problem
   459 is aggravated by $a?$ being represented as $a + \epsilon$.
   467 is aggravated by $a?$ being represented as $a + \epsilon$.
   460 
   468 
   461 We can fix this by having an explicit constructor for 
   469 We can however fix this by having an explicit constructor for
   462 $r^{\{n\}}$. In Scala we would introduce a constructor like
   470 $r^{\{n\}}$. In Scala we would introduce a constructor like
   463 
   471 
   464 \begin{center}
   472 \begin{center}
   465 \code{case class NTIMES(r: Rexp, n: Int) extends Rexp}
   473 \code{case class NTIMES(r: Rexp, n: Int) extends Rexp}
   466 \end{center}
   474 \end{center}
   467 
   475 
   468 \noindent With this we have a constant ``size'' regular
   476 \noindent With this fix we have a constant ``size'' regular
   469 expression for our running example no matter how large $n$ 
   477 expression for our running example no matter how large $n$ is.
   470 is. This means we have to also add cases for $nullable$ and
   478 This means we have to also add cases for \pcode{NTIMES} in the
   471 $der$. Does the change have any effect?
   479 functions $nullable$ and $der$. Does the change have any
       
   480 effect?
   472 
   481 
   473 \begin{center}
   482 \begin{center}
   474 \begin{tikzpicture}
   483 \begin{tikzpicture}
   475 \begin{axis}[
   484 \begin{axis}[
   476     xlabel={\pcode{a}s},
   485     xlabel={\pcode{a}s},
   503 can within 30 seconds handle regular expressions up to
   512 can within 30 seconds handle regular expressions up to
   504 $n = 950$ before a StackOverflow is raised.
   513 $n = 950$ before a StackOverflow is raised.
   505 
   514 
   506 The moral is that our algorithm is rather sensitive to the
   515 The moral is that our algorithm is rather sensitive to the
   507 size of regular expressions it needs to handle. This is of
   516 size of regular expressions it needs to handle. This is of
   508 course obvious because both $nullable$ and $der$ need to
   517 course obvious because both $nullable$ and $der$ frequently
   509 traverse the whole regular expression. There seems to be one
   518 need to traverse the whole regular expression. There seems,
   510 more issue for making the algorithm run faster. The derivative
   519 however, one more issue for making the algorithm run faster.
   511 function often produces ``useless'' $\varnothing$s and
   520 The derivative function often produces ``useless''
   512 $\epsilon$s. To see this, consider $r = ((a \cdot b) + b)^*$
   521 $\varnothing$s and $\epsilon$s. To see this, consider $r = ((a
   513 and the following two derivatives
   522 \cdot b) + b)^*$ and the following two derivatives
   514 
   523 
   515 \begin{center}
   524 \begin{center}
   516 \begin{tabular}{l}
   525 \begin{tabular}{l}
   517 $der\,a\,r = ((\epsilon \cdot b) + \varnothing) \cdot r$\\
   526 $der\,a\,r = ((\epsilon \cdot b) + \varnothing) \cdot r$\\
   518 $der\,b\,r = ((\varnothing \cdot b) + \epsilon)\cdot r$\\
   527 $der\,b\,r = ((\varnothing \cdot b) + \epsilon)\cdot r$\\
   540 regular expression and simplifies it according to the rules
   549 regular expression and simplifies it according to the rules
   541 given at the beginning. There are only rules for $+$, $\cdot$
   550 given at the beginning. There are only rules for $+$, $\cdot$
   542 and $n$-times (the latter because we added it in the second
   551 and $n$-times (the latter because we added it in the second
   543 version of our matcher). There is no rule for a star, because
   552 version of our matcher). There is no rule for a star, because
   544 empirical data and also a little thought showed that
   553 empirical data and also a little thought showed that
   545 simplifying under a star is waste of computation time. The
   554 simplifying under a star is a waste of computation time. The
   546 simplification function will be called after every derivation.
   555 simplification function will be called after every derivation.
   547 This additional step removes all the ``junk'' the derivative
   556 This additional step removes all the ``junk'' the derivative
   548 function introduced. Does this improve the speed? You bet!!
   557 function introduced. Does this improve the speed? You bet!!
   549 
   558 
   550 \begin{figure}[p]
   559 \begin{figure}[p]
   551 \lstinputlisting{../progs/app6.scala}
   560 \lstinputlisting{../progs/app6.scala}
   552 \caption{The simplification function and modified 
   561 \caption{The simplification function and modified 
   553 \texttt{ders}-function.\label{scala2}}
   562 \texttt{ders}-function; this function now
       
   563 calls \texttt{der} first, but then tries to simplify
       
   564 the resulting derivative regular expressions.\label{scala2}}
   554 \end{figure}
   565 \end{figure}
   555 
   566 
   556 \begin{center}
   567 \begin{center}
   557 \begin{tikzpicture}
   568 \begin{tikzpicture}
   558 \begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs},
   569 \begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs},