handouts/ho02.tex
changeset 727 eb9343126625
parent 646 2abd285c66d1
child 757 ea0be0662be0
equal deleted inserted replaced
726:fba480bbc9f7 727:eb9343126625
     5 \usepackage{../graphics}
     5 \usepackage{../graphics}
     6 \usepackage{../data}
     6 \usepackage{../data}
     7 
     7 
     8 
     8 
     9 \begin{document}
     9 \begin{document}
    10 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019}
    10 \fnote{\copyright{} Christian Urban, King's College London, 
       
    11   2014, 2015, 2016, 2017, 2018, 2019, 2020}
    11 
    12 
    12 
    13 
    13 \section*{Handout 2 (Regular Expression Matching)}
    14 \section*{Handout 2 (Regular Expression Matching)}
    14 
    15 
       
    16 \noindent
       
    17 {\bf Checklist}
       
    18 
       
    19 \begin{itemize}
       
    20   \item You have understood the derivative-based matching algorithm.
       
    21   \item You know how the derivative is related to the meaning of regular
       
    22   expressions.
       
    23   \item You can extend the algorithm to non-basic regular expressions.
       
    24 \end{itemize}\bigskip\bigskip\bigskip
       
    25 
       
    26 \noindent
    15 This lecture is about implementing a more efficient regular expression
    27 This lecture is about implementing a more efficient regular expression
    16 matcher (the plots on the right below)---more efficient than the
    28 matcher (the plots on the right below)---more efficient than the
    17 matchers from regular expression libraries in Ruby, Python, JavaScript
    29 matchers from regular expression libraries in Ruby, Python, JavaScript
    18 and Java (the plots on the left). For this consider some experimental
    30 and Java (the plots on the left). For this consider some experimental
    19 data: The first pair of plots shows the running time for the
    31 data: The first pair of plots shows the running time for the
    20 regular expression $(a^*)^*\cdot b$ and strings composed of $n$
    32 regular expression $(a^*)^*\cdot b$ and strings composed of $n$
    21 \pcode{a}s (meaning this regular expression actually does not match
    33 \pcode{a}s, like
    22 the strings). The second pair of plots shows the running time for the
    34 \[
    23 regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings also
    35 \pcode{"}\!\underbrace{\pcode{a}\ldots\pcode{a}}_{n}\!\pcode{"}  
    24 composed of $n$ \pcode{a}s (this time the regular expressions match
    36 \]
    25 the strings).  To see the substantial differences in the left and
    37 
       
    38 \noindent
       
    39 This means the regular expression actually does not match the strings.
       
    40 The second pair of plots shows the running time for the regular
       
    41 expressions of the form $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and corresponding
       
    42 strings composed of $n$ \pcode{a}s (this time the regular expressions
       
    43 match the strings).  To see the substantial differences in the left and
    26 right plots below, note the different scales of the $x$-axes.
    44 right plots below, note the different scales of the $x$-axes.
    27 
    45 
    28   
    46   
    29 \begin{center}
    47 \begin{center}
    30 Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
    48 Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
   128 \noindent
   146 \noindent
   129 In what follows we will use these regular expressions and strings as
   147 In what follows we will use these regular expressions and strings as
   130 running examples. There will be several versions (V1, V2, V3,\ldots)
   148 running examples. There will be several versions (V1, V2, V3,\ldots)
   131 of our matcher.\footnote{The corresponding files are
   149 of our matcher.\footnote{The corresponding files are
   132   \texttt{re1.scala}, \texttt{re2.scala} and so on. As usual, you can
   150   \texttt{re1.scala}, \texttt{re2.scala} and so on. As usual, you can
   133   find the code on KEATS.}\bigskip
   151   find the code on KEATS.}
   134 
   152 
   135 \noindent
       
   136 Having specified in the previous lecture what
   153 Having specified in the previous lecture what
   137 problem our regular expression matcher is supposed to solve,
   154 problem our regular expression matcher is supposed to solve,
   138 namely for any given regular expression $r$ and string $s$
   155 namely for any given regular expression $r$ and string $s$
   139 answer \textit{true} if and only if
   156 answer \textit{true} if and only if
   140 
   157 
   153 means when two regular expressions are equivalent.
   170 means when two regular expressions are equivalent.
   154 
   171 
   155 \subsection*{Regular Expression Equivalences}
   172 \subsection*{Regular Expression Equivalences}
   156 
   173 
   157 We already defined in Handout 1 what it means for two regular
   174 We already defined in Handout 1 what it means for two regular
   158 expressions to be equivalent, namely if their meaning is the
   175 expressions to be equivalent, namely whether their 
   159 same language:
   176 \emph{meaning} is the same language:
   160 
   177 
   161 \[
   178 \[
   162 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
   179 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
   163 \]
   180 \]
   164 
   181 
   218 $\ZERO \cdot r$ & $\equiv$ & $\ZERO$\\
   235 $\ZERO \cdot r$ & $\equiv$ & $\ZERO$\\
   219 $r + r$ & $\equiv$ & $r$
   236 $r + r$ & $\equiv$ & $r$
   220 \end{tabular}
   237 \end{tabular}
   221 \end{center}
   238 \end{center}
   222 
   239 
   223 \noindent which always hold no matter what the regular expression $r$
   240 \noindent They always hold no matter what the regular expression $r$
   224 looks like. The first two are easy to verify since $L(\ZERO)$ is the
   241 looks like. The first two are easy to verify since $L(\ZERO)$ is the
   225 empty set. The next two are also easy to verify since $L(\ONE) =
   242 empty set. The next two are also easy to verify since $L(\ONE) =
   226 \{[]\}$ and appending the empty string to every string of another set,
   243 \{[]\}$ and appending the empty string to every string of another set,
   227 leaves the set unchanged. Be careful to fully comprehend the fifth and
   244 leaves the set unchanged. Be careful to fully comprehend the fifth and
   228 sixth equivalence: if you concatenate two sets of strings and one is
   245 sixth equivalence: if you concatenate two sets of strings and one is
   229 the empty set, then the concatenation will also be the empty set. To
   246 the empty set, then the concatenation will also be the empty set. To
   230 see this, check the definition of $\_ @ \_$ for sets. The last
   247 see this, check the definition of $\_ @ \_$ for sets. The last
   231 equivalence is again trivial.
   248 equivalence is again trivial.
   232 
   249 
   233 What will be important later on is that we can orient these
   250 What will be critical later on is that we can orient these
   234 equivalences and read them from left to right. In this way we
   251 equivalences and read them from left to right. In this way we
   235 can view them as \emph{simplification rules}. Consider for 
   252 can view them as \emph{simplification rules}. Consider for 
   236 example the regular expression 
   253 example the regular expression 
   237 
   254 
   238 \begin{equation}
   255 \begin{equation}
   240 \label{big}
   257 \label{big}
   241 \end{equation}
   258 \end{equation}
   242 
   259 
   243 \noindent If we can find an equivalent regular expression that is
   260 \noindent If we can find an equivalent regular expression that is
   244 simpler (that usually means smaller), then this might potentially make
   261 simpler (that usually means smaller), then this might potentially make
   245 our matching algorithm run faster. We can look for such a simpler
   262 our matching algorithm run faster. We can look for such a simpler, but
   246 regular expression $r'$ because whether a string $s$ is in $L(r)$ or
   263 equivalent, regular expression $r'$ because whether a string $s$ is in
   247 in $L(r')$ with $r\equiv r'$ will always give the same answer. Yes?
   264 $L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes?
   248 
   265 
   249 In the example above you will see that the regular expression is
   266 In the example above you will see that the regular expression in
   250 equivalent to just $r_1$. You can verify this by iteratively applying
   267 \eqref{big} is equivalent to just $r_1$. You can verify this by
   251 the simplification rules from above:
   268 iteratively applying the simplification rules from above:
   252 
   269 
   253 \begin{center}
   270 \begin{center}
   254 \begin{tabular}{ll}
   271 \begin{tabular}{ll}
   255  & $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot 
   272  & $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot 
   256 (\underline{r_4 \cdot \ZERO})$\smallskip\\
   273 (\underline{r_4 \cdot \ZERO})$\smallskip\\
   272 Finally here are three equivalences between regular expressions which are
   289 Finally here are three equivalences between regular expressions which are
   273 not so obvious:
   290 not so obvious:
   274 
   291 
   275 \begin{center}
   292 \begin{center}
   276 \begin{tabular}{rcl}
   293 \begin{tabular}{rcl}
   277 $r^*$  & $\equiv$ & $1 + r\cdot r^*$\\
   294 $r^*$  & $\equiv$ & $\ONE + r\cdot r^*$\\
   278 $(r_1 + r_2)^*$  & $\equiv$ & $r_1^* \cdot (r_2\cdot r_1^*)^*$\\
   295 $(r_1 + r_2)^*$  & $\equiv$ & $r_1^* \cdot (r_2\cdot r_1^*)^*$\\
   279 $(r_1 \cdot r_2)^*$ & $\equiv$ & $1 + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\
   296 $(r_1 \cdot r_2)^*$ & $\equiv$ & $\ONE + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\
   280 \end{tabular}
   297 \end{tabular}
   281 \end{center}
   298 \end{center}
   282 
   299 
   283 \noindent
   300 \noindent
   284 We will not use them in our algorithm, but feel free to convince yourself
   301 We will not use them in our algorithm, but feel free to convince
   285 that they hold. As an aside, there has been a lot of research about
   302 yourself that they actually hold. As an aside, there has been a lot of
   286 questions like: Can one always decide when two regular expressions are
   303 research about questions like: Can one always decide when two regular
   287 equivalent or not? What does an algorithm look like to decide this
   304 expressions are equivalent or not? What does an algorithm look like to
   288 efficiently? So in general it is not a trivial problem.
   305 decide this efficiently? So in general it is not a trivial problem.
   289 
   306 
   290 \subsection*{The Matching Algorithm}
   307 \subsection*{The Matching Algorithm}
   291 
   308 
   292 The algorithm we will define below consists of two parts. One
   309 The algorithm we will introduce below consists of two parts. One is the
   293 is the function $\textit{nullable}$ which takes a regular expression as
   310 function $\textit{nullable}$ which takes a regular expression as an
   294 argument and decides whether it can match the empty string
   311 argument and decides whether it can match the empty string (this means
   295 (this means it returns a boolean in Scala). This can be easily
   312 it returns a boolean in Scala). This can be easily defined recursively
   296 defined recursively as follows:
   313 as follows:
   297 
   314 
   298 \begin{center}
   315 \begin{center}
   299 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
   316 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
   300 $\textit{nullable}(\ZERO)$      & $\dn$ & $\textit{false}$\\
   317 $\textit{nullable}(\ZERO)$      & $\dn$ & $\textit{false}$\\
   301 $\textit{nullable}(\ONE)$         & $\dn$ & $\textit{true}$\\
   318 $\textit{nullable}(\ONE)$         & $\dn$ & $\textit{true}$\\
   311 
   328 
   312 \[
   329 \[
   313 \textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r)
   330 \textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r)
   314 \]
   331 \]
   315 
   332 
   316 \noindent Note on the left-hand side of the if-and-only-if we
   333 \noindent Note on the left-hand side of the if-and-only-if we have a
   317 have a function we can implement; on the right we have its
   334 function we can implement, ofr example in Scala; on the right we have
   318 specification (which we cannot implement in a programming
   335 its specification (which we cannot implement in a programming language).
   319 language).
       
   320 
   336 
   321 The other function of our matching algorithm calculates a
   337 The other function of our matching algorithm calculates a
   322 \emph{derivative} of a regular expression. This is a function
   338 \emph{derivative} of a regular expression. This is a function
   323 which will take a regular expression, say $r$, and a
   339 which will take a regular expression, say $r$, and a
   324 character, say $c$, as arguments and returns a new regular
   340 character, say $c$, as arguments and returns a new regular
   340   & & else $(\textit{der}\, c\, r_1) \cdot r_2$\\
   356   & & else $(\textit{der}\, c\, r_1) \cdot r_2$\\
   341   $\textit{der}\, c\, (r^*)$          & $\dn$ & $(\textit{der}\,c\,r) \cdot (r^*)$
   357   $\textit{der}\, c\, (r^*)$          & $\dn$ & $(\textit{der}\,c\,r) \cdot (r^*)$
   342   \end{tabular}
   358   \end{tabular}
   343 \end{center}
   359 \end{center}
   344 
   360 
   345 \noindent The first two clauses can be rationalised as
   361 \noindent The first two clauses can be rationalised as follows: recall
   346 follows: recall that $\textit{der}$ should calculate a regular
   362 that $\textit{der}$ should calculate a regular expression so that
   347 expression so that given the ``input'' regular expression can
   363 provided  the ``input'' regular expression can match a string of the
   348 match a string of the form $c\!::\!s$, we want a regular
   364 form $c\!::\!s$, we want a regular expression for $s$. Since neither
   349 expression for $s$. Since neither $\ZERO$ nor $\ONE$
   365 $\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we return
   350 can match a string of the form $c\!::\!s$, we return
   366 $\ZERO$. In the third case we have to make a case-distinction: In case
   351 $\ZERO$. In the third case we have to make a
   367 the regular expression is $c$, then clearly it can recognise a string of
   352 case-distinction: In case the regular expression is $c$, then
   368 the form $c\!::\!s$, just that $s$ is the empty string. Therefore we
   353 clearly it can recognise a string of the form $c\!::\!s$, just
   369 return the $\ONE$-regular expression, as it can match the empty string.
   354 that $s$ is the empty string. Therefore we return the
   370 In the other case we again return $\ZERO$ since no string of the
   355 $\ONE$-regular expression. In the other case we again
   371 $c\!::\!s$ can be matched. Next come the recursive cases, which are a
   356 return $\ZERO$ since no string of the $c\!::\!s$ can be
   372 bit more involved. Fortunately, the $+$-case is still relatively
   357 matched. Next come the recursive cases, which are a bit more
   373 straightforward: all strings of the form $c\!::\!s$ are either matched
   358 involved. Fortunately, the $+$-case is still relatively
   374 by the regular expression $r_1$ or $r_2$. So we just have to recursively
   359 straightforward: all strings of the form $c\!::\!s$ are either
   375 call $\textit{der}$ with these two regular expressions and compose the
   360 matched by the regular expression $r_1$ or $r_2$. So we just
   376 results again with $+$. Makes sense?
   361 have to recursively call $\textit{der}$ with these two regular
   377 
   362 expressions and compose the results again with $+$. Makes
       
   363 sense? 
       
   364 
   378 
   365 The $\cdot$-case is more complicated: if $r_1\cdot r_2$
   379 The $\cdot$-case is more complicated: if $r_1\cdot r_2$
   366 matches a string of the form $c\!::\!s$, then the first part
   380 matches a string of the form $c\!::\!s$, then the first part
   367 must be matched by $r_1$. Consequently, it makes sense to
   381 must be matched by $r_1$. Consequently, it makes sense to
   368 construct the regular expression for $s$ by calling $\textit{der}$ with
   382 construct the regular expression for $s$ by calling $\textit{der}$ with