diff -r fba480bbc9f7 -r eb9343126625 handouts/ho02.tex --- a/handouts/ho02.tex Mon Jun 29 21:13:49 2020 +0100 +++ b/handouts/ho02.tex Mon Jun 29 21:14:50 2020 +0100 @@ -7,22 +7,40 @@ \begin{document} -\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019} +\fnote{\copyright{} Christian Urban, King's College London, + 2014, 2015, 2016, 2017, 2018, 2019, 2020} \section*{Handout 2 (Regular Expression Matching)} +\noindent +{\bf Checklist} + +\begin{itemize} + \item You have understood the derivative-based matching algorithm. + \item You know how the derivative is related to the meaning of regular + expressions. + \item You can extend the algorithm to non-basic regular expressions. +\end{itemize}\bigskip\bigskip\bigskip + +\noindent This lecture is about implementing a more efficient regular expression matcher (the plots on the right below)---more efficient than the matchers from regular expression libraries in Ruby, Python, JavaScript and Java (the plots on the left). For this consider some experimental data: The first pair of plots shows the running time for the regular expression $(a^*)^*\cdot b$ and strings composed of $n$ -\pcode{a}s (meaning this regular expression actually does not match -the strings). The second pair of plots shows the running time for the -regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings also -composed of $n$ \pcode{a}s (this time the regular expressions match -the strings). To see the substantial differences in the left and +\pcode{a}s, like +\[ +\pcode{"}\!\underbrace{\pcode{a}\ldots\pcode{a}}_{n}\!\pcode{"} +\] + +\noindent +This means the regular expression actually does not match the strings. +The second pair of plots shows the running time for the regular +expressions of the form $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and corresponding +strings composed of $n$ \pcode{a}s (this time the regular expressions +match the strings). To see the substantial differences in the left and right plots below, note the different scales of the $x$-axes. @@ -130,9 +148,8 @@ running examples. There will be several versions (V1, V2, V3,\ldots) of our matcher.\footnote{The corresponding files are \texttt{re1.scala}, \texttt{re2.scala} and so on. As usual, you can - find the code on KEATS.}\bigskip + find the code on KEATS.} -\noindent Having specified in the previous lecture what problem our regular expression matcher is supposed to solve, namely for any given regular expression $r$ and string $s$ @@ -155,8 +172,8 @@ \subsection*{Regular Expression Equivalences} We already defined in Handout 1 what it means for two regular -expressions to be equivalent, namely if their meaning is the -same language: +expressions to be equivalent, namely whether their +\emph{meaning} is the same language: \[ r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2) @@ -220,7 +237,7 @@ \end{tabular} \end{center} -\noindent which always hold no matter what the regular expression $r$ +\noindent They always hold no matter what the regular expression $r$ looks like. The first two are easy to verify since $L(\ZERO)$ is the empty set. The next two are also easy to verify since $L(\ONE) = \{[]\}$ and appending the empty string to every string of another set, @@ -230,7 +247,7 @@ see this, check the definition of $\_ @ \_$ for sets. The last equivalence is again trivial. -What will be important later on is that we can orient these +What will be critical later on is that we can orient these equivalences and read them from left to right. In this way we can view them as \emph{simplification rules}. Consider for example the regular expression @@ -242,13 +259,13 @@ \noindent If we can find an equivalent regular expression that is simpler (that usually means smaller), then this might potentially make -our matching algorithm run faster. We can look for such a simpler -regular expression $r'$ because whether a string $s$ is in $L(r)$ or -in $L(r')$ with $r\equiv r'$ will always give the same answer. Yes? +our matching algorithm run faster. We can look for such a simpler, but +equivalent, regular expression $r'$ because whether a string $s$ is in +$L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes? -In the example above you will see that the regular expression is -equivalent to just $r_1$. You can verify this by iteratively applying -the simplification rules from above: +In the example above you will see that the regular expression in +\eqref{big} is equivalent to just $r_1$. You can verify this by +iteratively applying the simplification rules from above: \begin{center} \begin{tabular}{ll} @@ -274,26 +291,26 @@ \begin{center} \begin{tabular}{rcl} -$r^*$ & $\equiv$ & $1 + r\cdot r^*$\\ +$r^*$ & $\equiv$ & $\ONE + r\cdot r^*$\\ $(r_1 + r_2)^*$ & $\equiv$ & $r_1^* \cdot (r_2\cdot r_1^*)^*$\\ -$(r_1 \cdot r_2)^*$ & $\equiv$ & $1 + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\ +$(r_1 \cdot r_2)^*$ & $\equiv$ & $\ONE + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\ \end{tabular} \end{center} \noindent -We will not use them in our algorithm, but feel free to convince yourself -that they hold. As an aside, there has been a lot of research about -questions like: Can one always decide when two regular expressions are -equivalent or not? What does an algorithm look like to decide this -efficiently? So in general it is not a trivial problem. +We will not use them in our algorithm, but feel free to convince +yourself that they actually hold. As an aside, there has been a lot of +research about questions like: Can one always decide when two regular +expressions are equivalent or not? What does an algorithm look like to +decide this efficiently? So in general it is not a trivial problem. \subsection*{The Matching Algorithm} -The algorithm we will define below consists of two parts. One -is the function $\textit{nullable}$ which takes a regular expression as -argument and decides whether it can match the empty string -(this means it returns a boolean in Scala). This can be easily -defined recursively as follows: +The algorithm we will introduce below consists of two parts. One is the +function $\textit{nullable}$ which takes a regular expression as an +argument and decides whether it can match the empty string (this means +it returns a boolean in Scala). This can be easily defined recursively +as follows: \begin{center} \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} @@ -313,10 +330,9 @@ \textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r) \] -\noindent Note on the left-hand side of the if-and-only-if we -have a function we can implement; on the right we have its -specification (which we cannot implement in a programming -language). +\noindent Note on the left-hand side of the if-and-only-if we have a +function we can implement, ofr example in Scala; on the right we have +its specification (which we cannot implement in a programming language). The other function of our matching algorithm calculates a \emph{derivative} of a regular expression. This is a function @@ -342,25 +358,23 @@ \end{tabular} \end{center} -\noindent The first two clauses can be rationalised as -follows: recall that $\textit{der}$ should calculate a regular -expression so that given the ``input'' regular expression can -match a string of the form $c\!::\!s$, we want a regular -expression for $s$. Since neither $\ZERO$ nor $\ONE$ -can match a string of the form $c\!::\!s$, we return -$\ZERO$. In the third case we have to make a -case-distinction: In case the regular expression is $c$, then -clearly it can recognise a string of the form $c\!::\!s$, just -that $s$ is the empty string. Therefore we return the -$\ONE$-regular expression. In the other case we again -return $\ZERO$ since no string of the $c\!::\!s$ can be -matched. Next come the recursive cases, which are a bit more -involved. Fortunately, the $+$-case is still relatively -straightforward: all strings of the form $c\!::\!s$ are either -matched by the regular expression $r_1$ or $r_2$. So we just -have to recursively call $\textit{der}$ with these two regular -expressions and compose the results again with $+$. Makes -sense? +\noindent The first two clauses can be rationalised as follows: recall +that $\textit{der}$ should calculate a regular expression so that +provided the ``input'' regular expression can match a string of the +form $c\!::\!s$, we want a regular expression for $s$. Since neither +$\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we return +$\ZERO$. In the third case we have to make a case-distinction: In case +the regular expression is $c$, then clearly it can recognise a string of +the form $c\!::\!s$, just that $s$ is the empty string. Therefore we +return the $\ONE$-regular expression, as it can match the empty string. +In the other case we again return $\ZERO$ since no string of the +$c\!::\!s$ can be matched. Next come the recursive cases, which are a +bit more involved. Fortunately, the $+$-case is still relatively +straightforward: all strings of the form $c\!::\!s$ are either matched +by the regular expression $r_1$ or $r_2$. So we just have to recursively +call $\textit{der}$ with these two regular expressions and compose the +results again with $+$. Makes sense? + The $\cdot$-case is more complicated: if $r_1\cdot r_2$ matches a string of the form $c\!::\!s$, then the first part