afl-material: comparison handouts/ho02.tex

equal deleted inserted replaced

-:fba480bbc9f7
+:eb9343126625
 \usepackage{../graphics}
 \usepackage{../data}
 \begin{document}
-\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019}
+\fnote{\copyright{} Christian Urban, King's College London,
+2014, 2015, 2016, 2017, 2018, 2019, 2020}
 \section*{Handout 2 (Regular Expression Matching)}
+\noindent
+{\bf Checklist}
+\begin{itemize}
+\item You have understood the derivative-based matching algorithm.
+\item You know how the derivative is related to the meaning of regular
+expressions.
+\item You can extend the algorithm to non-basic regular expressions.
+\end{itemize}\bigskip\bigskip\bigskip
+\noindent
 This lecture is about implementing a more efficient regular expression
 matcher (the plots on the right below)---more efficient than the
 matchers from regular expression libraries in Ruby, Python, JavaScript
 and Java (the plots on the left). For this consider some experimental
 data: The first pair of plots shows the running time for the
 regular expression $(a^*)^*\cdot b$ and strings composed of $n$
-\pcode{a}s (meaning this regular expression actually does not match
+\pcode{a}s, like
-the strings). The second pair of plots shows the running time for the
+\[
-regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings also
+\pcode{"}\!\underbrace{\pcode{a}\ldots\pcode{a}}_{n}\!\pcode{"}
-composed of $n$ \pcode{a}s (this time the regular expressions match
+\]
-the strings).  To see the substantial differences in the left and
+\noindent
+This means the regular expression actually does not match the strings.
+The second pair of plots shows the running time for the regular
+expressions of the form $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and corresponding
+strings composed of $n$ \pcode{a}s (this time the regular expressions
+match the strings).  To see the substantial differences in the left and
 right plots below, note the different scales of the $x$-axes.
 \begin{center}
 Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
 \noindent
 In what follows we will use these regular expressions and strings as
 running examples. There will be several versions (V1, V2, V3,\ldots)
 of our matcher.\footnote{The corresponding files are
 \texttt{re1.scala}, \texttt{re2.scala} and so on. As usual, you can
-find the code on KEATS.}\bigskip
+find the code on KEATS.}
-\noindent
 Having specified in the previous lecture what
 problem our regular expression matcher is supposed to solve,
 namely for any given regular expression $r$ and string $s$
 answer \textit{true} if and only if
 means when two regular expressions are equivalent.
 \subsection*{Regular Expression Equivalences}
 We already defined in Handout 1 what it means for two regular
-expressions to be equivalent, namely if their meaning is the
+expressions to be equivalent, namely whether their
-same language:
+\emph{meaning} is the same language:
 \[
 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
 \]
 $\ZERO \cdot r$ & $\equiv$ & $\ZERO$\\
 $r + r$ & $\equiv$ & $r$
 \end{tabular}
 \end{center}
-\noindent which always hold no matter what the regular expression $r$
+\noindent They always hold no matter what the regular expression $r$
 looks like. The first two are easy to verify since $L(\ZERO)$ is the
 empty set. The next two are also easy to verify since $L(\ONE) =
 \{[]\}$ and appending the empty string to every string of another set,
 leaves the set unchanged. Be careful to fully comprehend the fifth and
 sixth equivalence: if you concatenate two sets of strings and one is
 the empty set, then the concatenation will also be the empty set. To
 see this, check the definition of $\_ @ \_$ for sets. The last
 equivalence is again trivial.
-What will be important later on is that we can orient these
+What will be critical later on is that we can orient these
 equivalences and read them from left to right. In this way we
 can view them as \emph{simplification rules}. Consider for
 example the regular expression
 \begin{equation}
 \label{big}
 \end{equation}
 \noindent If we can find an equivalent regular expression that is
 simpler (that usually means smaller), then this might potentially make
-our matching algorithm run faster. We can look for such a simpler
+our matching algorithm run faster. We can look for such a simpler, but
-regular expression $r'$ because whether a string $s$ is in $L(r)$ or
+equivalent, regular expression $r'$ because whether a string $s$ is in
-in $L(r')$ with $r\equiv r'$ will always give the same answer. Yes?
+$L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes?
-In the example above you will see that the regular expression is
+In the example above you will see that the regular expression in
-equivalent to just $r_1$. You can verify this by iteratively applying
+\eqref{big} is equivalent to just $r_1$. You can verify this by
-the simplification rules from above:
+iteratively applying the simplification rules from above:
 \begin{center}
 \begin{tabular}{ll}
 & $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot
 (\underline{r_4 \cdot \ZERO})$\smallskip\\
 Finally here are three equivalences between regular expressions which are
 not so obvious:
 \begin{center}
 \begin{tabular}{rcl}
-$r^*$  & $\equiv$ & $1 + r\cdot r^*$\\
+$r^*$  & $\equiv$ & $\ONE + r\cdot r^*$\\
 $(r_1 + r_2)^*$  & $\equiv$ & $r_1^* \cdot (r_2\cdot r_1^*)^*$\\
-$(r_1 \cdot r_2)^*$ & $\equiv$ & $1 + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\
+$(r_1 \cdot r_2)^*$ & $\equiv$ & $\ONE + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\
 \end{tabular}
 \end{center}
 \noindent
-We will not use them in our algorithm, but feel free to convince yourself
+We will not use them in our algorithm, but feel free to convince
-that they hold. As an aside, there has been a lot of research about
+yourself that they actually hold. As an aside, there has been a lot of
-questions like: Can one always decide when two regular expressions are
+research about questions like: Can one always decide when two regular
-equivalent or not? What does an algorithm look like to decide this
+expressions are equivalent or not? What does an algorithm look like to
-efficiently? So in general it is not a trivial problem.
+decide this efficiently? So in general it is not a trivial problem.
 \subsection*{The Matching Algorithm}
-The algorithm we will define below consists of two parts. One
+The algorithm we will introduce below consists of two parts. One is the
-is the function $\textit{nullable}$ which takes a regular expression as
+function $\textit{nullable}$ which takes a regular expression as an
-argument and decides whether it can match the empty string
+argument and decides whether it can match the empty string (this means
-(this means it returns a boolean in Scala). This can be easily
+it returns a boolean in Scala). This can be easily defined recursively
-defined recursively as follows:
+as follows:
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
 $\textit{nullable}(\ZERO)$      & $\dn$ & $\textit{false}$\\
 $\textit{nullable}(\ONE)$         & $\dn$ & $\textit{true}$\\
 \[
 \textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r)
 \]
-\noindent Note on the left-hand side of the if-and-only-if we
+\noindent Note on the left-hand side of the if-and-only-if we have a
-have a function we can implement; on the right we have its
+function we can implement, ofr example in Scala; on the right we have
-specification (which we cannot implement in a programming
+its specification (which we cannot implement in a programming language).
-language).
 The other function of our matching algorithm calculates a
 \emph{derivative} of a regular expression. This is a function
 which will take a regular expression, say $r$, and a
 character, say $c$, as arguments and returns a new regular
 & & else $(\textit{der}\, c\, r_1) \cdot r_2$\\
 $\textit{der}\, c\, (r^*)$          & $\dn$ & $(\textit{der}\,c\,r) \cdot (r^*)$
 \end{tabular}
 \end{center}
-\noindent The first two clauses can be rationalised as
+\noindent The first two clauses can be rationalised as follows: recall
-follows: recall that $\textit{der}$ should calculate a regular
+that $\textit{der}$ should calculate a regular expression so that
-expression so that given the ``input'' regular expression can
+provided  the ``input'' regular expression can match a string of the
-match a string of the form $c\!::\!s$, we want a regular
+form $c\!::\!s$, we want a regular expression for $s$. Since neither
-expression for $s$. Since neither $\ZERO$ nor $\ONE$
+$\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we return
-can match a string of the form $c\!::\!s$, we return
+$\ZERO$. In the third case we have to make a case-distinction: In case
-$\ZERO$. In the third case we have to make a
+the regular expression is $c$, then clearly it can recognise a string of
-case-distinction: In case the regular expression is $c$, then
+the form $c\!::\!s$, just that $s$ is the empty string. Therefore we
-clearly it can recognise a string of the form $c\!::\!s$, just
+return the $\ONE$-regular expression, as it can match the empty string.
-that $s$ is the empty string. Therefore we return the
+In the other case we again return $\ZERO$ since no string of the
-$\ONE$-regular expression. In the other case we again
+$c\!::\!s$ can be matched. Next come the recursive cases, which are a
-return $\ZERO$ since no string of the $c\!::\!s$ can be
+bit more involved. Fortunately, the $+$-case is still relatively
-matched. Next come the recursive cases, which are a bit more
+straightforward: all strings of the form $c\!::\!s$ are either matched
-involved. Fortunately, the $+$-case is still relatively
+by the regular expression $r_1$ or $r_2$. So we just have to recursively
-straightforward: all strings of the form $c\!::\!s$ are either
+call $\textit{der}$ with these two regular expressions and compose the
-matched by the regular expression $r_1$ or $r_2$. So we just
+results again with $+$. Makes sense?
-have to recursively call $\textit{der}$ with these two regular
-expressions and compose the results again with $+$. Makes
-sense?
 The $\cdot$-case is more complicated: if $r_1\cdot r_2$
 matches a string of the form $c\!::\!s$, then the first part
 must be matched by $r_1$. Consequently, it makes sense to
 construct the regular expression for $s$ by calling $\textit{der}$ with

changeset 727	eb9343126625
parent 646	2abd285c66d1
child 757	ea0be0662be0