afl-material: comparison handouts/ho02.tex

equal deleted inserted replaced

-:c8ce95067c1a
+:5c1fbb39c93e
 \usepackage{../style}
 \usepackage{../langs}
 \usepackage{../graphics}
 \usepackage{../data}
-\pgfplotsset{compat=1.11}
 \begin{document}
+\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}
 \section*{Handout 2 (Regular Expression Matching)}
 This lecture is about implementing a more efficient regular
 expression matcher (the plots on the right)---more efficient
 than the matchers from regular expression libraries in Ruby
 and Python (the plots on the left). These plots show the
 running time for the evil regular expression
-$a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed of $n$ \pcode{a}s.
+$a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed of $n$
-We will use this regular expression and strings as running
+\pcode{a}s. We will use this regular expression and strings as
-example. To see the substantial differences in the two plots
+running example. To see the substantial differences in the two
-below, note the different scales of the $x$-axes.
+plots below, note the different scales of the $x$-axes.
 \begin{center}
 \begin{tabular}{@{}cc@{}}
 \begin{tikzpicture}
-\begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs},
+\begin{axis}[
+xlabel={strings of {\tt a}s},
+ylabel={time in secs},
 enlargelimits=false,
 xtick={0,5,...,30},
 xmax=33,
 ymax=35,
 ytick={0,5,...,30},
 table {re-ruby.data};
 \end{axis}
 \end{tikzpicture}
 &
 \begin{tikzpicture}
-\begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs},
+\begin{axis}[
+xlabel={strings of \texttt{a}s},
+ylabel={time in secs},
 enlargelimits=false,
 xtick={0,3000,...,12000},
 xmax=12500,
 ymax=35,
 ytick={0,5,...,30},
 \end{tabular}
 \end{center}
 \noindent I leave it to you to verify these equivalences and
 non-equivalences. It is also interesting to look at some
-corner cases involving $\epsilon$ and $\varnothing$:
+corner cases involving $\ONE$ and $\ZERO$:
 \begin{center}
 \begin{tabular}{rcl}
-$a \cdot \varnothing$ & $\not\equiv$ & $a$\\
+$a \cdot \ZERO$ & $\not\equiv$ & $a$\\
-$a + \epsilon$ & $\not\equiv$ & $a$\\
+$a + \ONE$ & $\not\equiv$ & $a$\\
-$\epsilon$ & $\equiv$ & $\varnothing^*$\\
+$\ONE$ & $\equiv$ & $\ZERO^*$\\
-$\epsilon^*$ & $\equiv$ & $\epsilon$\\
+$\ONE^*$ & $\equiv$ & $\ONE$\\
-$\varnothing^*$ & $\not\equiv$ & $\varnothing$\\
+$\ZERO^*$ & $\not\equiv$ & $\ZERO$\\
 \end{tabular}
 \end{center}
 \noindent Again I leave it to you to make sure you agree
 with these equivalences and non-equivalences.
 For our matching algorithm however the following seven
 equivalences will play an important role:
 \begin{center}
 \begin{tabular}{rcl}
-$r + \varnothing$  & $\equiv$ & $r$\\
+$r + \ZERO$  & $\equiv$ & $r$\\
-$\varnothing + r$  & $\equiv$ & $r$\\
+$\ZERO + r$  & $\equiv$ & $r$\\
-$r \cdot \epsilon$ & $\equiv$ & $r$\\
+$r \cdot \ONE$ & $\equiv$ & $r$\\
-$\epsilon \cdot r$     & $\equiv$ & $r$\\
+$\ONE \cdot r$     & $\equiv$ & $r$\\
-$r \cdot \varnothing$ & $\equiv$ & $\varnothing$\\
+$r \cdot \ZERO$ & $\equiv$ & $\ZERO$\\
-$\varnothing \cdot r$ & $\equiv$ & $\varnothing$\\
+$\ZERO \cdot r$ & $\equiv$ & $\ZERO$\\
 $r + r$ & $\equiv$ & $r$
 \end{tabular}
 \end{center}
 \noindent which always hold no matter what the regular
 expression $r$ looks like. The first two are easy to verify
-since $L(\varnothing)$ is the empty set. The next two are also
+since $L(\ZERO)$ is the empty set. The next two are also
-easy to verify since $L(\epsilon) = \{[]\}$ and appending the
+easy to verify since $L(\ONE) = \{[]\}$ and appending the
 empty string to every string of another set, leaves the set
 unchanged. Be careful to fully comprehend the fifth and sixth
 equivalence: if you concatenate two sets of strings and one is
 the empty set, then the concatenation will also be the empty
 set. To see this, check the definition of $\_ @ \_$. The
 equivalences and read them from left to right. In this way we
 can view them as \emph{simplification rules}. Consider for
 example the regular expression
 \begin{equation}
-(r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot (r_4 \cdot \varnothing)
+(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)
 \label{big}
 \end{equation}
 \noindent If we can find an equivalent regular expression that
 is simpler (smaller for example), then this might potentially
 You can verify this by iteratively applying the simplification
 rules from above:
 \begin{center}
 \begin{tabular}{ll}
-& $(r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot
+& $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot
-(\underline{r_4 \cdot \varnothing})$\smallskip\\
+(\underline{r_4 \cdot \ZERO})$\smallskip\\
-$\equiv$ & $(r_1 + \varnothing) \cdot \epsilon + \underline{((\epsilon + r_2) + r_3) \cdot
+$\equiv$ & $(r_1 + \ZERO) \cdot \ONE + \underline{((\ONE + r_2) + r_3) \cdot
-\varnothing}$\smallskip\\
+\ZERO}$\smallskip\\
-$\equiv$ & $\underline{(r_1 + \varnothing) \cdot \epsilon} + \varnothing$\smallskip\\
+$\equiv$ & $\underline{(r_1 + \ZERO) \cdot \ONE} + \ZERO$\smallskip\\
-$\equiv$ & $(\underline{r_1 + \varnothing}) + \varnothing$\smallskip\\
+$\equiv$ & $(\underline{r_1 + \ZERO}) + \ZERO$\smallskip\\
-$\equiv$ & $\underline{r_1 + \varnothing}$\smallskip\\
+$\equiv$ & $\underline{r_1 + \ZERO}$\smallskip\\
 $\equiv$ & $r_1$\
 \end{tabular}
 \end{center}
 \noindent In each step, I underlined where a simplification
 rule is applied. Our matching algorithm in the next section
-will often generate such ``useless'' $\epsilon$s and
+will often generate such ``useless'' $\ONE$s and
-$\varnothing$s, therefore simplifying them away will make the
+$\ZERO$s, therefore simplifying them away will make the
 algorithm quite a bit faster.
 \subsection*{The Matching Algorithm}
 The algorithm we will define below consists of two parts. One
 (this means it returns a boolean in Scala). This can be easily
 defined recursively as follows:
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
-$nullable(\varnothing)$      & $\dn$ & $\textit{false}$\\
+$nullable(\ZERO)$      & $\dn$ & $\textit{false}$\\
-$nullable(\epsilon)$         & $\dn$ & $true$\\
+$nullable(\ONE)$         & $\dn$ & $true$\\
 $nullable(c)$                & $\dn$ & $\textit{false}$\\
 $nullable(r_1 + r_2)$     & $\dn$ &  $nullable(r_1) \vee nullable(r_2)$\\
 $nullable(r_1 \cdot r_2)$ & $\dn$ &  $nullable(r_1) \wedge nullable(r_2)$\\
 $nullable(r^*)$              & $\dn$ & $true$ \\
 \end{tabular}
 expression look like that can match just $s$? The definition
 of this function is as follows:
 \begin{center}
 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
-$der\, c\, (\varnothing)$      & $\dn$ & $\varnothing$\\
+$der\, c\, (\ZERO)$      & $\dn$ & $\ZERO$\\
-$der\, c\, (\epsilon)$         & $\dn$ & $\varnothing$ \\
+$der\, c\, (\ONE)$         & $\dn$ & $\ZERO$ \\
-$der\, c\, (d)$                & $\dn$ & if $c = d$ then $\epsilon$ else $\varnothing$\\
+$der\, c\, (d)$                & $\dn$ & if $c = d$ then $\ONE$ else $\ZERO$\\
 $der\, c\, (r_1 + r_2)$        & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$\\
 $der\, c\, (r_1 \cdot r_2)$  & $\dn$  & if $nullable (r_1)$\\
 & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\
 & & else $(der\, c\, r_1) \cdot r_2$\\
 $der\, c\, (r^*)$          & $\dn$ & $(der\,c\,r) \cdot (r^*)$
 \noindent The first two clauses can be rationalised as
 follows: recall that $der$ should calculate a regular
 expression so that given the ``input'' regular expression can
 match a string of the form $c\!::\!s$, we want a regular
-expression for $s$. Since neither $\varnothing$ nor $\epsilon$
+expression for $s$. Since neither $\ZERO$ nor $\ONE$
 can match a string of the form $c\!::\!s$, we return
-$\varnothing$. In the third case we have to make a
+$\ZERO$. In the third case we have to make a
 case-distinction: In case the regular expression is $c$, then
 clearly it can recognise a string of the form $c\!::\!s$, just
 that $s$ is the empty string. Therefore we return the
-$\epsilon$-regular expression. In the other case we again
+$\ONE$-regular expression. In the other case we again
-return $\varnothing$ since no string of the $c\!::\!s$ can be
+return $\ZERO$ since no string of the $c\!::\!s$ can be
 matched. Next come the recursive cases, which are a bit more
 involved. Fortunately, the $+$-case is still relatively
 straightforward: all strings of the form $c\!::\!s$ are either
 matched by the regular expression $r_1$ or $r_2$. So we just
 have to recursively call $der$ with these two regular
 If this did not make sense, here is another way to rationalise
 the definition of $der$ by considering the following operation
 on sets:
-\[
+\begin{equation}\label{Der}
 Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
-\]
+\end{equation}
 \noindent This operation essentially transforms a set of
 strings $A$ by filtering out all strings that do not start
 with $c$ and then strips off the $c$ from all the remaining
 strings. For example suppose $A = \{f\!oo, bar, f\!rak\}$ then
 for \pcode{matches} are shown in Figure~\ref{scala1}.
 \begin{figure}[p]
 \lstinputlisting{../progs/app5.scala}
 \caption{Scala implementation of the nullable and
-derivatives functions. These functions are easy to
+derivative functions. These functions are easy to
 implement in functional languages, because pattern
 matching and recursion allow us to mimic the mathematical
 definitions very closely.\label{scala1}}
 \end{figure}
 \end{center}
 \noindent Our algorithm traverses such regular expressions at
 least once every time a derivative is calculated. So having
 large regular expressions will cause problems. This problem
-is aggravated by $a^?$ being represented as $a + \epsilon$.
+is aggravated by $a^?$ being represented as $a + \ONE$.
 We can however fix this by having an explicit constructor for
 $r^{\{n\}}$. In Scala we would introduce a constructor like
 \begin{center}
 size of regular expressions it needs to handle. This is of
 course obvious because both $nullable$ and $der$ frequently
 need to traverse the whole regular expression. There seems,
 however, one more issue for making the algorithm run faster.
 The derivative function often produces ``useless''
-$\varnothing$s and $\epsilon$s. To see this, consider $r = ((a
+$\ZERO$s and $\ONE$s. To see this, consider $r = ((a
 \cdot b) + b)^*$ and the following two derivatives
 \begin{center}
 \begin{tabular}{l}
-$der\,a\,r = ((\epsilon \cdot b) + \varnothing) \cdot r$\\
+$der\,a\,r = ((\ONE \cdot b) + \ZERO) \cdot r$\\
-$der\,b\,r = ((\varnothing \cdot b) + \epsilon)\cdot r$\\
+$der\,b\,r = ((\ZERO \cdot b) + \ONE)\cdot r$\\
-$der\,c\,r = ((\varnothing \cdot b) + \varnothing)\cdot r$
+$der\,c\,r = ((\ZERO \cdot b) + \ZERO)\cdot r$
 \end{tabular}
 \end{center}
 \noindent
 If we simplify them according to the simple rules from the
 \begin{center}
 \begin{tabular}{l}
 $der\,a\,r \equiv b \cdot r$\\
 $der\,b\,r \equiv r$\\
-$der\,c\,r \equiv \varnothing$
+$der\,c\,r \equiv \ZERO$
 \end{tabular}
 \end{center}
 \noindent I leave it to you to contemplate whether such a
 simplification can have any impact on the correctness of our
 \caption{The simplification function and modified
 \texttt{ders}-function; this function now
 calls \texttt{der} first, but then simplifies
 the resulting derivative regular expressions before
 building the next derivative, see
-Line~28.\label{scala2}}
+Line~\ref{simpline}.\label{scala2}}
 \end{figure}
 \begin{center}
 \begin{tikzpicture}
 \begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs},
 mostly by some form of induction. Remember that regular
 expressions are defined as
 \begin{center}
 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
-$r$ & $::=$ &   $\varnothing$         & null\\
+$r$ & $::=$ &   $\ZERO$         & null language\\
-& $\mid$ & $\epsilon$           & empty string / \texttt{""} / []\\
+& $\mid$ & $\ONE$           & empty string / \texttt{""} / []\\
 & $\mid$ & $c$                  & single character\\
 & $\mid$ & $r_1 + r_2$          & alternative / choice\\
 & $\mid$ & $r_1 \cdot r_2$      & sequence\\
 & $\mid$ & $r^*$                & star (zero or more)\\
 \end{tabular}
 \noindent If you want to show a property $P(r)$ for all
 regular expressions $r$, then you have to follow essentially
 the recipe:
 \begin{itemize}
-\item $P$ has to hold for $\varnothing$, $\epsilon$ and $c$
+\item $P$ has to hold for $\ZERO$, $\ONE$ and $c$
 (these are the base cases).
 \item $P$ has to hold for $r_1 + r_2$ under the assumption
 that $P$ already holds for $r_1$ and $r_2$.
 \item $P$ has to hold for $r_1 \cdot r_2$ under the
 assumption that $P$ already holds for $r_1$ and $r_2$.
 \label{nullableprop}
 \end{equation}
 \noindent
 Let us say that this property is $P(r)$, then the first case
-we need to check is whether $P(\varnothing)$ (see recipe
+we need to check is whether $P(\ZERO)$ (see recipe
 above). So we have to show that
 \[
-nullable(\varnothing) \;\;\text{if and only if}\;\;
+nullable(\ZERO) \;\;\text{if and only if}\;\;
-[]\in L(\varnothing)
+[]\in L(\ZERO)
 \]
-\noindent whereby $nullable(\varnothing)$ is by definition of
+\noindent whereby $nullable(\ZERO)$ is by definition of
 the function $nullable$ always $\textit{false}$. We also have
-that $L(\varnothing)$ is by definition $\{\}$. It is
+that $L(\ZERO)$ is by definition $\{\}$. It is
 impossible that the empty string $[]$ is in the empty set.
 Therefore also the right-hand side is false. Consequently we
 verified this case: both sides are false. We would still need
-to do this for $P(\varepsilon)$ and $P(c)$. I leave this to
+to do this for $P(\ONE)$ and $P(c)$. I leave this to
 you to verify.
 Next we need to check the inductive cases, for example
 $P(r_1 + r_2)$, which is
 \begin{equation}
 Ders\,s\,(L(r)) = L(ders\,s\,r)
 \label{dersprop}
 \end{equation}
-\noindent by induction on $s$. In this proof you can assume
+\noindent by induction on $s$. Recall $Der$ is defined for
-the following property for $der$ and $Der$ has already been
+character---see \eqref{Der}; $Ders$ is similar, but for strings:
-proved, that is you can assume
+\[
+Ders\,s\,A\;\dn\;\{s'\,|\,s @ s' \in A\}
+\]
+\noindent In this proof you can assume the following property
+for $der$ and $Der$ has already been proved, that is you can
+assume
 \[
 L(der\,c\,r) = Der\,c\,(L(r))
 \]

changeset 399	5c1fbb39c93e
parent 394	2f9fe225ecc8
child 412	1cef3924f7a2