afl-material: comparison handouts/ho02.tex

equal deleted inserted replaced

-:ddb521b57e0c
+:42ecc3186944
 \usepackage{../data}
 \begin{document}
 \fnote{\copyright{} Christian Urban, King's College London,
-2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022}
+2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023}
 \section*{Handout 2 (Regular Expression Matching)}
 %\noindent
 %\end{itemize}\bigskip\bigskip\bigskip
 \noindent
 This lecture is about implementing a more efficient regular expression
 matcher (the plots on the right below)---more efficient than the
-matchers from regular expression libraries in Ruby, Python, JavaScript
+matchers from regular expression libraries in Ruby, Python, JavaScript, Swift
 and Java (the plots on the left).\footnote{Have a look at KEATS: students
-last year contributed also data for the Swift and Dart languages.}\medskip
+last year contributed also data for the Dart language.}\medskip
 \noindent
 To start with let us look more closely at the experimental data: The
 first pair of plots shows the running time for the regular expression
 $(a^*)^*\cdot b$ and strings composed of $n$ \pcode{a}s, like
 ymax=35,
 ytick={0,5,...,30},
 scaled ticks=false,
 axis lines=left,
 width=5cm,
-height=5cm,
+height=4.5cm,
-legend entries={Java 8, Python, JavaScript},
+legend entries={Java 8, Python, JavaScript, Swift},
 legend pos=north west,
 legend cell align=left]
 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
+\addplot[magenta,mark=*, mark options={fill=white}] table {re-swift.data};
 \end{axis}
 \end{tikzpicture}
 &
 \begin{tikzpicture}[baseline=(current bounding box.north)]
 \begin{axis}[
 ymax=35,
 ytick={0,5,...,30},
 axis lines=left,
 %scaled ticks=false,
 width=6.5cm,
-height=5cm,
+height=4.5cm,
 legend entries={Our matcher},
 legend pos=north east,
 legend cell align=left]
 %\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
 ymax=35,
 ytick={0,5,...,30},
 scaled ticks=false,
 axis lines=left,
 width=5cm,
-height=5cm,
+height=4.55cm,
 legend entries={Python,Ruby},
 legend pos=north west,
 legend cell align=left]
 \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
 \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
 ymax=35,
 ytick={0,5,...,30},
 scaled ticks=false,
 axis lines=left,
 width=6.5cm,
-height=5cm,
+height=4.5cm,
 legend entries={Our matcher},
 legend pos=north east,
 legend cell align=left]
 %\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
 \noindent If we can find an equivalent regular expression that is
 simpler (that usually means smaller), then this might potentially make
 our matching algorithm run faster. We can look for such a simpler, but
 equivalent, regular expression $r'$ because whether a string $s$ is in
-$L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes?
+$L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes? \footnote{You have checked this for yourself? Your friendly lecturer might talk rubbish\ldots{}one never knows.}
 In the example above you will see that the regular expression in
 \eqref{big} is equivalent to just $r_1$. You can verify this by
 iteratively applying the simplification rules from above:
 \end{center}
 \noindent
 We will not use them in our algorithm, but feel free to convince
 yourself that they actually hold. As an aside, there has been a lot of
-research about questions like: Can one always decide when two regular
+research on questions like: Can one always decide whether two regular
 expressions are equivalent or not? What does an algorithm look like to
 decide this efficiently? Surprisingly, many of such questions
 turn out to be non-trivial problems.
 \subsection*{The Matching Algorithm}
-The algorithm we will introduce below consists of two parts. One is the
+The regular expression matching algorithm we will introduce below
-function $\textit{nullable}$ which takes a regular expression as an
+consists of two parts: One is the function $\textit{nullable}$ which
-argument and decides whether it can match the empty string (this means
+takes a regular expression as an argument and decides whether it can
-it returns a boolean in Scala). This can be easily defined recursively
+match the empty string (this means it returns a boolean in
-as follows:
+Scala). This can be easily defined recursively as follows:
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
 $\textit{nullable}(\ZERO)$      & $\dn$ & $\textit{false}$\\
 $\textit{nullable}(\ONE)$         & $\dn$ & $\textit{true}$\\
 \end{tabular}
 \end{center}
 \noindent The first two clauses can be rationalised as follows: recall
 that $\textit{der}$ should calculate a regular expression so that
-provided  the ``input'' regular expression can match a string of the
+provided the ``input'' regular expression can match a string of the
 form $c\!::\!s$, we want a regular expression for $s$. Since neither
-$\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we return
+$\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we
-$\ZERO$. In the third case we have to make a case-distinction: In case
+return $\ZERO$. In the third case we have to make a case-distinction:
-the regular expression is $c$, then clearly it can recognise a string of
+In case the regular expression is $c$, then clearly it can recognise a
-the form $c\!::\!s$, just that $s$ is the empty string. Therefore we
+string of the form $c\!::\!s$, just that $s$ is the empty
-return the $\ONE$-regular expression, as it can match the empty string.
+string. Therefore we return the $\ONE$-regular expression in this
-In the other case we again return $\ZERO$ since no string of the
+case, as it can match the empty string.  In the other case we again
-$c\!::\!s$ can be matched. Next come the recursive cases, which are a
+return $\ZERO$ since no string of the $c\!::\!s$ can be matched. Next
-bit more involved. Fortunately, the $+$-case is still relatively
+come the recursive cases, which are a bit more involved. Fortunately,
-straightforward: all strings of the form $c\!::\!s$ are either matched
+the $+$-case is still relatively straightforward: all strings of the
-by the regular expression $r_1$ or $r_2$. So we just have to recursively
+form $c\!::\!s$ are either matched by the regular expression $r_1$ or
-call $\textit{der}$ with these two regular expressions and compose the
+$r_2$. So we just have to recursively call $\textit{der}$ with these
-results again with $+$. Makes sense?
+two regular expressions and compose the results again with $+$. Makes
+sense?
 The $\cdot$-case is more complicated: if $r_1\cdot r_2$
 matches a string of the form $c\!::\!s$, then the first part
 must be matched by $r_1$. Consequently, it makes sense to
 s\in L(r)
 \]
 \noindent which is the property we set out to prove: our algorithm
 meets its specification. To have done so, requires a few induction
-proofs about strings and regular expressions. Following the \emph{induction
+proofs about strings and regular expressions. Following the
-recipes} is already a big step in actually performing these proofs.
+\emph{induction recipes} is already a big step in actually performing
-If you do not believe it, proofs have helped me to make sure my code
+these proofs.  If you do not believe it, proofs have helped me to make
-is correct and in several instances prevented me of letting slip
+sure my code is correct and in several instances prevented me of
-embarrassing mistakes into the `wild'.
+letting slip embarrassing mistakes into the `wild'. In fact I have
+found a number of mistakes in the brilliant work by Sulzmann and Lu,
+which we are going to have a look at in Lecture 4. But in Lecture 3 we
+should first find out why on earth are existing regular expression
+matchers so abysmally slow. Are the people in Python, Ruby, Swift,
+JavaScript, Java and also in Rust\footnote{Interestingly the regex
+engine in Rust says it guarantees a linear behaviour when deciding
+when a regular expression matches a
+string. \here{https://youtu.be/3N_ywhx6_K0?t=31}} just idiots? For
+example could it possibly be that what we have implemented here in
+Scala is faster than the regex engine that has been implemented in
+Rust? See you next week\ldots
 \end{document}

changeset 926	42ecc3186944
parent 874	ffe02fd574a5
child 932	5678414a3898