--- a/handouts/ho02.tex Sat Sep 23 21:22:17 2023 +0100
+++ b/handouts/ho02.tex Sat Sep 23 22:26:52 2023 +0100
@@ -8,7 +8,7 @@
\begin{document}
\fnote{\copyright{} Christian Urban, King's College London,
- 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022}
+ 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023}
\section*{Handout 2 (Regular Expression Matching)}
@@ -26,9 +26,9 @@
\noindent
This lecture is about implementing a more efficient regular expression
matcher (the plots on the right below)---more efficient than the
-matchers from regular expression libraries in Ruby, Python, JavaScript
+matchers from regular expression libraries in Ruby, Python, JavaScript, Swift
and Java (the plots on the left).\footnote{Have a look at KEATS: students
-last year contributed also data for the Swift and Dart languages.}\medskip
+last year contributed also data for the Dart language.}\medskip
\noindent
To start with let us look more closely at the experimental data: The
@@ -63,13 +63,14 @@
scaled ticks=false,
axis lines=left,
width=5cm,
- height=5cm,
- legend entries={Java 8, Python, JavaScript},
+ height=4.5cm,
+ legend entries={Java 8, Python, JavaScript, Swift},
legend pos=north west,
legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
+\addplot[magenta,mark=*, mark options={fill=white}] table {re-swift.data};
\end{axis}
\end{tikzpicture}
&
@@ -85,7 +86,7 @@
axis lines=left,
%scaled ticks=false,
width=6.5cm,
- height=5cm,
+ height=4.5cm,
legend entries={Our matcher},
legend pos=north east,
legend cell align=left]
@@ -112,7 +113,7 @@
scaled ticks=false,
axis lines=left,
width=5cm,
- height=5cm,
+ height=4.55cm,
legend entries={Python,Ruby},
legend pos=north west,
legend cell align=left]
@@ -134,7 +135,7 @@
scaled ticks=false,
axis lines=left,
width=6.5cm,
- height=5cm,
+ height=4.5cm,
legend entries={Our matcher},
legend pos=north east,
legend cell align=left]
@@ -264,7 +265,7 @@
simpler (that usually means smaller), then this might potentially make
our matching algorithm run faster. We can look for such a simpler, but
equivalent, regular expression $r'$ because whether a string $s$ is in
-$L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes?
+$L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes? \footnote{You have checked this for yourself? Your friendly lecturer might talk rubbish\ldots{}one never knows.}
In the example above you will see that the regular expression in
\eqref{big} is equivalent to just $r_1$. You can verify this by
@@ -303,7 +304,7 @@
\noindent
We will not use them in our algorithm, but feel free to convince
yourself that they actually hold. As an aside, there has been a lot of
-research about questions like: Can one always decide when two regular
+research on questions like: Can one always decide whether two regular
expressions are equivalent or not? What does an algorithm look like to
decide this efficiently? Surprisingly, many of such questions
turn out to be non-trivial problems.
@@ -311,11 +312,11 @@
\subsection*{The Matching Algorithm}
-The algorithm we will introduce below consists of two parts. One is the
-function $\textit{nullable}$ which takes a regular expression as an
-argument and decides whether it can match the empty string (this means
-it returns a boolean in Scala). This can be easily defined recursively
-as follows:
+The regular expression matching algorithm we will introduce below
+consists of two parts: One is the function $\textit{nullable}$ which
+takes a regular expression as an argument and decides whether it can
+match the empty string (this means it returns a boolean in
+Scala). This can be easily defined recursively as follows:
\begin{center}
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
@@ -365,20 +366,21 @@
\noindent The first two clauses can be rationalised as follows: recall
that $\textit{der}$ should calculate a regular expression so that
-provided the ``input'' regular expression can match a string of the
+provided the ``input'' regular expression can match a string of the
form $c\!::\!s$, we want a regular expression for $s$. Since neither
-$\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we return
-$\ZERO$. In the third case we have to make a case-distinction: In case
-the regular expression is $c$, then clearly it can recognise a string of
-the form $c\!::\!s$, just that $s$ is the empty string. Therefore we
-return the $\ONE$-regular expression, as it can match the empty string.
-In the other case we again return $\ZERO$ since no string of the
-$c\!::\!s$ can be matched. Next come the recursive cases, which are a
-bit more involved. Fortunately, the $+$-case is still relatively
-straightforward: all strings of the form $c\!::\!s$ are either matched
-by the regular expression $r_1$ or $r_2$. So we just have to recursively
-call $\textit{der}$ with these two regular expressions and compose the
-results again with $+$. Makes sense?
+$\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we
+return $\ZERO$. In the third case we have to make a case-distinction:
+In case the regular expression is $c$, then clearly it can recognise a
+string of the form $c\!::\!s$, just that $s$ is the empty
+string. Therefore we return the $\ONE$-regular expression in this
+case, as it can match the empty string. In the other case we again
+return $\ZERO$ since no string of the $c\!::\!s$ can be matched. Next
+come the recursive cases, which are a bit more involved. Fortunately,
+the $+$-case is still relatively straightforward: all strings of the
+form $c\!::\!s$ are either matched by the regular expression $r_1$ or
+$r_2$. So we just have to recursively call $\textit{der}$ with these
+two regular expressions and compose the results again with $+$. Makes
+sense?
The $\cdot$-case is more complicated: if $r_1\cdot r_2$
@@ -1064,11 +1066,22 @@
\noindent which is the property we set out to prove: our algorithm
meets its specification. To have done so, requires a few induction
-proofs about strings and regular expressions. Following the \emph{induction
-recipes} is already a big step in actually performing these proofs.
-If you do not believe it, proofs have helped me to make sure my code
-is correct and in several instances prevented me of letting slip
-embarrassing mistakes into the `wild'.
+proofs about strings and regular expressions. Following the
+\emph{induction recipes} is already a big step in actually performing
+these proofs. If you do not believe it, proofs have helped me to make
+sure my code is correct and in several instances prevented me of
+letting slip embarrassing mistakes into the `wild'. In fact I have
+found a number of mistakes in the brilliant work by Sulzmann and Lu,
+which we are going to have a look at in Lecture 4. But in Lecture 3 we
+should first find out why on earth are existing regular expression
+matchers so abysmally slow. Are the people in Python, Ruby, Swift,
+JavaScript, Java and also in Rust\footnote{Interestingly the regex
+ engine in Rust says it guarantees a linear behaviour when deciding
+ when a regular expression matches a
+ string. \here{https://youtu.be/3N_ywhx6_K0?t=31}} just idiots? For
+example could it possibly be that what we have implemented here in
+Scala is faster than the regex engine that has been implemented in
+Rust? See you next week\ldots
\end{document}