updated
authorChristian Urban <christian.urban@kcl.ac.uk>
Sat, 23 Sep 2023 22:26:52 +0100
changeset 926 42ecc3186944
parent 925 ddb521b57e0c
child 927 ef54868a9226
updated
handouts/ho01.tex
handouts/ho02.pdf
handouts/ho02.tex
handouts/ho03.pdf
handouts/ho03.tex
--- a/handouts/ho01.tex	Sat Sep 23 21:22:17 2023 +0100
+++ b/handouts/ho01.tex	Sat Sep 23 22:26:52 2023 +0100
@@ -117,7 +117,7 @@
 such patterns. The idea behind regular expressions is that
 they are a simple method for describing languages (or sets of
 strings)\ldots at least languages we are interested in in
-computer science. For example there is no convenient regular
+Computer Science. For example there is no convenient regular
 expression for describing the English language short of
 enumerating all English words. But they seem useful for
 describing for example simple email addresses.\footnote{See
@@ -215,7 +215,7 @@
 
 Regular expressions were introduced by Kleene in the 1950ies and they
 have been object of intense study since then. They are nowadays pretty
-much ubiquitous in computer science. There are many libraries
+much ubiquitous in Computer Science. There are many libraries
 implementing regular expressions. I am sure you have come across them
 before (remember the PRA or PEP modules?). 
 
Binary file handouts/ho02.pdf has changed
--- a/handouts/ho02.tex	Sat Sep 23 21:22:17 2023 +0100
+++ b/handouts/ho02.tex	Sat Sep 23 22:26:52 2023 +0100
@@ -8,7 +8,7 @@
 
 \begin{document}
 \fnote{\copyright{} Christian Urban, King's College London, 
-  2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022}
+  2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023}
 
 
 \section*{Handout 2 (Regular Expression Matching)}
@@ -26,9 +26,9 @@
 \noindent
 This lecture is about implementing a more efficient regular expression
 matcher (the plots on the right below)---more efficient than the
-matchers from regular expression libraries in Ruby, Python, JavaScript
+matchers from regular expression libraries in Ruby, Python, JavaScript, Swift
 and Java (the plots on the left).\footnote{Have a look at KEATS: students
-last year contributed also data for the Swift and Dart languages.}\medskip
+last year contributed also data for the Dart language.}\medskip
 
 \noindent
 To start with let us look more closely at the experimental data: The
@@ -63,13 +63,14 @@
     scaled ticks=false,
     axis lines=left,
     width=5cm,
-    height=5cm, 
-    legend entries={Java 8, Python, JavaScript},  
+    height=4.5cm, 
+    legend entries={Java 8, Python, JavaScript, Swift},  
     legend pos=north west,
     legend cell align=left]
 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
+\addplot[magenta,mark=*, mark options={fill=white}] table {re-swift.data};
 \end{axis}
 \end{tikzpicture}
 &
@@ -85,7 +86,7 @@
     axis lines=left,
     %scaled ticks=false,
     width=6.5cm,
-    height=5cm,
+    height=4.5cm,
     legend entries={Our matcher},  
     legend pos=north east,
     legend cell align=left]
@@ -112,7 +113,7 @@
     scaled ticks=false,
     axis lines=left,
     width=5cm,
-    height=5cm, 
+    height=4.55cm, 
     legend entries={Python,Ruby},  
     legend pos=north west,
     legend cell align=left]
@@ -134,7 +135,7 @@
     scaled ticks=false,
     axis lines=left,
     width=6.5cm,
-    height=5cm,
+    height=4.5cm,
     legend entries={Our matcher},  
     legend pos=north east,
     legend cell align=left]
@@ -264,7 +265,7 @@
 simpler (that usually means smaller), then this might potentially make
 our matching algorithm run faster. We can look for such a simpler, but
 equivalent, regular expression $r'$ because whether a string $s$ is in
-$L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes?
+$L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes? \footnote{You have checked this for yourself? Your friendly lecturer might talk rubbish\ldots{}one never knows.}
 
 In the example above you will see that the regular expression in
 \eqref{big} is equivalent to just $r_1$. You can verify this by
@@ -303,7 +304,7 @@
 \noindent
 We will not use them in our algorithm, but feel free to convince
 yourself that they actually hold. As an aside, there has been a lot of
-research about questions like: Can one always decide when two regular
+research on questions like: Can one always decide whether two regular
 expressions are equivalent or not? What does an algorithm look like to
 decide this efficiently? Surprisingly, many of such questions
 turn out to be non-trivial problems.
@@ -311,11 +312,11 @@
 
 \subsection*{The Matching Algorithm}
 
-The algorithm we will introduce below consists of two parts. One is the
-function $\textit{nullable}$ which takes a regular expression as an
-argument and decides whether it can match the empty string (this means
-it returns a boolean in Scala). This can be easily defined recursively
-as follows:
+The regular expression matching algorithm we will introduce below
+consists of two parts: One is the function $\textit{nullable}$ which
+takes a regular expression as an argument and decides whether it can
+match the empty string (this means it returns a boolean in
+Scala). This can be easily defined recursively as follows:
 
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
@@ -365,20 +366,21 @@
 
 \noindent The first two clauses can be rationalised as follows: recall
 that $\textit{der}$ should calculate a regular expression so that
-provided  the ``input'' regular expression can match a string of the
+provided the ``input'' regular expression can match a string of the
 form $c\!::\!s$, we want a regular expression for $s$. Since neither
-$\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we return
-$\ZERO$. In the third case we have to make a case-distinction: In case
-the regular expression is $c$, then clearly it can recognise a string of
-the form $c\!::\!s$, just that $s$ is the empty string. Therefore we
-return the $\ONE$-regular expression, as it can match the empty string.
-In the other case we again return $\ZERO$ since no string of the
-$c\!::\!s$ can be matched. Next come the recursive cases, which are a
-bit more involved. Fortunately, the $+$-case is still relatively
-straightforward: all strings of the form $c\!::\!s$ are either matched
-by the regular expression $r_1$ or $r_2$. So we just have to recursively
-call $\textit{der}$ with these two regular expressions and compose the
-results again with $+$. Makes sense?
+$\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we
+return $\ZERO$. In the third case we have to make a case-distinction:
+In case the regular expression is $c$, then clearly it can recognise a
+string of the form $c\!::\!s$, just that $s$ is the empty
+string. Therefore we return the $\ONE$-regular expression in this
+case, as it can match the empty string.  In the other case we again
+return $\ZERO$ since no string of the $c\!::\!s$ can be matched. Next
+come the recursive cases, which are a bit more involved. Fortunately,
+the $+$-case is still relatively straightforward: all strings of the
+form $c\!::\!s$ are either matched by the regular expression $r_1$ or
+$r_2$. So we just have to recursively call $\textit{der}$ with these
+two regular expressions and compose the results again with $+$. Makes
+sense?
 
 
 The $\cdot$-case is more complicated: if $r_1\cdot r_2$
@@ -1064,11 +1066,22 @@
 
 \noindent which is the property we set out to prove: our algorithm
 meets its specification. To have done so, requires a few induction
-proofs about strings and regular expressions. Following the \emph{induction
-recipes} is already a big step in actually performing these proofs.
-If you do not believe it, proofs have helped me to make sure my code
-is correct and in several instances prevented me of letting slip
-embarrassing mistakes into the `wild'. 
+proofs about strings and regular expressions. Following the
+\emph{induction recipes} is already a big step in actually performing
+these proofs.  If you do not believe it, proofs have helped me to make
+sure my code is correct and in several instances prevented me of
+letting slip embarrassing mistakes into the `wild'. In fact I have
+found a number of mistakes in the brilliant work by Sulzmann and Lu,
+which we are going to have a look at in Lecture 4. But in Lecture 3 we
+should first find out why on earth are existing regular expression
+matchers so abysmally slow. Are the people in Python, Ruby, Swift,
+JavaScript, Java and also in Rust\footnote{Interestingly the regex
+  engine in Rust says it guarantees a linear behaviour when deciding
+  when a regular expression matches a
+  string. \here{https://youtu.be/3N_ywhx6_K0?t=31}} just idiots? For
+example could it possibly be that what we have implemented here in
+Scala is faster than the regex engine that has been implemented in
+Rust? See you next week\ldots
 
 \end{document}
 
Binary file handouts/ho03.pdf has changed
--- a/handouts/ho03.tex	Sat Sep 23 21:22:17 2023 +0100
+++ b/handouts/ho03.tex	Sat Sep 23 22:26:52 2023 +0100
@@ -19,9 +19,9 @@
 already quite old, only became more widely known rather
 recently. Still, let us in this lecture have a closer look at automata
 and their relation to regular expressions. This will help us with
-understanding why the regular expression matchers in Python, Ruby and
-Java are so slow with certain regular expressions. On the way we will
-also see what are the limitations of regular
+understanding why the regular expression matchers in Python, Ruby,
+Java and so on are so slow with certain regular expressions. On the
+way we will also see what are the limitations of regular
 expressions. Unfortunately, they cannot be used for \emph{everything}.
 
 
@@ -911,10 +911,13 @@
 \noindent
 OK\ldots now you know why regular expression matchers in those
 languages are sometimes so slow. A bit surprising, don't you agree?
+Also it is still a mystery why Rust, which because of the reasons
+above avoids NFAs and uses DFAs instead cannot compete in all cases
+with our simple derivative-based regular expression matcher in Scala.
 
 \subsection*{Subset Construction}
 
-Of course, some developers of regular expression matchers are aware of
+So of course, some clever developers of regular expression matchers are aware of
 these problems with sluggish NFAs and try to address them. One common
 technique for alleviating the problem I like to show you in this
 section. This will also explain why we insisted on polymorphic types in
@@ -939,8 +942,8 @@
 that also recognises the same language. This might sound a bit
 paradoxical: NFA $\rightarrow$ decision of acceptance hard; DFA
 $\rightarrow$ decision easy. But this \emph{is} true\ldots but of course
-there is always a caveat---nothing ever is for free in life. Let's see what this
-actually means.
+there is always a caveat---nothing ever is for free in life. Let's see
+what this caveat is.
 
 There are actually a number of methods for transforming a NFA into
 an equivalent DFA, but the most famous one is the \emph{subset
@@ -1527,7 +1530,13 @@
 have an acceptable memory and runtime behaviour. But remember the
 subset construction in the worst case explodes the number of states by
 $2^n$.  Effectively also the translation to DFAs can incur a big
-runtime penalty.
+runtime penalty.\footnote{Therefore the clever people in Rust try to
+  \emph{not} do such calculations upfront, but rather delay them and
+  in this way can avoid much of the penalties\ldots{}in many practical
+  relevant places. As a result, they make the extraordinary claim that
+  their time complexity is in the worst case $O(m \times n)$ where $m$
+  is proportional to the size of the regex and $n$ is proportional to
+  the size of strings. Does this claim hold water?}
 
 But this does not mean that everything is bad with automata.  Recall
 the problem of finding a regular expressions for the language that is
@@ -1576,11 +1585,11 @@
 Namely, we can represent automata with infinite states, which is
 actually forbidden by the definition of what an automaton is. We can
 exploit this flaw as follows: Suppose our alphabet consists of the
-characters $c_1$ to $c_n$. Then we can generate an ``automaton''
-(it is not really one because it has infinite states) by taking
+characters $a$ to $z$. Then we can generate an ``automaton''
+(it is not really one because it has infinitely many states) by taking
 as starting state the regular expression $r$ for which we
 want to generate an automaton. There are $n$ next-states which
-corresponds to the derivatives of $r$ according to $c_1$ to $c_n$.
+corresponds to the derivatives of $r$ according to $a$ to $z$.
 Implementing this in our slightly ``flawed'' representation is
 not too difficult. This will give a picture for the ``automaton''
 looking something like this, except that it extends infinitely
@@ -1590,23 +1599,26 @@
 \begin{center}
 \begin{tikzpicture}
  [level distance=25mm,very thick,auto,
-  level 1/.style={sibling distance=30mm},
+  level 1/.style={sibling distance=10mm},
   level 2/.style={sibling distance=15mm},
   every node/.style={minimum size=30pt,
                    inner sep=0pt,circle,draw=blue!50,very thick,
                    fill=blue!20}]
- \node {$r$} [grow=right]
- child[->] {node (cn) {$d_{c_n}$}
-   child { node {$dd_{c_nc_n}$}}
-   child { node {$dd_{c_nc_1}$}}
-   %edge from parent node[left] {$c_n$}
+ \node {$r$} [grow=right] {
+ child[->] {node (c1) {$d_{z}$}
+   child { node {$dd_{zz}$}}
+   child { node {$dd_{za}$}}
  }
- child[->] {node (c1) {$d_{c_1}$}
-   child { node {$dd_{c_1c_n}$}}
-   child { node {$dd_{c_1c_1}$}}
-   %edge from parent node[left] {$c_1$}
- };
- %%\draw (cn) -- (c1) node {\vdots}; 
+ child[->] {}
+ child[->] {}
+ child[->] {}
+ child[->] {node (cn) {$d_{a}$}
+   child { node {$dd_{az}$}}
+   child { node {$dd_{aa}$}}
+ }
+};
+\node[draw=none,fill=none] at (3,0.1) {\vdots};
+\node[draw=none,fill=none] at (7,0.1) {\Large\ldots};
 \end{tikzpicture}  
 \end{center}
 
@@ -1621,7 +1633,7 @@
 which Brzozowski already cleverly found out, because there is
 a way to restrict the number of states to something finite.
 Meaning it would give us a real automaton.
-However, this would lead us far, far away from what we should
+However, this would lead us far, far away from what we want
 discuss here.