handouts/ho03.tex
changeset 926 42ecc3186944
parent 874 ffe02fd574a5
child 932 5678414a3898
--- a/handouts/ho03.tex	Sat Sep 23 21:22:17 2023 +0100
+++ b/handouts/ho03.tex	Sat Sep 23 22:26:52 2023 +0100
@@ -19,9 +19,9 @@
 already quite old, only became more widely known rather
 recently. Still, let us in this lecture have a closer look at automata
 and their relation to regular expressions. This will help us with
-understanding why the regular expression matchers in Python, Ruby and
-Java are so slow with certain regular expressions. On the way we will
-also see what are the limitations of regular
+understanding why the regular expression matchers in Python, Ruby,
+Java and so on are so slow with certain regular expressions. On the
+way we will also see what are the limitations of regular
 expressions. Unfortunately, they cannot be used for \emph{everything}.
 
 
@@ -911,10 +911,13 @@
 \noindent
 OK\ldots now you know why regular expression matchers in those
 languages are sometimes so slow. A bit surprising, don't you agree?
+Also it is still a mystery why Rust, which because of the reasons
+above avoids NFAs and uses DFAs instead cannot compete in all cases
+with our simple derivative-based regular expression matcher in Scala.
 
 \subsection*{Subset Construction}
 
-Of course, some developers of regular expression matchers are aware of
+So of course, some clever developers of regular expression matchers are aware of
 these problems with sluggish NFAs and try to address them. One common
 technique for alleviating the problem I like to show you in this
 section. This will also explain why we insisted on polymorphic types in
@@ -939,8 +942,8 @@
 that also recognises the same language. This might sound a bit
 paradoxical: NFA $\rightarrow$ decision of acceptance hard; DFA
 $\rightarrow$ decision easy. But this \emph{is} true\ldots but of course
-there is always a caveat---nothing ever is for free in life. Let's see what this
-actually means.
+there is always a caveat---nothing ever is for free in life. Let's see
+what this caveat is.
 
 There are actually a number of methods for transforming a NFA into
 an equivalent DFA, but the most famous one is the \emph{subset
@@ -1527,7 +1530,13 @@
 have an acceptable memory and runtime behaviour. But remember the
 subset construction in the worst case explodes the number of states by
 $2^n$.  Effectively also the translation to DFAs can incur a big
-runtime penalty.
+runtime penalty.\footnote{Therefore the clever people in Rust try to
+  \emph{not} do such calculations upfront, but rather delay them and
+  in this way can avoid much of the penalties\ldots{}in many practical
+  relevant places. As a result, they make the extraordinary claim that
+  their time complexity is in the worst case $O(m \times n)$ where $m$
+  is proportional to the size of the regex and $n$ is proportional to
+  the size of strings. Does this claim hold water?}
 
 But this does not mean that everything is bad with automata.  Recall
 the problem of finding a regular expressions for the language that is
@@ -1576,11 +1585,11 @@
 Namely, we can represent automata with infinite states, which is
 actually forbidden by the definition of what an automaton is. We can
 exploit this flaw as follows: Suppose our alphabet consists of the
-characters $c_1$ to $c_n$. Then we can generate an ``automaton''
-(it is not really one because it has infinite states) by taking
+characters $a$ to $z$. Then we can generate an ``automaton''
+(it is not really one because it has infinitely many states) by taking
 as starting state the regular expression $r$ for which we
 want to generate an automaton. There are $n$ next-states which
-corresponds to the derivatives of $r$ according to $c_1$ to $c_n$.
+corresponds to the derivatives of $r$ according to $a$ to $z$.
 Implementing this in our slightly ``flawed'' representation is
 not too difficult. This will give a picture for the ``automaton''
 looking something like this, except that it extends infinitely
@@ -1590,23 +1599,26 @@
 \begin{center}
 \begin{tikzpicture}
  [level distance=25mm,very thick,auto,
-  level 1/.style={sibling distance=30mm},
+  level 1/.style={sibling distance=10mm},
   level 2/.style={sibling distance=15mm},
   every node/.style={minimum size=30pt,
                    inner sep=0pt,circle,draw=blue!50,very thick,
                    fill=blue!20}]
- \node {$r$} [grow=right]
- child[->] {node (cn) {$d_{c_n}$}
-   child { node {$dd_{c_nc_n}$}}
-   child { node {$dd_{c_nc_1}$}}
-   %edge from parent node[left] {$c_n$}
+ \node {$r$} [grow=right] {
+ child[->] {node (c1) {$d_{z}$}
+   child { node {$dd_{zz}$}}
+   child { node {$dd_{za}$}}
  }
- child[->] {node (c1) {$d_{c_1}$}
-   child { node {$dd_{c_1c_n}$}}
-   child { node {$dd_{c_1c_1}$}}
-   %edge from parent node[left] {$c_1$}
- };
- %%\draw (cn) -- (c1) node {\vdots}; 
+ child[->] {}
+ child[->] {}
+ child[->] {}
+ child[->] {node (cn) {$d_{a}$}
+   child { node {$dd_{az}$}}
+   child { node {$dd_{aa}$}}
+ }
+};
+\node[draw=none,fill=none] at (3,0.1) {\vdots};
+\node[draw=none,fill=none] at (7,0.1) {\Large\ldots};
 \end{tikzpicture}  
 \end{center}
 
@@ -1621,7 +1633,7 @@
 which Brzozowski already cleverly found out, because there is
 a way to restrict the number of states to something finite.
 Meaning it would give us a real automaton.
-However, this would lead us far, far away from what we should
+However, this would lead us far, far away from what we want
 discuss here.