--- a/handouts/ho03.tex Sat Sep 23 21:22:17 2023 +0100
+++ b/handouts/ho03.tex Sat Sep 23 22:26:52 2023 +0100
@@ -19,9 +19,9 @@
already quite old, only became more widely known rather
recently. Still, let us in this lecture have a closer look at automata
and their relation to regular expressions. This will help us with
-understanding why the regular expression matchers in Python, Ruby and
-Java are so slow with certain regular expressions. On the way we will
-also see what are the limitations of regular
+understanding why the regular expression matchers in Python, Ruby,
+Java and so on are so slow with certain regular expressions. On the
+way we will also see what are the limitations of regular
expressions. Unfortunately, they cannot be used for \emph{everything}.
@@ -911,10 +911,13 @@
\noindent
OK\ldots now you know why regular expression matchers in those
languages are sometimes so slow. A bit surprising, don't you agree?
+Also it is still a mystery why Rust, which because of the reasons
+above avoids NFAs and uses DFAs instead cannot compete in all cases
+with our simple derivative-based regular expression matcher in Scala.
\subsection*{Subset Construction}
-Of course, some developers of regular expression matchers are aware of
+So of course, some clever developers of regular expression matchers are aware of
these problems with sluggish NFAs and try to address them. One common
technique for alleviating the problem I like to show you in this
section. This will also explain why we insisted on polymorphic types in
@@ -939,8 +942,8 @@
that also recognises the same language. This might sound a bit
paradoxical: NFA $\rightarrow$ decision of acceptance hard; DFA
$\rightarrow$ decision easy. But this \emph{is} true\ldots but of course
-there is always a caveat---nothing ever is for free in life. Let's see what this
-actually means.
+there is always a caveat---nothing ever is for free in life. Let's see
+what this caveat is.
There are actually a number of methods for transforming a NFA into
an equivalent DFA, but the most famous one is the \emph{subset
@@ -1527,7 +1530,13 @@
have an acceptable memory and runtime behaviour. But remember the
subset construction in the worst case explodes the number of states by
$2^n$. Effectively also the translation to DFAs can incur a big
-runtime penalty.
+runtime penalty.\footnote{Therefore the clever people in Rust try to
+ \emph{not} do such calculations upfront, but rather delay them and
+ in this way can avoid much of the penalties\ldots{}in many practical
+ relevant places. As a result, they make the extraordinary claim that
+ their time complexity is in the worst case $O(m \times n)$ where $m$
+ is proportional to the size of the regex and $n$ is proportional to
+ the size of strings. Does this claim hold water?}
But this does not mean that everything is bad with automata. Recall
the problem of finding a regular expressions for the language that is
@@ -1576,11 +1585,11 @@
Namely, we can represent automata with infinite states, which is
actually forbidden by the definition of what an automaton is. We can
exploit this flaw as follows: Suppose our alphabet consists of the
-characters $c_1$ to $c_n$. Then we can generate an ``automaton''
-(it is not really one because it has infinite states) by taking
+characters $a$ to $z$. Then we can generate an ``automaton''
+(it is not really one because it has infinitely many states) by taking
as starting state the regular expression $r$ for which we
want to generate an automaton. There are $n$ next-states which
-corresponds to the derivatives of $r$ according to $c_1$ to $c_n$.
+corresponds to the derivatives of $r$ according to $a$ to $z$.
Implementing this in our slightly ``flawed'' representation is
not too difficult. This will give a picture for the ``automaton''
looking something like this, except that it extends infinitely
@@ -1590,23 +1599,26 @@
\begin{center}
\begin{tikzpicture}
[level distance=25mm,very thick,auto,
- level 1/.style={sibling distance=30mm},
+ level 1/.style={sibling distance=10mm},
level 2/.style={sibling distance=15mm},
every node/.style={minimum size=30pt,
inner sep=0pt,circle,draw=blue!50,very thick,
fill=blue!20}]
- \node {$r$} [grow=right]
- child[->] {node (cn) {$d_{c_n}$}
- child { node {$dd_{c_nc_n}$}}
- child { node {$dd_{c_nc_1}$}}
- %edge from parent node[left] {$c_n$}
+ \node {$r$} [grow=right] {
+ child[->] {node (c1) {$d_{z}$}
+ child { node {$dd_{zz}$}}
+ child { node {$dd_{za}$}}
}
- child[->] {node (c1) {$d_{c_1}$}
- child { node {$dd_{c_1c_n}$}}
- child { node {$dd_{c_1c_1}$}}
- %edge from parent node[left] {$c_1$}
- };
- %%\draw (cn) -- (c1) node {\vdots};
+ child[->] {}
+ child[->] {}
+ child[->] {}
+ child[->] {node (cn) {$d_{a}$}
+ child { node {$dd_{az}$}}
+ child { node {$dd_{aa}$}}
+ }
+};
+\node[draw=none,fill=none] at (3,0.1) {\vdots};
+\node[draw=none,fill=none] at (7,0.1) {\Large\ldots};
\end{tikzpicture}
\end{center}
@@ -1621,7 +1633,7 @@
which Brzozowski already cleverly found out, because there is
a way to restrict the number of states to something finite.
Meaning it would give us a real automaton.
-However, this would lead us far, far away from what we should
+However, this would lead us far, far away from what we want
discuss here.