diff -r ddb521b57e0c -r 42ecc3186944 handouts/ho03.tex --- a/handouts/ho03.tex Sat Sep 23 21:22:17 2023 +0100 +++ b/handouts/ho03.tex Sat Sep 23 22:26:52 2023 +0100 @@ -19,9 +19,9 @@ already quite old, only became more widely known rather recently. Still, let us in this lecture have a closer look at automata and their relation to regular expressions. This will help us with -understanding why the regular expression matchers in Python, Ruby and -Java are so slow with certain regular expressions. On the way we will -also see what are the limitations of regular +understanding why the regular expression matchers in Python, Ruby, +Java and so on are so slow with certain regular expressions. On the +way we will also see what are the limitations of regular expressions. Unfortunately, they cannot be used for \emph{everything}. @@ -911,10 +911,13 @@ \noindent OK\ldots now you know why regular expression matchers in those languages are sometimes so slow. A bit surprising, don't you agree? +Also it is still a mystery why Rust, which because of the reasons +above avoids NFAs and uses DFAs instead cannot compete in all cases +with our simple derivative-based regular expression matcher in Scala. \subsection*{Subset Construction} -Of course, some developers of regular expression matchers are aware of +So of course, some clever developers of regular expression matchers are aware of these problems with sluggish NFAs and try to address them. One common technique for alleviating the problem I like to show you in this section. This will also explain why we insisted on polymorphic types in @@ -939,8 +942,8 @@ that also recognises the same language. This might sound a bit paradoxical: NFA $\rightarrow$ decision of acceptance hard; DFA $\rightarrow$ decision easy. But this \emph{is} true\ldots but of course -there is always a caveat---nothing ever is for free in life. Let's see what this -actually means. +there is always a caveat---nothing ever is for free in life. Let's see +what this caveat is. There are actually a number of methods for transforming a NFA into an equivalent DFA, but the most famous one is the \emph{subset @@ -1527,7 +1530,13 @@ have an acceptable memory and runtime behaviour. But remember the subset construction in the worst case explodes the number of states by $2^n$. Effectively also the translation to DFAs can incur a big -runtime penalty. +runtime penalty.\footnote{Therefore the clever people in Rust try to + \emph{not} do such calculations upfront, but rather delay them and + in this way can avoid much of the penalties\ldots{}in many practical + relevant places. As a result, they make the extraordinary claim that + their time complexity is in the worst case $O(m \times n)$ where $m$ + is proportional to the size of the regex and $n$ is proportional to + the size of strings. Does this claim hold water?} But this does not mean that everything is bad with automata. Recall the problem of finding a regular expressions for the language that is @@ -1576,11 +1585,11 @@ Namely, we can represent automata with infinite states, which is actually forbidden by the definition of what an automaton is. We can exploit this flaw as follows: Suppose our alphabet consists of the -characters $c_1$ to $c_n$. Then we can generate an ``automaton'' -(it is not really one because it has infinite states) by taking +characters $a$ to $z$. Then we can generate an ``automaton'' +(it is not really one because it has infinitely many states) by taking as starting state the regular expression $r$ for which we want to generate an automaton. There are $n$ next-states which -corresponds to the derivatives of $r$ according to $c_1$ to $c_n$. +corresponds to the derivatives of $r$ according to $a$ to $z$. Implementing this in our slightly ``flawed'' representation is not too difficult. This will give a picture for the ``automaton'' looking something like this, except that it extends infinitely @@ -1590,23 +1599,26 @@ \begin{center} \begin{tikzpicture} [level distance=25mm,very thick,auto, - level 1/.style={sibling distance=30mm}, + level 1/.style={sibling distance=10mm}, level 2/.style={sibling distance=15mm}, every node/.style={minimum size=30pt, inner sep=0pt,circle,draw=blue!50,very thick, fill=blue!20}] - \node {$r$} [grow=right] - child[->] {node (cn) {$d_{c_n}$} - child { node {$dd_{c_nc_n}$}} - child { node {$dd_{c_nc_1}$}} - %edge from parent node[left] {$c_n$} + \node {$r$} [grow=right] { + child[->] {node (c1) {$d_{z}$} + child { node {$dd_{zz}$}} + child { node {$dd_{za}$}} } - child[->] {node (c1) {$d_{c_1}$} - child { node {$dd_{c_1c_n}$}} - child { node {$dd_{c_1c_1}$}} - %edge from parent node[left] {$c_1$} - }; - %%\draw (cn) -- (c1) node {\vdots}; + child[->] {} + child[->] {} + child[->] {} + child[->] {node (cn) {$d_{a}$} + child { node {$dd_{az}$}} + child { node {$dd_{aa}$}} + } +}; +\node[draw=none,fill=none] at (3,0.1) {\vdots}; +\node[draw=none,fill=none] at (7,0.1) {\Large\ldots}; \end{tikzpicture} \end{center} @@ -1621,7 +1633,7 @@ which Brzozowski already cleverly found out, because there is a way to restrict the number of states to something finite. Meaning it would give us a real automaton. -However, this would lead us far, far away from what we should +However, this would lead us far, far away from what we want discuss here.