handouts/ho03.tex
changeset 926 42ecc3186944
parent 874 ffe02fd574a5
child 932 5678414a3898
equal deleted inserted replaced
925:ddb521b57e0c 926:42ecc3186944
    17 regular expressions come first. The reason is that regular expressions
    17 regular expressions come first. The reason is that regular expressions
    18 are easier to reason about and the notion of derivatives, although
    18 are easier to reason about and the notion of derivatives, although
    19 already quite old, only became more widely known rather
    19 already quite old, only became more widely known rather
    20 recently. Still, let us in this lecture have a closer look at automata
    20 recently. Still, let us in this lecture have a closer look at automata
    21 and their relation to regular expressions. This will help us with
    21 and their relation to regular expressions. This will help us with
    22 understanding why the regular expression matchers in Python, Ruby and
    22 understanding why the regular expression matchers in Python, Ruby,
    23 Java are so slow with certain regular expressions. On the way we will
    23 Java and so on are so slow with certain regular expressions. On the
    24 also see what are the limitations of regular
    24 way we will also see what are the limitations of regular
    25 expressions. Unfortunately, they cannot be used for \emph{everything}.
    25 expressions. Unfortunately, they cannot be used for \emph{everything}.
    26 
    26 
    27 
    27 
    28 \subsection*{Deterministic Finite Automata}
    28 \subsection*{Deterministic Finite Automata}
    29 
    29 
   909 \end{center}
   909 \end{center}
   910 
   910 
   911 \noindent
   911 \noindent
   912 OK\ldots now you know why regular expression matchers in those
   912 OK\ldots now you know why regular expression matchers in those
   913 languages are sometimes so slow. A bit surprising, don't you agree?
   913 languages are sometimes so slow. A bit surprising, don't you agree?
       
   914 Also it is still a mystery why Rust, which because of the reasons
       
   915 above avoids NFAs and uses DFAs instead cannot compete in all cases
       
   916 with our simple derivative-based regular expression matcher in Scala.
   914 
   917 
   915 \subsection*{Subset Construction}
   918 \subsection*{Subset Construction}
   916 
   919 
   917 Of course, some developers of regular expression matchers are aware of
   920 So of course, some clever developers of regular expression matchers are aware of
   918 these problems with sluggish NFAs and try to address them. One common
   921 these problems with sluggish NFAs and try to address them. One common
   919 technique for alleviating the problem I like to show you in this
   922 technique for alleviating the problem I like to show you in this
   920 section. This will also explain why we insisted on polymorphic types in
   923 section. This will also explain why we insisted on polymorphic types in
   921 our DFA code (remember I used \texttt{A} and \texttt{C} for the types
   924 our DFA code (remember I used \texttt{A} and \texttt{C} for the types
   922 of states and the input, see Figure~\ref{dfa} on
   925 of states and the input, see Figure~\ref{dfa} on
   937 a single state. So with DFAs we do not have to search at all.  What is
   940 a single state. So with DFAs we do not have to search at all.  What is
   938 perhaps interesting is the fact that for every NFA we can find a DFA
   941 perhaps interesting is the fact that for every NFA we can find a DFA
   939 that also recognises the same language. This might sound a bit
   942 that also recognises the same language. This might sound a bit
   940 paradoxical: NFA $\rightarrow$ decision of acceptance hard; DFA
   943 paradoxical: NFA $\rightarrow$ decision of acceptance hard; DFA
   941 $\rightarrow$ decision easy. But this \emph{is} true\ldots but of course
   944 $\rightarrow$ decision easy. But this \emph{is} true\ldots but of course
   942 there is always a caveat---nothing ever is for free in life. Let's see what this
   945 there is always a caveat---nothing ever is for free in life. Let's see
   943 actually means.
   946 what this caveat is.
   944 
   947 
   945 There are actually a number of methods for transforming a NFA into
   948 There are actually a number of methods for transforming a NFA into
   946 an equivalent DFA, but the most famous one is the \emph{subset
   949 an equivalent DFA, but the most famous one is the \emph{subset
   947   construction}. Consider the following NFA where the states are
   950   construction}. Consider the following NFA where the states are
   948 labelled with $0$, $1$ and $2$.
   951 labelled with $0$, $1$ and $2$.
  1525 exponentially the number of states. Therefore when this route is
  1528 exponentially the number of states. Therefore when this route is
  1526 taken, we definitely need to minimise the resulting DFAs in order to
  1529 taken, we definitely need to minimise the resulting DFAs in order to
  1527 have an acceptable memory and runtime behaviour. But remember the
  1530 have an acceptable memory and runtime behaviour. But remember the
  1528 subset construction in the worst case explodes the number of states by
  1531 subset construction in the worst case explodes the number of states by
  1529 $2^n$.  Effectively also the translation to DFAs can incur a big
  1532 $2^n$.  Effectively also the translation to DFAs can incur a big
  1530 runtime penalty.
  1533 runtime penalty.\footnote{Therefore the clever people in Rust try to
       
  1534   \emph{not} do such calculations upfront, but rather delay them and
       
  1535   in this way can avoid much of the penalties\ldots{}in many practical
       
  1536   relevant places. As a result, they make the extraordinary claim that
       
  1537   their time complexity is in the worst case $O(m \times n)$ where $m$
       
  1538   is proportional to the size of the regex and $n$ is proportional to
       
  1539   the size of strings. Does this claim hold water?}
  1531 
  1540 
  1532 But this does not mean that everything is bad with automata.  Recall
  1541 But this does not mean that everything is bad with automata.  Recall
  1533 the problem of finding a regular expressions for the language that is
  1542 the problem of finding a regular expressions for the language that is
  1534 \emph{not} recognised by a regular expression. In our implementation
  1543 \emph{not} recognised by a regular expression. In our implementation
  1535 we added explicitly such a regular expressions because they are useful
  1544 we added explicitly such a regular expressions because they are useful
  1574 one flaw in our representation of automata and transitions as partial
  1583 one flaw in our representation of automata and transitions as partial
  1575 functions....remember I said something about fishy things.
  1584 functions....remember I said something about fishy things.
  1576 Namely, we can represent automata with infinite states, which is
  1585 Namely, we can represent automata with infinite states, which is
  1577 actually forbidden by the definition of what an automaton is. We can
  1586 actually forbidden by the definition of what an automaton is. We can
  1578 exploit this flaw as follows: Suppose our alphabet consists of the
  1587 exploit this flaw as follows: Suppose our alphabet consists of the
  1579 characters $c_1$ to $c_n$. Then we can generate an ``automaton''
  1588 characters $a$ to $z$. Then we can generate an ``automaton''
  1580 (it is not really one because it has infinite states) by taking
  1589 (it is not really one because it has infinitely many states) by taking
  1581 as starting state the regular expression $r$ for which we
  1590 as starting state the regular expression $r$ for which we
  1582 want to generate an automaton. There are $n$ next-states which
  1591 want to generate an automaton. There are $n$ next-states which
  1583 corresponds to the derivatives of $r$ according to $c_1$ to $c_n$.
  1592 corresponds to the derivatives of $r$ according to $a$ to $z$.
  1584 Implementing this in our slightly ``flawed'' representation is
  1593 Implementing this in our slightly ``flawed'' representation is
  1585 not too difficult. This will give a picture for the ``automaton''
  1594 not too difficult. This will give a picture for the ``automaton''
  1586 looking something like this, except that it extends infinitely
  1595 looking something like this, except that it extends infinitely
  1587 far to the right:
  1596 far to the right:
  1588 
  1597 
  1589 
  1598 
  1590 \begin{center}
  1599 \begin{center}
  1591 \begin{tikzpicture}
  1600 \begin{tikzpicture}
  1592  [level distance=25mm,very thick,auto,
  1601  [level distance=25mm,very thick,auto,
  1593   level 1/.style={sibling distance=30mm},
  1602   level 1/.style={sibling distance=10mm},
  1594   level 2/.style={sibling distance=15mm},
  1603   level 2/.style={sibling distance=15mm},
  1595   every node/.style={minimum size=30pt,
  1604   every node/.style={minimum size=30pt,
  1596                    inner sep=0pt,circle,draw=blue!50,very thick,
  1605                    inner sep=0pt,circle,draw=blue!50,very thick,
  1597                    fill=blue!20}]
  1606                    fill=blue!20}]
  1598  \node {$r$} [grow=right]
  1607  \node {$r$} [grow=right] {
  1599  child[->] {node (cn) {$d_{c_n}$}
  1608  child[->] {node (c1) {$d_{z}$}
  1600    child { node {$dd_{c_nc_n}$}}
  1609    child { node {$dd_{zz}$}}
  1601    child { node {$dd_{c_nc_1}$}}
  1610    child { node {$dd_{za}$}}
  1602    %edge from parent node[left] {$c_n$}
       
  1603  }
  1611  }
  1604  child[->] {node (c1) {$d_{c_1}$}
  1612  child[->] {}
  1605    child { node {$dd_{c_1c_n}$}}
  1613  child[->] {}
  1606    child { node {$dd_{c_1c_1}$}}
  1614  child[->] {}
  1607    %edge from parent node[left] {$c_1$}
  1615  child[->] {node (cn) {$d_{a}$}
  1608  };
  1616    child { node {$dd_{az}$}}
  1609  %%\draw (cn) -- (c1) node {\vdots}; 
  1617    child { node {$dd_{aa}$}}
       
  1618  }
       
  1619 };
       
  1620 \node[draw=none,fill=none] at (3,0.1) {\vdots};
       
  1621 \node[draw=none,fill=none] at (7,0.1) {\Large\ldots};
  1610 \end{tikzpicture}  
  1622 \end{tikzpicture}  
  1611 \end{center}
  1623 \end{center}
  1612 
  1624 
  1613 \noindent
  1625 \noindent
  1614 I let you implement this ``automaton''.
  1626 I let you implement this ``automaton''.
  1619 algorithm from Lecture 2. There \emph{is} however something interesting
  1631 algorithm from Lecture 2. There \emph{is} however something interesting
  1620 in this construction
  1632 in this construction
  1621 which Brzozowski already cleverly found out, because there is
  1633 which Brzozowski already cleverly found out, because there is
  1622 a way to restrict the number of states to something finite.
  1634 a way to restrict the number of states to something finite.
  1623 Meaning it would give us a real automaton.
  1635 Meaning it would give us a real automaton.
  1624 However, this would lead us far, far away from what we should
  1636 However, this would lead us far, far away from what we want
  1625 discuss here.
  1637 discuss here.
  1626 
  1638 
  1627 
  1639 
  1628 
  1640 
  1629 %\section*{Further Reading}
  1641 %\section*{Further Reading}