17 regular expressions come first. The reason is that regular expressions |
17 regular expressions come first. The reason is that regular expressions |
18 are easier to reason about and the notion of derivatives, although |
18 are easier to reason about and the notion of derivatives, although |
19 already quite old, only became more widely known rather |
19 already quite old, only became more widely known rather |
20 recently. Still, let us in this lecture have a closer look at automata |
20 recently. Still, let us in this lecture have a closer look at automata |
21 and their relation to regular expressions. This will help us with |
21 and their relation to regular expressions. This will help us with |
22 understanding why the regular expression matchers in Python, Ruby and |
22 understanding why the regular expression matchers in Python, Ruby, |
23 Java are so slow with certain regular expressions. On the way we will |
23 Java and so on are so slow with certain regular expressions. On the |
24 also see what are the limitations of regular |
24 way we will also see what are the limitations of regular |
25 expressions. Unfortunately, they cannot be used for \emph{everything}. |
25 expressions. Unfortunately, they cannot be used for \emph{everything}. |
26 |
26 |
27 |
27 |
28 \subsection*{Deterministic Finite Automata} |
28 \subsection*{Deterministic Finite Automata} |
29 |
29 |
909 \end{center} |
909 \end{center} |
910 |
910 |
911 \noindent |
911 \noindent |
912 OK\ldots now you know why regular expression matchers in those |
912 OK\ldots now you know why regular expression matchers in those |
913 languages are sometimes so slow. A bit surprising, don't you agree? |
913 languages are sometimes so slow. A bit surprising, don't you agree? |
|
914 Also it is still a mystery why Rust, which because of the reasons |
|
915 above avoids NFAs and uses DFAs instead cannot compete in all cases |
|
916 with our simple derivative-based regular expression matcher in Scala. |
914 |
917 |
915 \subsection*{Subset Construction} |
918 \subsection*{Subset Construction} |
916 |
919 |
917 Of course, some developers of regular expression matchers are aware of |
920 So of course, some clever developers of regular expression matchers are aware of |
918 these problems with sluggish NFAs and try to address them. One common |
921 these problems with sluggish NFAs and try to address them. One common |
919 technique for alleviating the problem I like to show you in this |
922 technique for alleviating the problem I like to show you in this |
920 section. This will also explain why we insisted on polymorphic types in |
923 section. This will also explain why we insisted on polymorphic types in |
921 our DFA code (remember I used \texttt{A} and \texttt{C} for the types |
924 our DFA code (remember I used \texttt{A} and \texttt{C} for the types |
922 of states and the input, see Figure~\ref{dfa} on |
925 of states and the input, see Figure~\ref{dfa} on |
937 a single state. So with DFAs we do not have to search at all. What is |
940 a single state. So with DFAs we do not have to search at all. What is |
938 perhaps interesting is the fact that for every NFA we can find a DFA |
941 perhaps interesting is the fact that for every NFA we can find a DFA |
939 that also recognises the same language. This might sound a bit |
942 that also recognises the same language. This might sound a bit |
940 paradoxical: NFA $\rightarrow$ decision of acceptance hard; DFA |
943 paradoxical: NFA $\rightarrow$ decision of acceptance hard; DFA |
941 $\rightarrow$ decision easy. But this \emph{is} true\ldots but of course |
944 $\rightarrow$ decision easy. But this \emph{is} true\ldots but of course |
942 there is always a caveat---nothing ever is for free in life. Let's see what this |
945 there is always a caveat---nothing ever is for free in life. Let's see |
943 actually means. |
946 what this caveat is. |
944 |
947 |
945 There are actually a number of methods for transforming a NFA into |
948 There are actually a number of methods for transforming a NFA into |
946 an equivalent DFA, but the most famous one is the \emph{subset |
949 an equivalent DFA, but the most famous one is the \emph{subset |
947 construction}. Consider the following NFA where the states are |
950 construction}. Consider the following NFA where the states are |
948 labelled with $0$, $1$ and $2$. |
951 labelled with $0$, $1$ and $2$. |
1525 exponentially the number of states. Therefore when this route is |
1528 exponentially the number of states. Therefore when this route is |
1526 taken, we definitely need to minimise the resulting DFAs in order to |
1529 taken, we definitely need to minimise the resulting DFAs in order to |
1527 have an acceptable memory and runtime behaviour. But remember the |
1530 have an acceptable memory and runtime behaviour. But remember the |
1528 subset construction in the worst case explodes the number of states by |
1531 subset construction in the worst case explodes the number of states by |
1529 $2^n$. Effectively also the translation to DFAs can incur a big |
1532 $2^n$. Effectively also the translation to DFAs can incur a big |
1530 runtime penalty. |
1533 runtime penalty.\footnote{Therefore the clever people in Rust try to |
|
1534 \emph{not} do such calculations upfront, but rather delay them and |
|
1535 in this way can avoid much of the penalties\ldots{}in many practical |
|
1536 relevant places. As a result, they make the extraordinary claim that |
|
1537 their time complexity is in the worst case $O(m \times n)$ where $m$ |
|
1538 is proportional to the size of the regex and $n$ is proportional to |
|
1539 the size of strings. Does this claim hold water?} |
1531 |
1540 |
1532 But this does not mean that everything is bad with automata. Recall |
1541 But this does not mean that everything is bad with automata. Recall |
1533 the problem of finding a regular expressions for the language that is |
1542 the problem of finding a regular expressions for the language that is |
1534 \emph{not} recognised by a regular expression. In our implementation |
1543 \emph{not} recognised by a regular expression. In our implementation |
1535 we added explicitly such a regular expressions because they are useful |
1544 we added explicitly such a regular expressions because they are useful |
1574 one flaw in our representation of automata and transitions as partial |
1583 one flaw in our representation of automata and transitions as partial |
1575 functions....remember I said something about fishy things. |
1584 functions....remember I said something about fishy things. |
1576 Namely, we can represent automata with infinite states, which is |
1585 Namely, we can represent automata with infinite states, which is |
1577 actually forbidden by the definition of what an automaton is. We can |
1586 actually forbidden by the definition of what an automaton is. We can |
1578 exploit this flaw as follows: Suppose our alphabet consists of the |
1587 exploit this flaw as follows: Suppose our alphabet consists of the |
1579 characters $c_1$ to $c_n$. Then we can generate an ``automaton'' |
1588 characters $a$ to $z$. Then we can generate an ``automaton'' |
1580 (it is not really one because it has infinite states) by taking |
1589 (it is not really one because it has infinitely many states) by taking |
1581 as starting state the regular expression $r$ for which we |
1590 as starting state the regular expression $r$ for which we |
1582 want to generate an automaton. There are $n$ next-states which |
1591 want to generate an automaton. There are $n$ next-states which |
1583 corresponds to the derivatives of $r$ according to $c_1$ to $c_n$. |
1592 corresponds to the derivatives of $r$ according to $a$ to $z$. |
1584 Implementing this in our slightly ``flawed'' representation is |
1593 Implementing this in our slightly ``flawed'' representation is |
1585 not too difficult. This will give a picture for the ``automaton'' |
1594 not too difficult. This will give a picture for the ``automaton'' |
1586 looking something like this, except that it extends infinitely |
1595 looking something like this, except that it extends infinitely |
1587 far to the right: |
1596 far to the right: |
1588 |
1597 |
1589 |
1598 |
1590 \begin{center} |
1599 \begin{center} |
1591 \begin{tikzpicture} |
1600 \begin{tikzpicture} |
1592 [level distance=25mm,very thick,auto, |
1601 [level distance=25mm,very thick,auto, |
1593 level 1/.style={sibling distance=30mm}, |
1602 level 1/.style={sibling distance=10mm}, |
1594 level 2/.style={sibling distance=15mm}, |
1603 level 2/.style={sibling distance=15mm}, |
1595 every node/.style={minimum size=30pt, |
1604 every node/.style={minimum size=30pt, |
1596 inner sep=0pt,circle,draw=blue!50,very thick, |
1605 inner sep=0pt,circle,draw=blue!50,very thick, |
1597 fill=blue!20}] |
1606 fill=blue!20}] |
1598 \node {$r$} [grow=right] |
1607 \node {$r$} [grow=right] { |
1599 child[->] {node (cn) {$d_{c_n}$} |
1608 child[->] {node (c1) {$d_{z}$} |
1600 child { node {$dd_{c_nc_n}$}} |
1609 child { node {$dd_{zz}$}} |
1601 child { node {$dd_{c_nc_1}$}} |
1610 child { node {$dd_{za}$}} |
1602 %edge from parent node[left] {$c_n$} |
|
1603 } |
1611 } |
1604 child[->] {node (c1) {$d_{c_1}$} |
1612 child[->] {} |
1605 child { node {$dd_{c_1c_n}$}} |
1613 child[->] {} |
1606 child { node {$dd_{c_1c_1}$}} |
1614 child[->] {} |
1607 %edge from parent node[left] {$c_1$} |
1615 child[->] {node (cn) {$d_{a}$} |
1608 }; |
1616 child { node {$dd_{az}$}} |
1609 %%\draw (cn) -- (c1) node {\vdots}; |
1617 child { node {$dd_{aa}$}} |
|
1618 } |
|
1619 }; |
|
1620 \node[draw=none,fill=none] at (3,0.1) {\vdots}; |
|
1621 \node[draw=none,fill=none] at (7,0.1) {\Large\ldots}; |
1610 \end{tikzpicture} |
1622 \end{tikzpicture} |
1611 \end{center} |
1623 \end{center} |
1612 |
1624 |
1613 \noindent |
1625 \noindent |
1614 I let you implement this ``automaton''. |
1626 I let you implement this ``automaton''. |
1619 algorithm from Lecture 2. There \emph{is} however something interesting |
1631 algorithm from Lecture 2. There \emph{is} however something interesting |
1620 in this construction |
1632 in this construction |
1621 which Brzozowski already cleverly found out, because there is |
1633 which Brzozowski already cleverly found out, because there is |
1622 a way to restrict the number of states to something finite. |
1634 a way to restrict the number of states to something finite. |
1623 Meaning it would give us a real automaton. |
1635 Meaning it would give us a real automaton. |
1624 However, this would lead us far, far away from what we should |
1636 However, this would lead us far, far away from what we want |
1625 discuss here. |
1637 discuss here. |
1626 |
1638 |
1627 |
1639 |
1628 |
1640 |
1629 %\section*{Further Reading} |
1641 %\section*{Further Reading} |