handouts/ho02.tex
changeset 926 42ecc3186944
parent 874 ffe02fd574a5
child 932 5678414a3898
equal deleted inserted replaced
925:ddb521b57e0c 926:42ecc3186944
     6 \usepackage{../data}
     6 \usepackage{../data}
     7 
     7 
     8 
     8 
     9 \begin{document}
     9 \begin{document}
    10 \fnote{\copyright{} Christian Urban, King's College London, 
    10 \fnote{\copyright{} Christian Urban, King's College London, 
    11   2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022}
    11   2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023}
    12 
    12 
    13 
    13 
    14 \section*{Handout 2 (Regular Expression Matching)}
    14 \section*{Handout 2 (Regular Expression Matching)}
    15 
    15 
    16 %\noindent
    16 %\noindent
    24 %\end{itemize}\bigskip\bigskip\bigskip
    24 %\end{itemize}\bigskip\bigskip\bigskip
    25 
    25 
    26 \noindent
    26 \noindent
    27 This lecture is about implementing a more efficient regular expression
    27 This lecture is about implementing a more efficient regular expression
    28 matcher (the plots on the right below)---more efficient than the
    28 matcher (the plots on the right below)---more efficient than the
    29 matchers from regular expression libraries in Ruby, Python, JavaScript
    29 matchers from regular expression libraries in Ruby, Python, JavaScript, Swift
    30 and Java (the plots on the left).\footnote{Have a look at KEATS: students
    30 and Java (the plots on the left).\footnote{Have a look at KEATS: students
    31 last year contributed also data for the Swift and Dart languages.}\medskip
    31 last year contributed also data for the Dart language.}\medskip
    32 
    32 
    33 \noindent
    33 \noindent
    34 To start with let us look more closely at the experimental data: The
    34 To start with let us look more closely at the experimental data: The
    35 first pair of plots shows the running time for the regular expression
    35 first pair of plots shows the running time for the regular expression
    36 $(a^*)^*\cdot b$ and strings composed of $n$ \pcode{a}s, like
    36 $(a^*)^*\cdot b$ and strings composed of $n$ \pcode{a}s, like
    61     ymax=35,
    61     ymax=35,
    62     ytick={0,5,...,30},
    62     ytick={0,5,...,30},
    63     scaled ticks=false,
    63     scaled ticks=false,
    64     axis lines=left,
    64     axis lines=left,
    65     width=5cm,
    65     width=5cm,
    66     height=5cm, 
    66     height=4.5cm, 
    67     legend entries={Java 8, Python, JavaScript},  
    67     legend entries={Java 8, Python, JavaScript, Swift},  
    68     legend pos=north west,
    68     legend pos=north west,
    69     legend cell align=left]
    69     legend cell align=left]
    70 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
    70 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
    71 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
    71 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
    72 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
    72 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
       
    73 \addplot[magenta,mark=*, mark options={fill=white}] table {re-swift.data};
    73 \end{axis}
    74 \end{axis}
    74 \end{tikzpicture}
    75 \end{tikzpicture}
    75 &
    76 &
    76 \begin{tikzpicture}[baseline=(current bounding box.north)]
    77 \begin{tikzpicture}[baseline=(current bounding box.north)]
    77   \begin{axis}[
    78   \begin{axis}[
    83     ymax=35,
    84     ymax=35,
    84     ytick={0,5,...,30},
    85     ytick={0,5,...,30},
    85     axis lines=left,
    86     axis lines=left,
    86     %scaled ticks=false,
    87     %scaled ticks=false,
    87     width=6.5cm,
    88     width=6.5cm,
    88     height=5cm,
    89     height=4.5cm,
    89     legend entries={Our matcher},  
    90     legend entries={Our matcher},  
    90     legend pos=north east,
    91     legend pos=north east,
    91     legend cell align=left]
    92     legend cell align=left]
    92 %\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};    
    93 %\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};    
    93 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
    94 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
   110     ymax=35,
   111     ymax=35,
   111     ytick={0,5,...,30},
   112     ytick={0,5,...,30},
   112     scaled ticks=false,
   113     scaled ticks=false,
   113     axis lines=left,
   114     axis lines=left,
   114     width=5cm,
   115     width=5cm,
   115     height=5cm, 
   116     height=4.55cm, 
   116     legend entries={Python,Ruby},  
   117     legend entries={Python,Ruby},  
   117     legend pos=north west,
   118     legend pos=north west,
   118     legend cell align=left]
   119     legend cell align=left]
   119 \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
   120 \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
   120 \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};  
   121 \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};  
   132     ymax=35,
   133     ymax=35,
   133     ytick={0,5,...,30},
   134     ytick={0,5,...,30},
   134     scaled ticks=false,
   135     scaled ticks=false,
   135     axis lines=left,
   136     axis lines=left,
   136     width=6.5cm,
   137     width=6.5cm,
   137     height=5cm,
   138     height=4.5cm,
   138     legend entries={Our matcher},  
   139     legend entries={Our matcher},  
   139     legend pos=north east,
   140     legend pos=north east,
   140     legend cell align=left]
   141     legend cell align=left]
   141 %\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
   142 %\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
   142 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
   143 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
   262 
   263 
   263 \noindent If we can find an equivalent regular expression that is
   264 \noindent If we can find an equivalent regular expression that is
   264 simpler (that usually means smaller), then this might potentially make
   265 simpler (that usually means smaller), then this might potentially make
   265 our matching algorithm run faster. We can look for such a simpler, but
   266 our matching algorithm run faster. We can look for such a simpler, but
   266 equivalent, regular expression $r'$ because whether a string $s$ is in
   267 equivalent, regular expression $r'$ because whether a string $s$ is in
   267 $L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes?
   268 $L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes? \footnote{You have checked this for yourself? Your friendly lecturer might talk rubbish\ldots{}one never knows.}
   268 
   269 
   269 In the example above you will see that the regular expression in
   270 In the example above you will see that the regular expression in
   270 \eqref{big} is equivalent to just $r_1$. You can verify this by
   271 \eqref{big} is equivalent to just $r_1$. You can verify this by
   271 iteratively applying the simplification rules from above:
   272 iteratively applying the simplification rules from above:
   272 
   273 
   301 \end{center}
   302 \end{center}
   302 
   303 
   303 \noindent
   304 \noindent
   304 We will not use them in our algorithm, but feel free to convince
   305 We will not use them in our algorithm, but feel free to convince
   305 yourself that they actually hold. As an aside, there has been a lot of
   306 yourself that they actually hold. As an aside, there has been a lot of
   306 research about questions like: Can one always decide when two regular
   307 research on questions like: Can one always decide whether two regular
   307 expressions are equivalent or not? What does an algorithm look like to
   308 expressions are equivalent or not? What does an algorithm look like to
   308 decide this efficiently? Surprisingly, many of such questions
   309 decide this efficiently? Surprisingly, many of such questions
   309 turn out to be non-trivial problems.
   310 turn out to be non-trivial problems.
   310 
   311 
   311 
   312 
   312 \subsection*{The Matching Algorithm}
   313 \subsection*{The Matching Algorithm}
   313 
   314 
   314 The algorithm we will introduce below consists of two parts. One is the
   315 The regular expression matching algorithm we will introduce below
   315 function $\textit{nullable}$ which takes a regular expression as an
   316 consists of two parts: One is the function $\textit{nullable}$ which
   316 argument and decides whether it can match the empty string (this means
   317 takes a regular expression as an argument and decides whether it can
   317 it returns a boolean in Scala). This can be easily defined recursively
   318 match the empty string (this means it returns a boolean in
   318 as follows:
   319 Scala). This can be easily defined recursively as follows:
   319 
   320 
   320 \begin{center}
   321 \begin{center}
   321 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
   322 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
   322 $\textit{nullable}(\ZERO)$      & $\dn$ & $\textit{false}$\\
   323 $\textit{nullable}(\ZERO)$      & $\dn$ & $\textit{false}$\\
   323 $\textit{nullable}(\ONE)$         & $\dn$ & $\textit{true}$\\
   324 $\textit{nullable}(\ONE)$         & $\dn$ & $\textit{true}$\\
   363   \end{tabular}
   364   \end{tabular}
   364 \end{center}
   365 \end{center}
   365 
   366 
   366 \noindent The first two clauses can be rationalised as follows: recall
   367 \noindent The first two clauses can be rationalised as follows: recall
   367 that $\textit{der}$ should calculate a regular expression so that
   368 that $\textit{der}$ should calculate a regular expression so that
   368 provided  the ``input'' regular expression can match a string of the
   369 provided the ``input'' regular expression can match a string of the
   369 form $c\!::\!s$, we want a regular expression for $s$. Since neither
   370 form $c\!::\!s$, we want a regular expression for $s$. Since neither
   370 $\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we return
   371 $\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we
   371 $\ZERO$. In the third case we have to make a case-distinction: In case
   372 return $\ZERO$. In the third case we have to make a case-distinction:
   372 the regular expression is $c$, then clearly it can recognise a string of
   373 In case the regular expression is $c$, then clearly it can recognise a
   373 the form $c\!::\!s$, just that $s$ is the empty string. Therefore we
   374 string of the form $c\!::\!s$, just that $s$ is the empty
   374 return the $\ONE$-regular expression, as it can match the empty string.
   375 string. Therefore we return the $\ONE$-regular expression in this
   375 In the other case we again return $\ZERO$ since no string of the
   376 case, as it can match the empty string.  In the other case we again
   376 $c\!::\!s$ can be matched. Next come the recursive cases, which are a
   377 return $\ZERO$ since no string of the $c\!::\!s$ can be matched. Next
   377 bit more involved. Fortunately, the $+$-case is still relatively
   378 come the recursive cases, which are a bit more involved. Fortunately,
   378 straightforward: all strings of the form $c\!::\!s$ are either matched
   379 the $+$-case is still relatively straightforward: all strings of the
   379 by the regular expression $r_1$ or $r_2$. So we just have to recursively
   380 form $c\!::\!s$ are either matched by the regular expression $r_1$ or
   380 call $\textit{der}$ with these two regular expressions and compose the
   381 $r_2$. So we just have to recursively call $\textit{der}$ with these
   381 results again with $+$. Makes sense?
   382 two regular expressions and compose the results again with $+$. Makes
       
   383 sense?
   382 
   384 
   383 
   385 
   384 The $\cdot$-case is more complicated: if $r_1\cdot r_2$
   386 The $\cdot$-case is more complicated: if $r_1\cdot r_2$
   385 matches a string of the form $c\!::\!s$, then the first part
   387 matches a string of the form $c\!::\!s$, then the first part
   386 must be matched by $r_1$. Consequently, it makes sense to
   388 must be matched by $r_1$. Consequently, it makes sense to
  1062 s\in L(r)
  1064 s\in L(r)
  1063 \]
  1065 \]
  1064 
  1066 
  1065 \noindent which is the property we set out to prove: our algorithm
  1067 \noindent which is the property we set out to prove: our algorithm
  1066 meets its specification. To have done so, requires a few induction
  1068 meets its specification. To have done so, requires a few induction
  1067 proofs about strings and regular expressions. Following the \emph{induction
  1069 proofs about strings and regular expressions. Following the
  1068 recipes} is already a big step in actually performing these proofs.
  1070 \emph{induction recipes} is already a big step in actually performing
  1069 If you do not believe it, proofs have helped me to make sure my code
  1071 these proofs.  If you do not believe it, proofs have helped me to make
  1070 is correct and in several instances prevented me of letting slip
  1072 sure my code is correct and in several instances prevented me of
  1071 embarrassing mistakes into the `wild'. 
  1073 letting slip embarrassing mistakes into the `wild'. In fact I have
       
  1074 found a number of mistakes in the brilliant work by Sulzmann and Lu,
       
  1075 which we are going to have a look at in Lecture 4. But in Lecture 3 we
       
  1076 should first find out why on earth are existing regular expression
       
  1077 matchers so abysmally slow. Are the people in Python, Ruby, Swift,
       
  1078 JavaScript, Java and also in Rust\footnote{Interestingly the regex
       
  1079   engine in Rust says it guarantees a linear behaviour when deciding
       
  1080   when a regular expression matches a
       
  1081   string. \here{https://youtu.be/3N_ywhx6_K0?t=31}} just idiots? For
       
  1082 example could it possibly be that what we have implemented here in
       
  1083 Scala is faster than the regex engine that has been implemented in
       
  1084 Rust? See you next week\ldots
  1072 
  1085 
  1073 \end{document}
  1086 \end{document}
  1074 
  1087 
  1075 
  1088 
  1076 
  1089