diff -r ddb521b57e0c -r 42ecc3186944 handouts/ho02.tex --- a/handouts/ho02.tex Sat Sep 23 21:22:17 2023 +0100 +++ b/handouts/ho02.tex Sat Sep 23 22:26:52 2023 +0100 @@ -8,7 +8,7 @@ \begin{document} \fnote{\copyright{} Christian Urban, King's College London, - 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022} + 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023} \section*{Handout 2 (Regular Expression Matching)} @@ -26,9 +26,9 @@ \noindent This lecture is about implementing a more efficient regular expression matcher (the plots on the right below)---more efficient than the -matchers from regular expression libraries in Ruby, Python, JavaScript +matchers from regular expression libraries in Ruby, Python, JavaScript, Swift and Java (the plots on the left).\footnote{Have a look at KEATS: students -last year contributed also data for the Swift and Dart languages.}\medskip +last year contributed also data for the Dart language.}\medskip \noindent To start with let us look more closely at the experimental data: The @@ -63,13 +63,14 @@ scaled ticks=false, axis lines=left, width=5cm, - height=5cm, - legend entries={Java 8, Python, JavaScript}, + height=4.5cm, + legend entries={Java 8, Python, JavaScript, Swift}, legend pos=north west, legend cell align=left] \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; \addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; +\addplot[magenta,mark=*, mark options={fill=white}] table {re-swift.data}; \end{axis} \end{tikzpicture} & @@ -85,7 +86,7 @@ axis lines=left, %scaled ticks=false, width=6.5cm, - height=5cm, + height=4.5cm, legend entries={Our matcher}, legend pos=north east, legend cell align=left] @@ -112,7 +113,7 @@ scaled ticks=false, axis lines=left, width=5cm, - height=5cm, + height=4.55cm, legend entries={Python,Ruby}, legend pos=north west, legend cell align=left] @@ -134,7 +135,7 @@ scaled ticks=false, axis lines=left, width=6.5cm, - height=5cm, + height=4.5cm, legend entries={Our matcher}, legend pos=north east, legend cell align=left] @@ -264,7 +265,7 @@ simpler (that usually means smaller), then this might potentially make our matching algorithm run faster. We can look for such a simpler, but equivalent, regular expression $r'$ because whether a string $s$ is in -$L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes? +$L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes? \footnote{You have checked this for yourself? Your friendly lecturer might talk rubbish\ldots{}one never knows.} In the example above you will see that the regular expression in \eqref{big} is equivalent to just $r_1$. You can verify this by @@ -303,7 +304,7 @@ \noindent We will not use them in our algorithm, but feel free to convince yourself that they actually hold. As an aside, there has been a lot of -research about questions like: Can one always decide when two regular +research on questions like: Can one always decide whether two regular expressions are equivalent or not? What does an algorithm look like to decide this efficiently? Surprisingly, many of such questions turn out to be non-trivial problems. @@ -311,11 +312,11 @@ \subsection*{The Matching Algorithm} -The algorithm we will introduce below consists of two parts. One is the -function $\textit{nullable}$ which takes a regular expression as an -argument and decides whether it can match the empty string (this means -it returns a boolean in Scala). This can be easily defined recursively -as follows: +The regular expression matching algorithm we will introduce below +consists of two parts: One is the function $\textit{nullable}$ which +takes a regular expression as an argument and decides whether it can +match the empty string (this means it returns a boolean in +Scala). This can be easily defined recursively as follows: \begin{center} \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} @@ -365,20 +366,21 @@ \noindent The first two clauses can be rationalised as follows: recall that $\textit{der}$ should calculate a regular expression so that -provided the ``input'' regular expression can match a string of the +provided the ``input'' regular expression can match a string of the form $c\!::\!s$, we want a regular expression for $s$. Since neither -$\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we return -$\ZERO$. In the third case we have to make a case-distinction: In case -the regular expression is $c$, then clearly it can recognise a string of -the form $c\!::\!s$, just that $s$ is the empty string. Therefore we -return the $\ONE$-regular expression, as it can match the empty string. -In the other case we again return $\ZERO$ since no string of the -$c\!::\!s$ can be matched. Next come the recursive cases, which are a -bit more involved. Fortunately, the $+$-case is still relatively -straightforward: all strings of the form $c\!::\!s$ are either matched -by the regular expression $r_1$ or $r_2$. So we just have to recursively -call $\textit{der}$ with these two regular expressions and compose the -results again with $+$. Makes sense? +$\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we +return $\ZERO$. In the third case we have to make a case-distinction: +In case the regular expression is $c$, then clearly it can recognise a +string of the form $c\!::\!s$, just that $s$ is the empty +string. Therefore we return the $\ONE$-regular expression in this +case, as it can match the empty string. In the other case we again +return $\ZERO$ since no string of the $c\!::\!s$ can be matched. Next +come the recursive cases, which are a bit more involved. Fortunately, +the $+$-case is still relatively straightforward: all strings of the +form $c\!::\!s$ are either matched by the regular expression $r_1$ or +$r_2$. So we just have to recursively call $\textit{der}$ with these +two regular expressions and compose the results again with $+$. Makes +sense? The $\cdot$-case is more complicated: if $r_1\cdot r_2$ @@ -1064,11 +1066,22 @@ \noindent which is the property we set out to prove: our algorithm meets its specification. To have done so, requires a few induction -proofs about strings and regular expressions. Following the \emph{induction -recipes} is already a big step in actually performing these proofs. -If you do not believe it, proofs have helped me to make sure my code -is correct and in several instances prevented me of letting slip -embarrassing mistakes into the `wild'. +proofs about strings and regular expressions. Following the +\emph{induction recipes} is already a big step in actually performing +these proofs. If you do not believe it, proofs have helped me to make +sure my code is correct and in several instances prevented me of +letting slip embarrassing mistakes into the `wild'. In fact I have +found a number of mistakes in the brilliant work by Sulzmann and Lu, +which we are going to have a look at in Lecture 4. But in Lecture 3 we +should first find out why on earth are existing regular expression +matchers so abysmally slow. Are the people in Python, Ruby, Swift, +JavaScript, Java and also in Rust\footnote{Interestingly the regex + engine in Rust says it guarantees a linear behaviour when deciding + when a regular expression matches a + string. \here{https://youtu.be/3N_ywhx6_K0?t=31}} just idiots? For +example could it possibly be that what we have implemented here in +Scala is faster than the regex engine that has been implemented in +Rust? See you next week\ldots \end{document}