6 \usepackage{../data} |
6 \usepackage{../data} |
7 |
7 |
8 |
8 |
9 \begin{document} |
9 \begin{document} |
10 \fnote{\copyright{} Christian Urban, King's College London, |
10 \fnote{\copyright{} Christian Urban, King's College London, |
11 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022} |
11 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023} |
12 |
12 |
13 |
13 |
14 \section*{Handout 2 (Regular Expression Matching)} |
14 \section*{Handout 2 (Regular Expression Matching)} |
15 |
15 |
16 %\noindent |
16 %\noindent |
24 %\end{itemize}\bigskip\bigskip\bigskip |
24 %\end{itemize}\bigskip\bigskip\bigskip |
25 |
25 |
26 \noindent |
26 \noindent |
27 This lecture is about implementing a more efficient regular expression |
27 This lecture is about implementing a more efficient regular expression |
28 matcher (the plots on the right below)---more efficient than the |
28 matcher (the plots on the right below)---more efficient than the |
29 matchers from regular expression libraries in Ruby, Python, JavaScript |
29 matchers from regular expression libraries in Ruby, Python, JavaScript, Swift |
30 and Java (the plots on the left).\footnote{Have a look at KEATS: students |
30 and Java (the plots on the left).\footnote{Have a look at KEATS: students |
31 last year contributed also data for the Swift and Dart languages.}\medskip |
31 last year contributed also data for the Dart language.}\medskip |
32 |
32 |
33 \noindent |
33 \noindent |
34 To start with let us look more closely at the experimental data: The |
34 To start with let us look more closely at the experimental data: The |
35 first pair of plots shows the running time for the regular expression |
35 first pair of plots shows the running time for the regular expression |
36 $(a^*)^*\cdot b$ and strings composed of $n$ \pcode{a}s, like |
36 $(a^*)^*\cdot b$ and strings composed of $n$ \pcode{a}s, like |
61 ymax=35, |
61 ymax=35, |
62 ytick={0,5,...,30}, |
62 ytick={0,5,...,30}, |
63 scaled ticks=false, |
63 scaled ticks=false, |
64 axis lines=left, |
64 axis lines=left, |
65 width=5cm, |
65 width=5cm, |
66 height=5cm, |
66 height=4.5cm, |
67 legend entries={Java 8, Python, JavaScript}, |
67 legend entries={Java 8, Python, JavaScript, Swift}, |
68 legend pos=north west, |
68 legend pos=north west, |
69 legend cell align=left] |
69 legend cell align=left] |
70 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
70 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
71 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
71 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
72 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; |
72 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; |
|
73 \addplot[magenta,mark=*, mark options={fill=white}] table {re-swift.data}; |
73 \end{axis} |
74 \end{axis} |
74 \end{tikzpicture} |
75 \end{tikzpicture} |
75 & |
76 & |
76 \begin{tikzpicture}[baseline=(current bounding box.north)] |
77 \begin{tikzpicture}[baseline=(current bounding box.north)] |
77 \begin{axis}[ |
78 \begin{axis}[ |
83 ymax=35, |
84 ymax=35, |
84 ytick={0,5,...,30}, |
85 ytick={0,5,...,30}, |
85 axis lines=left, |
86 axis lines=left, |
86 %scaled ticks=false, |
87 %scaled ticks=false, |
87 width=6.5cm, |
88 width=6.5cm, |
88 height=5cm, |
89 height=4.5cm, |
89 legend entries={Our matcher}, |
90 legend entries={Our matcher}, |
90 legend pos=north east, |
91 legend pos=north east, |
91 legend cell align=left] |
92 legend cell align=left] |
92 %\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data}; |
93 %\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data}; |
93 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data}; |
94 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data}; |
110 ymax=35, |
111 ymax=35, |
111 ytick={0,5,...,30}, |
112 ytick={0,5,...,30}, |
112 scaled ticks=false, |
113 scaled ticks=false, |
113 axis lines=left, |
114 axis lines=left, |
114 width=5cm, |
115 width=5cm, |
115 height=5cm, |
116 height=4.55cm, |
116 legend entries={Python,Ruby}, |
117 legend entries={Python,Ruby}, |
117 legend pos=north west, |
118 legend pos=north west, |
118 legend cell align=left] |
119 legend cell align=left] |
119 \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data}; |
120 \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data}; |
120 \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data}; |
121 \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data}; |
132 ymax=35, |
133 ymax=35, |
133 ytick={0,5,...,30}, |
134 ytick={0,5,...,30}, |
134 scaled ticks=false, |
135 scaled ticks=false, |
135 axis lines=left, |
136 axis lines=left, |
136 width=6.5cm, |
137 width=6.5cm, |
137 height=5cm, |
138 height=4.5cm, |
138 legend entries={Our matcher}, |
139 legend entries={Our matcher}, |
139 legend pos=north east, |
140 legend pos=north east, |
140 legend cell align=left] |
141 legend cell align=left] |
141 %\addplot[green,mark=square*,mark options={fill=white}] table {re2.data}; |
142 %\addplot[green,mark=square*,mark options={fill=white}] table {re2.data}; |
142 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; |
143 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; |
262 |
263 |
263 \noindent If we can find an equivalent regular expression that is |
264 \noindent If we can find an equivalent regular expression that is |
264 simpler (that usually means smaller), then this might potentially make |
265 simpler (that usually means smaller), then this might potentially make |
265 our matching algorithm run faster. We can look for such a simpler, but |
266 our matching algorithm run faster. We can look for such a simpler, but |
266 equivalent, regular expression $r'$ because whether a string $s$ is in |
267 equivalent, regular expression $r'$ because whether a string $s$ is in |
267 $L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes? |
268 $L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes? \footnote{You have checked this for yourself? Your friendly lecturer might talk rubbish\ldots{}one never knows.} |
268 |
269 |
269 In the example above you will see that the regular expression in |
270 In the example above you will see that the regular expression in |
270 \eqref{big} is equivalent to just $r_1$. You can verify this by |
271 \eqref{big} is equivalent to just $r_1$. You can verify this by |
271 iteratively applying the simplification rules from above: |
272 iteratively applying the simplification rules from above: |
272 |
273 |
301 \end{center} |
302 \end{center} |
302 |
303 |
303 \noindent |
304 \noindent |
304 We will not use them in our algorithm, but feel free to convince |
305 We will not use them in our algorithm, but feel free to convince |
305 yourself that they actually hold. As an aside, there has been a lot of |
306 yourself that they actually hold. As an aside, there has been a lot of |
306 research about questions like: Can one always decide when two regular |
307 research on questions like: Can one always decide whether two regular |
307 expressions are equivalent or not? What does an algorithm look like to |
308 expressions are equivalent or not? What does an algorithm look like to |
308 decide this efficiently? Surprisingly, many of such questions |
309 decide this efficiently? Surprisingly, many of such questions |
309 turn out to be non-trivial problems. |
310 turn out to be non-trivial problems. |
310 |
311 |
311 |
312 |
312 \subsection*{The Matching Algorithm} |
313 \subsection*{The Matching Algorithm} |
313 |
314 |
314 The algorithm we will introduce below consists of two parts. One is the |
315 The regular expression matching algorithm we will introduce below |
315 function $\textit{nullable}$ which takes a regular expression as an |
316 consists of two parts: One is the function $\textit{nullable}$ which |
316 argument and decides whether it can match the empty string (this means |
317 takes a regular expression as an argument and decides whether it can |
317 it returns a boolean in Scala). This can be easily defined recursively |
318 match the empty string (this means it returns a boolean in |
318 as follows: |
319 Scala). This can be easily defined recursively as follows: |
319 |
320 |
320 \begin{center} |
321 \begin{center} |
321 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} |
322 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} |
322 $\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\ |
323 $\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\ |
323 $\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\ |
324 $\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\ |
363 \end{tabular} |
364 \end{tabular} |
364 \end{center} |
365 \end{center} |
365 |
366 |
366 \noindent The first two clauses can be rationalised as follows: recall |
367 \noindent The first two clauses can be rationalised as follows: recall |
367 that $\textit{der}$ should calculate a regular expression so that |
368 that $\textit{der}$ should calculate a regular expression so that |
368 provided the ``input'' regular expression can match a string of the |
369 provided the ``input'' regular expression can match a string of the |
369 form $c\!::\!s$, we want a regular expression for $s$. Since neither |
370 form $c\!::\!s$, we want a regular expression for $s$. Since neither |
370 $\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we return |
371 $\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we |
371 $\ZERO$. In the third case we have to make a case-distinction: In case |
372 return $\ZERO$. In the third case we have to make a case-distinction: |
372 the regular expression is $c$, then clearly it can recognise a string of |
373 In case the regular expression is $c$, then clearly it can recognise a |
373 the form $c\!::\!s$, just that $s$ is the empty string. Therefore we |
374 string of the form $c\!::\!s$, just that $s$ is the empty |
374 return the $\ONE$-regular expression, as it can match the empty string. |
375 string. Therefore we return the $\ONE$-regular expression in this |
375 In the other case we again return $\ZERO$ since no string of the |
376 case, as it can match the empty string. In the other case we again |
376 $c\!::\!s$ can be matched. Next come the recursive cases, which are a |
377 return $\ZERO$ since no string of the $c\!::\!s$ can be matched. Next |
377 bit more involved. Fortunately, the $+$-case is still relatively |
378 come the recursive cases, which are a bit more involved. Fortunately, |
378 straightforward: all strings of the form $c\!::\!s$ are either matched |
379 the $+$-case is still relatively straightforward: all strings of the |
379 by the regular expression $r_1$ or $r_2$. So we just have to recursively |
380 form $c\!::\!s$ are either matched by the regular expression $r_1$ or |
380 call $\textit{der}$ with these two regular expressions and compose the |
381 $r_2$. So we just have to recursively call $\textit{der}$ with these |
381 results again with $+$. Makes sense? |
382 two regular expressions and compose the results again with $+$. Makes |
|
383 sense? |
382 |
384 |
383 |
385 |
384 The $\cdot$-case is more complicated: if $r_1\cdot r_2$ |
386 The $\cdot$-case is more complicated: if $r_1\cdot r_2$ |
385 matches a string of the form $c\!::\!s$, then the first part |
387 matches a string of the form $c\!::\!s$, then the first part |
386 must be matched by $r_1$. Consequently, it makes sense to |
388 must be matched by $r_1$. Consequently, it makes sense to |
1062 s\in L(r) |
1064 s\in L(r) |
1063 \] |
1065 \] |
1064 |
1066 |
1065 \noindent which is the property we set out to prove: our algorithm |
1067 \noindent which is the property we set out to prove: our algorithm |
1066 meets its specification. To have done so, requires a few induction |
1068 meets its specification. To have done so, requires a few induction |
1067 proofs about strings and regular expressions. Following the \emph{induction |
1069 proofs about strings and regular expressions. Following the |
1068 recipes} is already a big step in actually performing these proofs. |
1070 \emph{induction recipes} is already a big step in actually performing |
1069 If you do not believe it, proofs have helped me to make sure my code |
1071 these proofs. If you do not believe it, proofs have helped me to make |
1070 is correct and in several instances prevented me of letting slip |
1072 sure my code is correct and in several instances prevented me of |
1071 embarrassing mistakes into the `wild'. |
1073 letting slip embarrassing mistakes into the `wild'. In fact I have |
|
1074 found a number of mistakes in the brilliant work by Sulzmann and Lu, |
|
1075 which we are going to have a look at in Lecture 4. But in Lecture 3 we |
|
1076 should first find out why on earth are existing regular expression |
|
1077 matchers so abysmally slow. Are the people in Python, Ruby, Swift, |
|
1078 JavaScript, Java and also in Rust\footnote{Interestingly the regex |
|
1079 engine in Rust says it guarantees a linear behaviour when deciding |
|
1080 when a regular expression matches a |
|
1081 string. \here{https://youtu.be/3N_ywhx6_K0?t=31}} just idiots? For |
|
1082 example could it possibly be that what we have implemented here in |
|
1083 Scala is faster than the regex engine that has been implemented in |
|
1084 Rust? See you next week\ldots |
1072 |
1085 |
1073 \end{document} |
1086 \end{document} |
1074 |
1087 |
1075 |
1088 |
1076 |
1089 |