5 \usepackage{../graphics} |
5 \usepackage{../graphics} |
6 \usepackage{../data} |
6 \usepackage{../data} |
7 |
7 |
8 |
8 |
9 \begin{document} |
9 \begin{document} |
10 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019} |
10 \fnote{\copyright{} Christian Urban, King's College London, |
|
11 2014, 2015, 2016, 2017, 2018, 2019, 2020} |
11 |
12 |
12 |
13 |
13 \section*{Handout 2 (Regular Expression Matching)} |
14 \section*{Handout 2 (Regular Expression Matching)} |
14 |
15 |
|
16 \noindent |
|
17 {\bf Checklist} |
|
18 |
|
19 \begin{itemize} |
|
20 \item You have understood the derivative-based matching algorithm. |
|
21 \item You know how the derivative is related to the meaning of regular |
|
22 expressions. |
|
23 \item You can extend the algorithm to non-basic regular expressions. |
|
24 \end{itemize}\bigskip\bigskip\bigskip |
|
25 |
|
26 \noindent |
15 This lecture is about implementing a more efficient regular expression |
27 This lecture is about implementing a more efficient regular expression |
16 matcher (the plots on the right below)---more efficient than the |
28 matcher (the plots on the right below)---more efficient than the |
17 matchers from regular expression libraries in Ruby, Python, JavaScript |
29 matchers from regular expression libraries in Ruby, Python, JavaScript |
18 and Java (the plots on the left). For this consider some experimental |
30 and Java (the plots on the left). For this consider some experimental |
19 data: The first pair of plots shows the running time for the |
31 data: The first pair of plots shows the running time for the |
20 regular expression $(a^*)^*\cdot b$ and strings composed of $n$ |
32 regular expression $(a^*)^*\cdot b$ and strings composed of $n$ |
21 \pcode{a}s (meaning this regular expression actually does not match |
33 \pcode{a}s, like |
22 the strings). The second pair of plots shows the running time for the |
34 \[ |
23 regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings also |
35 \pcode{"}\!\underbrace{\pcode{a}\ldots\pcode{a}}_{n}\!\pcode{"} |
24 composed of $n$ \pcode{a}s (this time the regular expressions match |
36 \] |
25 the strings). To see the substantial differences in the left and |
37 |
|
38 \noindent |
|
39 This means the regular expression actually does not match the strings. |
|
40 The second pair of plots shows the running time for the regular |
|
41 expressions of the form $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and corresponding |
|
42 strings composed of $n$ \pcode{a}s (this time the regular expressions |
|
43 match the strings). To see the substantial differences in the left and |
26 right plots below, note the different scales of the $x$-axes. |
44 right plots below, note the different scales of the $x$-axes. |
27 |
45 |
28 |
46 |
29 \begin{center} |
47 \begin{center} |
30 Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$ |
48 Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$ |
128 \noindent |
146 \noindent |
129 In what follows we will use these regular expressions and strings as |
147 In what follows we will use these regular expressions and strings as |
130 running examples. There will be several versions (V1, V2, V3,\ldots) |
148 running examples. There will be several versions (V1, V2, V3,\ldots) |
131 of our matcher.\footnote{The corresponding files are |
149 of our matcher.\footnote{The corresponding files are |
132 \texttt{re1.scala}, \texttt{re2.scala} and so on. As usual, you can |
150 \texttt{re1.scala}, \texttt{re2.scala} and so on. As usual, you can |
133 find the code on KEATS.}\bigskip |
151 find the code on KEATS.} |
134 |
152 |
135 \noindent |
|
136 Having specified in the previous lecture what |
153 Having specified in the previous lecture what |
137 problem our regular expression matcher is supposed to solve, |
154 problem our regular expression matcher is supposed to solve, |
138 namely for any given regular expression $r$ and string $s$ |
155 namely for any given regular expression $r$ and string $s$ |
139 answer \textit{true} if and only if |
156 answer \textit{true} if and only if |
140 |
157 |
218 $\ZERO \cdot r$ & $\equiv$ & $\ZERO$\\ |
235 $\ZERO \cdot r$ & $\equiv$ & $\ZERO$\\ |
219 $r + r$ & $\equiv$ & $r$ |
236 $r + r$ & $\equiv$ & $r$ |
220 \end{tabular} |
237 \end{tabular} |
221 \end{center} |
238 \end{center} |
222 |
239 |
223 \noindent which always hold no matter what the regular expression $r$ |
240 \noindent They always hold no matter what the regular expression $r$ |
224 looks like. The first two are easy to verify since $L(\ZERO)$ is the |
241 looks like. The first two are easy to verify since $L(\ZERO)$ is the |
225 empty set. The next two are also easy to verify since $L(\ONE) = |
242 empty set. The next two are also easy to verify since $L(\ONE) = |
226 \{[]\}$ and appending the empty string to every string of another set, |
243 \{[]\}$ and appending the empty string to every string of another set, |
227 leaves the set unchanged. Be careful to fully comprehend the fifth and |
244 leaves the set unchanged. Be careful to fully comprehend the fifth and |
228 sixth equivalence: if you concatenate two sets of strings and one is |
245 sixth equivalence: if you concatenate two sets of strings and one is |
229 the empty set, then the concatenation will also be the empty set. To |
246 the empty set, then the concatenation will also be the empty set. To |
230 see this, check the definition of $\_ @ \_$ for sets. The last |
247 see this, check the definition of $\_ @ \_$ for sets. The last |
231 equivalence is again trivial. |
248 equivalence is again trivial. |
232 |
249 |
233 What will be important later on is that we can orient these |
250 What will be critical later on is that we can orient these |
234 equivalences and read them from left to right. In this way we |
251 equivalences and read them from left to right. In this way we |
235 can view them as \emph{simplification rules}. Consider for |
252 can view them as \emph{simplification rules}. Consider for |
236 example the regular expression |
253 example the regular expression |
237 |
254 |
238 \begin{equation} |
255 \begin{equation} |
240 \label{big} |
257 \label{big} |
241 \end{equation} |
258 \end{equation} |
242 |
259 |
243 \noindent If we can find an equivalent regular expression that is |
260 \noindent If we can find an equivalent regular expression that is |
244 simpler (that usually means smaller), then this might potentially make |
261 simpler (that usually means smaller), then this might potentially make |
245 our matching algorithm run faster. We can look for such a simpler |
262 our matching algorithm run faster. We can look for such a simpler, but |
246 regular expression $r'$ because whether a string $s$ is in $L(r)$ or |
263 equivalent, regular expression $r'$ because whether a string $s$ is in |
247 in $L(r')$ with $r\equiv r'$ will always give the same answer. Yes? |
264 $L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes? |
248 |
265 |
249 In the example above you will see that the regular expression is |
266 In the example above you will see that the regular expression in |
250 equivalent to just $r_1$. You can verify this by iteratively applying |
267 \eqref{big} is equivalent to just $r_1$. You can verify this by |
251 the simplification rules from above: |
268 iteratively applying the simplification rules from above: |
252 |
269 |
253 \begin{center} |
270 \begin{center} |
254 \begin{tabular}{ll} |
271 \begin{tabular}{ll} |
255 & $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot |
272 & $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot |
256 (\underline{r_4 \cdot \ZERO})$\smallskip\\ |
273 (\underline{r_4 \cdot \ZERO})$\smallskip\\ |
272 Finally here are three equivalences between regular expressions which are |
289 Finally here are three equivalences between regular expressions which are |
273 not so obvious: |
290 not so obvious: |
274 |
291 |
275 \begin{center} |
292 \begin{center} |
276 \begin{tabular}{rcl} |
293 \begin{tabular}{rcl} |
277 $r^*$ & $\equiv$ & $1 + r\cdot r^*$\\ |
294 $r^*$ & $\equiv$ & $\ONE + r\cdot r^*$\\ |
278 $(r_1 + r_2)^*$ & $\equiv$ & $r_1^* \cdot (r_2\cdot r_1^*)^*$\\ |
295 $(r_1 + r_2)^*$ & $\equiv$ & $r_1^* \cdot (r_2\cdot r_1^*)^*$\\ |
279 $(r_1 \cdot r_2)^*$ & $\equiv$ & $1 + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\ |
296 $(r_1 \cdot r_2)^*$ & $\equiv$ & $\ONE + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\ |
280 \end{tabular} |
297 \end{tabular} |
281 \end{center} |
298 \end{center} |
282 |
299 |
283 \noindent |
300 \noindent |
284 We will not use them in our algorithm, but feel free to convince yourself |
301 We will not use them in our algorithm, but feel free to convince |
285 that they hold. As an aside, there has been a lot of research about |
302 yourself that they actually hold. As an aside, there has been a lot of |
286 questions like: Can one always decide when two regular expressions are |
303 research about questions like: Can one always decide when two regular |
287 equivalent or not? What does an algorithm look like to decide this |
304 expressions are equivalent or not? What does an algorithm look like to |
288 efficiently? So in general it is not a trivial problem. |
305 decide this efficiently? So in general it is not a trivial problem. |
289 |
306 |
290 \subsection*{The Matching Algorithm} |
307 \subsection*{The Matching Algorithm} |
291 |
308 |
292 The algorithm we will define below consists of two parts. One |
309 The algorithm we will introduce below consists of two parts. One is the |
293 is the function $\textit{nullable}$ which takes a regular expression as |
310 function $\textit{nullable}$ which takes a regular expression as an |
294 argument and decides whether it can match the empty string |
311 argument and decides whether it can match the empty string (this means |
295 (this means it returns a boolean in Scala). This can be easily |
312 it returns a boolean in Scala). This can be easily defined recursively |
296 defined recursively as follows: |
313 as follows: |
297 |
314 |
298 \begin{center} |
315 \begin{center} |
299 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} |
316 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} |
300 $\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\ |
317 $\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\ |
301 $\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\ |
318 $\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\ |
311 |
328 |
312 \[ |
329 \[ |
313 \textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r) |
330 \textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r) |
314 \] |
331 \] |
315 |
332 |
316 \noindent Note on the left-hand side of the if-and-only-if we |
333 \noindent Note on the left-hand side of the if-and-only-if we have a |
317 have a function we can implement; on the right we have its |
334 function we can implement, ofr example in Scala; on the right we have |
318 specification (which we cannot implement in a programming |
335 its specification (which we cannot implement in a programming language). |
319 language). |
|
320 |
336 |
321 The other function of our matching algorithm calculates a |
337 The other function of our matching algorithm calculates a |
322 \emph{derivative} of a regular expression. This is a function |
338 \emph{derivative} of a regular expression. This is a function |
323 which will take a regular expression, say $r$, and a |
339 which will take a regular expression, say $r$, and a |
324 character, say $c$, as arguments and returns a new regular |
340 character, say $c$, as arguments and returns a new regular |
340 & & else $(\textit{der}\, c\, r_1) \cdot r_2$\\ |
356 & & else $(\textit{der}\, c\, r_1) \cdot r_2$\\ |
341 $\textit{der}\, c\, (r^*)$ & $\dn$ & $(\textit{der}\,c\,r) \cdot (r^*)$ |
357 $\textit{der}\, c\, (r^*)$ & $\dn$ & $(\textit{der}\,c\,r) \cdot (r^*)$ |
342 \end{tabular} |
358 \end{tabular} |
343 \end{center} |
359 \end{center} |
344 |
360 |
345 \noindent The first two clauses can be rationalised as |
361 \noindent The first two clauses can be rationalised as follows: recall |
346 follows: recall that $\textit{der}$ should calculate a regular |
362 that $\textit{der}$ should calculate a regular expression so that |
347 expression so that given the ``input'' regular expression can |
363 provided the ``input'' regular expression can match a string of the |
348 match a string of the form $c\!::\!s$, we want a regular |
364 form $c\!::\!s$, we want a regular expression for $s$. Since neither |
349 expression for $s$. Since neither $\ZERO$ nor $\ONE$ |
365 $\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we return |
350 can match a string of the form $c\!::\!s$, we return |
366 $\ZERO$. In the third case we have to make a case-distinction: In case |
351 $\ZERO$. In the third case we have to make a |
367 the regular expression is $c$, then clearly it can recognise a string of |
352 case-distinction: In case the regular expression is $c$, then |
368 the form $c\!::\!s$, just that $s$ is the empty string. Therefore we |
353 clearly it can recognise a string of the form $c\!::\!s$, just |
369 return the $\ONE$-regular expression, as it can match the empty string. |
354 that $s$ is the empty string. Therefore we return the |
370 In the other case we again return $\ZERO$ since no string of the |
355 $\ONE$-regular expression. In the other case we again |
371 $c\!::\!s$ can be matched. Next come the recursive cases, which are a |
356 return $\ZERO$ since no string of the $c\!::\!s$ can be |
372 bit more involved. Fortunately, the $+$-case is still relatively |
357 matched. Next come the recursive cases, which are a bit more |
373 straightforward: all strings of the form $c\!::\!s$ are either matched |
358 involved. Fortunately, the $+$-case is still relatively |
374 by the regular expression $r_1$ or $r_2$. So we just have to recursively |
359 straightforward: all strings of the form $c\!::\!s$ are either |
375 call $\textit{der}$ with these two regular expressions and compose the |
360 matched by the regular expression $r_1$ or $r_2$. So we just |
376 results again with $+$. Makes sense? |
361 have to recursively call $\textit{der}$ with these two regular |
377 |
362 expressions and compose the results again with $+$. Makes |
|
363 sense? |
|
364 |
378 |
365 The $\cdot$-case is more complicated: if $r_1\cdot r_2$ |
379 The $\cdot$-case is more complicated: if $r_1\cdot r_2$ |
366 matches a string of the form $c\!::\!s$, then the first part |
380 matches a string of the form $c\!::\!s$, then the first part |
367 must be matched by $r_1$. Consequently, it makes sense to |
381 must be matched by $r_1$. Consequently, it makes sense to |
368 construct the regular expression for $s$ by calling $\textit{der}$ with |
382 construct the regular expression for $s$ by calling $\textit{der}$ with |