9 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016} |
9 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016} |
10 |
10 |
11 |
11 |
12 \section*{Handout 2 (Regular Expression Matching)} |
12 \section*{Handout 2 (Regular Expression Matching)} |
13 |
13 |
14 This lecture is about implementing a more efficient regular |
14 This lecture is about implementing a more efficient regular expression |
15 expression matcher (the plots on the right)---more efficient |
15 matcher (the plots on the right)---more efficient than the matchers |
16 than the matchers from regular expression libraries in Ruby |
16 from regular expression libraries in Ruby, Python and Java (the plots |
17 and Python (the plots on the left). These plots show the |
17 on the left). The first pair of plots show the running time for the |
18 running time for the evil regular expression |
18 regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed |
19 $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed of $n$ |
19 of $n$ \pcode{a}s. The second pair of plots show the running time |
20 \pcode{a}s. We will use this regular expression and strings as |
20 for the regular expression $(a^*)^*\cdot b$ and also strings composed |
21 running example. To see the substantial differences in the two |
21 of $n$ \pcode{a}s (meaning this regular expression actually does not |
22 plots below, note the different scales of the $x$-axes. |
22 match the strings). To see the substantial differences in the left |
|
23 and right plots below, note the different scales of the $x$-axes. |
23 |
24 |
24 |
25 |
25 \begin{center} |
26 \begin{center} |
26 \begin{tabular}{@{}cc@{}} |
27 \begin{tabular}{@{}cc@{}} |
27 \begin{tikzpicture} |
28 \begin{tikzpicture} |
28 \begin{axis}[ |
29 \begin{axis}[ |
29 xlabel={strings of {\tt a}s}, |
30 xlabel={\small $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$}, |
30 ylabel={time in secs}, |
31 ylabel={\small time in secs}, |
31 enlargelimits=false, |
32 enlargelimits=false, |
32 xtick={0,5,...,30}, |
33 xtick={0,5,...,30}, |
33 xmax=33, |
34 xmax=33, |
34 ymax=35, |
35 ymax=35, |
35 ytick={0,5,...,30}, |
36 ytick={0,5,...,30}, |
47 \end{axis} |
48 \end{axis} |
48 \end{tikzpicture} |
49 \end{tikzpicture} |
49 & |
50 & |
50 \begin{tikzpicture} |
51 \begin{tikzpicture} |
51 \begin{axis}[ |
52 \begin{axis}[ |
52 xlabel={strings of \texttt{a}s}, |
53 xlabel={\small $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$}, |
|
54 ylabel={\small time in secs}, |
|
55 enlargelimits=false, |
|
56 xtick={0,3000,...,12000}, |
|
57 xmax=12500, |
|
58 ymax=35, |
|
59 ytick={0,5,...,30}, |
|
60 scaled ticks=false, |
|
61 axis lines=left, |
|
62 width=6.5cm, |
|
63 height=5cm] |
|
64 \addplot[green,mark=square*,mark options={fill=white}] table {re2b.data}; |
|
65 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; |
|
66 \end{axis} |
|
67 \end{tikzpicture} |
|
68 \end{tabular} |
|
69 \end{center} |
|
70 |
|
71 \begin{center} |
|
72 \begin{tabular}{@{}cc@{}} |
|
73 \begin{tikzpicture} |
|
74 \begin{axis}[ |
|
75 xlabel={$(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$}, |
|
76 ylabel={time in secs}, |
|
77 enlargelimits=false, |
|
78 xtick={0,5,...,30}, |
|
79 xmax=33, |
|
80 ymax=35, |
|
81 ytick={0,5,...,30}, |
|
82 scaled ticks=false, |
|
83 axis lines=left, |
|
84 width=5cm, |
|
85 height=5cm, |
|
86 legend entries={Java}, |
|
87 legend pos=north west, |
|
88 legend cell align=left] |
|
89 \addplot[blue,mark=*, mark options={fill=white}] table {re-java.data}; |
|
90 \end{axis} |
|
91 \end{tikzpicture} |
|
92 & |
|
93 \begin{tikzpicture} |
|
94 \begin{axis}[ |
|
95 xlabel={$(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$}, |
53 ylabel={time in secs}, |
96 ylabel={time in secs}, |
54 enlargelimits=false, |
97 enlargelimits=false, |
55 xtick={0,3000,...,12000}, |
98 xtick={0,3000,...,12000}, |
56 xmax=12500, |
99 xmax=12500, |
57 ymax=35, |
100 ymax=35, |
65 \end{axis} |
108 \end{axis} |
66 \end{tikzpicture} |
109 \end{tikzpicture} |
67 \end{tabular} |
110 \end{tabular} |
68 \end{center}\medskip |
111 \end{center}\medskip |
69 |
112 |
70 |
113 \noindent |
71 \noindent Having specified in the previous lecture what |
114 We will use these regular expressions and strings |
|
115 as running examples. |
|
116 |
|
117 Having specified in the previous lecture what |
72 problem our regular expression matcher is supposed to solve, |
118 problem our regular expression matcher is supposed to solve, |
73 namely for any given regular expression $r$ and string $s$ |
119 namely for any given regular expression $r$ and string $s$ |
74 answer \textit{true} if and only if |
120 answer \textit{true} if and only if |
75 |
121 |
76 \[ |
122 \[ |
77 s \in L(r) |
123 s \in L(r) |
78 \] |
124 \] |
79 |
125 |
80 \noindent we can look at an algorithm to solve this problem. |
126 \noindent we can look at an algorithm to solve this problem. Clearly |
81 Clearly we cannot use the function $L$ directly for this, |
127 we cannot use the function $L$ directly for this, because in general |
82 because in general the set of strings $L$ returns is infinite |
128 the set of strings $L$ returns is infinite (recall what $L(a^*)$ is). |
83 (recall what $L(a^*)$ is). In such cases there is no way we |
129 In such cases there is no way we can implement an exhaustive test for |
84 can implement an exhaustive test for whether a string is |
130 whether a string is member of this set or not. In contrast our |
85 member of this set or not. In contrast our matching algorithm |
131 matching algorithm will operate on the regular expression $r$ and |
86 will operate on the regular expression $r$ and string $s$, |
132 string $s$, only, which are both finite objects. Before we explain to |
87 only, which are both finite objects. Before we come to the matching |
133 the matching algorithm, however, let us have a closer look at what it |
88 algorithm, however, let us have a closer look at what it means |
134 means when two regular expressions are equivalent. |
89 when two regular expressions are equivalent. |
|
90 |
135 |
91 \subsection*{Regular Expression Equivalences} |
136 \subsection*{Regular Expression Equivalences} |
92 |
137 |
93 We already defined in Handout 1 what it means for two regular |
138 We already defined in Handout 1 what it means for two regular |
94 expressions to be equivalent, namely if their meaning is the |
139 expressions to be equivalent, namely if their meaning is the |
154 $\ZERO \cdot r$ & $\equiv$ & $\ZERO$\\ |
199 $\ZERO \cdot r$ & $\equiv$ & $\ZERO$\\ |
155 $r + r$ & $\equiv$ & $r$ |
200 $r + r$ & $\equiv$ & $r$ |
156 \end{tabular} |
201 \end{tabular} |
157 \end{center} |
202 \end{center} |
158 |
203 |
159 \noindent which always hold no matter what the regular |
204 \noindent which always hold no matter what the regular expression $r$ |
160 expression $r$ looks like. The first two are easy to verify |
205 looks like. The first two are easy to verify since $L(\ZERO)$ is the |
161 since $L(\ZERO)$ is the empty set. The next two are also |
206 empty set. The next two are also easy to verify since $L(\ONE) = |
162 easy to verify since $L(\ONE) = \{[]\}$ and appending the |
207 \{[]\}$ and appending the empty string to every string of another set, |
163 empty string to every string of another set, leaves the set |
208 leaves the set unchanged. Be careful to fully comprehend the fifth and |
164 unchanged. Be careful to fully comprehend the fifth and sixth |
209 sixth equivalence: if you concatenate two sets of strings and one is |
165 equivalence: if you concatenate two sets of strings and one is |
210 the empty set, then the concatenation will also be the empty set. To |
166 the empty set, then the concatenation will also be the empty |
211 see this, check the definition of $\_ @ \_$ for sets. The last |
167 set. To see this, check the definition of $\_ @ \_$. The |
212 equivalence is again trivial. |
168 last equivalence is again trivial. |
|
169 |
213 |
170 What will be important later on is that we can orient these |
214 What will be important later on is that we can orient these |
171 equivalences and read them from left to right. In this way we |
215 equivalences and read them from left to right. In this way we |
172 can view them as \emph{simplification rules}. Consider for |
216 can view them as \emph{simplification rules}. Consider for |
173 example the regular expression |
217 example the regular expression |
175 \begin{equation} |
219 \begin{equation} |
176 (r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO) |
220 (r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO) |
177 \label{big} |
221 \label{big} |
178 \end{equation} |
222 \end{equation} |
179 |
223 |
180 \noindent If we can find an equivalent regular expression that |
224 \noindent If we can find an equivalent regular expression that is |
181 is simpler (smaller for example), then this might potentially |
225 simpler (smaller for example), then this might potentially make our |
182 make our matching algorithm run faster. The reason is that |
226 matching algorithm run faster. We can look for such a simpler regular |
183 whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv |
227 expression $r'$ because whether a string $s$ is in $L(r)$ or in |
184 r'$ will always give the same answer. In the example above you |
228 $L(r')$ with $r\equiv r'$ will always give the same answer. In the |
185 will see that the regular expression is equivalent to just $r_1$. |
229 example above you will see that the regular expression is equivalent |
186 You can verify this by iteratively applying the simplification |
230 to just $r_1$. You can verify this by iteratively applying the |
187 rules from above: |
231 simplification rules from above: |
188 |
232 |
189 \begin{center} |
233 \begin{center} |
190 \begin{tabular}{ll} |
234 \begin{tabular}{ll} |
191 & $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot |
235 & $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot |
192 (\underline{r_4 \cdot \ZERO})$\smallskip\\ |
236 (\underline{r_4 \cdot \ZERO})$\smallskip\\ |
206 algorithm quite a bit faster. |
250 algorithm quite a bit faster. |
207 |
251 |
208 \subsection*{The Matching Algorithm} |
252 \subsection*{The Matching Algorithm} |
209 |
253 |
210 The algorithm we will define below consists of two parts. One |
254 The algorithm we will define below consists of two parts. One |
211 is the function $nullable$ which takes a regular expression as |
255 is the function $\textit{nullable}$ which takes a regular expression as |
212 argument and decides whether it can match the empty string |
256 argument and decides whether it can match the empty string |
213 (this means it returns a boolean in Scala). This can be easily |
257 (this means it returns a boolean in Scala). This can be easily |
214 defined recursively as follows: |
258 defined recursively as follows: |
215 |
259 |
216 \begin{center} |
260 \begin{center} |
217 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} |
261 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} |
218 $nullable(\ZERO)$ & $\dn$ & $\textit{false}$\\ |
262 $\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\ |
219 $nullable(\ONE)$ & $\dn$ & $true$\\ |
263 $\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\ |
220 $nullable(c)$ & $\dn$ & $\textit{false}$\\ |
264 $\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\ |
221 $nullable(r_1 + r_2)$ & $\dn$ & $nullable(r_1) \vee nullable(r_2)$\\ |
265 $\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\ |
222 $nullable(r_1 \cdot r_2)$ & $\dn$ & $nullable(r_1) \wedge nullable(r_2)$\\ |
266 $\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\ |
223 $nullable(r^*)$ & $\dn$ & $true$ \\ |
267 $\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$ \\ |
224 \end{tabular} |
268 \end{tabular} |
225 \end{center} |
269 \end{center} |
226 |
270 |
227 \noindent The idea behind this function is that the following |
271 \noindent The idea behind this function is that the following |
228 property holds: |
272 property holds: |
229 |
273 |
230 \[ |
274 \[ |
231 nullable(r) \;\;\text{if and only if}\;\; []\in L(r) |
275 \textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r) |
232 \] |
276 \] |
233 |
277 |
234 \noindent Note on the left-hand side of the if-and-only-if we |
278 \noindent Note on the left-hand side of the if-and-only-if we |
235 have a function we can implement; on the right we have its |
279 have a function we can implement; on the right we have its |
236 specification (which we cannot implement in a programming |
280 specification (which we cannot implement in a programming |
237 language). |
281 language). |
238 |
282 |
239 The other function of our matching algorithm calculates a |
283 The other function of our matching algorithm calculates a |
240 \emph{derivative} of a regular expression. This is a function |
284 \emph{derivative} of a regular expression. This is a function |
241 which will take a regular expression, say $r$, and a |
285 which will take a regular expression, say $r$, and a |
242 character, say $c$, as argument and returns a new regular |
286 character, say $c$, as arguments and returns a new regular |
243 expression. Be careful that the intuition behind this function |
287 expression. Be careful that the intuition behind this function |
244 is not so easy to grasp on first reading. Essentially this |
288 is not so easy to grasp on first reading. Essentially this |
245 function solves the following problem: if $r$ can match a |
289 function solves the following problem: if $r$ can match a |
246 string of the form $c\!::\!s$, what does the regular |
290 string of the form $c\!::\!s$, what does the regular |
247 expression look like that can match just $s$? The definition |
291 expression look like that can match just $s$? The definition |
251 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l} |
295 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l} |
252 $der\, c\, (\ZERO)$ & $\dn$ & $\ZERO$\\ |
296 $der\, c\, (\ZERO)$ & $\dn$ & $\ZERO$\\ |
253 $der\, c\, (\ONE)$ & $\dn$ & $\ZERO$ \\ |
297 $der\, c\, (\ONE)$ & $\dn$ & $\ZERO$ \\ |
254 $der\, c\, (d)$ & $\dn$ & if $c = d$ then $\ONE$ else $\ZERO$\\ |
298 $der\, c\, (d)$ & $\dn$ & if $c = d$ then $\ONE$ else $\ZERO$\\ |
255 $der\, c\, (r_1 + r_2)$ & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$\\ |
299 $der\, c\, (r_1 + r_2)$ & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$\\ |
256 $der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $nullable (r_1)$\\ |
300 $der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $\textit{nullable} (r_1)$\\ |
257 & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\ |
301 & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\ |
258 & & else $(der\, c\, r_1) \cdot r_2$\\ |
302 & & else $(der\, c\, r_1) \cdot r_2$\\ |
259 $der\, c\, (r^*)$ & $\dn$ & $(der\,c\,r) \cdot (r^*)$ |
303 $der\, c\, (r^*)$ & $\dn$ & $(der\,c\,r) \cdot (r^*)$ |
260 \end{tabular} |
304 \end{tabular} |
261 \end{center} |
305 \end{center} |
276 involved. Fortunately, the $+$-case is still relatively |
320 involved. Fortunately, the $+$-case is still relatively |
277 straightforward: all strings of the form $c\!::\!s$ are either |
321 straightforward: all strings of the form $c\!::\!s$ are either |
278 matched by the regular expression $r_1$ or $r_2$. So we just |
322 matched by the regular expression $r_1$ or $r_2$. So we just |
279 have to recursively call $der$ with these two regular |
323 have to recursively call $der$ with these two regular |
280 expressions and compose the results again with $+$. Makes |
324 expressions and compose the results again with $+$. Makes |
281 sense? The $\cdot$-case is more complicated: if $r_1\cdot r_2$ |
325 sense? |
|
326 |
|
327 The $\cdot$-case is more complicated: if $r_1\cdot r_2$ |
282 matches a string of the form $c\!::\!s$, then the first part |
328 matches a string of the form $c\!::\!s$, then the first part |
283 must be matched by $r_1$. Consequently, it makes sense to |
329 must be matched by $r_1$. Consequently, it makes sense to |
284 construct the regular expression for $s$ by calling $der$ with |
330 construct the regular expression for $s$ by calling $der$ with |
285 $r_1$ and ``appending'' $r_2$. There is however one exception |
331 $r_1$ and ``appending'' $r_2$. There is however one exception |
286 to this simple rule: if $r_1$ can match the empty string, then |
332 to this simple rule: if $r_1$ can match the empty string, then |
292 is again simple: if $r^*$ matches a string of the form |
338 is again simple: if $r^*$ matches a string of the form |
293 $c\!::\!s$, then the first part must be ``matched'' by a |
339 $c\!::\!s$, then the first part must be ``matched'' by a |
294 single copy of $r$. Therefore we call recursively $der\,c\,r$ |
340 single copy of $r$. Therefore we call recursively $der\,c\,r$ |
295 and ``append'' $r^*$ in order to match the rest of $s$. |
341 and ``append'' $r^*$ in order to match the rest of $s$. |
296 |
342 |
297 If this did not make sense, here is another way to rationalise |
343 If this did not make sense yet, here is another way to rationalise |
298 the definition of $der$ by considering the following operation |
344 the definition of $der$ by considering the following operation |
299 on sets: |
345 on sets: |
300 |
346 |
301 \begin{equation}\label{Der} |
347 \begin{equation}\label{Der} |
302 Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\} |
348 Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\} |
336 \begin{tabular}{rll} |
382 \begin{tabular}{rll} |
337 Input: $r_1$, $abc$\medskip\\ |
383 Input: $r_1$, $abc$\medskip\\ |
338 Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = der\,a\,r_1)$\smallskip\\ |
384 Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = der\,a\,r_1)$\smallskip\\ |
339 Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = der\,b\,r_2)$\smallskip\\ |
385 Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = der\,b\,r_2)$\smallskip\\ |
340 Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = der\,b\,r_3)$\smallskip\\ |
386 Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = der\,b\,r_3)$\smallskip\\ |
341 Step 4: & the string is exhausted; test & ($nullable(r_4)$)\\ |
387 Step 4: & the string is exhausted; test & ($\textit{nullable}(r_4)$)\\ |
342 & whether $r_4$ can recognise the\\ |
388 & whether $r_4$ can recognise the\\ |
343 & empty string\smallskip\\ |
389 & empty string\smallskip\\ |
344 Output: & result of this test $\Rightarrow true \,\text{or}\, \textit{false}$\\ |
390 Output: & result of this test $\Rightarrow \textit{true} \,\text{or}\, \textit{false}$\\ |
345 \end{tabular} |
391 \end{tabular} |
346 \end{center} |
392 \end{center} |
347 |
393 |
348 \noindent Again the operation $Der$ might help to rationalise |
394 \noindent Again the operation $Der$ might help to rationalise |
349 this algorithm. We want to know whether $abc \in L(r_1)$. We |
395 this algorithm. We want to know whether $abc \in L(r_1)$. We |
350 do not know yet---but let us assume it is. Then $Der\,a\,L(r_1)$ |
396 do not know yet---but let us assume it is. Then $Der\,a\,L(r_1)$ |
351 builds the set where all the strings not starting with $a$ are |
397 builds the set where all the strings not starting with $a$ are |
352 filtered out. Of the remaining strings, the $a$ is stripped |
398 filtered out. Of the remaining strings, the $a$ is stripped |
353 off. Then we continue with filtering out all strings not |
399 off. So we should still have $bc$ in the set. |
|
400 Then we continue with filtering out all strings not |
354 starting with $b$ and stripping off the $b$ from the remaining |
401 starting with $b$ and stripping off the $b$ from the remaining |
355 strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$. |
402 strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$. |
356 Finally we filter out all strings not starting with $c$ and |
403 Finally we filter out all strings not starting with $c$ and |
357 strip off $c$ from the remaining string. This is |
404 strip off $c$ from the remaining string. This is |
358 $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the |
405 $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the |
359 original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ |
406 original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ |
360 must be the empty string. If not, then $abc$ was not in the |
407 must contain the empty string. If not, then $abc$ was not in the |
361 language we started with. |
408 language we started with. |
362 |
409 |
363 Our matching algorithm using $der$ and $nullable$ works |
410 Our matching algorithm using $der$ and $\textit{nullable}$ works |
364 similarly, just using regular expression instead of sets. For |
411 similarly, just using regular expression instead of sets. For |
365 this we need to extend the notion of derivatives from single |
412 this we need to extend the notion of derivatives from single |
366 characters to strings. This can be done using the following |
413 characters to strings. This can be done using the following |
367 function, taking a string and regular expression as input and |
414 function, taking a string and regular expression as input and |
368 a regular expression as output. |
415 a regular expression as output. |
409 in the first lecture and handout, the functions and subfunctions |
456 in the first lecture and handout, the functions and subfunctions |
410 for \pcode{matches} are shown in Figure~\ref{scala1}. |
457 for \pcode{matches} are shown in Figure~\ref{scala1}. |
411 |
458 |
412 \begin{figure}[p] |
459 \begin{figure}[p] |
413 \lstinputlisting{../progs/app5.scala} |
460 \lstinputlisting{../progs/app5.scala} |
414 \caption{Scala implementation of the nullable and |
461 \caption{Scala implementation of the \textit{nullable} and |
415 derivative functions. These functions are easy to |
462 derivative functions. These functions are easy to |
416 implement in functional languages, because pattern |
463 implement in functional languages, because their built-in pattern |
417 matching and recursion allow us to mimic the mathematical |
464 matching and recursion allow us to mimic the mathematical |
418 definitions very closely.\label{scala1}} |
465 definitions very closely.\label{scala1}} |
419 \end{figure} |
466 \end{figure} |
420 |
467 |
421 For running the algorithm with our favourite example, the evil |
468 For running the algorithm with our favourite example, the evil |
521 can within 30 seconds handle regular expressions up to |
568 can within 30 seconds handle regular expressions up to |
522 $n = 950$ before a StackOverflow is raised. Python and Ruby |
569 $n = 950$ before a StackOverflow is raised. Python and Ruby |
523 (and our first version) could only handle $n = 27$ or so in 30 |
570 (and our first version) could only handle $n = 27$ or so in 30 |
524 second. |
571 second. |
525 |
572 |
|
573 SECOND EXAMPLE |
|
574 |
526 The moral is that our algorithm is rather sensitive to the |
575 The moral is that our algorithm is rather sensitive to the |
527 size of regular expressions it needs to handle. This is of |
576 size of regular expressions it needs to handle. This is of |
528 course obvious because both $nullable$ and $der$ frequently |
577 course obvious because both $\textit{nullable}$ and $der$ frequently |
529 need to traverse the whole regular expression. There seems, |
578 need to traverse the whole regular expression. There seems, |
530 however, one more issue for making the algorithm run faster. |
579 however, one more issue for making the algorithm run faster. |
531 The derivative function often produces ``useless'' |
580 The derivative function often produces ``useless'' |
532 $\ZERO$s and $\ONE$s. To see this, consider $r = ((a |
581 $\ZERO$s and $\ONE$s. To see this, consider $r = ((a |
533 \cdot b) + b)^*$ and the following two derivatives |
582 \cdot b) + b)^*$ and the following two derivatives |
638 \noindent |
689 \noindent |
639 A simple proof is for example showing the following |
690 A simple proof is for example showing the following |
640 property: |
691 property: |
641 |
692 |
642 \begin{equation} |
693 \begin{equation} |
643 nullable(r) \;\;\text{if and only if}\;\; []\in L(r) |
694 \textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r) |
644 \label{nullableprop} |
695 \label{nullableprop} |
645 \end{equation} |
696 \end{equation} |
646 |
697 |
647 \noindent |
698 \noindent |
648 Let us say that this property is $P(r)$, then the first case |
699 Let us say that this property is $P(r)$, then the first case |
649 we need to check is whether $P(\ZERO)$ (see recipe |
700 we need to check is whether $P(\ZERO)$ (see recipe |
650 above). So we have to show that |
701 above). So we have to show that |
651 |
702 |
652 \[ |
703 \[ |
653 nullable(\ZERO) \;\;\text{if and only if}\;\; |
704 \textit{nullable}(\ZERO) \;\;\text{if and only if}\;\; |
654 []\in L(\ZERO) |
705 []\in L(\ZERO) |
655 \] |
706 \] |
656 |
707 |
657 \noindent whereby $nullable(\ZERO)$ is by definition of |
708 \noindent whereby $\textit{nullable}(\ZERO)$ is by definition of |
658 the function $nullable$ always $\textit{false}$. We also have |
709 the function $\textit{nullable}$ always $\textit{false}$. We also have |
659 that $L(\ZERO)$ is by definition $\{\}$. It is |
710 that $L(\ZERO)$ is by definition $\{\}$. It is |
660 impossible that the empty string $[]$ is in the empty set. |
711 impossible that the empty string $[]$ is in the empty set. |
661 Therefore also the right-hand side is false. Consequently we |
712 Therefore also the right-hand side is false. Consequently we |
662 verified this case: both sides are false. We would still need |
713 verified this case: both sides are false. We would still need |
663 to do this for $P(\ONE)$ and $P(c)$. I leave this to |
714 to do this for $P(\ONE)$ and $P(c)$. I leave this to |
665 |
716 |
666 Next we need to check the inductive cases, for example |
717 Next we need to check the inductive cases, for example |
667 $P(r_1 + r_2)$, which is |
718 $P(r_1 + r_2)$, which is |
668 |
719 |
669 \begin{equation} |
720 \begin{equation} |
670 nullable(r_1 + r_2) \;\;\text{if and only if}\;\; |
721 \textit{nullable}(r_1 + r_2) \;\;\text{if and only if}\;\; |
671 []\in L(r_1 + r_2) |
722 []\in L(r_1 + r_2) |
672 \label{propalt} |
723 \label{propalt} |
673 \end{equation} |
724 \end{equation} |
674 |
725 |
675 \noindent The difference to the base cases is that in this |
726 \noindent The difference to the base cases is that in this |
676 case we can already assume we proved |
727 case we can already assume we proved |
677 |
728 |
678 \begin{center} |
729 \begin{center} |
679 \begin{tabular}{l} |
730 \begin{tabular}{l} |
680 $nullable(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\ |
731 $\textit{nullable}(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\ |
681 $nullable(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\ |
732 $\textit{nullable}(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\ |
682 \end{tabular} |
733 \end{tabular} |
683 \end{center} |
734 \end{center} |
684 |
735 |
685 \noindent These are the induction hypotheses. To check this |
736 \noindent These are the induction hypotheses. To check this |
686 case, we can start from $nullable(r_1 + r_2)$, which by |
737 case, we can start from $\textit{nullable}(r_1 + r_2)$, which by |
687 definition is |
738 definition is |
688 |
739 |
689 \[ |
740 \[ |
690 nullable(r_1) \vee nullable(r_2) |
741 \textit{nullable}(r_1) \vee \textit{nullable}(r_2) |
691 \] |
742 \] |
692 |
743 |
693 \noindent Using the two induction hypotheses from above, |
744 \noindent Using the two induction hypotheses from above, |
694 we can transform this into |
745 we can transform this into |
695 |
746 |
696 \[ |
747 \[ |
697 [] \in L(r_1) \vee []\in(r_2) |
748 [] \in L(r_1) \vee []\in(r_2) |
698 \] |
749 \] |
699 |
750 |
700 \noindent We just replaced the $nullable(\ldots)$ parts by |
751 \noindent We just replaced the $\textit{nullable}(\ldots)$ parts by |
701 the equivalent $[] \in L(\ldots)$ from the induction |
752 the equivalent $[] \in L(\ldots)$ from the induction |
702 hypotheses. A bit of thinking convinces you that if |
753 hypotheses. A bit of thinking convinces you that if |
703 $[] \in L(r_1) \vee []\in L(r_2)$ then the empty string |
754 $[] \in L(r_1) \vee []\in L(r_2)$ then the empty string |
704 must be in the union $L(r_1)\cup L(r_2)$, that is |
755 must be in the union $L(r_1)\cup L(r_2)$, that is |
705 |
756 |
708 \] |
759 \] |
709 |
760 |
710 \noindent but this is by definition of $L$ exactly $[] \in |
761 \noindent but this is by definition of $L$ exactly $[] \in |
711 L(r_1 + r_2)$, which we needed to establish according to |
762 L(r_1 + r_2)$, which we needed to establish according to |
712 \eqref{propalt}. What we have shown is that starting from |
763 \eqref{propalt}. What we have shown is that starting from |
713 $nullable(r_1 + r_2)$ we have done equivalent transformations |
764 $\textit{nullable}(r_1 + r_2)$ we have done equivalent transformations |
714 to end up with $[] \in L(r_1 + r_2)$. Consequently we have |
765 to end up with $[] \in L(r_1 + r_2)$. Consequently we have |
715 established that $P(r_1 + r_2)$ holds. |
766 established that $P(r_1 + r_2)$ holds. |
716 |
767 |
717 In order to complete the proof we would now need to look |
768 In order to complete the proof we would now need to look |
718 at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let you |
769 at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let you |