1 \documentclass{article} |
1 \documentclass{article} |
2 \usepackage{../style} |
2 \usepackage{../style} |
3 \usepackage{../langs} |
3 \usepackage{../langs} |
4 |
4 \usepackage{../graphics} |
|
5 \usepackage{../data} |
5 |
6 |
6 \begin{document} |
7 \begin{document} |
7 |
8 |
8 \section*{Handout 2} |
9 \section*{Handout 2} |
9 |
10 |
10 Having specified what problem our matching algorithm, \pcode{match}, |
11 This lecture is about implementing a more efficient regular |
11 is supposed to solve, namely for a given regular expression $r$ and |
12 expression matcher (the plots on the right)---more efficient |
12 string $s$ answer \textit{true} if and only if |
13 than the matchers from regular expression libraries in Ruby and |
|
14 Python (the plots on the left). These plots show the running |
|
15 time for the evil regular expression $a?^{\{n\}}a^{\{n\}}$. |
|
16 |
|
17 \begin{center} |
|
18 \begin{tabular}{@{}cc@{}} |
|
19 \begin{tikzpicture}[y=.072cm, x=.12cm] |
|
20 %axis |
|
21 \draw (0,0) -- coordinate (x axis mid) (30,0); |
|
22 \draw (0,0) -- coordinate (y axis mid) (0,30); |
|
23 %ticks |
|
24 \foreach \x in {0,5,...,30} |
|
25 \draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x}; |
|
26 \foreach \y in {0,5,...,30} |
|
27 \draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; |
|
28 %labels |
|
29 \node[below=0.6cm] at (x axis mid) {number of \texttt{a}s}; |
|
30 \node[rotate=90,left=0.9cm] at (y axis mid) {time in secs}; |
|
31 %plots |
|
32 \draw[color=blue] plot[mark=*] |
|
33 file {re-python.data}; |
|
34 \draw[color=brown] plot[mark=triangle*] |
|
35 file {re-ruby.data}; |
|
36 %legend |
|
37 \begin{scope}[shift={(4,20)}] |
|
38 \draw[color=blue] (0,0) -- |
|
39 plot[mark=*] (0.25,0) -- (0.5,0) |
|
40 node[right]{\small Python}; |
|
41 \draw[yshift=-4mm, color=brown] (0,0) -- |
|
42 plot[mark=triangle*] (0.25,0) -- (0.5,0) |
|
43 node[right]{\small Ruby}; |
|
44 \end{scope} |
|
45 \end{tikzpicture} |
|
46 & |
|
47 \begin{tikzpicture}[y=.072cm, x=.0004cm] |
|
48 %axis |
|
49 \draw (0,0) -- coordinate (x axis mid) (12000,0); |
|
50 \draw (0,0) -- coordinate (y axis mid) (0,30); |
|
51 %ticks |
|
52 \foreach \x in {0,3000,...,12000} |
|
53 \draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x}; |
|
54 \foreach \y in {0,5,...,30} |
|
55 \draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; |
|
56 %labels |
|
57 \node[below=0.6cm] at (x axis mid) {number of \texttt{a}s}; |
|
58 \node[rotate=90,left=0.9cm] at (y axis mid) {time in secs}; |
|
59 |
|
60 %plots |
|
61 \draw[color=green] plot[mark=square*, mark options={fill=white} ] |
|
62 file {re2b.data}; |
|
63 \draw[color=black] plot[mark=square*, mark options={fill=white} ] |
|
64 file {re3.data}; |
|
65 \end{tikzpicture} |
|
66 \end{tabular} |
|
67 \end{center}\medskip |
|
68 |
|
69 |
|
70 \noindent Having specified in the previous lecture what |
|
71 problem our regular expression matcher, which we will call |
|
72 \pcode{matches}, is supposed to solve, namely for any given |
|
73 regular expression $r$ and string $s$ answer \textit{true} if |
|
74 and only if |
13 |
75 |
14 \[ |
76 \[ |
15 s \in L(r) |
77 s \in L(r) |
16 \] |
78 \] |
17 |
79 |
18 \noindent we can look at an algorithm to solve this problem. |
80 \noindent we can look at an algorithm to solve this problem. |
19 Clearly we cannot use the function $L$ directly for this, |
81 Clearly we cannot use the function $L$ directly for this, |
20 because in general the set of strings $L$ returns is infinite |
82 because in general the set of strings $L$ returns is infinite |
21 (recall what $L(a^*)$ is). In such cases there is no way we |
83 (recall what $L(a^*)$ is). In such cases there is no way we |
22 can implement an exhaustive test for whether a string is |
84 can implement an exhaustive test for whether a string is |
23 member of this set or not. Before we come to the matching |
85 member of this set or not. In contrast our matching algorithm |
24 algorithm, lets have a closer look at what it means when |
86 will mainly operate on the regular expression $r$ and string |
25 two regular expressions are equivalent. |
87 $s$, which are both finite. Before we come to the matching |
|
88 algorithm, however, let us have a closer look at what it means |
|
89 when two regular expressions are equivalent. |
26 |
90 |
27 \subsection*{Regular Expression Equivalences} |
91 \subsection*{Regular Expression Equivalences} |
28 |
92 |
29 |
93 We already defined in Handout 1 what it means for two regular |
30 \subsection*{Matching Algorithm} |
94 expressions to be equivalent, namely if their meaning is the |
31 |
95 same language: |
32 The algorithm we will define below consists of two parts. One is the |
96 |
33 function $nullable$ which takes a regular expression as argument and |
97 \[ |
34 decides whether it can match the empty string (this means it returns a |
98 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2) |
35 boolean). This can be easily defined recursively as follows: |
99 \] |
|
100 |
|
101 \noindent |
|
102 It is relatively easy to verify that some concrete equivalences |
|
103 hold, for example |
|
104 |
|
105 \begin{center} |
|
106 \begin{tabular}{rcl} |
|
107 $(a + b) + c$ & $\equiv$ & $a + (b + c)$\\ |
|
108 $a + a$ & $\equiv$ & $a$\\ |
|
109 $a + b$ & $\equiv$ & $b + a$\\ |
|
110 $(a \cdot b) \cdot c$ & $\equiv$ & $a \cdot (b \cdot c)$\\ |
|
111 $c \cdot (a + b)$ & $\equiv$ & $(c \cdot a) + (c \cdot b)$\\ |
|
112 \end{tabular} |
|
113 \end{center} |
|
114 |
|
115 \noindent |
|
116 but also easy to verify that the following regular expressions |
|
117 are \emph{not} equivalent |
|
118 |
|
119 \begin{center} |
|
120 \begin{tabular}{rcl} |
|
121 $a \cdot a$ & $\not\equiv$ & $a$\\ |
|
122 $a + (b \cdot c)$ & $\not\equiv$ & $(a + b) \cdot (a + c)$\\ |
|
123 \end{tabular} |
|
124 \end{center} |
|
125 |
|
126 \noindent I leave it to you to verify these equivalences and |
|
127 non-equivalences. It is also interesting to look at some |
|
128 corner cases involving $\epsilon$ and $\varnothing$: |
|
129 |
|
130 \begin{center} |
|
131 \begin{tabular}{rcl} |
|
132 $a \cdot \varnothing$ & $\not\equiv$ & $a$\\ |
|
133 $a + \epsilon$ & $\not\equiv$ & $a$\\ |
|
134 $\epsilon$ & $\equiv$ & $\varnothing^*$\\ |
|
135 $\epsilon^*$ & $\equiv$ & $\epsilon$\\ |
|
136 $\varnothing^*$ & $\not\equiv$ & $\varnothing$\\ |
|
137 \end{tabular} |
|
138 \end{center} |
|
139 |
|
140 \noindent Again I leave it to you to make sure you agree |
|
141 with these equivalences and non-equivalences. |
|
142 |
|
143 |
|
144 For our matching algorithm however the following six |
|
145 equivalences will play an important role: |
|
146 |
|
147 \begin{center} |
|
148 \begin{tabular}{rcl} |
|
149 $r + \varnothing$ & $\equiv$ & $r$\\ |
|
150 $\varnothing + r$ & $\equiv$ & $r$\\ |
|
151 $r \cdot \epsilon$ & $\equiv$ & $r$\\ |
|
152 $\epsilon \cdot r$ & $\equiv$ & $r$\\ |
|
153 $r \cdot \varnothing$ & $\equiv$ & $\varnothing$\\ |
|
154 $\varnothing \cdot r$ & $\equiv$ & $\varnothing$\\ |
|
155 $r + r$ & $\equiv$ & $r$ |
|
156 \end{tabular} |
|
157 \end{center} |
|
158 |
|
159 \noindent which always hold no matter what the regular |
|
160 expression $r$ looks like. The first are easy to verify since |
|
161 $L(\varnothing)$ is the empty set. The next two are also easy |
|
162 to verify since $L(\epsilon) = \{[]\}$ and appending the empty |
|
163 string to every string of another set, leaves the set |
|
164 unchanged. Be careful to fully comprehend the fifth and |
|
165 sixth equivalence: if you concatenate two sets of strings |
|
166 and one is the empty set, then the concatenation will also be |
|
167 the empty set. Check the definition of \pcode{_ @ _}. |
|
168 The last equivalence is again trivial. |
|
169 |
|
170 |
|
171 What will be important later on is that we can orient these |
|
172 equivalences and read them from left to right. In this way we |
|
173 can view them as \emph{simplification rules}. Suppose for |
|
174 example the regular expression |
|
175 |
|
176 \begin{equation} |
|
177 (r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot (r_4 \cdot \varnothing) |
|
178 \label{big} |
|
179 \end{equation} |
|
180 |
|
181 \noindent If we can find an equivalent regular expression that |
|
182 is simpler (smaller for example), then this might potentially |
|
183 make our matching algorithm is faster. The reason is that |
|
184 whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv r'$ |
|
185 will always give the same answer. In the example above you |
|
186 will see that the regular expression is equivalent to $r_1$ |
|
187 if you iteratively apply the simplification rules from above: |
|
188 |
|
189 \begin{center} |
|
190 \begin{tabular}{ll} |
|
191 & $(r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot |
|
192 (\underline{r_4 \cdot \varnothing})$\smallskip\\ |
|
193 $\equiv$ & $(r_1 + \varnothing) \cdot \epsilon + \underline{((\epsilon + r_2) + r_3) \cdot |
|
194 \varnothing}$\smallskip\\ |
|
195 $\equiv$ & $\underline{(r_1 + \varnothing) \cdot \epsilon} + \varnothing$\smallskip\\ |
|
196 $\equiv$ & $(\underline{r_1 + \varnothing}) + \varnothing$\smallskip\\ |
|
197 $\equiv$ & $\underline{r_1 + \varnothing}$\smallskip\\ |
|
198 $\equiv$ & $r_1$\ |
|
199 \end{tabular} |
|
200 \end{center} |
|
201 |
|
202 \noindent In each step I underlined where a simplification |
|
203 rule is applied. Our matching algorithm in the next section |
|
204 will often generate such ``useless'' $\epsilon$s and |
|
205 $\varnothing$s, therefore simplifying them away will make the |
|
206 algorithm quite a bit faster. |
|
207 |
|
208 \subsection*{The Matching Algorithm} |
|
209 |
|
210 The algorithm we will define below consists of two parts. One |
|
211 is the function $nullable$ which takes a regular expression as |
|
212 argument and decides whether it can match the empty string |
|
213 (this means it returns a boolean in Scala). This can be easily |
|
214 defined recursively as follows: |
36 |
215 |
37 \begin{center} |
216 \begin{center} |
38 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} |
217 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} |
39 $nullable(\varnothing)$ & $\dn$ & $f\!\/alse$\\ |
218 $nullable(\varnothing)$ & $\dn$ & $\textit{false}$\\ |
40 $nullable(\epsilon)$ & $\dn$ & $true$\\ |
219 $nullable(\epsilon)$ & $\dn$ & $true$\\ |
41 $nullable(c)$ & $\dn$ & $f\!alse$\\ |
220 $nullable(c)$ & $\dn$ & $\textit{false}$\\ |
42 $nullable(r_1 + r_2)$ & $\dn$ & $nullable(r_1) \vee nullable(r_2)$\\ |
221 $nullable(r_1 + r_2)$ & $\dn$ & $nullable(r_1) \vee nullable(r_2)$\\ |
43 $nullable(r_1 \cdot r_2)$ & $\dn$ & $nullable(r_1) \wedge nullable(r_2)$\\ |
222 $nullable(r_1 \cdot r_2)$ & $\dn$ & $nullable(r_1) \wedge nullable(r_2)$\\ |
44 $nullable(r^*)$ & $\dn$ & $true$ \\ |
223 $nullable(r^*)$ & $\dn$ & $true$ \\ |
45 \end{tabular} |
224 \end{tabular} |
46 \end{center} |
225 \end{center} |
47 |
226 |
48 \noindent |
227 \noindent The idea behind this function is that the following |
49 The idea behind this function is that the following property holds: |
228 property holds: |
50 |
229 |
51 \[ |
230 \[ |
52 nullable(r) \;\;\text{if and only if}\;\; ""\in L(r) |
231 nullable(r) \;\;\text{if and only if}\;\; []\in L(r) |
53 \] |
232 \] |
54 |
233 |
55 \noindent |
234 \noindent Note on the left-hand side we have a function we can |
56 Note on the left-hand side we have a function we can implement; on the |
235 implement; on the right we have its specification (which we |
57 right we have its specification. |
236 cannot implement in a programming language). |
58 |
237 |
59 The other function of our matching algorithm calculates a |
238 The other function of our matching algorithm calculates a |
60 \emph{derivative} of a regular expression. This is a function which |
239 \emph{derivative} of a regular expression. This is a function |
61 will take a regular expression, say $r$, and a character, say $c$, as |
240 which will take a regular expression, say $r$, and a |
62 argument and return a new regular expression. Be careful that the |
241 character, say $c$, as argument and return a new regular |
63 intuition behind this function is not so easy to grasp on first |
242 expression. Be careful that the intuition behind this function |
64 reading. Essentially this function solves the following problem: if |
243 is not so easy to grasp on first reading. Essentially this |
65 $r$ can match a string of the form $c\!::\!s$, what does the regular |
244 function solves the following problem: if $r$ can match a |
66 expression look like that can match just $s$. The definition of this |
245 string of the form $c\!::\!s$, what does the regular |
67 function is as follows: |
246 expression look like that can match just $s$. The definition |
68 |
247 of this function is as follows: |
69 \begin{center} |
248 |
70 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}} |
249 \begin{center} |
71 $der\, c\, (\varnothing)$ & $\dn$ & $\varnothing$ & \\ |
250 \begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l} |
72 $der\, c\, (\epsilon)$ & $\dn$ & $\varnothing$ & \\ |
251 $der\, c\, (\varnothing)$ & $\dn$ & $\varnothing$\\ |
73 $der\, c\, (d)$ & $\dn$ & if $c = d$ then $\epsilon$ else $\varnothing$ & \\ |
252 $der\, c\, (\epsilon)$ & $\dn$ & $\varnothing$ \\ |
74 $der\, c\, (r_1 + r_2)$ & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$ & \\ |
253 $der\, c\, (d)$ & $\dn$ & if $c = d$ then $\epsilon$ else $\varnothing$\\ |
|
254 $der\, c\, (r_1 + r_2)$ & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$\\ |
75 $der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $nullable (r_1)$\\ |
255 $der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $nullable (r_1)$\\ |
76 & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\ |
256 & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\ |
77 & & else $(der\, c\, r_1) \cdot r_2$\\ |
257 & & else $(der\, c\, r_1) \cdot r_2$\\ |
78 $der\, c\, (r^*)$ & $\dn$ & $(der\,c\,r) \cdot (r^*)$ & |
258 $der\, c\, (r^*)$ & $\dn$ & $(der\,c\,r) \cdot (r^*)$ |
79 \end{tabular} |
259 \end{tabular} |
80 \end{center} |
260 \end{center} |
81 |
261 |
82 \noindent |
262 \noindent The first two clauses can be rationalised as |
83 The first two clauses can be rationalised as follows: recall that |
263 follows: recall that $der$ should calculate a regular |
84 $der$ should calculate a regular expression, if the ``input'' regular |
264 expression, if the ``input'' regular expression can match a |
85 expression can match a string of the form $c\!::\!s$. Since neither |
265 string of the form $c\!::\!s$. Since neither $\varnothing$ nor |
86 $\varnothing$ nor $\epsilon$ can match such a string we return |
266 $\epsilon$ can match such a string we return $\varnothing$. In |
87 $\varnothing$. In the third case we have to make a case-distinction: |
267 the third case we have to make a case-distinction: In case the |
88 In case the regular expression is $c$, then clearly it can recognise a |
268 regular expression is $c$, then clearly it can recognise a |
89 string of the form $c\!::\!s$, just that $s$ is the empty |
269 string of the form $c\!::\!s$, just that $s$ is the empty |
90 string. Therefore we return the $\epsilon$-regular expression. In the |
270 string. Therefore we return the $\epsilon$-regular expression. |
91 other case we again return $\varnothing$ since no string of the |
271 In the other case we again return $\varnothing$ since no |
92 $c\!::\!s$ can be matched. The $+$-case is relatively |
272 string of the $c\!::\!s$ can be matched. Next come the |
93 straightforward: all strings of the form $c\!::\!s$ are either matched |
273 recursive cases. Fortunately, the $+$-case is still relatively |
94 by the regular expression $r_1$ or $r_2$. So we just have to |
274 straightforward: all strings of the form $c\!::\!s$ are either |
95 recursively call $der$ with these two regular expressions and compose |
275 matched by the regular expression $r_1$ or $r_2$. So we just |
96 the results again with $+$. The $\cdot$-case is more complicated: if |
276 have to recursively call $der$ with these two regular |
97 $r_1\cdot r_2$ matches a string of the form $c\!::\!s$, then the first |
277 expressions and compose the results again with $+$. Yes, makes |
98 part must be matched by $r_1$. Consequently, it makes sense to |
278 sense? The $\cdot$-case is more complicated: if $r_1\cdot r_2$ |
99 construct the regular expression for $s$ by calling $der$ with $r_1$ |
279 matches a string of the form $c\!::\!s$, then the first part |
100 and ``appending'' $r_2$. There is however one exception to this simple |
280 must be matched by $r_1$. Consequently, it makes sense to |
101 rule: if $r_1$ can match the empty string, then all of $c\!::\!s$ is |
281 construct the regular expression for $s$ by calling $der$ with |
102 matched by $r_2$. So in case $r_1$ is nullable (that is can match the |
282 $r_1$ and ``appending'' $r_2$. There is however one exception |
103 empty string) we have to allow the choice $der\,c\,r_2$ for |
283 to this simple rule: if $r_1$ can match the empty string, then |
104 calculating the regular expression that can match $s$. The $*$-case is |
284 all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is |
105 again simple: if $r^*$ matches a string of the form $c\!::\!s$, then |
285 nullable (that is can match the empty string) we have to allow |
106 the first part must be ``matched'' by a single copy of $r$. Therefore |
286 the choice $der\,c\,r_2$ for calculating the regular |
107 we call recursively $der\,c\,r$ and ``append'' $r^*$ in order to match |
287 expression that can match $s$. Therefore we have to |
108 the rest of $s$. |
288 add the regular expression $der\,c\,r_2$. |
109 |
289 The $*$-case is again simple: |
110 Another way to rationalise the definition of $der$ is to consider the |
290 if $r^*$ matches a string of the form $c\!::\!s$, then the |
111 following operation on sets: |
291 first part must be ``matched'' by a single copy of $r$. |
|
292 Therefore we call recursively $der\,c\,r$ and ``append'' $r^*$ |
|
293 in order to match the rest of $s$. |
|
294 |
|
295 If this did not make sense, here is another way to rationalise |
|
296 the definition of $der$ by considering the following operation |
|
297 on sets: |
112 |
298 |
113 \[ |
299 \[ |
114 Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\} |
300 Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\} |
115 \] |
301 \] |
116 |
302 |
117 \noindent |
303 \noindent |
118 which essentially transforms a set of strings $A$ by filtering out all |
304 which essentially transforms a set of strings $A$ by filtering out all |
119 strings that do not start with $c$ and then strips off the $c$ from |
305 strings that do not start with $c$ and then strips off the $c$ from |
120 all the remaining strings. For example suppose $A = \{"f\!oo", "bar", |
306 all the remaining strings. For example suppose $A = \{f\!oo, bar, |
121 "f\!rak"\}$ then |
307 f\!rak\}$ then |
122 \[ |
308 \[ |
123 Der\,f\,A = \{"oo", "rak"\}\quad,\quad |
309 Der\,f\,A = \{oo, rak\}\quad,\quad |
124 Der\,b\,A = \{"ar"\} \quad \text{and} \quad |
310 Der\,b\,A = \{ar\} \quad \text{and} \quad |
125 Der\,a\,A = \varnothing |
311 Der\,a\,A = \varnothing |
126 \] |
312 \] |
127 |
313 |
128 \noindent |
314 \noindent |
129 Note that in the last case $Der$ is empty, because no string in $A$ |
315 Note that in the last case $Der$ is empty, because no string in $A$ |
139 namely take the set of strings that $r$ can match (that is $L(r)$), |
325 namely take the set of strings that $r$ can match (that is $L(r)$), |
140 filter out all strings not starting with $c$ and strip off the $c$ |
326 filter out all strings not starting with $c$ and strip off the $c$ |
141 from the remaining strings---this is exactly the language that |
327 from the remaining strings---this is exactly the language that |
142 $der\,c\,r$ can match. |
328 $der\,c\,r$ can match. |
143 |
329 |
144 If we want to find out whether the string $"abc"$ is matched by the |
330 If we want to find out whether the string $abc$ is matched by |
145 regular expression $r$ then we can iteratively apply $Der$ as follows |
331 the regular expression $r_1$ then we can iteratively apply $der$ |
146 |
332 as follows |
147 \begin{enumerate} |
333 |
148 \item $Der\,a\,(L(r))$ |
334 \begin{center} |
149 \item $Der\,b\,(Der\,a\,(L(r)))$ |
335 \begin{tabular}{rll} |
150 \item $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ |
336 Input: $r_1$, $abc$\medskip\\ |
151 \end{enumerate} |
337 Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = der\,a\,r_1)$\smallskip\\ |
152 |
338 Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = der\,b\,r_2)$\smallskip\\ |
153 \noindent |
339 Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = der\,b\,r_3)$\smallskip\\ |
154 In the last step we need to test whether the empty string is in the |
340 Step 4: & the string is exhausted; test & ($nullable(r_4)$)\\ |
155 set. Our matching algorithm will work similarly, just using regular |
341 & whether $r_4$ can recognise the\\ |
156 expression instead of sets. For this we need to lift the notion of |
342 & empty string\smallskip\\ |
157 derivatives from characters to strings. This can be done using the |
343 Output: & result of the test $\Rightarrow true \,\text{or}\, \textit{false}$\\ |
158 following function, taking a string and regular expression as input |
344 \end{tabular} |
159 and a regular expression as output. |
345 \end{center} |
|
346 |
|
347 \noindent Again the operation $Der$ might help to rationalise |
|
348 this algorithm. We want to know whether $abc \in L(r_1)$. We |
|
349 do not know yet. But lets assume it is. Then $Der\,a\,L(r_1)$ |
|
350 builds the set where all the strings not starting with $a$ are |
|
351 filtered out. Of the remaining strings, the $a$ is stripped |
|
352 off. Then we continue with filtering out all strings not |
|
353 starting with $b$ and stripping off the $b$ from the remaining |
|
354 strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$. |
|
355 Finally we filter out all strings not starting with $c$ and |
|
356 strip off $c$ from the remaining string. This is |
|
357 $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the |
|
358 original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ |
|
359 must be the empty string. If not then $abc$ was not in the |
|
360 language we started with. |
|
361 |
|
362 Our matching algorithm using $der$ and $nullable$ works |
|
363 similarly, just using regular expression instead of sets. For |
|
364 this we need to extend the notion of derivatives from |
|
365 characters to strings. This can be done using the following |
|
366 function, taking a string and regular expression as input and |
|
367 a regular expression as output. |
160 |
368 |
161 \begin{center} |
369 \begin{center} |
162 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}} |
370 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}} |
163 $der\!s\, []\, r$ & $\dn$ & $r$ & \\ |
371 $\textit{ders}\, []\, r$ & $\dn$ & $r$ & \\ |
164 $der\!s\, (c\!::\!s)\, r$ & $\dn$ & $der\!s\,s\,(der\,c\,r)$ & \\ |
372 $\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(der\,c\,r)$ & \\ |
165 \end{tabular} |
373 \end{tabular} |
166 \end{center} |
374 \end{center} |
167 |
375 |
168 \noindent |
376 \noindent This function essentially iterates $der$ taking one |
169 Having $ders$ in place, we can finally define our matching algorithm: |
377 character at the time from the original string until it is |
170 |
378 exhausted. Having $ders$ in place, we can finally define our |
171 \[ |
379 matching algorithm: |
172 match\,s\,r = nullable(ders\,s\,r) |
380 |
173 \] |
381 \[ |
174 |
382 matches\,s\,r = nullable(ders\,s\,r) |
175 \noindent |
383 \] |
176 We claim that |
384 |
177 |
385 \noindent |
178 \[ |
386 We can claim that |
179 match\,s\,r\quad\text{if and only if}\quad s\in L(r) |
387 |
180 \] |
388 \[ |
181 |
389 matches\,s\,r\quad\text{if and only if}\quad s\in L(r) |
182 \noindent |
390 \] |
183 holds, which means our algorithm satisfies the specification. This |
391 |
184 algorithm was introduced by Janus Brzozowski in 1964. Its main |
392 \noindent holds, which means our algorithm satisfies the |
185 attractions are simplicity and being fast, as well as being easily |
393 specification. Of course we can claim many things\ldots |
186 extendable for other regular expressions such as $r^{\{n\}}$, $r^?$, |
394 whether the claim holds any water is a different question, |
187 $\sim{}r$ and so on. |
395 which for example is the point of the Strand-2 Coursework. |
|
396 |
|
397 This algorithm was introduced by Janus Brzozowski in 1964. Its |
|
398 main attractions are simplicity and being fast, as well as |
|
399 being easily extendable for other regular expressions such as |
|
400 $r^{\{n\}}$, $r^?$, $\sim{}r$ and so on (this is subject of |
|
401 Strand-1 Coursework 1). |
188 |
402 |
189 \subsection*{The Matching Algorithm in Scala} |
403 \subsection*{The Matching Algorithm in Scala} |
190 |
404 |
191 |
405 Another attraction of the algorithm is that it can be easily |
|
406 implemented in a functional programming language, like Scala. |
|
407 Given the implementation of regular expressions in Scala given |
|
408 in the first lecture and handout, the functions for |
|
409 \pcode{matches} are shown in Figure~\ref{scala1}. |
192 |
410 |
193 \begin{figure}[p] |
411 \begin{figure}[p] |
194 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/app5.scala}}} |
412 \lstinputlisting{../progs/app5.scala} |
195 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/app6.scala}}} |
413 \caption{Scala implementation of the nullable and |
196 \caption{Scala implementation of the nullable and derivatives functions.} |
414 derivatives functions.\label{scala1}} |
197 \end{figure} |
415 \end{figure} |
|
416 |
|
417 For running the algorithm with our favourite example, the evil |
|
418 regular expression $a?^{\{n\}}a^{\{n\}}$, we need to implement |
|
419 the optional regular expression and the exactly $n$-times |
|
420 regular expression. This can be done with the translations |
|
421 |
|
422 \lstinputlisting[numbers=none]{../progs/app51.scala} |
|
423 |
|
424 \noindent Running the matcher with the example, we find it is |
|
425 slightly worse then the matcher in Ruby and Python. |
|
426 Ooops\ldots\medskip |
|
427 |
|
428 \noindent Analysing this failure a bit we notice that |
|
429 for $a^{\{n\}}$ we generate quite big regular expressions: |
|
430 |
|
431 \begin{center} |
|
432 \begin{tabular}{rl} |
|
433 1: & $a$\\ |
|
434 2: & $a\cdot a$\\ |
|
435 3: & $a\cdot a\cdot a$\\ |
|
436 & \ldots\\ |
|
437 13: & $a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a$\\ |
|
438 & \ldots\\ |
|
439 20: |
|
440 \end{tabular} |
|
441 \end{center} |
|
442 |
|
443 \noindent Our algorithm traverses such regular expressions at |
|
444 least once every time a derivative is calculated. So having |
|
445 large regular expressions, will cause problems. This problem |
|
446 is aggravated with $a?$ being represented as $a + \epsilon$. |
|
447 |
|
448 |
198 |
449 |
199 \end{document} |
450 \end{document} |
200 |
451 |
201 %%% Local Variables: |
452 %%% Local Variables: |
202 %%% mode: latex |
453 %%% mode: latex |