13 This module is about text processing, be it for web-crawlers, |
12 This module is about text processing, be it for web-crawlers, |
14 compilers, dictionaries, DNA-data and so on. When looking for |
13 compilers, dictionaries, DNA-data and so on. When looking for |
15 a particular string in a large text we can use the |
14 a particular string in a large text we can use the |
16 Knuth-Morris-Pratt algorithm, which is currently the most |
15 Knuth-Morris-Pratt algorithm, which is currently the most |
17 efficient general string search algorithm. But often we do |
16 efficient general string search algorithm. But often we do |
18 \emph{not} look for just a particular string, but for string |
17 \emph{not} just look for a particular string, but for string |
19 patterns. For example in programming code we need to identify |
18 patterns. For example in programming code we need to identify |
20 what are the keywords, what are the identifiers etc. Also |
19 what are the keywords, what are the identifiers etc. A pattern |
21 often we face the problem that we are given a string (for |
20 for identifiers could be that they start with a letter, |
|
21 followed by zero or more letters, numbers and the underscore. |
|
22 Also often we face the problem that we are given a string (for |
22 example some user input) and want to know whether it matches a |
23 example some user input) and want to know whether it matches a |
23 particular pattern. For example for excluding some user input |
24 particular pattern. In this way we can exclude user input that |
24 that would otherwise have nasty effects on our program |
25 would otherwise have nasty effects on our program (crashing it |
25 (crashing or going into an infinite loop, if not worse). |
26 or going into an infinite loop, if not worse). \defn{Regular |
26 \defn{Regular expressions} help with conveniently specifying |
27 expressions} help with conveniently specifying such patterns. |
27 such patterns. |
|
28 |
|
29 |
|
30 The idea behind regular expressions is that they are a simple |
28 The idea behind regular expressions is that they are a simple |
31 method for describing languages (or sets of strings)...at |
29 method for describing languages (or sets of strings)\ldots at |
32 least languages we are interested in in computer science. For |
30 least languages we are interested in in computer science. For |
33 example there is no convenient regular expression for |
31 example there is no convenient regular expression for |
34 describing the English language short of enumerating all |
32 describing the English language short of enumerating all |
35 English words. But they seem useful for describing for example |
33 English words. But they seem useful for describing for example |
36 email addresses.\footnote{See ``8 Regular Expressions You |
34 email addresses.\footnote{See ``8 Regular Expressions You |
66 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] |
64 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] |
67 user@localserver |
65 user@localserver |
68 disposable.style.email.with+symbol@example.com |
66 disposable.style.email.with+symbol@example.com |
69 \end{lstlisting} |
67 \end{lstlisting} |
70 |
68 |
|
69 Identifiers, or variables, in program text are often required |
|
70 to satisfy the constraints that they start with a letter and |
|
71 then can be followed by zero or more letters or numbers and |
|
72 also can include underscores, but not as the first character. |
|
73 Such identifiers can be recognised with the regular expression |
|
74 |
|
75 \begin{center} |
|
76 \pcode{[a-zA-Z] [a-zA-Z0-9_]*} |
|
77 \end{center} |
|
78 |
|
79 \noindent Possible identifiers that match this regular expression |
|
80 are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name}, |
|
81 but not \pcode{_i} and also not \pcode{4you}. |
|
82 |
71 Many programming language offer libraries that can be used to |
83 Many programming language offer libraries that can be used to |
72 validate such strings against regular expressions, like the |
84 validate such strings against regular expressions. Also there |
73 one for email addresses in \eqref{email}. There are some |
85 are some common, and I am sure very familiar, ways how to |
74 common, and I am sure very familiar, ways how to construct |
86 construct regular expressions. For example in Scala we have: |
75 regular expressions. For example in Scala we have |
87 |
76 |
88 \begin{center} |
77 \begin{center} |
89 \begin{tabular}{lp{9cm}} |
78 \begin{longtable}{lp{9cm}} |
|
79 \pcode{re*} & matches 0 or more occurrences of preceding |
90 \pcode{re*} & matches 0 or more occurrences of preceding |
80 expression\\ |
91 expression\\ |
81 \pcode{re+} & matches 1 or more occurrences of preceding |
92 \pcode{re+} & matches 1 or more occurrences of preceding |
82 expression\\ |
93 expression\\ |
83 \pcode{re?} & matches 0 or 1 occurrence of preceding |
94 \pcode{re?} & matches 0 or 1 occurrence of preceding |
84 expression\\ |
95 expression\\ |
85 \pcode{re\{n\}} & matches exactly \pcode{n} number of |
96 \pcode{re\{n\}} & matches exactly \pcode{n} number of |
86 occurrences\\ |
97 occurrences of preceding expression\\ |
87 \pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m} |
98 \pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m} |
88 occurences of the preceding expression\\ |
99 occurences of the preceding expression\\ |
89 \pcode{[...]} & matches any single character inside the |
100 \pcode{[...]} & matches any single character inside the |
90 brackets\\ |
101 brackets\\ |
91 \pcode{[^...]} & matches any single character not inside the |
102 \pcode{[^...]} & matches any single character not inside the |
92 brackets\\ |
103 brackets\\ |
93 \pcode{..-..} & character ranges\\ |
104 \pcode{..-..} & character ranges\\ |
94 \pcode{\\d} & matches digits; equivalent to \pcode{[0-9]} |
105 \pcode{\\d} & matches digits; equivalent to \pcode{[0-9]} |
95 \end{longtable} |
106 \end{tabular} |
96 \end{center} |
107 \end{center} |
97 |
108 |
98 \noindent With this you can figure out the purpose of the |
109 \noindent With this table you can figure out the purpose of |
99 regular expressions in the web-crawlers shown Figures |
110 the regular expressions in the web-crawlers shown Figures |
100 \ref{crawler1}, \ref{crawler2} and \ref{crawler3}. Note the |
111 \ref{crawler1}, \ref{crawler2} and \ref{crawler3}. Note, |
101 regular expression for http-addresses in web-pages: |
112 however, the regular expression for http-addresses in |
|
113 web-pages is meant to be |
102 |
114 |
103 \[ |
115 \[ |
104 \pcode{"https?://[^"]*"} |
116 \pcode{"https?://[^"]*"} |
105 \] |
117 \] |
106 |
118 |
158 \end{scope} |
172 \end{scope} |
159 \end{tikzpicture} |
173 \end{tikzpicture} |
160 \end{center} |
174 \end{center} |
161 |
175 |
162 \noindent This graph shows that Python needs approximately 29 |
176 \noindent This graph shows that Python needs approximately 29 |
163 seconds in order to find out that a string of 28 \texttt{a}s |
177 seconds for finding out whether a string of 28 \texttt{a}s |
164 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}. |
178 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}. |
165 Ruby is even slightly worse.\footnote{In this example Ruby |
179 Ruby is even slightly worse.\footnote{In this example Ruby |
166 uses the slightly different regular expression |
180 uses the slightly different regular expression |
167 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and |
181 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and |
168 \texttt{a} each occur $n$ times.} Admittedly, this regular |
182 \texttt{a} each occur $n$ times.} Admittedly, this regular |
169 expression is carefully chosen to exhibit this exponential |
183 expression is carefully chosen to exhibit this exponential |
170 behaviour, but similar ones occur more often than one wants in |
184 behaviour, but similar ones occur more often than one wants in |
171 ``real life''. They are sometimes called \emph{evil regular |
185 ``real life''. They are sometimes called \emph{evil regular |
172 expressions} because they have the potential to make regular |
186 expressions} because they have the potential to make regular |
173 expression matching engines topple over, like in Python and |
187 expression matching engines to topple over, like in Python and |
174 Ruby. The problem is that this can have some serious |
188 Ruby. The problem with evil regular expressions is that they |
175 consequences, for example, if you use them in your |
189 can have some serious consequences, for example, if you use |
176 web-application, because hackers can look for these instances |
190 them in your web-application. The reason is that hackers can |
177 where the matching engine behaves badly and mount a nice |
191 look for these instances where the matching engine behaves |
178 DoS-attack against your application. |
192 badly and then mount a nice DoS-attack against your |
179 |
193 application. |
180 It will be instructive to look behind the ``scenes''to find |
194 |
|
195 It will be instructive to look behind the ``scenes'' to find |
181 out why Python and Ruby (and others) behave so badly when |
196 out why Python and Ruby (and others) behave so badly when |
182 matching with evil regular expressions. But we will also look |
197 matching with evil regular expressions. But we will also look |
183 at a relatively simple algorithm that solves this problem much |
198 at a relatively simple algorithm that solves this problem much |
184 better than Python and Ruby do\ldots actually it will be two |
199 better than Python and Ruby do\ldots actually it will be two |
185 versions of the algorithm: the first one will be able to |
200 versions of the algorithm: the first one will be able to |
186 process strings of approximately 1,000 \texttt{a}s in 30 |
201 process strings of approximately 1,000 \texttt{a}s in 30 |
187 seconds, while the second version will even be able to |
202 seconds, while the second version will even be able to process |
188 process up to 12,000 in less than 10(!) seconds, see the graph |
203 up to 12,000 in less than 10(!) seconds, see the graph below: |
189 below: |
|
190 |
204 |
191 \begin{center} |
205 \begin{center} |
192 \begin{tikzpicture}[y=.1cm, x=.0006cm] |
206 \begin{tikzpicture}[y=.1cm, x=.0006cm] |
193 %axis |
207 %axis |
194 \draw (0,0) -- coordinate (x axis mid) (12000,0); |
208 \draw (0,0) -- coordinate (x axis mid) (12000,0); |
210 \end{tikzpicture} |
224 \end{tikzpicture} |
211 \end{center} |
225 \end{center} |
212 |
226 |
213 \subsection*{Basic Regular Expressions} |
227 \subsection*{Basic Regular Expressions} |
214 |
228 |
215 The regular expressions shown above we will call |
229 The regular expressions shown above, for example for Scala, we |
216 \defn{extended regular expressions}. The ones we will mainly |
230 will call \emph{extended regular expressions}. The ones we |
217 study are \emph{basic regular expressions}, which by |
231 will mainly study in this module are \emph{basic regular |
218 convention we will just call regular expressions, if it is |
232 expressions}, which by convention we will just call |
219 clear what we mean. The attraction of (basic) regular |
233 \emph{regular expressions}, if it is clear what we mean. The |
220 expressions is that many features of the extended one are just |
234 attraction of (basic) regular expressions is that many |
221 syntactic suggar. (Basic) regular expressions are defined by |
235 features of the extended ones are just syntactic sugar. |
222 the following grammar: |
236 (Basic) regular expressions are defined by the following |
|
237 grammar: |
223 |
238 |
224 \begin{center} |
239 \begin{center} |
225 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} |
240 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} |
226 $r$ & $::=$ & $\varnothing$ & null\\ |
241 $r$ & $::=$ & $\varnothing$ & null\\ |
227 & $\mid$ & $\epsilon$ & empty string / "" / []\\ |
242 & $\mid$ & $\epsilon$ & empty string / "" / []\\ |
228 & $\mid$ & $c$ & single character\\ |
243 & $\mid$ & $c$ & single character\\ |
|
244 & $\mid$ & $r_1 + r_2$ & alternative / choice\\ |
229 & $\mid$ & $r_1 \cdot r_2$ & sequence\\ |
245 & $\mid$ & $r_1 \cdot r_2$ & sequence\\ |
230 & $\mid$ & $r_1 + r_2$ & alternative / choice\\ |
|
231 & $\mid$ & $r^*$ & star (zero or more)\\ |
246 & $\mid$ & $r^*$ & star (zero or more)\\ |
232 \end{tabular} |
247 \end{tabular} |
233 \end{center} |
248 \end{center} |
234 |
249 |
235 \noindent Because we overload our notation, there are some |
250 \noindent Because we overload our notation, there are some |
236 subtleties you should be aware of. First, when regular |
251 subtleties you should be aware of. When regular expressions |
237 expressions are referred to then $\varnothing$ does not stand |
252 are referred to then $\varnothing$ does not stand for the |
238 for the empty set: it is a particular pattern that does not |
253 empty set: rather it is a particular pattern that does not |
239 match any string. Similarly, in the context of regular |
254 match any string. Similarly, in the context of regular |
240 expressions, $\epsilon$ does not stand for the empty string |
255 expressions, $\epsilon$ does not stand for the empty string |
241 (as in many places in the literature) but for a pattern that |
256 (as in many places in the literature) but for a regular |
242 matches the empty string. Second, the letter $c$ stands for |
257 expression that matches the empty string. The letter $c$ |
243 any character from the alphabet at hand. Again in the context |
258 stands for any character from the alphabet at hand. Again in |
244 of regular expressions, it is a particular pattern that can |
259 the context of regular expressions, it is a particular pattern |
245 match the specified string. Third, you should also be careful |
260 that can match the specified character. You should also be |
246 with the our overloading of the star: assuming you have read |
261 careful with our overloading of the star: assuming you have |
247 the handout about our basic mathematical notation, you will |
262 read the handout about our basic mathematical notation, you |
248 see that in the context of languages (sets of strings) the |
263 will see that in the context of languages (sets of strings) |
249 star stands for an operation on languages. While $r^*$ stands |
264 the star stands for an operation on languages. While here |
250 for a regular expression, the operation on sets is defined as |
265 $r^*$ stands for a regular expression, which is different from |
|
266 the operation on sets is defined as |
251 |
267 |
252 \[ |
268 \[ |
253 A^* \dn \bigcup_{0\le n} A^n |
269 A^* \dn \bigcup_{0\le n} A^n |
254 \] |
270 \] |
255 |
271 |
256 We will use parentheses to disambiguate regular expressions. |
272 We will use parentheses to disambiguate regular expressions. |
257 Parentheses are not really part of a regular expression, and |
273 Parentheses are not really part of a regular expression, and |
258 indeed we do not need them in our code because there the tree |
274 indeed we do not need them in our code because there the tree |
259 structure is always clear. But for writing them down in a more |
275 structure of regular expressions is always clear. But for |
260 mathematical fashion, parentheses will be helpful. For example |
276 writing them down in a more mathematical fashion, parentheses |
261 we will write $(r_1 + r_2)^*$, which is different from, say |
277 will be helpful. For example we will write $(r_1 + r_2)^*$, |
262 $r_1 + (r_2)^*$. The former means roughly zero or more times |
278 which is different from, say $r_1 + (r_2)^*$. The former means |
263 $r_1$ or $r_2$, while the latter means $r_1$ or zero or more |
279 roughly zero or more times $r_1$ or $r_2$, while the latter |
264 times $r_2$. This will turn out are two different pattern, |
280 means $r_1$ or zero or more times $r_2$. This will turn out |
265 which match in general different strings. We should also write |
281 are two different patterns, which match in general different |
266 $(r_1 + r_2) + r_3$, which is different from the regular |
282 strings. We should also write $(r_1 + r_2) + r_3$, which is |
267 expression $r_1 + (r_2 + r_3)$, but in case of $+$ and $\cdot$ |
283 different from the regular expression $r_1 + (r_2 + r_3)$, but |
268 we actually do not care about the order and just write $r_1 + |
284 in case of $+$ and $\cdot$ we actually do not care about the |
269 r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$, respectively. The |
285 order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 |
270 reasons for this will become clear shortly. In the literature |
286 \cdot r_3$, respectively. The reasons for this will become |
271 you will often find that the choice $r_1 + r_2$ is written as |
287 clear shortly. In the literature you will often find that the |
272 $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also following the |
288 choice $r_1 + r_2$ is written as $r_1\mid{}r_2$ or |
273 convention in the literature, we will often omit the $\cdot$ |
289 $r_1\mid\mid{}r_2$. Also following the convention in the |
274 all together. This is to make some concrete regular |
290 literature, we will often omit the $\cdot$ all together. This |
275 expressions more readable. For example the regular expression |
291 is to make some concrete regular expressions more readable. |
276 for email addresses shown in \eqref{email} would look like |
292 For example the regular expression for email addresses shown |
|
293 in \eqref{email} would look like |
277 |
294 |
278 \[ |
295 \[ |
279 \texttt{[...]+} \;\cdot\; \texttt{@} \;\cdot\; |
296 \texttt{[...]+} \;\cdot\; \texttt{@} \;\cdot\; |
280 \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; |
297 \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; |
281 \texttt{[...]\{2,6\}} |
298 \texttt{[...]\{2,6\}} |
310 |
327 |
311 A source of confusion might arise from the fact that we |
328 A source of confusion might arise from the fact that we |
312 use the term \emph{basic regular expression} for the regular |
329 use the term \emph{basic regular expression} for the regular |
313 expressions used in ``theory'' and defined above, and |
330 expressions used in ``theory'' and defined above, and |
314 \emph{extended regular expression} for the ones used in |
331 \emph{extended regular expression} for the ones used in |
315 ``practice'', for example Scala. If runtime is not of an |
332 ``practice'', for example in Scala. If runtime is not an |
316 issue, then the latter can be seen as some syntactic sugar of |
333 issue, then the latter can be seen as syntactic sugar of |
317 the former. Fo example we could replace |
334 the former. For example we could replace |
318 |
335 |
319 \begin{center} |
336 \begin{center} |
320 \begin{tabular}{rcl} |
337 \begin{tabular}{rcl} |
321 $r^+$ & $\mapsto$ & $r\cdot r^*$\\ |
338 $r+$ & $\mapsto$ & $r\cdot r^*$\\ |
322 $r?$ & $\mapsto$ & $\epsilon + r$\\ |
339 $r?$ & $\mapsto$ & $\epsilon + r$\\ |
323 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\ |
340 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\ |
324 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\ |
341 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\ |
325 \end{tabular} |
342 \end{tabular} |
326 \end{center} |
343 \end{center} |
327 |
344 |
|
345 |
328 \subsection*{The Meaning of Regular Expressions} |
346 \subsection*{The Meaning of Regular Expressions} |
329 |
347 |
330 |
|
331 So far we have only considered informally what the |
348 So far we have only considered informally what the |
332 \emph{meaning} of a regular expression is. This is no good for |
349 \emph{meaning} of a regular expression is. This is not good |
333 specifications of what algorithms are supposed to do or which |
350 enough for specifications of what algorithms are supposed to |
334 problems they are supposed to solve. |
351 do or which problems they are supposed to solve. |
335 |
352 |
336 To do so more formally we will associate with every regular |
353 To define the meaning of a regular expression we will |
337 expression a language, or set of strings, that is supposed to |
354 associate with every regular expression a language, or set of |
338 be matched by this regular expression. To understand what is |
355 strings. This language contains all the strings the regular |
339 going on here it is crucial that you have also read the |
356 expression is supposed to match. To understand what is going |
340 handout about our basic mathematical notations. |
357 on here it is crucial that you have read the handout |
341 |
358 about basic mathematical notations. |
342 The meaning of a regular expression can be defined recursively |
359 |
343 as follows |
360 The \defn{meaning of a regular expression} can be defined |
|
361 by a recursive function called $L$ (for language), which |
|
362 is defined as follows |
344 |
363 |
345 \begin{center} |
364 \begin{center} |
346 \begin{tabular}{rcl} |
365 \begin{tabular}{rcl} |
347 $L(\varnothing)$ & $\dn$ & $\varnothing$\\ |
366 $L(\varnothing)$ & $\dn$ & $\varnothing$\\ |
348 $L(\epsilon)$ & $\dn$ & $\{[]\}$\\ |
367 $L(\epsilon)$ & $\dn$ & $\{[]\}$\\ |
349 $L(c)$ & $\dn$ & $\{"c"\}$\\ |
368 $L(c)$ & $\dn$ & $\{"c"\}$\\ |
350 $L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\ |
369 $L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\ |
351 $L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) @ L(r_2)$\\ |
370 $L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) \,@\, L(r_2)$\\ |
352 $L(r^*)$ & $\dn$ & $(L(r))^*$\\ |
371 $L(r^*)$ & $\dn$ & $(L(r))^*$\\ |
353 \end{tabular} |
372 \end{tabular} |
354 \end{center} |
373 \end{center} |
355 |
374 |
356 \noindent |
375 \noindent As a result we can now precisely state what the |
357 As a result we can now precisely state what the meaning, for example, of the regular expression |
376 meaning, for example, of the regular expression $h \cdot |
358 ${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely |
377 e \cdot l \cdot l \cdot o$ is, namely |
359 $L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$...as expected. Similarly if we have the |
378 |
360 choice-regular-expression $a + b$, its meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly |
379 \[ |
361 be matched by this choice. You can now also see why we do not make a difference |
380 L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot |
362 between the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$....they |
381 {\it o}) = \{"hello"\} |
363 are not the same regular expression, but have the same meaning. |
382 \] |
364 |
383 |
365 The point of the definition of $L$ is that we can use it to precisely specify when a string $s$ is matched by a |
384 \noindent This is expected because this regular expression |
366 regular expression $r$, namely only when $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and |
385 is only supposed to match the ``$hello$''-string. Similarly if |
367 any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no}, |
386 we have the choice-regular-expression $a + b$, its meaning is |
368 if $s \not\in L(r)$. We leave this for the next lecture. |
387 |
|
388 \[ |
|
389 L(a + b) = \{"a", "b"\} |
|
390 \] |
|
391 |
|
392 \noindent You can now also see why we do not make a difference |
|
393 between the different regular expressions $(r_1 + r_2) + r_3$ |
|
394 and $r_1 + (r_2 + r_3)$\ldots they are not the same regular |
|
395 expression, but have the same meaning. |
|
396 |
|
397 \begin{eqnarray*} |
|
398 L((r_1 + r_2) + r_3) & = & L(r_1 + r_2) \cup L(r_3)\\ |
|
399 & = & L(r_1) \cup L(r_2) \cup L(r_3)\\ |
|
400 & = & L(r_1) \cup L(r_2 + r_3)\\ |
|
401 & = & L(r_1 + (r_2 + r_3)) |
|
402 \end{eqnarray*} |
|
403 |
|
404 The point of the definition of $L$ is that we can use it to |
|
405 precisely specify when a string $s$ is matched by a regular |
|
406 expression $r$, namely if and only if $s \in L(r)$. In fact we |
|
407 will write a program \pcode{match} that takes any string $s$ |
|
408 and any regular expression $r$ as argument and returns |
|
409 \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in |
|
410 L(r)$. We leave this for the next lecture. |
|
411 |
|
412 There is one more feature of regular expressions that is worth |
|
413 mentioning. Given some strings, there are in general many |
|
414 different regular expressions that can recognise these |
|
415 strings. This is obvious with the regular expression $a + b$ |
|
416 which can match the strings $a$ and $b$. But also the regular |
|
417 expression $b + a$ would match the same strings. However, |
|
418 sometimes it is not so obvious whether two regular expressions |
|
419 match the same strings: for example do $r^*$ and $\epsilon + r |
|
420 \cdot r^*$ match the same strings? What about $\varnothing^*$ |
|
421 and $\epsilon^*$? This suggests the following relation between |
|
422 \defn{equivalent regular expressions}: |
|
423 |
|
424 \[ |
|
425 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2) |
|
426 \] |
|
427 |
|
428 \noindent That means two regular expressions are equivalent if |
|
429 they match the same set of strings. Therefore we do not really |
|
430 distinguish between the different regular expression $(r_1 + |
|
431 r_2) + r_3$ and $r_1 + (r_2 + r_3)$, because they are |
|
432 equivalent. I leave you to the question whether |
|
433 |
|
434 \[ |
|
435 \varnothing^* \equiv \epsilon^* |
|
436 \] |
|
437 |
|
438 \noindent holds. Such equivalences will be important for out |
|
439 matching algorithm, because we can use them to simplify |
|
440 regular expressions. |
|
441 |
|
442 \subsection*{My Fascination for Regular Expressions} |
|
443 |
|
444 Up until a few years ago I was not really interested in |
|
445 regular expressions. They have been studied for the last 60 |
|
446 years (by smarter people than me)---surely nothing new can be |
|
447 found out about them. I even have the vague recollection that |
|
448 I did not quite understand them during my study. If I remember |
|
449 correctly,\footnote{That was really a long time ago.} I got |
|
450 utterly confused about $\epsilon$ and the empty |
|
451 string.\footnote{Obviously the lecturer must have been bad.} |
|
452 Since my study, I have used regular expressions for |
|
453 implementing lexers and parsers as I have always been |
|
454 interested in all kinds of programming languages and |
|
455 compilers, which invariably need regular expression in some |
|
456 form or shape. |
|
457 |
|
458 To understand my fascination nowadays with regular |
|
459 expressions, you need to know that my main scientific interest |
|
460 for the last 14 years has been with in theorem provers. I am a |
|
461 core developer of one of |
|
462 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem |
|
463 provers are systems in which you can formally reason about |
|
464 mathematical concepts, but also about programs. In this way |
|
465 they can help with writing bug-free code. Theorem provers have |
|
466 proved already their value in a number of systems (even in |
|
467 terms of hard cash), but they are still clunky and difficult |
|
468 to use by average programmers. |
|
469 |
|
470 Anyway, in about 2011 I came across the notion of |
|
471 \defn{derivatives of regular expressions}. This notion allows |
|
472 one to do almost all calculations in regular language theory |
|
473 on the level of regular expressions, not needing any automata. |
|
474 This is important because automata are graphs and it is rather |
|
475 difficult to reason about graphs in theorem provers. In |
|
476 contrast, to reason about regular expressions is easy-peasy in |
|
477 theorem provers. Is this important? I think yes, because |
|
478 according to Kuklewicz nearly all POSIX-based regular |
|
479 expression matchers are |
|
480 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} |
|
481 With my PhD student Fahad Ausaf I am currently working on |
|
482 proving the correctness for one such algorithm that was |
|
483 proposed by Sulzmann and Lu in |
|
484 2014.\footnote{\url{http://goo.gl/bz0eHp}} This would be an |
|
485 attractive results since we will be able to prove that the |
|
486 algorithm is really correct, but also that the machine code(!) |
|
487 that implements this code efficiently is correct. Writing |
|
488 programs in this way does not leave any room for potential |
|
489 errors or bugs. How nice! |
|
490 |
|
491 What also helped with my fascination with regular expressions |
|
492 is that we could indeed find out new things about them that |
|
493 have surprised some experts in the field of regular |
|
494 expressions. Together with two colleagues from China, I was |
|
495 able to prove the Myhill-Nerode theorem by only using regular |
|
496 expressions and the notion of derivatives. Earlier versions of |
|
497 this theorem used always automata in the proof. Using this |
|
498 theorem we can show that regular languages are closed under |
|
499 complementation, something which Gasarch in his |
|
500 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be |
|
501 shown via automata. Even sombody who has written a 700+-page |
|
502 book\footnote{\url{http://goo.gl/fD0eHx}} on regular |
|
503 exprssions did not know better. Well, we showed it can also be |
|
504 done with regular expressions only. What a feeling if you |
|
505 are an outsider to the subject! |
|
506 |
|
507 To conclude: Despite my early ignorance about regular |
|
508 expressions, I find them now quite interesting. They have a |
|
509 beautiful mathematical theory behind them. They have practical |
|
510 importance (remember the shocking runtime of the regular |
|
511 expression matchers in Python and Ruby in some instances). |
|
512 People who are not very familiar with the mathematical |
|
513 background get them consistently wrong. The hope is that we |
|
514 can do better in the future---for example by proving that the |
|
515 algorithms actually satisfy their specification and that the |
|
516 corresponding implementations do not contain any bugs. We are |
|
517 close, but not yet quite there. |
369 |
518 |
370 \begin{figure}[p] |
519 \begin{figure}[p] |
371 \lstinputlisting{../progs/crawler1.scala} |
520 \lstinputlisting{../progs/crawler1.scala} |
372 \caption{The Scala code for a simple web-crawler that checks |
521 \caption{The Scala code for a simple web-crawler that checks |
373 for broken links in a web-page. It uses the regular expression |
522 for broken links in a web-page. It uses the regular expression |
398 \texttt{email\_pattern} in Line~16. The main change is in Line |
547 \texttt{email\_pattern} in Line~16. The main change is in Line |
399 32 where all email addresses that can be found in a page are |
548 32 where all email addresses that can be found in a page are |
400 printed.\label{crawler3}} |
549 printed.\label{crawler3}} |
401 \end{figure} |
550 \end{figure} |
402 |
551 |
403 \pagebreak |
|
404 Lets start |
|
405 with what we mean by \emph{strings}. Strings (they are also |
|
406 sometimes referred to as \emph{words}) are lists of characters |
|
407 drawn from an \emph{alphabet}. If nothing else is specified, |
|
408 we usually assume the alphabet consists of just the lower-case |
|
409 letters $a$, $b$, \ldots, $z$. Sometimes, however, we |
|
410 explicitly restrict strings to contain, for example, only the |
|
411 letters $a$ and $b$. In this case we say the alphabet is the |
|
412 set $\{a, b\}$. |
|
413 |
|
414 There are many ways how we can write down strings. In programming languages, they are usually |
|
415 written as {\it "hello"} where the double quotes indicate that we dealing with a string. |
|
416 Essentially, strings are lists of characters which can be written for example as follows |
|
417 |
|
418 \[ |
|
419 [\text{\it h, e, l, l, o}] |
|
420 \] |
|
421 |
|
422 \noindent |
|
423 The important point is that we can always decompose strings. For example, we will often consider the |
|
424 first character of a string, say $h$, and the ``rest'' of a string say {\it "ello"} when making definitions |
|
425 about strings. There are some subtleties with the empty string, sometimes written as {\it ""} but also as |
|
426 the empty list of characters $[\,]$. Two strings, for example $s_1$ and $s_2$, can be \emph{concatenated}, |
|
427 which we write as $s_1 @ s_2$. Suppose we are given two strings {\it "foo"} and {\it "bar"}, then their concatenation |
|
428 gives {\it "foobar"}. |
|
429 |
|
430 We often need to talk about sets of strings. For example the set of all strings over the alphabet $\{a, \ldots\, z\}$ |
|
431 is |
|
432 |
|
433 \[ |
|
434 \{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\} |
|
435 \] |
|
436 |
|
437 \noindent |
|
438 Any set of strings, not just the set-of-all-strings, is often called a \emph{language}. The idea behind |
|
439 this choice of terminology is that if we enumerate, say, all words/strings from a dictionary, like |
|
440 |
|
441 \[ |
|
442 \{\text{\it "the", "of", "milk", "name", "antidisestablishmentarianism", \ldots}\} |
|
443 \] |
|
444 |
|
445 \noindent |
|
446 then we have essentially described the English language, or more precisely all |
|
447 strings that can be used in a sentence of the English language. French would be a |
|
448 different set of strings, and so on. In the context of this course, a language might |
|
449 not necessarily make sense from a natural language point of view. For example |
|
450 the set of all strings shown above is a language, as is the empty set (of strings). The |
|
451 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a |
|
452 difference between the empty set, or empty language, and the set that |
|
453 contains only the empty string $\{\text{""}\}$: the former has no elements, whereas |
|
454 the latter has one element. |
|
455 |
|
456 |
|
457 |
|
458 Before we expand on the topic of regular expressions, let us review some operations on |
|
459 sets. We will use capital letters $A$, $B$, $\ldots$ to stand for sets of strings. |
|
460 The union of two sets is written as usual as $A \cup B$. We also need to define the |
|
461 operation of \emph{concatenating} two sets of strings. This can be defined as |
|
462 |
|
463 \[ |
|
464 A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \} |
|
465 \] |
|
466 |
|
467 \noindent |
|
468 which essentially means take the first string from the set $A$ and concatenate it with every |
|
469 string in the set $B$, then take the second string from $A$ do the same and so on. You might |
|
470 like to think about what this definition means in case $A$ or $B$ is the empty set. |
|
471 |
|
472 We also need to define |
|
473 the power of a set of strings, written as $A^n$ with $n$ being a natural number. This is defined inductively as follows |
|
474 |
|
475 \begin{center} |
|
476 \begin{tabular}{rcl} |
|
477 $A^0$ & $\dn$ & $\{[\,]\}$ \\ |
|
478 $A^{n+1}$ & $\dn$ & $A @ A^n$\\ |
|
479 \end{tabular} |
|
480 \end{center} |
|
481 |
|
482 \noindent |
|
483 Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union |
|
484 of every power of $A^n$ with $n\ge 0$. The mathematical notation for this operation is |
|
485 |
|
486 \[ |
|
487 A^* \dn \bigcup_{0\le n} A^n |
|
488 \] |
|
489 |
|
490 \noindent |
|
491 This definition implies that the star of a set $A$ contains always the empty string (that is $A^0$), one |
|
492 copy of every string in $A$ (that is $A^1$), two copies in $A$ (that is $A^2$) and so on. In case $A=\{"a"\}$ we therefore |
|
493 have |
|
494 |
|
495 \[ |
|
496 A^* = \{"", "a", "aa", "aaa", \ldots\} |
|
497 \] |
|
498 |
|
499 \noindent |
|
500 Be aware that these operations sometimes have quite non-intuitive properties, for example |
|
501 |
|
502 \begin{center} |
|
503 \begin{tabular}{@{}ccc@{}} |
|
504 \begin{tabular}{@{}r@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
505 $A \cup \varnothing$ & $=$ & $A$\\ |
|
506 $A \cup A$ & $=$ & $A$\\ |
|
507 $A \cup B$ & $=$ & $B \cup A$\\ |
|
508 \end{tabular} & |
|
509 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
510 $A @ B$ & $\not =$ & $B @ A$\\ |
|
511 $A @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\ |
|
512 $A @ \{""\}$ & $=$ & $\{""\} @ A = A$\\ |
|
513 \end{tabular} & |
|
514 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
|
515 $\varnothing^*$ & $=$ & $\{""\}$\\ |
|
516 $\{""\}^*$ & $=$ & $\{""\}$\\ |
|
517 $A^\star$ & $=$ & $\{""\} \cup A\cdot A^*$\\ |
|
518 \end{tabular} |
|
519 \end{tabular} |
|
520 \end{center} |
|
521 \bigskip |
|
522 |
|
523 |
|
524 \subsection*{My Fascination for Regular Expressions} |
|
525 |
552 |
526 |
553 |
527 \end{document} |
554 \end{document} |
528 |
555 |
529 %%% Local Variables: |
556 %%% Local Variables: |