1 \documentclass{article} |
1 \documentclass{article} |
2 \usepackage{../style} |
2 \usepackage{../style} |
3 \usepackage{../langs} |
3 \usepackage{../langs} |
4 |
4 \usepackage{../graphics} |
|
5 \usepackage{../data} |
|
6 \usepackage{longtable} |
5 |
7 |
6 |
8 |
7 \begin{document} |
9 \begin{document} |
8 |
10 |
9 \section*{Handout 1} |
11 \section*{Handout 1} |
10 |
12 |
11 This module is about the processing of strings. Lets start with what we mean by \emph{strings}. Strings |
13 This module is about text processing, be it for web-crawlers, |
12 (they are also sometimes referred to as \emph{words}) are lists of characters drawn from an \emph{alphabet}. |
14 compilers, dictionaries, DNA-data and so on. When looking for |
13 If nothing else is specified, we usually assume |
15 a particular string in a large text we can use the |
14 the alphabet consists of just the lower-case letters $a$, $b$, \ldots, $z$. Sometimes, however, we explicitly |
16 Knuth-Morris-Pratt algorithm, which is currently the most |
15 restrict strings to contain, for example, only the letters $a$ and $b$. In this case we say the alphabet is the set $\{a, b\}$. |
17 efficient general string search algorithm. But often we do |
|
18 \emph{not} look for just a particular string, but for string |
|
19 patterns. For example in programming code we need to identify |
|
20 what are the keywords, what are the identifiers etc. Also |
|
21 often we face the problem that we are given a string (for |
|
22 example some user input) and want to know whether it matches a |
|
23 particular pattern. For example for excluding some user input |
|
24 that would otherwise have nasty effects on our program |
|
25 (crashing or going into an infinite loop, if not worse). |
|
26 \defn{Regular expressions} help with conveniently specifying |
|
27 such patterns. |
|
28 |
|
29 |
|
30 The idea behind regular expressions is that they are a simple |
|
31 method for describing languages (or sets of strings)...at |
|
32 least languages we are interested in in computer science. For |
|
33 example there is no convenient regular expression for |
|
34 describing the English language short of enumerating all |
|
35 English words. But they seem useful for describing for example |
|
36 email addresses.\footnote{See ``8 Regular Expressions You |
|
37 Should Know'' \url{http://goo.gl/5LoVX7}} Consider the |
|
38 following regular expression |
|
39 |
|
40 \begin{equation}\label{email} |
|
41 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}} |
|
42 \end{equation} |
|
43 |
|
44 \noindent where the first part matches one or more lowercase |
|
45 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots |
|
46 or hyphens. The \pcode{+} ensures the ``one or more''. Then |
|
47 comes the \pcode{@}-sign, followed by the domain name which |
|
48 must be one or more lowercase letters, digits, underscores, |
|
49 dots or hyphens. Note there cannot be an underscore in the |
|
50 domain name. Finally there must be a dot followed by the |
|
51 toplevel domain. This toplevel domain must be 2 to 6 lowercase |
|
52 letters including the dot. Example strings which follow this |
|
53 pattern are: |
|
54 |
|
55 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] |
|
56 niceandsimple@example.com |
|
57 very.common@example.org |
|
58 a.little.lengthy.but.fine@dept.example.co.uk |
|
59 other.email-with-dash@example.ac.uk |
|
60 \end{lstlisting} |
|
61 |
|
62 |
|
63 \noindent |
|
64 But for example the following two do not: |
|
65 |
|
66 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] |
|
67 user@localserver |
|
68 disposable.style.email.with+symbol@example.com |
|
69 \end{lstlisting} |
|
70 |
|
71 Many programming language offer libraries that can be used to |
|
72 validate such strings against regular expressions, like the |
|
73 one for email addresses in \eqref{email}. There are some |
|
74 common, and I am sure very familiar, ways how to construct |
|
75 regular expressions. For example in Scala we have |
|
76 |
|
77 \begin{center} |
|
78 \begin{longtable}{lp{9cm}} |
|
79 \pcode{re*} & matches 0 or more occurrences of preceding |
|
80 expression\\ |
|
81 \pcode{re+} & matches 1 or more occurrences of preceding |
|
82 expression\\ |
|
83 \pcode{re?} & matches 0 or 1 occurrence of preceding |
|
84 expression\\ |
|
85 \pcode{re\{n\}} & matches exactly \pcode{n} number of |
|
86 occurrences\\ |
|
87 \pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m} |
|
88 occurences of the preceding expression\\ |
|
89 \pcode{[...]} & matches any single character inside the |
|
90 brackets\\ |
|
91 \pcode{[^...]} & matches any single character not inside the |
|
92 brackets\\ |
|
93 \pcode{..-..} & character ranges\\ |
|
94 \pcode{\\d} & matches digits; equivalent to \pcode{[0-9]} |
|
95 \end{longtable} |
|
96 \end{center} |
|
97 |
|
98 \noindent With this you can figure out the purpose of the |
|
99 regular expressions in the web-crawlers shown Figures |
|
100 \ref{crawler1}, \ref{crawler2} and \ref{crawler3}. Note the |
|
101 regular expression for http-addresses in web-pages: |
|
102 |
|
103 \[ |
|
104 \pcode{"https?://[^"]*"} |
|
105 \] |
|
106 |
|
107 \noindent It specifies that web-addresses need to start with a |
|
108 double quote, then comes \texttt{http} followed by an optional |
|
109 \texttt{s} and so on. Usually we would have to escape the |
|
110 double quotes in order to make sure we interpret the double |
|
111 quote as character, not as double quote for a string. But |
|
112 Scala's trick with triple quotes allows us to omit this kind |
|
113 of escaping. As a result we can just write: |
|
114 |
|
115 \[ |
|
116 \pcode{""""https?://[^"]*"""".r} |
|
117 \] |
|
118 |
|
119 \noindent Not also that the convention in Scala is that |
|
120 \texttt{.r} converts a string into a regular expression. I |
|
121 leave it to you to ponder whether this regular expression |
|
122 really captures all possible web-addresses.\bigskip |
|
123 |
|
124 Regular expressions were introduced by Kleene in the 1950ies |
|
125 and they have been object of intense study since then. They |
|
126 are nowadays pretty much ubiquitous in computer science. I am |
|
127 sure you have come across them before. Why on earth then is |
|
128 there any interest in studying them again in depth in this |
|
129 module? Well, one answer is in the following graph about |
|
130 regular expression matching in Python and in Ruby. |
|
131 |
|
132 \begin{center} |
|
133 \begin{tikzpicture}[y=.1cm, x=.15cm] |
|
134 %axis |
|
135 \draw (0,0) -- coordinate (x axis mid) (30,0); |
|
136 \draw (0,0) -- coordinate (y axis mid) (0,30); |
|
137 %ticks |
|
138 \foreach \x in {0,5,...,30} |
|
139 \draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x}; |
|
140 \foreach \y in {0,5,...,30} |
|
141 \draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; |
|
142 %labels |
|
143 \node[below=0.6cm] at (x axis mid) {number of \texttt{a}s}; |
|
144 \node[rotate=90,above=0.8cm] at (y axis mid) {time in secs}; |
|
145 %plots |
|
146 \draw[color=blue] plot[mark=*] |
|
147 file {re-python.data}; |
|
148 \draw[color=brown] plot[mark=triangle*] |
|
149 file {re-ruby.data}; |
|
150 %legend |
|
151 \begin{scope}[shift={(4,20)}] |
|
152 \draw[color=blue] (0,0) -- |
|
153 plot[mark=*] (0.25,0) -- (0.5,0) |
|
154 node[right]{\small Python}; |
|
155 \draw[yshift=-\baselineskip, color=brown] (0,0) -- |
|
156 plot[mark=triangle*] (0.25,0) -- (0.5,0) |
|
157 node[right]{\small Ruby}; |
|
158 \end{scope} |
|
159 \end{tikzpicture} |
|
160 \end{center} |
|
161 |
|
162 \noindent This graph shows that Python needs approximately 29 |
|
163 seconds in order to find out that a string of 28 \texttt{a}s |
|
164 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}. |
|
165 Ruby is even slightly worse.\footnote{In this example Ruby |
|
166 uses the slightly different regular expression |
|
167 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and |
|
168 \texttt{a} each occur $n$ times.} Admittedly, this regular |
|
169 expression is carefully chosen to exhibit this exponential |
|
170 behaviour, but similar ones occur more often than one wants in |
|
171 ``real life''. They are sometimes called \emph{evil regular |
|
172 expressions} because they have the potential to make regular |
|
173 expression matching engines topple over, like in Python and |
|
174 Ruby. The problem is that this can have some serious |
|
175 consequences, for example, if you use them in your |
|
176 web-application, because hackers can look for these instances |
|
177 where the matching engine behaves badly and mount a nice |
|
178 DoS-attack against your application. |
|
179 |
|
180 It will be instructive to look behind the ``scenes''to find |
|
181 out why Python and Ruby (and others) behave so badly when |
|
182 matching with evil regular expressions. But we will also look |
|
183 at a relatively simple algorithm that solves this problem much |
|
184 better than Python and Ruby do\ldots actually it will be two |
|
185 versions of the algorithm: the first one will be able to |
|
186 process strings of approximately 1,000 \texttt{a}s in 30 |
|
187 seconds, while the second version will even be able to |
|
188 process up to 12,000 in less than 10(!) seconds, see the graph |
|
189 below: |
|
190 |
|
191 \begin{center} |
|
192 \begin{tikzpicture}[y=.1cm, x=.0006cm] |
|
193 %axis |
|
194 \draw (0,0) -- coordinate (x axis mid) (12000,0); |
|
195 \draw (0,0) -- coordinate (y axis mid) (0,30); |
|
196 %ticks |
|
197 \foreach \x in {0,2000,...,12000} |
|
198 \draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x}; |
|
199 \foreach \y in {0,5,...,30} |
|
200 \draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; |
|
201 %labels |
|
202 \node[below=0.6cm] at (x axis mid) {number of \texttt{a}s}; |
|
203 \node[rotate=90, above=0.8cm] at (y axis mid) {time in secs}; |
|
204 |
|
205 %plots |
|
206 \draw[color=green] plot[mark=square*, mark options={fill=white} ] |
|
207 file {re2b.data}; |
|
208 \draw[color=black] plot[mark=square*, mark options={fill=white} ] |
|
209 file {re3.data}; |
|
210 \end{tikzpicture} |
|
211 \end{center} |
|
212 |
|
213 \subsection*{Basic Regular Expressions} |
|
214 |
|
215 The regular expressions shown above we will call |
|
216 \defn{extended regular expressions}. The ones we will mainly |
|
217 study are \emph{basic regular expressions}, which by |
|
218 convention we will just call regular expressions, if it is |
|
219 clear what we mean. The attraction of (basic) regular |
|
220 expressions is that many features of the extended one are just |
|
221 syntactic suggar. (Basic) regular expressions are defined by |
|
222 the following grammar: |
|
223 |
|
224 \begin{center} |
|
225 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} |
|
226 $r$ & $::=$ & $\varnothing$ & null\\ |
|
227 & $\mid$ & $\epsilon$ & empty string / "" / []\\ |
|
228 & $\mid$ & $c$ & single character\\ |
|
229 & $\mid$ & $r_1 \cdot r_2$ & sequence\\ |
|
230 & $\mid$ & $r_1 + r_2$ & alternative / choice\\ |
|
231 & $\mid$ & $r^*$ & star (zero or more)\\ |
|
232 \end{tabular} |
|
233 \end{center} |
|
234 |
|
235 \noindent Because we overload our notation, there are some |
|
236 subtleties you should be aware of. First, when regular |
|
237 expressions are referred to then $\varnothing$ does not stand |
|
238 for the empty set: it is a particular pattern that does not |
|
239 match any string. Similarly, in the context of regular |
|
240 expressions, $\epsilon$ does not stand for the empty string |
|
241 (as in many places in the literature) but for a pattern that |
|
242 matches the empty string. Second, the letter $c$ stands for |
|
243 any character from the alphabet at hand. Again in the context |
|
244 of regular expressions, it is a particular pattern that can |
|
245 match the specified string. Third, you should also be careful |
|
246 with the our overloading of the star: assuming you have read |
|
247 the handout about our basic mathematical notation, you will |
|
248 see that in the context of languages (sets of strings) the |
|
249 star stands for an operation on languages. While $r^*$ stands |
|
250 for a regular expression, the operation on sets is defined as |
|
251 |
|
252 \[ |
|
253 A^* \dn \bigcup_{0\le n} A^n |
|
254 \] |
|
255 |
|
256 We will use parentheses to disambiguate regular expressions. |
|
257 Parentheses are not really part of a regular expression, and |
|
258 indeed we do not need them in our code because there the tree |
|
259 structure is always clear. But for writing them down in a more |
|
260 mathematical fashion, parentheses will be helpful. For example |
|
261 we will write $(r_1 + r_2)^*$, which is different from, say |
|
262 $r_1 + (r_2)^*$. The former means roughly zero or more times |
|
263 $r_1$ or $r_2$, while the latter means $r_1$ or zero or more |
|
264 times $r_2$. This will turn out are two different pattern, |
|
265 which match in general different strings. We should also write |
|
266 $(r_1 + r_2) + r_3$, which is different from the regular |
|
267 expression $r_1 + (r_2 + r_3)$, but in case of $+$ and $\cdot$ |
|
268 we actually do not care about the order and just write $r_1 + |
|
269 r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$, respectively. The |
|
270 reasons for this will become clear shortly. In the literature |
|
271 you will often find that the choice $r_1 + r_2$ is written as |
|
272 $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also following the |
|
273 convention in the literature, we will often omit the $\cdot$ |
|
274 all together. This is to make some concrete regular |
|
275 expressions more readable. For example the regular expression |
|
276 for email addresses shown in \eqref{email} would look like |
|
277 |
|
278 \[ |
|
279 \texttt{[...]+} \;\cdot\; \texttt{@} \;\cdot\; |
|
280 \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; |
|
281 \texttt{[...]\{2,6\}} |
|
282 \] |
|
283 |
|
284 \noindent |
|
285 which is much less readable than \eqref{email}. Similarly for |
|
286 the regular expression that matches the string $hello$ we |
|
287 should write |
|
288 |
|
289 \[ |
|
290 {\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o} |
|
291 \] |
|
292 |
|
293 \noindent |
|
294 but often just write {\it hello}. |
|
295 |
|
296 If you prefer to think in terms of the implementation |
|
297 of regular expressions in Scala, the constructors and |
|
298 classes relate as follows |
|
299 |
|
300 \begin{center} |
|
301 \begin{tabular}{rcl} |
|
302 $\varnothing$ & $\mapsto$ & \texttt{NULL}\\ |
|
303 $\epsilon$ & $\mapsto$ & \texttt{EMPTY}\\ |
|
304 $c$ & $\mapsto$ & \texttt{CHAR(c)}\\ |
|
305 $r_1 + r_2$ & $\mapsto$ & \texttt{ALT(r1, r2)}\\ |
|
306 $r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\ |
|
307 $r^*$ & $\mapsto$ & \texttt{STAR(r)} |
|
308 \end{tabular} |
|
309 \end{center} |
|
310 |
|
311 A source of confusion might arise from the fact that we |
|
312 use the term \emph{basic regular expression} for the regular |
|
313 expressions used in ``theory'' and defined above, and |
|
314 \emph{extended regular expression} for the ones used in |
|
315 ``practice'', for example Scala. If runtime is not of an |
|
316 issue, then the latter can be seen as some syntactic sugar of |
|
317 the former. Fo example we could replace |
|
318 |
|
319 \begin{center} |
|
320 \begin{tabular}{rcl} |
|
321 $r^+$ & $\mapsto$ & $r\cdot r^*$\\ |
|
322 $r?$ & $\mapsto$ & $\epsilon + r$\\ |
|
323 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\ |
|
324 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\ |
|
325 \end{tabular} |
|
326 \end{center} |
|
327 |
|
328 \subsection*{The Meaning of Regular Expressions} |
|
329 |
|
330 |
|
331 So far we have only considered informally what the |
|
332 \emph{meaning} of a regular expression is. This is no good for |
|
333 specifications of what algorithms are supposed to do or which |
|
334 problems they are supposed to solve. |
|
335 |
|
336 To do so more formally we will associate with every regular |
|
337 expression a language, or set of strings, that is supposed to |
|
338 be matched by this regular expression. To understand what is |
|
339 going on here it is crucial that you have also read the |
|
340 handout about our basic mathematical notations. |
|
341 |
|
342 The meaning of a regular expression can be defined recursively |
|
343 as follows |
|
344 |
|
345 \begin{center} |
|
346 \begin{tabular}{rcl} |
|
347 $L(\varnothing)$ & $\dn$ & $\varnothing$\\ |
|
348 $L(\epsilon)$ & $\dn$ & $\{[]\}$\\ |
|
349 $L(c)$ & $\dn$ & $\{"c"\}$\\ |
|
350 $L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\ |
|
351 $L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) @ L(r_2)$\\ |
|
352 $L(r^*)$ & $\dn$ & $(L(r))^*$\\ |
|
353 \end{tabular} |
|
354 \end{center} |
|
355 |
|
356 \noindent |
|
357 As a result we can now precisely state what the meaning, for example, of the regular expression |
|
358 ${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely |
|
359 $L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$...as expected. Similarly if we have the |
|
360 choice-regular-expression $a + b$, its meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly |
|
361 be matched by this choice. You can now also see why we do not make a difference |
|
362 between the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$....they |
|
363 are not the same regular expression, but have the same meaning. |
|
364 |
|
365 The point of the definition of $L$ is that we can use it to precisely specify when a string $s$ is matched by a |
|
366 regular expression $r$, namely only when $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and |
|
367 any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no}, |
|
368 if $s \not\in L(r)$. We leave this for the next lecture. |
|
369 |
|
370 \begin{figure}[p] |
|
371 \lstinputlisting{../progs/crawler1.scala} |
|
372 \caption{The Scala code for a simple web-crawler that checks |
|
373 for broken links in a web-page. It uses the regular expression |
|
374 \texttt{http\_pattern} in Line~15 for recognising URL-addresses. |
|
375 It finds all links using the library function |
|
376 \texttt{findAllIn} in Line~21.\label{crawler1}} |
|
377 \end{figure} |
|
378 |
|
379 |
|
380 \begin{figure}[p] |
|
381 \lstinputlisting{../progs/crawler2.scala} |
|
382 |
|
383 \caption{A version of the web-crawler that only follows links |
|
384 in ``my'' domain---since these are the ones I am interested in |
|
385 to fix. It uses the regular expression \texttt{my\_urls} in |
|
386 Line~16 to check for my name in the links. The main change is |
|
387 in Lines~26--29 where there is a test whether URL is in ``my'' |
|
388 domain or not.\label{crawler2}} |
|
389 |
|
390 \end{figure} |
|
391 |
|
392 \begin{figure}[p] |
|
393 \lstinputlisting{../progs/crawler3.scala} |
|
394 |
|
395 \caption{A small email harvester---whenever we download a |
|
396 web-page, we also check whether it contains any email |
|
397 addresses. For this we use the regular expression |
|
398 \texttt{email\_pattern} in Line~16. The main change is in Line |
|
399 32 where all email addresses that can be found in a page are |
|
400 printed.\label{crawler3}} |
|
401 \end{figure} |
|
402 |
|
403 \pagebreak |
|
404 Lets start |
|
405 with what we mean by \emph{strings}. Strings (they are also |
|
406 sometimes referred to as \emph{words}) are lists of characters |
|
407 drawn from an \emph{alphabet}. If nothing else is specified, |
|
408 we usually assume the alphabet consists of just the lower-case |
|
409 letters $a$, $b$, \ldots, $z$. Sometimes, however, we |
|
410 explicitly restrict strings to contain, for example, only the |
|
411 letters $a$ and $b$. In this case we say the alphabet is the |
|
412 set $\{a, b\}$. |
16 |
413 |
17 There are many ways how we can write down strings. In programming languages, they are usually |
414 There are many ways how we can write down strings. In programming languages, they are usually |
18 written as {\it "hello"} where the double quotes indicate that we dealing with a string. |
415 written as {\it "hello"} where the double quotes indicate that we dealing with a string. |
19 Essentially, strings are lists of characters which can be written for example as follows |
416 Essentially, strings are lists of characters which can be written for example as follows |
20 |
417 |
145 \end{tabular} |
518 \end{tabular} |
146 \end{tabular} |
519 \end{tabular} |
147 \end{center} |
520 \end{center} |
148 \bigskip |
521 \bigskip |
149 |
522 |
150 \noindent |
523 |
151 \emph{Regular expressions} are meant to conveniently describe languages...at least languages |
524 \subsection*{My Fascination for Regular Expressions} |
152 we are interested in in Computer Science. For example there is no convenient regular expression |
525 |
153 for describing the English language short of enumerating all English words. |
|
154 But they seem useful for describing all permitted email addresses, as seen above. |
|
155 |
|
156 Regular expressions are given by the following grammar: |
|
157 |
|
158 \begin{center} |
|
159 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} |
|
160 $r$ & $::=$ & $\varnothing$ & null\\ |
|
161 & $\mid$ & $\epsilon$ & empty string / "" / []\\ |
|
162 & $\mid$ & $c$ & single character\\ |
|
163 & $\mid$ & $r_1 \cdot r_2$ & sequence\\ |
|
164 & $\mid$ & $r_1 + r_2$ & alternative / choice\\ |
|
165 & $\mid$ & $r^*$ & star (zero or more)\\ |
|
166 \end{tabular} |
|
167 \end{center} |
|
168 |
|
169 \noindent |
|
170 Because we overload our notation, there are some subtleties you should be aware of. First, the letter |
|
171 $c$ stands for any character from the |
|
172 alphabet at hand. Second, we will use parentheses to disambiguate |
|
173 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from, say $r_1 + (r_2)^*$. |
|
174 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times |
|
175 $r_2$. We should also write $(r_1 + r_2) + r_3$, which is different from the regular expression $r_1 + (r_2 + r_3)$, |
|
176 but in case of $+$ and $\cdot$ we actually do not care about the order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$, |
|
177 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice |
|
178 $r_1 + r_2$ is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also following the convention in the literature, we will in case of $\cdot$ even |
|
179 often omit it all together. For example the regular expression for email addresses shown above is meant to be of the form |
|
180 |
|
181 \[ |
|
182 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots |
|
183 \] |
|
184 |
|
185 \noindent |
|
186 meaning first comes a name (specified by the regular expression $([\ldots])^+$), then an $@$-sign, then |
|
187 a domain name (specified by the regular expression $([\ldots])^+$), then a dot and then a top-level domain. Similarly if |
|
188 we want to specify the regular expression for the string {\it "hello"} we should write |
|
189 |
|
190 \[ |
|
191 {\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o} |
|
192 \] |
|
193 |
|
194 \noindent |
|
195 but often just write {\it hello}. |
|
196 |
|
197 Another source of confusion might arise from the fact that we use the term \emph{regular expression} for the regular expressions used in ``theory'' |
|
198 and also the ones used in ``practice''. In this course we refer by default to the regular expressions defined by the grammar above. |
|
199 In ``practice'' we often use $r^+$ to stand for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ to stand for an optional regular |
|
200 expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are however mere convenience |
|
201 as they can be seen as shorthand for |
|
202 |
|
203 \begin{center} |
|
204 \begin{tabular}{rcl} |
|
205 $r^+$ & $\mapsto$ & $r\cdot r^*$\\ |
|
206 $r^?$ & $\mapsto$ & $\epsilon + r$\\ |
|
207 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\ |
|
208 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\ |
|
209 \end{tabular} |
|
210 \end{center} |
|
211 |
|
212 |
|
213 We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regular |
|
214 expression is supposed to stand for every string \emph{not} matched by a regular expression. We will write |
|
215 such not-regular-expressions as $\sim{}r$. While being ``convenience'' it is often not so clear what the shorthand for |
|
216 these kind of not-regular-expressions is. Try to write down the regular expression which can match any |
|
217 string except the two strings {\it "hello"} and {\it "world"}. It is possible in principle, but often it is easier to just include |
|
218 $\sim{}r$ in the definition of regular expressions. Whenever we do so, we will state it explicitly. |
|
219 |
|
220 So far we have only considered informally what the \emph{meaning} of a regular expression is. |
|
221 To do so more formally we will associate with every regular expression a set of strings |
|
222 that is supposed to be matched by this |
|
223 regular expression. This can be defined recursively as follows |
|
224 |
|
225 \begin{center} |
|
226 \begin{tabular}{rcl} |
|
227 $L(\varnothing)$ & $\dn$ & $\{\,\}$\\ |
|
228 $L(\epsilon)$ & $\dn$ & $\{""\}$\\ |
|
229 $L(c)$ & $\dn$ & $\{"c"\}$\\ |
|
230 $L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\ |
|
231 $L(r_1 \cdot r_2)$ & $\dn$ & $\{s_1@ s_2 | s_1 \in L(r_1) \wedge s_2 \in L(r_2) \}$\\ |
|
232 $L(r^*)$ & $\dn$ & $\bigcup_{n \ge 0} L(r)^n$\\ |
|
233 \end{tabular} |
|
234 \end{center} |
|
235 |
|
236 \noindent |
|
237 As a result we can now precisely state what the meaning, for example, of the regular expression |
|
238 ${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely |
|
239 $L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$...as expected. Similarly if we have the |
|
240 choice-regular-expression $a + b$, its meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly |
|
241 be matched by this choice. You can now also see why we do not make a difference |
|
242 between the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$....they |
|
243 are not the same regular expression, but have the same meaning. |
|
244 |
|
245 The point of the definition of $L$ is that we can use it to precisely specify when a string $s$ is matched by a |
|
246 regular expression $r$, namely only when $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and |
|
247 any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no}, |
|
248 if $s \not\in L(r)$. We leave this for the next lecture. |
|
249 |
|
250 \begin{figure}[p] |
|
251 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/crawler1.scala}}} |
|
252 \caption{Scala code for a web-crawler that can detect broken links in a web-page. It uses |
|
253 the regular expression {\tt http\_pattern} in Line~15 for recognising URL-addresses. It finds |
|
254 all links using the library function {\tt findAllIn} in Line~21.} |
|
255 \end{figure} |
|
256 |
|
257 \begin{figure}[p] |
|
258 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/crawler2.scala}}} |
|
259 \caption{A version of the web-crawler which only follows links in ``my'' domain---since these are the |
|
260 ones I am interested in to fix. It uses the regular expression {\tt my\_urls} in Line~16. |
|
261 The main change is in Line~26 where there is a test whether URL is in ``my'' domain or not.} |
|
262 |
|
263 \end{figure} |
|
264 |
|
265 \begin{figure}[p] |
|
266 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/crawler3.scala}}} |
|
267 \caption{A small email harvester---whenever we download a web-page, we also check whether |
|
268 it contains any email addresses. For this we use the regular expression {\tt email\_pattern} in |
|
269 Line~17. The main change is in Lines 33 and 34 where all email addresses that can be found in a page are printed.} |
|
270 \end{figure} |
|
271 |
526 |
272 \end{document} |
527 \end{document} |
273 |
528 |
274 %%% Local Variables: |
529 %%% Local Variables: |
275 %%% mode: latex |
530 %%% mode: latex |