8 %%https://jex.im/regulex/ |
8 %%https://jex.im/regulex/ |
9 %%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/ |
9 %%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/ |
10 %%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/ |
10 %%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/ |
11 |
11 |
12 \begin{document} |
12 \begin{document} |
13 \fnote{\copyright{} Christian Urban, 2014, 2015} |
13 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016} |
14 |
14 |
15 \section*{Handout 1} |
15 \section*{Handout 1} |
16 |
16 |
17 This module is about text processing, be it for web-crawlers, |
17 This module is about text processing, be it for web-crawlers, |
18 compilers, dictionaries, DNA-data and so on. When looking for |
18 compilers, dictionaries, DNA-data and so on. When looking for |
24 are the keywords, what are the identifiers etc. A pattern for |
24 are the keywords, what are the identifiers etc. A pattern for |
25 identifiers could be stated as: they start with a letter, |
25 identifiers could be stated as: they start with a letter, |
26 followed by zero or more letters, numbers and underscores. |
26 followed by zero or more letters, numbers and underscores. |
27 Also often we face the problem that we are given a string (for |
27 Also often we face the problem that we are given a string (for |
28 example some user input) and want to know whether it matches a |
28 example some user input) and want to know whether it matches a |
29 particular pattern. In this way we can, for example, exclude |
29 particular pattern---be it an email address, for example. In |
30 user input that would otherwise have nasty effects on our |
30 this way we can exclude user input that would otherwise have |
31 program (crashing it or making it go into an infinite loop, if |
31 nasty effects on our program (crashing it or making it go into |
32 not worse). |
32 an infinite loop, if not worse). |
33 |
33 |
34 \defn{Regular expressions} help with conveniently specifying |
34 \defn{Regular expressions} help with conveniently specifying |
35 such patterns. The idea behind regular expressions is that |
35 such patterns. The idea behind regular expressions is that |
36 they are a simple method for describing languages (or sets of |
36 they are a simple method for describing languages (or sets of |
37 strings)\ldots at least languages we are interested in in |
37 strings)\ldots at least languages we are interested in in |
38 computer science. For example there is no convenient regular |
38 computer science. For example there is no convenient regular |
39 expression for describing the English language short of |
39 expression for describing the English language short of |
40 enumerating all English words. But they seem useful for |
40 enumerating all English words. But they seem useful for |
41 describing for example email addresses.\footnote{See ``8 |
41 describing for example simple email addresses.\footnote{See |
42 Regular Expressions You Should Know'' |
42 ``8 Regular Expressions You Should Know'' |
43 \url{http://goo.gl/5LoVX7}} Consider the following regular |
43 \url{http://goo.gl/5LoVX7}} Consider the following regular |
44 expression |
44 expression |
45 |
45 |
46 \begin{equation}\label{email} |
46 \begin{equation}\label{email} |
47 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}} |
47 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}} |
65 other.email-with-dash@example.edu |
65 other.email-with-dash@example.edu |
66 \end{lstlisting} |
66 \end{lstlisting} |
67 |
67 |
68 |
68 |
69 \noindent |
69 \noindent |
70 But for example the following two do not: |
70 But for example the following two do not |
71 |
71 |
72 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] |
72 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] |
73 user@localserver |
73 user@localserver |
74 disposable.style.email.with+symbol@example.com |
74 disposable.style.email.with+symbol@example.com |
75 \end{lstlisting} |
75 \end{lstlisting} |
76 |
76 |
|
77 \noindent according to the regular expression we specified in |
|
78 \eqref{email}. Whether this is intended or not is a different |
|
79 question (the second email above is actually an acceptable |
|
80 email address acording to the RFC 5322 standard for email |
|
81 addresses). |
|
82 |
77 As mentioned above, identifiers, or variables, in program code |
83 As mentioned above, identifiers, or variables, in program code |
78 are often required to satisfy the constraints that they start |
84 are often required to satisfy the constraints that they start |
79 with a letter and then can be followed by zero or more letters |
85 with a letter and then can be followed by zero or more letters |
80 or numbers and also can include underscores, but not as the |
86 or numbers and also can include underscores, but not as the |
81 first character. Such identifiers can be recognised with the |
87 first character. Such identifiers can be recognised with the |
214 seconds, while the second version will even be able to process |
222 seconds, while the second version will even be able to process |
215 up to 12,000 in less than 10(!) seconds, see the graph below: |
223 up to 12,000 in less than 10(!) seconds, see the graph below: |
216 |
224 |
217 \begin{center} |
225 \begin{center} |
218 \begin{tikzpicture} |
226 \begin{tikzpicture} |
219 \begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs}, |
227 \begin{axis}[ |
|
228 xlabel={strings of \pcode{a}s}, |
|
229 ylabel={time in secs}, |
220 enlargelimits=false, |
230 enlargelimits=false, |
221 xtick={0,3000,...,12000}, |
231 xtick={0,3000,...,12000}, |
222 xmax=12500, |
232 xmax=12500, |
223 ymax=35, |
233 ymax=35, |
224 ytick={0,5,...,30}, |
234 ytick={0,5,...,30}, |
232 \end{tikzpicture} |
242 \end{tikzpicture} |
233 \end{center} |
243 \end{center} |
234 |
244 |
235 \subsection*{Basic Regular Expressions} |
245 \subsection*{Basic Regular Expressions} |
236 |
246 |
237 The regular expressions shown above for Scala, we |
247 The regular expressions shown earlier for Scala, we |
238 will call \emph{extended regular expressions}. The ones we |
248 will call \emph{extended regular expressions}. The ones we |
239 will mainly study in this module are \emph{basic regular |
249 will mainly study in this module are \emph{basic regular |
240 expressions}, which by convention we will just call |
250 expressions}, which by convention we will just call |
241 \emph{regular expressions}, if it is clear what we mean. The |
251 \emph{regular expressions}, if it is clear what we mean. The |
242 attraction of (basic) regular expressions is that many |
252 attraction of (basic) regular expressions is that many |
244 (Basic) regular expressions are defined by the following |
254 (Basic) regular expressions are defined by the following |
245 grammar: |
255 grammar: |
246 |
256 |
247 \begin{center} |
257 \begin{center} |
248 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} |
258 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} |
249 $r$ & $::=$ & $\varnothing$ & null\\ |
259 $r$ & $::=$ & $\ZERO$ & null language\\ |
250 & $\mid$ & $\epsilon$ & empty string / \texttt{""} / []\\ |
260 & $\mid$ & $\ONE$ & empty string / \texttt{""} / []\\ |
251 & $\mid$ & $c$ & single character\\ |
261 & $\mid$ & $c$ & single character\\ |
252 & $\mid$ & $r_1 + r_2$ & alternative / choice\\ |
262 & $\mid$ & $r_1 + r_2$ & alternative / choice\\ |
253 & $\mid$ & $r_1 \cdot r_2$ & sequence\\ |
263 & $\mid$ & $r_1 \cdot r_2$ & sequence\\ |
254 & $\mid$ & $r^*$ & star (zero or more)\\ |
264 & $\mid$ & $r^*$ & star (zero or more)\\ |
255 \end{tabular} |
265 \end{tabular} |
256 \end{center} |
266 \end{center} |
257 |
267 |
258 \noindent Because we overload our notation, there are some |
268 \noindent Because we overload our notation, there are some |
259 subtleties you should be aware of. When regular expressions |
269 subtleties you should be aware of. When regular expressions |
260 are referred to then $\varnothing$ does not stand for the |
270 are referred to then $\ZERO$ (in bold font) does not stand for |
261 empty set: rather it is a particular pattern that does not |
271 the number zero: rather it is a particular pattern that does |
262 match any string. Similarly, in the context of regular |
272 not match any string. Similarly, in the context of regular |
263 expressions, $\epsilon$ does not stand for the empty string |
273 expressions, $\ONE$ does not stand for the number one but for |
264 (as in many places in the literature) but for a regular |
274 a regular expression that matches the empty string. The letter |
265 expression that matches the empty string. The letter $c$ |
275 $c$ stands for any character from the alphabet at hand. Again |
266 stands for any character from the alphabet at hand. Again in |
276 in the context of regular expressions, it is a particular |
267 the context of regular expressions, it is a particular pattern |
277 pattern that can match the specified character. You should |
268 that can match the specified character. You should also be |
278 also be careful with our overloading of the star: assuming you |
269 careful with our overloading of the star: assuming you have |
279 have read the handout about our basic mathematical notation, |
270 read the handout about our basic mathematical notation, you |
280 you will see that in the context of languages (sets of |
271 will see that in the context of languages (sets of strings) |
281 strings) the star stands for an operation on languages. Here |
272 the star stands for an operation on languages. Here $r^*$ |
282 $r^*$ stands for a regular expression, which is different from |
273 stands for a regular expression, which is different from the |
283 the operation on sets is defined as |
274 operation on sets is defined as |
284 |
275 |
285 \[ |
276 \[ |
286 A\star\dn \bigcup_{0\le n} A^n |
277 A^* \dn \bigcup_{0\le n} A^n |
|
278 \] |
287 \] |
279 |
288 |
280 \noindent |
289 \noindent |
281 Note that this expands to |
290 Note that this expands to |
282 |
291 |
283 \[ |
292 \[ |
284 A^* \dn A^0 \cup A^1 \cup A^2 \cup A^3 \cup A^4 \cup \ldots |
293 A\star \dn A^0 \cup A^1 \cup A^2 \cup A^3 \cup A^4 \cup \ldots |
285 \] |
294 \] |
286 |
295 |
287 \noindent which is equivalent to |
296 \noindent which is equivalent to |
288 |
297 |
289 \[ |
298 \[ |
290 A^* \dn \{[]\} \cup A \cup A@A \cup A@A@A \cup A@A@A@A \cup \ldots |
299 A\star \dn \{[]\} \cup A \cup A@A \cup A@A@A \cup A@A@A@A \cup \ldots |
291 \] |
300 \] |
292 |
301 |
293 \noindent |
302 \noindent |
294 Remember that $A^0$ is always the set containing the empty |
303 Remember that $A^0$ is always the set containing the empty |
295 string. |
304 string. |
307 strings. We should also write $(r_1 + r_2) + r_3$, which is |
316 strings. We should also write $(r_1 + r_2) + r_3$, which is |
308 different from the regular expression $r_1 + (r_2 + r_3)$, but |
317 different from the regular expression $r_1 + (r_2 + r_3)$, but |
309 in case of $+$ and $\cdot$ we actually do not care about the |
318 in case of $+$ and $\cdot$ we actually do not care about the |
310 order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 |
319 order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 |
311 \cdot r_3$, respectively. The reasons for this will become |
320 \cdot r_3$, respectively. The reasons for this will become |
312 clear shortly. In the literature you will often find that the |
321 clear shortly. |
313 choice $r_1 + r_2$ is written as $r_1\mid{}r_2$ or |
322 |
314 $r_1\mid\mid{}r_2$. Also following the convention in the |
323 In the literature you will often find that the choice $r_1 + |
|
324 r_2$ is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also, |
|
325 often our $\ZERO$ and $\ONE$ are written $\varnothing$ and |
|
326 $\epsilon$, respectively. Following the convention in the |
315 literature, we will often omit the $\cdot$ all together. This |
327 literature, we will often omit the $\cdot$ all together. This |
316 is to make some concrete regular expressions more readable. |
328 is to make some concrete regular expressions more readable. |
317 For example the regular expression for email addresses shown |
329 For example the regular expression for email addresses shown |
318 in \eqref{email} would look like |
330 in \eqref{email} would look like |
319 |
331 |
340 classes relate as follows\footnote{More about Scala is |
352 classes relate as follows\footnote{More about Scala is |
341 in the handout about A Crash-Course on Scala.} |
353 in the handout about A Crash-Course on Scala.} |
342 |
354 |
343 \begin{center} |
355 \begin{center} |
344 \begin{tabular}{rcl} |
356 \begin{tabular}{rcl} |
345 $\varnothing$ & $\mapsto$ & \texttt{NULL}\\ |
357 $\ZERO$ & $\mapsto$ & \texttt{ZERO}\\ |
346 $\epsilon$ & $\mapsto$ & \texttt{EMPTY}\\ |
358 $\ONE$ & $\mapsto$ & \texttt{ONE}\\ |
347 $c$ & $\mapsto$ & \texttt{CHAR(c)}\\ |
359 $c$ & $\mapsto$ & \texttt{CHAR(c)}\\ |
348 $r_1 + r_2$ & $\mapsto$ & \texttt{ALT(r1, r2)}\\ |
360 $r_1 + r_2$ & $\mapsto$ & \texttt{ALT(r1, r2)}\\ |
349 $r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\ |
361 $r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\ |
350 $r^*$ & $\mapsto$ & \texttt{STAR(r)} |
362 $r^*$ & $\mapsto$ & \texttt{STAR(r)} |
351 \end{tabular} |
363 \end{tabular} |
386 The \defn{meaning of a regular expression} can be defined |
398 The \defn{meaning of a regular expression} can be defined |
387 by a recursive function called $L$ (for language), which |
399 by a recursive function called $L$ (for language), which |
388 is defined as follows |
400 is defined as follows |
389 |
401 |
390 \begin{center} |
402 \begin{center} |
391 \begin{tabular}{rcl} |
403 \begin{tabular}{rcll} |
392 $L(\varnothing)$ & $\dn$ & $\{\}$\\ |
404 $L(\ZERO)$ & $\dn$ & $\{\}$\\ |
393 $L(\epsilon)$ & $\dn$ & $\{[]\}$\\ |
405 $L(\ONE)$ & $\dn$ & $\{[]\}$\\ |
394 $L(c)$ & $\dn$ & $\{[c]\}$\\ |
406 $L(c)$ & $\dn$ & $\{"c"\}$ & or equivalently $\dn \{[c]\}$\\ |
395 $L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\ |
407 $L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\ |
396 $L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) \,@\, L(r_2)$\\ |
408 $L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) \,@\, L(r_2)$\\ |
397 $L(r^*)$ & $\dn$ & $(L(r))^*$\\ |
409 $L(r^*)$ & $\dn$ & $(L(r))\star$\\ |
398 \end{tabular} |
410 \end{tabular} |
399 \end{center} |
411 \end{center} |
400 |
412 |
401 \noindent As a result we can now precisely state what the |
413 \noindent As a result we can now precisely state what the |
402 meaning, for example, of the regular expression $h \cdot |
414 meaning, for example, of the regular expression $h \cdot |
441 different regular expressions that can recognise these |
453 different regular expressions that can recognise these |
442 strings. This is obvious with the regular expression $a + b$ |
454 strings. This is obvious with the regular expression $a + b$ |
443 which can match the strings $a$ and $b$. But also the regular |
455 which can match the strings $a$ and $b$. But also the regular |
444 expression $b + a$ would match the same strings. However, |
456 expression $b + a$ would match the same strings. However, |
445 sometimes it is not so obvious whether two regular expressions |
457 sometimes it is not so obvious whether two regular expressions |
446 match the same strings: for example do $r^*$ and $\epsilon + r |
458 match the same strings: for example do $r^*$ and $\ONE + r |
447 \cdot r^*$ match the same strings? What about $\varnothing^*$ |
459 \cdot r^*$ match the same strings? What about $\ZERO^*$ |
448 and $\epsilon^*$? This suggests the following relation between |
460 and $\ONE^*$? This suggests the following relation between |
449 \defn{equivalent regular expressions}: |
461 \defn{equivalent regular expressions}: |
450 |
462 |
451 \[ |
463 \[ |
452 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2) |
464 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2) |
453 \] |
465 \] |
458 expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$, |
470 expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$, |
459 because they are equivalent. I leave you to the question |
471 because they are equivalent. I leave you to the question |
460 whether |
472 whether |
461 |
473 |
462 \[ |
474 \[ |
463 \varnothing^* \equiv \epsilon^* |
475 \ZERO^* \equiv \ONE^* |
464 \] |
476 \] |
465 |
477 |
466 \noindent holds? Such equivalences will be important for our |
478 \noindent holds or not? Such equivalences will be important |
467 matching algorithm, because we can use them to simplify |
479 for our matching algorithm, because we can use them to |
468 regular expressions. |
480 simplify regular expressions, which will mean we can speed |
|
481 up the calculations. |
469 |
482 |
470 \subsection*{My Fascination for Regular Expressions} |
483 \subsection*{My Fascination for Regular Expressions} |
471 |
484 |
472 Up until a few years ago I was not really interested in |
485 Up until a few years ago I was not really interested in |
473 regular expressions. They have been studied for the last 60 |
486 regular expressions. They have been studied for the last 60 |
474 years (by smarter people than me)---surely nothing new can be |
487 years (by smarter people than me)---surely nothing new can be |
475 found out about them. I even have the vague recollection that |
488 found out about them. I even have the vague recollection that |
476 I did not quite understand them during my study. If I remember |
489 I did not quite understand them during my study. If I remember |
477 correctly,\footnote{That was really a long time ago.} I got |
490 correctly,\footnote{That was really a long time ago.} I got |
478 utterly confused about $\epsilon$ and the empty |
491 utterly confused about $\ONE$ (which my lecturer wrote as |
479 string.\footnote{Obviously the lecturer must have been bad.} |
492 $\epsilon$) and the empty string.\footnote{Obviously the |
480 Since my study, I have used regular expressions for |
493 lecturer must have been bad.} Since my study, I have used |
481 implementing lexers and parsers as I have always been |
494 regular expressions for implementing lexers and parsers as I |
482 interested in all kinds of programming languages and |
495 have always been interested in all kinds of programming |
483 compilers, which invariably need regular expression in some |
496 languages and compilers, which invariably need regular |
484 form or shape. |
497 expression in some form or shape. |
485 |
498 |
486 To understand my fascination nowadays with regular |
499 To understand my fascination \emph{nowadays} with regular |
487 expressions, you need to know that my main scientific interest |
500 expressions, you need to know that my main scientific interest |
488 for the last 14 years has been with theorem provers. I am a |
501 for the last 14 years has been with theorem provers. I am a |
489 core developer of one of |
502 core developer of one of |
490 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem |
503 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem |
491 provers are systems in which you can formally reason about |
504 provers are systems in which you can formally reason about |
531 exprssions did not know better. Well, we showed it can also be |
544 exprssions did not know better. Well, we showed it can also be |
532 done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}} |
545 done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}} |
533 What a feeling if you are an outsider to the subject! |
546 What a feeling if you are an outsider to the subject! |
534 |
547 |
535 To conclude: Despite my early ignorance about regular |
548 To conclude: Despite my early ignorance about regular |
536 expressions, I find them now quite interesting. They have a |
549 expressions, I find them now very interesting. They have a |
537 beautiful mathematical theory behind them. They have practical |
550 beautiful mathematical theory behind them, which can be |
538 importance (remember the shocking runtime of the regular |
551 sometimes quite deep and contain hidden snares. They have |
539 expression matchers in Python and Ruby in some instances). |
552 practical importance (remember the shocking runtime of the |
540 People who are not very familiar with the mathematical |
553 regular expression matchers in Python and Ruby in some |
541 background of regular expressions get them consistently wrong. |
554 instances). People who are not very familiar with the |
542 The hope is that we can do better in the future---for example |
555 mathematical background of regular expressions get them |
543 by proving that the algorithms actually satisfy their |
556 consistently wrong. The hope is that we can do better in the |
544 specification and that the corresponding implementations do |
557 future---for example by proving that the algorithms actually |
545 not contain any bugs. We are close, but not yet quite there. |
558 satisfy their specification and that the corresponding |
|
559 implementations do not contain any bugs. We are close, but not |
|
560 yet quite there. |
546 |
561 |
547 Notwithstanding my fascination, I am also happy to admit that regular |
562 Notwithstanding my fascination, I am also happy to admit that regular |
548 expressions have their shortcomings. There are some well-known |
563 expressions have their shortcomings. There are some well-known |
549 ``theoretical'' shortcomings, for example recognising strings |
564 ``theoretical'' shortcomings, for example recognising strings |
550 of the form $a^{n}b^{n}$. I am not so bothered by them. What I |
565 of the form $a^{n}b^{n}$. I am not so bothered by them. What I |
569 that is claimed to be closer to the standard is shown in |
584 that is claimed to be closer to the standard is shown in |
570 Figure~\ref{monster}. Whether this claim is true or not, I |
585 Figure~\ref{monster}. Whether this claim is true or not, I |
571 would not know---the only thing I can say to this regular |
586 would not know---the only thing I can say to this regular |
572 expression is it is a monstrosity. However, this might |
587 expression is it is a monstrosity. However, this might |
573 actually be an argument against the RFC standard, rather than |
588 actually be an argument against the RFC standard, rather than |
574 against regular expressions. Still it is good to know that |
589 against regular expressions. A similar argument is made in |
575 some tasks in text processing just cannot be achieved by using |
590 |
576 regular expressions. |
591 \begin{center} |
|
592 \url{https://elliot.land/validating-an-email-address} |
|
593 \end{center} |
|
594 |
|
595 |
|
596 \noindent which explains some of the crazier parts of email |
|
597 addresses. Still it is good to know that some tasks in text |
|
598 processing just cannot be achieved by using regular |
|
599 expressions. |
577 |
600 |
578 |
601 |
579 \begin{figure}[p] |
602 \begin{figure}[p] |
580 \lstinputlisting{../progs/crawler1.scala} |
603 \lstinputlisting{../progs/crawler1.scala} |
|
604 |
581 \caption{The Scala code for a simple web-crawler that checks |
605 \caption{The Scala code for a simple web-crawler that checks |
582 for broken links in a web-page. It uses the regular expression |
606 for broken links in a web-page. It uses the regular expression |
583 \texttt{http\_pattern} in Line~15 for recognising URL-addresses. |
607 \texttt{http\_pattern} in Line~\ref{httpline} for recognising |
584 It finds all links using the library function |
608 URL-addresses. It finds all links using the library function |
585 \texttt{findAllIn} in Line~21.\label{crawler1}} |
609 \texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}} |
|
610 |
586 \end{figure} |
611 \end{figure} |
587 |
612 |
588 |
613 |
589 \begin{figure}[p] |
614 \begin{figure}[p] |
590 \lstinputlisting{../progs/crawler2.scala} |
615 \lstinputlisting{../progs/crawler2.scala} |
591 |
616 |
592 \caption{A version of the web-crawler that only follows links |
617 \caption{A version of the web-crawler that only follows links |
593 in ``my'' domain---since these are the ones I am interested in |
618 in ``my'' domain---since these are the ones I am interested in |
594 to fix. It uses the regular expression \texttt{my\_urls} in |
619 to fix. It uses the regular expression \texttt{my\_urls} in |
595 Line~16 to check for my name in the links. The main change is |
620 Line~\ref{myurlline} to check for my name in the links. The |
596 in Lines~24--28 where there is a test whether URL is in ``my'' |
621 main change is in |
597 domain or not.\label{crawler2}} |
622 Lines~\ref{changestartline}--\ref{changeendline} where there |
|
623 is a test whether URL is in ``my'' domain or |
|
624 not.\label{crawler2}} |
598 |
625 |
599 \end{figure} |
626 \end{figure} |
600 |
627 |
601 \begin{figure}[p] |
628 \begin{figure}[p] |
602 \lstinputlisting{../progs/crawler3.scala} |
629 \lstinputlisting{../progs/crawler3.scala} |
603 |
630 |
604 \caption{A small email harvester---whenever we download a |
631 \caption{A small email harvester---whenever we download a |
605 web-page, we also check whether it contains any email |
632 web-page, we also check whether it contains any email |
606 addresses. For this we use the regular expression |
633 addresses. For this we use the regular expression |
607 \texttt{email\_pattern} in Line~15. The main change is in Line |
634 \texttt{email\_pattern} in Line~\ref{emailline}. The main |
608 30 where all email addresses that can be found in a page are |
635 change is in Line~\ref{mainline} where all email addresses |
609 printed.\label{crawler3}} |
636 that can be found in a page are printed.\label{crawler3}} |
|
637 |
610 \end{figure} |
638 \end{figure} |
611 |
639 |
612 \begin{figure}[p] |
640 \begin{figure}[p] |
613 \tiny |
641 \tiny |
614 \begin{center} |
642 \begin{center} |
615 \begin{minipage}{0.8\textwidth} |
643 \begin{minipage}{0.8\textwidth} |
616 \lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp} |
644 \lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp} |
617 \end{minipage} |
645 \end{minipage} |
618 \end{center} |
646 \end{center} |
619 |
647 |
620 \caption{Nothing that can be said\ldots\label{monster}} |
648 \caption{Nothing that can be said this\ldots\label{monster}} |
621 \end{figure} |
649 \end{figure} |
622 |
650 |
623 |
651 |
624 \end{document} |
652 \end{document} |
625 |
653 |