48 zero or more letters, numbers and underscores. Also often we face the |
48 zero or more letters, numbers and underscores. Also often we face the |
49 problem that we are given a string (for example some user input) and |
49 problem that we are given a string (for example some user input) and |
50 want to know whether it matches a particular pattern---be it an email |
50 want to know whether it matches a particular pattern---be it an email |
51 address, for example. In this way we can exclude user input that would |
51 address, for example. In this way we can exclude user input that would |
52 otherwise have nasty effects on our program (crashing it or making it |
52 otherwise have nasty effects on our program (crashing it or making it |
53 go into an infinite loop, if not worse). The point is that the fast |
53 go into an infinite loop, if not worse). Scanning for computer viruses |
|
54 or filtering out spam usually involves scanning for some signature |
|
55 (essentially a pattern). The point is that the fast |
54 Knuth-Morris-Pratt algorithm for strings is not good enough for such |
56 Knuth-Morris-Pratt algorithm for strings is not good enough for such |
55 string patterns.\smallskip |
57 string \emph{patterns}.\smallskip |
56 |
58 |
57 \defn{Regular expressions} help with conveniently specifying |
59 \defn{Regular expressions} help with conveniently specifying |
58 such patterns. The idea behind regular expressions is that |
60 such patterns. The idea behind regular expressions is that |
59 they are a simple method for describing languages (or sets of |
61 they are a simple method for describing languages (or sets of |
60 strings)\ldots at least languages we are interested in in |
62 strings)\ldots at least languages we are interested in in |
680 |
682 |
681 \noindent which explains some of the crazier parts of email |
683 \noindent which explains some of the crazier parts of email |
682 addresses. Still it is good to know that some tasks in text |
684 addresses. Still it is good to know that some tasks in text |
683 processing just cannot be achieved by using regular |
685 processing just cannot be achieved by using regular |
684 expressions. But for what we want to use them (lexing) they are |
686 expressions. But for what we want to use them (lexing) they are |
685 pretty good. |
687 pretty good.\medskip |
|
688 |
|
689 \noindent |
|
690 Finally there is a joke about regular expressions: |
|
691 |
|
692 \begin{quote}\it |
|
693 ``Sometimes you have a programming problem and it seems like the |
|
694 best solution is to use regular expressions; now you have two |
|
695 problems.'' |
|
696 \end{quote} |
686 |
697 |
687 |
698 |
688 \begin{figure}[p] |
699 \begin{figure}[p] |
689 \lstinputlisting{../progs/crawler1.scala} |
700 \lstinputlisting{../progs/crawler1.scala} |
690 |
701 |