author | Christian Urban <christian dot urban at kcl dot ac dot uk> |
Sun, 12 Oct 2014 21:54:04 +0100 | |
changeset 273 | b56d5e4468c0 |
parent 268 | 18bef085a7ca |
child 291 | 201c2c6d8696 |
permissions | -rw-r--r-- |
105
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
1 |
\documentclass{article} |
237
370c0647a9bf
more material
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
217
diff
changeset
|
2 |
\usepackage{../style} |
217
cd6066f1056a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
123
diff
changeset
|
3 |
\usepackage{../langs} |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
4 |
\usepackage{../graphics} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
5 |
\usepackage{../data} |
108
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
6 |
|
112
95ee5cc5c05d
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
111
diff
changeset
|
7 |
|
105
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
8 |
\begin{document} |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
9 |
|
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
10 |
\section*{Handout 1} |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
11 |
|
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
12 |
This module is about text processing, be it for web-crawlers, |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
13 |
compilers, dictionaries, DNA-data and so on. When looking for |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
14 |
a particular string in a large text we can use the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
15 |
Knuth-Morris-Pratt algorithm, which is currently the most |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
16 |
efficient general string search algorithm. But often we do |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
17 |
\emph{not} just look for a particular string, but for string |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
18 |
patterns. For example in programming code we need to identify |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
19 |
what are the keywords, what are the identifiers etc. A pattern |
268
18bef085a7ca
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
258
diff
changeset
|
20 |
for identifiers could be stated as: they start with a letter, |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
21 |
followed by zero or more letters, numbers and the underscore. |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
22 |
Also often we face the problem that we are given a string (for |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
23 |
example some user input) and want to know whether it matches a |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
24 |
particular pattern. In this way we can exclude user input that |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
25 |
would otherwise have nasty effects on our program (crashing it |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
26 |
or going into an infinite loop, if not worse). \defn{Regular |
268
18bef085a7ca
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
258
diff
changeset
|
27 |
expressions} help with conveniently specifying such patterns. |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
28 |
The idea behind regular expressions is that they are a simple |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
29 |
method for describing languages (or sets of strings)\ldots at |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
30 |
least languages we are interested in in computer science. For |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
31 |
example there is no convenient regular expression for |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
32 |
describing the English language short of enumerating all |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
33 |
English words. But they seem useful for describing for example |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
34 |
email addresses.\footnote{See ``8 Regular Expressions You |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
35 |
Should Know'' \url{http://goo.gl/5LoVX7}} Consider the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
36 |
following regular expression |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
37 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
38 |
\begin{equation}\label{email} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
39 |
\texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
40 |
\end{equation} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
41 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
42 |
\noindent where the first part matches one or more lowercase |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
43 |
letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
44 |
or hyphens. The \pcode{+} ensures the ``one or more''. Then |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
45 |
comes the \pcode{@}-sign, followed by the domain name which |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
46 |
must be one or more lowercase letters, digits, underscores, |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
47 |
dots or hyphens. Note there cannot be an underscore in the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
48 |
domain name. Finally there must be a dot followed by the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
49 |
toplevel domain. This toplevel domain must be 2 to 6 lowercase |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
50 |
letters including the dot. Example strings which follow this |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
51 |
pattern are: |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
52 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
53 |
\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
54 |
niceandsimple@example.org |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
55 |
very.common@example.co.uk |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
56 |
a.little.lengthy.but.fine@dept.example.ac.uk |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
57 |
other.email-with-dash@example.edu |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
58 |
\end{lstlisting} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
59 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
60 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
61 |
\noindent |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
62 |
But for example the following two do not: |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
63 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
64 |
\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
65 |
user@localserver |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
66 |
disposable.style.email.with+symbol@example.com |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
67 |
\end{lstlisting} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
68 |
|
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
69 |
Identifiers, or variables, in program text are often required |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
70 |
to satisfy the constraints that they start with a letter and |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
71 |
then can be followed by zero or more letters or numbers and |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
72 |
also can include underscores, but not as the first character. |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
73 |
Such identifiers can be recognised with the regular expression |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
74 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
75 |
\begin{center} |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
76 |
\pcode{[a-zA-Z] [a-zA-Z0-9_]*} |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
77 |
\end{center} |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
78 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
79 |
\noindent Possible identifiers that match this regular expression |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
80 |
are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name}, |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
81 |
but not \pcode{_i} and also not \pcode{4you}. |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
82 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
83 |
Many programming language offer libraries that can be used to |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
84 |
validate such strings against regular expressions. Also there |
248
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
85 |
are some common, and I am sure very familiar, ways of how to |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
86 |
construct regular expressions. For example in Scala we have: |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
87 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
88 |
\begin{center} |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
89 |
\begin{tabular}{lp{9cm}} |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
90 |
\pcode{re*} & matches 0 or more occurrences of preceding |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
91 |
expression\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
92 |
\pcode{re+} & matches 1 or more occurrences of preceding |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
93 |
expression\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
94 |
\pcode{re?} & matches 0 or 1 occurrence of preceding |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
95 |
expression\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
96 |
\pcode{re\{n\}} & matches exactly \pcode{n} number of |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
97 |
occurrences of preceding expression\\ |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
98 |
\pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
99 |
occurences of the preceding expression\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
100 |
\pcode{[...]} & matches any single character inside the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
101 |
brackets\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
102 |
\pcode{[^...]} & matches any single character not inside the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
103 |
brackets\\ |
250
b79e704acb72
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
248
diff
changeset
|
104 |
\pcode{...-...} & character ranges\\ |
b79e704acb72
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
248
diff
changeset
|
105 |
\pcode{\\d} & matches digits; equivalent to \pcode{[0-9]}\\ |
b79e704acb72
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
248
diff
changeset
|
106 |
\pcode{.} & matches every character except newline\\ |
b79e704acb72
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
248
diff
changeset
|
107 |
\pcode{(re)} & groups regular expressions and remembers |
b79e704acb72
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
248
diff
changeset
|
108 |
matched text |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
109 |
\end{tabular} |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
110 |
\end{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
111 |
|
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
112 |
\noindent With this table you can figure out the purpose of |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
113 |
the regular expressions in the web-crawlers shown Figures |
250
b79e704acb72
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
248
diff
changeset
|
114 |
\ref{crawler1}, \ref{crawler2} and |
b79e704acb72
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
248
diff
changeset
|
115 |
\ref{crawler3}.\footnote{There is an interesting twist in the |
b79e704acb72
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
248
diff
changeset
|
116 |
web-scraber where \pcode{re*?} is used instead of \pcode{re*}.} Note, |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
117 |
however, the regular expression for http-addresses in |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
118 |
web-pages is meant to be |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
119 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
120 |
\[ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
121 |
\pcode{"https?://[^"]*"} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
122 |
\] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
123 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
124 |
\noindent It specifies that web-addresses need to start with a |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
125 |
double quote, then comes \texttt{http} followed by an optional |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
126 |
\texttt{s} and so on. Usually we would have to escape the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
127 |
double quotes in order to make sure we interpret the double |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
128 |
quote as character, not as double quote for a string. But |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
129 |
Scala's trick with triple quotes allows us to omit this kind |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
130 |
of escaping. As a result we can just write: |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
131 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
132 |
\[ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
133 |
\pcode{""""https?://[^"]*"""".r} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
134 |
\] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
135 |
|
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
136 |
\noindent Note also that the convention in Scala is that |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
137 |
\texttt{.r} converts a string into a regular expression. I |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
138 |
leave it to you to ponder whether this regular expression |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
139 |
really captures all possible web-addresses. |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
140 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
141 |
\subsection*{Why Study Regular Expressions?} |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
142 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
143 |
Regular expressions were introduced by Kleene in the 1950ies |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
144 |
and they have been object of intense study since then. They |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
145 |
are nowadays pretty much ubiquitous in computer science. I am |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
146 |
sure you have come across them before. Why on earth then is |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
147 |
there any interest in studying them again in depth in this |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
148 |
module? Well, one answer is in the following graph about |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
149 |
regular expression matching in Python and in Ruby. |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
150 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
151 |
\begin{center} |
252
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
152 |
\begin{tikzpicture}[y=.09cm, x=.15cm] |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
153 |
%axis |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
154 |
\draw (0,0) -- coordinate (x axis mid) (30,0); |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
155 |
\draw (0,0) -- coordinate (y axis mid) (0,30); |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
156 |
%ticks |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
157 |
\foreach \x in {0,5,...,30} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
158 |
\draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
159 |
\foreach \y in {0,5,...,30} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
160 |
\draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
161 |
%labels |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
162 |
\node[below=0.6cm] at (x axis mid) {number of \texttt{a}s}; |
258
1e4da6d2490c
updated programs
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
252
diff
changeset
|
163 |
\node[rotate=90,left=0.9cm] at (y axis mid) {time in secs}; |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
164 |
%plots |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
165 |
\draw[color=blue] plot[mark=*] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
166 |
file {re-python.data}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
167 |
\draw[color=brown] plot[mark=triangle*] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
168 |
file {re-ruby.data}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
169 |
%legend |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
170 |
\begin{scope}[shift={(4,20)}] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
171 |
\draw[color=blue] (0,0) -- |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
172 |
plot[mark=*] (0.25,0) -- (0.5,0) |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
173 |
node[right]{\small Python}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
174 |
\draw[yshift=-\baselineskip, color=brown] (0,0) -- |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
175 |
plot[mark=triangle*] (0.25,0) -- (0.5,0) |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
176 |
node[right]{\small Ruby}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
177 |
\end{scope} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
178 |
\end{tikzpicture} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
179 |
\end{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
180 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
181 |
\noindent This graph shows that Python needs approximately 29 |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
182 |
seconds for finding out whether a string of 28 \texttt{a}s |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
183 |
matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}. |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
184 |
Ruby is even slightly worse.\footnote{In this example Ruby |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
185 |
uses the slightly different regular expression |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
186 |
\texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and |
252
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
187 |
\texttt{a} each occur $n$ times. More test results can be |
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
188 |
found at \url{http://www.computerbytesman.com/redos/}.} |
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
189 |
Admittedly, this regular expression is carefully chosen to |
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
190 |
exhibit this exponential behaviour, but similar ones occur |
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
191 |
more often than one wants in ``real life''. They are sometimes |
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
192 |
called \emph{evil regular expressions} because they have the |
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
193 |
potential to make regular expression matching engines to |
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
194 |
topple over, like in Python and Ruby. The problem with evil |
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
195 |
regular expressions is that they can have some serious |
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
196 |
consequences, for example, if you use them in your |
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
197 |
web-application. The reason is that hackers can look for these |
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
198 |
instances where the matching engine behaves badly and then |
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
199 |
mount a nice DoS-attack against your application. These |
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
200 |
attacks are already have their own name: |
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
201 |
\emph{Regular Expression Denial of Servive Attack (ReDoS)}. |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
202 |
|
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
203 |
It will be instructive to look behind the ``scenes'' to find |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
204 |
out why Python and Ruby (and others) behave so badly when |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
205 |
matching with evil regular expressions. But we will also look |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
206 |
at a relatively simple algorithm that solves this problem much |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
207 |
better than Python and Ruby do\ldots actually it will be two |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
208 |
versions of the algorithm: the first one will be able to |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
209 |
process strings of approximately 1,000 \texttt{a}s in 30 |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
210 |
seconds, while the second version will even be able to process |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
211 |
up to 12,000 in less than 10(!) seconds, see the graph below: |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
212 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
213 |
\begin{center} |
252
e8ef8f38ca84
added style files
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
250
diff
changeset
|
214 |
\begin{tikzpicture}[y=.09cm, x=.0006cm] |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
215 |
%axis |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
216 |
\draw (0,0) -- coordinate (x axis mid) (12000,0); |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
217 |
\draw (0,0) -- coordinate (y axis mid) (0,30); |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
218 |
%ticks |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
219 |
\foreach \x in {0,2000,...,12000} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
220 |
\draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
221 |
\foreach \y in {0,5,...,30} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
222 |
\draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
223 |
%labels |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
224 |
\node[below=0.6cm] at (x axis mid) {number of \texttt{a}s}; |
258
1e4da6d2490c
updated programs
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
252
diff
changeset
|
225 |
\node[rotate=90,left=0.9cm] at (y axis mid) {time in secs}; |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
226 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
227 |
%plots |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
228 |
\draw[color=green] plot[mark=square*, mark options={fill=white} ] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
229 |
file {re2b.data}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
230 |
\draw[color=black] plot[mark=square*, mark options={fill=white} ] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
231 |
file {re3.data}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
232 |
\end{tikzpicture} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
233 |
\end{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
234 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
235 |
\subsection*{Basic Regular Expressions} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
236 |
|
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
237 |
The regular expressions shown above, for example for Scala, we |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
238 |
will call \emph{extended regular expressions}. The ones we |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
239 |
will mainly study in this module are \emph{basic regular |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
240 |
expressions}, which by convention we will just call |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
241 |
\emph{regular expressions}, if it is clear what we mean. The |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
242 |
attraction of (basic) regular expressions is that many |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
243 |
features of the extended ones are just syntactic sugar. |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
244 |
(Basic) regular expressions are defined by the following |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
245 |
grammar: |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
246 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
247 |
\begin{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
248 |
\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
249 |
$r$ & $::=$ & $\varnothing$ & null\\ |
258
1e4da6d2490c
updated programs
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
252
diff
changeset
|
250 |
& $\mid$ & $\epsilon$ & empty string / \texttt{""} / []\\ |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
251 |
& $\mid$ & $c$ & single character\\ |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
252 |
& $\mid$ & $r_1 + r_2$ & alternative / choice\\ |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
253 |
& $\mid$ & $r_1 \cdot r_2$ & sequence\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
254 |
& $\mid$ & $r^*$ & star (zero or more)\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
255 |
\end{tabular} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
256 |
\end{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
257 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
258 |
\noindent Because we overload our notation, there are some |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
259 |
subtleties you should be aware of. When regular expressions |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
260 |
are referred to then $\varnothing$ does not stand for the |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
261 |
empty set: rather it is a particular pattern that does not |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
262 |
match any string. Similarly, in the context of regular |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
263 |
expressions, $\epsilon$ does not stand for the empty string |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
264 |
(as in many places in the literature) but for a regular |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
265 |
expression that matches the empty string. The letter $c$ |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
266 |
stands for any character from the alphabet at hand. Again in |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
267 |
the context of regular expressions, it is a particular pattern |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
268 |
that can match the specified character. You should also be |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
269 |
careful with our overloading of the star: assuming you have |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
270 |
read the handout about our basic mathematical notation, you |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
271 |
will see that in the context of languages (sets of strings) |
248
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
272 |
the star stands for an operation on languages. Here $r^*$ |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
273 |
stands for a regular expression, which is different from the |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
274 |
operation on sets is defined as |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
275 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
276 |
\[ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
277 |
A^* \dn \bigcup_{0\le n} A^n |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
278 |
\] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
279 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
280 |
We will use parentheses to disambiguate regular expressions. |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
281 |
Parentheses are not really part of a regular expression, and |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
282 |
indeed we do not need them in our code because there the tree |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
283 |
structure of regular expressions is always clear. But for |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
284 |
writing them down in a more mathematical fashion, parentheses |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
285 |
will be helpful. For example we will write $(r_1 + r_2)^*$, |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
286 |
which is different from, say $r_1 + (r_2)^*$. The former means |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
287 |
roughly zero or more times $r_1$ or $r_2$, while the latter |
248
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
288 |
means $r_1$ or zero or more times $r_2$. This will turn out to |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
289 |
be two different patterns, which match in general different |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
290 |
strings. We should also write $(r_1 + r_2) + r_3$, which is |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
291 |
different from the regular expression $r_1 + (r_2 + r_3)$, but |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
292 |
in case of $+$ and $\cdot$ we actually do not care about the |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
293 |
order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
294 |
\cdot r_3$, respectively. The reasons for this will become |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
295 |
clear shortly. In the literature you will often find that the |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
296 |
choice $r_1 + r_2$ is written as $r_1\mid{}r_2$ or |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
297 |
$r_1\mid\mid{}r_2$. Also following the convention in the |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
298 |
literature, we will often omit the $\cdot$ all together. This |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
299 |
is to make some concrete regular expressions more readable. |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
300 |
For example the regular expression for email addresses shown |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
301 |
in \eqref{email} would look like |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
302 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
303 |
\[ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
304 |
\texttt{[...]+} \;\cdot\; \texttt{@} \;\cdot\; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
305 |
\texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
306 |
\texttt{[...]\{2,6\}} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
307 |
\] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
308 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
309 |
\noindent |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
310 |
which is much less readable than \eqref{email}. Similarly for |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
311 |
the regular expression that matches the string $hello$ we |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
312 |
should write |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
313 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
314 |
\[ |
248
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
315 |
h \cdot e \cdot l \cdot l \cdot o |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
316 |
\] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
317 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
318 |
\noindent |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
319 |
but often just write {\it hello}. |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
320 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
321 |
If you prefer to think in terms of the implementation |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
322 |
of regular expressions in Scala, the constructors and |
245
a5fade10c207
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
244
diff
changeset
|
323 |
classes relate as follows\footnote{More about Scala is |
248
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
324 |
in the handout about A Crash-Course on Scala.} |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
325 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
326 |
\begin{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
327 |
\begin{tabular}{rcl} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
328 |
$\varnothing$ & $\mapsto$ & \texttt{NULL}\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
329 |
$\epsilon$ & $\mapsto$ & \texttt{EMPTY}\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
330 |
$c$ & $\mapsto$ & \texttt{CHAR(c)}\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
331 |
$r_1 + r_2$ & $\mapsto$ & \texttt{ALT(r1, r2)}\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
332 |
$r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
333 |
$r^*$ & $\mapsto$ & \texttt{STAR(r)} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
334 |
\end{tabular} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
335 |
\end{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
336 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
337 |
A source of confusion might arise from the fact that we |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
338 |
use the term \emph{basic regular expression} for the regular |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
339 |
expressions used in ``theory'' and defined above, and |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
340 |
\emph{extended regular expression} for the ones used in |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
341 |
``practice'', for example in Scala. If runtime is not an |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
342 |
issue, then the latter can be seen as syntactic sugar of |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
343 |
the former. For example we could replace |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
344 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
345 |
\begin{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
346 |
\begin{tabular}{rcl} |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
347 |
$r+$ & $\mapsto$ & $r\cdot r^*$\\ |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
348 |
$r?$ & $\mapsto$ & $\epsilon + r$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
349 |
$\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
350 |
$[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
351 |
\end{tabular} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
352 |
\end{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
353 |
|
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
354 |
|
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
355 |
\subsection*{The Meaning of Regular Expressions} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
356 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
357 |
So far we have only considered informally what the |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
358 |
\emph{meaning} of a regular expression is. This is not good |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
359 |
enough for specifications of what algorithms are supposed to |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
360 |
do or which problems they are supposed to solve. |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
361 |
|
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
362 |
To define the meaning of a regular expression we will |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
363 |
associate with every regular expression a language, or set of |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
364 |
strings. This language contains all the strings the regular |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
365 |
expression is supposed to match. To understand what is going |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
366 |
on here it is crucial that you have read the handout |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
367 |
about basic mathematical notations. |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
368 |
|
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
369 |
The \defn{meaning of a regular expression} can be defined |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
370 |
by a recursive function called $L$ (for language), which |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
371 |
is defined as follows |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
372 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
373 |
\begin{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
374 |
\begin{tabular}{rcl} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
375 |
$L(\varnothing)$ & $\dn$ & $\varnothing$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
376 |
$L(\epsilon)$ & $\dn$ & $\{[]\}$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
377 |
$L(c)$ & $\dn$ & $\{"c"\}$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
378 |
$L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\ |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
379 |
$L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) \,@\, L(r_2)$\\ |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
380 |
$L(r^*)$ & $\dn$ & $(L(r))^*$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
381 |
\end{tabular} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
382 |
\end{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
383 |
|
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
384 |
\noindent As a result we can now precisely state what the |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
385 |
meaning, for example, of the regular expression $h \cdot |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
386 |
e \cdot l \cdot l \cdot o$ is, namely |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
387 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
388 |
\[ |
248
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
389 |
L(h \cdot e \cdot l \cdot l \cdot o) = \{"hello"\} |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
390 |
\] |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
391 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
392 |
\noindent This is expected because this regular expression |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
393 |
is only supposed to match the ``$hello$''-string. Similarly if |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
394 |
we have the choice-regular-expression $a + b$, its meaning is |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
395 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
396 |
\[ |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
397 |
L(a + b) = \{"a", "b"\} |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
398 |
\] |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
399 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
400 |
\noindent You can now also see why we do not make a difference |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
401 |
between the different regular expressions $(r_1 + r_2) + r_3$ |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
402 |
and $r_1 + (r_2 + r_3)$\ldots they are not the same regular |
248
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
403 |
expression, but they have the same meaning. For example |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
404 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
405 |
\begin{eqnarray*} |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
406 |
L((r_1 + r_2) + r_3) & = & L(r_1 + r_2) \cup L(r_3)\\ |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
407 |
& = & L(r_1) \cup L(r_2) \cup L(r_3)\\ |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
408 |
& = & L(r_1) \cup L(r_2 + r_3)\\ |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
409 |
& = & L(r_1 + (r_2 + r_3)) |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
410 |
\end{eqnarray*} |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
411 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
412 |
The point of the definition of $L$ is that we can use it to |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
413 |
precisely specify when a string $s$ is matched by a regular |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
414 |
expression $r$, namely if and only if $s \in L(r)$. In fact we |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
415 |
will write a program \pcode{match} that takes any string $s$ |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
416 |
and any regular expression $r$ as argument and returns |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
417 |
\emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
418 |
L(r)$. We leave this for the next lecture. |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
419 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
420 |
There is one more feature of regular expressions that is worth |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
421 |
mentioning. Given some strings, there are in general many |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
422 |
different regular expressions that can recognise these |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
423 |
strings. This is obvious with the regular expression $a + b$ |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
424 |
which can match the strings $a$ and $b$. But also the regular |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
425 |
expression $b + a$ would match the same strings. However, |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
426 |
sometimes it is not so obvious whether two regular expressions |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
427 |
match the same strings: for example do $r^*$ and $\epsilon + r |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
428 |
\cdot r^*$ match the same strings? What about $\varnothing^*$ |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
429 |
and $\epsilon^*$? This suggests the following relation between |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
430 |
\defn{equivalent regular expressions}: |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
431 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
432 |
\[ |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
433 |
r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2) |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
434 |
\] |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
435 |
|
248
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
436 |
\noindent That means two regular expressions are said to be |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
437 |
equivalent if they match the same set of strings. Therefore we |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
438 |
do not really distinguish between the different regular |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
439 |
expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$, |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
440 |
because they are equivalent. I leave you to the question |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
441 |
whether |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
442 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
443 |
\[ |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
444 |
\varnothing^* \equiv \epsilon^* |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
445 |
\] |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
446 |
|
258
1e4da6d2490c
updated programs
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
252
diff
changeset
|
447 |
\noindent holds? Such equivalences will be important for our |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
448 |
matching algorithm, because we can use them to simplify |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
449 |
regular expressions. |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
450 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
451 |
\subsection*{My Fascination for Regular Expressions} |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
452 |
|
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
453 |
Up until a few years ago I was not really interested in |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
454 |
regular expressions. They have been studied for the last 60 |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
455 |
years (by smarter people than me)---surely nothing new can be |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
456 |
found out about them. I even have the vague recollection that |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
457 |
I did not quite understand them during my study. If I remember |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
458 |
correctly,\footnote{That was really a long time ago.} I got |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
459 |
utterly confused about $\epsilon$ and the empty |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
460 |
string.\footnote{Obviously the lecturer must have been bad.} |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
461 |
Since my study, I have used regular expressions for |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
462 |
implementing lexers and parsers as I have always been |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
463 |
interested in all kinds of programming languages and |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
464 |
compilers, which invariably need regular expression in some |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
465 |
form or shape. |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
466 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
467 |
To understand my fascination nowadays with regular |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
468 |
expressions, you need to know that my main scientific interest |
248
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
469 |
for the last 14 years has been with theorem provers. I am a |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
470 |
core developer of one of |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
471 |
them.\footnote{\url{http://isabelle.in.tum.de}} Theorem |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
472 |
provers are systems in which you can formally reason about |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
473 |
mathematical concepts, but also about programs. In this way |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
474 |
they can help with writing bug-free code. Theorem provers have |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
475 |
proved already their value in a number of systems (even in |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
476 |
terms of hard cash), but they are still clunky and difficult |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
477 |
to use by average programmers. |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
478 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
479 |
Anyway, in about 2011 I came across the notion of |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
480 |
\defn{derivatives of regular expressions}. This notion allows |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
481 |
one to do almost all calculations in regular language theory |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
482 |
on the level of regular expressions, not needing any automata. |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
483 |
This is important because automata are graphs and it is rather |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
484 |
difficult to reason about graphs in theorem provers. In |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
485 |
contrast, to reason about regular expressions is easy-peasy in |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
486 |
theorem provers. Is this important? I think yes, because |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
487 |
according to Kuklewicz nearly all POSIX-based regular |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
488 |
expression matchers are |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
489 |
buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
490 |
With my PhD student Fahad Ausaf I am currently working on |
248
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
491 |
proving the correctness for one such matcher that was |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
492 |
proposed by Sulzmann and Lu in |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
493 |
2014.\footnote{\url{http://goo.gl/bz0eHp}} This would be an |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
494 |
attractive results since we will be able to prove that the |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
495 |
algorithm is really correct, but also that the machine code(!) |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
496 |
that implements this code efficiently is correct. Writing |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
497 |
programs in this way does not leave any room for potential |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
498 |
errors or bugs. How nice! |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
499 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
500 |
What also helped with my fascination with regular expressions |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
501 |
is that we could indeed find out new things about them that |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
502 |
have surprised some experts in the field of regular |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
503 |
expressions. Together with two colleagues from China, I was |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
504 |
able to prove the Myhill-Nerode theorem by only using regular |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
505 |
expressions and the notion of derivatives. Earlier versions of |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
506 |
this theorem used always automata in the proof. Using this |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
507 |
theorem we can show that regular languages are closed under |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
508 |
complementation, something which Gasarch in his |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
509 |
blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
510 |
shown via automata. Even sombody who has written a 700+-page |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
511 |
book\footnote{\url{http://goo.gl/fD0eHx}} on regular |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
512 |
exprssions did not know better. Well, we showed it can also be |
248
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
513 |
done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}} |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
514 |
What a feeling if you are an outsider to the subject! |
243
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
515 |
|
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
516 |
To conclude: Despite my early ignorance about regular |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
517 |
expressions, I find them now quite interesting. They have a |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
518 |
beautiful mathematical theory behind them. They have practical |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
519 |
importance (remember the shocking runtime of the regular |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
520 |
expression matchers in Python and Ruby in some instances). |
8d5aaf5b0031
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
242
diff
changeset
|
521 |
People who are not very familiar with the mathematical |
248
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
522 |
background of regular expressions get them consistently wrong. |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
523 |
The hope is that we can do better in the future---for example |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
524 |
by proving that the algorithms actually satisfy their |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
525 |
specification and that the corresponding implementations do |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
526 |
not contain any bugs. We are close, but not yet quite there. |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
527 |
|
244
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
528 |
Despite my fascination, I am also happy to admit that regular |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
529 |
expressions have their shortcomings. There are some well-known |
248
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
530 |
``theoretical'' shortcomings, for example recognising strings |
244
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
531 |
of the form $a^{n}b^{n}$. I am not so bothered by them. What I |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
532 |
am bothered about is when regular expressions are in the way |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
533 |
of practical programming. For example, it turns out that the |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
534 |
regular expression for email addresses shown in \eqref{email} |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
535 |
is hopelessly inadequate for recognising all of them (despite |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
536 |
being touted as something every computer scientist should know |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
537 |
about). The W3 Consortium (which standardises the Web) |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
538 |
proposes to use the following, already more complicated |
248
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
539 |
regular expressions for email addresses: |
244
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
540 |
|
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
541 |
{\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none] |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
542 |
[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
543 |
\end{lstlisting}} |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
544 |
|
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
545 |
\noindent But they admit that by using this regular expression |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
546 |
they wilfully violate the RFC 5322 standard which specifies |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
547 |
the syntax of email addresses. With their proposed regular |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
548 |
expression they are too strict in some cases and too lax in |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
549 |
others. Not a good situation to be in. A regular expression |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
550 |
that is claimed to be closer to the standard is shown in |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
551 |
Figure~\ref{monster}. Whether this claim is true or not, I |
248
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
552 |
would not know---the only thing I can say to this regular |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
553 |
expression is it is a monstrosity. However, this might |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
554 |
actually be an argument against the RFC standard, rather than |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
555 |
against regular expressions. Still it is good to know that |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
556 |
some tasks in text processing just cannot be achieved by using |
ce767ca23244
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
245
diff
changeset
|
557 |
regular expressions. |
244
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
558 |
|
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
559 |
|
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
560 |
\begin{figure}[p] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
561 |
\lstinputlisting{../progs/crawler1.scala} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
562 |
\caption{The Scala code for a simple web-crawler that checks |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
563 |
for broken links in a web-page. It uses the regular expression |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
564 |
\texttt{http\_pattern} in Line~15 for recognising URL-addresses. |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
565 |
It finds all links using the library function |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
566 |
\texttt{findAllIn} in Line~21.\label{crawler1}} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
567 |
\end{figure} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
568 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
569 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
570 |
\begin{figure}[p] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
571 |
\lstinputlisting{../progs/crawler2.scala} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
572 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
573 |
\caption{A version of the web-crawler that only follows links |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
574 |
in ``my'' domain---since these are the ones I am interested in |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
575 |
to fix. It uses the regular expression \texttt{my\_urls} in |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
576 |
Line~16 to check for my name in the links. The main change is |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
577 |
in Lines~26--29 where there is a test whether URL is in ``my'' |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
578 |
domain or not.\label{crawler2}} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
579 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
580 |
\end{figure} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
581 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
582 |
\begin{figure}[p] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
583 |
\lstinputlisting{../progs/crawler3.scala} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
584 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
585 |
\caption{A small email harvester---whenever we download a |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
586 |
web-page, we also check whether it contains any email |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
587 |
addresses. For this we use the regular expression |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
588 |
\texttt{email\_pattern} in Line~16. The main change is in Line |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
589 |
32 where all email addresses that can be found in a page are |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
590 |
printed.\label{crawler3}} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
591 |
\end{figure} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
592 |
|
244
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
593 |
\begin{figure}[p] |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
594 |
\tiny |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
595 |
\begin{center} |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
596 |
\begin{minipage}{0.8\textwidth} |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
597 |
\lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp} |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
598 |
\end{minipage} |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
599 |
\end{center} |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
600 |
|
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
601 |
\caption{Nothing that can be said\ldots\label{monster}} |
771042ac7c3f
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
243
diff
changeset
|
602 |
\end{figure} |
108
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
603 |
|
112
95ee5cc5c05d
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
111
diff
changeset
|
604 |
|
105
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
605 |
\end{document} |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
606 |
|
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
607 |
%%% Local Variables: |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
608 |
%%% mode: latex |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
609 |
%%% TeX-master: t |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
610 |
%%% End: |