author | Christian Urban <christian dot urban at kcl dot ac dot uk> |
Sat, 13 Sep 2014 04:30:25 +0100 | |
changeset 242 | 35104ee14f87 |
parent 237 | 370c0647a9bf |
child 243 | 8d5aaf5b0031 |
permissions | -rw-r--r-- |
105
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
1 |
\documentclass{article} |
237
370c0647a9bf
more material
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
217
diff
changeset
|
2 |
\usepackage{../style} |
217
cd6066f1056a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
123
diff
changeset
|
3 |
\usepackage{../langs} |
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
4 |
\usepackage{../graphics} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
5 |
\usepackage{../data} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
6 |
\usepackage{longtable} |
108
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
7 |
|
112
95ee5cc5c05d
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
111
diff
changeset
|
8 |
|
105
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
9 |
\begin{document} |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
10 |
|
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
11 |
\section*{Handout 1} |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
12 |
|
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
13 |
This module is about text processing, be it for web-crawlers, |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
14 |
compilers, dictionaries, DNA-data and so on. When looking for |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
15 |
a particular string in a large text we can use the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
16 |
Knuth-Morris-Pratt algorithm, which is currently the most |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
17 |
efficient general string search algorithm. But often we do |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
18 |
\emph{not} look for just a particular string, but for string |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
19 |
patterns. For example in programming code we need to identify |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
20 |
what are the keywords, what are the identifiers etc. Also |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
21 |
often we face the problem that we are given a string (for |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
22 |
example some user input) and want to know whether it matches a |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
23 |
particular pattern. For example for excluding some user input |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
24 |
that would otherwise have nasty effects on our program |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
25 |
(crashing or going into an infinite loop, if not worse). |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
26 |
\defn{Regular expressions} help with conveniently specifying |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
27 |
such patterns. |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
28 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
29 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
30 |
The idea behind regular expressions is that they are a simple |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
31 |
method for describing languages (or sets of strings)...at |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
32 |
least languages we are interested in in computer science. For |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
33 |
example there is no convenient regular expression for |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
34 |
describing the English language short of enumerating all |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
35 |
English words. But they seem useful for describing for example |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
36 |
email addresses.\footnote{See ``8 Regular Expressions You |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
37 |
Should Know'' \url{http://goo.gl/5LoVX7}} Consider the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
38 |
following regular expression |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
39 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
40 |
\begin{equation}\label{email} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
41 |
\texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
42 |
\end{equation} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
43 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
44 |
\noindent where the first part matches one or more lowercase |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
45 |
letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
46 |
or hyphens. The \pcode{+} ensures the ``one or more''. Then |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
47 |
comes the \pcode{@}-sign, followed by the domain name which |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
48 |
must be one or more lowercase letters, digits, underscores, |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
49 |
dots or hyphens. Note there cannot be an underscore in the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
50 |
domain name. Finally there must be a dot followed by the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
51 |
toplevel domain. This toplevel domain must be 2 to 6 lowercase |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
52 |
letters including the dot. Example strings which follow this |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
53 |
pattern are: |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
54 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
55 |
\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
56 |
niceandsimple@example.com |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
57 |
very.common@example.org |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
58 |
a.little.lengthy.but.fine@dept.example.co.uk |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
59 |
other.email-with-dash@example.ac.uk |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
60 |
\end{lstlisting} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
61 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
62 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
63 |
\noindent |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
64 |
But for example the following two do not: |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
65 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
66 |
\begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
67 |
user@localserver |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
68 |
disposable.style.email.with+symbol@example.com |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
69 |
\end{lstlisting} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
70 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
71 |
Many programming language offer libraries that can be used to |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
72 |
validate such strings against regular expressions, like the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
73 |
one for email addresses in \eqref{email}. There are some |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
74 |
common, and I am sure very familiar, ways how to construct |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
75 |
regular expressions. For example in Scala we have |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
76 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
77 |
\begin{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
78 |
\begin{longtable}{lp{9cm}} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
79 |
\pcode{re*} & matches 0 or more occurrences of preceding |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
80 |
expression\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
81 |
\pcode{re+} & matches 1 or more occurrences of preceding |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
82 |
expression\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
83 |
\pcode{re?} & matches 0 or 1 occurrence of preceding |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
84 |
expression\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
85 |
\pcode{re\{n\}} & matches exactly \pcode{n} number of |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
86 |
occurrences\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
87 |
\pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
88 |
occurences of the preceding expression\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
89 |
\pcode{[...]} & matches any single character inside the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
90 |
brackets\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
91 |
\pcode{[^...]} & matches any single character not inside the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
92 |
brackets\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
93 |
\pcode{..-..} & character ranges\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
94 |
\pcode{\\d} & matches digits; equivalent to \pcode{[0-9]} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
95 |
\end{longtable} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
96 |
\end{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
97 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
98 |
\noindent With this you can figure out the purpose of the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
99 |
regular expressions in the web-crawlers shown Figures |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
100 |
\ref{crawler1}, \ref{crawler2} and \ref{crawler3}. Note the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
101 |
regular expression for http-addresses in web-pages: |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
102 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
103 |
\[ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
104 |
\pcode{"https?://[^"]*"} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
105 |
\] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
106 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
107 |
\noindent It specifies that web-addresses need to start with a |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
108 |
double quote, then comes \texttt{http} followed by an optional |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
109 |
\texttt{s} and so on. Usually we would have to escape the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
110 |
double quotes in order to make sure we interpret the double |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
111 |
quote as character, not as double quote for a string. But |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
112 |
Scala's trick with triple quotes allows us to omit this kind |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
113 |
of escaping. As a result we can just write: |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
114 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
115 |
\[ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
116 |
\pcode{""""https?://[^"]*"""".r} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
117 |
\] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
118 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
119 |
\noindent Not also that the convention in Scala is that |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
120 |
\texttt{.r} converts a string into a regular expression. I |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
121 |
leave it to you to ponder whether this regular expression |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
122 |
really captures all possible web-addresses.\bigskip |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
123 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
124 |
Regular expressions were introduced by Kleene in the 1950ies |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
125 |
and they have been object of intense study since then. They |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
126 |
are nowadays pretty much ubiquitous in computer science. I am |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
127 |
sure you have come across them before. Why on earth then is |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
128 |
there any interest in studying them again in depth in this |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
129 |
module? Well, one answer is in the following graph about |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
130 |
regular expression matching in Python and in Ruby. |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
131 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
132 |
\begin{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
133 |
\begin{tikzpicture}[y=.1cm, x=.15cm] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
134 |
%axis |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
135 |
\draw (0,0) -- coordinate (x axis mid) (30,0); |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
136 |
\draw (0,0) -- coordinate (y axis mid) (0,30); |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
137 |
%ticks |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
138 |
\foreach \x in {0,5,...,30} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
139 |
\draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
140 |
\foreach \y in {0,5,...,30} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
141 |
\draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
142 |
%labels |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
143 |
\node[below=0.6cm] at (x axis mid) {number of \texttt{a}s}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
144 |
\node[rotate=90,above=0.8cm] at (y axis mid) {time in secs}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
145 |
%plots |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
146 |
\draw[color=blue] plot[mark=*] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
147 |
file {re-python.data}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
148 |
\draw[color=brown] plot[mark=triangle*] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
149 |
file {re-ruby.data}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
150 |
%legend |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
151 |
\begin{scope}[shift={(4,20)}] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
152 |
\draw[color=blue] (0,0) -- |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
153 |
plot[mark=*] (0.25,0) -- (0.5,0) |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
154 |
node[right]{\small Python}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
155 |
\draw[yshift=-\baselineskip, color=brown] (0,0) -- |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
156 |
plot[mark=triangle*] (0.25,0) -- (0.5,0) |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
157 |
node[right]{\small Ruby}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
158 |
\end{scope} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
159 |
\end{tikzpicture} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
160 |
\end{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
161 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
162 |
\noindent This graph shows that Python needs approximately 29 |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
163 |
seconds in order to find out that a string of 28 \texttt{a}s |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
164 |
matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}. |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
165 |
Ruby is even slightly worse.\footnote{In this example Ruby |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
166 |
uses the slightly different regular expression |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
167 |
\texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
168 |
\texttt{a} each occur $n$ times.} Admittedly, this regular |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
169 |
expression is carefully chosen to exhibit this exponential |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
170 |
behaviour, but similar ones occur more often than one wants in |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
171 |
``real life''. They are sometimes called \emph{evil regular |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
172 |
expressions} because they have the potential to make regular |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
173 |
expression matching engines topple over, like in Python and |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
174 |
Ruby. The problem is that this can have some serious |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
175 |
consequences, for example, if you use them in your |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
176 |
web-application, because hackers can look for these instances |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
177 |
where the matching engine behaves badly and mount a nice |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
178 |
DoS-attack against your application. |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
179 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
180 |
It will be instructive to look behind the ``scenes''to find |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
181 |
out why Python and Ruby (and others) behave so badly when |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
182 |
matching with evil regular expressions. But we will also look |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
183 |
at a relatively simple algorithm that solves this problem much |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
184 |
better than Python and Ruby do\ldots actually it will be two |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
185 |
versions of the algorithm: the first one will be able to |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
186 |
process strings of approximately 1,000 \texttt{a}s in 30 |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
187 |
seconds, while the second version will even be able to |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
188 |
process up to 12,000 in less than 10(!) seconds, see the graph |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
189 |
below: |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
190 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
191 |
\begin{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
192 |
\begin{tikzpicture}[y=.1cm, x=.0006cm] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
193 |
%axis |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
194 |
\draw (0,0) -- coordinate (x axis mid) (12000,0); |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
195 |
\draw (0,0) -- coordinate (y axis mid) (0,30); |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
196 |
%ticks |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
197 |
\foreach \x in {0,2000,...,12000} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
198 |
\draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
199 |
\foreach \y in {0,5,...,30} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
200 |
\draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
201 |
%labels |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
202 |
\node[below=0.6cm] at (x axis mid) {number of \texttt{a}s}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
203 |
\node[rotate=90, above=0.8cm] at (y axis mid) {time in secs}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
204 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
205 |
%plots |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
206 |
\draw[color=green] plot[mark=square*, mark options={fill=white} ] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
207 |
file {re2b.data}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
208 |
\draw[color=black] plot[mark=square*, mark options={fill=white} ] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
209 |
file {re3.data}; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
210 |
\end{tikzpicture} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
211 |
\end{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
212 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
213 |
\subsection*{Basic Regular Expressions} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
214 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
215 |
The regular expressions shown above we will call |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
216 |
\defn{extended regular expressions}. The ones we will mainly |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
217 |
study are \emph{basic regular expressions}, which by |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
218 |
convention we will just call regular expressions, if it is |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
219 |
clear what we mean. The attraction of (basic) regular |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
220 |
expressions is that many features of the extended one are just |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
221 |
syntactic suggar. (Basic) regular expressions are defined by |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
222 |
the following grammar: |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
223 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
224 |
\begin{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
225 |
\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
226 |
$r$ & $::=$ & $\varnothing$ & null\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
227 |
& $\mid$ & $\epsilon$ & empty string / "" / []\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
228 |
& $\mid$ & $c$ & single character\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
229 |
& $\mid$ & $r_1 \cdot r_2$ & sequence\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
230 |
& $\mid$ & $r_1 + r_2$ & alternative / choice\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
231 |
& $\mid$ & $r^*$ & star (zero or more)\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
232 |
\end{tabular} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
233 |
\end{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
234 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
235 |
\noindent Because we overload our notation, there are some |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
236 |
subtleties you should be aware of. First, when regular |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
237 |
expressions are referred to then $\varnothing$ does not stand |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
238 |
for the empty set: it is a particular pattern that does not |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
239 |
match any string. Similarly, in the context of regular |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
240 |
expressions, $\epsilon$ does not stand for the empty string |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
241 |
(as in many places in the literature) but for a pattern that |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
242 |
matches the empty string. Second, the letter $c$ stands for |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
243 |
any character from the alphabet at hand. Again in the context |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
244 |
of regular expressions, it is a particular pattern that can |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
245 |
match the specified string. Third, you should also be careful |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
246 |
with the our overloading of the star: assuming you have read |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
247 |
the handout about our basic mathematical notation, you will |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
248 |
see that in the context of languages (sets of strings) the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
249 |
star stands for an operation on languages. While $r^*$ stands |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
250 |
for a regular expression, the operation on sets is defined as |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
251 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
252 |
\[ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
253 |
A^* \dn \bigcup_{0\le n} A^n |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
254 |
\] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
255 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
256 |
We will use parentheses to disambiguate regular expressions. |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
257 |
Parentheses are not really part of a regular expression, and |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
258 |
indeed we do not need them in our code because there the tree |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
259 |
structure is always clear. But for writing them down in a more |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
260 |
mathematical fashion, parentheses will be helpful. For example |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
261 |
we will write $(r_1 + r_2)^*$, which is different from, say |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
262 |
$r_1 + (r_2)^*$. The former means roughly zero or more times |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
263 |
$r_1$ or $r_2$, while the latter means $r_1$ or zero or more |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
264 |
times $r_2$. This will turn out are two different pattern, |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
265 |
which match in general different strings. We should also write |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
266 |
$(r_1 + r_2) + r_3$, which is different from the regular |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
267 |
expression $r_1 + (r_2 + r_3)$, but in case of $+$ and $\cdot$ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
268 |
we actually do not care about the order and just write $r_1 + |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
269 |
r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$, respectively. The |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
270 |
reasons for this will become clear shortly. In the literature |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
271 |
you will often find that the choice $r_1 + r_2$ is written as |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
272 |
$r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also following the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
273 |
convention in the literature, we will often omit the $\cdot$ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
274 |
all together. This is to make some concrete regular |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
275 |
expressions more readable. For example the regular expression |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
276 |
for email addresses shown in \eqref{email} would look like |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
277 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
278 |
\[ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
279 |
\texttt{[...]+} \;\cdot\; \texttt{@} \;\cdot\; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
280 |
\texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
281 |
\texttt{[...]\{2,6\}} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
282 |
\] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
283 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
284 |
\noindent |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
285 |
which is much less readable than \eqref{email}. Similarly for |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
286 |
the regular expression that matches the string $hello$ we |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
287 |
should write |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
288 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
289 |
\[ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
290 |
{\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
291 |
\] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
292 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
293 |
\noindent |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
294 |
but often just write {\it hello}. |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
295 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
296 |
If you prefer to think in terms of the implementation |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
297 |
of regular expressions in Scala, the constructors and |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
298 |
classes relate as follows |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
299 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
300 |
\begin{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
301 |
\begin{tabular}{rcl} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
302 |
$\varnothing$ & $\mapsto$ & \texttt{NULL}\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
303 |
$\epsilon$ & $\mapsto$ & \texttt{EMPTY}\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
304 |
$c$ & $\mapsto$ & \texttt{CHAR(c)}\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
305 |
$r_1 + r_2$ & $\mapsto$ & \texttt{ALT(r1, r2)}\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
306 |
$r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
307 |
$r^*$ & $\mapsto$ & \texttt{STAR(r)} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
308 |
\end{tabular} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
309 |
\end{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
310 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
311 |
A source of confusion might arise from the fact that we |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
312 |
use the term \emph{basic regular expression} for the regular |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
313 |
expressions used in ``theory'' and defined above, and |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
314 |
\emph{extended regular expression} for the ones used in |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
315 |
``practice'', for example Scala. If runtime is not of an |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
316 |
issue, then the latter can be seen as some syntactic sugar of |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
317 |
the former. Fo example we could replace |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
318 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
319 |
\begin{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
320 |
\begin{tabular}{rcl} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
321 |
$r^+$ & $\mapsto$ & $r\cdot r^*$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
322 |
$r?$ & $\mapsto$ & $\epsilon + r$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
323 |
$\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
324 |
$[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
325 |
\end{tabular} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
326 |
\end{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
327 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
328 |
\subsection*{The Meaning of Regular Expressions} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
329 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
330 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
331 |
So far we have only considered informally what the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
332 |
\emph{meaning} of a regular expression is. This is no good for |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
333 |
specifications of what algorithms are supposed to do or which |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
334 |
problems they are supposed to solve. |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
335 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
336 |
To do so more formally we will associate with every regular |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
337 |
expression a language, or set of strings, that is supposed to |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
338 |
be matched by this regular expression. To understand what is |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
339 |
going on here it is crucial that you have also read the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
340 |
handout about our basic mathematical notations. |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
341 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
342 |
The meaning of a regular expression can be defined recursively |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
343 |
as follows |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
344 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
345 |
\begin{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
346 |
\begin{tabular}{rcl} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
347 |
$L(\varnothing)$ & $\dn$ & $\varnothing$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
348 |
$L(\epsilon)$ & $\dn$ & $\{[]\}$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
349 |
$L(c)$ & $\dn$ & $\{"c"\}$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
350 |
$L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
351 |
$L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) @ L(r_2)$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
352 |
$L(r^*)$ & $\dn$ & $(L(r))^*$\\ |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
353 |
\end{tabular} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
354 |
\end{center} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
355 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
356 |
\noindent |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
357 |
As a result we can now precisely state what the meaning, for example, of the regular expression |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
358 |
${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
359 |
$L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$...as expected. Similarly if we have the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
360 |
choice-regular-expression $a + b$, its meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
361 |
be matched by this choice. You can now also see why we do not make a difference |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
362 |
between the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$....they |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
363 |
are not the same regular expression, but have the same meaning. |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
364 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
365 |
The point of the definition of $L$ is that we can use it to precisely specify when a string $s$ is matched by a |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
366 |
regular expression $r$, namely only when $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
367 |
any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no}, |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
368 |
if $s \not\in L(r)$. We leave this for the next lecture. |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
369 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
370 |
\begin{figure}[p] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
371 |
\lstinputlisting{../progs/crawler1.scala} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
372 |
\caption{The Scala code for a simple web-crawler that checks |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
373 |
for broken links in a web-page. It uses the regular expression |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
374 |
\texttt{http\_pattern} in Line~15 for recognising URL-addresses. |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
375 |
It finds all links using the library function |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
376 |
\texttt{findAllIn} in Line~21.\label{crawler1}} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
377 |
\end{figure} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
378 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
379 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
380 |
\begin{figure}[p] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
381 |
\lstinputlisting{../progs/crawler2.scala} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
382 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
383 |
\caption{A version of the web-crawler that only follows links |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
384 |
in ``my'' domain---since these are the ones I am interested in |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
385 |
to fix. It uses the regular expression \texttt{my\_urls} in |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
386 |
Line~16 to check for my name in the links. The main change is |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
387 |
in Lines~26--29 where there is a test whether URL is in ``my'' |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
388 |
domain or not.\label{crawler2}} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
389 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
390 |
\end{figure} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
391 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
392 |
\begin{figure}[p] |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
393 |
\lstinputlisting{../progs/crawler3.scala} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
394 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
395 |
\caption{A small email harvester---whenever we download a |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
396 |
web-page, we also check whether it contains any email |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
397 |
addresses. For this we use the regular expression |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
398 |
\texttt{email\_pattern} in Line~16. The main change is in Line |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
399 |
32 where all email addresses that can be found in a page are |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
400 |
printed.\label{crawler3}} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
401 |
\end{figure} |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
402 |
|
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
403 |
\pagebreak |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
404 |
Lets start |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
405 |
with what we mean by \emph{strings}. Strings (they are also |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
406 |
sometimes referred to as \emph{words}) are lists of characters |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
407 |
drawn from an \emph{alphabet}. If nothing else is specified, |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
408 |
we usually assume the alphabet consists of just the lower-case |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
409 |
letters $a$, $b$, \ldots, $z$. Sometimes, however, we |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
410 |
explicitly restrict strings to contain, for example, only the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
411 |
letters $a$ and $b$. In this case we say the alphabet is the |
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
412 |
set $\{a, b\}$. |
105
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
413 |
|
110
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
414 |
There are many ways how we can write down strings. In programming languages, they are usually |
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
415 |
written as {\it "hello"} where the double quotes indicate that we dealing with a string. |
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
416 |
Essentially, strings are lists of characters which can be written for example as follows |
105
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
417 |
|
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
418 |
\[ |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
419 |
[\text{\it h, e, l, l, o}] |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
420 |
\] |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
421 |
|
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
422 |
\noindent |
110
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
423 |
The important point is that we can always decompose strings. For example, we will often consider the |
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
424 |
first character of a string, say $h$, and the ``rest'' of a string say {\it "ello"} when making definitions |
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
425 |
about strings. There are some subtleties with the empty string, sometimes written as {\it ""} but also as |
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
426 |
the empty list of characters $[\,]$. Two strings, for example $s_1$ and $s_2$, can be \emph{concatenated}, |
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
427 |
which we write as $s_1 @ s_2$. Suppose we are given two strings {\it "foo"} and {\it "bar"}, then their concatenation |
108
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
428 |
gives {\it "foobar"}. |
105
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
429 |
|
110
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
430 |
We often need to talk about sets of strings. For example the set of all strings over the alphabet $\{a, \ldots\, z\}$ |
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
431 |
is |
105
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
432 |
|
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
433 |
\[ |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
434 |
\{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\} |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
435 |
\] |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
436 |
|
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
437 |
\noindent |
110
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
438 |
Any set of strings, not just the set-of-all-strings, is often called a \emph{language}. The idea behind |
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
439 |
this choice of terminology is that if we enumerate, say, all words/strings from a dictionary, like |
105
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
440 |
|
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
441 |
\[ |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
442 |
\{\text{\it "the", "of", "milk", "name", "antidisestablishmentarianism", \ldots}\} |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
443 |
\] |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
444 |
|
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
445 |
\noindent |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
446 |
then we have essentially described the English language, or more precisely all |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
447 |
strings that can be used in a sentence of the English language. French would be a |
110
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
448 |
different set of strings, and so on. In the context of this course, a language might |
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
449 |
not necessarily make sense from a natural language point of view. For example |
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
450 |
the set of all strings shown above is a language, as is the empty set (of strings). The |
105
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
451 |
empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a |
110
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
452 |
difference between the empty set, or empty language, and the set that |
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
453 |
contains only the empty string $\{\text{""}\}$: the former has no elements, whereas |
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
454 |
the latter has one element. |
106
93bf3182cf71
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
105
diff
changeset
|
455 |
|
93bf3182cf71
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
105
diff
changeset
|
456 |
|
93bf3182cf71
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
105
diff
changeset
|
457 |
|
108
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
458 |
Before we expand on the topic of regular expressions, let us review some operations on |
110
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
459 |
sets. We will use capital letters $A$, $B$, $\ldots$ to stand for sets of strings. |
108
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
460 |
The union of two sets is written as usual as $A \cup B$. We also need to define the |
110
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
461 |
operation of \emph{concatenating} two sets of strings. This can be defined as |
108
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
462 |
|
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
463 |
\[ |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
464 |
A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \} |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
465 |
\] |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
466 |
|
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
467 |
\noindent |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
468 |
which essentially means take the first string from the set $A$ and concatenate it with every |
110
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
469 |
string in the set $B$, then take the second string from $A$ do the same and so on. You might |
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
470 |
like to think about what this definition means in case $A$ or $B$ is the empty set. |
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
471 |
|
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
472 |
We also need to define |
123
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
113
diff
changeset
|
473 |
the power of a set of strings, written as $A^n$ with $n$ being a natural number. This is defined inductively as follows |
108
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
474 |
|
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
475 |
\begin{center} |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
476 |
\begin{tabular}{rcl} |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
477 |
$A^0$ & $\dn$ & $\{[\,]\}$ \\ |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
478 |
$A^{n+1}$ & $\dn$ & $A @ A^n$\\ |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
479 |
\end{tabular} |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
480 |
\end{center} |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
481 |
|
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
482 |
\noindent |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
483 |
Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union |
110
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
484 |
of every power of $A^n$ with $n\ge 0$. The mathematical notation for this operation is |
108
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
485 |
|
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
486 |
\[ |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
487 |
A^* \dn \bigcup_{0\le n} A^n |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
488 |
\] |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
489 |
|
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
490 |
\noindent |
110
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
491 |
This definition implies that the star of a set $A$ contains always the empty string (that is $A^0$), one |
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
492 |
copy of every string in $A$ (that is $A^1$), two copies in $A$ (that is $A^2$) and so on. In case $A=\{"a"\}$ we therefore |
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
493 |
have |
108
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
494 |
|
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
495 |
\[ |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
496 |
A^* = \{"", "a", "aa", "aaa", \ldots\} |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
497 |
\] |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
498 |
|
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
499 |
\noindent |
110
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
500 |
Be aware that these operations sometimes have quite non-intuitive properties, for example |
108
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
501 |
|
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
502 |
\begin{center} |
123
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
113
diff
changeset
|
503 |
\begin{tabular}{@{}ccc@{}} |
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
113
diff
changeset
|
504 |
\begin{tabular}{@{}r@{\hspace{1mm}}c@{\hspace{1mm}}l} |
108
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
505 |
$A \cup \varnothing$ & $=$ & $A$\\ |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
506 |
$A \cup A$ & $=$ & $A$\\ |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
507 |
$A \cup B$ & $=$ & $B \cup A$\\ |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
508 |
\end{tabular} & |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
509 |
\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
510 |
$A @ B$ & $\not =$ & $B @ A$\\ |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
511 |
$A @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\ |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
512 |
$A @ \{""\}$ & $=$ & $\{""\} @ A = A$\\ |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
513 |
\end{tabular} & |
123
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
113
diff
changeset
|
514 |
\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
108
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
515 |
$\varnothing^*$ & $=$ & $\{""\}$\\ |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
516 |
$\{""\}^*$ & $=$ & $\{""\}$\\ |
110
9353308f1c6a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
108
diff
changeset
|
517 |
$A^\star$ & $=$ & $\{""\} \cup A\cdot A^*$\\ |
108
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
518 |
\end{tabular} |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
519 |
\end{tabular} |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
520 |
\end{center} |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
521 |
\bigskip |
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
522 |
|
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
523 |
|
242
35104ee14f87
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
237
diff
changeset
|
524 |
\subsection*{My Fascination for Regular Expressions} |
108
52ee218151f9
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
107
diff
changeset
|
525 |
|
112
95ee5cc5c05d
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
111
diff
changeset
|
526 |
|
105
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
527 |
\end{document} |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
528 |
|
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
529 |
%%% Local Variables: |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
530 |
%%% mode: latex |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
531 |
%%% TeX-master: t |
397ecdafefd8
added handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
532 |
%%% End: |