handouts/ho01.tex
changeset 242 35104ee14f87
parent 237 370c0647a9bf
child 243 8d5aaf5b0031
equal deleted inserted replaced
241:10f02605a46a 242:35104ee14f87
     1 \documentclass{article}
     1 \documentclass{article}
     2 \usepackage{../style}
     2 \usepackage{../style}
     3 \usepackage{../langs}
     3 \usepackage{../langs}
     4 
     4 \usepackage{../graphics}
       
     5 \usepackage{../data}
       
     6 \usepackage{longtable}
     5 
     7 
     6 
     8 
     7 \begin{document}
     9 \begin{document}
     8 
    10 
     9 \section*{Handout 1}
    11 \section*{Handout 1}
    10 
    12 
    11 This module is about the processing of strings. Lets start with what we mean by \emph{strings}. Strings
    13 This module is about text processing, be it for web-crawlers,
    12 (they are also sometimes referred to as \emph{words}) are lists of characters drawn from an \emph{alphabet}. 
    14 compilers, dictionaries, DNA-data and so on. When looking for
    13 If nothing else is specified, we usually assume 
    15 a particular string in a large text we can use the
    14 the alphabet consists of just the lower-case letters $a$, $b$, \ldots, $z$. Sometimes, however, we explicitly
    16 Knuth-Morris-Pratt algorithm, which is currently the most
    15 restrict strings to contain, for example, only the letters $a$ and $b$. In this case we say the alphabet is the set $\{a, b\}$.
    17 efficient general string search algorithm. But often we do
       
    18 \emph{not} look for just a particular string, but for string
       
    19 patterns. For example in programming code we need to identify
       
    20 what are the keywords, what are the identifiers etc. Also
       
    21 often we face the problem that we are given a string (for
       
    22 example some user input) and want to know whether it matches a
       
    23 particular pattern. For example for excluding some user input
       
    24 that would otherwise have nasty effects on our program
       
    25 (crashing or going into an infinite loop, if not worse).
       
    26 \defn{Regular expressions} help with conveniently specifying
       
    27 such patterns. 
       
    28 
       
    29 
       
    30 The idea behind regular expressions is that they are a simple
       
    31 method for describing languages (or sets of strings)...at
       
    32 least languages we are interested in in computer science. For
       
    33 example there is no convenient regular expression for
       
    34 describing the English language short of enumerating all
       
    35 English words. But they seem useful for describing for example
       
    36 email addresses.\footnote{See ``8 Regular Expressions You
       
    37 Should Know'' \url{http://goo.gl/5LoVX7}} Consider the
       
    38 following regular expression
       
    39 
       
    40 \begin{equation}\label{email}
       
    41 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}}
       
    42 \end{equation}
       
    43 
       
    44 \noindent where the first part matches one or more lowercase
       
    45 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
       
    46 or hyphens. The \pcode{+} ensures the ``one or more''. Then
       
    47 comes the \pcode{@}-sign, followed by the domain name which
       
    48 must be one or more lowercase letters, digits, underscores,
       
    49 dots or hyphens. Note there cannot be an underscore in the
       
    50 domain name. Finally there must be a dot followed by the
       
    51 toplevel domain. This toplevel domain must be 2 to 6 lowercase
       
    52 letters including the dot. Example strings which follow this
       
    53 pattern are:
       
    54 
       
    55 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
       
    56 niceandsimple@example.com
       
    57 very.common@example.org
       
    58 a.little.lengthy.but.fine@dept.example.co.uk
       
    59 other.email-with-dash@example.ac.uk
       
    60 \end{lstlisting}
       
    61 
       
    62 
       
    63 \noindent
       
    64 But for example the following two do not:
       
    65 
       
    66 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
       
    67 user@localserver
       
    68 disposable.style.email.with+symbol@example.com
       
    69 \end{lstlisting}
       
    70 
       
    71 Many programming language offer libraries that can be used to
       
    72 validate such strings against regular expressions, like the
       
    73 one for email addresses in \eqref{email}. There are some
       
    74 common, and I am sure very familiar, ways how to construct
       
    75 regular expressions. For example in Scala we have 
       
    76 
       
    77 \begin{center}
       
    78 \begin{longtable}{lp{9cm}}
       
    79 \pcode{re*} & matches 0 or more occurrences of preceding 
       
    80 expression\\
       
    81 \pcode{re+} & matches 1 or more occurrences of preceding
       
    82 expression\\
       
    83 \pcode{re?} &	 matches 0 or 1 occurrence of preceding 
       
    84 expression\\
       
    85 \pcode{re\{n\}}	& matches exactly \pcode{n} number of 
       
    86 occurrences\\
       
    87 \pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m}
       
    88 occurences of the preceding expression\\
       
    89 \pcode{[...]} & matches any single character inside the 
       
    90 brackets\\
       
    91 \pcode{[^...]} & matches any single character not inside the 
       
    92 brackets\\
       
    93 \pcode{..-..} & character ranges\\
       
    94 \pcode{\\d} &	matches digits; equivalent to \pcode{[0-9]}
       
    95 \end{longtable}
       
    96 \end{center}
       
    97 
       
    98 \noindent With this you can figure out the purpose of the
       
    99 regular expressions in the web-crawlers shown Figures
       
   100 \ref{crawler1}, \ref{crawler2} and \ref{crawler3}. Note the
       
   101 regular expression for http-addresses in web-pages:
       
   102 
       
   103 \[
       
   104 \pcode{"https?://[^"]*"}
       
   105 \]
       
   106 
       
   107 \noindent It specifies that web-addresses need to start with a
       
   108 double quote, then comes \texttt{http} followed by an optional
       
   109 \texttt{s} and so on. Usually we would have to escape the
       
   110 double quotes in order to make sure we interpret the double
       
   111 quote as character, not as double quote for a string. But
       
   112 Scala's trick with triple quotes allows us to omit this kind
       
   113 of escaping. As a result we can just write:
       
   114 
       
   115 \[
       
   116 \pcode{""""https?://[^"]*"""".r}
       
   117 \]
       
   118 
       
   119 \noindent Not also that the convention in Scala is that
       
   120 \texttt{.r} converts a string into a regular expression. I
       
   121 leave it to you to ponder whether this regular expression
       
   122 really captures all possible web-addresses.\bigskip
       
   123 
       
   124 Regular expressions were introduced by Kleene in the 1950ies
       
   125 and they have been object of intense study since then. They
       
   126 are nowadays pretty much ubiquitous in computer science. I am
       
   127 sure you have come across them before. Why on earth then is
       
   128 there any interest in studying them again in depth in this
       
   129 module? Well, one answer is in the following graph about
       
   130 regular expression matching in Python and in Ruby.
       
   131 
       
   132 \begin{center}
       
   133 \begin{tikzpicture}[y=.1cm, x=.15cm]
       
   134  	%axis
       
   135 	\draw (0,0) -- coordinate (x axis mid) (30,0);
       
   136    \draw (0,0) -- coordinate (y axis mid) (0,30);
       
   137    %ticks
       
   138    \foreach \x in {0,5,...,30}
       
   139      \draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x};
       
   140    \foreach \y in {0,5,...,30}
       
   141      \draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; 
       
   142 	%labels      
       
   143 	\node[below=0.6cm] at (x axis mid) {number of \texttt{a}s};
       
   144 	\node[rotate=90,above=0.8cm] at (y axis mid) {time in secs};
       
   145 	%plots
       
   146 	\draw[color=blue] plot[mark=*] 
       
   147 		file {re-python.data};
       
   148 	\draw[color=brown] plot[mark=triangle*] 
       
   149 		file {re-ruby.data};	 
       
   150    %legend
       
   151 	\begin{scope}[shift={(4,20)}] 
       
   152 	\draw[color=blue] (0,0) -- 
       
   153 		plot[mark=*] (0.25,0) -- (0.5,0) 
       
   154 		node[right]{\small Python};
       
   155 	\draw[yshift=-\baselineskip, color=brown] (0,0) -- 
       
   156 		plot[mark=triangle*] (0.25,0) -- (0.5,0)
       
   157 		node[right]{\small Ruby};
       
   158 	\end{scope}	
       
   159 \end{tikzpicture}
       
   160 \end{center}
       
   161 
       
   162 \noindent This graph shows that Python needs approximately 29
       
   163 seconds in order to find out that a string of 28 \texttt{a}s
       
   164 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}.
       
   165 Ruby is even slightly worse.\footnote{In this example Ruby
       
   166 uses the slightly different regular expression
       
   167 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and
       
   168 \texttt{a} each occur $n$ times.} Admittedly, this regular
       
   169 expression is carefully chosen to exhibit this exponential
       
   170 behaviour, but similar ones occur more often than one wants in
       
   171 ``real life''. They are sometimes called \emph{evil regular
       
   172 expressions} because they have the potential to make regular
       
   173 expression matching engines topple over, like in Python and
       
   174 Ruby. The problem is that this can have some serious
       
   175 consequences, for example, if you use them in your
       
   176 web-application, because hackers can look for these instances
       
   177 where the matching engine behaves badly and mount a nice 
       
   178 DoS-attack against your application.
       
   179 
       
   180 It will be instructive to look behind the ``scenes''to find
       
   181 out why Python and Ruby (and others) behave so badly when
       
   182 matching with evil regular expressions. But we will also look
       
   183 at a relatively simple algorithm that solves this problem much
       
   184 better than Python and Ruby do\ldots actually it will be two
       
   185 versions of the algorithm: the first one will be able to
       
   186 process strings of approximately 1,000 \texttt{a}s in 30
       
   187 seconds, while the second version will even be able to
       
   188 process up to 12,000 in less than 10(!) seconds, see the graph
       
   189 below:
       
   190 
       
   191 \begin{center}
       
   192 \begin{tikzpicture}[y=.1cm, x=.0006cm]
       
   193  	%axis
       
   194 	\draw (0,0) -- coordinate (x axis mid) (12000,0);
       
   195    \draw (0,0) -- coordinate (y axis mid) (0,30);
       
   196    %ticks
       
   197    \foreach \x in {0,2000,...,12000}
       
   198      	\draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x};
       
   199    \foreach \y in {0,5,...,30}
       
   200      	\draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; 
       
   201 	%labels      
       
   202 	\node[below=0.6cm] at (x axis mid) {number of \texttt{a}s};
       
   203 	\node[rotate=90, above=0.8cm] at (y axis mid) {time in secs};
       
   204 
       
   205 	%plots
       
   206    \draw[color=green] plot[mark=square*, mark options={fill=white} ] 
       
   207 		file {re2b.data};
       
   208 	\draw[color=black] plot[mark=square*, mark options={fill=white} ] 
       
   209 		file {re3.data};	 
       
   210 \end{tikzpicture}
       
   211 \end{center}
       
   212 
       
   213 \subsection*{Basic Regular Expressions}
       
   214 
       
   215 The regular expressions shown above we will call
       
   216 \defn{extended regular expressions}. The ones we will mainly
       
   217 study are \emph{basic regular expressions}, which by
       
   218 convention we will just call regular expressions, if it is
       
   219 clear what we mean. The attraction of (basic) regular
       
   220 expressions is that many features of the extended one are just
       
   221 syntactic suggar. (Basic) regular expressions are defined by
       
   222 the following grammar:
       
   223 
       
   224 \begin{center}
       
   225 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
       
   226   $r$ & $::=$ &   $\varnothing$         & null\\
       
   227         & $\mid$ & $\epsilon$           & empty string / "" / []\\
       
   228         & $\mid$ & $c$                  & single character\\
       
   229         & $\mid$ & $r_1 \cdot r_2$      & sequence\\
       
   230         & $\mid$ & $r_1 + r_2$          & alternative / choice\\
       
   231         & $\mid$ & $r^*$                & star (zero or more)\\
       
   232   \end{tabular}
       
   233 \end{center}
       
   234 
       
   235 \noindent Because we overload our notation, there are some
       
   236 subtleties you should be aware of. First, when regular
       
   237 expressions are referred to then $\varnothing$ does not stand
       
   238 for the empty set: it is a particular pattern that does not
       
   239 match any string. Similarly, in the context of regular
       
   240 expressions, $\epsilon$ does not stand for the empty string
       
   241 (as in many places in the literature) but for a pattern that
       
   242 matches the empty string. Second, the letter $c$ stands for
       
   243 any character from the alphabet at hand. Again in the context
       
   244 of regular expressions, it is a particular pattern that can
       
   245 match the specified string. Third, you should also be careful
       
   246 with the our overloading of the star: assuming you have read
       
   247 the handout about our basic mathematical notation, you will
       
   248 see that in the context of languages (sets of strings) the
       
   249 star stands for an operation on languages. While $r^*$ stands
       
   250 for a regular expression, the operation on sets is defined as
       
   251 
       
   252 \[
       
   253 A^* \dn \bigcup_{0\le n} A^n
       
   254 \]
       
   255 
       
   256 We will use parentheses to disambiguate regular expressions.
       
   257 Parentheses are not really part of a regular expression, and
       
   258 indeed we do not need them in our code because there the tree
       
   259 structure is always clear. But for writing them down in a more
       
   260 mathematical fashion, parentheses will be helpful. For example
       
   261 we will write $(r_1 + r_2)^*$, which is different from, say
       
   262 $r_1 + (r_2)^*$. The former means roughly zero or more times
       
   263 $r_1$ or $r_2$, while the latter means $r_1$ or zero or more
       
   264 times $r_2$. This will turn out are two different pattern,
       
   265 which match in general different strings. We should also write
       
   266 $(r_1 + r_2) + r_3$, which is different from the regular
       
   267 expression $r_1 + (r_2 + r_3)$, but in case of $+$ and $\cdot$
       
   268 we actually do not care about the order and just write $r_1 +
       
   269 r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$, respectively. The
       
   270 reasons for this will become clear shortly. In the literature
       
   271 you will often find that the choice $r_1 + r_2$ is written as
       
   272 $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also following the
       
   273 convention in the literature, we will often omit the $\cdot$
       
   274 all together. This is to make some concrete regular
       
   275 expressions more readable. For example the regular expression
       
   276 for email addresses shown in \eqref{email} would look like
       
   277 
       
   278 \[
       
   279 \texttt{[...]+} \;\cdot\;  \texttt{@} \;\cdot\; 
       
   280 \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; 
       
   281 \texttt{[...]\{2,6\}}
       
   282 \]
       
   283 
       
   284 \noindent
       
   285 which is much less readable than \eqref{email}. Similarly for
       
   286 the regular expression that matches the string $hello$ we 
       
   287 should write
       
   288 
       
   289 \[
       
   290 {\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}
       
   291 \]
       
   292 
       
   293 \noindent
       
   294 but often just write {\it hello}.
       
   295 
       
   296 If you prefer to think in terms of the implementation
       
   297 of regular expressions in Scala, the constructors and
       
   298 classes relate as follows
       
   299 
       
   300 \begin{center}
       
   301 \begin{tabular}{rcl}
       
   302 $\varnothing$ & $\mapsto$ & \texttt{NULL}\\
       
   303 $\epsilon$    & $\mapsto$ & \texttt{EMPTY}\\
       
   304 $c$           & $\mapsto$ & \texttt{CHAR(c)}\\
       
   305 $r_1 + r_2$   & $\mapsto$ & \texttt{ALT(r1, r2)}\\
       
   306 $r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\
       
   307 $r^*$         & $\mapsto$ & \texttt{STAR(r)}
       
   308 \end{tabular}
       
   309 \end{center}
       
   310 
       
   311 A source of confusion might arise from the fact that we
       
   312 use the term \emph{basic regular expression} for the regular
       
   313 expressions used in ``theory'' and defined above, and
       
   314 \emph{extended regular expression} for the ones used in
       
   315 ``practice'', for example Scala. If runtime is not of an
       
   316 issue, then the latter can be seen as some syntactic sugar of
       
   317 the former. Fo example we could replace
       
   318 
       
   319 \begin{center}
       
   320 \begin{tabular}{rcl}
       
   321 $r^+$ & $\mapsto$ & $r\cdot r^*$\\
       
   322 $r?$ & $\mapsto$ & $\epsilon + r$\\
       
   323 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
       
   324 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
       
   325 \end{tabular}
       
   326 \end{center}
       
   327 
       
   328 \subsection*{The Meaning of Regular Expressions}
       
   329 
       
   330 
       
   331 So far we have only considered informally what the
       
   332 \emph{meaning} of a regular expression is. This is no good for
       
   333 specifications of what algorithms are supposed to do or which
       
   334 problems they are supposed to solve.
       
   335 
       
   336 To do so more formally we will associate with every regular
       
   337 expression a language, or set of strings, that is supposed to
       
   338 be matched by this regular expression. To understand what is 
       
   339 going on here it is crucial that you have also read the 
       
   340 handout about our basic mathematical notations.
       
   341 
       
   342 The meaning of a regular expression can be defined recursively
       
   343 as follows
       
   344 
       
   345 \begin{center}
       
   346 \begin{tabular}{rcl}
       
   347 $L(\varnothing)$  & $\dn$ & $\varnothing$\\
       
   348 $L(\epsilon)$     & $\dn$ & $\{[]\}$\\
       
   349 $L(c)$            & $\dn$ & $\{"c"\}$\\
       
   350 $L(r_1+ r_2)$     & $\dn$ & $L(r_1) \cup L(r_2)$\\
       
   351 $L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) @ L(r_2)$\\
       
   352 $L(r^*)$           & $\dn$ & $(L(r))^*$\\
       
   353 \end{tabular}
       
   354 \end{center}
       
   355 
       
   356 \noindent
       
   357 As a result we can now precisely state what the meaning, for example, of the regular expression 
       
   358 ${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely 
       
   359 $L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$...as expected. Similarly if we have the 
       
   360 choice-regular-expression $a + b$, its meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly
       
   361 be matched by this choice. You can now also see why we do not make a difference
       
   362 between the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$....they 
       
   363 are not the same regular expression, but have the same meaning. 
       
   364 
       
   365 The point of the definition of $L$ is that we can use it to precisely specify when a string $s$ is matched by a
       
   366 regular expression $r$, namely only when $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and
       
   367 any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no},
       
   368 if $s \not\in L(r)$. We leave this for the next lecture.
       
   369 
       
   370 \begin{figure}[p]
       
   371 \lstinputlisting{../progs/crawler1.scala}
       
   372 \caption{The Scala code for a simple web-crawler that checks
       
   373 for broken links in a web-page. It uses the regular expression
       
   374 \texttt{http\_pattern} in Line~15 for recognising URL-addresses.
       
   375 It finds all links using the library function
       
   376 \texttt{findAllIn} in Line~21.\label{crawler1}}
       
   377 \end{figure}
       
   378 
       
   379 
       
   380 \begin{figure}[p]
       
   381 \lstinputlisting{../progs/crawler2.scala}
       
   382 
       
   383 \caption{A version of the web-crawler that only follows links
       
   384 in ``my'' domain---since these are the ones I am interested in
       
   385 to fix. It uses the regular expression \texttt{my\_urls} in
       
   386 Line~16 to check for my name in the links. The main change is
       
   387 in Lines~26--29 where there is a test whether URL is in ``my''
       
   388 domain or not.\label{crawler2}}
       
   389 
       
   390 \end{figure}
       
   391 
       
   392 \begin{figure}[p]
       
   393 \lstinputlisting{../progs/crawler3.scala}
       
   394 
       
   395 \caption{A small email harvester---whenever we download a
       
   396 web-page, we also check whether it contains any email
       
   397 addresses. For this we use the regular expression
       
   398 \texttt{email\_pattern} in Line~16. The main change is in Line
       
   399 32 where all email addresses that can be found in a page are
       
   400 printed.\label{crawler3}}
       
   401 \end{figure}
       
   402 
       
   403 \pagebreak
       
   404 Lets start
       
   405 with what we mean by \emph{strings}. Strings (they are also
       
   406 sometimes referred to as \emph{words}) are lists of characters
       
   407 drawn from an \emph{alphabet}. If nothing else is specified,
       
   408 we usually assume the alphabet consists of just the lower-case
       
   409 letters $a$, $b$, \ldots, $z$. Sometimes, however, we
       
   410 explicitly restrict strings to contain, for example, only the
       
   411 letters $a$ and $b$. In this case we say the alphabet is the
       
   412 set $\{a, b\}$.
    16 
   413 
    17 There are many ways how we can write down strings. In programming languages, they are usually 
   414 There are many ways how we can write down strings. In programming languages, they are usually 
    18 written as {\it "hello"} where the double quotes indicate that we dealing with a string. 
   415 written as {\it "hello"} where the double quotes indicate that we dealing with a string. 
    19 Essentially, strings are lists of characters which can be written for example as follows
   416 Essentially, strings are lists of characters which can be written for example as follows
    20 
   417 
    54 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a 
   451 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a 
    55 difference between the empty set, or empty language, and the set that 
   452 difference between the empty set, or empty language, and the set that 
    56 contains only the empty string $\{\text{""}\}$: the former has no elements, whereas 
   453 contains only the empty string $\{\text{""}\}$: the former has no elements, whereas 
    57 the latter has one element.
   454 the latter has one element.
    58 
   455 
    59 As seen, there are languages which contain infinitely many strings, like the set of all strings.
   456 
    60 The ``natural'' languages like English, French and so on contain many but only finitely many 
       
    61 strings (namely the ones listed in a good dictionary). It might be therefore be surprising that the
       
    62 language consisting of all email addresses is infinite provided we assume it is defined by the
       
    63 regular expression\footnote{See \url{http://goo.gl/5LoVX7}}
       
    64 
       
    65 \[
       
    66 ([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}})
       
    67 \]
       
    68 
       
    69 \noindent
       
    70 One reason is that before the $@$-sign there can be any string you want assuming it 
       
    71 is made up from letters, digits, underscores, dots and hyphens---clearly there are infinitely many
       
    72 of those. Similarly the string after the $@$-sign can be any string. However, this does not mean 
       
    73 that every string is an email address. For example
       
    74 
       
    75 \[
       
    76 "\text{\it foo}@\text{\it bar}.\text{\it c}"
       
    77 \]
       
    78 
       
    79 \noindent
       
    80 is not, because the top-level-domains must be of length of at least two. (Note that there is
       
    81 the convention that uppercase letters are treated in email-addresses as if they were
       
    82 lower-case.)
       
    83 \bigskip
       
    84 
   457 
    85 Before we expand on the topic of regular expressions, let us review some operations on
   458 Before we expand on the topic of regular expressions, let us review some operations on
    86 sets. We will use capital letters $A$, $B$, $\ldots$ to stand for sets of strings. 
   459 sets. We will use capital letters $A$, $B$, $\ldots$ to stand for sets of strings. 
    87 The union of two sets is written as usual as $A \cup B$. We also need to define the
   460 The union of two sets is written as usual as $A \cup B$. We also need to define the
    88 operation of \emph{concatenating} two sets of strings. This can be defined as
   461 operation of \emph{concatenating} two sets of strings. This can be defined as
   145 \end{tabular} 
   518 \end{tabular} 
   146 \end{tabular}
   519 \end{tabular}
   147 \end{center}
   520 \end{center}
   148 \bigskip
   521 \bigskip
   149 
   522 
   150 \noindent
   523 
   151 \emph{Regular expressions} are meant to conveniently describe languages...at least languages
   524 \subsection*{My Fascination for Regular Expressions}
   152 we are interested in in Computer Science.  For example there is no convenient regular expression
   525 
   153 for describing the English language short of enumerating all English words. 
       
   154 But they seem useful for describing all permitted email addresses, as seen above. 
       
   155 
       
   156 Regular expressions are given by the following grammar:
       
   157 
       
   158 \begin{center}
       
   159 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
       
   160   $r$ & $::=$ &   $\varnothing$         & null\\
       
   161         & $\mid$ & $\epsilon$              & empty string / "" / []\\
       
   162         & $\mid$ & $c$                         & single character\\
       
   163         & $\mid$ & $r_1 \cdot r_2$      & sequence\\
       
   164         & $\mid$ & $r_1 + r_2$            & alternative / choice\\
       
   165         & $\mid$ & $r^*$                      & star (zero or more)\\
       
   166   \end{tabular}
       
   167 \end{center}
       
   168 
       
   169 \noindent
       
   170 Because we overload our notation, there are some subtleties you should be aware of. First, the letter 
       
   171 $c$ stands for any character from the
       
   172 alphabet at hand. Second, we will use parentheses to disambiguate
       
   173 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from, say $r_1 + (r_2)^*$.
       
   174 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times
       
   175 $r_2$. We should also write $(r_1 + r_2) + r_3$, which is different from the regular expression $r_1 + (r_2 + r_3)$,
       
   176 but in case of $+$ and $\cdot$ we actually do not care about the order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,
       
   177 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice
       
   178 $r_1 + r_2$  is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also following the convention in the literature, we will in case of $\cdot$ even 
       
   179 often omit it all together. For example the regular expression for email addresses shown above is meant to be of the form
       
   180 
       
   181 \[
       
   182 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots
       
   183 \]
       
   184 
       
   185 \noindent
       
   186 meaning first comes a name (specified by the regular expression $([\ldots])^+$), then an $@$-sign, then
       
   187 a domain name (specified by the regular expression $([\ldots])^+$), then a dot and then a top-level domain. Similarly if
       
   188 we want to specify the regular expression for the string {\it "hello"} we should write
       
   189 
       
   190 \[
       
   191 {\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}
       
   192 \]
       
   193 
       
   194 \noindent
       
   195 but often just write {\it hello}.
       
   196 
       
   197 Another source of confusion might arise from the fact that we use the term \emph{regular expression} for the regular expressions used in ``theory''
       
   198 and also the ones used in ``practice''. In this course we refer by default to the regular expressions defined by the grammar above. 
       
   199 In ``practice'' we often use $r^+$ to stand for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ to stand for an optional regular
       
   200 expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are however mere convenience 
       
   201 as they can be seen as shorthand for
       
   202 
       
   203 \begin{center}
       
   204 \begin{tabular}{rcl}
       
   205 $r^+$ & $\mapsto$ & $r\cdot r^*$\\
       
   206 $r^?$ & $\mapsto$ & $\epsilon + r$\\
       
   207 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
       
   208 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
       
   209 \end{tabular}
       
   210 \end{center}
       
   211 
       
   212 
       
   213 We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regular
       
   214 expression is supposed to stand for every string \emph{not} matched by a regular expression. We will write
       
   215 such not-regular-expressions as $\sim{}r$. While being ``convenience'' it is often not so clear what the shorthand for
       
   216 these kind of not-regular-expressions is. Try to write down the regular expression which can match any
       
   217 string except the two strings {\it "hello"} and {\it "world"}. It is possible in principle, but often it is easier to just include
       
   218 $\sim{}r$ in the definition of regular expressions. Whenever we do so, we will state it explicitly.
       
   219 
       
   220 So far we have only considered informally what the \emph{meaning} of a regular expression is.  
       
   221 To do so more formally we will associate with every regular expression a set of strings 
       
   222 that is supposed to be matched by this
       
   223 regular expression. This can be defined recursively  as follows
       
   224 
       
   225 \begin{center}
       
   226 \begin{tabular}{rcl}
       
   227 $L(\varnothing)$  & $\dn$ & $\{\,\}$\\
       
   228 $L(\epsilon)$       & $\dn$ & $\{""\}$\\
       
   229 $L(c)$                  & $\dn$ & $\{"c"\}$\\
       
   230 $L(r_1+ r_2)$      & $\dn$ & $L(r_1) \cup L(r_2)$\\
       
   231 $L(r_1 \cdot r_2)$  & $\dn$ & $\{s_1@ s_2 | s_1 \in L(r_1) \wedge s_2 \in L(r_2) \}$\\
       
   232 $L(r^*)$                   & $\dn$ & $\bigcup_{n \ge 0} L(r)^n$\\
       
   233 \end{tabular}
       
   234 \end{center}
       
   235 
       
   236 \noindent
       
   237 As a result we can now precisely state what the meaning, for example, of the regular expression 
       
   238 ${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely 
       
   239 $L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$...as expected. Similarly if we have the 
       
   240 choice-regular-expression $a + b$, its meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly
       
   241 be matched by this choice. You can now also see why we do not make a difference
       
   242 between the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$....they 
       
   243 are not the same regular expression, but have the same meaning. 
       
   244 
       
   245 The point of the definition of $L$ is that we can use it to precisely specify when a string $s$ is matched by a
       
   246 regular expression $r$, namely only when $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and
       
   247 any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no},
       
   248 if $s \not\in L(r)$. We leave this for the next lecture.
       
   249 
       
   250 \begin{figure}[p]
       
   251 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/crawler1.scala}}}
       
   252 \caption{Scala code for a web-crawler that can detect broken links in a web-page. It uses
       
   253 the regular expression {\tt http\_pattern} in Line~15 for recognising URL-addresses. It finds
       
   254 all links using the library function {\tt findAllIn} in Line~21.}
       
   255 \end{figure}
       
   256 
       
   257 \begin{figure}[p]
       
   258 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/crawler2.scala}}}
       
   259 \caption{A version of the web-crawler which only follows links in ``my'' domain---since these are the
       
   260 ones I am interested in to fix. It uses the regular expression {\tt my\_urls} in Line~16.
       
   261 The main change is in Line~26 where there is a test whether URL is in ``my'' domain or not.}
       
   262 
       
   263 \end{figure}
       
   264 
       
   265 \begin{figure}[p]
       
   266 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/crawler3.scala}}}
       
   267 \caption{A small email harvester---whenever we download a web-page, we also check whether
       
   268 it contains any email addresses. For this we use the regular expression {\tt email\_pattern} in
       
   269 Line~17.  The main change is in Lines 33 and 34 where all email addresses that can be found in a page are printed.}
       
   270 \end{figure}
       
   271 
   526 
   272 \end{document}
   527 \end{document}
   273 
   528 
   274 %%% Local Variables: 
   529 %%% Local Variables: 
   275 %%% mode: latex
   530 %%% mode: latex