handouts/ho01.tex
changeset 243 8d5aaf5b0031
parent 242 35104ee14f87
child 244 771042ac7c3f
equal deleted inserted replaced
242:35104ee14f87 243:8d5aaf5b0031
     1 \documentclass{article}
     1 \documentclass{article}
     2 \usepackage{../style}
     2 \usepackage{../style}
     3 \usepackage{../langs}
     3 \usepackage{../langs}
     4 \usepackage{../graphics}
     4 \usepackage{../graphics}
     5 \usepackage{../data}
     5 \usepackage{../data}
     6 \usepackage{longtable}
       
     7 
     6 
     8 
     7 
     9 \begin{document}
     8 \begin{document}
    10 
     9 
    11 \section*{Handout 1}
    10 \section*{Handout 1}
    13 This module is about text processing, be it for web-crawlers,
    12 This module is about text processing, be it for web-crawlers,
    14 compilers, dictionaries, DNA-data and so on. When looking for
    13 compilers, dictionaries, DNA-data and so on. When looking for
    15 a particular string in a large text we can use the
    14 a particular string in a large text we can use the
    16 Knuth-Morris-Pratt algorithm, which is currently the most
    15 Knuth-Morris-Pratt algorithm, which is currently the most
    17 efficient general string search algorithm. But often we do
    16 efficient general string search algorithm. But often we do
    18 \emph{not} look for just a particular string, but for string
    17 \emph{not} just look for a particular string, but for string
    19 patterns. For example in programming code we need to identify
    18 patterns. For example in programming code we need to identify
    20 what are the keywords, what are the identifiers etc. Also
    19 what are the keywords, what are the identifiers etc. A pattern
    21 often we face the problem that we are given a string (for
    20 for identifiers could be that they start with a letter,
       
    21 followed by zero or more letters, numbers and the underscore.
       
    22 Also often we face the problem that we are given a string (for
    22 example some user input) and want to know whether it matches a
    23 example some user input) and want to know whether it matches a
    23 particular pattern. For example for excluding some user input
    24 particular pattern. In this way we can exclude user input that
    24 that would otherwise have nasty effects on our program
    25 would otherwise have nasty effects on our program (crashing it
    25 (crashing or going into an infinite loop, if not worse).
    26 or going into an infinite loop, if not worse). \defn{Regular
    26 \defn{Regular expressions} help with conveniently specifying
    27 expressions} help with conveniently specifying such patterns. 
    27 such patterns. 
       
    28 
       
    29 
       
    30 The idea behind regular expressions is that they are a simple
    28 The idea behind regular expressions is that they are a simple
    31 method for describing languages (or sets of strings)...at
    29 method for describing languages (or sets of strings)\ldots at
    32 least languages we are interested in in computer science. For
    30 least languages we are interested in in computer science. For
    33 example there is no convenient regular expression for
    31 example there is no convenient regular expression for
    34 describing the English language short of enumerating all
    32 describing the English language short of enumerating all
    35 English words. But they seem useful for describing for example
    33 English words. But they seem useful for describing for example
    36 email addresses.\footnote{See ``8 Regular Expressions You
    34 email addresses.\footnote{See ``8 Regular Expressions You
    51 toplevel domain. This toplevel domain must be 2 to 6 lowercase
    49 toplevel domain. This toplevel domain must be 2 to 6 lowercase
    52 letters including the dot. Example strings which follow this
    50 letters including the dot. Example strings which follow this
    53 pattern are:
    51 pattern are:
    54 
    52 
    55 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
    53 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
    56 niceandsimple@example.com
    54 niceandsimple@example.org
    57 very.common@example.org
    55 very.common@example.co.uk
    58 a.little.lengthy.but.fine@dept.example.co.uk
    56 a.little.lengthy.but.fine@dept.example.ac.uk
    59 other.email-with-dash@example.ac.uk
    57 other.email-with-dash@example.edu
    60 \end{lstlisting}
    58 \end{lstlisting}
    61 
    59 
    62 
    60 
    63 \noindent
    61 \noindent
    64 But for example the following two do not:
    62 But for example the following two do not:
    66 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
    64 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
    67 user@localserver
    65 user@localserver
    68 disposable.style.email.with+symbol@example.com
    66 disposable.style.email.with+symbol@example.com
    69 \end{lstlisting}
    67 \end{lstlisting}
    70 
    68 
       
    69 Identifiers, or variables, in program text are often required
       
    70 to satisfy the constraints that they start with a letter and
       
    71 then can be followed by zero or more letters or numbers and
       
    72 also can include underscores, but not as the first character.
       
    73 Such identifiers can be recognised with the regular expression
       
    74 
       
    75 \begin{center}
       
    76 \pcode{[a-zA-Z] [a-zA-Z0-9_]*}
       
    77 \end{center}
       
    78 
       
    79 \noindent Possible identifiers that match this regular expression 
       
    80 are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name},
       
    81 but not \pcode{_i} and also not \pcode{4you}.
       
    82 
    71 Many programming language offer libraries that can be used to
    83 Many programming language offer libraries that can be used to
    72 validate such strings against regular expressions, like the
    84 validate such strings against regular expressions. Also there
    73 one for email addresses in \eqref{email}. There are some
    85 are some common, and I am sure very familiar, ways how to
    74 common, and I am sure very familiar, ways how to construct
    86 construct regular expressions. For example in Scala we have: 
    75 regular expressions. For example in Scala we have 
    87 
    76 
    88 \begin{center}
    77 \begin{center}
    89 \begin{tabular}{lp{9cm}}
    78 \begin{longtable}{lp{9cm}}
       
    79 \pcode{re*} & matches 0 or more occurrences of preceding 
    90 \pcode{re*} & matches 0 or more occurrences of preceding 
    80 expression\\
    91 expression\\
    81 \pcode{re+} & matches 1 or more occurrences of preceding
    92 \pcode{re+} & matches 1 or more occurrences of preceding
    82 expression\\
    93 expression\\
    83 \pcode{re?} &	 matches 0 or 1 occurrence of preceding 
    94 \pcode{re?} &	 matches 0 or 1 occurrence of preceding 
    84 expression\\
    95 expression\\
    85 \pcode{re\{n\}}	& matches exactly \pcode{n} number of 
    96 \pcode{re\{n\}}	& matches exactly \pcode{n} number of 
    86 occurrences\\
    97 occurrences of preceding  expression\\
    87 \pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m}
    98 \pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m}
    88 occurences of the preceding expression\\
    99 occurences of the preceding expression\\
    89 \pcode{[...]} & matches any single character inside the 
   100 \pcode{[...]} & matches any single character inside the 
    90 brackets\\
   101 brackets\\
    91 \pcode{[^...]} & matches any single character not inside the 
   102 \pcode{[^...]} & matches any single character not inside the 
    92 brackets\\
   103 brackets\\
    93 \pcode{..-..} & character ranges\\
   104 \pcode{..-..} & character ranges\\
    94 \pcode{\\d} &	matches digits; equivalent to \pcode{[0-9]}
   105 \pcode{\\d} &	matches digits; equivalent to \pcode{[0-9]}
    95 \end{longtable}
   106 \end{tabular}
    96 \end{center}
   107 \end{center}
    97 
   108 
    98 \noindent With this you can figure out the purpose of the
   109 \noindent With this table you can figure out the purpose of
    99 regular expressions in the web-crawlers shown Figures
   110 the regular expressions in the web-crawlers shown Figures
   100 \ref{crawler1}, \ref{crawler2} and \ref{crawler3}. Note the
   111 \ref{crawler1}, \ref{crawler2} and \ref{crawler3}. Note,
   101 regular expression for http-addresses in web-pages:
   112 however, the regular expression for http-addresses in
       
   113 web-pages is meant to be
   102 
   114 
   103 \[
   115 \[
   104 \pcode{"https?://[^"]*"}
   116 \pcode{"https?://[^"]*"}
   105 \]
   117 \]
   106 
   118 
   114 
   126 
   115 \[
   127 \[
   116 \pcode{""""https?://[^"]*"""".r}
   128 \pcode{""""https?://[^"]*"""".r}
   117 \]
   129 \]
   118 
   130 
   119 \noindent Not also that the convention in Scala is that
   131 \noindent Note also that the convention in Scala is that
   120 \texttt{.r} converts a string into a regular expression. I
   132 \texttt{.r} converts a string into a regular expression. I
   121 leave it to you to ponder whether this regular expression
   133 leave it to you to ponder whether this regular expression
   122 really captures all possible web-addresses.\bigskip
   134 really captures all possible web-addresses.
       
   135 
       
   136 \subsection*{Why Study Regular Expressions?}
   123 
   137 
   124 Regular expressions were introduced by Kleene in the 1950ies
   138 Regular expressions were introduced by Kleene in the 1950ies
   125 and they have been object of intense study since then. They
   139 and they have been object of intense study since then. They
   126 are nowadays pretty much ubiquitous in computer science. I am
   140 are nowadays pretty much ubiquitous in computer science. I am
   127 sure you have come across them before. Why on earth then is
   141 sure you have come across them before. Why on earth then is
   158 	\end{scope}	
   172 	\end{scope}	
   159 \end{tikzpicture}
   173 \end{tikzpicture}
   160 \end{center}
   174 \end{center}
   161 
   175 
   162 \noindent This graph shows that Python needs approximately 29
   176 \noindent This graph shows that Python needs approximately 29
   163 seconds in order to find out that a string of 28 \texttt{a}s
   177 seconds for finding out whether a string of 28 \texttt{a}s
   164 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}.
   178 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}.
   165 Ruby is even slightly worse.\footnote{In this example Ruby
   179 Ruby is even slightly worse.\footnote{In this example Ruby
   166 uses the slightly different regular expression
   180 uses the slightly different regular expression
   167 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and
   181 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and
   168 \texttt{a} each occur $n$ times.} Admittedly, this regular
   182 \texttt{a} each occur $n$ times.} Admittedly, this regular
   169 expression is carefully chosen to exhibit this exponential
   183 expression is carefully chosen to exhibit this exponential
   170 behaviour, but similar ones occur more often than one wants in
   184 behaviour, but similar ones occur more often than one wants in
   171 ``real life''. They are sometimes called \emph{evil regular
   185 ``real life''. They are sometimes called \emph{evil regular
   172 expressions} because they have the potential to make regular
   186 expressions} because they have the potential to make regular
   173 expression matching engines topple over, like in Python and
   187 expression matching engines to topple over, like in Python and
   174 Ruby. The problem is that this can have some serious
   188 Ruby. The problem with evil regular expressions is that they
   175 consequences, for example, if you use them in your
   189 can have some serious consequences, for example, if you use
   176 web-application, because hackers can look for these instances
   190 them in your web-application. The reason is that hackers can
   177 where the matching engine behaves badly and mount a nice 
   191 look for these instances where the matching engine behaves
   178 DoS-attack against your application.
   192 badly and then mount a nice DoS-attack against your
   179 
   193 application.
   180 It will be instructive to look behind the ``scenes''to find
   194 
       
   195 It will be instructive to look behind the ``scenes'' to find
   181 out why Python and Ruby (and others) behave so badly when
   196 out why Python and Ruby (and others) behave so badly when
   182 matching with evil regular expressions. But we will also look
   197 matching with evil regular expressions. But we will also look
   183 at a relatively simple algorithm that solves this problem much
   198 at a relatively simple algorithm that solves this problem much
   184 better than Python and Ruby do\ldots actually it will be two
   199 better than Python and Ruby do\ldots actually it will be two
   185 versions of the algorithm: the first one will be able to
   200 versions of the algorithm: the first one will be able to
   186 process strings of approximately 1,000 \texttt{a}s in 30
   201 process strings of approximately 1,000 \texttt{a}s in 30
   187 seconds, while the second version will even be able to
   202 seconds, while the second version will even be able to process
   188 process up to 12,000 in less than 10(!) seconds, see the graph
   203 up to 12,000 in less than 10(!) seconds, see the graph below:
   189 below:
       
   190 
   204 
   191 \begin{center}
   205 \begin{center}
   192 \begin{tikzpicture}[y=.1cm, x=.0006cm]
   206 \begin{tikzpicture}[y=.1cm, x=.0006cm]
   193  	%axis
   207  	%axis
   194 	\draw (0,0) -- coordinate (x axis mid) (12000,0);
   208 	\draw (0,0) -- coordinate (x axis mid) (12000,0);
   210 \end{tikzpicture}
   224 \end{tikzpicture}
   211 \end{center}
   225 \end{center}
   212 
   226 
   213 \subsection*{Basic Regular Expressions}
   227 \subsection*{Basic Regular Expressions}
   214 
   228 
   215 The regular expressions shown above we will call
   229 The regular expressions shown above, for example for Scala, we
   216 \defn{extended regular expressions}. The ones we will mainly
   230 will call \emph{extended regular expressions}. The ones we
   217 study are \emph{basic regular expressions}, which by
   231 will mainly study in this module are \emph{basic regular
   218 convention we will just call regular expressions, if it is
   232 expressions}, which by convention we will just call
   219 clear what we mean. The attraction of (basic) regular
   233 \emph{regular expressions}, if it is clear what we mean. The
   220 expressions is that many features of the extended one are just
   234 attraction of (basic) regular expressions is that many
   221 syntactic suggar. (Basic) regular expressions are defined by
   235 features of the extended ones are just syntactic sugar.
   222 the following grammar:
   236 (Basic) regular expressions are defined by the following
       
   237 grammar:
   223 
   238 
   224 \begin{center}
   239 \begin{center}
   225 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
   240 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
   226   $r$ & $::=$ &   $\varnothing$         & null\\
   241   $r$ & $::=$ &   $\varnothing$         & null\\
   227         & $\mid$ & $\epsilon$           & empty string / "" / []\\
   242         & $\mid$ & $\epsilon$           & empty string / "" / []\\
   228         & $\mid$ & $c$                  & single character\\
   243         & $\mid$ & $c$                  & single character\\
       
   244         & $\mid$ & $r_1 + r_2$          & alternative / choice\\
   229         & $\mid$ & $r_1 \cdot r_2$      & sequence\\
   245         & $\mid$ & $r_1 \cdot r_2$      & sequence\\
   230         & $\mid$ & $r_1 + r_2$          & alternative / choice\\
       
   231         & $\mid$ & $r^*$                & star (zero or more)\\
   246         & $\mid$ & $r^*$                & star (zero or more)\\
   232   \end{tabular}
   247   \end{tabular}
   233 \end{center}
   248 \end{center}
   234 
   249 
   235 \noindent Because we overload our notation, there are some
   250 \noindent Because we overload our notation, there are some
   236 subtleties you should be aware of. First, when regular
   251 subtleties you should be aware of. When regular expressions
   237 expressions are referred to then $\varnothing$ does not stand
   252 are referred to then $\varnothing$ does not stand for the
   238 for the empty set: it is a particular pattern that does not
   253 empty set: rather it is a particular pattern that does not
   239 match any string. Similarly, in the context of regular
   254 match any string. Similarly, in the context of regular
   240 expressions, $\epsilon$ does not stand for the empty string
   255 expressions, $\epsilon$ does not stand for the empty string
   241 (as in many places in the literature) but for a pattern that
   256 (as in many places in the literature) but for a regular
   242 matches the empty string. Second, the letter $c$ stands for
   257 expression that matches the empty string. The letter $c$
   243 any character from the alphabet at hand. Again in the context
   258 stands for any character from the alphabet at hand. Again in
   244 of regular expressions, it is a particular pattern that can
   259 the context of regular expressions, it is a particular pattern
   245 match the specified string. Third, you should also be careful
   260 that can match the specified character. You should also be
   246 with the our overloading of the star: assuming you have read
   261 careful with our overloading of the star: assuming you have
   247 the handout about our basic mathematical notation, you will
   262 read the handout about our basic mathematical notation, you
   248 see that in the context of languages (sets of strings) the
   263 will see that in the context of languages (sets of strings)
   249 star stands for an operation on languages. While $r^*$ stands
   264 the star stands for an operation on languages. While here
   250 for a regular expression, the operation on sets is defined as
   265 $r^*$ stands for a regular expression, which is different from
       
   266 the operation on sets is defined as
   251 
   267 
   252 \[
   268 \[
   253 A^* \dn \bigcup_{0\le n} A^n
   269 A^* \dn \bigcup_{0\le n} A^n
   254 \]
   270 \]
   255 
   271 
   256 We will use parentheses to disambiguate regular expressions.
   272 We will use parentheses to disambiguate regular expressions.
   257 Parentheses are not really part of a regular expression, and
   273 Parentheses are not really part of a regular expression, and
   258 indeed we do not need them in our code because there the tree
   274 indeed we do not need them in our code because there the tree
   259 structure is always clear. But for writing them down in a more
   275 structure of regular expressions is always clear. But for
   260 mathematical fashion, parentheses will be helpful. For example
   276 writing them down in a more mathematical fashion, parentheses
   261 we will write $(r_1 + r_2)^*$, which is different from, say
   277 will be helpful. For example we will write $(r_1 + r_2)^*$,
   262 $r_1 + (r_2)^*$. The former means roughly zero or more times
   278 which is different from, say $r_1 + (r_2)^*$. The former means
   263 $r_1$ or $r_2$, while the latter means $r_1$ or zero or more
   279 roughly zero or more times $r_1$ or $r_2$, while the latter
   264 times $r_2$. This will turn out are two different pattern,
   280 means $r_1$ or zero or more times $r_2$. This will turn out
   265 which match in general different strings. We should also write
   281 are two different patterns, which match in general different
   266 $(r_1 + r_2) + r_3$, which is different from the regular
   282 strings. We should also write $(r_1 + r_2) + r_3$, which is
   267 expression $r_1 + (r_2 + r_3)$, but in case of $+$ and $\cdot$
   283 different from the regular expression $r_1 + (r_2 + r_3)$, but
   268 we actually do not care about the order and just write $r_1 +
   284 in case of $+$ and $\cdot$ we actually do not care about the
   269 r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$, respectively. The
   285 order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2
   270 reasons for this will become clear shortly. In the literature
   286 \cdot r_3$, respectively. The reasons for this will become
   271 you will often find that the choice $r_1 + r_2$ is written as
   287 clear shortly. In the literature you will often find that the
   272 $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also following the
   288 choice $r_1 + r_2$ is written as $r_1\mid{}r_2$ or
   273 convention in the literature, we will often omit the $\cdot$
   289 $r_1\mid\mid{}r_2$. Also following the convention in the
   274 all together. This is to make some concrete regular
   290 literature, we will often omit the $\cdot$ all together. This
   275 expressions more readable. For example the regular expression
   291 is to make some concrete regular expressions more readable.
   276 for email addresses shown in \eqref{email} would look like
   292 For example the regular expression for email addresses shown
       
   293 in \eqref{email} would look like
   277 
   294 
   278 \[
   295 \[
   279 \texttt{[...]+} \;\cdot\;  \texttt{@} \;\cdot\; 
   296 \texttt{[...]+} \;\cdot\;  \texttt{@} \;\cdot\; 
   280 \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; 
   297 \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\; 
   281 \texttt{[...]\{2,6\}}
   298 \texttt{[...]\{2,6\}}
   310 
   327 
   311 A source of confusion might arise from the fact that we
   328 A source of confusion might arise from the fact that we
   312 use the term \emph{basic regular expression} for the regular
   329 use the term \emph{basic regular expression} for the regular
   313 expressions used in ``theory'' and defined above, and
   330 expressions used in ``theory'' and defined above, and
   314 \emph{extended regular expression} for the ones used in
   331 \emph{extended regular expression} for the ones used in
   315 ``practice'', for example Scala. If runtime is not of an
   332 ``practice'', for example in Scala. If runtime is not an
   316 issue, then the latter can be seen as some syntactic sugar of
   333 issue, then the latter can be seen as syntactic sugar of
   317 the former. Fo example we could replace
   334 the former. For example we could replace
   318 
   335 
   319 \begin{center}
   336 \begin{center}
   320 \begin{tabular}{rcl}
   337 \begin{tabular}{rcl}
   321 $r^+$ & $\mapsto$ & $r\cdot r^*$\\
   338 $r+$ & $\mapsto$ & $r\cdot r^*$\\
   322 $r?$ & $\mapsto$ & $\epsilon + r$\\
   339 $r?$ & $\mapsto$ & $\epsilon + r$\\
   323 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
   340 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
   324 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
   341 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
   325 \end{tabular}
   342 \end{tabular}
   326 \end{center}
   343 \end{center}
   327 
   344 
       
   345 
   328 \subsection*{The Meaning of Regular Expressions}
   346 \subsection*{The Meaning of Regular Expressions}
   329 
   347 
   330 
       
   331 So far we have only considered informally what the
   348 So far we have only considered informally what the
   332 \emph{meaning} of a regular expression is. This is no good for
   349 \emph{meaning} of a regular expression is. This is not good
   333 specifications of what algorithms are supposed to do or which
   350 enough for specifications of what algorithms are supposed to
   334 problems they are supposed to solve.
   351 do or which problems they are supposed to solve.
   335 
   352 
   336 To do so more formally we will associate with every regular
   353 To define the meaning of a regular expression we will
   337 expression a language, or set of strings, that is supposed to
   354 associate with every regular expression a language, or set of
   338 be matched by this regular expression. To understand what is 
   355 strings. This language contains all the strings the regular
   339 going on here it is crucial that you have also read the 
   356 expression is supposed to match. To understand what is going
   340 handout about our basic mathematical notations.
   357 on here it is crucial that you have read the handout
   341 
   358 about basic mathematical notations.
   342 The meaning of a regular expression can be defined recursively
   359 
   343 as follows
   360 The \defn{meaning of a regular expression} can be defined
       
   361 by a recursive function called $L$ (for language), which
       
   362 is defined as follows
   344 
   363 
   345 \begin{center}
   364 \begin{center}
   346 \begin{tabular}{rcl}
   365 \begin{tabular}{rcl}
   347 $L(\varnothing)$  & $\dn$ & $\varnothing$\\
   366 $L(\varnothing)$  & $\dn$ & $\varnothing$\\
   348 $L(\epsilon)$     & $\dn$ & $\{[]\}$\\
   367 $L(\epsilon)$     & $\dn$ & $\{[]\}$\\
   349 $L(c)$            & $\dn$ & $\{"c"\}$\\
   368 $L(c)$            & $\dn$ & $\{"c"\}$\\
   350 $L(r_1+ r_2)$     & $\dn$ & $L(r_1) \cup L(r_2)$\\
   369 $L(r_1+ r_2)$     & $\dn$ & $L(r_1) \cup L(r_2)$\\
   351 $L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) @ L(r_2)$\\
   370 $L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) \,@\, L(r_2)$\\
   352 $L(r^*)$           & $\dn$ & $(L(r))^*$\\
   371 $L(r^*)$           & $\dn$ & $(L(r))^*$\\
   353 \end{tabular}
   372 \end{tabular}
   354 \end{center}
   373 \end{center}
   355 
   374 
   356 \noindent
   375 \noindent As a result we can now precisely state what the
   357 As a result we can now precisely state what the meaning, for example, of the regular expression 
   376 meaning, for example, of the regular expression $h \cdot
   358 ${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely 
   377 e \cdot l \cdot l \cdot o$ is, namely
   359 $L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$...as expected. Similarly if we have the 
   378 
   360 choice-regular-expression $a + b$, its meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly
   379 \[
   361 be matched by this choice. You can now also see why we do not make a difference
   380 L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot
   362 between the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$....they 
   381 {\it o}) = \{"hello"\}
   363 are not the same regular expression, but have the same meaning. 
   382 \]
   364 
   383 
   365 The point of the definition of $L$ is that we can use it to precisely specify when a string $s$ is matched by a
   384 \noindent This is expected because this regular expression 
   366 regular expression $r$, namely only when $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and
   385 is only supposed to match the ``$hello$''-string. Similarly if
   367 any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no},
   386 we have the choice-regular-expression $a + b$, its meaning is
   368 if $s \not\in L(r)$. We leave this for the next lecture.
   387 
       
   388 \[
       
   389 L(a + b) = \{"a", "b"\}
       
   390 \]
       
   391 
       
   392 \noindent You can now also see why we do not make a difference
       
   393 between the different regular expressions $(r_1 + r_2) + r_3$
       
   394 and $r_1 + (r_2 + r_3)$\ldots they are not the same regular
       
   395 expression, but have the same meaning. 
       
   396 
       
   397 \begin{eqnarray*}
       
   398 L((r_1 + r_2) + r_3) & = & L(r_1 + r_2) \cup L(r_3)\\
       
   399                      & = & L(r_1) \cup L(r_2) \cup L(r_3)\\
       
   400                      & = & L(r_1) \cup L(r_2 + r_3)\\
       
   401                      & = & L(r_1 + (r_2 + r_3))
       
   402 \end{eqnarray*}
       
   403 
       
   404 The point of the definition of $L$ is that we can use it to
       
   405 precisely specify when a string $s$ is matched by a regular
       
   406 expression $r$, namely if and only if $s \in L(r)$. In fact we
       
   407 will write a program \pcode{match} that takes any string $s$
       
   408 and any regular expression $r$ as argument and returns
       
   409 \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
       
   410 L(r)$. We leave this for the next lecture.
       
   411 
       
   412 There is one more feature of regular expressions that is worth
       
   413 mentioning. Given some strings, there are in general many
       
   414 different regular expressions that can recognise these
       
   415 strings. This is obvious with the regular expression $a + b$
       
   416 which can match the strings $a$ and $b$. But also the regular
       
   417 expression $b + a$ would match the same strings. However,
       
   418 sometimes it is not so obvious whether two regular expressions
       
   419 match the same strings: for example do $r^*$ and $\epsilon + r
       
   420 \cdot r^*$ match the same strings? What about $\varnothing^*$
       
   421 and $\epsilon^*$? This suggests the following relation between
       
   422 \defn{equivalent regular expressions}: 
       
   423 
       
   424 \[
       
   425 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
       
   426 \]
       
   427 
       
   428 \noindent That means two regular expressions are equivalent if
       
   429 they match the same set of strings. Therefore we do not really
       
   430 distinguish between the different regular expression $(r_1 +
       
   431 r_2) + r_3$ and $r_1 + (r_2 + r_3)$, because they are
       
   432 equivalent. I leave you to the question whether
       
   433 
       
   434 \[
       
   435 \varnothing^* \equiv \epsilon^*
       
   436 \]
       
   437 
       
   438 \noindent holds. Such equivalences will be important for out
       
   439 matching algorithm, because we can use them to simplify 
       
   440 regular expressions. 
       
   441 
       
   442 \subsection*{My Fascination for Regular Expressions}
       
   443 
       
   444 Up until a few years ago I was not really interested in
       
   445 regular expressions. They have been studied for the last 60
       
   446 years (by smarter people than me)---surely nothing new can be
       
   447 found out about them. I even have the vague recollection that
       
   448 I did not quite understand them during my study. If I remember
       
   449 correctly,\footnote{That was really a long time ago.} I got
       
   450 utterly confused about $\epsilon$ and the empty
       
   451 string.\footnote{Obviously the lecturer must have been bad.}
       
   452 Since my study, I have used regular expressions for
       
   453 implementing lexers and parsers as I have always been
       
   454 interested in all kinds of programming languages and
       
   455 compilers, which invariably need regular expression in some
       
   456 form or shape. 
       
   457 
       
   458 To understand my fascination nowadays with regular
       
   459 expressions, you need to know that my main scientific interest
       
   460 for the last 14 years has been with in theorem provers. I am a
       
   461 core developer of one of
       
   462 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
       
   463 provers are systems in which you can formally reason about
       
   464 mathematical concepts, but also about programs. In this way
       
   465 they can help with writing bug-free code. Theorem provers have
       
   466 proved already their value in a number of systems (even in
       
   467 terms of hard cash), but they are still clunky and difficult
       
   468 to use by average programmers.
       
   469 
       
   470 Anyway, in about 2011 I came across the notion of
       
   471 \defn{derivatives of regular expressions}. This notion allows
       
   472 one to do almost all calculations in regular language theory
       
   473 on the level of regular expressions, not needing any automata.
       
   474 This is important because automata are graphs and it is rather
       
   475 difficult to reason about graphs in theorem provers. In
       
   476 contrast, to reason about regular expressions is easy-peasy in
       
   477 theorem provers. Is this important? I think yes, because
       
   478 according to Kuklewicz nearly all POSIX-based regular
       
   479 expression matchers are
       
   480 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
       
   481 With my PhD student Fahad Ausaf I am currently working on
       
   482 proving the correctness for one such algorithm that was
       
   483 proposed by Sulzmann and Lu in
       
   484 2014.\footnote{\url{http://goo.gl/bz0eHp}} This would be an
       
   485 attractive results since we will be able to prove that the
       
   486 algorithm is really correct, but also that the machine code(!)
       
   487 that implements this code efficiently is correct. Writing
       
   488 programs in this way does not leave any room for potential
       
   489 errors or bugs. How nice!
       
   490 
       
   491 What also helped with my fascination with regular expressions
       
   492 is that we could indeed find out new things about them that
       
   493 have surprised some experts in the field of regular
       
   494 expressions. Together with two colleagues from China, I was
       
   495 able to prove the Myhill-Nerode theorem by only using regular
       
   496 expressions and the notion of derivatives. Earlier versions of
       
   497 this theorem used always automata in the proof. Using this
       
   498 theorem we can show that regular languages are closed under
       
   499 complementation, something which Gasarch in his
       
   500 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
       
   501 shown via automata. Even sombody who has written a 700+-page
       
   502 book\footnote{\url{http://goo.gl/fD0eHx}} on regular
       
   503 exprssions did not know better. Well, we showed it can also be
       
   504 done with regular expressions only. What a feeling if you
       
   505 are an outsider to the subject!
       
   506 
       
   507 To conclude: Despite my early ignorance about regular
       
   508 expressions, I find them now quite interesting. They have a
       
   509 beautiful mathematical theory behind them. They have practical
       
   510 importance (remember the shocking runtime of the regular
       
   511 expression matchers in Python and Ruby in some instances).
       
   512 People who are not very familiar with the mathematical
       
   513 background get them consistently wrong. The hope is that we
       
   514 can do better in the future---for example by proving that the
       
   515 algorithms actually satisfy their specification and that the
       
   516 corresponding implementations do not contain any bugs. We are
       
   517 close, but not yet quite there.
   369 
   518 
   370 \begin{figure}[p]
   519 \begin{figure}[p]
   371 \lstinputlisting{../progs/crawler1.scala}
   520 \lstinputlisting{../progs/crawler1.scala}
   372 \caption{The Scala code for a simple web-crawler that checks
   521 \caption{The Scala code for a simple web-crawler that checks
   373 for broken links in a web-page. It uses the regular expression
   522 for broken links in a web-page. It uses the regular expression
   398 \texttt{email\_pattern} in Line~16. The main change is in Line
   547 \texttt{email\_pattern} in Line~16. The main change is in Line
   399 32 where all email addresses that can be found in a page are
   548 32 where all email addresses that can be found in a page are
   400 printed.\label{crawler3}}
   549 printed.\label{crawler3}}
   401 \end{figure}
   550 \end{figure}
   402 
   551 
   403 \pagebreak
       
   404 Lets start
       
   405 with what we mean by \emph{strings}. Strings (they are also
       
   406 sometimes referred to as \emph{words}) are lists of characters
       
   407 drawn from an \emph{alphabet}. If nothing else is specified,
       
   408 we usually assume the alphabet consists of just the lower-case
       
   409 letters $a$, $b$, \ldots, $z$. Sometimes, however, we
       
   410 explicitly restrict strings to contain, for example, only the
       
   411 letters $a$ and $b$. In this case we say the alphabet is the
       
   412 set $\{a, b\}$.
       
   413 
       
   414 There are many ways how we can write down strings. In programming languages, they are usually 
       
   415 written as {\it "hello"} where the double quotes indicate that we dealing with a string. 
       
   416 Essentially, strings are lists of characters which can be written for example as follows
       
   417 
       
   418 \[
       
   419 [\text{\it h, e, l, l, o}]
       
   420 \]
       
   421 
       
   422 \noindent
       
   423 The important point is that we can always decompose strings. For example, we will often consider the
       
   424 first character of a string, say $h$, and the ``rest''  of a string say {\it "ello"} when making definitions 
       
   425 about strings. There are some subtleties with the empty string, sometimes written as {\it ""} but also as 
       
   426 the empty list of characters $[\,]$. Two strings, for example $s_1$ and $s_2$, can be \emph{concatenated}, 
       
   427 which we write as $s_1 @ s_2$. Suppose we are given two strings {\it "foo"} and {\it "bar"}, then their concatenation
       
   428 gives {\it "foobar"}.
       
   429 
       
   430 We often need to talk about sets of strings. For example the set of all strings over the alphabet $\{a, \ldots\, z\}$
       
   431 is
       
   432 
       
   433 \[
       
   434 \{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\}
       
   435 \]
       
   436 
       
   437 \noindent
       
   438 Any set of strings, not just the set-of-all-strings, is often called a \emph{language}. The idea behind
       
   439 this choice of terminology is that if we enumerate, say, all words/strings from a dictionary, like 
       
   440 
       
   441 \[
       
   442 \{\text{\it "the", "of", "milk", "name", "antidisestablishmentarianism", \ldots}\}
       
   443 \]
       
   444 
       
   445 \noindent
       
   446 then we have essentially described the English language, or more precisely all
       
   447 strings that can be used in a sentence of the English language. French would be a
       
   448 different set of strings, and so on. In the context of this course, a language might 
       
   449 not necessarily make sense from a natural language point of view. For example
       
   450 the set of all strings shown above is a language, as is the empty set (of strings). The
       
   451 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a 
       
   452 difference between the empty set, or empty language, and the set that 
       
   453 contains only the empty string $\{\text{""}\}$: the former has no elements, whereas 
       
   454 the latter has one element.
       
   455 
       
   456 
       
   457 
       
   458 Before we expand on the topic of regular expressions, let us review some operations on
       
   459 sets. We will use capital letters $A$, $B$, $\ldots$ to stand for sets of strings. 
       
   460 The union of two sets is written as usual as $A \cup B$. We also need to define the
       
   461 operation of \emph{concatenating} two sets of strings. This can be defined as
       
   462 
       
   463 \[
       
   464 A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \}
       
   465 \]
       
   466 
       
   467 \noindent
       
   468 which essentially means take the first string from the set $A$ and concatenate it with every
       
   469 string in the set $B$, then take the second string from $A$ do the same and so on. You might
       
   470 like to think about what this definition means in case $A$ or $B$ is the empty set.
       
   471 
       
   472 We also need to define
       
   473 the power of a set of strings, written as $A^n$ with $n$ being a natural number. This is defined inductively as follows
       
   474 
       
   475 \begin{center}
       
   476 \begin{tabular}{rcl}
       
   477 $A^0$ & $\dn$ & $\{[\,]\}$ \\
       
   478 $A^{n+1}$ & $\dn$ & $A @ A^n$\\
       
   479 \end{tabular}
       
   480 \end{center}
       
   481 
       
   482 \noindent
       
   483 Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union
       
   484 of every power of $A^n$ with $n\ge 0$. The mathematical notation for this operation is
       
   485 
       
   486 \[
       
   487 A^* \dn \bigcup_{0\le n} A^n
       
   488 \]
       
   489 
       
   490 \noindent
       
   491 This definition implies that the star of a set $A$ contains always the empty string (that is $A^0$), one 
       
   492 copy of every string in $A$ (that is $A^1$), two copies in $A$ (that is $A^2$) and so on. In case $A=\{"a"\}$ we therefore 
       
   493 have 
       
   494 
       
   495 \[
       
   496 A^* = \{"", "a", "aa", "aaa", \ldots\}
       
   497 \]
       
   498 
       
   499 \noindent
       
   500 Be aware that these operations sometimes have quite non-intuitive properties, for example
       
   501 
       
   502 \begin{center}
       
   503 \begin{tabular}{@{}ccc@{}}
       
   504 \begin{tabular}{@{}r@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   505 $A \cup \varnothing$ & $=$ & $A$\\
       
   506 $A \cup A$ & $=$ & $A$\\
       
   507 $A \cup B$ & $=$ & $B \cup A$\\
       
   508 \end{tabular} &
       
   509 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   510 $A @ B$ & $\not =$ & $B @ A$\\
       
   511 $A  @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\
       
   512 $A  @ \{""\}$ & $=$ & $\{""\} @ A = A$\\
       
   513 \end{tabular} &
       
   514 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
       
   515 $\varnothing^*$ & $=$ & $\{""\}$\\
       
   516 $\{""\}^*$ & $=$ & $\{""\}$\\
       
   517 $A^\star$ & $=$ & $\{""\} \cup A\cdot A^*$\\
       
   518 \end{tabular} 
       
   519 \end{tabular}
       
   520 \end{center}
       
   521 \bigskip
       
   522 
       
   523 
       
   524 \subsection*{My Fascination for Regular Expressions}
       
   525 
   552 
   526 
   553 
   527 \end{document}
   554 \end{document}
   528 
   555 
   529 %%% Local Variables: 
   556 %%% Local Variables: