handouts/ho01.tex
changeset 404 245d302791c7
parent 403 564f7584eff1
child 407 4b454a6d1814
equal deleted inserted replaced
403:564f7584eff1 404:245d302791c7
    32 
    32 
    33 \section*{Handout 1}
    33 \section*{Handout 1}
    34 
    34 
    35 This module is about text processing, be it for web-crawlers,
    35 This module is about text processing, be it for web-crawlers,
    36 compilers, dictionaries, DNA-data and so on. When looking for
    36 compilers, dictionaries, DNA-data and so on. When looking for
    37 a particular string in a large text we can use the
    37 a particular string, like $abc$ in a large text we can use the
    38 Knuth-Morris-Pratt algorithm, which is currently the most
    38 Knuth-Morris-Pratt algorithm, which is currently the most
    39 efficient general string search algorithm. But often we do
    39 efficient general string search algorithm. But often we do
    40 \emph{not} just look for a particular string, but for string
    40 \emph{not} just look for a particular string, but for string
    41 patterns. For example in program code we need to identify what
    41 patterns. For example in program code we need to identify what
    42 are the keywords, what are the identifiers etc. A pattern for
    42 are the keywords, what are the identifiers etc. A pattern for
    65 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}}
    65 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}}
    66 \end{equation}
    66 \end{equation}
    67 
    67 
    68 \noindent where the first part matches one or more lowercase
    68 \noindent where the first part matches one or more lowercase
    69 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
    69 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
    70 or hyphens. The \pcode{+} ensures the ``one or more''. Then
    70 and hyphens. The \pcode{+} at the end of the brackets ensures
    71 comes the \pcode{@}-sign, followed by the domain name which
    71 the ``one or more''. Then comes the \pcode{@}-sign, followed
    72 must be one or more lowercase letters, digits, underscores,
    72 by the domain name which must be one or more lowercase
    73 dots or hyphens. Note there cannot be an underscore in the
    73 letters, digits, underscores, dots or hyphens. Note there
    74 domain name. Finally there must be a dot followed by the
    74 cannot be an underscore in the domain name. Finally there must
    75 toplevel domain. This toplevel domain must be 2 to 6 lowercase
    75 be a dot followed by the toplevel domain. This toplevel domain
    76 letters including the dot. Example strings which follow this
    76 must be 2 to 6 lowercase letters including the dot. Example
    77 pattern are:
    77 strings which follow this pattern are:
    78 
    78 
    79 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
    79 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
    80 niceandsimple@example.org
    80 niceandsimple@example.org
    81 very.common@example.co.uk
    81 very.common@example.co.uk
    82 a.little.lengthy.but.fine@dept.example.ac.uk
    82 a.little.lengthy.but.fine@dept.example.ac.uk
   111 
   111 
   112 \noindent Possible identifiers that match this regular expression 
   112 \noindent Possible identifiers that match this regular expression 
   113 are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name},
   113 are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name},
   114 but not \pcode{_i} and also not \pcode{4you}.
   114 but not \pcode{_i} and also not \pcode{4you}.
   115 
   115 
   116 Many programming language offer libraries that can be used to
   116 Many programming languages offer libraries that can be used to
   117 validate such strings against regular expressions. Also there
   117 validate such strings against regular expressions. Also there
   118 are some common, and I am sure very familiar, ways of how to
   118 are some common, and I am sure very familiar, ways of how to
   119 construct regular expressions. For example in Scala we have: 
   119 construct regular expressions. For example in Scala we have
       
   120 a library implementing the following regular expressions: 
   120 
   121 
   121 \begin{center}
   122 \begin{center}
   122 \begin{tabular}{lp{9cm}}
   123 \begin{tabular}{lp{9cm}}
   123 \pcode{re*} & matches 0 or more occurrences of preceding 
   124 \pcode{re*} & matches 0 or more occurrences of preceding 
   124 expression\\
   125 expression\\
   175 
   176 
   176 \subsection*{Why Study Regular Expressions?}
   177 \subsection*{Why Study Regular Expressions?}
   177 
   178 
   178 Regular expressions were introduced by Kleene in the 1950ies
   179 Regular expressions were introduced by Kleene in the 1950ies
   179 and they have been object of intense study since then. They
   180 and they have been object of intense study since then. They
   180 are nowadays pretty much ubiquitous in computer science. I am
   181 are nowadays pretty much ubiquitous in computer science. There
   181 sure you have come across them before. Why on earth then is
   182 are many libraries implementing regular expressions. I am sure
   182 there any interest in studying them again in depth in this
   183 you have come across them before (remember PRA?). Why on earth
   183 module? Well, one answer is in the following graph about
   184 then is there any interest in studying them again in depth in
       
   185 this module? Well, one answer is in the following graph about
   184 regular expression matching in Python and in Ruby.
   186 regular expression matching in Python and in Ruby.
   185 
   187 
   186 \begin{center}
   188 \begin{center}
   187 \begin{tikzpicture}
   189 \begin{tikzpicture}
   188 \begin{axis}[
   190 \begin{axis}[
   212 seconds for finding out whether a string of 28 \texttt{a}s
   214 seconds for finding out whether a string of 28 \texttt{a}s
   213 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}.
   215 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}.
   214 Ruby is even slightly worse.\footnote{In this example Ruby
   216 Ruby is even slightly worse.\footnote{In this example Ruby
   215 uses the slightly different regular expression
   217 uses the slightly different regular expression
   216 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and
   218 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and
   217 \texttt{a} each occur $n$ times. More test results can be
   219 \texttt{a} each occur $n$ times. More such test cases can be
   218 found at \url{http://www.computerbytesman.com/redos/}.}
   220 found at \url{http://www.computerbytesman.com/redos/}.}
   219 Admittedly, this regular expression is carefully chosen to
   221 Admittedly, this regular expression is carefully chosen to
   220 exhibit this exponential behaviour, but similar ones occur
   222 exhibit this exponential behaviour, but similar ones occur
   221 more often than one wants in ``real life''. They are sometimes
   223 more often than one wants in ``real life''. They are sometimes
   222 called \emph{evil regular expressions} because they have the
   224 called \emph{evil regular expressions} because they have the
   283   \end{tabular}
   285   \end{tabular}
   284 \end{center}
   286 \end{center}
   285 
   287 
   286 \noindent Because we overload our notation, there are some
   288 \noindent Because we overload our notation, there are some
   287 subtleties you should be aware of. When regular expressions
   289 subtleties you should be aware of. When regular expressions
   288 are referred to then $\ZERO$ (in bold font) does not stand for
   290 are referred to, then $\ZERO$ (in bold font) does not stand for
   289 the number zero: rather it is a particular pattern that does
   291 the number zero: rather it is a particular pattern that does
   290 not match any string. Similarly, in the context of regular
   292 not match any string. Similarly, in the context of regular
   291 expressions, $\ONE$ does not stand for the number one but for
   293 expressions, $\ONE$ does not stand for the number one but for
   292 a regular expression that matches the empty string. The letter
   294 a regular expression that matches the empty string. The letter
   293 $c$ stands for any character from the alphabet at hand. Again
   295 $c$ stands for any character from the alphabet at hand. Again
   366 but often just write {\it hello}.
   368 but often just write {\it hello}.
   367 
   369 
   368 If you prefer to think in terms of the implementation
   370 If you prefer to think in terms of the implementation
   369 of regular expressions in Scala, the constructors and
   371 of regular expressions in Scala, the constructors and
   370 classes relate as follows\footnote{More about Scala is 
   372 classes relate as follows\footnote{More about Scala is 
   371 in the handout about A Crash-Course on Scala.}
   373 in the handout about \emph{A Crash-Course on Scala}.}
   372 
   374 
   373 \begin{center}
   375 \begin{center}
   374 \begin{tabular}{rcl}
   376 \begin{tabular}{rcl}
   375 $\ZERO$       & $\mapsto$ & \texttt{ZERO}\\
   377 $\ZERO$       & $\mapsto$ & \texttt{ZERO}\\
   376 $\ONE$        & $\mapsto$ & \texttt{ONE}\\
   378 $\ONE$        & $\mapsto$ & \texttt{ONE}\\
   661 \begin{minipage}{0.8\textwidth}
   663 \begin{minipage}{0.8\textwidth}
   662 \lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp}
   664 \lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp}
   663 \end{minipage}
   665 \end{minipage}
   664 \end{center}
   666 \end{center}
   665 
   667 
   666 \caption{Nothing that can be said this\ldots\label{monster}}
   668 \caption{Nothing that can be said about this regular
       
   669 expression\ldots\label{monster}}
   667 \end{figure}
   670 \end{figure}
   668 
   671 
   669 
   672 
   670 \end{document}
   673 \end{document}
   671 
   674