32 |
32 |
33 \section*{Handout 1} |
33 \section*{Handout 1} |
34 |
34 |
35 This module is about text processing, be it for web-crawlers, |
35 This module is about text processing, be it for web-crawlers, |
36 compilers, dictionaries, DNA-data and so on. When looking for |
36 compilers, dictionaries, DNA-data and so on. When looking for |
37 a particular string in a large text we can use the |
37 a particular string, like $abc$ in a large text we can use the |
38 Knuth-Morris-Pratt algorithm, which is currently the most |
38 Knuth-Morris-Pratt algorithm, which is currently the most |
39 efficient general string search algorithm. But often we do |
39 efficient general string search algorithm. But often we do |
40 \emph{not} just look for a particular string, but for string |
40 \emph{not} just look for a particular string, but for string |
41 patterns. For example in program code we need to identify what |
41 patterns. For example in program code we need to identify what |
42 are the keywords, what are the identifiers etc. A pattern for |
42 are the keywords, what are the identifiers etc. A pattern for |
65 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}} |
65 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}} |
66 \end{equation} |
66 \end{equation} |
67 |
67 |
68 \noindent where the first part matches one or more lowercase |
68 \noindent where the first part matches one or more lowercase |
69 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots |
69 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots |
70 or hyphens. The \pcode{+} ensures the ``one or more''. Then |
70 and hyphens. The \pcode{+} at the end of the brackets ensures |
71 comes the \pcode{@}-sign, followed by the domain name which |
71 the ``one or more''. Then comes the \pcode{@}-sign, followed |
72 must be one or more lowercase letters, digits, underscores, |
72 by the domain name which must be one or more lowercase |
73 dots or hyphens. Note there cannot be an underscore in the |
73 letters, digits, underscores, dots or hyphens. Note there |
74 domain name. Finally there must be a dot followed by the |
74 cannot be an underscore in the domain name. Finally there must |
75 toplevel domain. This toplevel domain must be 2 to 6 lowercase |
75 be a dot followed by the toplevel domain. This toplevel domain |
76 letters including the dot. Example strings which follow this |
76 must be 2 to 6 lowercase letters including the dot. Example |
77 pattern are: |
77 strings which follow this pattern are: |
78 |
78 |
79 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] |
79 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] |
80 niceandsimple@example.org |
80 niceandsimple@example.org |
81 very.common@example.co.uk |
81 very.common@example.co.uk |
82 a.little.lengthy.but.fine@dept.example.ac.uk |
82 a.little.lengthy.but.fine@dept.example.ac.uk |
111 |
111 |
112 \noindent Possible identifiers that match this regular expression |
112 \noindent Possible identifiers that match this regular expression |
113 are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name}, |
113 are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name}, |
114 but not \pcode{_i} and also not \pcode{4you}. |
114 but not \pcode{_i} and also not \pcode{4you}. |
115 |
115 |
116 Many programming language offer libraries that can be used to |
116 Many programming languages offer libraries that can be used to |
117 validate such strings against regular expressions. Also there |
117 validate such strings against regular expressions. Also there |
118 are some common, and I am sure very familiar, ways of how to |
118 are some common, and I am sure very familiar, ways of how to |
119 construct regular expressions. For example in Scala we have: |
119 construct regular expressions. For example in Scala we have |
|
120 a library implementing the following regular expressions: |
120 |
121 |
121 \begin{center} |
122 \begin{center} |
122 \begin{tabular}{lp{9cm}} |
123 \begin{tabular}{lp{9cm}} |
123 \pcode{re*} & matches 0 or more occurrences of preceding |
124 \pcode{re*} & matches 0 or more occurrences of preceding |
124 expression\\ |
125 expression\\ |
175 |
176 |
176 \subsection*{Why Study Regular Expressions?} |
177 \subsection*{Why Study Regular Expressions?} |
177 |
178 |
178 Regular expressions were introduced by Kleene in the 1950ies |
179 Regular expressions were introduced by Kleene in the 1950ies |
179 and they have been object of intense study since then. They |
180 and they have been object of intense study since then. They |
180 are nowadays pretty much ubiquitous in computer science. I am |
181 are nowadays pretty much ubiquitous in computer science. There |
181 sure you have come across them before. Why on earth then is |
182 are many libraries implementing regular expressions. I am sure |
182 there any interest in studying them again in depth in this |
183 you have come across them before (remember PRA?). Why on earth |
183 module? Well, one answer is in the following graph about |
184 then is there any interest in studying them again in depth in |
|
185 this module? Well, one answer is in the following graph about |
184 regular expression matching in Python and in Ruby. |
186 regular expression matching in Python and in Ruby. |
185 |
187 |
186 \begin{center} |
188 \begin{center} |
187 \begin{tikzpicture} |
189 \begin{tikzpicture} |
188 \begin{axis}[ |
190 \begin{axis}[ |
212 seconds for finding out whether a string of 28 \texttt{a}s |
214 seconds for finding out whether a string of 28 \texttt{a}s |
213 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}. |
215 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}. |
214 Ruby is even slightly worse.\footnote{In this example Ruby |
216 Ruby is even slightly worse.\footnote{In this example Ruby |
215 uses the slightly different regular expression |
217 uses the slightly different regular expression |
216 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and |
218 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and |
217 \texttt{a} each occur $n$ times. More test results can be |
219 \texttt{a} each occur $n$ times. More such test cases can be |
218 found at \url{http://www.computerbytesman.com/redos/}.} |
220 found at \url{http://www.computerbytesman.com/redos/}.} |
219 Admittedly, this regular expression is carefully chosen to |
221 Admittedly, this regular expression is carefully chosen to |
220 exhibit this exponential behaviour, but similar ones occur |
222 exhibit this exponential behaviour, but similar ones occur |
221 more often than one wants in ``real life''. They are sometimes |
223 more often than one wants in ``real life''. They are sometimes |
222 called \emph{evil regular expressions} because they have the |
224 called \emph{evil regular expressions} because they have the |
283 \end{tabular} |
285 \end{tabular} |
284 \end{center} |
286 \end{center} |
285 |
287 |
286 \noindent Because we overload our notation, there are some |
288 \noindent Because we overload our notation, there are some |
287 subtleties you should be aware of. When regular expressions |
289 subtleties you should be aware of. When regular expressions |
288 are referred to then $\ZERO$ (in bold font) does not stand for |
290 are referred to, then $\ZERO$ (in bold font) does not stand for |
289 the number zero: rather it is a particular pattern that does |
291 the number zero: rather it is a particular pattern that does |
290 not match any string. Similarly, in the context of regular |
292 not match any string. Similarly, in the context of regular |
291 expressions, $\ONE$ does not stand for the number one but for |
293 expressions, $\ONE$ does not stand for the number one but for |
292 a regular expression that matches the empty string. The letter |
294 a regular expression that matches the empty string. The letter |
293 $c$ stands for any character from the alphabet at hand. Again |
295 $c$ stands for any character from the alphabet at hand. Again |
366 but often just write {\it hello}. |
368 but often just write {\it hello}. |
367 |
369 |
368 If you prefer to think in terms of the implementation |
370 If you prefer to think in terms of the implementation |
369 of regular expressions in Scala, the constructors and |
371 of regular expressions in Scala, the constructors and |
370 classes relate as follows\footnote{More about Scala is |
372 classes relate as follows\footnote{More about Scala is |
371 in the handout about A Crash-Course on Scala.} |
373 in the handout about \emph{A Crash-Course on Scala}.} |
372 |
374 |
373 \begin{center} |
375 \begin{center} |
374 \begin{tabular}{rcl} |
376 \begin{tabular}{rcl} |
375 $\ZERO$ & $\mapsto$ & \texttt{ZERO}\\ |
377 $\ZERO$ & $\mapsto$ & \texttt{ZERO}\\ |
376 $\ONE$ & $\mapsto$ & \texttt{ONE}\\ |
378 $\ONE$ & $\mapsto$ & \texttt{ONE}\\ |