handouts/ho01.tex
changeset 248 ce767ca23244
parent 245 a5fade10c207
child 250 b79e704acb72
equal deleted inserted replaced
247:84b4bc6e8554 248:ce767ca23244
    80 are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name},
    80 are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name},
    81 but not \pcode{_i} and also not \pcode{4you}.
    81 but not \pcode{_i} and also not \pcode{4you}.
    82 
    82 
    83 Many programming language offer libraries that can be used to
    83 Many programming language offer libraries that can be used to
    84 validate such strings against regular expressions. Also there
    84 validate such strings against regular expressions. Also there
    85 are some common, and I am sure very familiar, ways how to
    85 are some common, and I am sure very familiar, ways of how to
    86 construct regular expressions. For example in Scala we have: 
    86 construct regular expressions. For example in Scala we have: 
    87 
    87 
    88 \begin{center}
    88 \begin{center}
    89 \begin{tabular}{lp{9cm}}
    89 \begin{tabular}{lp{9cm}}
    90 \pcode{re*} & matches 0 or more occurrences of preceding 
    90 \pcode{re*} & matches 0 or more occurrences of preceding 
   259 the context of regular expressions, it is a particular pattern
   259 the context of regular expressions, it is a particular pattern
   260 that can match the specified character. You should also be
   260 that can match the specified character. You should also be
   261 careful with our overloading of the star: assuming you have
   261 careful with our overloading of the star: assuming you have
   262 read the handout about our basic mathematical notation, you
   262 read the handout about our basic mathematical notation, you
   263 will see that in the context of languages (sets of strings)
   263 will see that in the context of languages (sets of strings)
   264 the star stands for an operation on languages. While here
   264 the star stands for an operation on languages. Here $r^*$
   265 $r^*$ stands for a regular expression, which is different from
   265 stands for a regular expression, which is different from the
   266 the operation on sets is defined as
   266 operation on sets is defined as
   267 
   267 
   268 \[
   268 \[
   269 A^* \dn \bigcup_{0\le n} A^n
   269 A^* \dn \bigcup_{0\le n} A^n
   270 \]
   270 \]
   271 
   271 
   275 structure of regular expressions is always clear. But for
   275 structure of regular expressions is always clear. But for
   276 writing them down in a more mathematical fashion, parentheses
   276 writing them down in a more mathematical fashion, parentheses
   277 will be helpful. For example we will write $(r_1 + r_2)^*$,
   277 will be helpful. For example we will write $(r_1 + r_2)^*$,
   278 which is different from, say $r_1 + (r_2)^*$. The former means
   278 which is different from, say $r_1 + (r_2)^*$. The former means
   279 roughly zero or more times $r_1$ or $r_2$, while the latter
   279 roughly zero or more times $r_1$ or $r_2$, while the latter
   280 means $r_1$ or zero or more times $r_2$. This will turn out
   280 means $r_1$ or zero or more times $r_2$. This will turn out to
   281 are two different patterns, which match in general different
   281 be two different patterns, which match in general different
   282 strings. We should also write $(r_1 + r_2) + r_3$, which is
   282 strings. We should also write $(r_1 + r_2) + r_3$, which is
   283 different from the regular expression $r_1 + (r_2 + r_3)$, but
   283 different from the regular expression $r_1 + (r_2 + r_3)$, but
   284 in case of $+$ and $\cdot$ we actually do not care about the
   284 in case of $+$ and $\cdot$ we actually do not care about the
   285 order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2
   285 order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2
   286 \cdot r_3$, respectively. The reasons for this will become
   286 \cdot r_3$, respectively. The reasons for this will become
   302 which is much less readable than \eqref{email}. Similarly for
   302 which is much less readable than \eqref{email}. Similarly for
   303 the regular expression that matches the string $hello$ we 
   303 the regular expression that matches the string $hello$ we 
   304 should write
   304 should write
   305 
   305 
   306 \[
   306 \[
   307 {\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}
   307 h \cdot e \cdot l \cdot l \cdot o
   308 \]
   308 \]
   309 
   309 
   310 \noindent
   310 \noindent
   311 but often just write {\it hello}.
   311 but often just write {\it hello}.
   312 
   312 
   313 If you prefer to think in terms of the implementation
   313 If you prefer to think in terms of the implementation
   314 of regular expressions in Scala, the constructors and
   314 of regular expressions in Scala, the constructors and
   315 classes relate as follows\footnote{More about Scala is 
   315 classes relate as follows\footnote{More about Scala is 
   316 in the handout about a crash-course on Scala.}
   316 in the handout about A Crash-Course on Scala.}
   317 
   317 
   318 \begin{center}
   318 \begin{center}
   319 \begin{tabular}{rcl}
   319 \begin{tabular}{rcl}
   320 $\varnothing$ & $\mapsto$ & \texttt{NULL}\\
   320 $\varnothing$ & $\mapsto$ & \texttt{NULL}\\
   321 $\epsilon$    & $\mapsto$ & \texttt{EMPTY}\\
   321 $\epsilon$    & $\mapsto$ & \texttt{EMPTY}\\
   376 \noindent As a result we can now precisely state what the
   376 \noindent As a result we can now precisely state what the
   377 meaning, for example, of the regular expression $h \cdot
   377 meaning, for example, of the regular expression $h \cdot
   378 e \cdot l \cdot l \cdot o$ is, namely
   378 e \cdot l \cdot l \cdot o$ is, namely
   379 
   379 
   380 \[
   380 \[
   381 L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot
   381 L(h \cdot e \cdot  l \cdot l \cdot o) = \{"hello"\}
   382 {\it o}) = \{"hello"\}
       
   383 \]
   382 \]
   384 
   383 
   385 \noindent This is expected because this regular expression 
   384 \noindent This is expected because this regular expression 
   386 is only supposed to match the ``$hello$''-string. Similarly if
   385 is only supposed to match the ``$hello$''-string. Similarly if
   387 we have the choice-regular-expression $a + b$, its meaning is
   386 we have the choice-regular-expression $a + b$, its meaning is
   391 \]
   390 \]
   392 
   391 
   393 \noindent You can now also see why we do not make a difference
   392 \noindent You can now also see why we do not make a difference
   394 between the different regular expressions $(r_1 + r_2) + r_3$
   393 between the different regular expressions $(r_1 + r_2) + r_3$
   395 and $r_1 + (r_2 + r_3)$\ldots they are not the same regular
   394 and $r_1 + (r_2 + r_3)$\ldots they are not the same regular
   396 expression, but have the same meaning. 
   395 expression, but they have the same meaning. For example
   397 
   396 
   398 \begin{eqnarray*}
   397 \begin{eqnarray*}
   399 L((r_1 + r_2) + r_3) & = & L(r_1 + r_2) \cup L(r_3)\\
   398 L((r_1 + r_2) + r_3) & = & L(r_1 + r_2) \cup L(r_3)\\
   400                      & = & L(r_1) \cup L(r_2) \cup L(r_3)\\
   399                      & = & L(r_1) \cup L(r_2) \cup L(r_3)\\
   401                      & = & L(r_1) \cup L(r_2 + r_3)\\
   400                      & = & L(r_1) \cup L(r_2 + r_3)\\
   424 
   423 
   425 \[
   424 \[
   426 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
   425 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
   427 \]
   426 \]
   428 
   427 
   429 \noindent That means two regular expressions are equivalent if
   428 \noindent That means two regular expressions are said to be
   430 they match the same set of strings. Therefore we do not really
   429 equivalent if they match the same set of strings. Therefore we
   431 distinguish between the different regular expression $(r_1 +
   430 do not really distinguish between the different regular
   432 r_2) + r_3$ and $r_1 + (r_2 + r_3)$, because they are
   431 expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,
   433 equivalent. I leave you to the question whether
   432 because they are equivalent. I leave you to the question
       
   433 whether
   434 
   434 
   435 \[
   435 \[
   436 \varnothing^* \equiv \epsilon^*
   436 \varnothing^* \equiv \epsilon^*
   437 \]
   437 \]
   438 
   438 
   439 \noindent holds. Such equivalences will be important for out
   439 \noindent holds. Such equivalences will be important for our
   440 matching algorithm, because we can use them to simplify 
   440 matching algorithm, because we can use them to simplify 
   441 regular expressions. 
   441 regular expressions. 
   442 
   442 
   443 \subsection*{My Fascination for Regular Expressions}
   443 \subsection*{My Fascination for Regular Expressions}
   444 
   444 
   456 compilers, which invariably need regular expression in some
   456 compilers, which invariably need regular expression in some
   457 form or shape. 
   457 form or shape. 
   458 
   458 
   459 To understand my fascination nowadays with regular
   459 To understand my fascination nowadays with regular
   460 expressions, you need to know that my main scientific interest
   460 expressions, you need to know that my main scientific interest
   461 for the last 14 years has been with in theorem provers. I am a
   461 for the last 14 years has been with theorem provers. I am a
   462 core developer of one of
   462 core developer of one of
   463 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
   463 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
   464 provers are systems in which you can formally reason about
   464 provers are systems in which you can formally reason about
   465 mathematical concepts, but also about programs. In this way
   465 mathematical concepts, but also about programs. In this way
   466 they can help with writing bug-free code. Theorem provers have
   466 they can help with writing bug-free code. Theorem provers have
   478 theorem provers. Is this important? I think yes, because
   478 theorem provers. Is this important? I think yes, because
   479 according to Kuklewicz nearly all POSIX-based regular
   479 according to Kuklewicz nearly all POSIX-based regular
   480 expression matchers are
   480 expression matchers are
   481 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
   481 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
   482 With my PhD student Fahad Ausaf I am currently working on
   482 With my PhD student Fahad Ausaf I am currently working on
   483 proving the correctness for one such algorithm that was
   483 proving the correctness for one such matcher that was
   484 proposed by Sulzmann and Lu in
   484 proposed by Sulzmann and Lu in
   485 2014.\footnote{\url{http://goo.gl/bz0eHp}} This would be an
   485 2014.\footnote{\url{http://goo.gl/bz0eHp}} This would be an
   486 attractive results since we will be able to prove that the
   486 attractive results since we will be able to prove that the
   487 algorithm is really correct, but also that the machine code(!)
   487 algorithm is really correct, but also that the machine code(!)
   488 that implements this code efficiently is correct. Writing
   488 that implements this code efficiently is correct. Writing
   500 complementation, something which Gasarch in his
   500 complementation, something which Gasarch in his
   501 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
   501 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
   502 shown via automata. Even sombody who has written a 700+-page
   502 shown via automata. Even sombody who has written a 700+-page
   503 book\footnote{\url{http://goo.gl/fD0eHx}} on regular
   503 book\footnote{\url{http://goo.gl/fD0eHx}} on regular
   504 exprssions did not know better. Well, we showed it can also be
   504 exprssions did not know better. Well, we showed it can also be
   505 done with regular expressions only. What a feeling if you
   505 done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}} 
   506 are an outsider to the subject!
   506 What a feeling if you are an outsider to the subject!
   507 
   507 
   508 To conclude: Despite my early ignorance about regular
   508 To conclude: Despite my early ignorance about regular
   509 expressions, I find them now quite interesting. They have a
   509 expressions, I find them now quite interesting. They have a
   510 beautiful mathematical theory behind them. They have practical
   510 beautiful mathematical theory behind them. They have practical
   511 importance (remember the shocking runtime of the regular
   511 importance (remember the shocking runtime of the regular
   512 expression matchers in Python and Ruby in some instances).
   512 expression matchers in Python and Ruby in some instances).
   513 People who are not very familiar with the mathematical
   513 People who are not very familiar with the mathematical
   514 background get them consistently wrong. The hope is that we
   514 background of regular expressions get them consistently wrong.
   515 can do better in the future---for example by proving that the
   515 The hope is that we can do better in the future---for example
   516 algorithms actually satisfy their specification and that the
   516 by proving that the algorithms actually satisfy their
   517 corresponding implementations do not contain any bugs. We are
   517 specification and that the corresponding implementations do
   518 close, but not yet quite there.
   518 not contain any bugs. We are close, but not yet quite there.
   519 
   519 
   520 Despite my fascination, I am also happy to admit that regular
   520 Despite my fascination, I am also happy to admit that regular
   521 expressions have their shortcomings. There are some well-known
   521 expressions have their shortcomings. There are some well-known
   522 ``theoretical shortcomings'', for example recognising strings
   522 ``theoretical'' shortcomings, for example recognising strings
   523 of the form $a^{n}b^{n}$. I am not so bothered by them. What I
   523 of the form $a^{n}b^{n}$. I am not so bothered by them. What I
   524 am bothered about is when regular expressions are in the way
   524 am bothered about is when regular expressions are in the way
   525 of practical programming. For example, it turns out that the
   525 of practical programming. For example, it turns out that the
   526 regular expression for email addresses shown in \eqref{email}
   526 regular expression for email addresses shown in \eqref{email}
   527 is hopelessly inadequate for recognising all of them (despite
   527 is hopelessly inadequate for recognising all of them (despite
   528 being touted as something every computer scientist should know
   528 being touted as something every computer scientist should know
   529 about). The W3 Consortium (which standardises the Web)
   529 about). The W3 Consortium (which standardises the Web)
   530 proposes to use the following, already more complicated
   530 proposes to use the following, already more complicated
   531 regular expressions:
   531 regular expressions for email addresses:
   532 
   532 
   533 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
   533 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
   534 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
   534 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
   535 \end{lstlisting}}
   535 \end{lstlisting}}
   536 
   536 
   539 the syntax of email addresses. With their proposed regular
   539 the syntax of email addresses. With their proposed regular
   540 expression they are too strict in some cases and too lax in
   540 expression they are too strict in some cases and too lax in
   541 others. Not a good situation to be in. A regular expression
   541 others. Not a good situation to be in. A regular expression
   542 that is claimed to be closer to the standard is shown in
   542 that is claimed to be closer to the standard is shown in
   543 Figure~\ref{monster}. Whether this claim is true or not, I
   543 Figure~\ref{monster}. Whether this claim is true or not, I
   544 would not know---the only thing I can say it is a monstrosity. 
   544 would not know---the only thing I can say to this regular
   545 However, this might actually be an argument against the 
   545 expression is it is a monstrosity. However, this might
   546 standard, rather than against regular expressions. Still it
   546 actually be an argument against the RFC standard, rather than
   547 is good to know that some tasks in text processing just 
   547 against regular expressions. Still it is good to know that
   548 cannot be achieved by using regular expressions.
   548 some tasks in text processing just cannot be achieved by using
       
   549 regular expressions.
   549 
   550 
   550 
   551 
   551 \begin{figure}[p]
   552 \begin{figure}[p]
   552 \lstinputlisting{../progs/crawler1.scala}
   553 \lstinputlisting{../progs/crawler1.scala}
   553 \caption{The Scala code for a simple web-crawler that checks
   554 \caption{The Scala code for a simple web-crawler that checks