handouts/ho01.tex
changeset 244 771042ac7c3f
parent 243 8d5aaf5b0031
child 245 a5fade10c207
equal deleted inserted replaced
243:8d5aaf5b0031 244:771042ac7c3f
   514 can do better in the future---for example by proving that the
   514 can do better in the future---for example by proving that the
   515 algorithms actually satisfy their specification and that the
   515 algorithms actually satisfy their specification and that the
   516 corresponding implementations do not contain any bugs. We are
   516 corresponding implementations do not contain any bugs. We are
   517 close, but not yet quite there.
   517 close, but not yet quite there.
   518 
   518 
       
   519 Despite my fascination, I am also happy to admit that regular
       
   520 expressions have their shortcomings. There are some well-known
       
   521 ``theoretical shortcomings'', for example recognising strings
       
   522 of the form $a^{n}b^{n}$. I am not so bothered by them. What I
       
   523 am bothered about is when regular expressions are in the way
       
   524 of practical programming. For example, it turns out that the
       
   525 regular expression for email addresses shown in \eqref{email}
       
   526 is hopelessly inadequate for recognising all of them (despite
       
   527 being touted as something every computer scientist should know
       
   528 about). The W3 Consortium (which standardises the Web)
       
   529 proposes to use the following, already more complicated
       
   530 regular expressions:
       
   531 
       
   532 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
       
   533 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
       
   534 \end{lstlisting}}
       
   535 
       
   536 \noindent But they admit that by using this regular expression
       
   537 they wilfully violate the RFC 5322 standard which specifies
       
   538 the syntax of email addresses. With their proposed regular
       
   539 expression they are too strict in some cases and too lax in
       
   540 others. Not a good situation to be in. A regular expression
       
   541 that is claimed to be closer to the standard is shown in
       
   542 Figure~\ref{monster}. Whether this claim is true or not, I
       
   543 would not know---the only thing I can say it is a monstrosity. 
       
   544 However, this might actually be an argument against the 
       
   545 standard, rather than against regular expressions. Still it
       
   546 is good to know that some tasks in text processing just 
       
   547 cannot be achieved by using regular expressions.
       
   548 
       
   549 
   519 \begin{figure}[p]
   550 \begin{figure}[p]
   520 \lstinputlisting{../progs/crawler1.scala}
   551 \lstinputlisting{../progs/crawler1.scala}
   521 \caption{The Scala code for a simple web-crawler that checks
   552 \caption{The Scala code for a simple web-crawler that checks
   522 for broken links in a web-page. It uses the regular expression
   553 for broken links in a web-page. It uses the regular expression
   523 \texttt{http\_pattern} in Line~15 for recognising URL-addresses.
   554 \texttt{http\_pattern} in Line~15 for recognising URL-addresses.
   547 \texttt{email\_pattern} in Line~16. The main change is in Line
   578 \texttt{email\_pattern} in Line~16. The main change is in Line
   548 32 where all email addresses that can be found in a page are
   579 32 where all email addresses that can be found in a page are
   549 printed.\label{crawler3}}
   580 printed.\label{crawler3}}
   550 \end{figure}
   581 \end{figure}
   551 
   582 
       
   583 \begin{figure}[p]
       
   584 \tiny
       
   585 \begin{center}
       
   586 \begin{minipage}{0.8\textwidth}
       
   587 \lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp}
       
   588 \end{minipage}
       
   589 \end{center}
       
   590 
       
   591 \caption{Nothing that can be said\ldots\label{monster}}
       
   592 \end{figure}
   552 
   593 
   553 
   594 
   554 \end{document}
   595 \end{document}
   555 
   596 
   556 %%% Local Variables: 
   597 %%% Local Variables: