514 can do better in the future---for example by proving that the |
514 can do better in the future---for example by proving that the |
515 algorithms actually satisfy their specification and that the |
515 algorithms actually satisfy their specification and that the |
516 corresponding implementations do not contain any bugs. We are |
516 corresponding implementations do not contain any bugs. We are |
517 close, but not yet quite there. |
517 close, but not yet quite there. |
518 |
518 |
|
519 Despite my fascination, I am also happy to admit that regular |
|
520 expressions have their shortcomings. There are some well-known |
|
521 ``theoretical shortcomings'', for example recognising strings |
|
522 of the form $a^{n}b^{n}$. I am not so bothered by them. What I |
|
523 am bothered about is when regular expressions are in the way |
|
524 of practical programming. For example, it turns out that the |
|
525 regular expression for email addresses shown in \eqref{email} |
|
526 is hopelessly inadequate for recognising all of them (despite |
|
527 being touted as something every computer scientist should know |
|
528 about). The W3 Consortium (which standardises the Web) |
|
529 proposes to use the following, already more complicated |
|
530 regular expressions: |
|
531 |
|
532 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none] |
|
533 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* |
|
534 \end{lstlisting}} |
|
535 |
|
536 \noindent But they admit that by using this regular expression |
|
537 they wilfully violate the RFC 5322 standard which specifies |
|
538 the syntax of email addresses. With their proposed regular |
|
539 expression they are too strict in some cases and too lax in |
|
540 others. Not a good situation to be in. A regular expression |
|
541 that is claimed to be closer to the standard is shown in |
|
542 Figure~\ref{monster}. Whether this claim is true or not, I |
|
543 would not know---the only thing I can say it is a monstrosity. |
|
544 However, this might actually be an argument against the |
|
545 standard, rather than against regular expressions. Still it |
|
546 is good to know that some tasks in text processing just |
|
547 cannot be achieved by using regular expressions. |
|
548 |
|
549 |
519 \begin{figure}[p] |
550 \begin{figure}[p] |
520 \lstinputlisting{../progs/crawler1.scala} |
551 \lstinputlisting{../progs/crawler1.scala} |
521 \caption{The Scala code for a simple web-crawler that checks |
552 \caption{The Scala code for a simple web-crawler that checks |
522 for broken links in a web-page. It uses the regular expression |
553 for broken links in a web-page. It uses the regular expression |
523 \texttt{http\_pattern} in Line~15 for recognising URL-addresses. |
554 \texttt{http\_pattern} in Line~15 for recognising URL-addresses. |