117 the regular expressions in the web-crawlers shown Figures |
117 the regular expressions in the web-crawlers shown Figures |
118 \ref{crawler1}, \ref{crawler2} and |
118 \ref{crawler1}, \ref{crawler2} and |
119 \ref{crawler3}.\footnote{There is an interesting twist in the |
119 \ref{crawler3}.\footnote{There is an interesting twist in the |
120 web-scraper where \pcode{re*?} is used instead of |
120 web-scraper where \pcode{re*?} is used instead of |
121 \pcode{re*}.} Note, however, the regular expression for |
121 \pcode{re*}.} Note, however, the regular expression for |
122 http-addresses in web-pages in Figure~\ref{crawler1}, Line 15, is |
122 http-addresses in web-pages in Figure~\ref{crawler1}, Line 15, |
123 intended to be |
123 is intended to be |
124 |
124 |
125 \[ |
125 \[ |
126 \pcode{"https?://[^"]*"} |
126 \pcode{"https?://[^"]*"} |
127 \] |
127 \] |
128 |
128 |
129 \noindent It specifies that web-addresses need to start with a |
129 \noindent It specifies that web-addresses need to start with a |
130 double quote, then comes \texttt{http} followed by an optional |
130 double quote, then comes \texttt{http} followed by an optional |
131 \texttt{s} and so on. Usually we would have to escape the |
131 \texttt{s} and so on until the closing double quote comes. |
132 double quotes in order to make sure we interpret the double |
132 Usually we would have to escape the double quotes in order to |
133 quote as character, not as double quote for a string. But |
133 make sure we interpret the double quote as character, not as |
134 Scala's trick with triple quotes allows us to omit this kind |
134 double quote for a string. But Scala's trick with triple |
135 of escaping. As a result we can just write: |
135 quotes allows us to omit this kind of escaping. As a result we |
|
136 can just write: |
136 |
137 |
137 \[ |
138 \[ |
138 \pcode{""""https?://[^"]*"""".r} |
139 \pcode{""""https?://[^"]*"""".r} |
139 \] |
140 \] |
140 |
141 |
226 \end{tikzpicture} |
227 \end{tikzpicture} |
227 \end{center} |
228 \end{center} |
228 |
229 |
229 \subsection*{Basic Regular Expressions} |
230 \subsection*{Basic Regular Expressions} |
230 |
231 |
231 The regular expressions shown above, for example for Scala, we |
232 The regular expressions shown above for Scala, we |
232 will call \emph{extended regular expressions}. The ones we |
233 will call \emph{extended regular expressions}. The ones we |
233 will mainly study in this module are \emph{basic regular |
234 will mainly study in this module are \emph{basic regular |
234 expressions}, which by convention we will just call |
235 expressions}, which by convention we will just call |
235 \emph{regular expressions}, if it is clear what we mean. The |
236 \emph{regular expressions}, if it is clear what we mean. The |
236 attraction of (basic) regular expressions is that many |
237 attraction of (basic) regular expressions is that many |
519 The hope is that we can do better in the future---for example |
520 The hope is that we can do better in the future---for example |
520 by proving that the algorithms actually satisfy their |
521 by proving that the algorithms actually satisfy their |
521 specification and that the corresponding implementations do |
522 specification and that the corresponding implementations do |
522 not contain any bugs. We are close, but not yet quite there. |
523 not contain any bugs. We are close, but not yet quite there. |
523 |
524 |
524 My fascination non withstanding, I am also happy to admit that regular |
525 Notwithstanding my fascination, I am also happy to admit that regular |
525 expressions have their shortcomings. There are some well-known |
526 expressions have their shortcomings. There are some well-known |
526 ``theoretical'' shortcomings, for example recognising strings |
527 ``theoretical'' shortcomings, for example recognising strings |
527 of the form $a^{n}b^{n}$. I am not so bothered by them. What I |
528 of the form $a^{n}b^{n}$. I am not so bothered by them. What I |
528 am bothered about is when regular expressions are in the way |
529 am bothered about is when regular expressions are in the way |
529 of practical programming. For example, it turns out that the |
530 of practical programming. For example, it turns out that the |
568 |
569 |
569 \caption{A version of the web-crawler that only follows links |
570 \caption{A version of the web-crawler that only follows links |
570 in ``my'' domain---since these are the ones I am interested in |
571 in ``my'' domain---since these are the ones I am interested in |
571 to fix. It uses the regular expression \texttt{my\_urls} in |
572 to fix. It uses the regular expression \texttt{my\_urls} in |
572 Line~16 to check for my name in the links. The main change is |
573 Line~16 to check for my name in the links. The main change is |
573 in Lines~26--29 where there is a test whether URL is in ``my'' |
574 in Lines~24--28 where there is a test whether URL is in ``my'' |
574 domain or not.\label{crawler2}} |
575 domain or not.\label{crawler2}} |
575 |
576 |
576 \end{figure} |
577 \end{figure} |
577 |
578 |
578 \begin{figure}[p] |
579 \begin{figure}[p] |
579 \lstinputlisting{../progs/crawler3.scala} |
580 \lstinputlisting{../progs/crawler3.scala} |
580 |
581 |
581 \caption{A small email harvester---whenever we download a |
582 \caption{A small email harvester---whenever we download a |
582 web-page, we also check whether it contains any email |
583 web-page, we also check whether it contains any email |
583 addresses. For this we use the regular expression |
584 addresses. For this we use the regular expression |
584 \texttt{email\_pattern} in Line~16. The main change is in Line |
585 \texttt{email\_pattern} in Line~15. The main change is in Line |
585 32 where all email addresses that can be found in a page are |
586 30 where all email addresses that can be found in a page are |
586 printed.\label{crawler3}} |
587 printed.\label{crawler3}} |
587 \end{figure} |
588 \end{figure} |
588 |
589 |
589 \begin{figure}[p] |
590 \begin{figure}[p] |
590 \tiny |
591 \tiny |