handouts/ho01.tex
changeset 332 4755ad4b457b
parent 327 9470cd124667
child 334 fd89a63e9db3
equal deleted inserted replaced
331:a2c18456c6b7 332:4755ad4b457b
   117 the regular expressions in the web-crawlers shown Figures
   117 the regular expressions in the web-crawlers shown Figures
   118 \ref{crawler1}, \ref{crawler2} and
   118 \ref{crawler1}, \ref{crawler2} and
   119 \ref{crawler3}.\footnote{There is an interesting twist in the
   119 \ref{crawler3}.\footnote{There is an interesting twist in the
   120 web-scraper where \pcode{re*?} is used instead of
   120 web-scraper where \pcode{re*?} is used instead of
   121 \pcode{re*}.} Note, however, the regular expression for
   121 \pcode{re*}.} Note, however, the regular expression for
   122 http-addresses in web-pages in Figure~\ref{crawler1}, Line 15, is 
   122 http-addresses in web-pages in Figure~\ref{crawler1}, Line 15,
   123 intended to be
   123 is intended to be
   124 
   124 
   125 \[
   125 \[
   126 \pcode{"https?://[^"]*"}
   126 \pcode{"https?://[^"]*"}
   127 \]
   127 \]
   128 
   128 
   129 \noindent It specifies that web-addresses need to start with a
   129 \noindent It specifies that web-addresses need to start with a
   130 double quote, then comes \texttt{http} followed by an optional
   130 double quote, then comes \texttt{http} followed by an optional
   131 \texttt{s} and so on. Usually we would have to escape the
   131 \texttt{s} and so on until the closing double quote comes.
   132 double quotes in order to make sure we interpret the double
   132 Usually we would have to escape the double quotes in order to
   133 quote as character, not as double quote for a string. But
   133 make sure we interpret the double quote as character, not as
   134 Scala's trick with triple quotes allows us to omit this kind
   134 double quote for a string. But Scala's trick with triple
   135 of escaping. As a result we can just write:
   135 quotes allows us to omit this kind of escaping. As a result we
       
   136 can just write:
   136 
   137 
   137 \[
   138 \[
   138 \pcode{""""https?://[^"]*"""".r}
   139 \pcode{""""https?://[^"]*"""".r}
   139 \]
   140 \]
   140 
   141 
   226 \end{tikzpicture}
   227 \end{tikzpicture}
   227 \end{center}
   228 \end{center}
   228 
   229 
   229 \subsection*{Basic Regular Expressions}
   230 \subsection*{Basic Regular Expressions}
   230 
   231 
   231 The regular expressions shown above, for example for Scala, we
   232 The regular expressions shown above for Scala, we
   232 will call \emph{extended regular expressions}. The ones we
   233 will call \emph{extended regular expressions}. The ones we
   233 will mainly study in this module are \emph{basic regular
   234 will mainly study in this module are \emph{basic regular
   234 expressions}, which by convention we will just call
   235 expressions}, which by convention we will just call
   235 \emph{regular expressions}, if it is clear what we mean. The
   236 \emph{regular expressions}, if it is clear what we mean. The
   236 attraction of (basic) regular expressions is that many
   237 attraction of (basic) regular expressions is that many
   519 The hope is that we can do better in the future---for example
   520 The hope is that we can do better in the future---for example
   520 by proving that the algorithms actually satisfy their
   521 by proving that the algorithms actually satisfy their
   521 specification and that the corresponding implementations do
   522 specification and that the corresponding implementations do
   522 not contain any bugs. We are close, but not yet quite there.
   523 not contain any bugs. We are close, but not yet quite there.
   523 
   524 
   524 My fascination non withstanding, I am also happy to admit that regular
   525 Notwithstanding my fascination, I am also happy to admit that regular
   525 expressions have their shortcomings. There are some well-known
   526 expressions have their shortcomings. There are some well-known
   526 ``theoretical'' shortcomings, for example recognising strings
   527 ``theoretical'' shortcomings, for example recognising strings
   527 of the form $a^{n}b^{n}$. I am not so bothered by them. What I
   528 of the form $a^{n}b^{n}$. I am not so bothered by them. What I
   528 am bothered about is when regular expressions are in the way
   529 am bothered about is when regular expressions are in the way
   529 of practical programming. For example, it turns out that the
   530 of practical programming. For example, it turns out that the
   568 
   569 
   569 \caption{A version of the web-crawler that only follows links
   570 \caption{A version of the web-crawler that only follows links
   570 in ``my'' domain---since these are the ones I am interested in
   571 in ``my'' domain---since these are the ones I am interested in
   571 to fix. It uses the regular expression \texttt{my\_urls} in
   572 to fix. It uses the regular expression \texttt{my\_urls} in
   572 Line~16 to check for my name in the links. The main change is
   573 Line~16 to check for my name in the links. The main change is
   573 in Lines~26--29 where there is a test whether URL is in ``my''
   574 in Lines~24--28 where there is a test whether URL is in ``my''
   574 domain or not.\label{crawler2}}
   575 domain or not.\label{crawler2}}
   575 
   576 
   576 \end{figure}
   577 \end{figure}
   577 
   578 
   578 \begin{figure}[p]
   579 \begin{figure}[p]
   579 \lstinputlisting{../progs/crawler3.scala}
   580 \lstinputlisting{../progs/crawler3.scala}
   580 
   581 
   581 \caption{A small email harvester---whenever we download a
   582 \caption{A small email harvester---whenever we download a
   582 web-page, we also check whether it contains any email
   583 web-page, we also check whether it contains any email
   583 addresses. For this we use the regular expression
   584 addresses. For this we use the regular expression
   584 \texttt{email\_pattern} in Line~16. The main change is in Line
   585 \texttt{email\_pattern} in Line~15. The main change is in Line
   585 32 where all email addresses that can be found in a page are
   586 30 where all email addresses that can be found in a page are
   586 printed.\label{crawler3}}
   587 printed.\label{crawler3}}
   587 \end{figure}
   588 \end{figure}
   588 
   589 
   589 \begin{figure}[p]
   590 \begin{figure}[p]
   590 \tiny
   591 \tiny