handouts/ho01.tex
changeset 250 b79e704acb72
parent 248 ce767ca23244
child 252 e8ef8f38ca84
equal deleted inserted replaced
249:377c59df7297 250:b79e704acb72
    99 occurences of the preceding expression\\
    99 occurences of the preceding expression\\
   100 \pcode{[...]} & matches any single character inside the 
   100 \pcode{[...]} & matches any single character inside the 
   101 brackets\\
   101 brackets\\
   102 \pcode{[^...]} & matches any single character not inside the 
   102 \pcode{[^...]} & matches any single character not inside the 
   103 brackets\\
   103 brackets\\
   104 \pcode{..-..} & character ranges\\
   104 \pcode{...-...} & character ranges\\
   105 \pcode{\\d} &	matches digits; equivalent to \pcode{[0-9]}
   105 \pcode{\\d} & matches digits; equivalent to \pcode{[0-9]}\\
       
   106 \pcode{.} & matches every character except newline\\
       
   107 \pcode{(re)}	& groups regular expressions and remembers 
       
   108 matched text
   106 \end{tabular}
   109 \end{tabular}
   107 \end{center}
   110 \end{center}
   108 
   111 
   109 \noindent With this table you can figure out the purpose of
   112 \noindent With this table you can figure out the purpose of
   110 the regular expressions in the web-crawlers shown Figures
   113 the regular expressions in the web-crawlers shown Figures
   111 \ref{crawler1}, \ref{crawler2} and \ref{crawler3}. Note,
   114 \ref{crawler1}, \ref{crawler2} and
       
   115 \ref{crawler3}.\footnote{There is an interesting twist in the
       
   116 web-scraber where \pcode{re*?} is used instead of \pcode{re*}.} Note,
   112 however, the regular expression for http-addresses in
   117 however, the regular expression for http-addresses in
   113 web-pages is meant to be
   118 web-pages is meant to be
   114 
   119 
   115 \[
   120 \[
   116 \pcode{"https?://[^"]*"}
   121 \pcode{"https?://[^"]*"}