99 occurences of the preceding expression\\ |
99 occurences of the preceding expression\\ |
100 \pcode{[...]} & matches any single character inside the |
100 \pcode{[...]} & matches any single character inside the |
101 brackets\\ |
101 brackets\\ |
102 \pcode{[^...]} & matches any single character not inside the |
102 \pcode{[^...]} & matches any single character not inside the |
103 brackets\\ |
103 brackets\\ |
104 \pcode{..-..} & character ranges\\ |
104 \pcode{...-...} & character ranges\\ |
105 \pcode{\\d} & matches digits; equivalent to \pcode{[0-9]} |
105 \pcode{\\d} & matches digits; equivalent to \pcode{[0-9]}\\ |
|
106 \pcode{.} & matches every character except newline\\ |
|
107 \pcode{(re)} & groups regular expressions and remembers |
|
108 matched text |
106 \end{tabular} |
109 \end{tabular} |
107 \end{center} |
110 \end{center} |
108 |
111 |
109 \noindent With this table you can figure out the purpose of |
112 \noindent With this table you can figure out the purpose of |
110 the regular expressions in the web-crawlers shown Figures |
113 the regular expressions in the web-crawlers shown Figures |
111 \ref{crawler1}, \ref{crawler2} and \ref{crawler3}. Note, |
114 \ref{crawler1}, \ref{crawler2} and |
|
115 \ref{crawler3}.\footnote{There is an interesting twist in the |
|
116 web-scraber where \pcode{re*?} is used instead of \pcode{re*}.} Note, |
112 however, the regular expression for http-addresses in |
117 however, the regular expression for http-addresses in |
113 web-pages is meant to be |
118 web-pages is meant to be |
114 |
119 |
115 \[ |
120 \[ |
116 \pcode{"https?://[^"]*"} |
121 \pcode{"https?://[^"]*"} |