16 efficient general string search algorithm. But often we do |
16 efficient general string search algorithm. But often we do |
17 \emph{not} just look for a particular string, but for string |
17 \emph{not} just look for a particular string, but for string |
18 patterns. For example in programming code we need to identify |
18 patterns. For example in programming code we need to identify |
19 what are the keywords, what are the identifiers etc. A pattern |
19 what are the keywords, what are the identifiers etc. A pattern |
20 for identifiers could be stated as: they start with a letter, |
20 for identifiers could be stated as: they start with a letter, |
21 followed by zero or more letters, numbers and the underscore. |
21 followed by zero or more letters, numbers and underscores. |
22 Also often we face the problem that we are given a string (for |
22 Also often we face the problem that we are given a string (for |
23 example some user input) and want to know whether it matches a |
23 example some user input) and want to know whether it matches a |
24 particular pattern. In this way we can exclude user input that |
24 particular pattern. In this way we can exclude user input that |
25 would otherwise have nasty effects on our program (crashing it |
25 would otherwise have nasty effects on our program (crashing it |
26 or going into an infinite loop, if not worse). \defn{Regular |
26 or making it go into an infinite loop, if not worse). |
27 expressions} help with conveniently specifying such patterns. |
27 |
28 The idea behind regular expressions is that they are a simple |
28 \defn{Regular expressions} help with conveniently specifying |
29 method for describing languages (or sets of strings)\ldots at |
29 such patterns. The idea behind regular expressions is that |
30 least languages we are interested in in computer science. For |
30 they are a simple method for describing languages (or sets of |
31 example there is no convenient regular expression for |
31 strings)\ldots at least languages we are interested in in |
32 describing the English language short of enumerating all |
32 computer science. For example there is no convenient regular |
33 English words. But they seem useful for describing for example |
33 expression for describing the English language short of |
34 email addresses.\footnote{See ``8 Regular Expressions You |
34 enumerating all English words. But they seem useful for |
35 Should Know'' \url{http://goo.gl/5LoVX7}} Consider the |
35 describing for example email addresses.\footnote{See ``8 |
36 following regular expression |
36 Regular Expressions You Should Know'' |
|
37 \url{http://goo.gl/5LoVX7}} Consider the following regular |
|
38 expression |
37 |
39 |
38 \begin{equation}\label{email} |
40 \begin{equation}\label{email} |
39 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}} |
41 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}} |
40 \end{equation} |
42 \end{equation} |
41 |
43 |
111 |
113 |
112 \noindent With this table you can figure out the purpose of |
114 \noindent With this table you can figure out the purpose of |
113 the regular expressions in the web-crawlers shown Figures |
115 the regular expressions in the web-crawlers shown Figures |
114 \ref{crawler1}, \ref{crawler2} and |
116 \ref{crawler1}, \ref{crawler2} and |
115 \ref{crawler3}.\footnote{There is an interesting twist in the |
117 \ref{crawler3}.\footnote{There is an interesting twist in the |
116 web-scraber where \pcode{re*?} is used instead of \pcode{re*}.} Note, |
118 web-scraper where \pcode{re*?} is used instead of |
117 however, the regular expression for http-addresses in |
119 \pcode{re*}.} Note, however, the regular expression for |
118 web-pages is meant to be |
120 http-addresses in web-pages is meant to be |
119 |
121 |
120 \[ |
122 \[ |
121 \pcode{"https?://[^"]*"} |
123 \pcode{"https?://[^"]*"} |
122 \] |
124 \] |
123 |
125 |
147 there any interest in studying them again in depth in this |
149 there any interest in studying them again in depth in this |
148 module? Well, one answer is in the following graph about |
150 module? Well, one answer is in the following graph about |
149 regular expression matching in Python and in Ruby. |
151 regular expression matching in Python and in Ruby. |
150 |
152 |
151 \begin{center} |
153 \begin{center} |
152 \begin{tikzpicture}[y=.09cm, x=.15cm] |
154 \begin{tikzpicture} |
153 %axis |
155 \begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs}, |
154 \draw (0,0) -- coordinate (x axis mid) (30,0); |
156 enlargelimits=false, |
155 \draw (0,0) -- coordinate (y axis mid) (0,30); |
157 xtick={0,5,...,30}, |
156 %ticks |
158 xmax=33, |
157 \foreach \x in {0,5,...,30} |
159 ymax=35, |
158 \draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x}; |
160 ytick={0,5,...,30}, |
159 \foreach \y in {0,5,...,30} |
161 scaled ticks=false, |
160 \draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; |
162 axis lines=left, |
161 %labels |
163 width=7cm, |
162 \node[below=0.6cm] at (x axis mid) {number of \texttt{a}s}; |
164 height=5cm, |
163 \node[rotate=90,left=0.9cm] at (y axis mid) {time in secs}; |
165 legend entries={Python,Ruby}, |
164 %plots |
166 legend pos=north west, |
165 \draw[color=blue] plot[mark=*] |
167 legend cell align=left] |
166 file {re-python.data}; |
168 \addplot[blue,mark=*, mark options={fill=white}] |
167 \draw[color=brown] plot[mark=triangle*] |
169 table {re-python.data}; |
168 file {re-ruby.data}; |
170 \addplot[brown,mark=triangle*, mark options={fill=white}] |
169 %legend |
171 table {re-ruby.data}; |
170 \begin{scope}[shift={(4,20)}] |
172 \end{axis} |
171 \draw[color=blue] (0,0) -- |
|
172 plot[mark=*] (0.25,0) -- (0.5,0) |
|
173 node[right]{\small Python}; |
|
174 \draw[yshift=-\baselineskip, color=brown] (0,0) -- |
|
175 plot[mark=triangle*] (0.25,0) -- (0.5,0) |
|
176 node[right]{\small Ruby}; |
|
177 \end{scope} |
|
178 \end{tikzpicture} |
173 \end{tikzpicture} |
179 \end{center} |
174 \end{center} |
180 |
175 |
181 \noindent This graph shows that Python needs approximately 29 |
176 \noindent This graph shows that Python needs approximately 29 |
182 seconds for finding out whether a string of 28 \texttt{a}s |
177 seconds for finding out whether a string of 28 \texttt{a}s |
196 consequences, for example, if you use them in your |
191 consequences, for example, if you use them in your |
197 web-application. The reason is that hackers can look for these |
192 web-application. The reason is that hackers can look for these |
198 instances where the matching engine behaves badly and then |
193 instances where the matching engine behaves badly and then |
199 mount a nice DoS-attack against your application. These |
194 mount a nice DoS-attack against your application. These |
200 attacks are already have their own name: |
195 attacks are already have their own name: |
201 \emph{Regular Expression Denial of Servive Attack (ReDoS)}. |
196 \emph{Regular Expression Denial of Service Attacks (ReDoS)}. |
202 |
197 |
203 It will be instructive to look behind the ``scenes'' to find |
198 It will be instructive to look behind the ``scenes'' to find |
204 out why Python and Ruby (and others) behave so badly when |
199 out why Python and Ruby (and others) behave so badly when |
205 matching with evil regular expressions. But we will also look |
200 matching with evil regular expressions. But we will also look |
206 at a relatively simple algorithm that solves this problem much |
201 at a relatively simple algorithm that solves this problem much |
209 process strings of approximately 1,000 \texttt{a}s in 30 |
204 process strings of approximately 1,000 \texttt{a}s in 30 |
210 seconds, while the second version will even be able to process |
205 seconds, while the second version will even be able to process |
211 up to 12,000 in less than 10(!) seconds, see the graph below: |
206 up to 12,000 in less than 10(!) seconds, see the graph below: |
212 |
207 |
213 \begin{center} |
208 \begin{center} |
214 \begin{tikzpicture}[y=.09cm, x=.0006cm] |
209 \begin{tikzpicture} |
215 %axis |
210 \begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs}, |
216 \draw (0,0) -- coordinate (x axis mid) (12000,0); |
211 enlargelimits=false, |
217 \draw (0,0) -- coordinate (y axis mid) (0,30); |
212 xtick={0,3000,...,12000}, |
218 %ticks |
213 xmax=12500, |
219 \foreach \x in {0,2000,...,12000} |
214 ymax=35, |
220 \draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x}; |
215 ytick={0,5,...,30}, |
221 \foreach \y in {0,5,...,30} |
216 scaled ticks=false, |
222 \draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; |
217 axis lines=left, |
223 %labels |
218 width=9cm, |
224 \node[below=0.6cm] at (x axis mid) {number of \texttt{a}s}; |
219 height=5cm] |
225 \node[rotate=90,left=0.9cm] at (y axis mid) {time in secs}; |
220 \addplot[green,mark=square*,mark options={fill=white}] table {re2b.data}; |
226 |
221 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; |
227 %plots |
222 \end{axis} |
228 \draw[color=green] plot[mark=square*, mark options={fill=white} ] |
|
229 file {re2b.data}; |
|
230 \draw[color=black] plot[mark=square*, mark options={fill=white} ] |
|
231 file {re3.data}; |
|
232 \end{tikzpicture} |
223 \end{tikzpicture} |
233 \end{center} |
224 \end{center} |
234 |
225 |
235 \subsection*{Basic Regular Expressions} |
226 \subsection*{Basic Regular Expressions} |
236 |
227 |