handouts/ho01.tex
changeset 291 201c2c6d8696
parent 268 18bef085a7ca
child 306 fecffce112fa
equal deleted inserted replaced
290:3a2fa69ea675 291:201c2c6d8696
    16 efficient general string search algorithm. But often we do
    16 efficient general string search algorithm. But often we do
    17 \emph{not} just look for a particular string, but for string
    17 \emph{not} just look for a particular string, but for string
    18 patterns. For example in programming code we need to identify
    18 patterns. For example in programming code we need to identify
    19 what are the keywords, what are the identifiers etc. A pattern
    19 what are the keywords, what are the identifiers etc. A pattern
    20 for identifiers could be stated as: they start with a letter,
    20 for identifiers could be stated as: they start with a letter,
    21 followed by zero or more letters, numbers and the underscore.
    21 followed by zero or more letters, numbers and underscores.
    22 Also often we face the problem that we are given a string (for
    22 Also often we face the problem that we are given a string (for
    23 example some user input) and want to know whether it matches a
    23 example some user input) and want to know whether it matches a
    24 particular pattern. In this way we can exclude user input that
    24 particular pattern. In this way we can exclude user input that
    25 would otherwise have nasty effects on our program (crashing it
    25 would otherwise have nasty effects on our program (crashing it
    26 or going into an infinite loop, if not worse). \defn{Regular
    26 or making it go into an infinite loop, if not worse).
    27 expressions} help with conveniently specifying such patterns.
    27 
    28 The idea behind regular expressions is that they are a simple
    28 \defn{Regular expressions} help with conveniently specifying
    29 method for describing languages (or sets of strings)\ldots at
    29 such patterns. The idea behind regular expressions is that
    30 least languages we are interested in in computer science. For
    30 they are a simple method for describing languages (or sets of
    31 example there is no convenient regular expression for
    31 strings)\ldots at least languages we are interested in in
    32 describing the English language short of enumerating all
    32 computer science. For example there is no convenient regular
    33 English words. But they seem useful for describing for example
    33 expression for describing the English language short of
    34 email addresses.\footnote{See ``8 Regular Expressions You
    34 enumerating all English words. But they seem useful for
    35 Should Know'' \url{http://goo.gl/5LoVX7}} Consider the
    35 describing for example email addresses.\footnote{See ``8
    36 following regular expression
    36 Regular Expressions You Should Know''
       
    37 \url{http://goo.gl/5LoVX7}} Consider the following regular
       
    38 expression
    37 
    39 
    38 \begin{equation}\label{email}
    40 \begin{equation}\label{email}
    39 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}}
    41 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}}
    40 \end{equation}
    42 \end{equation}
    41 
    43 
   111 
   113 
   112 \noindent With this table you can figure out the purpose of
   114 \noindent With this table you can figure out the purpose of
   113 the regular expressions in the web-crawlers shown Figures
   115 the regular expressions in the web-crawlers shown Figures
   114 \ref{crawler1}, \ref{crawler2} and
   116 \ref{crawler1}, \ref{crawler2} and
   115 \ref{crawler3}.\footnote{There is an interesting twist in the
   117 \ref{crawler3}.\footnote{There is an interesting twist in the
   116 web-scraber where \pcode{re*?} is used instead of \pcode{re*}.} Note,
   118 web-scraper where \pcode{re*?} is used instead of
   117 however, the regular expression for http-addresses in
   119 \pcode{re*}.} Note, however, the regular expression for
   118 web-pages is meant to be
   120 http-addresses in web-pages is meant to be
   119 
   121 
   120 \[
   122 \[
   121 \pcode{"https?://[^"]*"}
   123 \pcode{"https?://[^"]*"}
   122 \]
   124 \]
   123 
   125 
   147 there any interest in studying them again in depth in this
   149 there any interest in studying them again in depth in this
   148 module? Well, one answer is in the following graph about
   150 module? Well, one answer is in the following graph about
   149 regular expression matching in Python and in Ruby.
   151 regular expression matching in Python and in Ruby.
   150 
   152 
   151 \begin{center}
   153 \begin{center}
   152 \begin{tikzpicture}[y=.09cm, x=.15cm]
   154 \begin{tikzpicture}
   153  	%axis
   155 \begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs},
   154 	\draw (0,0) -- coordinate (x axis mid) (30,0);
   156     enlargelimits=false,
   155    \draw (0,0) -- coordinate (y axis mid) (0,30);
   157     xtick={0,5,...,30},
   156    %ticks
   158     xmax=33,
   157    \foreach \x in {0,5,...,30}
   159     ymax=35,
   158      \draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x};
   160     ytick={0,5,...,30},
   159    \foreach \y in {0,5,...,30}
   161     scaled ticks=false,
   160      \draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; 
   162     axis lines=left,
   161 	%labels      
   163     width=7cm,
   162 	\node[below=0.6cm] at (x axis mid) {number of \texttt{a}s};
   164     height=5cm, 
   163 	\node[rotate=90,left=0.9cm] at (y axis mid) {time in secs};
   165     legend entries={Python,Ruby},  
   164 	%plots
   166     legend pos=north west,
   165 	\draw[color=blue] plot[mark=*] 
   167     legend cell align=left]
   166 		file {re-python.data};
   168 \addplot[blue,mark=*, mark options={fill=white}] 
   167 	\draw[color=brown] plot[mark=triangle*] 
   169   table {re-python.data};
   168 		file {re-ruby.data};	 
   170 \addplot[brown,mark=triangle*, mark options={fill=white}] 
   169    %legend
   171   table {re-ruby.data};  
   170 	\begin{scope}[shift={(4,20)}] 
   172 \end{axis}
   171 	\draw[color=blue] (0,0) -- 
       
   172 		plot[mark=*] (0.25,0) -- (0.5,0) 
       
   173 		node[right]{\small Python};
       
   174 	\draw[yshift=-\baselineskip, color=brown] (0,0) -- 
       
   175 		plot[mark=triangle*] (0.25,0) -- (0.5,0)
       
   176 		node[right]{\small Ruby};
       
   177 	\end{scope}	
       
   178 \end{tikzpicture}
   173 \end{tikzpicture}
   179 \end{center}
   174 \end{center}
   180 
   175 
   181 \noindent This graph shows that Python needs approximately 29
   176 \noindent This graph shows that Python needs approximately 29
   182 seconds for finding out whether a string of 28 \texttt{a}s
   177 seconds for finding out whether a string of 28 \texttt{a}s
   196 consequences, for example, if you use them in your
   191 consequences, for example, if you use them in your
   197 web-application. The reason is that hackers can look for these
   192 web-application. The reason is that hackers can look for these
   198 instances where the matching engine behaves badly and then
   193 instances where the matching engine behaves badly and then
   199 mount a nice DoS-attack against your application. These
   194 mount a nice DoS-attack against your application. These
   200 attacks are already have their own name: 
   195 attacks are already have their own name: 
   201 \emph{Regular Expression Denial of Servive Attack (ReDoS)}.
   196 \emph{Regular Expression Denial of Service Attacks (ReDoS)}.
   202 
   197 
   203 It will be instructive to look behind the ``scenes'' to find
   198 It will be instructive to look behind the ``scenes'' to find
   204 out why Python and Ruby (and others) behave so badly when
   199 out why Python and Ruby (and others) behave so badly when
   205 matching with evil regular expressions. But we will also look
   200 matching with evil regular expressions. But we will also look
   206 at a relatively simple algorithm that solves this problem much
   201 at a relatively simple algorithm that solves this problem much
   209 process strings of approximately 1,000 \texttt{a}s in 30
   204 process strings of approximately 1,000 \texttt{a}s in 30
   210 seconds, while the second version will even be able to process
   205 seconds, while the second version will even be able to process
   211 up to 12,000 in less than 10(!) seconds, see the graph below:
   206 up to 12,000 in less than 10(!) seconds, see the graph below:
   212 
   207 
   213 \begin{center}
   208 \begin{center}
   214 \begin{tikzpicture}[y=.09cm, x=.0006cm]
   209 \begin{tikzpicture}
   215  	%axis
   210   \begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs},
   216 	\draw (0,0) -- coordinate (x axis mid) (12000,0);
   211     enlargelimits=false,
   217    \draw (0,0) -- coordinate (y axis mid) (0,30);
   212     xtick={0,3000,...,12000},
   218    %ticks
   213     xmax=12500,
   219    \foreach \x in {0,2000,...,12000}
   214     ymax=35,
   220      	\draw (\x,1pt) -- (\x,-3pt) node[anchor=north] {\x};
   215     ytick={0,5,...,30},
   221    \foreach \y in {0,5,...,30}
   216     scaled ticks=false,
   222      	\draw (1pt,\y) -- (-3pt,\y) node[anchor=east] {\y}; 
   217     axis lines=left,
   223 	%labels      
   218     width=9cm,
   224 	\node[below=0.6cm] at (x axis mid) {number of \texttt{a}s};
   219     height=5cm]
   225 	\node[rotate=90,left=0.9cm] at (y axis mid) {time in secs};
   220 \addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};
   226 
   221 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
   227 	%plots
   222 \end{axis}
   228    \draw[color=green] plot[mark=square*, mark options={fill=white} ] 
       
   229 		file {re2b.data};
       
   230 	\draw[color=black] plot[mark=square*, mark options={fill=white} ] 
       
   231 		file {re3.data};	 
       
   232 \end{tikzpicture}
   223 \end{tikzpicture}
   233 \end{center}
   224 \end{center}
   234 
   225 
   235 \subsection*{Basic Regular Expressions}
   226 \subsection*{Basic Regular Expressions}
   236 
   227