handouts/ho01.tex
changeset 618 f4818c95a32e
parent 570 617c3b0e0a81
child 621 cf287db8dc15
equal deleted inserted replaced
617:f7de0915fff2 618:f4818c95a32e
    34 % compiler explorer
    34 % compiler explorer
    35 % https://gcc.godbolt.org
    35 % https://gcc.godbolt.org
    36 
    36 
    37 %https://www.youtube.com/watch?v=gmhMQfFQu20
    37 %https://www.youtube.com/watch?v=gmhMQfFQu20
    38 \begin{document}
    38 \begin{document}
    39 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018}
    39 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019}
    40 
    40 
    41 \section*{Handout 1}
    41 \section*{Handout 1}
    42 
    42 
    43 This module is about text processing, be it for web-crawlers,
    43 This module is about text processing, be it for web-crawlers,
    44 compilers, dictionaries, DNA-data, ad filters and so on.  When looking for a
    44 compilers, dictionaries, DNA-data, ad filters and so on.  When looking
    45 particular string, like $abc$ in a large text we can use the
    45 for a particular string, like \pcode{"foobar"} in a large text we can
    46 Knuth-Morris-Pratt algorithm, which is currently the most efficient
    46 use the Knuth-Morris-Pratt algorithm, which is currently the most
    47 general string search algorithm. But often we do \emph{not} just look
    47 efficient general string search algorithm. But often we do \emph{not}
    48 for a particular string, but for string patterns. For example, in
    48 just look for a particular string, but for string patterns. For
    49 program code we need to identify what are the keywords (\texttt{if}, \texttt{then},
    49 example, in program code we need to identify what are the keywords
    50 \texttt{while}, \texttt{for}, etc), what are the identifiers (variable names). A pattern for
    50 (\texttt{if}, \texttt{then}, \texttt{while}, \texttt{for}, etc), what
    51 identifiers could be stated as: they start with a letter, followed by
    51 are the identifiers (variable names). A pattern for identifiers could
    52 zero or more letters, numbers and underscores.  Often we also face the
    52 be stated as: they start with a letter, followed by zero or more
    53 problem that we are given a string (for example some user input) and
    53 letters, numbers and underscores. You might also be surprised what
    54 want to know whether it matches a particular pattern---be it an email
    54 constraints programming languages impose about numbers: for example
    55 address, for example. In this way we can exclude user input that would
    55 123 in JSON is OK, but 0123 is not.
    56 otherwise have nasty effects on our program (crashing it or making it
    56 
    57 go into an infinite loop, if not worse). In tools like Snort, scanning
    57 Often we also face the problem that we are given a string (for example
    58 for computer viruses or filtering out spam usually involves scanning
    58 some user input) and want to know whether it matches a particular
    59 for some signature (essentially a string pattern).  The point is that
    59 pattern---be it an email address, for example. In this way we can
    60 the fast Knuth-Morris-Pratt algorithm for strings is not good enough
    60 exclude user input that would otherwise have nasty effects on our
    61 for such string \emph{patterns}.\smallskip
    61 program (crashing it or making it go into an infinite loop, if not
       
    62 worse). In tools like Snort, scanning for computer viruses or
       
    63 filtering out spam usually involves scanning for some signature
       
    64 (essentially a string pattern).  The point is that the fast
       
    65 Knuth-Morris-Pratt algorithm for strings is not good enough for such
       
    66 string \emph{patterns}.\smallskip
    62 
    67 
    63 \defn{Regular expressions} help with conveniently specifying
    68 \defn{Regular expressions} help with conveniently specifying
    64 such patterns. The idea behind regular expressions is that
    69 such patterns. The idea behind regular expressions is that
    65 they are a simple method for describing languages (or sets of
    70 they are a simple method for describing languages (or sets of
    66 strings)\ldots at least languages we are interested in in
    71 strings)\ldots at least languages we are interested in in
   192 much ubiquitous in computer science. There are many libraries
   197 much ubiquitous in computer science. There are many libraries
   193 implementing regular expressions. I am sure you have come across them
   198 implementing regular expressions. I am sure you have come across them
   194 before (remember the PRA module?). Why on earth then is there any
   199 before (remember the PRA module?). Why on earth then is there any
   195 interest in studying them again in depth in this module? Well, one
   200 interest in studying them again in depth in this module? Well, one
   196 answer is in the following two graphs about regular expression
   201 answer is in the following two graphs about regular expression
   197 matching in Python, Ruby and Java (Version 8).
   202 matching in Python, Ruby, JavaScript and Java (Version 8).
   198 
   203 
   199 \begin{center}
   204 \begin{center}
   200 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
   205 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
   201 \begin{tikzpicture}
   206 \begin{tikzpicture}
   202 \begin{axis}[
   207 \begin{axis}[
   212     ytick={0,5,...,30},
   217     ytick={0,5,...,30},
   213     scaled ticks=false,
   218     scaled ticks=false,
   214     axis lines=left,
   219     axis lines=left,
   215     width=5.5cm,
   220     width=5.5cm,
   216     height=4.5cm, 
   221     height=4.5cm, 
   217     legend entries={Python, Java 8},  
   222     legend entries={Python, Java 8, JavaScript},  
   218     legend pos=north west,
   223     legend pos=north west,
   219     legend cell align=left]
   224     legend cell align=left]
   220 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
   225 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
   221 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
   226 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   227 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
   222 \end{axis}
   228 \end{axis}
   223 \end{tikzpicture}
   229 \end{tikzpicture}
   224 &
   230 &
   225 \begin{tikzpicture}
   231 \begin{tikzpicture}
   226 \begin{axis}[
   232 \begin{axis}[
   246 \end{axis}
   252 \end{axis}
   247 \end{tikzpicture}
   253 \end{tikzpicture}
   248 \end{tabular}
   254 \end{tabular}
   249 \end{center}
   255 \end{center}
   250 
   256 
   251 \noindent This first graph shows that Python and Java 8 need
   257 \noindent This first graph shows that Python, JavaScript and Java 8 need
   252 approximately 30 seconds to find out that the regular expression
   258 approximately 30 seconds to find out that the regular expression
   253 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s.
   259 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s.
   254 Similarly, the second shows that Python needs approximately 29 seconds
   260 Similarly, the second shows that Python needs approximately 29 seconds
   255 for finding out whether a string of 28 \texttt{a}s matches the regular
   261 for finding out whether a string of 28 \texttt{a}s matches the regular
   256 expression \texttt{a?\{28\}\,a\{28\}}.  Ruby is even slightly
   262 expression \texttt{a?\{28\}\,a\{28\}}.  Ruby is even slightly