handouts/ho01.tex
changeset 492 39b7ff2cf1bc
parent 477 b78664a24f5d
child 496 5c9de27a5b30
equal deleted inserted replaced
491:d5776c6018f0 492:39b7ff2cf1bc
    37 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
    37 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
    38 
    38 
    39 \section*{Handout 1}
    39 \section*{Handout 1}
    40 
    40 
    41 This module is about text processing, be it for web-crawlers,
    41 This module is about text processing, be it for web-crawlers,
    42 compilers, dictionaries, DNA-data and so on.  When looking for a
    42 compilers, dictionaries, DNA-data, ad filters and so on.  When looking for a
    43 particular string, like $abc$ in a large text we can use the
    43 particular string, like $abc$ in a large text we can use the
    44 Knuth-Morris-Pratt algorithm, which is currently the most efficient
    44 Knuth-Morris-Pratt algorithm, which is currently the most efficient
    45 general string search algorithm. But often we do \emph{not} just look
    45 general string search algorithm. But often we do \emph{not} just look
    46 for a particular string, but for string patterns. For example, in
    46 for a particular string, but for string patterns. For example, in
    47 program code we need to identify what are the keywords (if, then,
    47 program code we need to identify what are the keywords (if, then,
    48 while, etc), what are the identifiers (variable names). A pattern for
    48 while, for, etc), what are the identifiers (variable names). A pattern for
    49 identifiers could be stated as: they start with a letter, followed by
    49 identifiers could be stated as: they start with a letter, followed by
    50 zero or more letters, numbers and underscores.  Often we also face the
    50 zero or more letters, numbers and underscores.  Often we also face the
    51 problem that we are given a string (for example some user input) and
    51 problem that we are given a string (for example some user input) and
    52 want to know whether it matches a particular pattern---be it an email
    52 want to know whether it matches a particular pattern---be it an email
    53 address, for example. In this way we can exclude user input that would
    53 address, for example. In this way we can exclude user input that would
   534 
   534 
   535 The point of the definition of $L$ is that we can use it to
   535 The point of the definition of $L$ is that we can use it to
   536 precisely specify when a string $s$ is matched by a regular
   536 precisely specify when a string $s$ is matched by a regular
   537 expression $r$, namely if and only if $s \in L(r)$. In fact we
   537 expression $r$, namely if and only if $s \in L(r)$. In fact we
   538 will write a program \pcode{match} that takes any string $s$
   538 will write a program \pcode{match} that takes any string $s$
   539 and any regular expression $r$ as argument and returns
   539 and any regular expression $r$ as arguments and returns
   540 \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
   540 \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
   541 L(r)$. We leave this for the next lecture.
   541 L(r)$. We leave this for the next lecture.
   542 
   542 
   543 There is one more feature of regular expressions that is worth
   543 There is one more feature of regular expressions that is worth
   544 mentioning. Given some strings, there are in general many
   544 mentioning. Given some strings, there are in general many
   639 sometimes contains hidden snares. They have practical importance
   639 sometimes contains hidden snares. They have practical importance
   640 (remember the shocking runtime of the regular expression matchers in
   640 (remember the shocking runtime of the regular expression matchers in
   641 Python, Ruby and Java in some instances and the problems in Stack
   641 Python, Ruby and Java in some instances and the problems in Stack
   642 Exchange and the Atom editor).  People who are not very familiar with
   642 Exchange and the Atom editor).  People who are not very familiar with
   643 the mathematical background of regular expressions get them
   643 the mathematical background of regular expressions get them
   644 consistently wrong (surprising given they are a supposed to be core
   644 consistently wrong (this is surprising given they are a supposed to be
   645 skill for computer scientists). The hope is that we can do better in
   645 core skill for computer scientists). The hope is that we can do better
   646 the future---for example by proving that the algorithms actually
   646 in the future---for example by proving that the algorithms actually
   647 satisfy their specification and that the corresponding implementations
   647 satisfy their specification and that the corresponding implementations
   648 do not contain any bugs. We are close, but not yet quite there.
   648 do not contain any bugs. We are close, but not yet quite there.
   649 
   649 
   650 Notwithstanding my fascination, I am also happy to admit that regular
   650 Notwithstanding my fascination, I am also happy to admit that regular
   651 expressions have their shortcomings. There are some well-known
   651 expressions have their shortcomings. There are some well-known
   652 ``theoretical'' shortcomings, for example recognising strings of the
   652 ``theoretical'' shortcomings, for example recognising strings of the
   653 form $a^{n}b^{n}$ is not possible with regular expressions. This means
   653 form $a^{n}b^{n}$ is not possible with regular expressions. This means
   654 for example if we try to regognise whether parentheses are well-nested
   654 for example if we try to regognise whether parentheses are well-nested
   655 is impossible with (basic) regular expressions.  I am not so bothered
   655 in an expression is impossible with (basic) regular expressions.  I am
   656 by these shortcomings. What I am bothered about is when regular
   656 not so bothered by these shortcomings. What I am bothered about is
   657 expressions are in the way of practical programming. For example, it
   657 when regular expressions are in the way of practical programming. For
   658 turns out that the regular expression for email addresses shown in
   658 example, it turns out that the regular expression for email addresses
   659 \eqref{email} is hopelessly inadequate for recognising all of them
   659 shown in \eqref{email} is hopelessly inadequate for recognising all of
   660 (despite being touted as something every computer scientist should
   660 them (despite being touted as something every computer scientist
   661 know about). The W3 Consortium (which standardises the Web) proposes
   661 should know about). The W3 Consortium (which standardises the Web)
   662 to use the following, already more complicated regular expressions for
   662 proposes to use the following, already more complicated regular
   663 email addresses:
   663 expressions for email addresses:
   664 
   664 
   665 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
   665 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
   666 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
   666 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
   667 \end{lstlisting}}
   667 \end{lstlisting}}
   668 
   668