37 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017} |
37 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017} |
38 |
38 |
39 \section*{Handout 1} |
39 \section*{Handout 1} |
40 |
40 |
41 This module is about text processing, be it for web-crawlers, |
41 This module is about text processing, be it for web-crawlers, |
42 compilers, dictionaries, DNA-data and so on. When looking for a |
42 compilers, dictionaries, DNA-data, ad filters and so on. When looking for a |
43 particular string, like $abc$ in a large text we can use the |
43 particular string, like $abc$ in a large text we can use the |
44 Knuth-Morris-Pratt algorithm, which is currently the most efficient |
44 Knuth-Morris-Pratt algorithm, which is currently the most efficient |
45 general string search algorithm. But often we do \emph{not} just look |
45 general string search algorithm. But often we do \emph{not} just look |
46 for a particular string, but for string patterns. For example, in |
46 for a particular string, but for string patterns. For example, in |
47 program code we need to identify what are the keywords (if, then, |
47 program code we need to identify what are the keywords (if, then, |
48 while, etc), what are the identifiers (variable names). A pattern for |
48 while, for, etc), what are the identifiers (variable names). A pattern for |
49 identifiers could be stated as: they start with a letter, followed by |
49 identifiers could be stated as: they start with a letter, followed by |
50 zero or more letters, numbers and underscores. Often we also face the |
50 zero or more letters, numbers and underscores. Often we also face the |
51 problem that we are given a string (for example some user input) and |
51 problem that we are given a string (for example some user input) and |
52 want to know whether it matches a particular pattern---be it an email |
52 want to know whether it matches a particular pattern---be it an email |
53 address, for example. In this way we can exclude user input that would |
53 address, for example. In this way we can exclude user input that would |
534 |
534 |
535 The point of the definition of $L$ is that we can use it to |
535 The point of the definition of $L$ is that we can use it to |
536 precisely specify when a string $s$ is matched by a regular |
536 precisely specify when a string $s$ is matched by a regular |
537 expression $r$, namely if and only if $s \in L(r)$. In fact we |
537 expression $r$, namely if and only if $s \in L(r)$. In fact we |
538 will write a program \pcode{match} that takes any string $s$ |
538 will write a program \pcode{match} that takes any string $s$ |
539 and any regular expression $r$ as argument and returns |
539 and any regular expression $r$ as arguments and returns |
540 \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in |
540 \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in |
541 L(r)$. We leave this for the next lecture. |
541 L(r)$. We leave this for the next lecture. |
542 |
542 |
543 There is one more feature of regular expressions that is worth |
543 There is one more feature of regular expressions that is worth |
544 mentioning. Given some strings, there are in general many |
544 mentioning. Given some strings, there are in general many |
639 sometimes contains hidden snares. They have practical importance |
639 sometimes contains hidden snares. They have practical importance |
640 (remember the shocking runtime of the regular expression matchers in |
640 (remember the shocking runtime of the regular expression matchers in |
641 Python, Ruby and Java in some instances and the problems in Stack |
641 Python, Ruby and Java in some instances and the problems in Stack |
642 Exchange and the Atom editor). People who are not very familiar with |
642 Exchange and the Atom editor). People who are not very familiar with |
643 the mathematical background of regular expressions get them |
643 the mathematical background of regular expressions get them |
644 consistently wrong (surprising given they are a supposed to be core |
644 consistently wrong (this is surprising given they are a supposed to be |
645 skill for computer scientists). The hope is that we can do better in |
645 core skill for computer scientists). The hope is that we can do better |
646 the future---for example by proving that the algorithms actually |
646 in the future---for example by proving that the algorithms actually |
647 satisfy their specification and that the corresponding implementations |
647 satisfy their specification and that the corresponding implementations |
648 do not contain any bugs. We are close, but not yet quite there. |
648 do not contain any bugs. We are close, but not yet quite there. |
649 |
649 |
650 Notwithstanding my fascination, I am also happy to admit that regular |
650 Notwithstanding my fascination, I am also happy to admit that regular |
651 expressions have their shortcomings. There are some well-known |
651 expressions have their shortcomings. There are some well-known |
652 ``theoretical'' shortcomings, for example recognising strings of the |
652 ``theoretical'' shortcomings, for example recognising strings of the |
653 form $a^{n}b^{n}$ is not possible with regular expressions. This means |
653 form $a^{n}b^{n}$ is not possible with regular expressions. This means |
654 for example if we try to regognise whether parentheses are well-nested |
654 for example if we try to regognise whether parentheses are well-nested |
655 is impossible with (basic) regular expressions. I am not so bothered |
655 in an expression is impossible with (basic) regular expressions. I am |
656 by these shortcomings. What I am bothered about is when regular |
656 not so bothered by these shortcomings. What I am bothered about is |
657 expressions are in the way of practical programming. For example, it |
657 when regular expressions are in the way of practical programming. For |
658 turns out that the regular expression for email addresses shown in |
658 example, it turns out that the regular expression for email addresses |
659 \eqref{email} is hopelessly inadequate for recognising all of them |
659 shown in \eqref{email} is hopelessly inadequate for recognising all of |
660 (despite being touted as something every computer scientist should |
660 them (despite being touted as something every computer scientist |
661 know about). The W3 Consortium (which standardises the Web) proposes |
661 should know about). The W3 Consortium (which standardises the Web) |
662 to use the following, already more complicated regular expressions for |
662 proposes to use the following, already more complicated regular |
663 email addresses: |
663 expressions for email addresses: |
664 |
664 |
665 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none] |
665 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none] |
666 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* |
666 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* |
667 \end{lstlisting}} |
667 \end{lstlisting}} |
668 |
668 |