1 \documentclass{article} |
1 \documentclass{article} |
2 \usepackage{../style} |
2 \usepackage{../style} |
3 \usepackage{../langs} |
3 \usepackage{../langs} |
4 \usepackage{../graphics} |
4 \usepackage{../graphics} |
5 \usepackage{../data} |
5 \usepackage{../data} |
|
6 \usepackage{lstlinebgrd} |
|
7 \definecolor{capri}{rgb}{0.0, 0.75, 1.0} |
6 |
8 |
7 %%http://regexcrossword.com/challenges/cities/puzzles/1 |
9 %%http://regexcrossword.com/challenges/cities/puzzles/1 |
8 %%https://jex.im/regulex/ |
10 %%https://jex.im/regulex/ |
9 %%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/ |
11 %%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/ |
10 %%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/ |
12 %%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/ |
35 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017} |
39 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017} |
36 |
40 |
37 \section*{Handout 1} |
41 \section*{Handout 1} |
38 |
42 |
39 This module is about text processing, be it for web-crawlers, |
43 This module is about text processing, be it for web-crawlers, |
40 compilers, dictionaries, DNA-data and so on. When looking for a |
44 compilers, dictionaries, DNA-data, ad filters and so on. When looking for a |
41 particular string, like $abc$ in a large text we can use the |
45 particular string, like $abc$ in a large text we can use the |
42 Knuth-Morris-Pratt algorithm, which is currently the most efficient |
46 Knuth-Morris-Pratt algorithm, which is currently the most efficient |
43 general string search algorithm. But often we do \emph{not} just look |
47 general string search algorithm. But often we do \emph{not} just look |
44 for a particular string, but for string patterns. For example in |
48 for a particular string, but for string patterns. For example, in |
45 program code we need to identify what are the keywords (if, then, |
49 program code we need to identify what are the keywords (if, then, |
46 while, etc), what are the identifiers (variable names). A pattern for |
50 while, for, etc), what are the identifiers (variable names). A pattern for |
47 identifiers could be stated as: they start with a letter, followed by |
51 identifiers could be stated as: they start with a letter, followed by |
48 zero or more letters, numbers and underscores. Also often we face the |
52 zero or more letters, numbers and underscores. Often we also face the |
49 problem that we are given a string (for example some user input) and |
53 problem that we are given a string (for example some user input) and |
50 want to know whether it matches a particular pattern---be it an email |
54 want to know whether it matches a particular pattern---be it an email |
51 address, for example. In this way we can exclude user input that would |
55 address, for example. In this way we can exclude user input that would |
52 otherwise have nasty effects on our program (crashing it or making it |
56 otherwise have nasty effects on our program (crashing it or making it |
53 go into an infinite loop, if not worse). The point is that the fast |
57 go into an infinite loop, if not worse). In tools like Snort, scanning |
54 Knuth-Morris-Pratt algorithm for strings is not good enough for such |
58 for computer viruses or filtering out spam usually involves scanning |
55 string patterns.\smallskip |
59 for some signature (essentially a string pattern). The point is that |
|
60 the fast Knuth-Morris-Pratt algorithm for strings is not good enough |
|
61 for such string \emph{patterns}.\smallskip |
56 |
62 |
57 \defn{Regular expressions} help with conveniently specifying |
63 \defn{Regular expressions} help with conveniently specifying |
58 such patterns. The idea behind regular expressions is that |
64 such patterns. The idea behind regular expressions is that |
59 they are a simple method for describing languages (or sets of |
65 they are a simple method for describing languages (or sets of |
60 strings)\ldots at least languages we are interested in in |
66 strings)\ldots at least languages we are interested in in |
186 interest in studying them again in depth in this module? Well, one |
192 interest in studying them again in depth in this module? Well, one |
187 answer is in the following two graphs about regular expression |
193 answer is in the following two graphs about regular expression |
188 matching in Python, Ruby and Java. |
194 matching in Python, Ruby and Java. |
189 |
195 |
190 \begin{center} |
196 \begin{center} |
191 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}} |
197 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}} |
|
198 \begin{tikzpicture} |
|
199 \begin{axis}[ |
|
200 title={Graph: $\texttt{(a*)*\,b}$ and strings |
|
201 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
|
202 xlabel={$n$}, |
|
203 x label style={at={(1.05,0.0)}}, |
|
204 ylabel={time in secs}, |
|
205 enlargelimits=false, |
|
206 xtick={0,5,...,30}, |
|
207 xmax=33, |
|
208 ymax=35, |
|
209 ytick={0,5,...,30}, |
|
210 scaled ticks=false, |
|
211 axis lines=left, |
|
212 width=5.5cm, |
|
213 height=4.5cm, |
|
214 legend entries={Python, Java}, |
|
215 legend pos=north west, |
|
216 legend cell align=left] |
|
217 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
|
218 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
219 \end{axis} |
|
220 \end{tikzpicture} |
|
221 & |
192 \begin{tikzpicture} |
222 \begin{tikzpicture} |
193 \begin{axis}[ |
223 \begin{axis}[ |
194 title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings |
224 title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings |
195 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
225 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
196 xlabel={$n$}, |
226 xlabel={$n$}, |
210 legend cell align=left] |
240 legend cell align=left] |
211 \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data}; |
241 \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data}; |
212 \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data}; |
242 \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data}; |
213 \end{axis} |
243 \end{axis} |
214 \end{tikzpicture} |
244 \end{tikzpicture} |
215 & |
|
216 \begin{tikzpicture} |
|
217 \begin{axis}[ |
|
218 title={Graph: $\texttt{(a*)*\,b}$ and strings |
|
219 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
|
220 xlabel={$n$}, |
|
221 x label style={at={(1.05,0.0)}}, |
|
222 ylabel={time in secs}, |
|
223 enlargelimits=false, |
|
224 xtick={0,5,...,30}, |
|
225 xmax=33, |
|
226 ymax=35, |
|
227 ytick={0,5,...,30}, |
|
228 scaled ticks=false, |
|
229 axis lines=left, |
|
230 width=5.5cm, |
|
231 height=4.5cm, |
|
232 legend entries={Python, Java}, |
|
233 legend pos=north west, |
|
234 legend cell align=left] |
|
235 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
|
236 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
237 \end{axis} |
|
238 \end{tikzpicture} |
|
239 \end{tabular} |
245 \end{tabular} |
240 \end{center} |
246 \end{center} |
241 |
247 |
242 \noindent This first graph shows that Python needs approximately 29 |
248 \noindent This first graph shows that Python and Java need |
243 seconds for finding out whether a string of 28 \texttt{a}s matches the |
249 approximately 30 seconds to find out that the regular expression |
244 regular expression \texttt{a?\{28\}\,a\{28\}}. Ruby is even slightly |
250 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. |
|
251 Similarly, the second shows that Python needs approximately 29 seconds |
|
252 for finding out whether a string of 28 \texttt{a}s matches the regular |
|
253 expression \texttt{a?\{28\}\,a\{28\}}. Ruby is even slightly |
245 worse.\footnote{In this example Ruby uses the slightly different |
254 worse.\footnote{In this example Ruby uses the slightly different |
246 regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the |
255 regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the |
247 \texttt{a?} and \texttt{a} each occur $n$ times. More such test |
256 \texttt{a?} and \texttt{a} each occur $n$ times. More such test |
248 cases can be found at \url{http://www.computerbytesman.com/redos/}.} |
257 cases can be found at \url{http://www.computerbytesman.com/redos/}.} |
249 Simlarly, Python and Java needs approximately 30 seconds to find out |
258 Admittedly, these regular expressions are carefully chosen to exhibit |
250 that the regular expression $\texttt{(a*)*\,b}$ does not match strings |
259 this exponential behaviour, but similar ones occur more often than one |
251 of 28 \texttt{a}s. Admittedly, these regular expressions are |
260 wants in ``real life''. For example, on 20 July 2016 a similar regular |
252 carefully chosen to exhibit this exponential behaviour, but similar |
261 expression brought the webpage \href{http://stackexchange.com}{Stack |
253 ones occur more often than one wants in ``real life''. For example, on |
262 Exchange} to its knees: |
254 20 July 2016 a similar regular expression brought the webpage |
|
255 \href{http://stackexchange.com}{Stack Exchange} to its knees: |
|
256 |
263 |
257 \begin{center} |
264 \begin{center} |
258 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
265 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
259 \end{center} |
266 \end{center} |
260 |
267 |
289 out why Python and Ruby (and others) behave so badly when |
296 out why Python and Ruby (and others) behave so badly when |
290 matching strings with evil regular expressions. But we will also look |
297 matching strings with evil regular expressions. But we will also look |
291 at a relatively simple algorithm that solves this problem much |
298 at a relatively simple algorithm that solves this problem much |
292 better than Python and Ruby do\ldots actually it will be two |
299 better than Python and Ruby do\ldots actually it will be two |
293 versions of the algorithm: the first one will be able to |
300 versions of the algorithm: the first one will be able to |
294 process strings of approximately 1,000 \texttt{a}s in 30 |
301 process strings of approximately 1,100 \texttt{a}s in 23 |
295 seconds, while the second version will even be able to process |
302 seconds, while the second version will even be able to process |
296 up to 7,000(!) in 30 seconds, see the graph below: |
303 up to 11,000(!) in 5 seconds, see the graph below: |
297 |
304 |
298 \begin{center} |
305 \begin{center} |
299 \begin{tikzpicture} |
306 \begin{tikzpicture} |
300 \begin{axis}[ |
307 \begin{axis}[ |
301 title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings |
308 title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings |
302 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
309 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
303 xlabel={$n$}, |
310 xlabel={$n$}, |
304 x label style={at={(1.05,0.0)}}, |
311 x label style={at={(1.05,0.0)}}, |
305 ylabel={time in secs}, |
312 ylabel={time in secs}, |
306 enlargelimits=false, |
313 enlargelimits=false, |
307 xtick={0,3000,...,9000}, |
314 xtick={0,3000,...,12000}, |
308 xmax=10000, |
315 xmax=13000, |
309 ymax=32, |
316 ymax=32, |
310 ytick={0,5,...,30}, |
317 ytick={0,5,...,30}, |
311 scaled ticks=false, |
318 scaled ticks=false, |
312 axis lines=left, |
319 axis lines=left, |
313 width=7cm, |
320 width=7cm, |
529 |
536 |
530 The point of the definition of $L$ is that we can use it to |
537 The point of the definition of $L$ is that we can use it to |
531 precisely specify when a string $s$ is matched by a regular |
538 precisely specify when a string $s$ is matched by a regular |
532 expression $r$, namely if and only if $s \in L(r)$. In fact we |
539 expression $r$, namely if and only if $s \in L(r)$. In fact we |
533 will write a program \pcode{match} that takes any string $s$ |
540 will write a program \pcode{match} that takes any string $s$ |
534 and any regular expression $r$ as argument and returns |
541 and any regular expression $r$ as arguments and returns |
535 \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in |
542 \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in |
536 L(r)$. We leave this for the next lecture. |
543 L(r)$. We leave this for the next lecture. |
537 |
544 |
538 There is one more feature of regular expressions that is worth |
545 There is one more feature of regular expressions that is worth |
539 mentioning. Given some strings, there are in general many |
546 mentioning. Given some strings, there are in general many |
634 sometimes contains hidden snares. They have practical importance |
641 sometimes contains hidden snares. They have practical importance |
635 (remember the shocking runtime of the regular expression matchers in |
642 (remember the shocking runtime of the regular expression matchers in |
636 Python, Ruby and Java in some instances and the problems in Stack |
643 Python, Ruby and Java in some instances and the problems in Stack |
637 Exchange and the Atom editor). People who are not very familiar with |
644 Exchange and the Atom editor). People who are not very familiar with |
638 the mathematical background of regular expressions get them |
645 the mathematical background of regular expressions get them |
639 consistently wrong (surprising given they are a supposed to be core |
646 consistently wrong (this is surprising given they are a supposed to be |
640 skill for computer scientists). The hope is that we can do better in |
647 core skill for computer scientists). The hope is that we can do better |
641 the future---for example by proving that the algorithms actually |
648 in the future---for example by proving that the algorithms actually |
642 satisfy their specification and that the corresponding implementations |
649 satisfy their specification and that the corresponding implementations |
643 do not contain any bugs. We are close, but not yet quite there. |
650 do not contain any bugs. We are close, but not yet quite there. |
644 |
651 |
645 Notwithstanding my fascination, I am also happy to admit that regular |
652 Notwithstanding my fascination, I am also happy to admit that regular |
646 expressions have their shortcomings. There are some well-known |
653 expressions have their shortcomings. There are some well-known |
647 ``theoretical'' shortcomings, for example recognising strings of the |
654 ``theoretical'' shortcomings, for example recognising strings of the |
648 form $a^{n}b^{n}$ is not possible with regular expressions. This means |
655 form $a^{n}b^{n}$ is not possible with regular expressions. This means |
649 for example if we try to regognise whether parentheses are well-nested |
656 for example if we try to regognise whether parentheses are well-nested |
650 is impossible with (basic) regular expressions. I am not so bothered |
657 in an expression is impossible with (basic) regular expressions. I am |
651 by these shortcomings. What I am bothered about is when regular |
658 not so bothered by these shortcomings. What I am bothered about is |
652 expressions are in the way of practical programming. For example, it |
659 when regular expressions are in the way of practical programming. For |
653 turns out that the regular expression for email addresses shown in |
660 example, it turns out that the regular expression for email addresses |
654 \eqref{email} is hopelessly inadequate for recognising all of them |
661 shown in \eqref{email} is hopelessly inadequate for recognising all of |
655 (despite being touted as something every computer scientist should |
662 them (despite being touted as something every computer scientist |
656 know about). The W3 Consortium (which standardises the Web) proposes |
663 should know about). The W3 Consortium (which standardises the Web) |
657 to use the following, already more complicated regular expressions for |
664 proposes to use the following, already more complicated regular |
658 email addresses: |
665 expressions for email addresses: |
659 |
666 |
660 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none] |
667 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none] |
661 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* |
668 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* |
662 \end{lstlisting}} |
669 \end{lstlisting}} |
663 |
670 |
680 |
687 |
681 \noindent which explains some of the crazier parts of email |
688 \noindent which explains some of the crazier parts of email |
682 addresses. Still it is good to know that some tasks in text |
689 addresses. Still it is good to know that some tasks in text |
683 processing just cannot be achieved by using regular |
690 processing just cannot be achieved by using regular |
684 expressions. But for what we want to use them (lexing) they are |
691 expressions. But for what we want to use them (lexing) they are |
685 pretty good. |
692 pretty good.\medskip |
686 |
693 |
687 |
694 \noindent |
688 \begin{figure}[p] |
695 Finally there is a joke about regular expressions: |
689 \lstinputlisting{../progs/crawler1.scala} |
696 |
|
697 \begin{quote}\it |
|
698 ``Sometimes you have a programming problem and it seems like the |
|
699 best solution is to use regular expressions; now you have two |
|
700 problems.'' |
|
701 \end{quote} |
|
702 |
|
703 |
|
704 \begin{figure}[p]\small |
|
705 \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
|
706 {../progs/crawler1.scala} |
690 |
707 |
691 \caption{The Scala code for a simple web-crawler that checks |
708 \caption{The Scala code for a simple web-crawler that checks |
692 for broken links in a web-page. It uses the regular expression |
709 for broken links in a web-page. It uses the regular expression |
693 \texttt{http\_pattern} in Line~\ref{httpline} for recognising |
710 \texttt{http\_pattern} in Line~\ref{httpline} for recognising |
694 URL-addresses. It finds all links using the library function |
711 URL-addresses. It finds all links using the library function |
695 \texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}} |
712 \texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}} |
696 |
713 |
697 \end{figure} |
714 \end{figure} |
698 |
715 |
699 |
716 |
700 \begin{figure}[p] |
717 |
701 \lstinputlisting{../progs/crawler2.scala} |
718 \begin{figure}[p]\small |
|
719 \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
|
720 {../progs/crawler2.scala} |
702 |
721 |
703 \caption{A version of the web-crawler that only follows links |
722 \caption{A version of the web-crawler that only follows links |
704 in ``my'' domain---since these are the ones I am interested in |
723 in ``my'' domain---since these are the ones I am interested in |
705 to fix. It uses the regular expression \texttt{my\_urls} in |
724 to fix. It uses the regular expression \texttt{my\_urls} in |
706 Line~\ref{myurlline} to check for my name in the links. The |
725 Line~\ref{myurlline} to check for my name in the links. The |
709 is a test whether URL is in ``my'' domain or |
728 is a test whether URL is in ``my'' domain or |
710 not.\label{crawler2}} |
729 not.\label{crawler2}} |
711 |
730 |
712 \end{figure} |
731 \end{figure} |
713 |
732 |
714 \begin{figure}[p] |
733 \begin{figure}[p]\small |
715 \lstinputlisting{../progs/crawler3.scala} |
734 \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
|
735 {../progs/crawler3.scala} |
716 |
736 |
717 \caption{A small email harvester---whenever we download a |
737 \caption{A small email harvester---whenever we download a |
718 web-page, we also check whether it contains any email |
738 web-page, we also check whether it contains any email |
719 addresses. For this we use the regular expression |
739 addresses. For this we use the regular expression |
720 \texttt{email\_pattern} in Line~\ref{emailline}. The main |
740 \texttt{email\_pattern} in Line~\ref{emailline}. The main |