1 \documentclass{article} |
1 \documentclass{article} |
2 \usepackage{../style} |
2 \usepackage{../style} |
3 \usepackage{../langs} |
3 \usepackage{../langs} |
4 \usepackage{../graphics} |
4 \usepackage{../graphics} |
5 \usepackage{../data} |
5 \usepackage{../data} |
|
6 \usepackage{lstlinebgrd} |
|
7 \definecolor{capri}{rgb}{0.0, 0.75, 1.0} |
6 |
8 |
7 %%http://regexcrossword.com/challenges/cities/puzzles/1 |
9 %%http://regexcrossword.com/challenges/cities/puzzles/1 |
8 %%https://jex.im/regulex/ |
10 %%https://jex.im/regulex/ |
9 %%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/ |
11 %%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/ |
10 %%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/ |
12 %%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/ |
35 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017} |
37 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017} |
36 |
38 |
37 \section*{Handout 1} |
39 \section*{Handout 1} |
38 |
40 |
39 This module is about text processing, be it for web-crawlers, |
41 This module is about text processing, be it for web-crawlers, |
40 compilers, dictionaries, DNA-data and so on. When looking for a |
42 compilers, dictionaries, DNA-data and so on. When looking for a |
41 particular string, like $abc$ in a large text we can use the |
43 particular string, like $abc$ in a large text we can use the |
42 Knuth-Morris-Pratt algorithm, which is currently the most efficient |
44 Knuth-Morris-Pratt algorithm, which is currently the most efficient |
43 general string search algorithm. But often we do \emph{not} just look |
45 general string search algorithm. But often we do \emph{not} just look |
44 for a particular string, but for string patterns. For example in |
46 for a particular string, but for string patterns. For example, in |
45 program code we need to identify what are the keywords (if, then, |
47 program code we need to identify what are the keywords (if, then, |
46 while, etc), what are the identifiers (variable names). A pattern for |
48 while, etc), what are the identifiers (variable names). A pattern for |
47 identifiers could be stated as: they start with a letter, followed by |
49 identifiers could be stated as: they start with a letter, followed by |
48 zero or more letters, numbers and underscores. Also often we face the |
50 zero or more letters, numbers and underscores. Often we also face the |
49 problem that we are given a string (for example some user input) and |
51 problem that we are given a string (for example some user input) and |
50 want to know whether it matches a particular pattern---be it an email |
52 want to know whether it matches a particular pattern---be it an email |
51 address, for example. In this way we can exclude user input that would |
53 address, for example. In this way we can exclude user input that would |
52 otherwise have nasty effects on our program (crashing it or making it |
54 otherwise have nasty effects on our program (crashing it or making it |
53 go into an infinite loop, if not worse). Scanning for computer viruses |
55 go into an infinite loop, if not worse). In tools like Snort, scanning |
54 or filtering out spam usually involves scanning for some signature |
56 for computer viruses or filtering out spam usually involves scanning |
55 (essentially a pattern). The point is that the fast |
57 for some signature (essentially a string pattern). The point is that |
56 Knuth-Morris-Pratt algorithm for strings is not good enough for such |
58 the fast Knuth-Morris-Pratt algorithm for strings is not good enough |
57 string \emph{patterns}.\smallskip |
59 for such string \emph{patterns}.\smallskip |
58 |
60 |
59 \defn{Regular expressions} help with conveniently specifying |
61 \defn{Regular expressions} help with conveniently specifying |
60 such patterns. The idea behind regular expressions is that |
62 such patterns. The idea behind regular expressions is that |
61 they are a simple method for describing languages (or sets of |
63 they are a simple method for describing languages (or sets of |
62 strings)\ldots at least languages we are interested in in |
64 strings)\ldots at least languages we are interested in in |
188 interest in studying them again in depth in this module? Well, one |
190 interest in studying them again in depth in this module? Well, one |
189 answer is in the following two graphs about regular expression |
191 answer is in the following two graphs about regular expression |
190 matching in Python, Ruby and Java. |
192 matching in Python, Ruby and Java. |
191 |
193 |
192 \begin{center} |
194 \begin{center} |
193 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}} |
195 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}} |
|
196 \begin{tikzpicture} |
|
197 \begin{axis}[ |
|
198 title={Graph: $\texttt{(a*)*\,b}$ and strings |
|
199 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
|
200 xlabel={$n$}, |
|
201 x label style={at={(1.05,0.0)}}, |
|
202 ylabel={time in secs}, |
|
203 enlargelimits=false, |
|
204 xtick={0,5,...,30}, |
|
205 xmax=33, |
|
206 ymax=35, |
|
207 ytick={0,5,...,30}, |
|
208 scaled ticks=false, |
|
209 axis lines=left, |
|
210 width=5.5cm, |
|
211 height=4.5cm, |
|
212 legend entries={Python, Java}, |
|
213 legend pos=north west, |
|
214 legend cell align=left] |
|
215 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
|
216 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
217 \end{axis} |
|
218 \end{tikzpicture} |
|
219 & |
194 \begin{tikzpicture} |
220 \begin{tikzpicture} |
195 \begin{axis}[ |
221 \begin{axis}[ |
196 title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings |
222 title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings |
197 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
223 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
198 xlabel={$n$}, |
224 xlabel={$n$}, |
212 legend cell align=left] |
238 legend cell align=left] |
213 \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data}; |
239 \addplot[blue,mark=*, mark options={fill=white}] table {re-python.data}; |
214 \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data}; |
240 \addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data}; |
215 \end{axis} |
241 \end{axis} |
216 \end{tikzpicture} |
242 \end{tikzpicture} |
217 & |
|
218 \begin{tikzpicture} |
|
219 \begin{axis}[ |
|
220 title={Graph: $\texttt{(a*)*\,b}$ and strings |
|
221 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
|
222 xlabel={$n$}, |
|
223 x label style={at={(1.05,0.0)}}, |
|
224 ylabel={time in secs}, |
|
225 enlargelimits=false, |
|
226 xtick={0,5,...,30}, |
|
227 xmax=33, |
|
228 ymax=35, |
|
229 ytick={0,5,...,30}, |
|
230 scaled ticks=false, |
|
231 axis lines=left, |
|
232 width=5.5cm, |
|
233 height=4.5cm, |
|
234 legend entries={Python, Java}, |
|
235 legend pos=north west, |
|
236 legend cell align=left] |
|
237 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
|
238 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
239 \end{axis} |
|
240 \end{tikzpicture} |
|
241 \end{tabular} |
243 \end{tabular} |
242 \end{center} |
244 \end{center} |
243 |
245 |
244 \noindent This first graph shows that Python needs approximately 29 |
246 \noindent This first graph shows that Python and Java need |
245 seconds for finding out whether a string of 28 \texttt{a}s matches the |
247 approximately 30 seconds to find out that the regular expression |
246 regular expression \texttt{a?\{28\}\,a\{28\}}. Ruby is even slightly |
248 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. |
|
249 Similarly, the second shows that Python needs approximately 29 seconds |
|
250 for finding out whether a string of 28 \texttt{a}s matches the regular |
|
251 expression \texttt{a?\{28\}\,a\{28\}}. Ruby is even slightly |
247 worse.\footnote{In this example Ruby uses the slightly different |
252 worse.\footnote{In this example Ruby uses the slightly different |
248 regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the |
253 regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the |
249 \texttt{a?} and \texttt{a} each occur $n$ times. More such test |
254 \texttt{a?} and \texttt{a} each occur $n$ times. More such test |
250 cases can be found at \url{http://www.computerbytesman.com/redos/}.} |
255 cases can be found at \url{http://www.computerbytesman.com/redos/}.} |
251 Simlarly, Python and Java needs approximately 30 seconds to find out |
256 Admittedly, these regular expressions are carefully chosen to exhibit |
252 that the regular expression $\texttt{(a*)*\,b}$ does not match strings |
257 this exponential behaviour, but similar ones occur more often than one |
253 of 28 \texttt{a}s. Admittedly, these regular expressions are |
258 wants in ``real life''. For example, on 20 July 2016 a similar regular |
254 carefully chosen to exhibit this exponential behaviour, but similar |
259 expression brought the webpage \href{http://stackexchange.com}{Stack |
255 ones occur more often than one wants in ``real life''. For example, on |
260 Exchange} to its knees: |
256 20 July 2016 a similar regular expression brought the webpage |
|
257 \href{http://stackexchange.com}{Stack Exchange} to its knees: |
|
258 |
261 |
259 \begin{center} |
262 \begin{center} |
260 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
263 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
261 \end{center} |
264 \end{center} |
262 |
265 |
291 out why Python and Ruby (and others) behave so badly when |
294 out why Python and Ruby (and others) behave so badly when |
292 matching strings with evil regular expressions. But we will also look |
295 matching strings with evil regular expressions. But we will also look |
293 at a relatively simple algorithm that solves this problem much |
296 at a relatively simple algorithm that solves this problem much |
294 better than Python and Ruby do\ldots actually it will be two |
297 better than Python and Ruby do\ldots actually it will be two |
295 versions of the algorithm: the first one will be able to |
298 versions of the algorithm: the first one will be able to |
296 process strings of approximately 1,000 \texttt{a}s in 30 |
299 process strings of approximately 1,100 \texttt{a}s in 23 |
297 seconds, while the second version will even be able to process |
300 seconds, while the second version will even be able to process |
298 up to 7,000(!) in 30 seconds, see the graph below: |
301 up to 11,000(!) in 5 seconds, see the graph below: |
299 |
302 |
300 \begin{center} |
303 \begin{center} |
301 \begin{tikzpicture} |
304 \begin{tikzpicture} |
302 \begin{axis}[ |
305 \begin{axis}[ |
303 title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings |
306 title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings |
304 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
307 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
305 xlabel={$n$}, |
308 xlabel={$n$}, |
306 x label style={at={(1.05,0.0)}}, |
309 x label style={at={(1.05,0.0)}}, |
307 ylabel={time in secs}, |
310 ylabel={time in secs}, |
308 enlargelimits=false, |
311 enlargelimits=false, |
309 xtick={0,3000,...,9000}, |
312 xtick={0,3000,...,12000}, |
310 xmax=10000, |
313 xmax=13000, |
311 ymax=32, |
314 ymax=32, |
312 ytick={0,5,...,30}, |
315 ytick={0,5,...,30}, |
313 scaled ticks=false, |
316 scaled ticks=false, |
314 axis lines=left, |
317 axis lines=left, |
315 width=7cm, |
318 width=7cm, |
694 best solution is to use regular expressions; now you have two |
697 best solution is to use regular expressions; now you have two |
695 problems.'' |
698 problems.'' |
696 \end{quote} |
699 \end{quote} |
697 |
700 |
698 |
701 |
699 \begin{figure}[p] |
702 \begin{figure}[p]\small |
700 \lstinputlisting{../progs/crawler1.scala} |
703 \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
|
704 {../progs/crawler1.scala} |
701 |
705 |
702 \caption{The Scala code for a simple web-crawler that checks |
706 \caption{The Scala code for a simple web-crawler that checks |
703 for broken links in a web-page. It uses the regular expression |
707 for broken links in a web-page. It uses the regular expression |
704 \texttt{http\_pattern} in Line~\ref{httpline} for recognising |
708 \texttt{http\_pattern} in Line~\ref{httpline} for recognising |
705 URL-addresses. It finds all links using the library function |
709 URL-addresses. It finds all links using the library function |
706 \texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}} |
710 \texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}} |
707 |
711 |
708 \end{figure} |
712 \end{figure} |
709 |
713 |
710 |
714 |
711 \begin{figure}[p] |
715 |
712 \lstinputlisting{../progs/crawler2.scala} |
716 \begin{figure}[p]\small |
|
717 \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
|
718 {../progs/crawler2.scala} |
713 |
719 |
714 \caption{A version of the web-crawler that only follows links |
720 \caption{A version of the web-crawler that only follows links |
715 in ``my'' domain---since these are the ones I am interested in |
721 in ``my'' domain---since these are the ones I am interested in |
716 to fix. It uses the regular expression \texttt{my\_urls} in |
722 to fix. It uses the regular expression \texttt{my\_urls} in |
717 Line~\ref{myurlline} to check for my name in the links. The |
723 Line~\ref{myurlline} to check for my name in the links. The |
720 is a test whether URL is in ``my'' domain or |
726 is a test whether URL is in ``my'' domain or |
721 not.\label{crawler2}} |
727 not.\label{crawler2}} |
722 |
728 |
723 \end{figure} |
729 \end{figure} |
724 |
730 |
725 \begin{figure}[p] |
731 \begin{figure}[p]\small |
726 \lstinputlisting{../progs/crawler3.scala} |
732 \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
|
733 {../progs/crawler3.scala} |
727 |
734 |
728 \caption{A small email harvester---whenever we download a |
735 \caption{A small email harvester---whenever we download a |
729 web-page, we also check whether it contains any email |
736 web-page, we also check whether it contains any email |
730 addresses. For this we use the regular expression |
737 addresses. For this we use the regular expression |
731 \texttt{email\_pattern} in Line~\ref{emailline}. The main |
738 \texttt{email\_pattern} in Line~\ref{emailline}. The main |