34 % compiler explorer |
34 % compiler explorer |
35 % https://gcc.godbolt.org |
35 % https://gcc.godbolt.org |
36 |
36 |
37 %https://www.youtube.com/watch?v=gmhMQfFQu20 |
37 %https://www.youtube.com/watch?v=gmhMQfFQu20 |
38 \begin{document} |
38 \begin{document} |
39 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018} |
39 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019} |
40 |
40 |
41 \section*{Handout 1} |
41 \section*{Handout 1} |
42 |
42 |
43 This module is about text processing, be it for web-crawlers, |
43 This module is about text processing, be it for web-crawlers, |
44 compilers, dictionaries, DNA-data, ad filters and so on. When looking for a |
44 compilers, dictionaries, DNA-data, ad filters and so on. When looking |
45 particular string, like $abc$ in a large text we can use the |
45 for a particular string, like \pcode{"foobar"} in a large text we can |
46 Knuth-Morris-Pratt algorithm, which is currently the most efficient |
46 use the Knuth-Morris-Pratt algorithm, which is currently the most |
47 general string search algorithm. But often we do \emph{not} just look |
47 efficient general string search algorithm. But often we do \emph{not} |
48 for a particular string, but for string patterns. For example, in |
48 just look for a particular string, but for string patterns. For |
49 program code we need to identify what are the keywords (\texttt{if}, \texttt{then}, |
49 example, in program code we need to identify what are the keywords |
50 \texttt{while}, \texttt{for}, etc), what are the identifiers (variable names). A pattern for |
50 (\texttt{if}, \texttt{then}, \texttt{while}, \texttt{for}, etc), what |
51 identifiers could be stated as: they start with a letter, followed by |
51 are the identifiers (variable names). A pattern for identifiers could |
52 zero or more letters, numbers and underscores. Often we also face the |
52 be stated as: they start with a letter, followed by zero or more |
53 problem that we are given a string (for example some user input) and |
53 letters, numbers and underscores. You might also be surprised what |
54 want to know whether it matches a particular pattern---be it an email |
54 constraints programming languages impose about numbers: for example |
55 address, for example. In this way we can exclude user input that would |
55 123 in JSON is OK, but 0123 is not. |
56 otherwise have nasty effects on our program (crashing it or making it |
56 |
57 go into an infinite loop, if not worse). In tools like Snort, scanning |
57 Often we also face the problem that we are given a string (for example |
58 for computer viruses or filtering out spam usually involves scanning |
58 some user input) and want to know whether it matches a particular |
59 for some signature (essentially a string pattern). The point is that |
59 pattern---be it an email address, for example. In this way we can |
60 the fast Knuth-Morris-Pratt algorithm for strings is not good enough |
60 exclude user input that would otherwise have nasty effects on our |
61 for such string \emph{patterns}.\smallskip |
61 program (crashing it or making it go into an infinite loop, if not |
|
62 worse). In tools like Snort, scanning for computer viruses or |
|
63 filtering out spam usually involves scanning for some signature |
|
64 (essentially a string pattern). The point is that the fast |
|
65 Knuth-Morris-Pratt algorithm for strings is not good enough for such |
|
66 string \emph{patterns}.\smallskip |
62 |
67 |
63 \defn{Regular expressions} help with conveniently specifying |
68 \defn{Regular expressions} help with conveniently specifying |
64 such patterns. The idea behind regular expressions is that |
69 such patterns. The idea behind regular expressions is that |
65 they are a simple method for describing languages (or sets of |
70 they are a simple method for describing languages (or sets of |
66 strings)\ldots at least languages we are interested in in |
71 strings)\ldots at least languages we are interested in in |
192 much ubiquitous in computer science. There are many libraries |
197 much ubiquitous in computer science. There are many libraries |
193 implementing regular expressions. I am sure you have come across them |
198 implementing regular expressions. I am sure you have come across them |
194 before (remember the PRA module?). Why on earth then is there any |
199 before (remember the PRA module?). Why on earth then is there any |
195 interest in studying them again in depth in this module? Well, one |
200 interest in studying them again in depth in this module? Well, one |
196 answer is in the following two graphs about regular expression |
201 answer is in the following two graphs about regular expression |
197 matching in Python, Ruby and Java (Version 8). |
202 matching in Python, Ruby, JavaScript and Java (Version 8). |
198 |
203 |
199 \begin{center} |
204 \begin{center} |
200 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}} |
205 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}} |
201 \begin{tikzpicture} |
206 \begin{tikzpicture} |
202 \begin{axis}[ |
207 \begin{axis}[ |
212 ytick={0,5,...,30}, |
217 ytick={0,5,...,30}, |
213 scaled ticks=false, |
218 scaled ticks=false, |
214 axis lines=left, |
219 axis lines=left, |
215 width=5.5cm, |
220 width=5.5cm, |
216 height=4.5cm, |
221 height=4.5cm, |
217 legend entries={Python, Java 8}, |
222 legend entries={Python, Java 8, JavaScript}, |
218 legend pos=north west, |
223 legend pos=north west, |
219 legend cell align=left] |
224 legend cell align=left] |
220 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
225 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
221 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
226 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
227 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; |
222 \end{axis} |
228 \end{axis} |
223 \end{tikzpicture} |
229 \end{tikzpicture} |
224 & |
230 & |
225 \begin{tikzpicture} |
231 \begin{tikzpicture} |
226 \begin{axis}[ |
232 \begin{axis}[ |