6 \usepackage{../data} |
6 \usepackage{../data} |
7 |
7 |
8 |
8 |
9 \begin{document} |
9 \begin{document} |
10 \fnote{\copyright{} Christian Urban, King's College London, |
10 \fnote{\copyright{} Christian Urban, King's College London, |
11 2014, 2015, 2016, 2017, 2018, 2019, 2020} |
11 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021} |
12 |
12 |
13 |
13 |
14 \section*{Handout 2 (Regular Expression Matching)} |
14 \section*{Handout 2 (Regular Expression Matching)} |
15 |
15 |
16 %\noindent |
16 %\noindent |
25 |
25 |
26 \noindent |
26 \noindent |
27 This lecture is about implementing a more efficient regular expression |
27 This lecture is about implementing a more efficient regular expression |
28 matcher (the plots on the right below)---more efficient than the |
28 matcher (the plots on the right below)---more efficient than the |
29 matchers from regular expression libraries in Ruby, Python, JavaScript |
29 matchers from regular expression libraries in Ruby, Python, JavaScript |
30 and Java (the plots on the left). For this consider some experimental |
30 and Java (the plots on the left).\footnote{Have a look at KEATS: students |
31 data: The first pair of plots shows the running time for the |
31 last year contributed also date for the Swift and Dart languages.}\medskip |
32 regular expression $(a^*)^*\cdot b$ and strings composed of $n$ |
32 |
33 \pcode{a}s, like |
33 \noindent |
|
34 To start with let us look more closely at the experimental data: The |
|
35 first pair of plots shows the running time for the regular expression |
|
36 $(a^*)^*\cdot b$ and strings composed of $n$ \pcode{a}s, like |
34 \[ |
37 \[ |
35 \underbrace{\pcode{a}\ldots\pcode{a}}_{n} |
38 \underbrace{\pcode{a}\ldots\pcode{a}}_{n} |
36 \] |
39 \] |
37 |
40 |
38 \noindent |
41 \noindent |
145 |
148 |
146 \noindent |
149 \noindent |
147 In what follows we will use these regular expressions and strings as |
150 In what follows we will use these regular expressions and strings as |
148 running examples. There will be several versions (V1, V2, V3,\ldots) |
151 running examples. There will be several versions (V1, V2, V3,\ldots) |
149 of our matcher.\footnote{The corresponding files are |
152 of our matcher.\footnote{The corresponding files are |
150 \texttt{re1.scala}, \texttt{re2.scala} and so on. As usual, you can |
153 \texttt{re1.sc}, \texttt{re2.sc} and so on. As usual, you can |
151 find the code on KEATS.} |
154 find the code on KEATS.} |
152 |
155 |
153 Having specified in the previous lecture what |
156 Having specified in the previous lecture what |
154 problem our regular expression matcher is supposed to solve, |
157 problem our regular expression matcher is supposed to solve, |
155 namely for any given regular expression $r$ and string $s$ |
158 namely for any given regular expression $r$ and string $s$ |
300 \noindent |
303 \noindent |
301 We will not use them in our algorithm, but feel free to convince |
304 We will not use them in our algorithm, but feel free to convince |
302 yourself that they actually hold. As an aside, there has been a lot of |
305 yourself that they actually hold. As an aside, there has been a lot of |
303 research about questions like: Can one always decide when two regular |
306 research about questions like: Can one always decide when two regular |
304 expressions are equivalent or not? What does an algorithm look like to |
307 expressions are equivalent or not? What does an algorithm look like to |
305 decide this efficiently? So in general it is not a trivial problem. |
308 decide this efficiently? Surprisingly, many of such questions |
|
309 turn out to be non-trivial problems. |
|
310 |
306 |
311 |
307 \subsection*{The Matching Algorithm} |
312 \subsection*{The Matching Algorithm} |
308 |
313 |
309 The algorithm we will introduce below consists of two parts. One is the |
314 The algorithm we will introduce below consists of two parts. One is the |
310 function $\textit{nullable}$ which takes a regular expression as an |
315 function $\textit{nullable}$ which takes a regular expression as an |
490 \textit{matcher}\,r\,s\quad\text{if and only if}\quad s\in L(r) |
495 \textit{matcher}\,r\,s\quad\text{if and only if}\quad s\in L(r) |
491 \] |
496 \] |
492 |
497 |
493 \noindent holds, which means our algorithm satisfies the |
498 \noindent holds, which means our algorithm satisfies the |
494 specification. Of course we can claim many things\ldots |
499 specification. Of course we can claim many things\ldots |
495 whether the claim holds any water is a different question, |
500 whether the claim holds any water is a different question. |
496 which for example is the point of the Strand-2 Coursework. |
|
497 |
501 |
498 This algorithm was introduced by Janusz Brzozowski in 1964, but |
502 This algorithm was introduced by Janusz Brzozowski in 1964, but |
499 is more widely known only in the last 10 or so years. Its |
503 is more widely known only in the last 10 or so years. Its |
500 main attractions are simplicity and being fast, as well as |
504 main attractions are simplicity and being fast, as well as |
501 being easily extendible for other regular expressions such as |
505 being easily extendible for other regular expressions such as |
502 $r^{\{n\}}$, $r^?$, $\sim{}r$ and so on (this is subject of |
506 $r^{\{n\}}$, $r^?$, $\sim{}r$ and so on (this is subject of |
503 Strand-1 Coursework 1). |
507 Coursework 1). |
504 |
508 |
505 \subsection*{The Matching Algorithm in Scala} |
509 \subsection*{The Matching Algorithm in Scala} |
506 |
510 |
507 Another attraction of the algorithm is that it can be easily |
511 Another attraction of the algorithm is that it can be easily |
508 implemented in a functional programming language, like Scala. |
512 implemented in a functional programming language, like Scala. |