15 expression or not. Often a more interesting question is to |
15 expression or not. Often a more interesting question is to |
16 find out \emph{how} a regular expression matched a string? |
16 find out \emph{how} a regular expression matched a string? |
17 Answering this question will also help us with the problem we |
17 Answering this question will also help us with the problem we |
18 are after, namely tokenising an input string. |
18 are after, namely tokenising an input string. |
19 |
19 |
20 The algorithm we will be looking at in this lecture was designed by Sulzmann |
20 The algorithm we will be looking at in this lecture was designed by |
21 \& Lu in a rather recent research paper (from 2014). A link to it is |
21 Sulzmann \& Lu in a rather recent research paper (from 2014). A link |
22 provided on KEATS, in case you are interested.\footnote{In my |
22 to it is provided on KEATS, in case you are interested.\footnote{In my |
23 humble opinion this is an interesting instance of the research |
23 humble opinion this is an interesting instance of the research |
24 literature: it contains a very neat idea, but its presentation |
24 literature: it contains very clever ideas, but its presentation is |
25 is rather sloppy. In earlier versions of this paper, a King's |
25 rather sloppy. In earlier versions of this paper, students and I |
26 undergraduate student and I found several rather annoying typos in the |
26 found several rather annoying typos in the examples and definitions; |
27 examples and definitions.} My former PhD student Fahad Ausaf and I even more recently |
27 we even found downright errors in their work.} Together with my former PhD |
28 wrote a paper where we build on their result: we provided a |
28 students Fahad Ausaf and Chengsong Tan we wrote several papers where |
29 mathematical proof that their algorithm is really correct---the proof |
29 we build on their result: we provided a mathematical proof that their |
30 Sulzmann \& Lu had originally given contained major flaws. Such correctness |
30 algorithm is really correct---the proof Sulzmann \& Lu had originally |
31 proofs are important: Kuklewicz maintains a unit-test library |
31 given contained major flaws. Such correctness proofs are important: |
32 for the kind of algorithms we are interested in here and he showed |
32 Kuklewicz maintains a unit-test library for the kind of algorithms we |
33 that many implementations in the ``wild'' are buggy, that is not |
33 are interested in here and he showed that many implementations in the |
34 satisfy his unit tests: |
34 ``wild'' are buggy, that is not satisfy his unit tests: |
35 |
35 |
36 \begin{center} |
36 \begin{center} |
37 \url{http://www.haskell.org/haskellwiki/Regex_Posix} |
37 \url{http://www.haskell.org/haskellwiki/Regex_Posix} |
38 \end{center} |
38 \end{center} |
39 |
39 |
431 |
431 |
432 Phew\ldots{}Sweat\ldots!\#@$\skull$\%\ldots Unfortunately, there is |
432 Phew\ldots{}Sweat\ldots!\#@$\skull$\%\ldots Unfortunately, there is |
433 a gigantic problem with the described algorithm so far: it is very |
433 a gigantic problem with the described algorithm so far: it is very |
434 slow. To make it faster, we have to include in all this the simplification |
434 slow. To make it faster, we have to include in all this the simplification |
435 from Lecture 2\ldots{}and what rotten luck: simplification messes things |
435 from Lecture 2\ldots{}and what rotten luck: simplification messes things |
436 up and we need to rectify the mess. This is what we shall do next. |
436 up and we need to \emph{rectify} the mess. This is what we shall do next. |
437 |
437 |
438 |
438 |
439 \subsubsection*{Simplification} |
439 \subsubsection*{Simplification} |
440 |
440 |
441 Generally the matching algorithms based on derivatives do |
441 Generally the matching algorithms based on derivatives do |