handouts/ho04.tex
changeset 350 c4e7caa06c74
parent 326 94700593a2d5
child 352 1e1b0fe66107
--- a/handouts/ho04.tex	Thu Oct 08 22:44:16 2015 +0100
+++ b/handouts/ho04.tex	Thu Oct 08 23:03:28 2015 +0100
@@ -15,24 +15,28 @@
 Answering this question will also help us with the problem we
 are after, namely tokenising an input string. 
 
-The algorithm we will be looking at was designed by Sulzmann \& Lu in
-a rather recent paper. A link to it is provided on KEATS, in case you
-are interested.\footnote{In my humble opinion this is an interesting
-  instance of the research literature: it contains a very neat idea,
-  but its presentation is rather sloppy. In earlier versions of their
-  paper, students and I found several rather annoying typos in their
-  examples and definitions.}  In order to give an answer for how a
-regular expression matched a string, Sulzmann and Lu introduce
-\emph{values}. A value will be the output of the algorithm whenever
-the regular expression matches the string. If the string does not
-match the string, an error will be raised. Since the first phase of
-the algorithm by Sulzmann \& Lu is identical to the derivative based
-matcher from the first coursework, the function $nullable$ will be
-used to decide whether as string is matched by a regular
-expression. If $nullable$ says yes, then values are constructed that
-reflect how the regular expression matched the string. The definitions
-for regular expressions $r$ and values $v$ is shown next to each other
-below:
+The algorithm we will be looking at was designed by Sulzmann
+\& Lu in a rather recent paper (from 2014). A link to it is
+provided on KEATS, in case you are interested.\footnote{In my
+humble opinion this is an interesting instance of the research
+literature: it contains a very neat idea, but its presentation
+is rather sloppy. In earlier versions of their paper, a King's
+student and I found several rather annoying typos in their
+examples and definitions.} In order to give an answer for
+\emph{how} a regular expression matches a string, Sulzmann and
+Lu introduce \emph{values}. A value will be the output of the
+algorithm whenever the regular expression matches the string.
+If the string does not match the string, an error will be
+raised. Since the first phase of the algorithm by Sulzmann \&
+Lu is identical to the derivative based matcher from the first
+coursework, the function $nullable$ will be used to decide
+whether as string is matched by a regular expression. If
+$nullable$ says yes, then values are constructed that reflect
+how the regular expression matched the string. 
+
+The definitions for values is given below. They are shown 
+together with the regular expressions $r$ to which
+they correspond:
 
 \begin{center}
 \begin{tabular}{cc}
@@ -60,7 +64,7 @@
 \end{tabular}
 \end{center}
 
-\noindent The point is that there is a very strong
+\noindent The reason is that there is a very strong
 correspondence between them. There is no value for the
 $\varnothing$ regular expression, since it does not match any
 string. Otherwise there is exactly one value corresponding to
@@ -69,18 +73,26 @@
 corresponding to the two alternatives. Note that $r^*$ is
 associated with a list of values, one for each copy of $r$
 that was needed to match the string. This means we might also
-return the empty list $[]$, if no copy was needed.
+return the empty list $[]$, if no copy was needed in case
+of $r^*$. For sequence, there is exactly one value, composed 
+of two component values ($v_1$ and $v_2$).
 
 To emphasise the connection between regular expressions and
 values, I have in my implementation the convention that
+regular expressions and values have the same name, except that
 regular expressions are written entirely with upper-case
 letters, while values just start with a single upper-case
-character. My definition of values in Scala is below. I use 
-this in the REPL of Scala; when I use the Scala compiler
-I need to rename some constructors, because Scala on Macs
-does not like classes that are called \pcode{EMPTY} and
+character and the rest are lower-case letters. My definition
+of regular expressions and values in Scala is shown below. I use
+this in the REPL of Scala; when I use the Scala compiler I
+need to rename some constructors, because Scala on Macs does
+not like classes that are called \pcode{EMPTY} and
 \pcode{Empty}.
  
+ {\small\lstinputlisting[language=Scala,numbers=none]
+{../progs/app03.scala}}
+
+ 
 {\small\lstinputlisting[language=Scala,numbers=none]
 {../progs/app02.scala}}
 
@@ -146,15 +158,15 @@
 
 The second phase of the algorithm is organised so that it will
 calculate a value for how the derivative regular expression
-has matched a string whose first character has been chopped
-off. Now we need a function that reverses this ``chopping
-off'' for values. The corresponding function is called $inj$
-for injection. This function takes three arguments: the first
-one is a regular expression for which we want to calculate the
-value, the second is the character we want to inject and the
-third argument is the value where we will inject the
-character. The result of this function is a new value. The
-definition of $inj$ is as follows: 
+has matched a string. For this we need a function that
+reverses this ``chopping off'' for values which we did in the
+first phase for derivatives. The corresponding function is
+called $inj$ for injection. This function takes three
+arguments: the first one is a regular expression for which we
+want to calculate the value, the second is the character we
+want to inject and the third argument is the value where we
+will inject the character into. The result of this function is a
+new value. The definition of $inj$ is as follows: 
 
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
@@ -171,10 +183,10 @@
 \noindent This definition is by recursion on the regular
 expression and by analysing the shape of the values. Therefore
 there are, for example, three cases for sequence regular
-expressions. The last clause for the star regular expression
-returns a list where the first element is $inj\,r\,c\,v$ and
-the other elements are $vs$. That means $\_\,::\,\_$ should be 
-read as list cons.
+expressions (for all possible shapes of the value). The last
+clause for the star regular expression returns a list where
+the first element is $inj\,r\,c\,v$ and the other elements are
+$vs$. That means $\_\,::\,\_$ should be read as list cons.
 
 To understand what is going on, it might be best to do some
 example calculations and compare them with Figure~\ref{Sulz}.
@@ -194,22 +206,24 @@
 \end{center}
 
 \noindent According to the simple algorithm, we would test
-whether $r_4$ is nullable, which in this case it is. This
-means we can use the function $mkeps$ to calculate a value for
-how $r_4$ was able to match the empty string. Remember that
-this function gives preference for alternatives on the
-left-hand side. However there is only $\epsilon$ on the very
-right-hand side of $r_4$ that matches the empty string.
+whether $r_4$ is nullable, which in this case it indeed is.
+This means we can use the function $mkeps$ to calculate a
+value for how $r_4$ was able to match the empty string.
+Remember that this function gives preference for alternatives
+on the left-hand side. However there is only $\epsilon$ on the
+very right-hand side of $r_4$ that matches the empty string.
 Therefore $mkeps$ returns the value
 
 \begin{center}
 $v_4:\;Right(Right(Empty))$
 \end{center}
 
-\noindent The point is that from this value we can directly
-read off which part of $r_4$ matched the empty string: take
-the right-alternative first, and then the right-alternative
-again. 
+\noindent If there had been a $\epsilon$ on the left, then
+$mkeps$ would have returned something of the form
+$Left(\ldots)$. The point is that from this value we can
+directly read off which part of $r_4$ matched the empty
+string: take the right-alternative first, and then the
+right-alternative again. 
 
 Next we have to ``inject'' the last character, that is $c$ in
 the running example, into this value $v_4$ in order to
@@ -257,9 +271,9 @@
 \end{tabular}
 \end{center}
 
-\noindent Using flatten we can see what is the string behind 
-the values calculated by $mkeps$ and $inj$ in our running 
-example are:
+\noindent Using flatten we can see what are the strings behind 
+the values calculated by $mkeps$ and $inj$. In our running 
+example:
 
 \begin{center}
 \begin{tabular}{ll}
@@ -271,8 +285,13 @@
 \end{center}
 
 \noindent This indicates that $inj$ indeed is injecting, or
-adding, back a character into the value. Next we look at how
-simplification can be included into this algorithm.
+adding, back a character into the value. If we continue until
+all characters are injected back, we have a value that can 
+indeed say how the string $abc$ was matched.
+
+There is a problem, however, with the described algorithm
+so far: it is very slow. We need to include the simplification 
+from Lecture 2. This is what we shall do next.
 
 
 \subsubsection*{Simplification}