handouts/ho04.tex
changeset 284 0afe43616b6a
parent 283 c14e5ebf0c3b
child 285 8a222559278f
--- a/handouts/ho04.tex	Thu Oct 16 17:30:05 2014 +0100
+++ b/handouts/ho04.tex	Fri Oct 17 11:56:19 2014 +0100
@@ -13,31 +13,33 @@
 expression or not. Often a more interesting question is to
 find out \emph{how} a regular expression matched a string?
 Answering this question will help us with the problem we are
-after, namely tokenising an input string, that is splitting it
-up into its ``word'' components. The algorithm we will be
-looking at was designed by Sulzmann \& Lu in a rather recent
-paper. A link to it is provided on KEATS, in case you are
-interested.\footnote{In my humble opinion this is an
-interesting instance of the research literature: it contains a
-very neat idea, but its presentation is rather sloppy. In
-earlier versions of their paper, students and I found several
-rather annoying typos in their examples and definitions.}
+after, namely tokenising an input string. The algorithm we
+will be looking at for this was designed by Sulzmann \& Lu in
+a rather recent paper. A link to it is provided on KEATS, in
+case you are interested.\footnote{In my humble opinion this is
+an interesting instance of the research literature: it
+contains a very neat idea, but its presentation is rather
+sloppy. In earlier versions of their paper, students and I
+found several rather annoying typos in their examples and
+definitions.}
 
-In order to give an answer for how a regular expression matched
-a string, Sulzmann and Lu introduce \emph{values}. A value 
-will be the output of the algorithm whenever the regular 
-expression matches the string. If not, an error will be raised.
-Since the first phase of the algorithm is identical to the
-derivative based matcher from the first coursework, the 
-function $nullable$ will be used to decide whether as string
-is matched by a regular expression. If $nullable$ says
-yes, then values are constructed that reflect how the regular
-expression matched the string. The definitions for regular 
-expressions $r$ and values $v$ is shown below:
+In order to give an answer for how a regular expression
+matched a string, Sulzmann and Lu introduce \emph{values}. A
+value will be the output of the algorithm whenever the regular
+expression matches the string. If not, an error will be
+raised. Since the first phase of the algorithm by Sulzmann \&
+Lu is identical to the derivative based matcher from the first
+coursework, the function $nullable$ will be used to decide
+whether as string is matched by a regular expression. If
+$nullable$ says yes, then values are constructed that reflect
+how the regular expression matched the string. The definitions
+for regular expressions $r$ and values $v$ is shown next to
+each other below:
 
 \begin{center}
 \begin{tabular}{cc}
 \begin{tabular}{@{}rrl@{}}
+\multicolumn{3}{c}{regular expressions}\\
   $r$ & $::=$  & $\varnothing$\\
       & $\mid$ & $\epsilon$   \\
       & $\mid$ & $c$          \\
@@ -48,7 +50,8 @@
 \end{tabular}
 &
 \begin{tabular}{@{\hspace{0mm}}rrl@{}}
-  $v$ & $::=$  & \\
+\multicolumn{3}{c}{values}\\
+   $v$ & $::=$  & \\
       &        & $Empty$   \\
       & $\mid$ & $Char(c)$          \\
       & $\mid$ & $Seq(v_1,v_2)$\\
@@ -59,26 +62,27 @@
 \end{tabular}
 \end{center}
 
-\noindent As you can see there is a very strong correspondence
-between regular expressions and values. There is no value for
-the $\varnothing$ regular expression (since it does not match
-any string). Then there is exactly one value corresponding to 
-each regular expression. with the exception of $r_1 + r_2$ 
-where there are two values $Left(v)$ and $Right(v)$ 
-corresponding  to the two alternatives. Note that $r^*$
-is associated with a list of values, one for each copy of $r$ 
-that was needed to match the string. This mean we might also 
+\noindent The point is that there is a very strong
+correspondence between them. There is no value for the
+$\varnothing$ regular expression, since it does not match any
+string. Otherwise there is exactly one value corresponding to
+each regular expression with the exception of $r_1 + r_2$
+where there are two values, namely $Left(v)$ and $Right(v)$
+corresponding to the two alternatives. Note that $r^*$ is
+associated with a list of values, one for each copy of $r$
+that was needed to match the string. This means we might also
 return the empty list $[]$, if no copy was needed.
 
-Graphically the algorithm can be represneted by the 
-picture in Figure~\ref{Sulz} where the path involving $der/nullable$ is the first
-phase of the algorithm and $mkeps/inj$ the second phase.
-This picture shows the steps required when a regular
-expression, say $r_1$, matches the string $abc$. We first
-build the three derivatives (according to $a$, $b$ and $c$).
-We then use $nullable$ to find out whether the resulting
-regular expression can match the empty string. If yes
-we call the function $mkeps$.
+Graphically the algorithm by Sulzmann \& Lu can be represneted
+by the picture in Figure~\ref{Sulz} where the path from the
+left to the right involving $der/nullable$ is the first phase
+of the algorithm and $mkeps/inj$, the path from right to left,
+the second phase. This picture shows the steps required when a
+regular expression, say $r_1$, matches the string $abc$. We
+first build the three derivatives (according to $a$, $b$ and
+$c$). We then use $nullable$ to find out whether the resulting
+regular expression can match the empty string. If yes we call
+the function $mkeps$.
 
 \begin{figure}[t]
 \begin{center}
@@ -108,7 +112,7 @@
 \end{figure}
 
 The $mkeps$ function calculates a value for how a regular
-expression could have matched the empty string. Its definition
+expression has matched the empty string. Its definition
 is as follows:
 
 \begin{center}
@@ -123,23 +127,22 @@
 \end{center}
 
 \noindent There are no cases for $\epsilon$ and $c$, since
-these regular expression cannot match the empty string. 
-Note that in case of alternatives we give preference to the
+these regular expression cannot match the empty string. Note
+also that in case of alternatives we give preference to the
 regular expression on the left-hand side. This will become
 important later on.
 
-The algorithm is organised recursively such that it will 
-calculate a value for how the derivative regular expression
-has matched a string where the first character has been
-chopped off. Now we need a function that reverses this
-``chopping off'' for values. The corresponding function
-is called $inj$ for injection. This function takes
-three arguments: the first one is a regular expression
-for which we want to calculate the value, the second is the 
-character we want to inject and the third argument is the
-value where we will inject the character. The result
-of this function is a new value. The definition of $inj$ 
-is as follows: 
+The second phase of the algorithm is organised recursively
+such that it will calculate a value for how the derivative
+regular expression has matched a string whose first character
+has been chopped off. Now we need a function that reverses
+this ``chopping off'' for values. The corresponding function
+is called $inj$ for injection. This function takes three
+arguments: the first one is a regular expression for which we
+want to calculate the value, the second is the character we
+want to inject and the third argument is the value where we
+will inject the character. The result of this function is a
+new value. The definition of $inj$ is as follows: 
 
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
@@ -153,20 +156,21 @@
 \end{tabular}
 \end{center}
 
-\noindent This definition is by recursion on the
-regular expression and by analysing the shape of the values. 
-Therefore there are, for example, three cases for sequnece
-regular expressions. The last clause returns a list where 
-the first element is $inj\,r\,c\,v$ and the other elements 
-are $vs$.
+\noindent This definition is by recursion on the regular
+expression and by analysing the shape of the values. Therefore
+there are, for example, three cases for sequnece regular
+expressions. The last clause for the star regular expression
+returns a list where the first element is $inj\,r\,c\,v$ and
+the other elements are $vs$. That mean $\_\,::\,\_$ should be 
+read as list cons.
 
 To understand what is going on, it might be best to do some
 example calculations and compare with Figure~\ref{Sulz}. For
 this note that we have not yet dealt with the need of
-simplifying regular expreesions (this will be a topic on its
+simplifying regular expressions (this will be a topic on its
 own later). Suppose the regular expression is $a \cdot (b
-\cdot c)$ and the input string is $abc$. The derivatives are
-as follows:
+\cdot c)$ and the input string is $abc$. The derivatives from
+the first phase are as follows:
 
 \begin{center}
 \begin{tabular}{ll}
@@ -177,24 +181,24 @@
 \end{tabular}
 \end{center}
 
-\noindent According to the simple algorithm. we would test
-whether $r_4$ is nullable, which it is. This means we can
-use the function $mkeps$ to calculate a value for how $r_4$
-was able to match the empty string. Remember that this 
-function gives preference for alternatives on the left-hand
-side. However there is only $\epsilon$ on the very right-hand
-side of $r_4$ that matches the empty string. Therefore
-$mkeps$ returns the value
+\noindent According to the simple algorithm, we would test
+whether $r_4$ is nullable, which in this case it is. This
+means we can use the function $mkeps$ to calculate a value for
+how $r_4$ was able to match the empty string. Remember that
+this function gives preference for alternatives on the
+left-hand side. However there is only $\epsilon$ on the very
+right-hand side of $r_4$ that matches the empty string.
+Therefore $mkeps$ returns the value
 
 \begin{center}
 $v_4:\;Right(Right(Empty))$
 \end{center}
 
-\noindent The point is that from this value we can directly 
-read off which part of $r_4$ matched the empty string. Next we 
+\noindent The point is that from this value we can directly
+read off which part of $r_4$ matched the empty string. Next we
 have to ``inject'' the last character, that is $c$ in the
-running example, into this value in order to calculate how
-$r_3$ could have matched the string $c$. According to the 
+running example, into this value $v_4$ in order to calculate
+how $r_3$ could have matched the string $c$. According to the
 definition of $inj$ we obtain
 
 \begin{center}
@@ -220,13 +224,11 @@
 
 \noindent This now corresponds to how the regular
 expression $a \cdot (b \cdot c)$ matched the string $abc$.
-This is the expected result. So at least in this case the
-algorithm seems to calculate what it is supposed to.
 
-There are a few auxiliary function that are of interest
+There are a few auxiliary functions that are of interest
 in analysing this algorithm. One is called \emph{flatten},
 written $|\_|$, which extracts the string ``underlying'' a 
-value. It is defined as
+value. It is defined recursively as
 
 \begin{center}
 \begin{tabular}{lcl}
@@ -241,14 +243,14 @@
 
 \noindent Using flatten we can see what is the string behind 
 the values calculated by $mkeps$ and $inj$ in our running 
-example:
+example are:
 
 \begin{center}
 \begin{tabular}{ll}
-$v_4$: & $[]$\\
-$v_3$: & $c$\\
-$v_2$: & $bc$\\
-$v_1$: & $abc$
+$|v_4|$: & $[]$\\
+$|v_3|$: & $c$\\
+$|v_2|$: & $bc$\\
+$|v_1|$: & $abc$
 \end{tabular}
 \end{center}
 
@@ -261,7 +263,58 @@
 
 Generally the matching algorithms based on derivatives do
 poorly unless the regular expressions are simplified after
-each derivatives step.
+each derivatives step. But this is a bit more involved in 
+algorithm of Sulzmann \& Lu. Consider the last derivation
+step in our running example
+
+\begin{center}
+$r_4 = der\,c\,r_3$
+\end{center}
+
+\noindent 
+Simplifying the result would just give use $\epsilon$.
+Running $mkeps$ on this regular expression would provide 
+us with $Empty$ instead of $Right(Right(Empty))$ that
+was obtained without the simplification. The problem is
+we need to recreate this value, rather than just $Empty$.
+
+This requires what I call \emph{rectification functions}. They
+need to be calculated whenever a regular expression gets
+simplified. Rectification functions take a value as argument
+and return a (rectified) value. Our simplification rules so
+far are
+
+\begin{center}
+\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
+$r \cdot \varnothing$ & $\mapsto$ & $\varnothing$\\ 
+$\varnothing \cdot r$ & $\mapsto$ & $\varnothing$\\ 
+$r \cdot \epsilon$ & $\mapsto$ & $r$\\ 
+$\epsilon \cdot r$ & $\mapsto$ & $r$\\ 
+$r + \varnothing$ & $\mapsto$ & $r$\\ 
+$\varnothing + r$ & $\mapsto$ & $r$\\ 
+$r + r$ & $\mapsto$ & $r$\\ 
+\end{tabular}
+\end{center}
+
+\noindent Applying them to $r_4$ will require several nested
+simplifications in order end up with just $\epsilon$. We can
+implement this by letting simp return not just a (simplified)
+regular expression, but also a rectification function. Let us
+consider the alternative case, say $r_1 + r_2$, first. We
+would first simplify the component regular expressions $r_1$
+and $r_2.$ This will return simplified versions (if they can
+be simplified), say $r_{1s}$ and $r_{2s}$, but also two
+rectification functions $f_{1s}$ and $f_{2s}$. We need to
+assemble them in order to obtain a rectified value for 
+$r_1 + r_2$.
+
+In case $r_{1s}$ simplified to $\varnothing$, we would
+continue the derivative calculation with $r_{2s}$. The
+Sulzmann \& Lu algorithm would return a corresponding value,
+say $v_{2s}$.
+
+
+\subsubsection*{Records and Tokenisation}
 
 \newpage
 Algorithm by Sulzmann, Lexing