afl-material: comparison handouts/ho04.tex

equal deleted inserted replaced

-:c14e5ebf0c3b
+:0afe43616b6a
 So far our algorithm based on derivatives was only able to say
 yes or no depending on whether a string was matched by regular
 expression or not. Often a more interesting question is to
 find out \emph{how} a regular expression matched a string?
 Answering this question will help us with the problem we are
-after, namely tokenising an input string, that is splitting it
+after, namely tokenising an input string. The algorithm we
-up into its ``word'' components. The algorithm we will be
+will be looking at for this was designed by Sulzmann \& Lu in
-looking at was designed by Sulzmann \& Lu in a rather recent
+a rather recent paper. A link to it is provided on KEATS, in
-paper. A link to it is provided on KEATS, in case you are
+case you are interested.\footnote{In my humble opinion this is
-interested.\footnote{In my humble opinion this is an
+an interesting instance of the research literature: it
-interesting instance of the research literature: it contains a
+contains a very neat idea, but its presentation is rather
-very neat idea, but its presentation is rather sloppy. In
+sloppy. In earlier versions of their paper, students and I
-earlier versions of their paper, students and I found several
+found several rather annoying typos in their examples and
-rather annoying typos in their examples and definitions.}
+definitions.}
-In order to give an answer for how a regular expression matched
+In order to give an answer for how a regular expression
-a string, Sulzmann and Lu introduce \emph{values}. A value
+matched a string, Sulzmann and Lu introduce \emph{values}. A
-will be the output of the algorithm whenever the regular
+value will be the output of the algorithm whenever the regular
-expression matches the string. If not, an error will be raised.
+expression matches the string. If not, an error will be
-Since the first phase of the algorithm is identical to the
+raised. Since the first phase of the algorithm by Sulzmann \&
-derivative based matcher from the first coursework, the
+Lu is identical to the derivative based matcher from the first
-function $nullable$ will be used to decide whether as string
+coursework, the function $nullable$ will be used to decide
-is matched by a regular expression. If $nullable$ says
+whether as string is matched by a regular expression. If
-yes, then values are constructed that reflect how the regular
+$nullable$ says yes, then values are constructed that reflect
-expression matched the string. The definitions for regular
+how the regular expression matched the string. The definitions
-expressions $r$ and values $v$ is shown below:
+for regular expressions $r$ and values $v$ is shown next to
+each other below:
 \begin{center}
 \begin{tabular}{cc}
 \begin{tabular}{@{}rrl@{}}
+\multicolumn{3}{c}{regular expressions}\\
 $r$ & $::=$  & $\varnothing$\\
 & $\mid$ & $\epsilon$   \\
 & $\mid$ & $c$          \\
 & $\mid$ & $r_1 \cdot r_2$\\
 & $\mid$ & $r_1 + r_2$   \\
 \\
 & $\mid$ & $r^*$         \\
 \end{tabular}
 &
 \begin{tabular}{@{\hspace{0mm}}rrl@{}}
-$v$ & $::=$  & \\
+\multicolumn{3}{c}{values}\\
+$v$ & $::=$  & \\
 &        & $Empty$   \\
 & $\mid$ & $Char(c)$          \\
 & $\mid$ & $Seq(v_1,v_2)$\\
 & $\mid$ & $Left(v)$   \\
 & $\mid$ & $Right(v)$  \\
 & $\mid$ & $[v_1,\ldots\,v_n]$ \\
 \end{tabular}
 \end{tabular}
 \end{center}
-\noindent As you can see there is a very strong correspondence
+\noindent The point is that there is a very strong
-between regular expressions and values. There is no value for
+correspondence between them. There is no value for the
-the $\varnothing$ regular expression (since it does not match
+$\varnothing$ regular expression, since it does not match any
-any string). Then there is exactly one value corresponding to
+string. Otherwise there is exactly one value corresponding to
-each regular expression. with the exception of $r_1 + r_2$
+each regular expression with the exception of $r_1 + r_2$
-where there are two values $Left(v)$ and $Right(v)$
+where there are two values, namely $Left(v)$ and $Right(v)$
-corresponding  to the two alternatives. Note that $r^*$
+corresponding to the two alternatives. Note that $r^*$ is
-is associated with a list of values, one for each copy of $r$
+associated with a list of values, one for each copy of $r$
-that was needed to match the string. This mean we might also
+that was needed to match the string. This means we might also
 return the empty list $[]$, if no copy was needed.
-Graphically the algorithm can be represneted by the
+Graphically the algorithm by Sulzmann \& Lu can be represneted
-picture in Figure~\ref{Sulz} where the path involving $der/nullable$ is the first
+by the picture in Figure~\ref{Sulz} where the path from the
-phase of the algorithm and $mkeps/inj$ the second phase.
+left to the right involving $der/nullable$ is the first phase
-This picture shows the steps required when a regular
+of the algorithm and $mkeps/inj$, the path from right to left,
-expression, say $r_1$, matches the string $abc$. We first
+the second phase. This picture shows the steps required when a
-build the three derivatives (according to $a$, $b$ and $c$).
+regular expression, say $r_1$, matches the string $abc$. We
-We then use $nullable$ to find out whether the resulting
+first build the three derivatives (according to $a$, $b$ and
-regular expression can match the empty string. If yes
+$c$). We then use $nullable$ to find out whether the resulting
-we call the function $mkeps$.
+regular expression can match the empty string. If yes we call
+the function $mkeps$.
 \begin{figure}[t]
 \begin{center}
 \begin{tikzpicture}[scale=2,node distance=1.2cm,
 every node/.style={minimum size=7mm}]
 \caption{The two phases of the algorithm by Sulzmann \& Lu.
 \label{Sulz}}
 \end{figure}
 The $mkeps$ function calculates a value for how a regular
-expression could have matched the empty string. Its definition
+expression has matched the empty string. Its definition
 is as follows:
 \begin{center}
 \begin{tabular}{lcl}
 $mkeps(\epsilon)$       & $\dn$ & $Empty$\\
 $mkeps(r^*)$            & $\dn$ & $[]$  \\
 \end{tabular}
 \end{center}
 \noindent There are no cases for $\epsilon$ and $c$, since
-these regular expression cannot match the empty string.
+these regular expression cannot match the empty string. Note
-Note that in case of alternatives we give preference to the
+also that in case of alternatives we give preference to the
 regular expression on the left-hand side. This will become
 important later on.
-The algorithm is organised recursively such that it will
+The second phase of the algorithm is organised recursively
-calculate a value for how the derivative regular expression
+such that it will calculate a value for how the derivative
-has matched a string where the first character has been
+regular expression has matched a string whose first character
-chopped off. Now we need a function that reverses this
+has been chopped off. Now we need a function that reverses
-``chopping off'' for values. The corresponding function
+this ``chopping off'' for values. The corresponding function
-is called $inj$ for injection. This function takes
+is called $inj$ for injection. This function takes three
-three arguments: the first one is a regular expression
+arguments: the first one is a regular expression for which we
-for which we want to calculate the value, the second is the
+want to calculate the value, the second is the character we
-character we want to inject and the third argument is the
+want to inject and the third argument is the value where we
-value where we will inject the character. The result
+will inject the character. The result of this function is a
-of this function is a new value. The definition of $inj$
+new value. The definition of $inj$ is as follows:
-is as follows:
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 $inj\,(c)\,c\,Empty$            & $\dn$  & $Char\,c$\\
 $inj\,(r_1 + r_2)\,c\,Left(v)$  & $\dn$  & $Left(inj\,r_1\,c\,v)$\\
 $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\
 $inj\,(r^*)\,c\,Seq(v,vs)$         & $\dn$  & $inj\,r\,c\,v\,::\,vs$\\
 \end{tabular}
 \end{center}
-\noindent This definition is by recursion on the
+\noindent This definition is by recursion on the regular
-regular expression and by analysing the shape of the values.
+expression and by analysing the shape of the values. Therefore
-Therefore there are, for example, three cases for sequnece
+there are, for example, three cases for sequnece regular
-regular expressions. The last clause returns a list where
+expressions. The last clause for the star regular expression
-the first element is $inj\,r\,c\,v$ and the other elements
+returns a list where the first element is $inj\,r\,c\,v$ and
-are $vs$.
+the other elements are $vs$. That mean $\_\,::\,\_$ should be
+read as list cons.
 To understand what is going on, it might be best to do some
 example calculations and compare with Figure~\ref{Sulz}. For
 this note that we have not yet dealt with the need of
-simplifying regular expreesions (this will be a topic on its
+simplifying regular expressions (this will be a topic on its
 own later). Suppose the regular expression is $a \cdot (b
-\cdot c)$ and the input string is $abc$. The derivatives are
+\cdot c)$ and the input string is $abc$. The derivatives from
-as follows:
+the first phase are as follows:
 \begin{center}
 \begin{tabular}{ll}
 $r_1$: & $a \cdot (b \cdot c)$\\
 $r_2$: & $\epsilon \cdot (b \cdot c)$\\
 $r_3$: & $(\varnothing \cdot (b \cdot c)) + (\epsilon \cdot c)$\\
 $r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\
 \end{tabular}
 \end{center}
-\noindent According to the simple algorithm. we would test
+\noindent According to the simple algorithm, we would test
-whether $r_4$ is nullable, which it is. This means we can
+whether $r_4$ is nullable, which in this case it is. This
-use the function $mkeps$ to calculate a value for how $r_4$
+means we can use the function $mkeps$ to calculate a value for
-was able to match the empty string. Remember that this
+how $r_4$ was able to match the empty string. Remember that
-function gives preference for alternatives on the left-hand
+this function gives preference for alternatives on the
-side. However there is only $\epsilon$ on the very right-hand
+left-hand side. However there is only $\epsilon$ on the very
-side of $r_4$ that matches the empty string. Therefore
+right-hand side of $r_4$ that matches the empty string.
-$mkeps$ returns the value
+Therefore $mkeps$ returns the value
 \begin{center}
 $v_4:\;Right(Right(Empty))$
 \end{center}
 \noindent The point is that from this value we can directly
 read off which part of $r_4$ matched the empty string. Next we
 have to ``inject'' the last character, that is $c$ in the
-running example, into this value in order to calculate how
+running example, into this value $v_4$ in order to calculate
-$r_3$ could have matched the string $c$. According to the
+how $r_3$ could have matched the string $c$. According to the
 definition of $inj$ we obtain
 \begin{center}
 $v_3:\;Right(Seq(Empty, Char(c)))$
 \end{center}
 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
 \end{center}
 \noindent This now corresponds to how the regular
 expression $a \cdot (b \cdot c)$ matched the string $abc$.
-This is the expected result. So at least in this case the
-algorithm seems to calculate what it is supposed to.
+There are a few auxiliary functions that are of interest
-There are a few auxiliary function that are of interest
 in analysing this algorithm. One is called \emph{flatten},
 written $|\_|$, which extracts the string ``underlying'' a
-value. It is defined as
+value. It is defined recursively as
 \begin{center}
 \begin{tabular}{lcl}
 $|Empty|$     & $\dn$ & $[]$\\
 $|Char(c)|$   & $\dn$ & $[c]$\\
 \end{tabular}
 \end{center}
 \noindent Using flatten we can see what is the string behind
 the values calculated by $mkeps$ and $inj$ in our running
-example:
+example are:
 \begin{center}
 \begin{tabular}{ll}
-$v_4$: & $[]$\\
+$|v_4|$: & $[]$\\
-$v_3$: & $c$\\
+$|v_3|$: & $c$\\
-$v_2$: & $bc$\\
+$|v_2|$: & $bc$\\
-$v_1$: & $abc$
+$|v_1|$: & $abc$
 \end{tabular}
 \end{center}
 \noindent
 This indicates that $inj$ indeed is injecting, or adding, back
 \subsubsection*{Simplification}
 Generally the matching algorithms based on derivatives do
 poorly unless the regular expressions are simplified after
-each derivatives step.
+each derivatives step. But this is a bit more involved in
+algorithm of Sulzmann \& Lu. Consider the last derivation
+step in our running example
+\begin{center}
+$r_4 = der\,c\,r_3$
+\end{center}
+\noindent
+Simplifying the result would just give use $\epsilon$.
+Running $mkeps$ on this regular expression would provide
+us with $Empty$ instead of $Right(Right(Empty))$ that
+was obtained without the simplification. The problem is
+we need to recreate this value, rather than just $Empty$.
+This requires what I call \emph{rectification functions}. They
+need to be calculated whenever a regular expression gets
+simplified. Rectification functions take a value as argument
+and return a (rectified) value. Our simplification rules so
+far are
+\begin{center}
+\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
+$r \cdot \varnothing$ & $\mapsto$ & $\varnothing$\\
+$\varnothing \cdot r$ & $\mapsto$ & $\varnothing$\\
+$r \cdot \epsilon$ & $\mapsto$ & $r$\\
+$\epsilon \cdot r$ & $\mapsto$ & $r$\\
+$r + \varnothing$ & $\mapsto$ & $r$\\
+$\varnothing + r$ & $\mapsto$ & $r$\\
+$r + r$ & $\mapsto$ & $r$\\
+\end{tabular}
+\end{center}
+\noindent Applying them to $r_4$ will require several nested
+simplifications in order end up with just $\epsilon$. We can
+implement this by letting simp return not just a (simplified)
+regular expression, but also a rectification function. Let us
+consider the alternative case, say $r_1 + r_2$, first. We
+would first simplify the component regular expressions $r_1$
+and $r_2.$ This will return simplified versions (if they can
+be simplified), say $r_{1s}$ and $r_{2s}$, but also two
+rectification functions $f_{1s}$ and $f_{2s}$. We need to
+assemble them in order to obtain a rectified value for
+$r_1 + r_2$.
+In case $r_{1s}$ simplified to $\varnothing$, we would
+continue the derivative calculation with $r_{2s}$. The
+Sulzmann \& Lu algorithm would return a corresponding value,
+say $v_{2s}$.
+\subsubsection*{Records and Tokenisation}
 \newpage
 Algorithm by Sulzmann, Lexing
 \end{document}

changeset 284	0afe43616b6a
parent 283	c14e5ebf0c3b
child 285	8a222559278f