afl-material: comparison handouts/ho04.tex

equal deleted inserted replaced

-:19020b75d75e
+:2c50b8b5886c
 So far our algorithm based on derivatives was only able to say
 yes or no depending on whether a string was matched by regular
 expression or not. Often a more interesting question is to
 find out \emph{how} a regular expression matched a string?
-Answering this question will help us with the problem we are
+Answering this question will also help us with the problem we
-after, namely tokenising an input string. The algorithm we
+are after, namely tokenising an input string. The algorithm we
 will be looking at for this was designed by Sulzmann \& Lu in
 a rather recent paper. A link to it is provided on KEATS, in
 case you are interested.\footnote{In my humble opinion this is
 an interesting instance of the research literature: it
 contains a very neat idea, but its presentation is rather
 corresponding to the two alternatives. Note that $r^*$ is
 associated with a list of values, one for each copy of $r$
 that was needed to match the string. This means we might also
 return the empty list $[]$, if no copy was needed.
-Graphically the algorithm by Sulzmann \& Lu can be represneted
+To emphasise the connection between regular expressions and
+values, I have in my implementation the convention that
+regular expressions are written entirely with upper-case
+letters, while values just start with a single upper-case
+character. My definition of values in Scala is below. I use
+this in the REPL of Scala; when I use the Scala compiler
+I need to rename some constructors, because Scala on Macs
+does not like classes that are called \pcode{EMPTY} and
+\pcode{Empty}.
+{\small\lstinputlisting[language=Scala,numbers=none]
+{../progs/app02.scala}}
+Graphically the algorithm by Sulzmann \& Lu can be illustrated
 by the picture in Figure~\ref{Sulz} where the path from the
 left to the right involving $der/nullable$ is the first phase
 of the algorithm and $mkeps/inj$, the path from right to left,
 the second phase. This picture shows the steps required when a
 regular expression, say $r_1$, matches the string $abc$. We
 these regular expression cannot match the empty string. Note
 also that in case of alternatives we give preference to the
 regular expression on the left-hand side. This will become
 important later on.
-The second phase of the algorithm is organised recursively
+The second phase of the algorithm is organised so that it will
-such that it will calculate a value for how the derivative
+calculate a value for how the derivative regular expression
-regular expression has matched a string whose first character
+has matched a string whose first character has been chopped
-has been chopped off. Now we need a function that reverses
+off. Now we need a function that reverses this ``chopping
-this ``chopping off'' for values. The corresponding function
+off'' for values. The corresponding function is called $inj$
-is called $inj$ for injection. This function takes three
+for injection. This function takes three arguments: the first
-arguments: the first one is a regular expression for which we
+one is a regular expression for which we want to calculate the
-want to calculate the value, the second is the character we
+value, the second is the character we want to inject and the
-want to inject and the third argument is the value where we
+third argument is the value where we will inject the
-will inject the character. The result of this function is a
+character. The result of this function is a new value. The
-new value. The definition of $inj$ is as follows:
+definition of $inj$ is as follows:
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 $inj\,(c)\,c\,Empty$            & $\dn$  & $Char\,c$\\
 $inj\,(r_1 + r_2)\,c\,Left(v)$  & $\dn$  & $Left(inj\,r_1\,c\,v)$\\
 \noindent This definition is by recursion on the regular
 expression and by analysing the shape of the values. Therefore
 there are, for example, three cases for sequnece regular
 expressions. The last clause for the star regular expression
 returns a list where the first element is $inj\,r\,c\,v$ and
-the other elements are $vs$. That mean $\_\,::\,\_$ should be
+the other elements are $vs$. That means $\_\,::\,\_$ should be
 read as list cons.
 To understand what is going on, it might be best to do some
-example calculations and compare with Figure~\ref{Sulz}. For
+example calculations and compare them with Figure~\ref{Sulz}.
-this note that we have not yet dealt with the need of
+For this note that we have not yet dealt with the need of
 simplifying regular expressions (this will be a topic on its
 own later). Suppose the regular expression is $a \cdot (b
 \cdot c)$ and the input string is $abc$. The derivatives from
 the first phase are as follows:
 \begin{center}
 $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$
 \end{center}
-\noindent Finally we need to inject back the letter $a$ into
+\noindent which is again the correct result for how $r_2$
-$v_2$ giving the final result
+matched the string $bc$. Finally we need to inject back the
+letter $a$ into $v_2$ giving the final result
 \begin{center}
 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
 \end{center}
 \noindent This now corresponds to how the regular
 expression $a \cdot (b \cdot c)$ matched the string $abc$.
 There are a few auxiliary functions that are of interest
-in analysing this algorithm. One is called \emph{flatten},
+when analysing this algorithm. One is called \emph{flatten},
 written $|\_|$, which extracts the string ``underlying'' a
 value. It is defined recursively as
 \begin{center}
 \begin{tabular}{lcl}
 $|v_2|$: & $bc$\\
 $|v_1|$: & $abc$
 \end{tabular}
 \end{center}
-\noindent
+\noindent This indicates that $inj$ indeed is injecting, or
-This indicates that $inj$ indeed is injecting, or adding, back
+adding, back a character into the value. Next we look at how
-a character into the value.
+simplification can be included into this algorithm.
 \subsubsection*{Simplification}
 Generally the matching algorithms based on derivatives do
 poorly unless the regular expressions are simplified after
-each derivatives step. But this is a bit more involved in
+each derivative step. But this is a bit more involved in the
-algorithm of Sulzmann \& Lu. Consider the last derivation
+algorithm of Sulzmann \& Lu. So what follows might require you
-step in our running example
+to read several times before it makes sense and also might
+require that you do some example calculations. As a first
-\begin{center}
+example consider the last derivation step in our earlier
-$r_4 = der\,c\,r_3 = (\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$
+example:
-\end{center}
+\begin{center}
-\noindent Simplifying the result would just give us
+$r_4 = der\,c\,r_3 =
-$\epsilon$. Running $mkeps$ on this regular expression would
+(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$
-then provide us with $Empty$ instead of $Right(Right(Empty))$
+\end{center}
-that was obtained without the simplification. The problem is
-we need to recreate this more complicated value, rather than
+\noindent Simplifying this regular expression would just give
-just $Empty$.
+us $\epsilon$. Running $mkeps$ with this regular expression as
+input, however, would then provide us with $Empty$ instead of
-This requires what I call \emph{rectification functions}. They
+$Right(Right(Empty))$ that was obtained without the
-need to be calculated whenever a regular expression gets
+simplification. The problem is we need to recreate this more
+complicated value, rather than just $Empty$.
+This will require what I call \emph{rectification functions}.
+They need to be calculated whenever a regular expression gets
 simplified. Rectification functions take a value as argument
-and return a (rectified) value. Our simplification rules so
+and return a (rectified) value. Let us first take a look again
-far are
+at our simplification rules:
 \begin{center}
 \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
 $r \cdot \varnothing$ & $\mapsto$ & $\varnothing$\\
 $\varnothing \cdot r$ & $\mapsto$ & $\varnothing$\\
 $r + r$ & $\mapsto$ & $r$\\
 \end{tabular}
 \end{center}
 \noindent Applying them to $r_4$ will require several nested
-simplifications in order end up with just $\epsilon$.
+simplifications in order end up with just $\epsilon$. However,
+it is possible to apply them in a depth-first, or inside-out,
-We can implement this by letting simp return not just a
+manner in order to calculate this simplified regular
-(simplified) regular expression, but also a rectification
+expression.
-function. Let us consider the alternative case, say $r_1 +
-r_2$, first. We would first simplify the component regular
+The rectification we can implement this by letting simp return
-expressions $r_1$ and $r_2.$ This will return simplified
+not just a (simplified) regular expression, but also a
-versions (if they can be simplified), say $r_{1s}$ and
+rectification function. Let us consider the alternative case,
-$r_{2s}$, but also two rectification functions $f_{1s}$ and
+$r_1 + r_2$, first. By going depth-first, we first simplify
-$f_{2s}$. We need to assemble them in order to obtain a
+the component regular expressions $r_1$ and $r_2.$ This will
-rectified value for $r_1 + r_2$. In case $r_{1s}$ simplified
+return simplified versions (if they can be simplified), say
-to $\varnothing$, we would continue the derivative calculation
+$r_{1s}$ and $r_{2s}$, but also two rectification functions
-with $r_{2s}$. The Sulzmann \& Lu algorithm would return a
+$f_{1s}$ and $f_{2s}$. We need to assemble them in order to
-corresponding value, say $v_{2s}$. But now this needs to
+obtain a rectified value for $r_1 + r_2$. In case $r_{1s}$
-be ``rectified'' to the value
+simplified to $\varnothing$, we continue the derivative
+calculation with $r_{2s}$. The Sulzmann \& Lu algorithm would
+return a corresponding value, say $v_{2s}$. But now this value
+needs to be ``rectified'' to the value
 \begin{center}
 $Right(v_{2s})$
 \end{center}
-\noindent Unfortunately, this is not enough because there
+\noindent The reason is that we look for the value that tells
-might be some simplifications that happened inside $r_2$
+us how $r_1 + r_2$ could have matched the string, not just
-and for which the simplification function retuned also
+$r_2$ or $r_{2s}$. Unfortunately, this is still not the right
-a rectification function $f_{2s}$. So in fact we need to
+value in general because there might be some simplifications
-apply this one too which gives
+that happened inside $r_2$ and for which the simplification
+function retuned also a rectification function $f_{2s}$. So in
+fact we need to apply this one too which gives
 \begin{center}
 $Right(f_{2s}(v_{2s}))$
 \end{center}
-\noindent So if we want to return this as function,
+\noindent This is now the correct, or rectified, value. Since
-we would need to return
+the simplification will be done in the first phase of the
+algorithm, but the rectification needs to be done to the
+values in the second phase, it is advantageous to calculate
+the rectification as a function, remember this function and
+then apply the value to this function during the second phase.
+So if we want to implement the rectification as function, we
+would need to return
 \begin{center}
 $\lambda v.\,Right(f_{2s}(v))$
 \end{center}
 \noindent which is the lambda-calculus notation for
 a function that expects a value $v$ and returns everything
 after the dot where $v$ is replaced by whatever value is
 given.
-Let us package these ideas into a single function (still only
+Let us package this idea with rectification functions
-considering the alternative case):
+into a single function (still only considering the alternative
+case):
 \begin{center}
 \begin{tabular}{l}
 $simp(r)$:\\
 \quad case $r = r_1 + r_2$\\
 \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 \qquad case $r_{1s} = \varnothing$:
 return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
 \qquad case $r_{2s} = \varnothing$:
 return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
+\qquad case $r_{1s} = r_{2s}$:
+return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
 \qquad otherwise:
 return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
 \end{tabular}
 \end{center}
-\noindent We first recursively call the simlification with $r_1$
+\noindent We first recursively call the simplification with
-and $r_2$. This gives simplified regular expressions, $r_{1s}$
+$r_1$ and $r_2$. This gives simplified regular expressions,
-and $r_{2s}$, as well as two rectification functions $f_{1s}$
+$r_{1s}$ and $r_{2s}$, as well as two rectification functions
-and $f_{2s}$. We next need to test whether the simplified
+$f_{1s}$ and $f_{2s}$. We next need to test whether the
-regular expressions are $\varnothing$ so as to make further
+simplified regular expressions are $\varnothing$ so as to make
-simplifications. In case $r_{1s}$ is $\varnothing$ then
+further simplifications. In case $r_{1s}$ is $\varnothing$,
-we can return $r_{2s}$ (the other alternative). However
+then we can return $r_{2s}$ (the other alternative). However
-we need to now build a rectification function, which as
+for this we need to build a corresponding rectification
-said above is
+function, which as said above is
 \begin{center}
 $\lambda v.\,Right(f_{2s}(v))$
 \end{center}
-\noindent The case where $r_{2s} = \varnothing$ is similar.
+\noindent The case where $r_{2s} = \varnothing$ is similar:
-We return $r_{1s}$ but now have to rectify such that we return
+We return $r_{1s}$ and rectify with $Left(\_)$ and the
+other calculated rectification function $f_{1s}$. This gives
 \begin{center}
 $\lambda v.\,Left(f_{1s}(v))$
 \end{center}
-\noindent Note that in this case we have to apply $f_{1s}$,
+\noindent The next case where $r_{1s} = r_{2s}$ can be treated
-not $f_{2s}$, which is responsible to rectify the inner parts
+like the one where $r_{2s} = \varnothing$. We return $r_{1s}$
-of $v$. The otherwise-case is slightly interesting. In this
+and rectify with $Left(\_)$ and so on.
-case neither $r_{1s}$ nor $r_{2s}$ are $\varnothing$ and no
-further simplification can be applied. Accordingly, we return
-$r_{1s} + r_{2s}$ as the simplified regular expression. In
+The otherwise-case is slightly more complicated. In this case
-principle we also do not have to do any rectification, because
+neither $r_{1s}$ nor $r_{2s}$ are $\varnothing$ and also
-no simplification was done in this case. But this is actually
+$r_{1s} \not= r_{2s}$, which means no further simplification
-not true: There might have been simplifications inside
+can be applied. Accordingly, we return $r_{1s} + r_{2s}$ as
-$r_{1s}$ and $r_2s$. We therefore need to take into account
+the simplified regular expression. In principle we also do not
-the calculated rectification functions $f_{1s}$ and $f_{2s}$.
+have to do any rectification, because no simplification was
-We can do this by defining a rectification function $f_{alt}$
+done in this case. But this is actually not true: There might
-which takes two rectification functions as arguments
+have been simplifications inside $r_{1s}$ and $r_{2s}$. We
+therefore need to take into account the calculated
+rectification functions $f_{1s}$ and $f_{2s}$. We can do this
+by defining a rectification function $f_{alt}$ which takes two
+rectification functions as arguments and applies them
+according to whether the value is of the form $Left(\_)$ or
+$Right(\_)$:
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}l}
 $f_{alt}(f_1, f_2) \dn$\\
 \qquad $\lambda v.\,$ case $v = Left(v')$:
 \qquad \phantom{$\lambda v.\,$} case $v = Right(v')$:
 & return $Right(f_2(v'))$\\
 \end{tabular}
 \end{center}
-\noindent In essence we need to apply in this case
+The other interesting case with simplification is the sequence
-the appropriate rectification function to the inner part
+case. In this case the main simplification function is as
-of the value $v$, whereby $v$ can only be of the form
+follows
-$Right(\_)$ or $Left(\_)$.
-The other interesting case with simplification is the
-sequence case. Here the main simplification function
-is as follows
 \begin{center}
 \begin{tabular}{l}
 $simp(r)$:\qquad\qquad (continued)\\
 \quad case $r = r_1 \cdot r_2$\\
 \qquad otherwise:
 return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\
 \end{tabular}
 \end{center}
-\noindent whereby $f_{seq}$ is again pushing the two
+\noindent whereby in the last line $f_{seq}$ is again pushing
-rectification functions into the two components of
+the two rectification functions into the two components of the
-the Seq-value:
+Seq-value:
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}l}
 $f_{seq}(f_1, f_2) \dn$\\
 \qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$:
 & return $Seq(f_1(v_1), f_2(v_2))$\\
 \end{tabular}
 \end{center}
-\noindent Note that in the case of $r_{1s}$ (similarly $r_{2s}$)
+\noindent Note that in the case of $r_{1s} = \varnothing$
-we use the function $f_{error}$ for rectification. If you
+(similarly $r_{2s}$) we use the function $f_{error}$ for
-think carefully, then you will see that this function will
+rectification. If you think carefully, then you will realise
-actually never been called. Because a sequence with
+that this function will actually never been called. This is
-$\varnothing$ will never recognise any string and therefore
+because a sequence with $\varnothing$ will never recognise any
-the second phase of the algorithm would never been called.
+string and therefore the second phase of the algorithm would
-The simplification function still expects us to give a
+never been called. The simplification function still expects
-function. So in my own implementation I just returned
+us to give a function. So in my own implementation I just
-a function which raises an error.
+returned a function which raises an error. In the case
+where $r_{1s} = \epsilon$ (similarly $r_{2s}$) we have
+to create a sequence where the first component is a rectified
+version of $Empty$. Therefore we call $f_{1s}$ with $Empty$.
+Since we only simplify regular expressions of the form $r_1 +
+r_2$ and $r_1 \cdot r_2$ we do not have to do anything else
+in the remaining cases. The rectification function will
+be just the identity, which in lambda-calculus terms is
+just
+\begin{center}
+$\lambda v.\,v$
+\end{center}
+\noindent This completes the high-level version of the
+simplification function, which is also shown again in
+Figure~\ref{simp}. This can now be used in a \emph{lexing
+function} as follows:
+\begin{figure}[t]
+\begin{center}
+\begin{tabular}{l}
+$simp(r)$:\\
+\quad case $r = r_1 + r_2$\\
+\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
+\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
+\qquad case $r_{1s} = \varnothing$:
+return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
+\qquad case $r_{2s} = \varnothing$:
+return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
+\qquad case $r_{1s} = r_{2s}$:
+return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
+\qquad otherwise:
+return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$
+\medskip\\
+\quad case $r = r_1 \cdot r_2$\\
+\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
+\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
+\qquad case $r_{1s} = \varnothing$:
+return $(\varnothing, f_{error})$\\
+\qquad case $r_{2s} = \varnothing$:
+return $(\varnothing, f_{error})$\\
+\qquad case $r_{1s} = \epsilon$:
+return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
+\qquad case $r_{2s} = \epsilon$:
+return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
+\qquad otherwise:
+return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$
+\medskip\\
+\quad otherwise:\\
+\qquad return $(r, \lambda v.\,v)$\\
+\end{tabular}
+\end{center}
+\caption{The simplification function that returns a simplified
+regular expression and a rectification function.\label{simp}}
+\end{figure}
+\begin{center}
+\begin{tabular}{lcl}
+$lex\,r\,[]$ & $\dn$ & if $nullable(r)$ then $mkeps(r)$\\
+&       & else $error$\\
+$lex\,r\,c\!::\!s$ & $\dn$ & let
+$(r_{simp}, f_{rect}) = simp(der(c, r))$\\
+& & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$
+\end{tabular}
+\end{center}
+\noindent This corresponds to the $matches$ function we
+have seen in earlier lectures. In the first clause we are
+given an empty string, $[]$, and need to test wether the
+regular expression is $nullable$. If yes we can proceed
+normally and just return the value calculated by $mkeps$.
+The second clause is for strings where the first character is
+$c$ and the rest of the string is $s$. We first build the
+derivative of $r$ with respect to $c$; simplify the resulting
+regulare expression. We continue lexing with the simplified
+regular expression and the string $s$. Whatever will be
+returned as value, we sill rectify using the $f_{rect}$
+from the simplification and finally inject $c$ back into
+the (rectified) value.
 \subsubsection*{Records and Tokenisation}
 \newpage

changeset 287	2c50b8b5886c
parent 286	19020b75d75e
child 288	39aeca14af8c