diff -r 19020b75d75e -r 2c50b8b5886c handouts/ho04.tex --- a/handouts/ho04.tex Fri Oct 17 14:18:15 2014 +0100 +++ b/handouts/ho04.tex Sat Oct 18 01:02:03 2014 +0100 @@ -12,8 +12,8 @@ yes or no depending on whether a string was matched by regular expression or not. Often a more interesting question is to find out \emph{how} a regular expression matched a string? -Answering this question will help us with the problem we are -after, namely tokenising an input string. The algorithm we +Answering this question will also help us with the problem we +are after, namely tokenising an input string. The algorithm we will be looking at for this was designed by Sulzmann \& Lu in a rather recent paper. A link to it is provided on KEATS, in case you are interested.\footnote{In my humble opinion this is @@ -73,7 +73,21 @@ that was needed to match the string. This means we might also return the empty list $[]$, if no copy was needed. -Graphically the algorithm by Sulzmann \& Lu can be represneted +To emphasise the connection between regular expressions and +values, I have in my implementation the convention that +regular expressions are written entirely with upper-case +letters, while values just start with a single upper-case +character. My definition of values in Scala is below. I use +this in the REPL of Scala; when I use the Scala compiler +I need to rename some constructors, because Scala on Macs +does not like classes that are called \pcode{EMPTY} and +\pcode{Empty}. + +{\small\lstinputlisting[language=Scala,numbers=none] +{../progs/app02.scala}} + + +Graphically the algorithm by Sulzmann \& Lu can be illustrated by the picture in Figure~\ref{Sulz} where the path from the left to the right involving $der/nullable$ is the first phase of the algorithm and $mkeps/inj$, the path from right to left, @@ -132,17 +146,17 @@ regular expression on the left-hand side. This will become important later on. -The second phase of the algorithm is organised recursively -such that it will calculate a value for how the derivative -regular expression has matched a string whose first character -has been chopped off. Now we need a function that reverses -this ``chopping off'' for values. The corresponding function -is called $inj$ for injection. This function takes three -arguments: the first one is a regular expression for which we -want to calculate the value, the second is the character we -want to inject and the third argument is the value where we -will inject the character. The result of this function is a -new value. The definition of $inj$ is as follows: +The second phase of the algorithm is organised so that it will +calculate a value for how the derivative regular expression +has matched a string whose first character has been chopped +off. Now we need a function that reverses this ``chopping +off'' for values. The corresponding function is called $inj$ +for injection. This function takes three arguments: the first +one is a regular expression for which we want to calculate the +value, the second is the character we want to inject and the +third argument is the value where we will inject the +character. The result of this function is a new value. The +definition of $inj$ is as follows: \begin{center} \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} @@ -161,12 +175,12 @@ there are, for example, three cases for sequnece regular expressions. The last clause for the star regular expression returns a list where the first element is $inj\,r\,c\,v$ and -the other elements are $vs$. That mean $\_\,::\,\_$ should be +the other elements are $vs$. That means $\_\,::\,\_$ should be read as list cons. To understand what is going on, it might be best to do some -example calculations and compare with Figure~\ref{Sulz}. For -this note that we have not yet dealt with the need of +example calculations and compare them with Figure~\ref{Sulz}. +For this note that we have not yet dealt with the need of simplifying regular expressions (this will be a topic on its own later). Suppose the regular expression is $a \cdot (b \cdot c)$ and the input string is $abc$. The derivatives from @@ -215,8 +229,9 @@ $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$ \end{center} -\noindent Finally we need to inject back the letter $a$ into -$v_2$ giving the final result +\noindent which is again the correct result for how $r_2$ +matched the string $bc$. Finally we need to inject back the +letter $a$ into $v_2$ giving the final result \begin{center} $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$ @@ -226,7 +241,7 @@ expression $a \cdot (b \cdot c)$ matched the string $abc$. There are a few auxiliary functions that are of interest -in analysing this algorithm. One is called \emph{flatten}, +when analysing this algorithm. One is called \emph{flatten}, written $|\_|$, which extracts the string ``underlying'' a value. It is defined recursively as @@ -254,35 +269,39 @@ \end{tabular} \end{center} -\noindent -This indicates that $inj$ indeed is injecting, or adding, back -a character into the value. +\noindent This indicates that $inj$ indeed is injecting, or +adding, back a character into the value. Next we look at how +simplification can be included into this algorithm. \subsubsection*{Simplification} Generally the matching algorithms based on derivatives do poorly unless the regular expressions are simplified after -each derivatives step. But this is a bit more involved in -algorithm of Sulzmann \& Lu. Consider the last derivation -step in our running example +each derivative step. But this is a bit more involved in the +algorithm of Sulzmann \& Lu. So what follows might require you +to read several times before it makes sense and also might +require that you do some example calculations. As a first +example consider the last derivation step in our earlier +example: \begin{center} -$r_4 = der\,c\,r_3 = (\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$ +$r_4 = der\,c\,r_3 = +(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$ \end{center} -\noindent Simplifying the result would just give us -$\epsilon$. Running $mkeps$ on this regular expression would -then provide us with $Empty$ instead of $Right(Right(Empty))$ -that was obtained without the simplification. The problem is -we need to recreate this more complicated value, rather than -just $Empty$. +\noindent Simplifying this regular expression would just give +us $\epsilon$. Running $mkeps$ with this regular expression as +input, however, would then provide us with $Empty$ instead of +$Right(Right(Empty))$ that was obtained without the +simplification. The problem is we need to recreate this more +complicated value, rather than just $Empty$. -This requires what I call \emph{rectification functions}. They -need to be calculated whenever a regular expression gets +This will require what I call \emph{rectification functions}. +They need to be calculated whenever a regular expression gets simplified. Rectification functions take a value as argument -and return a (rectified) value. Our simplification rules so -far are +and return a (rectified) value. Let us first take a look again +at our simplification rules: \begin{center} \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l} @@ -297,38 +316,49 @@ \end{center} \noindent Applying them to $r_4$ will require several nested -simplifications in order end up with just $\epsilon$. +simplifications in order end up with just $\epsilon$. However, +it is possible to apply them in a depth-first, or inside-out, +manner in order to calculate this simplified regular +expression. -We can implement this by letting simp return not just a -(simplified) regular expression, but also a rectification -function. Let us consider the alternative case, say $r_1 + -r_2$, first. We would first simplify the component regular -expressions $r_1$ and $r_2.$ This will return simplified -versions (if they can be simplified), say $r_{1s}$ and -$r_{2s}$, but also two rectification functions $f_{1s}$ and -$f_{2s}$. We need to assemble them in order to obtain a -rectified value for $r_1 + r_2$. In case $r_{1s}$ simplified -to $\varnothing$, we would continue the derivative calculation -with $r_{2s}$. The Sulzmann \& Lu algorithm would return a -corresponding value, say $v_{2s}$. But now this needs to -be ``rectified'' to the value +The rectification we can implement this by letting simp return +not just a (simplified) regular expression, but also a +rectification function. Let us consider the alternative case, +$r_1 + r_2$, first. By going depth-first, we first simplify +the component regular expressions $r_1$ and $r_2.$ This will +return simplified versions (if they can be simplified), say +$r_{1s}$ and $r_{2s}$, but also two rectification functions +$f_{1s}$ and $f_{2s}$. We need to assemble them in order to +obtain a rectified value for $r_1 + r_2$. In case $r_{1s}$ +simplified to $\varnothing$, we continue the derivative +calculation with $r_{2s}$. The Sulzmann \& Lu algorithm would +return a corresponding value, say $v_{2s}$. But now this value +needs to be ``rectified'' to the value \begin{center} $Right(v_{2s})$ \end{center} -\noindent Unfortunately, this is not enough because there -might be some simplifications that happened inside $r_2$ -and for which the simplification function retuned also -a rectification function $f_{2s}$. So in fact we need to -apply this one too which gives +\noindent The reason is that we look for the value that tells +us how $r_1 + r_2$ could have matched the string, not just +$r_2$ or $r_{2s}$. Unfortunately, this is still not the right +value in general because there might be some simplifications +that happened inside $r_2$ and for which the simplification +function retuned also a rectification function $f_{2s}$. So in +fact we need to apply this one too which gives \begin{center} $Right(f_{2s}(v_{2s}))$ \end{center} -\noindent So if we want to return this as function, -we would need to return +\noindent This is now the correct, or rectified, value. Since +the simplification will be done in the first phase of the +algorithm, but the rectification needs to be done to the +values in the second phase, it is advantageous to calculate +the rectification as a function, remember this function and +then apply the value to this function during the second phase. +So if we want to implement the rectification as function, we +would need to return \begin{center} $\lambda v.\,Right(f_{2s}(v))$ @@ -339,8 +369,9 @@ after the dot where $v$ is replaced by whatever value is given. -Let us package these ideas into a single function (still only -considering the alternative case): +Let us package this idea with rectification functions +into a single function (still only considering the alternative +case): \begin{center} \begin{tabular}{l} @@ -352,45 +383,54 @@ return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\ \qquad case $r_{2s} = \varnothing$: return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\ +\qquad case $r_{1s} = r_{2s}$: + return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\ \qquad otherwise: return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\ \end{tabular} \end{center} -\noindent We first recursively call the simlification with $r_1$ -and $r_2$. This gives simplified regular expressions, $r_{1s}$ -and $r_{2s}$, as well as two rectification functions $f_{1s}$ -and $f_{2s}$. We next need to test whether the simplified -regular expressions are $\varnothing$ so as to make further -simplifications. In case $r_{1s}$ is $\varnothing$ then -we can return $r_{2s}$ (the other alternative). However -we need to now build a rectification function, which as -said above is +\noindent We first recursively call the simplification with +$r_1$ and $r_2$. This gives simplified regular expressions, +$r_{1s}$ and $r_{2s}$, as well as two rectification functions +$f_{1s}$ and $f_{2s}$. We next need to test whether the +simplified regular expressions are $\varnothing$ so as to make +further simplifications. In case $r_{1s}$ is $\varnothing$, +then we can return $r_{2s}$ (the other alternative). However +for this we need to build a corresponding rectification +function, which as said above is \begin{center} $\lambda v.\,Right(f_{2s}(v))$ \end{center} -\noindent The case where $r_{2s} = \varnothing$ is similar. -We return $r_{1s}$ but now have to rectify such that we return +\noindent The case where $r_{2s} = \varnothing$ is similar: +We return $r_{1s}$ and rectify with $Left(\_)$ and the +other calculated rectification function $f_{1s}$. This gives \begin{center} $\lambda v.\,Left(f_{1s}(v))$ \end{center} -\noindent Note that in this case we have to apply $f_{1s}$, -not $f_{2s}$, which is responsible to rectify the inner parts -of $v$. The otherwise-case is slightly interesting. In this -case neither $r_{1s}$ nor $r_{2s}$ are $\varnothing$ and no -further simplification can be applied. Accordingly, we return -$r_{1s} + r_{2s}$ as the simplified regular expression. In -principle we also do not have to do any rectification, because -no simplification was done in this case. But this is actually -not true: There might have been simplifications inside -$r_{1s}$ and $r_2s$. We therefore need to take into account -the calculated rectification functions $f_{1s}$ and $f_{2s}$. -We can do this by defining a rectification function $f_{alt}$ -which takes two rectification functions as arguments +\noindent The next case where $r_{1s} = r_{2s}$ can be treated +like the one where $r_{2s} = \varnothing$. We return $r_{1s}$ +and rectify with $Left(\_)$ and so on. + + +The otherwise-case is slightly more complicated. In this case +neither $r_{1s}$ nor $r_{2s}$ are $\varnothing$ and also +$r_{1s} \not= r_{2s}$, which means no further simplification +can be applied. Accordingly, we return $r_{1s} + r_{2s}$ as +the simplified regular expression. In principle we also do not +have to do any rectification, because no simplification was +done in this case. But this is actually not true: There might +have been simplifications inside $r_{1s}$ and $r_{2s}$. We +therefore need to take into account the calculated +rectification functions $f_{1s}$ and $f_{2s}$. We can do this +by defining a rectification function $f_{alt}$ which takes two +rectification functions as arguments and applies them +according to whether the value is of the form $Left(\_)$ or +$Right(\_)$: \begin{center} \begin{tabular}{l@{\hspace{1mm}}l} @@ -402,14 +442,9 @@ \end{tabular} \end{center} -\noindent In essence we need to apply in this case -the appropriate rectification function to the inner part -of the value $v$, whereby $v$ can only be of the form -$Right(\_)$ or $Left(\_)$. - -The other interesting case with simplification is the -sequence case. Here the main simplification function -is as follows +The other interesting case with simplification is the sequence +case. In this case the main simplification function is as +follows \begin{center} \begin{tabular}{l} @@ -430,9 +465,9 @@ \end{tabular} \end{center} -\noindent whereby $f_{seq}$ is again pushing the two -rectification functions into the two components of -the Seq-value: +\noindent whereby in the last line $f_{seq}$ is again pushing +the two rectification functions into the two components of the +Seq-value: \begin{center} \begin{tabular}{l@{\hspace{1mm}}l} @@ -442,16 +477,97 @@ \end{tabular} \end{center} -\noindent Note that in the case of $r_{1s}$ (similarly $r_{2s}$) -we use the function $f_{error}$ for rectification. If you -think carefully, then you will see that this function will -actually never been called. Because a sequence with -$\varnothing$ will never recognise any string and therefore -the second phase of the algorithm would never been called. -The simplification function still expects us to give a -function. So in my own implementation I just returned -a function which raises an error. +\noindent Note that in the case of $r_{1s} = \varnothing$ +(similarly $r_{2s}$) we use the function $f_{error}$ for +rectification. If you think carefully, then you will realise +that this function will actually never been called. This is +because a sequence with $\varnothing$ will never recognise any +string and therefore the second phase of the algorithm would +never been called. The simplification function still expects +us to give a function. So in my own implementation I just +returned a function which raises an error. In the case +where $r_{1s} = \epsilon$ (similarly $r_{2s}$) we have +to create a sequence where the first component is a rectified +version of $Empty$. Therefore we call $f_{1s}$ with $Empty$. + +Since we only simplify regular expressions of the form $r_1 + +r_2$ and $r_1 \cdot r_2$ we do not have to do anything else +in the remaining cases. The rectification function will +be just the identity, which in lambda-calculus terms is +just + +\begin{center} +$\lambda v.\,v$ +\end{center} + +\noindent This completes the high-level version of the +simplification function, which is also shown again in +Figure~\ref{simp}. This can now be used in a \emph{lexing +function} as follows: + +\begin{figure}[t] +\begin{center} +\begin{tabular}{l} +$simp(r)$:\\ +\quad case $r = r_1 + r_2$\\ +\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\ +\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\ +\qquad case $r_{1s} = \varnothing$: + return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\ +\qquad case $r_{2s} = \varnothing$: + return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\ +\qquad case $r_{1s} = r_{2s}$: + return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\ +\qquad otherwise: + return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$ + \medskip\\ +\quad case $r = r_1 \cdot r_2$\\ +\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\ +\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\ +\qquad case $r_{1s} = \varnothing$: + return $(\varnothing, f_{error})$\\ +\qquad case $r_{2s} = \varnothing$: + return $(\varnothing, f_{error})$\\ +\qquad case $r_{1s} = \epsilon$: +return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\ +\qquad case $r_{2s} = \epsilon$: +return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\ +\qquad otherwise: + return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$ + \medskip\\ + +\quad otherwise:\\ +\qquad return $(r, \lambda v.\,v)$\\ +\end{tabular} +\end{center} +\caption{The simplification function that returns a simplified +regular expression and a rectification function.\label{simp}} +\end{figure} + +\begin{center} +\begin{tabular}{lcl} +$lex\,r\,[]$ & $\dn$ & if $nullable(r)$ then $mkeps(r)$\\ + & & else $error$\\ +$lex\,r\,c\!::\!s$ & $\dn$ & let + $(r_{simp}, f_{rect}) = simp(der(c, r))$\\ +& & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$ +\end{tabular} +\end{center} + +\noindent This corresponds to the $matches$ function we +have seen in earlier lectures. In the first clause we are +given an empty string, $[]$, and need to test wether the +regular expression is $nullable$. If yes we can proceed +normally and just return the value calculated by $mkeps$. +The second clause is for strings where the first character is +$c$ and the rest of the string is $s$. We first build the +derivative of $r$ with respect to $c$; simplify the resulting +regulare expression. We continue lexing with the simplified +regular expression and the string $s$. Whatever will be +returned as value, we sill rectify using the $f_{rect}$ +from the simplification and finally inject $c$ back into +the (rectified) value. \subsubsection*{Records and Tokenisation}