# HG changeset patch # User Christian Urban # Date 1413543379 -3600 # Node ID 0afe43616b6ad64a62149038d94c0d98775bbb6c # Parent c14e5ebf0c3bd9f58cd1b8ed99c499d1ba997bd8 updated diff -r c14e5ebf0c3b -r 0afe43616b6a handouts/ho04.pdf Binary file handouts/ho04.pdf has changed diff -r c14e5ebf0c3b -r 0afe43616b6a handouts/ho04.tex --- a/handouts/ho04.tex Thu Oct 16 17:30:05 2014 +0100 +++ b/handouts/ho04.tex Fri Oct 17 11:56:19 2014 +0100 @@ -13,31 +13,33 @@ expression or not. Often a more interesting question is to find out \emph{how} a regular expression matched a string? Answering this question will help us with the problem we are -after, namely tokenising an input string, that is splitting it -up into its ``word'' components. The algorithm we will be -looking at was designed by Sulzmann \& Lu in a rather recent -paper. A link to it is provided on KEATS, in case you are -interested.\footnote{In my humble opinion this is an -interesting instance of the research literature: it contains a -very neat idea, but its presentation is rather sloppy. In -earlier versions of their paper, students and I found several -rather annoying typos in their examples and definitions.} +after, namely tokenising an input string. The algorithm we +will be looking at for this was designed by Sulzmann \& Lu in +a rather recent paper. A link to it is provided on KEATS, in +case you are interested.\footnote{In my humble opinion this is +an interesting instance of the research literature: it +contains a very neat idea, but its presentation is rather +sloppy. In earlier versions of their paper, students and I +found several rather annoying typos in their examples and +definitions.} -In order to give an answer for how a regular expression matched -a string, Sulzmann and Lu introduce \emph{values}. A value -will be the output of the algorithm whenever the regular -expression matches the string. If not, an error will be raised. -Since the first phase of the algorithm is identical to the -derivative based matcher from the first coursework, the -function $nullable$ will be used to decide whether as string -is matched by a regular expression. If $nullable$ says -yes, then values are constructed that reflect how the regular -expression matched the string. The definitions for regular -expressions $r$ and values $v$ is shown below: +In order to give an answer for how a regular expression +matched a string, Sulzmann and Lu introduce \emph{values}. A +value will be the output of the algorithm whenever the regular +expression matches the string. If not, an error will be +raised. Since the first phase of the algorithm by Sulzmann \& +Lu is identical to the derivative based matcher from the first +coursework, the function $nullable$ will be used to decide +whether as string is matched by a regular expression. If +$nullable$ says yes, then values are constructed that reflect +how the regular expression matched the string. The definitions +for regular expressions $r$ and values $v$ is shown next to +each other below: \begin{center} \begin{tabular}{cc} \begin{tabular}{@{}rrl@{}} +\multicolumn{3}{c}{regular expressions}\\ $r$ & $::=$ & $\varnothing$\\ & $\mid$ & $\epsilon$ \\ & $\mid$ & $c$ \\ @@ -48,7 +50,8 @@ \end{tabular} & \begin{tabular}{@{\hspace{0mm}}rrl@{}} - $v$ & $::=$ & \\ +\multicolumn{3}{c}{values}\\ + $v$ & $::=$ & \\ & & $Empty$ \\ & $\mid$ & $Char(c)$ \\ & $\mid$ & $Seq(v_1,v_2)$\\ @@ -59,26 +62,27 @@ \end{tabular} \end{center} -\noindent As you can see there is a very strong correspondence -between regular expressions and values. There is no value for -the $\varnothing$ regular expression (since it does not match -any string). Then there is exactly one value corresponding to -each regular expression. with the exception of $r_1 + r_2$ -where there are two values $Left(v)$ and $Right(v)$ -corresponding to the two alternatives. Note that $r^*$ -is associated with a list of values, one for each copy of $r$ -that was needed to match the string. This mean we might also +\noindent The point is that there is a very strong +correspondence between them. There is no value for the +$\varnothing$ regular expression, since it does not match any +string. Otherwise there is exactly one value corresponding to +each regular expression with the exception of $r_1 + r_2$ +where there are two values, namely $Left(v)$ and $Right(v)$ +corresponding to the two alternatives. Note that $r^*$ is +associated with a list of values, one for each copy of $r$ +that was needed to match the string. This means we might also return the empty list $[]$, if no copy was needed. -Graphically the algorithm can be represneted by the -picture in Figure~\ref{Sulz} where the path involving $der/nullable$ is the first -phase of the algorithm and $mkeps/inj$ the second phase. -This picture shows the steps required when a regular -expression, say $r_1$, matches the string $abc$. We first -build the three derivatives (according to $a$, $b$ and $c$). -We then use $nullable$ to find out whether the resulting -regular expression can match the empty string. If yes -we call the function $mkeps$. +Graphically the algorithm by Sulzmann \& Lu can be represneted +by the picture in Figure~\ref{Sulz} where the path from the +left to the right involving $der/nullable$ is the first phase +of the algorithm and $mkeps/inj$, the path from right to left, +the second phase. This picture shows the steps required when a +regular expression, say $r_1$, matches the string $abc$. We +first build the three derivatives (according to $a$, $b$ and +$c$). We then use $nullable$ to find out whether the resulting +regular expression can match the empty string. If yes we call +the function $mkeps$. \begin{figure}[t] \begin{center} @@ -108,7 +112,7 @@ \end{figure} The $mkeps$ function calculates a value for how a regular -expression could have matched the empty string. Its definition +expression has matched the empty string. Its definition is as follows: \begin{center} @@ -123,23 +127,22 @@ \end{center} \noindent There are no cases for $\epsilon$ and $c$, since -these regular expression cannot match the empty string. -Note that in case of alternatives we give preference to the +these regular expression cannot match the empty string. Note +also that in case of alternatives we give preference to the regular expression on the left-hand side. This will become important later on. -The algorithm is organised recursively such that it will -calculate a value for how the derivative regular expression -has matched a string where the first character has been -chopped off. Now we need a function that reverses this -``chopping off'' for values. The corresponding function -is called $inj$ for injection. This function takes -three arguments: the first one is a regular expression -for which we want to calculate the value, the second is the -character we want to inject and the third argument is the -value where we will inject the character. The result -of this function is a new value. The definition of $inj$ -is as follows: +The second phase of the algorithm is organised recursively +such that it will calculate a value for how the derivative +regular expression has matched a string whose first character +has been chopped off. Now we need a function that reverses +this ``chopping off'' for values. The corresponding function +is called $inj$ for injection. This function takes three +arguments: the first one is a regular expression for which we +want to calculate the value, the second is the character we +want to inject and the third argument is the value where we +will inject the character. The result of this function is a +new value. The definition of $inj$ is as follows: \begin{center} \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} @@ -153,20 +156,21 @@ \end{tabular} \end{center} -\noindent This definition is by recursion on the -regular expression and by analysing the shape of the values. -Therefore there are, for example, three cases for sequnece -regular expressions. The last clause returns a list where -the first element is $inj\,r\,c\,v$ and the other elements -are $vs$. +\noindent This definition is by recursion on the regular +expression and by analysing the shape of the values. Therefore +there are, for example, three cases for sequnece regular +expressions. The last clause for the star regular expression +returns a list where the first element is $inj\,r\,c\,v$ and +the other elements are $vs$. That mean $\_\,::\,\_$ should be +read as list cons. To understand what is going on, it might be best to do some example calculations and compare with Figure~\ref{Sulz}. For this note that we have not yet dealt with the need of -simplifying regular expreesions (this will be a topic on its +simplifying regular expressions (this will be a topic on its own later). Suppose the regular expression is $a \cdot (b -\cdot c)$ and the input string is $abc$. The derivatives are -as follows: +\cdot c)$ and the input string is $abc$. The derivatives from +the first phase are as follows: \begin{center} \begin{tabular}{ll} @@ -177,24 +181,24 @@ \end{tabular} \end{center} -\noindent According to the simple algorithm. we would test -whether $r_4$ is nullable, which it is. This means we can -use the function $mkeps$ to calculate a value for how $r_4$ -was able to match the empty string. Remember that this -function gives preference for alternatives on the left-hand -side. However there is only $\epsilon$ on the very right-hand -side of $r_4$ that matches the empty string. Therefore -$mkeps$ returns the value +\noindent According to the simple algorithm, we would test +whether $r_4$ is nullable, which in this case it is. This +means we can use the function $mkeps$ to calculate a value for +how $r_4$ was able to match the empty string. Remember that +this function gives preference for alternatives on the +left-hand side. However there is only $\epsilon$ on the very +right-hand side of $r_4$ that matches the empty string. +Therefore $mkeps$ returns the value \begin{center} $v_4:\;Right(Right(Empty))$ \end{center} -\noindent The point is that from this value we can directly -read off which part of $r_4$ matched the empty string. Next we +\noindent The point is that from this value we can directly +read off which part of $r_4$ matched the empty string. Next we have to ``inject'' the last character, that is $c$ in the -running example, into this value in order to calculate how -$r_3$ could have matched the string $c$. According to the +running example, into this value $v_4$ in order to calculate +how $r_3$ could have matched the string $c$. According to the definition of $inj$ we obtain \begin{center} @@ -220,13 +224,11 @@ \noindent This now corresponds to how the regular expression $a \cdot (b \cdot c)$ matched the string $abc$. -This is the expected result. So at least in this case the -algorithm seems to calculate what it is supposed to. -There are a few auxiliary function that are of interest +There are a few auxiliary functions that are of interest in analysing this algorithm. One is called \emph{flatten}, written $|\_|$, which extracts the string ``underlying'' a -value. It is defined as +value. It is defined recursively as \begin{center} \begin{tabular}{lcl} @@ -241,14 +243,14 @@ \noindent Using flatten we can see what is the string behind the values calculated by $mkeps$ and $inj$ in our running -example: +example are: \begin{center} \begin{tabular}{ll} -$v_4$: & $[]$\\ -$v_3$: & $c$\\ -$v_2$: & $bc$\\ -$v_1$: & $abc$ +$|v_4|$: & $[]$\\ +$|v_3|$: & $c$\\ +$|v_2|$: & $bc$\\ +$|v_1|$: & $abc$ \end{tabular} \end{center} @@ -261,7 +263,58 @@ Generally the matching algorithms based on derivatives do poorly unless the regular expressions are simplified after -each derivatives step. +each derivatives step. But this is a bit more involved in +algorithm of Sulzmann \& Lu. Consider the last derivation +step in our running example + +\begin{center} +$r_4 = der\,c\,r_3$ +\end{center} + +\noindent +Simplifying the result would just give use $\epsilon$. +Running $mkeps$ on this regular expression would provide +us with $Empty$ instead of $Right(Right(Empty))$ that +was obtained without the simplification. The problem is +we need to recreate this value, rather than just $Empty$. + +This requires what I call \emph{rectification functions}. They +need to be calculated whenever a regular expression gets +simplified. Rectification functions take a value as argument +and return a (rectified) value. Our simplification rules so +far are + +\begin{center} +\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l} +$r \cdot \varnothing$ & $\mapsto$ & $\varnothing$\\ +$\varnothing \cdot r$ & $\mapsto$ & $\varnothing$\\ +$r \cdot \epsilon$ & $\mapsto$ & $r$\\ +$\epsilon \cdot r$ & $\mapsto$ & $r$\\ +$r + \varnothing$ & $\mapsto$ & $r$\\ +$\varnothing + r$ & $\mapsto$ & $r$\\ +$r + r$ & $\mapsto$ & $r$\\ +\end{tabular} +\end{center} + +\noindent Applying them to $r_4$ will require several nested +simplifications in order end up with just $\epsilon$. We can +implement this by letting simp return not just a (simplified) +regular expression, but also a rectification function. Let us +consider the alternative case, say $r_1 + r_2$, first. We +would first simplify the component regular expressions $r_1$ +and $r_2.$ This will return simplified versions (if they can +be simplified), say $r_{1s}$ and $r_{2s}$, but also two +rectification functions $f_{1s}$ and $f_{2s}$. We need to +assemble them in order to obtain a rectified value for +$r_1 + r_2$. + +In case $r_{1s}$ simplified to $\varnothing$, we would +continue the derivative calculation with $r_{2s}$. The +Sulzmann \& Lu algorithm would return a corresponding value, +say $v_{2s}$. + + +\subsubsection*{Records and Tokenisation} \newpage Algorithm by Sulzmann, Lexing