diff -r 434891622131 -r c4e7caa06c74 handouts/ho04.tex --- a/handouts/ho04.tex Thu Oct 08 22:44:16 2015 +0100 +++ b/handouts/ho04.tex Thu Oct 08 23:03:28 2015 +0100 @@ -15,24 +15,28 @@ Answering this question will also help us with the problem we are after, namely tokenising an input string. -The algorithm we will be looking at was designed by Sulzmann \& Lu in -a rather recent paper. A link to it is provided on KEATS, in case you -are interested.\footnote{In my humble opinion this is an interesting - instance of the research literature: it contains a very neat idea, - but its presentation is rather sloppy. In earlier versions of their - paper, students and I found several rather annoying typos in their - examples and definitions.} In order to give an answer for how a -regular expression matched a string, Sulzmann and Lu introduce -\emph{values}. A value will be the output of the algorithm whenever -the regular expression matches the string. If the string does not -match the string, an error will be raised. Since the first phase of -the algorithm by Sulzmann \& Lu is identical to the derivative based -matcher from the first coursework, the function $nullable$ will be -used to decide whether as string is matched by a regular -expression. If $nullable$ says yes, then values are constructed that -reflect how the regular expression matched the string. The definitions -for regular expressions $r$ and values $v$ is shown next to each other -below: +The algorithm we will be looking at was designed by Sulzmann +\& Lu in a rather recent paper (from 2014). A link to it is +provided on KEATS, in case you are interested.\footnote{In my +humble opinion this is an interesting instance of the research +literature: it contains a very neat idea, but its presentation +is rather sloppy. In earlier versions of their paper, a King's +student and I found several rather annoying typos in their +examples and definitions.} In order to give an answer for +\emph{how} a regular expression matches a string, Sulzmann and +Lu introduce \emph{values}. A value will be the output of the +algorithm whenever the regular expression matches the string. +If the string does not match the string, an error will be +raised. Since the first phase of the algorithm by Sulzmann \& +Lu is identical to the derivative based matcher from the first +coursework, the function $nullable$ will be used to decide +whether as string is matched by a regular expression. If +$nullable$ says yes, then values are constructed that reflect +how the regular expression matched the string. + +The definitions for values is given below. They are shown +together with the regular expressions $r$ to which +they correspond: \begin{center} \begin{tabular}{cc} @@ -60,7 +64,7 @@ \end{tabular} \end{center} -\noindent The point is that there is a very strong +\noindent The reason is that there is a very strong correspondence between them. There is no value for the $\varnothing$ regular expression, since it does not match any string. Otherwise there is exactly one value corresponding to @@ -69,18 +73,26 @@ corresponding to the two alternatives. Note that $r^*$ is associated with a list of values, one for each copy of $r$ that was needed to match the string. This means we might also -return the empty list $[]$, if no copy was needed. +return the empty list $[]$, if no copy was needed in case +of $r^*$. For sequence, there is exactly one value, composed +of two component values ($v_1$ and $v_2$). To emphasise the connection between regular expressions and values, I have in my implementation the convention that +regular expressions and values have the same name, except that regular expressions are written entirely with upper-case letters, while values just start with a single upper-case -character. My definition of values in Scala is below. I use -this in the REPL of Scala; when I use the Scala compiler -I need to rename some constructors, because Scala on Macs -does not like classes that are called \pcode{EMPTY} and +character and the rest are lower-case letters. My definition +of regular expressions and values in Scala is shown below. I use +this in the REPL of Scala; when I use the Scala compiler I +need to rename some constructors, because Scala on Macs does +not like classes that are called \pcode{EMPTY} and \pcode{Empty}. + {\small\lstinputlisting[language=Scala,numbers=none] +{../progs/app03.scala}} + + {\small\lstinputlisting[language=Scala,numbers=none] {../progs/app02.scala}} @@ -146,15 +158,15 @@ The second phase of the algorithm is organised so that it will calculate a value for how the derivative regular expression -has matched a string whose first character has been chopped -off. Now we need a function that reverses this ``chopping -off'' for values. The corresponding function is called $inj$ -for injection. This function takes three arguments: the first -one is a regular expression for which we want to calculate the -value, the second is the character we want to inject and the -third argument is the value where we will inject the -character. The result of this function is a new value. The -definition of $inj$ is as follows: +has matched a string. For this we need a function that +reverses this ``chopping off'' for values which we did in the +first phase for derivatives. The corresponding function is +called $inj$ for injection. This function takes three +arguments: the first one is a regular expression for which we +want to calculate the value, the second is the character we +want to inject and the third argument is the value where we +will inject the character into. The result of this function is a +new value. The definition of $inj$ is as follows: \begin{center} \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} @@ -171,10 +183,10 @@ \noindent This definition is by recursion on the regular expression and by analysing the shape of the values. Therefore there are, for example, three cases for sequence regular -expressions. The last clause for the star regular expression -returns a list where the first element is $inj\,r\,c\,v$ and -the other elements are $vs$. That means $\_\,::\,\_$ should be -read as list cons. +expressions (for all possible shapes of the value). The last +clause for the star regular expression returns a list where +the first element is $inj\,r\,c\,v$ and the other elements are +$vs$. That means $\_\,::\,\_$ should be read as list cons. To understand what is going on, it might be best to do some example calculations and compare them with Figure~\ref{Sulz}. @@ -194,22 +206,24 @@ \end{center} \noindent According to the simple algorithm, we would test -whether $r_4$ is nullable, which in this case it is. This -means we can use the function $mkeps$ to calculate a value for -how $r_4$ was able to match the empty string. Remember that -this function gives preference for alternatives on the -left-hand side. However there is only $\epsilon$ on the very -right-hand side of $r_4$ that matches the empty string. +whether $r_4$ is nullable, which in this case it indeed is. +This means we can use the function $mkeps$ to calculate a +value for how $r_4$ was able to match the empty string. +Remember that this function gives preference for alternatives +on the left-hand side. However there is only $\epsilon$ on the +very right-hand side of $r_4$ that matches the empty string. Therefore $mkeps$ returns the value \begin{center} $v_4:\;Right(Right(Empty))$ \end{center} -\noindent The point is that from this value we can directly -read off which part of $r_4$ matched the empty string: take -the right-alternative first, and then the right-alternative -again. +\noindent If there had been a $\epsilon$ on the left, then +$mkeps$ would have returned something of the form +$Left(\ldots)$. The point is that from this value we can +directly read off which part of $r_4$ matched the empty +string: take the right-alternative first, and then the +right-alternative again. Next we have to ``inject'' the last character, that is $c$ in the running example, into this value $v_4$ in order to @@ -257,9 +271,9 @@ \end{tabular} \end{center} -\noindent Using flatten we can see what is the string behind -the values calculated by $mkeps$ and $inj$ in our running -example are: +\noindent Using flatten we can see what are the strings behind +the values calculated by $mkeps$ and $inj$. In our running +example: \begin{center} \begin{tabular}{ll} @@ -271,8 +285,13 @@ \end{center} \noindent This indicates that $inj$ indeed is injecting, or -adding, back a character into the value. Next we look at how -simplification can be included into this algorithm. +adding, back a character into the value. If we continue until +all characters are injected back, we have a value that can +indeed say how the string $abc$ was matched. + +There is a problem, however, with the described algorithm +so far: it is very slow. We need to include the simplification +from Lecture 2. This is what we shall do next. \subsubsection*{Simplification}