afl-material: comparison handouts/ho04.tex

equal deleted inserted replaced

-:434891622131
+:c4e7caa06c74
 expression or not. Often a more interesting question is to
 find out \emph{how} a regular expression matched a string?
 Answering this question will also help us with the problem we
 are after, namely tokenising an input string.
-The algorithm we will be looking at was designed by Sulzmann \& Lu in
+The algorithm we will be looking at was designed by Sulzmann
-a rather recent paper. A link to it is provided on KEATS, in case you
+\& Lu in a rather recent paper (from 2014). A link to it is
-are interested.\footnote{In my humble opinion this is an interesting
+provided on KEATS, in case you are interested.\footnote{In my
-instance of the research literature: it contains a very neat idea,
+humble opinion this is an interesting instance of the research
-but its presentation is rather sloppy. In earlier versions of their
+literature: it contains a very neat idea, but its presentation
-paper, students and I found several rather annoying typos in their
+is rather sloppy. In earlier versions of their paper, a King's
-examples and definitions.}  In order to give an answer for how a
+student and I found several rather annoying typos in their
-regular expression matched a string, Sulzmann and Lu introduce
+examples and definitions.} In order to give an answer for
-\emph{values}. A value will be the output of the algorithm whenever
+\emph{how} a regular expression matches a string, Sulzmann and
-the regular expression matches the string. If the string does not
+Lu introduce \emph{values}. A value will be the output of the
-match the string, an error will be raised. Since the first phase of
+algorithm whenever the regular expression matches the string.
-the algorithm by Sulzmann \& Lu is identical to the derivative based
+If the string does not match the string, an error will be
-matcher from the first coursework, the function $nullable$ will be
+raised. Since the first phase of the algorithm by Sulzmann \&
-used to decide whether as string is matched by a regular
+Lu is identical to the derivative based matcher from the first
-expression. If $nullable$ says yes, then values are constructed that
+coursework, the function $nullable$ will be used to decide
-reflect how the regular expression matched the string. The definitions
+whether as string is matched by a regular expression. If
-for regular expressions $r$ and values $v$ is shown next to each other
+$nullable$ says yes, then values are constructed that reflect
-below:
+how the regular expression matched the string.
+The definitions for values is given below. They are shown
+together with the regular expressions $r$ to which
+they correspond:
 \begin{center}
 \begin{tabular}{cc}
 \begin{tabular}{@{}rrl@{}}
 \multicolumn{3}{c}{regular expressions}\medskip\\
 & $\mid$ & $[v_1,\ldots\,v_n]$ \\
 \end{tabular}
 \end{tabular}
 \end{center}
-\noindent The point is that there is a very strong
+\noindent The reason is that there is a very strong
 correspondence between them. There is no value for the
 $\varnothing$ regular expression, since it does not match any
 string. Otherwise there is exactly one value corresponding to
 each regular expression with the exception of $r_1 + r_2$
 where there are two values, namely $Left(v)$ and $Right(v)$
 corresponding to the two alternatives. Note that $r^*$ is
 associated with a list of values, one for each copy of $r$
 that was needed to match the string. This means we might also
-return the empty list $[]$, if no copy was needed.
+return the empty list $[]$, if no copy was needed in case
+of $r^*$. For sequence, there is exactly one value, composed
+of two component values ($v_1$ and $v_2$).
 To emphasise the connection between regular expressions and
 values, I have in my implementation the convention that
+regular expressions and values have the same name, except that
 regular expressions are written entirely with upper-case
 letters, while values just start with a single upper-case
-character. My definition of values in Scala is below. I use
+character and the rest are lower-case letters. My definition
-this in the REPL of Scala; when I use the Scala compiler
+of regular expressions and values in Scala is shown below. I use
-I need to rename some constructors, because Scala on Macs
+this in the REPL of Scala; when I use the Scala compiler I
-does not like classes that are called \pcode{EMPTY} and
+need to rename some constructors, because Scala on Macs does
+not like classes that are called \pcode{EMPTY} and
 \pcode{Empty}.
+{\small\lstinputlisting[language=Scala,numbers=none]
+{../progs/app03.scala}}
 {\small\lstinputlisting[language=Scala,numbers=none]
 {../progs/app02.scala}}
 regular expression on the left-hand side. This will become
 important later on.
 The second phase of the algorithm is organised so that it will
 calculate a value for how the derivative regular expression
-has matched a string whose first character has been chopped
+has matched a string. For this we need a function that
-off. Now we need a function that reverses this ``chopping
+reverses this ``chopping off'' for values which we did in the
-off'' for values. The corresponding function is called $inj$
+first phase for derivatives. The corresponding function is
-for injection. This function takes three arguments: the first
+called $inj$ for injection. This function takes three
-one is a regular expression for which we want to calculate the
+arguments: the first one is a regular expression for which we
-value, the second is the character we want to inject and the
+want to calculate the value, the second is the character we
-third argument is the value where we will inject the
+want to inject and the third argument is the value where we
-character. The result of this function is a new value. The
+will inject the character into. The result of this function is a
-definition of $inj$ is as follows:
+new value. The definition of $inj$ is as follows:
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 $inj\,(c)\,c\,Empty$            & $\dn$  & $Char\,c$\\
 $inj\,(r_1 + r_2)\,c\,Left(v)$  & $\dn$  & $Left(inj\,r_1\,c\,v)$\\
 \end{center}
 \noindent This definition is by recursion on the regular
 expression and by analysing the shape of the values. Therefore
 there are, for example, three cases for sequence regular
-expressions. The last clause for the star regular expression
+expressions (for all possible shapes of the value). The last
-returns a list where the first element is $inj\,r\,c\,v$ and
+clause for the star regular expression returns a list where
-the other elements are $vs$. That means $\_\,::\,\_$ should be
+the first element is $inj\,r\,c\,v$ and the other elements are
-read as list cons.
+$vs$. That means $\_\,::\,\_$ should be read as list cons.
 To understand what is going on, it might be best to do some
 example calculations and compare them with Figure~\ref{Sulz}.
 For this note that we have not yet dealt with the need of
 simplifying regular expressions (this will be a topic on its
 $r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\
 \end{tabular}
 \end{center}
 \noindent According to the simple algorithm, we would test
-whether $r_4$ is nullable, which in this case it is. This
+whether $r_4$ is nullable, which in this case it indeed is.
-means we can use the function $mkeps$ to calculate a value for
+This means we can use the function $mkeps$ to calculate a
-how $r_4$ was able to match the empty string. Remember that
+value for how $r_4$ was able to match the empty string.
-this function gives preference for alternatives on the
+Remember that this function gives preference for alternatives
-left-hand side. However there is only $\epsilon$ on the very
+on the left-hand side. However there is only $\epsilon$ on the
-right-hand side of $r_4$ that matches the empty string.
+very right-hand side of $r_4$ that matches the empty string.
 Therefore $mkeps$ returns the value
 \begin{center}
 $v_4:\;Right(Right(Empty))$
 \end{center}
-\noindent The point is that from this value we can directly
+\noindent If there had been a $\epsilon$ on the left, then
-read off which part of $r_4$ matched the empty string: take
+$mkeps$ would have returned something of the form
-the right-alternative first, and then the right-alternative
+$Left(\ldots)$. The point is that from this value we can
-again.
+directly read off which part of $r_4$ matched the empty
+string: take the right-alternative first, and then the
+right-alternative again.
 Next we have to ``inject'' the last character, that is $c$ in
 the running example, into this value $v_4$ in order to
 calculate how $r_3$ could have matched the string $c$.
 According to the definition of $inj$ we obtain
 $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\
 $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
 \end{tabular}
 \end{center}
-\noindent Using flatten we can see what is the string behind
+\noindent Using flatten we can see what are the strings behind
-the values calculated by $mkeps$ and $inj$ in our running
+the values calculated by $mkeps$ and $inj$. In our running
-example are:
+example:
 \begin{center}
 \begin{tabular}{ll}
 $|v_4|$: & $[]$\\
 $|v_3|$: & $c$\\
 $|v_1|$: & $abc$
 \end{tabular}
 \end{center}
 \noindent This indicates that $inj$ indeed is injecting, or
-adding, back a character into the value. Next we look at how
+adding, back a character into the value. If we continue until
-simplification can be included into this algorithm.
+all characters are injected back, we have a value that can
+indeed say how the string $abc$ was matched.
+There is a problem, however, with the described algorithm
+so far: it is very slow. We need to include the simplification
+from Lecture 2. This is what we shall do next.
 \subsubsection*{Simplification}
 Generally the matching algorithms based on derivatives do

changeset 350	c4e7caa06c74
parent 326	94700593a2d5
child 352	1e1b0fe66107