afl-material: comparison handouts/ho04.tex

equal deleted inserted replaced

-:794c599cee53
+:94700593a2d5
 \begin{center}
 $v_4:\;Right(Right(Empty))$
 \end{center}
 \noindent The point is that from this value we can directly
-read off which part of $r_4$ matched the empty string. Next we
+read off which part of $r_4$ matched the empty string: take
-have to ``inject'' the last character, that is $c$ in the
+the right-alternative first, and then the right-alternative
-running example, into this value $v_4$ in order to calculate
+again.
-how $r_3$ could have matched the string $c$. According to the
-definition of $inj$ we obtain
+Next we have to ``inject'' the last character, that is $c$ in
+the running example, into this value $v_4$ in order to
+calculate how $r_3$ could have matched the string $c$.
+According to the definition of $inj$ we obtain
 \begin{center}
 $v_3:\;Right(Seq(Empty, Char(c)))$
 \end{center}
 Generally the matching algorithms based on derivatives do
 poorly unless the regular expressions are simplified after
 each derivative step. But this is a bit more involved in the
 algorithm of Sulzmann \& Lu. So what follows might require you
 to read several times before it makes sense and also might
-require that you do some example calculations. As a first
+require that you do some example calculations yourself. As a
-example consider the last derivation step in our earlier
+first example consider the last derivation step in our earlier
 example:
 \begin{center}
 $r_4 = der\,c\,r_3 =
 (\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$
 \noindent Simplifying this regular expression would just give
 us $\epsilon$. Running $mkeps$ with this regular expression as
 input, however, would then provide us with $Empty$ instead of
 $Right(Right(Empty))$ that was obtained without the
 simplification. The problem is we need to recreate this more
-complicated value, rather than just $Empty$.
+complicated value, rather than just return $Empty$.
 This will require what I call \emph{rectification functions}.
 They need to be calculated whenever a regular expression gets
 simplified. Rectification functions take a value as argument
 and return a (rectified) value. Let us first take a look again
 simplifications in order end up with just $\epsilon$. However,
 it is possible to apply them in a depth-first, or inside-out,
 manner in order to calculate this simplified regular
 expression.
-The rectification we can implement this by letting simp return
+The rectification we can implement by letting simp return
 not just a (simplified) regular expression, but also a
 rectification function. Let us consider the alternative case,
 $r_1 + r_2$, first. By going depth-first, we first simplify
 the component regular expressions $r_1$ and $r_2.$ This will
 return simplified versions (if they can be simplified), say
 that this function will actually never been called. This is
 because a sequence with $\varnothing$ will never recognise any
 string and therefore the second phase of the algorithm would
 never been called. The simplification function still expects
 us to give a function. So in my own implementation I just
-returned a function which raises an error. In the case
+returned a function that raises an error. In the case
 where $r_{1s} = \epsilon$ (similarly $r_{2s}$) we have
 to create a sequence where the first component is a rectified
 version of $Empty$. Therefore we call $f_{1s}$ with $Empty$.
 Since we only simplify regular expressions of the form $r_1 +
 $(r_{simp}, f_{rect}) = simp(der(c, r))$\\
 & & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$
 \end{tabular}
 \end{center}
-\noindent This corresponds to the $matches$ function we
+\noindent This corresponds to the $matches$ function we have
-have seen in earlier lectures. In the first clause we are
+seen in earlier lectures. In the first clause we are given an
-given an empty string, $[]$, and need to test wether the
+empty string, $[]$, and need to test wether the regular
-regular expression is $nullable$. If yes we can proceed
+expression is $nullable$. If yes, we can proceed normally and
-normally and just return the value calculated by $mkeps$.
+just return the value calculated by $mkeps$. The second clause
-The second clause is for strings where the first character is
+is for strings where the first character is $c$, say, and the
-$c$ and the rest of the string is $s$. We first build the
+rest of the string is $s$. We first build the derivative of
-derivative of $r$ with respect to $c$; simplify the resulting
+$r$ with respect to $c$; simplify the resulting regular
-regulare expression. We continue lexing with the simplified
+expression. We continue lexing with the simplified regular
-regular expression and the string $s$. Whatever will be
+expression and the string $s$. Whatever will be returned as
-returned as value, we sill rectify using the $f_{rect}$
+value, we sill need to rectify using the $f_{rect}$ from the
-from the simplification and finally inject $c$ back into
+simplification and finally inject $c$ back into the
-the (rectified) value.
+(rectified) value.
 \subsubsection*{Records}
-Remember we want to tokenize input strings, that means
+Remember we wanted to tokenize input strings, that means
-splitting strings into their ``word'' components. But
+splitting strings into their ``word'' components. Furthermore
-furthermore we want to classify each token as being a keyword
+we want to classify each token as being a keyword or
-or identifier and so on. For this one more feature will be
+identifier and so on. For this one more feature will be
-required, which I call \emph{record}. While values record
+required, which I call a \emph{record} regular expression.
-precisely how a regular expression matches a string,
+While values encode how a regular expression matches a string,
-records can be used to focus on some particular
+records can be used to ``focus'' on some particular parts of
-parts of the regular expression and forget about others.
+the regular expression and ``forget'' about others.
-Let us look at an example.
+Let us look at an example. Suppose you have the regular
-Suppose you have the regular expression $ab + ac$. Clearly
+expression $a\cdot b + a\cdot c$. Clearly this regular expression can only
-this regular expression can only recognise two strings. But
+recognise two strings. But suppose you are not interested
-suppose you are not interested whether it can recognise $ab$
+whether it can recognise $ab$ or $ac$, but rather if it
-or $ac$, but rather if it matched, then what was the last
+matched, then what was the last character of the matched
-character of the matched string\ldots either $b$ or $c$.
+string\ldots either $b$ or $c$. You can do this by annotating
-You can do this by annotating the regular expression with
+the regular expression with a record, written in general
-a record, written $(x:r)$, where $x$ is just an identifier
+$(x:r)$, where $x$ is just an identifier (in my implementation
-(in my implementation a plain string) and $r$ is a regular
+a plain string) and $r$ is a regular expression. A record will
-expression. A record will be regarded as a regular expression.
+be regarded as a regular expression. The extended definition
-The extended definition in Scala looks as follows:
+in Scala therefore looks as follows:
 {\small\lstinputlisting[language=Scala]
 {../progs/app03.scala}}
-\noindent Since we regard records as regular expressions
+\noindent Since we regard records as regular expressions we
-we need to extend the functions $nullable$ and $der$.
+need to extend the functions $nullable$ and $der$. Similarly
-Similarly $mkeps$ and $inj$ need to be extended and they
+$mkeps$ and $inj$ need to be extended. This means we also need
-sometimes can return a particular value for records.
+to extend the definition of values, which in Scala looks as
-This means we also need to extend the definition of values.
+follows:
-The extended definition in Scala looks as follows:
 {\small\lstinputlisting[language=Scala]
 {../progs/app04.scala}}
 \noindent Let us now look at the purpose of records more
-closely and lets return to our question whether the string
+closely and let us return to our question whether the string
 terminated in a $b$ or $c$. We can do this as follows: we
 annotate the regular expression $ab + ac$ with a record
 as follows
 \begin{center}
-$a(x:b) + a(x:c)$
+$a\cdot (x:b) + a\cdot (x:c)$
 \end{center}
 \noindent This regular expression can still only recognise
 the strings $ab$ and $ac$, but we can now use a function
 that takes a value and returns all records. I call this
 function \emph{env} for environment\ldots it builds a list
-of identifiers associated with their string. This function
+of identifiers associated with a string. This function
 can be defined as follows:
 \begin{center}
 \begin{tabular}{lcl}
 $env(Empty)$     & $\dn$ & $[]$\\
 $env(v_1) \,@\ldots @\, env(v_n)$\\
 $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\
 \end{tabular}
 \end{center}
 \noindent where in the last clause we use the flatten function
-defined earlier. The function $env$ ``picks'' out all
+defined earlier. As can be seen, the function $env$ ``picks''
-underlying strings where a record is given. Since there can be
+out all underlying strings where a record is given. Since
-more than one, the environment will potentially contain
+there can be more than one, the environment will potentially
-many ``recordings''. If we now postprocess the value
+contain many ``records''. If we now postprocess the value
-calculated by $lex$ extracting all recordings using $env$,
+calculated by $lex$ extracting all records using $env$, we can
-we can answer the question whether the last element in the
+answer the question whether the last element in the string was
-string was an $b$ or a $c$. Lets see this in action: if
+an $b$ or a $c$. Lets see this in action: if we use $a\cdot b
-we use $ab + ac$ and $ac$ the calculated value will be
++ a\cdot c$ and $ac$ the calculated value will be
 \begin{center}
 $Right(Seq(Char(a), Char(c)))$
 \end{center}
-\noindent If we use instead $a(x:b) + a(x:c)$ and
+\noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and
 use the $env$ function to extract the recording for
 $x$ we obtain
 \begin{center}
 $[(x:c)]$
 \noindent If we had given the string $ab$ instead, then the
 record would have been $[(x:b)]$. The fun starts if we
 iterate this. Consider the regular expression
 \begin{center}
-$(a(x:b) + a(y:c))^*$
+$(a\cdot (x:b) + a\cdot (y:c))^*$
 \end{center}
 \noindent and the string $ababacabacab$. This string is
 clearly matched by the regular expression, but we are only
 interested in the sequence of $b$s and $c$s. Using $env$

changeset 326	94700593a2d5
parent 319	e7b110f93697
child 350	c4e7caa06c74