# HG changeset patch # User Christian Urban # Date 1430512213 -3600 # Node ID 94700593a2d511ba79f8f4092537418b5d48d85a # Parent 794c599cee537ab3ee23ebbee9b5c20f10a1f70a updated diff -r 794c599cee53 -r 94700593a2d5 handouts/ho04.pdf Binary file handouts/ho04.pdf has changed diff -r 794c599cee53 -r 94700593a2d5 handouts/ho04.tex --- a/handouts/ho04.tex Fri May 01 19:19:31 2015 +0100 +++ b/handouts/ho04.tex Fri May 01 21:30:13 2015 +0100 @@ -207,11 +207,14 @@ \end{center} \noindent The point is that from this value we can directly -read off which part of $r_4$ matched the empty string. Next we -have to ``inject'' the last character, that is $c$ in the -running example, into this value $v_4$ in order to calculate -how $r_3$ could have matched the string $c$. According to the -definition of $inj$ we obtain +read off which part of $r_4$ matched the empty string: take +the right-alternative first, and then the right-alternative +again. + +Next we have to ``inject'' the last character, that is $c$ in +the running example, into this value $v_4$ in order to +calculate how $r_3$ could have matched the string $c$. +According to the definition of $inj$ we obtain \begin{center} $v_3:\;Right(Seq(Empty, Char(c)))$ @@ -279,8 +282,8 @@ each derivative step. But this is a bit more involved in the algorithm of Sulzmann \& Lu. So what follows might require you to read several times before it makes sense and also might -require that you do some example calculations. As a first -example consider the last derivation step in our earlier +require that you do some example calculations yourself. As a +first example consider the last derivation step in our earlier example: \begin{center} @@ -293,7 +296,7 @@ input, however, would then provide us with $Empty$ instead of $Right(Right(Empty))$ that was obtained without the simplification. The problem is we need to recreate this more -complicated value, rather than just $Empty$. +complicated value, rather than just return $Empty$. This will require what I call \emph{rectification functions}. They need to be calculated whenever a regular expression gets @@ -319,7 +322,7 @@ manner in order to calculate this simplified regular expression. -The rectification we can implement this by letting simp return +The rectification we can implement by letting simp return not just a (simplified) regular expression, but also a rectification function. Let us consider the alternative case, $r_1 + r_2$, first. By going depth-first, we first simplify @@ -483,7 +486,7 @@ string and therefore the second phase of the algorithm would never been called. The simplification function still expects us to give a function. So in my own implementation I just -returned a function which raises an error. In the case +returned a function that raises an error. In the case where $r_{1s} = \epsilon$ (similarly $r_{2s}$) we have to create a sequence where the first component is a rectified version of $Empty$. Therefore we call $f_{1s}$ with $Empty$. @@ -553,72 +556,71 @@ \end{tabular} \end{center} -\noindent This corresponds to the $matches$ function we -have seen in earlier lectures. In the first clause we are -given an empty string, $[]$, and need to test wether the -regular expression is $nullable$. If yes we can proceed -normally and just return the value calculated by $mkeps$. -The second clause is for strings where the first character is -$c$ and the rest of the string is $s$. We first build the -derivative of $r$ with respect to $c$; simplify the resulting -regulare expression. We continue lexing with the simplified -regular expression and the string $s$. Whatever will be -returned as value, we sill rectify using the $f_{rect}$ -from the simplification and finally inject $c$ back into -the (rectified) value. +\noindent This corresponds to the $matches$ function we have +seen in earlier lectures. In the first clause we are given an +empty string, $[]$, and need to test wether the regular +expression is $nullable$. If yes, we can proceed normally and +just return the value calculated by $mkeps$. The second clause +is for strings where the first character is $c$, say, and the +rest of the string is $s$. We first build the derivative of +$r$ with respect to $c$; simplify the resulting regular +expression. We continue lexing with the simplified regular +expression and the string $s$. Whatever will be returned as +value, we sill need to rectify using the $f_{rect}$ from the +simplification and finally inject $c$ back into the +(rectified) value. \subsubsection*{Records} -Remember we want to tokenize input strings, that means -splitting strings into their ``word'' components. But -furthermore we want to classify each token as being a keyword -or identifier and so on. For this one more feature will be -required, which I call \emph{record}. While values record -precisely how a regular expression matches a string, -records can be used to focus on some particular -parts of the regular expression and forget about others. -Let us look at an example. +Remember we wanted to tokenize input strings, that means +splitting strings into their ``word'' components. Furthermore +we want to classify each token as being a keyword or +identifier and so on. For this one more feature will be +required, which I call a \emph{record} regular expression. +While values encode how a regular expression matches a string, +records can be used to ``focus'' on some particular parts of +the regular expression and ``forget'' about others. -Suppose you have the regular expression $ab + ac$. Clearly -this regular expression can only recognise two strings. But -suppose you are not interested whether it can recognise $ab$ -or $ac$, but rather if it matched, then what was the last -character of the matched string\ldots either $b$ or $c$. -You can do this by annotating the regular expression with -a record, written $(x:r)$, where $x$ is just an identifier -(in my implementation a plain string) and $r$ is a regular -expression. A record will be regarded as a regular expression. -The extended definition in Scala looks as follows: +Let us look at an example. Suppose you have the regular +expression $a\cdot b + a\cdot c$. Clearly this regular expression can only +recognise two strings. But suppose you are not interested +whether it can recognise $ab$ or $ac$, but rather if it +matched, then what was the last character of the matched +string\ldots either $b$ or $c$. You can do this by annotating +the regular expression with a record, written in general +$(x:r)$, where $x$ is just an identifier (in my implementation +a plain string) and $r$ is a regular expression. A record will +be regarded as a regular expression. The extended definition +in Scala therefore looks as follows: {\small\lstinputlisting[language=Scala] {../progs/app03.scala}} -\noindent Since we regard records as regular expressions -we need to extend the functions $nullable$ and $der$. -Similarly $mkeps$ and $inj$ need to be extended and they -sometimes can return a particular value for records. -This means we also need to extend the definition of values. -The extended definition in Scala looks as follows: +\noindent Since we regard records as regular expressions we +need to extend the functions $nullable$ and $der$. Similarly +$mkeps$ and $inj$ need to be extended. This means we also need +to extend the definition of values, which in Scala looks as +follows: {\small\lstinputlisting[language=Scala] {../progs/app04.scala}} \noindent Let us now look at the purpose of records more -closely and lets return to our question whether the string +closely and let us return to our question whether the string terminated in a $b$ or $c$. We can do this as follows: we annotate the regular expression $ab + ac$ with a record as follows \begin{center} -$a(x:b) + a(x:c)$ +$a\cdot (x:b) + a\cdot (x:c)$ \end{center} \noindent This regular expression can still only recognise the strings $ab$ and $ac$, but we can now use a function that takes a value and returns all records. I call this function \emph{env} for environment\ldots it builds a list -of identifiers associated with their string. This function +of identifiers associated with a string. This function can be defined as follows: \begin{center} @@ -634,21 +636,21 @@ \end{tabular} \end{center} -\noindent where in the last clause we use the flatten function -defined earlier. The function $env$ ``picks'' out all -underlying strings where a record is given. Since there can be -more than one, the environment will potentially contain -many ``recordings''. If we now postprocess the value -calculated by $lex$ extracting all recordings using $env$, -we can answer the question whether the last element in the -string was an $b$ or a $c$. Lets see this in action: if -we use $ab + ac$ and $ac$ the calculated value will be +\noindent where in the last clause we use the flatten function +defined earlier. As can be seen, the function $env$ ``picks'' +out all underlying strings where a record is given. Since +there can be more than one, the environment will potentially +contain many ``records''. If we now postprocess the value +calculated by $lex$ extracting all records using $env$, we can +answer the question whether the last element in the string was +an $b$ or a $c$. Lets see this in action: if we use $a\cdot b ++ a\cdot c$ and $ac$ the calculated value will be \begin{center} $Right(Seq(Char(a), Char(c)))$ \end{center} -\noindent If we use instead $a(x:b) + a(x:c)$ and +\noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and use the $env$ function to extract the recording for $x$ we obtain @@ -661,7 +663,7 @@ iterate this. Consider the regular expression \begin{center} -$(a(x:b) + a(y:c))^*$ +$(a\cdot (x:b) + a\cdot (y:c))^*$ \end{center} \noindent and the string $ababacabacab$. This string is