handouts/ho04.tex
changeset 422 5deefcc8cffa
parent 417 e74c696821a2
child 444 3056a4c071b0
equal deleted inserted replaced
417:e74c696821a2 422:5deefcc8cffa
     8 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}
     8 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}
     9 
     9 
    10 \section*{Handout 4 (Sulzmann \& Lu Algorithm)}
    10 \section*{Handout 4 (Sulzmann \& Lu Algorithm)}
    11 
    11 
    12 So far our algorithm based on derivatives was only able to say
    12 So far our algorithm based on derivatives was only able to say
    13 yes or no depending on whether a string was matched by regular
    13 yes or no depending on whether a string was matched by a regular
    14 expression or not. Often a more interesting question is to
    14 expression or not. Often a more interesting question is to
    15 find out \emph{how} a regular expression matched a string?
    15 find out \emph{how} a regular expression matched a string?
    16 Answering this question will also help us with the problem we
    16 Answering this question will also help us with the problem we
    17 are after, namely tokenising an input string. 
    17 are after, namely tokenising an input string. 
    18 
    18 
    19 The algorithm we will be looking at was designed by Sulzmann
    19 The algorithm we will be looking at in this lecture was designed by Sulzmann
    20 \& Lu in a rather recent paper (from 2014). A link to it is
    20 \& Lu in a rather recent research paper (from 2014). A link to it is
    21 provided on KEATS, in case you are interested.\footnote{In my
    21 provided on KEATS, in case you are interested.\footnote{In my
    22 humble opinion this is an interesting instance of the research
    22 humble opinion this is an interesting instance of the research
    23 literature: it contains a very neat idea, but its presentation
    23 literature: it contains a very neat idea, but its presentation
    24 is rather sloppy. In earlier versions of their paper, a King's
    24 is rather sloppy. In earlier versions of this paper, a King's
    25 student and I found several rather annoying typos in their
    25 undergraduate student and I found several rather annoying typos in the
    26 examples and definitions.} In order to give an answer for
    26 examples and definitions.} My PhD student Fahad Ausaf and I even more recently 
       
    27 wrote a paper where we build on their result: we provided a 
       
    28 mathematical proof that their algorithm is really correct---the proof 
       
    29 Sulzmann \& Lu had originally given contained major flaws. Such correctness
       
    30 proofs are important: Kuklewicz maintains a unit-test library
       
    31 for the kind of algorithma we are interested in here and he showed 
       
    32 that many implementations in the ``wild'' are buggy, that is not
       
    33 satisfy his unit tests:
       
    34 
       
    35 \begin{center}
       
    36 \url{http://www.haskell.org/haskellwiki/Regex_Posix}
       
    37 \end{center}
       
    38 
       
    39 
       
    40 In order to give an answer for
    27 \emph{how} a regular expression matches a string, Sulzmann and
    41 \emph{how} a regular expression matches a string, Sulzmann and
    28 Lu use \emph{values}. A value will be the output of the
    42 Lu use \emph{values}. A value will be the output of the
    29 algorithm whenever the regular expression matches the string.
    43 algorithm whenever the regular expression matches the string.
    30 If the string does not match the string, an error will be
    44 If the string does not match the string, an error will be
    31 raised. 
    45 raised. 
    52       &        & $Empty$   \\
    66       &        & $Empty$   \\
    53       & $\mid$ & $Char(c)$          \\
    67       & $\mid$ & $Char(c)$          \\
    54       & $\mid$ & $Seq(v_1,v_2)$\\
    68       & $\mid$ & $Seq(v_1,v_2)$\\
    55       & $\mid$ & $\Left(v)$   \\
    69       & $\mid$ & $\Left(v)$   \\
    56       & $\mid$ & $Right(v)$  \\
    70       & $\mid$ & $Right(v)$  \\
    57       & $\mid$ & $Stars [v_1,\ldots\,v_n]$ \\
    71       & $\mid$ & $Stars\,[v_1,\ldots\,v_n]$ \\
    58 \end{tabular}
    72 \end{tabular}
    59 \end{tabular}
    73 \end{tabular}
    60 \end{center}
    74 \end{center}
    61 
    75 
    62 \noindent There is no value for the
    76 \noindent There is no value for the
    69 that was needed to match the string. This means we might also
    83 that was needed to match the string. This means we might also
    70 return the empty list $Stars []$, if no copy was needed in case
    84 return the empty list $Stars []$, if no copy was needed in case
    71 of $r^*$. For sequence, there is exactly one value, composed 
    85 of $r^*$. For sequence, there is exactly one value, composed 
    72 of two component values ($v_1$ and $v_2$).
    86 of two component values ($v_1$ and $v_2$).
    73 
    87 
    74 My definition of regular expressions and values in Scala is
    88 My implementation of regular expressions and values in Scala is
    75 shown below. I have in my implementation the convention that
    89 shown below. I have in my implementation the convention that
    76 regular expressions are written entirely with upper-case
    90 regular expressions are written entirely with upper-case
    77 letters, while values just start with a single upper-case
    91 letters, while values just start with a single upper-case
    78 character and the rest are lower-case letters.
    92 character and the rest are lower-case letters.
    79  
    93  
    85 {../progs/app02.scala}}
    99 {../progs/app02.scala}}
    86 
   100 
    87 
   101 
    88 Graphically the algorithm by Sulzmann \& Lu can be illustrated
   102 Graphically the algorithm by Sulzmann \& Lu can be illustrated
    89 by the picture in Figure~\ref{Sulz} where the path from the
   103 by the picture in Figure~\ref{Sulz} where the path from the
    90 left to the right involving $der/nullable$ is the first phase
   104 left to the right involving $\textit{der}/\textit{nullable}$ is the first phase
    91 of the algorithm and $mkeps/inj$, the path from right to left,
   105 of the algorithm and $\textit{mkeps}/\textit{inj}$, the path from right to left,
    92 the second phase. This picture shows the steps required when a
   106 the second phase. This picture shows the steps required when a
    93 regular expression, say $r_1$, matches the string $abc$. We
   107 regular expression, say $r_1$, matches the string $abc$. We
    94 first build the three derivatives (according to $a$, $b$ and
   108 first build the three derivatives (according to $a$, $b$ and
    95 $c$). We then use $nullable$ to find out whether the resulting
   109 $c$). We then use $\textit{nullable}$ to find out whether the resulting
    96 regular expression can match the empty string. If yes, we call
   110 regular expression can match the empty string. If yes, we call
    97 the function $mkeps$.
   111 the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular
       
   112 expression has matched the empty string. Its definition
       
   113 is as follows:
    98 
   114 
    99 \begin{figure}[t]
   115 \begin{figure}[t]
   100 \begin{center}
   116 \begin{center}
   101 \begin{tikzpicture}[scale=2,node distance=1.2cm,
   117 \begin{tikzpicture}[scale=2,node distance=1.2cm,
   102                     every node/.style={minimum size=7mm}]
   118                     every node/.style={minimum size=7mm}]
   103 \node (r1)  {$r_1$};
   119 \node (r1)  {$r_1$};
   104 \node (r2) [right=of r1]{$r_2$};
   120 \node (r2) [right=of r1]{$r_2$};
   105 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {$der\,a$};
   121 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\textit{der}\,a$};
   106 \node (r3) [right=of r2]{$r_3$};
   122 \node (r3) [right=of r2]{$r_3$};
   107 \draw[->,line width=1mm](r2)--(r3) node[above,midway] {$der\,b$};
   123 \draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\textit{der}\,b$};
   108 \node (r4) [right=of r3]{$r_4$};
   124 \node (r4) [right=of r3]{$r_4$};
   109 \draw[->,line width=1mm](r3)--(r4) node[above,midway] {$der\,c$};
   125 \draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\textit{der}\,c$};
   110 \draw (r4) node[anchor=west] {\;\raisebox{3mm}{$nullable$}};
   126 \draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\textit{nullable}$}};
   111 \node (v4) [below=of r4]{$v_4$};
   127 \node (v4) [below=of r4]{$v_4$};
   112 \draw[->,line width=1mm](r4) -- (v4);
   128 \draw[->,line width=1mm](r4) -- (v4);
   113 \node (v3) [left=of v4] {$v_3$};
   129 \node (v3) [left=of v4] {$v_3$};
   114 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {$inj\,c$};
   130 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\textit{inj}\,c$};
   115 \node (v2) [left=of v3]{$v_2$};
   131 \node (v2) [left=of v3]{$v_2$};
   116 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {$inj\,b$};
   132 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\textit{inj}\,b$};
   117 \node (v1) [left=of v2] {$v_1$};
   133 \node (v1) [left=of v2] {$v_1$};
   118 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {$inj\,a$};
   134 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\textit{inj}\,a$};
   119 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$mkeps$}};
   135 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};
   120 \end{tikzpicture}
   136 \end{tikzpicture}
   121 \end{center}
   137 \end{center}
   122 \caption{The two phases of the algorithm by Sulzmann \& Lu.
   138 \caption{The two phases of the algorithm by Sulzmann \& Lu.
   123 \label{Sulz}}
   139 \label{Sulz}}
   124 \end{figure}
   140 \end{figure}
   125 
   141 
   126 The $mkeps$ function calculates a value for how a regular
   142 
   127 expression has matched the empty string. Its definition
       
   128 is as follows:
       
   129 
   143 
   130 \begin{center}
   144 \begin{center}
   131 \begin{tabular}{lcl}
   145 \begin{tabular}{lcl}
   132   $mkeps(\ONE)$       & $\dn$ & $Empty$\\
   146   $\textit{mkeps}(\ONE)$       & $\dn$ & $Empty$\\
   133   $mkeps(r_1 + r_2)$      & $\dn$ & if $nullable(r_1)$  \\
   147   $\textit{mkeps}(r_1 + r_2)$      & $\dn$ & if $\textit{nullable}(r_1)$  \\
   134                           &       & then $\Left(mkeps(r_1))$\\
   148                           &       & then $\Left(\textit{mkeps}(r_1))$\\
   135                           &       & else $Right(mkeps(r_2))$\\
   149                           &       & else $Right(\textit{mkeps}(r_2))$\\
   136   $mkeps(r_1 \cdot r_2)$  & $\dn$ & $Seq(mkeps(r_1),mkeps(r_2))$\\
   150   $\textit{mkeps}(r_1 \cdot r_2)$  & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\
   137   $mkeps(r^*)$            & $\dn$ & $[]$  \\
   151   $\textit{mkeps}(r^*)$            & $\dn$ & $Stars\,[]$  \\
   138 \end{tabular}
   152 \end{tabular}
   139 \end{center}
   153 \end{center}
   140 
   154 
   141 \noindent There are no cases for $\ZERO$ and $c$, since
   155 \noindent There is are no cases for $\ZERO$ and $c$, since
   142 these regular expression cannot match the empty string. Note
   156 these regular expression cannot match the empty string. Note
   143 also that in case of alternatives we give preference to the
   157 also that in case of alternatives we give preference to the
   144 regular expression on the left-hand side. This will become
   158 regular expression on the left-hand side. This will become
   145 important later on.
   159 important later on.
   146 
   160 
   147 The second phase of the algorithm is organised so that it will
   161 The second phase of the algorithm is organised so that it will
   148 calculate a value for how the derivative regular expression
   162 calculate a value for how the derivative regular expression
   149 has matched a string. For this we need a function that
   163 has matched a string. For this we need a function that
   150 reverses this ``chopping off'' for values which we did in the
   164 reverses this ``chopping off'' for values which we did in the
   151 first phase for derivatives. The corresponding function is
   165 first phase for derivatives. The corresponding function is
   152 called $inj$ for injection. This function takes three
   166 called $\textit{inj}$ (short for injection). This function takes three
   153 arguments: the first one is a regular expression for which we
   167 arguments: the first one is a regular expression for which we
   154 want to calculate the value, the second is the character we
   168 want to calculate the value, the second is the character we
   155 want to inject and the third argument is the value where we
   169 want to inject and the third argument is the value where we
   156 will inject the character into. The result of this function is a
   170 will inject the character into. The result of this function is a
   157 new value. The definition of $inj$ is as follows: 
   171 new value. The definition of $\textit{inj}$ is as follows: 
   158 
   172 
   159 \begin{center}
   173 \begin{center}
   160 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   174 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   161   $inj\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
   175   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
   162   $inj\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(inj\,r_1\,c\,v)$\\
   176   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
   163   $inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(inj\,r_2\,c\,v)$\\
   177   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
   164   $inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
   178   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
   165   $inj\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
   179   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
   166   $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\
   180   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
   167   $inj\,(r^*)\,c\,Seq(v,vs)$         & $\dn$  & $inj\,r\,c\,v\,::\,vs$\\
   181   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars(\textit{inj}\,r\,c\,v\,::\,vs)$\\
   168 \end{tabular}
   182 \end{tabular}
   169 \end{center}
   183 \end{center}
   170 
   184 
   171 \noindent This definition is by recursion on the regular
   185 \noindent This definition is by recursion on the regular
   172 expression and by analysing the shape of the values. Therefore
   186 expression and by analysing the shape of the values. Therefore
   173 there are, for example, three cases for sequence regular
   187 there are three cases for sequence regular
   174 expressions (for all possible shapes of the value). The last
   188 expressions (for all possible shapes of the value we can encounter). The last
   175 clause for the star regular expression returns a list where
   189 clause for the star regular expression returns a list where
   176 the first element is $inj\,r\,c\,v$ and the other elements are
   190 the first element is $\textit{inj}\,r\,c\,v$ and the other elements are
   177 $vs$. That means $\_\,::\,\_$ should be read as list cons.
   191 $vs$. That means $\_\,::\,\_$ should be read as list cons.
   178 
   192 
   179 To understand what is going on, it might be best to do some
   193 To understand what is going on, it might be best to do some
   180 example calculations and compare them with Figure~\ref{Sulz}.
   194 example calculations and compare them with Figure~\ref{Sulz}.
   181 For this note that we have not yet dealt with the need of
   195 For this note that we have not yet dealt with the need of
   193 \end{tabular}
   207 \end{tabular}
   194 \end{center}
   208 \end{center}
   195 
   209 
   196 \noindent According to the simple algorithm, we would test
   210 \noindent According to the simple algorithm, we would test
   197 whether $r_4$ is nullable, which in this case it indeed is.
   211 whether $r_4$ is nullable, which in this case it indeed is.
   198 This means we can use the function $mkeps$ to calculate a
   212 This means we can use the function $\textit{mkeps}$ to calculate a
   199 value for how $r_4$ was able to match the empty string.
   213 value for how $r_4$ was able to match the empty string.
   200 Remember that this function gives preference for alternatives
   214 Remember that this function gives preference for alternatives
   201 on the left-hand side. However there is only $\ONE$ on the
   215 on the left-hand side. However there is only $\ONE$ on the
   202 very right-hand side of $r_4$ that matches the empty string.
   216 very right-hand side of $r_4$ (underlined)
   203 Therefore $mkeps$ returns the value
   217 
       
   218 \begin{center}
       
   219 $r_4$:\;$(\ZERO \cdot (b \cdot c)) + 
       
   220          ((\ZERO \cdot c) + \underline{\ONE})$\\
       
   221 \end{center}
       
   222 
       
   223 \noindent
       
   224 that matches the empty string.
       
   225 Therefore $\textit{mkeps}$ returns the value
   204 
   226 
   205 \begin{center}
   227 \begin{center}
   206 $v_4:\;Right(Right(Empty))$
   228 $v_4:\;Right(Right(Empty))$
   207 \end{center}
   229 \end{center}
   208 
   230 
   209 \noindent If there had been a $\ONE$ on the left, then
   231 \noindent If there had been a $\ONE$ on the left, then
   210 $mkeps$ would have returned something of the form
   232 $\textit{mkeps}$ would have returned something of the form
   211 $\Left(\ldots)$. The point is that from this value we can
   233 $\Left(\ldots)$. The point is that from this value we can
   212 directly read off which part of $r_4$ matched the empty
   234 directly read off which part of $r_4$ matched the empty
   213 string: take the right-alternative first, and then the
   235 string: take the right-alternative first, and then the
   214 right-alternative again. Remember $r_4$ is of the form
   236 right-alternative again. 
   215 
       
   216 \begin{center}
       
   217 $r_4$:\;$(\ZERO \cdot (b \cdot c)) + 
       
   218          ((\ZERO \cdot c) + \underline{\ONE})$\\
       
   219 \end{center}
       
   220 
       
   221 \noindent the value tells us that the underlined $\ONE$
       
   222 is responsible for matching the empty string.
       
   223 
   237 
   224 Next we have to ``inject'' the last character, that is $c$ in
   238 Next we have to ``inject'' the last character, that is $c$ in
   225 the running example, into this value $v_4$ in order to
   239 the running example, into this value $v_4$ in order to
   226 calculate how $r_3$ could have matched the string $c$.
   240 calculate how $r_3$ could have matched the string $c$.
   227 According to the definition of $inj$ we obtain
   241 According to the definition of $\textit{inj}$ we obtain
   228 
   242 
   229 \begin{center}
   243 \begin{center}
   230 $v_3:\;Right(Seq(Empty, Char(c)))$
   244 $v_3:\;Right(Seq(Empty, Char(c)))$
   231 \end{center}
   245 \end{center}
   232 
   246 
   261   $|Empty|$     & $\dn$ & $[]$\\
   275   $|Empty|$     & $\dn$ & $[]$\\
   262   $|Char(c)|$   & $\dn$ & $[c]$\\
   276   $|Char(c)|$   & $\dn$ & $[c]$\\
   263   $|\Left(v)|$   & $\dn$ & $|v|$\\
   277   $|\Left(v)|$   & $\dn$ & $|v|$\\
   264   $|Right(v)|$  & $\dn$ & $|v|$\\
   278   $|Right(v)|$  & $\dn$ & $|v|$\\
   265   $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\
   279   $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\
   266   $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
   280   $|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
   267 \end{tabular}
   281 \end{tabular}
   268 \end{center}
   282 \end{center}
   269 
   283 
   270 \noindent Using flatten we can see what are the strings behind 
   284 \noindent Using flatten we can see what are the strings behind 
   271 the values calculated by $mkeps$ and $inj$. In our running 
   285 the values calculated by $\textit{mkeps}$ and $\textit{inj}$. In our running 
   272 example:
   286 example:
   273 
   287 
   274 \begin{center}
   288 \begin{center}
   275 \begin{tabular}{ll}
   289 \begin{tabular}{ll}
   276 $|v_4|$: & $[]$\\
   290 $|v_4|$: & $[]$\\
   278 $|v_2|$: & $bc$\\
   292 $|v_2|$: & $bc$\\
   279 $|v_1|$: & $abc$
   293 $|v_1|$: & $abc$
   280 \end{tabular}
   294 \end{tabular}
   281 \end{center}
   295 \end{center}
   282 
   296 
   283 \noindent This indicates that $inj$ indeed is injecting, or
   297 \noindent This indicates that $\textit{inj}$ indeed is injecting, or
   284 adding, back a character into the value. If we continue until
   298 adding, back a character into the value. 
   285 all characters are injected back, we have a value that can 
       
   286 indeed say how the string $abc$ was matched.
       
   287 
   299 
   288 There is a problem, however, with the described algorithm
   300 There is a problem, however, with the described algorithm
   289 so far: it is very slow. We need to include the simplification 
   301 so far: it is very slow. We need to include the simplification 
   290 from Lecture 2. This is what we shall do next.
   302 from Lecture 2. This is what we shall do next.
   291 
   303 
   299 to read several times before it makes sense and also might
   311 to read several times before it makes sense and also might
   300 require that you do some example calculations yourself. As a
   312 require that you do some example calculations yourself. As a
   301 first example consider the last derivation step in our earlier
   313 first example consider the last derivation step in our earlier
   302 example:
   314 example:
   303 
   315 
   304 \begin{center}
   316 \begin{equation}\label{regexfour}
   305 $r_4 = der\,c\,r_3 = 
   317 r_4 = \textit{der}\,c\,r_3 = 
   306 (\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)$
   318 (\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)
   307 \end{center}
   319 \end{equation}
   308 
   320 
   309 \noindent Simplifying this regular expression would just give
   321 \noindent Simplifying this regular expression would just give
   310 us $\ONE$. Running $mkeps$ with this regular expression as
   322 us $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ as
   311 input, however, would then provide us with $Empty$ instead of
   323 input, however, would give us with the value $Empty$ instead of
   312 $Right(Right(Empty))$ that was obtained without the
   324 
   313 simplification. The problem is we need to recreate this more
   325 \[Right(Right(Empty))\]
   314 complicated value, rather than just return $Empty$.
   326 
   315 
   327 \noindent
   316 This will require what I call \emph{rectification functions}.
   328 that was obtained  without the simplification. The problem  is we need
   317 They need to be calculated whenever a regular expression gets
   329 to  recreate this  more  complicated value,  rather  than just  return
   318 simplified. Rectification functions take a value as argument
   330 $Empty$. This will require what I call \emph{rectification functions}.
   319 and return a (rectified) value. Let us first take a look again
   331 They  need  to  be  calculated  whenever  a  regular  expression  gets
   320 at our simplification rules:
   332 simplified.
       
   333 
       
   334 Rectification functions take a value as argument and return a
       
   335 (rectified) value. In the example above the argument would be $Empty$
       
   336 and the output $Right(Right(Empty))$.  Before we define these
       
   337 rectification functions in general, let us first take a look again at
       
   338 our simplification rules:
   321 
   339 
   322 \begin{center}
   340 \begin{center}
   323 \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
   341 \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
   324 $r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ 
   342 $r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ 
   325 $\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ 
   343 $\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ 
   329 $\ZERO + r$ & $\mapsto$ & $r$\\ 
   347 $\ZERO + r$ & $\mapsto$ & $r$\\ 
   330 $r + r$ & $\mapsto$ & $r$\\ 
   348 $r + r$ & $\mapsto$ & $r$\\ 
   331 \end{tabular}
   349 \end{tabular}
   332 \end{center}
   350 \end{center}
   333 
   351 
   334 \noindent Applying them to $r_4$ will require several nested
   352 \noindent Applying them to $r_4$ in \eqref{regexfour} will require several nested
   335 simplifications in order end up with just $\ONE$. However,
   353 simplifications in order end up with just $\ONE$. However,
   336 it is possible to apply them in a depth-first, or inside-out,
   354 it is possible to apply them in a depth-first, or inside-out,
   337 manner in order to calculate this simplified regular
   355 manner in order to calculate this simplified regular
   338 expression.
   356 expression.
   339 
   357 
   389 Let us package this idea with rectification functions
   407 Let us package this idea with rectification functions
   390 into a single function (still only considering the alternative
   408 into a single function (still only considering the alternative
   391 case):
   409 case):
   392 
   410 
   393 \begin{center}
   411 \begin{center}
   394 \begin{tabular}{l}
   412 \begin{tabular}{ll}
   395 $simp(r)$:\\
   413 $_1$ & $simp(r)$:\\
   396 \quad case $r = r_1 + r_2$\\
   414 $_2$ & \quad case $r = r_1 + r_2$\\
   397 \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
   415 $_3$ & \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
   398 \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
   416 $_4$ & \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
   399 \qquad case $r_{1s} = \ZERO$: 
   417 $_5$ & \qquad case $r_{1s} = \ZERO$: 
   400        return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
   418        return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
   401 \qquad case $r_{2s} = \ZERO$: 
   419 $_6$ & \qquad case $r_{2s} = \ZERO$: 
   402        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
   420        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
   403 \qquad case $r_{1s} = r_{2s}$:
   421 $_7$ & \qquad case $r_{1s} = r_{2s}$:
   404        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
   422        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
   405 \qquad otherwise: 
   423 $_8$ & \qquad otherwise: 
   406        return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
   424        return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
   407 \end{tabular}
   425 \end{tabular}
   408 \end{center}
   426 \end{center}
   409 
   427 
   410 \noindent We first recursively call the simplification with
   428 \noindent We first recursively call the simplification with
   411 $r_1$ and $r_2$. This gives simplified regular expressions,
   429 $r_1$ and $r_2$ (Lines 3 and 4). This gives simplified regular expressions,
   412 $r_{1s}$ and $r_{2s}$, as well as two rectification functions
   430 $r_{1s}$ and $r_{2s}$, as well as two rectification functions
   413 $f_{1s}$ and $f_{2s}$. We next need to test whether the
   431 $f_{1s}$ and $f_{2s}$. We next need to test whether the
   414 simplified regular expressions are $\ZERO$ so as to make
   432 simplified regular expressions are $\ZERO$ so as to make
   415 further simplifications. In case $r_{1s}$ is $\ZERO$,
   433 further simplifications. In case $r_{1s}$ is $\ZERO$ (Line 5),
   416 then we can return $r_{2s}$ (the other alternative). However
   434 then we can return $r_{2s}$ (the other alternative). However
   417 for this we need to build a corresponding rectification 
   435 for this we need to build a corresponding rectification 
   418 function, which as said above is
   436 function, which as said above is
   419 
   437 
   420 \begin{center}
   438 \begin{center}
   463 case. In this case the main simplification function is as
   481 case. In this case the main simplification function is as
   464 follows
   482 follows
   465 
   483 
   466 \begin{center}
   484 \begin{center}
   467 \begin{tabular}{l}
   485 \begin{tabular}{l}
   468 $simp(r)$:\qquad\qquad (continued)\\
   486 $simp(r)$:\hspace{5cm} (continued)\\
   469 \quad case $r = r_1 \cdot r_2$\\
   487 \quad case $r = r_1 \cdot r_2$\\
   470 \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
   488 \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
   471 \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
   489 \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
   472 \qquad case $r_{1s} = \ZERO$: 
   490 \qquad case $r_{1s} = \ZERO$: 
   473        return $(\ZERO, f_{error})$\\
   491        return $(\ZERO, f_{error})$\\
   515 
   533 
   516 \begin{center}
   534 \begin{center}
   517 $\lambda v.\,v$
   535 $\lambda v.\,v$
   518 \end{center}
   536 \end{center}
   519 
   537 
   520 \noindent This completes the high-level version of the
   538 \noindent This completes the high-level version of the simplification
   521 simplification function, which is also shown again in 
   539 function, which is shown again in Figure~\ref{simp}. The Scala code
   522 Figure~\ref{simp}. This can now be used in a \emph{lexing
   540 for the simplifaction function is in Figure~\ref{simprect}.
   523 function} as follows:
       
   524 
   541 
   525 \begin{figure}[t]
   542 \begin{figure}[t]
   526 \begin{center}
   543 \begin{center}
   527 \begin{tabular}{l}
   544 \begin{tabular}{l}
   528 $simp(r)$:\\
   545 $simp(r)$:\\
   560 \end{center}
   577 \end{center}
   561 \caption{The simplification function that returns a simplified 
   578 \caption{The simplification function that returns a simplified 
   562 regular expression and a rectification function.\label{simp}}
   579 regular expression and a rectification function.\label{simp}}
   563 \end{figure}
   580 \end{figure}
   564 
   581 
       
   582 \begin{figure}[p]
       
   583 \lstinputlisting{../progs/app61.scala}
       
   584 
       
   585 \caption{The Scala code for the simplification function. The
       
   586 first part defines some auxillary functions for the rectification.
       
   587 The second part give the simplification function.
       
   588 \label{simprect}}
       
   589 \end{figure}
       
   590 
       
   591 
       
   592 We are now in the position to define a \emph{lexing function} as follows:
       
   593 
   565 \begin{center}
   594 \begin{center}
   566 \begin{tabular}{lcl}
   595 \begin{tabular}{lcl}
   567 $lex\,r\,[]$ & $\dn$ & if $nullable(r)$ then $mkeps(r)$\\
   596 $lex\,r\,[]$ & $\dn$ & if $\textit{nullable}(r)$ then $\textit{mkeps}(r)$\\
   568              &       & else $error$\\
   597              &       & else $error$\\
   569 $lex\,r\,c\!::\!s$ & $\dn$ & let 
   598 $lex\,r\,c\!::\!s$ & $\dn$ & let 
   570    $(r_{simp}, f_{rect}) = simp(der(c, r))$\\
   599    $(r_{simp}, f_{rect}) = simp(\textit{der}(c, r))$\\
   571 & & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
   600 & & $\textit{inj}\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
   572 \end{tabular}
   601 \end{tabular}
   573 \end{center}
   602 \end{center}
   574 
   603 
   575 \noindent This corresponds to the $matches$ function we have
   604 \noindent This corresponds to the $matches$ function we have
   576 seen in earlier lectures. In the first clause we are given an
   605 seen in earlier lectures. In the first clause we are given an
   577 empty string, $[]$, and need to test wether the regular
   606 empty string, $[]$, and need to test wether the regular
   578 expression is $nullable$. If yes, we can proceed normally and
   607 expression is $nullable$. If yes, we can proceed normally and
   579 just return the value calculated by $mkeps$. The second clause
   608 just return the value calculated by $\textit{mkeps}$. The second clause
   580 is for strings where the first character is $c$, say, and the
   609 is for strings where the first character is $c$, say, and the
   581 rest of the string is $s$. We first build the derivative of
   610 rest of the string is $s$. We first build the derivative of
   582 $r$ with respect to $c$; simplify the resulting regular
   611 $r$ with respect to $c$; simplify the resulting regular
   583 expression. We continue lexing with the simplified regular
   612 expression. We continue lexing with the simplified regular
   584 expression and the string $s$. Whatever will be returned as
   613 expression and the string $s$. Whatever will be returned as
   612 
   641 
   613 {\small\lstinputlisting[language=Scala]
   642 {\small\lstinputlisting[language=Scala]
   614 {../progs/app03.scala}}
   643 {../progs/app03.scala}}
   615 
   644 
   616 \noindent Since we regard records as regular expressions we
   645 \noindent Since we regard records as regular expressions we
   617 need to extend the functions $nullable$ and $der$. Similarly
   646 need to extend the functions $\textit{nullable}$ and $\textit{der}$. Similarly
   618 $mkeps$ and $inj$ need to be extended. This means we also need
   647 $\textit{mkeps}$ and $\textit{inj}$ need to be extended. This means we also need
   619 to extend the definition of values, which in Scala looks as
   648 to extend the definition of values, which in Scala looks as
   620 follows:
   649 follows:
   621 
   650 
   622 {\small\lstinputlisting[language=Scala]
   651 {\small\lstinputlisting[language=Scala]
   623 {../progs/app04.scala}}
   652 {../progs/app04.scala}}
   644   $env(Empty)$     & $\dn$ & $[]$\\
   673   $env(Empty)$     & $\dn$ & $[]$\\
   645   $env(Char(c))$   & $\dn$ & $[]$\\
   674   $env(Char(c))$   & $\dn$ & $[]$\\
   646   $env(\Left(v))$   & $\dn$ & $env(v)$\\
   675   $env(\Left(v))$   & $\dn$ & $env(v)$\\
   647   $env(Right(v))$  & $\dn$ & $env(v)$\\
   676   $env(Right(v))$  & $\dn$ & $env(v)$\\
   648   $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\
   677   $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\
   649   $env([v_1,\ldots ,v_n])$ & $\dn$ & 
   678   $env(Stars\,[v_1,\ldots ,v_n])$ & $\dn$ & 
   650      $env(v_1) \,@\ldots @\, env(v_n)$\\
   679      $env(v_1) \,@\ldots @\, env(v_n)$\\
   651   $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\
   680   $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\
   652 \end{tabular}
   681 \end{tabular}
   653 \end{center}
   682 \end{center}
   654 
   683