handouts/ho04.tex
changeset 356 d9c784c71305
parent 352 1e1b0fe66107
child 357 603e171a7b48
equal deleted inserted replaced
355:a259eec25156 356:d9c784c71305
    27 Lu introduce \emph{values}. A value will be the output of the
    27 Lu introduce \emph{values}. A value will be the output of the
    28 algorithm whenever the regular expression matches the string.
    28 algorithm whenever the regular expression matches the string.
    29 If the string does not match the string, an error will be
    29 If the string does not match the string, an error will be
    30 raised. Since the first phase of the algorithm by Sulzmann \&
    30 raised. Since the first phase of the algorithm by Sulzmann \&
    31 Lu is identical to the derivative based matcher from the first
    31 Lu is identical to the derivative based matcher from the first
    32 coursework, the function $nullable$ will be used to decide
    32 coursework, the function \textit{nullable} will be used to
    33 whether as string is matched by a regular expression. If
    33 decide whether as string is matched by a regular expression.
    34 $nullable$ says yes, then values are constructed that reflect
    34 If \textit{nullable} says yes, then values are constructed
    35 how the regular expression matched the string. 
    35 that reflect how the regular expression matched the string. 
    36 
    36 
    37 The definitions for values is given below. They are shown 
    37 The definitions for values is given below. They are shown 
    38 together with the regular expressions $r$ to which
    38 together with the regular expressions $r$ to which
    39 they correspond:
    39 they correspond:
    40 
    40 
    81 values, I have in my implementation the convention that
    81 values, I have in my implementation the convention that
    82 regular expressions and values have the same name, except that
    82 regular expressions and values have the same name, except that
    83 regular expressions are written entirely with upper-case
    83 regular expressions are written entirely with upper-case
    84 letters, while values just start with a single upper-case
    84 letters, while values just start with a single upper-case
    85 character and the rest are lower-case letters. My definition
    85 character and the rest are lower-case letters. My definition
    86 of regular expressions and values in Scala is shown below. I use
    86 of regular expressions and values in Scala is shown below. I
    87 this in the REPL of Scala; when I use the Scala compiler I
    87 use this in the REPL of Scala; when I use the Scala compiler I
    88 need to rename some constructors, because Scala on Macs does
    88 unfortunately need to rename some constructors, because Scala
    89 not like classes that are called \pcode{EMPTY} and
    89 on Macs does not like classes that are called \pcode{EMPTY}
    90 \pcode{Empty}.
    90 and \pcode{Empty}.
    91  
    91  
    92  {\small\lstinputlisting[language=Scala,numbers=none]
    92  {\small\lstinputlisting[language=Scala,numbers=none]
    93 {../progs/app01.scala}}
    93 {../progs/app01.scala}}
    94 
    94 
    95  
    95  
   221 \noindent If there had been a $\epsilon$ on the left, then
   221 \noindent If there had been a $\epsilon$ on the left, then
   222 $mkeps$ would have returned something of the form
   222 $mkeps$ would have returned something of the form
   223 $Left(\ldots)$. The point is that from this value we can
   223 $Left(\ldots)$. The point is that from this value we can
   224 directly read off which part of $r_4$ matched the empty
   224 directly read off which part of $r_4$ matched the empty
   225 string: take the right-alternative first, and then the
   225 string: take the right-alternative first, and then the
   226 right-alternative again. 
   226 right-alternative again. Remember $r_4$ is of the form
       
   227 
       
   228 \begin{center}
       
   229 $r_4$:\;$(\varnothing \cdot (b \cdot c)) + 
       
   230          ((\varnothing \cdot c) + \underline{\epsilon})$\\
       
   231 \end{center}
       
   232 
       
   233 \noindent the value tells us that the underlined $\epsilon$
       
   234 is responsible for matching the empty string.
   227 
   235 
   228 Next we have to ``inject'' the last character, that is $c$ in
   236 Next we have to ``inject'' the last character, that is $c$ in
   229 the running example, into this value $v_4$ in order to
   237 the running example, into this value $v_4$ in order to
   230 calculate how $r_3$ could have matched the string $c$.
   238 calculate how $r_3$ could have matched the string $c$.
   231 According to the definition of $inj$ we obtain
   239 According to the definition of $inj$ we obtain
   339 simplifications in order end up with just $\epsilon$. However,
   347 simplifications in order end up with just $\epsilon$. However,
   340 it is possible to apply them in a depth-first, or inside-out,
   348 it is possible to apply them in a depth-first, or inside-out,
   341 manner in order to calculate this simplified regular
   349 manner in order to calculate this simplified regular
   342 expression.
   350 expression.
   343 
   351 
   344 The rectification we can implement by letting simp return
   352 The rectification we can implement by letting simp return not
   345 not just a (simplified) regular expression, but also a
   353 just a (simplified) regular expression, but also a
   346 rectification function. Let us consider the alternative case,
   354 rectification function. Let us consider the alternative case,
   347 $r_1 + r_2$, first. By going depth-first, we first simplify
   355 $r_1 + r_2$, first. By going depth-first, we first simplify
   348 the component regular expressions $r_1$ and $r_2.$ This will
   356 the component regular expressions $r_1$ and $r_2.$ This will
   349 return simplified versions (if they can be simplified), say
   357 return simplified versions, say $r_{1s}$ and $r_{2s}$, of the
   350 $r_{1s}$ and $r_{2s}$, but also two rectification functions
   358 component regular expressions (if they can be simplified) but
   351 $f_{1s}$ and $f_{2s}$. We need to assemble them in order to
   359 also two rectification functions $f_{1s}$ and $f_{2s}$. We
   352 obtain a rectified value for $r_1 + r_2$. In case $r_{1s}$
   360 need to assemble them in order to obtain a rectified value for
   353 simplified to $\varnothing$, we continue the derivative
   361 $r_1 + r_2$. In case $r_{1s}$ simplified to $\varnothing$, we
   354 calculation with $r_{2s}$. The Sulzmann \& Lu algorithm would
   362 continue the derivative calculation with $r_{2s}$. The
   355 return a corresponding value, say $v_{2s}$. But now this value
   363 Sulzmann \& Lu algorithm would return a corresponding value,
   356 needs to be ``rectified'' to the value 
   364 say $v_{2s}$. But now this value needs to be ``rectified'' to
       
   365 the value 
   357 
   366 
   358 \begin{center}
   367 \begin{center}
   359 $Right(v_{2s})$
   368 $Right(v_{2s})$
   360 \end{center}
   369 \end{center}
   361 
   370 
   718 $[(name:\texttt{christian.urban}), 
   727 $[(name:\texttt{christian.urban}), 
   719   (domain:\texttt{kcl}), 
   728   (domain:\texttt{kcl}), 
   720   (top\_level:\texttt{ac.uk})]$
   729   (top\_level:\texttt{ac.uk})]$
   721 \end{center}
   730 \end{center}
   722 
   731 
   723 \noindent As you will see in the next lecture, this is now all
   732 Recall that we want to lex a little programming language,
   724 we need to tokenise an input string and classify each token.
   733 called the \emph{While}-language. The main keywords in this
       
   734 language are \pcode{while}, \pcode{if}, \pcode{then} and
       
   735 \pcode{else}. As usual we have identifiers, operators, numbers
       
   736 and so on. For this we would need to design the corresponding
       
   737 regular expressions to recognise these syntactic categories. I
       
   738 let you do this design task. Having these regular expressions
       
   739 at our disposal we can form the regular expression
       
   740 
       
   741 \begin{center}
       
   742 \begin{tabular}{rcl}
       
   743 \textit{WhileRegs} & $\dn$ & (($k$, KEYWORD) +\\
       
   744                    &     & ($i$, ID) +\\
       
   745                    &     & ($o$, OP) + \\
       
   746                    &     & ($n$, NUM) + \\
       
   747                    &     & ($s$, SEMI) + \\
       
   748                    &     & ($p$, (LPAREN + RPAREN)) +\\ 
       
   749                    &     & ($b$, (BEGIN + END)) + \\
       
   750                    &     & ($w$, WHITESPACE))$^*$
       
   751 \end{tabular}
       
   752 \end{center}
       
   753 
       
   754 
   725 \end{document}
   755 \end{document}
   726 
   756 
   727 %%% Local Variables: 
   757 %%% Local Variables: 
   728 %%% mode: latex
   758 %%% mode: latex
   729 %%% TeX-master: t
   759 %%% TeX-master: t