27 Lu introduce \emph{values}. A value will be the output of the |
27 Lu introduce \emph{values}. A value will be the output of the |
28 algorithm whenever the regular expression matches the string. |
28 algorithm whenever the regular expression matches the string. |
29 If the string does not match the string, an error will be |
29 If the string does not match the string, an error will be |
30 raised. Since the first phase of the algorithm by Sulzmann \& |
30 raised. Since the first phase of the algorithm by Sulzmann \& |
31 Lu is identical to the derivative based matcher from the first |
31 Lu is identical to the derivative based matcher from the first |
32 coursework, the function $nullable$ will be used to decide |
32 coursework, the function \textit{nullable} will be used to |
33 whether as string is matched by a regular expression. If |
33 decide whether as string is matched by a regular expression. |
34 $nullable$ says yes, then values are constructed that reflect |
34 If \textit{nullable} says yes, then values are constructed |
35 how the regular expression matched the string. |
35 that reflect how the regular expression matched the string. |
36 |
36 |
37 The definitions for values is given below. They are shown |
37 The definitions for values is given below. They are shown |
38 together with the regular expressions $r$ to which |
38 together with the regular expressions $r$ to which |
39 they correspond: |
39 they correspond: |
40 |
40 |
81 values, I have in my implementation the convention that |
81 values, I have in my implementation the convention that |
82 regular expressions and values have the same name, except that |
82 regular expressions and values have the same name, except that |
83 regular expressions are written entirely with upper-case |
83 regular expressions are written entirely with upper-case |
84 letters, while values just start with a single upper-case |
84 letters, while values just start with a single upper-case |
85 character and the rest are lower-case letters. My definition |
85 character and the rest are lower-case letters. My definition |
86 of regular expressions and values in Scala is shown below. I use |
86 of regular expressions and values in Scala is shown below. I |
87 this in the REPL of Scala; when I use the Scala compiler I |
87 use this in the REPL of Scala; when I use the Scala compiler I |
88 need to rename some constructors, because Scala on Macs does |
88 unfortunately need to rename some constructors, because Scala |
89 not like classes that are called \pcode{EMPTY} and |
89 on Macs does not like classes that are called \pcode{EMPTY} |
90 \pcode{Empty}. |
90 and \pcode{Empty}. |
91 |
91 |
92 {\small\lstinputlisting[language=Scala,numbers=none] |
92 {\small\lstinputlisting[language=Scala,numbers=none] |
93 {../progs/app01.scala}} |
93 {../progs/app01.scala}} |
94 |
94 |
95 |
95 |
221 \noindent If there had been a $\epsilon$ on the left, then |
221 \noindent If there had been a $\epsilon$ on the left, then |
222 $mkeps$ would have returned something of the form |
222 $mkeps$ would have returned something of the form |
223 $Left(\ldots)$. The point is that from this value we can |
223 $Left(\ldots)$. The point is that from this value we can |
224 directly read off which part of $r_4$ matched the empty |
224 directly read off which part of $r_4$ matched the empty |
225 string: take the right-alternative first, and then the |
225 string: take the right-alternative first, and then the |
226 right-alternative again. |
226 right-alternative again. Remember $r_4$ is of the form |
|
227 |
|
228 \begin{center} |
|
229 $r_4$:\;$(\varnothing \cdot (b \cdot c)) + |
|
230 ((\varnothing \cdot c) + \underline{\epsilon})$\\ |
|
231 \end{center} |
|
232 |
|
233 \noindent the value tells us that the underlined $\epsilon$ |
|
234 is responsible for matching the empty string. |
227 |
235 |
228 Next we have to ``inject'' the last character, that is $c$ in |
236 Next we have to ``inject'' the last character, that is $c$ in |
229 the running example, into this value $v_4$ in order to |
237 the running example, into this value $v_4$ in order to |
230 calculate how $r_3$ could have matched the string $c$. |
238 calculate how $r_3$ could have matched the string $c$. |
231 According to the definition of $inj$ we obtain |
239 According to the definition of $inj$ we obtain |
339 simplifications in order end up with just $\epsilon$. However, |
347 simplifications in order end up with just $\epsilon$. However, |
340 it is possible to apply them in a depth-first, or inside-out, |
348 it is possible to apply them in a depth-first, or inside-out, |
341 manner in order to calculate this simplified regular |
349 manner in order to calculate this simplified regular |
342 expression. |
350 expression. |
343 |
351 |
344 The rectification we can implement by letting simp return |
352 The rectification we can implement by letting simp return not |
345 not just a (simplified) regular expression, but also a |
353 just a (simplified) regular expression, but also a |
346 rectification function. Let us consider the alternative case, |
354 rectification function. Let us consider the alternative case, |
347 $r_1 + r_2$, first. By going depth-first, we first simplify |
355 $r_1 + r_2$, first. By going depth-first, we first simplify |
348 the component regular expressions $r_1$ and $r_2.$ This will |
356 the component regular expressions $r_1$ and $r_2.$ This will |
349 return simplified versions (if they can be simplified), say |
357 return simplified versions, say $r_{1s}$ and $r_{2s}$, of the |
350 $r_{1s}$ and $r_{2s}$, but also two rectification functions |
358 component regular expressions (if they can be simplified) but |
351 $f_{1s}$ and $f_{2s}$. We need to assemble them in order to |
359 also two rectification functions $f_{1s}$ and $f_{2s}$. We |
352 obtain a rectified value for $r_1 + r_2$. In case $r_{1s}$ |
360 need to assemble them in order to obtain a rectified value for |
353 simplified to $\varnothing$, we continue the derivative |
361 $r_1 + r_2$. In case $r_{1s}$ simplified to $\varnothing$, we |
354 calculation with $r_{2s}$. The Sulzmann \& Lu algorithm would |
362 continue the derivative calculation with $r_{2s}$. The |
355 return a corresponding value, say $v_{2s}$. But now this value |
363 Sulzmann \& Lu algorithm would return a corresponding value, |
356 needs to be ``rectified'' to the value |
364 say $v_{2s}$. But now this value needs to be ``rectified'' to |
|
365 the value |
357 |
366 |
358 \begin{center} |
367 \begin{center} |
359 $Right(v_{2s})$ |
368 $Right(v_{2s})$ |
360 \end{center} |
369 \end{center} |
361 |
370 |
718 $[(name:\texttt{christian.urban}), |
727 $[(name:\texttt{christian.urban}), |
719 (domain:\texttt{kcl}), |
728 (domain:\texttt{kcl}), |
720 (top\_level:\texttt{ac.uk})]$ |
729 (top\_level:\texttt{ac.uk})]$ |
721 \end{center} |
730 \end{center} |
722 |
731 |
723 \noindent As you will see in the next lecture, this is now all |
732 Recall that we want to lex a little programming language, |
724 we need to tokenise an input string and classify each token. |
733 called the \emph{While}-language. The main keywords in this |
|
734 language are \pcode{while}, \pcode{if}, \pcode{then} and |
|
735 \pcode{else}. As usual we have identifiers, operators, numbers |
|
736 and so on. For this we would need to design the corresponding |
|
737 regular expressions to recognise these syntactic categories. I |
|
738 let you do this design task. Having these regular expressions |
|
739 at our disposal we can form the regular expression |
|
740 |
|
741 \begin{center} |
|
742 \begin{tabular}{rcl} |
|
743 \textit{WhileRegs} & $\dn$ & (($k$, KEYWORD) +\\ |
|
744 & & ($i$, ID) +\\ |
|
745 & & ($o$, OP) + \\ |
|
746 & & ($n$, NUM) + \\ |
|
747 & & ($s$, SEMI) + \\ |
|
748 & & ($p$, (LPAREN + RPAREN)) +\\ |
|
749 & & ($b$, (BEGIN + END)) + \\ |
|
750 & & ($w$, WHITESPACE))$^*$ |
|
751 \end{tabular} |
|
752 \end{center} |
|
753 |
|
754 |
725 \end{document} |
755 \end{document} |
726 |
756 |
727 %%% Local Variables: |
757 %%% Local Variables: |
728 %%% mode: latex |
758 %%% mode: latex |
729 %%% TeX-master: t |
759 %%% TeX-master: t |