8 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016} |
8 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016} |
9 |
9 |
10 \section*{Handout 4 (Sulzmann \& Lu Algorithm)} |
10 \section*{Handout 4 (Sulzmann \& Lu Algorithm)} |
11 |
11 |
12 So far our algorithm based on derivatives was only able to say |
12 So far our algorithm based on derivatives was only able to say |
13 yes or no depending on whether a string was matched by regular |
13 yes or no depending on whether a string was matched by a regular |
14 expression or not. Often a more interesting question is to |
14 expression or not. Often a more interesting question is to |
15 find out \emph{how} a regular expression matched a string? |
15 find out \emph{how} a regular expression matched a string? |
16 Answering this question will also help us with the problem we |
16 Answering this question will also help us with the problem we |
17 are after, namely tokenising an input string. |
17 are after, namely tokenising an input string. |
18 |
18 |
19 The algorithm we will be looking at was designed by Sulzmann |
19 The algorithm we will be looking at in this lecture was designed by Sulzmann |
20 \& Lu in a rather recent paper (from 2014). A link to it is |
20 \& Lu in a rather recent research paper (from 2014). A link to it is |
21 provided on KEATS, in case you are interested.\footnote{In my |
21 provided on KEATS, in case you are interested.\footnote{In my |
22 humble opinion this is an interesting instance of the research |
22 humble opinion this is an interesting instance of the research |
23 literature: it contains a very neat idea, but its presentation |
23 literature: it contains a very neat idea, but its presentation |
24 is rather sloppy. In earlier versions of their paper, a King's |
24 is rather sloppy. In earlier versions of this paper, a King's |
25 student and I found several rather annoying typos in their |
25 undergraduate student and I found several rather annoying typos in the |
26 examples and definitions.} In order to give an answer for |
26 examples and definitions.} My PhD student Fahad Ausaf and I even more recently |
|
27 wrote a paper where we build on their result: we provided a |
|
28 mathematical proof that their algorithm is really correct---the proof |
|
29 Sulzmann \& Lu had originally given contained major flaws. Such correctness |
|
30 proofs are important: Kuklewicz maintains a unit-test library |
|
31 for the kind of algorithma we are interested in here and he showed |
|
32 that many implementations in the ``wild'' are buggy, that is not |
|
33 satisfy his unit tests: |
|
34 |
|
35 \begin{center} |
|
36 \url{http://www.haskell.org/haskellwiki/Regex_Posix} |
|
37 \end{center} |
|
38 |
|
39 |
|
40 In order to give an answer for |
27 \emph{how} a regular expression matches a string, Sulzmann and |
41 \emph{how} a regular expression matches a string, Sulzmann and |
28 Lu use \emph{values}. A value will be the output of the |
42 Lu use \emph{values}. A value will be the output of the |
29 algorithm whenever the regular expression matches the string. |
43 algorithm whenever the regular expression matches the string. |
30 If the string does not match the string, an error will be |
44 If the string does not match the string, an error will be |
31 raised. |
45 raised. |
85 {../progs/app02.scala}} |
99 {../progs/app02.scala}} |
86 |
100 |
87 |
101 |
88 Graphically the algorithm by Sulzmann \& Lu can be illustrated |
102 Graphically the algorithm by Sulzmann \& Lu can be illustrated |
89 by the picture in Figure~\ref{Sulz} where the path from the |
103 by the picture in Figure~\ref{Sulz} where the path from the |
90 left to the right involving $der/nullable$ is the first phase |
104 left to the right involving $\textit{der}/\textit{nullable}$ is the first phase |
91 of the algorithm and $mkeps/inj$, the path from right to left, |
105 of the algorithm and $\textit{mkeps}/\textit{inj}$, the path from right to left, |
92 the second phase. This picture shows the steps required when a |
106 the second phase. This picture shows the steps required when a |
93 regular expression, say $r_1$, matches the string $abc$. We |
107 regular expression, say $r_1$, matches the string $abc$. We |
94 first build the three derivatives (according to $a$, $b$ and |
108 first build the three derivatives (according to $a$, $b$ and |
95 $c$). We then use $nullable$ to find out whether the resulting |
109 $c$). We then use $\textit{nullable}$ to find out whether the resulting |
96 regular expression can match the empty string. If yes, we call |
110 regular expression can match the empty string. If yes, we call |
97 the function $mkeps$. |
111 the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular |
|
112 expression has matched the empty string. Its definition |
|
113 is as follows: |
98 |
114 |
99 \begin{figure}[t] |
115 \begin{figure}[t] |
100 \begin{center} |
116 \begin{center} |
101 \begin{tikzpicture}[scale=2,node distance=1.2cm, |
117 \begin{tikzpicture}[scale=2,node distance=1.2cm, |
102 every node/.style={minimum size=7mm}] |
118 every node/.style={minimum size=7mm}] |
103 \node (r1) {$r_1$}; |
119 \node (r1) {$r_1$}; |
104 \node (r2) [right=of r1]{$r_2$}; |
120 \node (r2) [right=of r1]{$r_2$}; |
105 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {$der\,a$}; |
121 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\textit{der}\,a$}; |
106 \node (r3) [right=of r2]{$r_3$}; |
122 \node (r3) [right=of r2]{$r_3$}; |
107 \draw[->,line width=1mm](r2)--(r3) node[above,midway] {$der\,b$}; |
123 \draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\textit{der}\,b$}; |
108 \node (r4) [right=of r3]{$r_4$}; |
124 \node (r4) [right=of r3]{$r_4$}; |
109 \draw[->,line width=1mm](r3)--(r4) node[above,midway] {$der\,c$}; |
125 \draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\textit{der}\,c$}; |
110 \draw (r4) node[anchor=west] {\;\raisebox{3mm}{$nullable$}}; |
126 \draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\textit{nullable}$}}; |
111 \node (v4) [below=of r4]{$v_4$}; |
127 \node (v4) [below=of r4]{$v_4$}; |
112 \draw[->,line width=1mm](r4) -- (v4); |
128 \draw[->,line width=1mm](r4) -- (v4); |
113 \node (v3) [left=of v4] {$v_3$}; |
129 \node (v3) [left=of v4] {$v_3$}; |
114 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {$inj\,c$}; |
130 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\textit{inj}\,c$}; |
115 \node (v2) [left=of v3]{$v_2$}; |
131 \node (v2) [left=of v3]{$v_2$}; |
116 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {$inj\,b$}; |
132 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\textit{inj}\,b$}; |
117 \node (v1) [left=of v2] {$v_1$}; |
133 \node (v1) [left=of v2] {$v_1$}; |
118 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {$inj\,a$}; |
134 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\textit{inj}\,a$}; |
119 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$mkeps$}}; |
135 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}}; |
120 \end{tikzpicture} |
136 \end{tikzpicture} |
121 \end{center} |
137 \end{center} |
122 \caption{The two phases of the algorithm by Sulzmann \& Lu. |
138 \caption{The two phases of the algorithm by Sulzmann \& Lu. |
123 \label{Sulz}} |
139 \label{Sulz}} |
124 \end{figure} |
140 \end{figure} |
125 |
141 |
126 The $mkeps$ function calculates a value for how a regular |
142 |
127 expression has matched the empty string. Its definition |
|
128 is as follows: |
|
129 |
143 |
130 \begin{center} |
144 \begin{center} |
131 \begin{tabular}{lcl} |
145 \begin{tabular}{lcl} |
132 $mkeps(\ONE)$ & $\dn$ & $Empty$\\ |
146 $\textit{mkeps}(\ONE)$ & $\dn$ & $Empty$\\ |
133 $mkeps(r_1 + r_2)$ & $\dn$ & if $nullable(r_1)$ \\ |
147 $\textit{mkeps}(r_1 + r_2)$ & $\dn$ & if $\textit{nullable}(r_1)$ \\ |
134 & & then $\Left(mkeps(r_1))$\\ |
148 & & then $\Left(\textit{mkeps}(r_1))$\\ |
135 & & else $Right(mkeps(r_2))$\\ |
149 & & else $Right(\textit{mkeps}(r_2))$\\ |
136 $mkeps(r_1 \cdot r_2)$ & $\dn$ & $Seq(mkeps(r_1),mkeps(r_2))$\\ |
150 $\textit{mkeps}(r_1 \cdot r_2)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\ |
137 $mkeps(r^*)$ & $\dn$ & $[]$ \\ |
151 $\textit{mkeps}(r^*)$ & $\dn$ & $Stars\,[]$ \\ |
138 \end{tabular} |
152 \end{tabular} |
139 \end{center} |
153 \end{center} |
140 |
154 |
141 \noindent There are no cases for $\ZERO$ and $c$, since |
155 \noindent There is are no cases for $\ZERO$ and $c$, since |
142 these regular expression cannot match the empty string. Note |
156 these regular expression cannot match the empty string. Note |
143 also that in case of alternatives we give preference to the |
157 also that in case of alternatives we give preference to the |
144 regular expression on the left-hand side. This will become |
158 regular expression on the left-hand side. This will become |
145 important later on. |
159 important later on. |
146 |
160 |
147 The second phase of the algorithm is organised so that it will |
161 The second phase of the algorithm is organised so that it will |
148 calculate a value for how the derivative regular expression |
162 calculate a value for how the derivative regular expression |
149 has matched a string. For this we need a function that |
163 has matched a string. For this we need a function that |
150 reverses this ``chopping off'' for values which we did in the |
164 reverses this ``chopping off'' for values which we did in the |
151 first phase for derivatives. The corresponding function is |
165 first phase for derivatives. The corresponding function is |
152 called $inj$ for injection. This function takes three |
166 called $\textit{inj}$ (short for injection). This function takes three |
153 arguments: the first one is a regular expression for which we |
167 arguments: the first one is a regular expression for which we |
154 want to calculate the value, the second is the character we |
168 want to calculate the value, the second is the character we |
155 want to inject and the third argument is the value where we |
169 want to inject and the third argument is the value where we |
156 will inject the character into. The result of this function is a |
170 will inject the character into. The result of this function is a |
157 new value. The definition of $inj$ is as follows: |
171 new value. The definition of $\textit{inj}$ is as follows: |
158 |
172 |
159 \begin{center} |
173 \begin{center} |
160 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
174 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
161 $inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
175 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
162 $inj\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(inj\,r_1\,c\,v)$\\ |
176 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
163 $inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(inj\,r_2\,c\,v)$\\ |
177 $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ |
164 $inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(inj\,r_1\,c\,v_1,v_2)$\\ |
178 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
165 $inj\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(inj\,r_1\,c\,v_1,v_2)$\\ |
179 $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
166 $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\ |
180 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ |
167 $inj\,(r^*)\,c\,Seq(v,vs)$ & $\dn$ & $inj\,r\,c\,v\,::\,vs$\\ |
181 $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\textit{inj}\,r\,c\,v\,::\,vs)$\\ |
168 \end{tabular} |
182 \end{tabular} |
169 \end{center} |
183 \end{center} |
170 |
184 |
171 \noindent This definition is by recursion on the regular |
185 \noindent This definition is by recursion on the regular |
172 expression and by analysing the shape of the values. Therefore |
186 expression and by analysing the shape of the values. Therefore |
173 there are, for example, three cases for sequence regular |
187 there are three cases for sequence regular |
174 expressions (for all possible shapes of the value). The last |
188 expressions (for all possible shapes of the value we can encounter). The last |
175 clause for the star regular expression returns a list where |
189 clause for the star regular expression returns a list where |
176 the first element is $inj\,r\,c\,v$ and the other elements are |
190 the first element is $\textit{inj}\,r\,c\,v$ and the other elements are |
177 $vs$. That means $\_\,::\,\_$ should be read as list cons. |
191 $vs$. That means $\_\,::\,\_$ should be read as list cons. |
178 |
192 |
179 To understand what is going on, it might be best to do some |
193 To understand what is going on, it might be best to do some |
180 example calculations and compare them with Figure~\ref{Sulz}. |
194 example calculations and compare them with Figure~\ref{Sulz}. |
181 For this note that we have not yet dealt with the need of |
195 For this note that we have not yet dealt with the need of |
193 \end{tabular} |
207 \end{tabular} |
194 \end{center} |
208 \end{center} |
195 |
209 |
196 \noindent According to the simple algorithm, we would test |
210 \noindent According to the simple algorithm, we would test |
197 whether $r_4$ is nullable, which in this case it indeed is. |
211 whether $r_4$ is nullable, which in this case it indeed is. |
198 This means we can use the function $mkeps$ to calculate a |
212 This means we can use the function $\textit{mkeps}$ to calculate a |
199 value for how $r_4$ was able to match the empty string. |
213 value for how $r_4$ was able to match the empty string. |
200 Remember that this function gives preference for alternatives |
214 Remember that this function gives preference for alternatives |
201 on the left-hand side. However there is only $\ONE$ on the |
215 on the left-hand side. However there is only $\ONE$ on the |
202 very right-hand side of $r_4$ that matches the empty string. |
216 very right-hand side of $r_4$ (underlined) |
203 Therefore $mkeps$ returns the value |
217 |
|
218 \begin{center} |
|
219 $r_4$:\;$(\ZERO \cdot (b \cdot c)) + |
|
220 ((\ZERO \cdot c) + \underline{\ONE})$\\ |
|
221 \end{center} |
|
222 |
|
223 \noindent |
|
224 that matches the empty string. |
|
225 Therefore $\textit{mkeps}$ returns the value |
204 |
226 |
205 \begin{center} |
227 \begin{center} |
206 $v_4:\;Right(Right(Empty))$ |
228 $v_4:\;Right(Right(Empty))$ |
207 \end{center} |
229 \end{center} |
208 |
230 |
209 \noindent If there had been a $\ONE$ on the left, then |
231 \noindent If there had been a $\ONE$ on the left, then |
210 $mkeps$ would have returned something of the form |
232 $\textit{mkeps}$ would have returned something of the form |
211 $\Left(\ldots)$. The point is that from this value we can |
233 $\Left(\ldots)$. The point is that from this value we can |
212 directly read off which part of $r_4$ matched the empty |
234 directly read off which part of $r_4$ matched the empty |
213 string: take the right-alternative first, and then the |
235 string: take the right-alternative first, and then the |
214 right-alternative again. Remember $r_4$ is of the form |
236 right-alternative again. |
215 |
|
216 \begin{center} |
|
217 $r_4$:\;$(\ZERO \cdot (b \cdot c)) + |
|
218 ((\ZERO \cdot c) + \underline{\ONE})$\\ |
|
219 \end{center} |
|
220 |
|
221 \noindent the value tells us that the underlined $\ONE$ |
|
222 is responsible for matching the empty string. |
|
223 |
237 |
224 Next we have to ``inject'' the last character, that is $c$ in |
238 Next we have to ``inject'' the last character, that is $c$ in |
225 the running example, into this value $v_4$ in order to |
239 the running example, into this value $v_4$ in order to |
226 calculate how $r_3$ could have matched the string $c$. |
240 calculate how $r_3$ could have matched the string $c$. |
227 According to the definition of $inj$ we obtain |
241 According to the definition of $\textit{inj}$ we obtain |
228 |
242 |
229 \begin{center} |
243 \begin{center} |
230 $v_3:\;Right(Seq(Empty, Char(c)))$ |
244 $v_3:\;Right(Seq(Empty, Char(c)))$ |
231 \end{center} |
245 \end{center} |
232 |
246 |
299 to read several times before it makes sense and also might |
311 to read several times before it makes sense and also might |
300 require that you do some example calculations yourself. As a |
312 require that you do some example calculations yourself. As a |
301 first example consider the last derivation step in our earlier |
313 first example consider the last derivation step in our earlier |
302 example: |
314 example: |
303 |
315 |
304 \begin{center} |
316 \begin{equation}\label{regexfour} |
305 $r_4 = der\,c\,r_3 = |
317 r_4 = \textit{der}\,c\,r_3 = |
306 (\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)$ |
318 (\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE) |
307 \end{center} |
319 \end{equation} |
308 |
320 |
309 \noindent Simplifying this regular expression would just give |
321 \noindent Simplifying this regular expression would just give |
310 us $\ONE$. Running $mkeps$ with this regular expression as |
322 us $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ as |
311 input, however, would then provide us with $Empty$ instead of |
323 input, however, would give us with the value $Empty$ instead of |
312 $Right(Right(Empty))$ that was obtained without the |
324 |
313 simplification. The problem is we need to recreate this more |
325 \[Right(Right(Empty))\] |
314 complicated value, rather than just return $Empty$. |
326 |
315 |
327 \noindent |
316 This will require what I call \emph{rectification functions}. |
328 that was obtained without the simplification. The problem is we need |
317 They need to be calculated whenever a regular expression gets |
329 to recreate this more complicated value, rather than just return |
318 simplified. Rectification functions take a value as argument |
330 $Empty$. This will require what I call \emph{rectification functions}. |
319 and return a (rectified) value. Let us first take a look again |
331 They need to be calculated whenever a regular expression gets |
320 at our simplification rules: |
332 simplified. |
|
333 |
|
334 Rectification functions take a value as argument and return a |
|
335 (rectified) value. In the example above the argument would be $Empty$ |
|
336 and the output $Right(Right(Empty))$. Before we define these |
|
337 rectification functions in general, let us first take a look again at |
|
338 our simplification rules: |
321 |
339 |
322 \begin{center} |
340 \begin{center} |
323 \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l} |
341 \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l} |
324 $r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ |
342 $r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ |
325 $\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ |
343 $\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ |
389 Let us package this idea with rectification functions |
407 Let us package this idea with rectification functions |
390 into a single function (still only considering the alternative |
408 into a single function (still only considering the alternative |
391 case): |
409 case): |
392 |
410 |
393 \begin{center} |
411 \begin{center} |
394 \begin{tabular}{l} |
412 \begin{tabular}{ll} |
395 $simp(r)$:\\ |
413 $_1$ & $simp(r)$:\\ |
396 \quad case $r = r_1 + r_2$\\ |
414 $_2$ & \quad case $r = r_1 + r_2$\\ |
397 \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\ |
415 $_3$ & \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\ |
398 \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\ |
416 $_4$ & \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\ |
399 \qquad case $r_{1s} = \ZERO$: |
417 $_5$ & \qquad case $r_{1s} = \ZERO$: |
400 return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\ |
418 return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\ |
401 \qquad case $r_{2s} = \ZERO$: |
419 $_6$ & \qquad case $r_{2s} = \ZERO$: |
402 return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\ |
420 return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\ |
403 \qquad case $r_{1s} = r_{2s}$: |
421 $_7$ & \qquad case $r_{1s} = r_{2s}$: |
404 return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\ |
422 return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\ |
405 \qquad otherwise: |
423 $_8$ & \qquad otherwise: |
406 return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\ |
424 return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\ |
407 \end{tabular} |
425 \end{tabular} |
408 \end{center} |
426 \end{center} |
409 |
427 |
410 \noindent We first recursively call the simplification with |
428 \noindent We first recursively call the simplification with |
411 $r_1$ and $r_2$. This gives simplified regular expressions, |
429 $r_1$ and $r_2$ (Lines 3 and 4). This gives simplified regular expressions, |
412 $r_{1s}$ and $r_{2s}$, as well as two rectification functions |
430 $r_{1s}$ and $r_{2s}$, as well as two rectification functions |
413 $f_{1s}$ and $f_{2s}$. We next need to test whether the |
431 $f_{1s}$ and $f_{2s}$. We next need to test whether the |
414 simplified regular expressions are $\ZERO$ so as to make |
432 simplified regular expressions are $\ZERO$ so as to make |
415 further simplifications. In case $r_{1s}$ is $\ZERO$, |
433 further simplifications. In case $r_{1s}$ is $\ZERO$ (Line 5), |
416 then we can return $r_{2s}$ (the other alternative). However |
434 then we can return $r_{2s}$ (the other alternative). However |
417 for this we need to build a corresponding rectification |
435 for this we need to build a corresponding rectification |
418 function, which as said above is |
436 function, which as said above is |
419 |
437 |
420 \begin{center} |
438 \begin{center} |
560 \end{center} |
577 \end{center} |
561 \caption{The simplification function that returns a simplified |
578 \caption{The simplification function that returns a simplified |
562 regular expression and a rectification function.\label{simp}} |
579 regular expression and a rectification function.\label{simp}} |
563 \end{figure} |
580 \end{figure} |
564 |
581 |
|
582 \begin{figure}[p] |
|
583 \lstinputlisting{../progs/app61.scala} |
|
584 |
|
585 \caption{The Scala code for the simplification function. The |
|
586 first part defines some auxillary functions for the rectification. |
|
587 The second part give the simplification function. |
|
588 \label{simprect}} |
|
589 \end{figure} |
|
590 |
|
591 |
|
592 We are now in the position to define a \emph{lexing function} as follows: |
|
593 |
565 \begin{center} |
594 \begin{center} |
566 \begin{tabular}{lcl} |
595 \begin{tabular}{lcl} |
567 $lex\,r\,[]$ & $\dn$ & if $nullable(r)$ then $mkeps(r)$\\ |
596 $lex\,r\,[]$ & $\dn$ & if $\textit{nullable}(r)$ then $\textit{mkeps}(r)$\\ |
568 & & else $error$\\ |
597 & & else $error$\\ |
569 $lex\,r\,c\!::\!s$ & $\dn$ & let |
598 $lex\,r\,c\!::\!s$ & $\dn$ & let |
570 $(r_{simp}, f_{rect}) = simp(der(c, r))$\\ |
599 $(r_{simp}, f_{rect}) = simp(\textit{der}(c, r))$\\ |
571 & & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$ |
600 & & $\textit{inj}\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$ |
572 \end{tabular} |
601 \end{tabular} |
573 \end{center} |
602 \end{center} |
574 |
603 |
575 \noindent This corresponds to the $matches$ function we have |
604 \noindent This corresponds to the $matches$ function we have |
576 seen in earlier lectures. In the first clause we are given an |
605 seen in earlier lectures. In the first clause we are given an |
577 empty string, $[]$, and need to test wether the regular |
606 empty string, $[]$, and need to test wether the regular |
578 expression is $nullable$. If yes, we can proceed normally and |
607 expression is $nullable$. If yes, we can proceed normally and |
579 just return the value calculated by $mkeps$. The second clause |
608 just return the value calculated by $\textit{mkeps}$. The second clause |
580 is for strings where the first character is $c$, say, and the |
609 is for strings where the first character is $c$, say, and the |
581 rest of the string is $s$. We first build the derivative of |
610 rest of the string is $s$. We first build the derivative of |
582 $r$ with respect to $c$; simplify the resulting regular |
611 $r$ with respect to $c$; simplify the resulting regular |
583 expression. We continue lexing with the simplified regular |
612 expression. We continue lexing with the simplified regular |
584 expression and the string $s$. Whatever will be returned as |
613 expression and the string $s$. Whatever will be returned as |