13 expression or not. Often a more interesting question is to |
13 expression or not. Often a more interesting question is to |
14 find out \emph{how} a regular expression matched a string? |
14 find out \emph{how} a regular expression matched a string? |
15 Answering this question will also help us with the problem we |
15 Answering this question will also help us with the problem we |
16 are after, namely tokenising an input string. |
16 are after, namely tokenising an input string. |
17 |
17 |
18 The algorithm we will be looking at was designed by Sulzmann \& Lu in |
18 The algorithm we will be looking at was designed by Sulzmann |
19 a rather recent paper. A link to it is provided on KEATS, in case you |
19 \& Lu in a rather recent paper (from 2014). A link to it is |
20 are interested.\footnote{In my humble opinion this is an interesting |
20 provided on KEATS, in case you are interested.\footnote{In my |
21 instance of the research literature: it contains a very neat idea, |
21 humble opinion this is an interesting instance of the research |
22 but its presentation is rather sloppy. In earlier versions of their |
22 literature: it contains a very neat idea, but its presentation |
23 paper, students and I found several rather annoying typos in their |
23 is rather sloppy. In earlier versions of their paper, a King's |
24 examples and definitions.} In order to give an answer for how a |
24 student and I found several rather annoying typos in their |
25 regular expression matched a string, Sulzmann and Lu introduce |
25 examples and definitions.} In order to give an answer for |
26 \emph{values}. A value will be the output of the algorithm whenever |
26 \emph{how} a regular expression matches a string, Sulzmann and |
27 the regular expression matches the string. If the string does not |
27 Lu introduce \emph{values}. A value will be the output of the |
28 match the string, an error will be raised. Since the first phase of |
28 algorithm whenever the regular expression matches the string. |
29 the algorithm by Sulzmann \& Lu is identical to the derivative based |
29 If the string does not match the string, an error will be |
30 matcher from the first coursework, the function $nullable$ will be |
30 raised. Since the first phase of the algorithm by Sulzmann \& |
31 used to decide whether as string is matched by a regular |
31 Lu is identical to the derivative based matcher from the first |
32 expression. If $nullable$ says yes, then values are constructed that |
32 coursework, the function $nullable$ will be used to decide |
33 reflect how the regular expression matched the string. The definitions |
33 whether as string is matched by a regular expression. If |
34 for regular expressions $r$ and values $v$ is shown next to each other |
34 $nullable$ says yes, then values are constructed that reflect |
35 below: |
35 how the regular expression matched the string. |
|
36 |
|
37 The definitions for values is given below. They are shown |
|
38 together with the regular expressions $r$ to which |
|
39 they correspond: |
36 |
40 |
37 \begin{center} |
41 \begin{center} |
38 \begin{tabular}{cc} |
42 \begin{tabular}{cc} |
39 \begin{tabular}{@{}rrl@{}} |
43 \begin{tabular}{@{}rrl@{}} |
40 \multicolumn{3}{c}{regular expressions}\medskip\\ |
44 \multicolumn{3}{c}{regular expressions}\medskip\\ |
58 & $\mid$ & $[v_1,\ldots\,v_n]$ \\ |
62 & $\mid$ & $[v_1,\ldots\,v_n]$ \\ |
59 \end{tabular} |
63 \end{tabular} |
60 \end{tabular} |
64 \end{tabular} |
61 \end{center} |
65 \end{center} |
62 |
66 |
63 \noindent The point is that there is a very strong |
67 \noindent The reason is that there is a very strong |
64 correspondence between them. There is no value for the |
68 correspondence between them. There is no value for the |
65 $\varnothing$ regular expression, since it does not match any |
69 $\varnothing$ regular expression, since it does not match any |
66 string. Otherwise there is exactly one value corresponding to |
70 string. Otherwise there is exactly one value corresponding to |
67 each regular expression with the exception of $r_1 + r_2$ |
71 each regular expression with the exception of $r_1 + r_2$ |
68 where there are two values, namely $Left(v)$ and $Right(v)$ |
72 where there are two values, namely $Left(v)$ and $Right(v)$ |
69 corresponding to the two alternatives. Note that $r^*$ is |
73 corresponding to the two alternatives. Note that $r^*$ is |
70 associated with a list of values, one for each copy of $r$ |
74 associated with a list of values, one for each copy of $r$ |
71 that was needed to match the string. This means we might also |
75 that was needed to match the string. This means we might also |
72 return the empty list $[]$, if no copy was needed. |
76 return the empty list $[]$, if no copy was needed in case |
|
77 of $r^*$. For sequence, there is exactly one value, composed |
|
78 of two component values ($v_1$ and $v_2$). |
73 |
79 |
74 To emphasise the connection between regular expressions and |
80 To emphasise the connection between regular expressions and |
75 values, I have in my implementation the convention that |
81 values, I have in my implementation the convention that |
|
82 regular expressions and values have the same name, except that |
76 regular expressions are written entirely with upper-case |
83 regular expressions are written entirely with upper-case |
77 letters, while values just start with a single upper-case |
84 letters, while values just start with a single upper-case |
78 character. My definition of values in Scala is below. I use |
85 character and the rest are lower-case letters. My definition |
79 this in the REPL of Scala; when I use the Scala compiler |
86 of regular expressions and values in Scala is shown below. I use |
80 I need to rename some constructors, because Scala on Macs |
87 this in the REPL of Scala; when I use the Scala compiler I |
81 does not like classes that are called \pcode{EMPTY} and |
88 need to rename some constructors, because Scala on Macs does |
|
89 not like classes that are called \pcode{EMPTY} and |
82 \pcode{Empty}. |
90 \pcode{Empty}. |
|
91 |
|
92 {\small\lstinputlisting[language=Scala,numbers=none] |
|
93 {../progs/app03.scala}} |
|
94 |
83 |
95 |
84 {\small\lstinputlisting[language=Scala,numbers=none] |
96 {\small\lstinputlisting[language=Scala,numbers=none] |
85 {../progs/app02.scala}} |
97 {../progs/app02.scala}} |
86 |
98 |
87 |
99 |
144 regular expression on the left-hand side. This will become |
156 regular expression on the left-hand side. This will become |
145 important later on. |
157 important later on. |
146 |
158 |
147 The second phase of the algorithm is organised so that it will |
159 The second phase of the algorithm is organised so that it will |
148 calculate a value for how the derivative regular expression |
160 calculate a value for how the derivative regular expression |
149 has matched a string whose first character has been chopped |
161 has matched a string. For this we need a function that |
150 off. Now we need a function that reverses this ``chopping |
162 reverses this ``chopping off'' for values which we did in the |
151 off'' for values. The corresponding function is called $inj$ |
163 first phase for derivatives. The corresponding function is |
152 for injection. This function takes three arguments: the first |
164 called $inj$ for injection. This function takes three |
153 one is a regular expression for which we want to calculate the |
165 arguments: the first one is a regular expression for which we |
154 value, the second is the character we want to inject and the |
166 want to calculate the value, the second is the character we |
155 third argument is the value where we will inject the |
167 want to inject and the third argument is the value where we |
156 character. The result of this function is a new value. The |
168 will inject the character into. The result of this function is a |
157 definition of $inj$ is as follows: |
169 new value. The definition of $inj$ is as follows: |
158 |
170 |
159 \begin{center} |
171 \begin{center} |
160 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
172 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
161 $inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
173 $inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
162 $inj\,(r_1 + r_2)\,c\,Left(v)$ & $\dn$ & $Left(inj\,r_1\,c\,v)$\\ |
174 $inj\,(r_1 + r_2)\,c\,Left(v)$ & $\dn$ & $Left(inj\,r_1\,c\,v)$\\ |
169 \end{center} |
181 \end{center} |
170 |
182 |
171 \noindent This definition is by recursion on the regular |
183 \noindent This definition is by recursion on the regular |
172 expression and by analysing the shape of the values. Therefore |
184 expression and by analysing the shape of the values. Therefore |
173 there are, for example, three cases for sequence regular |
185 there are, for example, three cases for sequence regular |
174 expressions. The last clause for the star regular expression |
186 expressions (for all possible shapes of the value). The last |
175 returns a list where the first element is $inj\,r\,c\,v$ and |
187 clause for the star regular expression returns a list where |
176 the other elements are $vs$. That means $\_\,::\,\_$ should be |
188 the first element is $inj\,r\,c\,v$ and the other elements are |
177 read as list cons. |
189 $vs$. That means $\_\,::\,\_$ should be read as list cons. |
178 |
190 |
179 To understand what is going on, it might be best to do some |
191 To understand what is going on, it might be best to do some |
180 example calculations and compare them with Figure~\ref{Sulz}. |
192 example calculations and compare them with Figure~\ref{Sulz}. |
181 For this note that we have not yet dealt with the need of |
193 For this note that we have not yet dealt with the need of |
182 simplifying regular expressions (this will be a topic on its |
194 simplifying regular expressions (this will be a topic on its |
192 $r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\ |
204 $r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\ |
193 \end{tabular} |
205 \end{tabular} |
194 \end{center} |
206 \end{center} |
195 |
207 |
196 \noindent According to the simple algorithm, we would test |
208 \noindent According to the simple algorithm, we would test |
197 whether $r_4$ is nullable, which in this case it is. This |
209 whether $r_4$ is nullable, which in this case it indeed is. |
198 means we can use the function $mkeps$ to calculate a value for |
210 This means we can use the function $mkeps$ to calculate a |
199 how $r_4$ was able to match the empty string. Remember that |
211 value for how $r_4$ was able to match the empty string. |
200 this function gives preference for alternatives on the |
212 Remember that this function gives preference for alternatives |
201 left-hand side. However there is only $\epsilon$ on the very |
213 on the left-hand side. However there is only $\epsilon$ on the |
202 right-hand side of $r_4$ that matches the empty string. |
214 very right-hand side of $r_4$ that matches the empty string. |
203 Therefore $mkeps$ returns the value |
215 Therefore $mkeps$ returns the value |
204 |
216 |
205 \begin{center} |
217 \begin{center} |
206 $v_4:\;Right(Right(Empty))$ |
218 $v_4:\;Right(Right(Empty))$ |
207 \end{center} |
219 \end{center} |
208 |
220 |
209 \noindent The point is that from this value we can directly |
221 \noindent If there had been a $\epsilon$ on the left, then |
210 read off which part of $r_4$ matched the empty string: take |
222 $mkeps$ would have returned something of the form |
211 the right-alternative first, and then the right-alternative |
223 $Left(\ldots)$. The point is that from this value we can |
212 again. |
224 directly read off which part of $r_4$ matched the empty |
|
225 string: take the right-alternative first, and then the |
|
226 right-alternative again. |
213 |
227 |
214 Next we have to ``inject'' the last character, that is $c$ in |
228 Next we have to ``inject'' the last character, that is $c$ in |
215 the running example, into this value $v_4$ in order to |
229 the running example, into this value $v_4$ in order to |
216 calculate how $r_3$ could have matched the string $c$. |
230 calculate how $r_3$ could have matched the string $c$. |
217 According to the definition of $inj$ we obtain |
231 According to the definition of $inj$ we obtain |
255 $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\ |
269 $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\ |
256 $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\ |
270 $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\ |
257 \end{tabular} |
271 \end{tabular} |
258 \end{center} |
272 \end{center} |
259 |
273 |
260 \noindent Using flatten we can see what is the string behind |
274 \noindent Using flatten we can see what are the strings behind |
261 the values calculated by $mkeps$ and $inj$ in our running |
275 the values calculated by $mkeps$ and $inj$. In our running |
262 example are: |
276 example: |
263 |
277 |
264 \begin{center} |
278 \begin{center} |
265 \begin{tabular}{ll} |
279 \begin{tabular}{ll} |
266 $|v_4|$: & $[]$\\ |
280 $|v_4|$: & $[]$\\ |
267 $|v_3|$: & $c$\\ |
281 $|v_3|$: & $c$\\ |
269 $|v_1|$: & $abc$ |
283 $|v_1|$: & $abc$ |
270 \end{tabular} |
284 \end{tabular} |
271 \end{center} |
285 \end{center} |
272 |
286 |
273 \noindent This indicates that $inj$ indeed is injecting, or |
287 \noindent This indicates that $inj$ indeed is injecting, or |
274 adding, back a character into the value. Next we look at how |
288 adding, back a character into the value. If we continue until |
275 simplification can be included into this algorithm. |
289 all characters are injected back, we have a value that can |
|
290 indeed say how the string $abc$ was matched. |
|
291 |
|
292 There is a problem, however, with the described algorithm |
|
293 so far: it is very slow. We need to include the simplification |
|
294 from Lecture 2. This is what we shall do next. |
276 |
295 |
277 |
296 |
278 \subsubsection*{Simplification} |
297 \subsubsection*{Simplification} |
279 |
298 |
280 Generally the matching algorithms based on derivatives do |
299 Generally the matching algorithms based on derivatives do |