1 % Chapter Template |
1 % Chapter Template |
2 |
2 |
3 % Main chapter title |
3 % Main chapter title |
4 \chapter{Correctness of Bit-coded Algorithm without Simplification} |
4 \chapter{Bit-coded Algorithm of Sulzmann and Lu} |
5 |
5 |
6 \label{Bitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX} |
6 \label{Bitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX} |
7 %Then we illustrate how the algorithm without bitcodes falls short for such aggressive |
7 %Then we illustrate how the algorithm without bitcodes falls short for such aggressive |
8 %simplifications and therefore introduce our version of the bitcoded algorithm and |
8 %simplifications and therefore introduce our version of the bitcoded algorithm and |
9 %its correctness proof in |
9 %its correctness proof in |
10 %Chapter 3\ref{Chapter3}. |
10 %Chapter 3\ref{Chapter3}. |
11 |
11 In this chapter, we are going to introduce the bit-coded algorithm |
|
12 introduced by Sulzmann and Lu to address the problem of |
|
13 under-simplified regular expressions. |
12 \section{Bit-coded Algorithm} |
14 \section{Bit-coded Algorithm} |
13 |
|
14 |
|
15 The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure}, |
15 The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure}, |
16 stores information of previous lexing steps |
16 stores information of previous lexing steps |
17 on a stack, in the form of regular expressions |
17 on a stack, in the form of regular expressions |
18 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc. |
18 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc. |
19 \begin{envForCaption} |
19 \begin{envForCaption} |
201 $\textit{decode}\,bs\,r$ & $\dn$ & |
202 $\textit{decode}\,bs\,r$ & $\dn$ & |
202 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
203 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
203 & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; |
204 & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; |
204 \textit{else}\;\textit{None}$ |
205 \textit{else}\;\textit{None}$ |
205 \end{tabular} |
206 \end{tabular} |
206 \end{center} |
207 \end{center} |
207 \caption{Bit-decoding of values} |
|
208 \end{envForCaption} |
208 \end{envForCaption} |
209 %\end{definition} |
209 %\end{definition} |
210 |
210 |
211 Sulzmann and Lu's integrated the bitcodes into regular expressions to |
211 \noindent |
212 create annotated regular expressions \cite{Sulzmann2014}. |
212 $\decode'$ does most of the job while $\decode$ throws |
213 \emph{Annotated regular expressions} are defined by the following |
213 away leftover bit-codes and returns the value only. |
214 grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$} |
214 $\decode$ is terminating as $\decode'$ is terminating. |
|
215 We have the property that $\decode$ and $\code$ are |
|
216 reverse operations of one another: |
|
217 \begin{lemma} |
|
218 \[\vdash v : r \implies \decode \; (\code \; v) \; r = \textit{Some}(v) \] |
|
219 \end{lemma} |
|
220 \begin{proof} |
|
221 By proving a more general version of the lemma, on $\decode'$: |
|
222 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \] |
|
223 Then setting $ds$ to be $[]$ and unfolding $\decode$ definition |
|
224 we get the lemma. |
|
225 \end{proof} |
|
226 With the $\code$ and $\decode$ functions in hand, we know how to |
|
227 switch between bit-codes and value--the two different representations of |
|
228 lexing information. |
|
229 The next step is to integrate this information into the working regular expression. |
|
230 Attaching bits to the front of regular expressions is the solution Sulzamann and Lu |
|
231 gave for storing partial values on the fly: |
215 |
232 |
216 \begin{center} |
233 \begin{center} |
217 \begin{tabular}{lcl} |
234 \begin{tabular}{lcl} |
218 $\textit{a}$ & $::=$ & $\ZERO$\\ |
235 $\textit{a}$ & $::=$ & $\ZERO$\\ |
219 & $\mid$ & $_{bs}\ONE$\\ |
236 & $\mid$ & $_{bs}\ONE$\\ |
224 \end{tabular} |
241 \end{tabular} |
225 \end{center} |
242 \end{center} |
226 %(in \textit{ALTS}) |
243 %(in \textit{ALTS}) |
227 |
244 |
228 \noindent |
245 \noindent |
229 where $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regular |
246 We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}. |
|
247 $bs$ stands for bit-codes, $a$ for $\mathbf{a}$nnotated regular |
230 expressions and $as$ for a list of annotated regular expressions. |
248 expressions and $as$ for a list of annotated regular expressions. |
231 The alternative constructor($\sum$) has been generalized to |
249 The alternative constructor ($\sum$) has been generalised to |
232 accept a list of annotated regular expressions rather than just 2. |
250 accept a list of annotated regular expressions rather than just 2. |
233 We will show that these bitcodes encode information about |
251 %We will show that these bitcodes encode information about |
234 the (POSIX) value that should be generated by the Sulzmann and Lu |
252 %the ($\POSIX$) value that should be generated by the Sulzmann and Lu |
235 algorithm. |
253 %algorithm. |
236 |
254 The most central question is how these partial lexing information |
237 |
255 represented as bit-codes is augmented and carried around |
238 To do lexing using annotated regular expressions, we shall first |
256 during a derivative is taken. |
239 transform the usual (un-annotated) regular expressions into annotated |
257 |
240 regular expressions. This operation is called \emph{internalisation} and |
258 This is done by adding bitcodes to the |
241 defined as follows: |
259 derivatives, for example when one more star iteratoin is taken (we |
242 |
260 call the operation of derivatives on annotated regular expressions $\bder$ |
243 %\begin{definition} |
261 because it is derivatives on regexes with bitcodes): |
244 \begin{center} |
262 \begin{center} |
245 \begin{tabular}{lcl} |
263 \begin{tabular}{@{}lcl@{}} |
246 $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\ |
264 $\bder \; c\; (_{bs}a^*) $ & $\dn$ & |
247 $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\ |
265 $_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot |
248 $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\ |
266 (_{[]}a^*))$ |
249 $(r_1 + r_2)^\uparrow$ & $\dn$ & |
|
250 $_{[]}\sum[\textit{fuse}\,[Z]\,r_1^\uparrow,\, |
|
251 \textit{fuse}\,[S]\,r_2^\uparrow]$\\ |
|
252 $(r_1\cdot r_2)^\uparrow$ & $\dn$ & |
|
253 $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\ |
|
254 $(r^*)^\uparrow$ & $\dn$ & |
|
255 $_{[]}(r^\uparrow)^*$\\ |
|
256 \end{tabular} |
267 \end{tabular} |
257 \end{center} |
268 \end{center} |
258 %\end{definition} |
269 |
259 |
270 \noindent |
260 \noindent |
271 For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when |
261 We use up arrows here to indicate that the basic un-annotated regular |
272 there is no danger of confusion with derivatives on plain regular expressions, |
262 expressions are ``lifted up'' into something slightly more complex. In the |
273 for example, the above can be expressed as |
263 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to |
274 \begin{center} |
264 attach bits to the front of an annotated regular expression. Its |
275 \begin{tabular}{@{}lcl@{}} |
265 definition is as follows: |
276 $(_{bs}a^*)\,\backslash c$ & $\dn$ & |
266 |
277 $_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot |
|
278 (_{[]}a^*))$ |
|
279 \end{tabular} |
|
280 \end{center} |
|
281 |
|
282 |
|
283 Using the picture we used earlier to depict this, the transformation when |
|
284 taking a derivative w.r.t a star is like below: |
|
285 \centering |
|
286 \begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}} |
|
287 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
288 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
289 {$bs$ |
|
290 \nodepart{two} $a^*$ }; |
|
291 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
292 \end{tikzpicture} |
|
293 & |
|
294 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
295 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
296 {$v_{\text{previous iterations}}$ |
|
297 \nodepart{two} $a^*$}; |
|
298 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
299 \end{tikzpicture} |
|
300 \\ |
|
301 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
302 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
303 { $bs$ + [Z] |
|
304 \nodepart{two} $(a\backslash c )\cdot a^*$ }; |
|
305 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
306 \end{tikzpicture} |
|
307 & |
|
308 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
309 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
310 {$v_{\text{previous iterations}}$ + 1 more iteration |
|
311 \nodepart{two} $(a\backslash c )\cdot a^*$ }; |
|
312 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
313 \end{tikzpicture} |
|
314 \end{tabular} |
|
315 \noindent |
|
316 The operation $\fuse$ is just to attach bit-codes |
|
317 to the front of an annotated regular expression: |
267 \begin{center} |
318 \begin{center} |
268 \begin{tabular}{lcl} |
319 \begin{tabular}{lcl} |
269 $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ |
320 $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ |
270 $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & |
321 $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & |
271 $_{bs @ bs'}\ONE$\\ |
322 $_{bs @ bs'}\ONE$\\ |
276 $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & |
327 $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & |
277 $_{bs@bs'}a_1 \cdot a_2$\\ |
328 $_{bs@bs'}a_1 \cdot a_2$\\ |
278 $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & |
329 $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & |
279 $_{bs @ bs'}a^*$ |
330 $_{bs @ bs'}a^*$ |
280 \end{tabular} |
331 \end{tabular} |
281 \end{center} |
332 \end{center} |
282 |
333 |
283 \noindent |
334 \noindent |
284 After internalising the regular expression, we perform successive |
335 Another place in the $\bder$ function where it differs |
285 derivative operations on the annotated regular expressions. This |
336 from normal derivatives on un-annotated regular expressions |
286 derivative operation is the same as what we had previously for the |
337 is the sequence case: |
287 basic regular expressions, except that we beed to take care of |
|
288 the bitcodes: |
|
289 |
|
290 |
|
291 \iffalse |
|
292 %\begin{definition}{bder} |
|
293 \begin{center} |
338 \begin{center} |
294 \begin{tabular}{@{}lcl@{}} |
339 \begin{tabular}{@{}lcl@{}} |
295 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
340 |
296 $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
341 $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & |
297 $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & |
|
298 $\textit{if}\;c=d\; \;\textit{then}\; |
|
299 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ |
|
300 $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & |
|
301 $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\ |
|
302 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
|
303 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
342 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
304 & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ |
343 & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ |
305 & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ |
344 & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ |
306 & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ |
345 & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$ |
307 $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & |
|
308 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, |
|
309 (\textit{STAR}\,[]\,r)$ |
|
310 \end{tabular} |
346 \end{tabular} |
311 \end{center} |
347 \end{center} |
312 %\end{definition} |
348 Here |
313 |
349 |
314 \begin{center} |
|
315 \begin{tabular}{@{}lcl@{}} |
|
316 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
317 $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
318 $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ & |
|
319 $\textit{if}\;c=d\; \;\textit{then}\; |
|
320 _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\ |
|
321 $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ & |
|
322 $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\ |
|
323 $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ & |
|
324 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
325 & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\ |
|
326 & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ |
|
327 & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\ |
|
328 $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ & |
|
329 $_{bs}\textit{SEQ}\;(\textit{fuse}\, [Z] \; r\,\backslash c )\, |
|
330 (_{bs}\textit{STAR}\,[]\,r)$ |
|
331 \end{tabular} |
|
332 \end{center} |
|
333 %\end{definition} |
|
334 \fi |
|
335 |
350 |
336 \begin{center} |
351 \begin{center} |
337 \begin{tabular}{@{}lcl@{}} |
352 \begin{tabular}{@{}lcl@{}} |
338 $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
353 $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
339 $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
354 $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
350 $(_{bs}a^*)\,\backslash c$ & $\dn$ & |
365 $(_{bs}a^*)\,\backslash c$ & $\dn$ & |
351 $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot |
366 $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot |
352 (_{[]}r^*))$ |
367 (_{[]}r^*))$ |
353 \end{tabular} |
368 \end{tabular} |
354 \end{center} |
369 \end{center} |
|
370 |
|
371 |
|
372 To do lexing using annotated regular expressions, we shall first |
|
373 transform the usual (un-annotated) regular expressions into annotated |
|
374 regular expressions. This operation is called \emph{internalisation} and |
|
375 defined as follows: |
|
376 |
|
377 %\begin{definition} |
|
378 \begin{center} |
|
379 \begin{tabular}{lcl} |
|
380 $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\ |
|
381 $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\ |
|
382 $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\ |
|
383 $(r_1 + r_2)^\uparrow$ & $\dn$ & |
|
384 $_{[]}\sum[\textit{fuse}\,[Z]\,r_1^\uparrow,\, |
|
385 \textit{fuse}\,[S]\,r_2^\uparrow]$\\ |
|
386 $(r_1\cdot r_2)^\uparrow$ & $\dn$ & |
|
387 $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\ |
|
388 $(r^*)^\uparrow$ & $\dn$ & |
|
389 $_{[]}(r^\uparrow)^*$\\ |
|
390 \end{tabular} |
|
391 \end{center} |
|
392 %\end{definition} |
|
393 |
|
394 \noindent |
|
395 We use up arrows here to indicate that the basic un-annotated regular |
|
396 expressions are ``lifted up'' into something slightly more complex. In the |
|
397 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to |
|
398 attach bits to the front of an annotated regular expression. Its |
|
399 definition is as follows: |
|
400 |
|
401 |
|
402 |
|
403 \noindent |
|
404 After internalising the regular expression, we perform successive |
|
405 derivative operations on the annotated regular expressions. This |
|
406 derivative operation is the same as what we had previously for the |
|
407 basic regular expressions, except that we beed to take care of |
|
408 the bitcodes: |
|
409 |
|
410 |
|
411 |
|
412 |
355 |
413 |
356 %\end{definition} |
414 %\end{definition} |
357 \noindent |
415 \noindent |
358 For instance, when we do derivative of $_{bs}a^*$ with respect to c, |
416 For instance, when we do derivative of $_{bs}a^*$ with respect to c, |
359 we need to unfold it into a sequence, |
417 we need to unfold it into a sequence, |
414 |
472 |
415 \noindent |
473 \noindent |
416 In this definition $\_\backslash s$ is the generalisation of the derivative |
474 In this definition $\_\backslash s$ is the generalisation of the derivative |
417 operation from characters to strings (just like the derivatives for un-annotated |
475 operation from characters to strings (just like the derivatives for un-annotated |
418 regular expressions). |
476 regular expressions). |
419 |
|
420 Now we introduce the simplifications, which is why we introduce the |
|
421 bitcodes in the first place. |
|
422 |
|
423 \subsection*{Simplification Rules} |
|
424 |
|
425 This section introduces aggressive (in terms of size) simplification rules |
|
426 on annotated regular expressions |
|
427 to keep derivatives small. Such simplifications are promising |
|
428 as we have |
|
429 generated test data that show |
|
430 that a good tight bound can be achieved. We could only |
|
431 partially cover the search space as there are infinitely many regular |
|
432 expressions and strings. |
|
433 |
|
434 One modification we introduced is to allow a list of annotated regular |
|
435 expressions in the $\sum$ constructor. This allows us to not just |
|
436 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but |
|
437 also unnecessary ``copies'' of regular expressions (very similar to |
|
438 simplifying $r + r$ to just $r$, but in a more general setting). Another |
|
439 modification is that we use simplification rules inspired by Antimirov's |
|
440 work on partial derivatives. They maintain the idea that only the first |
|
441 ``copy'' of a regular expression in an alternative contributes to the |
|
442 calculation of a POSIX value. All subsequent copies can be pruned away from |
|
443 the regular expression. A recursive definition of our simplification function |
|
444 that looks somewhat similar to our Scala code is given below: |
|
445 %\comment{Use $\ZERO$, $\ONE$ and so on. |
|
446 %Is it $ALTS$ or $ALTS$?}\\ |
|
447 |
|
448 \begin{center} |
|
449 \begin{tabular}{@{}lcl@{}} |
|
450 |
|
451 $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
|
452 &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ |
|
453 &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ |
|
454 &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
|
455 &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
|
456 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ |
|
457 |
|
458 $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\ |
|
459 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
|
460 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
|
461 &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ |
|
462 |
|
463 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
|
464 \end{tabular} |
|
465 \end{center} |
|
466 |
|
467 \noindent |
|
468 The simplification does a pattern matching on the regular expression. |
|
469 When it detected that the regular expression is an alternative or |
|
470 sequence, it will try to simplify its child regular expressions |
|
471 recursively and then see if one of the children turns into $\ZERO$ or |
|
472 $\ONE$, which might trigger further simplification at the current level. |
|
473 The most involved part is the $\sum$ clause, where we use two |
|
474 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested |
|
475 alternatives and reduce as many duplicates as possible. Function |
|
476 $\textit{distinct}$ keeps the first occurring copy only and removes all later ones |
|
477 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. |
|
478 Its recursive definition is given below: |
|
479 |
|
480 \begin{center} |
|
481 \begin{tabular}{@{}lcl@{}} |
|
482 $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; |
|
483 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ |
|
484 $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ |
|
485 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) |
|
486 \end{tabular} |
|
487 \end{center} |
|
488 |
|
489 \noindent |
|
490 Here $\textit{flatten}$ behaves like the traditional functional programming flatten |
|
491 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it |
|
492 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. |
|
493 |
|
494 Having defined the $\simp$ function, |
|
495 we can use the previous notation of natural |
|
496 extension from derivative w.r.t.~character to derivative |
|
497 w.r.t.~string:%\comment{simp in the [] case?} |
|
498 |
|
499 \begin{center} |
|
500 \begin{tabular}{lcl} |
|
501 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ |
|
502 $r \backslash_{simp} [\,] $ & $\dn$ & $r$ |
|
503 \end{tabular} |
|
504 \end{center} |
|
505 |
|
506 \noindent |
|
507 to obtain an optimised version of the algorithm: |
|
508 |
|
509 \begin{center} |
|
510 \begin{tabular}{lcl} |
|
511 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
|
512 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
|
513 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
514 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
515 & & $\;\;\textit{else}\;\textit{None}$ |
|
516 \end{tabular} |
|
517 \end{center} |
|
518 |
|
519 \noindent |
|
520 This algorithm keeps the regular expression size small, for example, |
|
521 with this simplification our previous $(a + aa)^*$ example's 8000 nodes |
|
522 will be reduced to just 17 and stays constant, no matter how long the |
|
523 input string is. |
|
524 |
|
525 |
|
526 |
|
527 |
|
528 |
|
529 |
|
530 |
|
531 |
|
532 |
477 |
533 |
478 |
534 |
479 |
535 %----------------------------------- |
480 %----------------------------------- |
536 % SUBSECTION 1 |
481 % SUBSECTION 1 |