246 We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}. |
246 We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}. |
247 $bs$ stands for bit-codes, $a$ for $\mathbf{a}$nnotated regular |
247 $bs$ stands for bit-codes, $a$ for $\mathbf{a}$nnotated regular |
248 expressions and $as$ for a list of annotated regular expressions. |
248 expressions and $as$ for a list of annotated regular expressions. |
249 The alternative constructor ($\sum$) has been generalised to |
249 The alternative constructor ($\sum$) has been generalised to |
250 accept a list of annotated regular expressions rather than just 2. |
250 accept a list of annotated regular expressions rather than just 2. |
251 %We will show that these bitcodes encode information about |
251 |
252 %the ($\POSIX$) value that should be generated by the Sulzmann and Lu |
252 |
253 %algorithm. |
253 The first thing we define related to bit-coded regular expressions |
254 The most central question is how these partial lexing information |
254 is how we move bits, for instance pasting it at the front of an annotated regex. |
255 represented as bit-codes is augmented and carried around |
|
256 during a derivative is taken. |
|
257 |
|
258 This is done by adding bitcodes to the |
|
259 derivatives, for example when one more star iteratoin is taken (we |
|
260 call the operation of derivatives on annotated regular expressions $\bder$ |
|
261 because it is derivatives on regexes with bitcodes): |
|
262 \begin{center} |
|
263 \begin{tabular}{@{}lcl@{}} |
|
264 $\bder \; c\; (_{bs}a^*) $ & $\dn$ & |
|
265 $_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot |
|
266 (_{[]}a^*))$ |
|
267 \end{tabular} |
|
268 \end{center} |
|
269 |
|
270 \noindent |
|
271 For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when |
|
272 there is no danger of confusion with derivatives on plain regular expressions, |
|
273 for example, the above can be expressed as |
|
274 \begin{center} |
|
275 \begin{tabular}{@{}lcl@{}} |
|
276 $(_{bs}a^*)\,\backslash c$ & $\dn$ & |
|
277 $_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot |
|
278 (_{[]}a^*))$ |
|
279 \end{tabular} |
|
280 \end{center} |
|
281 |
|
282 |
|
283 Using the picture we used earlier to depict this, the transformation when |
|
284 taking a derivative w.r.t a star is like below: |
|
285 \centering |
|
286 \begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}} |
|
287 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
288 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
289 {$bs$ |
|
290 \nodepart{two} $a^*$ }; |
|
291 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
292 \end{tikzpicture} |
|
293 & |
|
294 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
295 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
296 {$v_{\text{previous iterations}}$ |
|
297 \nodepart{two} $a^*$}; |
|
298 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
299 \end{tikzpicture} |
|
300 \\ |
|
301 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
302 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
303 { $bs$ + [Z] |
|
304 \nodepart{two} $(a\backslash c )\cdot a^*$ }; |
|
305 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
306 \end{tikzpicture} |
|
307 & |
|
308 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
309 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
310 {$v_{\text{previous iterations}}$ + 1 more iteration |
|
311 \nodepart{two} $(a\backslash c )\cdot a^*$ }; |
|
312 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
313 \end{tikzpicture} |
|
314 \end{tabular} |
|
315 \noindent |
|
316 The operation $\fuse$ is just to attach bit-codes |
255 The operation $\fuse$ is just to attach bit-codes |
317 to the front of an annotated regular expression: |
256 to the front of an annotated regular expression: |
318 \begin{center} |
257 \begin{center} |
319 \begin{tabular}{lcl} |
258 \begin{tabular}{lcl} |
320 $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ |
259 $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ |
330 $_{bs @ bs'}a^*$ |
269 $_{bs @ bs'}a^*$ |
331 \end{tabular} |
270 \end{tabular} |
332 \end{center} |
271 \end{center} |
333 |
272 |
334 \noindent |
273 \noindent |
335 Another place in the $\bder$ function where it differs |
274 With that we are able to define $\internalise$. |
336 from normal derivatives on un-annotated regular expressions |
|
337 is the sequence case: |
|
338 \begin{center} |
|
339 \begin{tabular}{@{}lcl@{}} |
|
340 |
|
341 $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & |
|
342 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
343 & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ |
|
344 & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ |
|
345 & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$ |
|
346 \end{tabular} |
|
347 \end{center} |
|
348 Here |
|
349 |
|
350 |
|
351 \begin{center} |
|
352 \begin{tabular}{@{}lcl@{}} |
|
353 $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
|
354 $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
|
355 $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & |
|
356 $\textit{if}\;c=d\; \;\textit{then}\; |
|
357 _{bs}\ONE\;\textit{else}\;\ZERO$\\ |
|
358 $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ & |
|
359 $_{bs}\sum\;(\textit{map} (\_\backslash c) as )$\\ |
|
360 $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & |
|
361 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
362 & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ |
|
363 & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ |
|
364 & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ |
|
365 $(_{bs}a^*)\,\backslash c$ & $\dn$ & |
|
366 $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot |
|
367 (_{[]}r^*))$ |
|
368 \end{tabular} |
|
369 \end{center} |
|
370 |
|
371 |
275 |
372 To do lexing using annotated regular expressions, we shall first |
276 To do lexing using annotated regular expressions, we shall first |
373 transform the usual (un-annotated) regular expressions into annotated |
277 transform the usual (un-annotated) regular expressions into annotated |
374 regular expressions. This operation is called \emph{internalisation} and |
278 regular expressions. This operation is called \emph{internalisation} and |
375 defined as follows: |
279 defined as follows: |
390 \end{tabular} |
294 \end{tabular} |
391 \end{center} |
295 \end{center} |
392 %\end{definition} |
296 %\end{definition} |
393 |
297 |
394 \noindent |
298 \noindent |
395 We use up arrows here to indicate that the basic un-annotated regular |
299 We use an up arrow with postfix notation |
396 expressions are ``lifted up'' into something slightly more complex. In the |
300 to denote the operation, |
397 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to |
301 for convenience. The $\textit{internalise} \; r$ |
398 attach bits to the front of an annotated regular expression. Its |
302 notation is more cumbersome. |
399 definition is as follows: |
303 The opposite of $\textit{internalise}$ is |
400 |
304 $\erase$, where all the bit-codes are removed, |
401 |
305 and the alternative operator $\sum$ for annotated |
402 |
306 regular expressions is transformed to the binary alternatives |
403 \noindent |
307 for plain regular expressions. |
404 After internalising the regular expression, we perform successive |
308 \begin{center} |
405 derivative operations on the annotated regular expressions. This |
309 \begin{tabular}{lcr} |
406 derivative operation is the same as what we had previously for the |
310 $\erase \; \ZERO$ & $\dn$ & $\ZERO$\\ |
407 basic regular expressions, except that we beed to take care of |
311 $\erase \; _{bs}\ONE$ & $\dn$ & $\ONE$\\ |
408 the bitcodes: |
312 $\erase \; _{bs}\mathbf{c}$ & $\dn$ & $\mathbf{c}$\\ |
409 |
313 $\erase \; _{bs} a_1 \cdot a_2$ & $\dn$ & $\erase \; r_1\cdot\erase\; r_2$\\ |
410 |
314 $\erase \; _{bs} [] $ & $\dn$ & $\ZERO$\\ |
411 |
315 $\erase \; _{bs} [a] $ & $\dn$ & $\erase \; a$\\ |
412 |
316 $\erase \; _{bs} \sum [a_1, \; a_2]$ & $\dn$ & $\erase \; a_1 +\erase \; a_2$\\ |
413 |
317 $\erase \; _{bs} \sum (a :: as)$ & $\dn$ & $\erase \; a + \erase \; _{[]} \sum as$\\ |
414 %\end{definition} |
318 $\erase \; _{bs} a^*$ & $\dn$ & $(\erase \; a)^*$ |
415 \noindent |
319 \end{tabular} |
416 For instance, when we do derivative of $_{bs}a^*$ with respect to c, |
320 \end{center} |
417 we need to unfold it into a sequence, |
321 \noindent |
418 and attach an additional bit $Z$ to the front of $r \backslash c$ |
322 We also abbreviate the $\erase\; a$ operation |
419 to indicate one more star iteration. Also the sequence clause |
323 as $a_\downarrow$, for conciseness. |
420 is more subtle---when $a_1$ is $\textit{bnullable}$ (here |
324 For bit-coded regular expressions, as a different datatype, |
421 \textit{bnullable} is exactly the same as $\textit{nullable}$, except |
325 testing whether they contain empty string in their lauguage requires |
422 that it is for annotated regular expressions, therefore we omit the |
326 a dedicated function $\bnullable$ |
423 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how |
327 which simply calls $\erase$ first before testing whether it is $\nullable$. |
424 $a_1$ matches the string prior to character $c$ (more on this later), |
328 \begin{center} |
425 then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2 |
329 \begin{tabular}{lcr} |
426 \backslash c)$ will collapse the regular expression $a_1$(as it has |
330 $\bnullable \; a $ & $\dn$ & $\nullable \; (a_\downarrow)$ |
427 already been fully matched) and store the parsing information at the |
331 \end{tabular} |
428 head of the regular expression $a_2 \backslash c$ by fusing to it. The |
332 \end{center} |
429 bitsequence $\textit{bs}$, which was initially attached to the |
333 The function for collecting the |
430 first element of the sequence $a_1 \cdot a_2$, has |
334 bitcodes of a $\bnullable$ annotated regular expression |
431 now been elevated to the top-level of $\sum$, as this information will be |
335 is a generalised version of the $\textit{mkeps}$ function |
432 needed whichever way the sequence is matched---no matter whether $c$ belongs |
|
433 to $a_1$ or $ a_2$. After building these derivatives and maintaining all |
|
434 the lexing information, we complete the lexing by collecting the |
|
435 bitcodes using a generalised version of the $\textit{mkeps}$ function |
|
436 for annotated regular expressions, called $\textit{bmkeps}$: |
336 for annotated regular expressions, called $\textit{bmkeps}$: |
437 |
337 |
438 |
338 |
439 %\begin{definition}[\textit{bmkeps}]\mbox{} |
339 %\begin{definition}[\textit{bmkeps}]\mbox{} |
440 \begin{center} |
340 \begin{center} |
441 \begin{tabular}{lcl} |
341 \begin{tabular}{lcl} |
442 $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\ |
342 $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\ |
443 $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ & |
343 $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ & |
444 $\textit{if}\;\textit{bnullable}\,a$\\ |
344 $\textit{if}\;\textit{bnullable}\,a$\\ |
445 & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\ |
345 & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\ |
446 & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\ |
346 & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{[]}\sum \textit{as})$\\ |
447 $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ & |
347 $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ & |
448 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ |
348 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ |
449 $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ & |
349 $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ & |
450 $bs \,@\, [Z]$ |
350 $bs \,@\, [Z]$ |
451 \end{tabular} |
351 \end{tabular} |
454 |
354 |
455 \noindent |
355 \noindent |
456 This function completes the value information by travelling along the |
356 This function completes the value information by travelling along the |
457 path of the regular expression that corresponds to a POSIX value and |
357 path of the regular expression that corresponds to a POSIX value and |
458 collecting all the bitcodes, and using $S$ to indicate the end of star |
358 collecting all the bitcodes, and using $S$ to indicate the end of star |
459 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and |
359 iterations. |
460 decode them, we get the value we expect. The corresponding lexing |
360 |
461 algorithm looks as follows: |
361 The most central question is how these partial lexing information |
|
362 represented as bit-codes is augmented and carried around |
|
363 during a derivative is taken. |
|
364 This is done by adding bitcodes to the |
|
365 derivatives, for example when one more star iteratoin is taken (we |
|
366 call the operation of derivatives on annotated regular expressions $\bder$ |
|
367 because it is derivatives on regexes with \emph{b}itcodes), |
|
368 we need to unfold it into a sequence, |
|
369 and attach an additional bit $Z$ to the front of $r \backslash c$ |
|
370 to indicate one more star iteration. |
|
371 \begin{center} |
|
372 \begin{tabular}{@{}lcl@{}} |
|
373 $\bder \; c\; (_{bs}a^*) $ & $\dn$ & |
|
374 $_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot |
|
375 (_{[]}a^*))$ |
|
376 \end{tabular} |
|
377 \end{center} |
|
378 |
|
379 \noindent |
|
380 For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when |
|
381 there is no danger of confusion with derivatives on plain regular expressions, |
|
382 for example, the above can be expressed as |
|
383 \begin{center} |
|
384 \begin{tabular}{@{}lcl@{}} |
|
385 $(_{bs}a^*)\,\backslash c$ & $\dn$ & |
|
386 $_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot |
|
387 (_{[]}a^*))$ |
|
388 \end{tabular} |
|
389 \end{center} |
|
390 |
|
391 \noindent |
|
392 Using the picture we used earlier to depict this, the transformation when |
|
393 taking a derivative w.r.t a star is like below: |
|
394 |
|
395 \begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}} |
|
396 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
397 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
398 {$bs$ |
|
399 \nodepart{two} $a^*$ }; |
|
400 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
401 \end{tikzpicture} |
|
402 & |
|
403 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
404 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
405 {$v_{\text{previous iterations}}$ |
|
406 \nodepart{two} $a^*$}; |
|
407 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
408 \end{tikzpicture} |
|
409 \\ |
|
410 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
411 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
412 { $bs$ + [Z] |
|
413 \nodepart{two} $(a\backslash c )\cdot a^*$ }; |
|
414 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
415 \end{tikzpicture} |
|
416 & |
|
417 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
418 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
|
419 {$v_{\text{previous iterations}}$ + 1 more iteration |
|
420 \nodepart{two} $(a\backslash c )\cdot a^*$ }; |
|
421 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
422 \end{tikzpicture} |
|
423 \end{tabular} |
|
424 \noindent |
|
425 Another place in the $\bder$ function where it differs |
|
426 from normal derivatives (on un-annotated regular expressions) |
|
427 is the sequence case: |
|
428 \begin{center} |
|
429 \begin{tabular}{@{}lcl@{}} |
|
430 |
|
431 $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & |
|
432 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
433 & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ |
|
434 & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ |
|
435 & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$ |
|
436 \end{tabular} |
|
437 \end{center} |
|
438 |
|
439 The difference is that (when $a_1$ is $\bnullable$) |
|
440 we use $\bmkeps$ to store the lexing information |
|
441 in $a_1$ before collapsing it (as it has been fully matched by string prior to $c$, |
|
442 and attach the collected bit-codes to the front of $a_2$ |
|
443 before throwing away $a_1$. We assume that $\bmkeps$ correctly extracts the bitcode for how $a_1$ |
|
444 matches the string prior to $c$ (more on this later). |
|
445 The bitsequence $\textit{bs}$ which was initially attached to the first element of the sequence |
|
446 $a_1 \cdot a_2$, has now been elevated to the top level of teh $\sum$. |
|
447 This is because this piece of information will be needed whichever way the sequence is matched, |
|
448 regardless of whether $c$ belongs to $a_1$ or $a_2$. |
|
449 |
|
450 In the injection-based lexing, $r_1$ is immediately thrown away in |
|
451 subsequent derivatives on the right branch (when $r_1$ is $\nullable$), |
|
452 \begin{center} |
|
453 $(r_1 \cdot r_2 )\backslash c = (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ |
|
454 \end{center} |
|
455 \noindent |
|
456 as it knows $r_1$ is stored on stack and available once the recursive |
|
457 call to later derivatives finish. |
|
458 Therefore, if the $\Right$ branch is taken in a $\POSIX$ match, |
|
459 we construct back the sequence value once step back by |
|
460 calling a on $\mkeps(r_1)$. |
|
461 \begin{center} |
|
462 \begin{tabular}{lcr} |
|
463 $\ldots r_1 \cdot r_2$ & $\rightarrow$ & $r_1\cdot r_2 + r_2 \backslash c \ldots $\\ |
|
464 $\ldots \Seq(v_1, v_2) (\Seq(\mkeps(r1), (\inj \; r_2 \; c\; v_{2c})))$ & $\leftarrow$ & $\Right(v_{2c})\ldots$ |
|
465 \end{tabular} |
|
466 \end{center} |
|
467 \noindent |
|
468 The rest of the clauses of $\bder$ is rather similar to |
|
469 $\der$, and is put together here as a wholesome definition |
|
470 for $\bder$: |
|
471 \begin{center} |
|
472 \begin{tabular}{@{}lcl@{}} |
|
473 $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
|
474 $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
|
475 $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & |
|
476 $\textit{if}\;c=d\; \;\textit{then}\; |
|
477 _{bs}\ONE\;\textit{else}\;\ZERO$\\ |
|
478 $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ & |
|
479 $_{bs}\sum\;(\textit{map} \; (\_\backslash c) \; as )$\\ |
|
480 $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & |
|
481 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
482 & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ |
|
483 & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ |
|
484 & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ |
|
485 $(_{bs}a^*)\,\backslash c$ & $\dn$ & |
|
486 $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot |
|
487 (_{[]}r^*))$ |
|
488 \end{tabular} |
|
489 \end{center} |
|
490 \noindent |
|
491 Generalising the derivative operation with bitcodes to strings, we have |
|
492 \begin{center} |
|
493 \begin{tabular}{@{}lcl@{}} |
|
494 $a\backslash_s [] $ & $\dn$ & $a$\\ |
|
495 $a\backslash (c :: s) $ & $\dn$ & $(a \backslash c) \backslash_s s$ |
|
496 \end{tabular} |
|
497 \end{center} |
|
498 As we did earlier, we omit the $s$ subscript at $\backslash_s$ when there is no danger |
|
499 of confusion. |
|
500 Putting this all together, we obtain a lexer with bit-coded regular expressions |
|
501 as its internal data structures, which we call $\blexer$: |
462 |
502 |
463 \begin{center} |
503 \begin{center} |
464 \begin{tabular}{lcl} |
504 \begin{tabular}{lcl} |
465 $\textit{blexer}\;r\,s$ & $\dn$ & |
505 $\textit{blexer}\;r\,s$ & $\dn$ & |
466 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ |
506 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ |
469 & & $\;\;\textit{else}\;\textit{None}$ |
509 & & $\;\;\textit{else}\;\textit{None}$ |
470 \end{tabular} |
510 \end{tabular} |
471 \end{center} |
511 \end{center} |
472 |
512 |
473 \noindent |
513 \noindent |
474 In this definition $\_\backslash s$ is the generalisation of the derivative |
514 Ausaf and Urban formally proved the correctness of the $\blexer$, namely |
475 operation from characters to strings (just like the derivatives for un-annotated |
515 \begin{conjecture} |
476 regular expressions). |
516 $\blexer \;r \; s = \lexer \; r \; s$. |
477 |
517 \end{conjecture} |
|
518 This was claimed but not formalised in Sulzmann and Lu's work. |
|
519 We introduce the proof later, after we give out all |
|
520 the needed auxiliary functions and definitions |
478 |
521 |
479 |
522 |
480 %----------------------------------- |
523 %----------------------------------- |
481 % SUBSECTION 1 |
524 % SUBSECTION 1 |
482 %----------------------------------- |
525 %----------------------------------- |
483 \section{Specifications of Some Helper Functions} |
526 \section{Specifications of Some Helper Functions} |
484 Here we give some functions' definitions, |
527 The functions we introduce will give a more detailed glimpse into |
485 which we will use later. |
528 the lexing process, which might not be possible |
486 \begin{center} |
529 using $\lexer$ or $\blexer$ themselves. |
487 \begin{tabular}{ccc} |
530 The first function we shall look at is $\retrieve$. |
488 $\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$ |
531 \subsection{$\retrieve$} |
|
532 Our bit-coded lexer "retrieve"s the bitcodes using $\bmkeps$ |
|
533 after we finished doing all the derivatives: |
|
534 \begin{center} |
|
535 \begin{tabular}{lcl} |
|
536 & & $\ldots$\\ |
|
537 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
538 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
539 & & $\ldots$ |
489 \end{tabular} |
540 \end{tabular} |
490 \end{center} |
541 \end{center} |
491 |
542 \noindent |
492 |
543 Recall that $\bmkeps$ looks for the leftmost branch of an alternative |
493 %---------------------------------------------------------------------------------------- |
544 and completes a star's iterations by attaching a $Z$ at the end of the bitcodes |
494 % SECTION correctness proof |
545 extracted. It "retrieves" a sequence by visiting both children and then stitch together |
495 %---------------------------------------------------------------------------------------- |
546 two bitcodes using concatenation. After the entire tree structure of the regular |
496 \section{Correctness of Bit-coded Algorithm (Without Simplification)} |
547 expression has been traversed using the above manner, we get a bitcode encoding the |
497 We now give the proof the correctness of the algorithm with bit-codes. |
548 lexing result. |
498 |
549 We know that this "retrieved" bitcode leads to the correct value after decoding, |
499 Ausaf and Urban cleverly defined an auxiliary function called $\flex$, |
550 which is $v_0$ in the bird's eye view of the injection-based lexing diagram. |
|
551 Now assume we keep every other data structure in the diagram \ref{InjFigure}, |
|
552 and only replace all the plain regular expression by their annotated counterparts, |
|
553 computed during a $\blexer$ run. |
|
554 Then we obtain a diagram for the annotated regular expression derivatives and |
|
555 their corresponding values, though the values are never calculated in $\blexer$. |
|
556 We have that $a_n$ contains all the lexing result information. |
|
557 \vspace{20mm} |
|
558 \begin{center}%\label{graph:injLexer} |
|
559 \begin{tikzcd}[ |
|
560 every matrix/.append style = {name=p}, |
|
561 remember picture, overlay, |
|
562 ] |
|
563 a_0 \arrow[r, "\backslash c_0"] \arrow[d] & a_1 \arrow[r, "\backslash c_1"] \arrow[d] & a_2 \arrow[r, dashed] \arrow[d] & a_n \arrow[d] \\ |
|
564 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
|
565 \end{tikzcd} |
|
566 \begin{tikzpicture}[ |
|
567 remember picture, overlay, |
|
568 E/.style = {ellipse, draw=blue, dashed, |
|
569 inner xsep=4mm,inner ysep=-4mm, rotate=90, fit=#1} |
|
570 ] |
|
571 \node[E = (p-1-1) (p-2-1)] {}; |
|
572 \node[E = (p-1-4) (p-2-4)] {}; |
|
573 \end{tikzpicture} |
|
574 |
|
575 \end{center} |
|
576 \vspace{20mm} |
|
577 \noindent |
|
578 On the other hand, $v_0$ also encodes the correct lexing result, as we have proven for $\lexer$. |
|
579 Encircled in the diagram are the two pairs $v_0, a_0$ and $v_n, a_n$, which both |
|
580 encode the correct lexical result. Though for the leftmost pair, we have |
|
581 the information condensed in $v_0$ the value part, whereas for the rightmost pair, |
|
582 the information is concentrated on $a_n$. |
|
583 We know that in the intermediate steps the pairs $v_i, a_i$, must in some way encode the complete |
|
584 lexing information as well. Therefore, we need a unified approach to extract such lexing result |
|
585 from a value $v_i$ and its annotated regular expression $a_i$. |
|
586 And the function $f$ must satisfy these requirements: |
|
587 \begin{itemize} |
|
588 \item |
|
589 $f \; a_i\;v_i = f \; a_n \; v_n = \decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$ |
|
590 \item |
|
591 $f \; a_i\;v_i = f \; a_0 \; v_0 = v_0 = \decode \;(\code \; v_0) \; (\erase \; a_0)$ |
|
592 \end{itemize} |
|
593 \noindent |
|
594 If we factor out the common part $\decode \; \_ \; (\erase \; a_0)$, |
|
595 The core of the function $f$ is something that produces the bitcodes |
|
596 $\code \; v_0$. |
|
597 It is unclear how, but Sulzmann and Lu came up with a function satisfying all the above |
|
598 requirements, named \emph{retrieve}: |
|
599 |
|
600 |
|
601 |
|
602 \begin{center} |
|
603 \begin{tabular}{lcr} |
|
604 $\retrieve \; \, (_{bs} \ONE) \; \, (\Empty)$ & $\dn$ & $\textit{bs}$\\ |
|
605 $\retrieve \; \, (_{bs} \mathbf{c} ) \; \Char(c)$ & $\dn$ & $ \textit{bs}$\\ |
|
606 $\retrieve \; \, (_{bs} a_1 \cdot a_2) \; \Seq(v_1, v_2)$ & $\dn$ & $\textit{bs} @ (\retrieve \; a_1\; v_1) @ (\retrieve \; a_2 \; v_2)$\\ |
|
607 $\retrieve \; \, (_{bs} \Sigma (a :: \textit{as}) \; \,\Left(v)$ & $\dn$ & $\textit{bs} @ (\retrieve \; a \; v)$\\ |
|
608 $\retrieve \; \, (_{bs} \Sigma (a :: \textit{as} \; \, \Right(v)$ & $\dn$ & $\textit{bs} @ (\retrieve \; (_{[]}\Sigma \textit{as}) \; v)$\\ |
|
609 $\retrieve \; \, (_{bs} a^*) \; \, (\Stars(v :: vs)) $ & $\dn$ & $\textit{bs} @ (\retrieve \; a \; v) @ (\retrieve \; (_{[]} a^*) \; (\Stars(vs)))$\\ |
|
610 $\retrieve \; \, (_{bs} a^*) \; \, (\Stars([]) $ & $\dn$ & $\textit{bs} @ [Z]$ |
|
611 \end{tabular} |
|
612 \end{center} |
|
613 \noindent |
|
614 As promised, $\retrieve$ collects the right bit-codes from the |
|
615 final derivative $a_n$: |
|
616 \begin{lemma} |
|
617 $\bnullable \; a \implies \bmkeps \; a = \retrieve \; a \; (\mkeps \; (\erase \; a))$ |
|
618 \end{lemma} |
|
619 \begin{proof} |
|
620 By a routine induction on $a$. |
|
621 \end{proof} |
|
622 The design of $\retrieve$ enables extraction of bit-codes |
|
623 from not only $\bnullable$ (annotated) regular expressions, |
|
624 but also those that are not $\bnullable$. |
|
625 For example, if we have the regular expression just internalised |
|
626 and the lexing result value, we could $\retrieve$ the bitcdoes |
|
627 that look as if we have en$\code$-ed the value: |
|
628 \begin{lemma} |
|
629 $\vdash v : r \implies \retrieve \; (r)^\uparrow \; v = \code \; v$ |
|
630 \end{lemma} |
|
631 \begin{proof} |
|
632 By induction on $r$. |
|
633 \end{proof} |
|
634 \noindent |
|
635 The following property is more interesting, as |
|
636 it provides a "bridge" between $a_0, v_0$ and $a_n, v_n$ in the |
|
637 lexing diagram. |
|
638 If you take derivative of an annotated regular expression, |
|
639 you can $\retrieve$ the same bit-codes as before the derivative took place, |
|
640 provided that you use the corresponding value: |
|
641 |
|
642 \begin{lemma}\label{retrieveStepwise} |
|
643 $\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$ |
|
644 \end{lemma} |
|
645 \begin{proof} |
|
646 By induction on $r$, where $v$ is allowed to be arbitrary. |
|
647 The induction principle is function $\erase$'s cases. |
|
648 \end{proof} |
|
649 \noindent |
|
650 $\retrieve$ is connected to the $\blexer$ in the following way: |
|
651 \begin{lemma}\label{blexer_retrieve} |
|
652 $\blexer \; r \; s = \decode \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$ |
|
653 \end{lemma} |
|
654 \noindent |
|
655 $\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regexes of $\blexer$. |
|
656 For plain regular expressions something similar is required as well. |
|
657 |
|
658 \subsection{$\flex$} |
|
659 Ausaf and Urban cleverly defined an auxiliary function called $\flex$ for $\lexer$, |
500 defined as |
660 defined as |
501 \begin{center} |
661 \begin{center} |
502 \begin{tabular}{lcr} |
662 \begin{tabular}{lcr} |
503 $\flex \; r \; f \; [] \; v$ & $=$ & $f\; v$\\ |
663 $\flex \; r \; f \; [] \; v$ & $=$ & $f\; v$\\ |
504 $\flex \; r \; f \; c :: s \; v$ & $=$ & $\flex \; r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v$ |
664 $\flex \; r \; f \; c :: s \; v$ & $=$ & $\flex \; r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v$ |
508 and does the injection in a stack-like manner (last taken derivative first injected). |
668 and does the injection in a stack-like manner (last taken derivative first injected). |
509 $\flex$ is connected to the $\lexer$: |
669 $\flex$ is connected to the $\lexer$: |
510 \begin{lemma} |
670 \begin{lemma} |
511 $\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$ |
671 $\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$ |
512 \end{lemma} |
672 \end{lemma} |
513 $\flex$ provides us a bridge between $\lexer$ and $\blexer$. |
673 \begin{proof} |
|
674 By reverse induction on $s$. |
|
675 \end{proof} |
|
676 $\flex$ provides us a bridge between $\lexer$'s intermediate steps. |
514 What is even better about $\flex$ is that it allows us to |
677 What is even better about $\flex$ is that it allows us to |
515 directly operate on the value $\mkeps (r\backslash v)$, |
678 directly operate on the value $\mkeps (r\backslash v)$, |
516 which is pivotal in the definition of $\lexer $ and $\blexer$, but not visible as an argument. |
679 which is pivotal in the definition of $\lexer $ and $\blexer$, but not visible as an argument. |
517 When the value created by $\mkeps$ becomes available, one can |
680 When the value created by $\mkeps$ becomes available, one can |
518 prove some stepwise properties of lexing nicely: |
681 prove some stepwise properties of lexing nicely: |
519 \begin{lemma}\label{flexStepwise} |
682 \begin{lemma}\label{flexStepwise} |
520 $\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $ |
683 $\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $ |
521 \end{lemma} |
684 \end{lemma} |
522 |
685 \begin{proof} |
523 And for $\blexer$ we have a function with stepwise properties like $\flex$ as well, |
686 By induction on the shape of $r\backslash s$ |
524 called $\retrieve$\ref{retrieveDef}. |
687 \end{proof} |
525 $\retrieve$ takes bit-codes from annotated regular expressions |
688 \noindent |
526 guided by a value. |
689 With $\flex$ and $\retrieve$ ready, we are ready to connect $\lexer$ and $\blexer$ . |
527 $\retrieve$ is connected to the $\blexer$ in the following way: |
690 |
528 \begin{lemma}\label{blexer_retrieve} |
691 \subsection{Correctness Proof of Bit-coded Algorithm} |
529 $\blexer \; r \; s = \decode \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$ |
|
530 \end{lemma} |
|
531 If you take derivative of an annotated regular expression, |
|
532 you can $\retrieve$ the same bit-codes as before the derivative took place, |
|
533 provided that you use the corresponding value: |
|
534 |
|
535 \begin{lemma}\label{retrieveStepwise} |
|
536 $\retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$ |
|
537 \end{lemma} |
|
538 The other good thing about $\retrieve$ is that it can be connected to $\flex$: |
|
539 %centralLemma1 |
|
540 \begin{lemma}\label{flex_retrieve} |
692 \begin{lemma}\label{flex_retrieve} |
541 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$ |
693 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$ |
542 \end{lemma} |
694 \end{lemma} |
543 \begin{proof} |
695 \begin{proof} |
544 By induction on $s$. The induction tactic is reverse induction on strings. |
696 By induction on $s$. The induction tactic is reverse induction on strings. |