8 %simplifications and therefore introduce our version of the bitcoded algorithm and |
8 %simplifications and therefore introduce our version of the bitcoded algorithm and |
9 %its correctness proof in |
9 %its correctness proof in |
10 %Chapter 3\ref{Chapter3}. |
10 %Chapter 3\ref{Chapter3}. |
11 In this chapter, we are going to describe the bit-coded algorithm |
11 In this chapter, we are going to describe the bit-coded algorithm |
12 introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of |
12 introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of |
|
13 derivatives of |
13 regular expressions. |
14 regular expressions. |
14 We have implemented their algorithm in Scala, and found out inefficiencies |
15 We have implemented their algorithm in Scala and Isabelle, |
|
16 and found problems |
15 in their algorithm such as de-duplication not working properly and redundant |
17 in their algorithm such as de-duplication not working properly and redundant |
16 fixpoint construction. Their algorithm is improved and verified with the help of |
18 fixpoint construction. |
17 formal proofs. |
|
18 \section{The Motivation Behind Using Bitcodes} |
19 \section{The Motivation Behind Using Bitcodes} |
19 We first do a recap of what was going on |
20 Let us give again the definition of $\lexer$ from Chapter \ref{Inj}: |
20 in the lexer algorithm in Chapter \ref{Inj}, |
|
21 \begin{center} |
21 \begin{center} |
22 \begin{tabular}{lcl} |
22 \begin{tabular}{lcl} |
23 $\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\ |
23 $\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\ |
24 $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\ |
24 $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\ |
25 & & $\quad \phantom{\mid}\; \None \implies \None$\\ |
25 & & $\quad \phantom{\mid}\; \None \implies \None$\\ |
26 & & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$ |
26 & & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$ |
27 \end{tabular} |
27 \end{tabular} |
28 \end{center} |
28 \end{center} |
29 \noindent |
29 \noindent |
30 The algorithm recursively calls $\lexer$ on |
30 The first problem with this algorithm is that |
|
31 for the function $\inj$ to work properly |
|
32 we cannot destroy the structure of a regular expression, |
|
33 and therefore canno simplify aggressively. |
|
34 For example, |
|
35 \[ |
|
36 r + (r + a) \rightarrow r + a |
|
37 \] |
|
38 cannot be done because that would require |
|
39 breaking up the inner alternative. |
|
40 The $\lexer$ function therefore only enables |
|
41 same-level de-duplications like |
|
42 \[ |
|
43 r + r \rightarrow r. |
|
44 \] |
|
45 Secondly, the algorithm recursively calls $\lexer$ on |
31 each new character input, |
46 each new character input, |
32 and before starting a child call |
47 and before starting a child call |
33 it stores information of previous lexing steps |
48 it stores information of previous lexing steps |
34 on a stack, in the form of regular expressions |
49 on a stack, in the form of regular expressions |
35 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc. |
50 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc. |
294 \caption{Stored $r_i, c_{i+1}$ Used by $\inj$} |
309 \caption{Stored $r_i, c_{i+1}$ Used by $\inj$} |
295 \end{figure} |
310 \end{figure} |
296 \noindent |
311 \noindent |
297 Storing all $r_i, c_{i+1}$ pairs recursively |
312 Storing all $r_i, c_{i+1}$ pairs recursively |
298 allows the algorithm to work in an elegant way, at the expense of |
313 allows the algorithm to work in an elegant way, at the expense of |
299 storing quite a bit of verbose information. |
314 storing quite a bit of verbose information on the stack. |
300 The stack seems to grow at least quadratically fast with respect |
315 The stack seems to grow at least quadratically with respect |
301 to the input (not taking into account the size bloat of $r_i$), |
316 to the input (not taking into account the size bloat of $r_i$), |
302 which can be inefficient and prone to stack overflow. |
317 which can be inefficient and prone to stack overflows. |
303 \section{Bitcoded Algorithm} |
318 \section{Bitcoded Algorithm} |
304 To address this, |
319 To address this, |
305 Sulzmann and Lu chose to define a new datatype |
320 Sulzmann and Lu defined a new datatype |
306 called \emph{annotated regular expression}, |
321 called \emph{annotated regular expression}, |
307 which condenses all the partial lexing information |
322 which condenses all the partial lexing information |
308 (that was originally stored in $r_i, c_{i+1}$ pairs) |
323 (that was originally stored in $r_i, c_{i+1}$ pairs) |
309 into bitcodes. |
324 into bitcodes. |
310 The bitcodes are then carried with the regular |
325 The bitcodes are then carried with the regular |
331 \end{center} |
346 \end{center} |
332 \noindent |
347 \noindent |
333 where $bs$ stands for bit-codes, $a$ for $\mathbf{a}$nnotated regular |
348 where $bs$ stands for bit-codes, $a$ for $\mathbf{a}$nnotated regular |
334 expressions and $as$ for lists of annotated regular expressions. |
349 expressions and $as$ for lists of annotated regular expressions. |
335 The alternative constructor, written, $\sum$, has been generalised to |
350 The alternative constructor, written, $\sum$, has been generalised to |
336 accept a list of annotated regular expressions rather than just two. |
351 take a list of annotated regular expressions rather than just two. |
337 Why is it generalised? This is because when we open up nested |
352 Why is it generalised? This is because when we analyse nested |
338 alternatives, there could be more than two elements at the same level |
353 alternatives, there can be more than two elements at the same level |
339 after de-duplication, which can no longer be stored in a binary |
354 after de-duplication, which can no longer be stored in a binary |
340 constructor. |
355 constructor. |
341 Bits and bitcodes (lists of bits) are defined as: |
356 Bits and bitcodes (lists of bits) are defined as: |
342 \begin{center} |
357 \begin{center} |
343 $b ::= S \mid Z \qquad |
358 $b ::= S \mid Z \qquad |
345 $ |
360 $ |
346 \end{center} |
361 \end{center} |
347 \noindent |
362 \noindent |
348 We use $S$ and $Z$ rather than $1$ and $0$ is to avoid |
363 We use $S$ and $Z$ rather than $1$ and $0$ is to avoid |
349 confusion with the regular expressions $\ZERO$ and $\ONE$. |
364 confusion with the regular expressions $\ZERO$ and $\ONE$. |
350 The idea is to use the attached bitcodes |
365 The idea is to use the bitcodes |
351 to indicate which choice was made at a certain point |
366 to indicate which choice was made at a certain point |
352 during lexing. |
367 during lexing. |
353 For example, |
368 For example, |
354 $(_{Z}a+_{S}b) \backslash a$ gives us |
369 $(_{Z}a+_{S}b) \backslash a$ gives us |
355 $_{Z}\ONE + \ZERO$, this $Z$ bitcode will |
370 $_{Z}\ONE + \ZERO$, where the $Z$ bitcode will |
356 later be used to decode that a left branch was |
371 later be used to decode that a left branch was |
357 selected at the time when the part $a+b$ is being taken |
372 selected at the time when the part $a+b$ was anallysed by |
358 derivative of. |
373 derivative. |
359 \subsection{A Bird's Eye View of the Bit-coded Lexer} |
374 \subsection{A Bird's Eye View of the Bit-coded Lexer} |
360 Before we give out the rest of the functions and definitions |
375 Before we give the details of the functions and definitions |
361 related to our |
376 related to our |
362 $\blexer$ (\emph{b}-itcoded lexer), we first provide a high-level |
377 $\blexer$ (\emph{b}-itcoded lexer), we first provide a high-level |
363 view of the algorithm, so the reader does not get lost in |
378 view of the algorithm. |
364 the details. |
|
365 \begin{figure}[H] |
379 \begin{figure}[H] |
366 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] |
380 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] |
367 %\draw (-6,-6) grid (6,6); |
381 %\draw (-6,-6) grid (6,6); |
368 |
382 |
369 \node [circle, draw] (r0) at (-6, 2) {$r$}; |
383 \node [circle, draw] (r0) at (-6, 2) {$r$}; |
434 %edge [] node {} (ldots2); |
448 %edge [] node {} (ldots2); |
435 |
449 |
436 |
450 |
437 |
451 |
438 \end{tikzpicture} |
452 \end{tikzpicture} |
439 \caption{Bird's Eye View of $\blexer$} |
453 \caption{A bird's eye view of $\blexer$. The $_{bsi}a_{i}$s stand |
|
454 for the annotated regular expressions in each derivative step. |
|
455 The characters used in each derivative is written as $c_i$. |
|
456 The relative size increase reflect the size increase in each derivative step.} |
440 \end{figure} |
457 \end{figure} |
441 \noindent |
458 \noindent |
442 The plain regular expressions |
459 The plain regular expressions |
443 are first ``lifted'' to an annotated regular expression, |
460 are first ``lifted'' to an annotated regular expression, |
444 with the function called \emph{internalise}. |
461 with the function called \emph{internalise} ($r \rightarrow _{bs}a$). |
445 Then the annotated regular expression $_{bs}a$ will |
462 Then the annotated regular expression $_{bs}a$ will |
446 go through successive derivatives with respect |
463 be transformed by successive derivatives with respect |
447 to the input stream of characters $c_1, c_2$ etc. |
464 to the input stream of characters $c_1, c_2$ etc. |
448 Each time a derivative is taken, the bitcodes |
465 Each time a derivative is taken, the bitcodes |
449 are moved around, discarded or augmented together |
466 are moved around, discarded or augmented together |
450 with the regular expression they are attached to. |
467 with the regular expression they are attached to. |
451 After all input has been consumed, the |
468 After all input has been consumed, the |
452 bitcodes are collected by $\bmkeps$, |
469 bitcodes are collected by $\bmkeps$, |
453 which traverses the nullable part of the regular expression |
470 which traverses the nullable part of the regular expression |
454 and collect the bitcodes along the way. |
471 and collects the bitcodes along the way. |
455 The collected bitcodes are then $\decode$d with the guidance |
472 The collected bitcodes are then $\decode$d with the guidance |
456 of the input regular expression $r$. |
473 of the input regular expression $r$ ( $_{bs}a \rightarrow r$). |
457 The most notable improvements of $\blexer$ |
474 The most notable improvements of $\blexer$ |
458 over $\lexer$ are |
475 over $\lexer$ are |
459 \begin{itemize} |
476 \begin{itemize} |
460 \item |
477 \item |
461 |
478 |
462 An absence of the second injection phase. |
479 An absence of the second injection phase. |
463 \item |
480 \item |
464 One need not store each pair of the |
481 One does not need to store each pair of the |
465 intermediate regular expressions $_{bs_i}a_i$ and $c_{i+1}$. |
482 intermediate regular expressions $_{bs_i}a_i$ and $c_{i+1}$. |
466 The previous annotated regular expression $_{bs_i}a_i$'s information is passed |
483 The previous annotated regular expression $_{bs_i}a_i$'s information is passed |
467 on without loss to its successor $_{bs_{i+1}}a_{i+1}$, |
484 on without loss to its successor $_{bs_{i+1}}a_{i+1}$, |
468 and $c_{i+1}$'s already contained in the strings in $L\;r$ \footnote{ |
485 and $c_{i+1}$'s information is stored in $L\;r$.\footnote{ |
469 which can be easily recovered if we decode in the correct order}. |
486 which will be used during the decode phase, where we use $r$ as |
|
487 a source of information.} |
470 \item |
488 \item |
471 The simplification works much better--one can maintain correctness |
489 Finally, simplification works much better--one can maintain correctness |
472 while applying quite strong simplifications, as we shall wee. |
490 while applying quite aggressive simplifications. |
473 \end{itemize} |
491 \end{itemize} |
474 \noindent |
492 %In the next section we will |
475 In the next section we will |
493 %give the operations needed in $\blexer$, |
476 give the operations needed in $\blexer$, |
494 %with some remarks on the idea behind their definitions. |
477 with some remarks on the idea behind their definitions. |
|
478 \subsection{Operations in $\textit{Blexer}$} |
495 \subsection{Operations in $\textit{Blexer}$} |
479 The first operation we define related to bit-coded regular expressions |
496 The first operation we define related to bit-coded regular expressions |
480 is how we move bits to the inside of regular expressions. |
497 is how we move bits to the inside of regular expressions. |
481 Called $\fuse$, this operation attaches bit-codes |
498 This operation is called $\fuse$: |
482 to the front of an annotated regular expression: |
|
483 \begin{center} |
499 \begin{center} |
484 \begin{tabular}{lcl} |
500 \begin{tabular}{lcl} |
485 $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ |
501 $\textit{fuse}\;bs \; (\ZERO)$ & $\dn$ & $\ZERO$\\ |
486 $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & |
502 $\textit{fuse}\;bs\; (_{bs'})\ONE$ & $\dn$ & |
487 $_{bs @ bs'}\ONE$\\ |
503 $_{bs @ bs'}\ONE$\\ |
488 $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ & |
504 $\textit{fuse}\;bs\;(_{bs'}{\bf c})$ & $\dn$ & |
489 $_{bs@bs'}{\bf c}$\\ |
505 $_{bs@bs'}{\bf c}$\\ |
490 $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ & |
506 $\textit{fuse}\;bs\;(_{bs'}\sum\textit{as})$ & $\dn$ & |
491 $_{bs@bs'}\sum\textit{as}$\\ |
507 $_{bs@bs'}\sum\textit{as}$\\ |
492 $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & |
508 $\textit{fuse}\;bs\; (_{bs'}a_1\cdot a_2)$ & $\dn$ & |
493 $_{bs@bs'}a_1 \cdot a_2$\\ |
509 $_{bs@bs'}a_1 \cdot a_2$\\ |
494 $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & |
510 $\textit{fuse}\;bs\;(_{bs'}a^*)$ & $\dn$ & |
495 $_{bs @ bs'}a^*$ |
511 $_{bs @ bs'}a^*$ |
496 \end{tabular} |
512 \end{tabular} |
497 \end{center} |
513 \end{center} |
498 |
514 |
499 \noindent |
515 \noindent |
500 With \emph{fuse} we are able to define the $\internalise$ function |
516 With \emph{fuse} we are able to define the $\internalise$ function, |
|
517 written $(\_)^\uparrow$, |
501 that translates a ``standard'' regular expression into an |
518 that translates a ``standard'' regular expression into an |
502 annotated regular expression. |
519 annotated regular expression. |
503 This function will be applied before we start |
520 This function will be applied before we start |
504 with the derivative phase of the algorithm. |
521 with the derivative phase of the algorithm. |
505 |
522 |
516 $(r^*)^\uparrow$ & $\dn$ & |
533 $(r^*)^\uparrow$ & $\dn$ & |
517 $_{[]}(r^\uparrow)^*$\\ |
534 $_{[]}(r^\uparrow)^*$\\ |
518 \end{tabular} |
535 \end{tabular} |
519 \end{center} |
536 \end{center} |
520 \noindent |
537 \noindent |
521 We use an up arrow with postfix notation |
|
522 to denote this operation |
|
523 for convenience. |
|
524 The opposite of $\textit{internalise}$ is |
538 The opposite of $\textit{internalise}$ is |
525 $\erase$, where all the bit-codes are removed, |
539 $\erase$, where all bit-codes are removed, |
526 and the alternative operator $\sum$ for annotated |
540 and the alternative operator $\sum$ for annotated |
527 regular expressions is transformed to the binary version |
541 regular expressions is transformed to the binary version |
528 in plain regular expressions. |
542 in (plain) regular expressions. This can be defined as follows: |
529 \begin{center} |
543 \begin{center}\label{eraseDef} |
530 \begin{tabular}{lcl} |
544 \begin{tabular}{lcl} |
531 $\ZERO_\downarrow$ & $\dn$ & $\ZERO$\\ |
545 $\ZERO_\downarrow$ & $\dn$ & $\ZERO$\\ |
532 $( _{bs}\ONE )_\downarrow$ & $\dn$ & $\ONE$\\ |
546 $( _{bs}\ONE )_\downarrow$ & $\dn$ & $\ONE$\\ |
533 $( _{bs}\mathbf{c} )_\downarrow$ & $\dn$ & $\mathbf{c}$\\ |
547 $( _{bs}\mathbf{c} )_\downarrow$ & $\dn$ & $\mathbf{c}$\\ |
534 $( _{bs} a_1 \cdot a_2 )_\downarrow$ & $\dn$ & |
548 $( _{bs} a_1 \cdot a_2 )_\downarrow$ & $\dn$ & |
535 $ (a_1) _\downarrow \cdot (a_2) _\downarrow$\\ |
549 $ (a_1) _\downarrow \cdot (a_2) _\downarrow$\\ |
536 $( _{bs} [])_\downarrow $ & $\dn$ & $\ZERO $\\ |
550 $( _{bs} [])_\downarrow $ & $\dn$ & $\ZERO $\\ |
537 $( _{bs} [a] )_\downarrow$ & $\dn$ & $a_\downarrow$\\ |
551 $( _{bs} [a] )_\downarrow$ & $\dn$ & $a_\downarrow$\\ |
538 $_{bs} \sum [a_1, \; a_2]$ & $\dn$ & $ (a_1) _\downarrow + ( a_2 ) _\downarrow $\\ |
552 $(_{bs} \sum [a_1, \; a_2])_\downarrow$ & $\dn$ & $ (a_1) _\downarrow + ( a_2 ) _\downarrow $\\ |
539 $(_{bs} \sum (a :: as))_\downarrow$ & $\dn$ & $ a_\downarrow + \; (_{[]} \sum as)_\downarrow$\\ |
553 $(_{bs} \sum (a :: as))_\downarrow$ & $\dn$ & $ a_\downarrow + \; (_{[]} \sum as)_\downarrow$\\ |
540 $( _{bs} a^* )_\downarrow$ & $\dn$ & $(a_\downarrow)^*$ |
554 $( _{bs} a^* )_\downarrow$ & $\dn$ & $(a_\downarrow)^*$ |
541 \end{tabular} |
555 \end{tabular} |
542 \end{center} |
556 \end{center} |
543 \noindent |
557 \noindent |
544 We also abbreviate the $\erase\; a$ operation |
558 We also abbreviate the $\erase\; a$ operation |
545 as $a_\downarrow$, for conciseness. |
559 as $a_\downarrow$, for conciseness. |
546 |
560 |
547 Testing whether an annotated regular expression |
561 Determining whether an annotated regular expression |
548 contains the empty string in its lauguage requires |
562 contains the empty string in its lauguage requires |
549 a dedicated function $\bnullable$. |
563 a dedicated function $\bnullable$. |
|
564 In our formalisation this function is defined by simply calling $\erase$ |
|
565 before $\nullable$. |
550 $\bnullable$ simply calls $\erase$ before $\nullable$. |
566 $\bnullable$ simply calls $\erase$ before $\nullable$. |
551 \begin{definition} |
567 \begin{definition} |
552 $\bnullable \; a \dn \nullable \; (a_\downarrow)$ |
568 $\bnullable \; a \dn \nullable \; (a_\downarrow)$ |
553 \end{definition} |
569 \end{definition} |
554 The function for collecting |
570 The function for collecting |
555 bitcodes from a |
571 bitcodes from a |
556 (b)nullable regular expression is called $\bmkeps$. |
572 (b)nullable regular expression is called $\bmkeps$. |
557 $\bmkeps$ is a variation of the $\textit{mkeps}$ function, |
573 This function is a lifted version of the $\textit{mkeps}$ function, |
558 which follows the path $\mkeps$ takes to traverse the |
574 which follows the path $\mkeps$ takes to traverse the |
559 $\nullable$ branches when visiting a regular expression, |
575 $\nullable$ branches when visiting a regular expression, |
560 but gives back bitcodes instead of a value. |
576 but collects bitcodes instead of generating a value. |
561 \begin{center} |
577 \begin{center} |
562 \begin{tabular}{lcl} |
578 \begin{tabular}{lcl} |
563 $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\ |
579 $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\ |
564 $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ & |
580 $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ & |
565 $\textit{if}\;\textit{bnullable}\,a$\\ |
581 $\textit{if}\;\textit{bnullable}\,a$\\ |
570 $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ & |
586 $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ & |
571 $bs \,@\, [S]$ |
587 $bs \,@\, [S]$ |
572 \end{tabular} |
588 \end{tabular} |
573 \end{center} |
589 \end{center} |
574 \noindent |
590 \noindent |
575 $\bmkeps$, just like $\mkeps$, |
591 This function, just like $\mkeps$, |
576 visits a regular expression tree respecting |
592 visits a regular expression tree respecting |
577 the POSIX rules. The difference, however, is that |
593 the POSIX rules. |
578 it does not create values, but only bitcodes. |
|
579 It traverses each child of the sequence regular expression |
594 It traverses each child of the sequence regular expression |
580 from left to right and creates a bitcode by stitching |
595 from left to right and creates a bitcode by stitching |
581 together bitcodes obtained from the children expressions. |
596 together bitcodes obtained from the children expressions. |
582 In the case of alternative regular expressions, |
597 In the case of alternative regular expressions, |
583 it looks for the leftmost |
598 it looks for the leftmost |
618 \textit{else}\;\textit{None}$ |
634 \textit{else}\;\textit{None}$ |
619 \end{tabular} |
635 \end{tabular} |
620 \end{center} |
636 \end{center} |
621 \noindent |
637 \noindent |
622 The function $\decode'$ returns a pair consisting of |
638 The function $\decode'$ returns a pair consisting of |
623 a partially decoded value and some leftover bit list that cannot |
639 a partially decoded value and some leftover bit-list. |
624 be decide yet. |
|
625 The function $\decode'$ succeeds if the left-over |
640 The function $\decode'$ succeeds if the left-over |
626 bit-sequence is empty. |
641 bit-sequence is empty. |
627 $\decode$ is terminating as $\decode'$ is terminating. |
642 |
628 $\decode'$ is terminating |
643 %$\decode$ is terminating as $\decode'$ is terminating. |
629 because at least one of $\decode'$'s parameters will go down in terms |
644 %$\decode'$ is terminating |
630 of size.\\ |
645 %because at least one of $\decode'$'s parameters will go down in terms |
|
646 %of size.\\ |
631 The reverse operation of $\decode$ is $\code$. |
647 The reverse operation of $\decode$ is $\code$. |
632 $\textit{code}$ encodes a value into a bitcode by converting |
648 This function encodes a value into a bitcode by converting |
633 $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty |
649 $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty |
634 star iteration by $S$. The border where a star iteration |
650 star iteration by $S$. The border where a star iteration |
635 terminates is marked by $Z$. |
651 terminates is marked by $Z$. |
636 This coding is lossy, as it throws away the information about |
652 This coding is lossy, as it throws away the information about |
637 characters, and does not encode the ``boundary'' between two |
653 characters, and does not encode the ``boundary'' between two |
638 sequence values. Moreover, with only the bitcode we cannot even tell |
654 sequence values. Moreover, with only the bitcode we cannot even tell |
639 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. |
655 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$, |
|
656 but this swill not be necessary in our proof. |
640 \begin{center} |
657 \begin{center} |
641 \begin{tabular}{lcl} |
658 \begin{tabular}{lcl} |
642 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
659 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
643 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
660 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
644 $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\ |
661 $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\ |
662 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \] |
679 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \] |
663 Then setting $ds$ to be $[]$ and unfolding $\decode$ definition, |
680 Then setting $ds$ to be $[]$ and unfolding $\decode$ definition, |
664 we obtain the property. |
681 we obtain the property. |
665 \end{proof} |
682 \end{proof} |
666 \noindent |
683 \noindent |
667 Now we give out the central part of this lexing algorithm, |
684 Now we define the central part of Sulzmann and Lu's lexing algorithm, |
668 the $\bder$ function (stands for \emph{b}itcoded-derivative): |
685 the $\bder$ function (which stands for \emph{b}itcoded-derivative): |
669 \begin{center} |
686 \begin{center} |
670 \begin{tabular}{@{}lcl@{}} |
687 \begin{tabular}{@{}lcl@{}} |
671 $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
688 $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
672 $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
689 $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
673 $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & |
690 $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & |
684 $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot |
701 $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot |
685 (_{[]}r^*))$ |
702 (_{[]}r^*))$ |
686 \end{tabular} |
703 \end{tabular} |
687 \end{center} |
704 \end{center} |
688 \noindent |
705 \noindent |
689 For most time we use the infix notation $(\_\backslash\_)$ |
706 The $\bder$ function tells us how regular expressions can be recursively traversed, |
690 to mean $\bder$ for brevity when |
|
691 there is no danger of confusion with derivatives on plain regular expressions. |
|
692 For example, we write $( _{[]}r^* ) \backslash c$ instead of $\bder \;c \; _{[]}r^*$, |
|
693 as the bitcodes at the front of $r^*$ indicates that it is |
|
694 a bit-coded regular expression, not a plain one.\\ |
|
695 $\bder$ tells us how regular expressions can be recursively traversed, |
|
696 where the bitcodes are augmented and carried around |
707 where the bitcodes are augmented and carried around |
697 when a derivative is taken. |
708 when a derivative is taken. |
698 We give the intuition behind some of the more involved cases in |
709 We give the intuition behind some of the more involved cases in |
699 $\bder$. \\ |
710 $\bder$. \\ |
700 For example, |
711 For example, |
701 in the \emph{star} case, |
712 in the \emph{star} case, |
702 a derivative on $_{bs}a^*$ means |
713 a derivative of $_{bs}a^*$ means |
703 that one more star iteratoin needs to be taken. |
714 that one more star iteration needs to be taken. |
704 we need to unfold it into a sequence, |
715 We therefore need to unfold it into a sequence, |
705 and attach an additional bit $Z$ to the front of $r \backslash c$ |
716 and attach an additional bit $Z$ to the front of $a \backslash c$ |
706 as a record to indicate one new star iteration is unfolded. |
717 as a record to indicate one new star iteration is unfolded. |
707 |
718 |
708 \noindent |
719 \noindent |
709 \begin{center} |
720 \begin{center} |
710 \begin{tabular}{@{}lcl@{}} |
721 \begin{tabular}{@{}lcl@{}} |
772 & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ |
783 & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ |
773 & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$ |
784 & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$ |
774 \end{tabular} |
785 \end{tabular} |
775 \end{center} |
786 \end{center} |
776 \noindent |
787 \noindent |
777 The difference is that (when $a_1$ is $\bnullable$) |
788 %We assume that $\bmkeps$ |
|
789 %correctly extracts the bitcode for how $a_1$ |
|
790 %matches the string prior to $c$ (more on this later). |
|
791 %The bitsequence $\textit{bs}$ which was initially |
|
792 %attached to the first element of the sequence |
|
793 %$a_1 \cdot a_2$, has now been |
|
794 %elevated to the top level of the $\sum$ constructor. |
|
795 %This is because this piece of information will be needed |
|
796 %whichever way the sequence is matched, |
|
797 %regardless of whether $c$ belongs to $a_1$ or $a_2$. |
|
798 The difference mainly lies in the $\textit{if}$ branch (when $a_1$ is $\bnullable$): |
778 we use $\bmkeps$ to store the lexing information |
799 we use $\bmkeps$ to store the lexing information |
779 in $a_1$ before collapsing |
800 in $a_1$ before collapsing |
780 it (as it has been fully matched by string prior to $c$), |
801 it (as it has been fully matched by string prior to $c$), |
781 and attach the collected bit-codes to the front of $a_2$ |
802 and attach the collected bit-codes to the front of $a_2$ |
782 before throwing away $a_1$. We assume that $\bmkeps$ |
803 before throwing away $a_1$. |
783 correctly extracts the bitcode for how $a_1$ |
|
784 matches the string prior to $c$ (more on this later). |
|
785 The bitsequence $\textit{bs}$ which was initially |
|
786 attached to the first element of the sequence |
|
787 $a_1 \cdot a_2$, has now been |
|
788 elevated to the top level of the $\sum$ constructor. |
|
789 This is because this piece of information will be needed |
|
790 whichever way the sequence is matched, |
|
791 regardless of whether $c$ belongs to $a_1$ or $a_2$. |
|
792 |
|
793 In the injection-based lexer, $r_1$ is immediately thrown away in |
804 In the injection-based lexer, $r_1$ is immediately thrown away in |
794 subsequent derivatives on the right branch (when $r_1$ is $\nullable$), |
805 in the $\textit{if}$ branch, |
795 depite $r_1$ potentially storing information of previous matches: |
806 the information $r_1$ stores is therefore lost: |
796 \begin{center} |
807 \begin{center} |
797 $(r_1 \cdot r_2 )\backslash c = (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ |
808 \begin{tabular}{lcl} |
|
809 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\ |
|
810 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
811 & & $\mathit{else}\; \ldots$\\ |
|
812 \end{tabular} |
798 \end{center} |
813 \end{center} |
799 \noindent |
814 \noindent |
800 this loss of information makes it necessary |
815 |
801 to store on stack all the regular expressions' |
816 %\noindent |
802 ``snapshots'' before they were |
817 %this loss of information makes it necessary |
803 taken derivative of. |
818 %to store on stack all the regular expressions' |
804 So that the related information will be available |
819 %``snapshots'' before they were |
805 once the child recursive |
820 %taken derivative of. |
806 call finishes.\\ |
821 %So that the related information will be available |
|
822 %once the child recursive |
|
823 %call finishes.\\ |
807 The rest of the clauses of $\bder$ is rather similar to |
824 The rest of the clauses of $\bder$ is rather similar to |
808 $\der$. \\ |
825 $\der$. \\ |
809 Generalising the derivative operation with bitcodes to strings, we have |
826 Generalising the derivative operation with bitcodes to strings, we have |
810 \begin{center} |
827 \begin{center} |
811 \begin{tabular}{@{}lcl@{}} |
828 \begin{tabular}{@{}lcl@{}} |
827 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
844 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
828 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
845 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
829 & & $\;\;\textit{else}\;\textit{None}$ |
846 & & $\;\;\textit{else}\;\textit{None}$ |
830 \end{tabular} |
847 \end{tabular} |
831 \end{center} |
848 \end{center} |
832 |
849 \noindent |
833 \noindent |
850 This function first attaches bitcodes to a |
834 $\blexer$ first attaches bitcodes to a |
851 plain regular expression $r$, and then builds successive derivatives |
835 plain regular expression, and then do successive derivatives |
|
836 with respect to the input string $s$, and |
852 with respect to the input string $s$, and |
837 then test whether the result is nullable. |
853 then test whether the result is (b)nullable. |
838 If yes, then extract the bitcodes out of the nullable expression, |
854 If yes, then extract the bitcodes from the nullable expression, |
839 and decodes the bitcodes into a lexical value. |
855 and decodes the bitcodes into a lexical value. |
840 If there does not exists a match between $r$ and $s$ the lexer |
856 If there does not exists a match between $r$ and $s$ the lexer |
841 outputs $\None$ indicating a failed lex.\\ |
857 outputs $\None$ indicating a mismatch.\\ |
842 Ausaf and Urban formally proved the correctness of the $\blexer$, namely |
858 Ausaf and Urban formally proved the correctness of the $\blexer$, namely |
843 \begin{property} |
859 \begin{property} |
844 $\blexer \;r \; s = \lexer \; r \; s$. |
860 $\blexer \;r \; s = \lexer \; r \; s$. |
845 \end{property} |
861 \end{property} |
846 \noindent |
862 \noindent |
847 This was claimed but not formalised in Sulzmann and Lu's work. |
863 This was claimed but not formalised in Sulzmann and Lu's work. |
848 We introduce the proof later, after we give all |
864 We introduce the proof later, after we give all |
849 the needed auxiliary functions and definitions. |
865 the needed auxiliary functions and definitions. |
850 \subsection{An Example $\blexer$ Run} |
866 \subsection{An Example $\blexer$ Run} |
851 Before introducing the proof we shall first walk the reader |
867 Before introducing the proof we shall first walk |
852 through a concrete example of our $\blexer$ calculating the right |
868 through a concrete example of our $\blexer$ calculating the right |
853 lexical information through bit-coded regular expressions.\\ |
869 lexical information through bit-coded regular expressions.\\ |
854 Consider the regular expression $(aa)^* \cdot (b+c)$ matching the string $aab$. |
870 Consider the regular expression $(aa)^* \cdot (b+c)$ matching the string $aab$. |
855 We give again the bird's eye view of this particular example |
871 We give again the bird's eye view of this particular example |
856 in each stage of the algorithm: |
872 in each stage of the algorithm: |
917 The one dashed arrow indicates that $_Z(\ONE \cdot (aa)^*)$ |
933 The one dashed arrow indicates that $_Z(\ONE \cdot (aa)^*)$ |
918 turned into $ZS$ after derivative w.r.t $b$ |
934 turned into $ZS$ after derivative w.r.t $b$ |
919 is taken, which calls $\bmkeps$ on the nuallable $_Z\ONE\cdot (aa)^*$ |
935 is taken, which calls $\bmkeps$ on the nuallable $_Z\ONE\cdot (aa)^*$ |
920 before processing $_Zb+_Sc$.\\ |
936 before processing $_Zb+_Sc$.\\ |
921 The annotated regular expressions |
937 The annotated regular expressions |
922 would look too cumbersome if we explicitly indicate all the |
938 would look overwhelming if we explicitly indicate all the |
923 locations where bitcodes are attached. |
939 locations where bitcodes are attached. |
924 For example, |
940 For example, |
925 $(aa)^*\cdot (b+c)$ would |
941 $(aa)^*\cdot (b+c)$ would |
926 look like $_{[]}(_{[]}(_{[]}a \cdot _{[]}a)^*\cdot _{[]}(_{[]}b+_{[]}c))$ |
942 look like $_{[]}(_{[]}(_{[]}a \cdot _{[]}a)^*\cdot _{[]}(_{[]}b+_{[]}c))$ |
927 after |
943 after |
928 internalise. |
944 internalise. |
929 Therefore for readability we omit bitcodes if they are empty. |
945 Therefore for readability we omit bitcodes if they are empty. |
930 This applies to all example annotated |
946 This applies to all annotated |
931 regular expressions in this thesis.\\ |
947 regular expressions in this thesis.\\ |
932 %and assume we have just read the first character $a$: |
948 %and assume we have just read the first character $a$: |
933 %\begin{center} |
949 %\begin{center} |
934 %\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
950 %\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
935 % \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
951 % \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
1049 using $\lexer$ or $\blexer$ alone. |
1066 using $\lexer$ or $\blexer$ alone. |
1050 \subsubsection{$\textit{Retrieve}$} |
1067 \subsubsection{$\textit{Retrieve}$} |
1051 The first function we shall introduce is $\retrieve$. |
1068 The first function we shall introduce is $\retrieve$. |
1052 Sulzmann and Lu gave its definition, and |
1069 Sulzmann and Lu gave its definition, and |
1053 Ausaf and Urban found |
1070 Ausaf and Urban found |
1054 its usage in mechanised proofs. |
1071 useful in their mechanised proofs. |
1055 Our bit-coded lexer ``retrieve''s the bitcodes using $\bmkeps$, |
1072 Our bit-coded lexer ``retrieve''s the bitcodes using $\bmkeps$, |
1056 after all the derivatives has been taken: |
1073 after all the derivatives have been taken: |
1057 \begin{center} |
1074 \begin{center} |
1058 \begin{tabular}{lcl} |
1075 \begin{tabular}{lcl} |
1059 & & $\ldots$\\ |
1076 & & $\ldots$\\ |
1060 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
1077 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
1061 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
1078 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
1062 & & $\ldots$ |
1079 & & $\ldots$ |
1063 \end{tabular} |
1080 \end{tabular} |
1064 \end{center} |
1081 \end{center} |
1065 \noindent |
1082 \noindent |
1066 $\bmkeps$ retrieves the value $v$'s |
1083 The function $\bmkeps$ retrieves the value $v$'s |
1067 information in the format |
1084 information in the format |
1068 of bitcodes, by travelling along the |
1085 of bitcodes, by travelling along the |
1069 path of the regular expression that corresponds to a POSIX match, |
1086 path of the regular expression that corresponds to a POSIX match, |
1070 collecting all the bitcodes. |
1087 collecting all the bitcodes. |
1071 We know that this "retrieved" bitcode leads to the correct value after decoding, |
1088 We know that this "retrieved" bitcode leads to the correct value after decoding, |
1072 which is $v_0$ in the injection-based lexing diagram. |
1089 which is $v_0$ in the injection-based lexing diagram: |
1073 As an observation we pointed at the beginning of this section, |
|
1074 the annotated regular expressions generated in successive derivative steps |
|
1075 in $\blexer$ after $\erase$ has the same structure |
|
1076 as those appeared in $\lexer$. |
|
1077 We redraw the diagram below to visualise this fact. |
|
1078 We pretend that all the values are |
|
1079 ready despite they are only calculated in $\lexer$. |
|
1080 In general we have $\vdash v_i:(a_i)_\downarrow$. |
|
1081 \vspace{20mm} |
1090 \vspace{20mm} |
1082 \begin{figure}[H]%\label{graph:injLexer} |
1091 \begin{figure}[H]%\label{graph:injLexer} |
1083 \begin{center} |
1092 \begin{center} |
1084 \begin{tikzcd}[ |
1093 \begin{tikzcd}[ |
1085 every matrix/.append style = {name=p}, |
1094 every matrix/.append style = {name=p}, |
1097 \node[E = (p-1-1) (p-2-1)] {}; |
1106 \node[E = (p-1-1) (p-2-1)] {}; |
1098 \node[E = (p-1-4) (p-2-4)] {}; |
1107 \node[E = (p-1-4) (p-2-4)] {}; |
1099 \node[E = (p-1-2) (p-2-2)] {}; |
1108 \node[E = (p-1-2) (p-2-2)] {}; |
1100 \node[E = (p-1-3) (p-2-3)] {}; |
1109 \node[E = (p-1-3) (p-2-3)] {}; |
1101 \end{tikzpicture} |
1110 \end{tikzpicture} |
1102 |
1111 \caption{Injection-based lexing diagram, |
|
1112 with matching value and regular expression pairs |
|
1113 encircled}\label{Inj2} |
1103 \end{figure} |
1114 \end{figure} |
1104 \vspace{20mm} |
1115 \vspace{20mm} |
1105 \noindent |
1116 \noindent |
1106 We encircle in the diagram all the pairs $v_i, a_i$ to show that these values |
1117 We have that $\vdash v_i:(a_i)_\downarrow$, |
|
1118 where $v_i$ are the intermediate values generated step by step in $\lexer$. |
|
1119 These values are not explicitly created in $\blexer$. |
|
1120 |
|
1121 We encircled in the diagram all the pairs $v_i, a_i$ to show that these values |
1107 and regular expressions share the same structure. |
1122 and regular expressions share the same structure. |
1108 These pairs all contain adequate information to |
1123 These pairs all contain adequate information to |
1109 obtain the final lexing result. |
1124 obtain the final lexing result. |
1110 For example, in the leftmost pair the |
1125 For example, in the leftmost pair the |
1111 lexical information is condensed in |
1126 lexical information is condensed in |
1112 $v_0$, whereas for the rightmost pair, |
1127 $v_0$, whereas for the rightmost pair, |
1113 the lexing result hides is in the bitcodes of $a_n$.\\ |
1128 the lexing result hides is in the bitcodes of $a_n$.\\ |
1114 $\bmkeps$ is able to extract from $a_n$ the result |
1129 The function $\bmkeps$ is able to extract from $a_n$ the result |
1115 by looking for nullable parts of the regular expression. |
1130 by looking for nullable parts of the regular expression. |
1116 However for regular expressions $a_i$ in general, |
1131 However for regular expressions $a_i$ in general, |
1117 they might not necessarily be nullable. |
1132 they might not necessarily be nullable. |
1118 For those regular expressions that are not nullable, |
1133 For those regular expressions that are not nullable, |
1119 $\bmkeps$ will not work. |
1134 $\bmkeps$ will not work. |
1153 |
1169 |
1154 $\retrieve \; \, _{bs} a^* $ & $ (\Stars \; []) $ & $\dn$ & $\textit{bs} \; @ \; [S]$ |
1170 $\retrieve \; \, _{bs} a^* $ & $ (\Stars \; []) $ & $\dn$ & $\textit{bs} \; @ \; [S]$ |
1155 \end{tabular} |
1171 \end{tabular} |
1156 \end{center} |
1172 \end{center} |
1157 \noindent |
1173 \noindent |
1158 As promised, $\retrieve$ collects the right bit-codes from the |
1174 As stated, $\retrieve$ collects the right bit-codes from the |
1159 final derivative $a_n$, guided by $v_n$: |
1175 final derivative $a_n$, guided by $v_n$. This can be proved |
|
1176 as follows: |
1160 \begin{lemma}\label{bmkepsRetrieve} |
1177 \begin{lemma}\label{bmkepsRetrieve} |
1161 $\bnullable \; a \implies \bmkeps \; a = \retrieve \; a \; (\mkeps \; (\erase \; a))$ |
1178 $\bnullable \; a \implies \bmkeps \; a = \retrieve \; a \; (\mkeps \; (\erase \; a))$ |
1162 \end{lemma} |
1179 \end{lemma} |
1163 \begin{proof} |
1180 \begin{proof} |
1164 By a routine induction on $a$. |
1181 By a routine induction on $a$. |
1165 \end{proof} |
1182 \end{proof} |
1166 \noindent |
1183 \noindent |
1167 The design of $\retrieve$ enables us to extract bitcodes |
1184 %The design of $\retrieve$ enables us to extract bitcodes |
1168 from both annotated regular expressions and values. |
1185 %from both annotated regular expressions and values. |
1169 $\retrieve$ always ``sucks up'' all the information. |
1186 %$\retrieve$ always ``sucks up'' all the information. |
1170 When more information is stored in the value, we would be able to |
1187 When more information is stored in the value, we would be able to |
1171 extract more from the value, and vice versa. |
1188 extract more from the value, and vice versa. |
1172 For example in star iterations, $\retrieve$ will be able to extract from $\Stars \; vs$ |
1189 For example in star iterations, $\retrieve$ will be able to extract from $\Stars \; vs$ |
1173 exactly the same bitcodes as $\code \; (\Stars \; vs)$: |
1190 exactly the same bitcodes as $\code \; (\Stars \; vs)$: |
1174 \begin{lemma} |
1191 \begin{lemma}\label{retrieveEncodeSTARS} |
1175 If $\forall v \in vs. \vdash v : r$, and $\code \; v = \retrieve \; (\internalise\; r) \; v$\\ |
1192 If $\forall v \in vs. \vdash v : r$, and $\code \; v = \retrieve \; (\internalise\; r) \; v$\\ |
1176 then $\code \; (\Stars \; vs) = \retrieve \; _{[]}((\internalise \; r)^*) \; (\Stars \; vs)$ |
1193 then $\code \; (\Stars \; vs) = \retrieve \; _{[]}((\internalise \; r)^*) \; (\Stars \; vs)$ |
1177 \end{lemma}\label{retrieveEncodeSTARS} |
1194 \end{lemma} |
1178 \begin{proof} |
1195 \begin{proof} |
1179 By induction on the value list $vs$. |
1196 By induction on the value list $vs$. |
1180 \end{proof} |
1197 \end{proof} |
1181 \noindent |
1198 \noindent |
1182 Similarly, when the value list does not hold information, |
1199 Similarly, when the value list does not hold information, |
1183 only the bitcodes plus some delimiter will be recorded: |
1200 only the bitcodes plus some delimiter will be recorded: |
1184 \begin{center} |
1201 \begin{center} |
1185 $\retrieve \; _{bs}((\internalise \; r)^*) \; (\Stars \; [] ) = bs @ [S]$. |
1202 $\retrieve \; _{bs}((\internalise \; r)^*) \; (\Stars \; [] ) = bs @ [S]$. |
1186 \end{center} |
1203 \end{center} |
1187 In general, if we have a regular expression just internalised |
1204 In general, if we have a regular expression that is just internalised |
1188 and the lexing result value, we could $\retrieve$ the bitcdoes |
1205 and the final lexing result value, we can $\retrieve$ the bitcodes |
1189 that look as if we have en$\code$-ed the value as bitcodes: |
1206 that look as if we have en$\code$-ed the value as bitcodes: |
1190 \begin{lemma} |
1207 \begin{lemma} |
1191 $\vdash v : r \implies \retrieve \; (r)^\uparrow \; v = \code \; v$ |
1208 $\vdash v : r \implies \retrieve \; (r)^\uparrow \; v = \code \; v$ |
1192 \end{lemma} |
1209 \end{lemma} |
1193 \begin{proof} |
1210 \begin{proof} |
1194 By induction on $r$. |
1211 By induction on $r$. |
1195 The star case uses lemma \ref{retrieveEncodeSTARS}. |
1212 The star case uses lemma \ref{retrieveEncodeSTARS}. |
1196 \end{proof} |
1213 \end{proof} |
1197 \noindent |
1214 \noindent |
1198 The following property is more interesting, as |
1215 The following property of $\retrieve$ is crucial, as |
1199 it provides a "bridge" between $a_0$ and $a_n $ in the |
1216 it provides a "bridge" between $a_0$ and $a_n $ in the |
1200 lexing diagram by linking adjacent regular expressions $a_i$ and |
1217 lexing diagram by linking adjacent regular expressions $a_i$ and |
1201 $a_{i+1}$. |
1218 $a_{i+1}$. |
1202 The property says that one |
1219 The property says that one |
1203 can retrieve the same bits from a derivative |
1220 can retrieve the same bits from a derivative |
1204 regular expression as those from |
1221 regular expression as those from |
1205 before the derivative took place, |
1222 before the derivative took place, |
1206 provided that the right values are used respectively: |
1223 provided that the corresponding values are used respectively: |
1207 \begin{lemma}\label{retrieveStepwise} |
1224 \begin{lemma}\label{retrieveStepwise} |
1208 $\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$ |
1225 $\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$ |
1209 \end{lemma} |
1226 \end{lemma} |
1210 \begin{proof} |
1227 \begin{proof} |
1211 By induction on $r$, where $v$ is allowed to be arbitrary. |
1228 By induction on $r$, generalising over $v$. |
1212 The induction principle is function $\erase$'s cases. |
1229 The induction principle in our formalisation |
|
1230 is function $\erase$'s cases. |
1213 \end{proof} |
1231 \end{proof} |
1214 \noindent |
1232 \noindent |
1215 Before we move on to the next |
1233 Before we move on to the next |
1216 helper function, we rewrite $\blexer$ in |
1234 helper function, we rewrite $\blexer$ in |
1217 the following way which makes later proofs more convenient: |
1235 the following way which makes later proofs more convenient: |
1220 \end{lemma} |
1238 \end{lemma} |
1221 \begin{proof} |
1239 \begin{proof} |
1222 Using lemma \ref{bmkepsRetrieve}. |
1240 Using lemma \ref{bmkepsRetrieve}. |
1223 \end{proof} |
1241 \end{proof} |
1224 \noindent |
1242 \noindent |
1225 $\retrieve$ allows connecting |
1243 The function $\retrieve$ allows connecting |
1226 between the intermediate |
1244 between the intermediate |
1227 results $a_i$ and $a_{i+1}$ in $\blexer$. |
1245 results $a_i$ and $a_{i+1}$ in $\blexer$. |
1228 For plain regular expressions something similar is required. |
1246 For plain regular expressions something similar is required. |
1229 |
1247 |
1230 \subsection{$\flex$} |
1248 \subsection{$\flex$} |
1231 Ausaf and Urban cleverly defined an auxiliary function called $\flex$ for $\lexer$, |
1249 Ausaf and Urban defined an auxiliary function called $\flex$ for $\lexer$, |
1232 defined as |
1250 defined as |
1233 \begin{center} |
1251 \begin{center} |
1234 \begin{tabular}{lcl} |
1252 \begin{tabular}{lcl} |
1235 $\flex \; r \; f \; [] \; v$ & $=$ & $f\; v$\\ |
1253 $\flex \; r \; f \; [] \; v$ & $=$ & $f\; v$\\ |
1236 $\flex \; r \; f \; (c :: s) \; v$ & $=$ & $\flex \; r \; (\lambda v. \, f (\inj \; r\; c\; v)) \;\; s \; v$ |
1254 $\flex \; r \; f \; (c :: s) \; v$ & $=$ & $\flex \; r \; (\lambda v. \, f (\inj \; r\; c\; v)) \;\; s \; v$ |
1237 \end{tabular} |
1255 \end{tabular} |
1238 \end{center} |
1256 \end{center} |
1239 which accumulates the characters that need to be injected back, |
1257 which accumulates the characters that need to be injected back, |
1240 and does the injection in a stack-like manner (last taken derivative first injected). |
1258 and does the injection in a stack-like manner (last taken derivative first injected). |
|
1259 The function |
1241 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression |
1260 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression |
1242 $r$, the identity function $id$, the input string $s$ and the value |
1261 $r$, the identity function $id$, the input string $s$ and the value |
1243 $v_n= \mkeps \; (r\backslash s)$: |
1262 $v_n= \mkeps \; (r\backslash s)$: |
1244 \begin{lemma} |
1263 \begin{lemma} |
1245 $\flex \;\; r \;\; \textit{id}\;\; s \;\; (\mkeps \;(r\backslash s)) = \lexer \; r \; s$ |
1264 $\flex \;\; r \;\; \textit{id}\;\; s \;\; (\mkeps \;(r\backslash s)) = \lexer \; r \; s$ |
1247 \begin{proof} |
1266 \begin{proof} |
1248 By reverse induction on $s$. |
1267 By reverse induction on $s$. |
1249 \end{proof} |
1268 \end{proof} |
1250 \noindent |
1269 \noindent |
1251 The main advantage of using $\flex$ |
1270 The main advantage of using $\flex$ |
1252 in addition to $\lexer$ is that |
1271 in $\lexer$ is that |
1253 $\flex$ makes the value $v$ and function $f$ |
1272 $\flex$ makes the value $v$ and function $f$ |
1254 that manipulates the value explicit parameters, |
1273 that manipulates the value explicit parameters, |
1255 which ``exposes'' $\lexer$'s recursive call |
1274 which ``exposes'' $\lexer$'s recursive call |
1256 arguments and allows us to ``set breakpoints'' and ``resume'' |
1275 arguments and allows us to ``set breakpoints'' and ``resume'' |
1257 at any point during $\lexer$'s recursive calls.\\ |
1276 at any point during $\lexer$'s recursive calls. Therefore |
1258 The following stepwise property holds. |
1277 the following stepwise property holds. |
1259 \begin{lemma}\label{flexStepwise} |
1278 \begin{lemma}\label{flexStepwise} |
1260 $\textit{flex} \;\; r \;\; f \;\; (s@[c]) \;\; v = \flex \;\; r \;\; f \;\; s \;\; (\inj \; (r\backslash s) \; c \; v) $ |
1279 $\textit{flex} \;\; r \;\; f \;\; (s@[c]) \;\; v = \flex \;\; r \;\; f \;\; s \;\; (\inj \; (r\backslash s) \; c \; v) $ |
1261 \end{lemma} |
1280 \end{lemma} |
1262 \begin{proof} |
1281 \begin{proof} |
1263 By induction on the shape of $r\backslash s$. |
1282 By induction on the shape of $r\backslash s$. |
1264 \end{proof} |
1283 \end{proof} |
1265 \noindent |
1284 \noindent |
1266 With $\flex$ and $\retrieve$, |
1285 With $\flex$ and $\retrieve$, |
1267 we are ready to connect $\lexer$ and $\blexer$, |
1286 we are ready to connect $\lexer$ and $\blexer$, |
1268 giving the correctness proof. |
1287 giving the correctness proof for $\blexer$. |
1269 |
1288 |
1270 %---------------------------------------------------------------------------------------- |
1289 %---------------------------------------------------------------------------------------- |
1271 % SECTION correctness proof |
1290 % SECTION correctness proof |
1272 %---------------------------------------------------------------------------------------- |
1291 %---------------------------------------------------------------------------------------- |
1273 \section{Correctness of Bit-coded Algorithm (Without Simplification)} |
1292 \section{Correctness of the Bit-coded Algorithm (Without Simplification)} |
1274 $\flex$ and $\blexer$ essentially calculates the same thing. |
1293 We can first show that |
|
1294 $\flex$ and $\blexer$ calculates the same thing. |
1275 \begin{lemma}\label{flex_retrieve} |
1295 \begin{lemma}\label{flex_retrieve} |
1276 If $\vdash v: (r\backslash s)$, then\\ |
1296 If $\vdash v: (r\backslash s)$, then\\ |
1277 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$ |
1297 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$ |
1278 \end{lemma} |
1298 \end{lemma} |
1279 \begin{proof} |
1299 \begin{proof} |
1326 \noindent |
1346 \noindent |
1327 To piece things together and spell out the correctness |
1347 To piece things together and spell out the correctness |
1328 result of the bitcoded lexer more explicitly, |
1348 result of the bitcoded lexer more explicitly, |
1329 we use the fact from the previous chapter that |
1349 we use the fact from the previous chapter that |
1330 \[ |
1350 \[ |
1331 (r, s) \rightarrow v \;\; \textit{iff} \;\; \lexer \; r \; s = v |
1351 (r, s) \rightarrow v \;\; \textit{iff} \;\; \lexer \; r \; s =\Some \; v |
|
1352 \quad \quad |
|
1353 \nexists v. (r, s) \rightarrow v\;\; \textit{iff} \;\; \lexer \;r\;s = None |
1332 \] |
1354 \] |
1333 to obtain this corollary: |
1355 to obtain this corollary: |
1334 \begin{corollary}\label{blexerPOSIX} |
1356 \begin{corollary}\label{blexerPOSIX} |
1335 $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \; s = v$ |
1357 The $\blexer$ function correctly implements a POSIX lexer, namely\\ |
|
1358 $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \; s = \Some \; v$\\ |
|
1359 $\nexists v. (r, s) \rightarrow v\;\; \textit{iff} \;\; \blexer \;r\;s = None$ |
1336 \end{corollary} |
1360 \end{corollary} |
1337 Our main reason for wanting a bit-coded algorithm over |
1361 Our main reason for analysing the bit-coded algorithm over |
1338 the injection-based one is for its capabilities of allowing |
1362 the injection-based one is that it allows us to define |
1339 more aggressive simplifications. |
1363 more aggressive simplifications. |
1340 We will elaborate on this in the next chapter. |
1364 We will elaborate on this in the next chapter. |
1341 |
1365 |
1342 |
1366 |