28 \end{center} |
28 \end{center} |
29 \noindent |
29 \noindent |
30 The first problem with this algorithm is that |
30 The first problem with this algorithm is that |
31 for the function $\inj$ to work properly |
31 for the function $\inj$ to work properly |
32 we cannot destroy the structure of a regular expression, |
32 we cannot destroy the structure of a regular expression, |
33 and therefore canno simplify aggressively. |
33 and therefore cannot simplify too aggressively. |
34 For example, |
34 For example, |
35 \[ |
35 \[ |
36 r + (r + a) \rightarrow r + a |
36 r + (r + a) \rightarrow r + a |
37 \] |
37 \] |
38 cannot be done because that would require |
38 cannot be applied because that would require |
39 breaking up the inner alternative. |
39 breaking up the inner alternative. |
40 The $\lexer$ function therefore only enables |
40 The $\lexer$ plus $\textit{simp}$ therefore only enables |
41 same-level de-duplications like |
41 same-level de-duplications like |
42 \[ |
42 \[ |
43 r + r \rightarrow r. |
43 r + r \rightarrow r. |
44 \] |
44 \] |
45 Secondly, the algorithm recursively calls $\lexer$ on |
45 Secondly, the algorithm recursively calls $\lexer$ on |
156 \path (ldots) |
156 \path (ldots) |
157 edge [bend left, dashed] node {} (stack); |
157 edge [bend left, dashed] node {} (stack); |
158 \path (5.03, 4.9) |
158 \path (5.03, 4.9) |
159 edge [bend left, dashed] node {} (stack); |
159 edge [bend left, dashed] node {} (stack); |
160 \end{tikzpicture} |
160 \end{tikzpicture} |
161 \caption{More Derivatives Taken} |
161 \caption{More derivatives taken} |
162 \end{figure} |
162 \end{figure} |
163 |
163 |
164 |
164 |
165 \begin{figure}[H] |
165 \begin{figure}[H] |
166 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] |
166 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] |
220 \path (r2) |
220 \path (r2) |
221 edge [dashed, ] node {} (stack); |
221 edge [dashed, ] node {} (stack); |
222 \path (rn) |
222 \path (rn) |
223 edge [dashed, bend left] node {} (stack); |
223 edge [dashed, bend left] node {} (stack); |
224 \end{tikzpicture} |
224 \end{tikzpicture} |
225 \caption{Before Injection Phase Starts} |
225 \caption{Before injection phase starts} |
226 \end{figure} |
226 \end{figure} |
227 |
227 |
228 |
228 |
229 \noindent |
229 \noindent |
230 After all derivatives have been taken, the stack grows to a maximum size |
230 After all derivatives have been taken, the stack grows to a maximum size |
322 which condenses all the partial lexing information |
322 which condenses all the partial lexing information |
323 (that was originally stored in $r_i, c_{i+1}$ pairs) |
323 (that was originally stored in $r_i, c_{i+1}$ pairs) |
324 into bitcodes. |
324 into bitcodes. |
325 The bitcodes are then carried with the regular |
325 The bitcodes are then carried with the regular |
326 expression, and augmented or moved around |
326 expression, and augmented or moved around |
327 as the lexing goes on. |
327 as the lexing proceeds. |
328 It becomes unnecessary |
328 It becomes unnecessary |
329 to remember all the |
329 to remember all the |
330 intermediate expresssions, but only the most recent one |
330 intermediate expresssions, but only the most recent one |
331 with this bit-carrying regular expression. |
331 with this bit-carrying regular expression. |
332 Annotated regular expressions |
332 Annotated regular expressions |
333 are defined as the following |
333 are defined as the following |
334 Isabelle datatype \footnote{ We use subscript notation to indicate |
334 Isabelle datatype:\footnote{ We use subscript notation to indicate |
335 that the bitcodes are auxiliary information that do not |
335 that the bitcodes are auxiliary information that do not |
336 interfere with the structure of the regular expressions }: |
336 interfere with the structure of the regular expressions } |
337 \begin{center} |
337 \begin{center} |
338 \begin{tabular}{lcl} |
338 \begin{tabular}{lcl} |
339 $\textit{a}$ & $::=$ & $\ZERO$\\ |
339 $\textit{a}$ & $::=$ & $\ZERO$\\ |
340 & $\mid$ & $_{bs}\ONE$\\ |
340 & $\mid$ & $_{bs}\ONE$\\ |
341 & $\mid$ & $_{bs}{\bf c}$\\ |
341 & $\mid$ & $_{bs}{\bf c}$\\ |
345 \end{tabular} |
345 \end{tabular} |
346 \end{center} |
346 \end{center} |
347 \noindent |
347 \noindent |
348 where $bs$ stands for bit-codes, $a$ for $\mathbf{a}$nnotated regular |
348 where $bs$ stands for bit-codes, $a$ for $\mathbf{a}$nnotated regular |
349 expressions and $as$ for lists of annotated regular expressions. |
349 expressions and $as$ for lists of annotated regular expressions. |
350 The alternative constructor, written, $\sum$, has been generalised to |
350 The alternative constructor, written $\sum$, has been generalised to |
351 take a list of annotated regular expressions rather than just two. |
351 take a list of annotated regular expressions rather than just two. |
352 Why is it generalised? This is because when we analyse nested |
352 Why is it generalised? This is because when we analyse nested |
353 alternatives, there can be more than two elements at the same level |
353 alternatives, there can be more than two elements at the same level |
354 after de-duplication, which can no longer be stored in a binary |
354 after de-duplication, which can no longer be stored in a binary |
355 constructor. |
355 constructor. |
367 during lexing. |
367 during lexing. |
368 For example, |
368 For example, |
369 $(_{Z}a+_{S}b) \backslash a$ gives us |
369 $(_{Z}a+_{S}b) \backslash a$ gives us |
370 $_{Z}\ONE + \ZERO$, where the $Z$ bitcode will |
370 $_{Z}\ONE + \ZERO$, where the $Z$ bitcode will |
371 later be used to decode that a left branch was |
371 later be used to decode that a left branch was |
372 selected at the time when the part $a+b$ was anallysed by |
372 selected at the time when the part $a+b$ was analysed by |
373 derivative. |
373 derivative. |
374 \subsection{A Bird's Eye View of the Bit-coded Lexer} |
374 \subsection{A Bird's Eye View of the Bit-coded Lexer} |
375 Before we give the details of the functions and definitions |
375 Before we give the details of the functions and definitions |
376 related to our |
376 related to our |
377 $\blexer$ (\emph{b}-itcoded lexer), we first provide a high-level |
377 $\blexer$ (\emph{b}-itcoded lexer), we first provide a high-level |
451 |
451 |
452 \end{tikzpicture} |
452 \end{tikzpicture} |
453 \caption{A bird's eye view of $\blexer$. The $_{bsi}a_{i}$s stand |
453 \caption{A bird's eye view of $\blexer$. The $_{bsi}a_{i}$s stand |
454 for the annotated regular expressions in each derivative step. |
454 for the annotated regular expressions in each derivative step. |
455 The characters used in each derivative is written as $c_i$. |
455 The characters used in each derivative is written as $c_i$. |
456 The relative size increase reflect the size increase in each derivative step.} |
456 The size of the derivatives typically increase during each derivative step, |
|
457 therefore we draw the larger circles to indicate that. |
|
458 The process starts with an internalise phase and concludes with a decoding phase.} |
457 \end{figure} |
459 \end{figure} |
458 \noindent |
460 \noindent |
459 The plain regular expressions |
461 The plain regular expressions |
460 are first ``lifted'' to an annotated regular expression, |
462 are first ``lifted'' to an annotated regular expression, |
461 with the function called \emph{internalise} ($r \rightarrow _{bs}a$). |
463 with the function called \emph{internalise} ($r \rightarrow _{bs}a$). |
474 The most notable improvements of $\blexer$ |
476 The most notable improvements of $\blexer$ |
475 over $\lexer$ are |
477 over $\lexer$ are |
476 \begin{itemize} |
478 \begin{itemize} |
477 \item |
479 \item |
478 |
480 |
479 An absence of the second injection phase. |
481 the absence of the second injection phase. |
480 \item |
482 \item |
481 One does not need to store each pair of the |
483 one does not need to store each pair of the |
482 intermediate regular expressions $_{bs_i}a_i$ and $c_{i+1}$. |
484 intermediate regular expressions $_{bs_i}a_i$ and $c_{i+1}$. |
483 The previous annotated regular expression $_{bs_i}a_i$'s information is passed |
485 The previous annotated regular expression $_{bs_i}a_i$'s information is passed |
484 on without loss to its successor $_{bs_{i+1}}a_{i+1}$, |
486 on without loss to its successor $_{bs_{i+1}}a_{i+1}$, |
485 and $c_{i+1}$'s information is stored in $L\;r$.\footnote{ |
487 and $c_{i+1}$'s information is stored in $L\;r$.\footnote{ |
486 which will be used during the decode phase, where we use $r$ as |
488 which will be used during the decode phase, where we use $r$ as |
487 a source of information.} |
489 a source of information.} |
488 \item |
490 \item |
489 Finally, simplification works much better--one can maintain correctness |
491 simplification works much smoother--one can maintain correctness |
490 while applying quite aggressive simplifications. |
492 while applying quite aggressive simplifications. |
491 \end{itemize} |
493 \end{itemize} |
492 %In the next section we will |
494 %In the next section we will |
493 %give the operations needed in $\blexer$, |
495 %give the operations needed in $\blexer$, |
494 %with some remarks on the idea behind their definitions. |
496 %with some remarks on the idea behind their definitions. |
553 $(_{bs} \sum (a :: as))_\downarrow$ & $\dn$ & $ a_\downarrow + \; (_{[]} \sum as)_\downarrow$\\ |
555 $(_{bs} \sum (a :: as))_\downarrow$ & $\dn$ & $ a_\downarrow + \; (_{[]} \sum as)_\downarrow$\\ |
554 $( _{bs} a^* )_\downarrow$ & $\dn$ & $(a_\downarrow)^*$ |
556 $( _{bs} a^* )_\downarrow$ & $\dn$ & $(a_\downarrow)^*$ |
555 \end{tabular} |
557 \end{tabular} |
556 \end{center} |
558 \end{center} |
557 \noindent |
559 \noindent |
558 We also abbreviate the $\erase\; a$ operation |
560 Where we abbreviate the $\erase\; a$ operation |
559 as $a_\downarrow$, for conciseness. |
561 as $(a)_\downarrow$, for conciseness. |
560 |
562 |
561 Determining whether an annotated regular expression |
563 Determining whether an annotated regular expression |
562 contains the empty string in its lauguage requires |
564 contains the empty string in its language requires |
563 a dedicated function $\bnullable$. |
565 a dedicated function $\bnullable$. |
564 In our formalisation this function is defined by simply calling $\erase$ |
566 In our formalisation this function is defined by simply calling $\erase$ |
565 before $\nullable$. |
567 before $\nullable$. |
566 $\bnullable$ simply calls $\erase$ before $\nullable$. |
|
567 \begin{definition} |
568 \begin{definition} |
568 $\bnullable \; a \dn \nullable \; (a_\downarrow)$ |
569 $\bnullable \; a \dn \nullable \; (a_\downarrow)$ |
569 \end{definition} |
570 \end{definition} |
570 The function for collecting |
571 The function for collecting |
571 bitcodes from a |
572 bitcodes from a |
595 from left to right and creates a bitcode by stitching |
596 from left to right and creates a bitcode by stitching |
596 together bitcodes obtained from the children expressions. |
597 together bitcodes obtained from the children expressions. |
597 In the case of alternative regular expressions, |
598 In the case of alternative regular expressions, |
598 it looks for the leftmost |
599 it looks for the leftmost |
599 $\nullable$ branch |
600 $\nullable$ branch |
600 to visit and ignores other siblings. |
601 to visit and ignores the other siblings. |
601 %Whenever there is some bitcodes attached to a |
602 %Whenever there is some bitcodes attached to a |
602 %node, it returns the bitcodes concatenated with whatever |
603 %node, it returns the bitcodes concatenated with whatever |
603 %child recursive calls return. |
604 %child recursive calls return. |
604 The only time when $\bmkeps$ creates new bitcodes |
605 The only time when $\bmkeps$ creates new bitcodes |
605 is when it completes a star's iterations by attaching a $S$ to the end of the bitcode |
606 is when it completes a star's iterations by attaching a $S$ to the end of the bitcode |
606 list it returns. |
607 list it returns. |
607 |
608 |
608 The bitcodes extracted by $\bmkeps$ need to be |
609 The bitcodes extracted by $\bmkeps$ need to be |
609 $\decode$d (with the guidance of a plain regular expression): |
610 $\decode$d (with the guidance of a ``plain'' regular expression): |
610 %\begin{definition}[Bitdecoding of Values]\mbox{} |
611 %\begin{definition}[Bitdecoding of Values]\mbox{} |
611 \begin{center} |
612 \begin{center} |
612 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
613 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
613 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ |
614 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ |
614 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ |
615 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ |
651 terminates is marked by $Z$. |
652 terminates is marked by $Z$. |
652 This coding is lossy, as it throws away the information about |
653 This coding is lossy, as it throws away the information about |
653 characters, and does not encode the ``boundary'' between two |
654 characters, and does not encode the ``boundary'' between two |
654 sequence values. Moreover, with only the bitcode we cannot even tell |
655 sequence values. Moreover, with only the bitcode we cannot even tell |
655 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$, |
656 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$, |
656 but this swill not be necessary in our proof. |
657 but this will not be necessary in our correctness proof. |
657 \begin{center} |
658 \begin{center} |
658 \begin{tabular}{lcl} |
659 \begin{tabular}{lcl} |
659 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
660 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
660 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
661 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
661 $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\ |
662 $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\ |
679 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \] |
680 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \] |
680 Then setting $ds$ to be $[]$ and unfolding $\decode$ definition, |
681 Then setting $ds$ to be $[]$ and unfolding $\decode$ definition, |
681 we obtain the property. |
682 we obtain the property. |
682 \end{proof} |
683 \end{proof} |
683 \noindent |
684 \noindent |
684 Now we define the central part of Sulzmann and Lu's lexing algorithm, |
685 Now we define the central part of Sulzmann and Lu's second lexing algorithm, |
685 the $\bder$ function (which stands for \emph{b}itcoded-derivative): |
686 the $\bder$ function (which stands for \emph{b}itcoded-derivative): |
686 \begin{center} |
687 \begin{center} |
687 \begin{tabular}{@{}lcl@{}} |
688 \begin{tabular}{@{}lcl@{}} |
688 $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
689 $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
689 $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
690 $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
705 \noindent |
706 \noindent |
706 The $\bder$ function tells us how regular expressions can be recursively traversed, |
707 The $\bder$ function tells us how regular expressions can be recursively traversed, |
707 where the bitcodes are augmented and carried around |
708 where the bitcodes are augmented and carried around |
708 when a derivative is taken. |
709 when a derivative is taken. |
709 We give the intuition behind some of the more involved cases in |
710 We give the intuition behind some of the more involved cases in |
710 $\bder$. \\ |
711 $\bder$. |
|
712 |
711 For example, |
713 For example, |
712 in the \emph{star} case, |
714 in the \emph{star} case, |
713 a derivative of $_{bs}a^*$ means |
715 a derivative of $_{bs}a^*$ means |
714 that one more star iteration needs to be taken. |
716 that one more star iteration needs to be taken. |
715 We therefore need to unfold it into a sequence, |
717 We therefore need to unfold it into a sequence, |
800 in $a_1$ before collapsing |
802 in $a_1$ before collapsing |
801 it (as it has been fully matched by string prior to $c$), |
803 it (as it has been fully matched by string prior to $c$), |
802 and attach the collected bit-codes to the front of $a_2$ |
804 and attach the collected bit-codes to the front of $a_2$ |
803 before throwing away $a_1$. |
805 before throwing away $a_1$. |
804 In the injection-based lexer, $r_1$ is immediately thrown away in |
806 In the injection-based lexer, $r_1$ is immediately thrown away in |
805 in the $\textit{if}$ branch, |
807 the $\textit{if}$ branch, |
806 the information $r_1$ stores is therefore lost: |
808 the information $r_1$ stores is therefore lost: |
807 \begin{center} |
809 \begin{center} |
808 \begin{tabular}{lcl} |
810 \begin{tabular}{lcl} |
809 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\ |
811 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\ |
810 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
812 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
852 with respect to the input string $s$, and |
854 with respect to the input string $s$, and |
853 then test whether the result is (b)nullable. |
855 then test whether the result is (b)nullable. |
854 If yes, then extract the bitcodes from the nullable expression, |
856 If yes, then extract the bitcodes from the nullable expression, |
855 and decodes the bitcodes into a lexical value. |
857 and decodes the bitcodes into a lexical value. |
856 If there does not exists a match between $r$ and $s$ the lexer |
858 If there does not exists a match between $r$ and $s$ the lexer |
857 outputs $\None$ indicating a mismatch.\\ |
859 outputs $\None$ indicating a mismatch. |
858 Ausaf and Urban formally proved the correctness of the $\blexer$, namely |
860 Ausaf and Urban formally proved the correctness of the $\blexer$, namely |
859 \begin{property} |
861 \begin{property} |
860 $\blexer \;r \; s = \lexer \; r \; s$. |
862 $\blexer \;r \; s = \lexer \; r \; s$. |
861 \end{property} |
863 \end{property} |
862 \noindent |
864 \noindent |
864 We introduce the proof later, after we give all |
866 We introduce the proof later, after we give all |
865 the needed auxiliary functions and definitions. |
867 the needed auxiliary functions and definitions. |
866 \subsection{An Example $\blexer$ Run} |
868 \subsection{An Example $\blexer$ Run} |
867 Before introducing the proof we shall first walk |
869 Before introducing the proof we shall first walk |
868 through a concrete example of our $\blexer$ calculating the right |
870 through a concrete example of our $\blexer$ calculating the right |
869 lexical information through bit-coded regular expressions.\\ |
871 lexical information through bit-coded regular expressions. |
|
872 |
870 Consider the regular expression $(aa)^* \cdot (b+c)$ matching the string $aab$. |
873 Consider the regular expression $(aa)^* \cdot (b+c)$ matching the string $aab$. |
871 We give again the bird's eye view of this particular example |
874 We give again the bird's eye view of this particular example |
872 in each stage of the algorithm: |
875 in each stage of the algorithm: |
873 |
876 |
874 \tikzset{three sided/.style={ |
877 \tikzset{three sided/.style={ |
925 \path (bs) |
928 \path (bs) |
926 edge [] node {$\decode$} (v); |
929 edge [] node {$\decode$} (v); |
927 |
930 |
928 |
931 |
929 \end{tikzpicture} |
932 \end{tikzpicture} |
930 \caption{$\blexer \;\;\;\; (aa)^*(b+c) \;\;\;\; aab$} |
933 \caption{$\blexer$ with the regular expression $(aa)^*(b+c)$ and $aab$} |
931 \end{figure} |
934 \end{figure} |
932 \noindent |
935 \noindent |
933 The one dashed arrow indicates that $_Z(\ONE \cdot (aa)^*)$ |
936 The one dashed arrow indicates that $_Z(\ONE \cdot (aa)^*)$ |
934 turned into $ZS$ after derivative w.r.t $b$ |
937 turned into $ZS$ after derivative w.r.t $b$ |
935 is taken, which calls $\bmkeps$ on the nuallable $_Z\ONE\cdot (aa)^*$ |
938 is taken, which calls $\bmkeps$ on the nullable $_Z\ONE\cdot (aa)^*$ |
936 before processing $_Zb+_Sc$.\\ |
939 before processing $_Zb+_Sc$. |
|
940 |
937 The annotated regular expressions |
941 The annotated regular expressions |
938 would look overwhelming if we explicitly indicate all the |
942 would look overwhelming if we explicitly indicate all the |
939 locations where bitcodes are attached. |
943 locations where bitcodes are attached. |
940 For example, |
944 For example, |
941 $(aa)^*\cdot (b+c)$ would |
945 $(aa)^*\cdot (b+c)$ would |
942 look like $_{[]}(_{[]}(_{[]}a \cdot _{[]}a)^*\cdot _{[]}(_{[]}b+_{[]}c))$ |
946 look like $_{[]}(_{[]}(_{[]}a \cdot _{[]}a)^*\cdot _{[]}(_{[]}b+_{[]}c))$ |
943 after |
947 after |
944 internalise. |
948 internalise. |
945 Therefore for readability we omit bitcodes if they are empty. |
949 Therefore for readability we omit bitcodes if they are empty. |
946 This applies to all annotated |
950 This applies to all annotated |
947 regular expressions in this thesis.\\ |
951 regular expressions in this thesis. |
|
952 |
948 %and assume we have just read the first character $a$: |
953 %and assume we have just read the first character $a$: |
949 %\begin{center} |
954 %\begin{center} |
950 %\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
955 %\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
951 % \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
956 % \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
952 % {$(_{[Z]}(\ONE \cdot a) \cdot (aa)^* )\cdot bc$ |
957 % {$(_{[Z]}(\ONE \cdot a) \cdot (aa)^* )\cdot bc$ |
1028 % \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
1033 % \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] |
1029 % {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$ |
1034 % {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$ |
1030 % \nodepart{two} EOF}; |
1035 % \nodepart{two} EOF}; |
1031 %\end{tikzpicture} |
1036 %\end{tikzpicture} |
1032 %\end{center} |
1037 %\end{center} |
1033 \noindent |
1038 |
1034 In the next section we introduce the correctness proof |
1039 In the next section we introduce the correctness proof |
1035 found by Ausaf and Urban |
1040 found by Ausaf and Urban |
1036 of the bitcoded lexer. |
1041 of the bitcoded lexer. |
1037 %----------------------------------- |
1042 %----------------------------------- |
1038 % SUBSECTION 1 |
1043 % SUBSECTION 1 |
1065 the lexing process, which is not be possible |
1070 the lexing process, which is not be possible |
1066 using $\lexer$ or $\blexer$ alone. |
1071 using $\lexer$ or $\blexer$ alone. |
1067 \subsubsection{$\textit{Retrieve}$} |
1072 \subsubsection{$\textit{Retrieve}$} |
1068 The first function we shall introduce is $\retrieve$. |
1073 The first function we shall introduce is $\retrieve$. |
1069 Sulzmann and Lu gave its definition, and |
1074 Sulzmann and Lu gave its definition, and |
1070 Ausaf and Urban found |
1075 Ausaf and Urban found it |
1071 useful in their mechanised proofs. |
1076 useful in their mechanised proofs. |
1072 Our bit-coded lexer ``retrieve''s the bitcodes using $\bmkeps$, |
1077 Our bit-coded lexer ``retrieve''s the bitcodes using $\bmkeps$, |
1073 after all the derivatives have been taken: |
1078 after all the derivatives have been taken: |
1074 \begin{center} |
1079 \begin{center} |
1075 \begin{tabular}{lcl} |
1080 \begin{tabular}{lcl} |
1123 These pairs all contain adequate information to |
1128 These pairs all contain adequate information to |
1124 obtain the final lexing result. |
1129 obtain the final lexing result. |
1125 For example, in the leftmost pair the |
1130 For example, in the leftmost pair the |
1126 lexical information is condensed in |
1131 lexical information is condensed in |
1127 $v_0$, whereas for the rightmost pair, |
1132 $v_0$, whereas for the rightmost pair, |
1128 the lexing result hides is in the bitcodes of $a_n$.\\ |
1133 the lexing result hides is in the bitcodes of $a_n$. |
|
1134 |
1129 The function $\bmkeps$ is able to extract from $a_n$ the result |
1135 The function $\bmkeps$ is able to extract from $a_n$ the result |
1130 by looking for nullable parts of the regular expression. |
1136 by looking for nullable parts of the regular expression. |
1131 However for regular expressions $a_i$ in general, |
1137 However for regular expressions $a_i$ in general, |
1132 they might not necessarily be nullable. |
1138 they might not necessarily be nullable. |
1133 For those regular expressions that are not nullable, |
1139 For those regular expressions that are not nullable, |
1243 The function $\retrieve$ allows connecting |
1249 The function $\retrieve$ allows connecting |
1244 between the intermediate |
1250 between the intermediate |
1245 results $a_i$ and $a_{i+1}$ in $\blexer$. |
1251 results $a_i$ and $a_{i+1}$ in $\blexer$. |
1246 For plain regular expressions something similar is required. |
1252 For plain regular expressions something similar is required. |
1247 |
1253 |
1248 \subsection{$\flex$} |
1254 \subsection{The Function $\flex$} |
1249 Ausaf and Urban defined an auxiliary function called $\flex$ for $\lexer$, |
1255 Ausaf and Urban defined an auxiliary function called $\flex$ for $\lexer$, |
1250 defined as |
1256 defined as |
1251 \begin{center} |
1257 \begin{center} |
1252 \begin{tabular}{lcl} |
1258 \begin{tabular}{lcl} |
1253 $\flex \; r \; f \; [] \; v$ & $=$ & $f\; v$\\ |
1259 $\flex \; r \; f \; [] \; v$ & $=$ & $f\; v$\\ |
1254 $\flex \; r \; f \; (c :: s) \; v$ & $=$ & $\flex \; r \; (\lambda v. \, f (\inj \; r\; c\; v)) \;\; s \; v$ |
1260 $\flex \; r \; f \; (c :: s) \; v$ & $=$ & $\flex \; r \; (\lambda v. \, f (\inj \; r\; c\; v)) \;\; s \; v$ |
1255 \end{tabular} |
1261 \end{tabular} |
1256 \end{center} |
1262 \end{center} |
1257 which accumulates the characters that need to be injected back, |
1263 which accumulates the characters that need to be injected back, |
1258 and does the injection in a stack-like manner (last taken derivative first injected). |
1264 and does the injection in a stack-like manner (the last taken derivative is first injected). |
1259 The function |
1265 The function |
1260 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression |
1266 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression |
1261 $r$, the identity function $id$, the input string $s$ and the value |
1267 $r$, the identity function $id$, the input string $s$ and the value |
1262 $v_n= \mkeps \; (r\backslash s)$: |
1268 $v_n= \mkeps \; (r\backslash s)$: |
1263 \begin{lemma} |
1269 \begin{lemma} |
1289 %---------------------------------------------------------------------------------------- |
1295 %---------------------------------------------------------------------------------------- |
1290 % SECTION correctness proof |
1296 % SECTION correctness proof |
1291 %---------------------------------------------------------------------------------------- |
1297 %---------------------------------------------------------------------------------------- |
1292 \section{Correctness of the Bit-coded Algorithm (Without Simplification)} |
1298 \section{Correctness of the Bit-coded Algorithm (Without Simplification)} |
1293 We can first show that |
1299 We can first show that |
1294 $\flex$ and $\blexer$ calculates the same thing. |
1300 $\flex$ and $\blexer$ calculates the same value. |
1295 \begin{lemma}\label{flex_retrieve} |
1301 \begin{lemma}\label{flex_retrieve} |
1296 If $\vdash v: (r\backslash s)$, then\\ |
1302 If $\vdash v: (r\backslash s)$, then\\ |
1297 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$ |
1303 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$ |
1298 \end{lemma} |
1304 \end{lemma} |
1299 \begin{proof} |
1305 \begin{proof} |