12 introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of |
12 introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of |
13 derivatives of |
13 derivatives of |
14 regular expressions. |
14 regular expressions. |
15 We have implemented their algorithm in Scala and Isabelle, |
15 We have implemented their algorithm in Scala and Isabelle, |
16 and found problems |
16 and found problems |
17 in their algorithm such as de-duplication not working properly and redundant |
17 in their algorithm, such as de-duplication not working properly and redundant |
18 fixpoint construction. |
18 fixpoint construction. |
19 \section{The Motivation Behind Using Bitcodes} |
19 \section{The Motivation Behind Using Bitcodes} |
20 Let us give again the definition of $\lexer$ from Chapter \ref{Inj}: |
20 Let us give again the definition of $\lexer$ from Chapter \ref{Inj}: |
21 \begin{center} |
21 \begin{center} |
22 \begin{tabular}{lcl} |
22 \begin{tabular}{lcl} |
111 |
111 |
112 \end{tikzpicture} |
112 \end{tikzpicture} |
113 \caption{Second derivative taken} |
113 \caption{Second derivative taken} |
114 \end{figure} |
114 \end{figure} |
115 \noindent |
115 \noindent |
116 As the number of derivative steps increase, |
116 As the number of derivative steps increases, |
117 the stack would increase: |
117 the stack would increase: |
118 |
118 |
119 \begin{figure}[H] |
119 \begin{figure}[H] |
120 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] |
120 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] |
121 %\draw (-6,-6) grid (6,6); |
121 %\draw (-6,-6) grid (6,6); |
325 The bitcodes are then carried with the regular |
325 The bitcodes are then carried with the regular |
326 expression, and augmented or moved around |
326 expression, and augmented or moved around |
327 as the lexing proceeds. |
327 as the lexing proceeds. |
328 It becomes unnecessary |
328 It becomes unnecessary |
329 to remember all the |
329 to remember all the |
330 intermediate expresssions, but only the most recent one |
330 intermediate expressions, but only the most recent one |
331 with this bit-carrying regular expression. |
331 with this bit-carrying regular expression. |
332 Annotated regular expressions |
332 Annotated regular expressions |
333 are defined as the following |
333 are defined as the following |
334 Isabelle datatype:\footnote{ We use subscript notation to indicate |
334 Isabelle datatype:\footnote{ We use subscript notation to indicate |
335 that the bitcodes are auxiliary information that do not |
335 that the bitcodes are auxiliary information that does not |
336 interfere with the structure of the regular expressions } |
336 interfere with the structure of the regular expressions } |
337 \begin{center} |
337 \begin{center} |
338 \begin{tabular}{lcl} |
338 \begin{tabular}{lcl} |
339 $\textit{a}$ & $::=$ & $\ZERO$\\ |
339 $\textit{a}$ & $::=$ & $\ZERO$\\ |
340 & $\mid$ & $_{bs}\ONE$\\ |
340 & $\mid$ & $_{bs}\ONE$\\ |
450 |
450 |
451 |
451 |
452 \end{tikzpicture} |
452 \end{tikzpicture} |
453 \caption{A bird's eye view of $\blexer$. The $_{bsi}a_{i}$s stand |
453 \caption{A bird's eye view of $\blexer$. The $_{bsi}a_{i}$s stand |
454 for the annotated regular expressions in each derivative step. |
454 for the annotated regular expressions in each derivative step. |
455 The characters used in each derivative is written as $c_i$. |
455 The characters used in each derivative are written as $c_i$. |
456 The size of the derivatives typically increase during each derivative step, |
456 The size of the derivatives typically increases during each derivative step. |
457 therefore we draw the larger circles to indicate that. |
|
458 The process starts with an internalise phase and concludes with a decoding phase.} |
457 The process starts with an internalise phase and concludes with a decoding phase.} |
459 \end{figure} |
458 \end{figure} |
460 \noindent |
459 \noindent |
461 The plain regular expressions |
460 The plain regular expressions |
462 are first ``lifted'' to an annotated regular expression, |
461 are first ``lifted'' to an annotated regular expression, |
538 \end{center} |
537 \end{center} |
539 \noindent |
538 \noindent |
540 The opposite of $\textit{internalise}$ is |
539 The opposite of $\textit{internalise}$ is |
541 $\erase$, where all bit-codes are removed, |
540 $\erase$, where all bit-codes are removed, |
542 and the alternative operator $\sum$ for annotated |
541 and the alternative operator $\sum$ for annotated |
543 regular expressions is transformed to the binary version |
542 regular expressions is transformed into the binary version |
544 in (plain) regular expressions. This can be defined as follows: |
543 in (plain) regular expressions. This can be defined as follows: |
545 \begin{center}\label{eraseDef} |
544 \begin{center}\label{eraseDef} |
546 \begin{tabular}{lcl} |
545 \begin{tabular}{lcl} |
547 $\ZERO_\downarrow$ & $\dn$ & $\ZERO$\\ |
546 $\ZERO_\downarrow$ & $\dn$ & $\ZERO$\\ |
548 $( _{bs}\ONE )_\downarrow$ & $\dn$ & $\ONE$\\ |
547 $( _{bs}\ONE )_\downarrow$ & $\dn$ & $\ONE$\\ |
646 %because at least one of $\decode'$'s parameters will go down in terms |
645 %because at least one of $\decode'$'s parameters will go down in terms |
647 %of size.\\ |
646 %of size.\\ |
648 The reverse operation of $\decode$ is $\code$. |
647 The reverse operation of $\decode$ is $\code$. |
649 This function encodes a value into a bitcode by converting |
648 This function encodes a value into a bitcode by converting |
650 $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty |
649 $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty |
651 star iteration by $S$. The border where a star iteration |
650 star iteration by $S$. $ Z$ marks the border where a star iteration |
652 terminates is marked by $Z$. |
651 terminates. |
653 This coding is lossy, as it throws away the information about |
652 This coding is lossy, as it throws away the information about |
654 characters, and does not encode the ``boundary'' between two |
653 characters, and does not encode the ``boundary'' between two |
655 sequence values. Moreover, with only the bitcode we cannot even tell |
654 sequence values. Moreover, with only the bitcode we cannot even tell |
656 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$, |
655 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$, |
657 but this will not be necessary in our correctness proof. |
656 but this will not be necessary for our correctness proof. |
658 \begin{center} |
657 \begin{center} |
659 \begin{tabular}{lcl} |
658 \begin{tabular}{lcl} |
660 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
659 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
661 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
660 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
662 $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\ |
661 $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\ |
676 \[If \vdash v : r \; then \;\decode \; (\code \; v) \; r = \textit{Some}(v) \] |
675 \[If \vdash v : r \; then \;\decode \; (\code \; v) \; r = \textit{Some}(v) \] |
677 \end{lemma} |
676 \end{lemma} |
678 \begin{proof} |
677 \begin{proof} |
679 By proving a more general version of the lemma, on $\decode'$: |
678 By proving a more general version of the lemma, on $\decode'$: |
680 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \] |
679 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \] |
681 Then setting $ds$ to be $[]$ and unfolding $\decode$ definition, |
680 Then we obtain the property by setting $ds$ to be $[]$ and unfolding the $\decode$ definition. |
682 we obtain the property. |
|
683 \end{proof} |
681 \end{proof} |
684 \noindent |
682 \noindent |
685 Now we define the central part of Sulzmann and Lu's second lexing algorithm, |
683 Now we define the central part of Sulzmann and Lu's second lexing algorithm, |
686 the $\bder$ function (which stands for \emph{b}itcoded-derivative): |
684 the $\bder$ function (which stands for \emph{b}itcoded-derivative): |
687 \begin{center} |
685 \begin{center} |
854 plain regular expression $r$, and then builds successive derivatives |
852 plain regular expression $r$, and then builds successive derivatives |
855 with respect to the input string $s$, and |
853 with respect to the input string $s$, and |
856 then test whether the result is (b)nullable. |
854 then test whether the result is (b)nullable. |
857 If yes, then extract the bitcodes from the nullable expression, |
855 If yes, then extract the bitcodes from the nullable expression, |
858 and decodes the bitcodes into a lexical value. |
856 and decodes the bitcodes into a lexical value. |
859 If there does not exists a match between $r$ and $s$ the lexer |
857 If there does not exist a match between $r$ and $s$, the lexer |
860 outputs $\None$ indicating a mismatch. |
858 outputs $\None$ indicating a mismatch. |
861 Ausaf and Urban formally proved the correctness of the $\blexer$, namely |
859 Ausaf and Urban formally proved the correctness of the $\blexer$, namely |
862 \begin{property} |
860 \begin{property} |
863 $\blexer \;r \; s = \lexer \; r \; s$. |
861 $\blexer \;r \; s = \lexer \; r \; s$. |
864 \end{property} |
862 \end{property} |
1066 We will elaborate on this, with the help of |
1064 We will elaborate on this, with the help of |
1067 some helper functions such as $\retrieve$ and |
1065 some helper functions such as $\retrieve$ and |
1068 $\flex$. |
1066 $\flex$. |
1069 \subsection{Specifications of Some Helper Functions} |
1067 \subsection{Specifications of Some Helper Functions} |
1070 The functions we introduce will give a more detailed view into |
1068 The functions we introduce will give a more detailed view into |
1071 the lexing process, which is not be possible |
1069 the lexing process, which is not possible |
1072 using $\lexer$ or $\blexer$ alone. |
1070 using $\lexer$ or $\blexer$ alone. |
1073 \subsubsection{$\textit{Retrieve}$} |
1071 \subsubsection{$\textit{Retrieve}$} |
1074 The first function we shall introduce is $\retrieve$. |
1072 The first function we shall introduce is $\retrieve$. |
1075 Sulzmann and Lu gave its definition, and |
1073 Sulzmann and Lu gave its definition, and |
1076 Ausaf and Urban found it |
1074 Ausaf and Urban found it |
1230 provided that the corresponding values are used respectively: |
1228 provided that the corresponding values are used respectively: |
1231 \begin{lemma}\label{retrieveStepwise} |
1229 \begin{lemma}\label{retrieveStepwise} |
1232 $\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$ |
1230 $\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$ |
1233 \end{lemma} |
1231 \end{lemma} |
1234 \begin{proof} |
1232 \begin{proof} |
1235 By induction on $r$, generalising over $v$. |
1233 We do an induction on $r$, generalising over $v$. |
1236 The induction principle in our formalisation |
1234 The induction principle in our formalisation |
1237 is function $\erase$'s cases. |
1235 is function $\erase$'s cases. |
1238 \end{proof} |
1236 \end{proof} |
1239 \noindent |
1237 \noindent |
1240 Before we move on to the next |
1238 Before we move on to the next |
1260 $\flex \; r \; f \; [] \; v$ & $=$ & $f\; v$\\ |
1258 $\flex \; r \; f \; [] \; v$ & $=$ & $f\; v$\\ |
1261 $\flex \; r \; f \; (c :: s) \; v$ & $=$ & $\flex \; r \; (\lambda v. \, f (\inj \; r\; c\; v)) \;\; s \; v$ |
1259 $\flex \; r \; f \; (c :: s) \; v$ & $=$ & $\flex \; r \; (\lambda v. \, f (\inj \; r\; c\; v)) \;\; s \; v$ |
1262 \end{tabular} |
1260 \end{tabular} |
1263 \end{center} |
1261 \end{center} |
1264 which accumulates the characters that need to be injected back, |
1262 which accumulates the characters that need to be injected back, |
1265 and does the injection in a stack-like manner (the last taken derivative is first injected). |
1263 and does the injection in a stack-like manner (the last character being chopped off in the derivatives phase is first injected). |
1266 The function |
1264 The function |
1267 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression |
1265 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression |
1268 $r$, the identity function $id$, the input string $s$ and the value |
1266 $r$, the identity function $id$, the input string $s$ and the value |
1269 $v_n= \mkeps \; (r\backslash s)$: |
1267 $v_n= \mkeps \; (r\backslash s)$: |
1270 \begin{lemma} |
1268 \begin{lemma} |