thys3/Paper.thy
changeset 616 8907d4b6316d
parent 615 8881a09a06fd
child 617 3ea6a38c33b5
equal deleted inserted replaced
615:8881a09a06fd 616:8907d4b6316d
   197 does not achieve the intended objective with bitcoded regular expressions.  The
   197 does not achieve the intended objective with bitcoded regular expressions.  The
   198 reason is that in the bitcoded setting, each copy generally has a
   198 reason is that in the bitcoded setting, each copy generally has a
   199 different bitcode annotation---so @{text nub} would never ``fire''.
   199 different bitcode annotation---so @{text nub} would never ``fire''.
   200 Inspired by Scala's library for lists, we shall instead use a @{text
   200 Inspired by Scala's library for lists, we shall instead use a @{text
   201 distinctWith} function that finds duplicates under an ``erasing'' function
   201 distinctWith} function that finds duplicates under an ``erasing'' function
   202 which deletes bitcodes before comparing regular expressions.
   202 that deletes bitcodes before comparing regular expressions.
   203 We shall also introduce our \emph{own} arguments and definitions for
   203 We shall also introduce our \emph{own} arguments and definitions for
   204 establishing the correctness of the bitcoded algorithm when 
   204 establishing the correctness of the bitcoded algorithm when 
   205 simplifications are included. Finally we
   205 simplifications are included. Finally we
   206 establish that the size of derivatives can be finitely bounded.\medskip%\footnote{ In this paper, we shall first briefly introduce the basic notions
   206 establish that the size of derivatives can be finitely bounded.\medskip%\footnote{ In this paper, we shall first briefly introduce the basic notions
   207 %of regular expressions and describe the definition
   207 %of regular expressions and describe the definition
   374   \end{center}  
   374   \end{center}  
   375 
   375 
   376   \noindent where we use @{term vs} to stand for a list of values. The
   376   \noindent where we use @{term vs} to stand for a list of values. The
   377   string underlying a value can be calculated by a @{const flat}
   377   string underlying a value can be calculated by a @{const flat}
   378   function, written @{term "flat DUMMY"}. It traverses a value and
   378   function, written @{term "flat DUMMY"}. It traverses a value and
   379   collects the characters contained in it \cite{AusafDyckhoffUrban2016}.
   379   collects the characters contained in it (see \cite{AusafDyckhoffUrban2016}).
   380 
   380 
   381 
   381 
   382   Sulzmann and Lu also define inductively an
   382   Sulzmann and Lu also define inductively an
   383   inhabitation relation that associates values to regular expressions. Our
   383   inhabitation relation that associates values to regular expressions. Our
   384   version of this relation is defined the following six rules for the values:
   384   version of this relation is defined by the following six rules:
   385   %
   385   %
   386   \begin{center}
   386   \begin{center}
   387   \begin{tabular}{@ {}l@ {}}
   387   \begin{tabular}{@ {}l@ {}}
   388   @{thm[mode=Axiom] Prf.intros(4)}\quad
   388   @{thm[mode=Axiom] Prf.intros(4)}\quad
   389   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\quad
   389   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\quad
   414   only contains values that flatten to the empty string. 
   414   only contains values that flatten to the empty string. 
   415   It is routine to establish how values ``inhabiting'' a regular
   415   It is routine to establish how values ``inhabiting'' a regular
   416   expression correspond to the language of a regular expression, namely
   416   expression correspond to the language of a regular expression, namely
   417   @{thm L_flat_Prf}.
   417   @{thm L_flat_Prf}.
   418 
   418 
   419   In general there is more than one value inhabited by a regular
   419   In general there is more than one value inhabiting a regular
   420   expression (meaning regular expressions can typically match more
   420   expression (meaning regular expressions can typically match more
   421   than one string). But even when fixing a string from the language of the
   421   than one string). But even when fixing a string from the language of the
   422   regular expression, there are generally more than one way of how the
   422   regular expression, there are generally more than one way of how the
   423   regular expression can match this string. POSIX lexing is about
   423   regular expression can match this string. POSIX lexing is about
   424   identifying the unique value for a given regular expression and a
   424   identifying the unique value for a given regular expression and a
   430   %	Unix-like operating systems \cite{POSIX}.}
   430   %	Unix-like operating systems \cite{POSIX}.}
   431   Sometimes these
   431   Sometimes these
   432   informal rules are called \emph{maximal munch rule} and \emph{rule priority}.
   432   informal rules are called \emph{maximal munch rule} and \emph{rule priority}.
   433   One contribution of our earlier paper is to give a convenient
   433   One contribution of our earlier paper is to give a convenient
   434  specification for what POSIX values are (the inductive rules are shown in
   434  specification for what POSIX values are (the inductive rules are shown in
   435   Figure~\ref{POSIXrules}).
   435   Fig~\ref{POSIXrules}).
   436 
   436 
   437 \begin{figure}[t]
   437 \begin{figure}[t]
   438   \begin{center}\small%
   438   \begin{center}\small%
   439   \begin{tabular}{@ {\hspace{-2mm}}c@ {}}
   439   \begin{tabular}{@ {\hspace{-2mm}}c@ {}}
   440   \\[-9mm]
   440   \\[-4.5mm]
   441   @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad
   441   @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad
   442   @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad
   442   @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad
   443   @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad
   443   @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad
   444   @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\
   444   @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\
   445   $\mprset{flushleft}
   445   $\mprset{flushleft}
   713   regular expression \emph{and} a bitsequence of a corresponding value,
   713   regular expression \emph{and} a bitsequence of a corresponding value,
   714   then we can always decode the value accurately~(see Fig.~\ref{decode}).
   714   then we can always decode the value accurately~(see Fig.~\ref{decode}).
   715   The function \textit{decode} checks whether all of the bitsequence is
   715   The function \textit{decode} checks whether all of the bitsequence is
   716   consumed and returns the corresponding value as @{term "Some v"}; otherwise
   716   consumed and returns the corresponding value as @{term "Some v"}; otherwise
   717   it fails with @{text "None"}. We can establish that for a value $v$
   717   it fails with @{text "None"}. We can establish that for a value $v$
   718   inhabited by a regular expression $r$, the decoding of its
   718   inhabiting a regular expression $r$, the decoding of its
   719   bitsequence never fails (see also \cite{NielsenHenglein2011}).
   719   bitsequence never fails (see also \cite{NielsenHenglein2011}).
   720 
   720 
   721   %The decoding can be
   721   %The decoding can be
   722   %defined by using two functions called $\textit{decode}'$ and
   722   %defined by using two functions called $\textit{decode}'$ and
   723   %\textit{decode}:
   723   %\textit{decode}:
   797   \end{tabular}
   797   \end{tabular}
   798   \end{minipage}\medskip
   798   \end{minipage}\medskip
   799   %!\end{center}    
   799   %!\end{center}    
   800 
   800 
   801   \noindent
   801   \noindent
       
   802   This function ``fuses'' a bitsequence to the topmost constructor of an bitcoded regular expressions. 
   802   A regular expression can then be \emph{internalised} into a bitcoded
   803   A regular expression can then be \emph{internalised} into a bitcoded
   803   regular expression as follows:
   804   regular expression as follows:
   804   %
   805   %
   805   %!\begin{center}
   806   %!\begin{center}
   806   %!\begin{tabular}{@ {}c@ {\hspace{2mm}}c@ {\hspace{2mm}}c@ {}}
   807   %!\begin{tabular}{@ {}c@ {\hspace{2mm}}c@ {\hspace{2mm}}c@ {}}
  1040 derivative $r\backslash s$ and another time for stacking up injection
  1041 derivative $r\backslash s$ and another time for stacking up injection
  1041 functions), it helps us with proving
  1042 functions), it helps us with proving
  1042 that incrementally building up values in @{text blexer} generates the same result.
  1043 that incrementally building up values in @{text blexer} generates the same result.
  1043 
  1044 
  1044 This brings us to our main lemma in this section: if we calculate a
  1045 This brings us to our main lemma in this section: if we calculate a
  1045 derivative, say $r\backslash s$, and have a value, say $v$, inhabited
  1046 derivative, say $r\backslash s$, and have a value, say $v$, inhabiting
  1046 by this derivative, then we can produce the result @{text lexer} generates
  1047 this derivative, then we can produce the result @{text lexer} generates
  1047 by applying this value to the stacked-up injection functions
  1048 by applying this value to the stacked-up injection functions
  1048 that $\textit{flex}$ assembles. The lemma establishes that this is the same
  1049 that $\textit{flex}$ assembles. The lemma establishes that this is the same
  1049 value as if we build the annotated derivative $r^\uparrow\backslash s$
  1050 value as if we build the annotated derivative $r^\uparrow\backslash s$
  1050 and then retrieve the corresponding bitcoded version, followed by a
  1051 and then retrieve the bitcoded version of @{text v}, followed by a
  1051 decoding step.
  1052 decoding step.
  1052 
  1053 
  1053 \begin{lemma}[Main Lemma]\label{mainlemma}\mbox{}\\\it
  1054 \begin{lemma}[Main Lemma]\label{mainlemma}\mbox{}\\\it
  1054 If $\vdash v : r\backslash s$ then 
  1055 If $\vdash v : r\backslash s$ then 
  1055 $\textit{Some}\,(\textit{flex}\,r\,\textit{id}\,s\,v) =
  1056 $\textit{Some}\,(\textit{flex}\,r\,\textit{id}\,s\,v) =
  1129      de-nest, or spill out, @{text ALTs} as follows
  1130      de-nest, or spill out, @{text ALTs} as follows
  1130      %
  1131      %
  1131      \[
  1132      \[
  1132      @{term "ALTs bs\<^sub>1 (((ALTs bs\<^sub>2 rs\<^sub>2)) # rs\<^sub>1)"}
  1133      @{term "ALTs bs\<^sub>1 (((ALTs bs\<^sub>2 rs\<^sub>2)) # rs\<^sub>1)"}
  1133      \quad\xrightarrow{bsimp}\quad
  1134      \quad\xrightarrow{bsimp}\quad
  1134      @{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
  1135      @{text "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
  1135      \]
  1136      \]
  1136 
  1137 
  1137      \noindent
  1138      \noindent
  1138      where we just need to fuse the bitsequence that has accumulated in @{text "bs\<^sub>2"}
  1139      where we just need to fuse the bitsequence that has accumulated in @{text "bs\<^sub>2"}
  1139      to the alternatives in @{text "rs\<^sub>2"}. As we shall show below this will
  1140      to the alternatives in @{text "rs\<^sub>2"}. As we shall show below this will
  1176      \end{tabular}
  1177      \end{tabular}
  1177      \end{center}
  1178      \end{center}
  1178 
  1179 
  1179      \noindent where we scan the list from left to right (because we
  1180      \noindent where we scan the list from left to right (because we
  1180      have to remove later copies). In @{text distinctWith}, @{text eq} is intended to be an
  1181      have to remove later copies). In @{text distinctWith}, @{text eq} is intended to be an
  1181      equivalence relation for annotated regular expressions and @{text acc} is an accumulator for annotated regular
  1182      equivalence relation for bitcoded regular expressions and @{text acc} is an accumulator for bitcoded regular
  1182      expressions---essentially a set of regular expressions that we have already seen
  1183      expressions---essentially a set of regular expressions that we have already seen
  1183      while scanning the list. Therefore we delete an element, say @{text x},
  1184      while scanning the list. Therefore we delete an element, say @{text x},
  1184      from the list provided a @{text "y"} with @{text "y"} being equivalent to @{text x} is already in the accumulator;
  1185      from the list provided a @{text "y"} with @{text "y"} being equivalent to @{text x} is already in the accumulator;
  1185      otherwise we keep @{text x} and scan the rest of the list but 
  1186      otherwise we keep @{text x} and scan the rest of the list but 
  1186      add @{text "x"} as another ``seen'' element to @{text acc}. We will use
  1187      add @{text "x"} as another ``seen'' element to @{text acc}. We will use
  1187      @{term distinctWith} where @{text eq} is an equivalence that deletes bitsequences from bitcoded regular expressions
  1188      @{term distinctWith} where @{text eq} is an equivalence that deletes bitsequences from bitcoded regular expressions
  1188      before comparing the components. One way to define this in Isabelle/HOL is by the following recursive function from
  1189      before comparing the components. One way to define this in Isabelle/HOL is by the following recursive function from
  1189      annotated regular expressions to @{text bool}:
  1190      bitcoded regular expressions to @{text bool}:
  1190      %
  1191      %
  1191      \begin{center}
  1192      \begin{center}
  1192      \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}}
  1193      \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}}
  1193      @{thm (lhs) eq1.simps(1)} & $\dn$ & @{thm (rhs) eq1.simps(1)}\\
  1194      @{thm (lhs) eq1.simps(1)} & $\dn$ & @{thm (rhs) eq1.simps(1)}\\
  1194      @{thm (lhs) eq1.simps(2)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(2)[of DUMMY DUMMY]}\\
  1195      @{thm (lhs) eq1.simps(2)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(2)[of DUMMY DUMMY]}\\
  1283 
  1284 
  1284      Next we can include simplification after each derivative step leading to the
  1285      Next we can include simplification after each derivative step leading to the
  1285      following notion of bitcoded derivatives:
  1286      following notion of bitcoded derivatives:
  1286      
  1287      
  1287      \begin{center}
  1288      \begin{center}
  1288       \begin{tabular}{cc}
  1289       \begin{tabular}{c@ {\hspace{10mm}}c}
  1289       \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
  1290       \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
  1290       @{thm (lhs) bders_simp.simps(1)} & $\dn$ & @{thm (rhs) bders_simp.simps(1)}
  1291       @{thm (lhs) bders_simp.simps(1)} & $\dn$ & @{thm (rhs) bders_simp.simps(1)}
  1291       \end{tabular}
  1292       \end{tabular}
  1292       &
  1293       &
  1293       \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
  1294       \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
  1301 
  1302 
  1302      \begin{center}
  1303      \begin{center}
  1303 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
  1304 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
  1304   $\textit{blexer}^+\;r\,s$ & $\dn$ &
  1305   $\textit{blexer}^+\;r\,s$ & $\dn$ &
  1305       $\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\                
  1306       $\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\                
  1306       & & $\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
  1307       & & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
  1307       & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$\\
  1308       & & $\;\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$ $\;\;\textit{else}\;\textit{None}$
  1308       & & $\;\;\textit{else}\;\textit{None}$
       
  1309 \end{tabular}
  1309 \end{tabular}
  1310 \end{center}
  1310 \end{center}
  1311 
  1311 
  1312        \noindent The remaining task is to show that @{term blexer} and
  1312        \noindent
       
  1313        Note that in $\textit{blexer}^+$ the derivative $r_{der}$ is calculated
       
  1314        using the simplifying derivative $\_\,\backslash_{bsimp}\,\_$.
       
  1315        The remaining task is to show that @{term blexer} and
  1313        @{term "blexer_simp"} generate the same answers.
  1316        @{term "blexer_simp"} generate the same answers.
  1314 
  1317 
  1315        When we first
  1318        When we first
  1316        attempted this proof we encountered a problem with the idea
  1319        attempted this proof we encountered a problem with the idea
  1317        in Sulzmann and Lu's paper where the argument seems to be to appeal
  1320        in Sulzmann and Lu's paper where the argument seems to be to appeal
  1339      synchronise the change in the simplified regular expressions with
  1342      synchronise the change in the simplified regular expressions with
  1340      the original POSIX value, there is no hope of appealing to @{text retrieve} in the
  1343      the original POSIX value, there is no hope of appealing to @{text retrieve} in the
  1341      correctness argument for @{term blexer_simp}.
  1344      correctness argument for @{term blexer_simp}.
  1342 
  1345 
  1343      We found it more helpful to introduce the rewriting systems shown in
  1346      We found it more helpful to introduce the rewriting systems shown in
  1344      Figure~\ref{SimpRewrites}. The idea is to generate 
  1347      Fig~\ref{SimpRewrites}. The idea is to generate 
  1345      simplified regular expressions in small steps (unlike the @{text bsimp}-function which
  1348      simplified regular expressions in small steps (unlike the @{text bsimp}-function which
  1346      does the same in a big step), and show that each of
  1349      does the same in a big step), and show that each of
  1347      the small steps preserves the bitcodes that lead to the final POSIX value.
  1350      the small steps preserves the bitcodes that lead to the POSIX value.
  1348      The rewrite system is organised such that $\leadsto$ is for bitcoded regular
  1351      The rewrite system is organised such that $\leadsto$ is for bitcoded regular
  1349      expressions and $\stackrel{s}{\leadsto}$ for lists of bitcoded regular
  1352      expressions and $\stackrel{s}{\leadsto}$ for lists of bitcoded regular
  1350      expressions. The former essentially implements the simplifications of
  1353      expressions. The former essentially implements the simplifications of
  1351      @{text "bsimpSEQ"} and @{text flts}; while the latter implements the
  1354      @{text "bsimpSEQ"} and @{text flts}; while the latter implements the
  1352      simplifications in @{text "bsimpALTs"}. We can show that any bitcoded
  1355      simplifications in @{text "bsimpALTs"}. We can show that any bitcoded
  1501 %!\end{center}
  1504 %!\end{center}
  1502 
  1505 
  1503 % tell Chengsong about Indian paper of closed forms of derivatives
  1506 % tell Chengsong about Indian paper of closed forms of derivatives
  1504 
  1507 
  1505 \noindent
  1508 \noindent
  1506 where in (1) the $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
  1509 where in (1) the set $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
  1507 In (3) we know that  $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is 
  1510 In (3) we know that  $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is 
  1508 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
  1511 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
  1509 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands
  1512 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands
  1510 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
  1513 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
  1511 We reason similarly for @{text STAR}.\smallskip
  1514 We reason similarly for @{text STAR}.\smallskip
  1552 transfer to our setting where we are interested in the \emph{size} of the
  1555 transfer to our setting where we are interested in the \emph{size} of the
  1553 derivatives. For example it is not true for our derivatives that the
  1556 derivatives. For example it is not true for our derivatives that the
  1554 set of of derivatives $r \backslash s$ for a given $r$ and all strings
  1557 set of of derivatives $r \backslash s$ for a given $r$ and all strings
  1555 $s$ is finite. This is because for our annotated regular expressions
  1558 $s$ is finite. This is because for our annotated regular expressions
  1556 the bitcode annotation is dependent on the number of iterations that
  1559 the bitcode annotation is dependent on the number of iterations that
  1557 are needed for STAR-regular expressions. Since we want to do lexing
  1560 are needed for STAR-regular expressions. This is not a problem for us: Since we intend to do lexing
  1558 by calculating (as fast as possible) derivatives, the bound on the size
  1561 by calculating (as fast as possible) derivatives, the bound on the size
  1559 of the derivatives is important, not the number of derivatives.
  1562 of the derivatives is important, not the number of derivatives.
  1560 
  1563 
  1561 
  1564 
  1562 *}
  1565 *}
  1571    \cite{Sulzmann2014}. This follows earlier work where we established
  1574    \cite{Sulzmann2014}. This follows earlier work where we established
  1572    the correctness of the first algorithm
  1575    the correctness of the first algorithm
  1573    \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
  1576    \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
  1574    introduce our own specification about what POSIX values are,
  1577    introduce our own specification about what POSIX values are,
  1575    because the informal definition given by Sulzmann and Lu did not
  1578    because the informal definition given by Sulzmann and Lu did not
  1576    stand up to a formal proof. Also for the second algorithm we needed
  1579    stand up to formal proof. Also for the second algorithm we needed
  1577    to introduce our own definitions and proof ideas in order to establish the
  1580    to introduce our own definitions and proof ideas in order to establish the
  1578    correctness.  Our interest in the second algorithm 
  1581    correctness.  Our interest in the second algorithm 
  1579    lies in the fact that by using bitcoded regular expressions and an aggressive
  1582    lies in the fact that by using bitcoded regular expressions and an aggressive
  1580    simplification method there is a chance that the derivatives
  1583    simplification method there is a chance that the derivatives
  1581    can be kept universally small  (we established in this paper that
  1584    can be kept universally small  (we established in this paper that
  1582    they can be kept finitely bounded for any string).
  1585    for a given $r$ they can be kept finitely bounded for all strings).
  1583    %This is important if one is after
  1586    %This is important if one is after
  1584    %an efficient POSIX lexing algorithm based on derivatives.
  1587    %an efficient POSIX lexing algorithm based on derivatives.
  1585 
  1588 
  1586    Having proved the correctness of the POSIX lexing algorithm, which
  1589    Having proved the correctness of the POSIX lexing algorithm, which
  1587    lessons have we learned? Well, we feel this is a very good example
  1590    lessons have we learned? Well, we feel this is a very good example
  1602    grow to arbitrarily big sizes and the algorithm needs to traverse
  1605    grow to arbitrarily big sizes and the algorithm needs to traverse
  1603    the derivatives possibly several times, then the algorithm will be
  1606    the derivatives possibly several times, then the algorithm will be
  1604    slow---excruciatingly slow that is. Other works seems to make
  1607    slow---excruciatingly slow that is. Other works seems to make
  1605    stronger claims, but during our work we have developed a healthy
  1608    stronger claims, but during our work we have developed a healthy
  1606    suspicion when for example experimental data is used to back up
  1609    suspicion when for example experimental data is used to back up
  1607    efficiency claims. For example Sulzmann and Lu write about their
  1610    efficiency claims. For instance Sulzmann and Lu write about their
  1608    equivalent of @{term blexer_simp} \textit{``...we can incrementally
  1611    equivalent of @{term blexer_simp} \textit{``...we can incrementally
  1609    compute bitcoded parse trees in linear time in the size of the
  1612    compute bitcoded parse trees in linear time in the size of the
  1610    input''} \cite[Page 14]{Sulzmann2014}.  Given the growth of the
  1613    input''} \cite[Page 14]{Sulzmann2014}.  Given the growth of the
  1611    derivatives in some cases even after aggressive simplification,
  1614    derivatives in some cases even after aggressive simplification,
  1612    this is a hard to believe claim. A similar claim about a
  1615    this is a hard to believe claim. A similar claim about a
  1671    one needs to represent them as
  1674    one needs to represent them as
  1672    sequences of $a \cdot a \cdot \ldots \cdot a$ (500 a's in sequence). We have run
  1675    sequences of $a \cdot a \cdot \ldots \cdot a$ (500 a's in sequence). We have run
  1673    their extracted code with such a regular expression as a
  1676    their extracted code with such a regular expression as a
  1674    single lexing rule and a string of 50\,000 a's---lexing in this case
  1677    single lexing rule and a string of 50\,000 a's---lexing in this case
  1675    takes approximately 5~minutes. We are not aware of any better
  1678    takes approximately 5~minutes. We are not aware of any better
  1676    translation using the traditional notion of DFAs. Therefore we
  1679    translation using the traditional notion of DFAs so that we can improve on this. Therefore we
  1677    prefer to stick with calculating derivatives, but attempt to make
  1680    prefer to stick with calculating derivatives, but attempt to make
  1678    this calculation (in the future) as fast as possible. What we can guaranty
  1681    this calculation (in the future) as fast as possible. What we can guaranty
  1679    with the presented work is that the maximum size of the derivatives
  1682    with the presented work is that the maximum size of the derivatives
  1680    for this example is not bigger than 9. This means our Scala
  1683    for \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}$\,\cdot\,$@{term "STAR a"}} is never bigger than 9. This means our Scala
  1681    implementation only needs a few seconds for this example.
  1684    implementation only needs a few seconds for this example and matching 50\,000 a's.
  1682    %
  1685    %
  1683    %
  1686    %
  1684    %Possible ideas are
  1687    %Possible ideas are
  1685    %zippers which have been used in the context of parsing of
  1688    %zippers which have been used in the context of parsing of
  1686    %context-free grammars \cite{zipperparser}.
  1689    %context-free grammars \cite{zipperparser}.