197 does not achieve the intended objective with bitcoded regular expressions. The |
197 does not achieve the intended objective with bitcoded regular expressions. The |
198 reason is that in the bitcoded setting, each copy generally has a |
198 reason is that in the bitcoded setting, each copy generally has a |
199 different bitcode annotation---so @{text nub} would never ``fire''. |
199 different bitcode annotation---so @{text nub} would never ``fire''. |
200 Inspired by Scala's library for lists, we shall instead use a @{text |
200 Inspired by Scala's library for lists, we shall instead use a @{text |
201 distinctWith} function that finds duplicates under an ``erasing'' function |
201 distinctWith} function that finds duplicates under an ``erasing'' function |
202 which deletes bitcodes before comparing regular expressions. |
202 that deletes bitcodes before comparing regular expressions. |
203 We shall also introduce our \emph{own} arguments and definitions for |
203 We shall also introduce our \emph{own} arguments and definitions for |
204 establishing the correctness of the bitcoded algorithm when |
204 establishing the correctness of the bitcoded algorithm when |
205 simplifications are included. Finally we |
205 simplifications are included. Finally we |
206 establish that the size of derivatives can be finitely bounded.\medskip%\footnote{ In this paper, we shall first briefly introduce the basic notions |
206 establish that the size of derivatives can be finitely bounded.\medskip%\footnote{ In this paper, we shall first briefly introduce the basic notions |
207 %of regular expressions and describe the definition |
207 %of regular expressions and describe the definition |
374 \end{center} |
374 \end{center} |
375 |
375 |
376 \noindent where we use @{term vs} to stand for a list of values. The |
376 \noindent where we use @{term vs} to stand for a list of values. The |
377 string underlying a value can be calculated by a @{const flat} |
377 string underlying a value can be calculated by a @{const flat} |
378 function, written @{term "flat DUMMY"}. It traverses a value and |
378 function, written @{term "flat DUMMY"}. It traverses a value and |
379 collects the characters contained in it \cite{AusafDyckhoffUrban2016}. |
379 collects the characters contained in it (see \cite{AusafDyckhoffUrban2016}). |
380 |
380 |
381 |
381 |
382 Sulzmann and Lu also define inductively an |
382 Sulzmann and Lu also define inductively an |
383 inhabitation relation that associates values to regular expressions. Our |
383 inhabitation relation that associates values to regular expressions. Our |
384 version of this relation is defined the following six rules for the values: |
384 version of this relation is defined by the following six rules: |
385 % |
385 % |
386 \begin{center} |
386 \begin{center} |
387 \begin{tabular}{@ {}l@ {}} |
387 \begin{tabular}{@ {}l@ {}} |
388 @{thm[mode=Axiom] Prf.intros(4)}\quad |
388 @{thm[mode=Axiom] Prf.intros(4)}\quad |
389 @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\quad |
389 @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\quad |
414 only contains values that flatten to the empty string. |
414 only contains values that flatten to the empty string. |
415 It is routine to establish how values ``inhabiting'' a regular |
415 It is routine to establish how values ``inhabiting'' a regular |
416 expression correspond to the language of a regular expression, namely |
416 expression correspond to the language of a regular expression, namely |
417 @{thm L_flat_Prf}. |
417 @{thm L_flat_Prf}. |
418 |
418 |
419 In general there is more than one value inhabited by a regular |
419 In general there is more than one value inhabiting a regular |
420 expression (meaning regular expressions can typically match more |
420 expression (meaning regular expressions can typically match more |
421 than one string). But even when fixing a string from the language of the |
421 than one string). But even when fixing a string from the language of the |
422 regular expression, there are generally more than one way of how the |
422 regular expression, there are generally more than one way of how the |
423 regular expression can match this string. POSIX lexing is about |
423 regular expression can match this string. POSIX lexing is about |
424 identifying the unique value for a given regular expression and a |
424 identifying the unique value for a given regular expression and a |
430 % Unix-like operating systems \cite{POSIX}.} |
430 % Unix-like operating systems \cite{POSIX}.} |
431 Sometimes these |
431 Sometimes these |
432 informal rules are called \emph{maximal munch rule} and \emph{rule priority}. |
432 informal rules are called \emph{maximal munch rule} and \emph{rule priority}. |
433 One contribution of our earlier paper is to give a convenient |
433 One contribution of our earlier paper is to give a convenient |
434 specification for what POSIX values are (the inductive rules are shown in |
434 specification for what POSIX values are (the inductive rules are shown in |
435 Figure~\ref{POSIXrules}). |
435 Fig~\ref{POSIXrules}). |
436 |
436 |
437 \begin{figure}[t] |
437 \begin{figure}[t] |
438 \begin{center}\small% |
438 \begin{center}\small% |
439 \begin{tabular}{@ {\hspace{-2mm}}c@ {}} |
439 \begin{tabular}{@ {\hspace{-2mm}}c@ {}} |
440 \\[-9mm] |
440 \\[-4.5mm] |
441 @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad |
441 @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad |
442 @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad |
442 @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad |
443 @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad |
443 @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad |
444 @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\ |
444 @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\ |
445 $\mprset{flushleft} |
445 $\mprset{flushleft} |
713 regular expression \emph{and} a bitsequence of a corresponding value, |
713 regular expression \emph{and} a bitsequence of a corresponding value, |
714 then we can always decode the value accurately~(see Fig.~\ref{decode}). |
714 then we can always decode the value accurately~(see Fig.~\ref{decode}). |
715 The function \textit{decode} checks whether all of the bitsequence is |
715 The function \textit{decode} checks whether all of the bitsequence is |
716 consumed and returns the corresponding value as @{term "Some v"}; otherwise |
716 consumed and returns the corresponding value as @{term "Some v"}; otherwise |
717 it fails with @{text "None"}. We can establish that for a value $v$ |
717 it fails with @{text "None"}. We can establish that for a value $v$ |
718 inhabited by a regular expression $r$, the decoding of its |
718 inhabiting a regular expression $r$, the decoding of its |
719 bitsequence never fails (see also \cite{NielsenHenglein2011}). |
719 bitsequence never fails (see also \cite{NielsenHenglein2011}). |
720 |
720 |
721 %The decoding can be |
721 %The decoding can be |
722 %defined by using two functions called $\textit{decode}'$ and |
722 %defined by using two functions called $\textit{decode}'$ and |
723 %\textit{decode}: |
723 %\textit{decode}: |
797 \end{tabular} |
797 \end{tabular} |
798 \end{minipage}\medskip |
798 \end{minipage}\medskip |
799 %!\end{center} |
799 %!\end{center} |
800 |
800 |
801 \noindent |
801 \noindent |
|
802 This function ``fuses'' a bitsequence to the topmost constructor of an bitcoded regular expressions. |
802 A regular expression can then be \emph{internalised} into a bitcoded |
803 A regular expression can then be \emph{internalised} into a bitcoded |
803 regular expression as follows: |
804 regular expression as follows: |
804 % |
805 % |
805 %!\begin{center} |
806 %!\begin{center} |
806 %!\begin{tabular}{@ {}c@ {\hspace{2mm}}c@ {\hspace{2mm}}c@ {}} |
807 %!\begin{tabular}{@ {}c@ {\hspace{2mm}}c@ {\hspace{2mm}}c@ {}} |
1040 derivative $r\backslash s$ and another time for stacking up injection |
1041 derivative $r\backslash s$ and another time for stacking up injection |
1041 functions), it helps us with proving |
1042 functions), it helps us with proving |
1042 that incrementally building up values in @{text blexer} generates the same result. |
1043 that incrementally building up values in @{text blexer} generates the same result. |
1043 |
1044 |
1044 This brings us to our main lemma in this section: if we calculate a |
1045 This brings us to our main lemma in this section: if we calculate a |
1045 derivative, say $r\backslash s$, and have a value, say $v$, inhabited |
1046 derivative, say $r\backslash s$, and have a value, say $v$, inhabiting |
1046 by this derivative, then we can produce the result @{text lexer} generates |
1047 this derivative, then we can produce the result @{text lexer} generates |
1047 by applying this value to the stacked-up injection functions |
1048 by applying this value to the stacked-up injection functions |
1048 that $\textit{flex}$ assembles. The lemma establishes that this is the same |
1049 that $\textit{flex}$ assembles. The lemma establishes that this is the same |
1049 value as if we build the annotated derivative $r^\uparrow\backslash s$ |
1050 value as if we build the annotated derivative $r^\uparrow\backslash s$ |
1050 and then retrieve the corresponding bitcoded version, followed by a |
1051 and then retrieve the bitcoded version of @{text v}, followed by a |
1051 decoding step. |
1052 decoding step. |
1052 |
1053 |
1053 \begin{lemma}[Main Lemma]\label{mainlemma}\mbox{}\\\it |
1054 \begin{lemma}[Main Lemma]\label{mainlemma}\mbox{}\\\it |
1054 If $\vdash v : r\backslash s$ then |
1055 If $\vdash v : r\backslash s$ then |
1055 $\textit{Some}\,(\textit{flex}\,r\,\textit{id}\,s\,v) = |
1056 $\textit{Some}\,(\textit{flex}\,r\,\textit{id}\,s\,v) = |
1129 de-nest, or spill out, @{text ALTs} as follows |
1130 de-nest, or spill out, @{text ALTs} as follows |
1130 % |
1131 % |
1131 \[ |
1132 \[ |
1132 @{term "ALTs bs\<^sub>1 (((ALTs bs\<^sub>2 rs\<^sub>2)) # rs\<^sub>1)"} |
1133 @{term "ALTs bs\<^sub>1 (((ALTs bs\<^sub>2 rs\<^sub>2)) # rs\<^sub>1)"} |
1133 \quad\xrightarrow{bsimp}\quad |
1134 \quad\xrightarrow{bsimp}\quad |
1134 @{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"} |
1135 @{text "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"} |
1135 \] |
1136 \] |
1136 |
1137 |
1137 \noindent |
1138 \noindent |
1138 where we just need to fuse the bitsequence that has accumulated in @{text "bs\<^sub>2"} |
1139 where we just need to fuse the bitsequence that has accumulated in @{text "bs\<^sub>2"} |
1139 to the alternatives in @{text "rs\<^sub>2"}. As we shall show below this will |
1140 to the alternatives in @{text "rs\<^sub>2"}. As we shall show below this will |
1176 \end{tabular} |
1177 \end{tabular} |
1177 \end{center} |
1178 \end{center} |
1178 |
1179 |
1179 \noindent where we scan the list from left to right (because we |
1180 \noindent where we scan the list from left to right (because we |
1180 have to remove later copies). In @{text distinctWith}, @{text eq} is intended to be an |
1181 have to remove later copies). In @{text distinctWith}, @{text eq} is intended to be an |
1181 equivalence relation for annotated regular expressions and @{text acc} is an accumulator for annotated regular |
1182 equivalence relation for bitcoded regular expressions and @{text acc} is an accumulator for bitcoded regular |
1182 expressions---essentially a set of regular expressions that we have already seen |
1183 expressions---essentially a set of regular expressions that we have already seen |
1183 while scanning the list. Therefore we delete an element, say @{text x}, |
1184 while scanning the list. Therefore we delete an element, say @{text x}, |
1184 from the list provided a @{text "y"} with @{text "y"} being equivalent to @{text x} is already in the accumulator; |
1185 from the list provided a @{text "y"} with @{text "y"} being equivalent to @{text x} is already in the accumulator; |
1185 otherwise we keep @{text x} and scan the rest of the list but |
1186 otherwise we keep @{text x} and scan the rest of the list but |
1186 add @{text "x"} as another ``seen'' element to @{text acc}. We will use |
1187 add @{text "x"} as another ``seen'' element to @{text acc}. We will use |
1187 @{term distinctWith} where @{text eq} is an equivalence that deletes bitsequences from bitcoded regular expressions |
1188 @{term distinctWith} where @{text eq} is an equivalence that deletes bitsequences from bitcoded regular expressions |
1188 before comparing the components. One way to define this in Isabelle/HOL is by the following recursive function from |
1189 before comparing the components. One way to define this in Isabelle/HOL is by the following recursive function from |
1189 annotated regular expressions to @{text bool}: |
1190 bitcoded regular expressions to @{text bool}: |
1190 % |
1191 % |
1191 \begin{center} |
1192 \begin{center} |
1192 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}} |
1193 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}} |
1193 @{thm (lhs) eq1.simps(1)} & $\dn$ & @{thm (rhs) eq1.simps(1)}\\ |
1194 @{thm (lhs) eq1.simps(1)} & $\dn$ & @{thm (rhs) eq1.simps(1)}\\ |
1194 @{thm (lhs) eq1.simps(2)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(2)[of DUMMY DUMMY]}\\ |
1195 @{thm (lhs) eq1.simps(2)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(2)[of DUMMY DUMMY]}\\ |
1283 |
1284 |
1284 Next we can include simplification after each derivative step leading to the |
1285 Next we can include simplification after each derivative step leading to the |
1285 following notion of bitcoded derivatives: |
1286 following notion of bitcoded derivatives: |
1286 |
1287 |
1287 \begin{center} |
1288 \begin{center} |
1288 \begin{tabular}{cc} |
1289 \begin{tabular}{c@ {\hspace{10mm}}c} |
1289 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
1290 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
1290 @{thm (lhs) bders_simp.simps(1)} & $\dn$ & @{thm (rhs) bders_simp.simps(1)} |
1291 @{thm (lhs) bders_simp.simps(1)} & $\dn$ & @{thm (rhs) bders_simp.simps(1)} |
1291 \end{tabular} |
1292 \end{tabular} |
1292 & |
1293 & |
1293 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
1294 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
1301 |
1302 |
1302 \begin{center} |
1303 \begin{center} |
1303 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}} |
1304 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}} |
1304 $\textit{blexer}^+\;r\,s$ & $\dn$ & |
1305 $\textit{blexer}^+\;r\,s$ & $\dn$ & |
1305 $\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\ |
1306 $\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\ |
1306 & & $\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\ |
1307 & & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\ |
1307 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$\\ |
1308 & & $\;\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$ $\;\;\textit{else}\;\textit{None}$ |
1308 & & $\;\;\textit{else}\;\textit{None}$ |
|
1309 \end{tabular} |
1309 \end{tabular} |
1310 \end{center} |
1310 \end{center} |
1311 |
1311 |
1312 \noindent The remaining task is to show that @{term blexer} and |
1312 \noindent |
|
1313 Note that in $\textit{blexer}^+$ the derivative $r_{der}$ is calculated |
|
1314 using the simplifying derivative $\_\,\backslash_{bsimp}\,\_$. |
|
1315 The remaining task is to show that @{term blexer} and |
1313 @{term "blexer_simp"} generate the same answers. |
1316 @{term "blexer_simp"} generate the same answers. |
1314 |
1317 |
1315 When we first |
1318 When we first |
1316 attempted this proof we encountered a problem with the idea |
1319 attempted this proof we encountered a problem with the idea |
1317 in Sulzmann and Lu's paper where the argument seems to be to appeal |
1320 in Sulzmann and Lu's paper where the argument seems to be to appeal |
1339 synchronise the change in the simplified regular expressions with |
1342 synchronise the change in the simplified regular expressions with |
1340 the original POSIX value, there is no hope of appealing to @{text retrieve} in the |
1343 the original POSIX value, there is no hope of appealing to @{text retrieve} in the |
1341 correctness argument for @{term blexer_simp}. |
1344 correctness argument for @{term blexer_simp}. |
1342 |
1345 |
1343 We found it more helpful to introduce the rewriting systems shown in |
1346 We found it more helpful to introduce the rewriting systems shown in |
1344 Figure~\ref{SimpRewrites}. The idea is to generate |
1347 Fig~\ref{SimpRewrites}. The idea is to generate |
1345 simplified regular expressions in small steps (unlike the @{text bsimp}-function which |
1348 simplified regular expressions in small steps (unlike the @{text bsimp}-function which |
1346 does the same in a big step), and show that each of |
1349 does the same in a big step), and show that each of |
1347 the small steps preserves the bitcodes that lead to the final POSIX value. |
1350 the small steps preserves the bitcodes that lead to the POSIX value. |
1348 The rewrite system is organised such that $\leadsto$ is for bitcoded regular |
1351 The rewrite system is organised such that $\leadsto$ is for bitcoded regular |
1349 expressions and $\stackrel{s}{\leadsto}$ for lists of bitcoded regular |
1352 expressions and $\stackrel{s}{\leadsto}$ for lists of bitcoded regular |
1350 expressions. The former essentially implements the simplifications of |
1353 expressions. The former essentially implements the simplifications of |
1351 @{text "bsimpSEQ"} and @{text flts}; while the latter implements the |
1354 @{text "bsimpSEQ"} and @{text flts}; while the latter implements the |
1352 simplifications in @{text "bsimpALTs"}. We can show that any bitcoded |
1355 simplifications in @{text "bsimpALTs"}. We can show that any bitcoded |
1501 %!\end{center} |
1504 %!\end{center} |
1502 |
1505 |
1503 % tell Chengsong about Indian paper of closed forms of derivatives |
1506 % tell Chengsong about Indian paper of closed forms of derivatives |
1504 |
1507 |
1505 \noindent |
1508 \noindent |
1506 where in (1) the $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$). |
1509 where in (1) the set $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$). |
1507 In (3) we know that $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is |
1510 In (3) we know that $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is |
1508 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
1511 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
1509 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands |
1512 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands |
1510 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
1513 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
1511 We reason similarly for @{text STAR}.\smallskip |
1514 We reason similarly for @{text STAR}.\smallskip |
1552 transfer to our setting where we are interested in the \emph{size} of the |
1555 transfer to our setting where we are interested in the \emph{size} of the |
1553 derivatives. For example it is not true for our derivatives that the |
1556 derivatives. For example it is not true for our derivatives that the |
1554 set of of derivatives $r \backslash s$ for a given $r$ and all strings |
1557 set of of derivatives $r \backslash s$ for a given $r$ and all strings |
1555 $s$ is finite. This is because for our annotated regular expressions |
1558 $s$ is finite. This is because for our annotated regular expressions |
1556 the bitcode annotation is dependent on the number of iterations that |
1559 the bitcode annotation is dependent on the number of iterations that |
1557 are needed for STAR-regular expressions. Since we want to do lexing |
1560 are needed for STAR-regular expressions. This is not a problem for us: Since we intend to do lexing |
1558 by calculating (as fast as possible) derivatives, the bound on the size |
1561 by calculating (as fast as possible) derivatives, the bound on the size |
1559 of the derivatives is important, not the number of derivatives. |
1562 of the derivatives is important, not the number of derivatives. |
1560 |
1563 |
1561 |
1564 |
1562 *} |
1565 *} |
1571 \cite{Sulzmann2014}. This follows earlier work where we established |
1574 \cite{Sulzmann2014}. This follows earlier work where we established |
1572 the correctness of the first algorithm |
1575 the correctness of the first algorithm |
1573 \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to |
1576 \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to |
1574 introduce our own specification about what POSIX values are, |
1577 introduce our own specification about what POSIX values are, |
1575 because the informal definition given by Sulzmann and Lu did not |
1578 because the informal definition given by Sulzmann and Lu did not |
1576 stand up to a formal proof. Also for the second algorithm we needed |
1579 stand up to formal proof. Also for the second algorithm we needed |
1577 to introduce our own definitions and proof ideas in order to establish the |
1580 to introduce our own definitions and proof ideas in order to establish the |
1578 correctness. Our interest in the second algorithm |
1581 correctness. Our interest in the second algorithm |
1579 lies in the fact that by using bitcoded regular expressions and an aggressive |
1582 lies in the fact that by using bitcoded regular expressions and an aggressive |
1580 simplification method there is a chance that the derivatives |
1583 simplification method there is a chance that the derivatives |
1581 can be kept universally small (we established in this paper that |
1584 can be kept universally small (we established in this paper that |
1582 they can be kept finitely bounded for any string). |
1585 for a given $r$ they can be kept finitely bounded for all strings). |
1583 %This is important if one is after |
1586 %This is important if one is after |
1584 %an efficient POSIX lexing algorithm based on derivatives. |
1587 %an efficient POSIX lexing algorithm based on derivatives. |
1585 |
1588 |
1586 Having proved the correctness of the POSIX lexing algorithm, which |
1589 Having proved the correctness of the POSIX lexing algorithm, which |
1587 lessons have we learned? Well, we feel this is a very good example |
1590 lessons have we learned? Well, we feel this is a very good example |
1602 grow to arbitrarily big sizes and the algorithm needs to traverse |
1605 grow to arbitrarily big sizes and the algorithm needs to traverse |
1603 the derivatives possibly several times, then the algorithm will be |
1606 the derivatives possibly several times, then the algorithm will be |
1604 slow---excruciatingly slow that is. Other works seems to make |
1607 slow---excruciatingly slow that is. Other works seems to make |
1605 stronger claims, but during our work we have developed a healthy |
1608 stronger claims, but during our work we have developed a healthy |
1606 suspicion when for example experimental data is used to back up |
1609 suspicion when for example experimental data is used to back up |
1607 efficiency claims. For example Sulzmann and Lu write about their |
1610 efficiency claims. For instance Sulzmann and Lu write about their |
1608 equivalent of @{term blexer_simp} \textit{``...we can incrementally |
1611 equivalent of @{term blexer_simp} \textit{``...we can incrementally |
1609 compute bitcoded parse trees in linear time in the size of the |
1612 compute bitcoded parse trees in linear time in the size of the |
1610 input''} \cite[Page 14]{Sulzmann2014}. Given the growth of the |
1613 input''} \cite[Page 14]{Sulzmann2014}. Given the growth of the |
1611 derivatives in some cases even after aggressive simplification, |
1614 derivatives in some cases even after aggressive simplification, |
1612 this is a hard to believe claim. A similar claim about a |
1615 this is a hard to believe claim. A similar claim about a |
1671 one needs to represent them as |
1674 one needs to represent them as |
1672 sequences of $a \cdot a \cdot \ldots \cdot a$ (500 a's in sequence). We have run |
1675 sequences of $a \cdot a \cdot \ldots \cdot a$ (500 a's in sequence). We have run |
1673 their extracted code with such a regular expression as a |
1676 their extracted code with such a regular expression as a |
1674 single lexing rule and a string of 50\,000 a's---lexing in this case |
1677 single lexing rule and a string of 50\,000 a's---lexing in this case |
1675 takes approximately 5~minutes. We are not aware of any better |
1678 takes approximately 5~minutes. We are not aware of any better |
1676 translation using the traditional notion of DFAs. Therefore we |
1679 translation using the traditional notion of DFAs so that we can improve on this. Therefore we |
1677 prefer to stick with calculating derivatives, but attempt to make |
1680 prefer to stick with calculating derivatives, but attempt to make |
1678 this calculation (in the future) as fast as possible. What we can guaranty |
1681 this calculation (in the future) as fast as possible. What we can guaranty |
1679 with the presented work is that the maximum size of the derivatives |
1682 with the presented work is that the maximum size of the derivatives |
1680 for this example is not bigger than 9. This means our Scala |
1683 for \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}$\,\cdot\,$@{term "STAR a"}} is never bigger than 9. This means our Scala |
1681 implementation only needs a few seconds for this example. |
1684 implementation only needs a few seconds for this example and matching 50\,000 a's. |
1682 % |
1685 % |
1683 % |
1686 % |
1684 %Possible ideas are |
1687 %Possible ideas are |
1685 %zippers which have been used in the context of parsing of |
1688 %zippers which have been used in the context of parsing of |
1686 %context-free grammars \cite{zipperparser}. |
1689 %context-free grammars \cite{zipperparser}. |