349 than one string). But even when fixing a string from the language of the |
349 than one string). But even when fixing a string from the language of the |
350 regular expression, there are generally more than one way of how the |
350 regular expression, there are generally more than one way of how the |
351 regular expression can match this string. POSIX lexing is about |
351 regular expression can match this string. POSIX lexing is about |
352 identifying the unique value for a given regular expression and a |
352 identifying the unique value for a given regular expression and a |
353 string that satisfies the informal POSIX rules (see |
353 string that satisfies the informal POSIX rules (see |
354 \cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}).\footnote{POSIX |
354 \cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}). |
355 lexing acquired its name from the fact that the corresponding |
355 %\footnote{POSIX |
356 rules were described as part of the POSIX specification for |
356 % lexing acquired its name from the fact that the corresponding |
357 Unix-like operating systems \cite{POSIX}.} Sometimes these |
357 % rules were described as part of the POSIX specification for |
|
358 % Unix-like operating systems \cite{POSIX}.} |
|
359 Sometimes these |
358 informal rules are called \emph{maximal munch rule} and \emph{rule priority}. |
360 informal rules are called \emph{maximal munch rule} and \emph{rule priority}. |
359 One contribution of our earlier paper is to give a convenient |
361 One contribution of our earlier paper is to give a convenient |
360 specification for what POSIX values are (the inductive rules are shown in |
362 specification for what POSIX values are (the inductive rules are shown in |
361 Figure~\ref{POSIXrules}). |
363 Figure~\ref{POSIXrules}). |
362 |
364 |
363 \begin{figure}[t] |
365 \begin{figure}[t] |
364 \begin{center} |
366 \begin{center}\small% |
365 \begin{tabular}{@ {}c@ {}} |
367 \begin{tabular}{@ {}c@ {}} |
|
368 \\[-9mm] |
366 @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad |
369 @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad |
367 @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad |
370 @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad |
368 @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad |
371 @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad |
369 @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\ |
372 @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\ |
370 $\mprset{flushleft} |
373 $\mprset{flushleft} |
464 \end{center} |
467 \end{center} |
465 % |
468 % |
466 \noindent |
469 \noindent |
467 The picture shows the steps required when a |
470 The picture shows the steps required when a |
468 regular expression, say @{text "r\<^sub>1"}, matches the string @{term |
471 regular expression, say @{text "r\<^sub>1"}, matches the string @{term |
469 "[a,b,c]"}. The first lexing algorithm by Sulzmann and Lu can be defined as:\\[-8mm] |
472 "[a,b,c]"}. The first lexing algorithm by Sulzmann and Lu can be defined as:%\\[-8mm] |
470 |
473 |
471 % \begin{figure}[t] |
474 % \begin{figure}[t] |
472 %\begin{center} |
475 %\begin{center} |
473 %\begin{tikzpicture}[scale=1,node distance=1cm, |
476 %\begin{tikzpicture}[scale=1,node distance=1cm, |
474 % every node/.style={minimum size=6mm}] |
477 % every node/.style={minimum size=6mm}] |
529 \end{proposition} |
532 \end{proposition} |
530 |
533 |
531 \noindent |
534 \noindent |
532 With this in place we were able to prove: |
535 With this in place we were able to prove: |
533 |
536 |
534 |
537 \begin{proposition}\mbox{}\label{lexercorrect} |
535 \begin{proposition}\mbox{}\smallskip\\\label{lexercorrect} |
538 \textrm{(1)} @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}.\\ |
536 \begin{tabular}{ll} |
539 \mbox{\hspace{29mm}}\textrm{(2)}\; @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}. |
537 (1) & @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}\\ |
540 % |
538 (2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\ |
541 % \smallskip\\ |
539 \end{tabular} |
542 %\begin{tabular}{ll} |
|
543 %(1) & @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}\\ |
|
544 %(2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\ |
|
545 %\end{tabular} |
540 \end{proposition} |
546 \end{proposition} |
541 |
547 |
542 \noindent |
548 \noindent |
543 In fact we have shown that, in the success case, the generated POSIX value $v$ is |
549 In fact we have shown that, in the success case, the generated POSIX value $v$ is |
544 unique and in the failure case that there is no POSIX value $v$ that satisfies |
550 unique and in the failure case that there is no POSIX value $v$ that satisfies |
841 relate to the ``standard'' operations on regular expressions. For |
847 relate to the ``standard'' operations on regular expressions. For |
842 example if we build a bitcoded derivative and erase the result, this |
848 example if we build a bitcoded derivative and erase the result, this |
843 is the same as if we first erase the bitcoded regular expression and |
849 is the same as if we first erase the bitcoded regular expression and |
844 then perform the ``standard'' derivative operation. |
850 then perform the ``standard'' derivative operation. |
845 |
851 |
846 \begin{lemma}\label{bnullable}\mbox{}\smallskip\\ |
852 \begin{lemma}\label{bnullable}%\mbox{}\smallskip\\ |
847 \begin{tabular}{ll} |
853 \textit{(1)} $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\ |
848 \textit{(1)} & $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\ |
854 \mbox{\hspace{22mm}}\textit{(2)} $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\ |
849 \textit{(2)} & $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\ |
855 \mbox{\hspace{22mm}}\textit{(3)} $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$ |
850 \textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$. |
856 %\begin{tabular}{ll} |
851 \end{tabular} |
857 %\textit{(1)} & $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\ |
|
858 %\textit{(2)} & $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\ |
|
859 %\textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$. |
|
860 %\end{tabular} |
852 \end{lemma} |
861 \end{lemma} |
853 |
862 |
854 %\begin{proof} |
863 %\begin{proof} |
855 % All properties are by induction on annotated regular expressions. |
864 % All properties are by induction on annotated regular expressions. |
856 % %There are no interesting cases. |
865 % %There are no interesting cases. |
1063 @{text flts} and analyses lists of regular expressions coming from alternatives. |
1072 @{text flts} and analyses lists of regular expressions coming from alternatives. |
1064 It is defined as follows: |
1073 It is defined as follows: |
1065 |
1074 |
1066 \begin{center} |
1075 \begin{center} |
1067 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
1076 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
1068 @{thm (lhs) flts.simps(1)} & $\dn$ & @{thm (rhs) flts.simps(1)}\\ |
1077 \multicolumn{3}{@ {}c}{@{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad\qquad\qquad\qquad |
1069 @{thm (lhs) flts.simps(2)} & $\dn$ & @{thm (rhs) flts.simps(2)}\\ |
1078 @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\\ |
1070 @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\ |
1079 @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\ |
1071 \end{tabular} |
1080 \end{tabular} |
1072 \end{center} |
1081 \end{center} |
1073 |
1082 |
1074 \noindent |
1083 \noindent |
1275 we shall show next. |
1284 we shall show next. |
1276 |
1285 |
1277 \begin{figure}[t] |
1286 \begin{figure}[t] |
1278 \begin{center} |
1287 \begin{center} |
1279 \begin{tabular}{@ {\hspace{-8mm}}c@ {}} |
1288 \begin{tabular}{@ {\hspace{-8mm}}c@ {}} |
|
1289 \\[-7mm] |
1280 @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}$S\ZERO{}_l$\quad |
1290 @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}$S\ZERO{}_l$\quad |
1281 @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}$S\ZERO{}_r$\quad |
1291 @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}$S\ZERO{}_r$\quad |
1282 @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}$S\ONE$\\ |
1292 @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}$S\ONE$\\ |
1283 @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad |
1293 @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad |
1284 @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\ |
1294 @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\ |
1346 where in (1) the $\textit{Suffix}(@{text "r"}_1, s)$ are the all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$). |
1356 where in (1) the $\textit{Suffix}(@{text "r"}_1, s)$ are the all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$). |
1347 In (3) we know that $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is |
1357 In (3) we know that $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is |
1348 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
1358 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
1349 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands |
1359 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands |
1350 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
1360 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
1351 We reason similarly for @{text STAR}.\medskip |
1361 We reason similarly for @{text STAR}.\smallskip |
1352 |
1362 |
1353 \noindent |
1363 |
1354 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is |
1364 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is |
1355 far from the actual bound we can expect. We can do better than this, but this does not improve |
1365 far from the actual bound we can expect. We can do better than this, but this does not improve |
1356 the finiteness property we are proving. If we are interested in a polynomial bound, |
1366 the finiteness property we are proving. If we are interested in a polynomial bound, |
1357 one would hope to obtain a similar tight bound as for partial |
1367 one would hope to obtain a similar tight bound as for partial |
1358 derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with |
1368 derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with |
1382 to introduce our own definitions and proof ideas in order to establish the |
1392 to introduce our own definitions and proof ideas in order to establish the |
1383 correctness. Our interest in the second algorithm |
1393 correctness. Our interest in the second algorithm |
1384 lies in the fact that by using bitcoded regular expressions and an aggressive |
1394 lies in the fact that by using bitcoded regular expressions and an aggressive |
1385 simplification method there is a chance that the derivatives |
1395 simplification method there is a chance that the derivatives |
1386 can be kept universally small (we established in this paper that |
1396 can be kept universally small (we established in this paper that |
1387 they can be kept finitely bounded for any string). This is important if one is after |
1397 they can be kept finitely bounded for any string). |
1388 an efficient POSIX lexing algorithm based on derivatives. |
1398 %This is important if one is after |
|
1399 %an efficient POSIX lexing algorithm based on derivatives. |
1389 |
1400 |
1390 Having proved the correctness of the POSIX lexing algorithm, which |
1401 Having proved the correctness of the POSIX lexing algorithm, which |
1391 lessons have we learned? Well, we feel this is a very good example |
1402 lessons have we learned? Well, we feel this is a very good example |
1392 where formal proofs give further insight into the matter at |
1403 where formal proofs give further insight into the matter at |
1393 hand. For example it is very hard to see a problem with @{text nub} |
1404 hand. For example it is very hard to see a problem with @{text nub} |
1394 vs @{text distinctWith} with only experimental data---one would still |
1405 vs @{text distinctWith} with only experimental data---one would still |
1395 see the correct result but find that simplification does not simplify in well-chosen, but not |
1406 see the correct result but find that simplification does not simplify in well-chosen, but not |
1396 obscure, examples. We found that from an implementation |
1407 obscure, examples. |
1397 point-of-view it is really important to have the formal proofs of |
1408 %We found that from an implementation |
1398 the corresponding properties at hand. |
1409 %point-of-view it is really important to have the formal proofs of |
|
1410 %the corresponding properties at hand. |
1399 |
1411 |
1400 We have also developed a |
1412 We have also developed a |
1401 healthy suspicion when experimental data is used to back up |
1413 healthy suspicion when experimental data is used to back up |
1402 efficiency claims. For example Sulzmann and Lu write about their |
1414 efficiency claims. For example Sulzmann and Lu write about their |
1403 equivalent of @{term blexer_simp} \textit{``...we can incrementally compute |
1415 equivalent of @{term blexer_simp} \textit{``...we can incrementally compute |
1428 The contribution of this paper is to make sure |
1440 The contribution of this paper is to make sure |
1429 derivatives do not grow arbitrarily big (universially) In the example \mbox{@{text "(a + aa)\<^sup>*"}}, |
1441 derivatives do not grow arbitrarily big (universially) In the example \mbox{@{text "(a + aa)\<^sup>*"}}, |
1430 \emph{all} derivatives have a size of 17 or less. The result is that |
1442 \emph{all} derivatives have a size of 17 or less. The result is that |
1431 lexing a string of, say, 50\,000 a's with the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes approximately |
1443 lexing a string of, say, 50\,000 a's with the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes approximately |
1432 10 seconds with our Scala implementation |
1444 10 seconds with our Scala implementation |
1433 of the presented algorithm. |
1445 of the presented algorithm. Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}. |
1434 \smallskip |
1446 %\\[-10mm] |
1435 |
|
1436 \noindent |
|
1437 Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.\\[-10mm] |
|
1438 |
1447 |
1439 |
1448 |
1440 %%\bibliographystyle{plain} |
1449 %%\bibliographystyle{plain} |
1441 \bibliography{root} |
1450 \bibliography{root} |
1442 |
1451 |