thys3/Paper.thy
changeset 499 6a100d32314c
parent 498 ab626b60ee64
child 502 1ab693d6342f
equal deleted inserted replaced
498:ab626b60ee64 499:6a100d32314c
   349   than one string). But even when fixing a string from the language of the
   349   than one string). But even when fixing a string from the language of the
   350   regular expression, there are generally more than one way of how the
   350   regular expression, there are generally more than one way of how the
   351   regular expression can match this string. POSIX lexing is about
   351   regular expression can match this string. POSIX lexing is about
   352   identifying the unique value for a given regular expression and a
   352   identifying the unique value for a given regular expression and a
   353   string that satisfies the informal POSIX rules (see
   353   string that satisfies the informal POSIX rules (see
   354   \cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}).\footnote{POSIX
   354   \cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}).
   355 	lexing acquired its name from the fact that the corresponding
   355   %\footnote{POSIX
   356 	rules were described as part of the POSIX specification for
   356   %	lexing acquired its name from the fact that the corresponding
   357 	Unix-like operating systems \cite{POSIX}.} Sometimes these
   357   %	rules were described as part of the POSIX specification for
       
   358   %	Unix-like operating systems \cite{POSIX}.}
       
   359   Sometimes these
   358   informal rules are called \emph{maximal munch rule} and \emph{rule priority}.
   360   informal rules are called \emph{maximal munch rule} and \emph{rule priority}.
   359   One contribution of our earlier paper is to give a convenient
   361   One contribution of our earlier paper is to give a convenient
   360  specification for what POSIX values are (the inductive rules are shown in
   362  specification for what POSIX values are (the inductive rules are shown in
   361   Figure~\ref{POSIXrules}).
   363   Figure~\ref{POSIXrules}).
   362 
   364 
   363 \begin{figure}[t]
   365 \begin{figure}[t]
   364   \begin{center}
   366   \begin{center}\small%
   365   \begin{tabular}{@ {}c@ {}}
   367   \begin{tabular}{@ {}c@ {}}
       
   368   \\[-9mm]
   366   @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad
   369   @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad
   367   @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad
   370   @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad
   368   @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad
   371   @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad
   369   @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\
   372   @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\
   370   $\mprset{flushleft}
   373   $\mprset{flushleft}
   464 \end{center}
   467 \end{center}
   465 %
   468 %
   466 \noindent
   469 \noindent
   467   The picture shows the steps required when a
   470   The picture shows the steps required when a
   468   regular expression, say @{text "r\<^sub>1"}, matches the string @{term
   471   regular expression, say @{text "r\<^sub>1"}, matches the string @{term
   469   "[a,b,c]"}. The first lexing algorithm by Sulzmann and Lu can be defined as:\\[-8mm]
   472   "[a,b,c]"}. The first lexing algorithm by Sulzmann and Lu can be defined as:%\\[-8mm]
   470 
   473 
   471 %  \begin{figure}[t]
   474 %  \begin{figure}[t]
   472 %\begin{center}
   475 %\begin{center}
   473 %\begin{tikzpicture}[scale=1,node distance=1cm,
   476 %\begin{tikzpicture}[scale=1,node distance=1cm,
   474 %                    every node/.style={minimum size=6mm}]
   477 %                    every node/.style={minimum size=6mm}]
   529 \end{proposition}
   532 \end{proposition}
   530 
   533 
   531   \noindent
   534   \noindent
   532   With this in place we were able to prove:
   535   With this in place we were able to prove:
   533 
   536 
   534 
   537   \begin{proposition}\mbox{}\label{lexercorrect}
   535   \begin{proposition}\mbox{}\smallskip\\\label{lexercorrect}
   538   \textrm{(1)} @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}.\\
   536   \begin{tabular}{ll}
   539   \mbox{\hspace{29mm}}\textrm{(2)}\; @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}.
   537   (1) & @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}\\
   540   %
   538   (2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\
   541   % \smallskip\\
   539   \end{tabular}
   542   %\begin{tabular}{ll}
       
   543   %(1) & @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}\\
       
   544   %(2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\
       
   545   %\end{tabular}
   540   \end{proposition}
   546   \end{proposition}
   541 
   547 
   542   \noindent
   548   \noindent
   543   In fact we have shown that, in the success case, the generated POSIX value $v$ is
   549   In fact we have shown that, in the success case, the generated POSIX value $v$ is
   544   unique and in the failure case that there is no POSIX value $v$ that satisfies
   550   unique and in the failure case that there is no POSIX value $v$ that satisfies
   841 relate to the ``standard'' operations on regular expressions. For
   847 relate to the ``standard'' operations on regular expressions. For
   842 example if we build a bitcoded derivative and erase the result, this
   848 example if we build a bitcoded derivative and erase the result, this
   843 is the same as if we first erase the bitcoded regular expression and
   849 is the same as if we first erase the bitcoded regular expression and
   844 then perform the ``standard'' derivative operation.
   850 then perform the ``standard'' derivative operation.
   845 
   851 
   846 \begin{lemma}\label{bnullable}\mbox{}\smallskip\\
   852 \begin{lemma}\label{bnullable}%\mbox{}\smallskip\\
   847   \begin{tabular}{ll}
   853 \textit{(1)} $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\    
   848 \textit{(1)} & $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\    
   854 \mbox{\hspace{22mm}}\textit{(2)} $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
   849 \textit{(2)} & $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
   855 \mbox{\hspace{22mm}}\textit{(3)} $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$
   850 \textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$.
   856 %\begin{tabular}{ll}
   851 \end{tabular}  
   857 %\textit{(1)} & $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\    
       
   858 %\textit{(2)} & $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
       
   859 %\textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$.
       
   860 %\end{tabular}  
   852 \end{lemma}
   861 \end{lemma}
   853 
   862 
   854 %\begin{proof}
   863 %\begin{proof}
   855 %  All properties are by induction on annotated regular expressions.
   864 %  All properties are by induction on annotated regular expressions.
   856 %  %There are no interesting cases.
   865 %  %There are no interesting cases.
  1063      @{text flts} and analyses lists of regular expressions coming from alternatives.
  1072      @{text flts} and analyses lists of regular expressions coming from alternatives.
  1064      It is defined as follows:
  1073      It is defined as follows:
  1065 
  1074 
  1066      \begin{center}
  1075      \begin{center}
  1067      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
  1076      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
  1068      @{thm (lhs) flts.simps(1)} & $\dn$ & @{thm (rhs) flts.simps(1)}\\
  1077      \multicolumn{3}{@ {}c}{@{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad\qquad\qquad\qquad
  1069      @{thm (lhs) flts.simps(2)} & $\dn$ & @{thm (rhs) flts.simps(2)}\\
  1078      @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\\
  1070      @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\
  1079      @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\
  1071      \end{tabular}
  1080      \end{tabular}
  1072      \end{center}
  1081      \end{center}
  1073 
  1082 
  1074      \noindent
  1083      \noindent
  1275      we shall show next.
  1284      we shall show next.
  1276 
  1285 
  1277    \begin{figure}[t]
  1286    \begin{figure}[t]
  1278   \begin{center}
  1287   \begin{center}
  1279   \begin{tabular}{@ {\hspace{-8mm}}c@ {}}
  1288   \begin{tabular}{@ {\hspace{-8mm}}c@ {}}
       
  1289   \\[-7mm]
  1280   @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}$S\ZERO{}_l$\quad
  1290   @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}$S\ZERO{}_l$\quad
  1281   @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}$S\ZERO{}_r$\quad
  1291   @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}$S\ZERO{}_r$\quad
  1282   @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}$S\ONE$\\
  1292   @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}$S\ONE$\\
  1283   @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad
  1293   @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad
  1284   @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\
  1294   @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\
  1346 where in (1) the $\textit{Suffix}(@{text "r"}_1, s)$ are the all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
  1356 where in (1) the $\textit{Suffix}(@{text "r"}_1, s)$ are the all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
  1347 In (3) we know that  $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is 
  1357 In (3) we know that  $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is 
  1348 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
  1358 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
  1349 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands
  1359 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands
  1350 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
  1360 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
  1351 We reason similarly for @{text STAR}.\medskip
  1361 We reason similarly for @{text STAR}.\smallskip
  1352 
  1362 
  1353 \noindent
  1363 
  1354 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
  1364 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
  1355 far from the actual bound we can expect. We can do better than this, but this does not improve
  1365 far from the actual bound we can expect. We can do better than this, but this does not improve
  1356 the finiteness property we are proving. If we are interested in a polynomial bound,
  1366 the finiteness property we are proving. If we are interested in a polynomial bound,
  1357 one would hope to obtain a similar tight bound as for partial
  1367 one would hope to obtain a similar tight bound as for partial
  1358 derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
  1368 derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
  1382    to introduce our own definitions and proof ideas in order to establish the
  1392    to introduce our own definitions and proof ideas in order to establish the
  1383    correctness.  Our interest in the second algorithm 
  1393    correctness.  Our interest in the second algorithm 
  1384    lies in the fact that by using bitcoded regular expressions and an aggressive
  1394    lies in the fact that by using bitcoded regular expressions and an aggressive
  1385    simplification method there is a chance that the derivatives
  1395    simplification method there is a chance that the derivatives
  1386    can be kept universally small  (we established in this paper that
  1396    can be kept universally small  (we established in this paper that
  1387    they can be kept finitely bounded for any string). This is important if one is after
  1397    they can be kept finitely bounded for any string).
  1388    an efficient POSIX lexing algorithm based on derivatives.
  1398    %This is important if one is after
       
  1399    %an efficient POSIX lexing algorithm based on derivatives.
  1389 
  1400 
  1390    Having proved the correctness of the POSIX lexing algorithm, which
  1401    Having proved the correctness of the POSIX lexing algorithm, which
  1391    lessons have we learned? Well, we feel this is a very good example
  1402    lessons have we learned? Well, we feel this is a very good example
  1392    where formal proofs give further insight into the matter at
  1403    where formal proofs give further insight into the matter at
  1393    hand. For example it is very hard to see a problem with @{text nub}
  1404    hand. For example it is very hard to see a problem with @{text nub}
  1394    vs @{text distinctWith} with only experimental data---one would still
  1405    vs @{text distinctWith} with only experimental data---one would still
  1395    see the correct result but find that simplification does not simplify in well-chosen, but not
  1406    see the correct result but find that simplification does not simplify in well-chosen, but not
  1396    obscure, examples. We found that from an implementation
  1407    obscure, examples.
  1397    point-of-view it is really important to have the formal proofs of
  1408    %We found that from an implementation
  1398    the corresponding properties at hand.
  1409    %point-of-view it is really important to have the formal proofs of
       
  1410    %the corresponding properties at hand.
  1399 
  1411 
  1400    We have also developed a
  1412    We have also developed a
  1401    healthy suspicion when experimental data is used to back up
  1413    healthy suspicion when experimental data is used to back up
  1402    efficiency claims. For example Sulzmann and Lu write about their
  1414    efficiency claims. For example Sulzmann and Lu write about their
  1403    equivalent of @{term blexer_simp} \textit{``...we can incrementally compute
  1415    equivalent of @{term blexer_simp} \textit{``...we can incrementally compute
  1428    The contribution of this paper is to make sure
  1440    The contribution of this paper is to make sure
  1429    derivatives do not grow arbitrarily big (universially) In the example \mbox{@{text "(a + aa)\<^sup>*"}},
  1441    derivatives do not grow arbitrarily big (universially) In the example \mbox{@{text "(a + aa)\<^sup>*"}},
  1430    \emph{all} derivatives have a size of 17 or less. The result is that
  1442    \emph{all} derivatives have a size of 17 or less. The result is that
  1431    lexing a string of, say, 50\,000 a's with the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes approximately
  1443    lexing a string of, say, 50\,000 a's with the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes approximately
  1432    10 seconds with our Scala implementation
  1444    10 seconds with our Scala implementation
  1433    of the presented algorithm. 
  1445    of the presented algorithm. Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
  1434    \smallskip
  1446    %\\[-10mm]
  1435 
       
  1436    \noindent
       
  1437    Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.\\[-10mm]
       
  1438 
  1447 
  1439 
  1448 
  1440 %%\bibliographystyle{plain}
  1449 %%\bibliographystyle{plain}
  1441 \bibliography{root}
  1450 \bibliography{root}
  1442 
  1451