thys2/Paper/Paper.thy
changeset 426 5b77220fdf01
parent 425 14c558ae0b09
child 436 222333d2bdc2
equal deleted inserted replaced
425:14c558ae0b09 426:5b77220fdf01
   171 \begin{quote}\it
   171 \begin{quote}\it
   172   ``Correctness Claim: We further claim that the incremental parsing
   172   ``Correctness Claim: We further claim that the incremental parsing
   173   method [..] in combination with the simplification steps [..]
   173   method [..] in combination with the simplification steps [..]
   174   yields POSIX parse trees. We have tested this claim
   174   yields POSIX parse trees. We have tested this claim
   175   extensively [..] but yet
   175   extensively [..] but yet
   176   have to work out all proof details.''
   176   have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
   177 \end{quote}  
   177 \end{quote}  
   178 
   178 
   179 \noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL
   179 \noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL
   180 the derivative-based lexing algorithm of Sulzmann and Lu
   180 the derivative-based lexing algorithm of Sulzmann and Lu
   181 \cite{Sulzmann2014} where regular expressions are annotated with
   181 \cite{Sulzmann2014} where regular expressions are annotated with
   182 bitsequences. We define the crucial simplification function as a
   182 bitsequences. We define the crucial simplification function as a
   183 recursive function, instead of a fix-point operation. One objective of
   183 recursive function, without the need of a fix-point operation. One objective of
   184 the simplification function is to remove duplicates of regular
   184 the simplification function is to remove duplicates of regular
   185 expressions.  For this Sulzmann and Lu use in their paper the standard
   185 expressions.  For this Sulzmann and Lu use in their paper the standard
   186 @{text nub} function from Haskell's list library, but this function
   186 @{text nub} function from Haskell's list library, but this function
   187 does not achieve the intended objective with bitcoded regular expressions.  The
   187 does not achieve the intended objective with bitcoded regular expressions.  The
   188 reason is that in the bitcoded setting, each copy generally has a
   188 reason is that in the bitcoded setting, each copy generally has a
   296 \begin{proposition}\label{matchcorr} 
   296 \begin{proposition}\label{matchcorr} 
   297 @{text "match s r"} \;\;\text{if and only if}\;\; $s \in L(r)$
   297 @{text "match s r"} \;\;\text{if and only if}\;\; $s \in L(r)$
   298 \end{proposition}
   298 \end{proposition}
   299 
   299 
   300 \noindent
   300 \noindent
   301 It is a fun exercise to formaly prove this property in a theorem prover.
   301 It is a fun exercise to formally prove this property in a theorem prover.
   302 
   302 
   303 The novel idea of Sulzmann and Lu is to extend this algorithm for 
   303 The novel idea of Sulzmann and Lu is to extend this algorithm for 
   304 lexing, where it is important to find out which part of the string
   304 lexing, where it is important to find out which part of the string
   305 is matched by which part of the regular expression.
   305 is matched by which part of the regular expression.
   306 For this Sulzmann and Lu presented two lexing algorithms in their paper
   306 For this Sulzmann and Lu presented two lexing algorithms in their paper
   490   \end{center}
   490   \end{center}
   491 
   491 
   492 
   492 
   493 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that
   493 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that
   494 this algorithm is correct, that is it generates POSIX values. The
   494 this algorithm is correct, that is it generates POSIX values. The
   495 cenral property we established relates the derivative operation to the
   495 central property we established relates the derivative operation to the
   496 injection function.
   496 injection function.
   497 
   497 
   498   \begin{proposition}\label{Posix2}
   498   \begin{proposition}\label{Posix2}
   499 	\textit{If} $(s,\; r\backslash c) \rightarrow v$ \textit{then} $(c :: s,\; r) \rightarrow$ \textit{inj} $r\; c\; v$. 
   499 	\textit{If} $(s,\; r\backslash c) \rightarrow v$ \textit{then} $(c :: s,\; r) \rightarrow$ \textit{inj} $r\; c\; v$. 
   500 \end{proposition}
   500 \end{proposition}
   511   \end{proposition}
   511   \end{proposition}
   512 
   512 
   513   \noindent
   513   \noindent
   514   In fact we have shown that in the success case the generated POSIX value $v$ is
   514   In fact we have shown that in the success case the generated POSIX value $v$ is
   515   unique and in the failure case that there is no POSIX value $v$ that satisfies
   515   unique and in the failure case that there is no POSIX value $v$ that satisfies
   516   $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly
   516   $(s, r) \rightarrow v$. While the algorithm is correct, it is excruciatingly
   517   slow in cases where the derivatives grow arbitrarily (recall the example from the
   517   slow in cases where the derivatives grow arbitrarily (recall the example from the
   518   Introduction). However it can be used as a convenient reference point for the correctness
   518   Introduction). However it can be used as a convenient reference point for the correctness
   519   proof of the second algorithm by Sulzmann and Lu, which we shall describe next.
   519   proof of the second algorithm by Sulzmann and Lu, which we shall describe next.
   520   
   520   
   521 *}
   521 *}
   524 
   524 
   525 text {*
   525 text {*
   526 
   526 
   527   In the second part of their paper \cite{Sulzmann2014},
   527   In the second part of their paper \cite{Sulzmann2014},
   528   Sulzmann and Lu describe another algorithm that also generates POSIX
   528   Sulzmann and Lu describe another algorithm that also generates POSIX
   529   values but dispences with the second phase where characters are
   529   values but dispenses with the second phase where characters are
   530   injected ``back'' into values. For this they annotate bitcodes to
   530   injected ``back'' into values. For this they annotate bitcodes to
   531   regular expressions, which we define in Isabelle/HOL as the datatype
   531   regular expressions, which we define in Isabelle/HOL as the datatype
   532 
   532 
   533   \begin{center}
   533   \begin{center}
   534   \begin{tabular}{lcl}
   534   \begin{tabular}{lcl}
  1005      ensure that the correct value corresponding to the original (unsimplified)
  1005      ensure that the correct value corresponding to the original (unsimplified)
  1006      regular expression can still be extracted. %In this way the value construction
  1006      regular expression can still be extracted. %In this way the value construction
  1007      %is not affected by simplification. 
  1007      %is not affected by simplification. 
  1008 
  1008 
  1009      However there is one problem with the definition for the more
  1009      However there is one problem with the definition for the more
  1010      aggressive simlification rules described by Sulzmann and Lu. Recasting
  1010      aggressive simplification rules described by Sulzmann and Lu. Recasting
  1011      their definition with our syntax they define the step of removing
  1011      their definition with our syntax they define the step of removing
  1012      duplicates as
  1012      duplicates as
  1013      %
  1013      %
  1014      \[ @{text "bsimp (ALTs bs rs)"} \dn @{text "ALTs
  1014      \[ @{text "bsimp (ALTs bs rs)"} \dn @{text "ALTs
  1015      bs (nup (map bsimp rs))"}
  1015      bs (nup (map bsimp rs))"}
  1090      \end{tabular}
  1090      \end{tabular}
  1091      \end{tabular}
  1091      \end{tabular}
  1092      \end{center}
  1092      \end{center}
  1093 
  1093 
  1094      \noindent
  1094      \noindent
  1095      With this in place we can define our simlification function as
  1095      With this in place we can define our simplification function as
  1096 
  1096 
  1097      \begin{center}
  1097      \begin{center}
  1098      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
  1098      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
  1099      @{thm (lhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]} & $\dn$ &
  1099      @{thm (lhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]} & $\dn$ &
  1100          @{thm (rhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]}\\
  1100          @{thm (rhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]}\\
  1162 
  1162 
  1163      \noindent do not hold under simplification---this property
  1163      \noindent do not hold under simplification---this property
  1164      essentially purports that we can retrieve the same value from a
  1164      essentially purports that we can retrieve the same value from a
  1165      simplified version of the regular expression. To start with @{text retrieve}
  1165      simplified version of the regular expression. To start with @{text retrieve}
  1166      depends on the fact that the value @{text v} correspond to the
  1166      depends on the fact that the value @{text v} correspond to the
  1167      structure of the regular exprssions---but the whole point of simplification
  1167      structure of the regular expressions---but the whole point of simplification
  1168      is to ``destroy'' this structure by making the regular expression simpler.
  1168      is to ``destroy'' this structure by making the regular expression simpler.
  1169      To see this consider the regular expression @{text "r = r' + 0"} and a corresponding
  1169      To see this consider the regular expression @{text "r = r' + 0"} and a corresponding
  1170      value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then 
  1170      value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then 
  1171      we can use @{text retrieve} and @{text v} in order to extract a corresponding
  1171      we can use @{text retrieve} and @{text v} in order to extract a corresponding
  1172      bitsequence. The reason that this works is that @{text r} is an alternative
  1172      bitsequence. The reason that this works is that @{text r} is an alternative
  1215      be in both @{text "L (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ [r\<^sub>2] @ rs\<^sub>c)"} and
  1215      be in both @{text "L (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ [r\<^sub>2] @ rs\<^sub>c)"} and
  1216      @{text "L (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ rs\<^sub>c)"}.
  1216      @{text "L (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ rs\<^sub>c)"}.
  1217      \end{proof}
  1217      \end{proof}
  1218 
  1218 
  1219      \noindent
  1219      \noindent
  1220      From this, we can show that @{text bmkeps} will producte the same bitsequence
  1220      From this, we can show that @{text bmkeps} will produce the same bitsequence
  1221      as long as one of the bitcoded regular expressions in $\leadsto$ is nullable (this lemma
  1221      as long as one of the bitcoded regular expressions in $\leadsto$ is nullable (this lemma
  1222      establishes the missing fact we were not able to establish using @{text retrieve}, as suggested
  1222      establishes the missing fact we were not able to establish using @{text retrieve}, as suggested
  1223      in the paper by Sulzmannn and Lu). 
  1223      in the paper by Sulzmannn and Lu). 
  1224 
  1224 
  1225 
  1225 
  1267      \begin{proof}
  1267      \begin{proof}
  1268      By reverse induction on @{term s} generalising over @{text r}.
  1268      By reverse induction on @{term s} generalising over @{text r}.
  1269      \end{proof}
  1269      \end{proof}
  1270 
  1270 
  1271      \noindent
  1271      \noindent
  1272      With these lemmas in place we can finaly establish that @{term "blexer_simp"} and @{term "blexer"}
  1272      With these lemmas in place we can finally establish that @{term "blexer_simp"} and @{term "blexer"}
  1273      generate the same value, and using Theorem~\ref{thmone} from the previous section that this value
  1273      generate the same value, and using Theorem~\ref{thmone} from the previous section that this value
  1274      is indeed the POSIX value.
  1274      is indeed the POSIX value.
  1275      
  1275      
  1276      \begin{theorem}
  1276      \begin{theorem}
  1277      @{thm[mode=IfThen] main_blexer_simp}
  1277      @{thm[mode=IfThen] main_blexer_simp}
  1311   expressions be removed provided a regular expression earlier in the list can
  1311   expressions be removed provided a regular expression earlier in the list can
  1312   match the same strings.}\label{SimpRewrites}
  1312   match the same strings.}\label{SimpRewrites}
  1313   \end{figure}
  1313   \end{figure}
  1314 *}
  1314 *}
  1315 
  1315 
  1316 section {* Bound - NO *}
  1316 section {* Finiteness of Derivatives *}
       
  1317 
       
  1318 text {*
       
  1319 
       
  1320 In this section let us sketch our argument on why the size of the simplified
       
  1321 derivatives with the aggressive simplification function is finite. Suppose
       
  1322 we have a size functions for bitcoded regular expressions, written
       
  1323 @{text "|r|"}, which counts the number of nodes if we regard $r$ as a tree
       
  1324 (we omit the precise definition). For this we show that for every $r$
       
  1325 there exists a bound $N$
       
  1326 such that 
       
  1327 
       
  1328 \begin{center}
       
  1329 $\forall s. \; |@{term "bders_simp r s"}| < N$
       
  1330 \end{center}
       
  1331 
       
  1332 \noindent
       
  1333 We prove this by induction on $r$. The base cases for @{term AZERO},
       
  1334 @{term "AONE bs"} and @{term "ACHAR bs c"} are straightforward. The interesting case is
       
  1335 for sequences of the form @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}. In this case our induction
       
  1336 hypotheses state $\forall s. \; |@{term "bders_simp r\<^sub>1 s"}| < N_1$ and
       
  1337 $\forall s. \; |@{term "bders_simp r\<^sub>2 s"}| < N_2$. We can reason as follows
       
  1338 
       
  1339 \begin{center}
       
  1340 \begin{tabular}{lcll}
       
  1341 & & $ |@{term "bders_simp (ASEQ bs r\<^sub>1 r\<^sub>2) s"}|$\\
       
  1342 & $ = $ & $|bsimp(ALTs\;bs\;((@{term "bders_simp r\<^sub>1 s"}) \cdot r_2) ::
       
  1343     [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| $ & (1) \\
       
  1344 & $\leq$ &
       
  1345     $|distinctBy\,(flts\,((@{term "bders_simp r\<^sub>1 s "}) \cdot r_2) ::
       
  1346     [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (2) \\
       
  1347 & $\leq$ & $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2| +
       
  1348              |distinctBy\,(flts\,   [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (3) \\
       
  1349 & $\leq$ & $N_1 + |r_2| + 2 + |distinctBy\,(flts\,   [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])|$ & (4)\\
       
  1350 & $\leq$ & $N_1 + |r_2| + 2 + l_{N_{2}} * N_{2}$ & (5)
       
  1351 \end{tabular}
       
  1352 \end{center}
       
  1353 
       
  1354 % tell Chengsong about Indian paper of closed forms of derivatives
       
  1355 
       
  1356 \noindent
       
  1357 where in (1) the $Suf\!fix(s')$ are the suffixes where @{term "bders_simp r\<^sub>1 s''"} is nullable for
       
  1358 @{text "s = s'' @ s'"}. In (3) we know that  $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2|$ is 
       
  1359 bounded by $N_1 + |r_2|$. In (5) we know the list comprehension contains only regular expressions of size smaller
       
  1360 than $N_2$. The list length after @{text distinctBy} is bounded by a number, which we call $l_{N_2}$. It stands
       
  1361 for the number of distinct regular expressions with a maximum size $N_2$ (there can only be finitely many of them).
       
  1362 We reason similarly in the @{text Star}-case.\medskip
       
  1363 
       
  1364 \noindent
       
  1365 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
       
  1366 far from the actual bound we can expect. We can do better than this, but this does not improve
       
  1367 the finiteness property we are proving. If we are interested in a polynomial bound,
       
  1368 one would hope to obtain a similar tight bound as for partial
       
  1369 derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
       
  1370 @{text distinctBy} is to maintain a ``set'' of alternatives (like the sets in
       
  1371 partial derivatives). Unfortunately to obtain the exact same bound would mean
       
  1372 we need to introduce simplifications such as
       
  1373 %
       
  1374 \[ (r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)
       
  1375 \]
       
  1376 
       
  1377 \noindent
       
  1378 which exist for partial derivatives. However, if we introduce them in our
       
  1379 setting we would lose the POSIX property of our calculated values. We leave better
       
  1380 bounds for future work.
       
  1381 
       
  1382 *}
  1317 
  1383 
  1318 
  1384 
  1319 section {* Conclusion *}
  1385 section {* Conclusion *}
  1320 
  1386 
  1321 text {*
  1387 text {*
  1322 
  1388 
  1323 We set out in this work to implement in Isabelle/HOL the secod lexing
  1389    We set out in this work to prove in Isabelle/HOL the correctness of
  1324    algorithm by Sulzmann and Lu \cite{Sulzmann2014}. This algorithm is
  1390    the second POSIX lexing algorithm by Sulzmann and Lu
  1325    interesting because it includes an ``aggressive'' simplification
  1391    \cite{Sulzmann2014}. This follows earlier work where we established
  1326    function that keeps the sizes of derivatives finite.
  1392    the correctness of the first algorithm
  1327 
  1393    \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
  1328    Having proved the correctness of the POSIX lexing algorithm in
  1394    introduce our own specification about what POSIX values are,
  1329    [53], which lessons have we learned? Well, we feel this is a
  1395    because the informal definition given by Sulzmann and Lu did not
  1330    perfect example where formal proofs give further insight into the
  1396    stand up to a formal proof. Also for the second algorithm we needed
  1331    matter at hand. Our proofs were done both done by hand and
  1397    to introduce our own definitions and proof ideas in order to establish the
  1332    checked in Isabelle/HOL. The experience of doing our proofs in this
  1398    correctness.  Our interest in the second algorithm 
  1333    way has been that the mechanical checking was absolutely essential:
  1399    lies in the fact that by using bitcoded regular expressions and an aggressive
  1334    despite the apparent maturity, this subject area has hidden
  1400    simplification method there is a chance that the the derivatives
  1335    snares.
  1401    can be kept universally small  (we established in this paper that
       
  1402    they can be kept finite for any string). This is important if one is after
       
  1403    an efficient POSIX lexing algorithm.
       
  1404 
       
  1405    Having proved the correctness of the POSIX lexing algorithm, which
       
  1406    lessons have we learned? Well, we feel this is a very good example
       
  1407    where formal proofs give further insight into the matter at
       
  1408    hand. For example it is very hard to see a problem with @{text nub}
       
  1409    vs @{text distinctBy} with only experimental data---one would still
       
  1410    see the correct result but find that simplification does not simplify in well-chosen, but not
       
  1411    obscure, examples. We found that from an implementation
       
  1412    point-of-view it is really important to have the formal proofs of
       
  1413    the corresponding properties at hand. We have also developed a
       
  1414    healthy suspicion when experimental data is used to back up
       
  1415    efficiency claims. For example Sulzmann and Lu write about their
       
  1416    equivalent of @{term blexer_simp} ``...we can incrementally compute
       
  1417    bitcoded parse trees in linear time in the size of the input''
       
  1418    \cite[Page 14]{Sulzmann2014}. 
       
  1419    Given the growth of the
       
  1420    derivatives in some cases even after aggressive simplification, this
       
  1421    is a hard to believe fact. A similar claim of about a theoretical runtime
       
  1422    of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates POSIX matches and is based on
       
  1423    derivatives \cite{verbatim}. In this case derivatives are not simplified.
       
  1424    Clearly our result of having finite
       
  1425    derivatives is rather weak in this context but we think such effeciency claims
       
  1426    require further scrutiny.\medskip
       
  1427 
       
  1428    \noindent
       
  1429    Our Isabelle/HOL code is available under \url{https://github.com/urbanchr/posix}.
  1336 
  1430 
  1337 
  1431 
  1338 %%\bibliographystyle{plain}
  1432 %%\bibliographystyle{plain}
  1339 \bibliography{root}
  1433 \bibliography{root}
  1340 *}
  1434 *}