171 \begin{quote}\it |
171 \begin{quote}\it |
172 ``Correctness Claim: We further claim that the incremental parsing |
172 ``Correctness Claim: We further claim that the incremental parsing |
173 method [..] in combination with the simplification steps [..] |
173 method [..] in combination with the simplification steps [..] |
174 yields POSIX parse trees. We have tested this claim |
174 yields POSIX parse trees. We have tested this claim |
175 extensively [..] but yet |
175 extensively [..] but yet |
176 have to work out all proof details.'' |
176 have to work out all proof details.'' \cite[Page 14]{Sulzmann2014} |
177 \end{quote} |
177 \end{quote} |
178 |
178 |
179 \noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL |
179 \noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL |
180 the derivative-based lexing algorithm of Sulzmann and Lu |
180 the derivative-based lexing algorithm of Sulzmann and Lu |
181 \cite{Sulzmann2014} where regular expressions are annotated with |
181 \cite{Sulzmann2014} where regular expressions are annotated with |
182 bitsequences. We define the crucial simplification function as a |
182 bitsequences. We define the crucial simplification function as a |
183 recursive function, instead of a fix-point operation. One objective of |
183 recursive function, without the need of a fix-point operation. One objective of |
184 the simplification function is to remove duplicates of regular |
184 the simplification function is to remove duplicates of regular |
185 expressions. For this Sulzmann and Lu use in their paper the standard |
185 expressions. For this Sulzmann and Lu use in their paper the standard |
186 @{text nub} function from Haskell's list library, but this function |
186 @{text nub} function from Haskell's list library, but this function |
187 does not achieve the intended objective with bitcoded regular expressions. The |
187 does not achieve the intended objective with bitcoded regular expressions. The |
188 reason is that in the bitcoded setting, each copy generally has a |
188 reason is that in the bitcoded setting, each copy generally has a |
1162 |
1162 |
1163 \noindent do not hold under simplification---this property |
1163 \noindent do not hold under simplification---this property |
1164 essentially purports that we can retrieve the same value from a |
1164 essentially purports that we can retrieve the same value from a |
1165 simplified version of the regular expression. To start with @{text retrieve} |
1165 simplified version of the regular expression. To start with @{text retrieve} |
1166 depends on the fact that the value @{text v} correspond to the |
1166 depends on the fact that the value @{text v} correspond to the |
1167 structure of the regular exprssions---but the whole point of simplification |
1167 structure of the regular expressions---but the whole point of simplification |
1168 is to ``destroy'' this structure by making the regular expression simpler. |
1168 is to ``destroy'' this structure by making the regular expression simpler. |
1169 To see this consider the regular expression @{text "r = r' + 0"} and a corresponding |
1169 To see this consider the regular expression @{text "r = r' + 0"} and a corresponding |
1170 value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then |
1170 value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then |
1171 we can use @{text retrieve} and @{text v} in order to extract a corresponding |
1171 we can use @{text retrieve} and @{text v} in order to extract a corresponding |
1172 bitsequence. The reason that this works is that @{text r} is an alternative |
1172 bitsequence. The reason that this works is that @{text r} is an alternative |
1311 expressions be removed provided a regular expression earlier in the list can |
1311 expressions be removed provided a regular expression earlier in the list can |
1312 match the same strings.}\label{SimpRewrites} |
1312 match the same strings.}\label{SimpRewrites} |
1313 \end{figure} |
1313 \end{figure} |
1314 *} |
1314 *} |
1315 |
1315 |
1316 section {* Bound - NO *} |
1316 section {* Finiteness of Derivatives *} |
|
1317 |
|
1318 text {* |
|
1319 |
|
1320 In this section let us sketch our argument on why the size of the simplified |
|
1321 derivatives with the aggressive simplification function is finite. Suppose |
|
1322 we have a size functions for bitcoded regular expressions, written |
|
1323 @{text "|r|"}, which counts the number of nodes if we regard $r$ as a tree |
|
1324 (we omit the precise definition). For this we show that for every $r$ |
|
1325 there exists a bound $N$ |
|
1326 such that |
|
1327 |
|
1328 \begin{center} |
|
1329 $\forall s. \; |@{term "bders_simp r s"}| < N$ |
|
1330 \end{center} |
|
1331 |
|
1332 \noindent |
|
1333 We prove this by induction on $r$. The base cases for @{term AZERO}, |
|
1334 @{term "AONE bs"} and @{term "ACHAR bs c"} are straightforward. The interesting case is |
|
1335 for sequences of the form @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}. In this case our induction |
|
1336 hypotheses state $\forall s. \; |@{term "bders_simp r\<^sub>1 s"}| < N_1$ and |
|
1337 $\forall s. \; |@{term "bders_simp r\<^sub>2 s"}| < N_2$. We can reason as follows |
|
1338 |
|
1339 \begin{center} |
|
1340 \begin{tabular}{lcll} |
|
1341 & & $ |@{term "bders_simp (ASEQ bs r\<^sub>1 r\<^sub>2) s"}|$\\ |
|
1342 & $ = $ & $|bsimp(ALTs\;bs\;((@{term "bders_simp r\<^sub>1 s"}) \cdot r_2) :: |
|
1343 [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| $ & (1) \\ |
|
1344 & $\leq$ & |
|
1345 $|distinctBy\,(flts\,((@{term "bders_simp r\<^sub>1 s "}) \cdot r_2) :: |
|
1346 [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (2) \\ |
|
1347 & $\leq$ & $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2| + |
|
1348 |distinctBy\,(flts\, [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (3) \\ |
|
1349 & $\leq$ & $N_1 + |r_2| + 2 + |distinctBy\,(flts\, [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])|$ & (4)\\ |
|
1350 & $\leq$ & $N_1 + |r_2| + 2 + l_{N_{2}} * N_{2}$ & (5) |
|
1351 \end{tabular} |
|
1352 \end{center} |
|
1353 |
|
1354 % tell Chengsong about Indian paper of closed forms of derivatives |
|
1355 |
|
1356 \noindent |
|
1357 where in (1) the $Suf\!fix(s')$ are the suffixes where @{term "bders_simp r\<^sub>1 s''"} is nullable for |
|
1358 @{text "s = s'' @ s'"}. In (3) we know that $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2|$ is |
|
1359 bounded by $N_1 + |r_2|$. In (5) we know the list comprehension contains only regular expressions of size smaller |
|
1360 than $N_2$. The list length after @{text distinctBy} is bounded by a number, which we call $l_{N_2}$. It stands |
|
1361 for the number of distinct regular expressions with a maximum size $N_2$ (there can only be finitely many of them). |
|
1362 We reason similarly in the @{text Star}-case.\medskip |
|
1363 |
|
1364 \noindent |
|
1365 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is |
|
1366 far from the actual bound we can expect. We can do better than this, but this does not improve |
|
1367 the finiteness property we are proving. If we are interested in a polynomial bound, |
|
1368 one would hope to obtain a similar tight bound as for partial |
|
1369 derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with |
|
1370 @{text distinctBy} is to maintain a ``set'' of alternatives (like the sets in |
|
1371 partial derivatives). Unfortunately to obtain the exact same bound would mean |
|
1372 we need to introduce simplifications such as |
|
1373 % |
|
1374 \[ (r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3) |
|
1375 \] |
|
1376 |
|
1377 \noindent |
|
1378 which exist for partial derivatives. However, if we introduce them in our |
|
1379 setting we would lose the POSIX property of our calculated values. We leave better |
|
1380 bounds for future work. |
|
1381 |
|
1382 *} |
1317 |
1383 |
1318 |
1384 |
1319 section {* Conclusion *} |
1385 section {* Conclusion *} |
1320 |
1386 |
1321 text {* |
1387 text {* |
1322 |
1388 |
1323 We set out in this work to implement in Isabelle/HOL the secod lexing |
1389 We set out in this work to prove in Isabelle/HOL the correctness of |
1324 algorithm by Sulzmann and Lu \cite{Sulzmann2014}. This algorithm is |
1390 the second POSIX lexing algorithm by Sulzmann and Lu |
1325 interesting because it includes an ``aggressive'' simplification |
1391 \cite{Sulzmann2014}. This follows earlier work where we established |
1326 function that keeps the sizes of derivatives finite. |
1392 the correctness of the first algorithm |
1327 |
1393 \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to |
1328 Having proved the correctness of the POSIX lexing algorithm in |
1394 introduce our own specification about what POSIX values are, |
1329 [53], which lessons have we learned? Well, we feel this is a |
1395 because the informal definition given by Sulzmann and Lu did not |
1330 perfect example where formal proofs give further insight into the |
1396 stand up to a formal proof. Also for the second algorithm we needed |
1331 matter at hand. Our proofs were done both done by hand and |
1397 to introduce our own definitions and proof ideas in order to establish the |
1332 checked in Isabelle/HOL. The experience of doing our proofs in this |
1398 correctness. Our interest in the second algorithm |
1333 way has been that the mechanical checking was absolutely essential: |
1399 lies in the fact that by using bitcoded regular expressions and an aggressive |
1334 despite the apparent maturity, this subject area has hidden |
1400 simplification method there is a chance that the the derivatives |
1335 snares. |
1401 can be kept universally small (we established in this paper that |
|
1402 they can be kept finite for any string). This is important if one is after |
|
1403 an efficient POSIX lexing algorithm. |
|
1404 |
|
1405 Having proved the correctness of the POSIX lexing algorithm, which |
|
1406 lessons have we learned? Well, we feel this is a very good example |
|
1407 where formal proofs give further insight into the matter at |
|
1408 hand. For example it is very hard to see a problem with @{text nub} |
|
1409 vs @{text distinctBy} with only experimental data---one would still |
|
1410 see the correct result but find that simplification does not simplify in well-chosen, but not |
|
1411 obscure, examples. We found that from an implementation |
|
1412 point-of-view it is really important to have the formal proofs of |
|
1413 the corresponding properties at hand. We have also developed a |
|
1414 healthy suspicion when experimental data is used to back up |
|
1415 efficiency claims. For example Sulzmann and Lu write about their |
|
1416 equivalent of @{term blexer_simp} ``...we can incrementally compute |
|
1417 bitcoded parse trees in linear time in the size of the input'' |
|
1418 \cite[Page 14]{Sulzmann2014}. |
|
1419 Given the growth of the |
|
1420 derivatives in some cases even after aggressive simplification, this |
|
1421 is a hard to believe fact. A similar claim of about a theoretical runtime |
|
1422 of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates POSIX matches and is based on |
|
1423 derivatives \cite{verbatim}. In this case derivatives are not simplified. |
|
1424 Clearly our result of having finite |
|
1425 derivatives is rather weak in this context but we think such effeciency claims |
|
1426 require further scrutiny.\medskip |
|
1427 |
|
1428 \noindent |
|
1429 Our Isabelle/HOL code is available under \url{https://github.com/urbanchr/posix}. |
1336 |
1430 |
1337 |
1431 |
1338 %%\bibliographystyle{plain} |
1432 %%\bibliographystyle{plain} |
1339 \bibliography{root} |
1433 \bibliography{root} |
1340 *} |
1434 *} |