1550 |
1550 |
1551 \end{proof} |
1551 \end{proof} |
1552 |
1552 |
1553 *} |
1553 *} |
1554 |
1554 |
1555 section {* Extensions*} |
1555 |
1556 |
|
1557 text {* |
|
1558 |
|
1559 A strong point in favour of |
|
1560 Sulzmann and Lu's algorithm is that it can be extended in various |
|
1561 ways. If we are interested in tokenising a string, then we need to not just |
|
1562 split up the string into tokens, but also ``classify'' the tokens (for |
|
1563 example whether it is a keyword or an identifier). This can be done with |
|
1564 only minor modifications to the algorithm by introducing \emph{record |
|
1565 regular expressions} and \emph{record values} (for example |
|
1566 \cite{Sulzmann2014b}): |
|
1567 |
|
1568 \begin{center} |
|
1569 @{text "r :="} |
|
1570 @{text "..."} $\mid$ |
|
1571 @{text "(l : r)"} \qquad\qquad |
|
1572 @{text "v :="} |
|
1573 @{text "..."} $\mid$ |
|
1574 @{text "(l : v)"} |
|
1575 \end{center} |
|
1576 |
|
1577 \noindent where @{text l} is a label, say a string, @{text r} a regular |
|
1578 expression and @{text v} a value. All functions can be smoothly extended |
|
1579 to these regular expressions and values. For example \mbox{@{text "(l : |
|
1580 r)"}} is nullable iff @{term r} is, and so on. The purpose of the record |
|
1581 regular expression is to mark certain parts of a regular expression and |
|
1582 then record in the calculated value which parts of the string were matched |
|
1583 by this part. The label can then serve as classification for the tokens. |
|
1584 For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for |
|
1585 keywords and identifiers from the Introduction. With the record regular |
|
1586 expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}} |
|
1587 and then traverse the calculated value and only collect the underlying |
|
1588 strings in record values. With this we obtain finite sequences of pairs of |
|
1589 labels and strings, for example |
|
1590 |
|
1591 \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\] |
|
1592 |
|
1593 \noindent from which tokens with classifications (keyword-token, |
|
1594 identifier-token and so on) can be extracted. |
|
1595 |
|
1596 |
|
1597 *} |
|
1598 |
|
1599 |
|
1600 |
|
1601 section {* The Correctness Argument by Sulzmann and Lu\label{argu} *} |
|
1602 |
|
1603 text {* |
|
1604 % \newcommand{\greedy}{\succcurlyeq_{gr}} |
|
1605 \newcommand{\posix}{>} |
|
1606 |
|
1607 An extended version of \cite{Sulzmann2014} is available at the website of |
|
1608 its first author; this includes some ``proofs'', claimed in |
|
1609 \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in |
|
1610 final form, we make no comment thereon, preferring to give general reasons |
|
1611 for our belief that the approach of \cite{Sulzmann2014} is problematic. |
|
1612 Their central definition is an ``ordering relation'' defined by the |
|
1613 rules (slightly adapted to fit our notation): |
|
1614 |
|
1615 ?? |
|
1616 |
|
1617 \noindent The idea behind the rules (A1) and (A2), for example, is that a |
|
1618 @{text Left}-value is bigger than a @{text Right}-value, if the underlying |
|
1619 string of the @{text Left}-value is longer or of equal length to the |
|
1620 underlying string of the @{text Right}-value. The order is reversed, |
|
1621 however, if the @{text Right}-value can match a longer string than a |
|
1622 @{text Left}-value. In this way the POSIX value is supposed to be the |
|
1623 biggest value for a given string and regular expression. |
|
1624 |
|
1625 Sulzmann and Lu explicitly refer to the paper \cite{Frisch2004} by Frisch |
|
1626 and Cardelli from where they have taken the idea for their correctness |
|
1627 proof. Frisch and Cardelli introduced a similar ordering for GREEDY |
|
1628 matching and they showed that their GREEDY matching algorithm always |
|
1629 produces a maximal element according to this ordering (from all possible |
|
1630 solutions). The only difference between their GREEDY ordering and the |
|
1631 ``ordering'' by Sulzmann and Lu is that GREEDY always prefers a @{text |
|
1632 Left}-value over a @{text Right}-value, no matter what the underlying |
|
1633 string is. This seems to be only a very minor difference, but it has |
|
1634 drastic consequences in terms of what properties both orderings enjoy. |
|
1635 What is interesting for our purposes is that the properties reflexivity, |
|
1636 totality and transitivity for this GREEDY ordering can be proved |
|
1637 relatively easily by induction. |
|
1638 *} |
|
1639 |
|
1640 |
|
1641 section {* Conclusion *} |
|
1642 |
|
1643 text {* |
|
1644 |
|
1645 We have implemented the POSIX value calculation algorithm introduced by |
|
1646 Sulzmann and Lu |
|
1647 \cite{Sulzmann2014}. Our implementation is nearly identical to the |
|
1648 original and all modifications we introduced are harmless (like our char-clause for |
|
1649 @{text inj}). We have proved this algorithm to be correct, but correct |
|
1650 according to our own specification of what POSIX values are. Our |
|
1651 specification (inspired from work by Vansummeren \cite{Vansummeren2006}) appears to be |
|
1652 much simpler than in \cite{Sulzmann2014} and our proofs are nearly always |
|
1653 straightforward. We have attempted to formalise the original proof |
|
1654 by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains |
|
1655 unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors |
|
1656 already acknowledge some small problems, but our experience suggests |
|
1657 that there are more serious problems. |
|
1658 |
|
1659 Having proved the correctness of the POSIX lexing algorithm in |
|
1660 \cite{Sulzmann2014}, which lessons have we learned? Well, this is a |
|
1661 perfect example for the importance of the \emph{right} definitions. We |
|
1662 have (on and off) explored mechanisations as soon as first versions |
|
1663 of \cite{Sulzmann2014} appeared, but have made little progress with |
|
1664 turning the relatively detailed proof sketch in \cite{Sulzmann2014} into a |
|
1665 formalisable proof. Having seen \cite{Vansummeren2006} and adapted the |
|
1666 POSIX definition given there for the algorithm by Sulzmann and Lu made all |
|
1667 the difference: the proofs, as said, are nearly straightforward. The |
|
1668 question remains whether the original proof idea of \cite{Sulzmann2014}, |
|
1669 potentially using our result as a stepping stone, can be made to work? |
|
1670 Alas, we really do not know despite considerable effort. |
|
1671 |
|
1672 |
|
1673 Closely related to our work is an automata-based lexer formalised by |
|
1674 Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest |
|
1675 initial substrings, but Nipkow's algorithm is not completely |
|
1676 computational. The algorithm by Sulzmann and Lu, in contrast, can be |
|
1677 implemented with ease in any functional language. A bespoke lexer for the |
|
1678 Imp-language is formalised in Coq as part of the Software Foundations book |
|
1679 by Pierce et al \cite{Pierce2015}. The disadvantage of such bespoke lexers is that they |
|
1680 do not generalise easily to more advanced features. |
|
1681 Our formalisation is available from the Archive of Formal Proofs \cite{aduAFP16} |
|
1682 under \url{http://www.isa-afp.org/entries/Posix-Lexing.shtml}.\medskip |
|
1683 |
|
1684 \noindent |
|
1685 {\bf Acknowledgements:} |
|
1686 We are very grateful to Martin Sulzmann for his comments on our work and |
|
1687 moreover for patiently explaining to us the details in \cite{Sulzmann2014}. We |
|
1688 also received very helpful comments from James Cheney and anonymous referees. |
|
1689 % \small |
|
1690 \bibliographystyle{plain} |
|
1691 \bibliography{root} |
|
1692 |
|
1693 *} |
|
1694 |
1556 |
1695 |
1557 |
1696 (*<*) |
1558 (*<*) |
1697 end |
1559 end |
1698 (*>*) |
1560 (*>*) |