thys/Journal/Paper.thy
changeset 280 c840a99a3e05
parent 275 deea42c83c9e
child 287 95b3880d428f
equal deleted inserted replaced
279:f754a10875c7 280:c840a99a3e05
     1 (*<*)
     1 (*<*)
     2 theory Paper
     2 theory Paper
     3 imports 
     3 imports 
     4    "../Lexer"
     4    "../Lexer"
     5    "../Simplifying" 
     5    "../Simplifying" 
     6    "../Positions" 
     6    "../Positions"
     7    "~~/src/HOL/Library/LaTeXsugar"
     7    "~~/src/HOL/Library/LaTeXsugar"
     8 begin
     8 begin
     9 
     9 
    10 lemma Suc_0_fold:
    10 lemma Suc_0_fold:
    11   "Suc 0 = 1"
    11   "Suc 0 = 1"
   505   string. The reason for introducing the more restricted version of
   505   string. The reason for introducing the more restricted version of
   506   lexical values is convenience later on when reasoning about an
   506   lexical values is convenience later on when reasoning about an
   507   ordering relation for values.}
   507   ordering relation for values.}
   508 
   508 
   509   \begin{center}
   509   \begin{center}
   510   \begin{tabular}{c@ {\hspace{12mm}}c}
   510   \begin{tabular}{c@ {\hspace{12mm}}c}\label{prfintros}
   511   \\[-8mm]
   511   \\[-8mm]
   512   @{thm[mode=Axiom] Prf.intros(4)} & 
   512   @{thm[mode=Axiom] Prf.intros(4)} & 
   513   @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm]
   513   @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm]
   514   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} &
   514   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} &
   515   @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm]
   515   @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm]
  1550 
  1550 
  1551   \end{proof} 
  1551   \end{proof} 
  1552 
  1552 
  1553 *}
  1553 *}
  1554 
  1554 
  1555 section {* Extensions*}
  1555 
  1556 
       
  1557 text {*
       
  1558 
       
  1559   A strong point in favour of
       
  1560   Sulzmann and Lu's algorithm is that it can be extended in various
       
  1561   ways.  If we are interested in tokenising a string, then we need to not just
       
  1562   split up the string into tokens, but also ``classify'' the tokens (for
       
  1563   example whether it is a keyword or an identifier). This can be done with
       
  1564   only minor modifications to the algorithm by introducing \emph{record
       
  1565   regular expressions} and \emph{record values} (for example
       
  1566   \cite{Sulzmann2014b}):
       
  1567 
       
  1568   \begin{center}  
       
  1569   @{text "r :="}
       
  1570   @{text "..."} $\mid$
       
  1571   @{text "(l : r)"} \qquad\qquad
       
  1572   @{text "v :="}
       
  1573   @{text "..."} $\mid$
       
  1574   @{text "(l : v)"}
       
  1575   \end{center}
       
  1576   
       
  1577   \noindent where @{text l} is a label, say a string, @{text r} a regular
       
  1578   expression and @{text v} a value. All functions can be smoothly extended
       
  1579   to these regular expressions and values. For example \mbox{@{text "(l :
       
  1580   r)"}} is nullable iff @{term r} is, and so on. The purpose of the record
       
  1581   regular expression is to mark certain parts of a regular expression and
       
  1582   then record in the calculated value which parts of the string were matched
       
  1583   by this part. The label can then serve as classification for the tokens.
       
  1584   For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for
       
  1585   keywords and identifiers from the Introduction. With the record regular
       
  1586   expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}}
       
  1587   and then traverse the calculated value and only collect the underlying
       
  1588   strings in record values. With this we obtain finite sequences of pairs of
       
  1589   labels and strings, for example
       
  1590 
       
  1591   \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\]
       
  1592   
       
  1593   \noindent from which tokens with classifications (keyword-token,
       
  1594   identifier-token and so on) can be extracted.
       
  1595 
       
  1596 
       
  1597 *}
       
  1598 
       
  1599 
       
  1600 
       
  1601 section {* The Correctness Argument by Sulzmann and Lu\label{argu} *}
       
  1602 
       
  1603 text {*
       
  1604 %  \newcommand{\greedy}{\succcurlyeq_{gr}}
       
  1605  \newcommand{\posix}{>}
       
  1606 
       
  1607   An extended version of \cite{Sulzmann2014} is available at the website of
       
  1608   its first author; this includes some ``proofs'', claimed in
       
  1609   \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
       
  1610   final form, we make no comment thereon, preferring to give general reasons
       
  1611   for our belief that the approach of \cite{Sulzmann2014} is problematic.
       
  1612   Their central definition is an ``ordering relation'' defined by the
       
  1613   rules (slightly adapted to fit our notation):
       
  1614 
       
  1615   ??
       
  1616 
       
  1617   \noindent The idea behind the rules (A1) and (A2), for example, is that a
       
  1618   @{text Left}-value is bigger than a @{text Right}-value, if the underlying
       
  1619   string of the @{text Left}-value is longer or of equal length to the
       
  1620   underlying string of the @{text Right}-value. The order is reversed,
       
  1621   however, if the @{text Right}-value can match a longer string than a
       
  1622   @{text Left}-value. In this way the POSIX value is supposed to be the
       
  1623   biggest value for a given string and regular expression.
       
  1624 
       
  1625   Sulzmann and Lu explicitly refer to the paper \cite{Frisch2004} by Frisch
       
  1626   and Cardelli from where they have taken the idea for their correctness
       
  1627   proof. Frisch and Cardelli introduced a similar ordering for GREEDY
       
  1628   matching and they showed that their GREEDY matching algorithm always
       
  1629   produces a maximal element according to this ordering (from all possible
       
  1630   solutions). The only difference between their GREEDY ordering and the
       
  1631   ``ordering'' by Sulzmann and Lu is that GREEDY always prefers a @{text
       
  1632   Left}-value over a @{text Right}-value, no matter what the underlying
       
  1633   string is. This seems to be only a very minor difference, but it has
       
  1634   drastic consequences in terms of what properties both orderings enjoy.
       
  1635   What is interesting for our purposes is that the properties reflexivity,
       
  1636   totality and transitivity for this GREEDY ordering can be proved
       
  1637   relatively easily by induction.
       
  1638 *}
       
  1639 
       
  1640 
       
  1641 section {* Conclusion *}
       
  1642 
       
  1643 text {*
       
  1644 
       
  1645   We have implemented the POSIX value calculation algorithm introduced by
       
  1646   Sulzmann and Lu
       
  1647   \cite{Sulzmann2014}. Our implementation is nearly identical to the
       
  1648   original and all modifications we introduced are harmless (like our char-clause for
       
  1649   @{text inj}). We have proved this algorithm to be correct, but correct
       
  1650   according to our own specification of what POSIX values are. Our
       
  1651   specification (inspired from work by Vansummeren \cite{Vansummeren2006}) appears to be
       
  1652   much simpler than in \cite{Sulzmann2014} and our proofs are nearly always
       
  1653   straightforward. We have attempted to formalise the original proof
       
  1654   by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains
       
  1655   unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors
       
  1656   already acknowledge some small problems, but our experience suggests
       
  1657   that there are more serious problems. 
       
  1658   
       
  1659   Having proved the correctness of the POSIX lexing algorithm in
       
  1660   \cite{Sulzmann2014}, which lessons have we learned? Well, this is a
       
  1661   perfect example for the importance of the \emph{right} definitions. We
       
  1662   have (on and off) explored mechanisations as soon as first versions
       
  1663   of \cite{Sulzmann2014} appeared, but have made little progress with
       
  1664   turning the relatively detailed proof sketch in \cite{Sulzmann2014} into a
       
  1665   formalisable proof. Having seen \cite{Vansummeren2006} and adapted the
       
  1666   POSIX definition given there for the algorithm by Sulzmann and Lu made all
       
  1667   the difference: the proofs, as said, are nearly straightforward. The
       
  1668   question remains whether the original proof idea of \cite{Sulzmann2014},
       
  1669   potentially using our result as a stepping stone, can be made to work?
       
  1670   Alas, we really do not know despite considerable effort.
       
  1671 
       
  1672 
       
  1673   Closely related to our work is an automata-based lexer formalised by
       
  1674   Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest
       
  1675   initial substrings, but Nipkow's algorithm is not completely
       
  1676   computational. The algorithm by Sulzmann and Lu, in contrast, can be
       
  1677   implemented with ease in any functional language. A bespoke lexer for the
       
  1678   Imp-language is formalised in Coq as part of the Software Foundations book
       
  1679   by Pierce et al \cite{Pierce2015}. The disadvantage of such bespoke lexers is that they
       
  1680   do not generalise easily to more advanced features.
       
  1681   Our formalisation is available from the Archive of Formal Proofs \cite{aduAFP16}
       
  1682   under \url{http://www.isa-afp.org/entries/Posix-Lexing.shtml}.\medskip
       
  1683 
       
  1684   \noindent
       
  1685   {\bf Acknowledgements:}
       
  1686   We are very grateful to Martin Sulzmann for his comments on our work and 
       
  1687   moreover for patiently explaining to us the details in \cite{Sulzmann2014}. We
       
  1688   also received very helpful comments from James Cheney and anonymous referees.
       
  1689   %  \small
       
  1690   \bibliographystyle{plain}
       
  1691   \bibliography{root}
       
  1692 
       
  1693 *}
       
  1694 
  1556 
  1695 
  1557 
  1696 (*<*)
  1558 (*<*)
  1697 end
  1559 end
  1698 (*>*)
  1560 (*>*)