(*<*)theory PaperExtimports (*"../LexerExt"*) (*"../PositionsExt"*) "~~/src/HOL/Library/LaTeXsugar"begin(*>*)(*declare [[show_question_marks = false]]declare [[eta_contract = false]]abbreviation "der_syn r c \<equiv> der c r"abbreviation "ders_syn r s \<equiv> ders s r"notation (latex output) If ("(\<^latex>\<open>\\textrm{\<close>if\<^latex>\<open>}\<close> (_)/ \<^latex>\<open>\\textrm{\<close>then\<^latex>\<open>}\<close> (_)/ \<^latex>\<open>\\textrm{\<close>else\<^latex>\<open>}\<close> (_))" 10) and Cons ("_\<^latex>\<open>\\mbox{$\\,$}\<close>::\<^latex>\<open>\\mbox{$\\,$}\<close>_" [75,73] 73) and ZERO ("\<^bold>0" 81) and ONE ("\<^bold>1" 81) and CHAR ("_" [1000] 80) and ALT ("_ + _" [77,77] 78) and SEQ ("_ \<cdot> _" [77,77] 78) and STAR ("_\<^sup>\<star>" [78] 78) and NTIMES ("_\<^bsup>'{_'}\<^esup>" [78, 50] 80) and FROMNTIMES ("_\<^bsup>'{_..'}\<^esup>" [78, 50] 80) and UPNTIMES ("_\<^bsup>'{.._'}\<^esup>" [78, 50] 80) and NMTIMES ("_\<^bsup>'{_.._'}\<^esup>" [78, 50,50] 80) and val.Void ("Empty" 78) and val.Char ("Char _" [1000] 78) and val.Left ("Left _" [79] 78) and val.Right ("Right _" [1000] 78) and val.Seq ("Seq _ _" [79,79] 78) and val.Stars ("Stars _" [1000] 78) and L ("L'(_')" [10] 78) and LV ("LV _ _" [80,73] 78) and der_syn ("_\\_" [79, 1000] 76) and ders_syn ("_\\_" [79, 1000] 76) and flat ("|_|" [75] 74) and flats ("|_|" [72] 74) and Sequ ("_ @ _" [78,77] 63) and injval ("inj _ _ _" [81,77,79] 76) and mkeps ("mkeps _" [79] 76) and length ("len _" [73] 73) and intlen ("len _" [73] 73) and set ("_" [73] 73) and Prf ("_ : _" [75,75] 75) and Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and lexer ("lexer _ _" [78,78] 77) and DUMMY ("\<^latex>\<open>\\underline{\\hspace{2mm}}\<close>")*) (*>*)(*section {* Extensions*}text {* A strong point in favour of Sulzmann and Lu's algorithm is that it can be extended in various ways. If we are interested in tokenising a string, then we need to not just split up the string into tokens, but also ``classify'' the tokens (for example whether it is a keyword or an identifier). This can be done with only minor modifications to the algorithm by introducing \emph{record regular expressions} and \emph{record values} (for example \cite{Sulzmann2014b}): \begin{center} @{text "r :="} @{text "..."} $\mid$ @{text "(l : r)"} \qquad\qquad @{text "v :="} @{text "..."} $\mid$ @{text "(l : v)"} \end{center} \noindent where @{text l} is a label, say a string, @{text r} a regular expression and @{text v} a value. All functions can be smoothly extended to these regular expressions and values. For example \mbox{@{text "(l : r)"}} is nullable iff @{term r} is, and so on. The purpose of the record regular expression is to mark certain parts of a regular expression and then record in the calculated value which parts of the string were matched by this part. The label can then serve as classification for the tokens. For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for keywords and identifiers from the Introduction. With the record regular expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}} and then traverse the calculated value and only collect the underlying strings in record values. With this we obtain finite sequences of pairs of labels and strings, for example \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\] \noindent from which tokens with classifications (keyword-token, identifier-token and so on) can be extracted. In the context of POSIX matching, it is also interesting to study additional constructors about bounded-repetitions of regular expressions. For this let us extend the results from the previous section to the following four additional regular expression constructors: \begin{center} \begin{tabular}{lcrl@ {\hspace{12mm}}l} @{text r} & @{text ":="} & $\ldots\mid$ & @{term "NTIMES r n"} & exactly-@{text n}-times\\ & & $\mid$ & @{term "UPNTIMES r n"} & upto-@{text n}-times\\ & & $\mid$ & @{term "FROMNTIMES r n"} & from-@{text n}-times\\ & & $\mid$ & @{term "NMTIMES r n m"} & between-@{text nm}-times\\ \end{tabular} \end{center} \noindent We will call them \emph{bounded regular expressions}. With the help of the power operator (definition ommited) for sets of strings, the languages recognised by these regular expression can be defined in Isabelle as follows: \begin{center} \begin{tabular}{lcl} @{thm (lhs) L.simps(8)} & $\dn$ & @{thm (rhs) L.simps(8)}\\ @{thm (lhs) L.simps(7)} & $\dn$ & @{thm (rhs) L.simps(7)}\\ @{thm (lhs) L.simps(9)} & $\dn$ & @{thm (rhs) L.simps(9)}\\ @{thm (lhs) L.simps(10)} & $\dn$ & @{thm (rhs) L.simps(10)}\\ \end{tabular} \end{center} \noindent This definition implies that in the last clause @{term "NMTIMES r n m"} matches no string in case @{term "m < n"}, because then the interval @{term "{n..m}"} is empty. While the language recognised by these regular expressions is straightforward, some care is needed for how to define the corresponding lexical values. First, with a slight abuse of language, we will (re)use values of the form @{term "Stars vs"} for values inhabited in bounded regular expressions. Second, we need to introduce inductive rules for extending our inhabitation relation shown on Page~\ref{prfintros}, from which we then derived our notion of lexical values. Given the rule for @{term "STAR r"}, the rule for @{term "UPNTIMES r n"} just requires additionally that the length of the list of values must be smaller or equal to @{term n}: \begin{center} @{thm[mode=Rule] Prf.intros(7)[of "vs"]} \end{center} \noindent Like in the @{term "STAR r"}-rule, we require with the left-premise that some non-empty part of the string is `chipped' away by \emph{every} value in @{text vs}, that means the corresponding values do not flatten to the empty string. In the rule for @{term "NTIMES r n"} (that is exactly-@{term n}-times @{text r}) we clearly need to require that the length of the list of values equals to @{text n}. But enforcing that every of these @{term n} values `chipps' away some part of a string would be too strong. \begin{center} @{thm[mode=Rule] Prf.intros(8)[of "vs\<^sub>1" r "vs\<^sub>2"]} \end{center} \begin{center} \begin{tabular}{lcl} @{thm (lhs) der.simps(8)} & $\dn$ & @{thm (rhs) der.simps(8)}\\ @{thm (lhs) der.simps(7)} & $\dn$ & @{thm (rhs) der.simps(7)}\\ @{thm (lhs) der.simps(9)} & $\dn$ & @{thm (rhs) der.simps(9)}\\ @{thm (lhs) der.simps(10)} & $\dn$ & @{thm (rhs) der.simps(10)}\\ \end{tabular} \end{center} \begin{center} \begin{tabular}{lcl} @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\ @{thm (lhs) mkeps.simps(6)} & $\dn$ & @{thm (rhs) mkeps.simps(6)}\\ @{thm (lhs) mkeps.simps(7)} & $\dn$ & @{thm (rhs) mkeps.simps(7)}\\ @{thm (lhs) mkeps.simps(8)} & $\dn$ & @{thm (rhs) mkeps.simps(8)}\\ \end{tabular} \end{center} \begin{center} \begin{tabular}{lcl} @{thm (lhs) injval.simps(8)} & $\dn$ & @{thm (rhs) injval.simps(8)}\\ @{thm (lhs) injval.simps(9)} & $\dn$ & @{thm (rhs) injval.simps(9)}\\ @{thm (lhs) injval.simps(10)} & $\dn$ & @{thm (rhs) injval.simps(10)}\\ @{thm (lhs) injval.simps(11)} & $\dn$ & @{thm (rhs) injval.simps(11)}\\ \end{tabular} \end{center} @{thm [mode=Rule] Posix_NTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]} @{thm [mode=Rule] Posix_NTIMES2} @{thm [mode=Rule] Posix_UPNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]} @{thm [mode=Rule] Posix_UPNTIMES2} @{thm [mode=Rule] Posix_FROMNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]} @{thm [mode=Rule] Posix_FROMNTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]} @{thm [mode=Rule] Posix_FROMNTIMES2} @{thm [mode=Rule] Posix_NMTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]} @{thm [mode=Rule] Posix_NMTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]} @{thm [mode=Rule] Posix_NMTIMES2} @{term "\<Sum> i \<in> {m..n} . P i"} @{term "\<Sum> i \<in> {..n} . P i"} @{term "\<Union> i \<in> {m..n} . P i"} @{term "\<Union> i \<in> {..n} . P i"} @{term "\<Union> i \<in> {0::nat..n} . P i"}*}section {* The Correctness Argument by Sulzmann and Lu\label{argu} *}text {*% \newcommand{\greedy}{\succcurlyeq_{gr}} \newcommand{\posix}{>} An extended version of \cite{Sulzmann2014} is available at the website of its first author; this includes some ``proofs'', claimed in \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in final form, we make no comment thereon, preferring to give general reasons for our belief that the approach of \cite{Sulzmann2014} is problematic. Their central definition is an ``ordering relation'' defined by the rules (slightly adapted to fit our notation): ?? \noindent The idea behind the rules (A1) and (A2), for example, is that a @{text Left}-value is bigger than a @{text Right}-value, if the underlying string of the @{text Left}-value is longer or of equal length to the underlying string of the @{text Right}-value. The order is reversed, however, if the @{text Right}-value can match a longer string than a @{text Left}-value. In this way the POSIX value is supposed to be the biggest value for a given string and regular expression. Sulzmann and Lu explicitly refer to the paper \cite{Frisch2004} by Frisch and Cardelli from where they have taken the idea for their correctness proof. Frisch and Cardelli introduced a similar ordering for GREEDY matching and they showed that their GREEDY matching algorithm always produces a maximal element according to this ordering (from all possible solutions). The only difference between their GREEDY ordering and the ``ordering'' by Sulzmann and Lu is that GREEDY always prefers a @{text Left}-value over a @{text Right}-value, no matter what the underlying string is. This seems to be only a very minor difference, but it has drastic consequences in terms of what properties both orderings enjoy. What is interesting for our purposes is that the properties reflexivity, totality and transitivity for this GREEDY ordering can be proved relatively easily by induction.*}*)section {* Conclusion *}text {* We have implemented the POSIX value calculation algorithm introduced by Sulzmann and Lu \cite{Sulzmann2014}. Our implementation is nearly identical to the original and all modifications we introduced are harmless (like our char-clause for @{text inj}). We have proved this algorithm to be correct, but correct according to our own specification of what POSIX values are. Our specification (inspired from work by Vansummeren \cite{Vansummeren2006}) appears to be much simpler than in \cite{Sulzmann2014} and our proofs are nearly always straightforward. We have attempted to formalise the original proof by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors already acknowledge some small problems, but our experience suggests that there are more serious problems. Having proved the correctness of the POSIX lexing algorithm in \cite{Sulzmann2014}, which lessons have we learned? Well, this is a perfect example for the importance of the \emph{right} definitions. We have (on and off) explored mechanisations as soon as first versions of \cite{Sulzmann2014} appeared, but have made little progress with turning the relatively detailed proof sketch in \cite{Sulzmann2014} into a formalisable proof. Having seen \cite{Vansummeren2006} and adapted the POSIX definition given there for the algorithm by Sulzmann and Lu made all the difference: the proofs, as said, are nearly straightforward. The question remains whether the original proof idea of \cite{Sulzmann2014}, potentially using our result as a stepping stone, can be made to work? Alas, we really do not know despite considerable effort. Closely related to our work is an automata-based lexer formalised by Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest initial substrings, but Nipkow's algorithm is not completely computational. The algorithm by Sulzmann and Lu, in contrast, can be implemented with ease in any functional language. A bespoke lexer for the Imp-language is formalised in Coq as part of the Software Foundations book by Pierce et al \cite{Pierce2015}. The disadvantage of such bespoke lexers is that they do not generalise easily to more advanced features. Our formalisation is available from the Archive of Formal Proofs \cite{aduAFP16} under \url{http://www.isa-afp.org/entries/Posix-Lexing.shtml}.\medskip \noindent {\bf Acknowledgements:} We are very grateful to Martin Sulzmann for his comments on our work and moreover for patiently explaining to us the details in \cite{Sulzmann2014}. We also received very helpful comments from James Cheney and anonymous referees. % \small \bibliographystyle{plain} \bibliography{root}*}(*<*)end(*>*)