diff -r 56781ad291cf -r e00950ba4514 thys2/Journal/PaperExt.thy --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/thys2/Journal/PaperExt.thy Mon Nov 01 10:40:21 2021 +0000 @@ -0,0 +1,323 @@ +(*<*) +theory PaperExt +imports + (*"../LexerExt"*) + (*"../PositionsExt"*) + "~~/src/HOL/Library/LaTeXsugar" +begin +(*>*) + +(* +declare [[show_question_marks = false]] +declare [[eta_contract = false]] + +abbreviation + "der_syn r c \ der c r" + +abbreviation + "ders_syn r s \ ders s r" + + +notation (latex output) + If ("(\<^latex>\\\textrm{\if\<^latex>\}\ (_)/ \<^latex>\\\textrm{\then\<^latex>\}\ (_)/ \<^latex>\\\textrm{\else\<^latex>\}\ (_))" 10) and + Cons ("_\<^latex>\\\mbox{$\\,$}\::\<^latex>\\\mbox{$\\,$}\_" [75,73] 73) and + + ZERO ("\<^bold>0" 81) and + ONE ("\<^bold>1" 81) and + CHAR ("_" [1000] 80) and + ALT ("_ + _" [77,77] 78) and + SEQ ("_ \ _" [77,77] 78) and + STAR ("_\<^sup>\" [78] 78) and + NTIMES ("_\<^bsup>'{_'}\<^esup>" [78, 50] 80) and + FROMNTIMES ("_\<^bsup>'{_..'}\<^esup>" [78, 50] 80) and + UPNTIMES ("_\<^bsup>'{.._'}\<^esup>" [78, 50] 80) and + NMTIMES ("_\<^bsup>'{_.._'}\<^esup>" [78, 50,50] 80) and + + val.Void ("Empty" 78) and + val.Char ("Char _" [1000] 78) and + val.Left ("Left _" [79] 78) and + val.Right ("Right _" [1000] 78) and + val.Seq ("Seq _ _" [79,79] 78) and + val.Stars ("Stars _" [1000] 78) and + + L ("L'(_')" [10] 78) and + LV ("LV _ _" [80,73] 78) and + der_syn ("_\\_" [79, 1000] 76) and + ders_syn ("_\\_" [79, 1000] 76) and + flat ("|_|" [75] 74) and + flats ("|_|" [72] 74) and + Sequ ("_ @ _" [78,77] 63) and + injval ("inj _ _ _" [81,77,79] 76) and + mkeps ("mkeps _" [79] 76) and + length ("len _" [73] 73) and + intlen ("len _" [73] 73) and + set ("_" [73] 73) and + + Prf ("_ : _" [75,75] 75) and + Posix ("'(_, _') \ _" [63,75,75] 75) and + + lexer ("lexer _ _" [78,78] 77) and + DUMMY ("\<^latex>\\\underline{\\hspace{2mm}}\") +*) +(*>*) + +(* +section {* Extensions*} + +text {* + + A strong point in favour of + Sulzmann and Lu's algorithm is that it can be extended in various + ways. If we are interested in tokenising a string, then we need to not just + split up the string into tokens, but also ``classify'' the tokens (for + example whether it is a keyword or an identifier). This can be done with + only minor modifications to the algorithm by introducing \emph{record + regular expressions} and \emph{record values} (for example + \cite{Sulzmann2014b}): + + \begin{center} + @{text "r :="} + @{text "..."} $\mid$ + @{text "(l : r)"} \qquad\qquad + @{text "v :="} + @{text "..."} $\mid$ + @{text "(l : v)"} + \end{center} + + \noindent where @{text l} is a label, say a string, @{text r} a regular + expression and @{text v} a value. All functions can be smoothly extended + to these regular expressions and values. For example \mbox{@{text "(l : + r)"}} is nullable iff @{term r} is, and so on. The purpose of the record + regular expression is to mark certain parts of a regular expression and + then record in the calculated value which parts of the string were matched + by this part. The label can then serve as classification for the tokens. + For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\"} for + keywords and identifiers from the Introduction. With the record regular + expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\"}} + and then traverse the calculated value and only collect the underlying + strings in record values. With this we obtain finite sequences of pairs of + labels and strings, for example + + \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\] + + \noindent from which tokens with classifications (keyword-token, + identifier-token and so on) can be extracted. + + In the context of POSIX matching, it is also interesting to study additional + constructors about bounded-repetitions of regular expressions. For this let + us extend the results from the previous section to the following four + additional regular expression constructors: + + \begin{center} + \begin{tabular}{lcrl@ {\hspace{12mm}}l} + @{text r} & @{text ":="} & $\ldots\mid$ & @{term "NTIMES r n"} & exactly-@{text n}-times\\ + & & $\mid$ & @{term "UPNTIMES r n"} & upto-@{text n}-times\\ + & & $\mid$ & @{term "FROMNTIMES r n"} & from-@{text n}-times\\ + & & $\mid$ & @{term "NMTIMES r n m"} & between-@{text nm}-times\\ + \end{tabular} + \end{center} + + \noindent + We will call them \emph{bounded regular expressions}. + With the help of the power operator (definition ommited) for sets of strings, the languages + recognised by these regular expression can be defined in Isabelle as follows: + + \begin{center} + \begin{tabular}{lcl} + @{thm (lhs) L.simps(8)} & $\dn$ & @{thm (rhs) L.simps(8)}\\ + @{thm (lhs) L.simps(7)} & $\dn$ & @{thm (rhs) L.simps(7)}\\ + @{thm (lhs) L.simps(9)} & $\dn$ & @{thm (rhs) L.simps(9)}\\ + @{thm (lhs) L.simps(10)} & $\dn$ & @{thm (rhs) L.simps(10)}\\ + \end{tabular} + \end{center} + + \noindent This definition implies that in the last clause @{term + "NMTIMES r n m"} matches no string in case @{term "m < n"}, because + then the interval @{term "{n..m}"} is empty. While the language + recognised by these regular expressions is straightforward, some + care is needed for how to define the corresponding lexical + values. First, with a slight abuse of language, we will (re)use + values of the form @{term "Stars vs"} for values inhabited in + bounded regular expressions. Second, we need to introduce inductive + rules for extending our inhabitation relation shown on + Page~\ref{prfintros}, from which we then derived our notion of + lexical values. Given the rule for @{term "STAR r"}, the rule for + @{term "UPNTIMES r n"} just requires additionally that the length of + the list of values must be smaller or equal to @{term n}: + + \begin{center} + @{thm[mode=Rule] Prf.intros(7)[of "vs"]} + \end{center} + + \noindent Like in the @{term "STAR r"}-rule, we require with the + left-premise that some non-empty part of the string is `chipped' + away by \emph{every} value in @{text vs}, that means the corresponding + values do not flatten to the empty string. In the rule for @{term + "NTIMES r n"} (that is exactly-@{term n}-times @{text r}) we clearly need to require + that the length of the list of values equals to @{text n}. But enforcing + that every of these @{term n} values `chipps' away some part of a string + would be too strong. + + + \begin{center} + @{thm[mode=Rule] Prf.intros(8)[of "vs\<^sub>1" r "vs\<^sub>2"]} + \end{center} + + \begin{center} + \begin{tabular}{lcl} + @{thm (lhs) der.simps(8)} & $\dn$ & @{thm (rhs) der.simps(8)}\\ + @{thm (lhs) der.simps(7)} & $\dn$ & @{thm (rhs) der.simps(7)}\\ + @{thm (lhs) der.simps(9)} & $\dn$ & @{thm (rhs) der.simps(9)}\\ + @{thm (lhs) der.simps(10)} & $\dn$ & @{thm (rhs) der.simps(10)}\\ + \end{tabular} + \end{center} + + + \begin{center} + \begin{tabular}{lcl} + @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\ + @{thm (lhs) mkeps.simps(6)} & $\dn$ & @{thm (rhs) mkeps.simps(6)}\\ + @{thm (lhs) mkeps.simps(7)} & $\dn$ & @{thm (rhs) mkeps.simps(7)}\\ + @{thm (lhs) mkeps.simps(8)} & $\dn$ & @{thm (rhs) mkeps.simps(8)}\\ + \end{tabular} + \end{center} + + \begin{center} + \begin{tabular}{lcl} + @{thm (lhs) injval.simps(8)} & $\dn$ & @{thm (rhs) injval.simps(8)}\\ + @{thm (lhs) injval.simps(9)} & $\dn$ & @{thm (rhs) injval.simps(9)}\\ + @{thm (lhs) injval.simps(10)} & $\dn$ & @{thm (rhs) injval.simps(10)}\\ + @{thm (lhs) injval.simps(11)} & $\dn$ & @{thm (rhs) injval.simps(11)}\\ + \end{tabular} + \end{center} + + + @{thm [mode=Rule] Posix_NTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]} + + @{thm [mode=Rule] Posix_NTIMES2} + + @{thm [mode=Rule] Posix_UPNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]} + + @{thm [mode=Rule] Posix_UPNTIMES2} + + @{thm [mode=Rule] Posix_FROMNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]} + + @{thm [mode=Rule] Posix_FROMNTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]} + + @{thm [mode=Rule] Posix_FROMNTIMES2} + + @{thm [mode=Rule] Posix_NMTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]} + + @{thm [mode=Rule] Posix_NMTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]} + + @{thm [mode=Rule] Posix_NMTIMES2} + + + + @{term "\ i \ {m..n} . P i"} @{term "\ i \ {..n} . P i"} + @{term "\ i \ {m..n} . P i"} @{term "\ i \ {..n} . P i"} + @{term "\ i \ {0::nat..n} . P i"} + + + +*} + + + +section {* The Correctness Argument by Sulzmann and Lu\label{argu} *} + +text {* +% \newcommand{\greedy}{\succcurlyeq_{gr}} + \newcommand{\posix}{>} + + An extended version of \cite{Sulzmann2014} is available at the website of + its first author; this includes some ``proofs'', claimed in + \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in + final form, we make no comment thereon, preferring to give general reasons + for our belief that the approach of \cite{Sulzmann2014} is problematic. + Their central definition is an ``ordering relation'' defined by the + rules (slightly adapted to fit our notation): + + ?? + + \noindent The idea behind the rules (A1) and (A2), for example, is that a + @{text Left}-value is bigger than a @{text Right}-value, if the underlying + string of the @{text Left}-value is longer or of equal length to the + underlying string of the @{text Right}-value. The order is reversed, + however, if the @{text Right}-value can match a longer string than a + @{text Left}-value. In this way the POSIX value is supposed to be the + biggest value for a given string and regular expression. + + Sulzmann and Lu explicitly refer to the paper \cite{Frisch2004} by Frisch + and Cardelli from where they have taken the idea for their correctness + proof. Frisch and Cardelli introduced a similar ordering for GREEDY + matching and they showed that their GREEDY matching algorithm always + produces a maximal element according to this ordering (from all possible + solutions). The only difference between their GREEDY ordering and the + ``ordering'' by Sulzmann and Lu is that GREEDY always prefers a @{text + Left}-value over a @{text Right}-value, no matter what the underlying + string is. This seems to be only a very minor difference, but it has + drastic consequences in terms of what properties both orderings enjoy. + What is interesting for our purposes is that the properties reflexivity, + totality and transitivity for this GREEDY ordering can be proved + relatively easily by induction. +*} +*) + +section {* Conclusion *} + +text {* + + We have implemented the POSIX value calculation algorithm introduced by + Sulzmann and Lu + \cite{Sulzmann2014}. Our implementation is nearly identical to the + original and all modifications we introduced are harmless (like our char-clause for + @{text inj}). We have proved this algorithm to be correct, but correct + according to our own specification of what POSIX values are. Our + specification (inspired from work by Vansummeren \cite{Vansummeren2006}) appears to be + much simpler than in \cite{Sulzmann2014} and our proofs are nearly always + straightforward. We have attempted to formalise the original proof + by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains + unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors + already acknowledge some small problems, but our experience suggests + that there are more serious problems. + + Having proved the correctness of the POSIX lexing algorithm in + \cite{Sulzmann2014}, which lessons have we learned? Well, this is a + perfect example for the importance of the \emph{right} definitions. We + have (on and off) explored mechanisations as soon as first versions + of \cite{Sulzmann2014} appeared, but have made little progress with + turning the relatively detailed proof sketch in \cite{Sulzmann2014} into a + formalisable proof. Having seen \cite{Vansummeren2006} and adapted the + POSIX definition given there for the algorithm by Sulzmann and Lu made all + the difference: the proofs, as said, are nearly straightforward. The + question remains whether the original proof idea of \cite{Sulzmann2014}, + potentially using our result as a stepping stone, can be made to work? + Alas, we really do not know despite considerable effort. + + + Closely related to our work is an automata-based lexer formalised by + Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest + initial substrings, but Nipkow's algorithm is not completely + computational. The algorithm by Sulzmann and Lu, in contrast, can be + implemented with ease in any functional language. A bespoke lexer for the + Imp-language is formalised in Coq as part of the Software Foundations book + by Pierce et al \cite{Pierce2015}. The disadvantage of such bespoke lexers is that they + do not generalise easily to more advanced features. + Our formalisation is available from the Archive of Formal Proofs \cite{aduAFP16} + under \url{http://www.isa-afp.org/entries/Posix-Lexing.shtml}.\medskip + + \noindent + {\bf Acknowledgements:} + We are very grateful to Martin Sulzmann for his comments on our work and + moreover for patiently explaining to us the details in \cite{Sulzmann2014}. We + also received very helpful comments from James Cheney and anonymous referees. + % \small + \bibliographystyle{plain} + \bibliography{root} + +*} + +(*<*) +end +(*>*) \ No newline at end of file