thys/Journal/PaperExt.thy
changeset 291 25b7d6bfd294
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/thys/Journal/PaperExt.thy	Tue Aug 21 14:14:19 2018 +0100
@@ -0,0 +1,323 @@
+(*<*)
+theory PaperExt
+imports 
+   (*"../LexerExt"*)
+   (*"../PositionsExt"*)
+   "~~/src/HOL/Library/LaTeXsugar"
+begin
+(*>*)
+
+(*
+declare [[show_question_marks = false]]
+declare [[eta_contract = false]]
+
+abbreviation 
+  "der_syn r c \<equiv> der c r"
+
+abbreviation 
+  "ders_syn r s \<equiv> ders s r"
+
+
+notation (latex output)
+  If  ("(\<^latex>\<open>\\textrm{\<close>if\<^latex>\<open>}\<close> (_)/ \<^latex>\<open>\\textrm{\<close>then\<^latex>\<open>}\<close> (_)/ \<^latex>\<open>\\textrm{\<close>else\<^latex>\<open>}\<close> (_))" 10) and
+  Cons ("_\<^latex>\<open>\\mbox{$\\,$}\<close>::\<^latex>\<open>\\mbox{$\\,$}\<close>_" [75,73] 73) and  
+
+  ZERO ("\<^bold>0" 81) and 
+  ONE ("\<^bold>1" 81) and 
+  CHAR ("_" [1000] 80) and
+  ALT ("_ + _" [77,77] 78) and
+  SEQ ("_ \<cdot> _" [77,77] 78) and
+  STAR ("_\<^sup>\<star>" [78] 78) and
+  NTIMES ("_\<^bsup>'{_'}\<^esup>" [78, 50] 80) and
+  FROMNTIMES ("_\<^bsup>'{_..'}\<^esup>" [78, 50] 80) and
+  UPNTIMES ("_\<^bsup>'{.._'}\<^esup>" [78, 50] 80) and
+  NMTIMES ("_\<^bsup>'{_.._'}\<^esup>" [78, 50,50] 80) and
+
+  val.Void ("Empty" 78) and
+  val.Char ("Char _" [1000] 78) and
+  val.Left ("Left _" [79] 78) and
+  val.Right ("Right _" [1000] 78) and
+  val.Seq ("Seq _ _" [79,79] 78) and
+  val.Stars ("Stars _" [1000] 78) and
+
+  L ("L'(_')" [10] 78) and
+  LV ("LV _ _" [80,73] 78) and
+  der_syn ("_\\_" [79, 1000] 76) and  
+  ders_syn ("_\\_" [79, 1000] 76) and
+  flat ("|_|" [75] 74) and
+  flats ("|_|" [72] 74) and
+  Sequ ("_ @ _" [78,77] 63) and
+  injval ("inj _ _ _" [81,77,79] 76) and 
+  mkeps ("mkeps _" [79] 76) and 
+  length ("len _" [73] 73) and
+  intlen ("len _" [73] 73) and
+  set ("_" [73] 73) and
+ 
+  Prf ("_ : _" [75,75] 75) and
+  Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and
+ 
+  lexer ("lexer _ _" [78,78] 77) and
+  DUMMY ("\<^latex>\<open>\\underline{\\hspace{2mm}}\<close>")
+*)  
+(*>*)
+
+(*
+section {* Extensions*}
+
+text {*
+
+  A strong point in favour of
+  Sulzmann and Lu's algorithm is that it can be extended in various
+  ways.  If we are interested in tokenising a string, then we need to not just
+  split up the string into tokens, but also ``classify'' the tokens (for
+  example whether it is a keyword or an identifier). This can be done with
+  only minor modifications to the algorithm by introducing \emph{record
+  regular expressions} and \emph{record values} (for example
+  \cite{Sulzmann2014b}):
+
+  \begin{center}  
+  @{text "r :="}
+  @{text "..."} $\mid$
+  @{text "(l : r)"} \qquad\qquad
+  @{text "v :="}
+  @{text "..."} $\mid$
+  @{text "(l : v)"}
+  \end{center}
+  
+  \noindent where @{text l} is a label, say a string, @{text r} a regular
+  expression and @{text v} a value. All functions can be smoothly extended
+  to these regular expressions and values. For example \mbox{@{text "(l :
+  r)"}} is nullable iff @{term r} is, and so on. The purpose of the record
+  regular expression is to mark certain parts of a regular expression and
+  then record in the calculated value which parts of the string were matched
+  by this part. The label can then serve as classification for the tokens.
+  For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for
+  keywords and identifiers from the Introduction. With the record regular
+  expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}}
+  and then traverse the calculated value and only collect the underlying
+  strings in record values. With this we obtain finite sequences of pairs of
+  labels and strings, for example
+
+  \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\]
+  
+  \noindent from which tokens with classifications (keyword-token,
+  identifier-token and so on) can be extracted.
+
+  In the context of POSIX matching, it is also interesting to study additional
+  constructors about bounded-repetitions of regular expressions. For this let
+  us extend the results from the previous section to the following four
+  additional regular expression constructors:
+
+  \begin{center}
+  \begin{tabular}{lcrl@ {\hspace{12mm}}l}
+  @{text r} & @{text ":="} & $\ldots\mid$ & @{term "NTIMES r n"} & exactly-@{text n}-times\\
+            &              & $\mid$   & @{term "UPNTIMES r n"} & upto-@{text n}-times\\
+	    &              & $\mid$   & @{term "FROMNTIMES r n"} & from-@{text n}-times\\
+	    &              & $\mid$   & @{term "NMTIMES r n m"} & between-@{text nm}-times\\
+  \end{tabular}
+  \end{center}
+
+  \noindent
+  We will call them \emph{bounded regular expressions}.
+  With the help of the power operator (definition ommited) for sets of strings, the languages
+  recognised by these regular expression can be defined in Isabelle as follows:
+
+  \begin{center}
+  \begin{tabular}{lcl} 
+  @{thm (lhs) L.simps(8)} & $\dn$ & @{thm (rhs) L.simps(8)}\\
+  @{thm (lhs) L.simps(7)} & $\dn$ & @{thm (rhs) L.simps(7)}\\
+  @{thm (lhs) L.simps(9)} & $\dn$ & @{thm (rhs) L.simps(9)}\\
+  @{thm (lhs) L.simps(10)} & $\dn$ & @{thm (rhs) L.simps(10)}\\
+  \end{tabular}
+  \end{center}
+
+  \noindent This definition implies that in the last clause @{term
+  "NMTIMES r n m"} matches no string in case @{term "m < n"}, because
+  then the interval @{term "{n..m}"} is empty.  While the language
+  recognised by these regular expressions is straightforward, some
+  care is needed for how to define the corresponding lexical
+  values. First, with a slight abuse of language, we will (re)use
+  values of the form @{term "Stars vs"} for values inhabited in
+  bounded regular expressions. Second, we need to introduce inductive
+  rules for extending our inhabitation relation shown on
+  Page~\ref{prfintros}, from which we then derived our notion of
+  lexical values. Given the rule for @{term "STAR r"}, the rule for
+  @{term "UPNTIMES r n"} just requires additionally that the length of
+  the list of values must be smaller or equal to @{term n}:
+
+  \begin{center}
+  @{thm[mode=Rule] Prf.intros(7)[of "vs"]}
+  \end{center}
+
+  \noindent Like in the @{term "STAR r"}-rule, we require with the
+  left-premise that some non-empty part of the string is `chipped'
+  away by \emph{every} value in @{text vs}, that means the corresponding
+  values do not flatten to the empty string. In the rule for @{term
+  "NTIMES r n"} (that is exactly-@{term n}-times @{text r}) we clearly need to require
+  that the length of the list of values equals to @{text n}. But enforcing
+  that every of these @{term n} values `chipps' away some part of a string
+  would be too strong. 
+
+
+  \begin{center}
+  @{thm[mode=Rule] Prf.intros(8)[of "vs\<^sub>1" r "vs\<^sub>2"]}
+  \end{center}
+
+  \begin{center}
+  \begin{tabular}{lcl} 
+  @{thm (lhs) der.simps(8)} & $\dn$ & @{thm (rhs) der.simps(8)}\\
+  @{thm (lhs) der.simps(7)} & $\dn$ & @{thm (rhs) der.simps(7)}\\
+  @{thm (lhs) der.simps(9)} & $\dn$ & @{thm (rhs) der.simps(9)}\\
+  @{thm (lhs) der.simps(10)} & $\dn$ & @{thm (rhs) der.simps(10)}\\
+  \end{tabular}
+  \end{center} 
+
+
+  \begin{center}
+  \begin{tabular}{lcl}
+  @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\
+  @{thm (lhs) mkeps.simps(6)} & $\dn$ & @{thm (rhs) mkeps.simps(6)}\\
+  @{thm (lhs) mkeps.simps(7)} & $\dn$ & @{thm (rhs) mkeps.simps(7)}\\
+  @{thm (lhs) mkeps.simps(8)} & $\dn$ & @{thm (rhs) mkeps.simps(8)}\\
+  \end{tabular}
+  \end{center}
+
+  \begin{center}
+  \begin{tabular}{lcl}
+  @{thm (lhs) injval.simps(8)} & $\dn$ & @{thm (rhs) injval.simps(8)}\\
+  @{thm (lhs) injval.simps(9)} & $\dn$ & @{thm (rhs) injval.simps(9)}\\
+  @{thm (lhs) injval.simps(10)} & $\dn$ & @{thm (rhs) injval.simps(10)}\\
+  @{thm (lhs) injval.simps(11)} & $\dn$ & @{thm (rhs) injval.simps(11)}\\
+  \end{tabular}
+  \end{center}
+  
+
+  @{thm [mode=Rule] Posix_NTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
+
+  @{thm [mode=Rule] Posix_NTIMES2}
+
+  @{thm [mode=Rule] Posix_UPNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
+
+  @{thm [mode=Rule] Posix_UPNTIMES2}
+
+  @{thm [mode=Rule] Posix_FROMNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
+
+  @{thm [mode=Rule] Posix_FROMNTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]}
+
+  @{thm [mode=Rule] Posix_FROMNTIMES2}
+
+  @{thm [mode=Rule] Posix_NMTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
+
+  @{thm [mode=Rule] Posix_NMTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]}
+
+  @{thm [mode=Rule] Posix_NMTIMES2}
+
+
+
+   @{term "\<Sum> i \<in> {m..n} . P i"}  @{term "\<Sum> i \<in> {..n} . P i"}
+  @{term "\<Union> i \<in> {m..n} . P i"}  @{term "\<Union> i \<in> {..n} . P i"}
+   @{term "\<Union> i \<in> {0::nat..n} . P i"}
+
+
+
+*}
+
+
+
+section {* The Correctness Argument by Sulzmann and Lu\label{argu} *}
+
+text {*
+%  \newcommand{\greedy}{\succcurlyeq_{gr}}
+ \newcommand{\posix}{>}
+
+  An extended version of \cite{Sulzmann2014} is available at the website of
+  its first author; this includes some ``proofs'', claimed in
+  \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
+  final form, we make no comment thereon, preferring to give general reasons
+  for our belief that the approach of \cite{Sulzmann2014} is problematic.
+  Their central definition is an ``ordering relation'' defined by the
+  rules (slightly adapted to fit our notation):
+
+  ??
+
+  \noindent The idea behind the rules (A1) and (A2), for example, is that a
+  @{text Left}-value is bigger than a @{text Right}-value, if the underlying
+  string of the @{text Left}-value is longer or of equal length to the
+  underlying string of the @{text Right}-value. The order is reversed,
+  however, if the @{text Right}-value can match a longer string than a
+  @{text Left}-value. In this way the POSIX value is supposed to be the
+  biggest value for a given string and regular expression.
+
+  Sulzmann and Lu explicitly refer to the paper \cite{Frisch2004} by Frisch
+  and Cardelli from where they have taken the idea for their correctness
+  proof. Frisch and Cardelli introduced a similar ordering for GREEDY
+  matching and they showed that their GREEDY matching algorithm always
+  produces a maximal element according to this ordering (from all possible
+  solutions). The only difference between their GREEDY ordering and the
+  ``ordering'' by Sulzmann and Lu is that GREEDY always prefers a @{text
+  Left}-value over a @{text Right}-value, no matter what the underlying
+  string is. This seems to be only a very minor difference, but it has
+  drastic consequences in terms of what properties both orderings enjoy.
+  What is interesting for our purposes is that the properties reflexivity,
+  totality and transitivity for this GREEDY ordering can be proved
+  relatively easily by induction.
+*}
+*)
+
+section {* Conclusion *}
+
+text {*
+
+  We have implemented the POSIX value calculation algorithm introduced by
+  Sulzmann and Lu
+  \cite{Sulzmann2014}. Our implementation is nearly identical to the
+  original and all modifications we introduced are harmless (like our char-clause for
+  @{text inj}). We have proved this algorithm to be correct, but correct
+  according to our own specification of what POSIX values are. Our
+  specification (inspired from work by Vansummeren \cite{Vansummeren2006}) appears to be
+  much simpler than in \cite{Sulzmann2014} and our proofs are nearly always
+  straightforward. We have attempted to formalise the original proof
+  by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains
+  unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors
+  already acknowledge some small problems, but our experience suggests
+  that there are more serious problems. 
+  
+  Having proved the correctness of the POSIX lexing algorithm in
+  \cite{Sulzmann2014}, which lessons have we learned? Well, this is a
+  perfect example for the importance of the \emph{right} definitions. We
+  have (on and off) explored mechanisations as soon as first versions
+  of \cite{Sulzmann2014} appeared, but have made little progress with
+  turning the relatively detailed proof sketch in \cite{Sulzmann2014} into a
+  formalisable proof. Having seen \cite{Vansummeren2006} and adapted the
+  POSIX definition given there for the algorithm by Sulzmann and Lu made all
+  the difference: the proofs, as said, are nearly straightforward. The
+  question remains whether the original proof idea of \cite{Sulzmann2014},
+  potentially using our result as a stepping stone, can be made to work?
+  Alas, we really do not know despite considerable effort.
+
+
+  Closely related to our work is an automata-based lexer formalised by
+  Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest
+  initial substrings, but Nipkow's algorithm is not completely
+  computational. The algorithm by Sulzmann and Lu, in contrast, can be
+  implemented with ease in any functional language. A bespoke lexer for the
+  Imp-language is formalised in Coq as part of the Software Foundations book
+  by Pierce et al \cite{Pierce2015}. The disadvantage of such bespoke lexers is that they
+  do not generalise easily to more advanced features.
+  Our formalisation is available from the Archive of Formal Proofs \cite{aduAFP16}
+  under \url{http://www.isa-afp.org/entries/Posix-Lexing.shtml}.\medskip
+
+ \noindent
+  {\bf Acknowledgements:}
+  We are very grateful to Martin Sulzmann for his comments on our work and 
+  moreover for patiently explaining to us the details in \cite{Sulzmann2014}. We
+  also received very helpful comments from James Cheney and anonymous referees.
+  %  \small
+  \bibliographystyle{plain}
+  \bibliography{root}
+
+*}
+
+(*<*)
+end
+(*>*)
\ No newline at end of file