lexing: thys2/Journal/PaperExt.thy@2416fdec6396


(*<*)
theory PaperExt
imports 
   (*"../LexerExt"*)
   (*"../PositionsExt"*)
   "~~/src/HOL/Library/LaTeXsugar"
begin
(*>*)

(*
declare [[show_question_marks = false]]
declare [[eta_contract = false]]

abbreviation 
  "der_syn r c \<equiv> der c r"

abbreviation 
  "ders_syn r s \<equiv> ders s r"


notation (latex output)
  If  ("(\<^latex>\<open>\\textrm{\<close>if\<^latex>\<open>}\<close> (_)/ \<^latex>\<open>\\textrm{\<close>then\<^latex>\<open>}\<close> (_)/ \<^latex>\<open>\\textrm{\<close>else\<^latex>\<open>}\<close> (_))" 10) and
  Cons ("_\<^latex>\<open>\\mbox{$\\,$}\<close>::\<^latex>\<open>\\mbox{$\\,$}\<close>_" [75,73] 73) and  

  ZERO ("\<^bold>0" 81) and 
  ONE ("\<^bold>1" 81) and 
  CHAR ("_" [1000] 80) and
  ALT ("_ + _" [77,77] 78) and
  SEQ ("_ \<cdot> _" [77,77] 78) and
  STAR ("_\<^sup>\<star>" [78] 78) and
  NTIMES ("_\<^bsup>'{_'}\<^esup>" [78, 50] 80) and
  FROMNTIMES ("_\<^bsup>'{_..'}\<^esup>" [78, 50] 80) and
  UPNTIMES ("_\<^bsup>'{.._'}\<^esup>" [78, 50] 80) and
  NMTIMES ("_\<^bsup>'{_.._'}\<^esup>" [78, 50,50] 80) and

  val.Void ("Empty" 78) and
  val.Char ("Char _" [1000] 78) and
  val.Left ("Left _" [79] 78) and
  val.Right ("Right _" [1000] 78) and
  val.Seq ("Seq _ _" [79,79] 78) and
  val.Stars ("Stars _" [1000] 78) and

  L ("L'(_')" [10] 78) and
  LV ("LV _ _" [80,73] 78) and
  der_syn ("_\\_" [79, 1000] 76) and  
  ders_syn ("_\\_" [79, 1000] 76) and
  flat ("|_|" [75] 74) and
  flats ("|_|" [72] 74) and
  Sequ ("_ @ _" [78,77] 63) and
  injval ("inj _ _ _" [81,77,79] 76) and 
  mkeps ("mkeps _" [79] 76) and 
  length ("len _" [73] 73) and
  intlen ("len _" [73] 73) and
  set ("_" [73] 73) and
 
  Prf ("_ : _" [75,75] 75) and
  Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and
 
  lexer ("lexer _ _" [78,78] 77) and
  DUMMY ("\<^latex>\<open>\\underline{\\hspace{2mm}}\<close>")
*)  
(*>*)

(*
section {* Extensions*}

text {*

  A strong point in favour of
  Sulzmann and Lu's algorithm is that it can be extended in various
  ways.  If we are interested in tokenising a string, then we need to not just
  split up the string into tokens, but also ``classify'' the tokens (for
  example whether it is a keyword or an identifier). This can be done with
  only minor modifications to the algorithm by introducing \emph{record
  regular expressions} and \emph{record values} (for example
  \cite{Sulzmann2014b}):

  \begin{center}  
  @{text "r :="}
  @{text "..."} $\mid$
  @{text "(l : r)"} \qquad\qquad
  @{text "v :="}
  @{text "..."} $\mid$
  @{text "(l : v)"}
  \end{center}
  
  \noindent where @{text l} is a label, say a string, @{text r} a regular
  expression and @{text v} a value. All functions can be smoothly extended
  to these regular expressions and values. For example \mbox{@{text "(l :
  r)"}} is nullable iff @{term r} is, and so on. The purpose of the record
  regular expression is to mark certain parts of a regular expression and
  then record in the calculated value which parts of the string were matched
  by this part. The label can then serve as classification for the tokens.
  For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for
  keywords and identifiers from the Introduction. With the record regular
  expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}}
  and then traverse the calculated value and only collect the underlying
  strings in record values. With this we obtain finite sequences of pairs of
  labels and strings, for example

  \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\]
  
  \noindent from which tokens with classifications (keyword-token,
  identifier-token and so on) can be extracted.

  In the context of POSIX matching, it is also interesting to study additional
  constructors about bounded-repetitions of regular expressions. For this let
  us extend the results from the previous section to the following four
  additional regular expression constructors:

  \begin{center}
  \begin{tabular}{lcrl@ {\hspace{12mm}}l}
  @{text r} & @{text ":="} & $\ldots\mid$ & @{term "NTIMES r n"} & exactly-@{text n}-times\\
            &              & $\mid$   & @{term "UPNTIMES r n"} & upto-@{text n}-times\\
	    &              & $\mid$   & @{term "FROMNTIMES r n"} & from-@{text n}-times\\
	    &              & $\mid$   & @{term "NMTIMES r n m"} & between-@{text nm}-times\\
  \end{tabular}
  \end{center}

  \noindent
  We will call them \emph{bounded regular expressions}.
  With the help of the power operator (definition ommited) for sets of strings, the languages
  recognised by these regular expression can be defined in Isabelle as follows:

  \begin{center}
  \begin{tabular}{lcl} 
  @{thm (lhs) L.simps(8)} & $\dn$ & @{thm (rhs) L.simps(8)}\\
  @{thm (lhs) L.simps(7)} & $\dn$ & @{thm (rhs) L.simps(7)}\\
  @{thm (lhs) L.simps(9)} & $\dn$ & @{thm (rhs) L.simps(9)}\\
  @{thm (lhs) L.simps(10)} & $\dn$ & @{thm (rhs) L.simps(10)}\\
  \end{tabular}
  \end{center}

  \noindent This definition implies that in the last clause @{term
  "NMTIMES r n m"} matches no string in case @{term "m < n"}, because
  then the interval @{term "{n..m}"} is empty.  While the language
  recognised by these regular expressions is straightforward, some
  care is needed for how to define the corresponding lexical
  values. First, with a slight abuse of language, we will (re)use
  values of the form @{term "Stars vs"} for values inhabited in
  bounded regular expressions. Second, we need to introduce inductive
  rules for extending our inhabitation relation shown on
  Page~\ref{prfintros}, from which we then derived our notion of
  lexical values. Given the rule for @{term "STAR r"}, the rule for
  @{term "UPNTIMES r n"} just requires additionally that the length of
  the list of values must be smaller or equal to @{term n}:

  \begin{center}
  @{thm[mode=Rule] Prf.intros(7)[of "vs"]}
  \end{center}

  \noindent Like in the @{term "STAR r"}-rule, we require with the
  left-premise that some non-empty part of the string is `chipped'
  away by \emph{every} value in @{text vs}, that means the corresponding
  values do not flatten to the empty string. In the rule for @{term
  "NTIMES r n"} (that is exactly-@{term n}-times @{text r}) we clearly need to require
  that the length of the list of values equals to @{text n}. But enforcing
  that every of these @{term n} values `chipps' away some part of a string
  would be too strong. 


  \begin{center}
  @{thm[mode=Rule] Prf.intros(8)[of "vs\<^sub>1" r "vs\<^sub>2"]}
  \end{center}

  \begin{center}
  \begin{tabular}{lcl} 
  @{thm (lhs) der.simps(8)} & $\dn$ & @{thm (rhs) der.simps(8)}\\
  @{thm (lhs) der.simps(7)} & $\dn$ & @{thm (rhs) der.simps(7)}\\
  @{thm (lhs) der.simps(9)} & $\dn$ & @{thm (rhs) der.simps(9)}\\
  @{thm (lhs) der.simps(10)} & $\dn$ & @{thm (rhs) der.simps(10)}\\
  \end{tabular}
  \end{center} 


  \begin{center}
  \begin{tabular}{lcl}
  @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\
  @{thm (lhs) mkeps.simps(6)} & $\dn$ & @{thm (rhs) mkeps.simps(6)}\\
  @{thm (lhs) mkeps.simps(7)} & $\dn$ & @{thm (rhs) mkeps.simps(7)}\\
  @{thm (lhs) mkeps.simps(8)} & $\dn$ & @{thm (rhs) mkeps.simps(8)}\\
  \end{tabular}
  \end{center}

  \begin{center}
  \begin{tabular}{lcl}
  @{thm (lhs) injval.simps(8)} & $\dn$ & @{thm (rhs) injval.simps(8)}\\
  @{thm (lhs) injval.simps(9)} & $\dn$ & @{thm (rhs) injval.simps(9)}\\
  @{thm (lhs) injval.simps(10)} & $\dn$ & @{thm (rhs) injval.simps(10)}\\
  @{thm (lhs) injval.simps(11)} & $\dn$ & @{thm (rhs) injval.simps(11)}\\
  \end{tabular}
  \end{center}
  

  @{thm [mode=Rule] Posix_NTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}

  @{thm [mode=Rule] Posix_NTIMES2}

  @{thm [mode=Rule] Posix_UPNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}

  @{thm [mode=Rule] Posix_UPNTIMES2}

  @{thm [mode=Rule] Posix_FROMNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}

  @{thm [mode=Rule] Posix_FROMNTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]}

  @{thm [mode=Rule] Posix_FROMNTIMES2}

  @{thm [mode=Rule] Posix_NMTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}

  @{thm [mode=Rule] Posix_NMTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]}

  @{thm [mode=Rule] Posix_NMTIMES2}



   @{term "\<Sum> i \<in> {m..n} . P i"}  @{term "\<Sum> i \<in> {..n} . P i"}
  @{term "\<Union> i \<in> {m..n} . P i"}  @{term "\<Union> i \<in> {..n} . P i"}
   @{term "\<Union> i \<in> {0::nat..n} . P i"}



*}



section {* The Correctness Argument by Sulzmann and Lu\label{argu} *}

text {*
%  \newcommand{\greedy}{\succcurlyeq_{gr}}
 \newcommand{\posix}{>}

  An extended version of \cite{Sulzmann2014} is available at the website of
  its first author; this includes some ``proofs'', claimed in
  \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
  final form, we make no comment thereon, preferring to give general reasons
  for our belief that the approach of \cite{Sulzmann2014} is problematic.
  Their central definition is an ``ordering relation'' defined by the
  rules (slightly adapted to fit our notation):

  ??

  \noindent The idea behind the rules (A1) and (A2), for example, is that a
  @{text Left}-value is bigger than a @{text Right}-value, if the underlying
  string of the @{text Left}-value is longer or of equal length to the
  underlying string of the @{text Right}-value. The order is reversed,
  however, if the @{text Right}-value can match a longer string than a
  @{text Left}-value. In this way the POSIX value is supposed to be the
  biggest value for a given string and regular expression.

  Sulzmann and Lu explicitly refer to the paper \cite{Frisch2004} by Frisch
  and Cardelli from where they have taken the idea for their correctness
  proof. Frisch and Cardelli introduced a similar ordering for GREEDY
  matching and they showed that their GREEDY matching algorithm always
  produces a maximal element according to this ordering (from all possible
  solutions). The only difference between their GREEDY ordering and the
  ``ordering'' by Sulzmann and Lu is that GREEDY always prefers a @{text
  Left}-value over a @{text Right}-value, no matter what the underlying
  string is. This seems to be only a very minor difference, but it has
  drastic consequences in terms of what properties both orderings enjoy.
  What is interesting for our purposes is that the properties reflexivity,
  totality and transitivity for this GREEDY ordering can be proved
  relatively easily by induction.
*}
*)

section {* Conclusion *}

text {*

  We have implemented the POSIX value calculation algorithm introduced by
  Sulzmann and Lu
  \cite{Sulzmann2014}. Our implementation is nearly identical to the
  original and all modifications we introduced are harmless (like our char-clause for
  @{text inj}). We have proved this algorithm to be correct, but correct
  according to our own specification of what POSIX values are. Our
  specification (inspired from work by Vansummeren \cite{Vansummeren2006}) appears to be
  much simpler than in \cite{Sulzmann2014} and our proofs are nearly always
  straightforward. We have attempted to formalise the original proof
  by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains
  unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors
  already acknowledge some small problems, but our experience suggests
  that there are more serious problems. 
  
  Having proved the correctness of the POSIX lexing algorithm in
  \cite{Sulzmann2014}, which lessons have we learned? Well, this is a
  perfect example for the importance of the \emph{right} definitions. We
  have (on and off) explored mechanisations as soon as first versions
  of \cite{Sulzmann2014} appeared, but have made little progress with
  turning the relatively detailed proof sketch in \cite{Sulzmann2014} into a
  formalisable proof. Having seen \cite{Vansummeren2006} and adapted the
  POSIX definition given there for the algorithm by Sulzmann and Lu made all
  the difference: the proofs, as said, are nearly straightforward. The
  question remains whether the original proof idea of \cite{Sulzmann2014},
  potentially using our result as a stepping stone, can be made to work?
  Alas, we really do not know despite considerable effort.


  Closely related to our work is an automata-based lexer formalised by
  Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest
  initial substrings, but Nipkow's algorithm is not completely
  computational. The algorithm by Sulzmann and Lu, in contrast, can be
  implemented with ease in any functional language. A bespoke lexer for the
  Imp-language is formalised in Coq as part of the Software Foundations book
  by Pierce et al \cite{Pierce2015}. The disadvantage of such bespoke lexers is that they
  do not generalise easily to more advanced features.
  Our formalisation is available from the Archive of Formal Proofs \cite{aduAFP16}
  under \url{http://www.isa-afp.org/entries/Posix-Lexing.shtml}.\medskip

 \noindent
  {\bf Acknowledgements:}
  We are very grateful to Martin Sulzmann for his comments on our work and 
  moreover for patiently explaining to us the details in \cite{Sulzmann2014}. We
  also received very helpful comments from James Cheney and anonymous referees.
  %  \small
  \bibliographystyle{plain}
  \bibliography{root}

*}

(*<*)
end
(*>*)
author	Christian Urban <christian.urban@kcl.ac.uk>
	Wed, 09 Feb 2022 00:29:04 +0000
changeset 424	2416fdec6396
parent 369	e00950ba4514
permissions	-rw-r--r--