thys2/Journal/PaperExt.thy
changeset 369 e00950ba4514
equal deleted inserted replaced
368:56781ad291cf 369:e00950ba4514
       
     1 (*<*)
       
     2 theory PaperExt
       
     3 imports 
       
     4    (*"../LexerExt"*)
       
     5    (*"../PositionsExt"*)
       
     6    "~~/src/HOL/Library/LaTeXsugar"
       
     7 begin
       
     8 (*>*)
       
     9 
       
    10 (*
       
    11 declare [[show_question_marks = false]]
       
    12 declare [[eta_contract = false]]
       
    13 
       
    14 abbreviation 
       
    15   "der_syn r c \<equiv> der c r"
       
    16 
       
    17 abbreviation 
       
    18   "ders_syn r s \<equiv> ders s r"
       
    19 
       
    20 
       
    21 notation (latex output)
       
    22   If  ("(\<^latex>\<open>\\textrm{\<close>if\<^latex>\<open>}\<close> (_)/ \<^latex>\<open>\\textrm{\<close>then\<^latex>\<open>}\<close> (_)/ \<^latex>\<open>\\textrm{\<close>else\<^latex>\<open>}\<close> (_))" 10) and
       
    23   Cons ("_\<^latex>\<open>\\mbox{$\\,$}\<close>::\<^latex>\<open>\\mbox{$\\,$}\<close>_" [75,73] 73) and  
       
    24 
       
    25   ZERO ("\<^bold>0" 81) and 
       
    26   ONE ("\<^bold>1" 81) and 
       
    27   CHAR ("_" [1000] 80) and
       
    28   ALT ("_ + _" [77,77] 78) and
       
    29   SEQ ("_ \<cdot> _" [77,77] 78) and
       
    30   STAR ("_\<^sup>\<star>" [78] 78) and
       
    31   NTIMES ("_\<^bsup>'{_'}\<^esup>" [78, 50] 80) and
       
    32   FROMNTIMES ("_\<^bsup>'{_..'}\<^esup>" [78, 50] 80) and
       
    33   UPNTIMES ("_\<^bsup>'{.._'}\<^esup>" [78, 50] 80) and
       
    34   NMTIMES ("_\<^bsup>'{_.._'}\<^esup>" [78, 50,50] 80) and
       
    35 
       
    36   val.Void ("Empty" 78) and
       
    37   val.Char ("Char _" [1000] 78) and
       
    38   val.Left ("Left _" [79] 78) and
       
    39   val.Right ("Right _" [1000] 78) and
       
    40   val.Seq ("Seq _ _" [79,79] 78) and
       
    41   val.Stars ("Stars _" [1000] 78) and
       
    42 
       
    43   L ("L'(_')" [10] 78) and
       
    44   LV ("LV _ _" [80,73] 78) and
       
    45   der_syn ("_\\_" [79, 1000] 76) and  
       
    46   ders_syn ("_\\_" [79, 1000] 76) and
       
    47   flat ("|_|" [75] 74) and
       
    48   flats ("|_|" [72] 74) and
       
    49   Sequ ("_ @ _" [78,77] 63) and
       
    50   injval ("inj _ _ _" [81,77,79] 76) and 
       
    51   mkeps ("mkeps _" [79] 76) and 
       
    52   length ("len _" [73] 73) and
       
    53   intlen ("len _" [73] 73) and
       
    54   set ("_" [73] 73) and
       
    55  
       
    56   Prf ("_ : _" [75,75] 75) and
       
    57   Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and
       
    58  
       
    59   lexer ("lexer _ _" [78,78] 77) and
       
    60   DUMMY ("\<^latex>\<open>\\underline{\\hspace{2mm}}\<close>")
       
    61 *)  
       
    62 (*>*)
       
    63 
       
    64 (*
       
    65 section {* Extensions*}
       
    66 
       
    67 text {*
       
    68 
       
    69   A strong point in favour of
       
    70   Sulzmann and Lu's algorithm is that it can be extended in various
       
    71   ways.  If we are interested in tokenising a string, then we need to not just
       
    72   split up the string into tokens, but also ``classify'' the tokens (for
       
    73   example whether it is a keyword or an identifier). This can be done with
       
    74   only minor modifications to the algorithm by introducing \emph{record
       
    75   regular expressions} and \emph{record values} (for example
       
    76   \cite{Sulzmann2014b}):
       
    77 
       
    78   \begin{center}  
       
    79   @{text "r :="}
       
    80   @{text "..."} $\mid$
       
    81   @{text "(l : r)"} \qquad\qquad
       
    82   @{text "v :="}
       
    83   @{text "..."} $\mid$
       
    84   @{text "(l : v)"}
       
    85   \end{center}
       
    86   
       
    87   \noindent where @{text l} is a label, say a string, @{text r} a regular
       
    88   expression and @{text v} a value. All functions can be smoothly extended
       
    89   to these regular expressions and values. For example \mbox{@{text "(l :
       
    90   r)"}} is nullable iff @{term r} is, and so on. The purpose of the record
       
    91   regular expression is to mark certain parts of a regular expression and
       
    92   then record in the calculated value which parts of the string were matched
       
    93   by this part. The label can then serve as classification for the tokens.
       
    94   For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for
       
    95   keywords and identifiers from the Introduction. With the record regular
       
    96   expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}}
       
    97   and then traverse the calculated value and only collect the underlying
       
    98   strings in record values. With this we obtain finite sequences of pairs of
       
    99   labels and strings, for example
       
   100 
       
   101   \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\]
       
   102   
       
   103   \noindent from which tokens with classifications (keyword-token,
       
   104   identifier-token and so on) can be extracted.
       
   105 
       
   106   In the context of POSIX matching, it is also interesting to study additional
       
   107   constructors about bounded-repetitions of regular expressions. For this let
       
   108   us extend the results from the previous section to the following four
       
   109   additional regular expression constructors:
       
   110 
       
   111   \begin{center}
       
   112   \begin{tabular}{lcrl@ {\hspace{12mm}}l}
       
   113   @{text r} & @{text ":="} & $\ldots\mid$ & @{term "NTIMES r n"} & exactly-@{text n}-times\\
       
   114             &              & $\mid$   & @{term "UPNTIMES r n"} & upto-@{text n}-times\\
       
   115 	    &              & $\mid$   & @{term "FROMNTIMES r n"} & from-@{text n}-times\\
       
   116 	    &              & $\mid$   & @{term "NMTIMES r n m"} & between-@{text nm}-times\\
       
   117   \end{tabular}
       
   118   \end{center}
       
   119 
       
   120   \noindent
       
   121   We will call them \emph{bounded regular expressions}.
       
   122   With the help of the power operator (definition ommited) for sets of strings, the languages
       
   123   recognised by these regular expression can be defined in Isabelle as follows:
       
   124 
       
   125   \begin{center}
       
   126   \begin{tabular}{lcl} 
       
   127   @{thm (lhs) L.simps(8)} & $\dn$ & @{thm (rhs) L.simps(8)}\\
       
   128   @{thm (lhs) L.simps(7)} & $\dn$ & @{thm (rhs) L.simps(7)}\\
       
   129   @{thm (lhs) L.simps(9)} & $\dn$ & @{thm (rhs) L.simps(9)}\\
       
   130   @{thm (lhs) L.simps(10)} & $\dn$ & @{thm (rhs) L.simps(10)}\\
       
   131   \end{tabular}
       
   132   \end{center}
       
   133 
       
   134   \noindent This definition implies that in the last clause @{term
       
   135   "NMTIMES r n m"} matches no string in case @{term "m < n"}, because
       
   136   then the interval @{term "{n..m}"} is empty.  While the language
       
   137   recognised by these regular expressions is straightforward, some
       
   138   care is needed for how to define the corresponding lexical
       
   139   values. First, with a slight abuse of language, we will (re)use
       
   140   values of the form @{term "Stars vs"} for values inhabited in
       
   141   bounded regular expressions. Second, we need to introduce inductive
       
   142   rules for extending our inhabitation relation shown on
       
   143   Page~\ref{prfintros}, from which we then derived our notion of
       
   144   lexical values. Given the rule for @{term "STAR r"}, the rule for
       
   145   @{term "UPNTIMES r n"} just requires additionally that the length of
       
   146   the list of values must be smaller or equal to @{term n}:
       
   147 
       
   148   \begin{center}
       
   149   @{thm[mode=Rule] Prf.intros(7)[of "vs"]}
       
   150   \end{center}
       
   151 
       
   152   \noindent Like in the @{term "STAR r"}-rule, we require with the
       
   153   left-premise that some non-empty part of the string is `chipped'
       
   154   away by \emph{every} value in @{text vs}, that means the corresponding
       
   155   values do not flatten to the empty string. In the rule for @{term
       
   156   "NTIMES r n"} (that is exactly-@{term n}-times @{text r}) we clearly need to require
       
   157   that the length of the list of values equals to @{text n}. But enforcing
       
   158   that every of these @{term n} values `chipps' away some part of a string
       
   159   would be too strong. 
       
   160 
       
   161 
       
   162   \begin{center}
       
   163   @{thm[mode=Rule] Prf.intros(8)[of "vs\<^sub>1" r "vs\<^sub>2"]}
       
   164   \end{center}
       
   165 
       
   166   \begin{center}
       
   167   \begin{tabular}{lcl} 
       
   168   @{thm (lhs) der.simps(8)} & $\dn$ & @{thm (rhs) der.simps(8)}\\
       
   169   @{thm (lhs) der.simps(7)} & $\dn$ & @{thm (rhs) der.simps(7)}\\
       
   170   @{thm (lhs) der.simps(9)} & $\dn$ & @{thm (rhs) der.simps(9)}\\
       
   171   @{thm (lhs) der.simps(10)} & $\dn$ & @{thm (rhs) der.simps(10)}\\
       
   172   \end{tabular}
       
   173   \end{center} 
       
   174 
       
   175 
       
   176   \begin{center}
       
   177   \begin{tabular}{lcl}
       
   178   @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\
       
   179   @{thm (lhs) mkeps.simps(6)} & $\dn$ & @{thm (rhs) mkeps.simps(6)}\\
       
   180   @{thm (lhs) mkeps.simps(7)} & $\dn$ & @{thm (rhs) mkeps.simps(7)}\\
       
   181   @{thm (lhs) mkeps.simps(8)} & $\dn$ & @{thm (rhs) mkeps.simps(8)}\\
       
   182   \end{tabular}
       
   183   \end{center}
       
   184 
       
   185   \begin{center}
       
   186   \begin{tabular}{lcl}
       
   187   @{thm (lhs) injval.simps(8)} & $\dn$ & @{thm (rhs) injval.simps(8)}\\
       
   188   @{thm (lhs) injval.simps(9)} & $\dn$ & @{thm (rhs) injval.simps(9)}\\
       
   189   @{thm (lhs) injval.simps(10)} & $\dn$ & @{thm (rhs) injval.simps(10)}\\
       
   190   @{thm (lhs) injval.simps(11)} & $\dn$ & @{thm (rhs) injval.simps(11)}\\
       
   191   \end{tabular}
       
   192   \end{center}
       
   193   
       
   194 
       
   195   @{thm [mode=Rule] Posix_NTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
       
   196 
       
   197   @{thm [mode=Rule] Posix_NTIMES2}
       
   198 
       
   199   @{thm [mode=Rule] Posix_UPNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
       
   200 
       
   201   @{thm [mode=Rule] Posix_UPNTIMES2}
       
   202 
       
   203   @{thm [mode=Rule] Posix_FROMNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
       
   204 
       
   205   @{thm [mode=Rule] Posix_FROMNTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]}
       
   206 
       
   207   @{thm [mode=Rule] Posix_FROMNTIMES2}
       
   208 
       
   209   @{thm [mode=Rule] Posix_NMTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
       
   210 
       
   211   @{thm [mode=Rule] Posix_NMTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]}
       
   212 
       
   213   @{thm [mode=Rule] Posix_NMTIMES2}
       
   214 
       
   215 
       
   216 
       
   217    @{term "\<Sum> i \<in> {m..n} . P i"}  @{term "\<Sum> i \<in> {..n} . P i"}
       
   218   @{term "\<Union> i \<in> {m..n} . P i"}  @{term "\<Union> i \<in> {..n} . P i"}
       
   219    @{term "\<Union> i \<in> {0::nat..n} . P i"}
       
   220 
       
   221 
       
   222 
       
   223 *}
       
   224 
       
   225 
       
   226 
       
   227 section {* The Correctness Argument by Sulzmann and Lu\label{argu} *}
       
   228 
       
   229 text {*
       
   230 %  \newcommand{\greedy}{\succcurlyeq_{gr}}
       
   231  \newcommand{\posix}{>}
       
   232 
       
   233   An extended version of \cite{Sulzmann2014} is available at the website of
       
   234   its first author; this includes some ``proofs'', claimed in
       
   235   \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
       
   236   final form, we make no comment thereon, preferring to give general reasons
       
   237   for our belief that the approach of \cite{Sulzmann2014} is problematic.
       
   238   Their central definition is an ``ordering relation'' defined by the
       
   239   rules (slightly adapted to fit our notation):
       
   240 
       
   241   ??
       
   242 
       
   243   \noindent The idea behind the rules (A1) and (A2), for example, is that a
       
   244   @{text Left}-value is bigger than a @{text Right}-value, if the underlying
       
   245   string of the @{text Left}-value is longer or of equal length to the
       
   246   underlying string of the @{text Right}-value. The order is reversed,
       
   247   however, if the @{text Right}-value can match a longer string than a
       
   248   @{text Left}-value. In this way the POSIX value is supposed to be the
       
   249   biggest value for a given string and regular expression.
       
   250 
       
   251   Sulzmann and Lu explicitly refer to the paper \cite{Frisch2004} by Frisch
       
   252   and Cardelli from where they have taken the idea for their correctness
       
   253   proof. Frisch and Cardelli introduced a similar ordering for GREEDY
       
   254   matching and they showed that their GREEDY matching algorithm always
       
   255   produces a maximal element according to this ordering (from all possible
       
   256   solutions). The only difference between their GREEDY ordering and the
       
   257   ``ordering'' by Sulzmann and Lu is that GREEDY always prefers a @{text
       
   258   Left}-value over a @{text Right}-value, no matter what the underlying
       
   259   string is. This seems to be only a very minor difference, but it has
       
   260   drastic consequences in terms of what properties both orderings enjoy.
       
   261   What is interesting for our purposes is that the properties reflexivity,
       
   262   totality and transitivity for this GREEDY ordering can be proved
       
   263   relatively easily by induction.
       
   264 *}
       
   265 *)
       
   266 
       
   267 section {* Conclusion *}
       
   268 
       
   269 text {*
       
   270 
       
   271   We have implemented the POSIX value calculation algorithm introduced by
       
   272   Sulzmann and Lu
       
   273   \cite{Sulzmann2014}. Our implementation is nearly identical to the
       
   274   original and all modifications we introduced are harmless (like our char-clause for
       
   275   @{text inj}). We have proved this algorithm to be correct, but correct
       
   276   according to our own specification of what POSIX values are. Our
       
   277   specification (inspired from work by Vansummeren \cite{Vansummeren2006}) appears to be
       
   278   much simpler than in \cite{Sulzmann2014} and our proofs are nearly always
       
   279   straightforward. We have attempted to formalise the original proof
       
   280   by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains
       
   281   unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors
       
   282   already acknowledge some small problems, but our experience suggests
       
   283   that there are more serious problems. 
       
   284   
       
   285   Having proved the correctness of the POSIX lexing algorithm in
       
   286   \cite{Sulzmann2014}, which lessons have we learned? Well, this is a
       
   287   perfect example for the importance of the \emph{right} definitions. We
       
   288   have (on and off) explored mechanisations as soon as first versions
       
   289   of \cite{Sulzmann2014} appeared, but have made little progress with
       
   290   turning the relatively detailed proof sketch in \cite{Sulzmann2014} into a
       
   291   formalisable proof. Having seen \cite{Vansummeren2006} and adapted the
       
   292   POSIX definition given there for the algorithm by Sulzmann and Lu made all
       
   293   the difference: the proofs, as said, are nearly straightforward. The
       
   294   question remains whether the original proof idea of \cite{Sulzmann2014},
       
   295   potentially using our result as a stepping stone, can be made to work?
       
   296   Alas, we really do not know despite considerable effort.
       
   297 
       
   298 
       
   299   Closely related to our work is an automata-based lexer formalised by
       
   300   Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest
       
   301   initial substrings, but Nipkow's algorithm is not completely
       
   302   computational. The algorithm by Sulzmann and Lu, in contrast, can be
       
   303   implemented with ease in any functional language. A bespoke lexer for the
       
   304   Imp-language is formalised in Coq as part of the Software Foundations book
       
   305   by Pierce et al \cite{Pierce2015}. The disadvantage of such bespoke lexers is that they
       
   306   do not generalise easily to more advanced features.
       
   307   Our formalisation is available from the Archive of Formal Proofs \cite{aduAFP16}
       
   308   under \url{http://www.isa-afp.org/entries/Posix-Lexing.shtml}.\medskip
       
   309 
       
   310  \noindent
       
   311   {\bf Acknowledgements:}
       
   312   We are very grateful to Martin Sulzmann for his comments on our work and 
       
   313   moreover for patiently explaining to us the details in \cite{Sulzmann2014}. We
       
   314   also received very helpful comments from James Cheney and anonymous referees.
       
   315   %  \small
       
   316   \bibliographystyle{plain}
       
   317   \bibliography{root}
       
   318 
       
   319 *}
       
   320 
       
   321 (*<*)
       
   322 end
       
   323 (*>*)