overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
(*<*)
theory PaperExt
imports
(*"../LexerExt"*)
(*"../PositionsExt"*)
"~~/src/HOL/Library/LaTeXsugar"
begin
(*>*)
(*
declare [[show_question_marks = false]]
declare [[eta_contract = false]]
abbreviation
"der_syn r c \<equiv> der c r"
abbreviation
"ders_syn r s \<equiv> ders s r"
notation (latex output)
If ("(\<^latex>\<open>\\textrm{\<close>if\<^latex>\<open>}\<close> (_)/ \<^latex>\<open>\\textrm{\<close>then\<^latex>\<open>}\<close> (_)/ \<^latex>\<open>\\textrm{\<close>else\<^latex>\<open>}\<close> (_))" 10) and
Cons ("_\<^latex>\<open>\\mbox{$\\,$}\<close>::\<^latex>\<open>\\mbox{$\\,$}\<close>_" [75,73] 73) and
ZERO ("\<^bold>0" 81) and
ONE ("\<^bold>1" 81) and
CHAR ("_" [1000] 80) and
ALT ("_ + _" [77,77] 78) and
SEQ ("_ \<cdot> _" [77,77] 78) and
STAR ("_\<^sup>\<star>" [78] 78) and
NTIMES ("_\<^bsup>'{_'}\<^esup>" [78, 50] 80) and
FROMNTIMES ("_\<^bsup>'{_..'}\<^esup>" [78, 50] 80) and
UPNTIMES ("_\<^bsup>'{.._'}\<^esup>" [78, 50] 80) and
NMTIMES ("_\<^bsup>'{_.._'}\<^esup>" [78, 50,50] 80) and
val.Void ("Empty" 78) and
val.Char ("Char _" [1000] 78) and
val.Left ("Left _" [79] 78) and
val.Right ("Right _" [1000] 78) and
val.Seq ("Seq _ _" [79,79] 78) and
val.Stars ("Stars _" [1000] 78) and
L ("L'(_')" [10] 78) and
LV ("LV _ _" [80,73] 78) and
der_syn ("_\\_" [79, 1000] 76) and
ders_syn ("_\\_" [79, 1000] 76) and
flat ("|_|" [75] 74) and
flats ("|_|" [72] 74) and
Sequ ("_ @ _" [78,77] 63) and
injval ("inj _ _ _" [81,77,79] 76) and
mkeps ("mkeps _" [79] 76) and
length ("len _" [73] 73) and
intlen ("len _" [73] 73) and
set ("_" [73] 73) and
Prf ("_ : _" [75,75] 75) and
Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and
lexer ("lexer _ _" [78,78] 77) and
DUMMY ("\<^latex>\<open>\\underline{\\hspace{2mm}}\<close>")
*)
(*>*)
(*
section {* Extensions*}
text {*
A strong point in favour of
Sulzmann and Lu's algorithm is that it can be extended in various
ways. If we are interested in tokenising a string, then we need to not just
split up the string into tokens, but also ``classify'' the tokens (for
example whether it is a keyword or an identifier). This can be done with
only minor modifications to the algorithm by introducing \emph{record
regular expressions} and \emph{record values} (for example
\cite{Sulzmann2014b}):
\begin{center}
@{text "r :="}
@{text "..."} $\mid$
@{text "(l : r)"} \qquad\qquad
@{text "v :="}
@{text "..."} $\mid$
@{text "(l : v)"}
\end{center}
\noindent where @{text l} is a label, say a string, @{text r} a regular
expression and @{text v} a value. All functions can be smoothly extended
to these regular expressions and values. For example \mbox{@{text "(l :
r)"}} is nullable iff @{term r} is, and so on. The purpose of the record
regular expression is to mark certain parts of a regular expression and
then record in the calculated value which parts of the string were matched
by this part. The label can then serve as classification for the tokens.
For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for
keywords and identifiers from the Introduction. With the record regular
expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}}
and then traverse the calculated value and only collect the underlying
strings in record values. With this we obtain finite sequences of pairs of
labels and strings, for example
\[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\]
\noindent from which tokens with classifications (keyword-token,
identifier-token and so on) can be extracted.
In the context of POSIX matching, it is also interesting to study additional
constructors about bounded-repetitions of regular expressions. For this let
us extend the results from the previous section to the following four
additional regular expression constructors:
\begin{center}
\begin{tabular}{lcrl@ {\hspace{12mm}}l}
@{text r} & @{text ":="} & $\ldots\mid$ & @{term "NTIMES r n"} & exactly-@{text n}-times\\
& & $\mid$ & @{term "UPNTIMES r n"} & upto-@{text n}-times\\
& & $\mid$ & @{term "FROMNTIMES r n"} & from-@{text n}-times\\
& & $\mid$ & @{term "NMTIMES r n m"} & between-@{text nm}-times\\
\end{tabular}
\end{center}
\noindent
We will call them \emph{bounded regular expressions}.
With the help of the power operator (definition ommited) for sets of strings, the languages
recognised by these regular expression can be defined in Isabelle as follows:
\begin{center}
\begin{tabular}{lcl}
@{thm (lhs) L.simps(8)} & $\dn$ & @{thm (rhs) L.simps(8)}\\
@{thm (lhs) L.simps(7)} & $\dn$ & @{thm (rhs) L.simps(7)}\\
@{thm (lhs) L.simps(9)} & $\dn$ & @{thm (rhs) L.simps(9)}\\
@{thm (lhs) L.simps(10)} & $\dn$ & @{thm (rhs) L.simps(10)}\\
\end{tabular}
\end{center}
\noindent This definition implies that in the last clause @{term
"NMTIMES r n m"} matches no string in case @{term "m < n"}, because
then the interval @{term "{n..m}"} is empty. While the language
recognised by these regular expressions is straightforward, some
care is needed for how to define the corresponding lexical
values. First, with a slight abuse of language, we will (re)use
values of the form @{term "Stars vs"} for values inhabited in
bounded regular expressions. Second, we need to introduce inductive
rules for extending our inhabitation relation shown on
Page~\ref{prfintros}, from which we then derived our notion of
lexical values. Given the rule for @{term "STAR r"}, the rule for
@{term "UPNTIMES r n"} just requires additionally that the length of
the list of values must be smaller or equal to @{term n}:
\begin{center}
@{thm[mode=Rule] Prf.intros(7)[of "vs"]}
\end{center}
\noindent Like in the @{term "STAR r"}-rule, we require with the
left-premise that some non-empty part of the string is `chipped'
away by \emph{every} value in @{text vs}, that means the corresponding
values do not flatten to the empty string. In the rule for @{term
"NTIMES r n"} (that is exactly-@{term n}-times @{text r}) we clearly need to require
that the length of the list of values equals to @{text n}. But enforcing
that every of these @{term n} values `chipps' away some part of a string
would be too strong.
\begin{center}
@{thm[mode=Rule] Prf.intros(8)[of "vs\<^sub>1" r "vs\<^sub>2"]}
\end{center}
\begin{center}
\begin{tabular}{lcl}
@{thm (lhs) der.simps(8)} & $\dn$ & @{thm (rhs) der.simps(8)}\\
@{thm (lhs) der.simps(7)} & $\dn$ & @{thm (rhs) der.simps(7)}\\
@{thm (lhs) der.simps(9)} & $\dn$ & @{thm (rhs) der.simps(9)}\\
@{thm (lhs) der.simps(10)} & $\dn$ & @{thm (rhs) der.simps(10)}\\
\end{tabular}
\end{center}
\begin{center}
\begin{tabular}{lcl}
@{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\
@{thm (lhs) mkeps.simps(6)} & $\dn$ & @{thm (rhs) mkeps.simps(6)}\\
@{thm (lhs) mkeps.simps(7)} & $\dn$ & @{thm (rhs) mkeps.simps(7)}\\
@{thm (lhs) mkeps.simps(8)} & $\dn$ & @{thm (rhs) mkeps.simps(8)}\\
\end{tabular}
\end{center}
\begin{center}
\begin{tabular}{lcl}
@{thm (lhs) injval.simps(8)} & $\dn$ & @{thm (rhs) injval.simps(8)}\\
@{thm (lhs) injval.simps(9)} & $\dn$ & @{thm (rhs) injval.simps(9)}\\
@{thm (lhs) injval.simps(10)} & $\dn$ & @{thm (rhs) injval.simps(10)}\\
@{thm (lhs) injval.simps(11)} & $\dn$ & @{thm (rhs) injval.simps(11)}\\
\end{tabular}
\end{center}
@{thm [mode=Rule] Posix_NTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
@{thm [mode=Rule] Posix_NTIMES2}
@{thm [mode=Rule] Posix_UPNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
@{thm [mode=Rule] Posix_UPNTIMES2}
@{thm [mode=Rule] Posix_FROMNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
@{thm [mode=Rule] Posix_FROMNTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]}
@{thm [mode=Rule] Posix_FROMNTIMES2}
@{thm [mode=Rule] Posix_NMTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
@{thm [mode=Rule] Posix_NMTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]}
@{thm [mode=Rule] Posix_NMTIMES2}
@{term "\<Sum> i \<in> {m..n} . P i"} @{term "\<Sum> i \<in> {..n} . P i"}
@{term "\<Union> i \<in> {m..n} . P i"} @{term "\<Union> i \<in> {..n} . P i"}
@{term "\<Union> i \<in> {0::nat..n} . P i"}
*}
section {* The Correctness Argument by Sulzmann and Lu\label{argu} *}
text {*
% \newcommand{\greedy}{\succcurlyeq_{gr}}
\newcommand{\posix}{>}
An extended version of \cite{Sulzmann2014} is available at the website of
its first author; this includes some ``proofs'', claimed in
\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
final form, we make no comment thereon, preferring to give general reasons
for our belief that the approach of \cite{Sulzmann2014} is problematic.
Their central definition is an ``ordering relation'' defined by the
rules (slightly adapted to fit our notation):
??
\noindent The idea behind the rules (A1) and (A2), for example, is that a
@{text Left}-value is bigger than a @{text Right}-value, if the underlying
string of the @{text Left}-value is longer or of equal length to the
underlying string of the @{text Right}-value. The order is reversed,
however, if the @{text Right}-value can match a longer string than a
@{text Left}-value. In this way the POSIX value is supposed to be the
biggest value for a given string and regular expression.
Sulzmann and Lu explicitly refer to the paper \cite{Frisch2004} by Frisch
and Cardelli from where they have taken the idea for their correctness
proof. Frisch and Cardelli introduced a similar ordering for GREEDY
matching and they showed that their GREEDY matching algorithm always
produces a maximal element according to this ordering (from all possible
solutions). The only difference between their GREEDY ordering and the
``ordering'' by Sulzmann and Lu is that GREEDY always prefers a @{text
Left}-value over a @{text Right}-value, no matter what the underlying
string is. This seems to be only a very minor difference, but it has
drastic consequences in terms of what properties both orderings enjoy.
What is interesting for our purposes is that the properties reflexivity,
totality and transitivity for this GREEDY ordering can be proved
relatively easily by induction.
*}
*)
section {* Conclusion *}
text {*
We have implemented the POSIX value calculation algorithm introduced by
Sulzmann and Lu
\cite{Sulzmann2014}. Our implementation is nearly identical to the
original and all modifications we introduced are harmless (like our char-clause for
@{text inj}). We have proved this algorithm to be correct, but correct
according to our own specification of what POSIX values are. Our
specification (inspired from work by Vansummeren \cite{Vansummeren2006}) appears to be
much simpler than in \cite{Sulzmann2014} and our proofs are nearly always
straightforward. We have attempted to formalise the original proof
by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains
unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors
already acknowledge some small problems, but our experience suggests
that there are more serious problems.
Having proved the correctness of the POSIX lexing algorithm in
\cite{Sulzmann2014}, which lessons have we learned? Well, this is a
perfect example for the importance of the \emph{right} definitions. We
have (on and off) explored mechanisations as soon as first versions
of \cite{Sulzmann2014} appeared, but have made little progress with
turning the relatively detailed proof sketch in \cite{Sulzmann2014} into a
formalisable proof. Having seen \cite{Vansummeren2006} and adapted the
POSIX definition given there for the algorithm by Sulzmann and Lu made all
the difference: the proofs, as said, are nearly straightforward. The
question remains whether the original proof idea of \cite{Sulzmann2014},
potentially using our result as a stepping stone, can be made to work?
Alas, we really do not know despite considerable effort.
Closely related to our work is an automata-based lexer formalised by
Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest
initial substrings, but Nipkow's algorithm is not completely
computational. The algorithm by Sulzmann and Lu, in contrast, can be
implemented with ease in any functional language. A bespoke lexer for the
Imp-language is formalised in Coq as part of the Software Foundations book
by Pierce et al \cite{Pierce2015}. The disadvantage of such bespoke lexers is that they
do not generalise easily to more advanced features.
Our formalisation is available from the Archive of Formal Proofs \cite{aduAFP16}
under \url{http://www.isa-afp.org/entries/Posix-Lexing.shtml}.\medskip
\noindent
{\bf Acknowledgements:}
We are very grateful to Martin Sulzmann for his comments on our work and
moreover for patiently explaining to us the details in \cite{Sulzmann2014}. We
also received very helpful comments from James Cheney and anonymous referees.
% \small
\bibliographystyle{plain}
\bibliography{root}
*}
(*<*)
end
(*>*)