369
+ − 1
(*<*)
+ − 2
theory PaperExt
+ − 3
imports
+ − 4
(*"../LexerExt"*)
+ − 5
(*"../PositionsExt"*)
+ − 6
"~~/src/HOL/Library/LaTeXsugar"
+ − 7
begin
+ − 8
(*>*)
+ − 9
+ − 10
(*
+ − 11
declare [[show_question_marks = false]]
+ − 12
declare [[eta_contract = false]]
+ − 13
+ − 14
abbreviation
+ − 15
"der_syn r c \<equiv> der c r"
+ − 16
+ − 17
abbreviation
+ − 18
"ders_syn r s \<equiv> ders s r"
+ − 19
+ − 20
+ − 21
notation (latex output)
+ − 22
If ("(\<^latex>\<open>\\textrm{\<close>if\<^latex>\<open>}\<close> (_)/ \<^latex>\<open>\\textrm{\<close>then\<^latex>\<open>}\<close> (_)/ \<^latex>\<open>\\textrm{\<close>else\<^latex>\<open>}\<close> (_))" 10) and
+ − 23
Cons ("_\<^latex>\<open>\\mbox{$\\,$}\<close>::\<^latex>\<open>\\mbox{$\\,$}\<close>_" [75,73] 73) and
+ − 24
+ − 25
ZERO ("\<^bold>0" 81) and
+ − 26
ONE ("\<^bold>1" 81) and
+ − 27
CHAR ("_" [1000] 80) and
+ − 28
ALT ("_ + _" [77,77] 78) and
+ − 29
SEQ ("_ \<cdot> _" [77,77] 78) and
+ − 30
STAR ("_\<^sup>\<star>" [78] 78) and
+ − 31
NTIMES ("_\<^bsup>'{_'}\<^esup>" [78, 50] 80) and
+ − 32
FROMNTIMES ("_\<^bsup>'{_..'}\<^esup>" [78, 50] 80) and
+ − 33
UPNTIMES ("_\<^bsup>'{.._'}\<^esup>" [78, 50] 80) and
+ − 34
NMTIMES ("_\<^bsup>'{_.._'}\<^esup>" [78, 50,50] 80) and
+ − 35
+ − 36
val.Void ("Empty" 78) and
+ − 37
val.Char ("Char _" [1000] 78) and
+ − 38
val.Left ("Left _" [79] 78) and
+ − 39
val.Right ("Right _" [1000] 78) and
+ − 40
val.Seq ("Seq _ _" [79,79] 78) and
+ − 41
val.Stars ("Stars _" [1000] 78) and
+ − 42
+ − 43
L ("L'(_')" [10] 78) and
+ − 44
LV ("LV _ _" [80,73] 78) and
+ − 45
der_syn ("_\\_" [79, 1000] 76) and
+ − 46
ders_syn ("_\\_" [79, 1000] 76) and
+ − 47
flat ("|_|" [75] 74) and
+ − 48
flats ("|_|" [72] 74) and
+ − 49
Sequ ("_ @ _" [78,77] 63) and
+ − 50
injval ("inj _ _ _" [81,77,79] 76) and
+ − 51
mkeps ("mkeps _" [79] 76) and
+ − 52
length ("len _" [73] 73) and
+ − 53
intlen ("len _" [73] 73) and
+ − 54
set ("_" [73] 73) and
+ − 55
+ − 56
Prf ("_ : _" [75,75] 75) and
+ − 57
Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and
+ − 58
+ − 59
lexer ("lexer _ _" [78,78] 77) and
+ − 60
DUMMY ("\<^latex>\<open>\\underline{\\hspace{2mm}}\<close>")
+ − 61
*)
+ − 62
(*>*)
+ − 63
+ − 64
(*
+ − 65
section {* Extensions*}
+ − 66
+ − 67
text {*
+ − 68
+ − 69
A strong point in favour of
+ − 70
Sulzmann and Lu's algorithm is that it can be extended in various
+ − 71
ways. If we are interested in tokenising a string, then we need to not just
+ − 72
split up the string into tokens, but also ``classify'' the tokens (for
+ − 73
example whether it is a keyword or an identifier). This can be done with
+ − 74
only minor modifications to the algorithm by introducing \emph{record
+ − 75
regular expressions} and \emph{record values} (for example
+ − 76
\cite{Sulzmann2014b}):
+ − 77
+ − 78
\begin{center}
+ − 79
@{text "r :="}
+ − 80
@{text "..."} $\mid$
+ − 81
@{text "(l : r)"} \qquad\qquad
+ − 82
@{text "v :="}
+ − 83
@{text "..."} $\mid$
+ − 84
@{text "(l : v)"}
+ − 85
\end{center}
+ − 86
+ − 87
\noindent where @{text l} is a label, say a string, @{text r} a regular
+ − 88
expression and @{text v} a value. All functions can be smoothly extended
+ − 89
to these regular expressions and values. For example \mbox{@{text "(l :
+ − 90
r)"}} is nullable iff @{term r} is, and so on. The purpose of the record
+ − 91
regular expression is to mark certain parts of a regular expression and
+ − 92
then record in the calculated value which parts of the string were matched
+ − 93
by this part. The label can then serve as classification for the tokens.
+ − 94
For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for
+ − 95
keywords and identifiers from the Introduction. With the record regular
+ − 96
expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}}
+ − 97
and then traverse the calculated value and only collect the underlying
+ − 98
strings in record values. With this we obtain finite sequences of pairs of
+ − 99
labels and strings, for example
+ − 100
+ − 101
\[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\]
+ − 102
+ − 103
\noindent from which tokens with classifications (keyword-token,
+ − 104
identifier-token and so on) can be extracted.
+ − 105
+ − 106
In the context of POSIX matching, it is also interesting to study additional
+ − 107
constructors about bounded-repetitions of regular expressions. For this let
+ − 108
us extend the results from the previous section to the following four
+ − 109
additional regular expression constructors:
+ − 110
+ − 111
\begin{center}
+ − 112
\begin{tabular}{lcrl@ {\hspace{12mm}}l}
+ − 113
@{text r} & @{text ":="} & $\ldots\mid$ & @{term "NTIMES r n"} & exactly-@{text n}-times\\
+ − 114
& & $\mid$ & @{term "UPNTIMES r n"} & upto-@{text n}-times\\
+ − 115
& & $\mid$ & @{term "FROMNTIMES r n"} & from-@{text n}-times\\
+ − 116
& & $\mid$ & @{term "NMTIMES r n m"} & between-@{text nm}-times\\
+ − 117
\end{tabular}
+ − 118
\end{center}
+ − 119
+ − 120
\noindent
+ − 121
We will call them \emph{bounded regular expressions}.
+ − 122
With the help of the power operator (definition ommited) for sets of strings, the languages
+ − 123
recognised by these regular expression can be defined in Isabelle as follows:
+ − 124
+ − 125
\begin{center}
+ − 126
\begin{tabular}{lcl}
+ − 127
@{thm (lhs) L.simps(8)} & $\dn$ & @{thm (rhs) L.simps(8)}\\
+ − 128
@{thm (lhs) L.simps(7)} & $\dn$ & @{thm (rhs) L.simps(7)}\\
+ − 129
@{thm (lhs) L.simps(9)} & $\dn$ & @{thm (rhs) L.simps(9)}\\
+ − 130
@{thm (lhs) L.simps(10)} & $\dn$ & @{thm (rhs) L.simps(10)}\\
+ − 131
\end{tabular}
+ − 132
\end{center}
+ − 133
+ − 134
\noindent This definition implies that in the last clause @{term
+ − 135
"NMTIMES r n m"} matches no string in case @{term "m < n"}, because
+ − 136
then the interval @{term "{n..m}"} is empty. While the language
+ − 137
recognised by these regular expressions is straightforward, some
+ − 138
care is needed for how to define the corresponding lexical
+ − 139
values. First, with a slight abuse of language, we will (re)use
+ − 140
values of the form @{term "Stars vs"} for values inhabited in
+ − 141
bounded regular expressions. Second, we need to introduce inductive
+ − 142
rules for extending our inhabitation relation shown on
+ − 143
Page~\ref{prfintros}, from which we then derived our notion of
+ − 144
lexical values. Given the rule for @{term "STAR r"}, the rule for
+ − 145
@{term "UPNTIMES r n"} just requires additionally that the length of
+ − 146
the list of values must be smaller or equal to @{term n}:
+ − 147
+ − 148
\begin{center}
+ − 149
@{thm[mode=Rule] Prf.intros(7)[of "vs"]}
+ − 150
\end{center}
+ − 151
+ − 152
\noindent Like in the @{term "STAR r"}-rule, we require with the
+ − 153
left-premise that some non-empty part of the string is `chipped'
+ − 154
away by \emph{every} value in @{text vs}, that means the corresponding
+ − 155
values do not flatten to the empty string. In the rule for @{term
+ − 156
"NTIMES r n"} (that is exactly-@{term n}-times @{text r}) we clearly need to require
+ − 157
that the length of the list of values equals to @{text n}. But enforcing
+ − 158
that every of these @{term n} values `chipps' away some part of a string
+ − 159
would be too strong.
+ − 160
+ − 161
+ − 162
\begin{center}
+ − 163
@{thm[mode=Rule] Prf.intros(8)[of "vs\<^sub>1" r "vs\<^sub>2"]}
+ − 164
\end{center}
+ − 165
+ − 166
\begin{center}
+ − 167
\begin{tabular}{lcl}
+ − 168
@{thm (lhs) der.simps(8)} & $\dn$ & @{thm (rhs) der.simps(8)}\\
+ − 169
@{thm (lhs) der.simps(7)} & $\dn$ & @{thm (rhs) der.simps(7)}\\
+ − 170
@{thm (lhs) der.simps(9)} & $\dn$ & @{thm (rhs) der.simps(9)}\\
+ − 171
@{thm (lhs) der.simps(10)} & $\dn$ & @{thm (rhs) der.simps(10)}\\
+ − 172
\end{tabular}
+ − 173
\end{center}
+ − 174
+ − 175
+ − 176
\begin{center}
+ − 177
\begin{tabular}{lcl}
+ − 178
@{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\
+ − 179
@{thm (lhs) mkeps.simps(6)} & $\dn$ & @{thm (rhs) mkeps.simps(6)}\\
+ − 180
@{thm (lhs) mkeps.simps(7)} & $\dn$ & @{thm (rhs) mkeps.simps(7)}\\
+ − 181
@{thm (lhs) mkeps.simps(8)} & $\dn$ & @{thm (rhs) mkeps.simps(8)}\\
+ − 182
\end{tabular}
+ − 183
\end{center}
+ − 184
+ − 185
\begin{center}
+ − 186
\begin{tabular}{lcl}
+ − 187
@{thm (lhs) injval.simps(8)} & $\dn$ & @{thm (rhs) injval.simps(8)}\\
+ − 188
@{thm (lhs) injval.simps(9)} & $\dn$ & @{thm (rhs) injval.simps(9)}\\
+ − 189
@{thm (lhs) injval.simps(10)} & $\dn$ & @{thm (rhs) injval.simps(10)}\\
+ − 190
@{thm (lhs) injval.simps(11)} & $\dn$ & @{thm (rhs) injval.simps(11)}\\
+ − 191
\end{tabular}
+ − 192
\end{center}
+ − 193
+ − 194
+ − 195
@{thm [mode=Rule] Posix_NTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
+ − 196
+ − 197
@{thm [mode=Rule] Posix_NTIMES2}
+ − 198
+ − 199
@{thm [mode=Rule] Posix_UPNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
+ − 200
+ − 201
@{thm [mode=Rule] Posix_UPNTIMES2}
+ − 202
+ − 203
@{thm [mode=Rule] Posix_FROMNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
+ − 204
+ − 205
@{thm [mode=Rule] Posix_FROMNTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]}
+ − 206
+ − 207
@{thm [mode=Rule] Posix_FROMNTIMES2}
+ − 208
+ − 209
@{thm [mode=Rule] Posix_NMTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
+ − 210
+ − 211
@{thm [mode=Rule] Posix_NMTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]}
+ − 212
+ − 213
@{thm [mode=Rule] Posix_NMTIMES2}
+ − 214
+ − 215
+ − 216
+ − 217
@{term "\<Sum> i \<in> {m..n} . P i"} @{term "\<Sum> i \<in> {..n} . P i"}
+ − 218
@{term "\<Union> i \<in> {m..n} . P i"} @{term "\<Union> i \<in> {..n} . P i"}
+ − 219
@{term "\<Union> i \<in> {0::nat..n} . P i"}
+ − 220
+ − 221
+ − 222
+ − 223
*}
+ − 224
+ − 225
+ − 226
+ − 227
section {* The Correctness Argument by Sulzmann and Lu\label{argu} *}
+ − 228
+ − 229
text {*
+ − 230
% \newcommand{\greedy}{\succcurlyeq_{gr}}
+ − 231
\newcommand{\posix}{>}
+ − 232
+ − 233
An extended version of \cite{Sulzmann2014} is available at the website of
+ − 234
its first author; this includes some ``proofs'', claimed in
+ − 235
\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
+ − 236
final form, we make no comment thereon, preferring to give general reasons
+ − 237
for our belief that the approach of \cite{Sulzmann2014} is problematic.
+ − 238
Their central definition is an ``ordering relation'' defined by the
+ − 239
rules (slightly adapted to fit our notation):
+ − 240
+ − 241
??
+ − 242
+ − 243
\noindent The idea behind the rules (A1) and (A2), for example, is that a
+ − 244
@{text Left}-value is bigger than a @{text Right}-value, if the underlying
+ − 245
string of the @{text Left}-value is longer or of equal length to the
+ − 246
underlying string of the @{text Right}-value. The order is reversed,
+ − 247
however, if the @{text Right}-value can match a longer string than a
+ − 248
@{text Left}-value. In this way the POSIX value is supposed to be the
+ − 249
biggest value for a given string and regular expression.
+ − 250
+ − 251
Sulzmann and Lu explicitly refer to the paper \cite{Frisch2004} by Frisch
+ − 252
and Cardelli from where they have taken the idea for their correctness
+ − 253
proof. Frisch and Cardelli introduced a similar ordering for GREEDY
+ − 254
matching and they showed that their GREEDY matching algorithm always
+ − 255
produces a maximal element according to this ordering (from all possible
+ − 256
solutions). The only difference between their GREEDY ordering and the
+ − 257
``ordering'' by Sulzmann and Lu is that GREEDY always prefers a @{text
+ − 258
Left}-value over a @{text Right}-value, no matter what the underlying
+ − 259
string is. This seems to be only a very minor difference, but it has
+ − 260
drastic consequences in terms of what properties both orderings enjoy.
+ − 261
What is interesting for our purposes is that the properties reflexivity,
+ − 262
totality and transitivity for this GREEDY ordering can be proved
+ − 263
relatively easily by induction.
+ − 264
*}
+ − 265
*)
+ − 266
+ − 267
section {* Conclusion *}
+ − 268
+ − 269
text {*
+ − 270
+ − 271
We have implemented the POSIX value calculation algorithm introduced by
+ − 272
Sulzmann and Lu
+ − 273
\cite{Sulzmann2014}. Our implementation is nearly identical to the
+ − 274
original and all modifications we introduced are harmless (like our char-clause for
+ − 275
@{text inj}). We have proved this algorithm to be correct, but correct
+ − 276
according to our own specification of what POSIX values are. Our
+ − 277
specification (inspired from work by Vansummeren \cite{Vansummeren2006}) appears to be
+ − 278
much simpler than in \cite{Sulzmann2014} and our proofs are nearly always
+ − 279
straightforward. We have attempted to formalise the original proof
+ − 280
by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains
+ − 281
unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors
+ − 282
already acknowledge some small problems, but our experience suggests
+ − 283
that there are more serious problems.
+ − 284
+ − 285
Having proved the correctness of the POSIX lexing algorithm in
+ − 286
\cite{Sulzmann2014}, which lessons have we learned? Well, this is a
+ − 287
perfect example for the importance of the \emph{right} definitions. We
+ − 288
have (on and off) explored mechanisations as soon as first versions
+ − 289
of \cite{Sulzmann2014} appeared, but have made little progress with
+ − 290
turning the relatively detailed proof sketch in \cite{Sulzmann2014} into a
+ − 291
formalisable proof. Having seen \cite{Vansummeren2006} and adapted the
+ − 292
POSIX definition given there for the algorithm by Sulzmann and Lu made all
+ − 293
the difference: the proofs, as said, are nearly straightforward. The
+ − 294
question remains whether the original proof idea of \cite{Sulzmann2014},
+ − 295
potentially using our result as a stepping stone, can be made to work?
+ − 296
Alas, we really do not know despite considerable effort.
+ − 297
+ − 298
+ − 299
Closely related to our work is an automata-based lexer formalised by
+ − 300
Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest
+ − 301
initial substrings, but Nipkow's algorithm is not completely
+ − 302
computational. The algorithm by Sulzmann and Lu, in contrast, can be
+ − 303
implemented with ease in any functional language. A bespoke lexer for the
+ − 304
Imp-language is formalised in Coq as part of the Software Foundations book
+ − 305
by Pierce et al \cite{Pierce2015}. The disadvantage of such bespoke lexers is that they
+ − 306
do not generalise easily to more advanced features.
+ − 307
Our formalisation is available from the Archive of Formal Proofs \cite{aduAFP16}
+ − 308
under \url{http://www.isa-afp.org/entries/Posix-Lexing.shtml}.\medskip
+ − 309
+ − 310
\noindent
+ − 311
{\bf Acknowledgements:}
+ − 312
We are very grateful to Martin Sulzmann for his comments on our work and
+ − 313
moreover for patiently explaining to us the details in \cite{Sulzmann2014}. We
+ − 314
also received very helpful comments from James Cheney and anonymous referees.
+ − 315
% \small
+ − 316
\bibliographystyle{plain}
+ − 317
\bibliography{root}
+ − 318
+ − 319
*}
+ − 320
+ − 321
(*<*)
+ − 322
end
+ − 323
(*>*)