369
|
1 |
(*<*)
|
|
2 |
theory PaperExt
|
|
3 |
imports
|
|
4 |
(*"../LexerExt"*)
|
|
5 |
(*"../PositionsExt"*)
|
|
6 |
"~~/src/HOL/Library/LaTeXsugar"
|
|
7 |
begin
|
|
8 |
(*>*)
|
|
9 |
|
|
10 |
(*
|
|
11 |
declare [[show_question_marks = false]]
|
|
12 |
declare [[eta_contract = false]]
|
|
13 |
|
|
14 |
abbreviation
|
|
15 |
"der_syn r c \<equiv> der c r"
|
|
16 |
|
|
17 |
abbreviation
|
|
18 |
"ders_syn r s \<equiv> ders s r"
|
|
19 |
|
|
20 |
|
|
21 |
notation (latex output)
|
|
22 |
If ("(\<^latex>\<open>\\textrm{\<close>if\<^latex>\<open>}\<close> (_)/ \<^latex>\<open>\\textrm{\<close>then\<^latex>\<open>}\<close> (_)/ \<^latex>\<open>\\textrm{\<close>else\<^latex>\<open>}\<close> (_))" 10) and
|
|
23 |
Cons ("_\<^latex>\<open>\\mbox{$\\,$}\<close>::\<^latex>\<open>\\mbox{$\\,$}\<close>_" [75,73] 73) and
|
|
24 |
|
|
25 |
ZERO ("\<^bold>0" 81) and
|
|
26 |
ONE ("\<^bold>1" 81) and
|
|
27 |
CHAR ("_" [1000] 80) and
|
|
28 |
ALT ("_ + _" [77,77] 78) and
|
|
29 |
SEQ ("_ \<cdot> _" [77,77] 78) and
|
|
30 |
STAR ("_\<^sup>\<star>" [78] 78) and
|
|
31 |
NTIMES ("_\<^bsup>'{_'}\<^esup>" [78, 50] 80) and
|
|
32 |
FROMNTIMES ("_\<^bsup>'{_..'}\<^esup>" [78, 50] 80) and
|
|
33 |
UPNTIMES ("_\<^bsup>'{.._'}\<^esup>" [78, 50] 80) and
|
|
34 |
NMTIMES ("_\<^bsup>'{_.._'}\<^esup>" [78, 50,50] 80) and
|
|
35 |
|
|
36 |
val.Void ("Empty" 78) and
|
|
37 |
val.Char ("Char _" [1000] 78) and
|
|
38 |
val.Left ("Left _" [79] 78) and
|
|
39 |
val.Right ("Right _" [1000] 78) and
|
|
40 |
val.Seq ("Seq _ _" [79,79] 78) and
|
|
41 |
val.Stars ("Stars _" [1000] 78) and
|
|
42 |
|
|
43 |
L ("L'(_')" [10] 78) and
|
|
44 |
LV ("LV _ _" [80,73] 78) and
|
|
45 |
der_syn ("_\\_" [79, 1000] 76) and
|
|
46 |
ders_syn ("_\\_" [79, 1000] 76) and
|
|
47 |
flat ("|_|" [75] 74) and
|
|
48 |
flats ("|_|" [72] 74) and
|
|
49 |
Sequ ("_ @ _" [78,77] 63) and
|
|
50 |
injval ("inj _ _ _" [81,77,79] 76) and
|
|
51 |
mkeps ("mkeps _" [79] 76) and
|
|
52 |
length ("len _" [73] 73) and
|
|
53 |
intlen ("len _" [73] 73) and
|
|
54 |
set ("_" [73] 73) and
|
|
55 |
|
|
56 |
Prf ("_ : _" [75,75] 75) and
|
|
57 |
Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and
|
|
58 |
|
|
59 |
lexer ("lexer _ _" [78,78] 77) and
|
|
60 |
DUMMY ("\<^latex>\<open>\\underline{\\hspace{2mm}}\<close>")
|
|
61 |
*)
|
|
62 |
(*>*)
|
|
63 |
|
|
64 |
(*
|
|
65 |
section {* Extensions*}
|
|
66 |
|
|
67 |
text {*
|
|
68 |
|
|
69 |
A strong point in favour of
|
|
70 |
Sulzmann and Lu's algorithm is that it can be extended in various
|
|
71 |
ways. If we are interested in tokenising a string, then we need to not just
|
|
72 |
split up the string into tokens, but also ``classify'' the tokens (for
|
|
73 |
example whether it is a keyword or an identifier). This can be done with
|
|
74 |
only minor modifications to the algorithm by introducing \emph{record
|
|
75 |
regular expressions} and \emph{record values} (for example
|
|
76 |
\cite{Sulzmann2014b}):
|
|
77 |
|
|
78 |
\begin{center}
|
|
79 |
@{text "r :="}
|
|
80 |
@{text "..."} $\mid$
|
|
81 |
@{text "(l : r)"} \qquad\qquad
|
|
82 |
@{text "v :="}
|
|
83 |
@{text "..."} $\mid$
|
|
84 |
@{text "(l : v)"}
|
|
85 |
\end{center}
|
|
86 |
|
|
87 |
\noindent where @{text l} is a label, say a string, @{text r} a regular
|
|
88 |
expression and @{text v} a value. All functions can be smoothly extended
|
|
89 |
to these regular expressions and values. For example \mbox{@{text "(l :
|
|
90 |
r)"}} is nullable iff @{term r} is, and so on. The purpose of the record
|
|
91 |
regular expression is to mark certain parts of a regular expression and
|
|
92 |
then record in the calculated value which parts of the string were matched
|
|
93 |
by this part. The label can then serve as classification for the tokens.
|
|
94 |
For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for
|
|
95 |
keywords and identifiers from the Introduction. With the record regular
|
|
96 |
expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}}
|
|
97 |
and then traverse the calculated value and only collect the underlying
|
|
98 |
strings in record values. With this we obtain finite sequences of pairs of
|
|
99 |
labels and strings, for example
|
|
100 |
|
|
101 |
\[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\]
|
|
102 |
|
|
103 |
\noindent from which tokens with classifications (keyword-token,
|
|
104 |
identifier-token and so on) can be extracted.
|
|
105 |
|
|
106 |
In the context of POSIX matching, it is also interesting to study additional
|
|
107 |
constructors about bounded-repetitions of regular expressions. For this let
|
|
108 |
us extend the results from the previous section to the following four
|
|
109 |
additional regular expression constructors:
|
|
110 |
|
|
111 |
\begin{center}
|
|
112 |
\begin{tabular}{lcrl@ {\hspace{12mm}}l}
|
|
113 |
@{text r} & @{text ":="} & $\ldots\mid$ & @{term "NTIMES r n"} & exactly-@{text n}-times\\
|
|
114 |
& & $\mid$ & @{term "UPNTIMES r n"} & upto-@{text n}-times\\
|
|
115 |
& & $\mid$ & @{term "FROMNTIMES r n"} & from-@{text n}-times\\
|
|
116 |
& & $\mid$ & @{term "NMTIMES r n m"} & between-@{text nm}-times\\
|
|
117 |
\end{tabular}
|
|
118 |
\end{center}
|
|
119 |
|
|
120 |
\noindent
|
|
121 |
We will call them \emph{bounded regular expressions}.
|
|
122 |
With the help of the power operator (definition ommited) for sets of strings, the languages
|
|
123 |
recognised by these regular expression can be defined in Isabelle as follows:
|
|
124 |
|
|
125 |
\begin{center}
|
|
126 |
\begin{tabular}{lcl}
|
|
127 |
@{thm (lhs) L.simps(8)} & $\dn$ & @{thm (rhs) L.simps(8)}\\
|
|
128 |
@{thm (lhs) L.simps(7)} & $\dn$ & @{thm (rhs) L.simps(7)}\\
|
|
129 |
@{thm (lhs) L.simps(9)} & $\dn$ & @{thm (rhs) L.simps(9)}\\
|
|
130 |
@{thm (lhs) L.simps(10)} & $\dn$ & @{thm (rhs) L.simps(10)}\\
|
|
131 |
\end{tabular}
|
|
132 |
\end{center}
|
|
133 |
|
|
134 |
\noindent This definition implies that in the last clause @{term
|
|
135 |
"NMTIMES r n m"} matches no string in case @{term "m < n"}, because
|
|
136 |
then the interval @{term "{n..m}"} is empty. While the language
|
|
137 |
recognised by these regular expressions is straightforward, some
|
|
138 |
care is needed for how to define the corresponding lexical
|
|
139 |
values. First, with a slight abuse of language, we will (re)use
|
|
140 |
values of the form @{term "Stars vs"} for values inhabited in
|
|
141 |
bounded regular expressions. Second, we need to introduce inductive
|
|
142 |
rules for extending our inhabitation relation shown on
|
|
143 |
Page~\ref{prfintros}, from which we then derived our notion of
|
|
144 |
lexical values. Given the rule for @{term "STAR r"}, the rule for
|
|
145 |
@{term "UPNTIMES r n"} just requires additionally that the length of
|
|
146 |
the list of values must be smaller or equal to @{term n}:
|
|
147 |
|
|
148 |
\begin{center}
|
|
149 |
@{thm[mode=Rule] Prf.intros(7)[of "vs"]}
|
|
150 |
\end{center}
|
|
151 |
|
|
152 |
\noindent Like in the @{term "STAR r"}-rule, we require with the
|
|
153 |
left-premise that some non-empty part of the string is `chipped'
|
|
154 |
away by \emph{every} value in @{text vs}, that means the corresponding
|
|
155 |
values do not flatten to the empty string. In the rule for @{term
|
|
156 |
"NTIMES r n"} (that is exactly-@{term n}-times @{text r}) we clearly need to require
|
|
157 |
that the length of the list of values equals to @{text n}. But enforcing
|
|
158 |
that every of these @{term n} values `chipps' away some part of a string
|
|
159 |
would be too strong.
|
|
160 |
|
|
161 |
|
|
162 |
\begin{center}
|
|
163 |
@{thm[mode=Rule] Prf.intros(8)[of "vs\<^sub>1" r "vs\<^sub>2"]}
|
|
164 |
\end{center}
|
|
165 |
|
|
166 |
\begin{center}
|
|
167 |
\begin{tabular}{lcl}
|
|
168 |
@{thm (lhs) der.simps(8)} & $\dn$ & @{thm (rhs) der.simps(8)}\\
|
|
169 |
@{thm (lhs) der.simps(7)} & $\dn$ & @{thm (rhs) der.simps(7)}\\
|
|
170 |
@{thm (lhs) der.simps(9)} & $\dn$ & @{thm (rhs) der.simps(9)}\\
|
|
171 |
@{thm (lhs) der.simps(10)} & $\dn$ & @{thm (rhs) der.simps(10)}\\
|
|
172 |
\end{tabular}
|
|
173 |
\end{center}
|
|
174 |
|
|
175 |
|
|
176 |
\begin{center}
|
|
177 |
\begin{tabular}{lcl}
|
|
178 |
@{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\
|
|
179 |
@{thm (lhs) mkeps.simps(6)} & $\dn$ & @{thm (rhs) mkeps.simps(6)}\\
|
|
180 |
@{thm (lhs) mkeps.simps(7)} & $\dn$ & @{thm (rhs) mkeps.simps(7)}\\
|
|
181 |
@{thm (lhs) mkeps.simps(8)} & $\dn$ & @{thm (rhs) mkeps.simps(8)}\\
|
|
182 |
\end{tabular}
|
|
183 |
\end{center}
|
|
184 |
|
|
185 |
\begin{center}
|
|
186 |
\begin{tabular}{lcl}
|
|
187 |
@{thm (lhs) injval.simps(8)} & $\dn$ & @{thm (rhs) injval.simps(8)}\\
|
|
188 |
@{thm (lhs) injval.simps(9)} & $\dn$ & @{thm (rhs) injval.simps(9)}\\
|
|
189 |
@{thm (lhs) injval.simps(10)} & $\dn$ & @{thm (rhs) injval.simps(10)}\\
|
|
190 |
@{thm (lhs) injval.simps(11)} & $\dn$ & @{thm (rhs) injval.simps(11)}\\
|
|
191 |
\end{tabular}
|
|
192 |
\end{center}
|
|
193 |
|
|
194 |
|
|
195 |
@{thm [mode=Rule] Posix_NTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
|
|
196 |
|
|
197 |
@{thm [mode=Rule] Posix_NTIMES2}
|
|
198 |
|
|
199 |
@{thm [mode=Rule] Posix_UPNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
|
|
200 |
|
|
201 |
@{thm [mode=Rule] Posix_UPNTIMES2}
|
|
202 |
|
|
203 |
@{thm [mode=Rule] Posix_FROMNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
|
|
204 |
|
|
205 |
@{thm [mode=Rule] Posix_FROMNTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]}
|
|
206 |
|
|
207 |
@{thm [mode=Rule] Posix_FROMNTIMES2}
|
|
208 |
|
|
209 |
@{thm [mode=Rule] Posix_NMTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]}
|
|
210 |
|
|
211 |
@{thm [mode=Rule] Posix_NMTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]}
|
|
212 |
|
|
213 |
@{thm [mode=Rule] Posix_NMTIMES2}
|
|
214 |
|
|
215 |
|
|
216 |
|
|
217 |
@{term "\<Sum> i \<in> {m..n} . P i"} @{term "\<Sum> i \<in> {..n} . P i"}
|
|
218 |
@{term "\<Union> i \<in> {m..n} . P i"} @{term "\<Union> i \<in> {..n} . P i"}
|
|
219 |
@{term "\<Union> i \<in> {0::nat..n} . P i"}
|
|
220 |
|
|
221 |
|
|
222 |
|
|
223 |
*}
|
|
224 |
|
|
225 |
|
|
226 |
|
|
227 |
section {* The Correctness Argument by Sulzmann and Lu\label{argu} *}
|
|
228 |
|
|
229 |
text {*
|
|
230 |
% \newcommand{\greedy}{\succcurlyeq_{gr}}
|
|
231 |
\newcommand{\posix}{>}
|
|
232 |
|
|
233 |
An extended version of \cite{Sulzmann2014} is available at the website of
|
|
234 |
its first author; this includes some ``proofs'', claimed in
|
|
235 |
\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
|
|
236 |
final form, we make no comment thereon, preferring to give general reasons
|
|
237 |
for our belief that the approach of \cite{Sulzmann2014} is problematic.
|
|
238 |
Their central definition is an ``ordering relation'' defined by the
|
|
239 |
rules (slightly adapted to fit our notation):
|
|
240 |
|
|
241 |
??
|
|
242 |
|
|
243 |
\noindent The idea behind the rules (A1) and (A2), for example, is that a
|
|
244 |
@{text Left}-value is bigger than a @{text Right}-value, if the underlying
|
|
245 |
string of the @{text Left}-value is longer or of equal length to the
|
|
246 |
underlying string of the @{text Right}-value. The order is reversed,
|
|
247 |
however, if the @{text Right}-value can match a longer string than a
|
|
248 |
@{text Left}-value. In this way the POSIX value is supposed to be the
|
|
249 |
biggest value for a given string and regular expression.
|
|
250 |
|
|
251 |
Sulzmann and Lu explicitly refer to the paper \cite{Frisch2004} by Frisch
|
|
252 |
and Cardelli from where they have taken the idea for their correctness
|
|
253 |
proof. Frisch and Cardelli introduced a similar ordering for GREEDY
|
|
254 |
matching and they showed that their GREEDY matching algorithm always
|
|
255 |
produces a maximal element according to this ordering (from all possible
|
|
256 |
solutions). The only difference between their GREEDY ordering and the
|
|
257 |
``ordering'' by Sulzmann and Lu is that GREEDY always prefers a @{text
|
|
258 |
Left}-value over a @{text Right}-value, no matter what the underlying
|
|
259 |
string is. This seems to be only a very minor difference, but it has
|
|
260 |
drastic consequences in terms of what properties both orderings enjoy.
|
|
261 |
What is interesting for our purposes is that the properties reflexivity,
|
|
262 |
totality and transitivity for this GREEDY ordering can be proved
|
|
263 |
relatively easily by induction.
|
|
264 |
*}
|
|
265 |
*)
|
|
266 |
|
|
267 |
section {* Conclusion *}
|
|
268 |
|
|
269 |
text {*
|
|
270 |
|
|
271 |
We have implemented the POSIX value calculation algorithm introduced by
|
|
272 |
Sulzmann and Lu
|
|
273 |
\cite{Sulzmann2014}. Our implementation is nearly identical to the
|
|
274 |
original and all modifications we introduced are harmless (like our char-clause for
|
|
275 |
@{text inj}). We have proved this algorithm to be correct, but correct
|
|
276 |
according to our own specification of what POSIX values are. Our
|
|
277 |
specification (inspired from work by Vansummeren \cite{Vansummeren2006}) appears to be
|
|
278 |
much simpler than in \cite{Sulzmann2014} and our proofs are nearly always
|
|
279 |
straightforward. We have attempted to formalise the original proof
|
|
280 |
by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains
|
|
281 |
unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors
|
|
282 |
already acknowledge some small problems, but our experience suggests
|
|
283 |
that there are more serious problems.
|
|
284 |
|
|
285 |
Having proved the correctness of the POSIX lexing algorithm in
|
|
286 |
\cite{Sulzmann2014}, which lessons have we learned? Well, this is a
|
|
287 |
perfect example for the importance of the \emph{right} definitions. We
|
|
288 |
have (on and off) explored mechanisations as soon as first versions
|
|
289 |
of \cite{Sulzmann2014} appeared, but have made little progress with
|
|
290 |
turning the relatively detailed proof sketch in \cite{Sulzmann2014} into a
|
|
291 |
formalisable proof. Having seen \cite{Vansummeren2006} and adapted the
|
|
292 |
POSIX definition given there for the algorithm by Sulzmann and Lu made all
|
|
293 |
the difference: the proofs, as said, are nearly straightforward. The
|
|
294 |
question remains whether the original proof idea of \cite{Sulzmann2014},
|
|
295 |
potentially using our result as a stepping stone, can be made to work?
|
|
296 |
Alas, we really do not know despite considerable effort.
|
|
297 |
|
|
298 |
|
|
299 |
Closely related to our work is an automata-based lexer formalised by
|
|
300 |
Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest
|
|
301 |
initial substrings, but Nipkow's algorithm is not completely
|
|
302 |
computational. The algorithm by Sulzmann and Lu, in contrast, can be
|
|
303 |
implemented with ease in any functional language. A bespoke lexer for the
|
|
304 |
Imp-language is formalised in Coq as part of the Software Foundations book
|
|
305 |
by Pierce et al \cite{Pierce2015}. The disadvantage of such bespoke lexers is that they
|
|
306 |
do not generalise easily to more advanced features.
|
|
307 |
Our formalisation is available from the Archive of Formal Proofs \cite{aduAFP16}
|
|
308 |
under \url{http://www.isa-afp.org/entries/Posix-Lexing.shtml}.\medskip
|
|
309 |
|
|
310 |
\noindent
|
|
311 |
{\bf Acknowledgements:}
|
|
312 |
We are very grateful to Martin Sulzmann for his comments on our work and
|
|
313 |
moreover for patiently explaining to us the details in \cite{Sulzmann2014}. We
|
|
314 |
also received very helpful comments from James Cheney and anonymous referees.
|
|
315 |
% \small
|
|
316 |
\bibliographystyle{plain}
|
|
317 |
\bibliography{root}
|
|
318 |
|
|
319 |
*}
|
|
320 |
|
|
321 |
(*<*)
|
|
322 |
end
|
|
323 |
(*>*) |