|
1 (*<*) |
|
2 theory PaperExt |
|
3 imports |
|
4 (*"../LexerExt"*) |
|
5 (*"../PositionsExt"*) |
|
6 "~~/src/HOL/Library/LaTeXsugar" |
|
7 begin |
|
8 (*>*) |
|
9 |
|
10 (* |
|
11 declare [[show_question_marks = false]] |
|
12 declare [[eta_contract = false]] |
|
13 |
|
14 abbreviation |
|
15 "der_syn r c \<equiv> der c r" |
|
16 |
|
17 abbreviation |
|
18 "ders_syn r s \<equiv> ders s r" |
|
19 |
|
20 |
|
21 notation (latex output) |
|
22 If ("(\<^latex>\<open>\\textrm{\<close>if\<^latex>\<open>}\<close> (_)/ \<^latex>\<open>\\textrm{\<close>then\<^latex>\<open>}\<close> (_)/ \<^latex>\<open>\\textrm{\<close>else\<^latex>\<open>}\<close> (_))" 10) and |
|
23 Cons ("_\<^latex>\<open>\\mbox{$\\,$}\<close>::\<^latex>\<open>\\mbox{$\\,$}\<close>_" [75,73] 73) and |
|
24 |
|
25 ZERO ("\<^bold>0" 81) and |
|
26 ONE ("\<^bold>1" 81) and |
|
27 CHAR ("_" [1000] 80) and |
|
28 ALT ("_ + _" [77,77] 78) and |
|
29 SEQ ("_ \<cdot> _" [77,77] 78) and |
|
30 STAR ("_\<^sup>\<star>" [78] 78) and |
|
31 NTIMES ("_\<^bsup>'{_'}\<^esup>" [78, 50] 80) and |
|
32 FROMNTIMES ("_\<^bsup>'{_..'}\<^esup>" [78, 50] 80) and |
|
33 UPNTIMES ("_\<^bsup>'{.._'}\<^esup>" [78, 50] 80) and |
|
34 NMTIMES ("_\<^bsup>'{_.._'}\<^esup>" [78, 50,50] 80) and |
|
35 |
|
36 val.Void ("Empty" 78) and |
|
37 val.Char ("Char _" [1000] 78) and |
|
38 val.Left ("Left _" [79] 78) and |
|
39 val.Right ("Right _" [1000] 78) and |
|
40 val.Seq ("Seq _ _" [79,79] 78) and |
|
41 val.Stars ("Stars _" [1000] 78) and |
|
42 |
|
43 L ("L'(_')" [10] 78) and |
|
44 LV ("LV _ _" [80,73] 78) and |
|
45 der_syn ("_\\_" [79, 1000] 76) and |
|
46 ders_syn ("_\\_" [79, 1000] 76) and |
|
47 flat ("|_|" [75] 74) and |
|
48 flats ("|_|" [72] 74) and |
|
49 Sequ ("_ @ _" [78,77] 63) and |
|
50 injval ("inj _ _ _" [81,77,79] 76) and |
|
51 mkeps ("mkeps _" [79] 76) and |
|
52 length ("len _" [73] 73) and |
|
53 intlen ("len _" [73] 73) and |
|
54 set ("_" [73] 73) and |
|
55 |
|
56 Prf ("_ : _" [75,75] 75) and |
|
57 Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and |
|
58 |
|
59 lexer ("lexer _ _" [78,78] 77) and |
|
60 DUMMY ("\<^latex>\<open>\\underline{\\hspace{2mm}}\<close>") |
|
61 *) |
|
62 (*>*) |
|
63 |
|
64 (* |
|
65 section {* Extensions*} |
|
66 |
|
67 text {* |
|
68 |
|
69 A strong point in favour of |
|
70 Sulzmann and Lu's algorithm is that it can be extended in various |
|
71 ways. If we are interested in tokenising a string, then we need to not just |
|
72 split up the string into tokens, but also ``classify'' the tokens (for |
|
73 example whether it is a keyword or an identifier). This can be done with |
|
74 only minor modifications to the algorithm by introducing \emph{record |
|
75 regular expressions} and \emph{record values} (for example |
|
76 \cite{Sulzmann2014b}): |
|
77 |
|
78 \begin{center} |
|
79 @{text "r :="} |
|
80 @{text "..."} $\mid$ |
|
81 @{text "(l : r)"} \qquad\qquad |
|
82 @{text "v :="} |
|
83 @{text "..."} $\mid$ |
|
84 @{text "(l : v)"} |
|
85 \end{center} |
|
86 |
|
87 \noindent where @{text l} is a label, say a string, @{text r} a regular |
|
88 expression and @{text v} a value. All functions can be smoothly extended |
|
89 to these regular expressions and values. For example \mbox{@{text "(l : |
|
90 r)"}} is nullable iff @{term r} is, and so on. The purpose of the record |
|
91 regular expression is to mark certain parts of a regular expression and |
|
92 then record in the calculated value which parts of the string were matched |
|
93 by this part. The label can then serve as classification for the tokens. |
|
94 For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for |
|
95 keywords and identifiers from the Introduction. With the record regular |
|
96 expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}} |
|
97 and then traverse the calculated value and only collect the underlying |
|
98 strings in record values. With this we obtain finite sequences of pairs of |
|
99 labels and strings, for example |
|
100 |
|
101 \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\] |
|
102 |
|
103 \noindent from which tokens with classifications (keyword-token, |
|
104 identifier-token and so on) can be extracted. |
|
105 |
|
106 In the context of POSIX matching, it is also interesting to study additional |
|
107 constructors about bounded-repetitions of regular expressions. For this let |
|
108 us extend the results from the previous section to the following four |
|
109 additional regular expression constructors: |
|
110 |
|
111 \begin{center} |
|
112 \begin{tabular}{lcrl@ {\hspace{12mm}}l} |
|
113 @{text r} & @{text ":="} & $\ldots\mid$ & @{term "NTIMES r n"} & exactly-@{text n}-times\\ |
|
114 & & $\mid$ & @{term "UPNTIMES r n"} & upto-@{text n}-times\\ |
|
115 & & $\mid$ & @{term "FROMNTIMES r n"} & from-@{text n}-times\\ |
|
116 & & $\mid$ & @{term "NMTIMES r n m"} & between-@{text nm}-times\\ |
|
117 \end{tabular} |
|
118 \end{center} |
|
119 |
|
120 \noindent |
|
121 We will call them \emph{bounded regular expressions}. |
|
122 With the help of the power operator (definition ommited) for sets of strings, the languages |
|
123 recognised by these regular expression can be defined in Isabelle as follows: |
|
124 |
|
125 \begin{center} |
|
126 \begin{tabular}{lcl} |
|
127 @{thm (lhs) L.simps(8)} & $\dn$ & @{thm (rhs) L.simps(8)}\\ |
|
128 @{thm (lhs) L.simps(7)} & $\dn$ & @{thm (rhs) L.simps(7)}\\ |
|
129 @{thm (lhs) L.simps(9)} & $\dn$ & @{thm (rhs) L.simps(9)}\\ |
|
130 @{thm (lhs) L.simps(10)} & $\dn$ & @{thm (rhs) L.simps(10)}\\ |
|
131 \end{tabular} |
|
132 \end{center} |
|
133 |
|
134 \noindent This definition implies that in the last clause @{term |
|
135 "NMTIMES r n m"} matches no string in case @{term "m < n"}, because |
|
136 then the interval @{term "{n..m}"} is empty. While the language |
|
137 recognised by these regular expressions is straightforward, some |
|
138 care is needed for how to define the corresponding lexical |
|
139 values. First, with a slight abuse of language, we will (re)use |
|
140 values of the form @{term "Stars vs"} for values inhabited in |
|
141 bounded regular expressions. Second, we need to introduce inductive |
|
142 rules for extending our inhabitation relation shown on |
|
143 Page~\ref{prfintros}, from which we then derived our notion of |
|
144 lexical values. Given the rule for @{term "STAR r"}, the rule for |
|
145 @{term "UPNTIMES r n"} just requires additionally that the length of |
|
146 the list of values must be smaller or equal to @{term n}: |
|
147 |
|
148 \begin{center} |
|
149 @{thm[mode=Rule] Prf.intros(7)[of "vs"]} |
|
150 \end{center} |
|
151 |
|
152 \noindent Like in the @{term "STAR r"}-rule, we require with the |
|
153 left-premise that some non-empty part of the string is `chipped' |
|
154 away by \emph{every} value in @{text vs}, that means the corresponding |
|
155 values do not flatten to the empty string. In the rule for @{term |
|
156 "NTIMES r n"} (that is exactly-@{term n}-times @{text r}) we clearly need to require |
|
157 that the length of the list of values equals to @{text n}. But enforcing |
|
158 that every of these @{term n} values `chipps' away some part of a string |
|
159 would be too strong. |
|
160 |
|
161 |
|
162 \begin{center} |
|
163 @{thm[mode=Rule] Prf.intros(8)[of "vs\<^sub>1" r "vs\<^sub>2"]} |
|
164 \end{center} |
|
165 |
|
166 \begin{center} |
|
167 \begin{tabular}{lcl} |
|
168 @{thm (lhs) der.simps(8)} & $\dn$ & @{thm (rhs) der.simps(8)}\\ |
|
169 @{thm (lhs) der.simps(7)} & $\dn$ & @{thm (rhs) der.simps(7)}\\ |
|
170 @{thm (lhs) der.simps(9)} & $\dn$ & @{thm (rhs) der.simps(9)}\\ |
|
171 @{thm (lhs) der.simps(10)} & $\dn$ & @{thm (rhs) der.simps(10)}\\ |
|
172 \end{tabular} |
|
173 \end{center} |
|
174 |
|
175 |
|
176 \begin{center} |
|
177 \begin{tabular}{lcl} |
|
178 @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\ |
|
179 @{thm (lhs) mkeps.simps(6)} & $\dn$ & @{thm (rhs) mkeps.simps(6)}\\ |
|
180 @{thm (lhs) mkeps.simps(7)} & $\dn$ & @{thm (rhs) mkeps.simps(7)}\\ |
|
181 @{thm (lhs) mkeps.simps(8)} & $\dn$ & @{thm (rhs) mkeps.simps(8)}\\ |
|
182 \end{tabular} |
|
183 \end{center} |
|
184 |
|
185 \begin{center} |
|
186 \begin{tabular}{lcl} |
|
187 @{thm (lhs) injval.simps(8)} & $\dn$ & @{thm (rhs) injval.simps(8)}\\ |
|
188 @{thm (lhs) injval.simps(9)} & $\dn$ & @{thm (rhs) injval.simps(9)}\\ |
|
189 @{thm (lhs) injval.simps(10)} & $\dn$ & @{thm (rhs) injval.simps(10)}\\ |
|
190 @{thm (lhs) injval.simps(11)} & $\dn$ & @{thm (rhs) injval.simps(11)}\\ |
|
191 \end{tabular} |
|
192 \end{center} |
|
193 |
|
194 |
|
195 @{thm [mode=Rule] Posix_NTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]} |
|
196 |
|
197 @{thm [mode=Rule] Posix_NTIMES2} |
|
198 |
|
199 @{thm [mode=Rule] Posix_UPNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]} |
|
200 |
|
201 @{thm [mode=Rule] Posix_UPNTIMES2} |
|
202 |
|
203 @{thm [mode=Rule] Posix_FROMNTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]} |
|
204 |
|
205 @{thm [mode=Rule] Posix_FROMNTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]} |
|
206 |
|
207 @{thm [mode=Rule] Posix_FROMNTIMES2} |
|
208 |
|
209 @{thm [mode=Rule] Posix_NMTIMES1[of "s\<^sub>1" r v "s\<^sub>2"]} |
|
210 |
|
211 @{thm [mode=Rule] Posix_NMTIMES3[of "s\<^sub>1" r v "s\<^sub>2"]} |
|
212 |
|
213 @{thm [mode=Rule] Posix_NMTIMES2} |
|
214 |
|
215 |
|
216 |
|
217 @{term "\<Sum> i \<in> {m..n} . P i"} @{term "\<Sum> i \<in> {..n} . P i"} |
|
218 @{term "\<Union> i \<in> {m..n} . P i"} @{term "\<Union> i \<in> {..n} . P i"} |
|
219 @{term "\<Union> i \<in> {0::nat..n} . P i"} |
|
220 |
|
221 |
|
222 |
|
223 *} |
|
224 |
|
225 |
|
226 |
|
227 section {* The Correctness Argument by Sulzmann and Lu\label{argu} *} |
|
228 |
|
229 text {* |
|
230 % \newcommand{\greedy}{\succcurlyeq_{gr}} |
|
231 \newcommand{\posix}{>} |
|
232 |
|
233 An extended version of \cite{Sulzmann2014} is available at the website of |
|
234 its first author; this includes some ``proofs'', claimed in |
|
235 \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in |
|
236 final form, we make no comment thereon, preferring to give general reasons |
|
237 for our belief that the approach of \cite{Sulzmann2014} is problematic. |
|
238 Their central definition is an ``ordering relation'' defined by the |
|
239 rules (slightly adapted to fit our notation): |
|
240 |
|
241 ?? |
|
242 |
|
243 \noindent The idea behind the rules (A1) and (A2), for example, is that a |
|
244 @{text Left}-value is bigger than a @{text Right}-value, if the underlying |
|
245 string of the @{text Left}-value is longer or of equal length to the |
|
246 underlying string of the @{text Right}-value. The order is reversed, |
|
247 however, if the @{text Right}-value can match a longer string than a |
|
248 @{text Left}-value. In this way the POSIX value is supposed to be the |
|
249 biggest value for a given string and regular expression. |
|
250 |
|
251 Sulzmann and Lu explicitly refer to the paper \cite{Frisch2004} by Frisch |
|
252 and Cardelli from where they have taken the idea for their correctness |
|
253 proof. Frisch and Cardelli introduced a similar ordering for GREEDY |
|
254 matching and they showed that their GREEDY matching algorithm always |
|
255 produces a maximal element according to this ordering (from all possible |
|
256 solutions). The only difference between their GREEDY ordering and the |
|
257 ``ordering'' by Sulzmann and Lu is that GREEDY always prefers a @{text |
|
258 Left}-value over a @{text Right}-value, no matter what the underlying |
|
259 string is. This seems to be only a very minor difference, but it has |
|
260 drastic consequences in terms of what properties both orderings enjoy. |
|
261 What is interesting for our purposes is that the properties reflexivity, |
|
262 totality and transitivity for this GREEDY ordering can be proved |
|
263 relatively easily by induction. |
|
264 *} |
|
265 *) |
|
266 |
|
267 section {* Conclusion *} |
|
268 |
|
269 text {* |
|
270 |
|
271 We have implemented the POSIX value calculation algorithm introduced by |
|
272 Sulzmann and Lu |
|
273 \cite{Sulzmann2014}. Our implementation is nearly identical to the |
|
274 original and all modifications we introduced are harmless (like our char-clause for |
|
275 @{text inj}). We have proved this algorithm to be correct, but correct |
|
276 according to our own specification of what POSIX values are. Our |
|
277 specification (inspired from work by Vansummeren \cite{Vansummeren2006}) appears to be |
|
278 much simpler than in \cite{Sulzmann2014} and our proofs are nearly always |
|
279 straightforward. We have attempted to formalise the original proof |
|
280 by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains |
|
281 unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors |
|
282 already acknowledge some small problems, but our experience suggests |
|
283 that there are more serious problems. |
|
284 |
|
285 Having proved the correctness of the POSIX lexing algorithm in |
|
286 \cite{Sulzmann2014}, which lessons have we learned? Well, this is a |
|
287 perfect example for the importance of the \emph{right} definitions. We |
|
288 have (on and off) explored mechanisations as soon as first versions |
|
289 of \cite{Sulzmann2014} appeared, but have made little progress with |
|
290 turning the relatively detailed proof sketch in \cite{Sulzmann2014} into a |
|
291 formalisable proof. Having seen \cite{Vansummeren2006} and adapted the |
|
292 POSIX definition given there for the algorithm by Sulzmann and Lu made all |
|
293 the difference: the proofs, as said, are nearly straightforward. The |
|
294 question remains whether the original proof idea of \cite{Sulzmann2014}, |
|
295 potentially using our result as a stepping stone, can be made to work? |
|
296 Alas, we really do not know despite considerable effort. |
|
297 |
|
298 |
|
299 Closely related to our work is an automata-based lexer formalised by |
|
300 Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest |
|
301 initial substrings, but Nipkow's algorithm is not completely |
|
302 computational. The algorithm by Sulzmann and Lu, in contrast, can be |
|
303 implemented with ease in any functional language. A bespoke lexer for the |
|
304 Imp-language is formalised in Coq as part of the Software Foundations book |
|
305 by Pierce et al \cite{Pierce2015}. The disadvantage of such bespoke lexers is that they |
|
306 do not generalise easily to more advanced features. |
|
307 Our formalisation is available from the Archive of Formal Proofs \cite{aduAFP16} |
|
308 under \url{http://www.isa-afp.org/entries/Posix-Lexing.shtml}.\medskip |
|
309 |
|
310 \noindent |
|
311 {\bf Acknowledgements:} |
|
312 We are very grateful to Martin Sulzmann for his comments on our work and |
|
313 moreover for patiently explaining to us the details in \cite{Sulzmann2014}. We |
|
314 also received very helpful comments from James Cheney and anonymous referees. |
|
315 % \small |
|
316 \bibliographystyle{plain} |
|
317 \bibliography{root} |
|
318 |
|
319 *} |
|
320 |
|
321 (*<*) |
|
322 end |
|
323 (*>*) |