245 |
245 |
246 section {* Bitcoded Regular Expressions and Derivatives *} |
246 section {* Bitcoded Regular Expressions and Derivatives *} |
247 |
247 |
248 text {* |
248 text {* |
249 |
249 |
|
250 In the second part of their paper \cite{Sulzmann2014}, |
250 Sulzmann and Lu describe another algorithm that generates POSIX |
251 Sulzmann and Lu describe another algorithm that generates POSIX |
251 values but dispences with the second phase where characters are |
252 values but dispences with the second phase where characters are |
252 injected ``back'' into values. For this they annotate bitcodes to |
253 injected ``back'' into values. For this they annotate bitcodes to |
253 regular expressions, which we define in Isabelle/HOL as the datatype |
254 regular expressions, which we define in Isabelle/HOL as the datatype |
254 |
255 |
255 \begin{center} |
256 \begin{center} |
256 \begin{tabular}{lcl} |
257 \begin{tabular}{lcl} |
257 @{term breg} & $::=$ & @{term "AZERO"}\\ |
258 @{term breg} & $::=$ & @{term "AZERO"} $\quad\mid\quad$ @{term "AONE bs"}\\ |
258 & $\mid$ & @{term "AONE bs"}\\ |
|
259 & $\mid$ & @{term "ACHAR bs c"}\\ |
259 & $\mid$ & @{term "ACHAR bs c"}\\ |
260 & $\mid$ & @{term "AALTs bs rs"}\\ |
260 & $\mid$ & @{term "AALTs bs rs"}\\ |
261 & $\mid$ & @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}\\ |
261 & $\mid$ & @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}\\ |
262 & $\mid$ & @{term "ASTAR bs r"} |
262 & $\mid$ & @{term "ASTAR bs r"} |
263 \end{tabular} |
263 \end{tabular} |
264 \end{center} |
264 \end{center} |
265 |
265 |
266 \noindent where @{text bs} stands for a bitsequences; @{text r}, |
266 \noindent where @{text bs} stands for a bitsequences; @{text r}, |
267 @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for annotated regular |
267 @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular |
268 expressions; and @{text rs} for a list of annotated regular |
268 expressions; and @{text rs} for lists of bitcoded regular |
269 expressions. In contrast to Sulzmann and Lu we generalise the |
269 expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"} |
270 alternative regular expressions to lists, instead of just having |
270 is just an abbreviation for @{text "ALTs bs [r\<^sub>1, r\<^sub>2]"}. |
271 binary regular expressions. The idea with annotated regular |
271 For bitsequences we just use lists made up of the |
272 expressions is to incrementally generate the value information by |
272 constants @{text Z} and @{text S}. The idea with bitcoded regular |
273 recording bitsequences. Sulzmann and Lu then |
273 expressions is to incrementally generate the value information (for |
274 define a coding function for how values can be coded into bitsequences. |
274 example @{text Left} and @{text Right}) as bitsequences |
275 |
275 as part of the regular expression constructors. |
276 \begin{center} |
276 Sulzmann and Lu then define a coding |
|
277 function for how values can be coded into bitsequences. |
|
278 |
|
279 \begin{center} |
|
280 \begin{tabular}{cc} |
277 \begin{tabular}{lcl} |
281 \begin{tabular}{lcl} |
278 @{thm (lhs) code.simps(1)} & $\dn$ & @{thm (rhs) code.simps(1)}\\ |
282 @{thm (lhs) code.simps(1)} & $\dn$ & @{thm (rhs) code.simps(1)}\\ |
279 @{thm (lhs) code.simps(2)} & $\dn$ & @{thm (rhs) code.simps(2)}\\ |
283 @{thm (lhs) code.simps(2)} & $\dn$ & @{thm (rhs) code.simps(2)}\\ |
280 @{thm (lhs) code.simps(3)} & $\dn$ & @{thm (rhs) code.simps(3)}\\ |
284 @{thm (lhs) code.simps(3)} & $\dn$ & @{thm (rhs) code.simps(3)}\\ |
281 @{thm (lhs) code.simps(4)} & $\dn$ & @{thm (rhs) code.simps(4)}\\ |
285 @{thm (lhs) code.simps(4)} & $\dn$ & @{thm (rhs) code.simps(4)} |
|
286 \end{tabular} |
|
287 & |
|
288 \begin{tabular}{lcl} |
282 @{thm (lhs) code.simps(5)[of "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) code.simps(5)[of "v\<^sub>1" "v\<^sub>2"]}\\ |
289 @{thm (lhs) code.simps(5)[of "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) code.simps(5)[of "v\<^sub>1" "v\<^sub>2"]}\\ |
283 @{thm (lhs) code.simps(6)} & $\dn$ & @{thm (rhs) code.simps(6)}\\ |
290 @{thm (lhs) code.simps(6)} & $\dn$ & @{thm (rhs) code.simps(6)}\\ |
284 @{thm (lhs) code.simps(7)} & $\dn$ & @{thm (rhs) code.simps(7)} |
291 @{thm (lhs) code.simps(7)} & $\dn$ & @{thm (rhs) code.simps(7)}\\ |
285 \end{tabular} |
292 \mbox{\phantom{XX}}\\ |
286 \end{center} |
293 \end{tabular} |
287 |
294 \end{tabular} |
|
295 \end{center} |
|
296 |
|
297 \noindent |
|
298 As can be seen, this coding is ``lossy'' in the sense that we do not |
|
299 record explicitly character values and also not sequence values (for |
|
300 them we just append two bitsequences). We do, however, record the |
|
301 different alternatives for @{text Left}, respectively @{text Right}, as @{text Z} and |
|
302 @{text S} followed by some bitsequence. Similarly, we use @{text Z} to indicate |
|
303 if there is still a value coming in the list of @{text Stars}, whereas @{text S} |
|
304 indicates the end of the list. The lossiness makes the process of |
|
305 decoding a bit more involved, but the point is that if we have a |
|
306 regular expression \emph{and} a bitsequence of a corresponding value, |
|
307 then we can always decode the value accurately. The decoding can be |
|
308 defined by using two functions called $\textit{decode}'$ and |
|
309 \textit{decode}: |
|
310 |
|
311 \begin{center} |
|
312 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}} |
|
313 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ |
|
314 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ |
|
315 $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
|
316 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; |
|
317 (\Left\,v, bs_1)$\\ |
|
318 $\textit{decode}'\,(\S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
|
319 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; |
|
320 (\Right\,v, bs_1)$\\ |
|
321 $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & |
|
322 $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ |
|
323 & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$ |
|
324 \hspace{2mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ |
|
325 $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ |
|
326 $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & |
|
327 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
|
328 & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$ |
|
329 \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ |
|
330 $\textit{decode}\,bs\,r$ & $\dn$ & |
|
331 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
|
332 & & \hspace{7mm}$\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; |
|
333 \textit{else}\;\textit{None}$ |
|
334 \end{tabular} |
|
335 \end{center} |
|
336 |
|
337 \noindent |
|
338 The function \textit{decode} checks whether all of the bitsequence is |
|
339 consumed and returns the corresponding value as @{term "Some v"}; otherwise |
|
340 it fails with @{text "None"}. We can establish that for a value $v$ |
|
341 inhabited by a regular expression $r$, the decoding of its |
|
342 bitsequence never fails. |
|
343 |
|
344 \begin{lemma}\label{codedecode}\it |
|
345 If $\;\vdash v : r$ then |
|
346 $\;\textit{decode}\,(\textit{code}\, v)\,r = \textit{Some}\, v$. |
|
347 \end{lemma} |
|
348 |
|
349 \begin{proof} |
|
350 This follows from the property that |
|
351 $\textit{decode}'\,((\textit{code}\,v) \,@\, bs)\,r = (v, bs)$ holds |
|
352 for any bit-sequence $bs$ and $\vdash v : r$. This property can be |
|
353 easily proved by induction on $\vdash v : r$. |
|
354 \end{proof} |
|
355 |
|
356 Sulzmann and Lu define the function \emph{internalise} |
|
357 in order to transform standard regular expressions into annotated |
|
358 regular expressions. We write this operation as $r^\uparrow$. |
|
359 This internalisation uses the following |
|
360 \emph{fuse} function. |
|
361 |
|
362 \begin{center} |
|
363 \begin{tabular}{lcl} |
|
364 $\textit{fuse}\,bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\ |
|
365 $\textit{fuse}\,bs\,(\textit{ONE}\,bs')$ & $\dn$ & |
|
366 $\textit{ONE}\,(bs\,@\,bs')$\\ |
|
367 $\textit{fuse}\,bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ & |
|
368 $\textit{CHAR}\,(bs\,@\,bs')\,c$\\ |
|
369 $\textit{fuse}\,bs\,(\textit{ALTs}\,bs'\,rs)$ & $\dn$ & |
|
370 $\textit{ALTs}\,(bs\,@\,bs')\,rs$\\ |
|
371 $\textit{fuse}\,bs\,(\textit{SEQ}\,bs'\,r_1\,r_2)$ & $\dn$ & |
|
372 $\textit{SEQ}\,(bs\,@\,bs')\,r_1\,r_2$\\ |
|
373 $\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,r)$ & $\dn$ & |
|
374 $\textit{STAR}\,(bs\,@\,bs')\,r$ |
|
375 \end{tabular} |
|
376 \end{center} |
|
377 |
|
378 \noindent |
|
379 A regular expression can then be \emph{internalised} into a bitcoded |
|
380 regular expression as follows. |
|
381 |
|
382 \begin{center} |
|
383 \begin{tabular}{lcl} |
|
384 $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\ |
|
385 $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\ |
|
386 $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\ |
|
387 $(r_1 + r_2)^\uparrow$ & $\dn$ & |
|
388 $\textit{ALT}\;[]\,(\textit{fuse}\,[\Z]\,r_1^\uparrow)\, |
|
389 (\textit{fuse}\,[\S]\,r_2^\uparrow)$\\ |
|
390 $(r_1\cdot r_2)^\uparrow$ & $\dn$ & |
|
391 $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\ |
|
392 $(r^*)^\uparrow$ & $\dn$ & |
|
393 $\textit{STAR}\;[]\,r^\uparrow$\\ |
|
394 \end{tabular} |
|
395 \end{center} |
|
396 |
|
397 \noindent |
|
398 There is also an \emph{erase}-function, written $a^\downarrow$, which |
|
399 transforms a bitcoded regular expression into a (standard) regular |
|
400 expression by just erasing the annotated bitsequences. We omit the |
|
401 straightforward definition. For defining the algorithm, we also need |
|
402 the functions \textit{bnullable} and \textit{bmkeps}, which are the |
|
403 ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on |
|
404 bitcoded regular expressions, instead of regular expressions. |
|
405 |
|
406 \begin{center} |
|
407 \begin{tabular}{lcl} |
|
408 $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{false}$ \textbf{fix}\\ |
|
409 $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{true}$\\ |
|
410 $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{false}$\\ |
|
411 $\textit{bnullable}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ & |
|
412 $\textit{bnullable}\,a_1\vee \textit{bnullable}\,a_2$\\ |
|
413 $\textit{bnullable}\,(\textit{SEQ}\,bs\,a_1\,a_2)$ & $\dn$ & |
|
414 $\textit{bnullable}\,a_1\wedge \textit{bnullable}\,a_2$\\ |
|
415 $\textit{bnullable}\,(\textit{STAR}\,bs\,a)$ & $\dn$ & |
|
416 $\textit{true}$ |
|
417 \end{tabular} |
|
418 \end{center} |
|
419 |
|
420 \begin{center} |
|
421 \begin{tabular}{lcl} |
|
422 $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$ \textbf{fix}\\ |
|
423 $\textit{bmkeps}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ & |
|
424 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
425 & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\ |
|
426 & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\ |
|
427 $\textit{bmkeps}\,(\textit{SEQ}\,bs\,a_1\,a_2)$ & $\dn$ & |
|
428 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ |
|
429 $\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ & |
|
430 $bs \,@\, [\S]$ |
|
431 \end{tabular} |
|
432 \end{center} |
|
433 |
|
434 |
|
435 \noindent |
|
436 The key function in the bitcoded algorithm is the derivative of an |
|
437 annotated regular expression. This derivative calculates the |
|
438 derivative but at the same time also the incremental part that |
|
439 contributes to constructing a value. |
|
440 |
|
441 \begin{center} |
|
442 \begin{tabular}{@ {}lcl@ {}} |
|
443 $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$ \textbf{fix}\\ |
|
444 $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
445 $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ & |
|
446 $\textit{if}\;c=d\; \;\textit{then}\; |
|
447 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ |
|
448 $(\textit{ALT}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ & |
|
449 $\textit{ALT}\,bs\,(a_1\backslash c)\,(a_2\backslash c)$\\ |
|
450 $(\textit{SEQ}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ & |
|
451 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
452 & &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\backslash c)\,a_2)$\\ |
|
453 & &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\backslash c))$\\ |
|
454 & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\backslash c)\,a_2$\\ |
|
455 $(\textit{STAR}\,bs\,a)\backslash c$ & $\dn$ & |
|
456 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\, |
|
457 (\textit{STAR}\,[]\,r)$ |
|
458 \end{tabular} |
|
459 \end{center} |
|
460 |
|
461 |
|
462 \noindent |
|
463 This function can also be extended to strings, written $a\backslash s$, |
|
464 just like the standard derivative. We omit the details. Finally we |
|
465 can define Sulzmann and Lu's bitcoded lexer, which we call \textit{blexer}: |
|
466 |
|
467 \noindent |
|
468 This bitcoded lexer first internalises the regular expression $r$ and then |
|
469 builds the annotated derivative according to $s$. If the derivative is |
|
470 nullable, then it extracts the bitcoded value using the |
|
471 $\textit{bmkeps}$ function. Finally it decodes the bitcoded value. If |
|
472 the derivative is \emph{not} nullable, then $\textit{None}$ is |
|
473 returned. The task is to show that this way of calculating a value |
|
474 generates the same result as with \textit{lexer}. |
|
475 |
|
476 Before we can proceed we need to define a function, called |
|
477 \textit{retrieve}, which Sulzmann and Lu introduced for the proof. |
|
478 |
|
479 \textbf{fix} |
|
480 |
|
481 \noindent |
|
482 The idea behind this function is to retrieve a possibly partial |
|
483 bitcode from an annotated regular expression, where the retrieval is |
|
484 guided by a value. For example if the value is $\Left$ then we |
|
485 descend into the left-hand side of an alternative (annotated) regular |
|
486 expression in order to assemble the bitcode. Similarly for |
|
487 $\Right$. The property we can show is that for a given $v$ and $r$ |
|
488 with $\vdash v : r$, the retrieved bitsequence from the internalised |
|
489 regular expression is equal to the bitcoded version of $v$. |
|
490 |
|
491 \begin{lemma}\label{retrievecode} |
|
492 If $\vdash v : r$ then $\textit{code}\, v = \textit{retrieve}\,(r^\uparrow)\,v$. |
|
493 \end{lemma} |
|
494 |
|
495 *} |
|
496 |
|
497 text {* |
288 There is also a corresponding decoding function that takes a bitsequence |
498 There is also a corresponding decoding function that takes a bitsequence |
289 and generates back a value. However, since the bitsequences are a ``lossy'' |
499 and generates back a value. However, since the bitsequences are a ``lossy'' |
290 coding (@{term Seq}s are not coded) the decoding function depends also |
500 coding (@{term Seq}s are not coded) the decoding function depends also |
291 on a regular expression in order to decode values. |
501 on a regular expression in order to decode values. |
292 |
502 |
293 \begin{center} |
503 |
294 \begin{tabular}{lcll} |
|
295 %@{thm (lhs) decode'.simps(1)} & $\dn$ & @{thm (rhs) decode'.simps(1)}\\ |
|
296 @{thm (lhs) decode'.simps(2)} & $\dn$ & @{thm (rhs) decode'.simps(2)}\\ |
|
297 @{thm (lhs) decode'.simps(3)} & $\dn$ & @{thm (rhs) decode'.simps(3)}\\ |
|
298 @{thm (lhs) decode'.simps(4)} & $\dn$ & @{thm (rhs) decode'.simps(4)}\\ |
|
299 @{thm (lhs) decode'.simps(5)} & $\dn$ & @{thm (rhs) decode'.simps(5)}\\ |
|
300 @{thm (lhs) decode'.simps(6)} & $\dn$ & @{thm (rhs) decode'.simps(6)}\\ |
|
301 @{thm (lhs) decode'.simps(7)} & $\dn$ & @{thm (rhs) decode'.simps(7)}\\ |
|
302 @{thm (lhs) decode'.simps(8)} & $\dn$ & @{thm (rhs) decode'.simps(8)}\\ |
|
303 @{thm (lhs) decode'.simps(9)} & $\dn$ & @{thm (rhs) decode'.simps(9)}\\ |
|
304 @{thm (lhs) decode'.simps(10)} & $\dn$ & @{thm (rhs) decode'.simps(10)} & fix |
|
305 \end{tabular} |
|
306 \end{center} |
|
307 |
504 |
308 |
505 |
309 The idea of the bitcodes is to annotate them to regular expressions and generate values |
506 The idea of the bitcodes is to annotate them to regular expressions and generate values |
310 incrementally. The bitcodes can be read off from the @{text breg} and then decoded into a value. |
507 incrementally. The bitcodes can be read off from the @{text breg} and then decoded into a value. |
311 |
508 |