235 $\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\ |
239 $\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\ |
236 $\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$ |
240 $\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$ |
237 \end{tabular} |
241 \end{tabular} |
238 \end{center} |
242 \end{center} |
239 \noindent |
243 \noindent |
240 $\textit{Rerase}$ throws away the bitcodes on the annotated regular expressions, |
244 $\textit{Rerase}$ throws away the bitcodes |
|
245 on the annotated regular expressions, |
241 but keeps everything else intact. |
246 but keeps everything else intact. |
242 |
247 Therefore it does not change the size |
243 Before we introduce more functions related to r-regular expressions, |
248 of an annotated regular expression: |
244 we first give out the reason why we take all the trouble |
249 \begin{lemma}\label{rsizeAsize} |
245 defining a new datatype in the first place. |
|
246 We could calculate the size of annotated regular expressions in terms of |
|
247 their un-annotated $\rrexp$ counterpart: |
|
248 \begin{lemma} |
|
249 $\rsize{\rerase a} = \asize a$ |
250 $\rsize{\rerase a} = \asize a$ |
250 \end{lemma} |
251 \end{lemma} |
251 \begin{proof} |
252 \begin{proof} |
252 By routine structural induction on $a$. |
253 By routine structural induction on $a$. |
253 \end{proof} |
254 \end{proof} |
254 \noindent |
255 \noindent |
|
256 |
255 \subsection{Motivation Behind a New Datatype} |
257 \subsection{Motivation Behind a New Datatype} |
256 The main difference between a plain regular expression |
258 The reason we take all the trouble |
257 and an r-regular expression is that it can take |
259 defining a new datatype is that $\erase$ makes things harder. |
258 non-binary arguments for its alternative constructor. |
|
259 This turned out to be necessary if we want our proofs to be |
|
260 simple. |
|
261 We initially started by using |
260 We initially started by using |
262 plain regular expressions and tried to prove |
261 plain regular expressions and tried to prove |
263 equalities like |
262 the lemma \ref{rsizeAsize}, |
264 \begin{center} |
263 however the $\erase$ function unavoidbly messes with the structure of the |
265 $\llbracket a \rrbracket = \llbracket a\downarrow \rrbracket_p$ |
264 annotated regular expression. |
266 \end{center} |
265 The $+$ constructor |
267 and |
266 of basic regular expressions is binary whereas $\sum$ |
268 \[ |
267 takes a list, and one has to convert between them: |
269 \llbracket a \backslash_{bsimps} s \rrbracket = |
268 \begin{center} |
270 \llbracket a \downarrow \backslash_s s |
269 \begin{tabular}{ccc} |
271 \] |
270 $\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\ |
272 One might be able to prove that |
271 $\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\ |
273 $\llbracket a \downarrow \rrbracket_p \leq \llbracket a \rrbracket$. |
272 $\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$ |
274 $\rrexp$ give the exact correspondence between an annotated regular expression |
273 \end{tabular} |
275 and its (r-)erased version: |
274 \end{center} |
276 |
275 \noindent |
277 This does not hold for plain $\rexp$s. |
276 An alternative regular expression with an empty argument list |
278 |
277 will be turned into a $\ZERO$. |
279 |
278 The singleton alternative $\sum [r]$ would have $r$ during the |
280 These operations are |
279 $\erase$ function. |
281 almost identical to those of the annotated regular expressions, |
280 The annotated regular expression $\sum[a, b, c]$ would turn into |
282 except that no bitcodes are attached. |
281 $(a+(b+c))$. |
283 Of course, the bits which encode the lexing information would grow linearly with respect |
282 All these operations change the size and structure of |
284 to the input, which should be taken into account when we wish to tackle the runtime comlexity. |
283 an annotated regular expressions, adding unnecessary |
285 But at the current stage |
284 complexities to the size bound proof.\\ |
286 we can safely ignore them. |
285 For example, if we define the size of a basic regular expression |
287 Similarly there is a size function for plain regular expressions: |
286 in the usual way, |
288 \begin{center} |
287 \begin{center} |
289 \begin{tabular}{ccc} |
288 \begin{tabular}{ccc} |
290 $\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\ |
289 $\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\ |
291 $\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\ |
290 $\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\ |
292 $\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\ |
291 $\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\ |
293 $\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\ |
292 $\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\ |
294 $\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\ |
293 $\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\ |
295 $\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$ |
294 $\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$ |
296 \end{tabular} |
295 \end{tabular} |
297 \end{center} |
296 \end{center} |
298 |
297 \noindent |
299 \noindent |
298 Then the property |
300 The idea of obatining a bound for $\llbracket \bderssimp{a}{s} \rrbracket$ |
299 \begin{center} |
301 is to get an equivalent form |
300 $\llbracket a \rrbracket = \llbracket a_\downarrow \rrbracket_p$ |
302 of something like $\llbracket \bderssimp{a}{s}\rrbracket = f(a, s)$, where $f(a, s)$ |
301 \end{center} |
303 is easier to estimate than $\llbracket \bderssimp{a}{s}\rrbracket$. |
302 does not hold. |
304 We notice that while it is not so clear how to obtain |
303 One might be able to prove an inequality such as |
305 a metamorphic representation of $\bderssimp{a}{s}$ (as we argued in chapter \ref{Bitcoded2}, |
304 $\llbracket a \rrbracket \leq \llbracket a_\downarrow \rrbracket_p $ |
306 not interleaving the application of the functions $\backslash$ and $\bsimp{\_}$ |
305 and then estimate $\llbracket a_\downarrow \rrbracket_p$, |
307 in the order as our lexer will result in the bit-codes dispensed differently), |
306 but we found our approach more straightforward.\\ |
308 it is possible to get an slightly different representation of the unlifted versions: |
307 |
309 $ (\bderssimp{a}{s})_\downarrow = (\erase \; \bsimp{a \backslash s})_\downarrow$. |
308 \subsection{Lexing Related Functions for $\rrexp$ such as $\backslash_r$, $\rdistincts$, and $\rsimp$} |
310 This suggest setting the bounding function $f(a, s)$ as |
309 The operations on r-regular expressions are |
311 $\llbracket (a \backslash s)_\downarrow \rrbracket_p$, the plain size |
310 almost identical to those of the annotated regular expressions, |
312 of the erased annotated regular expression. |
311 except that no bitcodes are used. For example, |
313 This requires the the regular expression accompanied by bitcodes |
312 the derivative operation becomes simpler:\\ |
314 to have the same size as its plain counterpart after erasure: |
|
315 \begin{center} |
|
316 $\asize{a} \stackrel{?}{=} \llbracket \erase(a)\rrbracket_p$. |
|
317 \end{center} |
|
318 \noindent |
|
319 But there is a minor nuisance: |
|
320 the erase function unavoidbly messes with the structure of the regular expression, |
|
321 due to the discrepancy between annotated regular expression's $\sum$ constructor |
|
322 and plain regular expression's $+$ constructor having different arity. |
|
323 \begin{center} |
|
324 \begin{tabular}{ccc} |
|
325 $\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\ |
|
326 $\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\ |
|
327 $\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$ |
|
328 \end{tabular} |
|
329 \end{center} |
|
330 \noindent |
|
331 An alternative regular expression with an empty list of children |
|
332 is turned into a $\ZERO$ during the |
|
333 $\erase$ function, thereby changing the size and structure of the regex. |
|
334 Therefore the equality in question does not hold. |
|
335 |
|
336 These will likely be fixable if we really want to use plain $\rexp$s for dealing |
|
337 with size, but we choose a more straightforward (or stupid) method by |
|
338 |
|
339 Similarly we could define the derivative and simplification on |
|
340 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, |
|
341 except that now they can operate on alternatives taking multiple arguments. |
|
342 |
|
343 \begin{center} |
|
344 \begin{tabular}{lcr} |
|
345 $(\RALTS{rs})\; \backslash c$ & $\dn$ & $\RALTS{\map\; (\_ \backslash c) \;rs}$\\ |
|
346 (other clauses omitted) |
|
347 With the new $\rrexp$ datatype in place, one can define its size function, |
|
348 which precisely mirrors that of the annotated regular expressions: |
|
349 \end{tabular} |
|
350 \end{center} |
|
351 \noindent |
|
352 \begin{center} |
|
353 \begin{tabular}{ccc} |
|
354 $\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\ |
|
355 $\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\ |
|
356 $\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\ |
|
357 $\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\ |
|
358 $\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as + 1$\\ |
|
359 $\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$ |
|
360 \end{tabular} |
|
361 \end{center} |
|
362 \noindent |
|
363 |
|
364 \subsection{Lexing Related Functions for $\rrexp$} |
|
365 Everything else for $\rrexp$ will be precisely the same for annotated expressions, |
|
366 except that they do not involve rectifying and augmenting bit-encoded tokenization information. |
|
367 As expected, most functions are simpler, such as the derivative: |
|
368 \begin{center} |
313 \begin{center} |
369 \begin{tabular}{@{}lcl@{}} |
314 \begin{tabular}{@{}lcl@{}} |
370 $(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\ |
315 $(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\ |
371 $(\ONE)\,\backslash_r c$ & $\dn$ & |
316 $(\ONE)\,\backslash_r c$ & $\dn$ & |
372 $\textit{if}\;c=d\; \;\textit{then}\; |
317 $\textit{if}\;c=d\; \;\textit{then}\; |
373 \ONE\;\textit{else}\;\ZERO$\\ |
318 \ONE\;\textit{else}\;\ZERO$\\ |
374 $(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ & |
319 $(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ & |
375 $\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\ |
320 $\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\ |
376 $(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ & |
321 $(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ & |
377 $\textit{if}\;\textit{rnullable}\,r_1$\\ |
322 $\textit{if}\;(\textit{rnullable}\,r_1)$\\ |
378 & &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\ |
323 & &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\ |
379 & &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\ |
324 & &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\ |
380 & &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\ |
325 & &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\ |
381 $(r^*)\,\backslash_r c$ & $\dn$ & |
326 $(r^*)\,\backslash_r c$ & $\dn$ & |
382 $( r\,\backslash_r c)\cdot |
327 $( r\,\backslash_r c)\cdot |
383 (_{[]}r^*))$ |
328 (_{[]}r^*))$ |
384 \end{tabular} |
329 \end{tabular} |
385 \end{center} |
330 \end{center} |
386 \noindent |
331 \noindent |
387 The simplification function is simplified without annotation causing superficial differences. |
332 Similarly, $\distinctBy$ does not need |
388 Duplicate removal without an equivalence relation: |
333 a function checking equivalence because |
|
334 there are no bit annotations causing superficial differences |
|
335 between syntactically equal terms. |
389 \begin{center} |
336 \begin{center} |
390 \begin{tabular}{lcl} |
337 \begin{tabular}{lcl} |
391 $\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\ |
338 $\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\ |
392 $\rdistinct{r :: rs}{rset}$ & $\dn$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ |
339 $\rdistinct{r :: rs}{rset}$ & $\dn$ & |
393 & & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$ |
340 $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ |
|
341 & & $\textit{else}\; \; |
|
342 r::\rdistinct{rs}{(rset \cup \{r\})}$ |
394 \end{tabular} |
343 \end{tabular} |
395 \end{center} |
344 \end{center} |
396 %TODO: definition of rsimp (maybe only the alternative clause) |
345 %TODO: definition of rsimp (maybe only the alternative clause) |
397 \noindent |
346 \noindent |
398 The prefix $r$ in front of $\rdistinct{}{}$ is used mainly to |
347 Notice there is a difference between our $\rdistincts$ and |
399 differentiate with $\textit{distinct}$, which is a built-in predicate |
348 the Isabelle $\textit {distinct}$ function. |
400 in Isabelle that says all the elements of a list are unique. |
349 In Isabelle $\textit{distinct}$ is a predicate |
|
350 that tests if all the elements of a list are unique.\\ |
401 With $\textit{rdistinct}$ one can chain together all the other modules |
351 With $\textit{rdistinct}$ one can chain together all the other modules |
402 of $\bsimp{\_}$ (removing the functionalities related to bit-sequences) |
352 of $\bsimp{\_}$ (removing the functionalities related to bit-sequences) |
403 and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$. |
353 and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$. |
404 We omit these functions, as they are routine. Please refer to the formalisation |
354 We omit these functions, as they are routine. Please refer to the formalisation |
405 (in file BasicIdentities.thy) for the exact definition. |
355 (in file BasicIdentities.thy) for the exact definition. |