ChengsongTanPhdThesis/Chapters/Finite.tex
changeset 594 62f8fa03863e
parent 593 83fab852d72d
child 595 fa92124d1fb7
equal deleted inserted replaced
593:83fab852d72d 594:62f8fa03863e
   183 	\item
   183 	\item
   184 		We estimate $\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$.
   184 		We estimate $\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$.
   185 		The key observation is that $\distinctBy$'s output is
   185 		The key observation is that $\distinctBy$'s output is
   186 		a list with a constant length bound.
   186 		a list with a constant length bound.
   187 \end{itemize}
   187 \end{itemize}
   188 We give details of these steps in the next sections.
   188 We will expand on these steps in the next sections.\\
   189 The first step is to use 
       
   190 $\textit{rrexp}$s,
       
   191 something simpler than
       
   192 annotated regular expressions. 
       
   193 
   189 
   194 \section{The $\textit{Rrexp}$ Datatype and Its Lexing-Related Functions}
   190 \section{The $\textit{Rrexp}$ Datatype and Its Lexing-Related Functions}
   195 We first recap the definition of the new datatype $\rrexp$, called
   191 The first step is to define 
       
   192 $\textit{rrexp}$s.
       
   193 They are without bitcodes,
       
   194 allowing a much simpler size bound proof.
       
   195 Of course, the bits which encode the lexing information 
       
   196 would grow linearly with respect 
       
   197 to the input, which should be taken into account when we wish to tackle the runtime comlexity.
       
   198 But for the sake of the structural size 
       
   199 we can safely ignore them.\\
       
   200 To recapitulate, the datatype 
       
   201 definition of the $\rrexp$, called
   196 \emph{r-regular expressions},
   202 \emph{r-regular expressions},
   197 which we first defined in \ref{rrexpDef}.
   203 was initially defined in \ref{rrexpDef}.
   198 R-regular expressions are similar to basic regular expressions.
   204 The reason for the prefix $r$ is
   199 We call them \emph{r}-regular expressions
       
   200 to make a distinction  
   205 to make a distinction  
   201 with plain regular expressions.
   206 with basic regular expressions.
   202 \[			\rrexp ::=   \RZERO \mid  \RONE
   207 \[			\rrexp ::=   \RZERO \mid  \RONE
   203 	\mid  \RCHAR{c}  
   208 	\mid  \RCHAR{c}  
   204 	\mid  \RSEQ{r_1}{r_2}
   209 	\mid  \RSEQ{r_1}{r_2}
   205 	\mid  \RALTS{rs}
   210 	\mid  \RALTS{rs}
   206 	\mid \RSTAR{r}        
   211 	\mid \RSTAR{r}        
   220 \end{center}
   225 \end{center}
   221 \noindent
   226 \noindent
   222 The $r$ in the subscript of $\llbracket \rrbracket_r$ is to 
   227 The $r$ in the subscript of $\llbracket \rrbracket_r$ is to 
   223 differentiate with the same operation for annotated regular expressions.
   228 differentiate with the same operation for annotated regular expressions.
   224 Adding $r$ as subscript will be used in 
   229 Adding $r$ as subscript will be used in 
   225 other operations as well.
   230 other operations as well.\\
   226 
       
   227 The transformation from an annotated regular expression
   231 The transformation from an annotated regular expression
   228 to an r-regular expression is straightforward.
   232 to an r-regular expression is straightforward.
   229 \begin{center}
   233 \begin{center}
   230 	\begin{tabular}{lcl}
   234 	\begin{tabular}{lcl}
   231 		$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
   235 		$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
   235 		$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
   239 		$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
   236 		$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$
   240 		$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$
   237 	\end{tabular}
   241 	\end{tabular}
   238 \end{center}
   242 \end{center}
   239 \noindent
   243 \noindent
   240 $\textit{Rerase}$ throws away the bitcodes on the annotated regular expressions, 
   244 $\textit{Rerase}$ throws away the bitcodes 
       
   245 on the annotated regular expressions, 
   241 but keeps everything else intact.
   246 but keeps everything else intact.
   242 
   247 Therefore it does not change the size
   243 Before we introduce more functions related to r-regular expressions,
   248 of an annotated regular expression:
   244 we first give out the reason why we take all the trouble 
   249 \begin{lemma}\label{rsizeAsize}
   245 defining a new datatype in the first place.
       
   246 We could calculate the size of annotated regular expressions in terms of
       
   247 their un-annotated $\rrexp$ counterpart: 
       
   248 \begin{lemma}
       
   249 	$\rsize{\rerase a} = \asize a$
   250 	$\rsize{\rerase a} = \asize a$
   250 \end{lemma}
   251 \end{lemma}
   251 \begin{proof}
   252 \begin{proof}
   252 	By routine structural induction on $a$.
   253 	By routine structural induction on $a$.
   253 \end{proof}
   254 \end{proof}
   254 \noindent
   255 \noindent
       
   256 
   255 \subsection{Motivation Behind a New Datatype}
   257 \subsection{Motivation Behind a New Datatype}
   256 The main difference between a plain regular expression
   258 The reason we take all the trouble 
   257 and an r-regular expression is that it can take
   259 defining a new datatype is that $\erase$ makes things harder.
   258 non-binary arguments for its alternative constructor.
       
   259 This turned out to be necessary if we want our proofs to be
       
   260 simple.
       
   261 We initially started by using 
   260 We initially started by using 
   262 plain regular expressions and tried to prove
   261 plain regular expressions and tried to prove
   263 equalities like 
   262 the lemma \ref{rsizeAsize},
   264 \begin{center}
   263 however the $\erase$ function unavoidbly messes with the structure of the 
   265 	$\llbracket a \rrbracket = \llbracket a\downarrow \rrbracket_p$
   264 annotated regular expression.
   266 \end{center}
   265 The $+$ constructor
   267 and 
   266 of basic regular expressions is binary whereas $\sum$ 
   268 \[
   267 takes a list, and one has to convert between them:
   269 	\llbracket a \backslash_{bsimps} s \rrbracket = 
   268 \begin{center}
   270 	\llbracket a \downarrow \backslash_s s
   269 	\begin{tabular}{ccc}
   271 \]
   270 		$\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\
   272 One might be able to prove that 
   271 		$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
   273 $\llbracket a \downarrow \rrbracket_p \leq \llbracket a \rrbracket$.
   272 		$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
   274 $\rrexp$ give the exact correspondence between an annotated regular expression
   273 	\end{tabular}
   275 and its (r-)erased version:
   274 \end{center}
   276 
   275 \noindent
   277 This does not hold for plain $\rexp$s. 
   276 An alternative regular expression with an empty argument list
   278 
   277 will be turned into a $\ZERO$.
   279 
   278 The singleton alternative $\sum [r]$ would have $r$ during the
   280 These operations are 
   279 $\erase$ function.
   281 almost identical to those of the annotated regular expressions,
   280 The  annotated regular expression $\sum[a, b, c]$ would turn into
   282 except that no bitcodes are attached.
   281 $(a+(b+c))$.
   283 Of course, the bits which encode the lexing information would grow linearly with respect 
   282 All these operations change the size and structure of
   284 to the input, which should be taken into account when we wish to tackle the runtime comlexity.
   283 an annotated regular expressions, adding unnecessary 
   285 But at the current stage 
   284 complexities to the size bound proof.\\
   286 we can safely ignore them.
   285 For example, if we define the size of a basic regular expression 
   287 Similarly there is a size function for plain regular expressions:
   286 in the usual way,
   288 \begin{center}
   287 \begin{center}
   289 	\begin{tabular}{ccc}
   288 	\begin{tabular}{ccc}
   290 		$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
   289 		$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
   291 		$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
   290 		$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
   292 		$\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\
   291 		$\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\
   293 		$\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\
   292 		$\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\
   294 		$\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\
   293 		$\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\
   295 		$\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$
   294 		$\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$
   296 	\end{tabular}
   295 	\end{tabular}
   297 \end{center}
   296 \end{center}
   298 
   297 \noindent
   299 \noindent
   298 Then the property
   300 The idea of obatining a bound for $\llbracket \bderssimp{a}{s} \rrbracket$
   299 \begin{center}
   301 is to get an equivalent form
   300 	$\llbracket a \rrbracket = \llbracket a_\downarrow \rrbracket_p$
   302 of something like $\llbracket \bderssimp{a}{s}\rrbracket = f(a, s)$, where $f(a, s)$ 
   301 \end{center}
   303 is easier to estimate than $\llbracket \bderssimp{a}{s}\rrbracket$.
   302 does not hold.
   304 We notice that while it is not so clear how to obtain
   303 One might be able to prove an inequality such as
   305 a metamorphic representation of $\bderssimp{a}{s}$ (as we argued in chapter \ref{Bitcoded2},
   304 $\llbracket a \rrbracket  \leq \llbracket  a_\downarrow \rrbracket_p $
   306 not interleaving the application of the functions $\backslash$ and $\bsimp{\_}$ 
   305 and then estimate $\llbracket  a_\downarrow \rrbracket_p$,
   307 in the order as our lexer will result in the bit-codes dispensed differently),
   306 but we found our approach more straightforward.\\
   308 it is possible to get an slightly different representation of the unlifted versions:
   307 
   309 $ (\bderssimp{a}{s})_\downarrow = (\erase \; \bsimp{a \backslash s})_\downarrow$.
   308 \subsection{Lexing Related Functions for $\rrexp$ such as $\backslash_r$, $\rdistincts$, and $\rsimp$}
   310 This suggest setting the bounding function $f(a, s)$ as 
   309 The operations on r-regular expressions are 
   311 $\llbracket  (a \backslash s)_\downarrow \rrbracket_p$, the plain size
   310 almost identical to those of the annotated regular expressions,
   312 of the erased annotated regular expression.
   311 except that no bitcodes are used. For example,
   313 This requires the the regular expression accompanied by bitcodes
   312 the derivative operation becomes simpler:\\
   314 to have the same size as its plain counterpart after erasure:
       
   315 \begin{center}
       
   316 	$\asize{a} \stackrel{?}{=} \llbracket \erase(a)\rrbracket_p$. 
       
   317 \end{center}
       
   318 \noindent
       
   319 But there is a minor nuisance: 
       
   320 the erase function unavoidbly messes with the structure of the regular expression,
       
   321 due to the discrepancy between annotated regular expression's $\sum$ constructor
       
   322 and plain regular expression's $+$ constructor having different arity.
       
   323 \begin{center}
       
   324 	\begin{tabular}{ccc}
       
   325 		$\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\
       
   326 		$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
       
   327 		$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
       
   328 	\end{tabular}
       
   329 \end{center}
       
   330 \noindent
       
   331 An alternative regular expression with an empty list of children
       
   332 is turned into a $\ZERO$ during the
       
   333 $\erase$ function, thereby changing the size and structure of the regex.
       
   334 Therefore the equality in question does not hold.
       
   335 
       
   336 These will likely be fixable if we really want to use plain $\rexp$s for dealing
       
   337 with size, but we choose a more straightforward (or stupid) method by 
       
   338 
       
   339 Similarly we could define the derivative  and simplification on 
       
   340 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
       
   341 except that now they can operate on alternatives taking multiple arguments.
       
   342 
       
   343 \begin{center}
       
   344 	\begin{tabular}{lcr}
       
   345 		$(\RALTS{rs})\; \backslash c$ & $\dn$ &  $\RALTS{\map\; (\_ \backslash c) \;rs}$\\
       
   346 		(other clauses omitted)
       
   347 		With the new $\rrexp$ datatype in place, one can define its size function,
       
   348 		which precisely mirrors that of the annotated regular expressions:
       
   349 	\end{tabular}
       
   350 \end{center}
       
   351 \noindent
       
   352 \begin{center}
       
   353 	\begin{tabular}{ccc}
       
   354 		$\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\
       
   355 		$\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\
       
   356 		$\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\
       
   357 		$\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\
       
   358 		$\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as   + 1$\\
       
   359 		$\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$
       
   360 	\end{tabular}
       
   361 \end{center}
       
   362 \noindent
       
   363 
       
   364 \subsection{Lexing Related Functions for $\rrexp$}
       
   365 Everything else for $\rrexp$ will be precisely the same for annotated expressions,
       
   366 except that they do not involve rectifying and augmenting bit-encoded tokenization information.
       
   367 As expected, most functions are simpler, such as the derivative:
       
   368 \begin{center}
   313 \begin{center}
   369 	\begin{tabular}{@{}lcl@{}}
   314 	\begin{tabular}{@{}lcl@{}}
   370 		$(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\  
   315 		$(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\  
   371 		$(\ONE)\,\backslash_r c$ & $\dn$ &
   316 		$(\ONE)\,\backslash_r c$ & $\dn$ &
   372 		$\textit{if}\;c=d\; \;\textit{then}\;
   317 		$\textit{if}\;c=d\; \;\textit{then}\;
   373 		\ONE\;\textit{else}\;\ZERO$\\  
   318 		\ONE\;\textit{else}\;\ZERO$\\  
   374 		$(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ &
   319 		$(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ &
   375 		$\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\
   320 		$\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\
   376 		$(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ &
   321 		$(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ &
   377 		$\textit{if}\;\textit{rnullable}\,r_1$\\
   322 		$\textit{if}\;(\textit{rnullable}\,r_1)$\\
   378 						 & &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\
   323 						 & &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\
   379 						 & &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\
   324 						 & &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\
   380 						 & &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\
   325 						 & &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\
   381 		$(r^*)\,\backslash_r c$ & $\dn$ &
   326 		$(r^*)\,\backslash_r c$ & $\dn$ &
   382 		$( r\,\backslash_r c)\cdot
   327 		$( r\,\backslash_r c)\cdot
   383 		(_{[]}r^*))$
   328 		(_{[]}r^*))$
   384 	\end{tabular}    
   329 	\end{tabular}    
   385 \end{center}  
   330 \end{center}  
   386 \noindent
   331 \noindent
   387 The simplification function is simplified without annotation causing superficial differences.
   332 Similarly, $\distinctBy$ does not need 
   388 Duplicate removal without  an equivalence relation:
   333 a function checking equivalence because
       
   334 there are no bit annotations causing superficial differences
       
   335 between syntactically equal terms.
   389 \begin{center}
   336 \begin{center}
   390 	\begin{tabular}{lcl}
   337 	\begin{tabular}{lcl}
   391 		$\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\
   338 		$\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\
   392 		$\rdistinct{r :: rs}{rset}$ & $\dn$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
   339 		$\rdistinct{r :: rs}{rset}$ & $\dn$ & 
   393 					    &        & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
   340 		$\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
       
   341 					    &        & $\textit{else}\; \;
       
   342 					    r::\rdistinct{rs}{(rset \cup \{r\})}$
   394 	\end{tabular}
   343 	\end{tabular}
   395 \end{center}
   344 \end{center}
   396 %TODO: definition of rsimp (maybe only the alternative clause)
   345 %TODO: definition of rsimp (maybe only the alternative clause)
   397 \noindent
   346 \noindent
   398 The prefix $r$ in front of $\rdistinct{}{}$ is used mainly to 
   347 Notice there is a difference between our $\rdistincts$ and
   399 differentiate with $\textit{distinct}$, which is a built-in predicate
   348 the Isabelle $\textit {distinct}$ function.
   400 in Isabelle that says all the elements of a list are unique.
   349 In Isabelle $\textit{distinct}$ is a predicate
       
   350 that tests if all the elements of a list are unique.\\
   401 With $\textit{rdistinct}$ one can chain together all the other modules
   351 With $\textit{rdistinct}$ one can chain together all the other modules
   402 of $\bsimp{\_}$ (removing the functionalities related to bit-sequences)
   352 of $\bsimp{\_}$ (removing the functionalities related to bit-sequences)
   403 and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$.
   353 and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$.
   404 We omit these functions, as they are routine. Please refer to the formalisation
   354 We omit these functions, as they are routine. Please refer to the formalisation
   405 (in file BasicIdentities.thy) for the exact definition.
   355 (in file BasicIdentities.thy) for the exact definition.