ChengsongTanPhdThesis/Chapters/Cubic.tex
changeset 590 988e92a70704
parent 538 8016a2480704
child 591 b2d0de6aee18
equal deleted inserted replaced
589:86e0203db2da 590:988e92a70704
     3 \chapter{A Better Bound and Other Extensions} % Main chapter title
     3 \chapter{A Better Bound and Other Extensions} % Main chapter title
     4 
     4 
     5 \label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
     5 \label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
     6 %in Chapter 4 to a polynomial one, and demonstrate how one can extend the
     6 %in Chapter 4 to a polynomial one, and demonstrate how one can extend the
     7 %algorithm to include constructs such as bounded repetitions and negations.
     7 %algorithm to include constructs such as bounded repetitions and negations.
     8  
     8 \lstset{style=myScalastyle}
       
     9 
       
    10 
       
    11 This chapter is a ``miscellaneous''
       
    12 chapter which records various
       
    13 extensions to our $\blexersimp$'s formalisations.\\
       
    14 Firstly we present further improvements
       
    15 made to our lexer algorithm $\blexersimp$.
       
    16 We devise a stronger simplification algorithm,
       
    17 called $\bsimpStrong$, which can prune away
       
    18 similar components in two regular expressions at the same 
       
    19 alternative level,
       
    20 even if these regular expressions are not exactly the same.
       
    21 We call the lexer that uses this stronger simplification function
       
    22 $\blexerStrong$.
       
    23 We conjecture that both
       
    24 \begin{center}
       
    25 	$\blexerStrong \;r \; s = \blexer\; r\;s$
       
    26 \end{center}
       
    27 and
       
    28 \begin{center}
       
    29 	$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$
       
    30 \end{center}
       
    31 hold, but formalising
       
    32 them is still work in progress.
       
    33 We give reasons why the correctness and cubic size bound proofs
       
    34 can be achieved,
       
    35 by exploring the connection between the internal
       
    36 data structure of our $\blexerStrong$ and
       
    37 Animirov's partial derivatives.\\
       
    38 Secondly, we extend our $\blexersimp$
       
    39 to support bounded repetitions ($r^{\{n\}}$).
       
    40 We update our formalisation of 
       
    41 the correctness and finiteness properties to
       
    42 include this new construct. With bounded repetitions
       
    43 we are able to out-compete other verified lexers such as
       
    44 Verbatim++ on regular expressions which involve a lot of
       
    45 repetitions. We also present the idempotency property proof
       
    46 of $\bsimp$, which leverages the idempotency proof of $\rsimp$.
       
    47 This reinforces our claim that the fixpoint construction
       
    48 originally required by Sulzmann and Lu can be removed in $\blexersimp$.
       
    49 \\
       
    50 Last but not least, we present our efforts and challenges we met
       
    51 in further improving the algorithm by data
       
    52 structures such as zippers.
       
    53 
       
    54 
       
    55 
     9 %----------------------------------------------------------------------------------------
    56 %----------------------------------------------------------------------------------------
    10 %	SECTION strongsimp
    57 %	SECTION strongsimp
    11 %----------------------------------------------------------------------------------------
    58 %----------------------------------------------------------------------------------------
    12 \section{A Stronger Version of Simplification}
    59 \section{A Stronger Version of Simplification}
    13 %TODO: search for isabelle proofs of algorithms that check equivalence 
    60 %TODO: search for isabelle proofs of algorithms that check equivalence 
    14 In our bit-coded lexing algorithm, with or without simplification, 
    61 In our bitcoded lexing algorithm, (sub)terms represent (sub)matches.
    15 two alternative (distinct) sub-matches for the (sub-)string and (sub-)regex pair
    62 For example, the regular expression $aa \cdot a^*+ a \cdot a^*$ contains two terms, 
    16 are always expressed in the "derivative regular expression" as two
    63 which expresses two possibilities it will match future input.
    17 disjoint alternative terms at the current (sub-)regex level. The fact that they
       
    18 are parallel tells us when we retrieve the information from this (sub-)regex 
       
    19 there will always be a choice of which alternative term to take.
       
    20 As an example, the regular expression $aa \cdot a^*+ a \cdot a^*$ (omitting bit-codes)
       
    21 expresses two possibilities it will match future input.
       
    22 It will either match 2 $a$'s, then 0 or more $a$'s, in other words, at least 2 more $a$'s
    64 It will either match 2 $a$'s, then 0 or more $a$'s, in other words, at least 2 more $a$'s
    23 \begin{figure}\label{string_3parts1}
    65 \begin{figure}\label{string_3parts1}
    24 \begin{center}
    66 \begin{center}
    25 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    67 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    26     \node [rectangle split, rectangle split horizontal, rectangle split parts=3]
    68     \node [rectangle split, rectangle split horizontal, rectangle split parts=3]
   157 To implement this idea into an algorithm, we define a "pruning function"
   199 To implement this idea into an algorithm, we define a "pruning function"
   158 $\textit{prune}$, that takes away parts of all terms in $r$ that belongs to
   200 $\textit{prune}$, that takes away parts of all terms in $r$ that belongs to
   159 $\textit{acc}$, which acts as an element
   201 $\textit{acc}$, which acts as an element
   160 is a stronger version of $\textit{distinctBy}$.
   202 is a stronger version of $\textit{distinctBy}$.
   161 Here is a concise version of it (in the style of Scala):
   203 Here is a concise version of it (in the style of Scala):
   162 \begin{verbatim}
   204 \begin{figure}[H]
       
   205 \begin{lstlisting}
   163 def distinctByPlus(rs: List[ARexp], acc: Set[Rexp] = Set()) : 
   206 def distinctByPlus(rs: List[ARexp], acc: Set[Rexp] = Set()) : 
   164 List[ARexp] =  rs match {
   207 List[ARexp] =  rs match {
   165   case Nil => Nil
   208   case Nil => Nil
   166   case r :: rs =>
   209   case r :: rs =>
   167     if(acc.contains(erase(r)))
   210     if(acc.contains(erase(r)))
   168       distinctByPlus(rs, acc)
   211       distinctByPlus(rs, acc)
   169     else 
   212     else 
   170       prune(r, acc) :: distinctByPlus(rs, prune(r, acc) +: acc)
   213       prune(r, acc) :: distinctByPlus(rs, prune(r, acc) +: acc)
   171 }
   214 }
   172 
   215 
   173 \end{verbatim}
   216 \end{lstlisting}
       
   217 \caption{the distinctByPlus function}
       
   218 \end{figure}
   174 But for the function $\textit{prune}$ there is a difficulty:
   219 But for the function $\textit{prune}$ there is a difficulty:
   175 how could one define the $L$ function in a "computable" way,
   220 how could one define the $L$ function in a "computable" way,
   176 so that they generate a (lazy infinite) set of strings in $L(r)$.
   221 so that they generate a (lazy infinite) set of strings in $L(r)$.
   177 We also need a function that tests whether $L(r_1) \subseteq L(r_2)$
   222 We also need a function that tests whether $L(r_1) \subseteq L(r_2)$
   178 is true.
   223 is true.
   181 to a bound better than Antimirov's cubic bound).
   226 to a bound better than Antimirov's cubic bound).
   182 Rather, we do not try to eliminate in every instance of regular expressions
   227 Rather, we do not try to eliminate in every instance of regular expressions
   183 that have "language duplicates", but only eliminate the "exact duplicates".
   228 that have "language duplicates", but only eliminate the "exact duplicates".
   184 For this we need to break up terms $(a+b)\cdot c$ into two
   229 For this we need to break up terms $(a+b)\cdot c$ into two
   185 terms $a\cdot c$ and $b\cdot c$ before we add them to the accumulator:
   230 terms $a\cdot c$ and $b\cdot c$ before we add them to the accumulator:
   186 \begin{figure}
   231 \begin{figure}[H]
   187 \begin{verbatim}
   232 \begin{lstlisting}
   188 def distinctWith(rs: List[ARexp], 
   233 def distinctWith(rs: List[ARexp], 
   189                 pruneFunction: (ARexp, Set[Rexp]) => ARexp, 
   234                 pruneFunction: (ARexp, Set[Rexp]) => ARexp, 
   190                 acc: Set[Rexp] = Set()) : List[ARexp] = 
   235                 acc: Set[Rexp] = Set()) : List[ARexp] = 
   191   rs match{
   236   rs match{
   192     case Nil => Nil
   237     case Nil => Nil
   200           pruneFunction, 
   245           pruneFunction, 
   201           turnIntoTerms(erase(pruned_r)) ++: acc
   246           turnIntoTerms(erase(pruned_r)) ++: acc
   202         )
   247         )
   203       }
   248       }
   204   }
   249   }
   205 \end{verbatim}
   250 \end{lstlisting}
   206 \caption{A Stronger Version of $\textit{distinctBy}$}
   251 \caption{A Stronger Version of $\textit{distinctBy}$}
   207 \end{figure}
   252 \end{figure}
   208 \noindent
   253 \noindent
   209 This process is done by the function $\textit{turnIntoTerms}$:
   254 This process is done by the function $\textit{turnIntoTerms}$:
   210 \begin{figure}
   255 \begin{figure}