thys2/Paper/Paper.thy
changeset 461 c4b6906068a9
parent 460 6e269f557fc5
child 462 d9b672c4c0ac
equal deleted inserted replaced
460:6e269f557fc5 461:c4b6906068a9
   227 
   227 
   228   \noindent where @{const ZERO} stands for the regular expression that does
   228   \noindent where @{const ZERO} stands for the regular expression that does
   229   not match any string, @{const ONE} for the regular expression that matches
   229   not match any string, @{const ONE} for the regular expression that matches
   230   only the empty string and @{term c} for matching a character literal.
   230   only the empty string and @{term c} for matching a character literal.
   231   The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively.
   231   The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively.
       
   232   We sometimes omit the $\cdot$ in a sequence regular expression for brevity. 
   232   The
   233   The
   233   \emph{language} of a regular expression, written $L$, is defined as usual
   234   \emph{language} of a regular expression, written $L$, is defined as usual
   234   and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}).
   235   and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}).
   235 
   236 
   236   Central to Brzozowski's regular expression matcher are two functions
   237   Central to Brzozowski's regular expression matcher are two functions
  1047      from the list provided @{text "f x"} is already in the accumulator;
  1048      from the list provided @{text "f x"} is already in the accumulator;
  1048      otherwise we keep @{text x} and scan the rest of the list but 
  1049      otherwise we keep @{text x} and scan the rest of the list but 
  1049      add @{text "f x"} as another ``seen'' element to @{text acc}. We will use
  1050      add @{text "f x"} as another ``seen'' element to @{text acc}. We will use
  1050      @{term distinctBy} where @{text f} is the erase function, @{term "erase (DUMMY)"},
  1051      @{term distinctBy} where @{text f} is the erase function, @{term "erase (DUMMY)"},
  1051      that deletes bitsequences from bitcoded regular expressions.
  1052      that deletes bitsequences from bitcoded regular expressions.
  1052      This is clearly a computationally more expensive operation, than @{text nub},
  1053      This is clearly a computationally more expensive operation than @{text nub},
  1053      but is needed in order to make the removal of unnecessary copies
  1054      but is needed in order to make the removal of unnecessary copies
  1054      to work properly.
  1055      to work properly.
  1055 
  1056 
  1056      Our simplification function depends on three helper functions, one is called
  1057      Our simplification function depends on three helper functions, one is called
  1057      @{text flts} and analyses lists of regular expressions coming from alternatives.
  1058      @{text flts} and analyses lists of regular expressions coming from alternatives.
  1306   @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}$LD$\\
  1307   @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}$LD$\\
  1307   \end{tabular}
  1308   \end{tabular}
  1308   \end{center}
  1309   \end{center}
  1309   \caption{The rewrite rules that generate simplified regular expressions
  1310   \caption{The rewrite rules that generate simplified regular expressions
  1310   in small steps: @{term "rrewrite r\<^sub>1 r\<^sub>2"} is for bitcoded regular
  1311   in small steps: @{term "rrewrite r\<^sub>1 r\<^sub>2"} is for bitcoded regular
  1311   expressions and @{term "rrewrites rs\<^sub>1 rs\<^sub>2"} for \emph{lists} of bitcoded
  1312   expressions and @{term "srewrite rs\<^sub>1 rs\<^sub>2"} for \emph{lists} of bitcoded
  1312   regular expressions. Interesting is the $LD$ rule that allows copies of regular
  1313   regular expressions. Interesting is the $LD$ rule that allows copies of regular
  1313   expressions be removed provided a regular expression earlier in the list can
  1314   expressions to be removed provided a regular expression earlier in the list can
  1314   match the same strings.}\label{SimpRewrites}
  1315   match the same strings.}\label{SimpRewrites}
  1315   \end{figure}
  1316   \end{figure}
  1316 *}
  1317 *}
  1317 
  1318 
  1318 section {* Finiteness of Derivatives *}
  1319 section {* Finiteness of Derivatives *}
  1420    \cite[Page 14]{Sulzmann2014}. 
  1421    \cite[Page 14]{Sulzmann2014}. 
  1421    Given the growth of the
  1422    Given the growth of the
  1422    derivatives in some cases even after aggressive simplification, this
  1423    derivatives in some cases even after aggressive simplification, this
  1423    is a hard to believe fact. A similar claim about a theoretical runtime
  1424    is a hard to believe fact. A similar claim about a theoretical runtime
  1424    of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates
  1425    of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates
  1425    tokens according to POSIX rules \cite{verbatim}. For this it uses Brzozowski's
  1426    tokens according to POSIX rules~\cite{verbatim}. For this it uses Brzozowski's
  1426    derivatives . 
  1427    derivatives. 
  1427    They write: ``The results of our empirical tests [..] confirm that Verbatim has
  1428    They write: ``The results of our empirical tests [..] confirm that Verbatim has
  1428    $O(n^2)$ time complexity.'' \cite[Section~VII]{verbatim}.
  1429    @{text "O(n\<^sup>2)"} time complexity.'' \cite[Section~VII]{verbatim}.
  1429    While their correctness proof for Verbatim is formalised in Coq, the claim about
  1430    While their correctness proof for Verbatim is formalised in Coq, the claim about
  1430    the runtime complexity is only supported by some emperical evidence.
  1431    the runtime complexity is only supported by some emperical evidence obtained
       
  1432    by using the code extraction facilities of Coq.
  1431    In the context of our observation with the ``growth problem'' of derivatives,
  1433    In the context of our observation with the ``growth problem'' of derivatives,
  1432    we
  1434    we
  1433    tried out their extracted OCaml code with the example
  1435    tried out their extracted OCaml code with the example
  1434    \mbox{@{text "(a + aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 minutes to tokenise a
  1436    \mbox{@{text "(a + aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 minutes to tokenise a
  1435    string of 40 $a$'s and that increased to approximately 19 minutes when the
  1437    string of 40 $a$'s and that increased to approximately 19 minutes when the