thys2/Paper/Paper.thy
changeset 464 e6248d2c20c2
parent 463 421397f267b9
child 474 726f4e65c0fe
equal deleted inserted replaced
463:421397f267b9 464:e6248d2c20c2
   100 then @{term r} matches @{term s} (and {\em vice versa}).  We are aware
   100 then @{term r} matches @{term s} (and {\em vice versa}).  We are aware
   101 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
   101 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
   102 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
   102 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
   103 of the work by Krauss and Nipkow \cite{Krauss2011}.  And another one
   103 of the work by Krauss and Nipkow \cite{Krauss2011}.  And another one
   104 in Coq is given by Coquand and Siles \cite{Coquand2012}.
   104 in Coq is given by Coquand and Siles \cite{Coquand2012}.
   105 Also Ribeiro and Du Bois give one in Agda \cite{RibeiroAgda2017}.
   105 Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}.
   106 
   106 
   107 
   107 
   108 However, there are two difficulties with derivative-based matchers:
   108 However, there are two difficulties with derivative-based matchers:
   109 First, Brzozowski's original matcher only generates a yes/no answer
   109 First, Brzozowski's original matcher only generates a yes/no answer
   110 for whether a regular expression matches a string or not.  This is too
   110 for whether a regular expression matches a string or not.  This is too
   229   not match any string, @{const ONE} for the regular expression that matches
   229   not match any string, @{const ONE} for the regular expression that matches
   230   only the empty string and @{term c} for matching a character literal.
   230   only the empty string and @{term c} for matching a character literal.
   231   The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively.
   231   The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively.
   232   We sometimes omit the $\cdot$ in a sequence regular expression for brevity. 
   232   We sometimes omit the $\cdot$ in a sequence regular expression for brevity. 
   233   The
   233   The
   234   \emph{language} of a regular expression, written $L$, is defined as usual
   234   \emph{language} of a regular expression, written $L(r)$, is defined as usual
   235   and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}).
   235   and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}).
   236 
   236 
   237   Central to Brzozowski's regular expression matcher are two functions
   237   Central to Brzozowski's regular expression matcher are two functions
   238   called @{text nullable} and \emph{derivative}. The latter is written
   238   called @{text nullable} and \emph{derivative}. The latter is written
   239   $r\backslash c$ for the derivative of the regular expression $r$
   239   $r\backslash c$ for the derivative of the regular expression $r$
  1165 
  1165 
  1166      \noindent do not hold under simplification---this property
  1166      \noindent do not hold under simplification---this property
  1167      essentially purports that we can retrieve the same value from a
  1167      essentially purports that we can retrieve the same value from a
  1168      simplified version of the regular expression. To start with @{text retrieve}
  1168      simplified version of the regular expression. To start with @{text retrieve}
  1169      depends on the fact that the value @{text v} correspond to the
  1169      depends on the fact that the value @{text v} correspond to the
  1170      structure of the regular expressions---but the whole point of simplification
  1170      structure of the regular expression @{text r}---but the whole point of simplification
  1171      is to ``destroy'' this structure by making the regular expression simpler.
  1171      is to ``destroy'' this structure by making the regular expression simpler.
  1172      To see this consider the regular expression @{text "r = r' + 0"} and a corresponding
  1172      To see this consider the regular expression @{text "r = r' + 0"} and a corresponding
  1173      value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then 
  1173      value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then 
  1174      we can use @{text retrieve} and @{text v} in order to extract a corresponding
  1174      we can use @{text retrieve} with @{text r} and @{text v} in order to extract a corresponding
  1175      bitsequence. The reason that this works is that @{text r} is an alternative
  1175      bitsequence. The reason that this works is that @{text r} is an alternative
  1176      regular expression and @{text v} a corresponding value. However, if we simplify
  1176      regular expression and @{text v} a corresponding @{text "Left"}-value. However, if we simplify
  1177      @{text r}, then @{text v} does not correspond to the shape of the regular 
  1177      @{text r}, then @{text v} does not correspond to the shape of the regular 
  1178      expression anymore. So unless one can somehow
  1178      expression anymore. So unless one can somehow
  1179      synchronise the change in the simplified regular expressions with
  1179      synchronise the change in the simplified regular expressions with
  1180      the original POSIX value, there is no hope of appealing to @{text retrieve} in the
  1180      the original POSIX value, there is no hope of appealing to @{text retrieve} in the
  1181      correctness argument for @{term blexer_simp}.
  1181      correctness argument for @{term blexer_simp}.
  1401    correctness.  Our interest in the second algorithm 
  1401    correctness.  Our interest in the second algorithm 
  1402    lies in the fact that by using bitcoded regular expressions and an aggressive
  1402    lies in the fact that by using bitcoded regular expressions and an aggressive
  1403    simplification method there is a chance that the the derivatives
  1403    simplification method there is a chance that the the derivatives
  1404    can be kept universally small  (we established in this paper that
  1404    can be kept universally small  (we established in this paper that
  1405    they can be kept finite for any string). This is important if one is after
  1405    they can be kept finite for any string). This is important if one is after
  1406    an efficient POSIX lexing algorithm.
  1406    an efficient POSIX lexing algorithm based on derivatives.
  1407 
  1407 
  1408    Having proved the correctness of the POSIX lexing algorithm, which
  1408    Having proved the correctness of the POSIX lexing algorithm, which
  1409    lessons have we learned? Well, we feel this is a very good example
  1409    lessons have we learned? Well, we feel this is a very good example
  1410    where formal proofs give further insight into the matter at
  1410    where formal proofs give further insight into the matter at
  1411    hand. For example it is very hard to see a problem with @{text nub}
  1411    hand. For example it is very hard to see a problem with @{text nub}
  1421    equivalent of @{term blexer_simp} ``...we can incrementally compute
  1421    equivalent of @{term blexer_simp} ``...we can incrementally compute
  1422    bitcoded parse trees in linear time in the size of the input''
  1422    bitcoded parse trees in linear time in the size of the input''
  1423    \cite[Page 14]{Sulzmann2014}. 
  1423    \cite[Page 14]{Sulzmann2014}. 
  1424    Given the growth of the
  1424    Given the growth of the
  1425    derivatives in some cases even after aggressive simplification, this
  1425    derivatives in some cases even after aggressive simplification, this
  1426    is a hard to believe fact. A similar claim about a theoretical runtime
  1426    is a hard to believe claim. A similar claim about a theoretical runtime
  1427    of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates
  1427    of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates
  1428    tokens according to POSIX rules~\cite{verbatim}. For this it uses Brzozowski's
  1428    tokens according to POSIX rules~\cite{verbatim}. For this it uses Brzozowski's
  1429    derivatives like in our work. 
  1429    derivatives like in our work. 
  1430    They write: ``The results of our empirical tests [..] confirm that Verbatim has
  1430    They write: ``The results of our empirical tests [..] confirm that Verbatim has
  1431    @{text "O(n\<^sup>2)"} time complexity.'' \cite[Section~VII]{verbatim}.
  1431    @{text "O(n\<^sup>2)"} time complexity.'' \cite[Section~VII]{verbatim}.
  1432    While their correctness proof for Verbatim is formalised in Coq, the claim about
  1432    While their correctness proof for Verbatim is formalised in Coq, the claim about
  1433    the runtime complexity is only supported by some emperical evidence obtained
  1433    the runtime complexity is only supported by some emperical evidence obtained
  1434    by using the code extraction facilities of Coq.
  1434    by using the code extraction facilities of Coq.
  1435    In the context of our observation with the ``growth problem'' of derivatives,
  1435    Given our observation with the ``growth problem'' of derivatives,
  1436    we
  1436    we
  1437    tried out their extracted OCaml code with the example
  1437    tried out their extracted OCaml code with the example
  1438    \mbox{@{text "(a + aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 minutes to tokenise a
  1438    \mbox{@{text "(a + aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 minutes to tokenise a
  1439    string of 40 $a$'s and that increased to approximately 19 minutes when the
  1439    string of 40 $a$'s and that increased to approximately 19 minutes when the
  1440    string is 50 $a$'s long. Given that derivatives are not simplified in the Verbatim
  1440    string is 50 $a$'s long. Taking into account that derivatives are not simplified in the Verbatim
  1441    lexer, such numbers are not surprising. 
  1441    lexer, such numbers are not surprising. 
  1442    Clearly our result of having finite
  1442    Clearly our result of having finite
  1443    derivatives might sound rather weak in this context but we think such effeciency claims
  1443    derivatives might sound rather weak in this context but we think such effeciency claims
  1444    really require further scrutiny.\medskip
  1444    really require further scrutiny.\medskip
  1445 
  1445