lexing: comparison thys2/Paper/Paper.thy

equal deleted inserted replaced

-:421397f267b9
+:e6248d2c20c2
 then @{term r} matches @{term s} (and {\em vice versa}).  We are aware
 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
 of the work by Krauss and Nipkow \cite{Krauss2011}.  And another one
 in Coq is given by Coquand and Siles \cite{Coquand2012}.
-Also Ribeiro and Du Bois give one in Agda \cite{RibeiroAgda2017}.
+Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}.
 However, there are two difficulties with derivative-based matchers:
 First, Brzozowski's original matcher only generates a yes/no answer
 for whether a regular expression matches a string or not.  This is too
 not match any string, @{const ONE} for the regular expression that matches
 only the empty string and @{term c} for matching a character literal.
 The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively.
 We sometimes omit the $\cdot$ in a sequence regular expression for brevity.
 The
-\emph{language} of a regular expression, written $L$, is defined as usual
+\emph{language} of a regular expression, written $L(r)$, is defined as usual
 and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}).
 Central to Brzozowski's regular expression matcher are two functions
 called @{text nullable} and \emph{derivative}. The latter is written
 $r\backslash c$ for the derivative of the regular expression $r$
 \noindent do not hold under simplification---this property
 essentially purports that we can retrieve the same value from a
 simplified version of the regular expression. To start with @{text retrieve}
 depends on the fact that the value @{text v} correspond to the
-structure of the regular expressions---but the whole point of simplification
+structure of the regular expression @{text r}---but the whole point of simplification
 is to ``destroy'' this structure by making the regular expression simpler.
 To see this consider the regular expression @{text "r = r' + 0"} and a corresponding
 value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then
-we can use @{text retrieve} and @{text v} in order to extract a corresponding
+we can use @{text retrieve} with @{text r} and @{text v} in order to extract a corresponding
 bitsequence. The reason that this works is that @{text r} is an alternative
-regular expression and @{text v} a corresponding value. However, if we simplify
+regular expression and @{text v} a corresponding @{text "Left"}-value. However, if we simplify
 @{text r}, then @{text v} does not correspond to the shape of the regular
 expression anymore. So unless one can somehow
 synchronise the change in the simplified regular expressions with
 the original POSIX value, there is no hope of appealing to @{text retrieve} in the
 correctness argument for @{term blexer_simp}.
 correctness.  Our interest in the second algorithm
 lies in the fact that by using bitcoded regular expressions and an aggressive
 simplification method there is a chance that the the derivatives
 can be kept universally small  (we established in this paper that
 they can be kept finite for any string). This is important if one is after
-an efficient POSIX lexing algorithm.
+an efficient POSIX lexing algorithm based on derivatives.
 Having proved the correctness of the POSIX lexing algorithm, which
 lessons have we learned? Well, we feel this is a very good example
 where formal proofs give further insight into the matter at
 hand. For example it is very hard to see a problem with @{text nub}
 equivalent of @{term blexer_simp} ``...we can incrementally compute
 bitcoded parse trees in linear time in the size of the input''
 \cite[Page 14]{Sulzmann2014}.
 Given the growth of the
 derivatives in some cases even after aggressive simplification, this
-is a hard to believe fact. A similar claim about a theoretical runtime
+is a hard to believe claim. A similar claim about a theoretical runtime
 of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates
 tokens according to POSIX rules~\cite{verbatim}. For this it uses Brzozowski's
 derivatives like in our work.
 They write: ``The results of our empirical tests [..] confirm that Verbatim has
 @{text "O(n\<^sup>2)"} time complexity.'' \cite[Section~VII]{verbatim}.
 While their correctness proof for Verbatim is formalised in Coq, the claim about
 the runtime complexity is only supported by some emperical evidence obtained
 by using the code extraction facilities of Coq.
-In the context of our observation with the ``growth problem'' of derivatives,
+Given our observation with the ``growth problem'' of derivatives,
 we
 tried out their extracted OCaml code with the example
 \mbox{@{text "(a + aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 minutes to tokenise a
 string of 40 $a$'s and that increased to approximately 19 minutes when the
-string is 50 $a$'s long. Given that derivatives are not simplified in the Verbatim
+string is 50 $a$'s long. Taking into account that derivatives are not simplified in the Verbatim
 lexer, such numbers are not surprising.
 Clearly our result of having finite
 derivatives might sound rather weak in this context but we think such effeciency claims
 really require further scrutiny.\medskip

changeset 464	e6248d2c20c2
parent 463	421397f267b9
child 474	726f4e65c0fe