100 then @{term r} matches @{term s} (and {\em vice versa}). We are aware |
100 then @{term r} matches @{term s} (and {\em vice versa}). We are aware |
101 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
101 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
102 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part |
102 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part |
103 of the work by Krauss and Nipkow \cite{Krauss2011}. And another one |
103 of the work by Krauss and Nipkow \cite{Krauss2011}. And another one |
104 in Coq is given by Coquand and Siles \cite{Coquand2012}. |
104 in Coq is given by Coquand and Siles \cite{Coquand2012}. |
105 Also Ribeiro and Du Bois give one in Agda \cite{RibeiroAgda2017}. |
105 Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}. |
106 |
106 |
107 |
107 |
108 However, there are two difficulties with derivative-based matchers: |
108 However, there are two difficulties with derivative-based matchers: |
109 First, Brzozowski's original matcher only generates a yes/no answer |
109 First, Brzozowski's original matcher only generates a yes/no answer |
110 for whether a regular expression matches a string or not. This is too |
110 for whether a regular expression matches a string or not. This is too |
229 not match any string, @{const ONE} for the regular expression that matches |
229 not match any string, @{const ONE} for the regular expression that matches |
230 only the empty string and @{term c} for matching a character literal. |
230 only the empty string and @{term c} for matching a character literal. |
231 The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively. |
231 The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively. |
232 We sometimes omit the $\cdot$ in a sequence regular expression for brevity. |
232 We sometimes omit the $\cdot$ in a sequence regular expression for brevity. |
233 The |
233 The |
234 \emph{language} of a regular expression, written $L$, is defined as usual |
234 \emph{language} of a regular expression, written $L(r)$, is defined as usual |
235 and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}). |
235 and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}). |
236 |
236 |
237 Central to Brzozowski's regular expression matcher are two functions |
237 Central to Brzozowski's regular expression matcher are two functions |
238 called @{text nullable} and \emph{derivative}. The latter is written |
238 called @{text nullable} and \emph{derivative}. The latter is written |
239 $r\backslash c$ for the derivative of the regular expression $r$ |
239 $r\backslash c$ for the derivative of the regular expression $r$ |
1165 |
1165 |
1166 \noindent do not hold under simplification---this property |
1166 \noindent do not hold under simplification---this property |
1167 essentially purports that we can retrieve the same value from a |
1167 essentially purports that we can retrieve the same value from a |
1168 simplified version of the regular expression. To start with @{text retrieve} |
1168 simplified version of the regular expression. To start with @{text retrieve} |
1169 depends on the fact that the value @{text v} correspond to the |
1169 depends on the fact that the value @{text v} correspond to the |
1170 structure of the regular expressions---but the whole point of simplification |
1170 structure of the regular expression @{text r}---but the whole point of simplification |
1171 is to ``destroy'' this structure by making the regular expression simpler. |
1171 is to ``destroy'' this structure by making the regular expression simpler. |
1172 To see this consider the regular expression @{text "r = r' + 0"} and a corresponding |
1172 To see this consider the regular expression @{text "r = r' + 0"} and a corresponding |
1173 value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then |
1173 value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then |
1174 we can use @{text retrieve} and @{text v} in order to extract a corresponding |
1174 we can use @{text retrieve} with @{text r} and @{text v} in order to extract a corresponding |
1175 bitsequence. The reason that this works is that @{text r} is an alternative |
1175 bitsequence. The reason that this works is that @{text r} is an alternative |
1176 regular expression and @{text v} a corresponding value. However, if we simplify |
1176 regular expression and @{text v} a corresponding @{text "Left"}-value. However, if we simplify |
1177 @{text r}, then @{text v} does not correspond to the shape of the regular |
1177 @{text r}, then @{text v} does not correspond to the shape of the regular |
1178 expression anymore. So unless one can somehow |
1178 expression anymore. So unless one can somehow |
1179 synchronise the change in the simplified regular expressions with |
1179 synchronise the change in the simplified regular expressions with |
1180 the original POSIX value, there is no hope of appealing to @{text retrieve} in the |
1180 the original POSIX value, there is no hope of appealing to @{text retrieve} in the |
1181 correctness argument for @{term blexer_simp}. |
1181 correctness argument for @{term blexer_simp}. |
1401 correctness. Our interest in the second algorithm |
1401 correctness. Our interest in the second algorithm |
1402 lies in the fact that by using bitcoded regular expressions and an aggressive |
1402 lies in the fact that by using bitcoded regular expressions and an aggressive |
1403 simplification method there is a chance that the the derivatives |
1403 simplification method there is a chance that the the derivatives |
1404 can be kept universally small (we established in this paper that |
1404 can be kept universally small (we established in this paper that |
1405 they can be kept finite for any string). This is important if one is after |
1405 they can be kept finite for any string). This is important if one is after |
1406 an efficient POSIX lexing algorithm. |
1406 an efficient POSIX lexing algorithm based on derivatives. |
1407 |
1407 |
1408 Having proved the correctness of the POSIX lexing algorithm, which |
1408 Having proved the correctness of the POSIX lexing algorithm, which |
1409 lessons have we learned? Well, we feel this is a very good example |
1409 lessons have we learned? Well, we feel this is a very good example |
1410 where formal proofs give further insight into the matter at |
1410 where formal proofs give further insight into the matter at |
1411 hand. For example it is very hard to see a problem with @{text nub} |
1411 hand. For example it is very hard to see a problem with @{text nub} |
1421 equivalent of @{term blexer_simp} ``...we can incrementally compute |
1421 equivalent of @{term blexer_simp} ``...we can incrementally compute |
1422 bitcoded parse trees in linear time in the size of the input'' |
1422 bitcoded parse trees in linear time in the size of the input'' |
1423 \cite[Page 14]{Sulzmann2014}. |
1423 \cite[Page 14]{Sulzmann2014}. |
1424 Given the growth of the |
1424 Given the growth of the |
1425 derivatives in some cases even after aggressive simplification, this |
1425 derivatives in some cases even after aggressive simplification, this |
1426 is a hard to believe fact. A similar claim about a theoretical runtime |
1426 is a hard to believe claim. A similar claim about a theoretical runtime |
1427 of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates |
1427 of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates |
1428 tokens according to POSIX rules~\cite{verbatim}. For this it uses Brzozowski's |
1428 tokens according to POSIX rules~\cite{verbatim}. For this it uses Brzozowski's |
1429 derivatives like in our work. |
1429 derivatives like in our work. |
1430 They write: ``The results of our empirical tests [..] confirm that Verbatim has |
1430 They write: ``The results of our empirical tests [..] confirm that Verbatim has |
1431 @{text "O(n\<^sup>2)"} time complexity.'' \cite[Section~VII]{verbatim}. |
1431 @{text "O(n\<^sup>2)"} time complexity.'' \cite[Section~VII]{verbatim}. |
1432 While their correctness proof for Verbatim is formalised in Coq, the claim about |
1432 While their correctness proof for Verbatim is formalised in Coq, the claim about |
1433 the runtime complexity is only supported by some emperical evidence obtained |
1433 the runtime complexity is only supported by some emperical evidence obtained |
1434 by using the code extraction facilities of Coq. |
1434 by using the code extraction facilities of Coq. |
1435 In the context of our observation with the ``growth problem'' of derivatives, |
1435 Given our observation with the ``growth problem'' of derivatives, |
1436 we |
1436 we |
1437 tried out their extracted OCaml code with the example |
1437 tried out their extracted OCaml code with the example |
1438 \mbox{@{text "(a + aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 minutes to tokenise a |
1438 \mbox{@{text "(a + aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 minutes to tokenise a |
1439 string of 40 $a$'s and that increased to approximately 19 minutes when the |
1439 string of 40 $a$'s and that increased to approximately 19 minutes when the |
1440 string is 50 $a$'s long. Given that derivatives are not simplified in the Verbatim |
1440 string is 50 $a$'s long. Taking into account that derivatives are not simplified in the Verbatim |
1441 lexer, such numbers are not surprising. |
1441 lexer, such numbers are not surprising. |
1442 Clearly our result of having finite |
1442 Clearly our result of having finite |
1443 derivatives might sound rather weak in this context but we think such effeciency claims |
1443 derivatives might sound rather weak in this context but we think such effeciency claims |
1444 really require further scrutiny.\medskip |
1444 really require further scrutiny.\medskip |
1445 |
1445 |