227 |
227 |
228 \noindent where @{const ZERO} stands for the regular expression that does |
228 \noindent where @{const ZERO} stands for the regular expression that does |
229 not match any string, @{const ONE} for the regular expression that matches |
229 not match any string, @{const ONE} for the regular expression that matches |
230 only the empty string and @{term c} for matching a character literal. |
230 only the empty string and @{term c} for matching a character literal. |
231 The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively. |
231 The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively. |
|
232 We sometimes omit the $\cdot$ in a sequence regular expression for brevity. |
232 The |
233 The |
233 \emph{language} of a regular expression, written $L$, is defined as usual |
234 \emph{language} of a regular expression, written $L$, is defined as usual |
234 and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}). |
235 and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}). |
235 |
236 |
236 Central to Brzozowski's regular expression matcher are two functions |
237 Central to Brzozowski's regular expression matcher are two functions |
1047 from the list provided @{text "f x"} is already in the accumulator; |
1048 from the list provided @{text "f x"} is already in the accumulator; |
1048 otherwise we keep @{text x} and scan the rest of the list but |
1049 otherwise we keep @{text x} and scan the rest of the list but |
1049 add @{text "f x"} as another ``seen'' element to @{text acc}. We will use |
1050 add @{text "f x"} as another ``seen'' element to @{text acc}. We will use |
1050 @{term distinctBy} where @{text f} is the erase function, @{term "erase (DUMMY)"}, |
1051 @{term distinctBy} where @{text f} is the erase function, @{term "erase (DUMMY)"}, |
1051 that deletes bitsequences from bitcoded regular expressions. |
1052 that deletes bitsequences from bitcoded regular expressions. |
1052 This is clearly a computationally more expensive operation, than @{text nub}, |
1053 This is clearly a computationally more expensive operation than @{text nub}, |
1053 but is needed in order to make the removal of unnecessary copies |
1054 but is needed in order to make the removal of unnecessary copies |
1054 to work properly. |
1055 to work properly. |
1055 |
1056 |
1056 Our simplification function depends on three helper functions, one is called |
1057 Our simplification function depends on three helper functions, one is called |
1057 @{text flts} and analyses lists of regular expressions coming from alternatives. |
1058 @{text flts} and analyses lists of regular expressions coming from alternatives. |
1306 @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}$LD$\\ |
1307 @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}$LD$\\ |
1307 \end{tabular} |
1308 \end{tabular} |
1308 \end{center} |
1309 \end{center} |
1309 \caption{The rewrite rules that generate simplified regular expressions |
1310 \caption{The rewrite rules that generate simplified regular expressions |
1310 in small steps: @{term "rrewrite r\<^sub>1 r\<^sub>2"} is for bitcoded regular |
1311 in small steps: @{term "rrewrite r\<^sub>1 r\<^sub>2"} is for bitcoded regular |
1311 expressions and @{term "rrewrites rs\<^sub>1 rs\<^sub>2"} for \emph{lists} of bitcoded |
1312 expressions and @{term "srewrite rs\<^sub>1 rs\<^sub>2"} for \emph{lists} of bitcoded |
1312 regular expressions. Interesting is the $LD$ rule that allows copies of regular |
1313 regular expressions. Interesting is the $LD$ rule that allows copies of regular |
1313 expressions be removed provided a regular expression earlier in the list can |
1314 expressions to be removed provided a regular expression earlier in the list can |
1314 match the same strings.}\label{SimpRewrites} |
1315 match the same strings.}\label{SimpRewrites} |
1315 \end{figure} |
1316 \end{figure} |
1316 *} |
1317 *} |
1317 |
1318 |
1318 section {* Finiteness of Derivatives *} |
1319 section {* Finiteness of Derivatives *} |
1420 \cite[Page 14]{Sulzmann2014}. |
1421 \cite[Page 14]{Sulzmann2014}. |
1421 Given the growth of the |
1422 Given the growth of the |
1422 derivatives in some cases even after aggressive simplification, this |
1423 derivatives in some cases even after aggressive simplification, this |
1423 is a hard to believe fact. A similar claim about a theoretical runtime |
1424 is a hard to believe fact. A similar claim about a theoretical runtime |
1424 of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates |
1425 of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates |
1425 tokens according to POSIX rules \cite{verbatim}. For this it uses Brzozowski's |
1426 tokens according to POSIX rules~\cite{verbatim}. For this it uses Brzozowski's |
1426 derivatives . |
1427 derivatives. |
1427 They write: ``The results of our empirical tests [..] confirm that Verbatim has |
1428 They write: ``The results of our empirical tests [..] confirm that Verbatim has |
1428 $O(n^2)$ time complexity.'' \cite[Section~VII]{verbatim}. |
1429 @{text "O(n\<^sup>2)"} time complexity.'' \cite[Section~VII]{verbatim}. |
1429 While their correctness proof for Verbatim is formalised in Coq, the claim about |
1430 While their correctness proof for Verbatim is formalised in Coq, the claim about |
1430 the runtime complexity is only supported by some emperical evidence. |
1431 the runtime complexity is only supported by some emperical evidence obtained |
|
1432 by using the code extraction facilities of Coq. |
1431 In the context of our observation with the ``growth problem'' of derivatives, |
1433 In the context of our observation with the ``growth problem'' of derivatives, |
1432 we |
1434 we |
1433 tried out their extracted OCaml code with the example |
1435 tried out their extracted OCaml code with the example |
1434 \mbox{@{text "(a + aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 minutes to tokenise a |
1436 \mbox{@{text "(a + aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 minutes to tokenise a |
1435 string of 40 $a$'s and that increased to approximately 19 minutes when the |
1437 string of 40 $a$'s and that increased to approximately 19 minutes when the |