98 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in |
98 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in |
99 succession) all the characters of the string matches the empty string, |
99 succession) all the characters of the string matches the empty string, |
100 then @{term r} matches @{term s} (and {\em vice versa}). We are aware |
100 then @{term r} matches @{term s} (and {\em vice versa}). We are aware |
101 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
101 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
102 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part |
102 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part |
103 of the work by Krauss and Nipkow \cite{Krauss2011}. And another one |
103 of the work by Krauss and Nipkow~\cite{Krauss2011}. And another one |
104 in Coq is given by Coquand and Siles \cite{Coquand2012}. |
104 in Coq is given by Coquand and Siles \cite{Coquand2012}. |
105 Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}. |
105 Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}. |
106 |
106 |
107 |
107 |
108 However, there are two difficulties with derivative-based matchers: |
108 However, there are two difficulties with derivative-based matchers: |
159 regular expressions. In this version, POSIX values are |
159 regular expressions. In this version, POSIX values are |
160 represented as bitsequences and such sequences are incrementally generated |
160 represented as bitsequences and such sequences are incrementally generated |
161 when derivatives are calculated. The compact representation |
161 when derivatives are calculated. The compact representation |
162 of bitsequences and regular expressions allows them to define a more |
162 of bitsequences and regular expressions allows them to define a more |
163 ``aggressive'' simplification method that keeps the size of the |
163 ``aggressive'' simplification method that keeps the size of the |
164 derivatives finite no matter what the length of the string is. |
164 derivatives finitely bounded no matter what the length of the string is. |
165 They make some informal claims about the correctness and linear behaviour |
165 They make some informal claims about the correctness and linear behaviour |
166 of this version, but do not provide any supporting proof arguments, not |
166 of this version, but do not provide any supporting proof arguments, not |
167 even ``pencil-and-paper'' arguments. They write about their bitcoded |
167 even ``pencil-and-paper'' arguments. They write about their bitcoded |
168 \emph{incremental parsing method} (that is the algorithm to be formalised |
168 \emph{incremental parsing method} (that is the algorithm to be formalised |
169 in this paper): |
169 in this paper): |
193 We shall also introduce our own argument and definitions for |
193 We shall also introduce our own argument and definitions for |
194 establishing the correctness of the bitcoded algorithm when |
194 establishing the correctness of the bitcoded algorithm when |
195 simplifications are included.\medskip |
195 simplifications are included.\medskip |
196 |
196 |
197 \noindent In this paper, we shall first briefly introduce the basic notions |
197 \noindent In this paper, we shall first briefly introduce the basic notions |
198 of regular expressions and describe the basic definitions |
198 of regular expressions and describe the definition |
199 of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves |
199 of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves |
200 as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove |
200 as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove |
201 the correctness for the bitcoded algorithm without simplification, and |
201 the correctness for the bitcoded algorithm without simplification, and |
202 after that extend the proof to include simplification. |
202 after that extend the proof to include simplification. |
203 |
203 |
356 string that satisfies the informal POSIX rules (see |
356 string that satisfies the informal POSIX rules (see |
357 \cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}).\footnote{POSIX |
357 \cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}).\footnote{POSIX |
358 lexing acquired its name from the fact that the corresponding |
358 lexing acquired its name from the fact that the corresponding |
359 rules were described as part of the POSIX specification for |
359 rules were described as part of the POSIX specification for |
360 Unix-like operating systems \cite{POSIX}.} Sometimes these |
360 Unix-like operating systems \cite{POSIX}.} Sometimes these |
361 informal rules are called \emph{maximal much rule} and \emph{rule priority}. |
361 informal rules are called \emph{maximal munch rule} and \emph{rule priority}. |
362 One contribution of our earlier paper is to give a convenient |
362 One contribution of our earlier paper is to give a convenient |
363 specification for what POSIX values are (the inductive rules are shown in |
363 specification for what POSIX values are (the inductive rules are shown in |
364 Figure~\ref{POSIXrules}). |
364 Figure~\ref{POSIXrules}). |
365 |
365 |
366 \begin{figure}[t] |
366 \begin{figure}[t] |
510 (2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\ |
510 (2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\ |
511 \end{tabular} |
511 \end{tabular} |
512 \end{proposition} |
512 \end{proposition} |
513 |
513 |
514 \noindent |
514 \noindent |
515 In fact we have shown that in the success case the generated POSIX value $v$ is |
515 In fact we have shown that, in the success case, the generated POSIX value $v$ is |
516 unique and in the failure case that there is no POSIX value $v$ that satisfies |
516 unique and in the failure case that there is no POSIX value $v$ that satisfies |
517 $(s, r) \rightarrow v$. While the algorithm is correct, it is excruciatingly |
517 $(s, r) \rightarrow v$. While the algorithm is correct, it is excruciatingly |
518 slow in cases where the derivatives grow arbitrarily (recall the example from the |
518 slow in cases where the derivatives grow arbitrarily (recall the example from the |
519 Introduction). However it can be used as a convenient reference point for the correctness |
519 Introduction). However it can be used as a convenient reference point for the correctness |
520 proof of the second algorithm by Sulzmann and Lu, which we shall describe next. |
520 proof of the second algorithm by Sulzmann and Lu, which we shall describe next. |
1164 \] |
1164 \] |
1165 |
1165 |
1166 \noindent do not hold under simplification---this property |
1166 \noindent do not hold under simplification---this property |
1167 essentially purports that we can retrieve the same value from a |
1167 essentially purports that we can retrieve the same value from a |
1168 simplified version of the regular expression. To start with @{text retrieve} |
1168 simplified version of the regular expression. To start with @{text retrieve} |
1169 depends on the fact that the value @{text v} correspond to the |
1169 depends on the fact that the value @{text v} corresponds to the |
1170 structure of the regular expression @{text r}---but the whole point of simplification |
1170 structure of the regular expression @{text r}---but the whole point of simplification |
1171 is to ``destroy'' this structure by making the regular expression simpler. |
1171 is to ``destroy'' this structure by making the regular expression simpler. |
1172 To see this consider the regular expression @{text "r = r' + 0"} and a corresponding |
1172 To see this consider the regular expression @{text "r = r' + 0"} and a corresponding |
1173 value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then |
1173 value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then |
1174 we can use @{text retrieve} with @{text r} and @{text v} in order to extract a corresponding |
1174 we can use @{text retrieve} with @{text r} and @{text v} in order to extract a corresponding |
1298 @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad |
1298 @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad |
1299 @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\ |
1299 @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\ |
1300 @{thm[mode=Axiom] bs6}$A0$\qquad |
1300 @{thm[mode=Axiom] bs6}$A0$\qquad |
1301 @{thm[mode=Axiom] bs7}$A1$\\ |
1301 @{thm[mode=Axiom] bs7}$A1$\\ |
1302 @{thm[mode=Rule] bs8[of "rs\<^sub>1" "rs\<^sub>2"]}$AL$\\ |
1302 @{thm[mode=Rule] bs8[of "rs\<^sub>1" "rs\<^sub>2"]}$AL$\\ |
1303 @{thm[mode=Rule] ss2[of "rs\<^sub>1" "rs\<^sub>2"]}$LH$\qquad |
1303 @{thm[mode=Rule] ss2[of "rs\<^sub>1" "rs\<^sub>2"]}$LT$\qquad |
1304 @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}$LT$\\ |
1304 @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}$LH$\\ |
1305 @{thm[mode=Axiom] ss4}$L\ZERO$\qquad |
1305 @{thm[mode=Axiom] ss4}$L\ZERO$\qquad |
1306 @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}$LS$\medskip\\ |
1306 @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}$LS$\medskip\\ |
1307 @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}$LD$\\ |
1307 @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}$LD$\\ |
1308 \end{tabular} |
1308 \end{tabular} |
1309 \end{center} |
1309 \end{center} |
1398 because the informal definition given by Sulzmann and Lu did not |
1398 because the informal definition given by Sulzmann and Lu did not |
1399 stand up to a formal proof. Also for the second algorithm we needed |
1399 stand up to a formal proof. Also for the second algorithm we needed |
1400 to introduce our own definitions and proof ideas in order to establish the |
1400 to introduce our own definitions and proof ideas in order to establish the |
1401 correctness. Our interest in the second algorithm |
1401 correctness. Our interest in the second algorithm |
1402 lies in the fact that by using bitcoded regular expressions and an aggressive |
1402 lies in the fact that by using bitcoded regular expressions and an aggressive |
1403 simplification method there is a chance that the the derivatives |
1403 simplification method there is a chance that the derivatives |
1404 can be kept universally small (we established in this paper that |
1404 can be kept universally small (we established in this paper that |
1405 they can be kept finite for any string). This is important if one is after |
1405 they can be kept finite for any string). This is important if one is after |
1406 an efficient POSIX lexing algorithm based on derivatives. |
1406 an efficient POSIX lexing algorithm based on derivatives. |
1407 |
1407 |
1408 Having proved the correctness of the POSIX lexing algorithm, which |
1408 Having proved the correctness of the POSIX lexing algorithm, which |
1423 \cite[Page 14]{Sulzmann2014}. |
1423 \cite[Page 14]{Sulzmann2014}. |
1424 Given the growth of the |
1424 Given the growth of the |
1425 derivatives in some cases even after aggressive simplification, this |
1425 derivatives in some cases even after aggressive simplification, this |
1426 is a hard to believe claim. A similar claim about a theoretical runtime |
1426 is a hard to believe claim. A similar claim about a theoretical runtime |
1427 of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates |
1427 of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates |
1428 tokens according to POSIX rules~\cite{verbatim}. For this it uses Brzozowski's |
1428 tokens according to POSIX rules~\cite{verbatim}. For this Verbatim uses Brzozowski's |
1429 derivatives like in our work. |
1429 derivatives like in our work. |
1430 They write: ``The results of our empirical tests [..] confirm that Verbatim has |
1430 The authors write: ``The results of our empirical tests [..] confirm that Verbatim has |
1431 @{text "O(n\<^sup>2)"} time complexity.'' \cite[Section~VII]{verbatim}. |
1431 @{text "O(n\<^sup>2)"} time complexity.'' \cite[Section~VII]{verbatim}. |
1432 While their correctness proof for Verbatim is formalised in Coq, the claim about |
1432 While their correctness proof for Verbatim is formalised in Coq, the claim about |
1433 the runtime complexity is only supported by some emperical evidence obtained |
1433 the runtime complexity is only supported by some emperical evidence obtained |
1434 by using the code extraction facilities of Coq. |
1434 by using the code extraction facilities of Coq. |
1435 Given our observation with the ``growth problem'' of derivatives, |
1435 Given our observation with the ``growth problem'' of derivatives, |