214 \noindent In this paper, we shall first briefly introduce the basic notions |
214 \noindent In this paper, we shall first briefly introduce the basic notions |
215 of regular expressions and describe the definition |
215 of regular expressions and describe the definition |
216 of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves |
216 of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves |
217 as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove |
217 as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove |
218 the correctness for the bitcoded algorithm without simplification, and |
218 the correctness for the bitcoded algorithm without simplification, and |
219 after that extend the proof to include simplification.\mbox{}\\[-6mm] |
219 after that extend the proof to include simplification. |
|
220 Our Isabelle code including the results from Sec.~5 is available |
|
221 from \textcolor{darkblue}{\url{https://github.com/urbanchr/posix}}. |
|
222 \mbox{}\\[-6mm] |
220 |
223 |
221 *} |
224 *} |
222 |
225 |
223 section {* Background *} |
226 section {* Background *} |
224 |
227 |
254 \cite{AusafDyckhoffUrban2016}). |
257 \cite{AusafDyckhoffUrban2016}). |
255 |
258 |
256 In our work here we also add to the usual ``basic'' regular |
259 In our work here we also add to the usual ``basic'' regular |
257 expressions the \emph{bounded} regular expression @{term "NTIMES r |
260 expressions the \emph{bounded} regular expression @{term "NTIMES r |
258 n"} where the @{term n} specifies that @{term r} should match |
261 n"} where the @{term n} specifies that @{term r} should match |
259 exactly @{term n}-times. Again for brevity we omit the other bounded |
262 exactly @{term n}-times (it is not included in Sulzmann and Lu's original work). For brevity we omit the other bounded |
260 regular expressions @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$ |
263 regular expressions @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$ |
261 and @{text "r"}$^{\{n..m\}}$ which specify intervals for how many |
264 and @{text "r"}$^{\{n..m\}}$ which specify intervals for how many |
262 times @{text r} should match. The results presented in this paper |
265 times @{text r} should match. The results presented in this paper |
263 extend straightforwardly to them too. The importance of the bounded |
266 extend straightforwardly to them too. The importance of the bounded |
264 regular expressions is that they are often used in practical |
267 regular expressions is that they are often used in practical |
276 \cite{CountingSet2020}). Therefore regular expression matching |
279 \cite{CountingSet2020}). Therefore regular expression matching |
277 libraries that rely on the classic notion of DFAs often impose |
280 libraries that rely on the classic notion of DFAs often impose |
278 adhoc limits for bounded regular expressions: For example in the |
281 adhoc limits for bounded regular expressions: For example in the |
279 regular expression matching library in the Go language and also in Google's RE2 library the regular expression |
282 regular expression matching library in the Go language and also in Google's RE2 library the regular expression |
280 @{term "NTIMES a 1001"} is not permitted, because no counter can be |
283 @{term "NTIMES a 1001"} is not permitted, because no counter can be |
281 above 1000; and in the built-in regular expression library in Rust |
284 above 1000; and in the regular expression library in Rust |
282 expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error |
285 expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error |
283 message for being too big. Up until recently,\footnote{up until version 1.5.4 of the regex |
286 message for being too big. Up until recently,\footnote{up until version 1.5.4 of the regex |
284 library in Rust; see also CVE-2022-24713.} Rust |
287 library in Rust; see also CVE-2022-24713.} Rust |
285 however happily generated an automaton for the regular expression |
288 however happily generated automata for regular expressions such as |
286 @{text "a\<^bsup>{0}{4294967295}\<^esup>"}. This was due to a bug |
289 @{text "a\<^bsup>{0}{4294967295}\<^esup>"}. This was due to a bug |
287 in the algorithm that decides when a regular expression is acceptable |
290 in the algorithm that decides when a regular expression is acceptable |
288 or too big according to Rust's classification. We shall come back to |
291 or too big according to Rust's classification (it did not account for the fact that @{text "a\<^bsup>{0}\<^esup>"} and similar examples can match the empty string). We shall come back to |
289 this example later in the paper. |
292 this example later in the paper. |
290 These problems can of course be solved in matching |
293 These problems can of course be solved in matching |
291 algorithms where automata go beyond the classic notion and for |
294 algorithms where automata go beyond the classic notion and for |
292 instance include explicit counters (see for example~\cite{CountingSet2020}). |
295 instance include explicit counters (e.g.~\cite{CountingSet2020}). |
293 The point here is that Brzozowski derivatives and the algorithms by |
296 The point here is that Brzozowski derivatives and the algorithms by |
294 Sulzmann and Lu can be straightforwardly extended to deal with |
297 Sulzmann and Lu can be straightforwardly extended to deal with |
295 bounded regular expressions and moreover the resulting code |
298 bounded regular expressions and moreover the resulting code |
296 still consists of only simple recursive functions and inductive |
299 still consists of only simple recursive functions and inductive |
297 datatypes. Finally, bounded regular expressions |
300 datatypes. Finally, bounded regular expressions |
872 the functions \textit{bnullable} and \textit{bmkeps}(\textit{s}), which are the |
875 the functions \textit{bnullable} and \textit{bmkeps}(\textit{s}), which are the |
873 ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on |
876 ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on |
874 bitcoded regular expressions. |
877 bitcoded regular expressions. |
875 % |
878 % |
876 \begin{center} |
879 \begin{center} |
877 \begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{6mm}}c@ {}} |
880 \begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{10mm}}c@ {}} |
878 \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l} |
881 \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l} |
879 $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\ |
882 $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\ |
880 $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\ |
883 $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\ |
881 $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\ |
884 $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\ |
882 $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &\\ |
885 $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &\\ |
1458 we shall show next. |
1461 we shall show next. |
1459 |
1462 |
1460 \begin{figure}[t] |
1463 \begin{figure}[t] |
1461 \begin{center} |
1464 \begin{center} |
1462 \begin{tabular}{@ {\hspace{-8mm}}c@ {}} |
1465 \begin{tabular}{@ {\hspace{-8mm}}c@ {}} |
1463 \\[-7mm] |
1466 \\[-8mm] |
1464 @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}$S\ZERO{}_l$\quad |
1467 @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}$S\ZERO{}_l$\quad |
1465 @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}$S\ZERO{}_r$\quad |
1468 @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}$S\ZERO{}_r$\quad |
1466 @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}$S\ONE$\\ |
1469 @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}$S\ONE$\\ |
1467 @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad |
1470 @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad |
1468 @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\ |
1471 @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\ |
1567 reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm] |
1570 reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm] |
1568 |
1571 |
1569 Note that while Antimirov was able to give a bound on the \emph{size} |
1572 Note that while Antimirov was able to give a bound on the \emph{size} |
1570 of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound |
1573 of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound |
1571 on the \emph{number} of derivatives, provided they are quotient via |
1574 on the \emph{number} of derivatives, provided they are quotient via |
1572 ACI rules \cite{Brzozowski1964}. Brozozowski's result is crucial when one |
1575 ACI rules \cite{Brzozowski1964}. Brzozowski's result is crucial when one |
1573 uses his derivatives for obtaining a DFA (it essentially bounds |
1576 uses his derivatives for obtaining a DFA (it essentially bounds |
1574 the number of states). However, this result does \emph{not} |
1577 the number of states). However, this result does \emph{not} |
1575 transfer to our setting where we are interested in the \emph{size} of the |
1578 transfer to our setting where we are interested in the \emph{size} of the |
1576 derivatives. For example it is \emph{not} true for our derivatives that the |
1579 derivatives. For example it is \emph{not} true for our derivatives that the |
1577 set of derivatives $r \backslash s$ for a given $r$ and all strings |
1580 set of derivatives $r \backslash s$ for a given $r$ and all strings |
1592 We set out in this work to prove in Isabelle/HOL the correctness of |
1595 We set out in this work to prove in Isabelle/HOL the correctness of |
1593 the second POSIX lexing algorithm by Sulzmann and Lu |
1596 the second POSIX lexing algorithm by Sulzmann and Lu |
1594 \cite{Sulzmann2014}. This follows earlier work where we established |
1597 \cite{Sulzmann2014}. This follows earlier work where we established |
1595 the correctness of the first algorithm |
1598 the correctness of the first algorithm |
1596 \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to |
1599 \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to |
1597 introduce our own specification about what POSIX values are, |
1600 introduce our own specification for POSIX values, |
1598 because the informal definition given by Sulzmann and Lu did not |
1601 because the informal definition given by Sulzmann and Lu did not |
1599 stand up to formal proof. Also for the second algorithm we needed |
1602 stand up to formal proof. Also for the second algorithm we needed |
1600 to introduce our own definitions and proof ideas in order to |
1603 to introduce our own definitions and proof ideas in order to |
1601 establish the correctness. Our interest in the second algorithm |
1604 establish the correctness. Our interest in the second algorithm |
1602 lies in the fact that by using bitcoded regular expressions and an |
1605 lies in the fact that by using bitcoded regular expressions and an |
1603 aggressive simplification method there is a chance that the |
1606 aggressive simplification method there is a chance that the |
1604 derivatives can be kept universally small (we established in this |
1607 derivatives can be kept universally small (we established in this |
1605 paper that for a given $r$ they can be kept finitely bounded for |
1608 paper that for a given $r$ they can be kept finitely bounded for |
1606 all strings). Our formalisation is approximately 7500 lines of |
1609 \emph{all} strings). Our formalisation is approximately 7500 lines of |
1607 Isabelle code. A little more than half of this code concerns the finiteness bound |
1610 Isabelle code. A little more than half of this code concerns the finiteness bound |
1608 obtained in Section 5. This slight ``bloat'' in the latter part is because |
1611 obtained in Section 5. This slight ``bloat'' in the latter part is because |
1609 we had to introduce an intermediate datatype for annotated regular expressions and repeat many |
1612 we had to introduce an intermediate datatype for annotated regular expressions and repeat many |
1610 definitions for this intermediate datatype. But overall this made our |
1613 definitions for this intermediate datatype. But overall we think this made our |
1611 formalisation work smoother. |
1614 formalisation work smoother. The code of our formalisation can be found at |
|
1615 \textcolor{darkblue}{\url{https://github.com/urbanchr/posix}}. |
1612 %This is important if one is after |
1616 %This is important if one is after |
1613 %an efficient POSIX lexing algorithm based on derivatives. |
1617 %an efficient POSIX lexing algorithm based on derivatives. |
1614 |
1618 |
1615 Having proved the correctness of the POSIX lexing algorithm, which |
1619 Having proved the correctness of the POSIX lexing algorithm, which |
1616 lessons have we learned? Well, we feel this is a very good example |
1620 lessons have we learned? Well, we feel this is a very good example |
1695 more than 11 minutes to generate an automaton for this regular |
1699 more than 11 minutes to generate an automaton for this regular |
1696 expression and then to determine that a string of just one(!) |
1700 expression and then to determine that a string of just one(!) |
1697 @{text a} does \emph{not} match this regular expression. Therefore |
1701 @{text a} does \emph{not} match this regular expression. Therefore |
1698 it is probably a wise choice that in newer versions of Rust's |
1702 it is probably a wise choice that in newer versions of Rust's |
1699 regular expression library such regular expressions are declared as |
1703 regular expression library such regular expressions are declared as |
1700 ``too big''. While this is clearly |
1704 ``too big'' and raise an error message. While this is clearly |
1701 a contrived example, the safety guaranties Rust wants to provide necessitate |
1705 a contrived example, the safety guaranties Rust wants to provide necessitate |
1702 this conservative approach. |
1706 this conservative approach. |
1703 However, with the derivatives and simplifications we have shown |
1707 However, with the derivatives and simplifications we have shown |
1704 in this paper, the example can be solved with ease: |
1708 in this paper, the example can be solved with ease: |
1705 it essentially only involves operations on |
1709 it essentially only involves operations on |