diff -r 5499ba68188c -r f65444d29e74 thys2/Journal/Paper.thy --- a/thys2/Journal/Paper.thy Mon Nov 01 10:52:44 2021 +0000 +++ b/thys2/Journal/Paper.thy Tue Nov 02 13:57:59 2021 +0000 @@ -131,16 +131,11 @@ (*>*) - - -section \Introduction\ - - +section\Core of the proof\ text \ - -This works builds on previous work by Ausaf and Urban using +This paper builds on previous work by Ausaf and Urban using regular expression'd bit-coded derivatives to do lexing that -is both fast and satisfied the POSIX specification. +is both fast and satisfies the POSIX specification. In their work, a bit-coded algorithm introduced by Sulzmann and Lu was formally verified in Isabelle, by a very clever use of flex function and retrieve to carefully mimic the way a value is @@ -148,11 +143,102 @@ In the previous work, Ausaf and Urban established the below equality: \begin{lemma} -@{thm [mode=IfThen] bder_retrieve} +@{thm [mode=IfThen] MAIN_decode} +\end{lemma} + +This lemma establishes a link with the lexer without bit-codes. + +With it we get the correctness of bit-coded algorithm. +\begin{lemma} +@{thm [mode=IfThen] blexer_correctness} \end{lemma} -This lemma links the derivative of a bit-coded regular expression with -the regular expression itself before the derivative. +However what is not certain is whether we can add simplification +to the bit-coded algorithm, without breaking the correct lexing output. +This might sound trivial in the case of producing a YES/NO answer, +but once we require a lexing output to be produced (which is required +in applications like compiler front-end, malicious attack domain extraction, +etc.), it is not straightforward if we still extract what is needed according +to the POSIX standard. + +By simplification, we mean specifically the following rules: + +\begin{center} + \begin{tabular}{lcl} + @{thm[mode=Axiom] rrewrite.intros(1)[of "bs" "r\<^sub>2"]}\\ + @{thm[mode=Axiom] rrewrite.intros(2)[of "bs" "r\<^sub>1"]}\\ + @{thm[mode=Axiom] rrewrite.intros(3)[of "bs" "bs\<^sub>1" "r\<^sub>1"]}\\ + @{thm[mode=Rule] rrewrite.intros(4)[of "r\<^sub>1" "r\<^sub>2" "bs" "r\<^sub>3"]}\\ + @{thm[mode=Rule] rrewrite.intros(5)[of "bs" "r\<^sub>2"]}\\ + @{thm[mode=Rule] rrewrite.intros(6)[of "bs" "r\<^sub>1"]}\\ + @{thm[mode=Axiom] rrewrite.intros(7)[of "bs" "r\<^sub>2"]}\\ + @{thm[mode=Axiom] rrewrite.intros(8)[of "bs" "r\<^sub>1"]}\\ + @{thm[mode=Axiom] rrewrite.intros(9)[of "bs" "r\<^sub>2"]}\\ + @{thm[mode=Axiom] rrewrite.intros(10)[of "bs" "r\<^sub>1"]}\\ + @{thm[mode=Axiom] rrewrite.intros(11)[of "bs" ]}\\ + @{thm[mode=Axiom] rrewrite.intros(12)[of "bs" "r\<^sub>1"]}\\ + @{thm[mode=Rule] rrewrite.intros(13)[of "a\<^sub>1" "a\<^sub>2" "bs" "rs\<^sub>a" "rs\<^sub>b" "rs\<^sub>c"]}\\ + + \end{tabular} +\end{center} + + +And these can be made compact by the following simplification function: + +\begin{center} + \begin{tabular}{lcl} + @{thm (lhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]}\\ + @{thm (lhs) bsimp.simps(2)} & $\dn$ & @{thm (rhs) bsimp.simps(2)}\\ + @{thm (lhs) bsimp.simps(3)} & $\dn$ & @{thm (rhs) bsimp.simps(3)}\\ + @{thm (lhs) intern.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) intern.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\ +\end{tabular} +\end{center} + +where the function $\mathit{bsimp_AALTs}$ + +The core idea of the proof is that two regular expressions, +if "isomorphic" up to a finite number of rewrite steps, will +remain "isomorphic" when we take the same sequence of +derivatives on both of them. +This can be expressed by the following rewrite relation lemma: +\begin{lemma} +@{thm [mode=IfThen] central} +\end{lemma} + +This isomorphic relation implies a property that leads to the +correctness result: +if two (nullable) regular expressions are "rewritable" in many steps +from one another, +then a call to function $textit{bmkeps}$ gives the same +bit-sequence : +\begin{lemma} +@{thm [mode=IfThen] rewrites_bmkeps} +\end{lemma} + +Given the same bit-sequence, the decode function +will give out the same value, which is the output +of both lexers: +\begin{lemma} +@{thm blexer_def} +\end{lemma} + +\begin{lemma} +@{thm blexer_simp_def} +\end{lemma} + +And that yields the correctness result: +\begin{lemma} +@{thm blexersimp_correctness} +\end{lemma} + + +\ + +section \Introduction\ + + +text \ + Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em derivative} @{term "der c r"} of a regular expression \r\ w.r.t.\ @@ -332,6 +418,9 @@ \ + + + section \Preliminaries\ text \\noindent Strings in Isabelle/HOL are lists of characters with