129 *) |
129 *) |
130 |
130 |
131 |
131 |
132 (*>*) |
132 (*>*) |
133 |
133 |
134 |
134 section\<open>Core of the proof\<close> |
135 |
|
136 section \<open>Introduction\<close> |
|
137 |
|
138 |
|
139 text \<open> |
135 text \<open> |
140 |
136 This paper builds on previous work by Ausaf and Urban using |
141 This works builds on previous work by Ausaf and Urban using |
|
142 regular expression'd bit-coded derivatives to do lexing that |
137 regular expression'd bit-coded derivatives to do lexing that |
143 is both fast and satisfied the POSIX specification. |
138 is both fast and satisfies the POSIX specification. |
144 In their work, a bit-coded algorithm introduced by Sulzmann and Lu |
139 In their work, a bit-coded algorithm introduced by Sulzmann and Lu |
145 was formally verified in Isabelle, by a very clever use of |
140 was formally verified in Isabelle, by a very clever use of |
146 flex function and retrieve to carefully mimic the way a value is |
141 flex function and retrieve to carefully mimic the way a value is |
147 built up by the injection funciton. |
142 built up by the injection funciton. |
148 |
143 |
149 In the previous work, Ausaf and Urban established the below equality: |
144 In the previous work, Ausaf and Urban established the below equality: |
150 \begin{lemma} |
145 \begin{lemma} |
151 @{thm [mode=IfThen] bder_retrieve} |
146 @{thm [mode=IfThen] MAIN_decode} |
152 \end{lemma} |
147 \end{lemma} |
153 |
148 |
154 This lemma links the derivative of a bit-coded regular expression with |
149 This lemma establishes a link with the lexer without bit-codes. |
155 the regular expression itself before the derivative. |
150 |
|
151 With it we get the correctness of bit-coded algorithm. |
|
152 \begin{lemma} |
|
153 @{thm [mode=IfThen] blexer_correctness} |
|
154 \end{lemma} |
|
155 |
|
156 However what is not certain is whether we can add simplification |
|
157 to the bit-coded algorithm, without breaking the correct lexing output. |
|
158 This might sound trivial in the case of producing a YES/NO answer, |
|
159 but once we require a lexing output to be produced (which is required |
|
160 in applications like compiler front-end, malicious attack domain extraction, |
|
161 etc.), it is not straightforward if we still extract what is needed according |
|
162 to the POSIX standard. |
|
163 |
|
164 By simplification, we mean specifically the following rules: |
|
165 |
|
166 \begin{center} |
|
167 \begin{tabular}{lcl} |
|
168 @{thm[mode=Axiom] rrewrite.intros(1)[of "bs" "r\<^sub>2"]}\\ |
|
169 @{thm[mode=Axiom] rrewrite.intros(2)[of "bs" "r\<^sub>1"]}\\ |
|
170 @{thm[mode=Axiom] rrewrite.intros(3)[of "bs" "bs\<^sub>1" "r\<^sub>1"]}\\ |
|
171 @{thm[mode=Rule] rrewrite.intros(4)[of "r\<^sub>1" "r\<^sub>2" "bs" "r\<^sub>3"]}\\ |
|
172 @{thm[mode=Rule] rrewrite.intros(5)[of "bs" "r\<^sub>2"]}\\ |
|
173 @{thm[mode=Rule] rrewrite.intros(6)[of "bs" "r\<^sub>1"]}\\ |
|
174 @{thm[mode=Axiom] rrewrite.intros(7)[of "bs" "r\<^sub>2"]}\\ |
|
175 @{thm[mode=Axiom] rrewrite.intros(8)[of "bs" "r\<^sub>1"]}\\ |
|
176 @{thm[mode=Axiom] rrewrite.intros(9)[of "bs" "r\<^sub>2"]}\\ |
|
177 @{thm[mode=Axiom] rrewrite.intros(10)[of "bs" "r\<^sub>1"]}\\ |
|
178 @{thm[mode=Axiom] rrewrite.intros(11)[of "bs" ]}\\ |
|
179 @{thm[mode=Axiom] rrewrite.intros(12)[of "bs" "r\<^sub>1"]}\\ |
|
180 @{thm[mode=Rule] rrewrite.intros(13)[of "a\<^sub>1" "a\<^sub>2" "bs" "rs\<^sub>a" "rs\<^sub>b" "rs\<^sub>c"]}\\ |
|
181 |
|
182 \end{tabular} |
|
183 \end{center} |
|
184 |
|
185 |
|
186 And these can be made compact by the following simplification function: |
|
187 |
|
188 \begin{center} |
|
189 \begin{tabular}{lcl} |
|
190 @{thm (lhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]}\\ |
|
191 @{thm (lhs) bsimp.simps(2)} & $\dn$ & @{thm (rhs) bsimp.simps(2)}\\ |
|
192 @{thm (lhs) bsimp.simps(3)} & $\dn$ & @{thm (rhs) bsimp.simps(3)}\\ |
|
193 @{thm (lhs) intern.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) intern.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
|
194 \end{tabular} |
|
195 \end{center} |
|
196 |
|
197 where the function $\mathit{bsimp_AALTs}$ |
|
198 |
|
199 The core idea of the proof is that two regular expressions, |
|
200 if "isomorphic" up to a finite number of rewrite steps, will |
|
201 remain "isomorphic" when we take the same sequence of |
|
202 derivatives on both of them. |
|
203 This can be expressed by the following rewrite relation lemma: |
|
204 \begin{lemma} |
|
205 @{thm [mode=IfThen] central} |
|
206 \end{lemma} |
|
207 |
|
208 This isomorphic relation implies a property that leads to the |
|
209 correctness result: |
|
210 if two (nullable) regular expressions are "rewritable" in many steps |
|
211 from one another, |
|
212 then a call to function $textit{bmkeps}$ gives the same |
|
213 bit-sequence : |
|
214 \begin{lemma} |
|
215 @{thm [mode=IfThen] rewrites_bmkeps} |
|
216 \end{lemma} |
|
217 |
|
218 Given the same bit-sequence, the decode function |
|
219 will give out the same value, which is the output |
|
220 of both lexers: |
|
221 \begin{lemma} |
|
222 @{thm blexer_def} |
|
223 \end{lemma} |
|
224 |
|
225 \begin{lemma} |
|
226 @{thm blexer_simp_def} |
|
227 \end{lemma} |
|
228 |
|
229 And that yields the correctness result: |
|
230 \begin{lemma} |
|
231 @{thm blexersimp_correctness} |
|
232 \end{lemma} |
|
233 |
|
234 |
|
235 \<close> |
|
236 |
|
237 section \<open>Introduction\<close> |
|
238 |
|
239 |
|
240 text \<open> |
|
241 |
156 |
242 |
157 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em |
243 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em |
158 derivative} @{term "der c r"} of a regular expression \<open>r\<close> w.r.t.\ |
244 derivative} @{term "der c r"} of a regular expression \<open>r\<close> w.r.t.\ |
159 a character~\<open>c\<close>, and showed that it gave a simple solution to the |
245 a character~\<open>c\<close>, and showed that it gave a simple solution to the |
160 problem of matching a string @{term s} with a regular expression @{term |
246 problem of matching a string @{term s} with a regular expression @{term |