author | Chengsong |
Sat, 08 Jul 2023 22:18:22 +0100 | |
changeset 656 | 753a3b0ee02b |
parent 640 | bd1354127574 |
child 666 | 6da4516ea87d |
permissions | -rw-r--r-- |
637 | 1 |
% Chapter Template |
2 |
||
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
3 |
\chapter{Conclusion and Future Work} % Main chapter title |
637 | 4 |
|
5 |
\label{Future} |
|
6 |
||
7 |
In this thesis, in order to solve the ReDoS attacks |
|
8 |
once and for all, we have set out to formalise the correctness |
|
9 |
proof of Sulzmann and Lu's |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
10 |
lexing algorithm with aggressive simplifications \cite{Sulzmann2014}. |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
11 |
We formalised our proofs in the Isabelle/HOL theorem prover. |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
12 |
We have fixed some inefficiencies and a bug in their original simplification |
637 | 13 |
algorithm, and established the correctness by proving |
14 |
that our algorithm outputs the same result as the original bit-coded |
|
15 |
lexer without simplifications (whose correctness was established in |
|
16 |
previous work by Ausaf et al. in |
|
17 |
\cite{Ausaf} and \cite{AusafDyckhoffUrban2016}). |
|
18 |
The proof technique used in \cite{Ausaf} does not work in our case |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
19 |
because the simplification function messes with the structure of simplified |
637 | 20 |
regular expressions. |
21 |
Despite having to try out several workarounds and being |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
22 |
stuck for months looking for proofs, |
637 | 23 |
we were delighted to have discovered the simple yet |
24 |
effective proof method |
|
25 |
by modelling individual |
|
26 |
simplification steps as small-step rewriting rules and proving |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
27 |
equivalence between terms linked by these rewrite rules. |
637 | 28 |
|
29 |
In addition, we have proved a formal size bound on the |
|
30 |
regular expressions. The technique was by establishing |
|
31 |
a ``closed form'' informally |
|
32 |
described by |
|
33 |
Murugesan and Shanmuga Sundaram \cite{Murugesan2014} |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
34 |
for compound derivatives |
637 | 35 |
and using those closed forms to control the size. |
36 |
The Isabelle/HOL code for our formalisation can be |
|
37 |
found at |
|
38 |
\begin{center} |
|
39 |
\url{https://github.com/hellotommmy/posix} |
|
40 |
\end{center} |
|
41 |
\noindent |
|
42 |
Thanks to our theorem-prover-friendly approach, |
|
43 |
we believe that |
|
44 |
this finiteness bound can be improved to a bound |
|
45 |
linear to input and |
|
46 |
cubic to the regular expression size using a technique by |
|
47 |
Antimirov\cite{Antimirov95}. |
|
48 |
Once formalised, this would be a guarantee for the absence of all super-linear behavious. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
49 |
We are yet to work out the |
637 | 50 |
details. |
51 |
||
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
52 |
Our formalisation is approximately 7500 lines of code. Slightly more than half of this concerns the finiteness bound obtained in Chapter 5. This slight "bloat" is because we had to repeat many definitions for the rrexp datatype. However, we think we would not have been able to formalise the quite intricate reasoning involving rrexps with annotated regular expressions because we would have to carry around the bit-sequences (that are of course necessary in the algorithm) but which do not contribute to the size bound of the calculated derivatives. |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
53 |
|
637 | 54 |
|
55 |
To our best knowledge, no lexing libraries using Brzozowski derivatives |
|
56 |
have similar complexity-related bounds, |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
57 |
and claims about running time are usually speculative and backed up only by empirical |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
58 |
evidence on some test cases. |
637 | 59 |
If a matching or lexing algorithm |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
60 |
does not come with complexity related |
637 | 61 |
guarantees (for examaple the internal data structure size |
62 |
does not grow indefinitely), |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
63 |
then one cannot claim with confidence of having solved the problem |
637 | 64 |
of catastrophic backtracking. |
65 |
||
66 |
We believe our proof method is not specific to this |
|
67 |
algorithm, and intend to extend this approach |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
68 |
to prove the correctness of the faster version |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
69 |
of the lexer proposed in chapter \cite{Cubic}. The closed |
637 | 70 |
forms can then be re-used as well. Together |
71 |
with the idea from Antimirov \cite{Antimirov95} |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
72 |
we plan to reduce the size bound to be just polynomial |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
73 |
with respect to the size of the regular expressions. |
637 | 74 |
%We have tried out a variety of other optimisations |
75 |
%to make lexing/parsing with derivatives faster, |
|
76 |
%for example in \cite{Edelmann2021}, \cite{Darragh2020} |
|
77 |
%and \cite{Verbatimpp}. |
|
78 |
||
79 |
We have learned quite a few lessons in this process. |
|
80 |
As simple as the end result may seem, |
|
81 |
coming up with the initial proof idea with rewriting relations |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
82 |
was the hardest for us, as it required a deep understanding of what information |
637 | 83 |
is preserved during the simplification process. |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
84 |
There the ideas given by Sulzmann and Lu and Ausaf et al. were of no use. |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
85 |
%Having the first formalisation gives us much more confidence----it took |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
86 |
%us much less time to obtain the closed forms and size bound, although they |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
87 |
%are more involved than the correctness proof. |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
88 |
Of course this has already been shown many times, |
637 | 89 |
the usefulness of formal approaches cannot be overstated: |
90 |
they not only allow us to find bugs in Sulzmann and Lu's simplification functions, |
|
91 |
but also helped us set up realistic expectations about performance |
|
92 |
of algorithms. We believed in the beginning that |
|
93 |
the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2} |
|
94 |
was already able to achieve the linear bound claimed by |
|
95 |
Sulzmann and Lu. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
96 |
We then attempted to prove that the size $\llbracket \blexersimp\; r \; s\rrbracket$ |
637 | 97 |
is $O(\llbracket r\rrbracket^c\cdot |s|)$, |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
98 |
where $c$ is a small constant, making |
637 | 99 |
$\llbracket r\rrbracket^c$ a small constant factor. |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
100 |
We were then quite surprised that this did not go through despite a lot of effort. |
637 | 101 |
This led us to discover the example where $\llbracket r\rrbracket^c$ |
102 |
can in fact be exponentially large in chapter \ref{Finite}. |
|
103 |
We therefore learned the necessity to introduce the stronger simplifications |
|
104 |
in chapter \ref{Cubic}. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
105 |
Without formal proofs we would not have found out this so soon, |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
106 |
if at all. |
637 | 107 |
|
108 |
||
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
109 |
%In the future, we plan to incorporate many extensions |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
110 |
%to this work. |
637 | 111 |
|
112 |
\section{Future Work} |
|
113 |
The most obvious next-step is to implement the cubic bound and |
|
114 |
correctness of $\blexerStrong$ |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
115 |
in chapter \ref{Cubic}. |
637 | 116 |
A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
117 |
regular expression size will get us one step closer to fulfilling |
637 | 118 |
the linear complexity claim made by Sulzmann and Lu. |
119 |
||
120 |
With a linear size bound theoretically, the next challenge would be |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
121 |
to generate code that is competitive with respect to |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
122 |
matchers based on DFAs or NFAs. For that a lot of optimisations are needed. |
637 | 123 |
We aim to integrate the zipper data structure into our lexer. |
124 |
The idea is very simple: using a zipper with multiple focuses |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
125 |
just like Darragh \cite{Darragh2020} did in their parsing algorithm, we could represent |
637 | 126 |
\[ |
127 |
x\cdot r + y \cdot r + \ldots |
|
128 |
\] |
|
129 |
as |
|
130 |
\[ |
|
131 |
(x+y + \ldots) \cdot r. |
|
132 |
\] |
|
133 |
This would greatly reduce the amount of space needed |
|
134 |
when we have many terms like $x\cdot r$. |
|
135 |
Some initial results have been obtained, but |
|
136 |
significant changes are needed to make sure |
|
137 |
that the lexer output conforms to the POSIX standard. |
|
138 |
We aim to make use of Okui and Suzuki's labelling |
|
139 |
system \cite{Okui10} to ensure regular expressions represented as zippers |
|
140 |
always maintain the POSIX orderings. |
|
141 |
%Formalising correctness of |
|
142 |
%such a lexer using zippers will probably require using an imperative |
|
143 |
%framework with separation logic as it involves |
|
144 |
%mutable data structures, which is also interesting to look into. |
|
145 |
||
146 |
To further optimise the algorithm, |
|
147 |
we plan to add a deduplicate function that tells |
|
148 |
whether two regular expressions denote the same |
|
149 |
language using |
|
150 |
an efficient and verified equivalence checker like \cite{Krauss2011}. |
|
151 |
In this way, the internal data structures can be pruned more aggressively, |
|
152 |
leading to better simplifications and ultimately faster algorithms. |
|
153 |
||
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
154 |
Traditional automata approaches can be sped up by compiling |
637 | 155 |
multiple rules into the same automaton. This has been done |
156 |
in \cite{Kumar2006} and \cite{Becchi08}, for example. |
|
157 |
Currently our lexer only deals with a single regular expression each time. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
158 |
Extending this to multiple regular expressions might open up more |
637 | 159 |
possibilities of simplifications. |
160 |
||
161 |
As already mentioned in chapter \ref{RelatedWork}, |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
162 |
reducing the number of memory accesses can also accelerate the |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
163 |
matching speed. It would be interesting to study the memory |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
164 |
bandwidth of our derivative-based matching algorithm and improve accordingly. |
637 | 165 |
|
166 |
Memoization has been used frequently in lexing and parsing to achieve |
|
167 |
better complexity results, for example in \cite{Reps1998}, |
|
168 |
\cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}. |
|
169 |
We plan to explore the performance enhancements by memoization in our algorithm |
|
170 |
in a correctness-preserving way. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
171 |
The monadic data refinement technique that Lammich and Tuerk used |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
172 |
in \cite{Lammich2012} to optimise Hopcroft's automaton minimisation |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
173 |
algorithm seems also quite relevant for such an enterprise. |
637 | 174 |
We aim to learn from their refinement framework |
175 |
which generates high performance code with proofs |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
637
diff
changeset
|
176 |
that can be broken down into small steps. |
637 | 177 |
|
178 |
||
179 |
||
180 |
||
181 |
%We establish the correctness |
|
182 |
%claim made by them about the |
|
183 |
%, which were only proven with pencil and paper. |
|
184 |
||
185 |
||
186 |
%\section{Work on Back-References} |
|
187 |
%We introduced regular expressions with back-references |
|
188 |
%in chapter \ref{Introduction}. |
|
189 |
%We adopt the common practice of calling them rewbrs (Regular Expressions |
|
190 |
%With Back References) in this section for brevity. |
|
191 |
%It has been shown by Aho \cite{Aho1990} |
|
192 |
%that the k-vertex cover problem can be reduced |
|
193 |
%to the problem of rewbrs matching, and is therefore NP-complete. |
|
194 |
%Given the depth of the problem, the progress made at the full generality |
|
195 |
%of arbitrary rewbrs matching has been slow, with |
|
196 |
%theoretical work on them being |
|
197 |
%fairly recent. |
|
198 |
% |
|
199 |
%Campaneu et al. studied regexes with back-references |
|
200 |
%in the context of formal languages theory in |
|
201 |
%their 2003 work \cite{campeanu2003formal}. |
|
202 |
%They devised a pumping lemma for Perl-like regexes, |
|
203 |
%proving that the langugages denoted by them |
|
204 |
%is properly contained in context-sensitive |
|
205 |
%languages. |
|
206 |
%More interesting questions such as |
|
207 |
%whether the language denoted by Perl-like regexes |
|
208 |
%can express palindromes ($\{w \mid w = w^R\}$) |
|
209 |
%are discussed in \cite{campeanu2009patterns} |
|
210 |
%and \cite{CAMPEANU2009Intersect}. |
|
211 |
%Freydenberger \cite{Frey2013} investigated the |
|
212 |
%expressive power of back-references. He showed several |
|
213 |
%undecidability and decriptional complexity results |
|
214 |
%of back-references, concluding that they add |
|
215 |
%great power to certain programming tasks but are difficult to harness. |
|
216 |
%An interesting question would then be |
|
217 |
%whether we can add restrictions to them, |
|
218 |
%so that they become expressive enough for practical use such |
|
219 |
%as html files, but not too powerful. |
|
220 |
%Freydenberger and Schmid \cite{FREYDENBERGER20191} |
|
221 |
%introduced the notion of deterministic |
|
222 |
%regular expressions with back-references to achieve |
|
223 |
%a better balance between expressiveness and tractability. |
|
224 |
% |
|
225 |
% |
|
226 |
%Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012} |
|
227 |
%investigated the time complexity of different variants |
|
228 |
%of back-references. |
|
229 |
%We are not aware of any work that uses derivatives on back-references. |
|
230 |
% |
|
231 |
%See \cite{BERGLUND2022} for a survey |
|
232 |
%of these works and comparison between different |
|
233 |
%flavours of back-references syntax (e.g. whether references can be circular, |
|
234 |
%can labels be repeated etc.). |
|
235 |
% |
|
236 |
% |
|
237 |
%\subsection{Matchers and Lexers with Mechanised Proofs} |
|
238 |
%We are aware |
|
239 |
%of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
|
240 |
%Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part |
|
241 |
%of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one |
|
242 |
%in Coq is given by Coquand and Siles \parencite{Coquand2012}. |
|
243 |
%Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. |
|
244 |
% |
|
245 |
%\subsection{Static Analysis of Evil Regex Patterns} |
|
246 |
%When a regular expression does not behave as intended, |
|
247 |
%people usually try to rewrite the regex to some equivalent form |
|
248 |
%or they try to avoid the possibly problematic patterns completely, |
|
249 |
%for which many false positives exist\parencite{Davis18}. |
|
250 |
%Animated tools to "debug" regular expressions such as |
|
251 |
% \parencite{regexploit2021} \parencite{regex101} are also popular. |
|
252 |
%We are also aware of static analysis work on regular expressions that |
|
253 |
%aims to detect potentially expoential regex patterns. Rathnayake and Thielecke |
|
254 |
%\parencite{Rathnayake2014StaticAF} proposed an algorithm |
|
255 |
%that detects regular expressions triggering exponential |
|
256 |
%behavious on backtracking matchers. |
|
257 |
%Weideman \parencite{Weideman2017Static} came up with |
|
258 |
%non-linear polynomial worst-time estimates |
|
259 |
%for regexes, attack string that exploit the worst-time |
|
260 |
%scenario, and "attack automata" that generates |
|
261 |
%attack strings. |
|
262 |
% |
|
263 |
% |
|
264 |
% |
|
265 |
% |
|
266 |
% |
|
267 |
% |
|
268 |
%Thanks to our theorem-prover-friendly approach, |
|
269 |
%we believe that |
|
270 |
%this finiteness bound can be improved to a bound |
|
271 |
%linear to input and |
|
272 |
%cubic to the regular expression size using a technique by |
|
273 |
%Antimirov\cite{Antimirov95}. |
|
274 |
%Once formalised, this would be a guarantee for the absence of all super-linear behavious. |
|
275 |
%We are working out the |
|
276 |
%details. |
|
277 |
% |
|
278 |
% |
|
279 |
%To our best knowledge, no lexing libraries using Brzozowski derivatives |
|
280 |
%have similar complexity-related bounds, |
|
281 |
%and claims about running time are usually speculative and backed by empirical |
|
282 |
%evidence on a few test cases. |
|
283 |
%If a matching or lexing algorithm |
|
284 |
%does not come with certain basic complexity related |
|
285 |
%guarantees (for examaple the internal data structure size |
|
286 |
%does not grow indefinitely), |
|
287 |
%then they cannot claim with confidence having solved the problem |
|
288 |
%of catastrophic backtracking. |
|
289 |
% |
|
290 |
% |
|
291 |
%\section{Optimisations} |
|
292 |
%Perhaps the biggest problem that prevents derivative-based lexing |
|
293 |
%from being more widely adopted |
|
294 |
%is that they are still not blindingly fast in practice, unable to |
|
295 |
%reach throughputs like gigabytes per second, which the commercial |
|
296 |
%regular expression matchers such as Snort and Bro are able to achieve. |
|
297 |
%For our algorithm to be more attractive for practical use, we |
|
298 |
%need more correctness-preserving optimisations. |
|
299 |
%%So far our focus has been mainly on the bit-coded algorithm, |
|
300 |
%%but the injection-based lexing algorithm |
|
301 |
%%could also be sped up in various ways: |
|
302 |
%% |
|
303 |
%One such optimisation is defining string derivatives |
|
304 |
%directly, without recursively decomposing them into |
|
305 |
%character-by-character derivatives. For example, instead of |
|
306 |
%replacing $(r_1 + r_2)\backslash (c::cs)$ by |
|
307 |
%$((r_1 + r_2)\backslash c)\backslash cs$, we rather |
|
308 |
%calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$. |
|
309 |
% |
|
310 |
% |
|
311 |
%\subsection{Derivatives and Zippers} |
|
312 |
%Zipper is a data structure designed to focus on |
|
313 |
%and navigate between local parts of a tree. |
|
314 |
%The idea is that often operations on a large tree only deal with |
|
315 |
%local regions each time. |
|
316 |
%It was first formally described by Huet \cite{HuetZipper}. |
|
317 |
%Typical applications of zippers involve text editor buffers |
|
318 |
%and proof system databases. |
|
319 |
%In our setting, the idea is to compactify the representation |
|
320 |
%of derivatives with zippers, thereby making our algorithm faster. |
|
321 |
%We draw our inspirations from several works on parsing, derivatives |
|
322 |
%and zippers. |
|
323 |
% |
|
324 |
%Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives |
|
325 |
%\cite{Zippy2020}. |
|
326 |
%They adopted zippers to improve the speed, and argued that the runtime |
|
327 |
%complexity of their algorithm was linear with respect to the input string. |
|
328 |
% |
|
329 |
%The idea of using Brzozowski derivatives on general context-free |
|
330 |
%parsing was first implemented |
|
331 |
%by Might et al. \ref{Might2011}. |
|
332 |
%They used memoization and fixpoint construction to eliminate infinite recursion, |
|
333 |
%which is a major problem for using derivatives on context-free grammars. |
|
334 |
%The initial version was quite slow----exponential on the size of the input. |
|
335 |
%Adams et al. \ref{Adams2016} improved the speed and argued that their version |
|
336 |
%was cubic with respect to the input. |
|
337 |
%Darragh and Adams \cite{10.1145/3408990} further improved the performance |
|
338 |
%by using zippers in an innovative way--their zippers had multiple focuses |
|
339 |
%instead of just one in a traditional zipper to handle ambiguity. |
|
340 |
%Their algorithm was not formalised, though. |
|
341 |
% |
|
342 |
%We aim to integrate the zipper data structure into our lexer. |
|
343 |
%The idea is very simple: using a zipper with multiple focuses |
|
344 |
%just like Darragh did, we could represent |
|
345 |
%\[ |
|
346 |
% x\cdot r + y \cdot r + \ldots |
|
347 |
%\] |
|
348 |
%as |
|
349 |
%\[ |
|
350 |
% (x+y + \ldots) \cdot r. |
|
351 |
%\] |
|
352 |
%This would greatly reduce the amount of space needed |
|
353 |
%when we have many terms like $x\cdot r$. |
|
354 |
% |
|
355 |
%into the current lexer. |
|
356 |
%This allows the lexer to not traverse the entire |
|
357 |
%derivative tree generated |
|
358 |
%in every intermediate step, but only a small segment |
|
359 |
%that is currently active. |
|
360 |
% |
|
361 |
%Some initial results have been obtained, but more care needs to be taken to make sure |
|
362 |
%that the lexer output conforms to the POSIX standard. Formalising correctness of |
|
363 |
%such a lexer using zippers will probably require using an imperative |
|
364 |
%framework with separation logic as it involves |
|
365 |
%mutable data structures, which is also interesting to look into. |
|
366 |
% |
|
367 |
%To further optimise the algorithm, |
|
368 |
%I plan to add a ``deduplicate'' function that tells |
|
369 |
%whether two regular expressions denote the same |
|
370 |
%language using |
|
371 |
%an efficient and verified equivalence checker. |
|
372 |
%In this way, the internal data structures can be pruned more aggressively, |
|
373 |
%leading to better simplifications and ultimately faster algorithms. |
|
374 |
% |
|
375 |
% |
|
376 |
%Some initial results |
|
377 |
%We first give a brief introduction to what zippers are, |
|
378 |
%and other works |
|
379 |
%that apply zippers to derivatives |
|
380 |
%When dealing with large trees, it would be a waste to |
|
381 |
%traverse the entire tree if |
|
382 |
%the operation only |
|
383 |
%involves a small fraction of it. |
|
384 |
%The idea is to put the focus on that subtree, turning other parts |
|
385 |
%of the tree into a context |
|
386 |
% |
|
387 |
% |
|
388 |
%One observation about our derivative-based lexing algorithm is that |
|
389 |
%the derivative operation sometimes traverses the entire regular expression |
|
390 |
%unnecessarily: |
|
391 |