|
1 \documentclass[a4paper,UKenglish]{lipics} |
|
2 \usepackage{graphic} |
|
3 \usepackage{data} |
|
4 \usepackage{tikz-cd} |
|
5 \usepackage{tikz} |
|
6 \usetikzlibrary{graphs} |
|
7 \usetikzlibrary{graphdrawing} |
|
8 \usegdlibrary{trees} |
|
9 %\usepackage{algorithm} |
|
10 \usepackage{amsmath} |
|
11 \usepackage{xcolor} |
|
12 \usepackage[noend]{algpseudocode} |
|
13 \usepackage{enumitem} |
|
14 \usepackage{nccmath} |
|
15 \usepackage{soul} |
|
16 |
|
17 \definecolor{darkblue}{rgb}{0,0,0.6} |
|
18 \hypersetup{colorlinks=true,allcolors=darkblue} |
|
19 \newcommand{\comment}[1]% |
|
20 {{\color{red}$\Rightarrow$}\marginpar{\raggedright\small{\bf\color{red}#1}}} |
|
21 |
|
22 % \documentclass{article} |
|
23 %\usepackage[utf8]{inputenc} |
|
24 %\usepackage[english]{babel} |
|
25 %\usepackage{listings} |
|
26 % \usepackage{amsthm} |
|
27 %\usepackage{hyperref} |
|
28 % \usepackage[margin=0.5in]{geometry} |
|
29 %\usepackage{pmboxdraw} |
|
30 |
|
31 \title{POSIX Regular Expression Matching and Lexing} |
|
32 \author{Chengsong Tan} |
|
33 \affil{King's College London\\ |
|
34 London, UK\\ |
|
35 \texttt{chengsong.tan@kcl.ac.uk}} |
|
36 \authorrunning{Chengsong Tan} |
|
37 \Copyright{Chengsong Tan} |
|
38 |
|
39 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}% |
|
40 \newcommand{\ZERO}{\mbox{\bf 0}} |
|
41 \newcommand{\ONE}{\mbox{\bf 1}} |
|
42 \def\erase{\textit{erase}} |
|
43 \def\bders{\textit{bders}} |
|
44 \def\lexer{\mathit{lexer}} |
|
45 \def\blexer{\textit{blexer}} |
|
46 \def\fuse{\textit{fuse}} |
|
47 \def\flatten{\textit{flatten}} |
|
48 \def\map{\textit{map}} |
|
49 \def\blexers{\mathit{blexer\_simp}} |
|
50 \def\simp{\mathit{simp}} |
|
51 \def\mkeps{\mathit{mkeps}} |
|
52 \def\bmkeps{\textit{bmkeps}} |
|
53 \def\inj{\mathit{inj}} |
|
54 \def\Empty{\mathit{Empty}} |
|
55 \def\Left{\mathit{Left}} |
|
56 \def\Right{\mathit{Right}} |
|
57 \def\Stars{\mathit{Stars}} |
|
58 \def\Char{\mathit{Char}} |
|
59 \def\Seq{\mathit{Seq}} |
|
60 \def\Der{\mathit{Der}} |
|
61 \def\nullable{\mathit{nullable}} |
|
62 \def\Z{\mathit{Z}} |
|
63 \def\S{\mathit{S}} |
|
64 \def\flex{\textit{flex}} |
|
65 \def\rup{r^\uparrow} |
|
66 \def\retrieve{\textit{retrieve}} |
|
67 \def\AALTS{\textit{AALTS}} |
|
68 \def\AONE{\textit{AONE}} |
|
69 %\theoremstyle{theorem} |
|
70 %\newtheorem{theorem}{Theorem} |
|
71 %\theoremstyle{lemma} |
|
72 %\newtheorem{lemma}{Lemma} |
|
73 %\newcommand{\lemmaautorefname}{Lemma} |
|
74 %\theoremstyle{definition} |
|
75 %\newtheorem{definition}{Definition} |
|
76 \algnewcommand\algorithmicswitch{\textbf{switch}} |
|
77 \algnewcommand\algorithmiccase{\textbf{case}} |
|
78 \algnewcommand\algorithmicassert{\texttt{assert}} |
|
79 \algnewcommand\Assert[1]{\State \algorithmicassert(#1)}% |
|
80 % New "environments" |
|
81 \algdef{SE}[SWITCH]{Switch}{EndSwitch}[1]{\algorithmicswitch\ #1\ \algorithmicdo}{\algorithmicend\ \algorithmicswitch}% |
|
82 \algdef{SE}[CASE]{Case}{EndCase}[1]{\algorithmiccase\ #1}{\algorithmicend\ \algorithmiccase}% |
|
83 \algtext*{EndSwitch}% |
|
84 \algtext*{EndCase}% |
|
85 |
|
86 |
|
87 \begin{document} |
|
88 |
|
89 \maketitle |
|
90 |
|
91 \begin{abstract} |
|
92 Brzozowski introduced in 1964 a beautifully simple algorithm for |
|
93 regular expression matching based on the notion of derivatives of |
|
94 regular expressions. In 2014, Sulzmann and Lu extended this |
|
95 algorithm to not just give a YES/NO answer for whether or not a |
|
96 regular expression matches a string, but in case it does also |
|
97 answers with \emph{how} it matches the string. This is important for |
|
98 applications such as lexing (tokenising a string). The problem is to |
|
99 make the algorithm by Sulzmann and Lu fast on all inputs without |
|
100 breaking its correctness. Being fast depends on a complete set of |
|
101 simplification rules, some of which |
|
102 have been put forward by Sulzmann and Lu. We have extended their |
|
103 rules in order to obtain a tight bound on the size of regular expressions. |
|
104 We have tested these extended rules, but have not |
|
105 formally established their correctness. We have also not yet looked |
|
106 at extended regular expressions, such as bounded repetitions, |
|
107 negation and back-references. |
|
108 \end{abstract} |
|
109 |
|
110 \section{Recapitulation of Concepts From the Last Report} |
|
111 |
|
112 \subsection*{Regular Expressions and Derivatives} |
|
113 |
|
114 Suppose (basic) regular expressions are given by the following grammar: |
|
115 |
|
116 \[ r ::= \ZERO \mid \ONE |
|
117 \mid c |
|
118 \mid r_1 \cdot r_2 |
|
119 \mid r_1 + r_2 |
|
120 \mid r^* |
|
121 \] |
|
122 |
|
123 \noindent |
|
124 The ingenious contribution of Brzozowski is the notion of \emph{derivatives} of |
|
125 regular expressions, written~$\_ \backslash \_$. It uses the auxiliary notion of |
|
126 $\nullable$ defined below. |
|
127 |
|
128 \begin{center} |
|
129 \begin{tabular}{lcl} |
|
130 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
|
131 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
|
132 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
|
133 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
|
134 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
|
135 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
|
136 \end{tabular} |
|
137 \end{center} |
|
138 |
|
139 \begin{center} |
|
140 \begin{tabular}{lcl} |
|
141 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
142 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
143 $d \backslash c$ & $\dn$ & |
|
144 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
145 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
146 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ |
|
147 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
148 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
149 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
150 \end{tabular} |
|
151 \end{center} |
|
152 |
|
153 %Assuming the classic notion of a |
|
154 %\emph{language} of a regular expression, written $L(\_)$, t |
|
155 |
|
156 \noindent |
|
157 The main property of the derivative operation is that |
|
158 |
|
159 \begin{center} |
|
160 $c\!::\!s \in L(r)$ holds |
|
161 if and only if $s \in L(r\backslash c)$. |
|
162 \end{center} |
|
163 |
|
164 \noindent |
|
165 We can generalise the derivative operation shown above for single characters |
|
166 to strings as follows: |
|
167 |
|
168 \begin{center} |
|
169 \begin{tabular}{lcl} |
|
170 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
|
171 $r \backslash [\,] $ & $\dn$ & $r$ |
|
172 \end{tabular} |
|
173 \end{center} |
|
174 |
|
175 \noindent |
|
176 and then define Brzozowski's regular-expression matching algorithm as: |
|
177 |
|
178 \[ |
|
179 match\;s\;r \;\dn\; nullable(r\backslash s) |
|
180 \] |
|
181 |
|
182 \noindent |
|
183 Assuming the a string is givane as a sequence of characters, say $c_0c_1..c_n$, |
|
184 this algorithm presented graphically is as follows: |
|
185 |
|
186 \begin{equation}\label{graph:*} |
|
187 \begin{tikzcd} |
|
188 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
|
189 \end{tikzcd} |
|
190 \end{equation} |
|
191 |
|
192 \noindent |
|
193 where we start with a regular expression $r_0$, build successive |
|
194 derivatives until we exhaust the string and then use \textit{nullable} |
|
195 to test whether the result can match the empty string. It can be |
|
196 relatively easily shown that this matcher is correct (that is given |
|
197 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
|
198 |
|
199 |
|
200 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu} |
|
201 |
|
202 One limitation of Brzozowski's algorithm is that it only produces a |
|
203 YES/NO answer for whether a string is being matched by a regular |
|
204 expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm |
|
205 to allow generation of an actual matching, called a \emph{value} or |
|
206 sometimes also \emph{lexical value}. These values and regular |
|
207 expressions correspond to each other as illustrated in the following |
|
208 table: |
|
209 |
|
210 |
|
211 \begin{center} |
|
212 \begin{tabular}{c@{\hspace{20mm}}c} |
|
213 \begin{tabular}{@{}rrl@{}} |
|
214 \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |
|
215 $r$ & $::=$ & $\ZERO$\\ |
|
216 & $\mid$ & $\ONE$ \\ |
|
217 & $\mid$ & $c$ \\ |
|
218 & $\mid$ & $r_1 \cdot r_2$\\ |
|
219 & $\mid$ & $r_1 + r_2$ \\ |
|
220 \\ |
|
221 & $\mid$ & $r^*$ \\ |
|
222 \end{tabular} |
|
223 & |
|
224 \begin{tabular}{@{\hspace{0mm}}rrl@{}} |
|
225 \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ |
|
226 $v$ & $::=$ & \\ |
|
227 & & $\Empty$ \\ |
|
228 & $\mid$ & $\Char(c)$ \\ |
|
229 & $\mid$ & $\Seq\,v_1\, v_2$\\ |
|
230 & $\mid$ & $\Left(v)$ \\ |
|
231 & $\mid$ & $\Right(v)$ \\ |
|
232 & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ |
|
233 \end{tabular} |
|
234 \end{tabular} |
|
235 \end{center} |
|
236 |
|
237 \noindent |
|
238 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
|
239 algorithm by a second phase (the first phase being building successive |
|
240 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
|
241 is generated in case the regular expression matches the string. |
|
242 Pictorially, the Sulzmann and Lu algorithm is as follows: |
|
243 |
|
244 \begin{ceqn} |
|
245 \begin{equation}\label{graph:2} |
|
246 \begin{tikzcd} |
|
247 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
|
248 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
|
249 \end{tikzcd} |
|
250 \end{equation} |
|
251 \end{ceqn} |
|
252 |
|
253 \noindent |
|
254 For convenience, we shall employ the following notations: the regular |
|
255 expression we start with is $r_0$, and the given string $s$ is composed |
|
256 of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the |
|
257 left to right, we build the derivatives $r_1$, $r_2$, \ldots according |
|
258 to the characters $c_0$, $c_1$ until we exhaust the string and obtain |
|
259 the derivative $r_n$. We test whether this derivative is |
|
260 $\textit{nullable}$ or not. If not, we know the string does not match |
|
261 $r$ and no value needs to be generated. If yes, we start building the |
|
262 values incrementally by \emph{injecting} back the characters into the |
|
263 earlier values $v_n, \ldots, v_0$. This is the second phase of the |
|
264 algorithm from the right to left. For the first value $v_n$, we call the |
|
265 function $\textit{mkeps}$, which builds the lexical value |
|
266 for how the empty string has been matched by the (nullable) regular |
|
267 expression $r_n$. This function is defined as |
|
268 |
|
269 \begin{center} |
|
270 \begin{tabular}{lcl} |
|
271 $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
|
272 $\mkeps(r_{1}+r_{2})$ & $\dn$ |
|
273 & \textit{if} $\nullable(r_{1})$\\ |
|
274 & & \textit{then} $\Left(\mkeps(r_{1}))$\\ |
|
275 & & \textit{else} $\Right(\mkeps(r_{2}))$\\ |
|
276 $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ |
|
277 $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ |
|
278 \end{tabular} |
|
279 \end{center} |
|
280 |
|
281 |
|
282 \noindent |
|
283 After the $\mkeps$-call, we inject back the characters one by one in order to build |
|
284 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ |
|
285 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
|
286 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
|
287 matches $s$. For this Sulzmann and Lu defined a function that reverses |
|
288 the ``chopping off'' of characters during the derivative phase. The |
|
289 corresponding function is called \emph{injection}, written |
|
290 $\textit{inj}$; it takes three arguments: the first one is a regular |
|
291 expression ${r_{i-1}}$, before the character is chopped off, the second |
|
292 is a character ${c_{i-1}}$, the character we want to inject and the |
|
293 third argument is the value ${v_i}$, into which one wants to inject the |
|
294 character (it corresponds to the regular expression after the character |
|
295 has been chopped off). The result of this function is a new value. The |
|
296 definition of $\textit{inj}$ is as follows: |
|
297 |
|
298 \begin{center} |
|
299 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
300 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
|
301 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
|
302 $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ |
|
303 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
304 $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
305 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ |
|
306 $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ |
|
307 \end{tabular} |
|
308 \end{center} |
|
309 |
|
310 \noindent This definition is by recursion on the ``shape'' of regular |
|
311 expressions and values. |
|
312 |
|
313 |
|
314 \subsection*{Simplification of Regular Expressions} |
|
315 |
|
316 The main drawback of building successive derivatives according |
|
317 to Brzozowski's definition is that they can grow very quickly in size. |
|
318 This is mainly due to the fact that the derivative operation generates |
|
319 often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, if |
|
320 implemented naively both algorithms by Brzozowski and by Sulzmann and Lu |
|
321 are excruciatingly slow. For example when starting with the regular |
|
322 expression $(a + aa)^*$ and building 12 successive derivatives |
|
323 w.r.t.~the character $a$, one obtains a derivative regular expression |
|
324 with more than 8000 nodes (when viewed as a tree). Operations like |
|
325 $\textit{der}$ and $\nullable$ need to traverse such trees and |
|
326 consequently the bigger the size of the derivative the slower the |
|
327 algorithm. |
|
328 |
|
329 Fortunately, one can simplify regular expressions after each derivative |
|
330 step. Various simplifications of regular expressions are possible, such |
|
331 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r |
|
332 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not |
|
333 affect the answer for whether a regular expression matches a string or |
|
334 not, but fortunately also do not affect the POSIX strategy of how |
|
335 regular expressions match strings---although the latter is much harder |
|
336 to establish. Some initial results in this regard have been |
|
337 obtained in \cite{AusafDyckhoffUrban2016}. |
|
338 |
|
339 If we want the size of derivatives in Sulzmann and Lu's algorithm to |
|
340 stay below this bound, we would need more aggressive simplifications. |
|
341 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
|
342 deleting duplicates whenever possible. For example, the parentheses in |
|
343 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b |
|
344 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
|
345 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
|
346 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us |
|
347 to achieve the same size bound as that of the partial derivatives. |
|
348 |
|
349 In order to implement the idea of ``spilling out alternatives'' and to |
|
350 make them compatible with the $\textit{inj}$-mechanism, we use |
|
351 \emph{bitcodes}. They were first introduced by Sulzmann and Lu. |
|
352 Here bits and bitcodes (lists of bits) are defined as: |
|
353 |
|
354 \begin{center} |
|
355 $b ::= 1 \mid 0 \qquad |
|
356 bs ::= [] \mid b:bs |
|
357 $ |
|
358 \end{center} |
|
359 |
|
360 \noindent |
|
361 The $1$ and $0$ are not in bold in order to avoid |
|
362 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or |
|
363 bit-lists) can be used to encode values (or incomplete values) in a |
|
364 compact form. This can be straightforwardly seen in the following |
|
365 coding function from values to bitcodes: |
|
366 |
|
367 \begin{center} |
|
368 \begin{tabular}{lcl} |
|
369 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
|
370 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
|
371 $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\ |
|
372 $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\ |
|
373 $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ |
|
374 $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\ |
|
375 $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\; |
|
376 code(\Stars\,vs)$ |
|
377 \end{tabular} |
|
378 \end{center} |
|
379 |
|
380 \noindent |
|
381 Here $\textit{code}$ encodes a value into a bitcodes by converting |
|
382 $\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty |
|
383 star iteration by $1$. The border where a local star terminates |
|
384 is marked by $0$. This coding is lossy, as it throws away the information about |
|
385 characters, and also does not encode the ``boundary'' between two |
|
386 sequence values. Moreover, with only the bitcode we cannot even tell |
|
387 whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The |
|
388 reason for choosing this compact way of storing information is that the |
|
389 relatively small size of bits can be easily manipulated and ``moved |
|
390 around'' in a regular expression. In order to recover values, we will |
|
391 need the corresponding regular expression as an extra information. This |
|
392 means the decoding function is defined as: |
|
393 |
|
394 |
|
395 %\begin{definition}[Bitdecoding of Values]\mbox{} |
|
396 \begin{center} |
|
397 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
|
398 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ |
|
399 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ |
|
400 $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
|
401 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; |
|
402 (\Left\,v, bs_1)$\\ |
|
403 $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
|
404 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; |
|
405 (\Right\,v, bs_1)$\\ |
|
406 $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & |
|
407 $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ |
|
408 & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ |
|
409 & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ |
|
410 $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ |
|
411 $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & |
|
412 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
|
413 & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ |
|
414 & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ |
|
415 |
|
416 $\textit{decode}\,bs\,r$ & $\dn$ & |
|
417 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
|
418 & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; |
|
419 \textit{else}\;\textit{None}$ |
|
420 \end{tabular} |
|
421 \end{center} |
|
422 %\end{definition} |
|
423 |
|
424 Sulzmann and Lu's integrated the bitcodes into regular expressions to |
|
425 create annotated regular expressions \cite{Sulzmann2014}. |
|
426 \emph{Annotated regular expressions} are defined by the following |
|
427 grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$} |
|
428 |
|
429 \begin{center} |
|
430 \begin{tabular}{lcl} |
|
431 $\textit{a}$ & $::=$ & $\ZERO$\\ |
|
432 & $\mid$ & $_{bs}\ONE$\\ |
|
433 & $\mid$ & $_{bs}{\bf c}$\\ |
|
434 & $\mid$ & $_{bs}\oplus\,as$\\ |
|
435 & $\mid$ & $_{bs}a_1\cdot a_2$\\ |
|
436 & $\mid$ & $_{bs}a^*$ |
|
437 \end{tabular} |
|
438 \end{center} |
|
439 %(in \textit{ALTS}) |
|
440 |
|
441 \noindent |
|
442 where $bs$ stands for bitcodes, $a$ for $\bold{a}$nnotated regular |
|
443 expressions and $as$ for a list of annotated regular expressions. |
|
444 The alternative constructor($\oplus$) has been generalized to |
|
445 accept a list of annotated regular expressions rather than just 2. |
|
446 We will show that these bitcodes encode information about |
|
447 the (POSIX) value that should be generated by the Sulzmann and Lu |
|
448 algorithm. |
|
449 |
|
450 |
|
451 To do lexing using annotated regular expressions, we shall first |
|
452 transform the usual (un-annotated) regular expressions into annotated |
|
453 regular expressions. This operation is called \emph{internalisation} and |
|
454 defined as follows: |
|
455 |
|
456 %\begin{definition} |
|
457 \begin{center} |
|
458 \begin{tabular}{lcl} |
|
459 $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\ |
|
460 $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\ |
|
461 $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\ |
|
462 $(r_1 + r_2)^\uparrow$ & $\dn$ & |
|
463 $_{[]}\oplus[\textit{fuse}\,[0]\,r_1^\uparrow,\, |
|
464 \textit{fuse}\,[1]\,r_2^\uparrow]$\\ |
|
465 $(r_1\cdot r_2)^\uparrow$ & $\dn$ & |
|
466 $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\ |
|
467 $(r^*)^\uparrow$ & $\dn$ & |
|
468 $_{[]}(r^\uparrow)^*$\\ |
|
469 \end{tabular} |
|
470 \end{center} |
|
471 %\end{definition} |
|
472 |
|
473 \noindent |
|
474 We use up arrows here to indicate that the basic un-annotated regular |
|
475 expressions are ``lifted up'' into something slightly more complex. In the |
|
476 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to |
|
477 attach bits to the front of an annotated regular expression. Its |
|
478 definition is as follows: |
|
479 |
|
480 \begin{center} |
|
481 \begin{tabular}{lcl} |
|
482 $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ |
|
483 $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & |
|
484 $_{bs @ bs'}\ONE$\\ |
|
485 $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ & |
|
486 $_{bs@bs'}{\bf c}$\\ |
|
487 $\textit{fuse}\;bs\,_{bs'}\oplus\textit{as}$ & $\dn$ & |
|
488 $_{bs@bs'}\oplus\textit{as}$\\ |
|
489 $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & |
|
490 $_{bs@bs'}a_1 \cdot a_2$\\ |
|
491 $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & |
|
492 $_{bs @ bs'}a^*$ |
|
493 \end{tabular} |
|
494 \end{center} |
|
495 |
|
496 \noindent |
|
497 After internalising the regular expression, we perform successive |
|
498 derivative operations on the annotated regular expressions. This |
|
499 derivative operation is the same as what we had previously for the |
|
500 basic regular expressions, except that we beed to take care of |
|
501 the bitcodes: |
|
502 |
|
503 |
|
504 \iffalse |
|
505 %\begin{definition}{bder} |
|
506 \begin{center} |
|
507 \begin{tabular}{@{}lcl@{}} |
|
508 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
509 $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
510 $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & |
|
511 $\textit{if}\;c=d\; \;\textit{then}\; |
|
512 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ |
|
513 $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & |
|
514 $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\ |
|
515 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
|
516 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
517 & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ |
|
518 & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ |
|
519 & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ |
|
520 $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & |
|
521 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, |
|
522 (\textit{STAR}\,[]\,r)$ |
|
523 \end{tabular} |
|
524 \end{center} |
|
525 %\end{definition} |
|
526 |
|
527 \begin{center} |
|
528 \begin{tabular}{@{}lcl@{}} |
|
529 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
530 $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
531 $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ & |
|
532 $\textit{if}\;c=d\; \;\textit{then}\; |
|
533 _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\ |
|
534 $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ & |
|
535 $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\ |
|
536 $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ & |
|
537 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
538 & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\ |
|
539 & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ |
|
540 & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\ |
|
541 $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ & |
|
542 $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\, |
|
543 (_{bs}\textit{STAR}\,[]\,r)$ |
|
544 \end{tabular} |
|
545 \end{center} |
|
546 %\end{definition} |
|
547 \fi |
|
548 |
|
549 \begin{center} |
|
550 \begin{tabular}{@{}lcl@{}} |
|
551 $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
|
552 $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
|
553 $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & |
|
554 $\textit{if}\;c=d\; \;\textit{then}\; |
|
555 _{bs}\ONE\;\textit{else}\;\ZERO$\\ |
|
556 $(_{bs}\oplus \;\textit{as})\,\backslash c$ & $\dn$ & |
|
557 $_{bs}\oplus\;(\textit{as.map}(\backslash c))$\\ |
|
558 $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & |
|
559 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
560 & &$\textit{then}\;_{bs}\oplus\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ |
|
561 & &$\phantom{\textit{then},\;_{bs}\oplus\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ |
|
562 & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ |
|
563 $(_{bs}a^*)\,\backslash c$ & $\dn$ & |
|
564 $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot |
|
565 (_{[]}r^*))$ |
|
566 \end{tabular} |
|
567 \end{center} |
|
568 |
|
569 %\end{definition} |
|
570 \noindent |
|
571 For instance, when we do derivative of $_{bs}a^*$ with respect to c, |
|
572 we need to unfold it into a sequence, |
|
573 and attach an additional bit $0$ to the front of $r \backslash c$ |
|
574 to indicate that there is one more star iteration. Also the sequence clause |
|
575 is more subtle---when $a_1$ is $\textit{bnullable}$ (here |
|
576 \textit{bnullable} is exactly the same as $\textit{nullable}$, except |
|
577 that it is for annotated regular expressions, therefore we omit the |
|
578 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how |
|
579 $a_1$ matches the string prior to character $c$ (more on this later), |
|
580 then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2 |
|
581 \backslash c)$ will collapse the regular expression $a_1$(as it has |
|
582 already been fully matched) and store the parsing information at the |
|
583 head of the regular expression $a_2 \backslash c$ by fusing to it. The |
|
584 bitsequence $\textit{bs}$, which was initially attached to the |
|
585 first element of the sequence $a_1 \cdot a_2$, has |
|
586 now been elevated to the top-level of $\oplus$, as this information will be |
|
587 needed whichever way the sequence is matched---no matter whether $c$ belongs |
|
588 to $a_1$ or $ a_2$. After building these derivatives and maintaining all |
|
589 the lexing information, we complete the lexing by collecting the |
|
590 bitcodes using a generalised version of the $\textit{mkeps}$ function |
|
591 for annotated regular expressions, called $\textit{bmkeps}$: |
|
592 |
|
593 |
|
594 %\begin{definition}[\textit{bmkeps}]\mbox{} |
|
595 \begin{center} |
|
596 \begin{tabular}{lcl} |
|
597 $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\ |
|
598 $\textit{bmkeps}\,(_{bs}\oplus a::\textit{as})$ & $\dn$ & |
|
599 $\textit{if}\;\textit{bnullable}\,a$\\ |
|
600 & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\ |
|
601 & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\oplus \textit{as})$\\ |
|
602 $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ & |
|
603 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ |
|
604 $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ & |
|
605 $bs \,@\, [0]$ |
|
606 \end{tabular} |
|
607 \end{center} |
|
608 %\end{definition} |
|
609 |
|
610 \noindent |
|
611 This function completes the value information by travelling along the |
|
612 path of the regular expression that corresponds to a POSIX value and |
|
613 collecting all the bitcodes, and using $S$ to indicate the end of star |
|
614 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and |
|
615 decode them, we get the value we expect. The corresponding lexing |
|
616 algorithm looks as follows: |
|
617 |
|
618 \begin{center} |
|
619 \begin{tabular}{lcl} |
|
620 $\textit{blexer}\;r\,s$ & $\dn$ & |
|
621 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ |
|
622 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
623 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
624 & & $\;\;\textit{else}\;\textit{None}$ |
|
625 \end{tabular} |
|
626 \end{center} |
|
627 |
|
628 \noindent |
|
629 In this definition $\_\backslash s$ is the generalisation of the derivative |
|
630 operation from characters to strings (just like the derivatives for un-annotated |
|
631 regular expressions). |
|
632 |
|
633 |
|
634 \subsection*{Our Simplification Rules} |
|
635 |
|
636 The main point of the bitcodes and annotated regular expressions is that |
|
637 we can apply rather aggressive (in terms of size) simplification rules |
|
638 in order to keep derivatives small. We have developed such |
|
639 ``aggressive'' simplification rules and generated test data that show |
|
640 that the expected bound can be achieved. Obviously we could only |
|
641 partially cover the search space as there are infinitely many regular |
|
642 expressions and strings. |
|
643 |
|
644 One modification we introduced is to allow a list of annotated regular |
|
645 expressions in the $\oplus$ constructor. This allows us to not just |
|
646 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but |
|
647 also unnecessary ``copies'' of regular expressions (very similar to |
|
648 simplifying $r + r$ to just $r$, but in a more general setting). Another |
|
649 modification is that we use simplification rules inspired by Antimirov's |
|
650 work on partial derivatives. They maintain the idea that only the first |
|
651 ``copy'' of a regular expression in an alternative contributes to the |
|
652 calculation of a POSIX value. All subsequent copies can be pruned away from |
|
653 the regular expression. A recursive definition of our simplification function |
|
654 that looks somewhat similar to our Scala code is given below: |
|
655 %\comment{Use $\ZERO$, $\ONE$ and so on. |
|
656 %Is it $ALTS$ or $ALTS$?}\\ |
|
657 |
|
658 \begin{center} |
|
659 \begin{tabular}{@{}lcl@{}} |
|
660 |
|
661 $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
|
662 &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ |
|
663 &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ |
|
664 &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
|
665 &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
|
666 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ |
|
667 |
|
668 $\textit{simp} \; (_{bs}\oplus \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\ |
|
669 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
|
670 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
|
671 &&$\quad\textit{case} \; as' \Rightarrow _{bs}\oplus \textit{as'}$\\ |
|
672 |
|
673 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
|
674 \end{tabular} |
|
675 \end{center} |
|
676 |
|
677 \noindent |
|
678 The simplification does a pattern matching on the regular expression. |
|
679 When it detected that the regular expression is an alternative or |
|
680 sequence, it will try to simplify its children regular expressions |
|
681 recursively and then see if one of the children turn into $\ZERO$ or |
|
682 $\ONE$, which might trigger further simplification at the current level. |
|
683 The most involved part is the $\oplus$ clause, where we use two |
|
684 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested |
|
685 alternatives and reduce as many duplicates as possible. Function |
|
686 $\textit{distinct}$ keeps the first occurring copy only and remove all later ones |
|
687 when detected duplicates. Function $\textit{flatten}$ opens up nested $\oplus$s. |
|
688 Its recursive definition is given below: |
|
689 |
|
690 \begin{center} |
|
691 \begin{tabular}{@{}lcl@{}} |
|
692 $\textit{flatten} \; (_{bs}\oplus \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; |
|
693 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ |
|
694 $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ |
|
695 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) |
|
696 \end{tabular} |
|
697 \end{center} |
|
698 |
|
699 \noindent |
|
700 Here $\textit{flatten}$ behaves like the traditional functional programming flatten |
|
701 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it |
|
702 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. |
|
703 |
|
704 Suppose we apply simplification after each derivative step, and view |
|
705 these two operations as an atomic one: $a \backslash_{simp}\,c \dn |
|
706 \textit{simp}(a \backslash c)$. Then we can use the previous natural |
|
707 extension from derivative w.r.t.~character to derivative |
|
708 w.r.t.~string:%\comment{simp in the [] case?} |
|
709 |
|
710 \begin{center} |
|
711 \begin{tabular}{lcl} |
|
712 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ |
|
713 $r \backslash_{simp} [\,] $ & $\dn$ & $r$ |
|
714 \end{tabular} |
|
715 \end{center} |
|
716 |
|
717 \noindent |
|
718 we obtain an optimised version of the algorithm: |
|
719 |
|
720 \begin{center} |
|
721 \begin{tabular}{lcl} |
|
722 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
|
723 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
|
724 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
725 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
726 & & $\;\;\textit{else}\;\textit{None}$ |
|
727 \end{tabular} |
|
728 \end{center} |
|
729 |
|
730 \noindent |
|
731 This algorithm keeps the regular expression size small, for example, |
|
732 with this simplification our previous $(a + aa)^*$ example's 8000 nodes |
|
733 will be reduced to just 6 and stays constant, no matter how long the |
|
734 input string is. |
|
735 |
|
736 |
|
737 |
|
738 \section{Introduction} |
|
739 |
|
740 While we believe derivatives of regular expressions, written |
|
741 $r\backslash s$, are a beautiful concept (in terms of ease of |
|
742 implementing them in functional programming languages and in terms of |
|
743 reasoning about them formally), they have one major drawback: every |
|
744 derivative step can make regular expressions grow drastically in |
|
745 size. This in turn has negative effect on the runtime of the |
|
746 corresponding lexing algorithms. Consider for example the regular |
|
747 expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The |
|
748 corresponding derivative contains already 8668 nodes where we assume |
|
749 the derivative is given as a tree. The reason for the poor runtime of |
|
750 the derivative-based lexing algorithms is that they need to traverse |
|
751 such trees over and over again. The solution is to find a complete set |
|
752 of simplification rules that keep the sizes of derivatives uniformly |
|
753 small. |
|
754 |
|
755 For reasons beyond this report, it turns out that a complete set of |
|
756 simplification rules depends on values being encoded as |
|
757 bitsequences.\footnote{Values are the results the lexing algorithms |
|
758 generate; they encode how a regular expression matched a string.} We |
|
759 already know that the lexing algorithm using bitsequences but |
|
760 \emph{without} simplification is correct, albeilt horribly |
|
761 slow. Therefore in the past 6 months I was trying to prove that the |
|
762 algorithm using bitsequences plus our simplification rules is |
|
763 also correct. Formally this amounts to show that |
|
764 |
|
765 \begin{equation}\label{mainthm} |
|
766 \blexers \; r \; s = \blexer \;r\;s |
|
767 \end{equation} |
|
768 |
|
769 \noindent |
|
770 whereby $\blexers$ simplifies (makes derivatives smaller) in each |
|
771 step, whereas with $\blexer$ the size can grow exponentially. This |
|
772 would be an important milestone for my thesis, because we already |
|
773 have a very good idea how to establish that our set our simplification |
|
774 rules keeps the size of derivativs below a relatively tight bound. |
|
775 |
|
776 In order to prove the main theorem in \eqref{mainthm}, we need to prove that the |
|
777 two functions produce the same output. The definition of these two functions |
|
778 is shown below. |
|
779 |
|
780 \begin{center} |
|
781 \begin{tabular}{lcl} |
|
782 $\textit{blexer}\;r\,s$ & $\dn$ & |
|
783 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ |
|
784 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
785 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
786 & & $\;\;\textit{else}\;\textit{None}$ |
|
787 \end{tabular} |
|
788 \end{center} |
|
789 |
|
790 \begin{center} |
|
791 \begin{tabular}{lcl} |
|
792 $\blexers \; r \, s$ &$\dn$ & |
|
793 $\textit{let} \; a = (r^\uparrow)\backslash_{simp}\, s\; \textit{in}$\\ |
|
794 & & $\; \; \textit{if} \; \textit{bnullable}(a)$\\ |
|
795 & & $\; \; \textit{then} \; \textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
796 & & $\;\; \textit{else}\;\textit{None}$ |
|
797 \end{tabular} |
|
798 \end{center} |
|
799 \noindent |
|
800 In these definitions $(r^\uparrow)$ is a kind of coding function that |
|
801 is the same in each case, similarly the decode and the \textit{bmkeps} |
|
802 are functions that are the same in each case. Our main |
|
803 theorem~\eqref{mainthm} therefore boils down to proving the following |
|
804 two propositions (depending on which branch the if-else clause |
|
805 takes). They establish how the derivatives \emph{with} simplification |
|
806 do not change the computed result: |
|
807 |
|
808 \begin{itemize} |
|
809 \item{(a)} If a string $s$ is in the language of $L(r)$, then \\ |
|
810 $\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\ |
|
811 \item{(b)} If a string $s$ is in the language $L(r)$, then |
|
812 $\rup \backslash_{simp} \,s$ is not nullable. |
|
813 \end{itemize} |
|
814 |
|
815 \noindent |
|
816 We have already proved the second part in Isabelle. This is actually |
|
817 not too difficult because we can show that simplification does not |
|
818 change the language of simplified regular expressions. |
|
819 |
|
820 If we can prove the first part, that is the bitsequence algorithm with |
|
821 simplification produces the same result as the one without |
|
822 simplification, then we are done. Unfortunately that part requires |
|
823 more effort, because simplification does not only need to \emph{not} |
|
824 change the language, but also not change the value (that is the |
|
825 computed result). |
|
826 |
|
827 %\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} |
|
828 %Do you want to keep this? You essentially want to say that the old |
|
829 %method used retrieve, which unfortunately cannot be adopted to |
|
830 %the simplification rules. You could just say that and give an example. |
|
831 %However you have to think about how you give the example....nobody knows |
|
832 %about AZERO etc yet. Maybe it might be better to use normal regexes |
|
833 %like $a + aa$, but annotate bitsequences as subscript like $_1(_0a + _1aa)$. |
|
834 |
|
835 %\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} |
|
836 %REPLY:\\ |
|
837 %Yes, I am essentially saying that the old method |
|
838 %cannot be adopted without adjustments. |
|
839 %But this does not mean we should skip |
|
840 %the proof of the bit-coded algorithm |
|
841 %as it is still the main direction we are looking into |
|
842 %to prove things. We are trying to modify |
|
843 %the old proof to suit our needs, but not give |
|
844 %up it totally, that is why i believe the old |
|
845 %proof is fundamental in understanding |
|
846 %what we are doing in the past 6 months. |
|
847 %\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} |
|
848 |
|
849 \subsubsection*{Existing Proof} |
|
850 |
|
851 For this we have started with looking at the original proof that |
|
852 established that the bitsequence algorrithm produces the same result |
|
853 as the algorithm not using bitsequences. Formally this proof |
|
854 established |
|
855 |
|
856 \begin{equation}\label{lexer} |
|
857 \blexer \; (r^\uparrow) s = \lexer \;r \;s |
|
858 \end{equation} |
|
859 |
|
860 %\noindent |
|
861 %might provide us insight into proving |
|
862 %\begin{center} |
|
863 %$\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$ |
|
864 %\end{center} |
|
865 |
|
866 \noindent |
|
867 The proof uses two ``tricks''. One is that it uses a \flex-function |
|
868 |
|
869 \begin{center} |
|
870 \begin{tabular}{lcl} |
|
871 $\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \; (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\ |
|
872 $\textit{flex} \;r\; f\; [\,] $ & $\dn$ & $f$ |
|
873 \end{tabular} |
|
874 \end{center} |
|
875 |
|
876 \noindent |
|
877 and then proves for the right-hand side in \eqref{lexer} |
|
878 |
|
879 \begin{center} |
|
880 $\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; (r\backslash s))$ |
|
881 \end{center}. |
|
882 |
|
883 |
|
884 |
|
885 |
|
886 \noindent |
|
887 The $\flex$-function essentially does lexing by |
|
888 stacking up injection functions while doing derivatives. |
|
889 |
|
890 explicitly showing the order of characters being |
|
891 injected back in each step. |
|
892 With $\flex$ we can write $\lexer$ this way: |
|
893 |
|
894 \begin{center} |
|
895 $\lexer \;r\; s = \flex \;id \; r\;s \;(\mkeps r\backslash s)$ |
|
896 \end{center} |
|
897 |
|
898 %\noindent |
|
899 %$\flex$ focuses on the injections instead of the derivatives , |
|
900 %compared to the original definition of $\lexer$, which puts equal |
|
901 %amount of emphasis on injection and derivative with respect to each |
|
902 %character: |
|
903 |
|
904 %\begin{center} |
|
905 %\begin{tabular}{lcl} |
|
906 %$\textit{lexer} \; r\; (c\!::\!s) $ & $\dn$ & $\textit{case} \; \lexer \; (r\backslash c) \;s \; %\textit{of}$ \\ |
|
907 % & & $\textit{None} \; \Longrightarrow \; \textit{None}$\\ |
|
908 % & & $\textbar \; v \; \Longrightarrow \; \inj \; r\;c\;v$\\ |
|
909 %$\textit{lexer} \; r\; [\,] $ & $\dn$ & $\textit{if} \; \nullable (r) \; \textit{then} \; \mkeps% (r) \; \textit{else} \;None$ |
|
910 %\end{tabular} |
|
911 %\end{center} |
|
912 |
|
913 \noindent |
|
914 The crux in the existing proof is how $\flex$ relates to injection, namely |
|
915 |
|
916 \begin{center} |
|
917 $\flex \; r \; id \; (s@[c]) \; v = \flex \; r \; id \; s \; (inj \; (r\backslash s) \; c\; v)$. |
|
918 \end{center} |
|
919 |
|
920 \noindent |
|
921 This property allows one to rewrite an induction hypothesis like |
|
922 |
|
923 \begin{center} |
|
924 $ \flex \; r\; id\; (s@[c])\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$ |
|
925 \end{center} |
|
926 |
|
927 |
|
928 \subsubsection{Retrieve Function} |
|
929 The crucial point is to find the |
|
930 $\textit{POSIX}$ information of a regular expression and how it is modified, |
|
931 augmented and propagated |
|
932 during simplification in parallel with the regular expression that |
|
933 has not been simplified in the subsequent derivative operations. To aid this, |
|
934 we use the helper function retrieve described by Sulzmann and Lu: |
|
935 \begin{center} |
|
936 \begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}} |
|
937 $\textit{retrieve}\,(_{bs}\ONE)\,\Empty$ & $\dn$ & $bs$\\ |
|
938 $\textit{retrieve}\,(_{bs}{\bf c})\,(\Char\,d)$ & $\dn$ & $bs$\\ |
|
939 $\textit{retrieve}\,(_{bs}\oplus a::as)\,(\Left\,v)$ & $\dn$ & |
|
940 $bs \,@\, \textit{retrieve}\,a\,v$\\ |
|
941 $\textit{retrieve}\,(_{bs}\oplus a::as)\,(\Right\,v)$ & $\dn$ & |
|
942 $\textit{bs} \,@\, \textit{retrieve}\,(_{[]}\oplus as)\,v$\\ |
|
943 $\textit{retrieve}\,(_{bs}a_1\cdot a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ & |
|
944 $bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\ |
|
945 $\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,[])$ & $\dn$ & |
|
946 $bs \,@\, [0]$\\ |
|
947 $\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\ |
|
948 \multicolumn{3}{l}{ |
|
949 \hspace{3cm}$bs \,@\, [1] \,@\, \textit{retrieve}\,a\,v\,@\, |
|
950 \textit{retrieve}\,(_{[]}a^*)\,(\Stars\,vs)$}\\ |
|
951 \end{tabular} |
|
952 \end{center} |
|
953 %\comment{Did not read further}\\ |
|
954 This function assembles the bitcode |
|
955 %that corresponds to a lexical value for how |
|
956 %the current derivative matches the suffix of the string(the characters that |
|
957 %have not yet appeared, but will appear as the successive derivatives go on. |
|
958 %How do we get this "future" information? By the value $v$, which is |
|
959 %computed by a pass of the algorithm that uses |
|
960 %$inj$ as described in the previous section). |
|
961 using information from both the derivative regular expression and the |
|
962 value. Sulzmann and Lu poroposed this function, but did not prove |
|
963 anything about it. Ausaf and Urban used it to connect the bitcoded |
|
964 algorithm to the older algorithm by the following equation: |
|
965 |
|
966 \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; |
|
967 (r^\uparrow)\backslash_{simp} \,c)\,v)$ |
|
968 \end{center} |
|
969 |
|
970 \noindent |
|
971 whereby $r^\uparrow$ stands for the internalised version of $r$. Ausaf |
|
972 and Urban also used this fact to prove the correctness of bitcoded |
|
973 algorithm without simplification. Our purpose of using this, however, |
|
974 is to establish |
|
975 |
|
976 \begin{center} |
|
977 $ \textit{retrieve} \; |
|
978 a \; v \;=\; \textit{retrieve} \; (\textit{simp}\,a) \; v'.$ |
|
979 \end{center} |
|
980 The idea is that using $v'$, a simplified version of $v$ that had gone |
|
981 through the same simplification step as $\textit{simp}(a)$, we are able |
|
982 to extract the bitcode that gives the same parsing information as the |
|
983 unsimplified one. However, we noticed that constructing such a $v'$ |
|
984 from $v$ is not so straightforward. The point of this is that we might |
|
985 be able to finally bridge the gap by proving |
|
986 |
|
987 \noindent |
|
988 By using a property of retrieve we have the $\textit{RHS}$ of the above equality is |
|
989 $decode (retrieve (r^\uparrow \backslash(s @ [c])) v) r$, and this gives the |
|
990 main lemma result: |
|
991 |
|
992 \begin{center} |
|
993 $ \flex \;r\; id \; (s@[c]) \; v =\textit{decode}(\textit{retrieve} (\rup \backslash (s@[c])) \;v) r$ |
|
994 \end{center} |
|
995 |
|
996 |
|
997 |
|
998 |
|
999 \noindent |
|
1000 To use this lemma result for our |
|
1001 correctness proof, simply replace the $v$ in the |
|
1002 $\textit{RHS}$ of the above equality with |
|
1003 $\mkeps\;(r\backslash (s@[c]))$, and apply the lemma that |
|
1004 |
|
1005 \begin{center} |
|
1006 $\textit{decode} \; \bmkeps \; \rup \; r = \textit{decode} \; (\textit{retrieve} \; \rup \; \mkeps(r)) \;r$ |
|
1007 \end{center} |
|
1008 \noindent |
|
1009 We get the correctness of our bit-coded algorithm: |
|
1010 \begin{center} |
|
1011 $\flex \;r\; id \; s \; (\mkeps \; r\backslash s) = \textit{decode} \; \bmkeps \; \rup\backslash s \; r$ |
|
1012 \end{center} |
|
1013 \noindent |
|
1014 The bridge between the above chain of equalities |
|
1015 is the use of $\retrieve$, |
|
1016 if we want to use a similar technique for the |
|
1017 simplified version of algorithm, |
|
1018 we face the problem that in the above |
|
1019 equalities, |
|
1020 $\retrieve \; a \; v$ is not always defined. |
|
1021 for example, |
|
1022 $\retrieve \; _0(_1a+_0a) \; \Left(\Empty)$ |
|
1023 is defined, but not $\retrieve \; (_{01}a) \;\Left(\Empty)$, |
|
1024 though we can extract the same POSIX |
|
1025 bits from the two annotated regular expressions. |
|
1026 The latter might occur when we try to retrieve from |
|
1027 a simplified regular expression using the same value |
|
1028 as the unsimplified one. |
|
1029 This is because $\Left(\Empty)$ corresponds to |
|
1030 the regular expression structure $\ONE+r_2$ instead of |
|
1031 $\ONE$. |
|
1032 That means, if we |
|
1033 want to prove that |
|
1034 \begin{center} |
|
1035 $\textit{decode} \; \bmkeps \; \rup\backslash s \; r = \textit{decode} \; \bmkeps \; \rup\backslash_{simp} s \; r$ |
|
1036 \end{center} |
|
1037 \noindent |
|
1038 holds by using $\retrieve$, |
|
1039 we probably need to prove an equality like below: |
|
1040 \begin{center} |
|
1041 %$\retrieve \; \rup\backslash_{simp} s \; \mkeps(r\backslash_{simp} s)=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$ |
|
1042 $\retrieve \; \rup\backslash_{simp} s \; \mkeps(f(r\backslash s))=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$ |
|
1043 \end{center} |
|
1044 \noindent |
|
1045 $f$ rectifies $r\backslash s$ so the value $\mkeps(f(r\backslash s))$ becomes |
|
1046 something simpler |
|
1047 to make the retrieve function defined.\\ |
|
1048 \subsubsection{Ways to Rectify Value} |
|
1049 One way to do this is to prove the following: |
|
1050 \begin{center} |
|
1051 $\retrieve \; \rup\backslash_{simp} s \; \mkeps(\simp(r\backslash s))=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$ |
|
1052 \end{center} |
|
1053 \noindent |
|
1054 The reason why we choose $\simp$ as $f$ is because |
|
1055 $\rup\backslash_{simp} \, s$ and $\simp(\rup\backslash \, s)$ |
|
1056 have the same shape: |
|
1057 \begin{center} |
|
1058 $\erase (\rup\backslash_{simp} \, s) = \erase(\simp(\rup\backslash s))$ |
|
1059 \end{center} |
|
1060 |
|
1061 \noindent |
|
1062 $\erase$ in the above equality means to remove the bit-codes |
|
1063 in an annotated regular expression and only keep the original |
|
1064 regular expression(just like "erasing" the bits). Its definition is omitted. |
|
1065 $\rup\backslash_{simp} \, s$ and $\simp(\rup\backslash s)$ |
|
1066 are very closely related, but not identical. |
|
1067 \subsubsection{Example for $\rup\backslash_{simp} \, s \neq \simp(\rup\backslash s)$} |
|
1068 For example, let $r$ be the regular expression |
|
1069 $(a+b)(a+a*)$ and $s$ be the string $aa$, then |
|
1070 both $\erase (\rup\backslash_{simp} \, s)$ and $\erase (\simp (\rup\backslash s))$ |
|
1071 are $\ONE + a^*$. However, without $\erase$ |
|
1072 \begin{center} |
|
1073 $\rup\backslash_{simp} \, s$ is equal to $_0(_0\ONE +_{11}a^*)$ |
|
1074 \end{center} |
|
1075 \noindent |
|
1076 whereas |
|
1077 \begin{center} |
|
1078 $\simp(\rup\backslash s)$ is equal to $(_{00}\ONE +_{011}a^*)$ |
|
1079 \end{center} |
|
1080 \noindent |
|
1081 (For the sake of visual simplicity, we use numbers to denote the bits |
|
1082 in bitcodes as we have previously defined for annotated |
|
1083 regular expressions. $\S$ is replaced by |
|
1084 subscript $_1$ and $\Z$ by $_0$.) |
|
1085 |
|
1086 What makes the difference? |
|
1087 |
|
1088 %Two "rules" might be inferred from the above example. |
|
1089 |
|
1090 %First, after erasing the bits the two regular expressions |
|
1091 %are exactly the same: both become $1+a^*$. Here the |
|
1092 %function $\simp$ exhibits the "one in the end equals many times |
|
1093 %at the front" |
|
1094 %property: one simplification in the end causes the |
|
1095 %same regular expression structure as |
|
1096 %successive simplifications done alongside derivatives. |
|
1097 %$\rup\backslash_{simp} \, s$ unfolds to |
|
1098 %$\simp((\simp(r\backslash a))\backslash a)$ |
|
1099 %and $\simp(\rup\backslash s)$ unfolds to |
|
1100 %$\simp((r\backslash a)\backslash a)$. The one simplification |
|
1101 %in the latter causes the resulting regular expression to |
|
1102 %become $1+a^*$, exactly the same as the former with |
|
1103 %two simplifications. |
|
1104 |
|
1105 %Second, the bit-codes are different, but they are essentially |
|
1106 %the same: if we push the outmost bits ${\bf_0}(_0\ONE +_{11}a^*)$ of $\rup\backslash_{simp} \, s$ |
|
1107 %inside then we get $(_{00}\ONE +_{011}a^*)$, exactly the |
|
1108 %same as that of $\rup\backslash \, s$. And this difference |
|
1109 %does not matter when we try to apply $\bmkeps$ or $\retrieve$ |
|
1110 %to it. This seems a good news if we want to use $\retrieve$ |
|
1111 %to prove things. |
|
1112 |
|
1113 %If we look into the difference above, we could see that the |
|
1114 %difference is not fundamental: the bits are just being moved |
|
1115 %around in a way that does not hurt the correctness. |
|
1116 During the first derivative operation, |
|
1117 $\rup\backslash a=(_0\ONE + \ZERO)(_0a + _1a^*)$ is |
|
1118 in the form of a sequence regular expression with |
|
1119 two components, the first |
|
1120 one $\ONE + \ZERO$ being nullable. |
|
1121 Recall the simplification function definition: |
|
1122 \begin{center} |
|
1123 \begin{tabular}{@{}lcl@{}} |
|
1124 |
|
1125 $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
|
1126 &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ |
|
1127 &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ |
|
1128 &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
|
1129 &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
|
1130 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\ |
|
1131 |
|
1132 $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\ |
|
1133 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
|
1134 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
|
1135 &&$\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$\\ |
|
1136 |
|
1137 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
|
1138 \end{tabular} |
|
1139 \end{center} |
|
1140 |
|
1141 \noindent |
|
1142 |
|
1143 and the difinition of $\flatten$: |
|
1144 \begin{center} |
|
1145 \begin{tabular}{c c c} |
|
1146 $\flatten \; []$ & $\dn$ & $[]$\\ |
|
1147 $\flatten \; \ZERO::rs$ & $\dn$ & $rs$\\ |
|
1148 $\flatten \;(_{\textit{bs}_1}\oplus \textit{rs}_1 ::rs)$ & $\dn$ & $(\map \, (\fuse \, \textit{bs}_1) \,\textit{rs}_1) ::: \flatten(rs)$\\ |
|
1149 $\flatten \; r :: rs$ & $\dn$ & $r::\flatten(rs)$ |
|
1150 \end{tabular} |
|
1151 \end{center} |
|
1152 |
|
1153 \noindent |
|
1154 If we call $\simp$ on $\rup\backslash a$, just as $\backslash_{simp}$ |
|
1155 requires, then we would go throught the third clause of |
|
1156 the sequence case:$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$. |
|
1157 The $\ZERO$ of $(_0\ONE + \ZERO)$ is thrown away |
|
1158 by $\flatten$ and |
|
1159 $_0\ONE$ merged into $(_0a + _1a^*)$ by simply |
|
1160 putting its bits($_0$) to the front of the second component: |
|
1161 ${\bf_0}(_0a + _1a^*)$. |
|
1162 After a second derivative operation, |
|
1163 namely, $(_0(_0a + _1a^*))\backslash a$, we get |
|
1164 $ |
|
1165 _0(_0 \ONE + _1(_1\ONE \cdot a^*)) |
|
1166 $, and this simplifies to $_0(_0 \ONE + _{11} a^*)$ |
|
1167 by the third clause of the alternative case: |
|
1168 $\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$. |
|
1169 The outmost bit $_0$ stays with |
|
1170 the outmost regular expression, rather than being fused to |
|
1171 its child regular expressions, as what we will later see happens |
|
1172 to $\simp(\rup\backslash \, s)$. |
|
1173 If we choose to not simplify in the midst of derivative operations, |
|
1174 but only do it at the end after the string has been exhausted, |
|
1175 namely, $\simp(\rup\backslash \, s)=\simp((\rup\backslash a)\backslash a)$, |
|
1176 then at the {\bf second} derivative of |
|
1177 $(\rup\backslash a)\bf{\backslash a}$ |
|
1178 we will go throught the clause of $\backslash$: |
|
1179 \begin{center} |
|
1180 \begin{tabular}{lcl} |
|
1181 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
|
1182 $(when \; \textit{bnullable}\,a_1)$\\ |
|
1183 & &$\textit{ALTS}\,bs\,List(\;\;(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ |
|
1184 & &$(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))\;\;)$\\ |
|
1185 \end{tabular} |
|
1186 \end{center} |
|
1187 |
|
1188 because |
|
1189 $\rup\backslash a = (_0\ONE + \ZERO)(_0a + _1a^*)$ |
|
1190 is a sequence |
|
1191 with the first component being nullable |
|
1192 (unsimplified, unlike the first round of running$\backslash_{simp}$). |
|
1193 Therefore $((_0\ONE + \ZERO)(_0a + _1a^*))\backslash a$ splits into |
|
1194 $([(\ZERO + \ZERO)\cdot(_0a + _1a^*)] + _0( _0\ONE + _1[_1\ONE \cdot a^*]))$. |
|
1195 After these two successive derivatives without simplification, |
|
1196 we apply $\simp$ to this regular expression, which goes through |
|
1197 the alternative clause, and each component of |
|
1198 $([(\ZERO + \ZERO)\cdot(_0a + _1a^*)] + _0( _0\ONE + _1[_1\ONE \cdot a^*]))$ |
|
1199 will be simplified, giving us the list:$[\ZERO, _0(_0\ONE + _{11}a^*)]$ |
|
1200 This list is then "flattened"--$\ZERO$ will be |
|
1201 thrown away by $\textit{flatten}$; $ _0(_0\ONE + _{11}a^*)$ |
|
1202 is opened up to make the list consisting of two separate elements |
|
1203 $_{00}\ONE$ and $_{011}a^*$, note that $flatten$ |
|
1204 $\fuse$s the bit(s) $_0$ to the front of $_0\ONE $ and $_{11}a^*$. |
|
1205 The order of simplification, which impacts the order that alternatives |
|
1206 are opened up, causes |
|
1207 the bits to be moved differently. |
|
1208 |
|
1209 \subsubsection{A Failed Attempt To Remedy the Problem Above} |
|
1210 A simple class of regular expression and string |
|
1211 pairs $(r, s)$ can be deduced from the above example |
|
1212 which trigger the difference between |
|
1213 $\rup\backslash_{simp} \, s$ |
|
1214 and $\simp(\rup\backslash s)$: |
|
1215 \begin{center} |
|
1216 \begin{tabular}{lcl} |
|
1217 $D =\{ (r_1 \cdot r_2,\; [c_1c_2]) \mid $ & $\simp(r_2) = r_2, \simp(r_1 \backslash c_1) = \ONE,$\\ |
|
1218 $r_1 \; not \; \nullable, c_2 \in L(r_2),$ & $\exists \textit{rs},\textit{bs}.\; r_2 \backslash c_2 = _{bs}{\oplus rs}$\\ |
|
1219 $\exists \textit{rs}_1. \; \simp(r_2 \backslash c_2) = _{bs}{\oplus \textit{rs}_1}$ & $and \;\simp(r_1 \backslash [c_1c_2]) = \ZERO\}$\\ |
|
1220 \end{tabular} |
|
1221 \end{center} |
|
1222 We take a pair $(r, \;s)$ from the set $D$. |
|
1223 |
|
1224 Now we compute ${\bf \rup \backslash_{simp} s}$, we get: |
|
1225 \begin{center} |
|
1226 \begin{tabular}{lcl} |
|
1227 $(r_1\cdot r_2)\backslash_{simp} \, [c_1c_2]$ & $= \simp\left[ \big(\simp\left[ \left( r_1\cdot r_2 \right) \backslash c_1\right] \big)\backslash c_2\right]$\\ |
|
1228 & $= \simp\left[ \big(\simp \left[ \left(r_1 \backslash c_1\right) \cdot r_2 \right] \big) \backslash c_2 \right]$\\ |
|
1229 & $= \simp \left[ (\fuse \; \bmkeps(r_1\backslash c_1) \; \simp(r_2) ) \backslash c_2 \right]$,\\ |
|
1230 & $= \simp \left[ (\fuse \; \bmkeps(r_1\backslash c_1) \; r_2 ) \backslash c_2 \right]$, |
|
1231 \end{tabular} |
|
1232 \end{center} |
|
1233 \noindent |
|
1234 from the definition of $D$ we know $r_1 \backslash c_1$ is nullable, therefore |
|
1235 $\bmkeps(r_1\backslash c_1)$ returns a bitcode, we shall call it |
|
1236 $\textit{bs}_2$. |
|
1237 The above term can be rewritten as |
|
1238 \begin{center} |
|
1239 $ \simp \left[ \fuse \; \textit{bs}_2\; r_2 \backslash c_2 \right]$, |
|
1240 \end{center} |
|
1241 which is equal to |
|
1242 \begin{center} |
|
1243 $\simp \left[ \fuse \; \textit{bs}_2 \; _{bs}{\oplus rs} \right]$\\ |
|
1244 $=\simp \left[ \; _{bs_2++bs}{\oplus rs} \right]$\\ |
|
1245 $= \; _{bs_2++bs}{\oplus \textit{rs}_1} $ |
|
1246 \end{center} |
|
1247 \noindent |
|
1248 by using the properties from the set $D$ again |
|
1249 and again(The reason why we set so many conditions |
|
1250 that the pair $(r,s)$ need to satisfy is because we can |
|
1251 rewrite them easily to construct the difference.) |
|
1252 |
|
1253 Now we compute ${\bf \simp(\rup \backslash s)}$: |
|
1254 \begin{center} |
|
1255 $\simp \big[(r_1\cdot r_2) \backslash [c_1c_2] \big]= \simp \left[ ((r_1 \cdot r_2 )\backslash c_1) \backslash c_2 \right]$ |
|
1256 \end{center} |
|
1257 \noindent |
|
1258 Again, using the properties above, we obtain |
|
1259 the following chain of equalities: |
|
1260 \begin{center} |
|
1261 $\simp(\rup \backslash s)= \simp \left[ ((r_1 \cdot r_2 )\backslash c_1) \backslash c_2 \right]= \simp\left[ \left(r_1 \backslash c_1\right) \cdot r_2 \big) \backslash c_2 \right]$\\ |
|
1262 $= \simp \left[ \oplus[\big( \left(r_1 \backslash c_1\right) \backslash c_2 \big) \cdot r_2 \; , \; \fuse \; (\bmkeps \;r_1\backslash c_1) \; r_2 \backslash c_2 ] \right]$, |
|
1263 \end{center} |
|
1264 \noindent |
|
1265 as before, we call the bitcode returned by |
|
1266 $\bmkeps(r_1\backslash c_1)$ as |
|
1267 $\textit{bs}_2$. |
|
1268 Also, $\simp(r_2 \backslash c_2)$ is |
|
1269 $_{bs}\oplus \textit{rs}_1$, |
|
1270 and $( \left(r_1 \backslash c_1\right) \backslash c_2 \cdot r_2)$ |
|
1271 simplifies to $\ZERO$, |
|
1272 so the above term can be expanded as |
|
1273 \begin{center} |
|
1274 \begin{tabular}{l} |
|
1275 $\textit{distinct}(\flatten[\ZERO\;, \; _{\textit{bs}_2++\textit{bs}}\oplus \textit{rs}_1] ) \; \textit{match} $ \\ |
|
1276 $\textit{case} \; [] \Rightarrow \ZERO$ \\ |
|
1277 $\textit{case} \; a :: [] \Rightarrow \textit{\fuse \; \textit{bs} a}$ \\ |
|
1278 $\textit{case} \; as' \Rightarrow _{[]}\oplus as'$\\ |
|
1279 \end{tabular} |
|
1280 \end{center} |
|
1281 \noindent |
|
1282 Applying the definition of $\flatten$, we get |
|
1283 \begin{center} |
|
1284 $_{[]}\oplus (\textit{map} \; \fuse (\textit{bs}_2 ++ bs) rs_1)$ |
|
1285 \end{center} |
|
1286 \noindent |
|
1287 compared to the result |
|
1288 \begin{center} |
|
1289 $ \; _{bs_2++bs}{\oplus \textit{rs}_1} $ |
|
1290 \end{center} |
|
1291 \noindent |
|
1292 Note how these two regular expressions only |
|
1293 differ in terms of the position of the bits |
|
1294 $\textit{bs}_2++\textit{bs}$. They are the same otherwise. |
|
1295 What caused this difference to happen? |
|
1296 The culprit is the $\flatten$ function, which spills |
|
1297 out the bitcodes in the inner alternatives when |
|
1298 there exists an outer alternative. |
|
1299 Note how the absence of simplification |
|
1300 caused $\simp(\rup \backslash s)$ to |
|
1301 generate the nested alternatives structure: |
|
1302 \begin{center} |
|
1303 $ \oplus[\ZERO \;, \; _{bs}\oplus \textit{rs} ]$ |
|
1304 \end{center} |
|
1305 and this will always trigger the $\flatten$ to |
|
1306 spill out the inner alternative's bitcode $\textit{bs}$, |
|
1307 whereas when |
|
1308 simplification is done along the way, |
|
1309 the structure of nested alternatives is never created(we can |
|
1310 actually prove that simplification function never allows nested |
|
1311 alternatives to happen, more on this later). |
|
1312 |
|
1313 How about we do not allow the function $\simp$ |
|
1314 to fuse out the bits when it is unnecessary? |
|
1315 Like, for the above regular expression, we might |
|
1316 just delete the outer layer of alternative |
|
1317 \begin{center} |
|
1318 \st{$ {\oplus[\ZERO \;,}$} $_{bs}\oplus \textit{rs}$ \st{$]$} |
|
1319 \end{center} |
|
1320 and get $_{bs}\oplus \textit{rs}$ instead, without |
|
1321 fusing the bits $\textit{bs}$ inside to every element |
|
1322 of $\textit{rs}$. |
|
1323 This idea can be realized by making the following |
|
1324 changes to the $\simp$-function: |
|
1325 \begin{center} |
|
1326 \begin{tabular}{@{}lcl@{}} |
|
1327 |
|
1328 $\textit{simp}' \; (_{\textit{bs}}(a_1 \cdot a_2))$ & $\dn$ & $\textit{as} \; \simp \; \textit{was} \; \textit{before} $ \\ |
|
1329 |
|
1330 $\textit{simp}' \; (_{bs}\oplus as)$ & $\dn$ & \st{$\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $} \\ |
|
1331 &&\st{$\quad\textit{case} \; [] \Rightarrow \ZERO$} \\ |
|
1332 &&\st{$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$} \\ |
|
1333 &&\st{$\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$}\\ |
|
1334 &&$\textit{if}(\textit{hollowAlternatives}( \textit{map \; simp \; as}))$\\ |
|
1335 &&$\textit{then} \; \fuse \; \textit{bs}\; \textit{extractAlt}(\textit{map} \; \simp \; \textit{as} )$\\ |
|
1336 &&$\textit{else} \; \simp(_{bs} \oplus \textit{as})$\\ |
|
1337 |
|
1338 |
|
1339 $\textit{simp}' \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
|
1340 \end{tabular} |
|
1341 \end{center} |
|
1342 |
|
1343 \noindent |
|
1344 given the definition of $\textit{hollowAlternatives}$ and $\textit{extractAlt}$ : |
|
1345 \begin{center} |
|
1346 $\textit{hollowAlternatives}( \textit{rs}) \dn |
|
1347 \exists r = (_{\textit{bs}_1}\oplus \textit{rs}_1) \in \textit{rs}. \forall r' \in \textit{rs}, \; |
|
1348 \textit{either} \; r' = \ZERO \; \textit{or} \; r' = r $ |
|
1349 $\textit{extractAlt}( \textit{rs}) \dn \textit{if}\big( |
|
1350 \exists r = (_{\textit{bs}_1}\oplus \textit{rs}_1) \in \textit{rs}. \forall r' \in \textit{rs}, \; |
|
1351 \textit{either} \; r' = \ZERO \; \textit{or} \; r' = r \big)\; \textit{then} \; \textit{return} \; r$ |
|
1352 \end{center} |
|
1353 \noindent |
|
1354 Basically, $\textit{hollowAlternatives}$ captures the feature of |
|
1355 a list of regular expression of the shape |
|
1356 \begin{center} |
|
1357 $ \oplus[\ZERO \;, \; _{bs}\oplus \textit{rs} ]$ |
|
1358 \end{center} |
|
1359 and this means we can simply elevate the |
|
1360 inner regular expression $_{bs}\oplus \textit{rs}$ |
|
1361 to the outmost |
|
1362 and throw away the useless $\ZERO$s and |
|
1363 the outer $\oplus$ wrapper. |
|
1364 Using this new definition of simp, |
|
1365 under the example where $r$ is the regular expression |
|
1366 $(a+b)(a+a*)$ and $s$ is the string $aa$ |
|
1367 the problem of $\rup\backslash_{simp} \, s \neq \simp(\rup\backslash s)$ |
|
1368 is resolved. |
|
1369 |
|
1370 Unfortunately this causes new problems: |
|
1371 for the counterexample where |
|
1372 \begin{center} |
|
1373 $r$ is the regular expression |
|
1374 $(ab+(a^*+aa))$ and $s$ is the string $aa$ |
|
1375 \end{center} |
|
1376 |
|
1377 \noindent |
|
1378 $\rup\backslash_{simp} \, s$ is equal to |
|
1379 $ _1(_{011}a^* + _1\ONE) $ whereas |
|
1380 $ \simp(\rup\backslash s) = (_{1011}a^* + _{11}\ONE)$. |
|
1381 This discrepancy does not appear for the old |
|
1382 version of $\simp$. |
|
1383 |
|
1384 |
|
1385 During the first derivative operation, |
|
1386 $\rup\backslash a=(_0\ONE + \ZERO)(_0a + _1a^*)$ is |
|
1387 in the form of a sequence regular expression with |
|
1388 two components, the first |
|
1389 one $\ONE + \ZERO$ being nullable. |
|
1390 Recall the simplification function definition: |
|
1391 \begin{center} |
|
1392 \begin{tabular}{@{}lcl@{}} |
|
1393 |
|
1394 $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
|
1395 &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ |
|
1396 &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ |
|
1397 &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
|
1398 &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
|
1399 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\ |
|
1400 |
|
1401 $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\ |
|
1402 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
|
1403 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
|
1404 &&$\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$\\ |
|
1405 |
|
1406 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
|
1407 \end{tabular} |
|
1408 \end{center} |
|
1409 |
|
1410 \noindent |
|
1411 |
|
1412 and the difinition of $\flatten$: |
|
1413 \begin{center} |
|
1414 \begin{tabular}{c c c} |
|
1415 $\flatten \; []$ & $\dn$ & $[]$\\ |
|
1416 $\flatten \; \ZERO::rs$ & $\dn$ & $rs$\\ |
|
1417 $\flatten \;(_{\textit{bs}_1}\oplus \textit{rs}_1 ::rs)$ & $\dn$ & $(\map \, (\fuse \, \textit{bs}_1) \,\textit{rs}_1) ::: \flatten(rs)$\\ |
|
1418 $\flatten \; r :: rs$ & $\dn$ & $r::\flatten(rs)$ |
|
1419 \end{tabular} |
|
1420 \end{center} |
|
1421 |
|
1422 \noindent |
|
1423 |
|
1424 If we call $\simp$ on $\rup\backslash a$, |
|
1425 then we would go throught the third clause of |
|
1426 the sequence case:$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$. |
|
1427 The $\ZERO$ of $(_0\ONE + \ZERO)$ is thrown away |
|
1428 by $\flatten$ and |
|
1429 $_0\ONE$ merged into $(_0a + _1a^*)$ by simply |
|
1430 putting its bits($_0$) to the front of the second component: |
|
1431 ${\bf_0}(_0a + _1a^*)$. |
|
1432 After a second derivative operation, |
|
1433 namely, $(_0(_0a + _1a^*))\backslash a$, we get |
|
1434 $ |
|
1435 _0(_0 \ONE + _1(_1\ONE \cdot a^*)) |
|
1436 $, and this simplifies to $_0(_0 \ONE + _{11} a^*)$ |
|
1437 by the third clause of the alternative case: |
|
1438 $\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$. |
|
1439 The outmost bit $_0$ stays with |
|
1440 the outmost regular expression, rather than being fused to |
|
1441 its child regular expressions, as what we will later see happens |
|
1442 to $\simp(\rup\backslash \, s)$. |
|
1443 If we choose to not simplify in the midst of derivative operations, |
|
1444 but only do it at the end after the string has been exhausted, |
|
1445 namely, $\simp(\rup\backslash \, s)=\simp((\rup\backslash a)\backslash a)$, |
|
1446 then at the {\bf second} derivative of |
|
1447 $(\rup\backslash a)\bf{\backslash a}$ |
|
1448 we will go throught the clause of $\backslash$: |
|
1449 \begin{center} |
|
1450 \begin{tabular}{lcl} |
|
1451 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
|
1452 $(when \; \textit{bnullable}\,a_1)$\\ |
|
1453 & &$\textit{ALTS}\,bs\,List(\;\;(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ |
|
1454 & &$(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))\;\;)$\\ |
|
1455 \end{tabular} |
|
1456 \end{center} |
|
1457 |
|
1458 because |
|
1459 $\rup\backslash a = (_0\ONE + \ZERO)(_0a + _1a^*)$ |
|
1460 is a sequence |
|
1461 with the first component being nullable |
|
1462 (unsimplified, unlike the first round of running$\backslash_{simp}$). |
|
1463 Therefore $((_0\ONE + \ZERO)(_0a + _1a^*))\backslash a$ splits into |
|
1464 $([(\ZERO + \ZERO)\cdot(_0a + _1a^*)] + _0( _0\ONE + _1[_1\ONE \cdot a^*]))$. |
|
1465 After these two successive derivatives without simplification, |
|
1466 we apply $\simp$ to this regular expression, which goes through |
|
1467 the alternative clause, and each component of |
|
1468 $([(\ZERO + \ZERO)\cdot(_0a + _1a^*)] + _0( _0\ONE + _1[_1\ONE \cdot a^*]))$ |
|
1469 will be simplified, giving us the list:$[\ZERO, _0(_0\ONE + _{11}a^*)]$ |
|
1470 This list is then "flattened"--$\ZERO$ will be |
|
1471 thrown away by $\textit{flatten}$; $ _0(_0\ONE + _{11}a^*)$ |
|
1472 is opened up to make the list consisting of two separate elements |
|
1473 $_{00}\ONE$ and $_{011}a^*$, note that $flatten$ |
|
1474 $\fuse$s the bit(s) $_0$ to the front of $_0\ONE $ and $_{11}a^*$. |
|
1475 The order of simplification, which impacts the order that alternatives |
|
1476 are opened up, causes |
|
1477 the bits to be moved differently. |
|
1478 |
|
1479 |
|
1480 |
|
1481 We have changed the algorithm to suppress the old |
|
1482 counterexample, but this gives rise to new counterexamples. |
|
1483 This dilemma causes this amendment not a successful |
|
1484 attempt to make $\rup\backslash_{simp} \, s = \simp(\rup\backslash s)$ |
|
1485 under every possible regular expression and string. |
|
1486 \subsection{Properties of the Function $\simp$} |
|
1487 |
|
1488 We have proved in Isabelle quite a few properties |
|
1489 of the $\simp$-function, which helps the proof to go forward |
|
1490 and we list them here to aid comprehension. |
|
1491 |
|
1492 To start, we need a bit of auxiliary notations, |
|
1493 which is quite basic and is only written here |
|
1494 for clarity. |
|
1495 |
|
1496 $\textit{sub}(r)$ computes the set of the |
|
1497 sub-regular expression of $r$: |
|
1498 \begin{center} |
|
1499 $\textit{sub}(\ONE) \dn \{\ONE\}$\\ |
|
1500 $\textit{sub}(r_1 \cdot r_2) \dn \textit{sub}(r_1) \cup \textit{sub}(r_2) \cup \{r_1 \cdot r_2\}$\\ |
|
1501 $\textit{sub}(r_1 + r_2) \dn \textit{sub}(r_1) \cup \textit{sub}(r_2) \cup \{r_1+r_2\}$\\ |
|
1502 \end{center} |
|
1503 $\textit{good}(r) \dn r \neq \ZERO \land \\ |
|
1504 \forall r' \in \textit{sub}(r), \textit{if} \; r' = _{bs_1}\oplus(rs_1), \; |
|
1505 \textit{then} \nexists r'' \in \textit{rs}_1 \; s.t.\; |
|
1506 r'' = _{bs_2}\oplus \textit{rs}_2 $ |
|
1507 |
|
1508 The properties are mainly the ones below: |
|
1509 \begin{itemize} |
|
1510 \item |
|
1511 \begin{center} |
|
1512 $\simp(\simp(r)) = \simp(r)$ |
|
1513 \end{center} |
|
1514 \item |
|
1515 \begin{center} |
|
1516 $\textit{if} r = \simp(r') \textit{then} \; \textit{good}(r) $ |
|
1517 \end{center} |
|
1518 \end{itemize} |
|
1519 \subsection{the Contains relation} |
|
1520 $\retrieve$ is a too strong relation in that |
|
1521 it only extracts one bitcode instead of a set of them. |
|
1522 Therefore we try to define another relation(predicate) |
|
1523 to capture the fact the regular expression has bits |
|
1524 being moved around but still has all the bits needed. |
|
1525 The contains symbol, written$\gg$, is a relation that |
|
1526 takes two arguments in an infix form |
|
1527 and returns a truth value. |
|
1528 |
|
1529 |
|
1530 In other words, from the set of regular expression and |
|
1531 bitcode pairs |
|
1532 $\textit{RV} = \{(r, v) \mid r \text{r is a regular expression, v is a value}\}$, |
|
1533 those that satisfy the following requirements are in the set |
|
1534 $\textit{RV}_Contains$. |
|
1535 Unlike the $\retrieve$ |
|
1536 function, which takes two arguments $r$ and $v$ and |
|
1537 produces an only answer $\textit{bs}$, it takes only |
|
1538 one argument $r$ and returns a set of bitcodes that |
|
1539 can be generated by $r$. |
|
1540 \begin{center} |
|
1541 \begin{tabular}{llclll} |
|
1542 & & & $_{bs}\ONE$ & $\gg$ & $\textit{bs}$\\ |
|
1543 & & & $_{bs}{\bf c}$ & $\gg$ & $\textit{bs}$\\ |
|
1544 $\textit{if} \; r_1 \gg \textit{bs}_1$ & $r_2 \; \gg \textit{bs}_2$ |
|
1545 & $\textit{then}$ & |
|
1546 $_{bs}{r_1 \cdot r_2}$ & |
|
1547 $\gg$ & |
|
1548 $\textit{bs} @ \textit{bs}_1 @ \textit{bs}_2$\\ |
|
1549 |
|
1550 $\textit{if} \; r \gg \textit{bs}_1$ & & $\textit{then}$ & |
|
1551 $_{bs}{\oplus(r :: \textit{rs}})$ & |
|
1552 $\gg$ & |
|
1553 $\textit{bs} @ \textit{bs}_1 $\\ |
|
1554 |
|
1555 $\textit{if} \; _{bs}(\oplus \textit{rs}) \gg \textit{bs} @ \textit{bs}_1$ & & $\textit{then}$ & |
|
1556 $_{bs}{\oplus(r :: \textit{rs}})$ & |
|
1557 $\gg$ & |
|
1558 $\textit{bs} @ \textit{bs}_1 $\\ |
|
1559 |
|
1560 $\textit{if} \; r \gg \textit{bs}_1\; \textit{and}$ & $_{bs}r^* \gg \textit{bs} @ \textit{bs}_2$ & $\textit{then}$ & |
|
1561 $_{bs}r^* $ & |
|
1562 $\gg$ & |
|
1563 $\textit{bs} @ [0] @ \textit{bs}_1@ \textit{bs}_2 $\\ |
|
1564 |
|
1565 & & & $_{bs}r^*$ & $\gg$ & $\textit{bs} @ [1]$\\ |
|
1566 \end{tabular} |
|
1567 \end{center} |
|
1568 It is a predicate in the sense that given |
|
1569 a regular expression and a bitcode, it |
|
1570 returns true or false, depending on whether |
|
1571 or not the regular expression can actually produce that |
|
1572 value. If the predicates returns a true, then |
|
1573 we say that the regular expression $r$ contains |
|
1574 the bitcode $\textit{bs}$, written |
|
1575 $r \gg \textit{bs}$. |
|
1576 The $\gg$ operator with the |
|
1577 regular expression $r$ may also be seen as a |
|
1578 machine that does a derivative of regular expressions |
|
1579 on all strings simultaneously, taking |
|
1580 the bits by going throught the regular expression tree |
|
1581 structure in a depth first manner, regardless of whether |
|
1582 the part being traversed is nullable or not. |
|
1583 It put all possible bits that can be produced on such a traversal |
|
1584 into a set. |
|
1585 For example, if we are given the regular expression |
|
1586 $((a+b)(c+d))^*$, the tree structure may be written as |
|
1587 \begin{center} |
|
1588 \begin{tikzpicture} |
|
1589 \tikz[tree layout]\graph[nodes={draw, circle}] { |
|
1590 * -> |
|
1591 {@-> { |
|
1592 {+1 -> |
|
1593 {a , b} |
|
1594 }, |
|
1595 {+ -> |
|
1596 {c , d } |
|
1597 } |
|
1598 } |
|
1599 } |
|
1600 }; |
|
1601 \end{tikzpicture} |
|
1602 \end{center} |
|
1603 \subsection{the $\textit{ders}_2$ Function} |
|
1604 If we want to prove the result |
|
1605 \begin{center} |
|
1606 $ \textit{blexer}\_{simp}(r, \; s) = \textit{blexer}(r, \; s)$ |
|
1607 \end{center} |
|
1608 inductively |
|
1609 on the structure of the regular expression, |
|
1610 then we need to induct on the case $r_1 \cdot r_2$, |
|
1611 it can be good if we could express $(r_1 \cdot r_2) \backslash s$ |
|
1612 in terms of $r_1 \backslash s$ and $r_2 \backslash s$, |
|
1613 and this naturally induces the $ders2$ function, |
|
1614 which does a "convolution" on $r_1$ and $r_2$ using the string |
|
1615 $s$. |
|
1616 It is based on the observation that the derivative of $r_1 \cdot r_2$ |
|
1617 with respect to a string $s$ can actually be written in an "explicit form" |
|
1618 composed of $r_1$'s derivative of $s$ and $r_2$'s derivative of $s$. |
|
1619 This can be illustrated in the following procedure execution |
|
1620 \begin{center} |
|
1621 $ (r_1 \cdot r_2) \backslash [c_1c_2] = (\textit{if} \; \nullable(r_1)\; \textit{then} \; ((r_1 \backslash c_1) \cdot r_2 + r_2 \backslash c_1) \; \textit{else} \; (r_1\backslash c_1) \cdot r_2) \backslash c_2$ |
|
1622 \end{center} |
|
1623 which can also be written in a "convoluted sum" |
|
1624 format: |
|
1625 \begin{center} |
|
1626 $ (r_1 \cdot r_2) \backslash [c_1c_2] = \sum{r_1 \backslash s_i \cdot r_2 \backslash s_j}$ |
|
1627 \end{center} |
|
1628 In a more serious manner, we should write |
|
1629 \begin{center} |
|
1630 $ (r_1 \cdot r_2) \backslash [c_1c_2] = \sum{r_1 \backslash s_i \cdot r_2 \backslash s_j}$ |
|
1631 \end{center} |
|
1632 Note this differentiates from the previous form in the sense that |
|
1633 it calculates the results $r_1\backslash s_i , r_2 \backslash s_j$ first and then glue them together |
|
1634 through nested alternatives whereas the $r_1 \cdot r_2 \backslash s$ procedure, |
|
1635 used by algorithm $\lexer$, can only produce each component of the |
|
1636 resulting alternatives regular expression altogether rather than |
|
1637 generating each of the children nodes one by one |
|
1638 n a single recursive call that is only for generating that |
|
1639 very expression itself. |
|
1640 We have this |
|
1641 \section{Conclusion} |
|
1642 Under the exhaustive tests we believe the main |
|
1643 result holds, yet a proof still seems elusive. |
|
1644 We have tried out different approaches, and |
|
1645 found a lot of properties of the function $\simp$. |
|
1646 The counterexamples where $\rup\backslash_{simp} \, s \neq \simp(\rup\backslash s)$ |
|
1647 are also valuable in the sense that |
|
1648 we get to know better why they are not equal and what |
|
1649 are the subtle differences between a |
|
1650 nested simplified regular expression and a |
|
1651 regular expression that is simplified at the final moment. |
|
1652 We are almost there, but a last step is needed to make the proof work. |
|
1653 Hopefully in the next few weeks we will be able to find one. |
|
1654 %CONSTRUCTION SITE HERE |
|
1655 that is to say, despite the bits are being moved around on the regular expression |
|
1656 (difference in bits), the structure of the (unannotated)regular expression |
|
1657 after one simplification is exactly the same after the |
|
1658 same sequence of derivative operations |
|
1659 regardless of whether we did simplification |
|
1660 along the way. |
|
1661 One way would be to give a function that calls |
|
1662 |
|
1663 fuse is the culprit: it causes the order in which alternatives |
|
1664 are opened up to be remembered and finally the difference |
|
1665 appear in $\simp(\rup \backslash s)$ and $\rup \backslash{simp} \,s$. |
|
1666 but we have to use them as they are essential in the simplification: |
|
1667 flatten needs them. |
|
1668 |
|
1669 |
|
1670 |
|
1671 However, without erase the above equality does not hold: |
|
1672 for the regular expression |
|
1673 $(a+b)(a+a*)$, |
|
1674 if we do derivative with respect to string $aa$, |
|
1675 we get |
|
1676 |
|
1677 \subsection{Another Proof Strategy} |
|
1678 sdddddr does not equal sdsdsdsr sometimes.\\ |
|
1679 For example, |
|
1680 |
|
1681 This equicalence class method might still have the potential of proving this, |
|
1682 but not yet |
|
1683 i parallelly tried another method of using retrieve\\ |
|
1684 |
|
1685 |
|
1686 The vsimp function, defined as follows |
|
1687 tries to simplify the value in lockstep with |
|
1688 regular expression:\\ |
|
1689 |
|
1690 |
|
1691 The problem here is that |
|
1692 |
|
1693 we used retrieve for the key induction: |
|
1694 $decode (retrieve (r\backslash (s @ [c])) v) r $ |
|
1695 $decode (retrieve (r\backslash s) (inj (r\backslash s) c v)) r$ |
|
1696 Here, decode recovers a value that corresponds to a match(possibly partial) |
|
1697 from bits, and the bits are extracted by retrieve, |
|
1698 and the key value $v$ that guides retrieve is |
|
1699 $mkeps r\backslash s$, $inj r c (mkeps r\backslash s)$, $inj (inj (v))$, ...... |
|
1700 if we can |
|
1701 the problem is that |
|
1702 need vsiimp to make a value that is suitable for decoding |
|
1703 $Some(flex rid(s@[c])v) = Some(flex rids(inj (r\backslash s)cv))$ |
|
1704 another way that christian came up with that might circumvent the |
|
1705 prblem of finding suitable value is by not stating the visimp |
|
1706 function but include all possible value in a set that a regex is able to produce, |
|
1707 and proving that both r and sr are able to produce the bits that correspond the POSIX value |
|
1708 |
|
1709 produced by feeding the same initial regular expression $r$ and string $s$ to the |
|
1710 two functions $ders$ and $ders\_simp$. |
|
1711 The reason why |
|
1712 Namely, if $bmkeps( r_1) = bmkeps(r_2)$, then we |
|
1713 |
|
1714 |
|
1715 If we define the equivalence relation $\sim_{m\epsilon}$ between two regular expressions |
|
1716 $r_1$ and $r_2$as follows: |
|
1717 $r_1 \sim_{m\epsilon} r_2 \iff bmkeps(r_1)= bmkeps(r_2)$ |
|
1718 (in other words, they $r1$ and $r2$ produce the same output under the function $bmkeps$.) |
|
1719 Then the first goal |
|
1720 might be restated as |
|
1721 $(r^\uparrow)\backslash_{simp}\, s \sim_{m\epsilon} (r^\uparrow)\backslash s$. |
|
1722 I tried to establish an equivalence relation between the regular experssions |
|
1723 like dddr dddsr,..... |
|
1724 but right now i am only able to establish dsr and dr, using structural induction on r. |
|
1725 Those involve multiple derivative operations are harder to prove. |
|
1726 Two attempts have been made: |
|
1727 (1)induction on the number of der operations(or in other words, the length of the string s), |
|
1728 the inductive hypothesis was initially specified as |
|
1729 "For an arbitrary regular expression r, |
|
1730 For all string s in the language of r whose length do not exceed |
|
1731 the number n, ders s r me derssimp s r" |
|
1732 and the proof goal may be stated as |
|
1733 "For an arbitrary regular expression r, |
|
1734 For all string s in the language of r whose length do not exceed |
|
1735 the number n+1, ders s r me derssimp s r" |
|
1736 the problem here is that although we can easily break down |
|
1737 a string s of length n+1 into s1@list(c), it is not that easy |
|
1738 to use the i.h. as a stepping stone to prove anything because s1 may well be not |
|
1739 in the language L(r). This inhibits us from obtaining the fact that |
|
1740 ders s1 r me derssimps s1 r. |
|
1741 Further exploration is needed to amend this hypothesis so it includes the |
|
1742 situation when s1 is not nullable. |
|
1743 For example, what information(bits? |
|
1744 values?) can be extracted |
|
1745 from the regular expression ders(s1,r) so that we can compute or predict the possible |
|
1746 result of bmkeps after another derivative operation. What function f can used to |
|
1747 carry out the task? The possible way of exploration can be |
|
1748 more directly perceived throught the graph below: |
|
1749 find a function |
|
1750 f |
|
1751 such that |
|
1752 f(bders s1 r) |
|
1753 = re1 |
|
1754 f(bderss s1 r) |
|
1755 = re2 |
|
1756 bmkeps(bders s r) = g(re1,c) |
|
1757 bmkeps(bderssimp s r) = g(re2,c) |
|
1758 and g(re1,c) = g(re2,c) |
|
1759 The inductive hypothesis would be |
|
1760 "For all strings s1 of length <= n, |
|
1761 f(bders s1 r) |
|
1762 = re1 |
|
1763 f(bderss s1 r) |
|
1764 = re2" |
|
1765 proving this would be a lemma for the main proof: |
|
1766 the main proof would be |
|
1767 " |
|
1768 bmkeps(bders s r) = g(re1,c) |
|
1769 bmkeps(bderssimp s r) = g(re2,c) |
|
1770 for s = s1@c |
|
1771 " |
|
1772 and f need to be a recursive property for the lemma to be proved: |
|
1773 it needs to store not only the "after one char nullable info", |
|
1774 but also the "after two char nullable info", |
|
1775 and so on so that it is able to predict what f will compute after a derivative operation, |
|
1776 in other words, it needs to be "infinitely recursive"\\ |
|
1777 To prove the lemma, in other words, to get |
|
1778 "For all strings s1 of length <= n+1, |
|
1779 f(bders s1 r) |
|
1780 = re3 |
|
1781 f(bderss s1 r) |
|
1782 = re4"\\ |
|
1783 from\\ |
|
1784 "For all strings s1 of length <= n, |
|
1785 f(bders s1 r) |
|
1786 = re1 |
|
1787 f(bderss s1 r) |
|
1788 = re2"\\ |
|
1789 it might be best to construct an auxiliary function h such that\\ |
|
1790 h(re1, c) = re3\\ |
|
1791 h(re2, c) = re4\\ |
|
1792 and re3 = f(bder c (bders s1 r))\\ |
|
1793 re4 = f(simp(bder c (bderss s1 r))) |
|
1794 The key point here is that we are not satisfied with what bders s r will produce under |
|
1795 bmkeps, but also how it will perform after a derivative operation and then bmkeps, and two |
|
1796 derivative operations and so on. In essence, we are preserving the regular expression |
|
1797 itself under the function f, in a less compact way than the regluar expression: we are |
|
1798 not just recording but also interpreting what the regular expression matches. |
|
1799 In other words, we need to prove the properties of bderss s r beyond the bmkeps result, |
|
1800 i.e., not just the nullable ones, but also those containing remaining characters.\\ |
|
1801 (2)we observed the fact that |
|
1802 erase sdddddr= erase sdsdsdsr |
|
1803 that is to say, despite the bits are being moved around on the regular expression |
|
1804 (difference in bits), the structure of the (unannotated)regular expression |
|
1805 after one simplification is exactly the same after the |
|
1806 same sequence of derivative operations |
|
1807 regardless of whether we did simplification |
|
1808 along the way. |
|
1809 However, without erase the above equality does not hold: |
|
1810 for the regular expression |
|
1811 $(a+b)(a+a*)$, |
|
1812 if we do derivative with respect to string $aa$, |
|
1813 we get |
|
1814 %TODO |
|
1815 sdddddr does not equal sdsdsdsr sometimes.\\ |
|
1816 For example, |
|
1817 |
|
1818 This equicalence class method might still have the potential of proving this, |
|
1819 but not yet |
|
1820 i parallelly tried another method of using retrieve\\ |
|
1821 |
|
1822 |
|
1823 |
|
1824 \noindent\rule[0.5ex]{\linewidth}{1pt} |
|
1825 |
|
1826 |
|
1827 |
|
1828 |
|
1829 |
|
1830 |
|
1831 \bibliographystyle{plain} |
|
1832 \bibliography{root} |
|
1833 |
|
1834 |
|
1835 \end{document} |
|
1836 |