85 algorithm to not just give a YES/NO answer for whether or not a |
85 algorithm to not just give a YES/NO answer for whether or not a |
86 regular expression matches a string, but in case it does also |
86 regular expression matches a string, but in case it does also |
87 answers with \emph{how} it matches the string. This is important for |
87 answers with \emph{how} it matches the string. This is important for |
88 applications such as lexing (tokenising a string). The problem is to |
88 applications such as lexing (tokenising a string). The problem is to |
89 make the algorithm by Sulzmann and Lu fast on all inputs without |
89 make the algorithm by Sulzmann and Lu fast on all inputs without |
90 breaking its correctness. We have already developed some |
90 breaking its correctness. Being fast depends on a complete set of |
91 simplification rules for this, but have not yet proved that they |
91 simplification rules, some of which |
92 preserve the correctness of the algorithm. We also have not yet |
92 have been put forward by Sulzmann and Lu. We have extended their |
93 looked at extended regular expressions, such as bounded repetitions, |
93 rules in order to obtain a tight bound on size of regular expressions. |
|
94 We have tested the correctness of these extended rules, but have not |
|
95 formally established their correctness. We also have not yet looked |
|
96 at extended regular expressions, such as bounded repetitions, |
94 negation and back-references. |
97 negation and back-references. |
95 \end{abstract} |
98 \end{abstract} |
96 |
99 |
97 \section{Introduction} |
100 \section{Introduction} |
98 |
101 |
99 |
102 While we believe derivatives of regular expressions is a beautiful |
100 |
103 concept (interms of ease to implementing them in functional programming |
101 \noindent\rule[0.5ex]{\linewidth}{1pt} |
104 language and ease to reason about them formally), they have one major |
102 Between the 2 bars are the new materials.\\ |
105 drawback: every derivative step can make regular expressions grow |
103 In the past 6 months I was trying to prove that the bit-coded algorithm is correct. |
106 drastically in size. This in turn has negative effects on the runtime of |
104 \begin{center} |
107 the corresponding lexing algorithms. Consider for example the regular |
105 $\blexers \;r \; s = \blexer \; r \; s$ |
108 expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The size of |
106 \end{center} |
109 the corresponding derivative is already 8668 node assuming the derivatives |
107 \noindent |
110 is seen as a tree. The reason for the poor runtime of the lexing algorithms is |
108 To prove this, we need to prove these two functions produce the same output |
111 that they need to traverse such trees over and over again. The solution is to |
109 whether or not $r \in L(r)$. |
112 find a complete set of simplification rules that keep the sizes of derivatives |
110 Given the definition of $\blexer$ and $\blexers$: |
113 uniformly small. |
|
114 |
|
115 For reasons beyond this report, it turns out that a complete set of |
|
116 simplification rules depend on values being encoded as bitsequences. |
|
117 (Vlue are the results of the lexing algorithms generate; they encode how |
|
118 a regular expression matched a string.) We already know that the lexing |
|
119 algorithm \emph{without} simplification is correct. Therefore in the |
|
120 past 6 months we were trying to prove that the algorithm using bitsequences plus |
|
121 our simplification rules is correct. Formally this amounts to show that |
|
122 |
|
123 \begin{equation}\label{mainthm} |
|
124 \blexers \; r \; s = \blexer \;r\;s |
|
125 \end{equation} |
|
126 |
|
127 \noindent |
|
128 whereby $\blexers$ simplifies (makes derivatives smaller) in each step, |
|
129 whereas with $\blexer$ the size can grow exponentially. This would be an |
|
130 important milestone, because we already have a very good idea how to |
|
131 establish that our set our simplification rules keeps the size below a |
|
132 relatively tight bound. |
|
133 |
|
134 In order to prove the main theorem \eqref{mainthm}, we need to prove the |
|
135 two functions produce the same output. The definition of these functions |
|
136 is shown below. |
|
137 |
111 \begin{center} |
138 \begin{center} |
112 \begin{tabular}{lcl} |
139 \begin{tabular}{lcl} |
113 $\textit{blexer}\;r\,s$ & $\dn$ & |
140 $\textit{blexer}\;r\,s$ & $\dn$ & |
114 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ |
141 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ |
115 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
142 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
116 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
143 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
117 & & $\;\;\textit{else}\;\textit{None}$ |
144 & & $\;\;\textit{else}\;\textit{None}$ |
118 \end{tabular} |
145 \end{tabular} |
119 \end{center} |
146 \end{center} |
120 |
147 |
121 \begin{center} |
148 \begin{center} |
122 \begin{tabular}{lcl} |
149 \begin{tabular}{lcl} |
123 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
150 $\blexers \; r \, s$ &$\dn$ & |
124 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
151 $\textit{let} \; a = (r^\uparrow)\backslash_{simp}\, s\; \textit{in}$\\ |
125 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
152 & & $\; \; \textit{if} \; \textit{bnullable}(a)$\\ |
126 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
153 & & $\; \; \textit{then} \; \textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
127 & & $\;\;\textit{else}\;\textit{None}$ |
154 & & $\;\; \textit{else}\;\textit{None}$ |
128 \end{tabular} |
155 \end{tabular} |
129 \end{center} |
156 \end{center} |
130 \noindent |
157 \noindent |
131 it boils down to proving the following two propositions(depending on which |
158 In these definitions $(r^\uparrow)$ is a kind of coding function that is the |
132 branch in the if-else clause is taken): |
159 same in each case, similarly the decode and the \textit{bmkeps} |
|
160 functions. Our main theorem \eqref{mainthm} therefore boils down to |
|
161 proving the following two propositions (depending on which branch the |
|
162 if-else clause takes). It establishes how the derivatives \emph{with} |
|
163 simplification do not change the computed result: |
133 |
164 |
134 \begin{itemize} |
165 \begin{itemize} |
135 |
166 \item{} If a string $s$ is in the language of $L(r)$, then \\ |
136 \item{} |
167 $\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\ |
137 When s is a string in the language L(r), \\ |
168 \item{} If a string $s$ is in the language $L(r)$, then |
138 $\textit{bmkeps} (r^\uparrow)\backslash_{simp}\, s = \textit{bmkeps} (r^\uparrow)\backslash s$, \\ |
169 $\rup \backslash_{simp} \,s$ is not nullable. |
139 \item{} |
|
140 when s is not a string of the language L(ar) |
|
141 ders\_simp(ar, s) is not nullable |
|
142 \end{itemize} |
170 \end{itemize} |
143 The second one is relatively straightforward using isabelle to prove. |
171 |
144 The first part requires more effort. |
172 \noindent |
145 It builds on the result that the bit-coded algorithm without simplification |
173 We have already proved in Isabelle the second part. This is actually not |
146 produces the correct result: |
174 too difficult because we can show that simplification does not change |
147 \begin{center} |
175 the language of regular expressions. If we can prove the first case, |
148 $\blexer \;r^\uparrow s = \lexer \; r\; s$ |
176 that is the bitsequence algorithm with simplification produces the same |
149 \end{center} |
177 result as the one without simplification, then we are done. |
150 \noindent |
178 Unfortunately that part requires more effort, because simplification does not |
151 the definition of lexer and its correctness is |
179 only.need to \emph{not} change the language, but also not change |
152 omitted(see \cite{AusafDyckhoffUrban2016}). |
180 the value (computed result). |
153 if we can prove that the bit-coded algorithm with simplification produces |
181 |
154 the same result as the original bit-coded algorithm, |
182 \bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} |
155 then we are done. |
183 Do you want to keep this? You essentially want to say that the old |
|
184 method used retrieve, which unfortunately cannot be adopted to |
|
185 the simplification rules. You could just say that and give an example. |
|
186 However you have to think about how you give the example....nobody knows |
|
187 about AZERO etc yet. Maybe it might be better to use normal regexes |
|
188 like $a + aa$, but annotate bitsequences as subscript like $_1(_0a + _1aa)$. |
|
189 |
|
190 \bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} |
|
191 REPLY:\\ |
|
192 Yes, I am essentially saying that the old method |
|
193 cannot be adopted without adjustments. |
|
194 But this does not mean we should skip |
|
195 the proof of the bit-coded algorithm |
|
196 as it is still the main direction we are looking into |
|
197 to prove things. We are trying to modify |
|
198 the old proof to suit our needs, but not give |
|
199 up it totally, that is why i believe the old |
|
200 proof is fundamental in understanding |
|
201 what we are doing in the past 6 months. |
|
202 |
|
203 \bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} |
|
204 |
156 The correctness proof of |
205 The correctness proof of |
157 \begin{center} |
206 \begin{center} |
158 $\blexer \; r^\uparrow s = \lexer \;r \;s$ |
207 $\blexer \; r^\uparrow s = \lexer \;r \;s$ |
159 \end{center} |
208 \end{center} |
160 \noindent |
209 \noindent |