57 % SECTION strongsimp |
57 % SECTION strongsimp |
58 %---------------------------------------------------------------------------------------- |
58 %---------------------------------------------------------------------------------------- |
59 \section{A Stronger Version of Simplification} |
59 \section{A Stronger Version of Simplification} |
60 %TODO: search for isabelle proofs of algorithms that check equivalence |
60 %TODO: search for isabelle proofs of algorithms that check equivalence |
61 In our bitcoded lexing algorithm, (sub)terms represent (sub)matches. |
61 In our bitcoded lexing algorithm, (sub)terms represent (sub)matches. |
62 For example, the regular expression $aa \cdot a^*+ a \cdot a^*$ contains two terms, |
62 For example, the regular expression |
63 which expresses two possibilities it will match future input. |
63 \[ |
64 It will either match 2 $a$'s, then 0 or more $a$'s, in other words, at least 2 more $a$'s |
64 aa \cdot a^*+ a \cdot a^* + aa\cdot a^* |
65 \begin{figure}\label{string_3parts1} |
65 \] |
66 \begin{center} |
66 contains three terms, |
67 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
67 expressing three possibilities it will match future input. |
68 \node [rectangle split, rectangle split horizontal, rectangle split parts=3] |
68 The first and the third terms are identical, which means we can eliminate |
69 {Consumed Input |
69 the latter as we know it will not be picked up by $\bmkeps$. |
70 \nodepart{two} Expects $aa$ |
70 In $\bsimps$, the $\distinctBy$ function takes care of this. |
71 \nodepart{three} Then expects $a^*$}; |
71 The criteria $\distinctBy$ uses for removing a duplicate |
72 |
72 $a_2$ in the list |
|
73 \begin{center} |
|
74 $rs_a@[a_1]@rs_b@[a_2]@rs_c$ |
|
75 \end{center} |
|
76 is that |
|
77 \begin{center} |
|
78 $\rerase{a_1} = \rerase{a_2}$. |
|
79 \end{center} |
|
80 It can be characterised as the $LD$ |
|
81 rewrite rule in \ref{rrewriteRules}.\\ |
|
82 The problem , however, is that identical components |
|
83 in two slightly different regular expressions cannot be removed: |
|
84 \begin{figure}[H] |
|
85 \[ |
|
86 (a+b+d) \cdot r_1 + (a+c+e) \cdot r_1 \stackrel{?}{\rightsquigarrow} (a+b+d) \cdot r_1 + (c+e) \cdot r_1 |
|
87 \] |
|
88 \caption{Desired simplification, but not done in $\blexersimp$}\label{partialDedup} |
|
89 \end{figure} |
|
90 \noindent |
|
91 A simplification like this actually |
|
92 cannot be omitted, |
|
93 as without it the size could blow up even with our $\simp$ function: |
|
94 \begin{figure}[H] |
|
95 \centering |
|
96 \begin{tikzpicture} |
|
97 \begin{axis}[ |
|
98 %xlabel={$n$}, |
|
99 myplotstyle, |
|
100 xlabel={input length}, |
|
101 ylabel={size}, |
|
102 ] |
|
103 \addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data}; |
|
104 \end{axis} |
73 \end{tikzpicture} |
105 \end{tikzpicture} |
74 \end{center} |
106 \caption{Runtime of $\blexersimp$ for matching |
75 \end{figure} |
107 $\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ |
76 \noindent |
108 with strings |
77 Or it will match at least 1 more $a$: |
109 of the form $\protect\underbrace{aa..a}_{n}$.}\label{blexerExp} |
78 \begin{figure}\label{string_3parts2} |
110 \end{figure} |
79 \begin{center} |
111 \noindent |
80 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
112 We would like to apply the rewriting at some stage |
81 \node [rectangle split, rectangle split horizontal, rectangle split parts=3] |
113 \begin{figure}[H] |
82 {Consumed |
|
83 \nodepart{two} Expects $a$ |
|
84 \nodepart{three} Then expects $a^*$}; |
|
85 |
|
86 \end{tikzpicture} |
|
87 \end{center} |
|
88 \end{figure} |
|
89 If these two possibilities are identical, we can eliminate |
|
90 the second one as we know the second one corresponds to |
|
91 a match that is not $\POSIX$. |
|
92 |
|
93 |
|
94 If two identical regexes |
|
95 happen to be grouped into different sequences, one cannot use |
|
96 the $a + a \rightsquigarrow a$ rule to eliminate them, even if they |
|
97 are "parallel" to each other: |
|
98 \[ |
114 \[ |
99 (a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2 |
115 (a+b+d) \cdot r_1 \longrightarrow a \cdot r_1 + b \cdot r_1 + d \cdot r_1 |
100 \] |
116 \] |
101 and that's because they are followed by possibly |
117 \caption{Desired simplification, but not done in $\blexersimp$}\label{desiredSimp} |
102 different "suffixes" $r_1$ and $r_2$, and if |
118 \end{figure} |
103 they do turn out to be different then doing |
119 \noindent |
104 \[ |
120 in our $\simp$ function, |
105 (a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2 |
121 so that it makes the simplification in \ref{partialDedup} possible. |
106 \] |
122 For example, |
107 might cause a possible match where the string is in $L(a \cdot r_2)$ to be lost. |
123 can a function like the below one be used? |
108 |
124 %TODO: simp' that spills things |
109 Here is an example for this. |
125 |
110 Assume we have the derivative regex |
126 Unfortunately, |
111 \[( r_1 + r_2 + r_3)\cdot r+ |
127 if we introduce them in our |
112 ( r_1 + r_5 + r_6)\cdot( r_1 + r_2 + r_3)^* |
128 setting we would lose the POSIX property of our calculated values. |
113 \] |
129 For example given the regular expression |
114 |
130 \begin{center} |
115 occurring in an intermediate step in our bit-coded lexing algorithm. |
131 $(a + ab)(bc + c)$ |
116 |
132 \end{center} |
117 In this expression, there will be 6 "parallel" terms if we break up regexes |
133 and the string |
118 of shape $(a+b)\cdot c$ using the distributivity law (into $ac + bc$). |
134 \begin{center} |
119 \begin{align} |
135 $ab$, |
120 (\nonumber \\ |
136 \end{center} |
121 r_1 + & \label{term:1} \\ |
137 then our algorithm generates the following |
122 r_2 + & \label{term:2} \\ |
138 correct POSIX value |
123 r_3 & \label{term:3} \\ |
139 \begin{center} |
124 & )\cdot r\; + \; (\nonumber \\ |
140 $\Seq \; (\Right \; ab) \; (\Right \; c)$. |
125 r_1 + & \label{term:4} \\ |
141 \end{center} |
126 r_5 + & \label{term:5} \\ |
142 Essentially it matches the string with the longer Right-alternative |
127 r_6 \label{term:6}\\ |
143 in the first sequence (and |
128 & )\cdot r\nonumber |
144 then the 'rest' with the character regular expression $c$ from the second sequence). |
129 \end{align} |
145 If we add the simplification above, then we obtain the following value |
130 |
146 \begin{center} |
131 They have been labelled, and each label denotes |
147 $\Left \; (\Seq \; a \; (\Left \; bc))$ |
132 one term, for example, \ref{term:1} denotes |
148 \end{center} |
133 \[ |
149 where the $\Left$-alternatives get priority. |
134 r_1 \cdot r |
150 However this violates the POSIX rules. |
135 \] |
151 The reason for getting this undesired value |
136 \noindent |
152 is that the new rule splits this regular expression up into |
137 and \ref{term:3} denotes |
153 \begin{center} |
138 \[ |
154 $a\cdot(b c + c) + ab \cdot (bc + c)$, |
139 r_3\cdot r. |
155 \end{center} |
140 \] |
156 which becomes a regular expression with a |
141 From a lexer's point of view, \ref{term:4} will never |
157 totally different structure--the original |
142 be picked up in later phases of matching because there |
158 was a sequence, and now it becomes an alternative. |
143 is a term \ref{term:1} giving identical matching information. |
159 With an alternative the maximum munch rule no longer works.\\ |
144 The first term \ref{term:1} will match a string in $L(r_1 \cdot r)$, |
160 A method to reconcile this is to do the |
145 and so on for term \ref{term:2}, \ref{term:3}, etc. |
161 transformation in \ref{desiredSimp} ``non-invasively'', |
146 |
162 meaning that we traverse the list of regular expressions |
147 \mybox{previous input $\ldots$}\mybox{$aaa$ }\mybox{rest of input $\ldots$} |
163 \begin{center} |
148 \begin{center}\label{string_2parts} |
164 $rs_a@[a]@rs_c$ |
149 |
165 \end{center} |
150 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
166 in the alternative |
151 \node [rectangle split, rectangle split horizontal, rectangle split parts=2] |
167 \begin{center} |
152 {$r_{x1}$ |
168 $\sum ( rs_a@[a]@rs_c)$ |
153 \nodepart{two} $r_1\cdot r$ }; |
169 \end{center} |
154 %\caption{term 1 \ref{term:1}'s matching configuration} |
170 using a function similar to $\distinctBy$, |
155 \end{tikzpicture} |
171 but this time |
156 |
172 we allow a more general list rewrite: |
157 \end{center} |
173 \begin{mathpar} |
158 For term 1 \ref{term:1}, whatever was before the current |
174 \inferrule{\vspace{0mm} }{rs_a@[a]@rs_c |
159 input position was fully matched and the regex corresponding |
175 \stackrel{s}{\rightsquigarrow } |
160 to it reduced to $\ONE$, |
176 rs_a@[\textit{prune} \; a \; rs_a]@rs_c } |
161 and in the next input token, it will start on $r_1\cdot r$. |
177 \end{mathpar} |
162 The resulting value will be something of a similar configuration: |
178 %L \; a_1' = L \; a_1 \setminus (\cup_{a \in rs_a} L \; a) |
163 \begin{center}\label{value_2parts} |
179 where $\textit{prune} \;a \; acc$ traverses $a$ |
164 %\caption{term 1 \ref{term:1}'s matching configuration} |
180 without altering the structure of $a$, removing components in $a$ |
165 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
181 that have appeared in the accumulator $acc$. |
166 \node [rectangle split, rectangle split horizontal, rectangle split parts=2] |
182 For example |
167 {$v_{x1}$ |
183 \begin{center} |
168 \nodepart{two} $v_{r_1\cdot r}$ }; |
184 $\textit{prune} \;\;\; (r_a+r_f+r_g+r_h)r_d \;\; \; [(r_a+r_b+r_c)r_d, (r_e+r_f)r_d] $ |
169 \end{tikzpicture} |
185 \end{center} |
170 \end{center} |
186 should be equal to |
171 For term 2 \ref{term:2} we have a similar value partition: |
187 \begin{center} |
172 \begin{center}\label{value_2parts2} |
188 $(r_g+r_h)r_d$ |
173 %\caption{term 1 \ref{term:1}'s matching configuration} |
189 \end{center} |
174 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
190 because $r_gr_d$ and |
175 \node [rectangle split, rectangle split horizontal, rectangle split parts=2] |
191 $r_hr_d$ are the only terms |
176 {$v_{x2}$ |
192 that have not appeared in the accumulator list |
177 \nodepart{two} $v_{r_2\cdot r}$ }; |
193 \begin{center} |
178 \end{tikzpicture} |
194 $[(r_a+r_b+r_c)r_d, (r_e+r_f)r_d]$. |
179 \end{center} |
195 \end{center} |
180 \noindent |
196 We implemented |
181 and so on. |
197 function $\textit{prune}$ in Scala, |
182 We note that for term 4 \ref{term:4} its result value |
198 and incorporated into our lexer, |
183 after any position beyond the current one will always |
199 by replacing the $\simp$ function |
184 be the same: |
200 with a stronger version called $\bsimpStrong$ |
185 \begin{center}\label{value_2parts4} |
201 that prunes regular expressions. |
186 %\caption{term 1 \ref{term:1}'s matching configuration} |
202 We call this lexer $\blexerStrong$. |
187 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
203 $\blexerStrong$ is able to drastically reduce the |
188 \node [rectangle split, rectangle split horizontal, rectangle split parts=2] |
204 internal data structure size which could |
189 {$v_{x4}$ |
205 trigger exponential behaviours in |
190 \nodepart{two} $v_{r_1\cdot r}$ }; |
206 $\blexersimp$. |
191 \end{tikzpicture} |
207 \begin{figure}[H] |
192 \end{center} |
|
193 And $\POSIX$ rules says that we can eliminate at least one of them. |
|
194 Our algorithm always puts the regex with the longest initial sub-match at the left of the |
|
195 alternatives, so we safely throw away \ref{term:4}. |
|
196 The fact that term 1 and 4 are not immediately in the same alternative |
|
197 expression does not prevent them from being two duplicate matches at |
|
198 the current point of view. |
|
199 To implement this idea into an algorithm, we define a "pruning function" |
|
200 $\textit{prune}$, that takes away parts of all terms in $r$ that belongs to |
|
201 $\textit{acc}$, which acts as an element |
|
202 is a stronger version of $\textit{distinctBy}$. |
|
203 Here is a concise version of it (in the style of Scala): |
|
204 \begin{figure}[H] |
|
205 \begin{lstlisting} |
|
206 def distinctByPlus(rs: List[ARexp], acc: Set[Rexp] = Set()) : |
|
207 List[ARexp] = rs match { |
|
208 case Nil => Nil |
|
209 case r :: rs => |
|
210 if(acc.contains(erase(r))) |
|
211 distinctByPlus(rs, acc) |
|
212 else |
|
213 prune(r, acc) :: distinctByPlus(rs, prune(r, acc) +: acc) |
|
214 } |
|
215 |
|
216 \end{lstlisting} |
|
217 \caption{the distinctByPlus function} |
|
218 \end{figure} |
|
219 But for the function $\textit{prune}$ there is a difficulty: |
|
220 how could one define the $L$ function in a "computable" way, |
|
221 so that they generate a (lazy infinite) set of strings in $L(r)$. |
|
222 We also need a function that tests whether $L(r_1) \subseteq L(r_2)$ |
|
223 is true. |
|
224 For the moment we cut corners and do not define these two functions |
|
225 as they are not used by Antimirov (and will probably not contribute |
|
226 to a bound better than Antimirov's cubic bound). |
|
227 Rather, we do not try to eliminate in every instance of regular expressions |
|
228 that have "language duplicates", but only eliminate the "exact duplicates". |
|
229 For this we need to break up terms $(a+b)\cdot c$ into two |
|
230 terms $a\cdot c$ and $b\cdot c$ before we add them to the accumulator: |
|
231 \begin{figure}[H] |
|
232 \begin{lstlisting} |
|
233 def distinctWith(rs: List[ARexp], |
|
234 pruneFunction: (ARexp, Set[Rexp]) => ARexp, |
|
235 acc: Set[Rexp] = Set()) : List[ARexp] = |
|
236 rs match{ |
|
237 case Nil => Nil |
|
238 case r :: rs => |
|
239 if(acc(erase(r))) |
|
240 distinctWith(rs, pruneFunction, acc) |
|
241 else { |
|
242 val pruned_r = pruneFunction(r, acc) |
|
243 pruned_r :: |
|
244 distinctWith(rs, |
|
245 pruneFunction, |
|
246 turnIntoTerms(erase(pruned_r)) ++: acc |
|
247 ) |
|
248 } |
|
249 } |
|
250 \end{lstlisting} |
|
251 \caption{A Stronger Version of $\textit{distinctBy}$} |
|
252 \end{figure} |
|
253 \noindent |
|
254 This process is done by the function $\textit{turnIntoTerms}$: |
|
255 \begin{figure} |
|
256 \begin{verbatim} |
|
257 def turnIntoTerms(r: Rexp): List[Rexp] = r match { |
|
258 case SEQ(r1, r2) => if(isOne(r1)) |
|
259 turnIntoTerms(r2) |
|
260 else |
|
261 turnIntoTerms(r1).map(r11 => SEQ(r11, r2)) |
|
262 case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2) |
|
263 case ZERO => Nil |
|
264 case _ => r :: Nil |
|
265 } |
|
266 \end{verbatim} |
|
267 \caption{function that breaks up regular expression into "atomic" terms} |
|
268 \end{figure} |
|
269 |
|
270 \noindent |
|
271 The pruning function can be defined recursively: |
|
272 \begin{figure} |
|
273 \begin{verbatim} |
|
274 def prune7(r: ARexp, acc: Set[Rexp]) : ARexp = r match{ |
|
275 case AALTS(bs, rs) => rs.map(r => prune7(r, acc)).filter(_ != ZERO) match |
|
276 { |
|
277 case Nil => AZERO |
|
278 case r::Nil => fuse(bs, r) |
|
279 case rs1 => AALTS(bs, rs1) |
|
280 } |
|
281 case ASEQ(bs, r1, r2) => prune7(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match { |
|
282 case AZERO => AZERO |
|
283 case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2) |
|
284 case r1p => ASEQ(bs, r1p, r2) |
|
285 } |
|
286 case r => if(acc(erase(r))) AZERO else r |
|
287 } |
|
288 \end{verbatim} |
|
289 \caption{pruning function} |
|
290 \end{figure} |
|
291 |
|
292 |
|
293 |
|
294 \begin{figure} |
|
295 \centering |
208 \centering |
296 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
209 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
297 \begin{tikzpicture} |
210 \begin{tikzpicture} |
298 \begin{axis}[ |
211 \begin{axis}[ |
299 %xlabel={$n$}, |
212 %xlabel={$n$}, |
314 ] |
227 ] |
315 \addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data}; |
228 \addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data}; |
316 \end{axis} |
229 \end{axis} |
317 \end{tikzpicture}\\ |
230 \end{tikzpicture}\\ |
318 \multicolumn{2}{l}{Graphs: Runtime for matching $((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings |
231 \multicolumn{2}{l}{Graphs: Runtime for matching $((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings |
319 of the form $\underbrace{aa..a}_{n}$.} |
232 of the form $\underbrace{aa..a}_{n}$.} |
320 \end{tabular} |
233 \end{tabular} |
321 \caption{aaaaaStarStar} \label{fig:aaaaaStarStar} |
234 \caption{aaaaaStarStar} \label{fig:aaaaaStarStar} |
322 \end{figure} |
235 \end{figure} |
|
236 \begin{figure}[H] |
|
237 \begin{lstlisting} |
|
238 def atMostEmpty(r: Rexp) : Boolean = r match { |
|
239 case ZERO => true |
|
240 case ONE => true |
|
241 case STAR(r) => atMostEmpty(r) |
|
242 case SEQ(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2) |
|
243 case ALTS(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2) |
|
244 case CHAR(_) => false |
|
245 } |
|
246 |
|
247 |
|
248 def isOne(r: Rexp) : Boolean = r match { |
|
249 case ONE => true |
|
250 case SEQ(r1, r2) => isOne(r1) && isOne(r2) |
|
251 case ALTS(r1, r2) => (isOne(r1) || isOne(r2)) && (atMostEmpty(r1) && atMostEmpty(r2))//rs.forall(atMostEmpty) && rs.exists(isOne) |
|
252 case STAR(r0) => atMostEmpty(r0) |
|
253 case CHAR(c) => false |
|
254 case ZERO => false |
|
255 } |
|
256 |
|
257 //r = r' ~ tail' : If tail' matches tail => returns r' |
|
258 def removeSeqTail(r: Rexp, tail: Rexp) : Rexp = r match { |
|
259 case SEQ(r1, r2) => |
|
260 if(r2 == tail) |
|
261 r1 |
|
262 else |
|
263 ZERO |
|
264 case r => ZERO |
|
265 } |
|
266 |
|
267 def prune(r: ARexp, acc: Set[Rexp]) : ARexp = r match{ |
|
268 case AALTS(bs, rs) => rs.map(r => prune(r, acc)).filter(_ != ZERO) match |
|
269 { |
|
270 //all components have been removed, meaning this is effectively a duplicate |
|
271 //flats will take care of removing this AZERO |
|
272 case Nil => AZERO |
|
273 case r::Nil => fuse(bs, r) |
|
274 case rs1 => AALTS(bs, rs1) |
|
275 } |
|
276 case ASEQ(bs, r1, r2) => |
|
277 //remove the r2 in (ra + rb)r2 to identify the duplicate contents of r1 |
|
278 prune(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match { |
|
279 //after pruning, returns 0 |
|
280 case AZERO => AZERO |
|
281 //after pruning, got r1'.r2, where r1' is equal to 1 |
|
282 case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2) |
|
283 //assemble the pruned head r1p with r2 |
|
284 case r1p => ASEQ(bs, r1p, r2) |
|
285 } |
|
286 //this does the duplicate component removal task |
|
287 case r => if(acc(erase(r))) AZERO else r |
|
288 } |
|
289 \end{lstlisting} |
|
290 \caption{pruning function together with its helper functions} |
|
291 \end{figure} |
|
292 \noindent |
|
293 The benefits of using |
|
294 $\textit{prune}$ such as refining the finiteness bound |
|
295 to a cubic bound has not been formalised yet. |
|
296 Therefore we choose to use Scala code rather than an Isabelle-style formal |
|
297 definition like we did for $\simp$, as the definitions might change |
|
298 to suit proof needs. |
|
299 In the rest of the chapter we will use this convention consistently. |
|
300 \begin{figure}[H] |
|
301 \begin{lstlisting} |
|
302 def distinctWith(rs: List[ARexp], |
|
303 pruneFunction: (ARexp, Set[Rexp]) => ARexp, |
|
304 acc: Set[Rexp] = Set()) : List[ARexp] = |
|
305 rs match{ |
|
306 case Nil => Nil |
|
307 case r :: rs => |
|
308 if(acc(erase(r))) |
|
309 distinctWith(rs, pruneFunction, acc) |
|
310 else { |
|
311 val pruned_r = pruneFunction(r, acc) |
|
312 pruned_r :: |
|
313 distinctWith(rs, |
|
314 pruneFunction, |
|
315 turnIntoTerms(erase(pruned_r)) ++: acc |
|
316 ) |
|
317 } |
|
318 } |
|
319 \end{lstlisting} |
|
320 \caption{A Stronger Version of $\textit{distinctBy}$} |
|
321 \end{figure} |
|
322 \noindent |
|
323 The function $\textit{prune}$ is used in $\distinctWith$. |
|
324 $\distinctWith$ is a stronger version of $\distinctBy$ |
|
325 which not only removes duplicates as $\distinctBy$ would |
|
326 do, but also uses the $\textit{pruneFunction}$ |
|
327 argument to prune away verbose components in a regular expression.\\ |
|
328 \begin{figure}[H] |
|
329 \begin{lstlisting} |
|
330 //a stronger version of simp |
|
331 def bsimpStrong(r: ARexp): ARexp = |
|
332 { |
|
333 r match { |
|
334 case ASEQ(bs1, r1, r2) => (bsimpStrong(r1), bsimpStrong(r2)) match { |
|
335 //normal clauses same as simp |
|
336 case (AZERO, _) => AZERO |
|
337 case (_, AZERO) => AZERO |
|
338 case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s) |
|
339 //bs2 can be discarded |
|
340 case (r1s, AONE(bs2)) => fuse(bs1, r1s) //assert bs2 == Nil |
|
341 case (r1s, r2s) => ASEQ(bs1, r1s, r2s) |
|
342 } |
|
343 case AALTS(bs1, rs) => { |
|
344 //distinctBy(flat_res, erase) |
|
345 distinctWith(flats(rs.map(bsimpStrong(_))), prune) match { |
|
346 case Nil => AZERO |
|
347 case s :: Nil => fuse(bs1, s) |
|
348 case rs => AALTS(bs1, rs) |
|
349 } |
|
350 } |
|
351 //stars that can be treated as 1 |
|
352 case ASTAR(bs, r0) if(atMostEmpty(erase(r0))) => AONE(bs) |
|
353 case r => r |
|
354 } |
|
355 } |
|
356 \end{lstlisting} |
|
357 \caption{The function $\bsimpStrong$ and $\bdersStrongs$} |
|
358 \end{figure} |
|
359 \noindent |
|
360 $\distinctWith$, is in turn used in $\bsimpStrong$: |
|
361 \begin{figure}[H] |
|
362 \begin{lstlisting} |
|
363 //Conjecture: [| bdersStrong(s, r) |] = O([| r |]^3) |
|
364 def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match { |
|
365 case Nil => r |
|
366 case c::s => bdersStrong(s, bsimpStrong(bder(c, r))) |
|
367 } |
|
368 \end{lstlisting} |
|
369 \caption{The function $\bsimpStrong$ and $\bdersStrongs$} |
|
370 \end{figure} |
|
371 \noindent |
|
372 We conjecture that the above Scala function $\bdersStrongs$, |
|
373 written $\bdersStrong{\_}{\_}$ as an infix notation, |
|
374 satisfies the following property: |
|
375 \begin{conjecture} |
|
376 $\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$ |
|
377 \end{conjecture} |
|
378 The stronger version of $\blexersimp$'s |
|
379 code in Scala looks like: |
|
380 \begin{figure}[H] |
|
381 \begin{lstlisting} |
|
382 def strongBlexer(r: Rexp, s: String) : Option[Val] = { |
|
383 Try(Some(decode(r, strong_blex_simp(internalise(r), s.toList)))).getOrElse(None) |
|
384 } |
|
385 def strong_blex_simp(r: ARexp, s: List[Char]) : Bits = s match { |
|
386 case Nil => { |
|
387 if (bnullable(r)) { |
|
388 mkepsBC(r) |
|
389 } |
|
390 else |
|
391 throw new Exception("Not matched") |
|
392 } |
|
393 case c::cs => { |
|
394 strong_blex_simp(strongBsimp(bder(c, r)), cs) |
|
395 } |
|
396 } |
|
397 \end{lstlisting} |
|
398 \end{figure} |
|
399 \noindent |
|
400 We would like to preserve the correctness like the one |
|
401 we had for $\blexersimp$: |
|
402 \begin{conjecture}\label{cubicConjecture} |
|
403 $\blexerStrong \;r \; s = \blexer\; r\;s$ |
|
404 \end{conjecture} |
|
405 |
|
406 To implement this idea into an algorithm, we define a "pruning function" |
|
407 $\textit{prune}$, that takes away parts of all terms in $r$ that belongs to |
|
408 $\textit{acc}$, which acts as an element |
|
409 is a stronger version of $\textit{distinctBy}$. |
|
410 Here is a concise version of it (in the style of Scala): |
|
411 |
|
412 |
|
413 |
|
414 |
|
415 \noindent |
|
416 The pruning function can be defined recursively: |
|
417 |
|
418 |
|
419 |
|
420 |
|
421 This suggests a link towrads "partial derivatives" |
|
422 introduced by Antimirov \cite{Antimirov95}. |
|
423 |
|
424 \section{Antimirov's partial derivatives} |
|
425 The idea behind Antimirov's partial derivatives |
|
426 is to do derivatives in a similar way as suggested by Brzozowski, |
|
427 but maintain a set of regular expressions instead of a single one: |
|
428 |
|
429 %TODO: antimirov proposition 3.1, needs completion |
|
430 \begin{center} |
|
431 \begin{tabular}{ccc} |
|
432 $\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\ |
|
433 $\partial_x(\ONE)$ & $=$ & $\phi$ |
|
434 \end{tabular} |
|
435 \end{center} |
|
436 |
|
437 Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together |
|
438 using the alternatives constructor, Antimirov cleverly chose to put them into |
|
439 a set instead. This breaks the terms in a derivative regular expression up, |
|
440 allowing us to understand what are the "atomic" components of it. |
|
441 For example, To compute what regular expression $x^*(xx + y)^*$'s |
|
442 derivative against $x$ is made of, one can do a partial derivative |
|
443 of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$ |
|
444 from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$. |
|
445 To get all the "atomic" components of a regular expression's possible derivatives, |
|
446 there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes |
|
447 whatever character is available at the head of the string inside the language of a |
|
448 regular expression, and gives back the character and the derivative regular expression |
|
449 as a pair (which he called "monomial"): |
|
450 \begin{center} |
|
451 \begin{tabular}{ccc} |
|
452 $\lf(\ONE)$ & $=$ & $\phi$\\ |
|
453 $\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\ |
|
454 $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\ |
|
455 $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\ |
|
456 \end{tabular} |
|
457 \end{center} |
|
458 %TODO: completion |
|
459 |
|
460 There is a slight difference in the last three clauses compared |
|
461 with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular |
|
462 expression $r$ with every element inside $\textit{rset}$ to create a set of |
|
463 sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates |
|
464 on a set of monomials (which Antimirov called "linear form") and a regular |
|
465 expression, and returns a linear form: |
|
466 \begin{center} |
|
467 \begin{tabular}{ccc} |
|
468 $l \bigodot (\ZERO)$ & $=$ & $\phi$\\ |
|
469 $l \bigodot (\ONE)$ & $=$ & $l$\\ |
|
470 $\phi \bigodot t$ & $=$ & $\phi$\\ |
|
471 $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\ |
|
472 $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\ |
|
473 $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\ |
|
474 $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\ |
|
475 $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\ |
|
476 \end{tabular} |
|
477 \end{center} |
|
478 %TODO: completion |
|
479 |
|
480 Some degree of simplification is applied when doing $\bigodot$, for example, |
|
481 $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$, |
|
482 and $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and |
|
483 $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$, |
|
484 and so on. |
|
485 |
|
486 With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regular expression $r$ with |
|
487 an iterative procedure: |
|
488 \begin{center} |
|
489 \begin{tabular}{llll} |
|
490 $\textit{while}$ & $(\Delta_i \neq \phi)$ & & \\ |
|
491 & $\Delta_{i+1}$ & $ =$ & $\lf(\Delta_i) - \PD_i$ \\ |
|
492 & $\PD_{i+1}$ & $ =$ & $\Delta_{i+1} \cup \PD_i$ \\ |
|
493 $\partial_{UNIV}(r)$ & $=$ & $\PD$ & |
|
494 \end{tabular} |
|
495 \end{center} |
|
496 |
|
497 |
|
498 $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, |
|
499 |
|
500 |
323 |
501 |
324 |
502 |
325 |
503 |
326 %---------------------------------------------------------------------------------------- |
504 %---------------------------------------------------------------------------------------- |
327 % SECTION 1 |
505 % SECTION 1 |