9 %---------------------------------------------------------------------------------------- |
9 %---------------------------------------------------------------------------------------- |
10 % SECTION strongsimp |
10 % SECTION strongsimp |
11 %---------------------------------------------------------------------------------------- |
11 %---------------------------------------------------------------------------------------- |
12 \section{A Stronger Version of Simplification} |
12 \section{A Stronger Version of Simplification} |
13 %TODO: search for isabelle proofs of algorithms that check equivalence |
13 %TODO: search for isabelle proofs of algorithms that check equivalence |
14 Two alternative (distinct) sub-matches for the (sub-)string and (sub-)regex pair |
14 In our bit-coded lexing algorithm, with or without simplification, |
|
15 two alternative (distinct) sub-matches for the (sub-)string and (sub-)regex pair |
15 are always expressed in the "derivative regular expression" as two |
16 are always expressed in the "derivative regular expression" as two |
16 disjoint alternative terms at the current (sub-)regex level. The fact that they |
17 disjoint alternative terms at the current (sub-)regex level. The fact that they |
17 are parallel tells us when we retrieve the information from this (sub-)regex |
18 are parallel tells us when we retrieve the information from this (sub-)regex |
18 there will always be a choice of which alternative term to take. |
19 there will always be a choice of which alternative term to take. |
|
20 As an example, the regular expression $aa \cdot a^*+ a \cdot a^*$ (omitting bit-codes) |
|
21 expresses two possibilities it will match future input. |
|
22 It will either match 2 $a$'s, then 0 or more $a$'s, in other words, at least 2 more $a$'s |
|
23 \begin{figure}\label{string_3parts1} |
|
24 \begin{center} |
|
25 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
26 \node [rectangle split, rectangle split horizontal, rectangle split parts=3] |
|
27 {Consumed Input |
|
28 \nodepart{two} Expects $aa$ |
|
29 \nodepart{three} Then expects $a^*$}; |
|
30 |
|
31 \end{tikzpicture} |
|
32 \end{center} |
|
33 \end{figure} |
|
34 \noindent |
|
35 Or it will match at least 1 more $a$: |
|
36 \begin{figure}\label{string_3parts2} |
|
37 \begin{center} |
|
38 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
39 \node [rectangle split, rectangle split horizontal, rectangle split parts=3] |
|
40 {Consumed |
|
41 \nodepart{two} Expects $a$ |
|
42 \nodepart{three} Then expects $a^*$}; |
|
43 |
|
44 \end{tikzpicture} |
|
45 \end{center} |
|
46 \end{figure} |
|
47 If these two possibilities are identical, we can eliminate |
|
48 the second one as we know the second one corresponds to |
|
49 a match that is not $\POSIX$. |
|
50 |
|
51 |
|
52 If two identical regexes |
|
53 happen to be grouped into different sequences, one cannot use |
|
54 the $a + a \rightsquigarrow a$ rule to eliminate them, even if they |
|
55 are "parallel" to each other: |
|
56 \[ |
|
57 (a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2 |
|
58 \] |
|
59 and that's because they are followed by possibly |
|
60 different "suffixes" $r_1$ and $r_2$, and if |
|
61 they do turn out to be different then doing |
|
62 \[ |
|
63 (a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2 |
|
64 \] |
|
65 might cause a possible match where the string is in $L(a \cdot r_2)$ to be lost. |
19 |
66 |
20 Here is an example for this. |
67 Here is an example for this. |
21 |
|
22 Assume we have the derivative regex |
68 Assume we have the derivative regex |
23 \[(a^* + (aa)^* + (aaa)^*)\cdot(a^* + (aa)^* + (aaa)^*)^* + |
69 \[( r_1 + r_2 + r_3)\cdot r+ |
24 (a^* + a\cdot(aa)^* + (aa)\cdot (aaa)^*)\cdot(a^* + (aa)^* + (aaa)^*)^* |
70 ( r_1 + r_5 + r_6)\cdot( r_1 + r_2 + r_3)^* |
25 \] |
71 \] |
26 |
72 |
27 occurring in an intermediate step in successive |
73 occurring in an intermediate step in our bit-coded lexing algorithm. |
28 derivatives of an input string $\underbrace{aa\ldots a}_{\text{n \; a's}}$. |
74 |
29 In this expression, there will be 6 "parallel" terms if we break up regexes |
75 In this expression, there will be 6 "parallel" terms if we break up regexes |
30 of shape $(a+b)\cdot c$ using the distributivity law (into $ac + bc$). |
76 of shape $(a+b)\cdot c$ using the distributivity law (into $ac + bc$). |
31 \begin{align} |
77 \begin{align} |
32 (\nonumber \\ |
78 (\nonumber \\ |
33 a^* + & \label{term:1} \\ |
79 r_1 + & \label{term:1} \\ |
34 (aa)^* + & \label{term:2} \\ |
80 r_2 + & \label{term:2} \\ |
35 (aaa)^* & \label{term:3} \\ |
81 r_3 & \label{term:3} \\ |
36 & )\cdot(a^* + (aa)^* + (aaa)^*)^* \; + \; (\nonumber \\ |
82 & )\cdot r\; + \; (\nonumber \\ |
37 a^* + & \label{term:4} \\ |
83 r_1 + & \label{term:4} \\ |
38 a \cdot (aa)^* + & \label{term:5} \\ |
84 r_5 + & \label{term:5} \\ |
39 aa \cdot (aaa)^* \label{term:6}\\ |
85 r_6 \label{term:6}\\ |
40 & )\cdot(a^* + (aa)^* + (aaa)^*)^*\nonumber |
86 & )\cdot r\nonumber |
41 \end{align} |
87 \end{align} |
42 |
|
43 |
88 |
44 They have been labelled, and each label denotes |
89 They have been labelled, and each label denotes |
45 one term, for example, \ref{term:1} denotes |
90 one term, for example, \ref{term:1} denotes |
46 \[ |
91 \[ |
47 a^*\cdot(a^* + (aa)^* + (aaa)^*)^* |
92 r_1 \cdot r |
48 \] |
93 \] |
49 \noindent |
94 \noindent |
50 and \ref{term:3} denotes |
95 and \ref{term:3} denotes |
51 \[ |
96 \[ |
52 (aaa)^*\cdot(a^* + (aa)^* + (aaa)^*)^*. |
97 r_3\cdot r. |
53 \] |
98 \] |
54 These terms can be interpreted in terms of |
99 From a lexer's point of view, \ref{term:4} will never |
55 their current input position's most recent |
100 be picked up in later phases of matching because there |
56 partial match. |
101 is a term \ref{term:1} giving identical matching information. |
57 Term \ref{term:1}, \ref{term:2}, and \ref{term:3} |
102 The first term \ref{term:1} will match a string in $L(r_1 \cdot r)$, |
58 denote the partial-match ending at the current position: |
103 and so on for term \ref{term:2}, \ref{term:3}, etc. |
|
104 |
59 \mybox{previous input $\ldots$}\mybox{$aaa$ }\mybox{rest of input $\ldots$} |
105 \mybox{previous input $\ldots$}\mybox{$aaa$ }\mybox{rest of input $\ldots$} |
|
106 \begin{center}\label{string_2parts} |
|
107 |
|
108 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
109 \node [rectangle split, rectangle split horizontal, rectangle split parts=2] |
|
110 {$r_{x1}$ |
|
111 \nodepart{two} $r_1\cdot r$ }; |
|
112 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
113 \end{tikzpicture} |
|
114 |
|
115 \end{center} |
|
116 For term 1 \ref{term:1}, whatever was before the current |
|
117 input position was fully matched and the regex corresponding |
|
118 to it reduced to $\ONE$, |
|
119 and in the next input token, it will start on $r_1\cdot r$. |
|
120 The resulting value will be something of a similar configuration: |
|
121 \begin{center}\label{value_2parts} |
|
122 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
123 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
124 \node [rectangle split, rectangle split horizontal, rectangle split parts=2] |
|
125 {$v_{x1}$ |
|
126 \nodepart{two} $v_{r_1\cdot r}$ }; |
|
127 \end{tikzpicture} |
|
128 \end{center} |
|
129 For term 2 \ref{term:2} we have a similar value partition: |
|
130 \begin{center}\label{value_2parts2} |
|
131 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
132 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
133 \node [rectangle split, rectangle split horizontal, rectangle split parts=2] |
|
134 {$v_{x2}$ |
|
135 \nodepart{two} $v_{r_2\cdot r}$ }; |
|
136 \end{tikzpicture} |
|
137 \end{center} |
|
138 \noindent |
|
139 and so on. |
|
140 We note that for term 4 \ref{term:4} its result value |
|
141 after any position beyond the current one will always |
|
142 be the same: |
|
143 \begin{center}\label{value_2parts4} |
|
144 %\caption{term 1 \ref{term:1}'s matching configuration} |
|
145 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] |
|
146 \node [rectangle split, rectangle split horizontal, rectangle split parts=2] |
|
147 {$v_{x4}$ |
|
148 \nodepart{two} $v_{r_1\cdot r}$ }; |
|
149 \end{tikzpicture} |
|
150 \end{center} |
|
151 And $\POSIX$ rules says that we can eliminate at least one of them. |
|
152 Our algorithm always puts the regex with the longest initial sub-match at the left of the |
|
153 alternatives, so we safely throw away \ref{term:4}. |
|
154 The fact that term 1 and 4 are not immediately in the same alternative |
|
155 expression does not prevent them from being two duplicate matches at |
|
156 the current point of view. |
|
157 To implement this idea into an algorithm, we define a "pruning function" |
|
158 $\textit{prune}$, that takes away parts of all terms in $r$ that belongs to |
|
159 $\textit{acc}$, which acts as an element |
|
160 is a stronger version of $\textit{distinctBy}$. |
|
161 Here is a concise version of it (in the style of Scala): |
|
162 \begin{verbatim} |
|
163 def distinctByPlus(rs: List[ARexp], acc: Set[Rexp] = Set()) : |
|
164 List[ARexp] = rs match { |
|
165 case Nil => Nil |
|
166 case r :: rs => |
|
167 if(acc.contains(erase(r))) |
|
168 distinctByPlus(rs, acc) |
|
169 else |
|
170 prune(r, acc) :: distinctByPlus(rs, prune(r, acc) +: acc) |
|
171 } |
|
172 |
|
173 \end{verbatim} |
|
174 But for the function $\textit{prune}$ there is a difficulty: |
|
175 how could one define the $L$ function in a "computable" way, |
|
176 so that they generate a (lazy infinite) set of strings in $L(r)$. |
|
177 We also need a function that tests whether $L(r_1) \subseteq L(r_2)$ |
|
178 is true. |
|
179 For the moment we cut corners and do not define these two functions |
|
180 as they are not used by Antimirov (and will probably not contribute |
|
181 to a bound better than Antimirov's cubic bound). |
|
182 Rather, we do not try to eliminate in every instance of regular expressions |
|
183 that have "language duplicates", but only eliminate the "exact duplicates". |
|
184 For this we need to break up terms $(a+b)\cdot c$ into two |
|
185 terms $a\cdot c$ and $b\cdot c$ before we add them to the accumulator: |
|
186 \begin{verbatim} |
|
187 def distinctWith(rs: List[ARexp], |
|
188 pruneFunction: (ARexp, Set[Rexp]) => ARexp, |
|
189 acc: Set[Rexp] = Set()) : List[ARexp] = |
|
190 rs match{ |
|
191 case Nil => Nil |
|
192 case r :: rs => |
|
193 if(acc(erase(r))) |
|
194 distinctWith(rs, pruneFunction, acc) |
|
195 else { |
|
196 val pruned_r = pruneFunction(r, acc) |
|
197 pruned_r :: |
|
198 distinctWith(rs, |
|
199 pruneFunction, |
|
200 turnIntoTerms(erase(pruned_r)) ++: acc |
|
201 ) |
|
202 } |
|
203 } |
|
204 \end{verbatim} |
|
205 \noindent |
|
206 This process is done by the function $\textit{turnIntoTerms}$: |
|
207 \begin{verbatim} |
|
208 def turnIntoTerms(r: Rexp): List[Rexp] = r match { |
|
209 case SEQ(r1, r2) => if(isOne(r1)) |
|
210 turnIntoTerms(r2) |
|
211 else |
|
212 turnIntoTerms(r1).map(r11 => SEQ(r11, r2)) |
|
213 case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2) |
|
214 case ZERO => Nil |
|
215 case _ => r :: Nil |
|
216 } |
|
217 \end{verbatim} |
|
218 \noindent |
|
219 The pruning function can be defined recursively: |
|
220 \begin{verbatim} |
|
221 def prune7(r: ARexp, acc: Set[Rexp]) : ARexp = r match{ |
|
222 case AALTS(bs, rs) => rs.map(r => prune7(r, acc)).filter(_ != ZERO) match |
|
223 { |
|
224 case Nil => AZERO |
|
225 case r::Nil => fuse(bs, r) |
|
226 case rs1 => AALTS(bs, rs1) |
|
227 } |
|
228 case ASEQ(bs, r1, r2) => prune7(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match { |
|
229 case AZERO => AZERO |
|
230 case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2) |
|
231 case r1p => ASEQ(bs, r1p, r2) |
|
232 } |
|
233 case r => if(acc(erase(r))) AZERO else r |
|
234 } |
|
235 \end{verbatim} |
|
236 |
|
237 \begin{figure} |
|
238 \centering |
|
239 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
240 \begin{tikzpicture} |
|
241 \begin{axis}[ |
|
242 xlabel={$n$}, |
|
243 x label style={at={(1.05,-0.05)}}, |
|
244 ylabel={size}, |
|
245 enlargelimits=false, |
|
246 xtick={0,5,...,30}, |
|
247 xmax=30, |
|
248 ymax=800, |
|
249 ytick={0,200,...,800}, |
|
250 scaled ticks=false, |
|
251 axis lines=left, |
|
252 width=5cm, |
|
253 height=4cm, |
|
254 legend entries={$bsimpStrong$ size growth}, |
|
255 legend pos=north west, |
|
256 legend cell align=left] |
|
257 \addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data}; |
|
258 \end{axis} |
|
259 \end{tikzpicture} |
|
260 & |
|
261 \begin{tikzpicture} |
|
262 \begin{axis}[ |
|
263 xlabel={$n$}, |
|
264 x label style={at={(1.05,-0.05)}}, |
|
265 ylabel={size}, |
|
266 enlargelimits=false, |
|
267 xtick={0,5,...,30}, |
|
268 xmax=30, |
|
269 ymax=42000, |
|
270 ytick={0,10000,...,40000}, |
|
271 scaled ticks=true, |
|
272 axis lines=left, |
|
273 width=5cm, |
|
274 height=4cm, |
|
275 legend entries={$bsimp$ size growth}, |
|
276 legend pos=north west, |
|
277 legend cell align=left] |
|
278 \addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data}; |
|
279 \end{axis} |
|
280 \end{tikzpicture}\\ |
|
281 \multicolumn{2}{c}{Graphs: Runtime for matching $((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings |
|
282 of the form $\underbrace{aa..a}_{n}$.} |
|
283 \end{tabular} |
|
284 \caption{aaaaaStarStar} \label{fig:aaaaaStarStar} |
|
285 \end{figure} |
60 |
286 |
61 |
287 |
62 |
288 |
63 %---------------------------------------------------------------------------------------- |
289 %---------------------------------------------------------------------------------------- |
64 % SECTION 1 |
290 % SECTION 1 |