10 \theoremstyle{theorem} |
10 \theoremstyle{theorem} |
11 \newtheorem{theorem}{Theorem} |
11 \newtheorem{theorem}{Theorem} |
12 |
12 |
13 \theoremstyle{lemma} |
13 \theoremstyle{lemma} |
14 \newtheorem{lemma}{Lemma} |
14 \newtheorem{lemma}{Lemma} |
15 |
15 \usepackage{amsmath} |
16 \newcommand{\lemmaautorefname}{Lemma} |
16 \newcommand{\lemmaautorefname}{Lemma} |
17 |
17 |
18 \theoremstyle{definition} |
18 \theoremstyle{definition} |
19 \newtheorem{definition}{Definition} |
19 \newtheorem{definition}{Definition} |
20 \begin{document} |
20 \begin{document} |
21 This is an outline of the structure of the paper with a little bit of flesh in it. |
|
22 \section{The Flow of thought process} |
|
23 |
|
24 |
|
25 \begin{definition}{Regular Expressions and values} |
|
26 \begin{verbatim} |
|
27 TODO |
|
28 \end{verbatim} |
|
29 \end{definition} |
|
30 |
|
31 Value is.a parse tree for the regular expression matching the string. |
|
32 |
|
33 |
|
34 |
|
35 \begin{definition}{nullable} |
|
36 \begin{verbatim} |
|
37 TODO |
|
38 \end{verbatim} |
|
39 \end{definition} |
|
40 The idea behind nullable: whether it contains the empty string. |
|
41 |
|
42 \begin{definition}{derivatives} |
|
43 TODO |
|
44 |
|
45 \end{definition} |
|
46 |
|
47 This definition can be used for matching algorithm. |
|
48 |
|
49 \begin{definition}{matcher} |
|
50 TODO |
|
51 \end{definition} |
|
52 |
|
53 \begin{definition}{POSIX values} |
|
54 TODO |
|
55 \end{definition} |
|
56 |
|
57 \begin{definition}{POSIX lexer algorithm} |
|
58 TODO |
|
59 \end{definition} |
|
60 |
|
61 \begin{definition}{POSIX lexer algorithm with simplification} |
|
62 TODO |
|
63 \end{definition} |
|
64 This simplification algorithm is rather complicated as it entangles derivative, simplification and value reconstruction. We need to split the regular expression structure of the information for lexing so that simplifcation only changes the regex but does not destroy the information for reconstructing the resulting value.\\ |
|
65 |
|
66 Introduce regex with auxiliary information: |
|
67 |
|
68 \begin{definition}{annotated regular expression} |
|
69 TODO |
|
70 \end{definition} |
|
71 |
|
72 \begin{definition}{encoding} |
|
73 TODO |
|
74 \end{definition} |
|
75 Encoding translates values into bit codes with some information loss. |
|
76 |
|
77 \begin{definition}{decoding} |
|
78 TODO |
|
79 \end{definition} |
|
80 Decoding translates bitcodes back into values with the help of regex to recover the structure.\\ |
|
81 |
|
82 During different phases of lexing, we sometimes already know what the value would look like if we match the branch of regex with the string(e.g. a STAR with 1 more iteration, a left/right value), so we can partially encode the value at diffferent phases of the algorithm for later decoding. |
|
83 \\ |
|
84 Examples of such partial encoding: |
|
85 |
|
86 \begin{definition}{internalise} |
|
87 TODO |
|
88 \end{definition} |
|
89 When doing internalise on ALT:\\ |
|
90 Whichever branch is chosen, we know exactly the shape of the value, and therefore can get the bit code for such a value: Left corresponds to Z and Right to S |
|
91 |
|
92 \begin{definition}{bmkeps} |
|
93 TODO |
|
94 \end{definition} |
|
95 bmkeps on the STAR case:\\ |
|
96 We know there could be no more iteration for a star if we want to get a POSIX value for an empty string, so the value must be Stars [], corresponding to an S in the bit code. |
|
97 |
|
98 \begin{definition}{bder} |
|
99 TODO |
|
100 \end{definition} |
|
101 SEQ case with the first regex nullable:\\ |
|
102 bmkeps will extract the value for how a1 mathces the empty string and encode it into a bit sequence. |
|
103 |
|
104 \begin{definition}{blexer} |
|
105 \begin{verbatim} |
|
106 TODO |
|
107 \end{verbatim} |
|
108 \end{definition} |
|
109 |
|
110 adding simplifcation.\\ |
|
111 |
|
112 size of simplified regex: smaller than Antimirov's pder.\\ |
|
113 |
|
114 |
|
115 |
|
116 The rest of the document is the residual from a previous doc and may be deleted.\\ |
|
117 \begin{definition}{bsimp} |
|
118 \begin{verbatim} |
|
119 def bsimp(r: ARexp): ARexp = r match { |
|
120 case ASEQ(bs1, r1, r2) => (bsimp(r1), bsimp(r2)) match { |
|
121 case (AZERO, _) => AZERO |
|
122 case (_, AZERO) => AZERO |
|
123 case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s) |
|
124 case (r1s, r2s) => ASEQ(bs1, r1s, r2s) |
|
125 } |
|
126 case AALTS(bs1, rs) => { |
|
127 val rs_simp = rs.map(bsimp) |
|
128 val flat_res = flats(rs_simp) |
|
129 val dist_res = distinctBy(flat_res, erase) |
|
130 dist_res match { |
|
131 case Nil => AZERO |
|
132 case s :: Nil => fuse(bs1, s) |
|
133 case rs => AALTS(bs1, rs) |
|
134 } |
|
135 } |
|
136 //case ASTAR(bs, r) => ASTAR(bs, bsimp(r)) |
|
137 case r => r |
|
138 } |
|
139 \end{verbatim} |
|
140 \end{definition} |
|
141 |
|
142 \begin{definition}{sub-parts of bsimp} |
|
143 \begin{itemize} |
|
144 \item{flats}\\ |
|
145 flattens the list. |
|
146 \item{dB}\\ |
|
147 means distinctBy |
|
148 \item{Co}\\ |
|
149 The last matching clause of the function bsimp, with a slight modification to suit later reasoning. |
|
150 \begin{verbatim} |
|
151 def Co(bs1, rs): ARexp = { |
|
152 rs match { |
|
153 case Nil => AZERO |
|
154 case s :: Nil => fuse(bs1, s) |
|
155 case rs => AALTS(bs1, rs) |
|
156 } |
|
157 \end{verbatim} |
|
158 \end{itemize} |
|
159 \end{definition} |
|
160 |
|
161 \begin{definition}{fuse} |
|
162 \begin{verbatim} |
|
163 def fuse(bs: Bits, r: ARexp) : ARexp = r match { |
|
164 case AZERO => AZERO |
|
165 case AONE(cs) => AONE(bs ++ cs) |
|
166 case ACHAR(cs, f) => ACHAR(bs ++ cs, f) |
|
167 case AALTS(cs, rs) => AALTS(bs ++ cs, rs) |
|
168 case ASEQ(cs, r1, r2) => ASEQ(bs ++ cs, r1, r2) |
|
169 case ASTAR(cs, r) => ASTAR(bs ++ cs, r) |
|
170 } |
|
171 \end{verbatim} |
|
172 \end{definition} |
|
173 |
|
174 |
|
175 \begin{definition}{mkepsBC} |
|
176 \begin{verbatim} |
|
177 def mkepsBC(r: ARexp) : Bits = r match { |
|
178 case AONE(bs) => bs |
|
179 case AALTS(bs, rs) => { |
|
180 val n = rs.indexWhere(bnullable) |
|
181 bs ++ mkepsBC(rs(n)) |
|
182 } |
|
183 case ASEQ(bs, r1, r2) => bs ++ mkepsBC(r1) ++ mkepsBC(r2) |
|
184 case ASTAR(bs, r) => bs ++ List(Z) |
|
185 } |
|
186 \end{verbatim} |
|
187 \end{definition} |
|
188 |
|
189 \begin{definition}{mkepsBC equicalence} |
|
190 \\ |
|
191 Given 2 nullable annotated regular expressions r1, r2, if mkepsBC(r1) == mkepsBC(r2) |
|
192 then r1 and r2 are mkepsBC equivalent, denoted as r1 $\sim_{m\epsilon}$ r2 |
|
193 \end{definition} |
|
194 |
|
195 \begin{definition}{shorthand notation for ders} |
|
196 \\For the sake of reducing verbosity, we sometimes use the shorthand notation |
|
197 $d_{c}(r)$ for the function application bder(c, r) and $s(r)$(s here stands for simplification) for the function application bsimp(r) . |
|
198 \\We omit the subscript when it is clear from context what that character is and write $d(r)$ instead of $d_{c}(r)$. |
|
199 \\And we omit the parentheses when no confusion can be caused. For example ders\_simp(c, r) can be written as $s(d_{c}(r))$ or even $s d r$ as we know the derivative operation is w.r.t the character c. Here the s and d are more like operators that take an annotated regular expression as an input and return an annotated regular expression as an output |
|
200 |
|
201 \end{definition} |
|
202 |
|
203 |
|
204 |
|
205 \begin{definition}{distinctBy operation expressed in a different way--how it transforms the list}\\ |
|
206 Given two lists rs1 and rs2, we define the operation $--$:\\ |
|
207 $rs1 -- rs2 := [r \in rs1 | r \notin rs2]$ |
|
208 Note that the order each term appears in $rs_1 -- rs_2$ is preserved as in the original list. |
|
209 \end{definition} |
|
210 |
21 |
211 |
22 |
212 \section{Main Result} |
23 \section{Main Result} |
213 |
24 |
214 \begin{lemma}{simplification function does not simplify an already simplified regex}\label{lma1}\\ |
25 Want to prove |
215 bsimp(r) == bsimp(bsimp(r)) holds for any annotated regular expression r. |
26 \begin{equation}\label{cen} |
216 \end{lemma} |
27 \textit{bsimp}(\textit{bder}(c,a)) = \textit{bsimp}(\textit{bder}(c,\textit{bsimp}(a))). |
217 |
28 \end{equation} |
218 \begin{lemma}{simp and mkeps}\label{lma2}\\ |
29 For simplicity, we use $s$ to denote $\textit{bsimp}$ and use $a \backslash c$ or $d \; c \; a $ to denote $\textit{bder}(c,a)$, then we can write the equation we want to prove in the following manner: |
219 When r is nullable, we have that |
30 \begin{center} |
220 mkeps(bsimp(r)) == mkeps(r) |
31 $s \; d \; c \; a= s \; d \; c \; s \; a$ |
|
32 \end{center} |
|
33 Specifically, we are interested in the case where $a = a_1+a_2$. The inductive hypothesis is that |
|
34 |
|
35 \begin{center} |
|
36 $s \; d \; c \; a_1= s \; d \; c \; s \; a_1 \; \textbf{and} \; s \; d \; c \; a_2= s \; d \; c \; s \; a_2.$ |
|
37 \end{center} |
|
38 \noindent |
|
39 We want to prove that the $\textit{LHS}$ of \eqref{cen} is equal to the $\textit{RHS}$ of \eqref{cen}. |
|
40 For better readability the bits are ommitted as they don't inhibit the proof process but just adds to the |
|
41 nuisance of writing. |
|
42 $\textit{LHS}$ can be manipulated successively as follows: |
|
43 \begin{center} |
|
44 \begin{tabular}{@{}rrl@{}} |
|
45 %\multicolumn{3}{@{}l}{}\medskip\\ |
|
46 $\textit{LHS}$ & $=$ & $s \; (a_1+a_2) \backslash c$\\ |
|
47 & $=$ & $s \; (a_1 \backslash c + a_2 \backslash c)$ \\ |
|
48 & $\overset{\autoref{lma1}}{=}$ & $s(s(a_1 \backslash c) + s(a_2 \backslash c))$ \\ |
|
49 & $\overset{\autoref{lma2}}{=}$ & $s(s(a_1) \backslash c + s(a_2) \backslash c).$\\ |
|
50 \end{tabular} |
|
51 \end{center} |
|
52 $\textit{RHS}$ can be manipulated this way: |
|
53 \begin{center} |
|
54 \begin{tabular}{@{}rrl@{}} |
|
55 %\multicolumn{3}{@{}l}{}\medskip\\ |
|
56 $\textit{RHS}$ & $=$ & $s \; [(s(a_1+a_2)] \backslash c$\\ |
|
57 \end{tabular} |
|
58 \end{center} |
|
59 If we refer to $s(a_1+a_2)$ as $core$, then we have |
|
60 \begin{center} |
|
61 \begin{tabular}{@{}rrl@{}} |
|
62 %\multicolumn{3}{@{}l}{}\medskip\\ |
|
63 $\textit{RHS}$ & $=$ & $s \; (core \backslash c)$\\ |
|
64 \end{tabular} |
|
65 \end{center} |
|
66 and then |
|
67 \begin{center} |
|
68 \begin{tabular}{@{}rrl@{}} |
|
69 %\multicolumn{3}{@{}l}{}\medskip\\ |
|
70 $\textit{core}$ & $=$ & $s \; \textit{ALTS}(bs, a_1+a_2)$\\ |
|
71 & $\overset{\mathit{bsimp \; def}}{=}$ & $Li(ALTS(bs, dB(flats(s(a_1)+s(a_2))))$\\ |
|
72 \end{tabular} |
|
73 \end{center} |
|
74 Here we use $Li$ to refer to the operation that opens up the $\textit{ALTS}$ when it has 1 |
|
75 element, returns 0 when it has 0 elements or does nothing when |
|
76 there are 2 or more elements in the list $rs$ in $\textit{ALTS}(bs, rs)$(in scala code corresponds to the case clauses). |
|
77 |
|
78 Now in order to establish that $LHS = RHS$, we need to |
|
79 prove the transformed results we got above |
|
80 for $LHS$ and $RHS$ are equal to each other. |
|
81 That is, |
|
82 \begin{center} |
|
83 \begin{tabular}{@{}rrl@{}} |
|
84 %\multicolumn{3}{@{}l}{}\medskip\\ |
|
85 $s(s(a_1) \backslash c + s(a_2) \backslash c)$ & $=$ & $Li(ALTS(bs, dB(flats(s(a_1)+s(a_2))))$\\ |
|
86 \end{tabular} |
|
87 \end{center} |
|
88 We shall call the two sides of the above equation $LHS'$ and $RHS'$. |
|
89 To prove this equality we just need to consider what $s(a_1)$ and $s(a_2)$ look like. |
|
90 There are three interesting possibility for each, namely |
|
91 $s(a_i)$ is an alt, a star or a sequence. Combined that is |
|
92 9 possibilities. We just need to investigate each of these 9 possibilities. |
|
93 Here we only one of the 9 cases. The others are handled in a similar |
|
94 fashion. |
|
95 |
|
96 When $s(a_1) = ALTS(bs_1, as_1)$ and $s(a_2) = ALTS(bs_2, as_2)$, |
|
97 \begin{center} |
|
98 $\textit{LHS'}$ \\ |
|
99 $=$ \\ |
|
100 $s(ALTS(bs, ALTS(bs_1, as_1) \backslash c |
|
101 + ALTS(bs_2, as_2) \backslash c))$\\ |
|
102 $=$ \\ |
|
103 $s(ALTS(bs, ALTS(bs_1, as_1.map \backslash c )+ ALTS(bs_2,as_2.map \backslash c) ) )$\\ |
|
104 $\overset{\autoref{lma3}}{=}$ \\ |
|
105 $s(ALTS(bs, (as_1.map \backslash c ).map(fuse(bs_1))+ (as_2.map \backslash c).map(fuse(bs_2)) ) )$\\ |
|
106 \end{center} |
|
107 |
|
108 And then we deal with $RHS'$: |
|
109 $RHS'$\\ |
|
110 $\overset{\autoref{lma4}}{=}$ \\ |
|
111 $s(ALTS(bs, (as_1.map \backslash c ).map(fuse(bs_1))++ (as_2.map \backslash c).map(fuse(bs_2)) ) )$\\ |
|
112 and this completes the proof. |
|
113 |
|
114 \begin{lemma}{doing simplification in advance to subparts}\label{lma1}\\ |
|
115 We have that for any annotated regular expressions $a_1 \; a_2$ and bitcode $bs$,\\ |
|
116 $\textit{bsimp}(\textit{ALTS}(bs, a_1, a_2)) = |
|
117 \textit{bsimp}(\textit{ALTS}(bs, \textit{bsimp}(a_1), \textit{bsimp}(a_2))) $ |
|
118 \end{lemma} |
|
119 |
|
120 \begin{lemma}{combination of lemma 1 and inductive hypothesis(from now on use simple notation)}\label{lma2}\\ |
|
121 We have that for any annotated regular expressions $a_1 \; a_2$ and bitcode $bs$, |
|
122 $s(s(a_1 \backslash c) + s(a_2 \backslash c)) = |
|
123 s(s(a_1) \backslash c + s(a_2) \backslash c)$ |
221 \end{lemma} |
124 \end{lemma} |
222 |
125 |
223 |
126 |
224 %\begin{theorem}See~\cref{lma1}.\end{theorem} |
127 %\begin{theorem}See~\cref{lma1}.\end{theorem} |
225 %\begin{lemma}\label{lma1}\lipsum[2]\end{lemma} |
128 %\begin{lemma}\label{lma1}\lipsum[2]\end{lemma} |
226 |
129 |
227 \begin{lemma}{mkeps equivalence w.r.t some syntactically different regular expressions(1 ALTS)}\label{lma3}\\ |
130 \begin{lemma}{Spilling out ALTS does not affect simplification result}\label{lma3}\\ |
228 When one of the 2 regular expressions $s(r_1)$ and $s(r_2)$ is of the form ALTS(bs1, rs1), we have that $ds(ALTS(bs, r1, r2)) \sim_{m\epsilon} d(ALTS(bs, sr_1, sr_2))$ |
131 $s(ALTS(bs, ALTS(bs_1, as_1.map \backslash c )+ ALTS(bs_2,as_2.map \backslash c) ) )$\\ |
229 \end{lemma} |
132 $\overset{\autoref{lma3}}{=}$ \\ |
230 \begin{proof} |
133 $s(ALTS(bs, (as_1.map \backslash c ).map(fuse(bs_1))+ (as_2.map \backslash c).map(fuse(bs_2)) ) )$\\ |
231 By opening up one of the alts and show no additional changes are made.\\ |
134 \end{lemma} |
232 Details: $ds(ALTS(bs, r1, r2)) = d Co( bs, dB(flats(sr1, sr2)) )$ |
135 |
233 \end{proof} |
136 \begin{lemma}{deleting duplicates does not affect simplification result}\label{lma4}\\ |
234 |
137 $s(ALTS(bs, (as_1.map \backslash c ).map(fuse(bs_1))+ (as_2.map \backslash c).map(fuse(bs_2)) ) )$\\ |
|
138 $=$\\ |
|
139 $s(ALTS(bs, dB((as_1.map \backslash c ).map(fuse(bs_1))+ (as_2.map \backslash c).map(fuse(bs_2)) ) ))$ |
|
140 \end{lemma} |
235 |
141 |
236 \begin{lemma}{mkepsBC invariant manipulation of bits and notation}\label{lma7}\\ |
142 \begin{lemma}{mkepsBC invariant manipulation of bits and notation}\label{lma7}\\ |
237 ALTS(bs, ALTS(bs1, rs1), ALTS(bs2, rs2)) $\sim_{m\epsilon}$ ALTS(bs, rs1.map(fuse(bs1, \_)) ++ rs2.map(fuse(bs2, \_)) ). \\ |
143 ALTS(bs, ALTS(bs1, rs1), ALTS(bs2, rs2)) $\sim_{m\epsilon}$ ALTS(bs, rs1.map(fuse(bs1, \_)) ++ rs2.map(fuse(bs2, \_)) ). \\ |
238 We also use $bs2>>rs2 $ as a shorthand notation for rs2.map(fuse(bs2,\_)). |
144 We also use $bs2>>rs2 $ as a shorthand notation for rs2.map(fuse(bs2,\_)). |
239 \end{lemma} |
145 \end{lemma} |
240 |
146 |
241 \begin{lemma}{mkepsBC equivalence w.r.t syntactically different regular expressions(2 ALTS)}\label{lma4}\\ |
147 |
242 $sr_1 = ALTS(bs1, rs1)$ and $sr_2 = ALTS(bs2, rs2)$ we have $ d(sr_1 +sr_2 ) \sim_{m\epsilon} d(ALTS(bs, bs1>>rs1 ++ bs2>>rs2))$ |
|
243 \end{lemma} |
|
244 \begin{proof} |
|
245 We are just fusing bits inside here, there is no other structural change. |
|
246 \end{proof} |
|
247 |
148 |
248 \begin{lemma}{What does dB do to two already simplified ALTS}\label{lma5}\\ |
149 \begin{lemma}{What does dB do to two already simplified ALTS}\label{lma5}\\ |
249 $d Co(ALTS(bs, dB(bs1>>rs1 ++ bs2>>rs2))) = d Co(ALTS(bs, bs1>>rs1 ++ ((bs2>>rs2)--rs1) )) $ |
150 $d Co(ALTS(bs, dB(bs1>>rs1 ++ bs2>>rs2))) = d Co(ALTS(bs, bs1>>rs1 ++ ((bs2>>rs2)--rs1) )) $ |
250 \end{lemma} |
151 \end{lemma} |
251 \begin{proof} |
152 \begin{proof} |