1043 \noindent |
1054 \noindent |
1044 The harder closed forms are the sequence and star ones. |
1055 The harder closed forms are the sequence and star ones. |
1045 Before we go on to obtain them, some preliminary definitions |
1056 Before we go on to obtain them, some preliminary definitions |
1046 are needed to make proof statements concise. |
1057 are needed to make proof statements concise. |
1047 |
1058 |
1048 \section{"Closed Forms" of regular expressions' derivatives w.r.t strings} |
1059 \section{"Closed Forms" of Sequence Regular Expressions} |
1049 We note that the different ways in which regular expressions are |
1060 The problem of obataining a closed-form for sequence regular expression |
1050 nested do not matter under $\rsimp{}$: |
1061 is constructing $(r_1 \cdot r_2) \backslash_r s$ |
1051 \begin{center} |
1062 if we are only allowed to use a combination of $r_1 \backslash s''$ |
1052 To be proven:\\ |
1063 and $r_2 \backslash s''$ , where $s''$ is from $s$. |
1053 $\rsimp{r} \stackrel{?}{=} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$ |
1064 First let's look at a series of derivatives steps on a sequence |
1054 and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$ |
1065 regular expression, assuming that each time the first |
1055 \end{center} |
1066 component of the sequence is always nullable): |
1056 \noindent |
1067 \begin{center} |
|
1068 |
|
1069 $r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\ |
|
1070 $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad |
|
1071 \ldots$ |
|
1072 |
|
1073 \end{center} |
|
1074 Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as |
|
1075 a giant alternative taking a list of terms |
|
1076 $[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$, |
|
1077 where the head of the list is always the term |
|
1078 representing a match involving only $r_1$, and the tail of the list consisting of |
|
1079 terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$. |
1057 This intuition is also echoed by IndianPaper, where they gave |
1080 This intuition is also echoed by IndianPaper, where they gave |
1058 a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$: |
1081 a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$: |
1059 \begin{center} |
1082 \begin{center} |
1060 $((r_1 \cdot r_2) \backslash_r c_1) \stackrel{\backslash_r c_2}{=} |
1083 \begin{tabular}{c} |
1061 ((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1),\; r_2 \backslash_r c_1)) |
1084 $(r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) \myequiv$\\ |
1062 \stackrel{\backslash_r c_2}{=} |
1085 \rule{0pt}{3ex} $((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1) \; r_2 \backslash_r c_1)) \backslash_r (c_2 :: \ldots c_n) |
1063 ((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2) |
1086 \myequiv$\\ |
|
1087 \rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable \; r_1) \; r_2 \backslash_r c_1c_2)) |
|
1088 + (\delta \ (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)) \backslash_r (c_3 \ldots c_n) |
|
1089 $ |
|
1090 \end{tabular} |
|
1091 \end{center} |
|
1092 \noindent |
|
1093 The equality in above should be interpretated |
|
1094 as language equivalence. |
|
1095 The $\delta$ function works similarly to that of |
|
1096 a Kronecker delta function: |
|
1097 \[ \delta \; b\; r\] |
|
1098 will produce $r$ |
|
1099 if $b$ evaluates to true, |
|
1100 and $\RZERO$ otherwise. |
|
1101 Note that their formulation |
|
1102 \[ |
|
1103 ((r_1 \backslash_r \, c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2) |
1064 + (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2) |
1104 + (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2) |
1065 $ |
1105 \] |
1066 \end{center} |
1106 does not faithfully |
1067 \noindent |
1107 represent what the intermediate derivatives would actually look like |
1068 The $\delta \; b \; r$ function above turns the entire term into $\RZERO$ |
1108 when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not |
1069 if the boolean condition $b$ evaluates to false, and outputs $r$ otherwise. |
1109 nullable in the head of the sequence. |
1070 The equality in their derivation steps should be interpretated |
1110 For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable, |
1071 as language equivalence. To pin down the intuition into rigorous terms, |
1111 the regular expression would not look like |
1072 $\sflat{}$ is used to enable the transformation from |
1112 \[ |
|
1113 (r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO, |
|
1114 \] |
|
1115 but actually $r_1 \backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the |
|
1116 first place. |
|
1117 In a closed-form one would want to take into account this |
|
1118 and generate the list of |
|
1119 regular expressions $r_2 \backslash_r s''$ with |
|
1120 string pairs $(s', s'')$ where $s'@s'' = s$ and |
|
1121 $r_1 \backslash s'$ nullable. |
|
1122 We denote the list consisting of such |
|
1123 strings $s''$ as $\vsuf{s}{r_1}$. |
|
1124 |
|
1125 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string: |
|
1126 \begin{center} |
|
1127 \begin{tabular}{lcl} |
|
1128 $\vsuf{[]}{\_} $ & $=$ & $[]$\\ |
|
1129 $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ |
|
1130 && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ |
|
1131 \end{tabular} |
|
1132 \end{center} |
|
1133 \noindent |
|
1134 The list is sorted in the order $r_2\backslash s''$ |
|
1135 appears in $(r_1\cdot r_2)\backslash s$. |
|
1136 In essence, $\vsuf{\_}{\_}$ is doing a |
|
1137 "virtual derivative" of $r_1 \cdot r_2$, but instead of producing |
|
1138 the entire result $(r_1 \cdot r_2) \backslash s$, |
|
1139 it only stores all the strings $s''$ such that $r_2 \backslash s''$ |
|
1140 are occurring terms in $(r_1\cdot r_2)\backslash s$. |
|
1141 |
|
1142 To make the closed form representation |
|
1143 more straightforward, |
|
1144 the flattetning function $\sflat{\_}$ is used to enable the transformation from |
1073 a left-associative nested sequence of alternatives into |
1145 a left-associative nested sequence of alternatives into |
1074 a flattened list: |
1146 a flattened list: |
1075 $r = \sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{}}{\rightarrow} |
1147 \[ |
1076 (\ldots ((r_1 + r_2) + r_3) + \ldots)$ |
1148 \sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{\_}}{\rightarrow} |
1077 The definitions $\sflat{}$, $\sflataux{}$ are given below. |
1149 (\ldots ((r_1 + r_2) + r_3) + \ldots) |
1078 |
1150 \] |
|
1151 \noindent |
|
1152 The definitions $\sflat{\_}$, $\sflataux{\_}$ are given below. |
1079 \begin{center} |
1153 \begin{center} |
1080 \begin{tabular}{ccc} |
1154 \begin{tabular}{ccc} |
1081 $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ |
1155 $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ |
1082 $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\ |
1156 $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\ |
1083 $\sflataux r$ & $=$ & $ [r]$ |
1157 $\sflataux r$ & $=$ & $ [r]$ |
1089 $\sflat{(\sum r :: rs)}$ & $=$ & $\sum (\sflataux{r} @ rs)$\\ |
1163 $\sflat{(\sum r :: rs)}$ & $=$ & $\sum (\sflataux{r} @ rs)$\\ |
1090 $\sflat{\sum []}$ & $ = $ & $ \sum []$\\ |
1164 $\sflat{\sum []}$ & $ = $ & $ \sum []$\\ |
1091 $\sflat r$ & $=$ & $ r$ |
1165 $\sflat r$ & $=$ & $ r$ |
1092 \end{tabular} |
1166 \end{tabular} |
1093 \end{center} |
1167 \end{center} |
1094 The intuition behind $\sflataux{\_}$ is to break up nested regexes |
1168 \noindent |
|
1169 $\sflataux{\_}$ breaks up nested alternative regexes |
1095 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape |
1170 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape |
1096 into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$. |
1171 into a "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$. |
1097 It will return the singleton list $[r]$ otherwise. |
1172 It will return the singleton list $[r]$ otherwise. |
1098 $\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps |
1173 $\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps |
1099 the output type a regular expression, not a list. |
1174 the output type a regular expression, not a list. |
1100 $\sflataux{\_}$ and $\sflat{\_}$ is only recursive in terms of the |
1175 $\sflataux{\_}$ and $\sflat{\_}$ are only recursive on the |
1101 first element of the list of children of |
1176 first element of the list. |
1102 an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for dealing with the regular expression |
1177 |
1103 $r_1 \cdot r_2 \backslash s$. |
1178 With $\sflataux{}$ a preliminary to the closed form can be stated, |
1104 \subsection{The $\textit{vsuf}$ Function} |
1179 where the derivative of $r_1 \cdot r_2 \backslash s$ can be |
1105 Note that $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2) |
1180 flattened into a list whose head and tail meet the description |
1106 + (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)$ does not faithfully |
1181 we gave earlier. |
1107 represent what would the intermediate derivatives would actually look like. |
1182 \begin{lemma}\label{seqSfau0} |
1108 For example, when $r_1$ and $r_1 \backslash_r c_1$ are nullable, |
1183 $\sflataux{\rders{(r_1 \cdot r_2) \backslash s }} = (r_1 \backslash_r s) \cdot r_2 |
1109 the regular expression would not look like $(r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO$, |
1184 :: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ |
1110 but actually $r_1\backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the |
1185 \end{lemma} |
1111 first place. |
1186 \begin{proof} |
1112 In a closed-form one would want to take into account this |
1187 By an induction on the string $s$, where the inductive cases |
1113 and generate the list of string pairs $(s', s'')$ where $s'@s'' = s$ such that |
1188 are split as $[]$ and $xs @ [x]$. |
1114 only $r_1 \backslash s'$ nullable. |
1189 Note the key identify holds: |
1115 With $\sflat{\_}$ and $\sflataux{\_}$ we make |
1190 \[ |
1116 precise what "closed forms" we have for the sequence derivatives and their simplifications, |
1191 \map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\; |
1117 in other words, how can we construct $(r_1 \cdot r_2) \backslash s$ |
1192 \map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1})) |
1118 and $\bderssimp{(r_1\cdot r_2)}{s}$, |
1193 \] |
1119 if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$) |
1194 = |
1120 and $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$ ranges over |
1195 \[ |
1121 the substring of $s$? |
1196 \map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1}) |
1122 First let's look at a series of derivatives steps on a sequence |
1197 \] |
1123 regular expression, (assuming) that each time the first |
1198 This enables the inductive case to go through. |
1124 component of the sequence is always nullable): |
1199 \end{proof} |
1125 \begin{center} |
1200 \noindent |
1126 |
1201 Note that this lemma does $\mathbf{not}$ depend on any |
1127 $r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\ |
1202 specific definitions we used, |
1128 $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad |
1203 allowing people investigating derivatives to get an alternative |
1129 \ldots$ |
1204 view of what $r_1 \cdot r_2$ is. |
1130 |
1205 |
1131 \end{center} |
1206 Now we are able to use this for the intuition that |
1132 %TODO: cite indian paper |
1207 the different ways in which regular expressions are |
1133 Indianpaper have come up with a slightly more formal way of putting the above process: |
1208 nested do not matter under $\rsimp{}$: |
1134 \begin{center} |
1209 \begin{center} |
1135 $r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) + |
1210 $\rsimp{r} \stackrel{?}{\sequal} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$ |
1136 \delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots |
1211 and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$ |
1137 + \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$ |
1212 \end{center} |
1138 \end{center} |
1213 Simply wrap with $\sum$ constructor and add |
1139 where $\delta(b, r)$ will produce $r$ |
1214 simplifications to both sides of \ref{seqSfau0} |
1140 if $b$ evaluates to true, |
1215 and one gets |
1141 and $\ZERO$ otherwise. |
1216 \begin{corollary}\label{seqClosedFormGeneral} |
1142 |
1217 $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} } |
1143 But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical |
1218 =\rsimp{(\sum ( (r_1 \backslash s) \cdot r_2 :: |
1144 equivalence. To make this intuition useful |
1219 \map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$ |
1145 for a formal proof, we need something |
1220 \end{corollary} |
1146 stronger than language equivalence. |
1221 Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}), |
1147 With the help of $\sflat{\_}$ we can state the equation in Indianpaper |
|
1148 more rigorously: |
|
1149 \begin{lemma} |
|
1150 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$ |
|
1151 \end{lemma} |
|
1152 |
|
1153 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string: |
|
1154 |
|
1155 \begin{center} |
|
1156 \begin{tabular}{lcl} |
|
1157 $\vsuf{[]}{\_} $ & $=$ & $[]$\\ |
|
1158 $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ |
|
1159 && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ |
|
1160 \end{tabular} |
|
1161 \end{center} |
|
1162 It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable, |
|
1163 and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in |
|
1164 the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$. |
|
1165 In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing |
|
1166 the entire result of $(r_1 \cdot r_2) \backslash s$, |
|
1167 it only stores all the $s''$ such that $r_2 \backslash s''$ are occurring terms in $(r_1\cdot r_2)\backslash s$. |
|
1168 |
|
1169 With this we can also add simplifications to both sides and get |
|
1170 \begin{lemma} |
|
1171 $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ |
|
1172 \end{lemma} |
|
1173 Together with the idempotency property of $\rsimp{}$\ref{rsimpIdem}, |
|
1174 %TODO: state the idempotency property of rsimp |
|
1175 it is possible to convert the above lemma to obtain a "closed form" |
1222 it is possible to convert the above lemma to obtain a "closed form" |
1176 for derivatives nested with simplification: |
1223 for derivatives nested with simplification: |
|
1224 \begin{lemma}\label{seqClosedForm} |
|
1225 $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) |
|
1226 :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$ |
|
1227 \end{lemma} |
|
1228 \begin{proof} |
|
1229 By a case analysis of string $s$. |
|
1230 When $s$ is empty list, the rewrite is straightforward. |
|
1231 When $s$ is a list, one could use the corollary \ref{seqSfau0}, |
|
1232 and lemma \ref{Simpders} to rewrite the left-hand-side. |
|
1233 \end{proof} |
|
1234 As a corollary for this closed form, one can estimate the size |
|
1235 of the sequence derivative $r_1 \cdot r_2 \backslash_r s$ using |
|
1236 an easier-to-handle expression: |
|
1237 \begin{corollary}\label{seqEstimate1} |
|
1238 \begin{center} |
|
1239 |
|
1240 $\llbracket \rderssimp{(r_1 \cdot r_2)}{s} \rrbracket_r = \llbracket \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) |
|
1241 :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))} \rrbracket_r$ |
|
1242 |
|
1243 \end{center} |
|
1244 \end{corollary} |
|
1245 \noindent |
|
1246 \subsection{Closed Forms for Star Regular Expressions} |
|
1247 We use a similar technique as $r_1 \cdot r_2$ case, |
|
1248 generating |
|
1249 all possible sub-strings $s'$ of $s$ |
|
1250 such that $r\backslash s' \cdot r^*$ will appear |
|
1251 as a term in $(r^*) \backslash s$. |
|
1252 The first function we define is a single-step |
|
1253 updating function $\starupdate$, which takes three arguments as input: |
|
1254 the new character $c$ to take derivative with, |
|
1255 the regular expression |
|
1256 $r$ directly under the star $r^*$, and the |
|
1257 list of strings $sSet$ for the derivative $r^* \backslash s$ |
|
1258 up til this point |
|
1259 such that $(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ |
|
1260 (the equality is not exact, more on this later). |
|
1261 \begin{center} |
|
1262 \begin{tabular}{lcl} |
|
1263 $\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ |
|
1264 $\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\ |
|
1265 & & $\textit{if} \; |
|
1266 (\rnullable \; (\rders \; r \; s))$ \\ |
|
1267 & & $\textit{then} \;\; (s @ [c]) :: [c] :: ( |
|
1268 \starupdate \; c \; r \; Ss)$ \\ |
|
1269 & & $\textit{else} \;\; (s @ [c]) :: ( |
|
1270 \starupdate \; c \; r \; Ss)$ |
|
1271 \end{tabular} |
|
1272 \end{center} |
|
1273 \noindent |
|
1274 As a generalisation from characters to strings, |
|
1275 $\starupdates$ takes a string instead of a character |
|
1276 as the first input argument, and is otherwise the same |
|
1277 as $\starupdate$. |
|
1278 \begin{center} |
|
1279 \begin{tabular}{lcl} |
|
1280 $\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\ |
|
1281 $\starupdates \; (c :: cs) \; r \; Ss$ & $=$ & $\starupdates \; cs \; r \; ( |
|
1282 \starupdate \; c \; r \; Ss)$ |
|
1283 \end{tabular} |
|
1284 \end{center} |
|
1285 \noindent |
|
1286 For the star regular expression, |
|
1287 its derivatives can be seen as a nested gigantic |
|
1288 alternative similar to that of sequence regular expression's derivatives, |
|
1289 and therefore need |
|
1290 to be ``straightened out" as well. |
|
1291 The function for this would be $\hflat{}$ and $\hflataux{}$. |
|
1292 \begin{center} |
|
1293 \begin{tabular}{lcl} |
|
1294 $\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\ |
|
1295 $\hflataux{r}$ & $\dn$ & $[r]$ |
|
1296 \end{tabular} |
|
1297 \end{center} |
|
1298 |
|
1299 \begin{center} |
|
1300 \begin{tabular}{lcl} |
|
1301 $\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\ |
|
1302 $\hflat{r}$ & $\dn$ & $r$ |
|
1303 \end{tabular} |
|
1304 \end{center} |
|
1305 \noindent |
|
1306 %MAYBE TODO: introduce createdByStar |
|
1307 We first introduce an inductive property |
|
1308 for $\starupdate$ and $\hflataux{\_}$, |
|
1309 it says if we do derivatives of $r^*$ |
|
1310 with a string that starts with $c$, |
|
1311 then flatten it out, |
|
1312 we obtain a list |
|
1313 of the shape $\sum_{s' \in sSet} (r\backslash_r s') \cdot r^*$, |
|
1314 where $sSet = \starupdates \; s \; r \; [[c]]$. |
|
1315 \begin{lemma}\label{starHfauInduct} |
|
1316 $\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} = |
|
1317 \map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; |
|
1318 (\starupdates \; s \; r_0 \; [[c]])$ |
|
1319 \end{lemma} |
|
1320 \begin{proof} |
|
1321 By an induction on $s$, the inductive cases |
|
1322 being $[]$ and $s@[c]$. |
|
1323 \end{proof} |
|
1324 \noindent |
|
1325 Here is a corollary that states the lemma in |
|
1326 a more intuitive way: |
|
1327 \begin{corollary} |
|
1328 $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot |
|
1329 (r^*))\; (\starupdates \; c\; r\; [[c]])$ |
|
1330 \end{corollary} |
|
1331 \noindent |
|
1332 Note that this is also agnostic of the simplification |
|
1333 function we defined, and is therefore of more general interest. |
|
1334 |
|
1335 Now adding the $\rsimp{}$ bit for closed forms, |
|
1336 we have |
1177 \begin{lemma} |
1337 \begin{lemma} |
1178 $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ |
1338 $a :: rs \grewrites \hflataux{a} @ rs$ |
1179 \end{lemma} |
1339 \end{lemma} |
1180 And now the reason we have (1) in section 1 is clear: |
1340 \noindent |
1181 $\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, |
1341 giving us |
1182 is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$ |
1342 \begin{lemma}\label{cbsHfauRsimpeq1} |
1183 as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$. |
1343 $\rsimp{a+b} = \rsimp{(\sum \hflataux{a} @ \hflataux{b})}$. |
1184 |
1344 \end{lemma} |
1185 %---------------------------------------------------------------------------------------- |
1345 \noindent |
1186 % SECTION 3 |
1346 This yields |
1187 %---------------------------------------------------------------------------------------- |
1347 \begin{lemma}\label{hfauRsimpeq2} |
1188 |
1348 $\rsimp{r} = \rsimp{(\sum \hflataux{r})}$ |
1189 \section{interaction between $\distinctWith$ and $\flts$} |
1349 \end{lemma} |
1190 Note that it is not immediately obvious that |
1350 \noindent |
1191 \begin{center} |
1351 Together with the rewriting relation |
1192 $\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $. |
1352 \begin{lemma}\label{starClosedForm6Hrewrites} |
1193 \end{center} |
1353 $\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss |
|
1354 \scfrewrites |
|
1355 \map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$ |
|
1356 \end{lemma} |
|
1357 \noindent |
|
1358 We obtain the closed form for star regular expression: |
|
1359 \begin{lemma}\label{starClosedForm} |
|
1360 $\rderssimp{r^*}{c::s} = |
|
1361 \rsimp{ |
|
1362 (\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; |
|
1363 (\starupdates \; s\; r \; [[c]]) |
|
1364 ) |
|
1365 ) |
|
1366 } |
|
1367 $ |
|
1368 \end{lemma} |
|
1369 \begin{proof} |
|
1370 By an induction on $s$. |
|
1371 The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2} |
|
1372 are used. |
|
1373 \end{proof} |
|
1374 \section{Estimating the Closed Forms' sizes} |
|
1375 We now summarize the closed forms below: |
|
1376 \begin{itemize} |
|
1377 \item |
|
1378 $\rderssimp{(\sum rs)}{s} \sequal |
|
1379 \sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$ |
|
1380 \item |
|
1381 $\rderssimp{(r_1 \cdot r_2)}{s} \sequal \sum ((r_1 \backslash s) \cdot r_2 ) |
|
1382 :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1}))$ |
|
1383 \item |
|
1384 |
|
1385 $\rderssimp{r^*}{c::s} = |
|
1386 \rsimp{ |
|
1387 (\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; |
|
1388 (\starupdates \; s\; r \; [[c]]) |
|
1389 ) |
|
1390 ) |
|
1391 } |
|
1392 $ |
|
1393 \end{itemize} |
|
1394 \noindent |
|
1395 The closed forms on the left-hand-side |
|
1396 are all of the same shape: $\rsimp{ (\sum rs)} $. |
|
1397 Such regular expression will be bounded by the size of $\sum rs'$, |
|
1398 where every element in $rs'$ is distinct, and each element |
|
1399 can be described by some inductive sub-structures |
|
1400 (for example when $r = r_1 \cdot r_2$ then $rs'$ |
|
1401 will be solely comprised of $r_1 \backslash s'$ |
|
1402 and $r_2 \backslash s''$, $s'$ and $s''$ being |
|
1403 sub-strings of $s$). |
|
1404 which will each have a size uppder bound |
|
1405 according to inductive hypothesis, which controls $r \backslash s$. |
|
1406 |
|
1407 We elaborate the above reasoning by a series of lemmas |
|
1408 below, where straightforward proofs are omitted. |
|
1409 \begin{lemma} |
|
1410 If $\forall r \in rs. \rsize{r} $ is less than or equal to $N$, |
|
1411 and $\textit{length} \; rs$ is less than or equal to $l$, |
|
1412 then $\rsize{\sum rs}$ is less than or equal to $l*N + 1$. |
|
1413 \end{lemma} |
|
1414 \noindent |
|
1415 If we define all regular expressions with size no |
|
1416 more than $N$ as $\sizeNregex \; N$: |
|
1417 \[ |
|
1418 \sizeNregex \; N \dn \{r \mid \rsize{r} \leq N \} |
|
1419 \] |
|
1420 Then such set is finite: |
|
1421 \begin{lemma}\label{finiteSizeN} |
|
1422 $\textit{isFinite}\; (\sizeNregex \; N)$ |
|
1423 \end{lemma} |
|
1424 \begin{proof} |
|
1425 By overestimating the set $\sizeNregex \; N + 1$ |
|
1426 using union of sets like |
|
1427 $\{r_1 \cdot r_2 \mid r_1 \in A |
|
1428 \text{and} |
|
1429 r_2 \in A\} |
|
1430 $ where $A = \sizeNregex \; N$. |
|
1431 \end{proof} |
|
1432 \noindent |
|
1433 From this we get a corollary that |
|
1434 if forall $r \in rs$, $\rsize{r} \leq N$, then the output of |
|
1435 $\rdistinct{rs}{\varnothing}$ is a list of regular |
|
1436 expressions of finite size depending on $N$ only. |
|
1437 \begin{corollary} |
|
1438 Assumes that for all $r \in rs. \rsize{r} \leq N$, |
|
1439 and the cardinality of $\sizeNregex \; N$ is $c_N$ |
|
1440 then$\rsize{\rdistinct{rs}{\varnothing}} \leq c*N$. |
|
1441 \end{corollary} |
|
1442 \noindent |
|
1443 We have proven that the output of $\rdistinct{rs'}{\varnothing}$ |
|
1444 is bounded by a constant $c_N$ depending only on $N$, |
|
1445 provided that each of $rs'$'s element |
|
1446 is bounded by $N$. |
|
1447 We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$. |
|
1448 |
|
1449 We show how $\rdistinct$ and $\rflts$ |
|
1450 in the simplification function together is at least as |
|
1451 good as $\rdistinct{}{}$ alone. |
|
1452 \begin{lemma}\label{interactionFltsDB} |
|
1453 $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r |
|
1454 \leq |
|
1455 \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. |
|
1456 \end{lemma} |
|
1457 \noindent |
1194 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of |
1458 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of |
1195 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. |
1459 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. |
1196 |
1460 |
1197 |
1461 Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$: |
1198 \subsection{A Closed Form for the Sequence Regular Expression} |
1462 \begin{lemma}\label{altsSimpControl} |
1199 \begin{quote}\it |
1463 $\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$ |
1200 Claim: For regular expressions $r_1 \cdot r_2$, we claim that |
1464 \end{lemma} |
1201 \begin{center} |
1465 \begin{proof} |
1202 $ \rderssimp{r_1 \cdot r_2}{s} = |
1466 By using \ref{interactionFltsDB}. |
1203 \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$ |
1467 \end{proof} |
1204 \end{center} |
1468 \noindent |
1205 \end{quote} |
1469 which says that the size of regular expression |
1206 \noindent |
1470 is always smaller if we apply the full simplification |
1207 |
1471 rather than just one component ($\rdistinct{}{}$). |
1208 Before we get to the proof that says the intermediate result of our lexer will |
1472 |
1209 remain finitely bounded, which is an important efficiency/liveness guarantee, |
1473 |
1210 we shall first develop a few preparatory properties and definitions to |
1474 Now we are ready to control the sizes of |
1211 make the process of proving that a breeze. |
1475 $r_1 \cdot r_2 \backslash s$, $r^* \backslash s$. |
1212 |
|
1213 We define rewriting relations for $\rrexp$s, which allows us to do the |
|
1214 same trick as we did for the correctness proof, |
|
1215 but this time we will have stronger equalities established. |
|
1216 |
|
1217 |
|
1218 \section{Estimating the Closed Forms' sizes} |
|
1219 |
|
1220 The closed form $f\; (g\; (\sum rs)) $ is controlled |
|
1221 by the size of $rs'$, where every element in $rs'$ is distinct, and each element can be described by som e inductive sub-structures (for example when $r = r_1 \cdot r_2$ then $rs'$ will be solely comprised of $r_1 \backslash s'$ and $r_2 \backslash s''$, $s'$ and $s''$ being sub-strings of $s$), which will have a numeric uppder bound according to inductive hypothesis, which controls $r \backslash s$. |
|
1222 %----------------------------------- |
|
1223 % SECTION 2 |
|
1224 %----------------------------------- |
|
1225 |
|
1226 \begin{theorem} |
1476 \begin{theorem} |
1227 For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$ |
1477 For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$ |
1228 \end{theorem} |
1478 \end{theorem} |
1229 |
1479 \noindent |
1230 \noindent |
1480 \begin{proof} |
1231 \begin{proof} |
1481 We prove this by induction on $r$. The base cases for $\RZERO$, |
1232 We prove this by induction on $r$. The base cases for $\AZERO$, |
1482 $\RONE $ and $\RCHAR{c}$ are straightforward. |
1233 $\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is |
1483 In the sequence $r_1 \cdot r_2$ case, |
1234 for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction |
1484 the inductive hypotheses state $\exists N_1. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N_1$ and |
1235 hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and |
1485 $\exists N_2. \forall s. \; \llbracket \rderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows |
1236 $\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows |
|
1237 % |
1486 % |
1238 \begin{center} |
1487 \begin{center} |
1239 \begin{tabular}{lcll} |
1488 \begin{tabular}{lcll} |
1240 & & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\ |
1489 & & $ \llbracket \rderssimp{r_1\cdot r_2 }{s} \rrbracket_r $\\ |
1241 & $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} :: |
1490 & $ = $ & $\llbracket \rsimp{(\sum(r_1 \backslash s \cdot r_2 \; \; :: \; \; |
1242 [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\ |
1491 \map \; (\lambda s'. r_2\backslash_{rsimp} s') (\vsuf{s}{r})))} \rrbracket_r $ & (1) \\ |
1243 & $\leq$ & |
1492 & $\leq$ & $\llbracket \rsimp{(\sum(r_1 \backslash s \cdot r_2 :: |
1244 $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) :: |
1493 \map \; (\lambda s'. r_2\backslash_{rsimp} s') (\vsuf{s}{r})))} \rrbracket_r $ & (2) \\ |
1245 [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\ |
1494 |
1246 & $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket + |
|
1247 \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\ |
|
1248 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + |
|
1249 \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\ |
|
1250 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5) |
|
1251 \end{tabular} |
1495 \end{tabular} |
1252 \end{center} |
1496 \end{center} |
1253 |
1497 |
1254 |
1498 \end{proof} |
1255 \noindent |
1499 |
1256 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$). |
1500 \noindent |
|
1501 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\rderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$). |
1257 The reason why we could write the derivatives of a sequence this way is described in section 2. |
1502 The reason why we could write the derivatives of a sequence this way is described in section 2. |
1258 The term (2) is used to control (1). |
1503 The term (2) is used to control (1). |
1259 That is because one can obtain an overall |
1504 That is because one can obtain an overall |
1260 smaller regex list |
1505 smaller regex list |
1261 by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it. |
1506 by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it. |
1262 Section 3 is dedicated to its proof. |
1507 Section 3 is dedicated to its proof. |
1263 In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is |
1508 In (3) we know that $\llbracket \ASEQ{bs}{(\rderssimp{ r_1}{s}}{r_2}\rrbracket$ is |
1264 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
1509 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
1265 than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands |
1510 than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands |
1266 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
1511 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
1267 We reason similarly for $\STAR$.\medskip |
1512 We reason similarly for $\STAR$.\medskip |
1268 \end{proof} |
1513 |
|
1514 %----------------------------------- |
|
1515 % SECTION 2 |
|
1516 %----------------------------------- |
|
1517 |
|
1518 |
|
1519 %---------------------------------------------------------------------------------------- |
|
1520 % SECTION 3 |
|
1521 %---------------------------------------------------------------------------------------- |
|
1522 |
|
1523 |
|
1524 \subsection{A Closed Form for the Sequence Regular Expression} |
|
1525 \noindent |
|
1526 |
|
1527 Before we get to the proof that says the intermediate result of our lexer will |
|
1528 remain finitely bounded, which is an important efficiency/liveness guarantee, |
|
1529 we shall first develop a few preparatory properties and definitions to |
|
1530 make the process of proving that a breeze. |
|
1531 |
|
1532 We define rewriting relations for $\rrexp$s, which allows us to do the |
|
1533 same trick as we did for the correctness proof, |
|
1534 but this time we will have stronger equalities established. |
|
1535 |
|
1536 |
1269 |
1537 |
1270 What guarantee does this bound give us? |
1538 What guarantee does this bound give us? |
1271 |
1539 |
1272 Whatever the regex is, it will not grow indefinitely. |
1540 Whatever the regex is, it will not grow indefinitely. |
1273 Take our previous example $(a + aa)^*$ as an example: |
1541 Take our previous example $(a + aa)^*$ as an example: |