1102 |
1102 |
1103 \end{figure} |
1103 \end{figure} |
1104 \vspace{20mm} |
1104 \vspace{20mm} |
1105 \noindent |
1105 \noindent |
1106 We encircle in the diagram all the pairs $v_i, a_i$ to show that these values |
1106 We encircle in the diagram all the pairs $v_i, a_i$ to show that these values |
1107 and regular expressions correspond to each other. |
1107 and regular expressions share the same structure. |
1108 For the leftmost pair, we have that the |
1108 These pairs all contain adequate information to |
|
1109 obtain the final lexing result. |
|
1110 For example, in the leftmost pair the |
1109 lexical information is condensed in |
1111 lexical information is condensed in |
1110 $v_0$--the value part, whereas for the rightmost pair, |
1112 $v_0$, whereas for the rightmost pair, |
1111 the lexing result is in the bitcodes of $a_n$. |
1113 the lexing result hides is in the bitcodes of $a_n$.\\ |
1112 $\bmkeps$ is able to extract from $a_n$ the result |
1114 $\bmkeps$ is able to extract from $a_n$ the result |
1113 by looking for nullable parts of the regular expression, |
1115 by looking for nullable parts of the regular expression. |
1114 however for regular expressions $a_i$ in general |
1116 However for regular expressions $a_i$ in general, |
1115 they might not necessarily be nullable and therefore |
1117 they might not necessarily be nullable. |
1116 needs some ``help'' finding the POSIX bit-encoding. |
1118 For those regular expressions that are not nullable, |
|
1119 $\bmkeps$ will not work. |
|
1120 Therefore they need some additional ``help'' |
|
1121 finding the POSIX bit-encoding. |
1117 The most straightforward ``help'' comes from $a_i$'s corresponding |
1122 The most straightforward ``help'' comes from $a_i$'s corresponding |
1118 value $v_i$, and this suggests a function $f$ satisfying the |
1123 value $v_i$, and this suggests a function $f$ satisfying the |
1119 following properties: |
1124 following properties: |
1120 \begin{itemize} |
1125 \begin{itemize} |
1121 \item |
1126 \item |
1122 $f \; a_i\;v_i = f \; a_n \; v_n = \bmkeps \; a_n$%\decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$ |
1127 $f \; a_i\;v_i = f \; a_n \; v_n = \bmkeps \; a_n$%\decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$ |
1123 \item |
1128 \item |
1124 $f \; a_i\;v_i = f \; a_0 \; v_0 = \decode \;(\code \; v_0) \; (\erase \; a_0)$ |
1129 $f \; a_i\;v_i = f \; a_0 \; v_0 = \code \; v_0$ %\decode \;(\code \; v_0) \; (\erase \; a_0)$ |
1125 \end{itemize} |
1130 \end{itemize} |
1126 \noindent |
1131 \noindent |
1127 If we factor out the common part $\decode \; \_ \; (\erase \; a_0)$, |
1132 Sulzmann and Lu came up with a function satisfying |
1128 The core of the function $f$ is something that produces the bitcodes |
1133 the above |
1129 $\code \; v_0$. |
|
1130 It is unclear how, but Sulzmann and Lu came up with a function satisfying all the above |
|
1131 requirements, named \emph{retrieve}: |
1134 requirements, named \emph{retrieve}: |
1132 |
1135 \begin{center} |
1133 |
1136 \begin{tabular}{llcl} |
1134 |
1137 $\retrieve \; \, _{bs} \ONE$ & $ \Empty$ & $\dn$ & $\textit{bs}$\\ |
1135 \begin{center} |
1138 |
1136 \begin{tabular}{lcr} |
1139 $\retrieve \; \, _{bs} \mathbf{c}$ & $ (\Char \; c) $ & $\dn$ & $ \textit{bs}$\\ |
1137 $\retrieve \; \, (_{bs} \ONE) \; \, (\Empty)$ & $\dn$ & $\textit{bs}$\\ |
1140 |
1138 $\retrieve \; \, (_{bs} \mathbf{c} ) \; \Char(c)$ & $\dn$ & $ \textit{bs}$\\ |
1141 $\retrieve \; \, _{bs} a_1 \cdot a_2$ & $ (\Seq \; v_1 \; v_2) $ & |
1139 $\retrieve \; \, (_{bs} a_1 \cdot a_2) \; \Seq(v_1, v_2)$ & $\dn$ & $\textit{bs} @ (\retrieve \; a_1\; v_1) @ (\retrieve \; a_2 \; v_2)$\\ |
1142 $\dn$ & $\textit{bs} \; @\; (\retrieve \; a_1\; v_1)\; @ \;(\retrieve \; a_2 \; v_2)$\\ |
1140 $\retrieve \; \, (_{bs} \Sigma (a :: \textit{as}) \; \,\Left(v)$ & $\dn$ & $\textit{bs} @ (\retrieve \; a \; v)$\\ |
1143 |
1141 $\retrieve \; \, (_{bs} \Sigma (a :: \textit{as} \; \, \Right(v)$ & $\dn$ & $\textit{bs} @ (\retrieve \; (_{[]}\Sigma \textit{as}) \; v)$\\ |
1144 $\retrieve \; \, _{bs} \Sigma (a :: \textit{as})$ & $ (\Left \; v) $ & |
1142 $\retrieve \; \, (_{bs} a^*) \; \, (\Stars(v :: vs)) $ & $\dn$ & $\textit{bs} @ (\retrieve \; a \; v) @ (\retrieve \; (_{[]} a^*) \; (\Stars(vs)))$\\ |
1145 $\dn$ & $\textit{bs}\; @\; (\retrieve \; a \; v)$\\ |
1143 $\retrieve \; \, (_{bs} a^*) \; \, (\Stars([]) $ & $\dn$ & $\textit{bs} @ [Z]$ |
1146 |
|
1147 $\retrieve \; \, _{bs} \Sigma (a :: \textit{as})$ & $ (\Right \; v) $ & |
|
1148 $\dn$ & $\textit{bs}\; @\; (\retrieve \; (_{[]}\Sigma \textit{as}) \; v)$\\ |
|
1149 |
|
1150 $\retrieve \; \, _{bs} a^* $ & $ (\Stars \; (v :: vs)) $ & $\dn$ & |
|
1151 $\textit{bs}\; @\; (\retrieve \; a \; v)\; @ \; |
|
1152 (\retrieve \; (_{[]} a^*) \; (\Stars \; vs) )$\\ |
|
1153 |
|
1154 $\retrieve \; \, _{bs} a^* $ & $ (\Stars \; []) $ & $\dn$ & $\textit{bs} \; @ \; [Z]$ |
1144 \end{tabular} |
1155 \end{tabular} |
1145 \end{center} |
1156 \end{center} |
1146 \noindent |
1157 \noindent |
1147 As promised, $\retrieve$ collects the right bit-codes from the |
1158 As promised, $\retrieve$ collects the right bit-codes from the |
1148 final derivative $a_n$: |
1159 final derivative $a_n$, guided by $v_n$: |
1149 \begin{lemma} |
1160 \begin{lemma}\label{bmkepsRetrieve} |
1150 $\bnullable \; a \implies \bmkeps \; a = \retrieve \; a \; (\mkeps \; (\erase \; a))$ |
1161 $\bnullable \; a \implies \bmkeps \; a = \retrieve \; a \; (\mkeps \; (\erase \; a))$ |
1151 \end{lemma} |
1162 \end{lemma} |
1152 \begin{proof} |
1163 \begin{proof} |
1153 By a routine induction on $a$. |
1164 By a routine induction on $a$. |
1154 \end{proof} |
1165 \end{proof} |
1155 The design of $\retrieve$ enables extraction of bit-codes |
1166 \noindent |
1156 from not only $\bnullable$ (annotated) regular expressions, |
1167 The design of $\retrieve$ enables us to extract bitcodes |
1157 but also those that are not $\bnullable$. |
1168 from both annotated regular expressions and values. |
1158 For example, if we have the regular expression just internalised |
1169 $\retrieve$ always ``sucks up'' all the information. |
|
1170 When more information is stored in the value, we would be able to |
|
1171 extract more from the value, and vice versa. |
|
1172 For example in star iterations, $\retrieve$ will be able to extract from $\Stars \; vs$ |
|
1173 exactly the same bitcodes as $\code \; (\Stars \; vs)$: |
|
1174 \begin{lemma} |
|
1175 If $\forall v \in vs. \vdash v : r$, and $\code \; v = \retrieve \; (\internalise\; r) \; v$\\ |
|
1176 then $\code \; (\Stars \; vs) = \retrieve \; _{[]}((\internalise \; r)^*) \; (\Stars \; vs)$ |
|
1177 \end{lemma}\label{retrieveEncodeSTARS} |
|
1178 \begin{proof} |
|
1179 By induction on the value list $vs$. |
|
1180 \end{proof} |
|
1181 \noindent |
|
1182 Similarly, when the value list does not hold information, |
|
1183 only the bitcodes plus some delimiter will be recorded: |
|
1184 \begin{center} |
|
1185 $\retrieve \; _{bs}((\internalise \; r)^*) \; (\Stars \; [] ) = bs @ [S]$. |
|
1186 \end{center} |
|
1187 In general, if we have a regular expression just internalised |
1159 and the lexing result value, we could $\retrieve$ the bitcdoes |
1188 and the lexing result value, we could $\retrieve$ the bitcdoes |
1160 that look as if we have en$\code$-ed the value: |
1189 that look as if we have en$\code$-ed the value as bitcodes: |
1161 \begin{lemma} |
1190 \begin{lemma} |
1162 $\vdash v : r \implies \retrieve \; (r)^\uparrow \; v = \code \; v$ |
1191 $\vdash v : r \implies \retrieve \; (r)^\uparrow \; v = \code \; v$ |
1163 \end{lemma} |
1192 \end{lemma} |
1164 \begin{proof} |
1193 \begin{proof} |
1165 By induction on $r$. |
1194 By induction on $r$. |
|
1195 The star case uses lemma \ref{retrieveEncodeSTARS}. |
1166 \end{proof} |
1196 \end{proof} |
1167 \noindent |
1197 \noindent |
1168 The following property is more interesting, as |
1198 The following property is more interesting, as |
1169 it provides a "bridge" between $a_0, v_0$ and $a_n, v_n$ in the |
1199 it provides a "bridge" between $a_0$ and $a_n $ in the |
1170 lexing diagram. |
1200 lexing diagram by linking adjacent regular expressions $a_i$ and |
1171 If you take derivative of an annotated regular expression, |
1201 $a_{i+1}$. |
1172 you can $\retrieve$ the same bit-codes as before the derivative took place, |
1202 The property says that one |
1173 provided that you use the corresponding value: |
1203 can retrieve the same bits from a derivative |
1174 |
1204 regular expression as those from |
|
1205 before the derivative took place, |
|
1206 provided that the right values are used respectively: |
1175 \begin{lemma}\label{retrieveStepwise} |
1207 \begin{lemma}\label{retrieveStepwise} |
1176 $\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$ |
1208 $\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$ |
1177 \end{lemma} |
1209 \end{lemma} |
1178 \begin{proof} |
1210 \begin{proof} |
1179 By induction on $r$, where $v$ is allowed to be arbitrary. |
1211 By induction on $r$, where $v$ is allowed to be arbitrary. |
1180 The induction principle is function $\erase$'s cases. |
1212 The induction principle is function $\erase$'s cases. |
1181 \end{proof} |
1213 \end{proof} |
1182 \noindent |
1214 \noindent |
1183 $\retrieve$ is connected to the $\blexer$ in the following way: |
1215 Before we move on to the next |
|
1216 helper function, we rewrite $\blexer$ in |
|
1217 the following way which makes later proofs more convenient: |
1184 \begin{lemma}\label{blexer_retrieve} |
1218 \begin{lemma}\label{blexer_retrieve} |
1185 $\blexer \; r \; s = \decode \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$ |
1219 $\blexer \; r \; s = \decode \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$ |
1186 \end{lemma} |
1220 \end{lemma} |
1187 \noindent |
1221 \begin{proof} |
1188 $\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regular expressiones of $\blexer$. |
1222 Using lemma \ref{bmkepsRetrieve}. |
1189 For plain regular expressions something similar is required as well. |
1223 \end{proof} |
|
1224 \noindent |
|
1225 $\retrieve$ allows connecting |
|
1226 between the intermediate |
|
1227 results $a_i$ and $a_{i+1}$ in $\blexer$. |
|
1228 For plain regular expressions something similar is required. |
1190 |
1229 |
1191 \subsection{$\flex$} |
1230 \subsection{$\flex$} |
1192 Ausaf and Urban cleverly defined an auxiliary function called $\flex$ for $\lexer$, |
1231 Ausaf and Urban cleverly defined an auxiliary function called $\flex$ for $\lexer$, |
1193 defined as |
1232 defined as |
1194 \begin{center} |
1233 \begin{center} |
1195 \begin{tabular}{lcr} |
1234 \begin{tabular}{lcl} |
1196 $\flex \; r \; f \; [] \; v$ & $=$ & $f\; v$\\ |
1235 $\flex \; r \; f \; [] \; v$ & $=$ & $f\; v$\\ |
1197 $\flex \; r \; f \; c :: s \; v$ & $=$ & $\flex \; r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v$ |
1236 $\flex \; r \; f \; (c :: s) \; v$ & $=$ & $\flex \; r \; (\lambda v. \, f (\inj \; r\; c\; v)) \;\; s \; v$ |
1198 \end{tabular} |
1237 \end{tabular} |
1199 \end{center} |
1238 \end{center} |
1200 which accumulates the characters that needs to be injected back, |
1239 which accumulates the characters that need to be injected back, |
1201 and does the injection in a stack-like manner (last taken derivative first injected). |
1240 and does the injection in a stack-like manner (last taken derivative first injected). |
1202 $\flex$ is connected to the $\lexer$: |
1241 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression |
|
1242 $r$, the identity function $id$, the input string $s$ and the value |
|
1243 $v_n= \mkeps \; (r\backslash s)$: |
1203 \begin{lemma} |
1244 \begin{lemma} |
1204 $\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$ |
1245 $\flex \;\; r \;\; \textit{id}\;\; s \;\; (\mkeps \;(r\backslash s)) = \lexer \; r \; s$ |
1205 \end{lemma} |
1246 \end{lemma} |
1206 \begin{proof} |
1247 \begin{proof} |
1207 By reverse induction on $s$. |
1248 By reverse induction on $s$. |
1208 \end{proof} |
1249 \end{proof} |
1209 $\flex$ provides us a bridge between $\lexer$'s intermediate steps. |
1250 \noindent |
1210 What is even better about $\flex$ is that it allows us to |
1251 The main advantage of using $\flex$ |
1211 directly operate on the value $\mkeps (r\backslash v)$, |
1252 in addition to $\lexer$ is that |
1212 which is pivotal in the definition of $\lexer $ and $\blexer$, but not visible as an argument. |
1253 $\flex$ makes the value $v$ and function $f$ |
1213 When the value created by $\mkeps$ becomes available, one can |
1254 that manipulates the value explicit parameters, |
1214 prove some stepwise properties of lexing nicely: |
1255 which ``exposes'' $\lexer$'s recursive call |
|
1256 arguments and allows us to ``set breakpoints'' and ``resume'' |
|
1257 at any point during $\lexer$'s recursive calls.\\ |
|
1258 The following stepwise property holds. |
1215 \begin{lemma}\label{flexStepwise} |
1259 \begin{lemma}\label{flexStepwise} |
1216 $\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $ |
1260 $\textit{flex} \;\; r \;\; f \;\; (s@[c]) \;\; v = \flex \;\; r \;\; f \;\; s \;\; (\inj \; (r\backslash s) \; c \; v) $ |
1217 \end{lemma} |
1261 \end{lemma} |
1218 \begin{proof} |
1262 \begin{proof} |
1219 By induction on the shape of $r\backslash s$ |
1263 By induction on the shape of $r\backslash s$. |
1220 \end{proof} |
1264 \end{proof} |
1221 \noindent |
1265 \noindent |
1222 With $\flex$ and $\retrieve$ ready, we are ready to connect $\lexer$ and $\blexer$ . |
1266 With $\flex$ and $\retrieve$, |
1223 |
1267 we are ready to connect $\lexer$ and $\blexer$, |
1224 \subsection{Correctness Proof of Bit-coded Algorithm} |
1268 giving the correctness proof. |
|
1269 |
|
1270 %---------------------------------------------------------------------------------------- |
|
1271 % SECTION correctness proof |
|
1272 %---------------------------------------------------------------------------------------- |
|
1273 \section{Correctness of Bit-coded Algorithm (Without Simplification)} |
|
1274 $\flex$ and $\blexer$ essentially calculates the same thing. |
1225 \begin{lemma}\label{flex_retrieve} |
1275 \begin{lemma}\label{flex_retrieve} |
|
1276 If $\vdash v: (r\backslash s)$, then\\ |
1226 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$ |
1277 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$ |
1227 \end{lemma} |
1278 \end{lemma} |
1228 \begin{proof} |
1279 \begin{proof} |
1229 By induction on $s$. The induction tactic is reverse induction on strings. |
1280 By induction on $s$. |
1230 $v$ is allowed to be arbitrary. |
1281 We prove the interesting case where |
|
1282 both $\flex$ and $\decode$ successfully terminates |
|
1283 with some value. |
|
1284 We take advantage of the stepwise properties |
|
1285 both sides. |
|
1286 The induction tactic is reverse induction on string $s$. |
|
1287 The inductive hypothesis says that $\flex \; r \; id \;s \; v = |
|
1288 \decode \; (\retrieve \; (r\backslash s) \; v) \; r$ holds, |
|
1289 where $v$ can be any value satisfying |
|
1290 the assumption $\vdash v: (r\backslash s)$. |
1231 The crucial point is to rewrite |
1291 The crucial point is to rewrite |
1232 \[ |
1292 \[ |
1233 \retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) |
1293 \retrieve \;\; (r \backslash (s@[c])) \;\; (\mkeps\; (r \backslash (s@[c]) )) |
1234 \] |
1294 \] |
1235 as |
1295 as |
1236 \[ |
1296 \[ |
1237 \retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\; \mkeps (r \backslash s@[c])) |
1297 \retrieve \;\; (r \backslash s) |
1238 \]. |
1298 \;\; (\inj \; (r \backslash s) \; c\; \mkeps (r \backslash (s@[c]))) |
|
1299 \] |
|
1300 using lemma \ref{retrieveStepwise}. |
1239 This enables us to equate |
1301 This enables us to equate |
1240 \[ |
1302 \[ |
1241 \retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) |
1303 \retrieve \; (r \backslash (s@[c])) \; (\mkeps \; (r \backslash (s@[c]) )) |
1242 \] |
1304 \] |
1243 with |
1305 with |
1244 \[ |
1306 \[ |
1245 \flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c]))) |
1307 \flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c]))) |
1246 \], |
1308 \] |
|
1309 using IH, |
1247 which in turn can be rewritten as |
1310 which in turn can be rewritten as |
1248 \[ |
1311 \[ |
1249 \flex \; r \; \textit{id} \; s@[c] \; (\mkeps (r\backslash s@[c])) |
1312 \flex \; r \; \textit{id} \; (s@[c]) \; (\mkeps \; (r \backslash (s@[c]))) |
1250 \]. |
1313 \]. |
1251 \end{proof} |
1314 \end{proof} |
1252 |
1315 \noindent |
1253 With the above lemma we can now link $\flex$ and $\blexer$. |
1316 With this pivotal lemma we can now link $\flex$ and $\blexer$ |
1254 |
1317 and finally give the correctness of $\blexer$--it outputs the same result as $\lexer$: |
1255 %---------------------------------------------------------------------------------------- |
|
1256 % SECTION correctness proof |
|
1257 %---------------------------------------------------------------------------------------- |
|
1258 \section{Correctness of Bit-coded Algorithm (Without Simplification)} |
|
1259 We now give the proof the correctness of the algorithm with bit-codes. |
|
1260 |
|
1261 \begin{lemma}\label{flex_blexer} |
|
1262 $\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s) = \blexer \; r \; s$ |
|
1263 \end{lemma} |
|
1264 \begin{proof} |
|
1265 Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}. |
|
1266 \end{proof} |
|
1267 Finally the correctness of $\blexer$ is given as it outputs the same result as $\lexer$: |
|
1268 \begin{theorem} |
1318 \begin{theorem} |
1269 $\blexer\; r \; s = \lexer \; r \; s$ |
1319 $\blexer\; r \; s = \lexer \; r \; s$ |
1270 \end{theorem} |
1320 \end{theorem} |
1271 \begin{proof} |
1321 \begin{proof} |
1272 Straightforward corollary of \ref{flex_blexer}. |
1322 From \ref{flex_retrieve} we have that |
|
1323 $\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s) = \blexer \; r \; s$, |
|
1324 which immediately implies this theorem. |
1273 \end{proof} |
1325 \end{proof} |
1274 \noindent |
1326 \noindent |
1275 To piece things together and spell out the exact correctness |
1327 To piece things together and spell out the correctness |
1276 of the bitcoded lexer |
1328 result of the bitcoded lexer more explicitly, |
1277 in terms of producing POSIX values, |
|
1278 we use the fact from the previous chapter that |
1329 we use the fact from the previous chapter that |
1279 \[ |
1330 \[ |
1280 If \; (r, s) \rightarrow v \; then \; \lexer \; r \; s = v |
1331 (r, s) \rightarrow v \;\; \textit{iff} \;\; \lexer \; r \; s = v |
1281 \] |
1332 \] |
1282 to obtain this corollary: |
1333 to obtain this corollary: |
1283 \begin{corollary}\label{blexerPOSIX} |
1334 \begin{corollary}\label{blexerPOSIX} |
1284 $If \; (r, s) \rightarrow v \; then \blexer \; r \; s = v$ |
1335 $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \; s = v$ |
1285 \end{corollary} |
1336 \end{corollary} |
1286 Our main reason for wanting a bit-coded algorithm over |
1337 Our main reason for wanting a bit-coded algorithm over |
1287 the injection-based one is for its capabilities of allowing |
1338 the injection-based one is for its capabilities of allowing |
1288 more aggressive simplifications. |
1339 more aggressive simplifications. |
1289 We will elaborate on this in the next chapter. |
1340 We will elaborate on this in the next chapter. |