ChengsongTanPhdThesis/Chapters/Bitcoded1.tex
changeset 582 3e19073e91f4
parent 581 9db2500629be
child 585 4969ef817d92
equal deleted inserted replaced
581:9db2500629be 582:3e19073e91f4
  1102 
  1102 
  1103 \end{figure}
  1103 \end{figure}
  1104 \vspace{20mm}
  1104 \vspace{20mm}
  1105 \noindent
  1105 \noindent
  1106 We encircle in the diagram  all the pairs $v_i, a_i$ to show that these values
  1106 We encircle in the diagram  all the pairs $v_i, a_i$ to show that these values
  1107 and regular expressions correspond to each other.
  1107 and regular expressions share the same structure.
  1108 For the leftmost pair, we have that the 
  1108 These pairs all contain adequate information to 
       
  1109 obtain the final lexing result.
       
  1110 For example, in the leftmost pair the 
  1109 lexical information is condensed in 
  1111 lexical information is condensed in 
  1110 $v_0$--the value part, whereas for the rightmost pair,
  1112 $v_0$, whereas for the rightmost pair,
  1111 the lexing result is in the bitcodes of $a_n$.
  1113 the lexing result hides is in the bitcodes of $a_n$.\\
  1112 $\bmkeps$ is able to extract from $a_n$ the result
  1114 $\bmkeps$ is able to extract from $a_n$ the result
  1113 by looking for nullable parts of the regular expression,
  1115 by looking for nullable parts of the regular expression.
  1114 however for regular expressions $a_i$ in general
  1116 However for regular expressions $a_i$ in general,
  1115 they might not necessarily be nullable and therefore
  1117 they might not necessarily be nullable.
  1116 needs some ``help'' finding the POSIX bit-encoding.
  1118 For those regular expressions that are not nullable,
       
  1119 $\bmkeps$ will not work. 
       
  1120 Therefore they need some additional ``help'' 
       
  1121 finding the POSIX bit-encoding.
  1117 The most straightforward ``help'' comes from $a_i$'s corresponding
  1122 The most straightforward ``help'' comes from $a_i$'s corresponding
  1118 value $v_i$, and this suggests a function $f$ satisfying the
  1123 value $v_i$, and this suggests a function $f$ satisfying the
  1119 following properties:
  1124 following properties:
  1120 \begin{itemize}
  1125 \begin{itemize}
  1121 	\item
  1126 	\item
  1122 		$f \; a_i\;v_i = f \; a_n \; v_n = \bmkeps \; a_n$%\decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$
  1127 		$f \; a_i\;v_i = f \; a_n \; v_n = \bmkeps \; a_n$%\decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$
  1123 	\item
  1128 	\item
  1124 		$f \; a_i\;v_i = f \; a_0 \; v_0 = \decode \;(\code \; v_0) \; (\erase \; a_0)$
  1129 		$f \; a_i\;v_i = f \; a_0 \; v_0 = \code \; v_0$ %\decode \;(\code \; v_0) \; (\erase \; a_0)$
  1125 \end{itemize}
  1130 \end{itemize}
  1126 \noindent
  1131 \noindent
  1127 If we factor out the common part $\decode \; \_ \; (\erase \; a_0)$,
  1132 Sulzmann and Lu came up with a function satisfying 
  1128 The core of the function $f$ is something that produces the bitcodes
  1133 the above
  1129 $\code \; v_0$.
       
  1130 It is unclear how, but Sulzmann and Lu came up with a function satisfying all the above
       
  1131 requirements, named \emph{retrieve}:
  1134 requirements, named \emph{retrieve}:
  1132 
  1135 \begin{center}
  1133 
  1136 \begin{tabular}{llcl}
  1134 
  1137 	$\retrieve \; \, _{bs} \ONE$ & $   \Empty$ & $\dn$ & $\textit{bs}$\\
  1135 \begin{center}
  1138 
  1136 \begin{tabular}{lcr}
  1139 	$\retrieve \; \, _{bs} \mathbf{c}$ & $ (\Char \; c) $ & $\dn$ & $ \textit{bs}$\\
  1137 	$\retrieve \; \, (_{bs} \ONE) \; \, (\Empty)$ & $\dn$ & $\textit{bs}$\\
  1140 
  1138 	$\retrieve \; \, (_{bs} \mathbf{c} ) \; \Char(c)$ & $\dn$ & $ \textit{bs}$\\
  1141 	$\retrieve \; \, _{bs} a_1 \cdot a_2$ & $   (\Seq \; v_1 \; v_2) $ & 
  1139 	$\retrieve \; \, (_{bs} a_1 \cdot a_2) \; \Seq(v_1, v_2)$ & $\dn$ &  $\textit{bs} @ (\retrieve \; a_1\; v_1) @ (\retrieve \; a_2 \; v_2)$\\
  1142 	$\dn$ &  $\textit{bs} \; @\;  (\retrieve \; a_1\; v_1)\; @ \;(\retrieve \; a_2 \; v_2)$\\
  1140 	$\retrieve \; \, (_{bs} \Sigma (a :: \textit{as}) \; \,\Left(v)$ & $\dn$ & $\textit{bs} @ (\retrieve \; a \; v)$\\
  1143 
  1141 	$\retrieve \; \, (_{bs} \Sigma (a :: \textit{as} \; \, \Right(v)$ & $\dn$ & $\textit{bs} @ (\retrieve \; (_{[]}\Sigma \textit{as}) \; v)$\\
  1144 	$\retrieve \; \, _{bs} \Sigma (a :: \textit{as})$ & $   (\Left \; v) $ & 
  1142 	$\retrieve \; \, (_{bs} a^*) \; \, (\Stars(v :: vs)) $ & $\dn$ & $\textit{bs} @ (\retrieve \; a \; v) @ (\retrieve \; (_{[]} a^*) \; (\Stars(vs)))$\\
  1145 	$\dn$ & $\textit{bs}\; @\; (\retrieve \; a \; v)$\\
  1143 	$\retrieve \; \, (_{bs} a^*) \; \, (\Stars([]) $ & $\dn$ & $\textit{bs} @ [Z]$
  1146 
       
  1147 	$\retrieve \; \, _{bs} \Sigma (a :: \textit{as})$ & $   (\Right \; v) $ & 
       
  1148 	$\dn$ & $\textit{bs}\; @\; (\retrieve \; (_{[]}\Sigma \textit{as}) \; v)$\\
       
  1149 
       
  1150 	$\retrieve \; \, _{bs} a^* $ & $   (\Stars \; (v :: vs)) $ & $\dn$ & 
       
  1151 	$\textit{bs}\; @\; (\retrieve \; a \; v)\; @ \; 
       
  1152 	(\retrieve \; (_{[]} a^*) \; (\Stars \; vs) )$\\
       
  1153 
       
  1154 	$\retrieve \; \, _{bs} a^* $ & $  (\Stars \; []) $ & $\dn$ & $\textit{bs} \; @ \; [Z]$
  1144 \end{tabular}
  1155 \end{tabular}
  1145 \end{center}
  1156 \end{center}
  1146 \noindent
  1157 \noindent
  1147 As promised, $\retrieve$ collects the right bit-codes from the 
  1158 As promised, $\retrieve$ collects the right bit-codes from the 
  1148 final derivative $a_n$:
  1159 final derivative $a_n$, guided by $v_n$:
  1149 \begin{lemma}
  1160 \begin{lemma}\label{bmkepsRetrieve}
  1150 	$\bnullable \; a \implies \bmkeps \; a = \retrieve \; a \; (\mkeps \; (\erase \; a))$
  1161 	$\bnullable \; a \implies \bmkeps \; a = \retrieve \; a \; (\mkeps \; (\erase \; a))$
  1151 \end{lemma}
  1162 \end{lemma}
  1152 \begin{proof}
  1163 \begin{proof}
  1153 	By a routine induction on $a$.
  1164 	By a routine induction on $a$.
  1154 \end{proof}
  1165 \end{proof}
  1155 The design of $\retrieve$ enables extraction of bit-codes
  1166 \noindent
  1156 from not only $\bnullable$ (annotated) regular expressions,
  1167 The design of $\retrieve$ enables us to extract bitcodes
  1157 but also those that are not $\bnullable$.
  1168 from both annotated regular expressions and values.
  1158 For example, if we have the regular expression just internalised
  1169 $\retrieve$ always ``sucks up'' all the information.
       
  1170 When more information is stored in the value, we would be able to
       
  1171 extract more from the value, and vice versa.
       
  1172 For example in star iterations, $\retrieve$ will be able to extract from $\Stars \; vs$
       
  1173 exactly the same bitcodes as $\code \; (\Stars \; vs)$:
       
  1174 \begin{lemma}
       
  1175   If $\forall v \in vs. \vdash v : r$, and  $\code \; v = \retrieve \; (\internalise\; r) \; v$\\
       
  1176   then $\code \; (\Stars \; vs) = \retrieve \; _{[]}((\internalise \;  r)^*) \; (\Stars \; vs)$
       
  1177 \end{lemma}\label{retrieveEncodeSTARS}
       
  1178 \begin{proof}
       
  1179 	By induction on the value list $vs$.
       
  1180 \end{proof}
       
  1181 \noindent
       
  1182 Similarly, when the value list does not hold information,
       
  1183 only the bitcodes plus some delimiter will be recorded:
       
  1184 \begin{center}
       
  1185   $\retrieve \; _{bs}((\internalise \;  r)^*) \; (\Stars \; [] )  = bs @ [S]$.
       
  1186 \end{center}
       
  1187 In general, if we have a regular expression just internalised
  1159 and the lexing result value, we could $\retrieve$ the bitcdoes
  1188 and the lexing result value, we could $\retrieve$ the bitcdoes
  1160 that look as if we have en$\code$-ed the value:
  1189 that look as if we have en$\code$-ed the value as bitcodes:
  1161 \begin{lemma}
  1190 \begin{lemma}
  1162 	$\vdash v : r \implies \retrieve \; (r)^\uparrow \; v = \code \; v$
  1191 	$\vdash v : r \implies \retrieve \; (r)^\uparrow \; v = \code \; v$
  1163 \end{lemma}
  1192 \end{lemma}
  1164 \begin{proof}
  1193 \begin{proof}
  1165 	By induction on $r$.
  1194 	By induction on $r$.
       
  1195 	The star case uses lemma \ref{retrieveEncodeSTARS}.
  1166 \end{proof}
  1196 \end{proof}
  1167 \noindent
  1197 \noindent
  1168 The following property is more interesting, as
  1198 The following property is more interesting, as
  1169 it provides a "bridge" between $a_0, v_0$ and $a_n, v_n$ in the
  1199 it provides a "bridge" between $a_0$ and $a_n $ in the
  1170 lexing diagram.
  1200 lexing diagram by linking adjacent regular expressions $a_i$ and
  1171 If you take derivative of an annotated regular expression, 
  1201 $a_{i+1}$.
  1172 you can $\retrieve$ the same bit-codes as before the derivative took place,
  1202 The property says that one 
  1173 provided that you use the corresponding value:
  1203 can retrieve the same bits from a derivative 
  1174 
  1204 regular expression as those from 
       
  1205 before the derivative took place,
       
  1206 provided that the right values are used respectively:
  1175 \begin{lemma}\label{retrieveStepwise}
  1207 \begin{lemma}\label{retrieveStepwise}
  1176 	$\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
  1208 	$\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
  1177 \end{lemma}
  1209 \end{lemma}
  1178 \begin{proof}
  1210 \begin{proof}
  1179 	By induction on $r$, where $v$ is allowed to be arbitrary.
  1211 	By induction on $r$, where $v$ is allowed to be arbitrary.
  1180 	The induction principle is function $\erase$'s cases.
  1212 	The induction principle is function $\erase$'s cases.
  1181 \end{proof}
  1213 \end{proof}
  1182 \noindent
  1214 \noindent
  1183 $\retrieve$ is connected to the $\blexer$ in the following way:
  1215 Before we move on to the next
       
  1216 helper function, we rewrite $\blexer$ in
       
  1217 the following way which makes later proofs more convenient:
  1184 \begin{lemma}\label{blexer_retrieve}
  1218 \begin{lemma}\label{blexer_retrieve}
  1185 $\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
  1219 $\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
  1186 \end{lemma}
  1220 \end{lemma}
  1187 \noindent
  1221 \begin{proof}
  1188 $\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regular expressiones of $\blexer$.
  1222 	Using lemma \ref{bmkepsRetrieve}.
  1189 For plain regular expressions something similar is required as well.
  1223 \end{proof}
       
  1224 \noindent
       
  1225 $\retrieve$ allows connecting
       
  1226 between the intermediate 
       
  1227 results $a_i$ and $a_{i+1}$ in $\blexer$.
       
  1228 For plain regular expressions something similar is required.
  1190 
  1229 
  1191 \subsection{$\flex$}
  1230 \subsection{$\flex$}
  1192 Ausaf and Urban cleverly defined an auxiliary function called $\flex$ for $\lexer$,
  1231 Ausaf and Urban cleverly defined an auxiliary function called $\flex$ for $\lexer$,
  1193 defined as
  1232 defined as
  1194 \begin{center}
  1233 \begin{center}
  1195 \begin{tabular}{lcr}
  1234 \begin{tabular}{lcl}
  1196 $\flex \; r \; f \; [] \; v$       &  $=$ &   $f\; v$\\
  1235 $\flex \; r \; f \; [] \; v$       &  $=$ &   $f\; v$\\
  1197 $\flex \; r \; f \; c :: s \; v$ &  $=$ &   $\flex \; r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v$
  1236 $\flex \; r \; f \; (c :: s) \; v$ &  $=$ &   $\flex \; r \; (\lambda v. \, f (\inj \; r\; c\; v)) \;\; s \; v$
  1198 \end{tabular}
  1237 \end{tabular}
  1199 \end{center}
  1238 \end{center}
  1200 which accumulates the characters that needs to be injected back, 
  1239 which accumulates the characters that need to be injected back, 
  1201 and does the injection in a stack-like manner (last taken derivative first injected).
  1240 and does the injection in a stack-like manner (last taken derivative first injected).
  1202 $\flex$ is connected to the $\lexer$:
  1241 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression
       
  1242 $r$, the identity function $id$, the input string $s$ and the value
       
  1243 $v_n= \mkeps \; (r\backslash s)$:
  1203 \begin{lemma}
  1244 \begin{lemma}
  1204 $\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$
  1245 	$\flex \;\; r \;\; \textit{id}\;\; s \;\; (\mkeps \;(r\backslash s)) = \lexer \; r \; s$
  1205 \end{lemma}
  1246 \end{lemma}
  1206 \begin{proof}
  1247 \begin{proof}
  1207 	By reverse induction on $s$.
  1248 	By reverse induction on $s$.
  1208 \end{proof}
  1249 \end{proof}
  1209 $\flex$ provides us a bridge between $\lexer$'s intermediate steps.
  1250 \noindent
  1210 What is even better about $\flex$ is that it allows us to 
  1251 The main advantage of using $\flex$
  1211 directly operate on the value $\mkeps (r\backslash v)$,
  1252 in addition to $\lexer$ is that
  1212 which is pivotal in the definition of  $\lexer $ and $\blexer$, but not visible as an argument.
  1253 $\flex$ makes the value $v$ and function $f$
  1213 When the value created by $\mkeps$ becomes available, one can 
  1254 that manipulates the value  explicit parameters,
  1214 prove some stepwise properties of lexing nicely:
  1255 which ``exposes'' $\lexer$'s recursive call
       
  1256 arguments and allows us to ``set breakpoints'' and ``resume''
       
  1257 at any point during $\lexer$'s recursive calls.\\
       
  1258 The following stepwise property holds. 
  1215 \begin{lemma}\label{flexStepwise}
  1259 \begin{lemma}\label{flexStepwise}
  1216 $\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $
  1260 	$\textit{flex} \;\; r \;\; f \;\; (s@[c]) \;\; v = \flex \;\; r \;\; f  \;\; s \;\; (\inj \; (r\backslash s) \; c \; v) $
  1217 \end{lemma}
  1261 \end{lemma}
  1218 \begin{proof}
  1262 \begin{proof}
  1219 	By induction on the shape of $r\backslash s$
  1263 	By induction on the shape of $r\backslash s$.
  1220 \end{proof}
  1264 \end{proof}
  1221 \noindent
  1265 \noindent
  1222 With $\flex$ and $\retrieve$ ready, we are ready to connect $\lexer$ and $\blexer$ .
  1266 With $\flex$ and $\retrieve$, 
  1223 
  1267 we are ready to connect $\lexer$ and $\blexer$,
  1224 \subsection{Correctness Proof of Bit-coded Algorithm}
  1268 giving the correctness proof.
       
  1269 
       
  1270 %----------------------------------------------------------------------------------------
       
  1271 %	SECTION  correctness proof
       
  1272 %----------------------------------------------------------------------------------------
       
  1273 \section{Correctness of Bit-coded Algorithm (Without Simplification)}
       
  1274 $\flex$ and $\blexer$ essentially calculates the same thing.
  1225 \begin{lemma}\label{flex_retrieve}
  1275 \begin{lemma}\label{flex_retrieve}
       
  1276 	If $\vdash v: (r\backslash s)$, then\\
  1226 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
  1277 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
  1227 \end{lemma}
  1278 \end{lemma}
  1228 \begin{proof}
  1279 \begin{proof}
  1229 By induction on $s$. The induction tactic is reverse induction on strings.
  1280 By induction on $s$. 
  1230 $v$ is allowed to be arbitrary.
  1281 We prove the interesting case where
       
  1282 both $\flex$ and $\decode$ successfully terminates
       
  1283 with some value.
       
  1284 We take advantage of the stepwise properties  
       
  1285 both sides.
       
  1286 The induction tactic is reverse induction on string $s$.
       
  1287 The inductive hypothesis says that $\flex \; r \; id \;s \; v =
       
  1288 \decode \; (\retrieve \; (r\backslash s) \; v) \; r$ holds,
       
  1289 where $v$ can be any value satisfying 
       
  1290 the assumption $\vdash v: (r\backslash s)$.
  1231 The crucial point is to rewrite 
  1291 The crucial point is to rewrite 
  1232 \[
  1292 \[
  1233 \retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
  1293 	\retrieve \;\; (r \backslash (s@[c])) \;\; (\mkeps\; (r \backslash (s@[c]) ))
  1234 \]
  1294 \]
  1235 as
  1295 as
  1236 \[
  1296 \[
  1237 \retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\;  \mkeps (r \backslash s@[c]))
  1297 	\retrieve \;\; (r \backslash s) 
  1238 \].
  1298 	\;\; (\inj \; (r \backslash s) \; c\;  \mkeps (r \backslash (s@[c])))
       
  1299 \]
       
  1300 using lemma \ref{retrieveStepwise}.
  1239 This enables us to equate 
  1301 This enables us to equate 
  1240 \[
  1302 \[
  1241 \retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
  1303 	\retrieve \; (r \backslash (s@[c])) \; (\mkeps \; (r \backslash (s@[c]) ))
  1242 \] 
  1304 \] 
  1243 with 
  1305 with 
  1244 \[
  1306 \[
  1245 \flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c])))
  1307 \flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c])))
  1246 \],
  1308 \]
       
  1309 using IH,
  1247 which in turn can be rewritten as
  1310 which in turn can be rewritten as
  1248 \[
  1311 \[
  1249 \flex \; r \; \textit{id} \; s@[c] \;  (\mkeps (r\backslash s@[c]))
  1312 	\flex \; r \; \textit{id} \; (s@[c]) \;  (\mkeps \; (r \backslash (s@[c])))
  1250 \].
  1313 \].
  1251 \end{proof}
  1314 \end{proof}
  1252 
  1315 \noindent
  1253 With the above lemma we can now link $\flex$ and $\blexer$.
  1316 With this pivotal lemma we can now link $\flex$ and $\blexer$
  1254 
  1317 and finally give the correctness of $\blexer$--it outputs the same result as $\lexer$:
  1255 %----------------------------------------------------------------------------------------
       
  1256 %	SECTION  correctness proof
       
  1257 %----------------------------------------------------------------------------------------
       
  1258 \section{Correctness of Bit-coded Algorithm (Without Simplification)}
       
  1259 We now give the proof the correctness of the algorithm with bit-codes.
       
  1260 
       
  1261 \begin{lemma}\label{flex_blexer}
       
  1262 $\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s)  = \blexer \; r \; s$
       
  1263 \end{lemma}
       
  1264 \begin{proof}
       
  1265 Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.
       
  1266 \end{proof}
       
  1267 Finally the correctness of $\blexer$ is given as it outputs the same result as $\lexer$:
       
  1268 \begin{theorem}
  1318 \begin{theorem}
  1269 	$\blexer\; r \; s = \lexer \; r \; s$
  1319 	$\blexer\; r \; s = \lexer \; r \; s$
  1270 \end{theorem}
  1320 \end{theorem}
  1271 \begin{proof}
  1321 \begin{proof}
  1272 	Straightforward corollary of \ref{flex_blexer}.
  1322 	From \ref{flex_retrieve} we have that 
       
  1323 	$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s)  = \blexer \; r \; s$,
       
  1324 	which immediately implies this theorem.
  1273 \end{proof}
  1325 \end{proof}
  1274 \noindent
  1326 \noindent
  1275 To piece things together and spell out the exact correctness
  1327 To piece things together and spell out the correctness
  1276 of the bitcoded lexer
  1328 result of the bitcoded lexer more explicitly,
  1277 in terms of producing POSIX values,
       
  1278 we use the fact from the previous chapter that
  1329 we use the fact from the previous chapter that
  1279 \[
  1330 \[
  1280 	If \; (r, s) \rightarrow v \; then \; \lexer \; r \; s = v
  1331 	 (r, s) \rightarrow v \;\; \textit{iff} \;\; \lexer \; r \; s = v
  1281 \]
  1332 \]
  1282 to obtain this corollary:
  1333 to obtain this corollary:
  1283 \begin{corollary}\label{blexerPOSIX}
  1334 \begin{corollary}\label{blexerPOSIX}
  1284 	$If \; (r, s) \rightarrow v \; then \blexer \; r \; s = v$
  1335 	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \; s = v$
  1285 \end{corollary}
  1336 \end{corollary}
  1286 Our main reason for wanting a bit-coded algorithm over 
  1337 Our main reason for wanting a bit-coded algorithm over 
  1287 the injection-based one is for its capabilities of allowing
  1338 the injection-based one is for its capabilities of allowing
  1288 more aggressive simplifications.
  1339 more aggressive simplifications.
  1289 We will elaborate on this in the next chapter.
  1340 We will elaborate on this in the next chapter.