ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 623 c0c1ebe09c7d
parent 622 4b1149fb5aec
child 626 1c8525061545
equal deleted inserted replaced
622:4b1149fb5aec 623:c0c1ebe09c7d
   982 	the empty string , by always selecting the leftmost 
   982 	the empty string , by always selecting the leftmost 
   983 	nullable regular expression. After that $\inj$ ``injects'' back the character in reverse order as they 
   983 	nullable regular expression. After that $\inj$ ``injects'' back the character in reverse order as they 
   984 	appeared in the string, always preserving POSIXness.}\label{graph:inj}
   984 	appeared in the string, always preserving POSIXness.}\label{graph:inj}
   985 \end{figure}
   985 \end{figure}
   986 \noindent
   986 \noindent
   987 $\textit{inj}$ takes three arguments: a regular
   987 The function $\textit{inj}$ as defined by Sulzmann and Lu
       
   988 takes three arguments: a regular
   988 expression ${r_{i}}$, before the character is chopped off, 
   989 expression ${r_{i}}$, before the character is chopped off, 
   989 a character ${c_{i}}$, the character we want to inject back and 
   990 a character ${c_{i}}$ (the character we want to inject back) and 
   990 the third argument $v_{i+1}$ the value we want to inject into. 
   991 the third argument $v_{i+1}$ the value we want to inject into. 
   991 The result of an application 
   992 The result of an application 
   992 $\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
   993 $\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
   993 \[
   994 \[
   994 	(s_i, r_i) \rightarrow v_i
   995 	(s_i, r_i) \rightarrow v_i
  1008   $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$         & $\dn$  & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
  1009   $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$         & $\dn$  & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
  1009 \end{tabular}
  1010 \end{tabular}
  1010 \end{center}
  1011 \end{center}
  1011 
  1012 
  1012 \noindent 
  1013 \noindent 
  1013 The function does a recursion on 
  1014 The function recurses on 
  1014 the shape of regular
  1015 the shape of regular
  1015 expression $r_i$ and value $v_{i+1}$. 
  1016 expressionsw  and values.
  1016 Intuitively, each clause analyses 
  1017 Intuitively, each clause analyses 
  1017 how $r_i$ could have transformed when being 
  1018 how $r_i$ could have transformed when being 
  1018 derived by $c$, identifying which subpart
  1019 derived by $c$, identifying which subpart
  1019 of $v_{i+1}$ has the ``hole'' 
  1020 of $v_{i+1}$ has the ``hole'' 
  1020 to inject the character back into.
  1021 to inject the character back into.
  1021 Once the character is
  1022 Once the character is
  1022 injected back to that sub-value; 
  1023 injected back to that sub-value; 
  1023 $\inj$ assembles all things together
  1024 $\inj$ assembles all parts together
  1024 to form a new value.
  1025 to form a new value.
  1025 
  1026 
  1026 For instance, the last clause is an
  1027 For instance, the last clause is an
  1027 injection into a sequence value $v_{i+1}$
  1028 injection into a sequence value $v_{i+1}$
  1028 whose second child
  1029 whose second child
  1052 to the previous list of iterations, and then
  1053 to the previous list of iterations, and then
  1053 wrapped under the $\Stars$
  1054 wrapped under the $\Stars$
  1054 constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
  1055 constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
  1055 
  1056 
  1056 Recall that lemma 
  1057 Recall that lemma 
  1057 \ref{mePosix} tells us
  1058 \ref{mePosix} tells us that
  1058 $\mkeps$ always selects the POSIX matching among
  1059 $\mkeps$ always generates the POSIX value.
  1059 multiple values that flatten to the empty string.
  1060 The function $\inj$ preserves the POSIXness, provided
  1060 Now $\inj$ preserves the POSIXness, provided
  1061 the value before injection is POSIX, namely
  1061 the value before injection is POSIX:
       
  1062 \begin{lemma}\label{injPosix}
  1062 \begin{lemma}\label{injPosix}
  1063 	If
  1063 	If$(r \backslash c, s) \rightarrow v $,
  1064 	\[
  1064 	then $(r, c :: s) \rightarrow (\inj r \; c\; v)$.
  1065 		(r \backslash c, s) \rightarrow v 
       
  1066 	\]
       
  1067 	then
       
  1068 	\[
       
  1069 		(r, c :: s) \rightarrow (\inj r \; c\; v).
       
  1070 	\]
       
  1071 \end{lemma}
  1065 \end{lemma}
  1072 \begin{proof}
  1066 \begin{proof}
  1073 	By induction on $r$.
  1067 	By induction on $r$.
  1074 	The involved cases are sequence and star.
  1068 	The non-trivial cases are sequence and star.
  1075 	When $r = a \cdot b$, there could be
  1069 	When $r = a \cdot b$, there can be
  1076 	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
  1070 	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
  1077 	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
  1071 	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
  1078 	case.
  1072 	case.
  1079 	\begin{itemize}
  1073 	\begin{itemize}
  1080 		\item
  1074 		\item
  1159 			as the POSIX value for $a\cdot b$.
  1153 			as the POSIX value for $a\cdot b$.
  1160 	\end{itemize}
  1154 	\end{itemize}
  1161 	The star case can be proven similarly.
  1155 	The star case can be proven similarly.
  1162 \end{proof}
  1156 \end{proof}
  1163 \noindent
  1157 \noindent
  1164 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together
  1158 Putting all together, Sulzmann and Lu obtained the following algorithm
  1165 by following the procedure outlined in the diagram \ref{graph:inj},
  1159 outlined in the diagram \ref{graph:inj}:
  1166 and taking into consideration the possibility of a non-match,
       
  1167 a lexer can be built with the following recursive definition:
       
  1168 \begin{center}
  1160 \begin{center}
  1169 \begin{tabular}{lcl}
  1161 \begin{tabular}{lcl}
  1170 	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
  1162 	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
  1171 	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
  1163 	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
  1172 	& & $\quad \phantom{\mid}\; \None \implies \None$\\
  1164 	& & $\quad \phantom{\mid}\; \None \implies \None$\\
  1173 	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
  1165 	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
  1174 \end{tabular}
  1166 \end{tabular}
  1175 \end{center}
  1167 \end{center}
  1176 \noindent
  1168 \noindent
  1177 The central property of the $\lexer$ is that it gives the correct result by
  1169 The central property of the $\lexer$ is that it gives the correct result
  1178 $\POSIX$ standards:
  1170 according to
       
  1171 POSIX rules. 
  1179 \begin{theorem}\label{lexerCorrectness}
  1172 \begin{theorem}\label{lexerCorrectness}
  1180 	The $\lexer$ based on derivatives and injections is correct: 
  1173 	The $\lexer$ based on derivatives and injections is correct: 
  1181 	\begin{center}
  1174 	\begin{center}
  1182 		\begin{tabular}{lcl}
  1175 		\begin{tabular}{lcl}
  1183 			$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
  1176 			$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
  1184 			$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
  1177 			$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
  1185 		\end{tabular}
  1178 		\end{tabular}
  1186 	\end{center}
  1179 	\end{center}
  1187 \end{theorem} 
  1180 \end{theorem} 
  1188 \begin{proof}
  1181 \begin{proof}
  1189 By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
  1182 By induction on $s$. $r$ generalising over an arbitrary regular expression.
  1190 The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
  1183 The $[]$ case is proven by  an application of lemma \ref{mePosix}, and the inductive case
  1191 by lemma \ref{injPosix}.
  1184 by lemma \ref{injPosix}.
  1192 \end{proof}
  1185 \end{proof}
  1193 \noindent
  1186 \noindent
  1194 As we did earlier in this chapter on the matcher, one can 
  1187 As we did earlier in this chapter with the matcher, one can 
  1195 introduce simplification on the regular expression.
  1188 introduce simplification on the regular expression in each derivative step.
  1196 However, now one needs to do a backward phase and make sure
  1189 However, now one needs to do a backward phase and make sure
  1197 the values align with the regular expressions.
  1190 the values align with the regular expression.
  1198 Therefore one has to
  1191 Therefore one has to
  1199 be careful not to break the correctness, as the injection 
  1192 be careful not to break the correctness, as the injection 
  1200 function heavily relies on the structure of the regular expressions and values
  1193 function heavily relies on the structure of 
  1201 being correct and matching each other.
  1194 the regular expressions and values being aligned.
  1202 It can be achieved by recording some extra rectification functions
  1195 This can be achieved by recording some extra rectification functions
  1203 during the derivatives step, and applying these rectifications in 
  1196 during the derivatives step, and applying these rectifications in 
  1204 each run during the injection phase.
  1197 each run during the injection phase.
  1205 With extra care
  1198 With extra care
  1206 one can show that POSIXness will not be affected---although it is much harder
  1199 one can show that POSIXness will not be affected
  1207 to establish. 
  1200 by the simplifications listed here \cite{AusafDyckhoffUrban2016}. 
  1208 Some initial results in this regard have been
  1201 \begin{center}
  1209 obtained in \cite{AusafDyckhoffUrban2016}. 
  1202 	\begin{tabular}{lcl}
  1210 
  1203 		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
  1211 However, with all the simplification rules allowed
  1204 		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
  1212 in an injection-based lexer, one could still end up in
  1205 					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
  1213 trouble, when cases that require more involved and aggressive
  1206 					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
  1214 simplifications arise.
  1207 					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
       
  1208 					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
       
  1209 					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
       
  1210 		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
       
  1211 				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
       
  1212 				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
       
  1213 				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
       
  1214 		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
       
  1215 				   
       
  1216 	\end{tabular}
       
  1217 \end{center}
       
  1218 
       
  1219 However, with the simple-minded simplification rules allowed
       
  1220 in an injection-based lexer, one can still end up
       
  1221 with exploding derivatives.
  1215 \section{A Case Requring More Aggressive Simplifications}
  1222 \section{A Case Requring More Aggressive Simplifications}
  1216 For example, when starting with the regular
  1223 For example, when starting with the regular
  1217 expression $(a^* \cdot a^*)^*$ and building just over
  1224 expression $(a^* \cdot a^*)^*$ and building just over
  1218 a dozen successive derivatives 
  1225 a dozen successive derivatives 
  1219 w.r.t.~the character $a$, one obtains a derivative regular expression
  1226 w.r.t.~the character $a$, one obtains a derivative regular expression
  1220 with millions of nodes (when viewed as a tree)
  1227 with millions of nodes (when viewed as a tree)
  1221 even with simplification, which is not much better compared
  1228 even with the simple-minded simplification.
  1222 with the naive version without any simplifications:
       
  1223 \begin{figure}[H]
  1229 \begin{figure}[H]
  1224 \begin{center}
  1230 \begin{center}
  1225 \begin{tikzpicture}
  1231 \begin{tikzpicture}
  1226 \begin{axis}[
  1232 \begin{axis}[
  1227     xlabel={$n$},
  1233     xlabel={$n$},
  1239    
  1245    
  1240 That is because Sulzmann and Lu's 
  1246 That is because Sulzmann and Lu's 
  1241 injection-based lexing algorithm keeps a lot of 
  1247 injection-based lexing algorithm keeps a lot of 
  1242 "useless" values that will not be used. 
  1248 "useless" values that will not be used. 
  1243 These different ways of matching will grow exponentially with the string length.
  1249 These different ways of matching will grow exponentially with the string length.
  1244 Take 
  1250 Consider the case
  1245 \[
  1251 \[
  1246 	r= (a^*\cdot a^*)^* \quad and \quad
  1252 	r= (a^*\cdot a^*)^* \quad and \quad
  1247 	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
  1253 	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
  1248 \]
  1254 \]
  1249 as an example.
  1255 as an example.
  1254 multiple ways of splitting between 
  1260 multiple ways of splitting between 
  1255 the two $a^*$ sub-expressions.
  1261 the two $a^*$ sub-expressions.
  1256 When $n$ is equal to $1$, there are two lexical values for
  1262 When $n$ is equal to $1$, there are two lexical values for
  1257 the match:
  1263 the match:
  1258 \[
  1264 \[
  1259 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (value 1)
  1265 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1)
  1260 \]
  1266 \]
  1261 and
  1267 and
  1262 \[
  1268 \[
  1263 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (value 2)
  1269 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2)
  1264 \]
  1270 \]
  1265 The derivative of $\derssimp \;s \; r$ is
  1271 The derivative of $\derssimp \;s \; r$ is
  1266 \[
  1272 \[
  1267 	(a^*a^* + a^*)\cdot(a^*a^*)^*.
  1273 	(a^*a^* + a^*)\cdot(a^*a^*)^*.
  1268 \]
  1274 \]
  1270 correspond to value 1 and value 2, respectively.
  1276 correspond to value 1 and value 2, respectively.
  1271 When $n=2$, the number goes up to 7:
  1277 When $n=2$, the number goes up to 7:
  1272 \[
  1278 \[
  1273 	\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
  1279 	\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
  1274 \]
  1280 \]
  1275 ,
  1281 
  1276 \[
  1282 \[
  1277 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
  1283 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
  1278 \]
  1284 \]
  1279 ,
  1285 
  1280 \[
  1286 \[
  1281 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
  1287 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
  1282 \]
  1288 \]
  1283 ,
  1289 
  1284 \[
  1290 \[
  1285 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
  1291 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
  1286 \]
  1292 \]
  1287 ,
  1293 
  1288 \[
  1294 \[
  1289 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), 
  1295 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), 
  1290 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
  1296 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
  1291 		  ] 
  1297 		  ] 
  1292 \]
  1298 \]
  1293 ,
  1299 
  1294 \[
  1300 \[
  1295 	\Stars \; [
  1301 	\Stars \; [
  1296 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
  1302 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
  1297 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
  1303 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
  1298 		  ] 
  1304 		  ] 
  1313 seven distinct lexical values.
  1319 seven distinct lexical values.
  1314 
  1320 
  1315 It is not surprising that there are exponentially many 
  1321 It is not surprising that there are exponentially many 
  1316 distinct lexical values that cannot be eliminated by 
  1322 distinct lexical values that cannot be eliminated by 
  1317 the simple-minded simplification of $\derssimp$. 
  1323 the simple-minded simplification of $\derssimp$. 
  1318 
       
  1319 A lexer without a good enough strategy to 
  1324 A lexer without a good enough strategy to 
  1320 deduplicate will naturally
  1325 deduplicate will naturally
  1321 have an exponential runtime on ambiguous regular expressions.
  1326 have an exponential runtime on highly
       
  1327 ambiguous regular expressions, because there
       
  1328 are exponentially many matches.
       
  1329 For this particular example, it seems
       
  1330 that the number of distinct matches growth
       
  1331 speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
  1322 
  1332 
  1323 On the other hand, the
  1333 On the other hand, the
  1324  $\POSIX$ value for $r= (a^*\cdot a^*)^*$  and 
  1334  $\POSIX$ value for $r= (a^*\cdot a^*)^*$  and 
  1325 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is  
  1335 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is  
  1326 \[
  1336 \[
  1358 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
  1368 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
  1359 \]
  1369 \]
  1360 is simply too hopeless to contribute a POSIX lexical value,
  1370 is simply too hopeless to contribute a POSIX lexical value,
  1361 and is therefore thrown away.
  1371 and is therefore thrown away.
  1362 
  1372 
  1363 Ausaf and Dyckhoff and Urban \cite{AusafDyckhoffUrban2016} 
  1373 Ausaf et al. \cite{AusafDyckhoffUrban2016} 
  1364 have come up with some simplification steps, however those steps
  1374 have come up with some simplification steps, however those steps
  1365 are not yet sufficiently strong so that they achieve the above effects.
  1375 are not yet sufficiently strong so that they achieve the above effects.
  1366 And even with these relatively mild simplifications the proof
  1376 And even with these relatively mild simplifications the proof
  1367 is already quite a bit complicated than the theorem \ref{lexerCorrectness}.
  1377 is already quite a bit complicated than the theorem \ref{lexerCorrectness}.
  1368 One would prove something like: 
  1378 One would prove something like: 
  1377 has been done as a function on values
  1387 has been done as a function on values
  1378 showing how one can transform the value
  1388 showing how one can transform the value
  1379 underlying the simplified regular expression
  1389 underlying the simplified regular expression
  1380 to the unsimplified one.
  1390 to the unsimplified one.
  1381 
  1391 
  1382 We therefore choose a slightly different approach to
  1392 We therefore choose a slightly different approach
       
  1393 also described by Sulzmann and Lu to
  1383 get better simplifications, which uses
  1394 get better simplifications, which uses
  1384 some augmented data structures compared to 
  1395 some augmented data structures compared to 
  1385 plain regular expressions.
  1396 plain regular expressions.
  1386 We call them \emph{annotated}
  1397 We call them \emph{annotated}
  1387 regular expressions.
  1398 regular expressions.
  1388 With annotated regular expressions,
  1399 With annotated regular expressions,
  1389 we can avoid creating the intermediate values $v_1,\ldots v_n$ and a 
  1400 we can avoid creating the intermediate values $v_1,\ldots v_n$ and a 
  1390 second phase altogether.
  1401 second phase altogether.
  1391 In the meantime, we can also ensure that simplifications
       
  1392 are easily handled without breaking the correctness of the algorithm.
       
  1393 We introduce this new datatype and the 
  1402 We introduce this new datatype and the 
  1394 corresponding algorithm in the next chapter.
  1403 corresponding algorithm in the next chapter.
  1395 
  1404 
  1396 
  1405 
  1397 
  1406