982 the empty string , by always selecting the leftmost |
982 the empty string , by always selecting the leftmost |
983 nullable regular expression. After that $\inj$ ``injects'' back the character in reverse order as they |
983 nullable regular expression. After that $\inj$ ``injects'' back the character in reverse order as they |
984 appeared in the string, always preserving POSIXness.}\label{graph:inj} |
984 appeared in the string, always preserving POSIXness.}\label{graph:inj} |
985 \end{figure} |
985 \end{figure} |
986 \noindent |
986 \noindent |
987 $\textit{inj}$ takes three arguments: a regular |
987 The function $\textit{inj}$ as defined by Sulzmann and Lu |
|
988 takes three arguments: a regular |
988 expression ${r_{i}}$, before the character is chopped off, |
989 expression ${r_{i}}$, before the character is chopped off, |
989 a character ${c_{i}}$, the character we want to inject back and |
990 a character ${c_{i}}$ (the character we want to inject back) and |
990 the third argument $v_{i+1}$ the value we want to inject into. |
991 the third argument $v_{i+1}$ the value we want to inject into. |
991 The result of an application |
992 The result of an application |
992 $\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that |
993 $\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that |
993 \[ |
994 \[ |
994 (s_i, r_i) \rightarrow v_i |
995 (s_i, r_i) \rightarrow v_i |
1008 $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\ |
1009 $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\ |
1009 \end{tabular} |
1010 \end{tabular} |
1010 \end{center} |
1011 \end{center} |
1011 |
1012 |
1012 \noindent |
1013 \noindent |
1013 The function does a recursion on |
1014 The function recurses on |
1014 the shape of regular |
1015 the shape of regular |
1015 expression $r_i$ and value $v_{i+1}$. |
1016 expressionsw and values. |
1016 Intuitively, each clause analyses |
1017 Intuitively, each clause analyses |
1017 how $r_i$ could have transformed when being |
1018 how $r_i$ could have transformed when being |
1018 derived by $c$, identifying which subpart |
1019 derived by $c$, identifying which subpart |
1019 of $v_{i+1}$ has the ``hole'' |
1020 of $v_{i+1}$ has the ``hole'' |
1020 to inject the character back into. |
1021 to inject the character back into. |
1021 Once the character is |
1022 Once the character is |
1022 injected back to that sub-value; |
1023 injected back to that sub-value; |
1023 $\inj$ assembles all things together |
1024 $\inj$ assembles all parts together |
1024 to form a new value. |
1025 to form a new value. |
1025 |
1026 |
1026 For instance, the last clause is an |
1027 For instance, the last clause is an |
1027 injection into a sequence value $v_{i+1}$ |
1028 injection into a sequence value $v_{i+1}$ |
1028 whose second child |
1029 whose second child |
1052 to the previous list of iterations, and then |
1053 to the previous list of iterations, and then |
1053 wrapped under the $\Stars$ |
1054 wrapped under the $\Stars$ |
1054 constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$. |
1055 constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$. |
1055 |
1056 |
1056 Recall that lemma |
1057 Recall that lemma |
1057 \ref{mePosix} tells us |
1058 \ref{mePosix} tells us that |
1058 $\mkeps$ always selects the POSIX matching among |
1059 $\mkeps$ always generates the POSIX value. |
1059 multiple values that flatten to the empty string. |
1060 The function $\inj$ preserves the POSIXness, provided |
1060 Now $\inj$ preserves the POSIXness, provided |
1061 the value before injection is POSIX, namely |
1061 the value before injection is POSIX: |
|
1062 \begin{lemma}\label{injPosix} |
1062 \begin{lemma}\label{injPosix} |
1063 If |
1063 If$(r \backslash c, s) \rightarrow v $, |
1064 \[ |
1064 then $(r, c :: s) \rightarrow (\inj r \; c\; v)$. |
1065 (r \backslash c, s) \rightarrow v |
|
1066 \] |
|
1067 then |
|
1068 \[ |
|
1069 (r, c :: s) \rightarrow (\inj r \; c\; v). |
|
1070 \] |
|
1071 \end{lemma} |
1065 \end{lemma} |
1072 \begin{proof} |
1066 \begin{proof} |
1073 By induction on $r$. |
1067 By induction on $r$. |
1074 The involved cases are sequence and star. |
1068 The non-trivial cases are sequence and star. |
1075 When $r = a \cdot b$, there could be |
1069 When $r = a \cdot b$, there can be |
1076 three cases for the value $v$ satisfying $\vdash v:a\backslash c$. |
1070 three cases for the value $v$ satisfying $\vdash v:a\backslash c$. |
1077 We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each |
1071 We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each |
1078 case. |
1072 case. |
1079 \begin{itemize} |
1073 \begin{itemize} |
1080 \item |
1074 \item |
1159 as the POSIX value for $a\cdot b$. |
1153 as the POSIX value for $a\cdot b$. |
1160 \end{itemize} |
1154 \end{itemize} |
1161 The star case can be proven similarly. |
1155 The star case can be proven similarly. |
1162 \end{proof} |
1156 \end{proof} |
1163 \noindent |
1157 \noindent |
1164 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together |
1158 Putting all together, Sulzmann and Lu obtained the following algorithm |
1165 by following the procedure outlined in the diagram \ref{graph:inj}, |
1159 outlined in the diagram \ref{graph:inj}: |
1166 and taking into consideration the possibility of a non-match, |
|
1167 a lexer can be built with the following recursive definition: |
|
1168 \begin{center} |
1160 \begin{center} |
1169 \begin{tabular}{lcl} |
1161 \begin{tabular}{lcl} |
1170 $\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\ |
1162 $\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\ |
1171 $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\ |
1163 $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\ |
1172 & & $\quad \phantom{\mid}\; \None \implies \None$\\ |
1164 & & $\quad \phantom{\mid}\; \None \implies \None$\\ |
1173 & & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$ |
1165 & & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$ |
1174 \end{tabular} |
1166 \end{tabular} |
1175 \end{center} |
1167 \end{center} |
1176 \noindent |
1168 \noindent |
1177 The central property of the $\lexer$ is that it gives the correct result by |
1169 The central property of the $\lexer$ is that it gives the correct result |
1178 $\POSIX$ standards: |
1170 according to |
|
1171 POSIX rules. |
1179 \begin{theorem}\label{lexerCorrectness} |
1172 \begin{theorem}\label{lexerCorrectness} |
1180 The $\lexer$ based on derivatives and injections is correct: |
1173 The $\lexer$ based on derivatives and injections is correct: |
1181 \begin{center} |
1174 \begin{center} |
1182 \begin{tabular}{lcl} |
1175 \begin{tabular}{lcl} |
1183 $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\ |
1176 $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\ |
1184 $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$ |
1177 $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$ |
1185 \end{tabular} |
1178 \end{tabular} |
1186 \end{center} |
1179 \end{center} |
1187 \end{theorem} |
1180 \end{theorem} |
1188 \begin{proof} |
1181 \begin{proof} |
1189 By induction on $s$. $r$ is allowed to be an arbitrary regular expression. |
1182 By induction on $s$. $r$ generalising over an arbitrary regular expression. |
1190 The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case |
1183 The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case |
1191 by lemma \ref{injPosix}. |
1184 by lemma \ref{injPosix}. |
1192 \end{proof} |
1185 \end{proof} |
1193 \noindent |
1186 \noindent |
1194 As we did earlier in this chapter on the matcher, one can |
1187 As we did earlier in this chapter with the matcher, one can |
1195 introduce simplification on the regular expression. |
1188 introduce simplification on the regular expression in each derivative step. |
1196 However, now one needs to do a backward phase and make sure |
1189 However, now one needs to do a backward phase and make sure |
1197 the values align with the regular expressions. |
1190 the values align with the regular expression. |
1198 Therefore one has to |
1191 Therefore one has to |
1199 be careful not to break the correctness, as the injection |
1192 be careful not to break the correctness, as the injection |
1200 function heavily relies on the structure of the regular expressions and values |
1193 function heavily relies on the structure of |
1201 being correct and matching each other. |
1194 the regular expressions and values being aligned. |
1202 It can be achieved by recording some extra rectification functions |
1195 This can be achieved by recording some extra rectification functions |
1203 during the derivatives step, and applying these rectifications in |
1196 during the derivatives step, and applying these rectifications in |
1204 each run during the injection phase. |
1197 each run during the injection phase. |
1205 With extra care |
1198 With extra care |
1206 one can show that POSIXness will not be affected---although it is much harder |
1199 one can show that POSIXness will not be affected |
1207 to establish. |
1200 by the simplifications listed here \cite{AusafDyckhoffUrban2016}. |
1208 Some initial results in this regard have been |
1201 \begin{center} |
1209 obtained in \cite{AusafDyckhoffUrban2016}. |
1202 \begin{tabular}{lcl} |
1210 |
1203 $\simp \; r_1 \cdot r_2 $ & $ \dn$ & |
1211 However, with all the simplification rules allowed |
1204 $(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\ |
1212 in an injection-based lexer, one could still end up in |
1205 & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\ |
1213 trouble, when cases that require more involved and aggressive |
1206 & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\ |
1214 simplifications arise. |
1207 & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\ |
|
1208 & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\ |
|
1209 & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\ |
|
1210 $\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\ |
|
1211 & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\ |
|
1212 & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\ |
|
1213 & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\ |
|
1214 $\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$ |
|
1215 |
|
1216 \end{tabular} |
|
1217 \end{center} |
|
1218 |
|
1219 However, with the simple-minded simplification rules allowed |
|
1220 in an injection-based lexer, one can still end up |
|
1221 with exploding derivatives. |
1215 \section{A Case Requring More Aggressive Simplifications} |
1222 \section{A Case Requring More Aggressive Simplifications} |
1216 For example, when starting with the regular |
1223 For example, when starting with the regular |
1217 expression $(a^* \cdot a^*)^*$ and building just over |
1224 expression $(a^* \cdot a^*)^*$ and building just over |
1218 a dozen successive derivatives |
1225 a dozen successive derivatives |
1219 w.r.t.~the character $a$, one obtains a derivative regular expression |
1226 w.r.t.~the character $a$, one obtains a derivative regular expression |
1220 with millions of nodes (when viewed as a tree) |
1227 with millions of nodes (when viewed as a tree) |
1221 even with simplification, which is not much better compared |
1228 even with the simple-minded simplification. |
1222 with the naive version without any simplifications: |
|
1223 \begin{figure}[H] |
1229 \begin{figure}[H] |
1224 \begin{center} |
1230 \begin{center} |
1225 \begin{tikzpicture} |
1231 \begin{tikzpicture} |
1226 \begin{axis}[ |
1232 \begin{axis}[ |
1227 xlabel={$n$}, |
1233 xlabel={$n$}, |
1270 correspond to value 1 and value 2, respectively. |
1276 correspond to value 1 and value 2, respectively. |
1271 When $n=2$, the number goes up to 7: |
1277 When $n=2$, the number goes up to 7: |
1272 \[ |
1278 \[ |
1273 \Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])] |
1279 \Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])] |
1274 \] |
1280 \] |
1275 , |
1281 |
1276 \[ |
1282 \[ |
1277 \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] |
1283 \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] |
1278 \] |
1284 \] |
1279 , |
1285 |
1280 \[ |
1286 \[ |
1281 \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])] |
1287 \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])] |
1282 \] |
1288 \] |
1283 , |
1289 |
1284 \[ |
1290 \[ |
1285 \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ] |
1291 \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ] |
1286 \] |
1292 \] |
1287 , |
1293 |
1288 \[ |
1294 \[ |
1289 \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), |
1295 \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), |
1290 \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]) |
1296 \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]) |
1291 ] |
1297 ] |
1292 \] |
1298 \] |
1293 , |
1299 |
1294 \[ |
1300 \[ |
1295 \Stars \; [ |
1301 \Stars \; [ |
1296 \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]), |
1302 \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]), |
1297 \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]) |
1303 \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]) |
1298 ] |
1304 ] |
1313 seven distinct lexical values. |
1319 seven distinct lexical values. |
1314 |
1320 |
1315 It is not surprising that there are exponentially many |
1321 It is not surprising that there are exponentially many |
1316 distinct lexical values that cannot be eliminated by |
1322 distinct lexical values that cannot be eliminated by |
1317 the simple-minded simplification of $\derssimp$. |
1323 the simple-minded simplification of $\derssimp$. |
1318 |
|
1319 A lexer without a good enough strategy to |
1324 A lexer without a good enough strategy to |
1320 deduplicate will naturally |
1325 deduplicate will naturally |
1321 have an exponential runtime on ambiguous regular expressions. |
1326 have an exponential runtime on highly |
|
1327 ambiguous regular expressions, because there |
|
1328 are exponentially many matches. |
|
1329 For this particular example, it seems |
|
1330 that the number of distinct matches growth |
|
1331 speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length). |
1322 |
1332 |
1323 On the other hand, the |
1333 On the other hand, the |
1324 $\POSIX$ value for $r= (a^*\cdot a^*)^*$ and |
1334 $\POSIX$ value for $r= (a^*\cdot a^*)^*$ and |
1325 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is |
1335 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is |
1326 \[ |
1336 \[ |
1377 has been done as a function on values |
1387 has been done as a function on values |
1378 showing how one can transform the value |
1388 showing how one can transform the value |
1379 underlying the simplified regular expression |
1389 underlying the simplified regular expression |
1380 to the unsimplified one. |
1390 to the unsimplified one. |
1381 |
1391 |
1382 We therefore choose a slightly different approach to |
1392 We therefore choose a slightly different approach |
|
1393 also described by Sulzmann and Lu to |
1383 get better simplifications, which uses |
1394 get better simplifications, which uses |
1384 some augmented data structures compared to |
1395 some augmented data structures compared to |
1385 plain regular expressions. |
1396 plain regular expressions. |
1386 We call them \emph{annotated} |
1397 We call them \emph{annotated} |
1387 regular expressions. |
1398 regular expressions. |
1388 With annotated regular expressions, |
1399 With annotated regular expressions, |
1389 we can avoid creating the intermediate values $v_1,\ldots v_n$ and a |
1400 we can avoid creating the intermediate values $v_1,\ldots v_n$ and a |
1390 second phase altogether. |
1401 second phase altogether. |
1391 In the meantime, we can also ensure that simplifications |
|
1392 are easily handled without breaking the correctness of the algorithm. |
|
1393 We introduce this new datatype and the |
1402 We introduce this new datatype and the |
1394 corresponding algorithm in the next chapter. |
1403 corresponding algorithm in the next chapter. |
1395 |
1404 |
1396 |
1405 |
1397 |
1406 |