1006 \] |
1006 \] |
1007 as an example. |
1007 as an example. |
1008 This is a highly ambiguous regular expression, with |
1008 This is a highly ambiguous regular expression, with |
1009 many ways to split up the string into multiple segments for |
1009 many ways to split up the string into multiple segments for |
1010 different star iteratioins, |
1010 different star iteratioins, |
1011 and each segment will have multiple ways of splitting between |
1011 and for each segment |
|
1012 multiple ways of splitting between |
1012 the two $a^*$ sub-expressions. |
1013 the two $a^*$ sub-expressions. |
1013 It is not surprising there are exponentially many |
1014 When $n$ is equal to $1$, there are two lexical values for |
1014 distinct lexical values |
1015 the match: |
1015 for such a pair of regular expression and string. |
1016 \[ |
|
1017 \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (value 1) |
|
1018 \] |
|
1019 and |
|
1020 \[ |
|
1021 \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (value 2) |
|
1022 \] |
|
1023 The derivative of $\derssimp \;s \; r$ is |
|
1024 \[ |
|
1025 (a^*a^* + a^*)\cdot(a^*a^*)^*. |
|
1026 \] |
|
1027 The $a^*a^*$ and $a^*$ in the first child of the above sequence |
|
1028 correspond to value 1 and value 2, respectively. |
|
1029 When $n=2$, the number goes up to 7: |
|
1030 \[ |
|
1031 \Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])] |
|
1032 \] |
|
1033 , |
|
1034 \[ |
|
1035 \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] |
|
1036 \] |
|
1037 , |
|
1038 \[ |
|
1039 \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])] |
|
1040 \] |
|
1041 , |
|
1042 \[ |
|
1043 \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ] |
|
1044 \] |
|
1045 , |
|
1046 \[ |
|
1047 \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), |
|
1048 \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]) |
|
1049 ] |
|
1050 \] |
|
1051 , |
|
1052 \[ |
|
1053 \Stars \; [ |
|
1054 \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]), |
|
1055 \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]) |
|
1056 ] |
|
1057 \] |
|
1058 and |
|
1059 \[ |
|
1060 \Stars \; [ |
|
1061 \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]), |
|
1062 \Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []) |
|
1063 ] |
|
1064 \] |
|
1065 And $\derssimp \; aa \; (a^*a^*)^*$ would be |
|
1066 \[ |
|
1067 ((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* + |
|
1068 (a^*a^* + a^*)\cdot(a^*a^*)^*. |
|
1069 \] |
|
1070 which removes two out of the seven terms corresponding to the |
|
1071 seven distinct lexical values. |
|
1072 |
|
1073 It is not surprising that there are exponentially many |
|
1074 distinct lexical values that cannot be eliminated by |
|
1075 the simple-minded simplification of $\derssimp$. |
|
1076 |
1016 A lexer without a good enough strategy to |
1077 A lexer without a good enough strategy to |
1017 deduplicate will naturally |
1078 deduplicate will naturally |
1018 have an exponential runtime on ambiguous regular expressions. |
1079 have an exponential runtime on ambiguous regular expressions. |
1019 |
1080 |
1020 Somehow one has to make sure which |
1081 On the other hand, the |
1021 lexical values are $\POSIX$ and must be kept in a lexing algorithm. |
1082 $\POSIX$ value for $r= (a^*\cdot a^*)^*$ and |
1022 For example, the above $r= (a^*\cdot a^*)^*$ and |
1083 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is |
1023 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value |
1084 \[ |
1024 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
1085 \Stars\, |
1025 We want to keep this value only, and remove all the regular expression subparts |
1086 [\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]] |
1026 not corresponding to this value during lexing. |
1087 \] |
1027 To do this, a two-phase algorithm with rectification is a bit too fragile. |
1088 and at any moment the subterms in a regular expression |
1028 Can we not create those intermediate values $v_1,\ldots v_n$, |
1089 that will result in a POSIX value is only |
1029 and get the lexing information that should be already there while |
1090 a minority among the many other terms, |
1030 doing derivatives in one pass, without a second injection phase? |
1091 and one can remove ones that are absolutely not possible to |
1031 In the meantime, can we ensure that simplifications |
1092 be POSIX. |
1032 are easily handled without breaking the correctness of the algorithm? |
1093 In the above example, |
1033 |
1094 \[ |
1034 Sulzmann and Lu solved this problem by |
1095 ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
1035 introducing additional information to the |
1096 \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}. |
1036 regular expressions called \emph{bitcodes}. |
1097 \] |
1037 |
1098 can be further simplified by |
1038 |
1099 removing the underlined term first, |
1039 |
1100 which would open up possibilities |
1040 |
1101 of further simplification that removes the |
|
1102 underbraced part. |
|
1103 The result would be |
|
1104 \[ |
|
1105 (\underbrace{a^*a^*}_\text{term 1} + \underbrace{a^*}_\text{term 2})\cdot(a^*a^*)^*. |
|
1106 \] |
|
1107 with corresponding values |
|
1108 \begin{center} |
|
1109 \begin{tabular}{lr} |
|
1110 $\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$ & $(\text{term 1})$\\ |
|
1111 $\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] $ & $(\text{term 2})$ |
|
1112 \end{tabular} |
|
1113 \end{center} |
|
1114 Other terms with an underlying value such as |
|
1115 \[ |
|
1116 \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])] |
|
1117 \] |
|
1118 is simply too hopeless to contribute a POSIX lexical value, |
|
1119 and is therefore thrown away. |
|
1120 |
|
1121 Ausaf and Dyckhoff and Urban \cite{AusafDyckhoffUrban2016} |
|
1122 have come up with some simplification steps, however those steps |
|
1123 are not yet sufficiently strong so that they achieve the above effects. |
|
1124 And even with these relatively mild simplifications the proof |
|
1125 is already quite a bit complicated than the theorem \ref{lexerCorrectness}. |
|
1126 One would prove something like: |
|
1127 \[ |
|
1128 \textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\; |
|
1129 \textit{then}\;\; (r, c::s) \rightarrow |
|
1130 \inj\;\, r\, \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v) |
|
1131 \] |
|
1132 instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$ |
|
1133 not only has to return a simplified regular expression, |
|
1134 but also what specific simplifications |
|
1135 has been done as a function on values |
|
1136 showing how one can transform the value |
|
1137 underlying the simplified regular expression |
|
1138 to the unsimplified one. |
|
1139 |
|
1140 We therefore choose a slightly different approach to |
|
1141 get better simplifications, which uses |
|
1142 some augmented data structures compared to |
|
1143 plain regular expressions. |
|
1144 We call them \emph{annotated} |
|
1145 regular expressions. |
|
1146 With annotated regular expressions, |
|
1147 we can avoid creating the intermediate values $v_1,\ldots v_n$ and a |
|
1148 second phase altogether. |
|
1149 In the meantime, we can also ensure that simplifications |
|
1150 are easily handled without breaking the correctness of the algorithm. |
|
1151 We introduce this new datatype and the |
|
1152 corresponding algorithm in the next chapter. |
|
1153 |
|
1154 |
|
1155 |
|
1156 |