ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 573 454ced557605
parent 568 7a579f5533f8
child 577 f47fc4840579
equal deleted inserted replaced
568:7a579f5533f8 573:454ced557605
   937 \end{tabular}
   937 \end{tabular}
   938 \end{center}
   938 \end{center}
   939 \noindent
   939 \noindent
   940 The central property of the $\lexer$ is that it gives the correct result by
   940 The central property of the $\lexer$ is that it gives the correct result by
   941 $\POSIX$ standards:
   941 $\POSIX$ standards:
   942 \begin{theorem}
   942 \begin{theorem}\label{lexerCorrectness}
   943 	The $\lexer$ based on derivatives and injections is correct: 
   943 	The $\lexer$ based on derivatives and injections is correct: 
   944 	\begin{center}
   944 	\begin{center}
   945 		\begin{tabular}{lcl}
   945 		\begin{tabular}{lcl}
   946 			$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
   946 			$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
   947 			$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
   947 			$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
  1006 \]
  1006 \]
  1007 as an example.
  1007 as an example.
  1008 This is a highly ambiguous regular expression, with
  1008 This is a highly ambiguous regular expression, with
  1009 many ways to split up the string into multiple segments for
  1009 many ways to split up the string into multiple segments for
  1010 different star iteratioins,
  1010 different star iteratioins,
  1011 and each segment will have multiple ways of splitting between 
  1011 and for each segment 
       
  1012 multiple ways of splitting between 
  1012 the two $a^*$ sub-expressions.
  1013 the two $a^*$ sub-expressions.
  1013 It is not surprising there are exponentially many 
  1014 When $n$ is equal to $1$, there are two lexical values for
  1014 distinct lexical values
  1015 the match:
  1015 for such a pair of regular expression and string.
  1016 \[
       
  1017 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (value 1)
       
  1018 \]
       
  1019 and
       
  1020 \[
       
  1021 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (value 2)
       
  1022 \]
       
  1023 The derivative of $\derssimp \;s \; r$ is
       
  1024 \[
       
  1025 	(a^*a^* + a^*)\cdot(a^*a^*)^*.
       
  1026 \]
       
  1027 The $a^*a^*$ and $a^*$ in the first child of the above sequence
       
  1028 correspond to value 1 and value 2, respectively.
       
  1029 When $n=2$, the number goes up to 7:
       
  1030 \[
       
  1031 	\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
       
  1032 \]
       
  1033 ,
       
  1034 \[
       
  1035 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
       
  1036 \]
       
  1037 ,
       
  1038 \[
       
  1039 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
       
  1040 \]
       
  1041 ,
       
  1042 \[
       
  1043 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
       
  1044 \]
       
  1045 ,
       
  1046 \[
       
  1047 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), 
       
  1048 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
       
  1049 		  ] 
       
  1050 \]
       
  1051 ,
       
  1052 \[
       
  1053 	\Stars \; [
       
  1054 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
       
  1055 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
       
  1056 		  ] 
       
  1057 \]
       
  1058 and
       
  1059 \[
       
  1060 	\Stars \; [
       
  1061 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
       
  1062 		   \Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
       
  1063 		  ] 
       
  1064 \]
       
  1065 And $\derssimp \; aa \; (a^*a^*)^*$ would be
       
  1066 \[
       
  1067 	((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* + 
       
  1068 	(a^*a^* + a^*)\cdot(a^*a^*)^*.
       
  1069 \]
       
  1070 which removes two out of the seven terms corresponding to the
       
  1071 seven distinct lexical values.
       
  1072 
       
  1073 It is not surprising that there are exponentially many 
       
  1074 distinct lexical values that cannot be eliminated by 
       
  1075 the simple-minded simplification of $\derssimp$. 
       
  1076 
  1016 A lexer without a good enough strategy to 
  1077 A lexer without a good enough strategy to 
  1017 deduplicate will naturally
  1078 deduplicate will naturally
  1018 have an exponential runtime on ambiguous regular expressions.
  1079 have an exponential runtime on ambiguous regular expressions.
  1019 
  1080 
  1020 Somehow one has to make sure which
  1081 On the other hand, the
  1021  lexical values are $\POSIX$ and must be kept in a lexing algorithm.
  1082  $\POSIX$ value for $r= (a^*\cdot a^*)^*$  and 
  1022  For example, the above $r= (a^*\cdot a^*)^*$  and 
  1083 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is  
  1023 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
  1084 \[
  1024 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
  1085 	\Stars\,
  1025 We want to keep this value only, and remove all the regular expression subparts
  1086 	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]]
  1026 not corresponding to this value during lexing.
  1087 \]
  1027 To do this, a two-phase algorithm with rectification is a bit too fragile.
  1088 and at any moment the  subterms in a regular expression
  1028 Can we not create those intermediate values $v_1,\ldots v_n$,
  1089 that will result in a POSIX value is only
  1029 and get the lexing information that should be already there while
  1090 a minority among the many other terms,
  1030 doing derivatives in one pass, without a second injection phase?
  1091 and one can remove ones that are absolutely not possible to 
  1031 In the meantime, can we ensure that simplifications
  1092 be POSIX.
  1032 are easily handled without breaking the correctness of the algorithm?
  1093 In the above example,
  1033 
  1094 \[
  1034 Sulzmann and Lu solved this problem by
  1095 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
  1035 introducing additional information to the 
  1096 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
  1036 regular expressions called \emph{bitcodes}.
  1097 \]
  1037 
  1098 can be further simplified by 
  1038 
  1099 removing the underlined term first,
  1039 
  1100 which would open up possibilities
  1040 
  1101 of further simplification that removes the
       
  1102 underbraced part.
       
  1103 The result would be 
       
  1104 \[
       
  1105 	(\underbrace{a^*a^*}_\text{term 1} + \underbrace{a^*}_\text{term 2})\cdot(a^*a^*)^*.
       
  1106 \]
       
  1107 with corresponding values
       
  1108 \begin{center}
       
  1109 	\begin{tabular}{lr}
       
  1110 		$\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$  & $(\text{term 1})$\\
       
  1111 		$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]  $ &  $(\text{term 2})$
       
  1112 	\end{tabular}
       
  1113 \end{center}
       
  1114 Other terms with an underlying value such as
       
  1115 \[
       
  1116 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
       
  1117 \]
       
  1118 is simply too hopeless to contribute a POSIX lexical value,
       
  1119 and is therefore thrown away.
       
  1120 
       
  1121 Ausaf and Dyckhoff and Urban \cite{AusafDyckhoffUrban2016} 
       
  1122 have come up with some simplification steps, however those steps
       
  1123 are not yet sufficiently strong so that they achieve the above effects.
       
  1124 And even with these relatively mild simplifications the proof
       
  1125 is already quite a bit complicated than the theorem \ref{lexerCorrectness}.
       
  1126 One would prove something like: 
       
  1127 \[
       
  1128 	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow  v  \;\;
       
  1129 	\textit{then}\;\; (r, c::s) \rightarrow 
       
  1130 	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v) 
       
  1131 \]
       
  1132 instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
       
  1133 not only has to return a simplified regular expression,
       
  1134 but also what specific simplifications 
       
  1135 has been done as a function on values
       
  1136 showing how one can transform the value
       
  1137 underlying the simplified regular expression
       
  1138 to the unsimplified one.
       
  1139 
       
  1140 We therefore choose a slightly different approach to
       
  1141 get better simplifications, which uses
       
  1142 some augmented data structures compared to 
       
  1143 plain regular expressions.
       
  1144 We call them \emph{annotated}
       
  1145 regular expressions.
       
  1146 With annotated regular expressions,
       
  1147 we can avoid creating the intermediate values $v_1,\ldots v_n$ and a 
       
  1148 second phase altogether.
       
  1149 In the meantime, we can also ensure that simplifications
       
  1150 are easily handled without breaking the correctness of the algorithm.
       
  1151 We introduce this new datatype and the 
       
  1152 corresponding algorithm in the next chapter.
       
  1153 
       
  1154 
       
  1155 
       
  1156