ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 568 7a579f5533f8
parent 567 28cb8089ec36
child 573 454ced557605
equal deleted inserted replaced
567:28cb8089ec36 568:7a579f5533f8
   749 \noindent
   749 \noindent
   750 $\textit{inj}$ takes three arguments: a regular
   750 $\textit{inj}$ takes three arguments: a regular
   751 expression ${r_{i}}$, before the character is chopped off, 
   751 expression ${r_{i}}$, before the character is chopped off, 
   752 a character ${c_{i}}$, the character we want to inject back and 
   752 a character ${c_{i}}$, the character we want to inject back and 
   753 the third argument $v_{i+1}$ the value we want to inject into. 
   753 the third argument $v_{i+1}$ the value we want to inject into. 
   754 The result of this function is a new value $v_i$.
   754 The result of an application 
       
   755 $\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
       
   756 \[
       
   757 	(s_i, r_i) \rightarrow v_i
       
   758 \]
       
   759 holds.
   755 The definition of $\textit{inj}$ is as follows: 
   760 The definition of $\textit{inj}$ is as follows: 
   756 \begin{center}
   761 \begin{center}
   757 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   762 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}
   758   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
   763   $\textit{inj}\;(c)\;c\,Empty$            & $\dn$ & $\Char\,c$\\
   759   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
   764   $\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left  \; (\textit{inj}\; r_1 \; c\,v)$\\
   760   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
   765   $\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c  \; v)$\\
   761   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
   766   $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$  & 
   762   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
   767   $\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\
   763   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
   768   $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ & 
   764   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
   769   $\dn$  & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
       
   770   $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$  & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
       
   771   $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$         & $\dn$  & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
   765 \end{tabular}
   772 \end{tabular}
   766 \end{center}
   773 \end{center}
   767 
   774 
   768 \noindent 
   775 \noindent 
   769 This definition is by recursion on the ``shape'' of regular
   776 The function does a recursion on 
   770 expressions and values. 
   777 the shape of regular
   771 The clauses do one thing--identifying the ``hole'' on a
   778 expression $r_i$ and value $v_{i+1}$. 
   772 value to inject the character back into.
   779 Intuitively, each clause analyses 
   773 For instance, in the last clause for injecting back to a value
   780 how $r_i$ could have transformed when being 
   774 that would turn into a new star value that corresponds to a star,
   781 derived by $c$, identifying which subpart
   775 we know it must be a sequence value. And we know that the first 
   782 of $v_{i+1}$ has the ``hole'' 
   776 value of that sequence corresponds to the child regex of the star
   783 to inject the character back into.
   777 with the first character being chopped off--an iteration of the star
   784 Once the character is
   778 that had just been unfolded. This value is followed by the already
   785 injected back to that sub-value; 
   779 matched star iterations we collected before. So we inject the character 
   786 $\inj$ assembles all things together
   780 back to the first value and form a new value with this latest iteration
   787 to form a new value.
   781 being added to the previous list of iterations, all under the $\Stars$
   788 
   782 top level.
   789 For instance, the last clause is an
   783 The POSIX value is maintained throughout the process:
   790 injection into a sequence value $v_{i+1}$
   784 \begin{lemma}
   791 whose second child
   785 $(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
   792 value is a star, and the shape of the 
   786 \end{lemma}\label{injPosix}
   793 regular expression $r_i$ before injection 
   787 
   794 is a star.
   788 
   795 We therefore know 
   789 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
   796 the derivative 
       
   797 starts on a star and ends as a sequence:
       
   798 \[
       
   799 	(r^*) \backslash c \longrightarrow r\backslash c \cdot r^*
       
   800 \]
       
   801 during which an iteration of the star
       
   802 had just been unfolded, giving the below
       
   803 value inhabitation relation:
       
   804 \[
       
   805 	\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.
       
   806 \]
       
   807 The value list $vs$ corresponds to
       
   808 matched star iterations,
       
   809 and the ``hole'' lies in $v$ because
       
   810 \[
       
   811 	\vdash v: r\backslash c.
       
   812 \]
       
   813 Finally, 
       
   814 $\inj \; r \;c \; v$ is prepended
       
   815 to the previous list of iterations, and then
       
   816 wrapped under the $\Stars$
       
   817 constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
       
   818 
       
   819 Recall that lemma 
       
   820 \ref{mePosix} tells us
       
   821 $\mkeps$ always selects the POSIX matching among
       
   822 multiple values that flatten to the empty string.
       
   823 Now $\inj$ preserves the POSIXness, provided
       
   824 the value before injection is POSIX:
       
   825 \begin{lemma}\label{injPosix}
       
   826 	If
       
   827 	\[
       
   828 		(r \backslash c, s) \rightarrow v 
       
   829 	\]
       
   830 	then
       
   831 	\[
       
   832 		(r, c :: s) \rightarrow (\inj r \; c\; v).
       
   833 	\]
       
   834 \end{lemma}
       
   835 \begin{proof}
       
   836 	By induction on $r$.
       
   837 	The involved cases are sequence and star.
       
   838 	When $r = a \cdot b$, there could be
       
   839 	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
       
   840 	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
       
   841 	case.
       
   842 	\begin{itemize}
       
   843 		\item
       
   844 			$v = \Seq \; v_a \; v_b$.\\
       
   845 			The ``not nullable'' clause of the $\inj$ function is taken:
       
   846 			\begin{center}
       
   847 				\begin{tabular}{lcl}
       
   848 					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
       
   849 					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
       
   850 				\end{tabular}
       
   851 			\end{center}
       
   852 			We know that there exists a unique pair of
       
   853 			$s_a$ and $s_b$ satisfaying	
       
   854 				$(a \backslash c, s_a) \rightarrow v_a$,
       
   855 				$(b , s_b) \rightarrow v_b$, and
       
   856 				$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in 
       
   857 				L \; (a\backslash c) \land
       
   858 				s_4 \in L \; b$.
       
   859 			The last condition gives us
       
   860 			$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in 
       
   861 				L \; a \land
       
   862 				s_4 \in L \; b$.
       
   863 			By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,
       
   864 			and this gives us
       
   865 			\[
       
   866 				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
       
   867 			\]
       
   868 		\item
       
   869 			$v = \Left \; (\Seq \; v_a \; v_b)$\\
       
   870 			The argument is almost identical to the above case,	
       
   871 			except that a different clause of $\inj$ is taken:
       
   872 			\begin{center}
       
   873 				\begin{tabular}{lcl}
       
   874 					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
       
   875 					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
       
   876 				\end{tabular}
       
   877 			\end{center}
       
   878 			With a similar reasoning, 
       
   879 
       
   880 			\[
       
   881 				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
       
   882 			\]
       
   883 			again holds.
       
   884 		\item 
       
   885 			$v = \Right \; v_b$\\
       
   886 			Again the injection result would be 
       
   887 			\begin{center}
       
   888 				\begin{tabular}{lcl}
       
   889 					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\
       
   890 					& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$
       
   891 				\end{tabular}
       
   892 			\end{center}
       
   893 			We know that $a$ must be nullable,
       
   894 			allowing us to call $\mkeps$ and get
       
   895 			\[
       
   896 				(a, []) \rightarrow \mkeps \; a.
       
   897 			\]
       
   898 			Also by inductive hypothesis
       
   899 			\[
       
   900 				(b, c::s) \rightarrow \inj\; b \; c \; v_b
       
   901 			\]
       
   902 			holds.
       
   903 			In addition, as
       
   904 			$\Right \;v_b$  instead of $\Left \ldots$ is 
       
   905 			the POSIX value for $v$, it must be the case
       
   906 			that $s \notin L \;( (a\backslash c)\cdot b)$.
       
   907 			This tells us that 
       
   908 			\[
       
   909 				\nexists s_3 \; s_4.
       
   910 				s_3 @s_4 = s  \land s_3 \in L \; (a\backslash c) 
       
   911 				\land s_4 \in L \; b
       
   912 			\]
       
   913 			which translates to
       
   914 			\[
       
   915 				\nexists s_3 \; s_4. \; s_3 \neq [] \land
       
   916 				s_3 @s_4 = c::s  \land s_3 \in L \; a 
       
   917 				\land s_4 \in L \; b.
       
   918 			\]
       
   919 			(Which basically says there cannot be a longer 
       
   920 			initial split for $s$ other than the empty string.)
       
   921 			Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
       
   922 			as the POSIX value for $a\cdot b$.
       
   923 	\end{itemize}
       
   924 	The star case can be proven similarly.
       
   925 \end{proof}
       
   926 \noindent
       
   927 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together
       
   928 by following the procedure outlined in the diagram \ref{graph:inj},
   790 and taking into consideration the possibility of a non-match,
   929 and taking into consideration the possibility of a non-match,
   791 we have a lexer with the following recursive definition:
   930 a lexer can be built with the following recursive definition:
   792 \begin{center}
   931 \begin{center}
   793 \begin{tabular}{lcl}
   932 \begin{tabular}{lcl}
   794 $\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
   933 	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
   795 $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
   934 	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
   796 & & $\quad \None \implies \None$\\
   935 	& & $\quad \phantom{\mid}\; \None \implies \None$\\
   797 & & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
   936 	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
   798 \end{tabular}
   937 \end{tabular}
   799 \end{center}
   938 \end{center}
   800  \noindent
   939 \noindent
   801  The central property of the $\lexer$ is that it gives the correct result by
   940 The central property of the $\lexer$ is that it gives the correct result by
   802  $\POSIX$ standards:
   941 $\POSIX$ standards:
   803  \begin{theorem}
   942 \begin{theorem}
   804 	 The $\lexer$ based on derivatives and injections is both sound and complete:
   943 	The $\lexer$ based on derivatives and injections is correct: 
   805  \begin{tabular}{lcl}
   944 	\begin{center}
   806 	 $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
   945 		\begin{tabular}{lcl}
   807 	 $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
   946 			$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
   808  \end{tabular}
   947 			$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
   809  \end{theorem}
   948 		\end{tabular}
   810  
   949 	\end{center}
   811  
   950 \end{theorem} 
   812  \begin{proof}
   951 \begin{proof}
   813  By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
   952 By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
   814  The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
   953 The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
   815  by lemma \ref{injPosix}.
   954 by lemma \ref{injPosix}.
   816  \end{proof}
   955 \end{proof}
   817 
       
   818 
       
   819 We now give a pictorial view of the algorithm (
       
   820 For convenience, we employ the following notations: the regular
       
   821 expression we start with is $r_0$, and the given string $s$ is composed
       
   822 of characters $c_0 c_1 \ldots c_{n-1}$. The
       
   823 values built incrementally by \emph{injecting} back the characters into the
       
   824 earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
       
   825 are always in the same subscript, i.e. $\vdash v_i : r_i$):
       
   826 
       
   827 \begin{ceqn}
       
   828 \begin{equation}\label{graph:2}
       
   829 \begin{tikzcd}
       
   830 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
   831 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
   832 \end{tikzcd}
       
   833 \end{equation}
       
   834 \end{ceqn}
       
   835 
       
   836 \noindent
   956 \noindent
   837 As we did earlier in this chapter on the matcher, one can 
   957 As we did earlier in this chapter on the matcher, one can 
   838 introduce simplification on the regex.
   958 introduce simplification on the regex.
   839 However, now we need to do a backward phase and make sure
   959 However, now one needs to do a backward phase and make sure
   840 the values align with the regular expressions.
   960 the values align with the regular expressions.
   841 Therefore one has to
   961 Therefore one has to
   842 be careful not to break the correctness, as the injection 
   962 be careful not to break the correctness, as the injection 
   843 function heavily relies on the structure of the regexes and values
   963 function heavily relies on the structure of the regexes and values
   844 being correct and matching each other.
   964 being correct and matching each other.
   845 It can be achieved by recording some extra rectification functions
   965 It can be achieved by recording some extra rectification functions
   846 during the derivatives step, and applying these rectifications in 
   966 during the derivatives step, and applying these rectifications in 
   847 each run during the injection phase.
   967 each run during the injection phase.
   848 
   968 With extra care
   849 \ChristianComment{Do I introduce the lexer with rectification here?}
   969 one can show that POSIXness will not be affected---although it is much harder
   850 And we can prove that the POSIX value of how
       
   851 regular expressions match strings will not be affected---although it is much harder
       
   852 to establish. 
   970 to establish. 
   853 Some initial results in this regard have been
   971 Some initial results in this regard have been
   854 obtained in \cite{AusafDyckhoffUrban2016}. 
   972 obtained in \cite{AusafDyckhoffUrban2016}. 
   855 
   973 
   856 However, even with these simplification rules, we could still end up in
   974 However, with all the simplification rules allowed
   857 trouble, when we encounter cases that require more involved and aggressive
   975 in an injection-based lexer, one could still end up in
   858 simplifications.
   976 trouble, when cases that require more involved and aggressive
   859 \section{A Case Requring More Aggressive Simplification}
   977 simplifications arise.
       
   978 \section{A Case Requring More Aggressive Simplifications}
   860 For example, when starting with the regular
   979 For example, when starting with the regular
   861 expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10)
   980 expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10)
   862 w.r.t.~the character $a$, one obtains a derivative regular expression
   981 w.r.t.~the character $a$, one obtains a derivative regular expression
   863 with more than 9000 nodes (when viewed as a tree)
   982 with more than 9000 nodes (when viewed as a tree)
   864 even with simplification.
   983 even with simplification.
   874 \end{axis}
   993 \end{axis}
   875 \end{tikzpicture} 
   994 \end{tikzpicture} 
   876 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
   995 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
   877 \end{figure}\label{fig:BetterWaterloo}
   996 \end{figure}\label{fig:BetterWaterloo}
   878    
   997    
   879 That is because our lexing algorithm currently keeps a lot of 
   998 That is because Sulzmann and Lu's 
       
   999 injection-based lexing algorithm keeps a lot of 
   880 "useless" values that will not be used. 
  1000 "useless" values that will not be used. 
   881 These different ways of matching will grow exponentially with the string length.
  1001 These different ways of matching will grow exponentially with the string length.
   882 
  1002 Take 
   883 For $r= (a^*\cdot a^*)^*$ and  
  1003 \[
   884 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
  1004 	r= (a^*\cdot a^*)^* \quad and \quad
   885 if we do not allow any empty iterations in its lexical values,
  1005 	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
   886 there will be $n - 1$ "splitting points" on $s$ we can independently choose to 
  1006 \]
   887 split or not so that each sub-string
  1007 as an example.
   888 segmented by those chosen splitting points will form different iterations.
  1008 This is a highly ambiguous regular expression, with
   889 For example when $n=4$, we give out a few of the many possibilities of splitting:
  1009 many ways to split up the string into multiple segments for
   890 \begin{center}
  1010 different star iteratioins,
   891 \begin{tabular}{lcr}
  1011 and each segment will have multiple ways of splitting between 
   892 $aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration)\\
  1012 the two $a^*$ sub-expressions.
   893 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$ (two iterations)\\
  1013 It is not surprising there are exponentially many 
   894 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$ (two iterations)\\
  1014 distinct lexical values
   895 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
  1015 for such a pair of regular expression and string.
   896 $a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
  1016 A lexer without a good enough strategy to 
   897  & $\textit{etc}.$ &
  1017 deduplicate will naturally
   898  \end{tabular}
       
   899 \end{center}
       
   900 \noindent
       
   901 And for each iteration, there are still multiple ways to split
       
   902 between the two $a^*$s.
       
   903 It is not surprising there are exponentially many lexical values
       
   904 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
       
   905 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   906 A lexer to keep all the possible values will naturally 
       
   907 have an exponential runtime on ambiguous regular expressions.
  1018 have an exponential runtime on ambiguous regular expressions.
   908 With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
  1019 
   909 of a match. This means Sulzmann and Lu's injection-based algorithm 
       
   910  exponential by nature.
       
   911 Somehow one has to make sure which
  1020 Somehow one has to make sure which
   912  lexical values are $\POSIX$ and must be kept in a lexing algorithm.
  1021  lexical values are $\POSIX$ and must be kept in a lexing algorithm.
   913 
       
   914 
       
   915  For example, the above $r= (a^*\cdot a^*)^*$  and 
  1022  For example, the above $r= (a^*\cdot a^*)^*$  and 
   916 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
  1023 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
   917 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
  1024 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
   918 We want to keep this value only, and remove all the regular expression subparts
  1025 We want to keep this value only, and remove all the regular expression subparts
   919 not corresponding to this value during lexing.
  1026 not corresponding to this value during lexing.