lexing: changeset 568:7a579f5533f8

--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Thu Jul 14 14:57:32 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Sat Jul 16 18:34:46 2022 +0100
@@ -751,92 +751,212 @@
 expression ${r_{i}}$, before the character is chopped off, 
 a character ${c_{i}}$, the character we want to inject back and 
 the third argument $v_{i+1}$ the value we want to inject into. 
-The result of this function is a new value $v_i$.
+The result of an application 
+$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
+\[
+	(s_i, r_i) \rightarrow v_i
+\]
+holds.
 The definition of $\textit{inj}$ is as follows: 
 \begin{center}
-\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
-  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
-  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
-  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
-  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}
+  $\textit{inj}\;(c)\;c\,Empty$            & $\dn$ & $\Char\,c$\\
+  $\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left  \; (\textit{inj}\; r_1 \; c\,v)$\\
+  $\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c  \; v)$\\
+  $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$  & 
+  $\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\
+  $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ & 
+  $\dn$  & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
+  $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$  & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
+  $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$         & $\dn$  & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
 \end{tabular}
 \end{center}
 
 \noindent 
-This definition is by recursion on the ``shape'' of regular
-expressions and values. 
-The clauses do one thing--identifying the ``hole'' on a
-value to inject the character back into.
-For instance, in the last clause for injecting back to a value
-that would turn into a new star value that corresponds to a star,
-we know it must be a sequence value. And we know that the first 
-value of that sequence corresponds to the child regex of the star
-with the first character being chopped off--an iteration of the star
-that had just been unfolded. This value is followed by the already
-matched star iterations we collected before. So we inject the character 
-back to the first value and form a new value with this latest iteration
-being added to the previous list of iterations, all under the $\Stars$
-top level.
-The POSIX value is maintained throughout the process:
-\begin{lemma}
-$(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
-\end{lemma}\label{injPosix}
+The function does a recursion on 
+the shape of regular
+expression $r_i$ and value $v_{i+1}$. 
+Intuitively, each clause analyses 
+how $r_i$ could have transformed when being 
+derived by $c$, identifying which subpart
+of $v_{i+1}$ has the ``hole'' 
+to inject the character back into.
+Once the character is
+injected back to that sub-value; 
+$\inj$ assembles all things together
+to form a new value.
+
+For instance, the last clause is an
+injection into a sequence value $v_{i+1}$
+whose second child
+value is a star, and the shape of the 
+regular expression $r_i$ before injection 
+is a star.
+We therefore know 
+the derivative 
+starts on a star and ends as a sequence:
+\[
+	(r^*) \backslash c \longrightarrow r\backslash c \cdot r^*
+\]
+during which an iteration of the star
+had just been unfolded, giving the below
+value inhabitation relation:
+\[
+	\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.
+\]
+The value list $vs$ corresponds to
+matched star iterations,
+and the ``hole'' lies in $v$ because
+\[
+	\vdash v: r\backslash c.
+\]
+Finally, 
+$\inj \; r \;c \; v$ is prepended
+to the previous list of iterations, and then
+wrapped under the $\Stars$
+constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
 
+Recall that lemma 
+\ref{mePosix} tells us
+$\mkeps$ always selects the POSIX matching among
+multiple values that flatten to the empty string.
+Now $\inj$ preserves the POSIXness, provided
+the value before injection is POSIX:
+\begin{lemma}\label{injPosix}
+	If
+	\[
+		(r \backslash c, s) \rightarrow v 
+	\]
+	then
+	\[
+		(r, c :: s) \rightarrow (\inj r \; c\; v).
+	\]
+\end{lemma}
+\begin{proof}
+	By induction on $r$.
+	The involved cases are sequence and star.
+	When $r = a \cdot b$, there could be
+	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
+	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
+	case.
+	\begin{itemize}
+		\item
+			$v = \Seq \; v_a \; v_b$.\\
+			The ``not nullable'' clause of the $\inj$ function is taken:
+			\begin{center}
+				\begin{tabular}{lcl}
+					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
+					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
+				\end{tabular}
+			\end{center}
+			We know that there exists a unique pair of
+			$s_a$ and $s_b$ satisfaying	
+				$(a \backslash c, s_a) \rightarrow v_a$,
+				$(b , s_b) \rightarrow v_b$, and
+				$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in 
+				L \; (a\backslash c) \land
+				s_4 \in L \; b$.
+			The last condition gives us
+			$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in 
+				L \; a \land
+				s_4 \in L \; b$.
+			By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,
+			and this gives us
+			\[
+				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
+			\]
+		\item
+			$v = \Left \; (\Seq \; v_a \; v_b)$\\
+			The argument is almost identical to the above case,	
+			except that a different clause of $\inj$ is taken:
+			\begin{center}
+				\begin{tabular}{lcl}
+					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
+					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
+				\end{tabular}
+			\end{center}
+			With a similar reasoning, 
 
-Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
+			\[
+				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
+			\]
+			again holds.
+		\item 
+			$v = \Right \; v_b$\\
+			Again the injection result would be 
+			\begin{center}
+				\begin{tabular}{lcl}
+					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\
+					& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$
+				\end{tabular}
+			\end{center}
+			We know that $a$ must be nullable,
+			allowing us to call $\mkeps$ and get
+			\[
+				(a, []) \rightarrow \mkeps \; a.
+			\]
+			Also by inductive hypothesis
+			\[
+				(b, c::s) \rightarrow \inj\; b \; c \; v_b
+			\]
+			holds.
+			In addition, as
+			$\Right \;v_b$  instead of $\Left \ldots$ is 
+			the POSIX value for $v$, it must be the case
+			that $s \notin L \;( (a\backslash c)\cdot b)$.
+			This tells us that 
+			\[
+				\nexists s_3 \; s_4.
+				s_3 @s_4 = s  \land s_3 \in L \; (a\backslash c) 
+				\land s_4 \in L \; b
+			\]
+			which translates to
+			\[
+				\nexists s_3 \; s_4. \; s_3 \neq [] \land
+				s_3 @s_4 = c::s  \land s_3 \in L \; a 
+				\land s_4 \in L \; b.
+			\]
+			(Which basically says there cannot be a longer 
+			initial split for $s$ other than the empty string.)
+			Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
+			as the POSIX value for $a\cdot b$.
+	\end{itemize}
+	The star case can be proven similarly.
+\end{proof}
+\noindent
+Putting all the functions $\inj$, $\mkeps$, $\backslash$ together
+by following the procedure outlined in the diagram \ref{graph:inj},
 and taking into consideration the possibility of a non-match,
-we have a lexer with the following recursive definition:
+a lexer can be built with the following recursive definition:
 \begin{center}
 \begin{tabular}{lcl}
-$\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
-$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
-& & $\quad \None \implies \None$\\
-& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
+	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
+	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
+	& & $\quad \phantom{\mid}\; \None \implies \None$\\
+	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
 \end{tabular}
 \end{center}
- \noindent
- The central property of the $\lexer$ is that it gives the correct result by
- $\POSIX$ standards:
- \begin{theorem}
-	 The $\lexer$ based on derivatives and injections is both sound and complete:
- \begin{tabular}{lcl}
-	 $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
-	 $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
- \end{tabular}
- \end{theorem}
- 
- 
- \begin{proof}
- By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
- The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
- by lemma \ref{injPosix}.
- \end{proof}
-
-
-We now give a pictorial view of the algorithm (
-For convenience, we employ the following notations: the regular
-expression we start with is $r_0$, and the given string $s$ is composed
-of characters $c_0 c_1 \ldots c_{n-1}$. The
-values built incrementally by \emph{injecting} back the characters into the
-earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
-are always in the same subscript, i.e. $\vdash v_i : r_i$):
-
-\begin{ceqn}
-\begin{equation}\label{graph:2}
-\begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
-v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
-\end{tikzcd}
-\end{equation}
-\end{ceqn}
-
+\noindent
+The central property of the $\lexer$ is that it gives the correct result by
+$\POSIX$ standards:
+\begin{theorem}
+	The $\lexer$ based on derivatives and injections is correct: 
+	\begin{center}
+		\begin{tabular}{lcl}
+			$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
+			$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
+		\end{tabular}
+	\end{center}
+\end{theorem} 
+\begin{proof}
+By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
+The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
+by lemma \ref{injPosix}.
+\end{proof}
 \noindent
 As we did earlier in this chapter on the matcher, one can 
 introduce simplification on the regex.
-However, now we need to do a backward phase and make sure
+However, now one needs to do a backward phase and make sure
 the values align with the regular expressions.
 Therefore one has to
 be careful not to break the correctness, as the injection 
@@ -845,18 +965,17 @@
 It can be achieved by recording some extra rectification functions
 during the derivatives step, and applying these rectifications in 
 each run during the injection phase.
-
-\ChristianComment{Do I introduce the lexer with rectification here?}
-And we can prove that the POSIX value of how
-regular expressions match strings will not be affected---although it is much harder
+With extra care
+one can show that POSIXness will not be affected---although it is much harder
 to establish. 
 Some initial results in this regard have been
 obtained in \cite{AusafDyckhoffUrban2016}. 
 
-However, even with these simplification rules, we could still end up in
-trouble, when we encounter cases that require more involved and aggressive
-simplifications.
-\section{A Case Requring More Aggressive Simplification}
+However, with all the simplification rules allowed
+in an injection-based lexer, one could still end up in
+trouble, when cases that require more involved and aggressive
+simplifications arise.
+\section{A Case Requring More Aggressive Simplifications}
 For example, when starting with the regular
 expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10)
 w.r.t.~the character $a$, one obtains a derivative regular expression
@@ -876,42 +995,30 @@
 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
 \end{figure}\label{fig:BetterWaterloo}
    
-That is because our lexing algorithm currently keeps a lot of 
+That is because Sulzmann and Lu's 
+injection-based lexing algorithm keeps a lot of 
 "useless" values that will not be used. 
 These different ways of matching will grow exponentially with the string length.
+Take 
+\[
+	r= (a^*\cdot a^*)^* \quad and \quad
+	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
+\]
+as an example.
+This is a highly ambiguous regular expression, with
+many ways to split up the string into multiple segments for
+different star iteratioins,
+and each segment will have multiple ways of splitting between 
+the two $a^*$ sub-expressions.
+It is not surprising there are exponentially many 
+distinct lexical values
+for such a pair of regular expression and string.
+A lexer without a good enough strategy to 
+deduplicate will naturally
+have an exponential runtime on ambiguous regular expressions.
 
-For $r= (a^*\cdot a^*)^*$ and  
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
-if we do not allow any empty iterations in its lexical values,
-there will be $n - 1$ "splitting points" on $s$ we can independently choose to 
-split or not so that each sub-string
-segmented by those chosen splitting points will form different iterations.
-For example when $n=4$, we give out a few of the many possibilities of splitting:
-\begin{center}
-\begin{tabular}{lcr}
-$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration)\\
-$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$ (two iterations)\\
-$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$ (two iterations)\\
-$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
-$a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
- & $\textit{etc}.$ &
- \end{tabular}
-\end{center}
-\noindent
-And for each iteration, there are still multiple ways to split
-between the two $a^*$s.
-It is not surprising there are exponentially many lexical values
-that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-A lexer to keep all the possible values will naturally 
-have an exponential runtime on ambiguous regular expressions.
-With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
-of a match. This means Sulzmann and Lu's injection-based algorithm 
- exponential by nature.
 Somehow one has to make sure which
  lexical values are $\POSIX$ and must be kept in a lexing algorithm.
-
-
  For example, the above $r= (a^*\cdot a^*)^*$  and 
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.

--- a/ChengsongTanPhdThesis/main.tex	Thu Jul 14 14:57:32 2022 +0100
+++ b/ChengsongTanPhdThesis/main.tex	Sat Jul 16 18:34:46 2022 +0100
@@ -220,32 +220,17 @@
 \begin{abstract}
 \addchaptertocentry{\abstractname} % Add the abstract to the table of contents
 %\addchap{Abstract} 
-This work is about regular expressions and derivatives. It combines functional algorithms and their formal verification in the Isabelle/HOL theorem prover.
+This thesis is about regular expressions and derivatives. It combines functional algorithms and their formal verification in the Isabelle/HOL theorem prover. 
+
+Theoretical results say that regular expression matching should be linear with respect to the input. Under a certain class of regular expressions and inputs though, practical implementations often suffer from non-linear or even exponential running time, allowing ReDoS (regular expression denial-of-service ) attacks. This makes levers with formalised properties such as correctness and time complexity appealing.
+
+Sulzmann and Lu describe a lexing algorithm that calculates Brzozowski derivatives using bitcodes annotated to regular expressions. Their algorithm generates POSIX values which encode the information of how a regular expression matches a string—that is, which part of the string is matched by which part of the regular expression. This information is needed in the context of lexing in order to extract and to classify tokens. The purpose of the bitcodes is to generate POSIX values incrementally while derivatives are calculated. They also help with designing an “aggressive” simplification function that keeps the size of derivatives finitely bounded. Without simplification the size of some derivatives can grow arbitrarily big resulting in an extremely slow lexing algorithm. 
 
-Theoretical results say that regular expression matching
-should be linear with respect to the input.
-Under a certain class of regular expressions and inputs though,
-practical implementations often suffer from non-linear or even 
-exponential running time,
-allowing ReDoS (regular expression denial-of-service ) attacks.
-The reason behind this is that regex libraries in popular programming
-languages often want to support richer constructs
-than the usual basic regular expressions. Unfortunately, this means that even simple cases
-are "infected" with atrocious running time.
- 
+In this thesis we describe a variant of Sulzmann and Lu’s algorithm: Our variant is a recursive functional program, whereas Sulzmann and Lu’s version involves a fixpoint construction. We (i) prove in Isabelle/HOL that our algorithm is correct and generates unique POSIX values; we also (ii) establish a finite bound for the size of the derivatives. 
+
 
-This work aims to address the above vulnerability by a combination
-of Brzozowski's derivatives and interactive theorem proving. We give an 
-improved version of  Sulzmann and Lu's bit-coded algorithm using 
-derivatives, and give a formal guarantee in terms of correctness and 
-size of derivatives, which we formalised in Isabelle/HOL.
-Then we improve the algorithm with a stronger version of 
-simplification, and prove a time bound linear to input and
-cubic to regular expression size using a technique introduced by
-Antimirov.
-The result is an algorithm with provable guarantees on 
-correctness and finiteness. We believe this is the first 
-work with these two guarantees together.
+
+
 
 \end{abstract}

--- a/thys3/BlexerSimp.thy	Thu Jul 14 14:57:32 2022 +0100
+++ b/thys3/BlexerSimp.thy	Sat Jul 16 18:34:46 2022 +0100
@@ -359,11 +359,38 @@
 done
 
 
-lemma "set (tint (seqFold (erase x42) [erase x43])) = 
+lemma tint_seqFold_eq: shows"set (tint (seqFold (erase x42) [erase x43])) = 
            set (tint (SEQ (erase x42) (erase x43)))"
   apply(case_tac "erase x42 = ONE")
    apply simp
+  using seqFold_concats
   apply simp
+  done
+
+fun top :: "arexp \<Rightarrow> bit list" where
+  "top AZERO = []"
+| "top (AONE bs) = bs"
+| "top (ASEQ bs r1 r2) = bs"
+| "top (ACHAR v va) = v"
+| "top (AALTs v va) = v"
+| "top (ASTAR v va) = v "
+
+
+
+lemma top_bitcodes_preserved_p6:
+  shows "\<lbrakk> r = ASEQ bs a1 a2 \<rbrakk> \<Longrightarrow> p6 as r = AZERO \<or> (\<exists>bs3. top (p6 as r) = bs @ bs3)"
+  apply(induct r arbitrary: as)
+       apply simp
+  apply simp
+     apply simp
+
+  apply(case_tac "a1")
+         apply simp
+
+
+  sorry
+
+
 
 lemma prune6_preserves_fuse:
   shows "fuse bs (p6 as a) = p6 as (fuse bs a)"
@@ -377,7 +404,7 @@
   using fused_bits_at_head
 
     apply simp
-  apply(case_tac "erase x42 = ONE")
+  using tint_seqFold_eq
   apply simp
 
   sorry

author	Chengsong
	Sat, 16 Jul 2022 18:34:46 +0100
changeset 568	7a579f5533f8
parent 567	28cb8089ec36
child 569	5af61c89f51e
child 573	454ced557605

ChengsongTanPhdThesis/Chapters/Inj.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/main.tex		file \| annotate \| diff \| comparison \| revisions
thys3/BlexerSimp.thy		file \| annotate \| diff \| comparison \| revisions