lexing: changeset 600:fd068f39ac23

--- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex	Sat Sep 10 22:30:22 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex	Mon Sep 12 23:32:18 2022 +0200
@@ -11,39 +11,43 @@
 
 
 
-In this chapter we introduce the simplifications
+In this chapter we introduce simplifications
 on annotated regular expressions that can be applied to 
 each intermediate derivative result. This allows
 us to make $\blexer$ much more efficient.
-We contrast this simplification function 
-with Sulzmann and Lu's original
-simplifications, indicating the simplicity of our algorithm and
-improvements we made, demostrating
-the usefulness and reliability of formal proofs on algorithms.
+Sulzmann and Lu already had some bit-coded simplifications,
+but their simplification functions  were inefficient.
+We contrast our simplification function 
+with Sulzmann and Lu's, indicating the simplicity of our algorithm.
+This is another case for the usefulness 
+and reliability of formal proofs on algorithms.
 These ``aggressive'' simplifications would not be possible in the injection-based 
 lexing we introduced in chapter \ref{Inj}.
-We then go on to prove the correctness with the improved version of 
+We then prove the correctness with the improved version of 
 $\blexer$, called $\blexersimp$, by establishing 
 $\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.
 
 \section{Simplifications by Sulzmann and Lu}
-The first thing we notice in the fast growth of examples such as $(a^*a^*)^*$'s
-and $(a^* + (aa)^*)^*$'s derivatives is that a lot of duplicated sub-patterns
-are scattered around different levels, and therefore requires 
-de-duplication at different levels:
+Consider the derivatives of examples such as $(a^*a^*)^*$
+and $(a^* + (aa)^*)^*$:
 \begin{center}
 	$(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} $\\
 	$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} \ldots$
 \end{center}
 \noindent
-As we have already mentioned in \ref{eqn:growth2},
-a simple-minded simplification function cannot simplify
+As can be seen, there is a lot of duplication 
+in the example we have already mentioned in 
+\ref{eqn:growth2}.
+A simple-minded simplification function cannot simplify
 the third regular expression in the above chain of derivative
-regular expressions:
+regular expressions, namely
 \begin{center}
 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
 \end{center}
-one would expect a better simplification function to work in the 
+because the duplicates are
+not next to each other and therefore the rule
+$r+ r \rightarrow r$ does not fire.
+One would expect a better simplification function to work in the 
 following way:
 \begin{gather*}
 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
@@ -61,10 +65,16 @@
 	)\cdot(a^*a^*)^*  
 \end{gather*}
 \noindent
-This motivating example came from testing Sulzmann and Lu's 
+In the first step, the nested alternative regular expression
+$(a^*a^* + a^*) + a^*$ is flattened into $a^*a^* + a^* + a^*$.
+Now the third term $a^*$ is clearly identified as a duplicate
+and therefore removed in the second step. This causes the two
+top-level terms to become the same and the second $(a^*a^*+a^*)\cdot(a^*a^*)^*$ 
+removed in the final step.\\
+This motivating example is from testing Sulzmann and Lu's 
 algorithm: their simplification does 
 not work!
-We quote their $\textit{simp}$ function verbatim here:
+Consider their simplification (using our notations):
 \begin{center}
 	\begin{tabular}{lcl}
 		$\simpsulz \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ & 
@@ -85,10 +95,10 @@
 	\end{tabular}
 \end{center}
 \noindent
-the $\textit{zeroable}$ predicate 
-which tests whether the regular expression
+where the $\textit{zeroable}$ predicate 
+tests whether the regular expression
 is equivalent to $\ZERO$,
-is defined as:
+can be defined as:
 \begin{center}
 	\begin{tabular}{lcl}
 		$\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\;
@@ -190,6 +200,23 @@
 The assumption that the size of the regular expressions
 in the algorithm
 would stay below a finite constant is not ture.
+The main reason behind this is that (i) The $\textit{nub}$
+function requires identical annotations between two 
+annotated regular expressions to qualify as duplicates,
+and cannot simplify the cases like $_{SZZ}a^*+_{SZS}a^*$
+even if both $a^*$ denote the same language.
+(ii) The ``flattening'' only applies to the head of the list
+in the 
+\begin{center}
+	\begin{tabular}{lcl}
+
+		$\simpsulz  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
+		$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
+	\end{tabular}
+\end{center}
+\noindent
+clause, and therefore is not thorough enough to simplify all
+needed parts of the regular expression.\\
 In addition to that, even if the regular expressions size
 do stay finite, one has to take into account that
 the $\simpsulz$ function is applied many times
@@ -209,14 +236,14 @@
 by our solution. These solutions come with correctness
 statements that are backed up by formal proofs.
 \subsection{Flattening Nested Alternatives}
-The idea behind the 
+The idea behind the clause
 \begin{center}
 $\simpsulz  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad
 	       _{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$
 \end{center}
-clause is that it allows
+is that it allows
 duplicate removal of regular expressions at different
-levels.
+``levels'' of alternatives.
 For example, this would help with the
 following simplification:
 
@@ -226,15 +253,15 @@
 The problem here is that only the head element
 is ``spilled out'',
 whereas we would want to flatten
-an entire list to open up possibilities for further simplifications.\\
+an entire list to open up possibilities for further simplifications.
 Not flattening the rest of the elements also means that
 the later de-duplication processs 
-does not fully remove apparent duplicates.
+does not fully remove further duplicates.
 For example,
 using $\simpsulz$ we could not 
 simplify
 \begin{center}
-$((a^* a^*)+ (a^* + a^*))\cdot (a^*a^*)^*+
+	$((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+
 ((a^*a^*)+a^*)\cdot (a^*a^*)^*$
 \end{center}
 due to the underlined part not in the first element
@@ -427,17 +454,16 @@
 This algorithm keeps the regular expression size small.
 
 
-\subsection{$(a+aa)^*$ and $(a^*\cdot a^*)^*$  against 
-$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
-For example,
-with our simplification the
+\subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$
+After Simplification}
+Recall the
 previous $(a^*a^*)^*$ example
 where $\simpsulz$ could not
-stop the fast growth (over
+prevent the fast growth (over
 3 million nodes just below $20$ input length)
-will be reduced to just 15 and stays constant, no matter how long the
+will be reduced to just 15 and stays constant no matter how long the
 input string is.
-This is demonstrated in the graphs below.
+This is shown in the graphs below.
 \begin{figure}[H]
 \begin{center}
 \begin{tabular}{ll}
@@ -482,12 +508,11 @@
 \section{Correctness of $\blexersimp$}
 In this section we give details
 of the correctness proof of $\blexersimp$,
-an important contribution of this thesis.\\
+one of the contributions of this thesis.\\
 We first introduce the rewriting relation \emph{rrewrite}
 ($\rrewrite$) between two regular expressions,
 which expresses an atomic
-simplification step from the left-hand-side
-to the right-hand-side.
+simplification.
 We then prove properties about
 this rewriting relation and its reflexive transitive closure.
 Finally we leverage these properties to show
@@ -496,7 +521,7 @@
 
 \subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
 In the $\blexer$'s correctness proof, we
-did not directly derive the fact that $\blexer$ gives out the POSIX value,
+did not directly derive the fact that $\blexer$ generates the POSIX value,
 but first proved that $\blexer$ is linked with $\lexer$.
 Then we re-use
 the correctness of $\lexer$
@@ -511,8 +536,8 @@
 produces the same output as $\blexer \; r\; s$,
 and then piecing it together with 
 $\blexer$'s correctness to achieve our main
-theorem:\footnote{ the case when 
-$s$ is not in $L \; r$, is routine to establish }
+theorem:\footnote{ The case when 
+$s$ is not in $L \; r$, is routine to establish.}
 \begin{center}
 	$(r, s) \rightarrow v \; \;   \textit{iff} \;\;  \blexersimp \; r \; s = v$
 \end{center}
@@ -526,7 +551,7 @@
 \end{center}
 where each rewrite step, written $\rightsquigarrow$,
 is an ``atomic'' simplification that
-cannot be broken down any further:
+is similar to a small-step reduction in operational semantics:
 \begin{figure}[H]
 \begin{mathpar}
 	\inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}
@@ -592,8 +617,10 @@
 \caption{The Reflexive Transitive Closure of 
 $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure}
 \end{figure}
-Two rewritable terms will remain rewritable to each other
-even after a derivative is taken:
+%Two rewritable terms will remain rewritable to each other
+%even after a derivative is taken:
+Rewriting is preserved under derivatives,
+namely
 \begin{center}
 	$r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$
 \end{center}
@@ -956,7 +983,7 @@
 
 \subsection{Comments on the Proof Techniques Used}
 Straightforward and simple as the proof may seem,
-the efforts we spent obtaining it was far from trivial.\\
+the efforts we spent obtaining it were far from trivial.\\
 We initially attempted to re-use the argument 
 in \cref{flex_retrieve}. 
 The problem was that both functions $\inj$ and $\retrieve$ require 
@@ -1007,6 +1034,16 @@
 \end{center}
 as equal, because they were both re-written
 from the same expression.\\
+The simplification rewriting rules
+given in \ref{rrewriteRules} are by no means
+final,
+one could come up new rules
+such as 
+$\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow
+\SEQs [r_1, r_2, r_3]$.
+This does not fit with the proof technique
+of our main theorem, but seem to not violate the POSIX
+property.\\
 Having correctness property is good. 
 But we would also a guarantee that the lexer is not slow in 
 some sense, for exampe, not grinding to a halt regardless of the input.

--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Sat Sep 10 22:30:22 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Mon Sep 12 23:32:18 2022 +0200
@@ -42,6 +42,8 @@
 
 \newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
 
+\def\SEQ{\textit{SEQ}}
+\def\SEQs{\textit{SEQs}}
 \def\case{\textit{case}}
 \def\sequal{\stackrel{\mbox{\scriptsize rsimp}}{=}}
 \def\rsimpalts{\textit{rsimp}_{ALTS}}

--- a/ChengsongTanPhdThesis/main.tex	Sat Sep 10 22:30:22 2022 +0100
+++ b/ChengsongTanPhdThesis/main.tex	Mon Sep 12 23:32:18 2022 +0200
@@ -290,8 +290,14 @@
 \addchaptertocentry{\abstractname} % Add the abstract to the table of contents
 %\addchap{Abstract} 
 This thesis is about regular expressions and derivatives. It combines functional algorithms and their formal verification in the Isabelle/HOL theorem prover. 
-
-Theoretical results say that regular expression matching should be linear with respect to the input. Under a certain class of regular expressions and inputs though, practical implementations often suffer from non-linear or even exponential running time, allowing ReDoS (regular expression denial-of-service ) attacks. This makes levers with formalised properties such as correctness and time complexity appealing.
+Theoretical results say that regular expression matching should be 
+linear with respect to the input. 
+However with some regular expressions and inputs, existing implementations 
+often suffer from non-linear or even exponential running time, 
+allowing for example ReDoS (regular expression denial-of-service ) attacks. 
+To avoid these attacks, lexers with formalised correctness and running time related
+properties become appealing because the guarantee applies to all inputs, not just 
+a few empirical test cases.
 
 Sulzmann and Lu describe a lexing algorithm that calculates Brzozowski derivatives using bitcodes annotated to regular expressions. Their algorithm generates POSIX values which encode the information of how a regular expression matches a string—that is, which part of the string is matched by which part of the regular expression. This information is needed in the context of lexing in order to extract and to classify tokens. The purpose of the bitcodes is to generate POSIX values incrementally while derivatives are calculated. They also help with designing an “aggressive” simplification function that keeps the size of derivatives finitely bounded. Without simplification the size of some derivatives can grow arbitrarily big resulting in an extremely slow lexing algorithm.

--- a/thys4/posix/FBound.thy	Sat Sep 10 22:30:22 2022 +0100
+++ b/thys4/posix/FBound.thy	Mon Sep 12 23:32:18 2022 +0200
@@ -560,7 +560,16 @@
 
 lemma scompsize_aux:
   shows "s_complexity (AALTs bs (rs1 @ distinctWith rs2 eq1 (set rs1))) \<le> s_complexity (AALTs bs (rs1 @ rs2))"
-  sorry
+  apply(induct rs2 arbitrary: rs1)
+   apply simp
+  apply simp
+  apply(case_tac "\<exists>x \<in> set rs1. a ~1 x")
+  using trans_le_add2 apply blast
+  apply simp
+  
+  by (metis List.set_insert)
+
+  
 
 
 
@@ -589,7 +598,28 @@
    apply auto
   by (metis eq_imp_le le_imp_less_Suc less_imp_le_nat rerase_fuse scomp_rerase2)
 
+lemma prf22:
+  shows "\<lbrakk>r1 \<le>1 r2; \<not> r1 ~1 r2\<rbrakk> \<Longrightarrow> s_complexity r1 \<noteq> s_complexity r2"
+  apply(induct rule:eq1.induct)
+                      apply simp+
+                      apply auto
 
+  sorry
+
+
+
+lemma compl_rrewrite_down1:
+  shows "r1 \<le>1 r2 \<Longrightarrow> r1 ~1 r2 \<or> s_complexity r1 < s_complexity r2"
+    apply(subgoal_tac "s_complexity r1 \<le> s_complexity r2")
+  apply(case_tac "r1 ~1 r2")
+    apply simp
+   apply(subgoal_tac "s_complexity r1 \<noteq> s_complexity r2")
+    apply simp
+  using prf22 apply blast
+  by (simp add: scomp_size_reduction)
+  
+lemma leq1_eq1_equal:
+  shows "\<lbrakk>r1 \<le>1 r2;
 
 
 
@@ -598,6 +628,7 @@
 
 lemma compl_rrewrite_down:
   shows "r1 \<le>1 r2 \<Longrightarrow>r1 = r2 \<or> s_complexity r1 < s_complexity r2"
+  apply(subgoal_tac "s_complexity r1 \<le> s_complexity r2")
   
   apply(induct rule: leq1.induct)
                     apply simp

author	Chengsong
	Mon, 12 Sep 2022 23:32:18 +0200
changeset 600	fd068f39ac23
parent 599	a5f666410101
child 601	ce4e5151a836

ChengsongTanPhdThesis/Chapters/Bitcoded2.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Introduction.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/main.tex		file \| annotate \| diff \| comparison \| revisions
thys4/posix/FBound.thy		file \| annotate \| diff \| comparison \| revisions