ChengsongTanPhdThesis/main.tex
changeset 653 bc5571c38d1f
parent 646 56057198e4f5
child 664 ba44144875b1
equal deleted inserted replaced
652:a4d692a9a289 653:bc5571c38d1f
   289 
   289 
   290 
   290 
   291 \begin{abstract}
   291 \begin{abstract}
   292 \addchaptertocentry{\abstractname} % Add the abstract to the table of contents
   292 \addchaptertocentry{\abstractname} % Add the abstract to the table of contents
   293 %\addchap{Abstract} 
   293 %\addchap{Abstract} 
   294 \textbf{Problem: not like an abstract, more like a summary}
   294 \marginpar{\em Reviewer feedback:  Problem: not like an abstract, more like a summary\\
   295 \textbf{Goal for new abstract: more high-level and abstract, tell the problem and solution in a concise way.}
   295 New abstract: more high-level and abstract, tell the problem and solution in a concise way.
   296 
   296 }
   297 New abstract:\\
       
   298 %Modern computer systems rely on lexing for essential applications such as
   297 %Modern computer systems rely on lexing for essential applications such as
   299 %compilers, IDEs, file systems, and network intrusion detection systems.
   298 %compilers, IDEs, file systems, and network intrusion detection systems.
   300 %These applications require correctness with respect to
   299 %These applications require correctness with respect to
   301 %the POSIX standard and high performance.
   300 %the POSIX standard and high performance.
   302 %%While existing implementations of lexers often achieve high performance,
   301 %%While existing implementations of lexers often achieve high performance,
   314 While a formalised correctness proof for Sulzmann and Lu's algorithm already exists, this proof does not include any of the crucial simplification rules. These simplification rules are however necessary in order to have an acceptable runtime for this algorithm. Our version of the simplification rules includes a number of fixes and improvements: one problem we fix has to do with their use of the nub function that does not remove non-trivial duplicates. We improve the simplification rules by formulating them as simple recursive function and also by simplifying more instances of regular expressions.  As a result we can establish a bound on the size of derivatives. Our proofs are formalised in Isabelle/HOL.
   313 While a formalised correctness proof for Sulzmann and Lu's algorithm already exists, this proof does not include any of the crucial simplification rules. These simplification rules are however necessary in order to have an acceptable runtime for this algorithm. Our version of the simplification rules includes a number of fixes and improvements: one problem we fix has to do with their use of the nub function that does not remove non-trivial duplicates. We improve the simplification rules by formulating them as simple recursive function and also by simplifying more instances of regular expressions.  As a result we can establish a bound on the size of derivatives. Our proofs are formalised in Isabelle/HOL.
   315 
   314 
   316 
   315 
   317 
   316 
   318 Old abstract:
   317 Old abstract:
   319 This thesis is about regular expressions and derivatives. It combines functional algorithms and their formal verification in the Isabelle/HOL theorem prover. 
   318 %This thesis is about regular expressions and derivatives. It combines functional algorithms and their formal verification in the Isabelle/HOL theorem prover. 
   320 Classic results say that regular expression matching should be 
   319 %Classic results say that regular expression matching should be 
   321 linear with respect to the input. The size of the regular expressions
   320 %linear with respect to the input. The size of the regular expressions
   322 are often treated as a constant factor.
   321 %are often treated as a constant factor.
   323 However with some regular expressions and inputs, existing implementations 
   322 %However with some regular expressions and inputs, existing implementations 
   324 often suffer from non-linear or even exponential running time, 
   323 %often suffer from non-linear or even exponential running time, 
   325 giving rise to ReDoS (regular expression denial-of-service ) attacks. 
   324 %giving rise to ReDoS (regular expression denial-of-service ) attacks. 
   326 To avoid these attacks, regular expression matchers and lexers with 
   325 %To avoid these attacks, regular expression matchers and lexers with 
   327 formalised correctness and running time related
   326 %formalised correctness and running time related
   328 properties are of interest because the guarantees apply to all inputs, not just a finite
   327 %properties are of interest because the guarantees apply to all inputs, not just a finite
   329 number of empirical test cases.
   328 %number of empirical test cases.
   330 
   329 %
   331 Sulzmann and Lu describe a lexing algorithm that calculates Brzozowski derivatives using bitcodes annotated to regular expressions. Their algorithm generates POSIX values which encode the information of how a regular expression matches a string—that is, which part of the string is matched by which part of the regular expression. This information is needed in the context of lexing in order to extract and to classify tokens. The purpose of the bitcodes is to generate POSIX values incrementally while derivatives are calculated. They also help with designing an “aggressive” simplification function that keeps the size of derivatives finitely bounded. 
   330 %Sulzmann and Lu describe a lexing algorithm that calculates Brzozowski derivatives using bitcodes annotated to regular expressions. Their algorithm generates POSIX values which encode the information of how a regular expression matches a string—that is, which part of the string is matched by which part of the regular expression. This information is needed in the context of lexing in order to extract and to classify tokens. The purpose of the bitcodes is to generate POSIX values incrementally while derivatives are calculated. They also help with designing an “aggressive” simplification function that keeps the size of derivatives finitely bounded. 
   332 Our simplification function is more aggressive than the one by Sulzmann and Lu.
   331 %Our simplification function is more aggressive than the one by Sulzmann and Lu.
   333 We also fix a problem in Sulzmann and Lu's simplification to do with their use of
   332 %We also fix a problem in Sulzmann and Lu's simplification to do with their use of
   334 the $\textit{nub}$ function which does not remove non-trivial duplicates.
   333 %the $\textit{nub}$ function which does not remove non-trivial duplicates.
   335 Without simplification the size of some derivatives can grow arbitrarily big resulting in an extremely slow lexing algorithm.
   334 %Without simplification the size of some derivatives can grow arbitrarily big resulting in an extremely slow lexing algorithm.
   336 In this thesis we describe a variant of Sulzmann and Lu’s algorithm: Our variant is a recursive functional program, whereas Sulzmann and Lu’s version involves a fixpoint construction. We (i) prove in Isabelle/HOL that our algorithm is correct and generates unique POSIX values; we also (ii) establish a finite bound for the size of the derivatives for every input string; we also
   335 %In this thesis we describe a variant of Sulzmann and Lu’s algorithm: Our variant is a recursive functional program, whereas Sulzmann and Lu’s version involves a fixpoint construction. We (i) prove in Isabelle/HOL that our algorithm is correct and generates unique POSIX values; we also (ii) establish a finite bound for the size of the derivatives for every input string; we also
   337 (iii) give a program and a conjecture that the finite 
   336 %(iii) give a program and a conjecture that the finite 
   338 bound can be improved to be cubic if stronger simplification rules are applied. 
   337 %bound can be improved to be cubic if stronger simplification rules are applied. 
   339 
   338 
   340 
   339 
   341 
   340 
   342 
   341 
   343 
   342