ChengsongTanPhdThesis/main.tex
changeset 638 dd9dde2d902b
parent 629 96e009a446d5
child 646 56057198e4f5
equal deleted inserted replaced
637:e3752aac8ec2 638:dd9dde2d902b
   294 Classic results say that regular expression matching should be 
   294 Classic results say that regular expression matching should be 
   295 linear with respect to the input. The size of the regular expressions
   295 linear with respect to the input. The size of the regular expressions
   296 are often treated as a constant factor.
   296 are often treated as a constant factor.
   297 However with some regular expressions and inputs, existing implementations 
   297 However with some regular expressions and inputs, existing implementations 
   298 often suffer from non-linear or even exponential running time, 
   298 often suffer from non-linear or even exponential running time, 
   299 giving to for example ReDoS (regular expression denial-of-service ) attacks. 
   299 giving rise to ReDoS (regular expression denial-of-service ) attacks. 
   300 To avoid these attacks, lexers with formalised correctness and running time related
   300 To avoid these attacks, regular expression matchers and lexers with 
       
   301 formalised correctness and running time related
   301 properties are of interest because the guarantees apply to all inputs, not just a finite
   302 properties are of interest because the guarantees apply to all inputs, not just a finite
   302 number of empirical test cases.
   303 number of empirical test cases.
   303 
   304 
   304 Sulzmann and Lu describe a lexing algorithm that calculates Brzozowski derivatives using bitcodes annotated to regular expressions. Their algorithm generates POSIX values which encode the information of how a regular expression matches a string—that is, which part of the string is matched by which part of the regular expression. This information is needed in the context of lexing in order to extract and to classify tokens. The purpose of the bitcodes is to generate POSIX values incrementally while derivatives are calculated. They also help with designing an “aggressive” simplification function that keeps the size of derivatives finitely bounded. Without simplification the size of some derivatives can grow arbitrarily big resulting in an extremely slow lexing algorithm.
   305 Sulzmann and Lu describe a lexing algorithm that calculates Brzozowski derivatives using bitcodes annotated to regular expressions. Their algorithm generates POSIX values which encode the information of how a regular expression matches a string—that is, which part of the string is matched by which part of the regular expression. This information is needed in the context of lexing in order to extract and to classify tokens. The purpose of the bitcodes is to generate POSIX values incrementally while derivatives are calculated. They also help with designing an “aggressive” simplification function that keeps the size of derivatives finitely bounded. 
       
   306 Our simplification function is more aggressive than the one by Sulzmann and Lu.
       
   307 We also fix a problem in Sulzmann and Lu's simplification to do with their use of
       
   308 the $\textit{nub}$ function which does not remove non-trivial duplicates.
       
   309 Without simplification the size of some derivatives can grow arbitrarily big resulting in an extremely slow lexing algorithm.
   305 In this thesis we describe a variant of Sulzmann and Lu’s algorithm: Our variant is a recursive functional program, whereas Sulzmann and Lu’s version involves a fixpoint construction. We (i) prove in Isabelle/HOL that our algorithm is correct and generates unique POSIX values; we also (ii) establish a finite bound for the size of the derivatives for every input string; we also
   310 In this thesis we describe a variant of Sulzmann and Lu’s algorithm: Our variant is a recursive functional program, whereas Sulzmann and Lu’s version involves a fixpoint construction. We (i) prove in Isabelle/HOL that our algorithm is correct and generates unique POSIX values; we also (ii) establish a finite bound for the size of the derivatives for every input string; we also
   306 (iii) give a program and a conjecture that the finite 
   311 (iii) give a program and a conjecture that the finite 
   307 bound can be improved to be cubic if stronger simplification rules are applied. 
   312 bound can be improved to be cubic if stronger simplification rules are applied. 
   308 
   313 
   309 
   314 
   327 quick to lend a
   332 quick to lend a
   328 helping hand at difficult times. 
   333 helping hand at difficult times. 
   329 I want to thank Doctor Kathrin Stark, my SIGPLAN mentor, for offering brilliant advice
   334 I want to thank Doctor Kathrin Stark, my SIGPLAN mentor, for offering brilliant advice
   330 at the late stage of my PhD. My transition from a PhD student to a postdoc researcher
   335 at the late stage of my PhD. My transition from a PhD student to a postdoc researcher
   331 could not have been so smooth without Kathrin's mentoring.
   336 could not have been so smooth without Kathrin's mentoring.
   332 I want to thank Jeanna Wheeler, my UMO mentor, for helping me regulate my mental health
   337 %I want to thank Jeanna Wheeler, my UMO mentor, for helping me regulate my mental health
   333 and productivity, by being always encouraging
   338 %and productivity, by being always encouraging
   334 and compassionate in her sessions.
   339 %and compassionate in her sessions.
       
   340 I want to thank Jeanna Wheeler for helping me with keeping sane during my time during the PhD and COVID times when an encouraging and compassionate person was very appreciated.
   335 
   341 
   336 I want to thank my father Haiyan Tan and my mother Yunan Cheng,
   342 I want to thank my father Haiyan Tan and my mother Yunan Cheng,
   337 for their unconditional love, and who I have not seen
   343 for their unconditional love, and who I have not seen
   338 face to face for three years. 
   344 face to face for three years. 
   339 I really miss you.
   345 I really miss you.