# HG changeset patch # User Chengsong # Date 1648494004 -3600 # Node ID 6953d2786e7cdd2c5a36c5038b1bdb82a49cb57a # Parent 23818853a710f588c5dd8d7beeaed08fb82a30f6 hi diff -r 23818853a710 -r 6953d2786e7c ChengsongTanPhdThesis/Chapters/Chapter1.tex --- a/ChengsongTanPhdThesis/Chapters/Chapter1.tex Mon Mar 28 00:59:42 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Chapter1.tex Mon Mar 28 20:00:04 2022 +0100 @@ -45,6 +45,10 @@ %---------------------------------------------------------------------------------------- %This part is about regular expressions, Brzozowski derivatives, %and a bit-coded lexing algorithm with proven correctness and time bounds. + +%TODO: look up snort rules to use here--give readers idea of what regexes look like + + Regular expressions are widely used in computer science: be it in IDEs with syntax hightlighting and auto completion, command line tools like $\mathit{grep}$ that facilitates easy @@ -450,36 +454,55 @@ ``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.'' \end{center} -\section{Engineering and Academic Approaches to Deal with Catastrophic Backtracking} +%\section{How people solve problems with regexes} -There is also static analysis work on regular expression that -have potential expoential behavious. Rathnayake and Thielecke +When a regular expression does not behave as intended, +people usually try to rewrite the regex to some equivalent form +or they try to avoid the possibly problematic patterns completely\parencite{Davis18}, +of which there are many false positives. +Animated tools to "debug" regular expressions +are also quite popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101} +to name a few. +There is also static analysis work on regular expressions that +aims to detect potentially expoential regex patterns. Rathnayake and Thielecke \parencite{Rathnayake2014StaticAF} proposed an algorithm that detects regular expressions triggering exponential behavious on backtracking matchers. -People also developed static analysis methods for -generating non-linear polynomial worst-time estimates +Weideman \parencite{Weideman2017Static} came up with +non-linear polynomial worst-time estimates for regexes, attack string that exploit the worst-time scenario, and "attack automata" that generates -attack strings \parencite{Weideman2017Static}. -There are also tools to "debug" regular expressions -that allows people to see why a match failed or was especially slow -by showing the steps a back-tracking regex engine took\parencite{regexploit2021}. +attack strings. +%Arguably these methods limits the programmers' freedom +%or productivity when all they want is to come up with a regex +%that solves the text processing problem. + %TODO:also the regex101 debugger \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives} Is it possible to have a regex lexing algorithm with proven correctness and time complexity, which allows easy extensions to constructs like bounded repetitions, negation, lookarounds, and even back-references? -Building on top of Sulzmann and Lu's attempt to formalize the -notion of POSIX lexing rules \parencite{Sulzmann2014}, -Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled -POSIX matching as a ternary relation recursively defined in a -natural deduction style. -With the formally-specified rules for what a POSIX matching is, -they designed a regex matching algorithm based on Brzozowski derivatives, and -proved in Isabelle/HOL that the algorithm gives correct results. + + We propose Brzozowski derivatives on regular expressions as + a solution to this. + + In the last fifteen or so years, Brzozowski's derivatives of regular +expressions have sparked quite a bit of interest in the functional +programming and theorem prover communities. The beauty of +Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly +expressible in any functional language, and easily definable and +reasoned about in theorem provers---the definitions just consist of +inductive datatypes and simple recursive functions. +And an algorithms based on it by +Suzmann and Lu \parencite{Sulzmann2014} allows easy extension +to include extended regular expressions and + simplification of internal data structures + eliminating the exponential behaviours. + + + @@ -487,23 +510,14 @@ %---------------------------------------------------------------------------------------- -\section{Our Approach} -In the last fifteen or so years, Brzozowski's derivatives of regular -expressions have sparked quite a bit of interest in the functional -programming and theorem prover communities. The beauty of -Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly -expressible in any functional language, and easily definable and -reasoned about in theorem provers---the definitions just consist of -inductive datatypes and simple recursive functions. Derivatives of a -regular expression, written $r \backslash c$, give a simple solution -to the problem of matching a string $s$ with a regular -expression $r$: if the derivative of $r$ w.r.t.\ (in -succession) all the characters of the string matches the empty string, -then $r$ matches $s$ (and {\em vice versa}). +\section{Our Contribution} + -This work aims to address the above vulnerability by the combination -of Brzozowski's derivatives and interactive theorem proving. We give an +This work addresses the vulnerability of super-linear and +buggy regex implementations by the combination +of Brzozowski's derivatives and interactive theorem proving. +We give an improved version of Sulzmann and Lu's bit-coded algorithm using derivatives, which come with a formal guarantee in terms of correctness and running time as an Isabelle/HOL proof. @@ -512,15 +526,6 @@ cubic to regular expression size using a technique by Antimirov. -\subsection{Existing Work} -We are aware -of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by -Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part -of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one -in Coq is given by Coquand and Siles \parencite{Coquand2012}. -Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. - - %We propose Brzozowski's derivatives as a solution to this problem. The main contribution of this thesis is a proven correct lexing algorithm with formalized time bounds. @@ -542,18 +547,22 @@ and concluded that the algorithm is quadratic in terms of input length. When we tried out their extracted OCaml code with our example $(a+aa)^*$, the time it took to lex only 40 $a$'s was 5 minutes. -We therefore believe our results of a proof of performance on general + +We believe our results of a proof of performance on general inputs rather than specific examples a novel contribution.\\ - \section{Preliminaries about Lexing Using Brzozowski derivatives} - In the last fifteen or so years, Brzozowski's derivatives of regular -expressions have sparked quite a bit of interest in the functional -programming and theorem prover communities. -The beauty of -Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly -expressible in any functional language, and easily definable and -reasoned about in theorem provers---the definitions just consist of -inductive datatypes and simple recursive functions. + +\subsection{Related Work} +We are aware +of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by +Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part +of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one +in Coq is given by Coquand and Siles \parencite{Coquand2012}. +Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. + + %We propose Brzozowski's derivatives as a solution to this problem. +% about Lexing Using Brzozowski derivatives + \section{Preliminaries} Suppose we have an alphabet $\Sigma$, the strings whose characters are from $\Sigma$ @@ -596,6 +605,13 @@ If the $\textit{StringSet}$ happen to have some structure, for example, if it is regular, then we have that it +% Derivatives of a +%regular expression, written $r \backslash c$, give a simple solution +%to the problem of matching a string $s$ with a regular +%expression $r$: if the derivative of $r$ w.r.t.\ (in +%succession) all the characters of the string matches the empty string, +%then $r$ matches $s$ (and {\em vice versa}). + The the derivative of regular expression, denoted as $r \backslash c$, is a function that takes parameters $r$ and $c$, and returns another regular expression $r'$, @@ -809,6 +825,19 @@ \end{center} \noindent + +Building on top of Sulzmann and Lu's attempt to formalize the +notion of POSIX lexing rules \parencite{Sulzmann2014}, +Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled +POSIX matching as a ternary relation recursively defined in a +natural deduction style. +With the formally-specified rules for what a POSIX matching is, +they proved in Isabelle/HOL that the algorithm gives correct results. + +But having a correct result is still not enough, we want $\mathbf{efficiency}$. + + + One regular expression can have multiple lexical values. For example for the regular expression $(a+b)^*$, it has a infinite list of values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$, @@ -829,6 +858,7 @@ C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}} \end{equation} which is clearly in exponential order. + A lexer aimed at getting all the possible values has an exponential worst case runtime. Therefore it is impractical to try to generate all possible matches in a run. In practice, we are usually @@ -941,9 +971,12 @@ each run during the injection phase. And we can prove that the POSIX value of how regular expressions match strings will not be affected---although is much harder -to establish. Some initial results in this regard have been +to establish. +Some initial results in this regard have been obtained in \cite{AusafDyckhoffUrban2016}. + + %Brzozowski, after giving the derivatives and simplification, %did not explore lexing with simplification or he may well be %stuck on an efficient simplificaiton with a proof. @@ -1434,20 +1467,6 @@ -\section{Backgound} -%Regular expression matching and lexing has been -% widely-used and well-implemented -%in software industry. -%TODO: expand the above into a full paragraph -%TODO: look up snort rules to use here--give readers idea of what regexes look like - - -Theoretical results say that regular expression matching -should be linear with respect to the input. -Under a certain class of regular expressions and inputs though, -practical implementations suffer from non-linear or even -exponential running time, -allowing a ReDoS (regular expression denial-of-service ) attack. %---------------------------------------------------------------------------------------- @@ -1455,246 +1474,8 @@ %---------------------------------------------------------------------------------------- -\section{What this Template Includes} - -\subsection{Folders} - -This template comes as a single zip file that expands out to several files and folders. The folder names are mostly self-explanatory: - -\keyword{Appendices} -- this is the folder where you put the appendices. Each appendix should go into its own separate \file{.tex} file. An example and template are included in the directory. - -\keyword{Chapters} -- this is the folder where you put the thesis chapters. A thesis usually has about six chapters, though there is no hard rule on this. Each chapter should go in its own separate \file{.tex} file and they can be split as: -\begin{itemize} -\item Chapter 1: Introduction to the thesis topic -\item Chapter 2: Background information and theory -\item Chapter 3: (Laboratory) experimental setup -\item Chapter 4: Details of experiment 1 -\item Chapter 5: Details of experiment 2 -\item Chapter 6: Discussion of the experimental results -\item Chapter 7: Conclusion and future directions -\end{itemize} -This chapter layout is specialised for the experimental sciences, your discipline may be different. - -\keyword{Figures} -- this folder contains all figures for the thesis. These are the final images that will go into the thesis document. - -\subsection{Files} - -Included are also several files, most of them are plain text and you can see their contents in a text editor. After initial compilation, you will see that more auxiliary files are created by \LaTeX{} or BibTeX and which you don't need to delete or worry about: - -\keyword{example.bib} -- this is an important file that contains all the bibliographic information and references that you will be citing in the thesis for use with BibTeX. You can write it manually, but there are reference manager programs available that will create and manage it for you. Bibliographies in \LaTeX{} are a large subject and you may need to read about BibTeX before starting with this. Many modern reference managers will allow you to export your references in BibTeX format which greatly eases the amount of work you have to do. - -\keyword{MastersDoctoralThesis.cls} -- this is an important file. It is the class file that tells \LaTeX{} how to format the thesis. - -\keyword{main.pdf} -- this is your beautifully typeset thesis (in the PDF file format) created by \LaTeX{}. It is supplied in the PDF with the template and after you compile the template you should get an identical version. - -\keyword{main.tex} -- this is an important file. This is the file that you tell \LaTeX{} to compile to produce your thesis as a PDF file. It contains the framework and constructs that tell \LaTeX{} how to layout the thesis. It is heavily commented so you can read exactly what each line of code does and why it is there. After you put your own information into the \emph{THESIS INFORMATION} block -- you have now started your thesis! - -Files that are \emph{not} included, but are created by \LaTeX{} as auxiliary files include: - -\keyword{main.aux} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. - -\keyword{main.bbl} -- this is an auxiliary file generated by BibTeX, if it is deleted, BibTeX simply regenerates it when you run the \file{main.aux} file. Whereas the \file{.bib} file contains all the references you have, this \file{.bbl} file contains the references you have actually cited in the thesis and is used to build the bibliography section of the thesis. - -\keyword{main.blg} -- this is an auxiliary file generated by BibTeX, if it is deleted BibTeX simply regenerates it when you run the main \file{.aux} file. - -\keyword{main.lof} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Figures} section. - -\keyword{main.log} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It contains messages from \LaTeX{}, if you receive errors and warnings from \LaTeX{}, they will be in this \file{.log} file. - -\keyword{main.lot} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Tables} section. - -\keyword{main.out} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. - -So from this long list, only the files with the \file{.bib}, \file{.cls} and \file{.tex} extensions are the most important ones. The other auxiliary files can be ignored or deleted as \LaTeX{} and BibTeX will regenerate them. - %---------------------------------------------------------------------------------------- -\section{Filling in Your Information in the \file{main.tex} File}\label{FillingFile} - -You will need to personalise the thesis template and make it your own by filling in your own information. This is done by editing the \file{main.tex} file in a text editor or your favourite LaTeX environment. - -Open the file and scroll down to the third large block titled \emph{THESIS INFORMATION} where you can see the entries for \emph{University Name}, \emph{Department Name}, etc \ldots - -Fill out the information about yourself, your group and institution. You can also insert web links, if you do, make sure you use the full URL, including the \code{http://} for this. If you don't want these to be linked, simply remove the \verb|\href{url}{name}| and only leave the name. - -When you have done this, save the file and recompile \code{main.tex}. All the information you filled in should now be in the PDF, complete with web links. You can now begin your thesis proper! - -%---------------------------------------------------------------------------------------- - -\section{The \code{main.tex} File Explained} - -The \file{main.tex} file contains the structure of the thesis. There are plenty of written comments that explain what pages, sections and formatting the \LaTeX{} code is creating. Each major document element is divided into commented blocks with titles in all capitals to make it obvious what the following bit of code is doing. Initially there seems to be a lot of \LaTeX{} code, but this is all formatting, and it has all been taken care of so you don't have to do it. - -Begin by checking that your information on the title page is correct. For the thesis declaration, your institution may insist on something different than the text given. If this is the case, just replace what you see with what is required in the \emph{DECLARATION PAGE} block. - -Then comes a page which contains a funny quote. You can put your own, or quote your favourite scientist, author, person, and so on. Make sure to put the name of the person who you took the quote from. - -Following this is the abstract page which summarises your work in a condensed way and can almost be used as a standalone document to describe what you have done. The text you write will cause the heading to move up so don't worry about running out of space. - -Next come the acknowledgements. On this page, write about all the people who you wish to thank (not forgetting parents, partners and your advisor/supervisor). - -The contents pages, list of figures and tables are all taken care of for you and do not need to be manually created or edited. The next set of pages are more likely to be optional and can be deleted since they are for a more technical thesis: insert a list of abbreviations you have used in the thesis, then a list of the physical constants and numbers you refer to and finally, a list of mathematical symbols used in any formulae. Making the effort to fill these tables means the reader has a one-stop place to refer to instead of searching the internet and references to try and find out what you meant by certain abbreviations or symbols. - -The list of symbols is split into the Roman and Greek alphabets. Whereas the abbreviations and symbols ought to be listed in alphabetical order (and this is \emph{not} done automatically for you) the list of physical constants should be grouped into similar themes. - -The next page contains a one line dedication. Who will you dedicate your thesis to? - -Finally, there is the block where the chapters are included. Uncomment the lines (delete the \code{\%} character) as you write the chapters. Each chapter should be written in its own file and put into the \emph{Chapters} folder and named \file{Chapter1}, \file{Chapter2}, etc\ldots Similarly for the appendices, uncomment the lines as you need them. Each appendix should go into its own file and placed in the \emph{Appendices} folder. - -After the preamble, chapters and appendices finally comes the bibliography. The bibliography style (called \option{authoryear}) is used for the bibliography and is a fully featured style that will even include links to where the referenced paper can be found online. Do not underestimate how grateful your reader will be to find that a reference to a paper is just a click away. Of course, this relies on you putting the URL information into the BibTeX file in the first place. - %---------------------------------------------------------------------------------------- -\section{Thesis Features and Conventions}\label{ThesisConventions} -To get the best out of this template, there are a few conventions that you may want to follow. - -One of the most important (and most difficult) things to keep track of in such a long document as a thesis is consistency. Using certain conventions and ways of doing things (such as using a Todo list) makes the job easier. Of course, all of these are optional and you can adopt your own method. - -\subsection{Printing Format} - -This thesis template is designed for double sided printing (i.e. content on the front and back of pages) as most theses are printed and bound this way. Switching to one sided printing is as simple as uncommenting the \option{oneside} option of the \code{documentclass} command at the top of the \file{main.tex} file. You may then wish to adjust the margins to suit specifications from your institution. - -The headers for the pages contain the page number on the outer side (so it is easy to flick through to the page you want) and the chapter name on the inner side. - -The text is set to 11 point by default with single line spacing, again, you can tune the text size and spacing should you want or need to using the options at the very start of \file{main.tex}. The spacing can be changed similarly by replacing the \option{singlespacing} with \option{onehalfspacing} or \option{doublespacing}. - -\subsection{Using US Letter Paper} - -The paper size used in the template is A4, which is the standard size in Europe. If you are using this thesis template elsewhere and particularly in the United States, then you may have to change the A4 paper size to the US Letter size. This can be done in the margins settings section in \file{main.tex}. - -Due to the differences in the paper size, the resulting margins may be different to what you like or require (as it is common for institutions to dictate certain margin sizes). If this is the case, then the margin sizes can be tweaked by modifying the values in the same block as where you set the paper size. Now your document should be set up for US Letter paper size with suitable margins. - -\subsection{References} - -The \code{biblatex} package is used to format the bibliography and inserts references such as this one \parencite{Reference1}. The options used in the \file{main.tex} file mean that the in-text citations of references are formatted with the author(s) listed with the date of the publication. Multiple references are separated by semicolons (e.g. \parencite{Reference2, Reference1}) and references with more than three authors only show the first author with \emph{et al.} indicating there are more authors (e.g. \parencite{Reference3}). This is done automatically for you. To see how you use references, have a look at the \file{Chapter1.tex} source file. Many reference managers allow you to simply drag the reference into the document as you type. - -Scientific references should come \emph{before} the punctuation mark if there is one (such as a comma or period). The same goes for footnotes\footnote{Such as this footnote, here down at the bottom of the page.}. You can change this but the most important thing is to keep the convention consistent throughout the thesis. Footnotes themselves should be full, descriptive sentences (beginning with a capital letter and ending with a full stop). The APA6 states: \enquote{Footnote numbers should be superscripted, [...], following any punctuation mark except a dash.} The Chicago manual of style states: \enquote{A note number should be placed at the end of a sentence or clause. The number follows any punctuation mark except the dash, which it precedes. It follows a closing parenthesis.} - -The bibliography is typeset with references listed in alphabetical order by the first author's last name. This is similar to the APA referencing style. To see how \LaTeX{} typesets the bibliography, have a look at the very end of this document (or just click on the reference number links in in-text citations). - -\subsubsection{A Note on bibtex} - -The bibtex backend used in the template by default does not correctly handle unicode character encoding (i.e. "international" characters). You may see a warning about this in the compilation log and, if your references contain unicode characters, they may not show up correctly or at all. The solution to this is to use the biber backend instead of the outdated bibtex backend. This is done by finding this in \file{main.tex}: \option{backend=bibtex} and changing it to \option{backend=biber}. You will then need to delete all auxiliary BibTeX files and navigate to the template directory in your terminal (command prompt). Once there, simply type \code{biber main} and biber will compile your bibliography. You can then compile \file{main.tex} as normal and your bibliography will be updated. An alternative is to set up your LaTeX editor to compile with biber instead of bibtex, see \href{http://tex.stackexchange.com/questions/154751/biblatex-with-biber-configuring-my-editor-to-avoid-undefined-citations/}{here} for how to do this for various editors. - -\subsection{Tables} - -Tables are an important way of displaying your results, below is an example table which was generated with this code: - -{\small -\begin{verbatim} -\begin{table} -\caption{The effects of treatments X and Y on the four groups studied.} -\label{tab:treatments} -\centering -\begin{tabular}{l l l} -\toprule -\tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\ -\midrule -1 & 0.2 & 0.8\\ -2 & 0.17 & 0.7\\ -3 & 0.24 & 0.75\\ -4 & 0.68 & 0.3\\ -\bottomrule\\ -\end{tabular} -\end{table} -\end{verbatim} -} - -\begin{table} -\caption{The effects of treatments X and Y on the four groups studied.} -\label{tab:treatments} -\centering -\begin{tabular}{l l l} -\toprule -\tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\ -\midrule -1 & 0.2 & 0.8\\ -2 & 0.17 & 0.7\\ -3 & 0.24 & 0.75\\ -4 & 0.68 & 0.3\\ -\bottomrule\\ -\end{tabular} -\end{table} - -You can reference tables with \verb|\ref{