ChengsongPhdThesis/ChengsongPhDThesis.tex
changeset 440 0856fbf8bda7
parent 438 a73b2e553804
child 441 426a93160f4a
equal deleted inserted replaced
439:a5376206fd52 440:0856fbf8bda7
    93   looked at extended regular expressions, such as bounded repetitions,
    93   looked at extended regular expressions, such as bounded repetitions,
    94   negation and back-references.
    94   negation and back-references.
    95 \end{abstract}
    95 \end{abstract}
    96 
    96 
    97 \section{Introduction}
    97 \section{Introduction}
    98 \subsection{Practical Example of Regex}
    98 \subsection{Basic Regex Introduction}
    99 
       
   100 %TODO: read rules libraries and the explanation for some of the rules
       
   101 matching some string $s$ with a regex
       
   102 
       
   103 \begin{verbatim}
       
   104 (?:(?:\"|'|\]|\}|\\|\d|
       
   105 (?:nan|infinity|true|false|null|undefined|symbol|math)
       
   106 |\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*))) 
       
   107 \end{verbatim}
       
   108 
       
   109 
       
   110 %Could be from a network intrusion detection algorithm.
       
   111 %Checking whether there is some malicious code 
       
   112 %in the network data blocks being routed.
       
   113 %If so, discard the data and identify the sender for future alert.
       
   114 
       
   115 \subsection{The problem: Efficient Matching and Lexing}
       
   116 A programmer writes patterns to process texts,
       
   117 where a regex is a structured symbolic pattern
       
   118 specifying what the string should be like.
       
   119 
       
   120 The above regex looks complicated, but can be 
       
   121 described by some basic constructs:
       
   122 
       
   123 
       
   124 Suppose (basic) regular expressions are given by the following grammar:
    99 Suppose (basic) regular expressions are given by the following grammar:
   125 \[			r ::=   \ZERO \mid  \ONE
   100 \[			r ::=   \ZERO \mid  \ONE
   126 			 \mid  c  
   101 			 \mid  c  
   127 			 \mid  r_1 \cdot r_2
   102 			 \mid  r_1 \cdot r_2
   128 			 \mid  r_1 + r_2   
   103 			 \mid  r_1 + r_2   
   129 			 \mid r^*         
   104 			 \mid r^*         
   130 \]
   105 \]
   131 
   106 
   132 \noindent
   107 Problem of matching:
   133 The intended meaning of the constructors is as follows: $\ZERO$
   108 
   134 cannot match any string, $\ONE$ can match the empty string, the
       
   135 character regular expression $c$ can match the character $c$, and so
       
   136 on.
       
   137 
       
   138 and the underlying algorithmic problem is:
       
   139 \begin{center}
   109 \begin{center}
   140 \begin{tabular}{lcr}
   110 \begin{tabular}{lcr}
   141 $\textit{Match}(r, s)$ & $ =  $ & $\textit{if}\; s \in L(r)\; \textit{output} \; \textit{YES}$\\
   111 $\textit{Match}(r, s)$ & $ =  $ & $\textit{if}\; s \in L(r)\; \textit{output} \; \textit{YES}$\\
   142 &				& $\textit{else} \; \textit{output} \; \textit{NO}$
   112 &				& $\textit{else} \; \textit{output} \; \textit{NO}$
   143 \end{tabular}
   113 \end{tabular}
   144 \end{center}
   114 \end{center}
       
   115 
       
   116 
       
   117 \subsection{The practical problem}
       
   118 Introduce the problem of matching a pattern with a string.
       
   119 Why important?
       
   120 Need to be fast, real-time, on large inputs.
       
   121 Examples: Snort, Bro, etc?
       
   122 \subsubsection{The rules for network intrusion analysis tools }
       
   123 TODO: read rules libraries such as Snort and the explanation for some of the rules
       
   124 TODO: pcre/pcre2?
       
   125 TODO: any other libraries?
       
   126 
       
   127 
       
   128 \subsection{Regexes that brought down CloudFlare}
       
   129 
       
   130 
       
   131 matching some string $s$ with a regex
       
   132 
       
   133 \begin{verbatim}
       
   134 (?:(?:\"|'|\]|\}|\\|\d|
       
   135 (?:nan|infinity|true|false|null|undefined|symbol|math)
       
   136 |\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*))) 
       
   137 \end{verbatim}
       
   138 
       
   139 
       
   140 %Could be from a network intrusion detection algorithm.
       
   141 %Checking whether there is some malicious code 
       
   142 %in the network data blocks being routed.
       
   143 %If so, discard the data and identify the sender for future alert.
       
   144 \section{Existing approaches}
       
   145 \subsection{Shortcomings of different methods}
       
   146 
       
   147 
       
   148 \subsubsection{ NFA's}
       
   149 Can be slow especially when many states are active.
       
   150 
       
   151 \subsubsection{DFAs}
       
   152 The tool JFLEX uses it.
       
   153 Advantages: super fast on most regexes \\
       
   154 TODO: show it being fast on a lot of inputs\\
       
   155 Disavantages:
       
   156 state explosion for bounded repetitions due to 
       
   157 theoretic bottleneck of having to remember exactly what the
       
   158 suffix up to length $n$ of input string is.
       
   159 "Countdown States activation problem":
       
   160 $.*a.{100}$ requires $2^100$ + DFA states.
       
   161 \subsubsection{variant of DFA's}
       
   162 counting set automata 
       
   163 \\
       
   164 TODO: microsoft 2020 oopsla CsA work, need to add bibli entry, and read, understand key novelty, learn to benchmark like it
       
   165 \\
       
   166 TODO: find weakness of such counting set automata?
       
   167 \\
       
   168 Other variants?
       
   169 
       
   170 \subsubsection{NFA and Regex: isomorphic structure}
       
   171 TODO: define mathematically an isomorphism?\\
       
   172 
       
   173 
       
   174 
       
   175 \subsubsection{variants of NFA's}
       
   176 How about acting on regular expressions themselves? Certain copies represent verbose info--that they will always match the same string--prune away!
       
   177 
       
   178 \subsection{Brzozowski's derivatives}
       
   179 
       
   180 \subsection{Sulzmann and Lu's algorithm}
       
   181 
       
   182 \subsection{Bit-coded algorithm}
       
   183 +bitcodes!
       
   184 Built on top of derivatives, but with auxiliary bits
       
   185 \subsection{Correctness Proof}
       
   186 unproven by Sulzmann  and Lu
       
   187 by Ausaf and Urban
       
   188 
       
   189 \section{My Work}
       
   190 
       
   191 \subsection{an improved version of bit-coded algorithm: with simp!}
       
   192 
       
   193 \subsection{a correctness proof for bitcoded}
       
   194 
       
   195 \subsection{finiteness proof }
       
   196 
       
   197 \subsection{stronger simplification}
       
   198 
       
   199 
       
   200 \subsection{cubic bound}
       
   201 
       
   202 
       
   203 
       
   204 
       
   205 
       
   206 
       
   207 
       
   208 
       
   209 
       
   210 
       
   211 
       
   212 \subsection{Too Detailed too basic? regex to NFA translation}
   145 
   213 
   146 Deciding whether a string is in the language of the regex
   214 Deciding whether a string is in the language of the regex
   147 can be intuitively done by constructing an NFA\cite{Thompson_1968}:
   215 can be intuitively done by constructing an NFA\cite{Thompson_1968}:
   148 and simulate the running of it:
   216 and simulate the running of it:
   149 \begin{figure}
   217 \begin{figure}