93 looked at extended regular expressions, such as bounded repetitions, |
93 looked at extended regular expressions, such as bounded repetitions, |
94 negation and back-references. |
94 negation and back-references. |
95 \end{abstract} |
95 \end{abstract} |
96 |
96 |
97 \section{Introduction} |
97 \section{Introduction} |
98 \subsection{Practical Example of Regex} |
98 \subsection{Basic Regex Introduction} |
99 |
|
100 %TODO: read rules libraries and the explanation for some of the rules |
|
101 matching some string $s$ with a regex |
|
102 |
|
103 \begin{verbatim} |
|
104 (?:(?:\"|'|\]|\}|\\|\d| |
|
105 (?:nan|infinity|true|false|null|undefined|symbol|math) |
|
106 |\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*))) |
|
107 \end{verbatim} |
|
108 |
|
109 |
|
110 %Could be from a network intrusion detection algorithm. |
|
111 %Checking whether there is some malicious code |
|
112 %in the network data blocks being routed. |
|
113 %If so, discard the data and identify the sender for future alert. |
|
114 |
|
115 \subsection{The problem: Efficient Matching and Lexing} |
|
116 A programmer writes patterns to process texts, |
|
117 where a regex is a structured symbolic pattern |
|
118 specifying what the string should be like. |
|
119 |
|
120 The above regex looks complicated, but can be |
|
121 described by some basic constructs: |
|
122 |
|
123 |
|
124 Suppose (basic) regular expressions are given by the following grammar: |
99 Suppose (basic) regular expressions are given by the following grammar: |
125 \[ r ::= \ZERO \mid \ONE |
100 \[ r ::= \ZERO \mid \ONE |
126 \mid c |
101 \mid c |
127 \mid r_1 \cdot r_2 |
102 \mid r_1 \cdot r_2 |
128 \mid r_1 + r_2 |
103 \mid r_1 + r_2 |
129 \mid r^* |
104 \mid r^* |
130 \] |
105 \] |
131 |
106 |
132 \noindent |
107 Problem of matching: |
133 The intended meaning of the constructors is as follows: $\ZERO$ |
108 |
134 cannot match any string, $\ONE$ can match the empty string, the |
|
135 character regular expression $c$ can match the character $c$, and so |
|
136 on. |
|
137 |
|
138 and the underlying algorithmic problem is: |
|
139 \begin{center} |
109 \begin{center} |
140 \begin{tabular}{lcr} |
110 \begin{tabular}{lcr} |
141 $\textit{Match}(r, s)$ & $ = $ & $\textit{if}\; s \in L(r)\; \textit{output} \; \textit{YES}$\\ |
111 $\textit{Match}(r, s)$ & $ = $ & $\textit{if}\; s \in L(r)\; \textit{output} \; \textit{YES}$\\ |
142 & & $\textit{else} \; \textit{output} \; \textit{NO}$ |
112 & & $\textit{else} \; \textit{output} \; \textit{NO}$ |
143 \end{tabular} |
113 \end{tabular} |
144 \end{center} |
114 \end{center} |
|
115 |
|
116 |
|
117 \subsection{The practical problem} |
|
118 Introduce the problem of matching a pattern with a string. |
|
119 Why important? |
|
120 Need to be fast, real-time, on large inputs. |
|
121 Examples: Snort, Bro, etc? |
|
122 \subsubsection{The rules for network intrusion analysis tools } |
|
123 TODO: read rules libraries such as Snort and the explanation for some of the rules |
|
124 TODO: pcre/pcre2? |
|
125 TODO: any other libraries? |
|
126 |
|
127 |
|
128 \subsection{Regexes that brought down CloudFlare} |
|
129 |
|
130 |
|
131 matching some string $s$ with a regex |
|
132 |
|
133 \begin{verbatim} |
|
134 (?:(?:\"|'|\]|\}|\\|\d| |
|
135 (?:nan|infinity|true|false|null|undefined|symbol|math) |
|
136 |\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*))) |
|
137 \end{verbatim} |
|
138 |
|
139 |
|
140 %Could be from a network intrusion detection algorithm. |
|
141 %Checking whether there is some malicious code |
|
142 %in the network data blocks being routed. |
|
143 %If so, discard the data and identify the sender for future alert. |
|
144 \section{Existing approaches} |
|
145 \subsection{Shortcomings of different methods} |
|
146 |
|
147 |
|
148 \subsubsection{ NFA's} |
|
149 Can be slow especially when many states are active. |
|
150 |
|
151 \subsubsection{DFAs} |
|
152 The tool JFLEX uses it. |
|
153 Advantages: super fast on most regexes \\ |
|
154 TODO: show it being fast on a lot of inputs\\ |
|
155 Disavantages: |
|
156 state explosion for bounded repetitions due to |
|
157 theoretic bottleneck of having to remember exactly what the |
|
158 suffix up to length $n$ of input string is. |
|
159 "Countdown States activation problem": |
|
160 $.*a.{100}$ requires $2^100$ + DFA states. |
|
161 \subsubsection{variant of DFA's} |
|
162 counting set automata |
|
163 \\ |
|
164 TODO: microsoft 2020 oopsla CsA work, need to add bibli entry, and read, understand key novelty, learn to benchmark like it |
|
165 \\ |
|
166 TODO: find weakness of such counting set automata? |
|
167 \\ |
|
168 Other variants? |
|
169 |
|
170 \subsubsection{NFA and Regex: isomorphic structure} |
|
171 TODO: define mathematically an isomorphism?\\ |
|
172 |
|
173 |
|
174 |
|
175 \subsubsection{variants of NFA's} |
|
176 How about acting on regular expressions themselves? Certain copies represent verbose info--that they will always match the same string--prune away! |
|
177 |
|
178 \subsection{Brzozowski's derivatives} |
|
179 |
|
180 \subsection{Sulzmann and Lu's algorithm} |
|
181 |
|
182 \subsection{Bit-coded algorithm} |
|
183 +bitcodes! |
|
184 Built on top of derivatives, but with auxiliary bits |
|
185 \subsection{Correctness Proof} |
|
186 unproven by Sulzmann and Lu |
|
187 by Ausaf and Urban |
|
188 |
|
189 \section{My Work} |
|
190 |
|
191 \subsection{an improved version of bit-coded algorithm: with simp!} |
|
192 |
|
193 \subsection{a correctness proof for bitcoded} |
|
194 |
|
195 \subsection{finiteness proof } |
|
196 |
|
197 \subsection{stronger simplification} |
|
198 |
|
199 |
|
200 \subsection{cubic bound} |
|
201 |
|
202 |
|
203 |
|
204 |
|
205 |
|
206 |
|
207 |
|
208 |
|
209 |
|
210 |
|
211 |
|
212 \subsection{Too Detailed too basic? regex to NFA translation} |
145 |
213 |
146 Deciding whether a string is in the language of the regex |
214 Deciding whether a string is in the language of the regex |
147 can be intuitively done by constructing an NFA\cite{Thompson_1968}: |
215 can be intuitively done by constructing an NFA\cite{Thompson_1968}: |
148 and simulate the running of it: |
216 and simulate the running of it: |
149 \begin{figure} |
217 \begin{figure} |