196 In this instance, a regular expression intended to just trim white |
196 In this instance, a regular expression intended to just trim white |
197 spaces from the beginning and the end of a line actually consumed |
197 spaces from the beginning and the end of a line actually consumed |
198 massive amounts of CPU-resources---causing web servers to |
198 massive amounts of CPU-resources---causing web servers to |
199 grind to a halt. This happened when a post with 20,000 white spaces was |
199 grind to a halt. This happened when a post with 20,000 white spaces was |
200 submitted, but importantly the white spaces were neither at the |
200 submitted, but importantly the white spaces were neither at the |
201 beginning nor at the end. As a result, the regular expression matching |
201 beginning nor at the end. |
202 engine needed to backtrack over many choices. The underlying problem is |
202 As a result, the regular expression matching |
|
203 engine needed to backtrack over many choices. |
|
204 In this example, the time needed to process the string is not |
|
205 exactly the classical exponential case, but rather $O(n^2)$ |
|
206 with respect to the string length. But this is enough for the |
|
207 home page of stackexchnge to respond not fast enough to |
|
208 the load balancer, which thought that there must be some |
|
209 attack and therefore stopped the servers from responding to |
|
210 requests. This makes the whole site become unavailable. |
|
211 Another fresh example that just came out of the oven |
|
212 is the cloudfare outage incident. A poorly written |
|
213 regular expression exhibited exponential behaviour |
|
214 and exhausted CPUs that serve HTTP traffic on 2 July 2019, |
|
215 resulting in a 27-minute outage. The regular expression contained |
|
216 adjacent Kleene stars, which is the source of the exponential number |
|
217 of backtracking possibilities. Although this outage has a series of |
|
218 causes that came from different vulnerabilities within the system |
|
219 of cloudfare, the heart of the problem lies within regular expression. |
|
220 Such outages could be excluded from happening if the matching |
|
221 algorithm is guaranteed to be fast.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}} |
|
222 The underlying problem is |
203 that many ``real life'' regular expression matching engines do not use |
223 that many ``real life'' regular expression matching engines do not use |
204 DFAs for matching. This is because they support regular expressions that |
224 DFAs for matching. |
|
225 This is because they support regular expressions that |
205 are not covered by the classical automata theory, and in this more |
226 are not covered by the classical automata theory, and in this more |
206 general setting there are quite a few research questions still |
227 general setting there are quite a few research questions still |
207 unanswered and fast algorithms still need to be developed (for example |
228 unanswered and fast algorithms still need to be developed (for example |
208 how to treat bounded repetitions, negation and back-references |
229 how to treat bounded repetitions, negation and back-references |
209 efficiently). |
230 efficiently). |