5 |
5 |
6 declare [[show_question_marks = false]] |
6 declare [[show_question_marks = false]] |
7 |
7 |
8 notation (latex output) |
8 notation (latex output) |
9 If ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and |
9 If ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and |
10 Cons ("_::_" [78,77] 73) and |
10 Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and |
|
11 ZERO ("\<^raw:\textrm{0}>" 78) and |
|
12 ONE ("\<^raw:\textrm{1}>" 78) and |
|
13 CHAR ("_" [1000] 10) and |
|
14 ALT ("_ + _" [1000,1000] 78) and |
|
15 SEQ ("_ \<cdot> _" [1000,1000] 78) and |
|
16 STAR ("_\<^sup>\<star>" [1000] 78) and |
11 val.Char ("Char _" [1000] 78) and |
17 val.Char ("Char _" [1000] 78) and |
12 val.Left ("Left _" [1000] 78) and |
18 val.Left ("Left _" [1000] 78) and |
13 val.Right ("Right _" [1000] 78) and |
19 val.Right ("Right _" [1000] 78) and |
14 L ("L _" [1000] 0) and |
20 L ("L _" [1000] 0) and |
15 flat ("|_|" [70] 73) and |
21 flat ("|_|" [70] 73) and |
21 (*>*) |
27 (*>*) |
22 |
28 |
23 section {* Introduction *} |
29 section {* Introduction *} |
24 |
30 |
25 text {* |
31 text {* |
26 |
32 |
27 \noindent |
33 |
|
34 Sulzmann and Lu \cite{Sulzmann2014} |
|
35 |
|
36 there are two commonly used |
|
37 disambiguation strategies to create a unique matching tree: |
|
38 one is called \emph{greedy} matching \cite{Frisch2004} and the |
|
39 other is \emph{POSIX} matching~\cite{Kuklewicz,Sulzmann2014}. |
|
40 For the latter there are two rough rules: |
|
41 |
|
42 \begin{itemize} |
|
43 \item The Longest Match Rule (or ``maximal munch rule''):\smallskip\\ The |
|
44 longest initial substring matched by any regular |
|
45 expression is taken as next token. |
|
46 |
|
47 \item Rule Priority:\smallskip\\ For a particular longest initial |
|
48 substring, the first regular expression that can match |
|
49 determines the token. |
|
50 \end{itemize} |
|
51 |
|
52 \noindent In the context of lexing, POSIX is the more |
|
53 interesting disambiguation strategy as it produces longest |
|
54 matches, which is necessary for tokenising programs. For |
|
55 example the string \textit{iffoo} should not match the keyword |
|
56 \textit{if} and the rest, but as one string \textit{iffoo}, |
|
57 which might be a variable name in a program. As another |
|
58 example consider the string $xy$ and the regular expression |
|
59 \mbox{$(x + y + xy)^*$}. Either the input string can be |
|
60 matched in two `iterations' by the single letter-regular |
|
61 expressions $x$ and $y$, or directly in one iteration by $xy$. |
|
62 The first case corresponds to greedy matching, which first |
|
63 matches with the left-most symbol and only matches the next |
|
64 symbol in case of a mismatch. The second case is POSIX |
|
65 matching, which prefers the longest match. In case more than |
|
66 one (longest) matches exist, only then it prefers the |
|
67 left-most match. While POSIX matching seems natural, it turns |
|
68 out to be much more subtle than greedy matching in terms of |
|
69 implementations and in terms of proving properties about it. |
|
70 If POSIX matching is implemented using automata, then one has |
|
71 to follow transitions (according to the input string) until |
|
72 one finds an accepting state, record this state and look for |
|
73 further transition which might lead to another accepting state |
|
74 that represents a longer input initial substring to be |
|
75 matched. Only if none can be found, the last accepting state |
|
76 is returned. |
|
77 |
|
78 |
|
79 Sulzmann and Lu's paper \cite{Sulzmann2014} targets POSIX |
|
80 regular expression matching. They write that it is known to be |
|
81 to be a non-trivial problem and nearly all POSIX matching |
|
82 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}. |
|
83 For this they cite a study by Kuklewicz \cite{Kuklewicz}. My |
|
84 current work is about formalising the proofs in the paper by |
|
85 Sulzmann and Lu. Specifically, they propose in this paper a |
|
86 POSIX matching algorithm and give some details of a |
|
87 correctness proof for this algorithm inside the paper and some |
|
88 more details in an appendix. This correctness proof is |
|
89 unformalised, meaning it is just a ``pencil-and-paper'' proof, |
|
90 not done in a theorem prover. Though, the paper and presumably |
|
91 the proof have been peer-reviewed. Unfortunately their proof |
|
92 does not give enough details such that it can be |
|
93 straightforwardly implemented in a theorem prover, say |
|
94 Isabelle. In fact, the purported proof they outline does not |
|
95 work in central places. We discovered this when filling in |
|
96 many gaps and attempting to formalise the proof in Isabelle. |
|
97 |
|
98 |
|
99 |
|
100 {\bf Contributions:} |
|
101 |
|
102 *} |
|
103 |
|
104 section {* Preliminaries *} |
|
105 |
|
106 text {* \noindent Strings in Isabelle/HOL are lists of characters with |
|
107 the empty string being represented by the empty list, written @{term |
|
108 "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}. By |
|
109 using the type char we have a supply of finitely many characters |
|
110 roughly corresponding to the ASCII character set. |
28 Regular exprtessions |
111 Regular exprtessions |
29 |
112 |
30 \begin{center} |
113 \begin{center} |
31 @{text "r :="} |
114 @{text "r :="} |
32 @{const "NULL"} $\mid$ |
115 @{const "ZERO"} $\mid$ |
33 @{const "EMPTY"} $\mid$ |
116 @{const "ONE"} $\mid$ |
34 @{term "CHAR c"} $\mid$ |
117 @{term "CHAR c"} $\mid$ |
35 @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$ |
118 @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$ |
36 @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$ |
119 @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$ |
37 @{term "STAR r"} |
120 @{term "STAR r"} |
38 \end{center} |
121 \end{center} |
|
122 |
|
123 *} |
|
124 |
|
125 section {* POSIX Regular Expression Matching *} |
|
126 |
|
127 section {* The Argument by Sulzmmann and Lu *} |
|
128 |
|
129 section {* Conclusion *} |
|
130 |
|
131 text {* |
|
132 |
|
133 Nipkow lexer from 2000 |
|
134 |
|
135 *} |
|
136 |
|
137 |
|
138 text {* |
|
139 |
|
140 |
|
141 |
39 |
142 |
40 \noindent |
143 \noindent |
41 Values |
144 Values |
42 |
145 |
43 \begin{center} |
146 \begin{center} |