Suffix Arrays (SA) are a powerfull tool used in many different fields such as genetics, string matching, compression, and many others. A suffix array A of T is just a sorted array of suffixes of T. To avoid quadratic space we store only the indices of the suffixes, instead of full suffixes of T. Given SA(T), one can search for occurrences of P directly on the suffix array using binary search in O(|P| lg |T|) time. This can be improved to O(|P| + lg |T|).
SA were originally developed by Gene Myers (ex-Celera) and Udi Manber (now at Google). For more than a decade, the direct construction of suffix array was O(N^2). SA were built in linear time indirectly by building an intermediate suffix tree (which are expensive from the point of view of memory). SA is equivalent to an in-order traversal of the leaves of the suffix tree.
Here a succint description of the algorithm:
- Sort Σ, the alphabet of the T in linear time by using radix sort;
- Replace each letter in the text with its rank among the letters in the text;
- Divide the text T into 3 parts:
- T0 = < (T[3i ], T[3i + 1], T[3i + 2]) , i>=0 . . . >
- T1 = < (T[3i + 1], T[3i + 2], T[3i + 3]) , i=0>
- T2 = < (T[3i + 2], T[3i + 3], T[3i + 4]) , i >= 0 >
- Recurse on
the concatenation of T0 and T1. When this recursive call returns, we get T0 and T1 sorted in a suffix array. After that, we need is to sort the suffixes of T2, and merge them with the old suffixes to get suffixes of T. This is the point where the algorithm become a diamond: the construction of the suffixes of T2 and the linear merge.
- Note that by construction T2[i:] = T[3i+2:]= (T[3i+2], T[3i+3:]) = (T[3i+2], T0[i+1:]). Therefore we can have T2[i:] from the already built To[i+1]
- Merge the sorted suffixes of T0, T1, and T2. Remember that suffixes of T0 and T1 are already sorted, so comparing a suffix from T0 and a suffix from T1 takes constant time. To compare against a suffix from T2, we decompose it again to get a suffix from either T0 or T1;