A suffix array stores the lexicographic order of all suffixes of a given string. For example, the suffix array for 'dog' is [1,3,2], because the suffixes of 'dog' are 'dog', 'g' and 'og', in this lexicographic order. The suffix array is a very space economical data structure (it takes only 4n bytes for a string of length n) and can be constructed extremely fast (the theoretical time complexity is O(n), although other algorithms exist that have a better practical performance). The suffix array can be used for various tasks in the field of string matching. One important application is to find the number of matches (sometimes also their positions) of a given query string. With an additional array (the LCP-array) the suffix array has been shown to be almost equivalent to the well-known suffix tree.
Our research is focused on applications of suffix- and LCP-arrays in data mining. More precisely, we use them to extract all frequent strings from a given (positive) database of strings. We further use the same data structures to exclude all strings that are also frequent in a different (negative) database of strings. The biological relevance of this task can be seen in the following example: suppose a genetic disease is conjectured to be caused by a defect on the X-chromosome, but it is unknown where and how this failure occurs. One can then collect the genetic sequences of the X-chromosome of 1000 ill patients in the positive database, and likewise the genetic sequences of 1000 healthy persons in the negative database. Then all patterns that occur frequently (or always) in the positive database and not too often (or never) in the negative database are potential indicators for the genetic defect under consideration.
Several interesting theoretical and practical problems arise when working in this field, including external memory construction and processing of suffix- and LCP-arrays, specialized solutions for range-minimum-queries (RMQs) in LCP-arrays, etc. Our work is situated on the borderline between algorithmics and data mining and tries to integrate solutions from one of these fields in the other.