Let Substr_k(X) denote the set of length-k substrings of a given string X for a given integer k > 0. We study the following basic string problem, called z-Shortest _k-Equivalent Strings: Given a set _k of n length-k strings and an integer z > 0, list z shortest distinct strings T1,...,T_z such that Substr_k(T_i) = _k, for all i ∈ [1,z]. The z-Shortest _k-Equivalent Strings problem arises naturally as an encoding problem in many real-world applications; e.g., in data privacy, in data compression, and in bioinformatics. The 1-Shortest _k-Equivalent Strings, referred to as Shortest _k-Equivalent String, asks for a shortest string X such that Substr_k(X) = _k.
Our main contributions are summarized below:
- Given a directed graph G(V,E), the Directed Chinese Postman (DCP) problem asks for a shortest closed walk that visits every edge of G at least once. DCP can be solved in ̃(|E||V|) time using an algorithm for min-cost flow. We show, via a non-trivial reduction, that if Shortest _k-Equivalent String over a binary alphabet has a near-linear-time solution then so does DCP.
- We show that the length of a shortest string output by Shortest _k-Equivalent String is in (k+n2). We generalize this bound by showing that the total length of z shortest strings is in (zk+zn2+z2n). We derive these upper bounds by showing (asymptotically tight) bounds on the total length of z shortest Eulerian walks in general directed graphs.
- We present an algorithm for solving z-Shortest _k-Equivalent Strings in (nk+n2log2n+zn2log n+|output|) time. If z = 1, the time becomes (nk+n2log2n) by the fact that the size of the input is Θ(nk) and the size of the output is (k+n2).