An efficient approximate algorithm for the 1-median problem in metric spaces

Cantone, Domenico; Cincotti, Gianluca; Ferro, Alfredo; Pulvirenti, Alfredo

doi:10.1137/S1052623403424740

We propose a simple and natural linear randomized algorithm for the approximate 1-median selection problem in metric spaces. The 1-median of a finite subset S of a metric space is the element of S which minimizes the average distance from the remaining points in S. This problem is extremely important in most applications using clustering of metric spaces, but also in connection with several algorithms in bioinformatics. The only linear approximation algorithm for the 1-median problem, which provably works in any metric space without going through any Euclidean space, has been proposed by Indyk in {[}Proceedings of the 31st Annual ACM Symposium on Theory of Computing, Atlanta, 1999, pp. 428-432]. However, Indyk's algorithm, which is based on sufficiently large sampling, turns out not to be a practical solution. The same holds true even for its heuristic variants which use samplings of smaller size. The algorithm we propose has a simple and efficient implementation, which performs better than Indyk's algorithm in practice. On the other hand, while the performance of Indyk's algorithm is guaranteed by an approximation factor, in the case of our algorithm we are only able to produce experimental evidence of its precision. Extensive experimentation has been performed on both synthetic and real input datasets. Synthetic datasets were generated with uniform and skewed distributions, using various metrics. Real datasets have been extrapolated from real world official databases available on the web. Successful results of the proposed algorithm are reported for several applications in bioinformatics and various classes of approximate search queries.

We propose a simple and natural linear randomized algorithm for the approximate 1-median selection problem in metric spaces. The 1-median of a finite subset S of a metric space is the element of S which minimizes the average distance from the remaining points in S. This problem is extremely important in most applications using clustering of metric spaces, but also in connection with several algorithms in bioinformatics. The only linear approximation algorithm for the 1-median problem, which provably works in any metric space without going through any Euclidean space, has been proposed by Indyk in [Proceedings of the 31st Annual ACM Symposium on Theory of Computing, Atlanta, 1999, pp. 428–432]. However, Indyk’s algorithm, which is based on sufficiently large sampling, turns out not to be a practical solution. The same holds true even for its heuristic variants which use samplings of smaller size. The algorithm we propose has a simple and efficient implementation, which performs better than Indyk’s algorithm in practice. On the other hand, while the performance of Indyk’s algorithm is guaranteed by an approximation factor, in the case of our algorithm we are only able to produce experimental evidence of its precision. Extensive experimentation has been performed on both synthetic and real input datasets. Synthetic datasets were generated with uniform and skewed distributions, using various metrics. Real datasets have been extrapolated from real world official databases available on the web. Successful results of the proposed algorithm are reported for several applications in bioinformatics and various classes of approximate search queries.