Fast Discrete Distribution Clustering Using Wasserstein Barycenter with Sparse Support

AD2-clustering is a fast clustering algorithm for Discrete Distributions (D2) under the exact Wasserstein metric. It can scale to large-scale D2 data with parallel computing. This project is an efficient and scalable implementation publicly available for D2-clustering, and it is still in the very early stage of development. Contributions are welcomed.

Please see our paper for technical details:

Jianbo Ye, Panruo Wu, James Z. Wang and Jia Li, Fast Discrete Distribution Clustering Using Wasserstein Barycenter with Sparse Support, IEEE Transactions on Signal Processing, 2017 (arXiv:1510.00012 [stat.CO], September 2015)

Discrete Distributions

There are four data types of discrete distribution that have been covered in this project (See data format for their specifications):

[default] discrete distribution over vector space endowed with Euclidean distance
Same type as the default one, but over embeddings space with a finite vocabulary size (e.g. document represented by tf-idf over word vectors, sparsified histograms)
normalized dense histograms with bin-to-bin distance
normalized sparse histograms with bin-to-bin distance
[**experimental] d2 over n-gram provided with item-to-item similarity/distance (sparse histograms as the centroids are represented as 1-gram)

An object/instance can be represented as the joint of multiple discrete distributions (in combinations of any aforementioned types), called phases. For example, an image can be represented in two phases: color distribution and texture distribution; a protein sequence can be represented as distributions in three phases, aka, 1-gram, 2-gram, 3-gram of amino acid. A document can be represented by a bag of weighted word vectors. The co-clustering is then performed jointly over multiple phases.

How to compile and run tests

Though one can build a serial version to solve small-to-moderate scale problems with speed, the strength of AD2-clustering is its scalability to large dataset with parallelization at high scaling efficiency (over 80% on hundreds of cores). You can find quickstart tips on the Google cloud compute engine for references.

Build dependencies:

MPI
CBLAS
Mosek 7.x: free individual academic license available.

Create make.inc file to resolve the dependencies, see make.inc.Linux for an example. One can choose to build the program under Linux or Mac.

Compile

 $ make MPI=0 # build sequential version, or
 $ make MPI=1 # build MPI version (default)

Run unit tests (it takes several minutes):

 $ make MPI=0 test # test sequential version, or
 $ make MPI=1 test # test MPI version (default)

By default, 64bit floating numbers are used. Optionally, you can switch to use 32bit to save memory in computation and get some marginal speedups by edit the line in make.inc

D2_DEFINES=-D _D2_DOUBLE # change to _D2_SINGLE if the size of RAM is limited

Usage

See here for options to prepare D2 data and see here for detailed instructions on possible arguments and modes of program.

Examples

preprocessed "20newsgroups" as bags of word-vectors (related codes on preparing the data).
preprocessed n-gram of protein sequences: experimental
USPS handwritten digit dataset

See here for more examples.

Name		Name	Last commit message	Last commit date
Latest commit History 418 Commits
data		data
experiment		experiment
include		include
matlab @ 855633b		matlab @ 855633b
src		src
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
make.inc.Linux64		make.inc.Linux64
make.inc.OSX_Yosemite		make.inc.OSX_Yosemite
make.inc.Ubuntu		make.inc.Ubuntu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast Discrete Distribution Clustering Using Wasserstein Barycenter with Sparse Support

Discrete Distributions

How to compile and run tests

Usage

Examples

About

Releases

Packages

Contributors 2

Languages

License

bobye/d2_kmeans

Folders and files

Latest commit

History

Repository files navigation

Fast Discrete Distribution Clustering Using Wasserstein Barycenter with Sparse Support

Discrete Distributions

How to compile and run tests

Usage

Examples

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages