Welcome to the k-FreqItems GitHub!
In this repo, we share the implementations and experiments of our work A New Sparse Data Clustering Method Based on Frequent Items in SIGMOD 2023. We implement k-FreqItems and SILK with CUDA on a distributed platform (one master with 4 GPUs and one slave with 4 GPUs) for clustering massive sparse set data (or say, sparse binary vectors) on Jaccard distance. We also adapt two state-of-the-art methods k-Means++ and k-Means$\parallel$ for Jaccard distance as baselines (by using FreqItem as the cluster center).
We choose two small sparse data sets with ground truth laebls (i.e., News20 and RCV1) and five large-scale real-world sparse data sets (i.e., URL, Avazu, KDD2012, Criteo10M and Criteo1B) for performance evaluation. Users can download the data sets from their links. The statistics of data sets are summarized as follows.
Data Sets | # Data | # Dim | # Non-Zero Dim | Data Size | Global |
Local |
---|---|---|---|---|---|---|
News20 | 80 | 6.3 MB | 0.2 | - | ||
RCV1 | 65 | 136 MB | 0.2 | - | ||
URL | 116 | 1.1 GB | 0.4 | 0.2 | ||
Criteo10M | 39 | 1.6 GB | 0.2 | 0.1 | ||
Avazu | 15 | 2.6 GB | 0.3 | 0.2 | ||
KDD2012 | 11 | 7.3 GB | 0.5 | 0.2 | ||
Criteo1B | 39 | 153 GB | 0.2 | 0.1 |
The input sparse data sets are stored in binary format, where each file consists of two fields: pos
and data
, as shown below:
field | field type | description |
---|---|---|
pos | uint64_t *(n+1) |
start position of data for each point |
data | int *pos[n] |
non-zero dimensions IDs of all points |
pos
is an uint64_t
array of (n+1) length, which stores the start position of the data
array for each sparse data point. data
is an int
array of pos[n] length, which store the non-zero dimensions IDs of all sparse data points. Here we assume that all non-zero dimension IDs can be represented by an int
type integer. If the dimensionality exceeds the range of int
, one can store the non-zero dimension IDs by uint64_t
type with minor modification.With pos
and data
, one can efficient retrieve a specific data point with its data ID.
For example, suppose there is a sparse data set with four points: x_0={1,3,5,8}
, x_1={1,3}
, x_2={1,6,8}
, and x_3={1,8,10}
. Then, the pos
array is [0,4,6,9,12]
, e.g., pos[0]=0, pos[4]=12. And the data
array is [1,3,5,8,1,3,1,6,8,1,8,10]
. If you want to retrieve x_1
, you can first get its start position of data
and its length from pos
by its data ID 1
, i.e., start position is pos[1]=4
, and its length is pos[1+1]-pos[1]=6-4=2
. Then you can retrieve x_1
from data
by the start position 4
and its length 2
, i.e., x_1={1,3}
.
We have included the transformation code in the folder transformation/
to convert sparse data sets into the binary format. In addition, we also provide the binary format of the datasets we used (except Criteo1B) with 1, 2, 4, 8 slices. Users could download the datasets here.
The source codes require nvcc
with c++11
support. We have provided Makefile
for compilation. However, for different machines, users may need to specify your local path to MPI libary and CUDA SM and Gencode variations. Suppose the Makefile
is in accordance with the machines, users can use the following commands to compile the source codes:
cd methods/
make clean
make -j
Representative NVIDIA GPU cards for each architecture name and CUDA version can be found in http://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/.
We provide bash (and python) scripts to reproduce all experiments reported in our SIGMOD 2023 paper. Suppose you have cloned this repo and you are in the folder k-FreqItems/
.
Please first download the datasets from the links we provided. Then, you can use transformation/sparse.cc
to convert them into the binary format we support. If you want to try multiple GPU and/or nodes, you could further run tranformation/partitin.cc
to split the data into 1,2,4,8 slices. After that, you copy the binary files to a specific folder like data/
. For example, when you get URL_0.bin, you can move it to the path data/URL/URL_0.bin
.
After preparing the data sets, we provide bash scripts to run k-FreqItems, SILK, and two baselines k-FreqItems++ and k-FreqItems$\parallel$. Users can reproduce the experiments by simply running the following command:
cd methods/
bash run.sh
By default, we provide the scripts for running methods with 4 GPUs once users specify the data path. With some minor modifications, users can run methods with different number of GPUs based on their machines.
We provide python scripts (i.e., plot.py
and plot_util.py
in scripts/
) to reproduce all the figures that are appeared in our SIGMOD 2023 paper. These scripts require python 3.7 (or higher versions) with numpy, scipy, and matplotlib installed. If not, you might need to use anaconda to create a new virtual environment and use pip to install those packages.
With the experimental results from step 2, users can reproduce all the figures with the following commands.
cd scripts/
python plot.py
Finally, we provide the source codes and results for the validation of FreqItem representation (Section 3.2 in our SIGMOD 2023 paper). Please refer to the folder representation/
.
Thank you so much for being so patient to read the user manual. We will appreciate using the following BibTeX to cite this work when you use k-FreqItems (or SILK) in your paper.
@article{huang2023new,
title={A New Sparse Data Clustering Method Based On Frequent Items},
author={Huang, Qiang and Luo, Pingyi and Tung, Anthony KH},
journal={Proceedings of the ACM on Management of Data},
volume={1},
number={1},
pages={1--28},
year={2023},
publisher={ACM New York, NY, USA}
}
It is welcome to contact me (huangq@comp.nus.edu.sg) if you meet any issue. Thank you.