Searching...

No Matches

Parallel C++ library for Multivariate Data Analysis of large datasets.

Overview

Diodon provides executables and functions to compute multivariate data Analysis such as:

Singular Value Decomposition (SVD),
Principal Component Analysis (PCA) and variants (with different pre-treatments),
Multidimensional Scaling (MDS),
Correspondence Analysis (CoA),
Canonical Correlation Analysis (CCA, future work),
Multiple Correspondence Analysis (MCoA, future work).

Please read this note about the algorithms.

All these methods rely on a Singular Value Decomposition (SVD) of a 2D matrix. For small size matrices the SVD can be directly computed using a sequential or multi-threaded LAPACK solver such as OpenBlas or Intel MKL. For large matrices the SVD becomes time consuming and we use a Randomized Singular Value Decomposition method (rSVD) instead of the exact SVD which implementation is given by the FMR library. FMR can perform computations of the rSVD on parallel shared and distributed memory machines using adequate parallel dense linear algebra routines internally such as OpenBlas or Intel MKL on a shared memory node and Chameleon for distributed memory nodes (MPI).

MDS result on a testcase of size 1043192, representation of points on the two first axis.

Performances

MDS on Occigen with parallel HDF5 I/Os

MDS performance results example on a square distance matrix of size N, rank of the RSVD 10000, on the Occigen supercomputer. The input matrices represent distances between individuals in samples of Diatoms from the Leman lake, courtesey Inrae.

CPU Times details of the MDS in function of the test case.

Method	N	# nodes	CPU Time (s)
MDS	99594	1	1047
MDS	270983	6	1099
MDS	426548	16	1054
MDS	616644	32	1195
MDS	1043192	100	1401

MDS and PCA on PlaFRIM with randomly generated data

MDS and PCA performance results example on randomly generated matrices of size N and rank 10000, on the PlaFRIM supercomputer.

CPU Times details of the MDS in function of matrix size on Bora nodes.

Method	N	# nodes	CPU Time (s)
MDS	100000	1	477
MDS	200000	2	701
MDS	400000	8	847
MDS	600000	16	1165
MDS	800000	32	1323

CPU Times details of the PCA in function of matrix size on Bora nodes.

Method	M	N	# nodes	CPU Time (s)
PCA	500000	10000	1	545
PCA	1000000	10000	2	527
PCA	2000000	10000	4	604
PCA	4000000	10000	8	771
PCA	8000000	10000	16	1206

Download

Diodon source code is available on Gitlab Inria.

Clone Diodon using https (public access)

git clone https://gitlab.inria.fr/diodon/cppdiodon.git

Features

Supports

zlib and bzip2 (required)
HDF5 (required), prefer the parallel MPI version if Chameleon enabled
LAPACK (required), multithreaded Intel MKL or OpenBLAS recommended
MPI + Chameleon + StarPU (optionnal)

Installation examples of theses libraries are given hereafter.

Data analysis algorithms

Method	what it does
diodon::method::coa	Correspondence Analysis
diodon::method::mds	Multidimensional Scaling
diodon::method::pca	Principal Component Analysis
diodon::method::pcamet	Principal Component Analysis with metrics (diagonal)
diodon::method::svd	Singular Value Decomposition

Installation

cppdiodon depends on git, cmake, a C/C++ compiler such as GNU GCC, and on some thirdparty libraries such as hdf5, lapack (multithreaded version prefered), chameleon (parallel mpi version, optional).

Example packages on Linux Ubuntu:

git: to download diodon
cmake: for building the diodon C++ project
build-essential: GNU g++ compiler
zlib1g-dev: zlib compression library for I/O files management
libbz2-dev: bzip2 compression library for I/O files management
libhdf5-dev: I/O files management
gfortran (optional): GNU Fortan compiler for Chameleon
pkg-config (optional): for finding libraries (StarPU, Chameleon) on Unix systems
python3 (optional): for generating some figures (points cloud) after execution
libmkl-dev or libopenblas-dev + liblapacke-dev: dense linear algebra routines BLAS/LAPACK routines

Install dependencies packages

Linux Ubuntu 22.04

First, update your packages basis and install classical development tools

sudo apt update -y
 
sudo apt install -y git cmake build-essential zlib1g-dev libbz2-dev gfortran pkg-config python-is-python3

Then install a multithreaded Lapack

# either Intel mkl if available
 
sudo apt install -y libmkl-dev
 
# OR you can replace mkl by openblas + lapacke (C interface)
 
sudo apt install -y libopenblas-dev liblapacke-dev

Then you will have to choose the parallelisation support: either the multithreaded version (threads) or the distributed one (with MPI+threads).

Multithreaded version

To install a multithreaded version without MPI (usage on one single machine)

# for the multithreaded version: hdf5 standard version without mpi is required
 
sudo apt install -y libhdf5-dev

Then follow the install procedure described hereafter Multithreaded version.

MPI version

To install a distributed version with MPI (usage on several machines at the same time for very large input data)

# for the distributed MPI version: hdf5 (mpi version),
 
sudo apt install -y libhdf5-mpi-dev // I/O files management
 
# starpu and chameleon required for distributed linear algebra
 
sudo apt install -y libstarpu-dev // runtime systems required for Chameleon
 
# install chameleon manually (indicate to use the multithreaded Lapack version for mkl)
 
cd $HOME
wget https://gitlab.inria.fr/api/v4/projects/616/packages/generic/source/v1.2.0/chameleon-1.2.0.tar.gz
tar xvf chameleon-1.2.0.tar.gz
cd chameleon-1.2.0
export CHAMELEON_ROOT=$PWD/install
mkdir -p build && cd build
 
# if using mkl
 
cmake .. -DCMAKE_INSTALL_PREFIX=$CHAMELEON_ROOT -DCHAMELEON_USE_MPI=ON -DBLA_VENDOR=Intel10_64lp -DCHAMELEON_KERNELS_MT=ON -DBUILD_SHARED_LIBS=ON
 
# if using openblas
 
cmake .. -DCMAKE_INSTALL_PREFIX=$CHAMELEON_ROOT -DCHAMELEON_USE_MPI=ON -DBLA_VENDOR=OpenBLAS -DBUILD_SHARED_LIBS=ON
sudo make -j3 install

Then follow the install procedure described hereafter MPI version.

MacOS X

Considering git, gcc and cmake are already installed you need to install the following: zlib, bzip2, CBLAS/LAPACKE (example here with OpenBLAS but you can install Intel MKL).

brew install automake autoconf libtool zlib bzip2 hwloc pkgconfig openblas openmpi
 
# gcc and g++ are missing (avoid to use clang version in /usr/bin)
 
ln -sf /usr/local/bin/gcc-11 /usr/local/bin/gcc
ln -sf /usr/local/bin/g++-11 /usr/local/bin/g++
 
# use pkg-config .pc files to detect some dependencies
 
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:/usr/local/opt/openblas/lib/pkgconfig
 
# cmake checks blas.pc not openblas.pc
 
sudo cp /usr/local/opt/openblas/lib/pkgconfig/openblas.pc /usr/local/opt/openblas/lib/pkgconfig/blas.pc

Then you will have to choose the parallelisation support: either the multithreaded version (threads) or the distributed one (with MPI+threads).

Multithreaded version

To install a multithreaded version without MPI (usage on one single machine)

# install hdf5
# download a source tarball here https://www.hdfgroup.org/downloads/hdf5/source-code/
 
tar xvf hdf5-1.12.0.tar.gz
cd hdf5-1.12.0
CFLAGS="-fPIC" CXXFLAGS="-fPIC" ./configure --with-zlib=/usr/local/opt --prefix=/usr/local/ --enable-shared --disable-fortran --disable-static --disable-tests --enable-threadsafe --with-pthread --enable-unsupported
make -j5
sudo make install

Then follow the install procedure described hereafter Multithreaded version.

MPI version

To install a distributed version with MPI (usage on several machines at the same time for very large input data)

# install hdf5
# download a source tarball here https://www.hdfgroup.org/downloads/hdf5/source-code/
 
tar xvf hdf5-1.12.0.tar.gz
cd hdf5-1.12.0
CC=/usr/local/bin/mpicc CFLAGS="-fPIC" CXXFLAGS="-fPIC" ./configure --with-zlib=/usr/local/opt --prefix=/usr/local/ --enable-shared --disable-fortran --disable-static --disable-tests --enable-threadsafe --with-pthread --enable-unsupported --enable-parallel
make -j5
sudo make install
 
# install starpu
 
# download a source tarball here https://files.inria.fr/starpu/
wget https://files.inria.fr/starpu/starpu-1.4.1/starpu-1.4.1.tar.gz
tar xvf starpu-1.4.1.tar.gz
cd starpu-1.4.1/
./configure --enable-blas-lib=none --disable-starpufft --disable-mlr --disable-opencl --disable-cuda --disable-hdf5 --disable-fortran --prefix=/usr/local
make -j5
sudo make install
 
# install chameleon
 
cd $HOME
wget https://gitlab.inria.fr/api/v4/projects/616/packages/generic/source/v1.2.0/chameleon-1.2.0.tar.gz
tar xvf chameleon-1.2.0.tar.gz
cd chameleon-1.2.0
export CHAMELEON_ROOT=/usr/local/
mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=${CHAMELEON_ROOT} -DCHAMELEON_USE_MPI=ON -DBLA_PREFER_PKGCONFIG=ON -DBUILD_SHARED_LIBS=ON
make -j5
sudo make install

Then follow the install procedure described hereafter MPI version.

Install Diodon with CMake

Main CMake options:

DIODON_USE_CHAMELEON (default OFF): enable parallel version with Chameleon
DIODON_USE_CHAMELEON_MPI (default ON if DIODON_USE_CHAMELEON is ON): enable parallel MPI version with Chameleon
DIODON_USE_INTERNAL_FMR (default ON): use the embedded fmr (in external/) version or an already installed one. To use an already installed FMR, set the fmr_ROOT environment variable or the CMAKE_PREFIX_PATH cmake variable to the root of the FMR installation
DIODON_USE_MKL_AS_BLAS (default OFF): force using Intel MKL for Lapack
DIODON_USE_OPENBLAS_AS_BLAS (default OFF): force using OpenBlas for Lapack
DIODON_USE_BLIS_AS_BLAS (default OFF): force using BLIS/FLAME for Lapack

Multithreaded version

# install cppdiodon
 
cd $HOME
 
git clone --recursive https://gitlab.inria.fr/diodon/cppdiodon.git
cd cppdiodon
export DIODON_ROOT=$PWD/install
mkdir -p build && cd build
 
## configure, build and check
 
# if using mkl
 
cmake .. -DDIODON_USE_MKL_AS_BLAS=ON -DCMAKE_INSTALL_PREFIX=$DIODON_ROOT
 
# if using openblas
 
cmake .. -DDIODON_USE_OPENBLAS_AS_BLAS=ON -DCMAKE_INSTALL_PREFIX=$DIODON_ROOT
make VERBOSE=1
ctest -V
 
# build the doc if doxygen available
 
make doc
 
# install
 
make install

If you want to use your own mkl library, don't forget to source the dedicated environment. For example :

source /opt/intel/mkl/bin/mklvars.sh intel64

MPI version

# do not forget to install packages such as mkl, hdf5-mpi, starpu and chameleon
 
# install cppdiodon
 
cd $HOME
 
git clone --recursive https://gitlab.inria.fr/diodon/cppdiodon.git
cd cppdiodon
export DIODON_ROOT=$PWD/install
mkdir -p build && cd build
 
## configure, build and check
 
# if using mkl
 
cmake .. -DDIODON_USE_CHAMELEON=ON -DDIODON_USE_MKL_AS_BLAS=ON -DCMAKE_INSTALL_PREFIX=$DIODON_ROOT
 
# if using openblas
 
cmake .. -DDIODON_USE_CHAMELEON=ON -DDIODON_USE_OPENBLAS_AS_BLAS=ON -DCMAKE_INSTALL_PREFIX=$DIODON_ROOT
make VERBOSE=1
ctest -V
 
# build the doc if doxygen available
 
make doc
 
# install
 
make install

Docker image

We provide a docker image with Diodon installed with Chameleon and MPI. Please refer to the dockerfiles in tools/docker/dockerfile-ubuntu for the recipe. The docker image name is registry.gitlab.inria.fr/diodon/cppdiodon/ubuntu/chameleonmpi and you can use it as follows:

docker run -it registry.gitlab.inria.fr/diodon/cppdiodon/ubuntu/chameleonmpi
# or registry.gitlab.inria.fr/diodon/cppdiodon/ubuntu/chameleon
# or  registry.gitlab.inria.fr/diodon/cppdiodon/ubuntu/lapack
 
# inside the docker image
export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
 
# execute with lapack in shared memory
mds -if /root/cppdiodon/data/atlas_guyane_trnH.h5 -t 2 -r 1000 -off h5 -ocf atlas_guyane_trnH.mds.h5 --blas
 
# execute with chameleon
mpiexec -np 2 mds -if /root/cppdiodon/data/atlas_guyane_trnH.h5 -t 2 -r 1000 -off h5 -ocf atlas_guyane_trnH.mds.h5
 
# one can copy back the resulting file on the host
# note the container id
echo $HOSTNAME #for example 5689c5d5c633
exit
docker cp 5689c5d5c633:atlas_guyane_trnH.mds.h5 .

Executables usage

Diodon provides methods executables to be used on data (e.g. classical individuals/observations table, contingency table, distance matrix) stored in ascii/binary text file (.txt, .csv, .gz, .bz2) or HDF5 file(s) (.h5).

The executables are built in the src/ subdirectory of your build directory or if you have installed cppdiodon using make install it is located in the subdirectory bin/ of the installation root.

The executable should be run with some parameters: the path to the HDF5 file, the dataset name (default is 'distance') and other parameters depending on the executable (method).

Environment variables that can be used during execution

The following Diodon and FMR environment variables can be used to enable some features at runtime:

DIODON_TIME: if set to 1, enables timing of some Diodon functions
FMR_TIME: if set to 1, enables timing of some FMR functions
FMR_CHAMELEON_NUM_THREADS: number of threads (cpu workers) to be used by chameleon
FMR_CHAMELEON_NUM_CUDAS: number of gpus (cuda devices) to be used by chameleon
FMR_CHAMELEON_ASYNC: if set to 1, enables the chameleon async interface (avoid tasks synchronization between some algorithms)
FMR_CHAMELEON_READH5_BYTILE: if set to 1, read HDF5 files directly by chameleon tiles instead of the default StarPU handmade task algorithm (read by panel of rows + build tiles + tiles migration to get 2d block-cyclic distribution)
FMR_CHAMELEON_READH5_BYTILE_DISTRIB_1D: to be used when FMR_CHAMELEON_READH5_BYTILE is used, if set to 1, read by tile but follow a 1D distribution of tiles over MPI processes instead of 2d block-cyclic
FMR_CHAMELEON_READH5_SYM: if set to 1, read HDF5 files considering matrix is symmetric in main memory

Principal Component Analysis - pca program

# see parameters with --help option
 
# -if to give the path to the file, HDF5 file example
./src/pca -if ../data/atlas_guyane_trnH.h5 -iff h5
 
# .csv file example with " " (space) as delimiters
./src/pca -if ../data/atlas_guyane_trnH.csv -iff csv -ifd " "
 
# -ids to specify the HDF5 dataset name to consider as the input matrix to read (if -iff is h5)
# -iff to specify the input file format
# -ifd to specify the input file delimiters (if ascii file and not h5)
# -ict to specify whether or not to read columns tags (if there are headers for columns in the first line of the file)
# -irt to specify whether or not to read rows tags (if there are tags for rows in the first column of the file)
# -t to specify the number of threads to be used (by Intel MKL/OpenBlas and Chameleon)
# -pcam to specify a pretreatment method (either 'standard' = centered-scaled columnwise or 'bicentering', 'row_centering', 'col_centering' or 'scaling')
# -svdm to choose between 'svd' (default) or 'rsvd'. Prefer rsvd when the data are too big (large number of columns).
# -r to give the prescribed rank in the rsvd method
# -of to enable saving results (points cloud and eigen values) in file(s) (default is inputfilename.pca.h5, else use options -off, -ocf, -oef, -oevf, etc).
# -off format of the output file ("h5", "txt", "csv", "gz", "bz2")
# -ocf path of the file where to store the points cloud of the pca
# -oef path of the file where to store the eigen values of the pca (can be the same than the cloud in h5 format)
# -oevf path of the file where to store the eigen vectors of the pca (can be the same than the cloud in h5 format)
# -v 1 to enable verbose mode
./src/pca -if ../data/pca_template_withnames.txt -t 2 -iff txt -ifd \t -ict -irt -svdm rsvd -r 1000 -of
./src/pca -if ../data/pca_template_withnames.txt -t 2 -iff txt -ifd \t -ict -irt -svdm rsvd -r 1000 -of -off h5 -v 1
 
# with the chameleon version, one can use MPI and perform pca on several nodes
mpiexec -n 2 ./src/pca -if ../data/atlas_guyane_trnH.h5 -ids distance -t 2 -svdm rsvd -r 1000 -of

Principal Component Analysis with metrics- pcamet program

It is the same usage as pca program except that the input file for A (main data) is given with -ifa, the metrics on rows M with -ifm and the metrics on columns Q with -ifq.

The input dataset names, if input file is in hdf5 format, can be chosen with -idsa, -idsm and -idsq.

The metrics M, Q must be diagonal matrices stored as a vector, M of size (m, 1) and Q of size (n, 1) for A of size (m, n).

See parameters with –help option.

Correspondence Analysis - coa program

It is the same usage as pca program except there is no option -pcam.

See parameters with –help option.

Singular Value Decomposition - svd program

It is the same usage as pca program except there is no option -pcam.

The output filenames and dataset names (if output file is in hdf5 format) can be chosen with -ouf, -osf, -ovf, -ouds, -osds, -ovds (for U, S, and Vt respectively).

See parameters with –help option.

Multidimensional Scaling - mds program

# see parameters with --help option
 
# -if to give the path to the file, HDF5 file example
./src/mds -if ../data/atlas_guyane_trnH.h5 -iff h5
 
# .csv file example with " " (space) as delimiters
./src/mds -if ../data/atlas_guyane_trnH.csv -iff csv -ifd " "
 
# -ids to specify the HDF5 dataset name to consider as the input matrix to read (if -iff is h5)
# -t to specify the number of threads to be used (by Intel MKL/OpenBlas and Chameleon)
# -svdm to choose between 'svd' (default) or 'rsvd'. Prefer rsvd when the data are too big (large number of columns).
# -r to give the prescribed rank in the rsvd method
# -of to enable saving results (points cloud and eigen values) in file(s) (default is inputfilename.mds.h5, else use options -off, -ocf, -oef, etc).
# -off format of the output file ("h5", "txt", "csv", "gz", "bz2")
# -ocf path of the file where to store the points cloud of the mds
# -oef path of the file where to store the eigen values of the mds (can be the same than the cloud in h5 format)
# -ncs to choose the number of columns to keep in the cloud points
# -v 1 to enable verbose mode
./src/mds -if ../data/atlas_guyane_trnH.h5 -ids distance -t 2 -svdm rsvd -r 1000 -of
./src/mds -if ../data/atlas_guyane_trnH.h5 -ids distance -t 2 -svdm rsvd -r 1000 -of -ocf mds.h5 -oef mds.h5 -off h5 -ncs 3
 
# with the chameleon version, one can use MPI and perform mds on several nodes
mpiexec -n 2 ./src/mds -if ../data/atlas_guyane_trnH.h5 -ids distance -t 2 -r 1000 -of

Input/Output file parameters examples

For all methods executables the input file given with argument -if can be either an ascii file (.csv, .txt, .gz, .bz2) or an HDF5 one (.h5 or .txt, default). For ascii files the format must be given as well as the delimiters, see arguments -iff and -ifd, used to separate values in the file. Indicate whether or not the input file contains tags or rows and columns as first row/column with the -ict and -irt arguments. To specify the HDF5 dataset name of row and column tags to read use -ictds and -irtds.

Usage examples of the input file parameters -if, -iff, -ifd, -ift, -ids:

git clone https://gitlab.inria.fr/diodon/data4tests.git
export DIODON_DATA4TESTS=$PWD/data4tests
 
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_nonames.csv -iff csv -ifd ";"
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_withnames.csv -iff csv -ifd ";" -ict -irt
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_nonames.txt -iff txt -ifd $'\t'
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_withnames.txt -iff txt -ifd $'\t' -ict -irt
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_nonames.csv.gz -iff gz -ifd ";"
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_withnames.csv.gz -iff gz -ifd ";" -ict -irt
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_nonames.csv.bz2 -iff bz2 -ifd ";"
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_withnames.csv.bz2 -iff bz2 -ifd ";" -ict -irt
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_nonames.h5 -iff h5 -ids values
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_withnames.h5 -iff h5 -ids values -ict -irt

For all methods executables the output file(s) given with argument -o?f (? depends on the method, can be c for cloud, e for eigen, etc, see –help option of the executable) can be either an ascii file (.csv, .txt, .gz, .bz2) or an HDF5 one (.h5 or .txt, default). For ascii files the format must be given as well as the delimiters, see arguments -off and -ofd, used to separate values in the file. Indicate whether or not to save row and column tags with the the -ict and -irt arguments. To specify the HDF5 dataset name of row and column tags to save use -octds and -ortds. To restrict the number of columns of the matrix to save use -ncs. To restrict the number of digits of values to save use -nds.

Usage examples of the output file parameters -of, -off, -ofd, -ods, -octds, -ortds, -ncs, -nds:

git clone https://gitlab.inria.fr/diodon/data4tests.git
export DIODON_DATA4TESTS=$PWD/data4tests
 
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_nonames.csv       -iff csv -ifd ";" -of output_nn.csv -off csv -ofd ";")
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_withnames.csv     -iff csv -ifd ";" -of output_wn.csv -off csv -ofd ";" -ict -irt)
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_nonames.txt       -iff txt -ifd $'\t' -of output_nn.txt -off txt -ofd $'\t' )
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_withnames.txt     -iff txt -ifd $'\t' -of output_wn.txt -off txt -ofd $'\t' -ict -irt)
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_nonames.csv.gz    -iff gz -ifd ";" -of output_nn.gz -off gz -ofd ";")
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_withnames.csv.gz  -iff gz -ifd ";" -of output_wn.gz -off gz -ofd ";" -ict -irt)
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_nonames.csv.bz2   -iff bz2 -ifd ";" -of output_nn.bz2 -off bz2 -ofd ";")
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_withnames.csv.bz2 -iff bz2 -ifd ";" -of output_wn.bz2 -off bz2 -ofd ";" -ict -irt)
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_nonames.h5        -iff h5 -ids values -of output_nn.h5 -off h5 -ods values)
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_withnames.h5      -iff h5 -ids values -of output_wn.h5 -off h5 -ods values -ict -irt -octds "cols" -ortds "rows")
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_withnames.h5      -iff h5 -ids values -of output_wn_cs3.h5 -off h5 -ods values -ict -irt -octds "cols" -ortds "rows" -ncs 3)

Tutorial on a local computer

Setup your environment, either install yourself the libraries following the Installation section, or use the Ubuntu 22.04 docker image provided with the projet

docker run -it registry.gitlab.inria.fr/diodon/cppdiodon/ubuntu/chameleonmpi

MDS examples

# read the header of
mds -help
 
# first example: we read the distance matrix in atlas_guyane_trnH.h5 and save the eigenvalues and points cloud in atlas_guyane_trnH.mds.h5
 
mds -if /root/cppdiodon/data/atlas_guyane_trnH.h5 -of
 
# notice that the default input file format is hdf5 (h5)
# and default dataset read is 'distance', you can change this with -ids
 
mds -if /root/cppdiodon/data/atlas_guyane_trnH.h5 -of -ids distance
 
# a file /root/cppdiodon/data/atlas_guyane_trnH.mds.h5 is created
 
# you can use h5dump to see the results
h5dump -H /root/cppdiodon/data/atlas_guyane_trnH.mds.h5
h5dump -d eigenvalues /root/cppdiodon/data/atlas_guyane_trnH.mds.h5
h5dump -d points /root/cppdiodon/data/atlas_guyane_trnH.mds.h5
 
# you can make an image of the points cloud (dataset must be called 'points' by default here)
python3 /root/cppdiodon/tools/post/nuage_png.py /root/cppdiodon/data/atlas_guyane_trnH.mds.h5

By default the SVD method used is full SVD. You can use a Randomized SVD instead with prescribed rank which is set by default 1/10 of M the size of the input distance matrix.

You can choose to use RSVD with -svdm rsvd and you can change the prescribed rank with the -r flag, e.g.

mds -if /root/cppdiodon/data/atlas_guyane_trnH.h5 -of -svdm rsvd -r 1500

You can also force the SVD method instead of the RSVD, use the flag -svdm svd, e.g.

mds -if /root/cppdiodon/data/atlas_guyane_trnH.h5 -of -svdm svd

By default the number of threads used is 2. You can use more or less threads with the flag -t, e.g.

mds -if /root/cppdiodon/data/atlas_guyane_trnH.h5 -of -t 4

You can limit the number of columns saved in the output matrices with the flag -ncs

mds -if /root/cppdiodon/data/atlas_guyane_trnH.h5 -of -ncs 2
 
# look at the 5 first rows and 2 first columns of the 'points' dataset
h5dump -A 0 -d points -c "5,2" /root/cppdiodon/data/atlas_guyane_trnH.mds.h5

PCA examples

pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of
 
# see points cloud Y in /root/cppdiodon/data/pca_template_withnames_cloud.pca.txt
# eigen values L in /root/cppdiodon/data/pca_template_withnames_eigenvalues.pca.txt
# eigen vectors V in /root/cppdiodon/data/pca_template_withnames_eigenvectors.pca.txt

Notice that default file format is h5 so that it must be changed here with -iff txt. For txt files the default delimiter is a space, you can change this with the flag -ifd, here -ifd $'\t' means tabular delimiters. In addition when an input file provides some tags to identify rows and columns, one must use -irt for rows and -ict for columns to handle this case and read the tags. It will be later saved in the output file with the results.

The default pre-treatments is centered (columnwise) and scaled, this can be changed with flag -pcam, e.g.

# centered-scaled columnwise (default)
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -pcam standard
# double-averaged
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -pcam bicentering
# centered columnwise
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -pcam col_centering
# centered rowise
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -pcam rwo_centering
# scaled columnwise
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -pcam scaled

For ascii files output one can change the number of digits to save with -nds

pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -nds 2
 
more /root/cppdiodon/data/pca_template_withnames_cloud.pca.txt

The input and output file formats could differ

pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -off h5
 
h5dump /root/cppdiodon/data/pca_template_withnames.pca.h5
 
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -off csv -ofd ";"
 
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -off bz2

The output dataset name to use can be specified with -ocds, -oeds, -oevds

pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -off h5 -ocds riri -oeds fifi -oevds loulou
 
h5dump -A 0 -d riri -d fifi -d loulou /root/cppdiodon/data/pca_template_withnames.pca.h5

PCAmet examples

pcamet -ifa /root/cppdiodon/data/pca_template_withnames.txt -ifm /root/cppdiodon/data/pca_template_metrics_m.txt -ifq /root/cppdiodon/data/pca_template_metrics_q.txt -iff txt -ifd $'\t' -irt -ict -of
 
# see points cloud Y in /root/cppdiodon/data/pca_template_withnames_cloud.pcamet.txt
#     eigen values L in /root/cppdiodon/data/pca_template_withnames_eigenvalues.pcamet.txt
#     eigen vectors V in /root/cppdiodon/data/pca_template_withnames_eigenvectors.pcamet.txt

CoA examples

coa -if /root/cppdiodon/data/coa4tests_nonames.txt -iff txt -ifd $'\t' -of
 
# see Yr in /root/cppdiodon/data/coa4tests_nonames_yr.coa.txt
#     Yc in /root/cppdiodon/data/coa4tests_nonames_yc.coa.txt
#     eigen values L in /root/cppdiodon/data/coa4tests_nonames_eigenvalues.coa.txt

SVD examples

svd -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -ict -irt -of
 
# see U in /root/cppdiodon/data/pca_template_withnames_u.svd.txt
#     S in /root/cppdiodon/data/pca_template_withnames_s.svd.txt
#     Vt in /root/cppdiodon/data/pca_template_withnames_vt.svd.txt

Notice default SVD method is always Randomized SVD, to use a full SVD instead use -svdm svd.

Tutorial on the PlaFRIM supercomputer

Here is an example usage on the PlaFRIM supercomputer hosted at Inria Bordeaux.

Example using modules

ssh plafrim
 
# to avoid getting some guix paths in the current environment
export PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/dell/srvadmin/bin
unset CMAKE_PREFIX_PATH
unset CPLUS_INCLUDE_PATH
unset LIBRARY_PATH
unset CPATH
unset C_INCLUDE_PATH
unset PYTHONPATH
 
# load plafrim modules
module purge
module add build/cmake/3.15.3
module add compiler/gcc/9.3.0
module add hardware/hwloc/2.7.0
module add mpi/openmpi/4.0.3
module add runtime/starpu/1.3.7/mpi
module add io/hdf5/openmpi@4.0.2/1.10.5
module add linalg/mkl/2020_update4
 
cd $HOME
mkdir -p diodon
mkdir -p diodon/install
export INSTALLDIR=$HOME/diodon/install
 
# install chameleon
cd $HOME/diodon
wget https://gitlab.inria.fr/api/v4/projects/616/packages/generic/source/v1.2.0/chameleon-1.2.0.tar.gz
tar xvf chameleon-1.2.0.tar.gz
cd chameleon-1.2.0
mkdir -p build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$INSTALLDIR -DCHAMELEON_USE_MPI=ON -DBLA_VENDOR=Intel10_64lp -DCHAMELEON_KERNELS_MT=ON -DBUILD_SHARED_LIBS=ON
make -j10 install
 
# install diodon
cd $HOME/diodon
git clone --recursive https://gitlab.inria.fr/diodon/cppdiodon.git
cd cppdiodon
mkdir -p build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$INSTALLDIR -DCMAKE_PREFIX_PATH=$INSTALLDIR -DCMAKE_NO_SYSTEM_FROM_IMPORTED=ON -DDIODON_USE_CHAMELEON=ON -DDIODON_USE_MKL_AS_BLAS=ON
make -j10 install
 
# use cppdiodon in parallel on several nodes
salloc --nodes=2 --time=01:00:00 --constraint bora --exclusive --ntasks-per-node=1 --threads-per-core=1
 
# make sure we use the same environment as for building
export PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/dell/srvadmin/bin
unset CMAKE_PREFIX_PATH
unset CPLUS_INCLUDE_PATH
unset LIBRARY_PATH
unset CPATH
unset C_INCLUDE_PATH
unset PYTHONPATH
 
module purge
module add build/cmake/3.15.3
module add compiler/gcc/9.3.0
module add hardware/hwloc/2.7.0
module add mpi/openmpi/4.0.3
module add runtime/starpu/1.3.7/mpi
module add io/hdf5/openmpi@4.0.2/1.10.5
module add linalg/mkl/2020_update4
 
export INSTALLDIR=$HOME/diodon/install
export DIODON_TIME=1
export FMR_TIME=1
 
# check mpi
mpiexec hostname
 
# mds, small testcase n=1502
mpiexec --map-by ppr:1:node:pe=36 $INSTALLDIR/bin/mds -t 2 -if $HOME/diodon/cppdiodon/data/atlas_guyane_trnH.h5 -of
 
# mds, large testcase n=100000 (you have to be a member of the 'diodon' project on plafrim to get the matrices)
mpiexec --map-by ppr:1:node:pe=36 $INSTALLDIR/bin/mds -t 34 -if /beegfs/diodon/gordon/L1-L10/L6.h5 -svdm rsvd -r 5000 -of -v 1
 
# CPU time for MDS (on L6, 2 bora nodes and rank=5000) should be around: 300s (5 min)
# Quality estimates should be around:
# [diodon] rsvd quality estimation ||Psy_r|| / ||G||= 0.999994
# [diodon] mds quality estimation ||Psy^+_r|| / ||G||= 0.998018
# results are saved into /beegfs/diodon/gordon/L1-L10/L6.mds.h5 in 'eigenvalues' and 'points' datasets
 
# the output file is saved in a directory shared by the group diodon, allow the group to modify/update the file
chmod g+w /beegfs/diodon/gordon/L1-L10/L6.mds.h5
 
# posttreatment
module add language/python/3.8.0
python3 -m pip install --user virtualenv
python3 -m venv env
source env/bin/activate
pip install numpy matplotlib h5py
 
python3 $HOME/diodon/cppdiodon/tools/post/nuage_png.py /beegfs/diodon/gordon/L1-L10/L6.mds.h5
# L6.png should be generated
 
# one can also perform benchmarks, by using the testing drivers, with values randomly generated for the input matrix
mpiexec --map-by ppr:1:node:pe=36 $INSTALLDIR/bin/testmds -t 34 -gen 1 -rs 10000 -rk 1000
mpiexec --map-by ppr:1:node:pe=36 $INSTALLDIR/bin/testpca -t 34 -gen 1 -rsm 10000 -rsn 1000
mpiexec --map-by ppr:1:node:pe=36 $INSTALLDIR/bin/testpcamet -t 34 -gen 1 -rsm 10000 -rsn 1000
# warning: the method to build the contingency table, input matrix of coa, is very costly, use small sizes (m,n)
mpiexec --map-by ppr:1:node:pe=36 $INSTALLDIR/bin/testcoa -t 34 -gen 1 -rsm 1000 -rsn 100
mpiexec --map-by ppr:1:node:pe=36 $INSTALLDIR/bin/testsvd -t 34 -gen 1 -rsm 10000 -rsn 1000

Example using GNU Guix

GNU Guix is available on PlaFRIM and can be used to install cppdiodon.

It is necessary first to add a channel to get new packages coming from the GUIX-HPC project. Create a ~/.config/guix/channels.scm file with the following snippet (on PlaFRIM):

(cons (channel
    (name 'guix-hpc-non-free)
    (url "https://gitlab.inria.fr/guix-hpc/guix-hpc-non-free.git"))
  %default-channels)

Update and check Guix

# update packages recipes
 
guix pull
 
# check guix usage
 
guix build hello

Then install and use cppdiodon on PlaFRIM with Guix

# software environment setup
 
# allocate ressources with slurm
 
salloc --nodes=2 --constraint bora --ntasks-per-node=1 --exclusive
 
guix shell --pure --preserve=SLURM cppdiodon-mpi --without-tests=hdf5-parallel-openmpi openmpi openssh slurm coreutils inetutils gawk grep sed -- /bin/bash --norc
 
# execute a MDS
 
mpiexec --bind-to board mds -t 34 -if /beegfs/diodon/gordon/L1-L10/L6.h5 -of -svdm rsvd -r 5000 -ncs 3 -v 1
 
# results are saved into /beegfs/diodon/gordon/L1-L10/L6.mds.h5 in 'eigenvalues' and 'points' datasets
 
# the output file is saved in a directory shared by the group diodon, allow the group to modify/update the file
chmod g+w /beegfs/diodon/gordon/L1-L10/L6.mds.h5

Source code structure

data/: input data useful for testing
docs/: doxygen documentation
examples/: API usage examples
external/: third party library FMR
include/: cppdiodon API
src/: cppdiodon main executables for calling data analysis methods such as (pca, coa, mds, etc)
tests/: executables for functional testing
tools/: various scripts used for: continuous integration with gitlab-ci, generating docker image, posttreatments of methods, sonarqube analysis, ...

API

Input and Output / Read and Write files : Input/read files and Output/write files functions
Pre-treatments : Pretreatments functions such as Gram matrix computation
Data analysis methods : main data analysis algorithms such PCA, COA, MDS, etc
Post-treatments : Posttreatments functions such as cloud coordinates computation

API usage examples

shared memory example with lapack matrix:
distributed memory example with Chameleon matrix: