Parallel C++ library for Multivariate Data Analysis of large datasets.
Overview
Diodon provides executables and functions to compute multivariate data Analysis such as:
- Singular Value Decomposition (SVD),
- Principal Component Analysis (PCA) and variants (with different pre-treatments),
- Multidimensional Scaling (MDS),
- Correspondence Analysis (CoA),
- Canonical Correlation Analysis (CCA, future work),
- Multiple Correspondence Analysis (MCoA, future work).
Please read this note about the algorithms.
All these methods rely on a Singular Value Decomposition (SVD) of a 2D matrix. For small size matrices the SVD can be directly computed using a sequential or multi-threaded LAPACK solver such as OpenBlas or Intel MKL. For large matrices the SVD becomes time consuming and we use a Randomized Singular Value Decomposition method (rSVD) instead of the exact SVD which implementation is given by the FMR library. FMR can perform computations of the rSVD on parallel shared and distributed memory machines using adequate parallel dense linear algebra routines internally such as OpenBlas or Intel MKL on a shared memory node and Chameleon for distributed memory nodes (MPI).
MDS result on a testcase of size 1043192, representation of points on the two first axis.
Performances
MDS on Occigen with parallel HDF5 I/Os
MDS performance results example on a square distance matrix of size N, rank of the RSVD 10000, on the Occigen supercomputer. The input matrices represent distances between individuals in samples of Diatoms from the Leman lake, courtesey Inrae.
CPU Times details of the MDS in function of the test case.
Method | N | # nodes | CPU Time (s) |
MDS | 99594 | 1 | 1047 |
MDS | 270983 | 6 | 1099 |
MDS | 426548 | 16 | 1054 |
MDS | 616644 | 32 | 1195 |
MDS | 1043192 | 100 | 1401 |
MDS and PCA on PlaFRIM with randomly generated data
MDS and PCA performance results example on randomly generated matrices of size N and rank 10000, on the PlaFRIM supercomputer.
CPU Times details of the MDS in function of matrix size on Bora nodes.
Method | N | # nodes | CPU Time (s) |
MDS | 100000 | 1 | 477 |
MDS | 200000 | 2 | 701 |
MDS | 400000 | 8 | 847 |
MDS | 600000 | 16 | 1165 |
MDS | 800000 | 32 | 1323 |
CPU Times details of the PCA in function of matrix size on Bora nodes.
Method | M | N | # nodes | CPU Time (s) |
PCA | 500000 | 10000 | 1 | 545 |
PCA | 1000000 | 10000 | 2 | 527 |
PCA | 2000000 | 10000 | 4 | 604 |
PCA | 4000000 | 10000 | 8 | 771 |
PCA | 8000000 | 10000 | 16 | 1206 |
Download
Diodon source code is available on Gitlab Inria.
Clone Diodon using https (public access)
Features
Supports
- zlib and bzip2 (required)
- HDF5 (required), prefer the parallel MPI version if Chameleon enabled
- LAPACK (required), multithreaded Intel MKL or OpenBLAS recommended
- MPI + Chameleon + StarPU (optionnal)
Installation examples of theses libraries are given hereafter.
Data analysis algorithms
Installation
cppdiodon depends on git, cmake, a C/C++ compiler such as GNU GCC, and on some thirdparty libraries such as hdf5, lapack (multithreaded version prefered), chameleon (parallel mpi version, optional).
Example packages on Linux Ubuntu:
- git: to download diodon
- cmake: for building the diodon C++ project
- build-essential: GNU g++ compiler
- zlib1g-dev: zlib compression library for I/O files management
- libbz2-dev: bzip2 compression library for I/O files management
- libhdf5-dev: I/O files management
- gfortran (optional): GNU Fortan compiler for Chameleon
- pkg-config (optional): for finding libraries (StarPU, Chameleon) on Unix systems
- python3 (optional): for generating some figures (points cloud) after execution
- libmkl-dev or libopenblas-dev + liblapacke-dev: dense linear algebra routines BLAS/LAPACK routines
Install dependencies packages
Linux Ubuntu 22.04
First, update your packages basis and install classical development tools
sudo apt update -y
sudo apt install -y git cmake build-essential zlib1g-dev libbz2-dev gfortran pkg-config python-is-python3
Then install a multithreaded Lapack
# either Intel mkl if available
sudo apt install -y libmkl-dev
# OR you can replace mkl by openblas + lapacke (C interface)
sudo apt install -y libopenblas-dev liblapacke-dev
Then you will have to choose the parallelisation support: either the multithreaded version (threads) or the distributed one (with MPI+threads).
Multithreaded version
To install a multithreaded version without MPI (usage on one single machine)
# for the multithreaded version: hdf5 standard version without mpi is required
sudo apt install -y libhdf5-dev
Then follow the install procedure described hereafter Multithreaded version.
MPI version
To install a distributed version with MPI (usage on several machines at the same time for very large input data)
# for the distributed MPI version: hdf5 (mpi version),
sudo apt install -y libhdf5-mpi-dev
# starpu and chameleon required for distributed linear algebra
sudo apt install -y libstarpu-dev
# install chameleon manually (indicate to use the multithreaded Lapack version for mkl)
cd $HOME
wget https:
tar xvf chameleon-1.2.0.tar.gz
cd chameleon-1.2.0
export CHAMELEON_ROOT=$PWD/install
mkdir -p build && cd build
# if using mkl
cmake .. -DCMAKE_INSTALL_PREFIX=$CHAMELEON_ROOT -DCHAMELEON_USE_MPI=ON -DBLA_VENDOR=Intel10_64lp -DCHAMELEON_KERNELS_MT=ON -DBUILD_SHARED_LIBS=ON
# if using openblas
cmake .. -DCMAKE_INSTALL_PREFIX=$CHAMELEON_ROOT -DCHAMELEON_USE_MPI=ON -DBLA_VENDOR=OpenBLAS -DBUILD_SHARED_LIBS=ON
sudo make -j3 install
Then follow the install procedure described hereafter MPI version.
MacOS X
Considering git, gcc and cmake are already installed you need to install the following: zlib, bzip2, CBLAS/LAPACKE (example here with OpenBLAS but you can install Intel MKL).
brew install automake autoconf libtool zlib bzip2 hwloc pkgconfig openblas openmpi
# gcc and g++ are missing (avoid to use clang version in /usr/bin)
ln -sf /usr/local/bin/gcc-11 /usr/local/bin/gcc
ln -sf /usr/local/bin/g++-11 /usr/local/bin/g++
# use pkg-config .pc files to detect some dependencies
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:/usr/local/opt/openblas/lib/pkgconfig
# cmake checks blas.pc not openblas.pc
sudo cp /usr/local/opt/openblas/lib/pkgconfig/openblas.pc /usr/local/opt/openblas/lib/pkgconfig/blas.pc
Then you will have to choose the parallelisation support: either the multithreaded version (threads) or the distributed one (with MPI+threads).
Multithreaded version
To install a multithreaded version without MPI (usage on one single machine)
# install hdf5
# download a source tarball here https:
tar xvf hdf5-1.12.0.tar.gz
cd hdf5-1.12.0
CFLAGS="-fPIC" CXXFLAGS="-fPIC" ./configure --with-zlib=/usr/local/opt --prefix=/usr/local/ --enable-shared --disable-fortran --disable-static --disable-tests --enable-threadsafe --with-pthread --enable-unsupported
make -j5
sudo make install
Then follow the install procedure described hereafter Multithreaded version.
MPI version
To install a distributed version with MPI (usage on several machines at the same time for very large input data)
# install hdf5
# download a source tarball here https:
tar xvf hdf5-1.12.0.tar.gz
cd hdf5-1.12.0
CC=/usr/local/bin/mpicc CFLAGS="-fPIC" CXXFLAGS="-fPIC" ./configure --with-zlib=/usr/local/opt --prefix=/usr/local/ --enable-shared --disable-fortran --disable-static --disable-tests --enable-threadsafe --with-pthread --enable-unsupported --enable-parallel
make -j5
sudo make install
# install starpu
# download a source tarball here https:
wget https:
tar xvf starpu-1.4.1.tar.gz
cd starpu-1.4.1/
./configure --enable-blas-lib=none --disable-starpufft --disable-mlr --disable-opencl --disable-cuda --disable-hdf5 --disable-fortran --prefix=/usr/local
make -j5
sudo make install
# install chameleon
cd $HOME
wget https:
tar xvf chameleon-1.2.0.tar.gz
cd chameleon-1.2.0
export CHAMELEON_ROOT=/usr/local/
mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=${CHAMELEON_ROOT} -DCHAMELEON_USE_MPI=ON -DBLA_PREFER_PKGCONFIG=ON -DBUILD_SHARED_LIBS=ON
make -j5
sudo make install
Then follow the install procedure described hereafter MPI version.
Install Diodon with CMake
Main CMake options:
- DIODON_USE_CHAMELEON (default OFF): enable parallel version with Chameleon
- DIODON_USE_CHAMELEON_MPI (default ON if DIODON_USE_CHAMELEON is ON): enable parallel MPI version with Chameleon
- DIODON_USE_INTERNAL_FMR (default ON): use the embedded fmr (in external/) version or an already installed one. To use an already installed FMR, set the fmr_ROOT environment variable or the CMAKE_PREFIX_PATH cmake variable to the root of the FMR installation
- DIODON_USE_MKL_AS_BLAS (default OFF): force using Intel MKL for Lapack
- DIODON_USE_OPENBLAS_AS_BLAS (default OFF): force using OpenBlas for Lapack
- DIODON_USE_BLIS_AS_BLAS (default OFF): force using BLIS/FLAME for Lapack
Multithreaded version
# install cppdiodon
cd $HOME
git clone --recursive https:
cd cppdiodon
export DIODON_ROOT=$PWD/install
mkdir -p build && cd build
## configure, build and check
# if using mkl
cmake .. -DDIODON_USE_MKL_AS_BLAS=ON -DCMAKE_INSTALL_PREFIX=$DIODON_ROOT
# if using openblas
cmake .. -DDIODON_USE_OPENBLAS_AS_BLAS=ON -DCMAKE_INSTALL_PREFIX=$DIODON_ROOT
make VERBOSE=1
ctest -V
# build the doc if doxygen available
make doc
# install
make install
If you want to use your own mkl library, don't forget to source the dedicated environment. For example :
source /opt/intel/mkl/bin/mklvars.sh intel64
MPI version
# do not forget to install packages such as mkl, hdf5-mpi, starpu and chameleon
# install cppdiodon
cd $HOME
git clone --recursive https:
cd cppdiodon
export DIODON_ROOT=$PWD/install
mkdir -p build && cd build
## configure, build and check
# if using mkl
cmake .. -DDIODON_USE_CHAMELEON=ON -DDIODON_USE_MKL_AS_BLAS=ON -DCMAKE_INSTALL_PREFIX=$DIODON_ROOT
# if using openblas
cmake .. -DDIODON_USE_CHAMELEON=ON -DDIODON_USE_OPENBLAS_AS_BLAS=ON -DCMAKE_INSTALL_PREFIX=$DIODON_ROOT
make VERBOSE=1
ctest -V
# build the doc if doxygen available
make doc
# install
make install
Docker image
We provide a docker image with Diodon installed with Chameleon and MPI. Please refer to the dockerfiles in tools/docker/dockerfile-ubuntu for the recipe. The docker image name is registry.gitlab.inria.fr/diodon/cppdiodon/ubuntu/chameleonmpi and you can use it as follows:
docker run -it registry.gitlab.inria.fr/
diodon/cppdiodon/ubuntu/chameleonmpi
# or registry.gitlab.inria.fr/diodon/cppdiodon/ubuntu/chameleon
# or registry.gitlab.inria.fr/diodon/cppdiodon/ubuntu/lapack
# inside the docker image
export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
# execute with lapack in shared memory
mds -if /root/cppdiodon/data/atlas_guyane_trnH.h5 -t 2 -r 1000 -off h5 -ocf atlas_guyane_trnH.mds.h5 --blas
# execute with chameleon
mpiexec -np 2 mds -if /root/cppdiodon/data/atlas_guyane_trnH.h5 -t 2 -r 1000 -off h5 -ocf atlas_guyane_trnH.mds.h5
# one can copy back the resulting file on the host
# note the container id
echo $HOSTNAME #for example 5689c5d5c633
exit
docker cp 5689c5d5c633:atlas_guyane_trnH.mds.h5 .
Diodon classes and routines.
Executables usage
Diodon provides methods executables to be used on data (e.g. classical individuals/observations table, contingency table, distance matrix) stored in ascii/binary text file (.txt, .csv, .gz, .bz2) or HDF5 file(s) (.h5).
The executables are built in the src/ subdirectory of your build directory or if you have installed cppdiodon using make install
it is located in the subdirectory bin/ of the installation root.
The executable should be run with some parameters: the path to the HDF5 file, the dataset name (default is 'distance') and other parameters depending on the executable (method).
Environment variables that can be used during execution
The following Diodon and FMR environment variables can be used to enable some features at runtime:
- DIODON_TIME: if set to 1, enables timing of some Diodon functions
- FMR_TIME: if set to 1, enables timing of some FMR functions
- FMR_CHAMELEON_NUM_THREADS: number of threads (cpu workers) to be used by chameleon
- FMR_CHAMELEON_NUM_CUDAS: number of gpus (cuda devices) to be used by chameleon
- FMR_CHAMELEON_ASYNC: if set to 1, enables the chameleon async interface (avoid tasks synchronization between some algorithms)
- FMR_CHAMELEON_READH5_BYTILE: if set to 1, read HDF5 files directly by chameleon tiles instead of the default StarPU handmade task algorithm (read by panel of rows + build tiles + tiles migration to get 2d block-cyclic distribution)
- FMR_CHAMELEON_READH5_BYTILE_DISTRIB_1D: to be used when FMR_CHAMELEON_READH5_BYTILE is used, if set to 1, read by tile but follow a 1D distribution of tiles over MPI processes instead of 2d block-cyclic
- FMR_CHAMELEON_READH5_SYM: if set to 1, read HDF5 files considering matrix is symmetric in main memory
Principal Component Analysis - pca program
# see parameters with --help option
# -if to give the path to the file, HDF5 file example
./src/pca -if ../data/atlas_guyane_trnH.h5 -iff h5
# .csv file example with " " (space) as delimiters
./src/pca -if ../data/atlas_guyane_trnH.csv -iff csv -ifd " "
# -ids to specify the HDF5 dataset name to consider as the input matrix to read (if -iff is h5)
# -iff to specify the input file format
# -ifd to specify the input file delimiters (if ascii file and not h5)
# -ict to specify whether or not to read columns tags (if there are headers for columns in the first line of the file)
# -irt to specify whether or not to read rows tags (if there are tags for rows in the first column of the file)
# -t to specify the number of threads to be used (by Intel MKL/OpenBlas and Chameleon)
# -pcam to specify a pretreatment method (either 'standard' = centered-scaled columnwise or 'bicentering', 'row_centering', 'col_centering' or 'scaling')
# -svdm to choose between 'svd' (default) or 'rsvd'. Prefer rsvd when the data are too big (large number of columns).
# -r to give the prescribed rank in the rsvd method
# -of to enable saving results (points cloud and eigen values) in file(s) (default is inputfilename.pca.h5, else use options -off, -ocf, -oef, -oevf, etc).
# -off format of the output file ("h5", "txt", "csv", "gz", "bz2")
# -ocf path of the file where to store the points cloud of the pca
# -oef path of the file where to store the eigen values of the pca (can be the same than the cloud in h5 format)
# -oevf path of the file where to store the eigen vectors of the pca (can be the same than the cloud in h5 format)
# -v 1 to enable verbose mode
./src/pca -if ../data/pca_template_withnames.txt -t 2 -iff txt -ifd \t -ict -irt -svdm rsvd -r 1000 -of
./src/pca -if ../data/pca_template_withnames.txt -t 2 -iff txt -ifd \t -ict -irt -svdm rsvd -r 1000 -of -off h5 -v 1
# with the chameleon version, one can use MPI and perform pca on several nodes
mpiexec -n 2 ./src/pca -if ../data/atlas_guyane_trnH.h5 -ids distance -t 2 -svdm rsvd -r 1000 -of
Principal Component Analysis with metrics- pcamet program
It is the same usage as pca program except that the input file for A (main data) is given with -ifa, the metrics on rows M with -ifm and the metrics on columns Q with -ifq.
The input dataset names, if input file is in hdf5 format, can be chosen with -idsa, -idsm and -idsq.
The metrics M, Q must be diagonal matrices stored as a vector, M of size (m, 1) and Q of size (n, 1) for A of size (m, n).
See parameters with –help option.
Correspondence Analysis - coa program
It is the same usage as pca program except there is no option -pcam.
See parameters with –help option.
Singular Value Decomposition - svd program
It is the same usage as pca program except there is no option -pcam.
The output filenames and dataset names (if output file is in hdf5 format) can be chosen with -ouf, -osf, -ovf, -ouds, -osds, -ovds (for U, S, and Vt respectively).
See parameters with –help option.
Multidimensional Scaling - mds program
# see parameters with --help option
# -if to give the path to the file, HDF5 file example
./src/mds -if ../data/atlas_guyane_trnH.h5 -iff h5
# .csv file example with " " (space) as delimiters
./src/mds -if ../data/atlas_guyane_trnH.csv -iff csv -ifd " "
# -ids to specify the HDF5 dataset name to consider as the input matrix to read (if -iff is h5)
# -t to specify the number of threads to be used (by Intel MKL/OpenBlas and Chameleon)
# -svdm to choose between 'svd' (default) or 'rsvd'. Prefer rsvd when the data are too big (large number of columns).
# -r to give the prescribed rank in the rsvd method
# -of to enable saving results (points cloud and eigen values) in file(s) (default is inputfilename.mds.h5, else use options -off, -ocf, -oef, etc).
# -off format of the output file ("h5", "txt", "csv", "gz", "bz2")
# -ocf path of the file where to store the points cloud of the mds
# -oef path of the file where to store the eigen values of the mds (can be the same than the cloud in h5 format)
# -ncs to choose the number of columns to keep in the cloud points
# -v 1 to enable verbose mode
./src/mds -if ../data/atlas_guyane_trnH.h5 -ids distance -t 2 -svdm rsvd -r 1000 -of
./src/mds -if ../data/atlas_guyane_trnH.h5 -ids distance -t 2 -svdm rsvd -r 1000 -of -ocf mds.h5 -oef mds.h5 -off h5 -ncs 3
# with the chameleon version, one can use MPI and perform mds on several nodes
mpiexec -n 2 ./src/mds -if ../data/atlas_guyane_trnH.h5 -ids distance -t 2 -r 1000 -of
Input/Output file parameters examples
For all methods executables the input file given with argument -if can be either an ascii file (.csv, .txt, .gz, .bz2) or an HDF5 one (.h5 or .txt, default). For ascii files the format must be given as well as the delimiters, see arguments -iff and -ifd, used to separate values in the file. Indicate whether or not the input file contains tags or rows and columns as first row/column with the -ict and -irt arguments. To specify the HDF5 dataset name of row and column tags to read use -ictds and -irtds.
Usage examples of the input file parameters -if, -iff, -ifd, -ift, -ids:
git clone https:
export DIODON_DATA4TESTS=$PWD/data4tests
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_nonames.csv -iff csv -ifd ";"
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_withnames.csv -iff csv -ifd ";" -ict -irt
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_nonames.txt -iff txt -ifd $'\t'
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_withnames.txt -iff txt -ifd $'\t' -ict -irt
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_nonames.csv.gz -iff gz -ifd ";"
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_withnames.csv.gz -iff gz -ifd ";" -ict -irt
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_nonames.csv.bz2 -iff bz2 -ifd ";"
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_withnames.csv.bz2 -iff bz2 -ifd ";" -ict -irt
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_nonames.h5 -iff h5 -ids values
./tests/testLoad -if ${DIODON_DATA4TESTS}/pca_template_withnames.h5 -iff h5 -ids values -ict -irt
For all methods executables the output file(s) given with argument -o?f (? depends on the method, can be c for cloud, e for eigen, etc, see –help option of the executable) can be either an ascii file (.csv, .txt, .gz, .bz2) or an HDF5 one (.h5 or .txt, default). For ascii files the format must be given as well as the delimiters, see arguments -off and -ofd, used to separate values in the file. Indicate whether or not to save row and column tags with the the -ict and -irt arguments. To specify the HDF5 dataset name of row and column tags to save use -octds and -ortds. To restrict the number of columns of the matrix to save use -ncs. To restrict the number of digits of values to save use -nds.
Usage examples of the output file parameters -of, -off, -ofd, -ods, -octds, -ortds, -ncs, -nds:
git clone https:
export DIODON_DATA4TESTS=$PWD/data4tests
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_nonames.csv -iff csv -ifd ";" -of output_nn.csv -off csv -ofd ";")
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_withnames.csv -iff csv -ifd ";" -of output_wn.csv -off csv -ofd ";" -ict -irt)
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_nonames.txt -iff txt -ifd $'\t' -of output_nn.txt -off txt -ofd $'\t' )
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_withnames.txt -iff txt -ifd $'\t' -of output_wn.txt -off txt -ofd $'\t' -ict -irt)
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_nonames.csv.gz -iff gz -ifd ";" -of output_nn.gz -off gz -ofd ";")
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_withnames.csv.gz -iff gz -ifd ";" -of output_wn.gz -off gz -ofd ";" -ict -irt)
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_nonames.csv.bz2 -iff bz2 -ifd ";" -of output_nn.bz2 -off bz2 -ofd ";")
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_withnames.csv.bz2 -iff bz2 -ifd ";" -of output_wn.bz2 -off bz2 -ofd ";" -ict -irt)
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_nonames.h5 -iff h5 -ids values -of output_nn.h5 -off h5 -ods values)
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_withnames.h5 -iff h5 -ids values -of output_wn.h5 -off h5 -ods values -ict -irt -octds "cols" -ortds "rows")
./tests/testWrite -if ${DIODON_DATA4TESTS}/pca_template_withnames.h5 -iff h5 -ids values -of output_wn_cs3.h5 -off h5 -ods values -ict -irt -octds "cols" -ortds "rows" -ncs 3)
Tutorial on a local computer
Setup your environment, either install yourself the libraries following the Installation section, or use the Ubuntu 22.04 docker image provided with the projet
docker run -it registry.gitlab.inria.fr/
diodon/cppdiodon/ubuntu/chameleonmpi
MDS examples
# read the header of
mds -help
# first example: we read the distance matrix in atlas_guyane_trnH.h5 and save the eigenvalues and points cloud in atlas_guyane_trnH.mds.h5
mds -if /root/cppdiodon/data/atlas_guyane_trnH.h5 -of
# notice that the default input file format is hdf5 (h5)
# and default dataset read is 'distance', you can change this with -ids
mds -if /root/cppdiodon/data/atlas_guyane_trnH.h5 -of -ids distance
# a file /root/cppdiodon/data/atlas_guyane_trnH.mds.h5 is created
# you can use h5dump to see the results
h5dump -H /root/cppdiodon/data/atlas_guyane_trnH.mds.h5
h5dump -d eigenvalues /root/cppdiodon/data/atlas_guyane_trnH.mds.h5
h5dump -d points /root/cppdiodon/data/atlas_guyane_trnH.mds.h5
# you can make an image of the points cloud (dataset must be called 'points' by default here)
python3 /root/cppdiodon/tools/post/nuage_png.py /root/cppdiodon/data/atlas_guyane_trnH.mds.h5
By default the SVD method used is full SVD. You can use a Randomized SVD instead with prescribed rank which is set by default 1/10 of M the size of the input distance matrix.
You can choose to use RSVD with -svdm rsvd and you can change the prescribed rank with the -r flag, e.g.
mds -if /root/cppdiodon/data/atlas_guyane_trnH.h5 -of -svdm rsvd -r 1500
You can also force the SVD method instead of the RSVD, use the flag -svdm svd, e.g.
mds -if /root/cppdiodon/data/atlas_guyane_trnH.h5 -of -svdm svd
By default the number of threads used is 2. You can use more or less threads with the flag -t, e.g.
mds -if /root/cppdiodon/data/atlas_guyane_trnH.h5 -of -t 4
You can limit the number of columns saved in the output matrices with the flag -ncs
mds -if /root/cppdiodon/data/atlas_guyane_trnH.h5 -of -ncs 2
# look at the 5 first rows and 2 first columns of the 'points' dataset
h5dump -A 0 -d points -c "5,2" /root/cppdiodon/data/atlas_guyane_trnH.mds.h5
PCA examples
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of
# see points cloud Y in /root/cppdiodon/data/pca_template_withnames_cloud.pca.txt
# eigen values L in /root/cppdiodon/data/pca_template_withnames_eigenvalues.pca.txt
# eigen vectors V in /root/cppdiodon/data/pca_template_withnames_eigenvectors.pca.txt
Notice that default file format is h5 so that it must be changed here with -iff txt. For txt files the default delimiter is a space, you can change this with the flag -ifd, here -ifd $'\t' means tabular delimiters. In addition when an input file provides some tags to identify rows and columns, one must use -irt for rows and -ict for columns to handle this case and read the tags. It will be later saved in the output file with the results.
The default pre-treatments is centered (columnwise) and scaled, this can be changed with flag -pcam, e.g.
# centered-scaled columnwise (default)
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -pcam standard
# double-averaged
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -pcam bicentering
# centered columnwise
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -pcam col_centering
# centered rowise
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -pcam rwo_centering
# scaled columnwise
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -pcam scaled
For ascii files output one can change the number of digits to save with -nds
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -nds 2
more /root/cppdiodon/data/pca_template_withnames_cloud.pca.txt
The input and output file formats could differ
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -off h5
h5dump /root/cppdiodon/data/pca_template_withnames.pca.h5
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -off csv -ofd ";"
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -off bz2
The output dataset name to use can be specified with -ocds, -oeds, -oevds
pca -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -irt -ict -of -off h5 -ocds riri -oeds fifi -oevds loulou
h5dump -A 0 -d riri -d fifi -d loulou /root/cppdiodon/data/pca_template_withnames.pca.h5
PCAmet examples
pcamet -ifa /root/cppdiodon/data/pca_template_withnames.txt -ifm /root/cppdiodon/data/pca_template_metrics_m.txt -ifq /root/cppdiodon/data/pca_template_metrics_q.txt -iff txt -ifd $'\t' -irt -ict -of
# see points cloud Y in /root/cppdiodon/data/pca_template_withnames_cloud.pcamet.txt
# eigen values L in /root/cppdiodon/data/pca_template_withnames_eigenvalues.pcamet.txt
# eigen vectors V in /root/cppdiodon/data/pca_template_withnames_eigenvectors.pcamet.txt
CoA examples
coa -if /root/cppdiodon/data/coa4tests_nonames.txt -iff txt -ifd $'\t' -of
# see Yr in /root/cppdiodon/data/coa4tests_nonames_yr.coa.txt
# Yc in /root/cppdiodon/data/coa4tests_nonames_yc.coa.txt
# eigen values L in /root/cppdiodon/data/coa4tests_nonames_eigenvalues.coa.txt
SVD examples
svd -if /root/cppdiodon/data/pca_template_withnames.txt -iff txt -ifd $'\t' -ict -irt -of
# see U in /root/cppdiodon/data/pca_template_withnames_u.svd.txt
# S in /root/cppdiodon/data/pca_template_withnames_s.svd.txt
# Vt in /root/cppdiodon/data/pca_template_withnames_vt.svd.txt
Notice default SVD method is always Randomized SVD, to use a full SVD instead use -svdm svd.
Tutorial on the PlaFRIM supercomputer
Here is an example usage on the PlaFRIM supercomputer hosted at Inria Bordeaux.
Example using modules
ssh plafrim
# to avoid getting some guix paths in the current environment
export PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/dell/srvadmin/bin
unset CMAKE_PREFIX_PATH
unset CPLUS_INCLUDE_PATH
unset LIBRARY_PATH
unset CPATH
unset C_INCLUDE_PATH
unset PYTHONPATH
# load plafrim modules
module purge
module add build/cmake/3.15.3
module add compiler/gcc/9.3.0
module add hardware/hwloc/2.7.0
module add mpi/openmpi/4.0.3
module add runtime/starpu/1.3.7/mpi
module add io/hdf5/openmpi@4.0.2/1.10.5
module add linalg/mkl/2020_update4
cd $HOME
export INSTALLDIR=$HOME/
diodon/install
# install chameleon
wget https:
tar xvf chameleon-1.2.0.tar.gz
cd chameleon-1.2.0
mkdir -p build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$INSTALLDIR -DCHAMELEON_USE_MPI=ON -DBLA_VENDOR=Intel10_64lp -DCHAMELEON_KERNELS_MT=ON -DBUILD_SHARED_LIBS=ON
make -j10 install
# install diodon
git clone --recursive https:
cd cppdiodon
mkdir -p build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$INSTALLDIR -DCMAKE_PREFIX_PATH=$INSTALLDIR -DCMAKE_NO_SYSTEM_FROM_IMPORTED=ON -DDIODON_USE_CHAMELEON=ON -DDIODON_USE_MKL_AS_BLAS=ON
make -j10 install
# use cppdiodon in parallel on several nodes
salloc --nodes=2 --time=01:00:00 --constraint bora --exclusive --ntasks-per-node=1 --threads-per-core=1
# make sure we use the same environment as for building
export PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/dell/srvadmin/bin
unset CMAKE_PREFIX_PATH
unset CPLUS_INCLUDE_PATH
unset LIBRARY_PATH
unset CPATH
unset C_INCLUDE_PATH
unset PYTHONPATH
module purge
module add build/cmake/3.15.3
module add compiler/gcc/9.3.0
module add hardware/hwloc/2.7.0
module add mpi/openmpi/4.0.3
module add runtime/starpu/1.3.7/mpi
module add io/hdf5/openmpi@4.0.2/1.10.5
module add linalg/mkl/2020_update4
export INSTALLDIR=$HOME/
diodon/install
export DIODON_TIME=1
export FMR_TIME=1
# check mpi
mpiexec hostname
# mds, small testcase n=1502
mpiexec --map-by ppr:1:node:pe=36 $INSTALLDIR/bin/mds -t 2 -
if $HOME/
diodon/cppdiodon/data/atlas_guyane_trnH.h5 -of
# mds, large testcase n=100000 (you have to be a member of the 'diodon' project on plafrim to get the matrices)
mpiexec --map-by ppr:1:node:pe=36 $INSTALLDIR/bin/mds -t 34 -
if /beegfs/
diodon/gordon/L1-L10/L6.h5 -svdm rsvd -r 5000 -of -v 1
# CPU time for MDS (on L6, 2 bora nodes and rank=5000) should be around: 300s (5 min)
# Quality estimates should be around:
# [diodon] rsvd quality estimation ||Psy_r|| / ||G||= 0.999994
# [diodon] mds quality estimation ||Psy^+_r|| / ||G||= 0.998018
# results are saved into /beegfs/diodon/gordon/L1-L10/L6.mds.h5 in 'eigenvalues' and 'points' datasets
# the output file is saved in a directory shared by the group diodon, allow the group to modify/update the file
chmod g+w /beegfs/
diodon/gordon/L1-L10/L6.mds.h5
# posttreatment
module add language/python/3.8.0
python3 -m pip install --user virtualenv
python3 -m venv env
source env/bin/activate
pip install numpy matplotlib h5py
python3 $HOME/
diodon/cppdiodon/tools/post/nuage_png.py /beegfs/
diodon/gordon/L1-L10/L6.mds.h5
# L6.png should be generated
# one can also perform benchmarks, by using the testing drivers, with values randomly generated for the input matrix
mpiexec --map-by ppr:1:node:pe=36 $INSTALLDIR/bin/testmds -t 34 -gen 1 -rs 10000 -rk 1000
mpiexec --map-by ppr:1:node:pe=36 $INSTALLDIR/bin/testpca -t 34 -gen 1 -rsm 10000 -rsn 1000
mpiexec --map-by ppr:1:node:pe=36 $INSTALLDIR/bin/testpcamet -t 34 -gen 1 -rsm 10000 -rsn 1000
# warning: the method to build the contingency table, input matrix of coa, is very costly, use small sizes (m,n)
mpiexec --map-by ppr:1:node:pe=36 $INSTALLDIR/bin/testcoa -t 34 -gen 1 -rsm 1000 -rsn 100
mpiexec --map-by ppr:1:node:pe=36 $INSTALLDIR/bin/testsvd -t 34 -gen 1 -rsm 10000 -rsn 1000
Example using GNU Guix
GNU Guix is available on PlaFRIM and can be used to install cppdiodon.
It is necessary first to add a channel to get new packages coming from the GUIX-HPC project. Create a ~/.config/guix/channels.scm file with the following snippet (on PlaFRIM):
(cons (channel
(name 'guix-hpc-non-free)
(url "https://gitlab.inria.fr/guix-hpc/guix-hpc-non-free.git"))
%default-channels)
Update and check Guix
# update packages recipes
guix pull
# check guix usage
guix build hello
Then install and use cppdiodon on PlaFRIM with Guix
# software environment setup
# allocate ressources with slurm
salloc --nodes=2 --constraint bora --ntasks-per-node=1 --exclusive
guix shell --pure --preserve=SLURM cppdiodon-mpi --without-tests=hdf5-parallel-openmpi openmpi openssh slurm coreutils inetutils gawk grep sed -- /bin/bash --norc
# execute a MDS
mpiexec --bind-to board mds -t 34 -
if /beegfs/
diodon/gordon/L1-L10/L6.h5 -of -svdm rsvd -r 5000 -ncs 3 -v 1
# results are saved into /beegfs/diodon/gordon/L1-L10/L6.mds.h5 in 'eigenvalues' and 'points' datasets
# the output file is saved in a directory shared by the group diodon, allow the group to modify/update the file
chmod g+w /beegfs/
diodon/gordon/L1-L10/L6.mds.h5
Source code structure
- data/: input data useful for testing
- docs/: doxygen documentation
- examples/: API usage examples
- external/: third party library FMR
- include/: cppdiodon API
- src/: cppdiodon main executables for calling data analysis methods such as (pca, coa, mds, etc)
- tests/: executables for functional testing
- tools/: various scripts used for: continuous integration with gitlab-ci, generating docker image, posttreatments of methods, sonarqube analysis, ...
API
API usage examples
- shared memory example with lapack matrix:
- distributed memory example with Chameleon matrix: