Cluster algorithms, PCA, and chemical conformere analysis.
Please see the README on GitLab at https://gitlab.com/theoretical-chemistry-jena/quantum-chemistry/ConfoCluster
ConClusion
ConClusion provides principal component analysis, hierarchical clustering and DBScan in Haskell. There is also a command line interface for processing of CREST conformere trajectories. Hence the name: CONformere CLUStering. The procedure to analyse conformere data has three steps:
- Read the trajectory and calculate a set of features for each conformere. The features can include the energy, a set of bond lengths, a set of bond angles, and a set of dihedral angles in arbitrary combination. Those descriptors form a feature matrix.
- A principal component analysis of the feature matrix might be perfomed to reduce the number of dimensions and remove redundancies.
- The (potentially PCA-processed) feature matrix is being clustered. Different distance measures are available. Either DBScan or hierarchical clustering can be used to group different conformeres.
While the command line interface only fits the work flow described above, the underlying clustering algorithms and PCA are implemented in a general way and can be utilised independently as library.
Installation
Bundled Archive
A self-contained executable archive is build for the main branch and for releases. This can be executed directly on any Linux and has just to be downloaded. Go to the page of releaes and download an archive. Make it executable (e.g. chmod +x conclusion
) and you are done!
From Source
If you have the Haskell toolchain intalled and therefore working Cabal and GHC, you may build ConClusion from source. This also requires working BLAS and LAPACK libraries on your system.
git clone https://gitlab.com/theoretical-chemistry-jena/quantum-chemistry/ConfoCluster.git ConClusion
cd ConClusion
cabal install --installdir=$(PREFIX)
Choose a PREFIX
where to install the executable. $HOME/.local/bin/
is often a good choice.
If you would like to use ConClusion on systems where Nix is not available (Windows, BSD, ...) this is the way to go.
With Nix
When you have Nix available on your system, everything can be build by Nix:
git clone https://gitlab.com/theoretical-chemistry-jena/quantum-chemistry/ConfoCluster.git ConClusion
cd ConClusion/nix
nix-build -A ConClusion.components.exes.conclusion
Usage
The command line interface to the conclusion
executable offers full control about all three steps described above.
-x --xyz
takes the path to the XYZ trajectory that is to be processed. It must contain the energy as comment line (which is the case for CREST trajectories).--dim
specifies the characterising features; therefore a set of internal coordinates and maybe the energy. A comma separated list of an arbitrary amount of features can be specified with the following syntax:e
for the energyb m n
for a bond length, wherem
andn
are the 0-based atom indicesa m n o
for an angle, wherem
,n
ando
are 0-based atom indices. Calculates the angle aroundn
d m n o p
for a dihedral angle, wherem
,n
,o
andp
are 0-based atom indices. Calculates the rotation of the bond betweenn
ando
. Dihedrals use a metric, that respects periodicity and direction of the rotation. For each dihedral there will be two rows in the feature matrix, therefore. See this paper.- (indices are 0-based)
-p --pca
activates dimensionalty reduction by principal component analysis. Give an integer to specify how many principal components are kept. During the execution of ConClusion the error introduced by PCA will be printed.-c --cluster
activates clustering of the results. The clustering algorithm can be selected by giving eitherdbscan
orhca
--measure
specifies the distance measure between the conformeres. By default an euclidean distance is used, but Manhattan and Mahalanobis distances are also available, as well as a general L_p norm. If Mahalanobis distances are used, it might be worth a try to disable PCA.--joinstrat
controls how inter-cluster distances are calculated in hierarchical clustering.single
might be the best choice to get dense groups of conformeres.--distance
gives the search radius in DBScan or the dendrogram cut distance in hierarchical clustering.--minsize
is the minimum size of a cluster in DBScan. If--forcemin
is given, the clusters obtained by HCA are also filtered for a minimum size.--forcemin
forces filtering of HCA clusters for their minimum size as given by--minsize
. Disabled by default.
Each processing step will produce a Gnuplot compatible file (space separated columns). The pure feature matrix will be features.dat
, the results of the PCA will be in pca.dat
and the clustering results will be in cluster.dat
. The first column in cluster.dat
will be an integer giving the cluster number this point belongs to, that can be used for colour-coding in Gnuplot.
Example
A perylene dye with four phenoxy groups has different conformeres, that have different spectral properties. For solubility the dye has also some alkyl groups. Crest finds about 1400 conformeres, most of them being different only in the alkyl side-chains, that do not influence spectral properties. Therefore, a much smaller group of different conformeres with respect to different positions of phenoxy groups exist. From each of those groups the lowest energy conformere shall be obtained. We therefore select eight dihedral angles and the energy as features; 2 dihedrals for each phenoxy group. One dihedral per phenoxy group describing the rotation of the perylene-O bond, the second one describing the rotation around the O-Ph bond. As the dihedral angles are not independent from each other, as some orientations of phenoxy groups are not possible, we use a PCA to reduce dimensionalty and remove redundancies. After the PCA, DBScan is used to obtain clusters of similar conformeres. The lowest index in each cluster is also the lowest energy conformere in each group, as CREST sorts conformeres by energy.
conclusion \
--xyz=crest_conformers.xyz \
--pca=3 \
--dim="e, d 19 18 2 56, d 18 2 56 65, d 16 15 1 45, d 15 1 45 46, d 11 13 0 34, d 13 0 34 43, d 29 31 3 67, d31 3 67 76" \
--measure=manhattan \
--cluster=dbscan \
--distance=0.3 \
--minsize=5
Library/Haskell Package
ConClusion provides principal components analysis and the clustering algorithms DBScan and hierarchical clustering. The algorithms are implemented in efficient parallel arrays and perform quite well. For the API see the haddock documentation, which can be generated by:
cabal haddock