DPA Module

GitHub Link to Code.

DPA cluster type implementation.

This module provides the DPA (Density Peak Advanced) cluster type that implements density-based clustering for molecular dynamics trajectory analysis using the DPA package from conda environment.

class mdxplain.clustering.cluster_type.dpa.dpa.DPA(Z: float = 1.0, metric: str = 'euclidean', affinity: str = 'nearest_neighbors', density_algo: str = 'PAk', k_max: int = 1000, D_thr: float = 23.92812698, dim_algo: str = 'twoNN', blockAn: bool = True, block_ratio: int = 20, frac: float = 1.0, halos: bool = False, method: str = 'standard', sample_fraction: float = 0.1, knn_neighbors: int = 5, force: bool = False, n_jobs: int = -1, max_blas_threads: int | None = 1, auto_limit_blas: bool = True)

DPA (Density Peak Advanced) cluster type.

DPA is a density-based clustering algorithm that identifies cluster centers as points with high density that are far from other high-density points. It’s particularly useful for identifying conformational states in molecular dynamics trajectories with complex cluster shapes and varying densities.

Uses mariad’Errico et al.’s DPA package under the hood and sklearn’s NearestNeighbors for k-NN sampling method.

Examples

>>> # Create DPA with default parameters
>>> dpa = DPA()
>>> # Create DPA with custom parameters
>>> dpa = DPA(Z=1.5, metric='euclidean', density_algo='PAk',
...           k_max=500, blockAn=True, block_ratio=10)
>>> # Initialize and compute clustering
>>> dpa.init_calculator()
>>> labels, metadata = dpa.compute(data)

References

M. d’Errico, E. Facco, A. Laio, A. Rodriguez, Information Sciences, Volume 560, June 2021, 476-492. See: https://github.com/mariaderrico/DPA

__init__(Z: float = 1.0, metric: str = 'euclidean', affinity: str = 'nearest_neighbors', density_algo: str = 'PAk', k_max: int = 1000, D_thr: float = 23.92812698, dim_algo: str = 'twoNN', blockAn: bool = True, block_ratio: int = 20, frac: float = 1.0, halos: bool = False, method: str = 'standard', sample_fraction: float = 0.1, knn_neighbors: int = 5, force: bool = False, n_jobs: int = -1, max_blas_threads: int | None = 1, auto_limit_blas: bool = True) None

Initialize DPA cluster type.

Parameters

Zfloat, default=1

The number of standard deviations, which fixes the level of statistical confidence at which one decides to consider a cluster meaningful.

metricstring or callable, default=”euclidean”

The distance metric to use. If metric is a string, it must be one of the options allowed by scipy.spatial.distance.pdist for its metric parameter, or a metric listed in VALID_METRIC = [precomputed, euclidean, cosine]. If metric is “precomputed”, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them.

affinitystring or callable, default=’nearest_neighbors’

How to construct the affinity matrix.

  • “nearest_neighbors”: construct the affinity matrix by computing a graph of nearest neighbors.

  • “rbf”: construct the affinity matrix using a radial basis function (RBF) kernel.

  • “precomputed”: interpret X as a precomputed affinity matrix.

  • “precomputed_nearest_neighbors”: interpret X as a sparse graph of precomputed nearest neighbors, and constructs the affinity matrix by selecting the n_neighbors nearest neighbors.

  • one of the kernels supported by sklearn.metrics.pairwise_kernels.

density_algostring, default=”PAk”

Define the algorithm to use as density estimator. It must be one of the options allowed by VALID_DENSITY = [PAk, kNN].

k_maxint, default=1000

This parameter is considered if density_algo is “PAk” or “kNN”, it is ignored otherwise. k_max set the maximum number of nearest-neighbors considered by the density estimator. If density_algo=”PAk”, k_max is used by the algorithm in the search for the largest number of neighbors k_hat for which the condition of constant density holds, within a given level of confidence. If density_algo=”kNN”, k_max set the number of neighbors to be used by the standard k-Nearest Neighbor algorithm. If the number of points in the sample N is less than the default value, k_max will be set automatically to the value N/2.

D_thrfloat, default=23.92812698

This parameter is considered if density_algo is “PAk”, it is ignored otherwise. Set the level of confidence in the PAk density estimator. The default value corresponds to a p-value of 10^-6 for a χ² distribution with one degree of freedom.

dim_algostring, default=”twoNN”

Method for intrinsic dimensionality calculation. If dim_algo is “auto”, dim is assumed to be equal to n_samples. If dim_algo is a string, it must be one of the options allowed by VALID_DIM = [auto, twoNN].

blockAnbool, default=True

This parameter is considered if dim_algo is “twoNN”, it is ignored otherwise. If blockAn is True the algorithm perform a block analysis that allows discriminating the relevant dimensions as a function of the block size. This allows to study the stability of the estimation with respect to changes in the neighborhood size, which is crucial for ID estimations when the data lie on a manifold perturbed by a high-dimensional noise.

block_ratioint, default=20

This parameter is considered if dim_algo is “twoNN”, it is ignored otherwise. Set the minimum size of the blocks as n_samples/block_ratio. If blockAn=False, block_ratio is ignored.

fracfloat, default=1.0

This parameter is considered if dim_algo is “twoNN”, it is ignored otherwise. Define the fraction of points in the data set used for ID calculation. By default the full data set is used.

halosbool, default=False

Whether to return halo points. If True, returns dpa.halos, otherwise returns dpa.labels. If true frames which are on a low density are set to 0. So kind of a -1 in sklearn clustering algorithms. If false, each frame is assigned to its most probable cluster.

methodstr, default=”standard”

Clustering method:

  • “standard”: Load all data into memory (default)

  • “sampling_knn”: Sample data + k-NN classifier fallback

sample_fractionfloat, default=0.1

Fraction of data to sample for sampling-based methods (10%) Final sample size: max(50000, min(100000, sample_fraction * n_samples))

knn_neighborsint, default=5

Number of nearest neighbors for k-NN classifier in sampling methods

forcebool, default=False

Override memory and dimensionality checks (converts errors to warnings)

n_jobsint, default=-1

Number of parallel jobs for distance computations. -1 means using all processors.

max_blas_threadsint or None, default=1

Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default

auto_limit_blasbool, default=True

Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)

Returned Metadata

algorithmstr

Always “dpa”

hyperparametersdict

Dictionary containing all DPA parameters used

original_shapetuple

Shape of the input data (n_samples, n_features)

n_clustersint

Number of clusters found (excluding noise/halo points)

n_noiseint

Number of noise/halo points identified

silhouette_scorefloat or None

Silhouette score for clustering quality assessment

computation_timefloat

Time taken for clustering computation in seconds

cluster_centerslist or None

Indices of cluster center points

densitieslist or None

Density values for each point

nn_distanceslist or None

Distances to k_max neighbors for each point

nn_indiceslist or None

Indices of k_max neighbors for each point

topographylist or None

Topography matrix with peak heights and saddle points

error_densitieslist or None

Uncertainty values of density estimation

cache_pathstr

Path used for caching results

References

Parameter descriptions adapted from the DPA package documentation. See: https://github.com/mariaderrico/DPA

classmethod get_type_name() str

Return unique string identifier for DPA cluster type.

Returns

str

The string ‘dpa’

init_calculator(cache_path: str = './cache', max_memory_gb: float = 2.0, chunk_size: int = 1000, use_memmap: bool = False) None

Initialize the DPA calculator.

Parameters

cache_pathstr, optional

Directory path for cache files. Default is ‘./cache’.

max_memory_gbfloat, optional

Maximum memory threshold in GB. Default is 2.0.

chunk_sizeint, optional

Chunk size for processing large datasets. Default is 1000.

use_memmapbool, optional

Whether to use memory mapping for large datasets. Default is False.

Returns

None

compute(data: ndarray, center_method: str = 'centroid') Tuple[ndarray, Dict[str, Any]]

Compute DPA clustering.

Parameters

datanumpy.ndarray

Input data matrix to cluster, shape (n_samples, n_features)

center_methodstr, optional

Method for calculating cluster centers, default=”centroid”

Returns

Tuple[numpy.ndarray, Dict]

Tuple containing:

  • cluster_labels: Cluster labels for each sample

  • metadata: Dictionary with clustering information

Raises

ValueError

If calculator is not initialized