DPA Module

GitHub Link to Code.

DPA cluster type implementation.

This module provides the DPA (Density Peak Advanced) cluster type that implements density-based clustering for molecular dynamics trajectory analysis using the DPA package from conda environment.

class mdxplain.clustering.cluster_type.dpa.dpa.DPA(Z: float = 1.0, metric: str = 'euclidean', affinity: str = 'nearest_neighbors', density_algo: str = 'PAk', k_max: int = 1000, D_thr: float = 23.92812698, dim_algo: str = 'twoNN', blockAn: bool = True, block_ratio: int = 20, frac: float = 1.0, halos: bool = False, method: str = 'standard', sample_fraction: float = 0.1, knn_neighbors: int = 5, force: bool = False, n_jobs: int = -1, max_blas_threads: int | None = 1, auto_limit_blas: bool = True)

DPA (Density Peak Advanced) cluster type.

DPA is a density-based clustering algorithm that identifies cluster centers as points with high density that are far from other high-density points. It’s particularly useful for identifying conformational states in molecular dynamics trajectories with complex cluster shapes and varying densities.

Uses mariad’Errico et al.’s DPA package under the hood and sklearn’s NearestNeighbors for k-NN sampling method.

Examples

>>> # Create DPA with default parameters
>>> dpa = DPA()

>>> # Create DPA with custom parameters
>>> dpa = DPA(Z=1.5, metric='euclidean', density_algo='PAk',
...           k_max=500, blockAn=True, block_ratio=10)

>>> # Initialize and compute clustering
>>> dpa.init_calculator()
>>> labels, metadata = dpa.compute(data)

References

M. d’Errico, E. Facco, A. Laio, A. Rodriguez, Information Sciences, Volume 560, June 2021, 476-492. See: https://github.com/mariaderrico/DPA

__init__(Z: float = 1.0, metric: str = 'euclidean', affinity: str = 'nearest_neighbors', density_algo: str = 'PAk', k_max: int = 1000, D_thr: float = 23.92812698, dim_algo: str = 'twoNN', blockAn: bool = True, block_ratio: int = 20, frac: float = 1.0, halos: bool = False, method: str = 'standard', sample_fraction: float = 0.1, knn_neighbors: int = 5, force: bool = False, n_jobs: int = -1, max_blas_threads: int | None = 1, auto_limit_blas: bool = True) → None

Initialize DPA cluster type.

Parameters

Zfloat, default=1

The number of standard deviations, which fixes the level of statistical confidence at which one decides to consider a cluster meaningful.

metricstring or callable, default=”euclidean”

The distance metric to use. If metric is a string, it must be one of the options allowed by scipy.spatial.distance.pdist for its metric parameter, or a metric listed in VALID_METRIC = [precomputed, euclidean, cosine]. If metric is “precomputed”, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them.

affinitystring or callable, default=’nearest_neighbors’

How to construct the affinity matrix.

“nearest_neighbors”: construct the affinity matrix by computing a graph of nearest neighbors.
“rbf”: construct the affinity matrix using a radial basis function (RBF) kernel.
“precomputed”: interpret X as a precomputed affinity matrix.
“precomputed_nearest_neighbors”: interpret X as a sparse graph of precomputed nearest neighbors, and constructs the affinity matrix by selecting the n_neighbors nearest neighbors.
one of the kernels supported by sklearn.metrics.pairwise_kernels.

density_algostring, default=”PAk”

Define the algorithm to use as density estimator. It must be one of the options allowed by VALID_DENSITY = [PAk, kNN].

k_maxint, default=1000

This parameter is considered if density_algo is “PAk” or “kNN”, it is ignored otherwise. k_max set the maximum number of nearest-neighbors considered by the density estimator. If density_algo=”PAk”, k_max is used by the algorithm in the search for the largest number of neighbors k_hat for which the condition of constant density holds, within a given level of confidence. If density_algo=”kNN”, k_max set the number of neighbors to be used by the standard k-Nearest Neighbor algorithm. If the number of points in the sample N is less than the default value, k_max will be set automatically to the value N/2.

D_thrfloat, default=23.92812698

This parameter is considered if density_algo is “PAk”, it is ignored otherwise. Set the level of confidence in the PAk density estimator. The default value corresponds to a p-value of 10^-6 for a χ² distribution with one degree of freedom.

dim_algostring, default=”twoNN”

Method for intrinsic dimensionality calculation. If dim_algo is “auto”, dim is assumed to be equal to n_samples. If dim_algo is a string, it must be one of the options allowed by VALID_DIM = [auto, twoNN].

blockAnbool, default=True

This parameter is considered if dim_algo is “twoNN”, it is ignored otherwise. If blockAn is True the algorithm perform a block analysis that allows discriminating the relevant dimensions as a function of the block size. This allows to study the stability of the estimation with respect to changes in the neighborhood size, which is crucial for ID estimations when the data lie on a manifold perturbed by a high-dimensional noise.

block_ratioint, default=20

This parameter is considered if dim_algo is “twoNN”, it is ignored otherwise. Set the minimum size of the blocks as n_samples/block_ratio. If blockAn=False, block_ratio is ignored.

fracfloat, default=1.0

This parameter is considered if dim_algo is “twoNN”, it is ignored otherwise. Define the fraction of points in the data set used for ID calculation. By default the full data set is used.

halosbool, default=False

Whether to return halo points. If True, returns dpa.halos, otherwise returns dpa.labels. If true frames which are on a low density are set to 0. So kind of a -1 in sklearn clustering algorithms. If false, each frame is assigned to its most probable cluster.

methodstr, default=”standard”

Clustering method:

“standard”: Load all data into memory (default)
“sampling_knn”: Sample data + k-NN classifier fallback

sample_fractionfloat, default=0.1

Fraction of data to sample for sampling-based methods (10%) Final sample size: max(50000, min(100000, sample_fraction * n_samples))

knn_neighborsint, default=5

Number of nearest neighbors for k-NN classifier in sampling methods

forcebool, default=False

Override memory and dimensionality checks (converts errors to warnings)

n_jobsint, default=-1

Number of parallel jobs for distance computations. -1 means using all processors.

max_blas_threadsint or None, default=1

Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default

auto_limit_blasbool, default=True

Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)

Returned Metadata

algorithmstr: Always “dpa”
hyperparametersdict: Dictionary containing all DPA parameters used
original_shapetuple: Shape of the input data (n_samples, n_features)
n_clustersint: Number of clusters found (excluding noise/halo points)
n_noiseint: Number of noise/halo points identified
silhouette_scorefloat or None: Silhouette score for clustering quality assessment
computation_timefloat: Time taken for clustering computation in seconds
cluster_centerslist or None: Indices of cluster center points
densitieslist or None: Density values for each point
nn_distanceslist or None: Distances to k_max neighbors for each point
nn_indiceslist or None: Indices of k_max neighbors for each point
topographylist or None: Topography matrix with peak heights and saddle points
error_densitieslist or None: Uncertainty values of density estimation
cache_pathstr: Path used for caching results

References

Parameter descriptions adapted from the DPA package documentation. See: https://github.com/mariaderrico/DPA

classmethod get_type_name() → str

Return unique string identifier for DPA cluster type.

Returns

str: The string ‘dpa’

init_calculator(cache_path: str = './cache', max_memory_gb: float = 2.0, chunk_size: int = 1000, use_memmap: bool = False) → None

Initialize the DPA calculator.

Parameters

cache_pathstr, optional: Directory path for cache files. Default is ‘./cache’.
max_memory_gbfloat, optional: Maximum memory threshold in GB. Default is 2.0.
chunk_sizeint, optional: Chunk size for processing large datasets. Default is 1000.
use_memmapbool, optional: Whether to use memory mapping for large datasets. Default is False.

Returns

None

compute(data: ndarray, center_method: str = 'centroid') → Tuple[ndarray, Dict[str, Any]]

Compute DPA clustering.

Parameters

datanumpy.ndarray: Input data matrix to cluster, shape (n_samples, n_features)
center_methodstr, optional: Method for calculating cluster centers, default=”centroid”

Returns

Tuple[numpy.ndarray, Dict]

Tuple containing:

cluster_labels: Cluster labels for each sample
metadata: Dictionary with clustering information

Raises

ValueError: If calculator is not initialized