DPA Module
GitHub Link to Code.
DPA cluster type implementation.
This module provides the DPA (Density Peak Advanced) cluster type that implements density-based clustering for molecular dynamics trajectory analysis using the DPA package from conda environment.
- class mdxplain.clustering.cluster_type.dpa.dpa.DPA(Z: float = 1.0, metric: str = 'euclidean', affinity: str = 'nearest_neighbors', density_algo: str = 'PAk', k_max: int = 1000, D_thr: float = 23.92812698, dim_algo: str = 'twoNN', blockAn: bool = True, block_ratio: int = 20, frac: float = 1.0, halos: bool = False, method: str = 'standard', sample_fraction: float = 0.1, knn_neighbors: int = 5, force: bool = False, n_jobs: int = -1, max_blas_threads: int | None = 1, auto_limit_blas: bool = True)
DPA (Density Peak Advanced) cluster type.
DPA is a density-based clustering algorithm that identifies cluster centers as points with high density that are far from other high-density points. It’s particularly useful for identifying conformational states in molecular dynamics trajectories with complex cluster shapes and varying densities.
Uses mariad’Errico et al.’s DPA package under the hood and sklearn’s NearestNeighbors for k-NN sampling method.
Examples
>>> # Create DPA with default parameters >>> dpa = DPA()
>>> # Create DPA with custom parameters >>> dpa = DPA(Z=1.5, metric='euclidean', density_algo='PAk', ... k_max=500, blockAn=True, block_ratio=10)
>>> # Initialize and compute clustering >>> dpa.init_calculator() >>> labels, metadata = dpa.compute(data)
References
M. d’Errico, E. Facco, A. Laio, A. Rodriguez, Information Sciences, Volume 560, June 2021, 476-492. See: https://github.com/mariaderrico/DPA
- __init__(Z: float = 1.0, metric: str = 'euclidean', affinity: str = 'nearest_neighbors', density_algo: str = 'PAk', k_max: int = 1000, D_thr: float = 23.92812698, dim_algo: str = 'twoNN', blockAn: bool = True, block_ratio: int = 20, frac: float = 1.0, halos: bool = False, method: str = 'standard', sample_fraction: float = 0.1, knn_neighbors: int = 5, force: bool = False, n_jobs: int = -1, max_blas_threads: int | None = 1, auto_limit_blas: bool = True) None
Initialize DPA cluster type.
Parameters
- Zfloat, default=1
The number of standard deviations, which fixes the level of statistical confidence at which one decides to consider a cluster meaningful.
- metricstring or callable, default=”euclidean”
The distance metric to use. If metric is a string, it must be one of the options allowed by scipy.spatial.distance.pdist for its metric parameter, or a metric listed in VALID_METRIC = [precomputed, euclidean, cosine]. If metric is “precomputed”, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them.
- affinitystring or callable, default=’nearest_neighbors’
How to construct the affinity matrix.
“nearest_neighbors”: construct the affinity matrix by computing a graph of nearest neighbors.
“rbf”: construct the affinity matrix using a radial basis function (RBF) kernel.
“precomputed”: interpret X as a precomputed affinity matrix.
“precomputed_nearest_neighbors”: interpret X as a sparse graph of precomputed nearest neighbors, and constructs the affinity matrix by selecting the n_neighbors nearest neighbors.
one of the kernels supported by sklearn.metrics.pairwise_kernels.
- density_algostring, default=”PAk”
Define the algorithm to use as density estimator. It must be one of the options allowed by VALID_DENSITY = [PAk, kNN].
- k_maxint, default=1000
This parameter is considered if density_algo is “PAk” or “kNN”, it is ignored otherwise. k_max set the maximum number of nearest-neighbors considered by the density estimator. If density_algo=”PAk”, k_max is used by the algorithm in the search for the largest number of neighbors k_hat for which the condition of constant density holds, within a given level of confidence. If density_algo=”kNN”, k_max set the number of neighbors to be used by the standard k-Nearest Neighbor algorithm. If the number of points in the sample N is less than the default value, k_max will be set automatically to the value N/2.
- D_thrfloat, default=23.92812698
This parameter is considered if density_algo is “PAk”, it is ignored otherwise. Set the level of confidence in the PAk density estimator. The default value corresponds to a p-value of 10^-6 for a χ² distribution with one degree of freedom.
- dim_algostring, default=”twoNN”
Method for intrinsic dimensionality calculation. If dim_algo is “auto”, dim is assumed to be equal to n_samples. If dim_algo is a string, it must be one of the options allowed by VALID_DIM = [auto, twoNN].
- blockAnbool, default=True
This parameter is considered if dim_algo is “twoNN”, it is ignored otherwise. If blockAn is True the algorithm perform a block analysis that allows discriminating the relevant dimensions as a function of the block size. This allows to study the stability of the estimation with respect to changes in the neighborhood size, which is crucial for ID estimations when the data lie on a manifold perturbed by a high-dimensional noise.
- block_ratioint, default=20
This parameter is considered if dim_algo is “twoNN”, it is ignored otherwise. Set the minimum size of the blocks as n_samples/block_ratio. If blockAn=False, block_ratio is ignored.
- fracfloat, default=1.0
This parameter is considered if dim_algo is “twoNN”, it is ignored otherwise. Define the fraction of points in the data set used for ID calculation. By default the full data set is used.
- halosbool, default=False
Whether to return halo points. If True, returns dpa.halos, otherwise returns dpa.labels. If true frames which are on a low density are set to 0. So kind of a -1 in sklearn clustering algorithms. If false, each frame is assigned to its most probable cluster.
- methodstr, default=”standard”
Clustering method:
“standard”: Load all data into memory (default)
“sampling_knn”: Sample data + k-NN classifier fallback
- sample_fractionfloat, default=0.1
Fraction of data to sample for sampling-based methods (10%) Final sample size: max(50000, min(100000, sample_fraction * n_samples))
- knn_neighborsint, default=5
Number of nearest neighbors for k-NN classifier in sampling methods
- forcebool, default=False
Override memory and dimensionality checks (converts errors to warnings)
- n_jobsint, default=-1
Number of parallel jobs for distance computations. -1 means using all processors.
- max_blas_threadsint or None, default=1
Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default
- auto_limit_blasbool, default=True
Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)
Returned Metadata
- algorithmstr
Always “dpa”
- hyperparametersdict
Dictionary containing all DPA parameters used
- original_shapetuple
Shape of the input data (n_samples, n_features)
- n_clustersint
Number of clusters found (excluding noise/halo points)
- n_noiseint
Number of noise/halo points identified
- silhouette_scorefloat or None
Silhouette score for clustering quality assessment
- computation_timefloat
Time taken for clustering computation in seconds
- cluster_centerslist or None
Indices of cluster center points
- densitieslist or None
Density values for each point
- nn_distanceslist or None
Distances to k_max neighbors for each point
- nn_indiceslist or None
Indices of k_max neighbors for each point
- topographylist or None
Topography matrix with peak heights and saddle points
- error_densitieslist or None
Uncertainty values of density estimation
- cache_pathstr
Path used for caching results
References
Parameter descriptions adapted from the DPA package documentation. See: https://github.com/mariaderrico/DPA
- classmethod get_type_name() str
Return unique string identifier for DPA cluster type.
Returns
- str
The string ‘dpa’
- init_calculator(cache_path: str = './cache', max_memory_gb: float = 2.0, chunk_size: int = 1000, use_memmap: bool = False) None
Initialize the DPA calculator.
Parameters
- cache_pathstr, optional
Directory path for cache files. Default is ‘./cache’.
- max_memory_gbfloat, optional
Maximum memory threshold in GB. Default is 2.0.
- chunk_sizeint, optional
Chunk size for processing large datasets. Default is 1000.
- use_memmapbool, optional
Whether to use memory mapping for large datasets. Default is False.
Returns
None
- compute(data: ndarray, center_method: str = 'centroid') Tuple[ndarray, Dict[str, Any]]
Compute DPA clustering.
Parameters
- datanumpy.ndarray
Input data matrix to cluster, shape (n_samples, n_features)
- center_methodstr, optional
Method for calculating cluster centers, default=”centroid”
Returns
- Tuple[numpy.ndarray, Dict]
Tuple containing:
cluster_labels: Cluster labels for each sample
metadata: Dictionary with clustering information
Raises
- ValueError
If calculator is not initialized