HDBSCAN Module
GitHub Link to Code.
HDBSCAN cluster type implementation.
This module provides the HDBSCAN cluster type that implements hierarchical density-based clustering for molecular dynamics trajectory analysis.
- class mdxplain.clustering.cluster_type.hdbscan.hdbscan.HDBSCAN(min_cluster_size: int = 5, min_samples: int | None = None, cluster_selection_epsilon: float = 0.0, cluster_selection_method: str = 'eom', method: str = 'standard', sample_fraction: float = 0.1, knn_neighbors: int = 5, force: bool = False, n_jobs: int = -1, max_blas_threads: int | None = 1, auto_limit_blas: bool = True)
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) cluster type.
HDBSCAN extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based on the stability of clusters. It’s particularly useful for identifying conformational states in molecular dynamics trajectories with varying densities.
Uses sklearn’s HDBSCAN and NearestNeighbors under the hood.
Examples
>>> # Create HDBSCAN with default parameters >>> hdbscan = HDBSCAN()
>>> # Create HDBSCAN with custom parameters >>> hdbscan = HDBSCAN(min_cluster_size=10, min_samples=5)
>>> # Initialize and compute clustering >>> hdbscan.init_calculator() >>> labels, metadata = hdbscan.compute(data)
- __init__(min_cluster_size: int = 5, min_samples: int | None = None, cluster_selection_epsilon: float = 0.0, cluster_selection_method: str = 'eom', method: str = 'standard', sample_fraction: float = 0.1, knn_neighbors: int = 5, force: bool = False, n_jobs: int = -1, max_blas_threads: int | None = 1, auto_limit_blas: bool = True) None
Initialize HDBSCAN cluster type.
Parameters
- min_cluster_sizeint, optional
Minimum size of clusters. Default is 5.
- min_samplesint, optional
Minimum samples in neighborhood for core point. If None, defaults to min_cluster_size.
- cluster_selection_epsilonfloat, optional
Distance threshold for cluster selection. Default is 0.0.
- cluster_selection_methodstr, optional
Method for cluster selection (‘eom’ or ‘leaf’). Default is ‘eom’.
- methodstr, default=”standard”
Clustering method:
“standard”: Load all data into memory (default)
“sampling_approximate”: Sample data + approximate_predict for large datasets
“sampling_knn”: Sample data + k-NN classifier fallback
- sample_fractionfloat, default=0.1
Fraction of data to sample for sampling-based methods (10%) Final sample size: max(50000, min(100000, sample_fraction * n_samples))
- knn_neighborsint, default=5
Number of neighbors for k-NN classifier in knn_sampling method
- forcebool, default=False
Override memory and dimensionality checks (converts errors to warnings)
- n_jobsint, default=-1
Number of parallel jobs for core distance computation. -1 means using all processors.
- max_blas_threadsint or None, default=1
Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default
- auto_limit_blasbool, default=True
Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)
Returned Metadata
- algorithmstr
Always “hdbscan”
- hyperparametersdict
Dictionary containing all HDBSCAN parameters used
- original_shapetuple
Shape of the input data (n_samples, n_features)
- n_clustersint
Number of clusters found (excluding noise points)
- n_noiseint
Number of noise points identified (label -1)
- silhouette_scorefloat or None
Silhouette score for clustering quality assessment
- computation_timefloat
Time taken for clustering computation in seconds
- cluster_probabilitieslist or None
Cluster membership probabilities for each point
- outlier_scoreslist or None
Outlier scores for each point
- cache_pathstr
Path used for caching results
- classmethod get_type_name() str
Return unique string identifier for HDBSCAN cluster type.
Returns
- str
The string ‘hdbscan’
- init_calculator(cache_path: str = './cache', max_memory_gb: float = 2.0, chunk_size: int = 1000, use_memmap: bool = False) None
Initialize the HDBSCAN calculator.
Parameters
- cache_pathstr, optional
Directory path for cache files. Default is ‘./cache’.
- max_memory_gbfloat, optional
Maximum memory threshold in GB. Default is 2.0.
- chunk_sizeint, optional
Chunk size for processing large datasets. Default is 1000.
- use_memmapbool, optional
Whether to use memory mapping for large datasets. Default is False.
- compute(data: ndarray, center_method: str = 'centroid') Tuple[ndarray, Dict[str, Any]]
Compute HDBSCAN clustering.
Parameters
- datanumpy.ndarray
Input data matrix to cluster, shape (n_samples, n_features)
- center_methodstr, optional
Method for calculating cluster centers, default=”centroid”
Returns
- Tuple[numpy.ndarray, Dict]
Tuple containing:
cluster_labels: Cluster labels for each sample (-1 for noise)
metadata: Dictionary with clustering information
Raises
- ValueError
If calculator is not initialized