HDBSCAN Module

GitHub Link to Code.

HDBSCAN cluster type implementation.

This module provides the HDBSCAN cluster type that implements hierarchical density-based clustering for molecular dynamics trajectory analysis.

class mdxplain.clustering.cluster_type.hdbscan.hdbscan.HDBSCAN(min_cluster_size: int = 5, min_samples: int | None = None, cluster_selection_epsilon: float = 0.0, cluster_selection_method: str = 'eom', method: str = 'standard', sample_fraction: float = 0.1, knn_neighbors: int = 5, force: bool = False, n_jobs: int = -1, max_blas_threads: int | None = 1, auto_limit_blas: bool = True)

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) cluster type.

HDBSCAN extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based on the stability of clusters. It’s particularly useful for identifying conformational states in molecular dynamics trajectories with varying densities.

Uses sklearn’s HDBSCAN and NearestNeighbors under the hood.

Examples

>>> # Create HDBSCAN with default parameters
>>> hdbscan = HDBSCAN()
>>> # Create HDBSCAN with custom parameters
>>> hdbscan = HDBSCAN(min_cluster_size=10, min_samples=5)
>>> # Initialize and compute clustering
>>> hdbscan.init_calculator()
>>> labels, metadata = hdbscan.compute(data)
__init__(min_cluster_size: int = 5, min_samples: int | None = None, cluster_selection_epsilon: float = 0.0, cluster_selection_method: str = 'eom', method: str = 'standard', sample_fraction: float = 0.1, knn_neighbors: int = 5, force: bool = False, n_jobs: int = -1, max_blas_threads: int | None = 1, auto_limit_blas: bool = True) None

Initialize HDBSCAN cluster type.

Parameters

min_cluster_sizeint, optional

Minimum size of clusters. Default is 5.

min_samplesint, optional

Minimum samples in neighborhood for core point. If None, defaults to min_cluster_size.

cluster_selection_epsilonfloat, optional

Distance threshold for cluster selection. Default is 0.0.

cluster_selection_methodstr, optional

Method for cluster selection (‘eom’ or ‘leaf’). Default is ‘eom’.

methodstr, default=”standard”

Clustering method:

  • “standard”: Load all data into memory (default)

  • “sampling_approximate”: Sample data + approximate_predict for large datasets

  • “sampling_knn”: Sample data + k-NN classifier fallback

sample_fractionfloat, default=0.1

Fraction of data to sample for sampling-based methods (10%) Final sample size: max(50000, min(100000, sample_fraction * n_samples))

knn_neighborsint, default=5

Number of neighbors for k-NN classifier in knn_sampling method

forcebool, default=False

Override memory and dimensionality checks (converts errors to warnings)

n_jobsint, default=-1

Number of parallel jobs for core distance computation. -1 means using all processors.

max_blas_threadsint or None, default=1

Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default

auto_limit_blasbool, default=True

Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)

Returned Metadata

algorithmstr

Always “hdbscan”

hyperparametersdict

Dictionary containing all HDBSCAN parameters used

original_shapetuple

Shape of the input data (n_samples, n_features)

n_clustersint

Number of clusters found (excluding noise points)

n_noiseint

Number of noise points identified (label -1)

silhouette_scorefloat or None

Silhouette score for clustering quality assessment

computation_timefloat

Time taken for clustering computation in seconds

cluster_probabilitieslist or None

Cluster membership probabilities for each point

outlier_scoreslist or None

Outlier scores for each point

cache_pathstr

Path used for caching results

classmethod get_type_name() str

Return unique string identifier for HDBSCAN cluster type.

Returns

str

The string ‘hdbscan’

init_calculator(cache_path: str = './cache', max_memory_gb: float = 2.0, chunk_size: int = 1000, use_memmap: bool = False) None

Initialize the HDBSCAN calculator.

Parameters

cache_pathstr, optional

Directory path for cache files. Default is ‘./cache’.

max_memory_gbfloat, optional

Maximum memory threshold in GB. Default is 2.0.

chunk_sizeint, optional

Chunk size for processing large datasets. Default is 1000.

use_memmapbool, optional

Whether to use memory mapping for large datasets. Default is False.

compute(data: ndarray, center_method: str = 'centroid') Tuple[ndarray, Dict[str, Any]]

Compute HDBSCAN clustering.

Parameters

datanumpy.ndarray

Input data matrix to cluster, shape (n_samples, n_features)

center_methodstr, optional

Method for calculating cluster centers, default=”centroid”

Returns

Tuple[numpy.ndarray, Dict]

Tuple containing:

  • cluster_labels: Cluster labels for each sample (-1 for noise)

  • metadata: Dictionary with clustering information

Raises

ValueError

If calculator is not initialized