HDBSCAN Calculator

GitHub Link to Code.

HDBSCAN calculator implementation.

This module provides the HDBSCANCalculator class that performs the actual HDBSCAN clustering computation using scikit-learn.

class mdxplain.clustering.cluster_type.hdbscan.hdbscan_calculator.HDBSCANCalculator(cache_path: str = './cache', max_memory_gb: float = 2.0, chunk_size: int = 1000, use_memmap: bool = False, max_blas_threads: int | None = 1, auto_limit_blas: bool = True)

Calculator for HDBSCAN clustering.

This class implements the actual HDBSCAN clustering computation using scikit-learn’s HDBSCAN implementation and computes clustering quality metrics.

Examples

>>> # Create calculator and compute clustering
>>> calc = HDBSCANCalculator()
>>> data = np.random.rand(100, 10)
>>> labels, metadata = calc.compute(data, min_cluster_size=5, min_samples=5)
>>> print(f"Found {metadata['n_clusters']} clusters")
__init__(cache_path: str = './cache', max_memory_gb: float = 2.0, chunk_size: int = 1000, use_memmap: bool = False, max_blas_threads: int | None = 1, auto_limit_blas: bool = True) None

Initialize HDBSCAN calculator.

Parameters

cache_pathstr, optional

Path for cache files. Default is ‘./cache’.

max_memory_gbfloat, optional

Maximum memory threshold in GB. Default is 2.0.

chunk_sizeint, optional

Chunk size for processing large datasets in sampling methods. Used for chunked k-NN prediction and approximate_predict. Default is 1000.

use_memmapbool, optional

Whether to use memory mapping for large datasets. Default is False.

max_blas_threadsint or None, default=1

Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default

auto_limit_blasbool, default=True

Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)

Returns

None

compute(data: ndarray, center_method: str = 'centroid', **kwargs) Tuple[ndarray, Dict[str, Any]]

Compute HDBSCAN clustering.

Parameters

datanumpy.ndarray

Input data matrix to cluster, shape (n_samples, n_features)

center_methodstr, optional

Method for calculating cluster centers, default=”centroid”:

  • “centroid”: Representative point (medoid - closest to mean)

  • “mean”: Average of cluster members

  • “median”: Coordinate-wise median (robust to outliers)

  • “density_peak”: Point with highest local density

  • “median_centroid”: Medoid from median (more robust to outliers)

  • “rmsd_centroid”: Centroid using RMSD metric (better for structural comparisons)

kwargsdict

HDBSCAN parameters including:

  • min_cluster_size : int, minimum size of clusters

  • min_samples : int, minimum samples in neighborhood

  • cluster_selection_epsilon : float, distance threshold

  • cluster_selection_method : str, cluster selection method

  • method : str, clustering method (‘standard’, ‘sampling_approximate’, ‘sampling_knn’)

  • sample_fraction : float, fraction of data to sample

  • n_jobs : int, number of parallel jobs (-1 uses all processors)

  • force : bool, override memory and dimensionality checks

Returns

Tuple[numpy.ndarray, Dict]

Tuple containing:

  • cluster_labels: Cluster labels for each sample (-1 for noise)

  • metadata: Dictionary with clustering information

Raises

ValueError

If input data is invalid or required parameters are missing