DBSCAN Module

GitHub Link to Code.

DBSCAN cluster type implementation.

This module provides the DBSCAN cluster type that implements density-based clustering for molecular dynamics trajectory analysis.

class mdxplain.clustering.cluster_type.dbscan.dbscan.DBSCAN(eps: float = 0.5, min_samples: int = 5, method: str = 'standard', sample_fraction: float = 0.1, force: bool = False, knn_neighbors: int = 5, n_jobs: int = -1, max_blas_threads: int | None = 1, auto_limit_blas: bool = True)

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) cluster type.

DBSCAN groups together points that are closely packed while marking points in low-density regions as outliers. It’s particularly useful for identifying conformational states in molecular dynamics trajectories.

Uses sklearn’s DBSCAN and NearestNeighbors under the hood.

Examples

>>> # Create DBSCAN with default parameters
>>> dbscan = DBSCAN()
>>> # Create DBSCAN with custom parameters
>>> dbscan = DBSCAN(eps=0.3, min_samples=10)
>>> # Initialize and compute clustering
>>> dbscan.init_calculator()
>>> labels, metadata = dbscan.compute(data)
__init__(eps: float = 0.5, min_samples: int = 5, method: str = 'standard', sample_fraction: float = 0.1, force: bool = False, knn_neighbors: int = 5, n_jobs: int = -1, max_blas_threads: int | None = 1, auto_limit_blas: bool = True) None

Initialize DBSCAN cluster type.

Parameters

epsfloat, optional

Maximum distance between samples in neighborhood. Default is 0.5.

min_samplesint, optional

Minimum samples in neighborhood for core point. Default is 5.

methodstr, default=”standard”

Clustering method:

  • “standard”: Load all data into memory (default)

  • “sampling_approximate”: Sample data + approximate_predict for large datasets

  • “precomputed”: Use precomputed distance matrix (data must be square)

sample_fractionfloat, default=0.1

Fraction of data to sample for sampling-based methods (10%) Final sample size: max(50000, min(100000, sample_fraction * n_samples))

forcebool, default=False

Override memory and dimensionality checks (converts errors to warnings)

knn_neighborsint, default=5

Number of neighbors for k-NN sampling method.

n_jobsint, default=-1

Number of parallel jobs for distance computations. -1 means using all processors.

max_blas_threadsint or None, default=1

Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default

auto_limit_blasbool, default=True

Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)

Returned Metadata

algorithmstr

Always “dbscan”

hyperparametersdict

Dictionary containing all DBSCAN parameters used

original_shapetuple

Shape of the input data (n_samples, n_features)

n_clustersint

Number of clusters found (excluding noise points)

n_noiseint

Number of noise points identified (label -1)

silhouette_scorefloat or None

Silhouette score for clustering quality assessment

computation_timefloat

Time taken for clustering computation in seconds

core_sample_indiceslist

List of core sample indices

cache_pathstr

Path used for caching results

classmethod get_type_name() str

Return unique string identifier for DBSCAN cluster type.

Returns

str

The string ‘dbscan’

init_calculator(cache_path: str = './cache', max_memory_gb: float = 2.0, chunk_size: int = 1000, use_memmap: bool = False) None

Initialize the DBSCAN calculator.

Parameters

cache_pathstr, optional

Directory path for cache files. Default is ‘./cache’.

max_memory_gbfloat, optional

Maximum memory threshold in GB. Default is 2.0.

chunk_sizeint, optional

Chunk size for processing large datasets. Default is 1000.

use_memmapbool, optional

Whether to use memory mapping for large datasets. Default is False.

Returns

None

compute(data: ndarray, center_method: str = 'centroid') Tuple[ndarray, Dict[str, Any]]

Compute DBSCAN clustering.

Parameters

datanumpy.ndarray

Input data matrix to cluster, shape (n_samples, n_features)

center_methodstr, optional

Method for calculating cluster centers, default=”centroid”

Returns

Tuple[numpy.ndarray, Dict]

Tuple containing:

  • cluster_labels: Cluster labels for each sample (-1 for noise)

  • metadata: Dictionary with clustering information

Raises

ValueError

If calculator is not initialized