DBSCAN Module
GitHub Link to Code.
DBSCAN cluster type implementation.
This module provides the DBSCAN cluster type that implements density-based clustering for molecular dynamics trajectory analysis.
- class mdxplain.clustering.cluster_type.dbscan.dbscan.DBSCAN(eps: float = 0.5, min_samples: int = 5, method: str = 'standard', sample_fraction: float = 0.1, force: bool = False, knn_neighbors: int = 5, n_jobs: int = -1, max_blas_threads: int | None = 1, auto_limit_blas: bool = True)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) cluster type.
DBSCAN groups together points that are closely packed while marking points in low-density regions as outliers. It’s particularly useful for identifying conformational states in molecular dynamics trajectories.
Uses sklearn’s DBSCAN and NearestNeighbors under the hood.
Examples
>>> # Create DBSCAN with default parameters >>> dbscan = DBSCAN()
>>> # Create DBSCAN with custom parameters >>> dbscan = DBSCAN(eps=0.3, min_samples=10)
>>> # Initialize and compute clustering >>> dbscan.init_calculator() >>> labels, metadata = dbscan.compute(data)
- __init__(eps: float = 0.5, min_samples: int = 5, method: str = 'standard', sample_fraction: float = 0.1, force: bool = False, knn_neighbors: int = 5, n_jobs: int = -1, max_blas_threads: int | None = 1, auto_limit_blas: bool = True) None
Initialize DBSCAN cluster type.
Parameters
- epsfloat, optional
Maximum distance between samples in neighborhood. Default is 0.5.
- min_samplesint, optional
Minimum samples in neighborhood for core point. Default is 5.
- methodstr, default=”standard”
Clustering method:
“standard”: Load all data into memory (default)
“sampling_approximate”: Sample data + approximate_predict for large datasets
“precomputed”: Use precomputed distance matrix (data must be square)
- sample_fractionfloat, default=0.1
Fraction of data to sample for sampling-based methods (10%) Final sample size: max(50000, min(100000, sample_fraction * n_samples))
- forcebool, default=False
Override memory and dimensionality checks (converts errors to warnings)
- knn_neighborsint, default=5
Number of neighbors for k-NN sampling method.
- n_jobsint, default=-1
Number of parallel jobs for distance computations. -1 means using all processors.
- max_blas_threadsint or None, default=1
Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default
- auto_limit_blasbool, default=True
Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)
Returned Metadata
- algorithmstr
Always “dbscan”
- hyperparametersdict
Dictionary containing all DBSCAN parameters used
- original_shapetuple
Shape of the input data (n_samples, n_features)
- n_clustersint
Number of clusters found (excluding noise points)
- n_noiseint
Number of noise points identified (label -1)
- silhouette_scorefloat or None
Silhouette score for clustering quality assessment
- computation_timefloat
Time taken for clustering computation in seconds
- core_sample_indiceslist
List of core sample indices
- cache_pathstr
Path used for caching results
- classmethod get_type_name() str
Return unique string identifier for DBSCAN cluster type.
Returns
- str
The string ‘dbscan’
- init_calculator(cache_path: str = './cache', max_memory_gb: float = 2.0, chunk_size: int = 1000, use_memmap: bool = False) None
Initialize the DBSCAN calculator.
Parameters
- cache_pathstr, optional
Directory path for cache files. Default is ‘./cache’.
- max_memory_gbfloat, optional
Maximum memory threshold in GB. Default is 2.0.
- chunk_sizeint, optional
Chunk size for processing large datasets. Default is 1000.
- use_memmapbool, optional
Whether to use memory mapping for large datasets. Default is False.
Returns
None
- compute(data: ndarray, center_method: str = 'centroid') Tuple[ndarray, Dict[str, Any]]
Compute DBSCAN clustering.
Parameters
- datanumpy.ndarray
Input data matrix to cluster, shape (n_samples, n_features)
- center_methodstr, optional
Method for calculating cluster centers, default=”centroid”
Returns
- Tuple[numpy.ndarray, Dict]
Tuple containing:
cluster_labels: Cluster labels for each sample (-1 for noise)
metadata: Dictionary with clustering information
Raises
- ValueError
If calculator is not initialized