DBA Calculator
GitHub Link to Code.
DPA calculator implementation.
This module provides the DPACalculator class that performs the actual DPA clustering computation using the DPA package from conda environment.
- class mdxplain.clustering.cluster_type.dpa.dpa_calculator.DPACalculator(cache_path: str = './cache', max_memory_gb: float = 2.0, chunk_size: int = 1000, use_memmap: bool = False, max_blas_threads: int | None = 1, auto_limit_blas: bool = True)
Calculator for DPA clustering.
This class implements the actual DPA clustering computation using the DPA package and computes clustering quality metrics.
Examples
>>> # Create calculator and compute clustering >>> calc = DPACalculator() >>> data = np.random.rand(100, 10) >>> labels, metadata = calc.compute(data, Z=2.0, affinity='euclidean', ... nn_distances=10, density_algo='knn', ... k_max=20, block_ratio=0.1, blockAn=False, ... frac=1.0, halos=False) >>> print(f"Found {metadata['n_clusters']} clusters")
- __init__(cache_path: str = './cache', max_memory_gb: float = 2.0, chunk_size: int = 1000, use_memmap: bool = False, max_blas_threads: int | None = 1, auto_limit_blas: bool = True) None
Initialize DPA calculator.
Parameters
- cache_pathstr, optional
Path for cache files. Default is ‘./cache’.
- max_memory_gbfloat, optional
Maximum memory threshold in GB. Default is 2.0.
- chunk_sizeint, optional
Chunk size for processing large datasets in sampling methods. Used for chunked k-NN prediction. Default is 1000.
- use_memmapbool, optional
Whether to use memory mapping for large datasets. Default is False.
- max_blas_threadsint or None, default=1
Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default
- auto_limit_blasbool, default=True
Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)
Returns
None
- compute(data: ndarray, center_method: str = 'centroid', **kwargs) Tuple[ndarray, Dict[str, Any]]
Compute DPA clustering.
Parameters
- datanumpy.ndarray
Input data matrix to cluster, shape (n_samples, n_features)
- center_methodstr, optional
Method for calculating cluster centers, default=”centroid”:
“centroid”: Representative point (medoid - closest to mean)
“mean”: Average of cluster members
“median”: Coordinate-wise median (robust to outliers)
“density_peak”: Point with highest local density
“median_centroid”: Medoid from median (more robust to outliers)
“rmsd_centroid”: Centroid using RMSD metric (better for structural comparisons)
- kwargsdict
DPA parameters including: See DPA init docstring for more information or https://github.com/mariaderrico/DPA and https://github.com/mariaderrico/DPA/blob/master/DPA_analysis.ipynb
Z : float, density threshold parameter
affinity : str, affinity metric for distance calculation
nn_distances : int, number of nearest neighbors
density_algo : str, algorithm for density computation
k_max : int, maximum number of clusters
block_ratio : float, block ratio parameter
blockAn : bool, whether to use block analysis
frac : float, fraction parameter for sampling
halos : bool, whether to return halo points assigned to cluster 0
method : str, clustering method (‘standard’, ‘knn_sampling’)
sample_fraction : float, fraction of data to sample
force : bool, override memory and dimensionality checks
n_jobs : int, number of parallel jobs (-1 uses all processors)
Returns
- Tuple[numpy.ndarray, Dict]
Tuple containing:
cluster_labels: Cluster labels for each sample
metadata: Dictionary with clustering information
Raises
- ValueError
If input data is invalid or required parameters are missing
- ImportError
If DPA package is not available