DBA Calculator

GitHub Link to Code.

DPA calculator implementation.

This module provides the DPACalculator class that performs the actual DPA clustering computation using the DPA package from conda environment.

class mdxplain.clustering.cluster_type.dpa.dpa_calculator.DPACalculator(cache_path: str = './cache', max_memory_gb: float = 2.0, chunk_size: int = 1000, use_memmap: bool = False, max_blas_threads: int | None = 1, auto_limit_blas: bool = True)

Calculator for DPA clustering.

This class implements the actual DPA clustering computation using the DPA package and computes clustering quality metrics.

Examples

>>> # Create calculator and compute clustering
>>> calc = DPACalculator()
>>> data = np.random.rand(100, 10)
>>> labels, metadata = calc.compute(data, Z=2.0, affinity='euclidean',
...                                nn_distances=10, density_algo='knn',
...                                k_max=20, block_ratio=0.1, blockAn=False,
...                                frac=1.0, halos=False)
>>> print(f"Found {metadata['n_clusters']} clusters")
__init__(cache_path: str = './cache', max_memory_gb: float = 2.0, chunk_size: int = 1000, use_memmap: bool = False, max_blas_threads: int | None = 1, auto_limit_blas: bool = True) None

Initialize DPA calculator.

Parameters

cache_pathstr, optional

Path for cache files. Default is ‘./cache’.

max_memory_gbfloat, optional

Maximum memory threshold in GB. Default is 2.0.

chunk_sizeint, optional

Chunk size for processing large datasets in sampling methods. Used for chunked k-NN prediction. Default is 1000.

use_memmapbool, optional

Whether to use memory mapping for large datasets. Default is False.

max_blas_threadsint or None, default=1

Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default

auto_limit_blasbool, default=True

Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)

Returns

None

compute(data: ndarray, center_method: str = 'centroid', **kwargs) Tuple[ndarray, Dict[str, Any]]

Compute DPA clustering.

Parameters

datanumpy.ndarray

Input data matrix to cluster, shape (n_samples, n_features)

center_methodstr, optional

Method for calculating cluster centers, default=”centroid”:

  • “centroid”: Representative point (medoid - closest to mean)

  • “mean”: Average of cluster members

  • “median”: Coordinate-wise median (robust to outliers)

  • “density_peak”: Point with highest local density

  • “median_centroid”: Medoid from median (more robust to outliers)

  • “rmsd_centroid”: Centroid using RMSD metric (better for structural comparisons)

kwargsdict

DPA parameters including: See DPA init docstring for more information or https://github.com/mariaderrico/DPA and https://github.com/mariaderrico/DPA/blob/master/DPA_analysis.ipynb

  • Z : float, density threshold parameter

  • affinity : str, affinity metric for distance calculation

  • nn_distances : int, number of nearest neighbors

  • density_algo : str, algorithm for density computation

  • k_max : int, maximum number of clusters

  • block_ratio : float, block ratio parameter

  • blockAn : bool, whether to use block analysis

  • frac : float, fraction parameter for sampling

  • halos : bool, whether to return halo points assigned to cluster 0

  • method : str, clustering method (‘standard’, ‘knn_sampling’)

  • sample_fraction : float, fraction of data to sample

  • force : bool, override memory and dimensionality checks

  • n_jobs : int, number of parallel jobs (-1 uses all processors)

Returns

Tuple[numpy.ndarray, Dict]

Tuple containing:

  • cluster_labels: Cluster labels for each sample

  • metadata: Dictionary with clustering information

Raises

ValueError

If input data is invalid or required parameters are missing

ImportError

If DPA package is not available