Clustering Helper
GitHub Link to Code.
Clustering helper functions.
Center Calculation Helper
Helper functions for calculating cluster centers.
Provides cluster center calculation from data and labels using various methods (centroid, mean, median, density_peak).
- class mdxplain.clustering.helper.center_calculation_helper.CenterCalculationHelper
Helper for calculating cluster centers.
Calculates cluster centers from data and labels using various methods. Automatically handles memory-efficient chunk-wise processing for large datasets.
Available calculation methods:
centroid: Representative point (medoid - closest to mean)
mean: Average of all cluster members
median: Coordinate-wise median (robust to outliers)
density_peak: Point with highest local density
median_centroid: Medoid from median (robust centroid)
rmsd_centroid: Centroid using RMSD metric (structural)
- VALID_METHODS = ['centroid', 'mean', 'median', 'density_peak', 'median_centroid', 'rmsd_centroid']
- static calculate_centers(data: ndarray, labels: ndarray, center_method: str = 'centroid', chunk_size: int = 1000, use_memmap: bool = False, max_memory_gb: float = 2.0, n_jobs: int = -1) ndarray | None
Calculate cluster centers from data and labels.
Parameters
- datanumpy.ndarray
Original data used for clustering, shape (n_samples, n_features)
- labelsnumpy.ndarray
Cluster labels, shape (n_samples,)
- center_methodstr, default=”centroid”
Method for calculating centers:
“centroid”: Representative point (medoid - closest to mean)
“mean”: Average of all points
“median”: Coordinate-wise median (robust to outliers)
“density_peak”: Point with highest local density
“median_centroid”: Medoid from median (more robust than centroid)
“rmsd_centroid”: Centroid using RMSD metric (structural comparisons)
- chunk_sizeint, default=1000
Chunk size for memory-safe processing
- use_memmapbool, default=False
Whether to use chunk-wise processing (memory-safe for large data)
- max_memory_gbfloat, default=2.0
Memory threshold for density_peak sampling (other methods ignore this)
- n_jobsint, default=-1
Number of parallel jobs for density_peak method (other methods ignore this)
Returns
- Optional[numpy.ndarray]
Array of cluster centers (n_clusters, n_features) or None
Raises
- ValueError
If center_method is not valid
Notes
Memory-safe processing:
centroid, mean, median, median_centroid, rmsd_centroid: Always memory-safe
density_peak: Uses sampling if cluster exceeds max_memory_gb
Examples
>>> # Standard usage >>> centers = calculate_centers(data, labels, "centroid")
>>> # Memory-safe processing for large data >>> centers = calculate_centers(data, labels, "centroid", ... chunk_size=1000, use_memmap=True, max_memory_gb=2.0)