Clustering Helper

GitHub Link to Code.

Clustering helper functions.

Center Calculation Helper

Helper functions for calculating cluster centers.

Provides cluster center calculation from data and labels using various methods (centroid, mean, median, density_peak).

class mdxplain.clustering.helper.center_calculation_helper.CenterCalculationHelper

Helper for calculating cluster centers.

Calculates cluster centers from data and labels using various methods. Automatically handles memory-efficient chunk-wise processing for large datasets.

Available calculation methods:

centroid: Representative point (medoid - closest to mean)
mean: Average of all cluster members
median: Coordinate-wise median (robust to outliers)
density_peak: Point with highest local density
median_centroid: Medoid from median (robust centroid)
rmsd_centroid: Centroid using RMSD metric (structural)

VALID_METHODS = ['centroid', 'mean', 'median', 'density_peak', 'median_centroid', 'rmsd_centroid']

static calculate_centers(data: ndarray, labels: ndarray, center_method: str = 'centroid', chunk_size: int = 1000, use_memmap: bool = False, max_memory_gb: float = 2.0, n_jobs: int = -1) → ndarray | None

Calculate cluster centers from data and labels.

Parameters

datanumpy.ndarray

Original data used for clustering, shape (n_samples, n_features)

labelsnumpy.ndarray

Cluster labels, shape (n_samples,)

center_methodstr, default=”centroid”

Method for calculating centers:

“centroid”: Representative point (medoid - closest to mean)
“mean”: Average of all points
“median”: Coordinate-wise median (robust to outliers)
“density_peak”: Point with highest local density
“median_centroid”: Medoid from median (more robust than centroid)
“rmsd_centroid”: Centroid using RMSD metric (structural comparisons)

chunk_sizeint, default=1000

Chunk size for memory-safe processing

use_memmapbool, default=False

Whether to use chunk-wise processing (memory-safe for large data)

max_memory_gbfloat, default=2.0

Memory threshold for density_peak sampling (other methods ignore this)

n_jobsint, default=-1

Number of parallel jobs for density_peak method (other methods ignore this)

Returns

Optional[numpy.ndarray]: Array of cluster centers (n_clusters, n_features) or None

Raises

ValueError: If center_method is not valid

Notes

Memory-safe processing:

centroid, mean, median, median_centroid, rmsd_centroid: Always memory-safe
density_peak: Uses sampling if cluster exceeds max_memory_gb

Examples

>>> # Standard usage
>>> centers = calculate_centers(data, labels, "centroid")

>>> # Memory-safe processing for large data
>>> centers = calculate_centers(data, labels, "centroid",
...                             chunk_size=1000, use_memmap=True, max_memory_gb=2.0)