Clustering Helper

GitHub Link to Code.

Clustering helper functions.

Center Calculation Helper

Helper functions for calculating cluster centers.

Provides cluster center calculation from data and labels using various methods (centroid, mean, median, density_peak).

class mdxplain.clustering.helper.center_calculation_helper.CenterCalculationHelper

Helper for calculating cluster centers.

Calculates cluster centers from data and labels using various methods. Automatically handles memory-efficient chunk-wise processing for large datasets.

Available calculation methods:

  • centroid: Representative point (medoid - closest to mean)

  • mean: Average of all cluster members

  • median: Coordinate-wise median (robust to outliers)

  • density_peak: Point with highest local density

  • median_centroid: Medoid from median (robust centroid)

  • rmsd_centroid: Centroid using RMSD metric (structural)

VALID_METHODS = ['centroid', 'mean', 'median', 'density_peak', 'median_centroid', 'rmsd_centroid']
static calculate_centers(data: ndarray, labels: ndarray, center_method: str = 'centroid', chunk_size: int = 1000, use_memmap: bool = False, max_memory_gb: float = 2.0, n_jobs: int = -1) ndarray | None

Calculate cluster centers from data and labels.

Parameters

datanumpy.ndarray

Original data used for clustering, shape (n_samples, n_features)

labelsnumpy.ndarray

Cluster labels, shape (n_samples,)

center_methodstr, default=”centroid”

Method for calculating centers:

  • “centroid”: Representative point (medoid - closest to mean)

  • “mean”: Average of all points

  • “median”: Coordinate-wise median (robust to outliers)

  • “density_peak”: Point with highest local density

  • “median_centroid”: Medoid from median (more robust than centroid)

  • “rmsd_centroid”: Centroid using RMSD metric (structural comparisons)

chunk_sizeint, default=1000

Chunk size for memory-safe processing

use_memmapbool, default=False

Whether to use chunk-wise processing (memory-safe for large data)

max_memory_gbfloat, default=2.0

Memory threshold for density_peak sampling (other methods ignore this)

n_jobsint, default=-1

Number of parallel jobs for density_peak method (other methods ignore this)

Returns

Optional[numpy.ndarray]

Array of cluster centers (n_clusters, n_features) or None

Raises

ValueError

If center_method is not valid

Notes

Memory-safe processing:

  • centroid, mean, median, median_centroid, rmsd_centroid: Always memory-safe

  • density_peak: Uses sampling if cluster exceeds max_memory_gb

Examples

>>> # Standard usage
>>> centers = calculate_centers(data, labels, "centroid")
>>> # Memory-safe processing for large data
>>> centers = calculate_centers(data, labels, "centroid",
...                             chunk_size=1000, use_memmap=True, max_memory_gb=2.0)