PCA Calculator

GitHub Link to Code.

PCA calculator for dimensionality reduction of molecular dynamics data.

Implements PCA computation with support for incremental processing for large datasets using sklearn’s PCA and IncrementalPCA.

class mdxplain.decomposition.decomposition_type.pca.pca_calculator.PCACalculator(use_memmap: bool = False, cache_path: str = './cache', chunk_size: int = 2000)

Calculator for Principal Component Analysis (PCA) decomposition.

Implements PCA computation with support for both standard and incremental processing for large datasets. Uses sklearn’s PCA for standard computation and sklearn’s IncrementalPCA for chunk-wise processing when memory mapping is enabled.

Examples

>>> # Standard PCA computation
>>> calc = PCACalculator()
>>> data = np.random.rand(1000, 100)
>>> transformed, metadata = calc.compute(data, n_components=10)
>>> # Incremental PCA for large datasets
>>> calc = PCACalculator(use_memmap=True, chunk_size=200)
>>> large_data = np.random.rand(10000, 500)
>>> transformed, metadata = calc.compute(large_data, n_components=50)
__init__(use_memmap: bool = False, cache_path: str = './cache', chunk_size: int = 2000) None

Initialize PCA calculator.

Parameters

use_memmapbool, default=False

Whether to use memory mapping and incremental computation

cache_pathstr, optional

Path for memory-mapped cache files (not used for PCA)

chunk_sizeint, optional

Size of chunks for incremental PCA processing

Returns

None

Initializes PCA calculator with specified configuration

Examples

>>> # Standard PCA
>>> calc = PCACalculator()
>>> # Incremental PCA for large datasets
>>> calc = PCACalculator(use_memmap=True, chunk_size=1000)
compute(data: ndarray, **kwargs) Tuple[ndarray, Dict[str, Any]]

Compute PCA decomposition of input data.

Performs Principal Component Analysis on the input data matrix, using either standard PCA or incremental PCA based on the configuration settings.

Parameters

datanumpy.ndarray

Input data matrix to decompose, shape (n_samples, n_features)

wargsdict

PCA parameters:

  • n_componentsint, optional

    Number of components to keep (default: min(n_samples, n_features))

  • random_stateint, optional

    Random state for reproducible results

Returns

Tuple[numpy.ndarray, Dict]

Tuple containing:

  • transformed_data: PCA-transformed data (n_samples, n_components)

  • metadata: Dictionary with PCA information including components, explained variance ratio, and hyperparameters

Examples

>>> # Compute PCA with 10 components
>>> calc = PCACalculator()
>>> data = np.random.rand(500, 100)
>>> transformed, metadata = calc.compute(data, n_components=10)
>>> print(f"Explained variance: {metadata['explained_variance_ratio']}")

Raises

ValueError

If input data is invalid or n_components is too large