PCA Data

GitHub Link to Code.

PCA decomposition type implementation for molecular dynamics analysis.

PCA decomposition type implementation with standard and incremental Principal Component Analysis for dimensionality reduction of feature matrices.

class mdxplain.decomposition.decomposition_type.pca.pca.PCA(n_components: int | str | None = 'auto', random_state: int | None = None, offset: int | float = 0)

Principal Component Analysis decomposition type.

Implements PCA for dimensionality reduction of feature matrices from molecular dynamics trajectories. Supports both standard and incremental computation for large datasets.

This is a linear dimensionality reduction method that finds the directions of maximum variance in the data and projects the data onto these directions.

Uses sklearn’s PCA and IncrementalPCA under the hood.

Examples

>>> # Basic PCA decomposition via DecompositionManager
>>> from mdxplain.decomposition import decomposition_type
>>> decomp_manager = DecompositionManager()
>>> decomp_manager.add(
...     traj_data, "feature_selection", decomposition_type.PCA(n_components=10)
... )
>>> # PCA with incremental computation for large datasets
>>> pca = decomposition_type.PCA()
>>> pca.init_calculator(use_memmap=True, chunk_size=1000)
>>> transformed, metadata = pca.compute(large_data, n_components=50)
__init__(n_components: int | str | None = 'auto', random_state: int | None = None, offset: int | float = 0) None

Initialize PCA decomposition type with parameters.

Creates a PCA instance with specified parameters that will be used during computation.

Parameters

n_componentsint, str, or None, default=”auto”

Number of components to keep. Options:

  • int: Specific number of components

  • “auto”: Automatic selection via elbow detection (5% of features) [DEFAULT]

  • None: Uses min(n_samples, n_features)

random_stateint, optional

Random state for reproducible results

offsetint or float, default=0

Adjustment to auto-selected component count (only applies when n_components=”auto”):

  • int: Direct addition/subtraction (e.g., -2 selects 2 fewer)

  • float: Percentage adjustment (e.g., -0.5 selects 50% fewer)

Returned Metadata

hyperparametersdict

Dictionary containing all PCA parameters used

original_shapetuple

Shape of the input data (n_samples, n_features)

use_memmapbool

Whether memory mapping was used

chunk_sizeint

Chunk size used for incremental processing

cache_pathstr

Path used for caching results

explained_variance_ratioarray

Percentage of variance explained by each component

explained_variancearray

Variance explained by each component

methodstr

Method used (‘standard_pca’ or ‘incremental_pca’)

n_chunksint

Number of chunks used (only for incremental_pca)

Examples

>>> # Create PCA instance with parameters
>>> pca = PCA(n_components=10, random_state=42)
>>> print(f"Type: {pca.get_type_name()}")
'pca'
classmethod get_type_name() str

Get the type name for PCA decomposition.

Returns the unique string identifier for PCA decomposition type used for storing results and type identification.

Parameters

clstype

The PCA class

Returns

str

String identifier ‘pca’

Examples

>>> print(PCA.get_type_name())
'pca'
>>> # Can also be used via class directly
>>> print(decomposition_type.PCA.get_type_name())
'pca'
init_calculator(use_memmap: bool = False, cache_path: str = './cache', chunk_size: int = 2000) None

Initialize the PCA calculator with specified configuration.

Sets up the PCA calculator with options for memory mapping and incremental computation for large datasets.

Parameters

use_memmapbool, default=False

Whether to use incremental computation for large datasets

cache_pathstr, optional

Path for cache files (not used for PCA but kept for interface consistency)

chunk_sizeint, optional

Number of samples to process per chunk for incremental computation

Returns

None

Sets self.calculator to initialized PCACalculator instance

Examples

>>> # Basic initialization
>>> pca = PCA()
>>> pca.init_calculator()
>>> # With incremental computation for large datasets
>>> pca.init_calculator(use_memmap=True, chunk_size=1000)
>>> # With custom chunk size
>>> pca.init_calculator(chunk_size=500)
compute(data: ndarray) Tuple[ndarray, Dict[str, Any]]

Compute PCA decomposition of input data.

Performs Principal Component Analysis on the input feature matrix using the initialized calculator and the parameters provided during initialization.

Parameters

datanumpy.ndarray

Input feature matrix to decompose, shape (n_samples, n_features)

Returns

Tuple[numpy.ndarray, Dict]

Tuple containing:

  • transformed_data: PCA-transformed data matrix (n_samples, n_components)

  • metadata: Dictionary with PCA information including:

    • hyperparameters: Used parameters

    • explained_variance_ratio: Variance explained by each component

    • components: Principal components (eigenvectors)

    • explained_variance: Variance explained by each component

    • mean: Mean of the original data

    • method: ‘standard_pca’ or ‘incremental_pca’

Examples

>>> # Compute PCA with predefined parameters
>>> pca = PCA(n_components=10, random_state=42)
>>> pca.init_calculator()
>>> data = np.random.rand(1000, 100)
>>> transformed, metadata = pca.compute(data)
>>> print(f"Transformed shape: {transformed.shape}")
>>> # Incremental PCA for large datasets
>>> pca = PCA(n_components=50)
>>> pca.init_calculator(use_memmap=True, chunk_size=200)
>>> large_data = np.random.rand(10000, 500)
>>> transformed, metadata = pca.compute(large_data)

Raises

ValueError

If calculator is not initialized, input data is invalid, or n_components is too large