Kernel PCA Data

GitHub Link to Code.

KernelPCA decomposition type implementation for molecular dynamics analysis.

KernelPCA decomposition type implementation with nonlinear dimensionality reduction using various kernel functions for feature matrices.

class mdxplain.decomposition.decomposition_type.kernel_pca.kernel_pca.KernelPCA(n_components: int | str | None = 'auto', gamma: float | str | None = 'scale', use_nystrom: bool = False, n_landmarks: int = 10000, landmark_selection_mode: str = 'kmeans', random_state: int | None = None, use_parallel: bool = False, n_jobs: int = -1, min_chunk_size: int = 1000, max_blas_threads: int | None = 1, auto_limit_blas: bool = True, offset: int | float = 0)

Kernel Principal Component Analysis decomposition type.

Implements KernelPCA for nonlinear dimensionality reduction of feature matrices from molecular dynamics trajectories with RBF kernel.

This is a nonlinear dimensionality reduction method that maps the data to a higher-dimensional space via a kernel function and then applies PCA in that space.

Examples

>>> # Basic KernelPCA decomposition via DecompositionManager
>>> from mdxplain.decomposition import decomposition_type
>>> decomp_manager = DecompositionManager()
>>> decomp_manager.add(
...     traj_data, "feature_selection", decomposition_type.KernelPCA(n_components=10, gamma=0.1)
... )
>>> # KernelPCA with incremental computation for large datasets
>>> kpca = decomposition_type.KernelPCA()
>>> kpca.init_calculator(use_memmap=True, chunk_size=1000)
>>> transformed, metadata = kpca.compute(large_data, n_components=50)
>>> # KernelPCA with Nyström approximation for very large datasets
>>> kpca = decomposition_type.KernelPCA(use_nystrom=True, n_landmarks=5000)
>>> kpca.init_calculator()
>>> transformed, metadata = kpca.compute(very_large_data, n_components=50)
__init__(n_components: int | str | None = 'auto', gamma: float | str | None = 'scale', use_nystrom: bool = False, n_landmarks: int = 10000, landmark_selection_mode: str = 'kmeans', random_state: int | None = None, use_parallel: bool = False, n_jobs: int = -1, min_chunk_size: int = 1000, max_blas_threads: int | None = 1, auto_limit_blas: bool = True, offset: int | float = 0) None

Initialize KernelPCA decomposition type with RBF kernel.

Creates a KernelPCA instance that always uses RBF kernel with specified parameters.

Parameters

n_componentsint, str, or None, default=”auto”

Number of components to keep. Options:

  • int: Specific number of components

  • “auto”: Automatic selection via elbow detection (5% of features) [DEFAULT]

  • None: Uses min(n_samples, n_features)

gammafloat, str, or None, default=”scale”

RBF kernel coefficient. Options:

  • float: Specific gamma value

  • “scale”: 1.0 / (n_features * variance) [DEFAULT]

  • “auto”: 1.0 / n_features

  • None: Uses 1.0 / n_features (same as “auto”)

use_nystrombool, default=False

Whether to use Nyström approximation for large datasets

n_landmarksint, default=10000

Number of landmarks for Nyström approximation

landmark_selection_modestr, default=”kmeans”

Method for landmark selection in Nyström approximation:

  • “kmeans”: Use KMeans centroids as landmarks (better coverage)

  • “random”: Use random sampling from data

random_stateint, optional

Random state for reproducible results

use_parallelbool, default=False

Whether to use parallel processing for matrix-vector multiplication

n_jobsint, default=-1

Number of parallel jobs (-1 for all available CPU cores)

min_chunk_sizeint, default=1000

Minimum chunk size per parallel process to avoid overhead

max_blas_threadsint or None, default=1

Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default

auto_limit_blasbool, default=True

Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)

offsetint or float, default=0

Adjustment to auto-selected component count (only applies when n_components=”auto”):

  • int: Direct addition/subtraction (e.g., -2 selects 2 fewer)

  • float: Percentage adjustment (e.g., -0.5 selects 50% fewer)

Returns

hyperparametersdict

Dictionary containing all Kernel PCA parameters used

original_shapetuple

Shape of the input data (n_samples, n_features)

use_memmapbool

Whether memory mapping was used

chunk_sizeint

Chunk size used for processing

cache_pathstr

Path used for caching results

methodstr

Method used (‘standard_kernel_pca’, ‘nystrom_kernel_pca’, or ‘chunk_wise_kernel_pca’)

approximationstr

Approximation method used (‘nystrom’ when Nyström approximation is enabled)

n_chunksint

Number of chunks used for incremental processing

n_landmarksint

Number of landmarks used for Nyström approximation (when applicable)

Examples

>>> # Create KernelPCA instance with RBF kernel
>>> kpca = KernelPCA(n_components=10, gamma=0.1)
>>> print(f"Type: {kpca.get_type_name()}")
'kernel_pca'
>>> # Create KernelPCA with Nyström approximation
>>> kpca = KernelPCA(n_components=50, use_nystrom=True, n_landmarks=5000)
>>> # Create KernelPCA with parallel processing
>>> kpca = KernelPCA(n_components=10, use_parallel=True, n_jobs=8, min_chunk_size=500)
classmethod get_type_name() str

Get the type name for KernelPCA decomposition.

Returns the unique string identifier for KernelPCA decomposition type used for storing results and type identification.

Parameters

clstype

The KernelPCA class

Returns

str

String identifier ‘kernel_pca’

Examples

>>> print(KernelPCA.get_type_name())
'kernel_pca'
>>> # Can also be used via class directly
>>> print(decomposition_type.KernelPCA.get_type_name())
'kernel_pca'
init_calculator(use_memmap: bool = False, cache_path: str = './cache', chunk_size: int = 2000) None

Initialize the KernelPCA calculator with specified configuration.

Sets up the KernelPCA calculator with options for memory mapping and incremental kernel computation for large datasets.

Parameters

use_memmapbool, default=False

Whether to use incremental kernel computation for large datasets

cache_pathstr, optional

Path for cache files when using memory mapping

chunk_sizeint, optional

Number of samples to process per chunk for incremental computation

Returns

None

Sets self.calculator to initialized KernelPCACalculator instance

Examples

>>> # Basic initialization
>>> kpca = KernelPCA()
>>> kpca.init_calculator()
>>> # With incremental computation for large datasets
>>> kpca.init_calculator(use_memmap=True, chunk_size=1000)
>>> # With custom cache path
>>> kpca.init_calculator(
...     use_memmap=True,
...     cache_path="./cache/kernel_pca.dat"
... )
compute(data: ndarray) Tuple[ndarray, Dict[str, Any]]

Compute KernelPCA decomposition of input data using RBF kernel.

Performs Kernel Principal Component Analysis on the input feature matrix using the initialized calculator with RBF kernel and the parameters provided during initialization.

Parameters

datanumpy.ndarray

Input feature matrix to decompose, shape (n_samples, n_features)

Returns

Tuple[numpy.ndarray, Dict]

Tuple containing:

  • transformed_data: KernelPCA-transformed data matrix (n_samples, n_components)

  • metadata: Dictionary with KernelPCA information including:

    • hyperparameters: Used parameters

    • method: ‘standard_kernel_pca’ or ‘incremental_kernel_pca’

    • optional: n_landmarks: Number of landmarks used in Nyström approximation

Examples

>>> # Compute KernelPCA with RBF kernel
>>> kpca = KernelPCA(n_components=10, gamma=0.1)
>>> kpca.init_calculator()
>>> data = np.random.rand(1000, 100)
>>> transformed, metadata = kpca.compute(data)
>>> print(f"Transformed shape: {transformed.shape}")
>>> print(f"Kernel: {metadata['hyperparameters']['kernel']}")
'rbf'
>>> # Incremental KernelPCA for large datasets
>>> kpca = KernelPCA(n_components=50, gamma=0.01)
>>> kpca.init_calculator(use_memmap=True, chunk_size=200)
>>> large_data = np.random.rand(10000, 500)
>>> transformed, metadata = kpca.compute(large_data)

Raises

ValueError

If calculator is not initialized, input data is invalid, or n_components is too large