Decomposition Services

GitHub Link to Code.

Decomposition factory implementations for simplified API access.

This module provides factory classes that simplify decomposition management by hiding explicit type instantiation from the user.

Decomposition Add Service

Factory for adding decomposition algorithms with simplified syntax.

class mdxplain.decomposition.services.decomposition_add_service.DecompositionAddService(manager: DecompositionManager, pipeline_data: PipelineData)

Service for adding decomposition algorithms without explicit type instantiation.

This service provides an intuitive interface for adding decomposition algorithms without requiring users to import and instantiate decomposition types directly. All decomposition type parameters are combined with manager.add parameters.

Examples

>>> pipeline.decomposition.add.pca("my_features", n_components=10)
>>> pipeline.decomposition.add.kernel_pca("contact_features", kernel='rbf', n_components=20)
>>> pipeline.decomposition.add.diffusion_maps("distance_features", n_components=15)

__init__(manager: DecompositionManager, pipeline_data: PipelineData) → None

Initialize factory with manager and pipeline data.

Parameters

managerDecompositionManager: Decomposition manager instance
pipeline_dataPipelineData: Pipeline data container (injected by AutoInjectProxy)

Returns

None

Add PCA (Principal Component Analysis) decomposition.

PCA reduces dimensionality by finding the directions of maximum variance in the data and projecting the data onto these principal components.

Parameters

selection_namestr

Name of feature selection to decompose

n_componentsint, str, or None, default=”auto”

Number of principal components. Options:

int: Specific number of components
“auto”: Automatic selection via elbow detection (5% of features) [DEFAULT]
None: Uses min(n_samples, n_features)

random_stateint, optional

Random state for reproducible results

offsetint or float, default=0

Adjustment to auto-selected component count (only applies when n_components=”auto”):

int: Direct addition/subtraction (e.g., -2 selects 2 fewer)
float: Percentage adjustment (e.g., -0.5 selects 50% fewer)

decomposition_namestr, optional

Name for the decomposition result. If None, uses algorithm-based name

data_selector_namestr, optional

Name of data selector to apply before decomposition

forcebool, default=False

Force recalculation even if decomposition already exists

Returns

None: Adds PCA decomposition results to pipeline data

Examples

>>> # Basic PCA decomposition
>>> pipeline.decomposition.add.pca("my_features", n_components=10)

>>> # PCA with reproducible results
>>> pipeline.decomposition.add.pca("my_features", n_components=15, random_state=42)

>>> # PCA with custom name and data selector
>>> pipeline.decomposition.add.pca(
...     "distance_features",
...     n_components=20,
...     decomposition_name="distance_pca",
...     data_selector_name="equilibrated_frames"
... )

>>> # Force recalculation of existing PCA
>>> pipeline.decomposition.add.pca(
...     "contact_features",
...     n_components=15,
...     force=True
... )

Notes

PCA is a linear dimensionality reduction technique that preserves the maximum amount of variance in the reduced representation. Components are ordered by explained variance ratio.

kernel_pca(selection_name: str, n_components: int | str | None = 'auto', gamma: float | str | None = 'scale', use_nystrom: bool = False, n_landmarks: int = 10000, landmark_selection_mode: str = 'kmeans', random_state: int | None = None, use_parallel: bool = False, n_jobs: int = -1, min_chunk_size: int = 1000, max_blas_threads: int | None = 1, auto_limit_blas: bool = True, offset: int | float = 0, decomposition_name: str | None = None, data_selector_name: str | None = None, force: bool = False) → None

Add Kernel PCA decomposition.

Kernel PCA is a nonlinear extension of PCA that uses kernel functions to project data into higher-dimensional spaces where linear PCA can capture nonlinear relationships in the original space.

Parameters

selection_namestr

Name of feature selection to decompose

n_componentsint, str, or None, default=”auto”

Number of components. Options:

int: Specific number of components
“auto”: Automatic selection via elbow detection (5% of features) [DEFAULT]
None: Uses min(n_samples, n_features)

gammafloat, str, or None, default=”scale”

RBF kernel coefficient. Options:

float: Specific gamma value
“scale”: 1.0 / (n_features * variance) [DEFAULT]
“auto”: 1.0 / n_features
None: Uses 1.0 / n_features (same as “auto”)

use_nystrombool, default=False

Whether to use Nyström approximation for large datasets

n_landmarksint, default=10000

Number of landmarks for Nyström approximation

landmark_selection_modestr, default=”kmeans”

Method for landmark selection in Nyström approximation:

“kmeans”: Use KMeans centroids as landmarks (better coverage)
“random”: Use random sampling from data

random_stateint, optional

Random state for reproducible results

use_parallelbool, default=False

Whether to use parallel processing for matrix-vector multiplication

n_jobsint, default=-1

Number of parallel jobs (-1 for all available CPU cores)

min_chunk_sizeint, default=1000

Minimum chunk size per parallel process to avoid overhead

max_blas_threadsint or None, default=1

Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default

auto_limit_blasbool, default=True

Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)

offsetint or float, default=0

Adjustment to auto-selected component count (only applies when n_components=”auto”):

int: Direct addition/subtraction (e.g., -2 selects 2 fewer)
float: Percentage adjustment (e.g., -0.5 selects 50% fewer)

decomposition_namestr, optional

Name for the decomposition result

data_selector_namestr, optional

Name of data selector to apply before decomposition

forcebool, default=False

Force recalculation even if decomposition already exists

Returns

None: Adds Kernel PCA decomposition results to pipeline data

Examples

>>> # Basic RBF Kernel PCA
>>> pipeline.decomposition.add.kernel_pca("my_features", n_components=15)

>>> # Kernel PCA with Nyström approximation for large datasets
>>> pipeline.decomposition.add.kernel_pca(
...     "distance_features",
...     n_components=20,
...     use_nystrom=True,
...     n_landmarks=5000,
...     decomposition_name="nystrom_kpca"
... )

>>> # Parallel Kernel PCA with custom parameters
>>> pipeline.decomposition.add.kernel_pca(
...     "contact_features",
...     n_components=12,
...     gamma=0.1,
...     use_parallel=True,
...     n_jobs=8,
...     data_selector_name="folded_conformations"
... )

Notes

Kernel PCA uses RBF kernel to capture complex nonlinear patterns in data. Nyström approximation is recommended for datasets with >10,000 samples.

contact_kernel_pca(selection_name: str, n_components: int | str | None = 'auto', gamma: float | str = 'scale', use_nystrom: bool = False, n_landmarks: int = 2000, landmark_selection: str = 'kmeans', random_state: int | None = None, use_parallel: bool = False, n_jobs: int = -1, min_chunk_size: int = 1000, max_blas_threads: int | None = 1, auto_limit_blas: bool = True, offset: int | float = 0, decomposition_name: str | None = None, data_selector_name: str | None = None, force: bool = False) → None

Add Contact Kernel PCA decomposition.

Specialized Kernel PCA implementation optimized for contact matrix data with contact-specific kernel functions and regularization.

Parameters

selection_namestr

Name of contact feature selection to decompose

n_componentsint, str, or None, default=”auto”

Number of components. Options:

int: Specific number of components
“auto”: Automatic selection via elbow detection (5% of features) [DEFAULT]
None: Uses min(n_samples, n_features)

gammafloat or str, default=”scale”

Kernel coefficient for Hamming/RBF kernel. Options:

float: Specific gamma value
“scale”: 1.0 / (n_features * variance) [DEFAULT]
“auto”: 1.0 / n_features

use_nystrombool, default=False

Whether to use Nyström approximation for large datasets

n_landmarksint, default=2000

Number of landmarks for Nyström approximation

landmark_selectionstr, default=”kmeans”

Method for landmark selection in Nyström approximation:

“kmeans”: Use KMeans centroids as landmarks (better coverage)
“random”: Use random sampling from data

random_stateint, optional

Random state for reproducible results

use_parallelbool, default=False

Whether to use parallel processing for matrix-vector multiplication

n_jobsint, default=-1

Number of parallel jobs (-1 for all available CPU cores)

min_chunk_sizeint, default=1000

Minimum chunk size per parallel process to avoid overhead

max_blas_threadsint or None, default=1

Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default

auto_limit_blasbool, default=True

Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)

offsetint or float, default=0

Adjustment to auto-selected component count (only applies when n_components=”auto”):

int: Direct addition/subtraction (e.g., -2 selects 2 fewer)
float: Percentage adjustment (e.g., -0.5 selects 50% fewer)

decomposition_namestr, optional

Name for the decomposition result

data_selector_namestr, optional

Name of data selector to apply before decomposition

forcebool, default=False

Force recalculation even if decomposition already exists

Returns

None: Adds Contact Kernel PCA decomposition results to pipeline data

Examples

>>> # Basic Contact Kernel PCA
>>> pipeline.decomposition.add.contact_kernel_pca("contact_features", n_components=12)

>>> # Contact Kernel PCA with custom gamma
>>> pipeline.decomposition.add.contact_kernel_pca(
...     "persistent_contacts",
...     n_components=15,
...     gamma=0.5,
...     decomposition_name="contact_modes"
... )

>>> # Contact Kernel PCA with Nyström approximation
>>> pipeline.decomposition.add.contact_kernel_pca(
...     "native_contacts",
...     n_components=8,
...     gamma=2.0,
...     use_nystrom=True,
...     n_landmarks=1000,
...     data_selector_name="folded_states"
... )

Notes

Contact Kernel PCA uses specialized kernels that account for the binary nature of contact data and contact pattern correlations.

diffusion_maps(selection_name: str, n_components: int, epsilon: float | None = None, use_nystrom: bool = False, n_landmarks: int = 1000, landmark_selection_mode: str = 'kmeans', alpha: float = 0.0, random_state: int | None = None, epsilon_k: int | None = None, epsilon_n_samples: int | None = None, epsilon_ref_size: int | None = None, decomposition_name: str | None = None, data_selector_name: str | None = None, force: bool = False) → None

Add Diffusion Maps decomposition.

Diffusion Maps is a nonlinear dimensionality reduction technique that captures the intrinsic geometry of data manifolds. It’s particularly effective for analyzing conformational transitions in MD trajectories.

Parameters

selection_namestr: Name of feature selection to decompose
n_componentsint, required: Number of diffusion components to compute
epsilonfloat, optional: Kernel bandwidth parameter for Gaussian kernel. If None, estimated automatically
use_nystrombool, default=False: Whether to use Nyström approximation for very large datasets
n_landmarksint, default=1000: Number of landmarks for Nyström approximation
landmark_selection_modestr, default=”kmeans”: Landmark selection mode for Nyström approximation (“kmeans” or “random”)
alphafloat, default=0.0: Diffusion maps alpha normalization parameter (0.0 = no density correction)
random_stateint, optional: Random state for reproducible results
epsilon_kint, optional, default=None: k for k-NN epsilon estimation when epsilon is None. If None, defaults to clamp(5 * log(n_frames), 20, 100).
epsilon_n_samplesint, optional, default=None: Number of samples used for epsilon estimation. If None, defaults to 5% of frames (capped by ref size).
epsilon_ref_sizeint, optional, default=None: Reference pool size used for epsilon estimation. If None, defaults to 25% of frames.
decomposition_namestr, optional: Name for the decomposition result
data_selector_namestr, optional: Name of data selector to apply before decomposition
forcebool, default=False: Force recalculation even if decomposition already exists

Returns

None: Adds Diffusion Maps decomposition results to pipeline data

Examples

>>> # Basic Diffusion Maps
>>> pipeline.decomposition.add.diffusion_maps("my_features", n_components=12)

>>> # Diffusion Maps with custom epsilon
>>> pipeline.decomposition.add.diffusion_maps(
...     "distance_features",
...     n_components=20,
...     epsilon=1.5,
...     decomposition_name="transition_coordinates"
... )

>>> # Diffusion Maps with Nyström approximation
>>> pipeline.decomposition.add.diffusion_maps(
...     "conformational_features",
...     n_components=15,
...     use_nystrom=True,
...     n_landmarks=2000,
...     data_selector_name="equilibrated_frames"
... )

Notes

Diffusion Maps preserves diffusion distances and is excellent for identifying slow conformational coordinates and transition pathways. Uses RMSD-based distances to construct the diffusion process.