Decomposition Services

GitHub Link to Code.

Decomposition factory implementations for simplified API access.

This module provides factory classes that simplify decomposition management by hiding explicit type instantiation from the user.

Decomposition Add Service

Factory for adding decomposition algorithms with simplified syntax.

class mdxplain.decomposition.services.decomposition_add_service.DecompositionAddService(manager: DecompositionManager, pipeline_data: PipelineData)

Service for adding decomposition algorithms without explicit type instantiation.

This service provides an intuitive interface for adding decomposition algorithms without requiring users to import and instantiate decomposition types directly. All decomposition type parameters are combined with manager.add parameters.

Examples

>>> pipeline.decomposition.add.pca("my_features", n_components=10)
>>> pipeline.decomposition.add.kernel_pca("contact_features", kernel='rbf', n_components=20)
>>> pipeline.decomposition.add.diffusion_maps("distance_features", n_components=15)
__init__(manager: DecompositionManager, pipeline_data: PipelineData) None

Initialize factory with manager and pipeline data.

Parameters

managerDecompositionManager

Decomposition manager instance

pipeline_dataPipelineData

Pipeline data container (injected by AutoInjectProxy)

Returns

None

pca(selection_name: str, n_components: int | str | None = 'auto', random_state: int | None = None, offset: int | float = 0, decomposition_name: str | None = None, data_selector_name: str | None = None, force: bool = False) None

Add PCA (Principal Component Analysis) decomposition.

PCA reduces dimensionality by finding the directions of maximum variance in the data and projecting the data onto these principal components.

Parameters

selection_namestr

Name of feature selection to decompose

n_componentsint, str, or None, default=”auto”

Number of principal components. Options:

  • int: Specific number of components

  • “auto”: Automatic selection via elbow detection (5% of features) [DEFAULT]

  • None: Uses min(n_samples, n_features)

random_stateint, optional

Random state for reproducible results

offsetint or float, default=0

Adjustment to auto-selected component count (only applies when n_components=”auto”):

  • int: Direct addition/subtraction (e.g., -2 selects 2 fewer)

  • float: Percentage adjustment (e.g., -0.5 selects 50% fewer)

decomposition_namestr, optional

Name for the decomposition result. If None, uses algorithm-based name

data_selector_namestr, optional

Name of data selector to apply before decomposition

forcebool, default=False

Force recalculation even if decomposition already exists

Returns

None

Adds PCA decomposition results to pipeline data

Examples

>>> # Basic PCA decomposition
>>> pipeline.decomposition.add.pca("my_features", n_components=10)
>>> # PCA with reproducible results
>>> pipeline.decomposition.add.pca("my_features", n_components=15, random_state=42)
>>> # PCA with custom name and data selector
>>> pipeline.decomposition.add.pca(
...     "distance_features",
...     n_components=20,
...     decomposition_name="distance_pca",
...     data_selector_name="equilibrated_frames"
... )
>>> # Force recalculation of existing PCA
>>> pipeline.decomposition.add.pca(
...     "contact_features",
...     n_components=15,
...     force=True
... )

Notes

PCA is a linear dimensionality reduction technique that preserves the maximum amount of variance in the reduced representation. Components are ordered by explained variance ratio.

kernel_pca(selection_name: str, n_components: int | str | None = 'auto', gamma: float | str | None = 'scale', use_nystrom: bool = False, n_landmarks: int = 10000, landmark_selection_mode: str = 'kmeans', random_state: int | None = None, use_parallel: bool = False, n_jobs: int = -1, min_chunk_size: int = 1000, max_blas_threads: int | None = 1, auto_limit_blas: bool = True, offset: int | float = 0, decomposition_name: str | None = None, data_selector_name: str | None = None, force: bool = False) None

Add Kernel PCA decomposition.

Kernel PCA is a nonlinear extension of PCA that uses kernel functions to project data into higher-dimensional spaces where linear PCA can capture nonlinear relationships in the original space.

Parameters

selection_namestr

Name of feature selection to decompose

n_componentsint, str, or None, default=”auto”

Number of components. Options:

  • int: Specific number of components

  • “auto”: Automatic selection via elbow detection (5% of features) [DEFAULT]

  • None: Uses min(n_samples, n_features)

gammafloat, str, or None, default=”scale”

RBF kernel coefficient. Options:

  • float: Specific gamma value

  • “scale”: 1.0 / (n_features * variance) [DEFAULT]

  • “auto”: 1.0 / n_features

  • None: Uses 1.0 / n_features (same as “auto”)

use_nystrombool, default=False

Whether to use Nyström approximation for large datasets

n_landmarksint, default=10000

Number of landmarks for Nyström approximation

landmark_selection_modestr, default=”kmeans”

Method for landmark selection in Nyström approximation:

  • “kmeans”: Use KMeans centroids as landmarks (better coverage)

  • “random”: Use random sampling from data

random_stateint, optional

Random state for reproducible results

use_parallelbool, default=False

Whether to use parallel processing for matrix-vector multiplication

n_jobsint, default=-1

Number of parallel jobs (-1 for all available CPU cores)

min_chunk_sizeint, default=1000

Minimum chunk size per parallel process to avoid overhead

max_blas_threadsint or None, default=1

Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default

auto_limit_blasbool, default=True

Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)

offsetint or float, default=0

Adjustment to auto-selected component count (only applies when n_components=”auto”):

  • int: Direct addition/subtraction (e.g., -2 selects 2 fewer)

  • float: Percentage adjustment (e.g., -0.5 selects 50% fewer)

decomposition_namestr, optional

Name for the decomposition result

data_selector_namestr, optional

Name of data selector to apply before decomposition

forcebool, default=False

Force recalculation even if decomposition already exists

Returns

None

Adds Kernel PCA decomposition results to pipeline data

Examples

>>> # Basic RBF Kernel PCA
>>> pipeline.decomposition.add.kernel_pca("my_features", n_components=15)
>>> # Kernel PCA with Nyström approximation for large datasets
>>> pipeline.decomposition.add.kernel_pca(
...     "distance_features",
...     n_components=20,
...     use_nystrom=True,
...     n_landmarks=5000,
...     decomposition_name="nystrom_kpca"
... )
>>> # Parallel Kernel PCA with custom parameters
>>> pipeline.decomposition.add.kernel_pca(
...     "contact_features",
...     n_components=12,
...     gamma=0.1,
...     use_parallel=True,
...     n_jobs=8,
...     data_selector_name="folded_conformations"
... )

Notes

Kernel PCA uses RBF kernel to capture complex nonlinear patterns in data. Nyström approximation is recommended for datasets with >10,000 samples.

contact_kernel_pca(selection_name: str, n_components: int | str | None = 'auto', gamma: float | str = 'scale', use_nystrom: bool = False, n_landmarks: int = 2000, landmark_selection: str = 'kmeans', random_state: int | None = None, use_parallel: bool = False, n_jobs: int = -1, min_chunk_size: int = 1000, max_blas_threads: int | None = 1, auto_limit_blas: bool = True, offset: int | float = 0, decomposition_name: str | None = None, data_selector_name: str | None = None, force: bool = False) None

Add Contact Kernel PCA decomposition.

Specialized Kernel PCA implementation optimized for contact matrix data with contact-specific kernel functions and regularization.

Parameters

selection_namestr

Name of contact feature selection to decompose

n_componentsint, str, or None, default=”auto”

Number of components. Options:

  • int: Specific number of components

  • “auto”: Automatic selection via elbow detection (5% of features) [DEFAULT]

  • None: Uses min(n_samples, n_features)

gammafloat or str, default=”scale”

Kernel coefficient for Hamming/RBF kernel. Options:

  • float: Specific gamma value

  • “scale”: 1.0 / (n_features * variance) [DEFAULT]

  • “auto”: 1.0 / n_features

use_nystrombool, default=False

Whether to use Nyström approximation for large datasets

n_landmarksint, default=2000

Number of landmarks for Nyström approximation

landmark_selectionstr, default=”kmeans”

Method for landmark selection in Nyström approximation:

  • “kmeans”: Use KMeans centroids as landmarks (better coverage)

  • “random”: Use random sampling from data

random_stateint, optional

Random state for reproducible results

use_parallelbool, default=False

Whether to use parallel processing for matrix-vector multiplication

n_jobsint, default=-1

Number of parallel jobs (-1 for all available CPU cores)

min_chunk_sizeint, default=1000

Minimum chunk size per parallel process to avoid overhead

max_blas_threadsint or None, default=1

Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default

auto_limit_blasbool, default=True

Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)

offsetint or float, default=0

Adjustment to auto-selected component count (only applies when n_components=”auto”):

  • int: Direct addition/subtraction (e.g., -2 selects 2 fewer)

  • float: Percentage adjustment (e.g., -0.5 selects 50% fewer)

decomposition_namestr, optional

Name for the decomposition result

data_selector_namestr, optional

Name of data selector to apply before decomposition

forcebool, default=False

Force recalculation even if decomposition already exists

Returns

None

Adds Contact Kernel PCA decomposition results to pipeline data

Examples

>>> # Basic Contact Kernel PCA
>>> pipeline.decomposition.add.contact_kernel_pca("contact_features", n_components=12)
>>> # Contact Kernel PCA with custom gamma
>>> pipeline.decomposition.add.contact_kernel_pca(
...     "persistent_contacts",
...     n_components=15,
...     gamma=0.5,
...     decomposition_name="contact_modes"
... )
>>> # Contact Kernel PCA with Nyström approximation
>>> pipeline.decomposition.add.contact_kernel_pca(
...     "native_contacts",
...     n_components=8,
...     gamma=2.0,
...     use_nystrom=True,
...     n_landmarks=1000,
...     data_selector_name="folded_states"
... )

Notes

Contact Kernel PCA uses specialized kernels that account for the binary nature of contact data and contact pattern correlations.

diffusion_maps(selection_name: str, n_components: int, epsilon: float | None = None, use_nystrom: bool = False, n_landmarks: int = 1000, landmark_selection_mode: str = 'kmeans', alpha: float = 0.0, random_state: int | None = None, epsilon_k: int | None = None, epsilon_n_samples: int | None = None, epsilon_ref_size: int | None = None, decomposition_name: str | None = None, data_selector_name: str | None = None, force: bool = False) None

Add Diffusion Maps decomposition.

Diffusion Maps is a nonlinear dimensionality reduction technique that captures the intrinsic geometry of data manifolds. It’s particularly effective for analyzing conformational transitions in MD trajectories.

Parameters

selection_namestr

Name of feature selection to decompose

n_componentsint, required

Number of diffusion components to compute

epsilonfloat, optional

Kernel bandwidth parameter for Gaussian kernel. If None, estimated automatically

use_nystrombool, default=False

Whether to use Nyström approximation for very large datasets

n_landmarksint, default=1000

Number of landmarks for Nyström approximation

landmark_selection_modestr, default=”kmeans”

Landmark selection mode for Nyström approximation (“kmeans” or “random”)

alphafloat, default=0.0

Diffusion maps alpha normalization parameter (0.0 = no density correction)

random_stateint, optional

Random state for reproducible results

epsilon_kint, optional, default=None

k for k-NN epsilon estimation when epsilon is None. If None, defaults to clamp(5 * log(n_frames), 20, 100).

epsilon_n_samplesint, optional, default=None

Number of samples used for epsilon estimation. If None, defaults to 5% of frames (capped by ref size).

epsilon_ref_sizeint, optional, default=None

Reference pool size used for epsilon estimation. If None, defaults to 25% of frames.

decomposition_namestr, optional

Name for the decomposition result

data_selector_namestr, optional

Name of data selector to apply before decomposition

forcebool, default=False

Force recalculation even if decomposition already exists

Returns

None

Adds Diffusion Maps decomposition results to pipeline data

Examples

>>> # Basic Diffusion Maps
>>> pipeline.decomposition.add.diffusion_maps("my_features", n_components=12)
>>> # Diffusion Maps with custom epsilon
>>> pipeline.decomposition.add.diffusion_maps(
...     "distance_features",
...     n_components=20,
...     epsilon=1.5,
...     decomposition_name="transition_coordinates"
... )
>>> # Diffusion Maps with Nyström approximation
>>> pipeline.decomposition.add.diffusion_maps(
...     "conformational_features",
...     n_components=15,
...     use_nystrom=True,
...     n_landmarks=2000,
...     data_selector_name="equilibrated_frames"
... )

Notes

Diffusion Maps preserves diffusion distances and is excellent for identifying slow conformational coordinates and transition pathways. Uses RMSD-based distances to construct the diffusion process.