Decomposition Services
GitHub Link to Code.
Decomposition factory implementations for simplified API access.
This module provides factory classes that simplify decomposition management by hiding explicit type instantiation from the user.
Decomposition Add Service
Factory for adding decomposition algorithms with simplified syntax.
- class mdxplain.decomposition.services.decomposition_add_service.DecompositionAddService(manager: DecompositionManager, pipeline_data: PipelineData)
Service for adding decomposition algorithms without explicit type instantiation.
This service provides an intuitive interface for adding decomposition algorithms without requiring users to import and instantiate decomposition types directly. All decomposition type parameters are combined with manager.add parameters.
Examples
>>> pipeline.decomposition.add.pca("my_features", n_components=10) >>> pipeline.decomposition.add.kernel_pca("contact_features", kernel='rbf', n_components=20) >>> pipeline.decomposition.add.diffusion_maps("distance_features", n_components=15)
- __init__(manager: DecompositionManager, pipeline_data: PipelineData) None
Initialize factory with manager and pipeline data.
Parameters
- managerDecompositionManager
Decomposition manager instance
- pipeline_dataPipelineData
Pipeline data container (injected by AutoInjectProxy)
Returns
None
- pca(selection_name: str, n_components: int | str | None = 'auto', random_state: int | None = None, offset: int | float = 0, decomposition_name: str | None = None, data_selector_name: str | None = None, force: bool = False) None
Add PCA (Principal Component Analysis) decomposition.
PCA reduces dimensionality by finding the directions of maximum variance in the data and projecting the data onto these principal components.
Parameters
- selection_namestr
Name of feature selection to decompose
- n_componentsint, str, or None, default=”auto”
Number of principal components. Options:
int: Specific number of components
“auto”: Automatic selection via elbow detection (5% of features) [DEFAULT]
None: Uses min(n_samples, n_features)
- random_stateint, optional
Random state for reproducible results
- offsetint or float, default=0
Adjustment to auto-selected component count (only applies when n_components=”auto”):
int: Direct addition/subtraction (e.g., -2 selects 2 fewer)
float: Percentage adjustment (e.g., -0.5 selects 50% fewer)
- decomposition_namestr, optional
Name for the decomposition result. If None, uses algorithm-based name
- data_selector_namestr, optional
Name of data selector to apply before decomposition
- forcebool, default=False
Force recalculation even if decomposition already exists
Returns
- None
Adds PCA decomposition results to pipeline data
Examples
>>> # Basic PCA decomposition >>> pipeline.decomposition.add.pca("my_features", n_components=10)
>>> # PCA with reproducible results >>> pipeline.decomposition.add.pca("my_features", n_components=15, random_state=42)
>>> # PCA with custom name and data selector >>> pipeline.decomposition.add.pca( ... "distance_features", ... n_components=20, ... decomposition_name="distance_pca", ... data_selector_name="equilibrated_frames" ... )
>>> # Force recalculation of existing PCA >>> pipeline.decomposition.add.pca( ... "contact_features", ... n_components=15, ... force=True ... )
Notes
PCA is a linear dimensionality reduction technique that preserves the maximum amount of variance in the reduced representation. Components are ordered by explained variance ratio.
- kernel_pca(selection_name: str, n_components: int | str | None = 'auto', gamma: float | str | None = 'scale', use_nystrom: bool = False, n_landmarks: int = 10000, landmark_selection_mode: str = 'kmeans', random_state: int | None = None, use_parallel: bool = False, n_jobs: int = -1, min_chunk_size: int = 1000, max_blas_threads: int | None = 1, auto_limit_blas: bool = True, offset: int | float = 0, decomposition_name: str | None = None, data_selector_name: str | None = None, force: bool = False) None
Add Kernel PCA decomposition.
Kernel PCA is a nonlinear extension of PCA that uses kernel functions to project data into higher-dimensional spaces where linear PCA can capture nonlinear relationships in the original space.
Parameters
- selection_namestr
Name of feature selection to decompose
- n_componentsint, str, or None, default=”auto”
Number of components. Options:
int: Specific number of components
“auto”: Automatic selection via elbow detection (5% of features) [DEFAULT]
None: Uses min(n_samples, n_features)
- gammafloat, str, or None, default=”scale”
RBF kernel coefficient. Options:
float: Specific gamma value
“scale”: 1.0 / (n_features * variance) [DEFAULT]
“auto”: 1.0 / n_features
None: Uses 1.0 / n_features (same as “auto”)
- use_nystrombool, default=False
Whether to use Nyström approximation for large datasets
- n_landmarksint, default=10000
Number of landmarks for Nyström approximation
- landmark_selection_modestr, default=”kmeans”
Method for landmark selection in Nyström approximation:
“kmeans”: Use KMeans centroids as landmarks (better coverage)
“random”: Use random sampling from data
- random_stateint, optional
Random state for reproducible results
- use_parallelbool, default=False
Whether to use parallel processing for matrix-vector multiplication
- n_jobsint, default=-1
Number of parallel jobs (-1 for all available CPU cores)
- min_chunk_sizeint, default=1000
Minimum chunk size per parallel process to avoid overhead
- max_blas_threadsint or None, default=1
Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default
- auto_limit_blasbool, default=True
Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)
- offsetint or float, default=0
Adjustment to auto-selected component count (only applies when n_components=”auto”):
int: Direct addition/subtraction (e.g., -2 selects 2 fewer)
float: Percentage adjustment (e.g., -0.5 selects 50% fewer)
- decomposition_namestr, optional
Name for the decomposition result
- data_selector_namestr, optional
Name of data selector to apply before decomposition
- forcebool, default=False
Force recalculation even if decomposition already exists
Returns
- None
Adds Kernel PCA decomposition results to pipeline data
Examples
>>> # Basic RBF Kernel PCA >>> pipeline.decomposition.add.kernel_pca("my_features", n_components=15)
>>> # Kernel PCA with Nyström approximation for large datasets >>> pipeline.decomposition.add.kernel_pca( ... "distance_features", ... n_components=20, ... use_nystrom=True, ... n_landmarks=5000, ... decomposition_name="nystrom_kpca" ... )
>>> # Parallel Kernel PCA with custom parameters >>> pipeline.decomposition.add.kernel_pca( ... "contact_features", ... n_components=12, ... gamma=0.1, ... use_parallel=True, ... n_jobs=8, ... data_selector_name="folded_conformations" ... )
Notes
Kernel PCA uses RBF kernel to capture complex nonlinear patterns in data. Nyström approximation is recommended for datasets with >10,000 samples.
- contact_kernel_pca(selection_name: str, n_components: int | str | None = 'auto', gamma: float | str = 'scale', use_nystrom: bool = False, n_landmarks: int = 2000, landmark_selection: str = 'kmeans', random_state: int | None = None, use_parallel: bool = False, n_jobs: int = -1, min_chunk_size: int = 1000, max_blas_threads: int | None = 1, auto_limit_blas: bool = True, offset: int | float = 0, decomposition_name: str | None = None, data_selector_name: str | None = None, force: bool = False) None
Add Contact Kernel PCA decomposition.
Specialized Kernel PCA implementation optimized for contact matrix data with contact-specific kernel functions and regularization.
Parameters
- selection_namestr
Name of contact feature selection to decompose
- n_componentsint, str, or None, default=”auto”
Number of components. Options:
int: Specific number of components
“auto”: Automatic selection via elbow detection (5% of features) [DEFAULT]
None: Uses min(n_samples, n_features)
- gammafloat or str, default=”scale”
Kernel coefficient for Hamming/RBF kernel. Options:
float: Specific gamma value
“scale”: 1.0 / (n_features * variance) [DEFAULT]
“auto”: 1.0 / n_features
- use_nystrombool, default=False
Whether to use Nyström approximation for large datasets
- n_landmarksint, default=2000
Number of landmarks for Nyström approximation
- landmark_selectionstr, default=”kmeans”
Method for landmark selection in Nyström approximation:
“kmeans”: Use KMeans centroids as landmarks (better coverage)
“random”: Use random sampling from data
- random_stateint, optional
Random state for reproducible results
- use_parallelbool, default=False
Whether to use parallel processing for matrix-vector multiplication
- n_jobsint, default=-1
Number of parallel jobs (-1 for all available CPU cores)
- min_chunk_sizeint, default=1000
Minimum chunk size per parallel process to avoid overhead
- max_blas_threadsint or None, default=1
Preferred BLAS/OpenMP thread limit; set auto_limit_blas=False to disable thread limiting, or None to fall back to a safe default
- auto_limit_blasbool, default=True
Apply a safe thread policy: use BLAS=1 when n_jobs != 1, otherwise use max_blas_threads (fallback 2 when None)
- offsetint or float, default=0
Adjustment to auto-selected component count (only applies when n_components=”auto”):
int: Direct addition/subtraction (e.g., -2 selects 2 fewer)
float: Percentage adjustment (e.g., -0.5 selects 50% fewer)
- decomposition_namestr, optional
Name for the decomposition result
- data_selector_namestr, optional
Name of data selector to apply before decomposition
- forcebool, default=False
Force recalculation even if decomposition already exists
Returns
- None
Adds Contact Kernel PCA decomposition results to pipeline data
Examples
>>> # Basic Contact Kernel PCA >>> pipeline.decomposition.add.contact_kernel_pca("contact_features", n_components=12)
>>> # Contact Kernel PCA with custom gamma >>> pipeline.decomposition.add.contact_kernel_pca( ... "persistent_contacts", ... n_components=15, ... gamma=0.5, ... decomposition_name="contact_modes" ... )
>>> # Contact Kernel PCA with Nyström approximation >>> pipeline.decomposition.add.contact_kernel_pca( ... "native_contacts", ... n_components=8, ... gamma=2.0, ... use_nystrom=True, ... n_landmarks=1000, ... data_selector_name="folded_states" ... )
Notes
Contact Kernel PCA uses specialized kernels that account for the binary nature of contact data and contact pattern correlations.
- diffusion_maps(selection_name: str, n_components: int, epsilon: float | None = None, use_nystrom: bool = False, n_landmarks: int = 1000, landmark_selection_mode: str = 'kmeans', alpha: float = 0.0, random_state: int | None = None, epsilon_k: int | None = None, epsilon_n_samples: int | None = None, epsilon_ref_size: int | None = None, decomposition_name: str | None = None, data_selector_name: str | None = None, force: bool = False) None
Add Diffusion Maps decomposition.
Diffusion Maps is a nonlinear dimensionality reduction technique that captures the intrinsic geometry of data manifolds. It’s particularly effective for analyzing conformational transitions in MD trajectories.
Parameters
- selection_namestr
Name of feature selection to decompose
- n_componentsint, required
Number of diffusion components to compute
- epsilonfloat, optional
Kernel bandwidth parameter for Gaussian kernel. If None, estimated automatically
- use_nystrombool, default=False
Whether to use Nyström approximation for very large datasets
- n_landmarksint, default=1000
Number of landmarks for Nyström approximation
- landmark_selection_modestr, default=”kmeans”
Landmark selection mode for Nyström approximation (“kmeans” or “random”)
- alphafloat, default=0.0
Diffusion maps alpha normalization parameter (0.0 = no density correction)
- random_stateint, optional
Random state for reproducible results
- epsilon_kint, optional, default=None
k for k-NN epsilon estimation when epsilon is None. If None, defaults to clamp(5 * log(n_frames), 20, 100).
- epsilon_n_samplesint, optional, default=None
Number of samples used for epsilon estimation. If None, defaults to 5% of frames (capped by ref size).
- epsilon_ref_sizeint, optional, default=None
Reference pool size used for epsilon estimation. If None, defaults to 25% of frames.
- decomposition_namestr, optional
Name for the decomposition result
- data_selector_namestr, optional
Name of data selector to apply before decomposition
- forcebool, default=False
Force recalculation even if decomposition already exists
Returns
- None
Adds Diffusion Maps decomposition results to pipeline data
Examples
>>> # Basic Diffusion Maps >>> pipeline.decomposition.add.diffusion_maps("my_features", n_components=12)
>>> # Diffusion Maps with custom epsilon >>> pipeline.decomposition.add.diffusion_maps( ... "distance_features", ... n_components=20, ... epsilon=1.5, ... decomposition_name="transition_coordinates" ... )
>>> # Diffusion Maps with Nyström approximation >>> pipeline.decomposition.add.diffusion_maps( ... "conformational_features", ... n_components=15, ... use_nystrom=True, ... n_landmarks=2000, ... data_selector_name="equilibrated_frames" ... )
Notes
Diffusion Maps preserves diffusion distances and is excellent for identifying slow conformational coordinates and transition pathways. Uses RMSD-based distances to construct the diffusion process.