Diffusion Maps Data
GitHub Link to Code.
Diffusion Maps decomposition type for nonlinear dimensionality reduction.
Diffusion Maps decomposition type implementation with nonlinear dimensionality reduction using diffusion processes and spectral analysis of transition matrices.
- class mdxplain.decomposition.decomposition_type.diffusion_maps.diffusion_maps.DiffusionMaps(n_components: int, epsilon: float | None = None, use_nystrom: bool = False, n_landmarks: int = 1000, landmark_selection_mode: str = 'kmeans', alpha: float = 0.0, random_state: int | None = None, epsilon_k: int | None = None, epsilon_n_samples: int | None = None, epsilon_ref_size: int | None = None)
Performs Diffusion Map analysis for nonlinear dimensionality reduction.
This class provides an implementation of the Diffusion Map algorithm, inspired by the MDAnalysis toolkit [2], and adapted for the specific data structures and workflows of this software package.
The method analyzes the intrinsic geometry of high-dimensional data by constructing a diffusion process on the data manifold, as originally proposed by Coifman & Lafon [1].
As a distance metric between configurations, this implementation uses the Root Mean Square Deviation (RMSD). The resulting diffusion coordinates correspond to the slowest collective motions in the system’s dynamics, making them powerful for analyzing molecular simulations. [4]
A clear, less mathematical introduction to the method can be found in the work by de la Porte et al. [3].
## Mathematical Workflow
The algorithm follows the “Random Walk” normalization approach:
Kernel Matrix Construction: A Gaussian kernel is constructed from the pairwise RMSD distance matrix d(i, j). The parameter epsilon controls the kernel’s bandwidth. K_ij = exp(-d(i, j)**2 / epsilon)
Transition Matrix Normalization: The kernel matrix K is row-normalized to create a row-stochastic transition matrix M, where each row sums to 1. D is a diagonal matrix of the row sums of K. M = D⁻¹ * K
Eigendecomposition: The top eigenvectors of the transition matrix M are computed. The first eigenvector, corresponding to the eigenvalue λ₀=1, is a trivial constant vector representing the stationary distribution of the diffusion process. It contains no geometric information and is therefore discarded.[1]_ The subsequent eigenvectors form the final diffusion coordinates. M * v_k = λ_k * v_k
This method also allows for out-of-sample extension using the Nyström method [5]. The Nyström method approximates the kernel matrix using a subset of the data, significantly reducing memory usage and computation time. This is particularly beneficial for large datasets common in molecular dynamics simulations. Use the use_nystrom parameter to enable this feature.
References
Examples
>>> # Basic Diffusion Maps decomposition via DecompositionManager >>> from mdxplain.decomposition import decomposition_type >>> decomp_manager = DecompositionManager() >>> decomp_manager.add( ... traj_data, "feature_selection", decomposition_type.DiffusionMaps(n_components=10, epsilon=0.05) ... )
>>> # Diffusion Maps with incremental computation for large datasets >>> diffmaps = decomposition_type.DiffusionMaps(n_components=20) >>> diffmaps.init_calculator(use_memmap=True, chunk_size=1000) >>> transformed, metadata = diffmaps.compute(large_data)
>>> # Diffusion Maps with Nyström approximation for very large datasets >>> diffmaps = decomposition_type.DiffusionMaps(use_nystrom=True, n_landmarks=2000) >>> diffmaps.init_calculator() >>> transformed, metadata = diffmaps.compute(very_large_data, n_components=50)
- __init__(n_components: int, epsilon: float | None = None, use_nystrom: bool = False, n_landmarks: int = 1000, landmark_selection_mode: str = 'kmeans', alpha: float = 0.0, random_state: int | None = None, epsilon_k: int | None = None, epsilon_n_samples: int | None = None, epsilon_ref_size: int | None = None) None
Initialize Diffusion Maps decomposition type.
Creates a Diffusion Maps instance with specified parameters for constructing the diffusion process and extracting diffusion coordinates.
Parameters
- n_componentsint, required
Number of diffusion coordinates to keep
- epsilonfloat, optional, default=1.0
Kernel bandwidth parameter for Gaussian kernel. Controls diffusion scale. If None, automatically estimated using k-nearest neighbors heuristic
- use_nystrombool, default=False
Whether to use Nyström approximation for very large datasets
- n_landmarksint, default=1000
Number of landmarks for Nyström approximation
- landmark_selection_modestr, default=”kmeans”
Landmark selection mode for Nyström approximation (“kmeans” or “random”)
- alphafloat, default=0.0
Diffusion maps alpha normalization parameter (0.0 = no density correction)
- random_stateint, optional
Random state for reproducible results
- epsilon_kint, optional, default=None
k for k-NN epsilon estimation when epsilon is None. If None, defaults to clamp(5 * log(n_frames), 20, 100).
- epsilon_n_samplesint, optional, default=None
Number of samples used for epsilon estimation. If None, defaults to 5% of frames (capped by ref size).
- epsilon_ref_sizeint, optional, default=None
Reference pool size used for epsilon estimation. If None, defaults to 25% of frames.
Returned Metadata
- hyperparametersdict
Dictionary containing all Diffusion Maps parameters used
- original_shapetuple
Shape of the input data (n_samples, n_features)
- use_memmapbool
Whether memory mapping was used
- chunk_sizeint
Chunk size used for processing
- cache_pathstr
Path used for caching results
- methodstr
Method used (‘standard_diffusion_maps’, ‘nystrom_diffusion_maps’, or ‘iterative_diffusion_maps’)
- approximationstr
Approximation method used (‘nystrom’ when Nyström approximation is enabled)
- n_landmarksint
Number of landmarks used for Nyström approximation (when applicable)
- eigenvaluesnumpy.ndarray
Eigenvalues of the diffusion operator (diffusion timescales)
Examples
>>> # Create Diffusion Maps instance >>> diffmaps = DiffusionMaps(n_components=10, epsilon=0.1) >>> print(f"Type: {diffmaps.get_type_name()}") 'diffusion_maps'
>>> # Create Diffusion Maps with Nyström approximation >>> diffmaps = DiffusionMaps(n_components=20, use_nystrom=True, n_landmarks=2000)
- classmethod get_type_name() str
Get the type name for Diffusion Maps decomposition.
Returns the unique string identifier for Diffusion Maps decomposition type used for storing results and type identification.
Parameters
- clstype
The DiffusionMaps class
Returns
- str
String identifier ‘diffusion_maps’
Examples
>>> print(DiffusionMaps.get_type_name()) 'diffusion_maps' >>> # Can also be used via class directly >>> print(decomposition_type.DiffusionMaps.get_type_name()) 'diffusion_maps'
- get_required_feature_type() str
DiffusionMaps requires coordinate features for RMSD-based distance computation.
Parameters
None
Returns
- str
‘coordinates’ - indicating this method requires coordinate features
- init_calculator(use_memmap: bool = False, cache_path: str = './cache', chunk_size: int = 2000) None
Initialize the Diffusion Maps calculator with specified configuration.
Sets up the Diffusion Maps calculator with options for memory mapping and iterative kernel computation for large datasets.
Parameters
- use_memmapbool, default=False
Whether to use iterative computation with memory mapping for large datasets
- cache_pathstr, optional
Path for cache files when using memory mapping
- chunk_sizeint, optional
Number of samples to process per chunk for iterative computation
Returns
- None
Sets self.calculator to initialized DiffusionMapsCalculator instance
Examples
>>> # Basic initialization >>> diffmaps = DiffusionMaps() >>> diffmaps.init_calculator()
>>> # With iterative computation for large datasets >>> diffmaps.init_calculator(use_memmap=True, chunk_size=500)
>>> # With custom cache path >>> diffmaps.init_calculator( ... use_memmap=True, ... cache_path="./cache/diffusion_maps.dat" ... )
- compute(data, **kwargs) Tuple[ndarray, Dict[str, Any]]
Compute Diffusion Maps decomposition of input trajectory.
Performs Diffusion Maps analysis on the input MDTraj trajectory using the initialized calculator with the parameters provided during initialization. Constructs diffusion coordinates that capture the intrinsic geometry.
Parameters
- datanumpy.ndarray
Input coordinate matrix (n_frames, n_features) where n_features = n_atoms * 3
- kwargsdict
Additional optional parameters (currently none supported)
Returns
- Tuple[numpy.ndarray, Dict]
Tuple containing:
diffusion_coords: Diffusion coordinates (n_frames, n_components)
metadata: Dictionary with Diffusion Maps information including:
hyperparameters: Used parameters
method: computation method used
eigenvalues: Eigenvalues of diffusion operator
Examples
>>> # Load trajectory and compute Diffusion Maps >>> import mdtraj as md >>> traj = md.load('trajectory.xtc', top='topology.pdb') >>> diffmaps = DiffusionMaps(n_components=10, epsilon=0.5, atom_selection="backbone") >>> diffmaps.init_calculator() >>> coords, metadata = diffmaps.compute(traj) >>> print(f"Coordinates shape: {coords.shape}") >>> print(f"Eigenvalues: {metadata['eigenvalues'][:3]}")
>>> # Iterative Diffusion Maps for large trajectories >>> diffmaps = DiffusionMaps(n_components=20, epsilon=1.0) >>> diffmaps.init_calculator(use_memmap=True, chunk_size=200) >>> coords, metadata = diffmaps.compute(large_trajectory)
Raises
- ValueError
If calculator is not initialized, input data is invalid, or n_components is too large