Diffusion Maps Data

GitHub Link to Code.

Diffusion Maps decomposition type for nonlinear dimensionality reduction.

Diffusion Maps decomposition type implementation with nonlinear dimensionality reduction using diffusion processes and spectral analysis of transition matrices.

class mdxplain.decomposition.decomposition_type.diffusion_maps.diffusion_maps.DiffusionMaps(n_components: int, epsilon: float | None = None, use_nystrom: bool = False, n_landmarks: int = 1000, landmark_selection_mode: str = 'kmeans', alpha: float = 0.0, random_state: int | None = None, epsilon_k: int | None = None, epsilon_n_samples: int | None = None, epsilon_ref_size: int | None = None)

Performs Diffusion Map analysis for nonlinear dimensionality reduction.

This class provides an implementation of the Diffusion Map algorithm, inspired by the MDAnalysis toolkit [2], and adapted for the specific data structures and workflows of this software package.

The method analyzes the intrinsic geometry of high-dimensional data by constructing a diffusion process on the data manifold, as originally proposed by Coifman & Lafon [1].

As a distance metric between configurations, this implementation uses the Root Mean Square Deviation (RMSD). The resulting diffusion coordinates correspond to the slowest collective motions in the system’s dynamics, making them powerful for analyzing molecular simulations. [4]

A clear, less mathematical introduction to the method can be found in the work by de la Porte et al. [3].

## Mathematical Workflow

The algorithm follows the “Random Walk” normalization approach:

  1. Kernel Matrix Construction: A Gaussian kernel is constructed from the pairwise RMSD distance matrix d(i, j). The parameter epsilon controls the kernel’s bandwidth. K_ij = exp(-d(i, j)**2 / epsilon)

  2. Transition Matrix Normalization: The kernel matrix K is row-normalized to create a row-stochastic transition matrix M, where each row sums to 1. D is a diagonal matrix of the row sums of K. M = D⁻¹ * K

  3. Eigendecomposition: The top eigenvectors of the transition matrix M are computed. The first eigenvector, corresponding to the eigenvalue λ₀=1, is a trivial constant vector representing the stationary distribution of the diffusion process. It contains no geometric information and is therefore discarded.[1]_ The subsequent eigenvectors form the final diffusion coordinates. M * v_k = λ_k * v_k

This method also allows for out-of-sample extension using the Nyström method [5]. The Nyström method approximates the kernel matrix using a subset of the data, significantly reducing memory usage and computation time. This is particularly beneficial for large datasets common in molecular dynamics simulations. Use the use_nystrom parameter to enable this feature.

References

Examples

>>> # Basic Diffusion Maps decomposition via DecompositionManager
>>> from mdxplain.decomposition import decomposition_type
>>> decomp_manager = DecompositionManager()
>>> decomp_manager.add(
...     traj_data, "feature_selection", decomposition_type.DiffusionMaps(n_components=10, epsilon=0.05)
... )
>>> # Diffusion Maps with incremental computation for large datasets
>>> diffmaps = decomposition_type.DiffusionMaps(n_components=20)
>>> diffmaps.init_calculator(use_memmap=True, chunk_size=1000)
>>> transformed, metadata = diffmaps.compute(large_data)
>>> # Diffusion Maps with Nyström approximation for very large datasets
>>> diffmaps = decomposition_type.DiffusionMaps(use_nystrom=True, n_landmarks=2000)
>>> diffmaps.init_calculator()
>>> transformed, metadata = diffmaps.compute(very_large_data, n_components=50)
__init__(n_components: int, epsilon: float | None = None, use_nystrom: bool = False, n_landmarks: int = 1000, landmark_selection_mode: str = 'kmeans', alpha: float = 0.0, random_state: int | None = None, epsilon_k: int | None = None, epsilon_n_samples: int | None = None, epsilon_ref_size: int | None = None) None

Initialize Diffusion Maps decomposition type.

Creates a Diffusion Maps instance with specified parameters for constructing the diffusion process and extracting diffusion coordinates.

Parameters

n_componentsint, required

Number of diffusion coordinates to keep

epsilonfloat, optional, default=1.0

Kernel bandwidth parameter for Gaussian kernel. Controls diffusion scale. If None, automatically estimated using k-nearest neighbors heuristic

use_nystrombool, default=False

Whether to use Nyström approximation for very large datasets

n_landmarksint, default=1000

Number of landmarks for Nyström approximation

landmark_selection_modestr, default=”kmeans”

Landmark selection mode for Nyström approximation (“kmeans” or “random”)

alphafloat, default=0.0

Diffusion maps alpha normalization parameter (0.0 = no density correction)

random_stateint, optional

Random state for reproducible results

epsilon_kint, optional, default=None

k for k-NN epsilon estimation when epsilon is None. If None, defaults to clamp(5 * log(n_frames), 20, 100).

epsilon_n_samplesint, optional, default=None

Number of samples used for epsilon estimation. If None, defaults to 5% of frames (capped by ref size).

epsilon_ref_sizeint, optional, default=None

Reference pool size used for epsilon estimation. If None, defaults to 25% of frames.

Returned Metadata

hyperparametersdict

Dictionary containing all Diffusion Maps parameters used

original_shapetuple

Shape of the input data (n_samples, n_features)

use_memmapbool

Whether memory mapping was used

chunk_sizeint

Chunk size used for processing

cache_pathstr

Path used for caching results

methodstr

Method used (‘standard_diffusion_maps’, ‘nystrom_diffusion_maps’, or ‘iterative_diffusion_maps’)

approximationstr

Approximation method used (‘nystrom’ when Nyström approximation is enabled)

n_landmarksint

Number of landmarks used for Nyström approximation (when applicable)

eigenvaluesnumpy.ndarray

Eigenvalues of the diffusion operator (diffusion timescales)

Examples

>>> # Create Diffusion Maps instance
>>> diffmaps = DiffusionMaps(n_components=10, epsilon=0.1)
>>> print(f"Type: {diffmaps.get_type_name()}")
'diffusion_maps'
>>> # Create Diffusion Maps with Nyström approximation
>>> diffmaps = DiffusionMaps(n_components=20, use_nystrom=True, n_landmarks=2000)
classmethod get_type_name() str

Get the type name for Diffusion Maps decomposition.

Returns the unique string identifier for Diffusion Maps decomposition type used for storing results and type identification.

Parameters

clstype

The DiffusionMaps class

Returns

str

String identifier ‘diffusion_maps’

Examples

>>> print(DiffusionMaps.get_type_name())
'diffusion_maps'
>>> # Can also be used via class directly
>>> print(decomposition_type.DiffusionMaps.get_type_name())
'diffusion_maps'
get_required_feature_type() str

DiffusionMaps requires coordinate features for RMSD-based distance computation.

Parameters

None

Returns

str

‘coordinates’ - indicating this method requires coordinate features

init_calculator(use_memmap: bool = False, cache_path: str = './cache', chunk_size: int = 2000) None

Initialize the Diffusion Maps calculator with specified configuration.

Sets up the Diffusion Maps calculator with options for memory mapping and iterative kernel computation for large datasets.

Parameters

use_memmapbool, default=False

Whether to use iterative computation with memory mapping for large datasets

cache_pathstr, optional

Path for cache files when using memory mapping

chunk_sizeint, optional

Number of samples to process per chunk for iterative computation

Returns

None

Sets self.calculator to initialized DiffusionMapsCalculator instance

Examples

>>> # Basic initialization
>>> diffmaps = DiffusionMaps()
>>> diffmaps.init_calculator()
>>> # With iterative computation for large datasets
>>> diffmaps.init_calculator(use_memmap=True, chunk_size=500)
>>> # With custom cache path
>>> diffmaps.init_calculator(
...     use_memmap=True,
...     cache_path="./cache/diffusion_maps.dat"
... )
compute(data, **kwargs) Tuple[ndarray, Dict[str, Any]]

Compute Diffusion Maps decomposition of input trajectory.

Performs Diffusion Maps analysis on the input MDTraj trajectory using the initialized calculator with the parameters provided during initialization. Constructs diffusion coordinates that capture the intrinsic geometry.

Parameters

datanumpy.ndarray

Input coordinate matrix (n_frames, n_features) where n_features = n_atoms * 3

kwargsdict

Additional optional parameters (currently none supported)

Returns

Tuple[numpy.ndarray, Dict]

Tuple containing:

  • diffusion_coords: Diffusion coordinates (n_frames, n_components)

  • metadata: Dictionary with Diffusion Maps information including:

    • hyperparameters: Used parameters

    • method: computation method used

    • eigenvalues: Eigenvalues of diffusion operator

Examples

>>> # Load trajectory and compute Diffusion Maps
>>> import mdtraj as md
>>> traj = md.load('trajectory.xtc', top='topology.pdb')
>>> diffmaps = DiffusionMaps(n_components=10, epsilon=0.5, atom_selection="backbone")
>>> diffmaps.init_calculator()
>>> coords, metadata = diffmaps.compute(traj)
>>> print(f"Coordinates shape: {coords.shape}")
>>> print(f"Eigenvalues: {metadata['eigenvalues'][:3]}")
>>> # Iterative Diffusion Maps for large trajectories
>>> diffmaps = DiffusionMaps(n_components=20, epsilon=1.0)
>>> diffmaps.init_calculator(use_memmap=True, chunk_size=200)
>>> coords, metadata = diffmaps.compute(large_trajectory)

Raises

ValueError

If calculator is not initialized, input data is invalid, or n_components is too large