Dask MD Trajectory
GitHub Link to Code.
DaskMDTrajectory - Memory-efficient MDTraj-compatible trajectory class.
Provides the complete MDTraj.Trajectory interface with Dask/Zarr backend for efficient processing of large trajectory files.
- class mdxplain.trajectory.entities.dask_md_trajectory.DaskMDTrajectory(trajectory_file: str, topology_file: str | None = None, zarr_cache_path: str | None = None, chunk_size: int = 1000, n_workers: int | None = None)
Memory-efficient trajectory class compatible with MDTraj interface.
Uses Dask arrays and Zarr storage for optimal memory usage while maintaining full compatibility with MDTraj operations and workflows.
Attributes
- trajectory_filestr
Path to the trajectory file
- topology_filestr, optional
Path to the topology file
- zarr_cache_pathstr, optional
Path to the Zarr cache directory
- chunk_sizeint
Number of frames per chunk
- n_workersint
Number of parallel workers
Examples
>>> dask_traj = DaskMDTrajectory( ... trajectory_file='trajectory.xtc', ... topology_file='topology.pdb', ... zarr_cache_path='/tmp/my_cache.zarr' ... )
Notes
This class is designed for memory-efficient processing of large trajectory files using Dask and Zarr. It provides a drop-in replacement for MDTraj’s trajectory class with additional features for parallel processing and out-of-core computation. It supports all standard MDTraj properties and methods, including slicing, atom selection, superposition, and more.
So it is basically a md.Trajectory wrapper with a Dask/Zarr backend.
- __init__(trajectory_file: str, topology_file: str | None = None, zarr_cache_path: str | None = None, chunk_size: int = 1000, n_workers: int | None = None)
Initialize DaskMDTrajectory.
Parameters
- trajectory_filestr
Path to trajectory file (.xtc, .dcd, etc.)
- topology_filestr, optional
Path to topology file (.pdb, .gro, etc.)
- zarr_cache_pathstr, optional
Custom path for Zarr cache file
- chunk_sizeint, default=1000
Number of frames per chunk (optimized for performance)
- n_workersint, optional
Number of parallel workers (defaults to CPU count)
- classmethod from_mdtraj(mdtraj: Trajectory, zarr_cache_path: str | None = None, chunk_size: int = 1000, n_workers: int | None = None) DaskMDTrajectory
Create DaskMDTrajectory from existing MDTraj trajectory.
Parameters
- mdtrajmd.Trajectory
MDTraj trajectory object to convert
- zarr_cache_pathstr, optional
Path for Zarr cache. If None, creates temporary cache.
- chunk_sizeint, default=1000
Number of frames per chunk for Dask arrays
- n_workersint, optional
Number of parallel workers (defaults to CPU count)
Returns
- DaskMDTrajectory
New DaskMDTrajectory instance with data from MDTraj
Examples
>>> import mdtraj as md >>> traj = md.load('trajectory.xtc', top='topology.pdb') >>> dask_traj = DaskMDTrajectory.from_mdtraj(traj) >>> print(f"Converted {dask_traj.n_frames} frames")
>>> # With custom cache path >>> dask_traj = DaskMDTrajectory.from_mdtraj( ... traj, zarr_cache_path='/tmp/my_cache.zarr' ... )
- property n_frames: int
Number of frames in the trajectory.
Returns
- int
Total number of frames
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> print(dask_traj.n_frames) 1000
- property n_atoms: int
Number of atoms in the trajectory.
Returns
- int
Total number of atoms
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> print(dask_traj.n_atoms) 5000
- property n_residues: int
Number of residues in the trajectory.
Returns
- int
Total number of residues
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> print(dask_traj.n_residues) 333
- property topology: Topology
System topology.
Returns
- md.Topology
MDTraj topology object
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> topology = dask_traj.topology >>> print(topology.n_atoms) 5000
- property top: Topology
System topology (alias for topology property).
Returns
- md.Topology
MDTraj topology object
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> top = dask_traj.top # Same as dask_traj.topology >>> print(top.n_residues) 333
- property time: ndarray
Simulation time for each frame (lazy loaded).
Returns
- np.ndarray
Array of simulation times with shape (n_frames,)
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> times = dask_traj.time >>> print(f"First frame: {times[0]} ps") First frame: 0.0 ps
- property timestep: float
Time between frames in picoseconds.
Returns
- float
Time step in picoseconds
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> dt = dask_traj.timestep >>> print(f"Timestep: {dt} ps") Timestep: 1.0 ps
- property xyz: ndarray
Cartesian coordinates (lazy loaded with memory management).
Returns
- np.ndarray
Coordinate array with shape (n_frames, n_atoms, 3)
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> coords = dask_traj.xyz # Loads all coordinates into memory >>> print(coords.shape) (1000, 5000, 3) >>> # For large trajectories, use slicing: >>> coords_subset = dask_traj[0:100].xyz
- property unitcell_vectors: ndarray | None
Unit cell vectors (lazy loaded).
Returns
- Optional[np.ndarray]
Unit cell vectors array with shape (n_frames, 3, 3) or None if no unitcell
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> vectors = dask_traj.unitcell_vectors >>> if vectors is not None: ... print(f"Unit cell shape: {vectors.shape}")
- property unitcell_lengths: ndarray | None
Unit cell lengths (lazy loaded).
Returns
- Optional[np.ndarray]
Unit cell lengths array with shape (n_frames, 3) or None if no unitcell
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> lengths = dask_traj.unitcell_lengths >>> if lengths is not None: ... print(f"Average box size: {lengths.mean(axis=0)} nm")
- property unitcell_angles: ndarray | None
Unit cell angles (lazy loaded).
Returns
- Optional[np.ndarray]
Unit cell angles array with shape (n_frames, 3) or None if no unitcell
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> angles = dask_traj.unitcell_angles >>> if angles is not None: ... print(f"Box angles: {angles[0]} degrees")
- atom_slice(atom_indices: ndarray | list, inplace: bool = False) DaskMDTrajectory
Create trajectory from subset of atoms.
Parameters
- atom_indicesarray_like
Indices of atoms to keep
- inplacebool, default=False
If True, modify trajectory in place and return self. If False, return new trajectory instance.
Returns
- DaskMDTrajectory
Self if inplace=True, otherwise new trajectory with selected atoms
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> # Select first 100 atoms (new trajectory) >>> small_traj = dask_traj.atom_slice(range(100)) >>> print(f"Original: {dask_traj.n_atoms} atoms, Sliced: {small_traj.n_atoms} atoms") >>> # Select specific atom indices in-place >>> ca_indices = dask_traj.topology.select('name CA') >>> dask_traj.atom_slice(ca_indices, inplace=True) # Modifies dask_traj
- center_coordinates(mass_weighted: bool = False, inplace: bool = True) DaskMDTrajectory
Center trajectory frames at origin (in-place operation by default).
This method acts in-place on the trajectory by default, similar to MDTraj behavior.
Parameters
- mass_weightedbool, default=False
Use mass-weighted centering
- inplacebool, default=True
If True, modify trajectory in place and return self. If False, return new trajectory instance.
Returns
- DaskMDTrajectory
Self if inplace=True, otherwise new trajectory with centered coordinates
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> # Center coordinates in-place (geometric center) - modifies dask_traj >>> dask_traj.center_coordinates() >>> # Center coordinates using mass weighting, create new trajectory >>> mass_centered = dask_traj.center_coordinates(mass_weighted=True, inplace=False)
- superpose(reference: DaskMDTrajectory | Trajectory | None = None, frame: int = 0, atom_indices: ndarray | None = None, ref_atom_indices: ndarray | None = None, parallel: bool = True, inplace: bool = True) DaskMDTrajectory
Align trajectory to reference structure (in-place operation by default).
This method acts in-place on the trajectory by default, similar to MDTraj behavior.
Parameters
- referenceDaskMDTrajectory, md.Trajectory, optional
Reference trajectory (if None, uses self as reference)
- frameint, default=0
Frame index from reference trajectory to use for alignment
- atom_indicesnp.ndarray, optional
Atoms to use for alignment on this trajectory
- ref_atom_indicesnp.ndarray, optional
Atoms to use for alignment on reference trajectory If None, uses same as atom_indices
- parallelbool, default=True
Use parallel processing (for MDTraj compatibility)
- inplacebool, default=True
If True, modify trajectory in place and return self. If False, return new trajectory instance.
Returns
- DaskMDTrajectory
Self if inplace=True, otherwise new trajectory aligned to reference
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> # Align to own first frame in-place - modifies dask_traj >>> dask_traj.superpose() >>> # Align to own frame 10, create new trajectory >>> aligned = dask_traj.superpose(frame=10, inplace=False) >>> # Align to external trajectory frame 0 >>> other_traj = DaskMDTrajectory('other.xtc', 'topology.pdb') >>> dask_traj.superpose(reference=other_traj, frame=0) >>> # Align using only CA atoms >>> ca_indices = dask_traj.topology.select('name CA') >>> dask_traj.superpose(atom_indices=ca_indices)
Raises
- ValueError
If reference frame index is out of range
- smooth(width: int, order: int | None = None, atom_indices: ndarray | None = None, inplace: bool = False) DaskMDTrajectory
Apply smoothing filter to trajectory.
Parameters
- widthint
Smoothing window width
- orderint, optional
Polynomial order for Savitzky-Golay filter
- atom_indicesnp.ndarray, optional
Atoms to smooth (default: all)
- inplacebool, default=False
If True, modify trajectory in place and return self. If False, return new trajectory instance.
Returns
- DaskMDTrajectory
Self if inplace=True, otherwise new trajectory with smoothed coordinates
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> # Apply smoothing with window width 5 (new trajectory) >>> smoothed = dask_traj.smooth(width=5) >>> # Smooth only backbone atoms in-place >>> backbone = dask_traj.topology.select('backbone') >>> dask_traj.smooth(width=3, atom_indices=backbone, inplace=True)
- image_molecules(inplace: bool = True, anchor_molecules: ndarray | None = None, other_molecules: ndarray | None = None, sorted_bonds: ndarray | None = None, make_whole: bool = True) DaskMDTrajectory
Apply periodic boundary condition imaging to molecules.
Recenters molecules and wraps them into the primary unit cell using MDTraj’s image_molecules method. This operation modifies coordinates but does not change the number of atoms.
Parameters
- inplacebool, default=True
If True, modify trajectory in place and return self. If False, return new trajectory instance.
- anchor_moleculesnp.ndarray, optional
Indices of molecules to anchor at the origin. If None, uses all molecules.
- other_moleculesnp.ndarray, optional
Indices of other molecules to image relative to anchors. If None, uses all molecules not in anchor_molecules.
- sorted_bondsnp.ndarray, optional
Pre-sorted bond array for performance optimization. If None, bonds are determined from topology.
- make_wholebool, default=True
Make molecules whole across periodic boundary conditions before imaging.
Returns
- DaskMDTrajectory
Self if inplace=True, otherwise new trajectory with imaged coordinates
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> # Apply default imaging in-place >>> dask_traj.image_molecules() >>> # Image with specific anchor molecules, create new trajectory >>> protein_molecules = np.array([0, 1, 2]) >>> imaged = dask_traj.image_molecules( ... anchor_molecules=protein_molecules, ... inplace=False ... )
- remove_solvent(exclude: list | None = None, inplace: bool = False) DaskMDTrajectory
Remove solvent atoms from trajectory.
Creates new trajectory without solvent atoms using MDTraj’s remove_solvent method. This operation changes the number of atoms and creates a new topology.
Parameters
- excludelist, optional
List of solvent residue names to KEEP (not remove). Common values include [‘HOH’, ‘WAT’] to keep water molecules. If None, removes all recognized solvent molecules.
- inplacebool, default=False
If True, modify trajectory in place and return self. If False, return new trajectory instance.
Returns
- DaskMDTrajectory
Self if inplace=True, otherwise new trajectory without solvent atoms
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> # Remove all solvent, create new trajectory >>> protein_only = dask_traj.remove_solvent() >>> print(f"Atoms before: {dask_traj.n_atoms}") >>> print(f"Atoms after: {protein_only.n_atoms}") >>> # Keep water but remove other solvent >>> keep_water = dask_traj.remove_solvent(exclude=['HOH', 'WAT']) >>> # Remove solvent in-place (changes dask_traj) >>> dask_traj.remove_solvent(inplace=True)
- join(other: DaskMDTrajectory, check_topology: bool = True) DaskMDTrajectory
Combine trajectories along frame axis.
Parameters
- otherDaskMDTrajectory
Trajectory to join
- check_topologybool, default=True
Check topology compatibility
Returns
- DaskMDTrajectory
New trajectory with combined frames
Examples
>>> traj1 = DaskMDTrajectory('part1.xtc', 'topology.pdb') >>> traj2 = DaskMDTrajectory('part2.xtc', 'topology.pdb') >>> combined = traj1.join(traj2) >>> print(f"Combined: {combined.n_frames} frames")
Raises
- ValueError
If trajectories have different number of atoms when check_topology=True
- stack(other: DaskMDTrajectory) DaskMDTrajectory
Combine trajectories along atom axis.
Parameters
- otherDaskMDTrajectory
Trajectory to stack
Returns
- DaskMDTrajectory
New trajectory with combined atoms
Examples
>>> protein = DaskMDTrajectory('protein.xtc', 'protein.pdb') >>> ligand = DaskMDTrajectory('ligand.xtc', 'ligand.pdb') >>> complex_traj = protein.stack(ligand) >>> print(f"Complex: {complex_traj.n_atoms} atoms")
Raises
- ValueError
If trajectories have different number of frames
- slice(key: int | slice | ndarray, return_dask: bool = True) DaskMDTrajectory | Trajectory
Extract specific frames.
Parameters
- keyint, slice, or array_like
Frame indices to extract
- return_daskbool, default=True
If True, returns a new DaskMDTrajectory with sliced data If False, returns an md.Trajectory (original behavior)
Returns
- DaskMDTrajectory or md.Trajectory
DaskMDTrajectory with selected frames (if return_dask=True) or MDTraj trajectory with selected frames (if return_dask=False)
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> # Extract first 100 frames as DaskMDTrajectory (default) >>> subset = dask_traj.slice(slice(0, 100)) >>> # Extract specific frames as md.Trajectory >>> frames = dask_traj.slice([10, 50, 100, 200], return_dask=False) >>> print(f"Extracted {frames.n_frames} frames")
Note:
When return_dask=True, uses lazy Dask array slicing to avoid memory issues. The slicing operation is completely lazy until data is actually accessed.
- cleanup() None
Clean up memory resources and clear internal caches.
This method clears the coordinate and time caches held in memory and releases the handle to the underlying Zarr store. It does NOT delete any files on disk.
Returns
- None
Clears caches in-place.
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> # ... perform calculations ... >>> dask_traj.cleanup() # Free memory
- memory_usage() dict
Get memory usage information.
Returns
- dict
Memory usage statistics with keys: coordinates_size_mb, zarr_cache_size_mb, chunk_size, n_workers
Examples
>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> usage = dask_traj.memory_usage() >>> print(f"Trajectory size: {usage['coordinates_size_mb']:.1f} MB") >>> print(f"Cache size: {usage['zarr_cache_size_mb']:.1f} MB")
- save(filepath: str) None
Save DaskMDTrajectory to a portable self-contained archive.
Creates a
.dask_trajarchive (tar + zstd) containing the object metadata and the underlying Zarr cache. The archive can be moved or shared freely without the original Zarr cache.Parameters
- filepathstr
Destination path. The
.dask_trajextension is added automatically if not already present.
Returns
None
Examples
>>> traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb') >>> traj.save('output/my_traj') >>> # Creates 'output/my_traj.dask_traj'
- classmethod load(filepath: str) DaskMDTrajectory
Load DaskMDTrajectory from a
.dask_trajarchive.Extracts the archive next to the file on first load; subsequent loads reuse the already-extracted Zarr cache.
Parameters
- filepathstr
Path to a
.dask_trajarchive created bysave().
Returns
- DaskMDTrajectory
Fully initialised trajectory with coordinate access ready.
Raises
- FileNotFoundError
If the archive does not exist.
Examples
>>> traj = DaskMDTrajectory.load('output/my_traj.dask_traj') >>> print(traj.n_frames)