Dask MD Trajectory

GitHub Link to Code.

DaskMDTrajectory - Memory-efficient MDTraj-compatible trajectory class.

Provides the complete MDTraj.Trajectory interface with Dask/Zarr backend for efficient processing of large trajectory files.

class mdxplain.trajectory.entities.dask_md_trajectory.DaskMDTrajectory(trajectory_file: str, topology_file: str | None = None, zarr_cache_path: str | None = None, chunk_size: int = 1000, n_workers: int | None = None)

Memory-efficient trajectory class compatible with MDTraj interface.

Uses Dask arrays and Zarr storage for optimal memory usage while maintaining full compatibility with MDTraj operations and workflows.

Attributes

trajectory_filestr

Path to the trajectory file

topology_filestr, optional

Path to the topology file

zarr_cache_pathstr, optional

Path to the Zarr cache directory

chunk_sizeint

Number of frames per chunk

n_workersint

Number of parallel workers

Examples

>>> dask_traj = DaskMDTrajectory(
...     trajectory_file='trajectory.xtc',
...     topology_file='topology.pdb',
...     zarr_cache_path='/tmp/my_cache.zarr'
... )

Notes

This class is designed for memory-efficient processing of large trajectory files using Dask and Zarr. It provides a drop-in replacement for MDTraj’s trajectory class with additional features for parallel processing and out-of-core computation. It supports all standard MDTraj properties and methods, including slicing, atom selection, superposition, and more.

So it is basically a md.Trajectory wrapper with a Dask/Zarr backend.

__init__(trajectory_file: str, topology_file: str | None = None, zarr_cache_path: str | None = None, chunk_size: int = 1000, n_workers: int | None = None)

Initialize DaskMDTrajectory.

Parameters

trajectory_filestr

Path to trajectory file (.xtc, .dcd, etc.)

topology_filestr, optional

Path to topology file (.pdb, .gro, etc.)

zarr_cache_pathstr, optional

Custom path for Zarr cache file

chunk_sizeint, default=1000

Number of frames per chunk (optimized for performance)

n_workersint, optional

Number of parallel workers (defaults to CPU count)

classmethod from_mdtraj(mdtraj: Trajectory, zarr_cache_path: str | None = None, chunk_size: int = 1000, n_workers: int | None = None) DaskMDTrajectory

Create DaskMDTrajectory from existing MDTraj trajectory.

Parameters

mdtrajmd.Trajectory

MDTraj trajectory object to convert

zarr_cache_pathstr, optional

Path for Zarr cache. If None, creates temporary cache.

chunk_sizeint, default=1000

Number of frames per chunk for Dask arrays

n_workersint, optional

Number of parallel workers (defaults to CPU count)

Returns

DaskMDTrajectory

New DaskMDTrajectory instance with data from MDTraj

Examples

>>> import mdtraj as md
>>> traj = md.load('trajectory.xtc', top='topology.pdb')
>>> dask_traj = DaskMDTrajectory.from_mdtraj(traj)
>>> print(f"Converted {dask_traj.n_frames} frames")
>>> # With custom cache path
>>> dask_traj = DaskMDTrajectory.from_mdtraj(
...     traj, zarr_cache_path='/tmp/my_cache.zarr'
... )
property n_frames: int

Number of frames in the trajectory.

Returns

int

Total number of frames

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> print(dask_traj.n_frames)
1000
property n_atoms: int

Number of atoms in the trajectory.

Returns

int

Total number of atoms

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> print(dask_traj.n_atoms)
5000
property n_residues: int

Number of residues in the trajectory.

Returns

int

Total number of residues

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> print(dask_traj.n_residues)
333
property topology: Topology

System topology.

Returns

md.Topology

MDTraj topology object

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> topology = dask_traj.topology
>>> print(topology.n_atoms)
5000
property top: Topology

System topology (alias for topology property).

Returns

md.Topology

MDTraj topology object

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> top = dask_traj.top  # Same as dask_traj.topology
>>> print(top.n_residues)
333
property time: ndarray

Simulation time for each frame (lazy loaded).

Returns

np.ndarray

Array of simulation times with shape (n_frames,)

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> times = dask_traj.time
>>> print(f"First frame: {times[0]} ps")
First frame: 0.0 ps
property timestep: float

Time between frames in picoseconds.

Returns

float

Time step in picoseconds

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> dt = dask_traj.timestep
>>> print(f"Timestep: {dt} ps")
Timestep: 1.0 ps
property xyz: ndarray

Cartesian coordinates (lazy loaded with memory management).

Returns

np.ndarray

Coordinate array with shape (n_frames, n_atoms, 3)

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> coords = dask_traj.xyz  # Loads all coordinates into memory
>>> print(coords.shape)
(1000, 5000, 3)
>>> # For large trajectories, use slicing:
>>> coords_subset = dask_traj[0:100].xyz
property unitcell_vectors: ndarray | None

Unit cell vectors (lazy loaded).

Returns

Optional[np.ndarray]

Unit cell vectors array with shape (n_frames, 3, 3) or None if no unitcell

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> vectors = dask_traj.unitcell_vectors
>>> if vectors is not None:
...     print(f"Unit cell shape: {vectors.shape}")
property unitcell_lengths: ndarray | None

Unit cell lengths (lazy loaded).

Returns

Optional[np.ndarray]

Unit cell lengths array with shape (n_frames, 3) or None if no unitcell

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> lengths = dask_traj.unitcell_lengths
>>> if lengths is not None:
...     print(f"Average box size: {lengths.mean(axis=0)} nm")
property unitcell_angles: ndarray | None

Unit cell angles (lazy loaded).

Returns

Optional[np.ndarray]

Unit cell angles array with shape (n_frames, 3) or None if no unitcell

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> angles = dask_traj.unitcell_angles
>>> if angles is not None:
...     print(f"Box angles: {angles[0]} degrees")
atom_slice(atom_indices: ndarray | list, inplace: bool = False) DaskMDTrajectory

Create trajectory from subset of atoms.

Parameters

atom_indicesarray_like

Indices of atoms to keep

inplacebool, default=False

If True, modify trajectory in place and return self. If False, return new trajectory instance.

Returns

DaskMDTrajectory

Self if inplace=True, otherwise new trajectory with selected atoms

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> # Select first 100 atoms (new trajectory)
>>> small_traj = dask_traj.atom_slice(range(100))
>>> print(f"Original: {dask_traj.n_atoms} atoms, Sliced: {small_traj.n_atoms} atoms")
>>> # Select specific atom indices in-place
>>> ca_indices = dask_traj.topology.select('name CA')
>>> dask_traj.atom_slice(ca_indices, inplace=True)  # Modifies dask_traj
center_coordinates(mass_weighted: bool = False, inplace: bool = True) DaskMDTrajectory

Center trajectory frames at origin (in-place operation by default).

This method acts in-place on the trajectory by default, similar to MDTraj behavior.

Parameters

mass_weightedbool, default=False

Use mass-weighted centering

inplacebool, default=True

If True, modify trajectory in place and return self. If False, return new trajectory instance.

Returns

DaskMDTrajectory

Self if inplace=True, otherwise new trajectory with centered coordinates

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> # Center coordinates in-place (geometric center) - modifies dask_traj
>>> dask_traj.center_coordinates()
>>> # Center coordinates using mass weighting, create new trajectory
>>> mass_centered = dask_traj.center_coordinates(mass_weighted=True, inplace=False)
superpose(reference: DaskMDTrajectory | Trajectory | None = None, frame: int = 0, atom_indices: ndarray | None = None, ref_atom_indices: ndarray | None = None, parallel: bool = True, inplace: bool = True) DaskMDTrajectory

Align trajectory to reference structure (in-place operation by default).

This method acts in-place on the trajectory by default, similar to MDTraj behavior.

Parameters

referenceDaskMDTrajectory, md.Trajectory, optional

Reference trajectory (if None, uses self as reference)

frameint, default=0

Frame index from reference trajectory to use for alignment

atom_indicesnp.ndarray, optional

Atoms to use for alignment on this trajectory

ref_atom_indicesnp.ndarray, optional

Atoms to use for alignment on reference trajectory If None, uses same as atom_indices

parallelbool, default=True

Use parallel processing (for MDTraj compatibility)

inplacebool, default=True

If True, modify trajectory in place and return self. If False, return new trajectory instance.

Returns

DaskMDTrajectory

Self if inplace=True, otherwise new trajectory aligned to reference

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> # Align to own first frame in-place - modifies dask_traj
>>> dask_traj.superpose()
>>> # Align to own frame 10, create new trajectory
>>> aligned = dask_traj.superpose(frame=10, inplace=False)
>>> # Align to external trajectory frame 0
>>> other_traj = DaskMDTrajectory('other.xtc', 'topology.pdb')
>>> dask_traj.superpose(reference=other_traj, frame=0)
>>> # Align using only CA atoms
>>> ca_indices = dask_traj.topology.select('name CA')
>>> dask_traj.superpose(atom_indices=ca_indices)

Raises

ValueError

If reference frame index is out of range

smooth(width: int, order: int | None = None, atom_indices: ndarray | None = None, inplace: bool = False) DaskMDTrajectory

Apply smoothing filter to trajectory.

Parameters

widthint

Smoothing window width

orderint, optional

Polynomial order for Savitzky-Golay filter

atom_indicesnp.ndarray, optional

Atoms to smooth (default: all)

inplacebool, default=False

If True, modify trajectory in place and return self. If False, return new trajectory instance.

Returns

DaskMDTrajectory

Self if inplace=True, otherwise new trajectory with smoothed coordinates

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> # Apply smoothing with window width 5 (new trajectory)
>>> smoothed = dask_traj.smooth(width=5)
>>> # Smooth only backbone atoms in-place
>>> backbone = dask_traj.topology.select('backbone')
>>> dask_traj.smooth(width=3, atom_indices=backbone, inplace=True)
image_molecules(inplace: bool = True, anchor_molecules: ndarray | None = None, other_molecules: ndarray | None = None, sorted_bonds: ndarray | None = None, make_whole: bool = True) DaskMDTrajectory

Apply periodic boundary condition imaging to molecules.

Recenters molecules and wraps them into the primary unit cell using MDTraj’s image_molecules method. This operation modifies coordinates but does not change the number of atoms.

Parameters

inplacebool, default=True

If True, modify trajectory in place and return self. If False, return new trajectory instance.

anchor_moleculesnp.ndarray, optional

Indices of molecules to anchor at the origin. If None, uses all molecules.

other_moleculesnp.ndarray, optional

Indices of other molecules to image relative to anchors. If None, uses all molecules not in anchor_molecules.

sorted_bondsnp.ndarray, optional

Pre-sorted bond array for performance optimization. If None, bonds are determined from topology.

make_wholebool, default=True

Make molecules whole across periodic boundary conditions before imaging.

Returns

DaskMDTrajectory

Self if inplace=True, otherwise new trajectory with imaged coordinates

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> # Apply default imaging in-place
>>> dask_traj.image_molecules()
>>> # Image with specific anchor molecules, create new trajectory
>>> protein_molecules = np.array([0, 1, 2])
>>> imaged = dask_traj.image_molecules(
...     anchor_molecules=protein_molecules,
...     inplace=False
... )
remove_solvent(exclude: list | None = None, inplace: bool = False) DaskMDTrajectory

Remove solvent atoms from trajectory.

Creates new trajectory without solvent atoms using MDTraj’s remove_solvent method. This operation changes the number of atoms and creates a new topology.

Parameters

excludelist, optional

List of solvent residue names to KEEP (not remove). Common values include [‘HOH’, ‘WAT’] to keep water molecules. If None, removes all recognized solvent molecules.

inplacebool, default=False

If True, modify trajectory in place and return self. If False, return new trajectory instance.

Returns

DaskMDTrajectory

Self if inplace=True, otherwise new trajectory without solvent atoms

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> # Remove all solvent, create new trajectory
>>> protein_only = dask_traj.remove_solvent()
>>> print(f"Atoms before: {dask_traj.n_atoms}")
>>> print(f"Atoms after: {protein_only.n_atoms}")
>>> # Keep water but remove other solvent
>>> keep_water = dask_traj.remove_solvent(exclude=['HOH', 'WAT'])
>>> # Remove solvent in-place (changes dask_traj)
>>> dask_traj.remove_solvent(inplace=True)
join(other: DaskMDTrajectory, check_topology: bool = True) DaskMDTrajectory

Combine trajectories along frame axis.

Parameters

otherDaskMDTrajectory

Trajectory to join

check_topologybool, default=True

Check topology compatibility

Returns

DaskMDTrajectory

New trajectory with combined frames

Examples

>>> traj1 = DaskMDTrajectory('part1.xtc', 'topology.pdb')
>>> traj2 = DaskMDTrajectory('part2.xtc', 'topology.pdb')
>>> combined = traj1.join(traj2)
>>> print(f"Combined: {combined.n_frames} frames")

Raises

ValueError

If trajectories have different number of atoms when check_topology=True

stack(other: DaskMDTrajectory) DaskMDTrajectory

Combine trajectories along atom axis.

Parameters

otherDaskMDTrajectory

Trajectory to stack

Returns

DaskMDTrajectory

New trajectory with combined atoms

Examples

>>> protein = DaskMDTrajectory('protein.xtc', 'protein.pdb')
>>> ligand = DaskMDTrajectory('ligand.xtc', 'ligand.pdb')
>>> complex_traj = protein.stack(ligand)
>>> print(f"Complex: {complex_traj.n_atoms} atoms")

Raises

ValueError

If trajectories have different number of frames

slice(key: int | slice | ndarray, return_dask: bool = True) DaskMDTrajectory | Trajectory

Extract specific frames.

Parameters

keyint, slice, or array_like

Frame indices to extract

return_daskbool, default=True

If True, returns a new DaskMDTrajectory with sliced data If False, returns an md.Trajectory (original behavior)

Returns

DaskMDTrajectory or md.Trajectory

DaskMDTrajectory with selected frames (if return_dask=True) or MDTraj trajectory with selected frames (if return_dask=False)

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> # Extract first 100 frames as DaskMDTrajectory (default)
>>> subset = dask_traj.slice(slice(0, 100))
>>> # Extract specific frames as md.Trajectory
>>> frames = dask_traj.slice([10, 50, 100, 200], return_dask=False)
>>> print(f"Extracted {frames.n_frames} frames")

Note:

When return_dask=True, uses lazy Dask array slicing to avoid memory issues. The slicing operation is completely lazy until data is actually accessed.

cleanup() None

Clean up memory resources and clear internal caches.

This method clears the coordinate and time caches held in memory and releases the handle to the underlying Zarr store. It does NOT delete any files on disk.

Returns

None

Clears caches in-place.

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> # ... perform calculations ...
>>> dask_traj.cleanup() # Free memory
memory_usage() dict

Get memory usage information.

Returns

dict

Memory usage statistics with keys: coordinates_size_mb, zarr_cache_size_mb, chunk_size, n_workers

Examples

>>> dask_traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> usage = dask_traj.memory_usage()
>>> print(f"Trajectory size: {usage['coordinates_size_mb']:.1f} MB")
>>> print(f"Cache size: {usage['zarr_cache_size_mb']:.1f} MB")
save(filepath: str) None

Save DaskMDTrajectory to a portable self-contained archive.

Creates a .dask_traj archive (tar + zstd) containing the object metadata and the underlying Zarr cache. The archive can be moved or shared freely without the original Zarr cache.

Parameters

filepathstr

Destination path. The .dask_traj extension is added automatically if not already present.

Returns

None

Examples

>>> traj = DaskMDTrajectory('trajectory.xtc', 'topology.pdb')
>>> traj.save('output/my_traj')
>>> # Creates 'output/my_traj.dask_traj'
classmethod load(filepath: str) DaskMDTrajectory

Load DaskMDTrajectory from a .dask_traj archive.

Extracts the archive next to the file on first load; subsequent loads reuse the already-extracted Zarr cache.

Parameters

filepathstr

Path to a .dask_traj archive created by save().

Returns

DaskMDTrajectory

Fully initialised trajectory with coordinate access ready.

Raises

FileNotFoundError

If the archive does not exist.

Examples

>>> traj = DaskMDTrajectory.load('output/my_traj.dask_traj')
>>> print(traj.n_frames)