Archive Utils

GitHub Link to Code.

Archive utilities for pipeline persistence and sharing.

This module provides utilities for creating and extracting compressed archives containing pipeline data. Supports filtering of visualization files and structure files for flexible archive creation.

class mdxplain.utils.archive_utils.ArchiveUtils

Utilities for creating and extracting pipeline archives.

Provides static methods for compressing pipeline data into portable archives and extracting them. Supports selective inclusion of files based on type (essential data, visualizations, structure files).

Examples

>>> # Create archive from pipeline data
>>> archive_path = ArchiveUtils.create_archive(
...     pipeline_data, "analysis.tar.zst"
... )

>>> # Extract archive
>>> extract_dir = ArchiveUtils.extract_archive("analysis.tar.zst")

static is_essential_file(suffix: str, use_memmap: bool) → bool

Check if file is essential for pipeline load.

Essential files depend on memmap usage. Pickle always essential. Memmap files (.dat) only essential if use_memmap=True.

Parameters

suffixstr: File extension (lowercase with dot)
use_memmapbool: Whether pipeline uses memory mapping

Returns

bool: True if file is essential for pipeline loading

Examples

>>> ArchiveUtils.is_essential_file('.dat', use_memmap=True)
True
>>> ArchiveUtils.is_essential_file('.dat', use_memmap=False)
False
>>> ArchiveUtils.is_essential_file('.pkl', use_memmap=False)
True

static is_zarr_directory(path: Path) → bool

Check if path is a zarr archive directory.

Zarr archives are directories used for trajectory caching with DaskMDTrajectory. Essential for trajectory loading.

Parameters

pathPath: Path to check

Returns

bool: True if path is zarr directory

Examples

>>> path = Path("cache/traj0.dask.zarr")
>>> ArchiveUtils.is_zarr_directory(path)
True

static is_visualization_file(suffix: str) → bool

Check if file is visualization output.

Visualization files are plot outputs that can be regenerated and are typically excluded from minimal archives.

Parameters

suffixstr: File extension (lowercase with dot)

Returns

bool: True if file is a visualization output

Examples

>>> ArchiveUtils.is_visualization_file('.png')
True
>>> ArchiveUtils.is_visualization_file('.dat')
False

static is_structure_file(suffix: str) → bool

Check if file is structure output.

Structure files include PDB coordinates and PyMOL scripts generated from feature importance analysis.

Parameters

suffixstr: File extension (lowercase with dot)

Returns

bool: True if file is a structure file

Examples

>>> ArchiveUtils.is_structure_file('.pdb')
True
>>> ArchiveUtils.is_structure_file('.dat')
False

static should_include_file(file_path: Path, exclude_visualizations: bool, include_structure_files: bool, use_memmap: bool) → bool

Determine if file should be included in archive.

Applies filtering logic based on file type and user preferences. Essential files depend on memmap usage.

Parameters

file_pathPath: Path to file to check
exclude_visualizationsbool: If True, exclude plot outputs (PNG, PDF, etc.)
include_structure_filesbool: If True, include PDB/PML structure files
use_memmapbool: Whether pipeline uses memory mapping

Returns

bool: True if file should be included in archive

Examples

>>> path = Path("cache/features.dat")
>>> ArchiveUtils.should_include_file(path, True, True, True)
True
>>> ArchiveUtils.should_include_file(path, True, True, False)
False
>>> path = Path("plots/landscape.png")
>>> ArchiveUtils.should_include_file(path, True, True, True)
False

static collect_cache_files(cache_dir: str, exclude_visualizations: bool, include_structure_files: bool, use_memmap: bool) → List[Tuple[str, str]]

Collect all files and zarr directories from cache for archiving.

Recursively scans cache directory and collects files and zarr directories matching the specified filter criteria.

Parameters

cache_dirstr: Path to cache directory
exclude_visualizationsbool: If True, exclude plot outputs
include_structure_filesbool: If True, include PDB/PML files
use_memmapbool: Whether pipeline uses memory mapping

Returns

List[Tuple[str, str]]: List of (absolute_path, archive_path) tuples

Examples

>>> files = ArchiveUtils.collect_cache_files(
...     "./cache", exclude_visualizations=True,
...     include_structure_files=True, use_memmap=True
... )
>>> len(files) > 0
True

Notes

Files are filtered by extension
Zarr directories only included if use_memmap=True
.dat files only included if use_memmap=True
Zarr directories are added as directories, not individual files

static is_sha256_string(value: str) → bool

Check whether value is a raw SHA256 hex digest.

Parameters

valuestr: Candidate SHA256 string.

Returns

bool: True when the value is a 64-character hexadecimal digest.

static parse_sha256_text(text: str) → str

Parse a SHA256 value from raw text or sha256sum-style content.

Parameters

textstr: Raw text containing a SHA256 digest.

Returns

str: Normalized lowercase SHA256 digest.

Raises

ValueError: If no valid SHA256 digest can be parsed from the text.

static compute_sha256(file_path: str) → str

Compute the SHA256 digest of a local file.

Parameters

file_pathstr: Path to the file to hash.

Returns

str: Lowercase SHA256 digest.

static get_sha256_file_path(archive_path: str) → str

Build the sidecar .sha path for an archive file.

Parameters

archive_pathstr: Local archive path.

Returns

str: Normalized absolute path to the sidecar SHA256 file.

static write_sha256_file(archive_path: str, sha_file_path: str | None = None) → str

Write a .sha sidecar file for an archive.

Parameters

archive_pathstr: Local archive path.
sha_file_pathstr, optional: Explicit output path for the SHA256 sidecar file.

Returns

str: Path to the written SHA256 file.

static resolve_sha_output_path(archive_path: str, sha: bool | str) → str | None

Resolve the requested SHA256 output path for an archive.

Parameters

archive_pathstr: Target archive file path.
shabool or str: False disables SHA output, True uses the default sidecar path, and a string is treated as an explicit SHA256 output path.

Returns

str or None: Normalized SHA256 output path when enabled, otherwise None.

static create_archive(pipeline_data, archive_path: str, compression: str = 'zst', exclude_visualizations: bool = True, include_structure_files: bool = True, compression_level: int | None = None, zstd_threads: int | None = None, reserve_cores: int = 2, sha: bool | str = True, overwrite: bool = False) → str

Create compressed archive with pipeline and cache files.

Creates tar archive containing pipeline pickle and filtered cache directory files with maximum compression.

Parameters

pipeline_dataPipelineData: Pipeline data object to save
archive_pathstr: Path for output archive (extension added if missing)
compressionstr, default=”zst”: Compression method: “zst”, “bz2”, or “gz”
exclude_visualizationsbool, default=True: If True, exclude plot outputs
include_structure_filesbool, default=True: If True, include PDB/PML files
compression_levelint, optional: Compression level override. For zst this maps to level (1-19).
zstd_threadsint, optional: Thread count for zstd compression. If None, uses max(1, cpu_count - reserve_cores).
reserve_coresint, default=2: Number of CPU cores to keep free for automatic zstd thread selection.
shabool or str, default=True: If True, write <archive>.sha next to the created archive. When a string is provided, it is used as the explicit SHA256 output path.
overwritebool, default=False: If True, replace existing archive outputs. When False, existing archive or SHA256 files raise FileExistsError.

Returns

str: Path to created archive file

Raises

ValueError: If compression method not supported

Examples

>>> archive = ArchiveUtils.create_archive(
...     pipeline_data, "analysis.tar.zst"
... )
>>> Path(archive).exists()
True

Notes

Uses tempfile for pickle creation
Preserves relative paths in archive
zstd compression uses the zstandard Python library with streaming I/O
With use_memmap=False: Only pickle needed (all data in objects)
With use_memmap=True: Pickle + .dat files + zarr directories
tar.add() automatically handles both files and directories

static extract_archive(archive_path: str, extract_to: str = None) → Path

Extract archive and return extraction directory.

Extracts compressed tar archive preserving directory structure. Creates extraction directory if it does not exist.

Parameters

archive_pathstr: Path to archive file
extract_tostr, optional: Directory to extract to. If None, uses archive parent directory with archive stem as subdirectory name.

Returns

Path: Path to extraction directory

Raises

FileNotFoundError: If archive does not exist

Examples

>>> extract_dir = ArchiveUtils.extract_archive("analysis.tar.zst")
>>> (extract_dir / "pipeline.pkl").exists()
True

>>> extract_dir = ArchiveUtils.extract_archive(
...     "analysis.tar.zst",
...     extract_to="./restored"
... )

Notes

Automatically detects compression from file extension
Creates parent directories if needed
Preserves file permissions and timestamps