Archive Utils

GitHub Link to Code.

Archive utilities for pipeline persistence and sharing.

This module provides utilities for creating and extracting compressed archives containing pipeline data. Supports filtering of visualization files and structure files for flexible archive creation.

class mdxplain.utils.archive_utils.ArchiveUtils

Utilities for creating and extracting pipeline archives.

Provides static methods for compressing pipeline data into portable archives and extracting them. Supports selective inclusion of files based on type (essential data, visualizations, structure files).

Examples

>>> # Create archive from pipeline data
>>> archive_path = ArchiveUtils.create_archive(
...     pipeline_data, "analysis.tar.zst"
... )
>>> # Extract archive
>>> extract_dir = ArchiveUtils.extract_archive("analysis.tar.zst")
static is_essential_file(suffix: str, use_memmap: bool) bool

Check if file is essential for pipeline load.

Essential files depend on memmap usage. Pickle always essential. Memmap files (.dat) only essential if use_memmap=True.

Parameters

suffixstr

File extension (lowercase with dot)

use_memmapbool

Whether pipeline uses memory mapping

Returns

bool

True if file is essential for pipeline loading

Examples

>>> ArchiveUtils.is_essential_file('.dat', use_memmap=True)
True
>>> ArchiveUtils.is_essential_file('.dat', use_memmap=False)
False
>>> ArchiveUtils.is_essential_file('.pkl', use_memmap=False)
True
static is_zarr_directory(path: Path) bool

Check if path is a zarr archive directory.

Zarr archives are directories used for trajectory caching with DaskMDTrajectory. Essential for trajectory loading.

Parameters

pathPath

Path to check

Returns

bool

True if path is zarr directory

Examples

>>> path = Path("cache/traj0.dask.zarr")
>>> ArchiveUtils.is_zarr_directory(path)
True
static is_visualization_file(suffix: str) bool

Check if file is visualization output.

Visualization files are plot outputs that can be regenerated and are typically excluded from minimal archives.

Parameters

suffixstr

File extension (lowercase with dot)

Returns

bool

True if file is a visualization output

Examples

>>> ArchiveUtils.is_visualization_file('.png')
True
>>> ArchiveUtils.is_visualization_file('.dat')
False
static is_structure_file(suffix: str) bool

Check if file is structure output.

Structure files include PDB coordinates and PyMOL scripts generated from feature importance analysis.

Parameters

suffixstr

File extension (lowercase with dot)

Returns

bool

True if file is a structure file

Examples

>>> ArchiveUtils.is_structure_file('.pdb')
True
>>> ArchiveUtils.is_structure_file('.dat')
False
static should_include_file(file_path: Path, exclude_visualizations: bool, include_structure_files: bool, use_memmap: bool) bool

Determine if file should be included in archive.

Applies filtering logic based on file type and user preferences. Essential files depend on memmap usage.

Parameters

file_pathPath

Path to file to check

exclude_visualizationsbool

If True, exclude plot outputs (PNG, PDF, etc.)

include_structure_filesbool

If True, include PDB/PML structure files

use_memmapbool

Whether pipeline uses memory mapping

Returns

bool

True if file should be included in archive

Examples

>>> path = Path("cache/features.dat")
>>> ArchiveUtils.should_include_file(path, True, True, True)
True
>>> ArchiveUtils.should_include_file(path, True, True, False)
False
>>> path = Path("plots/landscape.png")
>>> ArchiveUtils.should_include_file(path, True, True, True)
False
static collect_cache_files(cache_dir: str, exclude_visualizations: bool, include_structure_files: bool, use_memmap: bool) List[Tuple[str, str]]

Collect all files and zarr directories from cache for archiving.

Recursively scans cache directory and collects files and zarr directories matching the specified filter criteria.

Parameters

cache_dirstr

Path to cache directory

exclude_visualizationsbool

If True, exclude plot outputs

include_structure_filesbool

If True, include PDB/PML files

use_memmapbool

Whether pipeline uses memory mapping

Returns

List[Tuple[str, str]]

List of (absolute_path, archive_path) tuples

Examples

>>> files = ArchiveUtils.collect_cache_files(
...     "./cache", exclude_visualizations=True,
...     include_structure_files=True, use_memmap=True
... )
>>> len(files) > 0
True

Notes

  • Files are filtered by extension

  • Zarr directories only included if use_memmap=True

  • .dat files only included if use_memmap=True

  • Zarr directories are added as directories, not individual files

static is_sha256_string(value: str) bool

Check whether value is a raw SHA256 hex digest.

Parameters

valuestr

Candidate SHA256 string.

Returns

bool

True when the value is a 64-character hexadecimal digest.

static parse_sha256_text(text: str) str

Parse a SHA256 value from raw text or sha256sum-style content.

Parameters

textstr

Raw text containing a SHA256 digest.

Returns

str

Normalized lowercase SHA256 digest.

Raises

ValueError

If no valid SHA256 digest can be parsed from the text.

static compute_sha256(file_path: str) str

Compute the SHA256 digest of a local file.

Parameters

file_pathstr

Path to the file to hash.

Returns

str

Lowercase SHA256 digest.

static get_sha256_file_path(archive_path: str) str

Build the sidecar .sha path for an archive file.

Parameters

archive_pathstr

Local archive path.

Returns

str

Normalized absolute path to the sidecar SHA256 file.

static write_sha256_file(archive_path: str, sha_file_path: str | None = None) str

Write a .sha sidecar file for an archive.

Parameters

archive_pathstr

Local archive path.

sha_file_pathstr, optional

Explicit output path for the SHA256 sidecar file.

Returns

str

Path to the written SHA256 file.

static resolve_sha_output_path(archive_path: str, sha: bool | str) str | None

Resolve the requested SHA256 output path for an archive.

Parameters

archive_pathstr

Target archive file path.

shabool or str

False disables SHA output, True uses the default sidecar path, and a string is treated as an explicit SHA256 output path.

Returns

str or None

Normalized SHA256 output path when enabled, otherwise None.

static create_archive(pipeline_data, archive_path: str, compression: str = 'zst', exclude_visualizations: bool = True, include_structure_files: bool = True, compression_level: int | None = None, zstd_threads: int | None = None, reserve_cores: int = 2, sha: bool | str = True, overwrite: bool = False) str

Create compressed archive with pipeline and cache files.

Creates tar archive containing pipeline pickle and filtered cache directory files with maximum compression.

Parameters

pipeline_dataPipelineData

Pipeline data object to save

archive_pathstr

Path for output archive (extension added if missing)

compressionstr, default=”zst”

Compression method: “zst”, “bz2”, or “gz”

exclude_visualizationsbool, default=True

If True, exclude plot outputs

include_structure_filesbool, default=True

If True, include PDB/PML files

compression_levelint, optional

Compression level override. For zst this maps to level (1-19).

zstd_threadsint, optional

Thread count for zstd compression. If None, uses max(1, cpu_count - reserve_cores).

reserve_coresint, default=2

Number of CPU cores to keep free for automatic zstd thread selection.

shabool or str, default=True

If True, write <archive>.sha next to the created archive. When a string is provided, it is used as the explicit SHA256 output path.

overwritebool, default=False

If True, replace existing archive outputs. When False, existing archive or SHA256 files raise FileExistsError.

Returns

str

Path to created archive file

Raises

ValueError

If compression method not supported

Examples

>>> archive = ArchiveUtils.create_archive(
...     pipeline_data, "analysis.tar.zst"
... )
>>> Path(archive).exists()
True

Notes

  • Uses tempfile for pickle creation

  • Preserves relative paths in archive

  • zstd compression uses the zstandard Python library with streaming I/O

  • With use_memmap=False: Only pickle needed (all data in objects)

  • With use_memmap=True: Pickle + .dat files + zarr directories

  • tar.add() automatically handles both files and directories

static extract_archive(archive_path: str, extract_to: str = None) Path

Extract archive and return extraction directory.

Extracts compressed tar archive preserving directory structure. Creates extraction directory if it does not exist.

Parameters

archive_pathstr

Path to archive file

extract_tostr, optional

Directory to extract to. If None, uses archive parent directory with archive stem as subdirectory name.

Returns

Path

Path to extraction directory

Raises

FileNotFoundError

If archive does not exist

Examples

>>> extract_dir = ArchiveUtils.extract_archive("analysis.tar.zst")
>>> (extract_dir / "pipeline.pkl").exists()
True
>>> extract_dir = ArchiveUtils.extract_archive(
...     "analysis.tar.zst",
...     extract_to="./restored"
... )

Notes

  • Automatically detects compression from file extension

  • Creates parent directories if needed

  • Preserves file permissions and timestamps