Archive Utils
GitHub Link to Code.
Archive utilities for pipeline persistence and sharing.
This module provides utilities for creating and extracting compressed archives containing pipeline data. Supports filtering of visualization files and structure files for flexible archive creation.
- class mdxplain.utils.archive_utils.ArchiveUtils
Utilities for creating and extracting pipeline archives.
Provides static methods for compressing pipeline data into portable archives and extracting them. Supports selective inclusion of files based on type (essential data, visualizations, structure files).
Examples
>>> # Create archive from pipeline data >>> archive_path = ArchiveUtils.create_archive( ... pipeline_data, "analysis.tar.zst" ... )
>>> # Extract archive >>> extract_dir = ArchiveUtils.extract_archive("analysis.tar.zst")
- static is_essential_file(suffix: str, use_memmap: bool) bool
Check if file is essential for pipeline load.
Essential files depend on memmap usage. Pickle always essential. Memmap files (.dat) only essential if use_memmap=True.
Parameters
- suffixstr
File extension (lowercase with dot)
- use_memmapbool
Whether pipeline uses memory mapping
Returns
- bool
True if file is essential for pipeline loading
Examples
>>> ArchiveUtils.is_essential_file('.dat', use_memmap=True) True >>> ArchiveUtils.is_essential_file('.dat', use_memmap=False) False >>> ArchiveUtils.is_essential_file('.pkl', use_memmap=False) True
- static is_zarr_directory(path: Path) bool
Check if path is a zarr archive directory.
Zarr archives are directories used for trajectory caching with DaskMDTrajectory. Essential for trajectory loading.
Parameters
- pathPath
Path to check
Returns
- bool
True if path is zarr directory
Examples
>>> path = Path("cache/traj0.dask.zarr") >>> ArchiveUtils.is_zarr_directory(path) True
- static is_visualization_file(suffix: str) bool
Check if file is visualization output.
Visualization files are plot outputs that can be regenerated and are typically excluded from minimal archives.
Parameters
- suffixstr
File extension (lowercase with dot)
Returns
- bool
True if file is a visualization output
Examples
>>> ArchiveUtils.is_visualization_file('.png') True >>> ArchiveUtils.is_visualization_file('.dat') False
- static is_structure_file(suffix: str) bool
Check if file is structure output.
Structure files include PDB coordinates and PyMOL scripts generated from feature importance analysis.
Parameters
- suffixstr
File extension (lowercase with dot)
Returns
- bool
True if file is a structure file
Examples
>>> ArchiveUtils.is_structure_file('.pdb') True >>> ArchiveUtils.is_structure_file('.dat') False
- static should_include_file(file_path: Path, exclude_visualizations: bool, include_structure_files: bool, use_memmap: bool) bool
Determine if file should be included in archive.
Applies filtering logic based on file type and user preferences. Essential files depend on memmap usage.
Parameters
- file_pathPath
Path to file to check
- exclude_visualizationsbool
If True, exclude plot outputs (PNG, PDF, etc.)
- include_structure_filesbool
If True, include PDB/PML structure files
- use_memmapbool
Whether pipeline uses memory mapping
Returns
- bool
True if file should be included in archive
Examples
>>> path = Path("cache/features.dat") >>> ArchiveUtils.should_include_file(path, True, True, True) True >>> ArchiveUtils.should_include_file(path, True, True, False) False >>> path = Path("plots/landscape.png") >>> ArchiveUtils.should_include_file(path, True, True, True) False
- static collect_cache_files(cache_dir: str, exclude_visualizations: bool, include_structure_files: bool, use_memmap: bool) List[Tuple[str, str]]
Collect all files and zarr directories from cache for archiving.
Recursively scans cache directory and collects files and zarr directories matching the specified filter criteria.
Parameters
- cache_dirstr
Path to cache directory
- exclude_visualizationsbool
If True, exclude plot outputs
- include_structure_filesbool
If True, include PDB/PML files
- use_memmapbool
Whether pipeline uses memory mapping
Returns
- List[Tuple[str, str]]
List of (absolute_path, archive_path) tuples
Examples
>>> files = ArchiveUtils.collect_cache_files( ... "./cache", exclude_visualizations=True, ... include_structure_files=True, use_memmap=True ... ) >>> len(files) > 0 True
Notes
Files are filtered by extension
Zarr directories only included if use_memmap=True
.dat files only included if use_memmap=True
Zarr directories are added as directories, not individual files
- static is_sha256_string(value: str) bool
Check whether
valueis a raw SHA256 hex digest.Parameters
- valuestr
Candidate SHA256 string.
Returns
- bool
True when the value is a 64-character hexadecimal digest.
- static parse_sha256_text(text: str) str
Parse a SHA256 value from raw text or
sha256sum-style content.Parameters
- textstr
Raw text containing a SHA256 digest.
Returns
- str
Normalized lowercase SHA256 digest.
Raises
- ValueError
If no valid SHA256 digest can be parsed from the text.
- static compute_sha256(file_path: str) str
Compute the SHA256 digest of a local file.
Parameters
- file_pathstr
Path to the file to hash.
Returns
- str
Lowercase SHA256 digest.
- static get_sha256_file_path(archive_path: str) str
Build the sidecar
.shapath for an archive file.Parameters
- archive_pathstr
Local archive path.
Returns
- str
Normalized absolute path to the sidecar SHA256 file.
- static write_sha256_file(archive_path: str, sha_file_path: str | None = None) str
Write a
.shasidecar file for an archive.Parameters
- archive_pathstr
Local archive path.
- sha_file_pathstr, optional
Explicit output path for the SHA256 sidecar file.
Returns
- str
Path to the written SHA256 file.
- static resolve_sha_output_path(archive_path: str, sha: bool | str) str | None
Resolve the requested SHA256 output path for an archive.
Parameters
- archive_pathstr
Target archive file path.
- shabool or str
Falsedisables SHA output,Trueuses the default sidecar path, and a string is treated as an explicit SHA256 output path.
Returns
- str or None
Normalized SHA256 output path when enabled, otherwise None.
- static create_archive(pipeline_data, archive_path: str, compression: str = 'zst', exclude_visualizations: bool = True, include_structure_files: bool = True, compression_level: int | None = None, zstd_threads: int | None = None, reserve_cores: int = 2, sha: bool | str = True, overwrite: bool = False) str
Create compressed archive with pipeline and cache files.
Creates tar archive containing pipeline pickle and filtered cache directory files with maximum compression.
Parameters
- pipeline_dataPipelineData
Pipeline data object to save
- archive_pathstr
Path for output archive (extension added if missing)
- compressionstr, default=”zst”
Compression method: “zst”, “bz2”, or “gz”
- exclude_visualizationsbool, default=True
If True, exclude plot outputs
- include_structure_filesbool, default=True
If True, include PDB/PML files
- compression_levelint, optional
Compression level override. For zst this maps to level (1-19).
- zstd_threadsint, optional
Thread count for zstd compression. If None, uses
max(1, cpu_count - reserve_cores).- reserve_coresint, default=2
Number of CPU cores to keep free for automatic zstd thread selection.
- shabool or str, default=True
If True, write
<archive>.shanext to the created archive. When a string is provided, it is used as the explicit SHA256 output path.- overwritebool, default=False
If True, replace existing archive outputs. When False, existing archive or SHA256 files raise
FileExistsError.
Returns
- str
Path to created archive file
Raises
- ValueError
If compression method not supported
Examples
>>> archive = ArchiveUtils.create_archive( ... pipeline_data, "analysis.tar.zst" ... ) >>> Path(archive).exists() True
Notes
Uses tempfile for pickle creation
Preserves relative paths in archive
zstd compression uses the
zstandardPython library with streaming I/OWith use_memmap=False: Only pickle needed (all data in objects)
With use_memmap=True: Pickle + .dat files + zarr directories
tar.add() automatically handles both files and directories
- static extract_archive(archive_path: str, extract_to: str = None) Path
Extract archive and return extraction directory.
Extracts compressed tar archive preserving directory structure. Creates extraction directory if it does not exist.
Parameters
- archive_pathstr
Path to archive file
- extract_tostr, optional
Directory to extract to. If None, uses archive parent directory with archive stem as subdirectory name.
Returns
- Path
Path to extraction directory
Raises
- FileNotFoundError
If archive does not exist
Examples
>>> extract_dir = ArchiveUtils.extract_archive("analysis.tar.zst") >>> (extract_dir / "pipeline.pkl").exists() True
>>> extract_dir = ArchiveUtils.extract_archive( ... "analysis.tar.zst", ... extract_to="./restored" ... )
Notes
Automatically detects compression from file extension
Creates parent directories if needed
Preserves file permissions and timestamps