Data Selector Manager

GitHub Link to Code.

Data selector manager for trajectory frame selection.

This module provides the DataSelectorManager class that manages frame selection (row selection) as the counterpart to FeatureSelector’s column selection. It supports selection based on tags, clusters, and combinations thereof.

class mdxplain.data_selector.manager.data_selector_manager.DataSelectorManager

Manager for creating and managing trajectory frame selections.

This class provides methods to select trajectory frames (rows) based on various criteria such as tags, cluster assignments, or combinations. It serves as the counterpart to FeatureSelectorManager, focusing on row selection instead of column selection.

The manager supports:

  • Tag-based frame selection

  • Cluster-based frame selection

  • Combination of multiple selections

  • Frame index range selection

Examples

Pipeline mode (automatic injection):

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.create("folded_frames")
>>> pipeline.data_selector.select_by_cluster("folded_frames", "conformations", [0])

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.create(pipeline_data, "folded_frames")
>>> manager.select_by_cluster(pipeline_data, "folded_frames", "conformations", [0])
__init__() None

Initialize the data selector manager.

Returns

None

Initializes DataSelectorManager instance

create(pipeline_data: PipelineData, name: str) None

Create a new data selector with given name.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.create("folded_frames")  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.create(pipeline_data, "folded_frames")  # WITH pipeline_data parameter

Parameters

pipeline_dataPipelineData

Pipeline data object to store the selector

namestr

Name for the new data selector

Returns

None

Creates empty DataSelectorData in pipeline_data

Raises

ValueError

If a selector with the given name already exists

Examples

>>> manager = DataSelectorManager()
>>> manager.create(pipeline_data, "folded_frames")
>>> manager.create(pipeline_data, "system_A_frames")
select_by_tags(pipeline_data: PipelineData, name: str, tags: List[str], match_all: bool = True, mode: str = 'add', stride: int = 1) None

Select frames based on trajectory tags.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.select_by_tags("biased_system_A", ["system_A", "biased"])  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.select_by_tags(pipeline_data, "biased_system_A", ["system_A", "biased"])  # WITH pipeline_data parameter

Parameters

pipeline_dataPipelineData

Pipeline data object containing trajectory data

namestr

Name of the data selector to populate

tagsList[str]

List of tags to search for

match_allbool, default=True

If True, frame must have ALL tags. If False, ANY tag matches.

modestr, default=”add”

Selection mode: “add” (union), “subtract” (difference), “intersect” (intersection)

strideint, default=1

Minimum distance between consecutive frames (per trajectory). stride=1 returns all frames, stride=10 returns every 10th frame.

# TODO: We could use an Enum for modes

Returns

None

Updates DataSelectorData with selected frame indices

Raises

ValueError

If selector name doesn’t exist or no trajectories loaded

Examples

>>> # Add frames with all specified tags
>>> manager.select_by_tags(
...     pipeline_data, "biased_system_A", ["system_A", "biased"], match_all=True, mode="add"
... )
>>> # Select every 5th frame from tagged trajectories
>>> manager.select_by_tags(
...     pipeline_data, "biased_sparse", ["biased"], stride=5
... )
>>> # Keep only frames that have these tags
>>> manager.select_by_tags(
...     pipeline_data, "my_frames", ["production"], mode="intersect"
... )
select_by_cluster(pipeline_data: PipelineData, name: str, clustering_name: str, cluster_ids: List[int | str], mode: str = 'add', stride: int = 1) None

Select frames based on cluster assignments.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.select_by_cluster("structured", "conformations", [0, 1])  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.select_by_cluster(pipeline_data, "structured", "conformations", [0, 1])  # WITH pipeline_data parameter

Parameters

pipeline_dataPipelineData

Pipeline data object containing cluster data

namestr

Name of the data selector to populate

clustering_namestr

Name of the clustering result to use

cluster_idsList[Union[int, str]]

List of cluster IDs to select (can be numeric IDs or cluster names)

modestr, default=”add”

Selection mode: “add” (union), “subtract” (difference), “intersect” (intersection)

strideint, default=1

Minimum distance between consecutive frames (per trajectory). Applied after cluster selection to maintain cluster representation.

Returns

None

Updates DataSelectorData with selected frame indices

Raises

ValueError

If selector name doesn’t exist or clustering not found

Examples

>>> # Add frames from specific clusters
>>> manager.select_by_cluster(
...     pipeline_data, "structured", "conformations", [0, 1], mode="add"
... )
>>> # Select every 10th frame from clusters (sparse sampling)
>>> manager.select_by_cluster(
...     pipeline_data, "structured_sparse", "conformations", [0, 1], stride=10
... )
>>> # Keep only frames from these clusters
>>> manager.select_by_cluster(
...     pipeline_data, "my_frames", "conformations", ["folded", "intermediate"], mode="intersect"
... )
select_by_indices(pipeline_data: PipelineData, name: str, trajectory_indices: Dict[int, List[int]] | Dict[str, List[int]] | Dict[int, str] | Dict[str, str], mode: str = 'add') None

Select frames by explicit trajectory-specific frame indices.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.select_by_indices("custom_frames", {0: [10, 20], 1: [5, 15]})  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.select_by_indices(pipeline_data, "custom_frames", {0: [10, 20], 1: [5, 15]})  # pipeline_data required

Parameters

pipeline_dataPipelineData

Pipeline data object

namestr

Name of the data selector to populate

trajectory_indicesDict[traj_selection, frame_selection]

Dictionary mapping trajectory selectors to frame specifications.

traj_selection:

  • int: trajectory index (0, 1, 2…)

  • str: trajectory name (“system_A”), tag (”tag:biased”), pattern (“system_*”)

  • Can resolve to multiple trajectories (e.g., tags apply frames to all matching)

frame_selection:

  • int: single frame (42)

  • List[int]: explicit frames ([10, 20, 30])

  • str: various formats:

    • Single: “42”

    • Range: “10-20” → [10, 11, …, 20]

    • Comma list: “10,20,30” → [10, 20, 30]

    • Combined: “10-20,30-40,50” → [10…20, 30…40, 50]

    • All: “all” → all frames in trajectory

  • dict: with stride support:

    • {“frames”: frame_selection, “stride”: N}

    • stride = minimum distance between consecutive frames

    • Example: {“frames”: “0-100”, “stride”: 10} → [0, 10, 20, …, 100]

modestr, default=”add”

Selection mode: “add” (union), “subtract” (difference), “intersect” (intersection)

Returns

None

Updates DataSelectorData with specified trajectory frame indices

Examples

>>> # Direct trajectory-specific selection
>>> manager.select_by_indices(
...     pipeline_data, "custom_frames", 
...     {0: [10, 20, 30], 1: [5, 15, 25]}, mode="add"
... )
>>> # Combined ranges
>>> manager.select_by_indices(
...     pipeline_data, "complex_frames",
...     {0: "10-20,30-40,50", "system_A": "100-200"}, mode="add"
... )
>>> # All frames from tagged trajectories
>>> manager.select_by_indices(
...     pipeline_data, "all_biased",
...     {"tag:biased": "all"}, mode="add"
... )
>>> # With stride for sparse sampling
>>> manager.select_by_indices(
...     pipeline_data, "sparse_frames",
...     {0: {"frames": "0-1000", "stride": 50}}, mode="add"
... )
>>> # Complex mixed example
>>> manager.select_by_indices(
...     pipeline_data, "mixed_selection",
...     {
...         "system_A": {"frames": "10-20,100-200", "stride": 5},
...         "tag:biased": "all",
...         1: [42, 84, 126]
...     }, mode="add"
... )
get_selection_info(pipeline_data: PipelineData, name: str) Dict[str, Any]

Get information about a data selection.

Parameters

pipeline_dataPipelineData

Pipeline data object

namestr

Name of the data selector

Returns

Dict[str, Any]

Dictionary with selection information

Examples

>>> info = manager.get_selection_info(pipeline_data, "folded_frames")
>>> print(f"Selected {info['n_frames']} frames")
>>> print(f"Selection type: {info['selection_type']}")
list_selectors(pipeline_data: PipelineData) List[str]

List all available data selectors.

Parameters

pipeline_dataPipelineData

Pipeline data object

Returns

List[str]

List of selector names

Examples

>>> selectors = manager.list_selectors(pipeline_data)
>>> print(f"Available selectors: {selectors}")
clear_selector(pipeline_data: PipelineData, name: str) None

Clear all frames and criteria from a data selector.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.clear_selector("my_frames")  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.clear_selector(pipeline_data, "my_frames")  # WITH pipeline_data parameter

Parameters

pipeline_dataPipelineData

Pipeline data object

namestr

Name of the selector to clear

Returns

None

Clears all frames and criteria from the selector

Examples

>>> manager.clear_selector(pipeline_data, "my_selection")
remove_selector(pipeline_data: PipelineData, name: str) None

Remove a data selector.

Parameters

pipeline_dataPipelineData

Pipeline data object

namestr

Name of the selector to remove

Returns

None

Removes the selector from pipeline_data

Examples

>>> manager.remove_selector(pipeline_data, "old_selection")
save(pipeline_data: PipelineData, save_path: str) None

Save all data selector data to single file.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.save('data_selector.npy')  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.save(pipeline_data, 'data_selector.npy')  # pipeline_data required

Parameters

pipeline_dataPipelineData

Pipeline data container with data selector data

save_pathstr

Path where to save all data selector data in one file

Returns

None

Saves all data selector data to the specified file

Examples

>>> manager.save(pipeline_data, 'data_selector.npy')
load(pipeline_data: PipelineData, load_path: str) None

Load all data selector data from single file.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.load('data_selector.npy')  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.load(pipeline_data, 'data_selector.npy')  # pipeline_data required

Parameters

pipeline_dataPipelineData

Pipeline data container to load data selector data into

load_pathstr

Path to saved data selector data file

Returns

None

Loads all data selector data from the specified file

Examples

>>> manager.load(pipeline_data, 'data_selector.npy')
print_info(pipeline_data: PipelineData) None

Print data selector information.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.print_info()  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.print_info(pipeline_data)  # pipeline_data required

Parameters

pipeline_dataPipelineData

Pipeline data container with data selector data

Returns

None

Prints data selector information to console

Examples

>>> manager.print_info(pipeline_data)
create_from_clusters(pipeline_data: PipelineData, group_name: str, clustering_name: str, cluster_ids: List[int] | None = None, noise_id: int | None = -1, min_cluster_size: int | None = 2, force: bool = False) None

Create data selectors automatically for clusters.

Creates one data selector per cluster and organizes them into a named group for easy reference. Noise clusters are filtered out by default.

Parameters

pipeline_dataPipelineData

Pipeline data object containing clustering results

group_namestr

Name for the selector group

clustering_namestr

Name of clustering result to use

cluster_idsList[int], optional

Specific cluster IDs to include. If None, includes all non-noise clusters.

noise_idint or None, default=-1

Cluster ID that represents noise/outliers to filter out.

  • If int: Filters out this specific cluster ID (e.g., -1 for sklearn)

  • If None: No filtering, creates selectors for ALL cluster IDs

min_cluster_sizeint or None, optional

Minimum number of frames required for a cluster to be included. If None, includes all clusters (except noise filtering). Default is 2 to avoid single-frame clusters. This is necessary for decision trees to work properly.

forcebool, default=False

Whether to overwrite existing selectors with same names. If False, raises ValueError when selector already exists.

Returns

None

Creates selectors and stores group in pipeline_data

Raises

ValueError

If clustering_name does not exist in pipeline_data

ValueError

If selector already exists and force is False

Examples

>>> # Create selectors for all non-noise clusters
>>> manager.create_from_clusters(
...     pipeline_data, "clusters", "dbscan_clustering"
... )
>>> # Create selectors for specific clusters only
>>> manager.create_from_clusters(
...     pipeline_data, "folded", "clustering",
...     cluster_ids=[0, 1, 2]
... )
>>> # Include ALL clusters (even noise)
>>> manager.create_from_clusters(
...     pipeline_data, "all_states", "clustering",
...     noise_id=None
... )
create_from_tags(pipeline_data: PipelineData, group_name: str, tags: List[str] | None = None, force: bool = False) None

Create data selectors automatically for tags.

Creates one data selector per tag and organizes them into a named group for easy reference.

Parameters

pipeline_dataPipelineData

Pipeline data object

group_namestr

Name for the group

tagsList[str], optional

Specific tags to include

forcebool, default=False

Overwrite existing selectors

Returns

None

Creates selectors and group

Examples

>>> manager.create_from_tags(
...     pipeline_data, "systems", tags=["system_A", "system_B"]
... )
get_group(pipeline_data: PipelineData, group_name: str) List[str]

Get selector names in a group.

Parameters

pipeline_dataPipelineData

Pipeline data object

group_namestr

Group name

Returns

List[str]

Selector names

Examples

>>> selectors = manager.get_group(pipeline_data, "clusters")
list_groups(pipeline_data: PipelineData) List[str]

List all group names.

Parameters

pipeline_dataPipelineData

Pipeline data object

Returns

List[str]

Group names

Examples

>>> groups = manager.list_groups(pipeline_data)
delete_group(pipeline_data: PipelineData, group_name: str, delete_selectors: bool = False) None

Delete a group.

Parameters

pipeline_dataPipelineData

Pipeline data object

group_namestr

Group name

delete_selectorsbool, default=False

Also delete selectors

Returns

None

Deletes group

Examples

>>> manager.delete_group(pipeline_data, "old_group")