Data Selector Manager

GitHub Link to Code.

Data selector manager for trajectory frame selection.

This module provides the DataSelectorManager class that manages frame selection (row selection) as the counterpart to FeatureSelector’s column selection. It supports selection based on tags, clusters, and combinations thereof.

class mdxplain.data_selector.manager.data_selector_manager.DataSelectorManager

Manager for creating and managing trajectory frame selections.

This class provides methods to select trajectory frames (rows) based on various criteria such as tags, cluster assignments, or combinations. It serves as the counterpart to FeatureSelectorManager, focusing on row selection instead of column selection.

The manager supports:

Tag-based frame selection
Cluster-based frame selection
Combination of multiple selections
Frame index range selection

Examples

Pipeline mode (automatic injection):

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.create("folded_frames")
>>> pipeline.data_selector.select_by_cluster("folded_frames", "conformations", [0])

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.create(pipeline_data, "folded_frames")
>>> manager.select_by_cluster(pipeline_data, "folded_frames", "conformations", [0])

__init__() → None

Initialize the data selector manager.

Returns

None: Initializes DataSelectorManager instance

create(pipeline_data: PipelineData, name: str) → None

Create a new data selector with given name.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.create("folded_frames")  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.create(pipeline_data, "folded_frames")  # WITH pipeline_data parameter

Parameters

pipeline_dataPipelineData: Pipeline data object to store the selector
namestr: Name for the new data selector

Returns

None: Creates empty DataSelectorData in pipeline_data

Raises

ValueError: If a selector with the given name already exists

Examples

>>> manager = DataSelectorManager()
>>> manager.create(pipeline_data, "folded_frames")
>>> manager.create(pipeline_data, "system_A_frames")

select_by_tags(pipeline_data: PipelineData, name: str, tags: List[str], match_all: bool = True, mode: str = 'add', stride: int = 1) → None

Select frames based on trajectory tags.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.select_by_tags("biased_system_A", ["system_A", "biased"])  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.select_by_tags(pipeline_data, "biased_system_A", ["system_A", "biased"])  # WITH pipeline_data parameter

Parameters

pipeline_dataPipelineData: Pipeline data object containing trajectory data
namestr: Name of the data selector to populate
tagsList[str]: List of tags to search for
match_allbool, default=True: If True, frame must have ALL tags. If False, ANY tag matches.
modestr, default=”add”: Selection mode: “add” (union), “subtract” (difference), “intersect” (intersection)
strideint, default=1: Minimum distance between consecutive frames (per trajectory). stride=1 returns all frames, stride=10 returns every 10th frame.

# TODO: We could use an Enum for modes

Returns

None: Updates DataSelectorData with selected frame indices

Raises

ValueError: If selector name doesn’t exist or no trajectories loaded

Examples

>>> # Add frames with all specified tags
>>> manager.select_by_tags(
...     pipeline_data, "biased_system_A", ["system_A", "biased"], match_all=True, mode="add"
... )

>>> # Select every 5th frame from tagged trajectories
>>> manager.select_by_tags(
...     pipeline_data, "biased_sparse", ["biased"], stride=5
... )

>>> # Keep only frames that have these tags
>>> manager.select_by_tags(
...     pipeline_data, "my_frames", ["production"], mode="intersect"
... )

select_by_cluster(pipeline_data: PipelineData, name: str, clustering_name: str, cluster_ids: List[int | str], mode: str = 'add', stride: int = 1) → None

Select frames based on cluster assignments.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.select_by_cluster("structured", "conformations", [0, 1])  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.select_by_cluster(pipeline_data, "structured", "conformations", [0, 1])  # WITH pipeline_data parameter

Parameters

pipeline_dataPipelineData: Pipeline data object containing cluster data
namestr: Name of the data selector to populate
clustering_namestr: Name of the clustering result to use
cluster_idsList[Union[int, str]]: List of cluster IDs to select (can be numeric IDs or cluster names)
modestr, default=”add”: Selection mode: “add” (union), “subtract” (difference), “intersect” (intersection)
strideint, default=1: Minimum distance between consecutive frames (per trajectory). Applied after cluster selection to maintain cluster representation.

Returns

None: Updates DataSelectorData with selected frame indices

Raises

ValueError: If selector name doesn’t exist or clustering not found

Examples

>>> # Add frames from specific clusters
>>> manager.select_by_cluster(
...     pipeline_data, "structured", "conformations", [0, 1], mode="add"
... )

>>> # Select every 10th frame from clusters (sparse sampling)
>>> manager.select_by_cluster(
...     pipeline_data, "structured_sparse", "conformations", [0, 1], stride=10
... )

>>> # Keep only frames from these clusters
>>> manager.select_by_cluster(
...     pipeline_data, "my_frames", "conformations", ["folded", "intermediate"], mode="intersect"
... )

select_by_indices(pipeline_data: PipelineData, name: str, trajectory_indices: Dict[int, List[int]] | Dict[str, List[int]] | Dict[int, str] | Dict[str, str], mode: str = 'add') → None

Select frames by explicit trajectory-specific frame indices.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.select_by_indices("custom_frames", {0: [10, 20], 1: [5, 15]})  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.select_by_indices(pipeline_data, "custom_frames", {0: [10, 20], 1: [5, 15]})  # pipeline_data required

Parameters

pipeline_dataPipelineData

Pipeline data object

namestr

Name of the data selector to populate

trajectory_indicesDict[traj_selection, frame_selection]

Dictionary mapping trajectory selectors to frame specifications.

traj_selection:

int: trajectory index (0, 1, 2…)
str: trajectory name (“system_A”), tag (”tag:biased”), pattern (“system_*”)
Can resolve to multiple trajectories (e.g., tags apply frames to all matching)

frame_selection:

int: single frame (42)
List[int]: explicit frames ([10, 20, 30])
str: various formats:
- Single: “42”
- Range: “10-20” → [10, 11, …, 20]
- Comma list: “10,20,30” → [10, 20, 30]
- Combined: “10-20,30-40,50” → [10…20, 30…40, 50]
- All: “all” → all frames in trajectory
dict: with stride support:
- {“frames”: frame_selection, “stride”: N}
- stride = minimum distance between consecutive frames
- Example: {“frames”: “0-100”, “stride”: 10} → [0, 10, 20, …, 100]

modestr, default=”add”

Selection mode: “add” (union), “subtract” (difference), “intersect” (intersection)

Returns

None: Updates DataSelectorData with specified trajectory frame indices

Examples

>>> # Direct trajectory-specific selection
>>> manager.select_by_indices(
...     pipeline_data, "custom_frames", 
...     {0: [10, 20, 30], 1: [5, 15, 25]}, mode="add"
... )

>>> # Combined ranges
>>> manager.select_by_indices(
...     pipeline_data, "complex_frames",
...     {0: "10-20,30-40,50", "system_A": "100-200"}, mode="add"
... )

>>> # All frames from tagged trajectories
>>> manager.select_by_indices(
...     pipeline_data, "all_biased",
...     {"tag:biased": "all"}, mode="add"
... )

>>> # With stride for sparse sampling
>>> manager.select_by_indices(
...     pipeline_data, "sparse_frames",
...     {0: {"frames": "0-1000", "stride": 50}}, mode="add"
... )

>>> # Complex mixed example
>>> manager.select_by_indices(
...     pipeline_data, "mixed_selection",
...     {
...         "system_A": {"frames": "10-20,100-200", "stride": 5},
...         "tag:biased": "all",
...         1: [42, 84, 126]
...     }, mode="add"
... )

get_selection_info(pipeline_data: PipelineData, name: str) → Dict[str, Any]

Get information about a data selection.

Parameters

pipeline_dataPipelineData: Pipeline data object
namestr: Name of the data selector

Returns

Dict[str, Any]: Dictionary with selection information

Examples

>>> info = manager.get_selection_info(pipeline_data, "folded_frames")
>>> print(f"Selected {info['n_frames']} frames")
>>> print(f"Selection type: {info['selection_type']}")

list_selectors(pipeline_data: PipelineData) → List[str]

List all available data selectors.

Parameters

pipeline_dataPipelineData: Pipeline data object

Returns

List[str]: List of selector names

Examples

>>> selectors = manager.list_selectors(pipeline_data)
>>> print(f"Available selectors: {selectors}")

clear_selector(pipeline_data: PipelineData, name: str) → None

Clear all frames and criteria from a data selector.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.clear_selector("my_frames")  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.clear_selector(pipeline_data, "my_frames")  # WITH pipeline_data parameter

Parameters

pipeline_dataPipelineData: Pipeline data object
namestr: Name of the selector to clear

Returns

None: Clears all frames and criteria from the selector

Examples

>>> manager.clear_selector(pipeline_data, "my_selection")

remove_selector(pipeline_data: PipelineData, name: str) → None

Remove a data selector.

Parameters

pipeline_dataPipelineData: Pipeline data object
namestr: Name of the selector to remove

Returns

None: Removes the selector from pipeline_data

Examples

>>> manager.remove_selector(pipeline_data, "old_selection")

save(pipeline_data: PipelineData, save_path: str) → None

Save all data selector data to single file.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.save('data_selector.npy')  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.save(pipeline_data, 'data_selector.npy')  # pipeline_data required

Parameters

pipeline_dataPipelineData: Pipeline data container with data selector data
save_pathstr: Path where to save all data selector data in one file

Returns

None: Saves all data selector data to the specified file

Examples

>>> manager.save(pipeline_data, 'data_selector.npy')

load(pipeline_data: PipelineData, load_path: str) → None

Load all data selector data from single file.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.load('data_selector.npy')  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.load(pipeline_data, 'data_selector.npy')  # pipeline_data required

Parameters

pipeline_dataPipelineData: Pipeline data container to load data selector data into
load_pathstr: Path to saved data selector data file

Returns

None: Loads all data selector data from the specified file

Examples

>>> manager.load(pipeline_data, 'data_selector.npy')

print_info(pipeline_data: PipelineData) → None

Print data selector information.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.data_selector.print_info()  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = DataSelectorManager()
>>> manager.print_info(pipeline_data)  # pipeline_data required

Parameters

pipeline_dataPipelineData: Pipeline data container with data selector data

Returns

None: Prints data selector information to console

Examples

>>> manager.print_info(pipeline_data)

create_from_clusters(pipeline_data: PipelineData, group_name: str, clustering_name: str, cluster_ids: List[int] | None = None, noise_id: int | None = -1, min_cluster_size: int | None = 2, force: bool = False) → None

Create data selectors automatically for clusters.

Creates one data selector per cluster and organizes them into a named group for easy reference. Noise clusters are filtered out by default.

Parameters

pipeline_dataPipelineData

Pipeline data object containing clustering results

group_namestr

Name for the selector group

clustering_namestr

Name of clustering result to use

cluster_idsList[int], optional

Specific cluster IDs to include. If None, includes all non-noise clusters.

noise_idint or None, default=-1

Cluster ID that represents noise/outliers to filter out.

If int: Filters out this specific cluster ID (e.g., -1 for sklearn)
If None: No filtering, creates selectors for ALL cluster IDs

min_cluster_sizeint or None, optional

Minimum number of frames required for a cluster to be included. If None, includes all clusters (except noise filtering). Default is 2 to avoid single-frame clusters. This is necessary for decision trees to work properly.

forcebool, default=False

Whether to overwrite existing selectors with same names. If False, raises ValueError when selector already exists.

Returns

None: Creates selectors and stores group in pipeline_data

Raises

ValueError: If clustering_name does not exist in pipeline_data
ValueError: If selector already exists and force is False

Examples

>>> # Create selectors for all non-noise clusters
>>> manager.create_from_clusters(
...     pipeline_data, "clusters", "dbscan_clustering"
... )

>>> # Create selectors for specific clusters only
>>> manager.create_from_clusters(
...     pipeline_data, "folded", "clustering",
...     cluster_ids=[0, 1, 2]
... )

>>> # Include ALL clusters (even noise)
>>> manager.create_from_clusters(
...     pipeline_data, "all_states", "clustering",
...     noise_id=None
... )

create_from_tags(pipeline_data: PipelineData, group_name: str, tags: List[str] | None = None, force: bool = False) → None

Create data selectors automatically for tags.

Creates one data selector per tag and organizes them into a named group for easy reference.

Parameters

pipeline_dataPipelineData: Pipeline data object
group_namestr: Name for the group
tagsList[str], optional: Specific tags to include
forcebool, default=False: Overwrite existing selectors

Returns

None: Creates selectors and group

Examples

>>> manager.create_from_tags(
...     pipeline_data, "systems", tags=["system_A", "system_B"]
... )

get_group(pipeline_data: PipelineData, group_name: str) → List[str]

Get selector names in a group.

Parameters

pipeline_dataPipelineData: Pipeline data object
group_namestr: Group name

Returns

List[str]: Selector names

Examples

>>> selectors = manager.get_group(pipeline_data, "clusters")

list_groups(pipeline_data: PipelineData) → List[str]

List all group names.

Parameters

pipeline_dataPipelineData: Pipeline data object

Returns

List[str]: Group names

Examples

>>> groups = manager.list_groups(pipeline_data)

delete_group(pipeline_data: PipelineData, group_name: str, delete_selectors: bool = False) → None

Delete a group.

Parameters

pipeline_dataPipelineData: Pipeline data object
group_namestr: Group name
delete_selectorsbool, default=False: Also delete selectors

Returns

None: Deletes group

Examples

>>> manager.delete_group(pipeline_data, "old_group")