Data Selector Manager
GitHub Link to Code.
Data selector manager for trajectory frame selection.
This module provides the DataSelectorManager class that manages frame selection (row selection) as the counterpart to FeatureSelector’s column selection. It supports selection based on tags, clusters, and combinations thereof.
- class mdxplain.data_selector.manager.data_selector_manager.DataSelectorManager
Manager for creating and managing trajectory frame selections.
This class provides methods to select trajectory frames (rows) based on various criteria such as tags, cluster assignments, or combinations. It serves as the counterpart to FeatureSelectorManager, focusing on row selection instead of column selection.
The manager supports:
Tag-based frame selection
Cluster-based frame selection
Combination of multiple selections
Frame index range selection
Examples
Pipeline mode (automatic injection):
>>> pipeline = PipelineManager() >>> pipeline.data_selector.create("folded_frames") >>> pipeline.data_selector.select_by_cluster("folded_frames", "conformations", [0])
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = DataSelectorManager() >>> manager.create(pipeline_data, "folded_frames") >>> manager.select_by_cluster(pipeline_data, "folded_frames", "conformations", [0])
- __init__() None
Initialize the data selector manager.
Returns
- None
Initializes DataSelectorManager instance
- create(pipeline_data: PipelineData, name: str) None
Create a new data selector with given name.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> pipeline.data_selector.create("folded_frames") # NO pipeline_data parameter
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = DataSelectorManager() >>> manager.create(pipeline_data, "folded_frames") # WITH pipeline_data parameter
Parameters
- pipeline_dataPipelineData
Pipeline data object to store the selector
- namestr
Name for the new data selector
Returns
- None
Creates empty DataSelectorData in pipeline_data
Raises
- ValueError
If a selector with the given name already exists
Examples
>>> manager = DataSelectorManager() >>> manager.create(pipeline_data, "folded_frames") >>> manager.create(pipeline_data, "system_A_frames")
- select_by_tags(pipeline_data: PipelineData, name: str, tags: List[str], match_all: bool = True, mode: str = 'add', stride: int = 1) None
Select frames based on trajectory tags.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> pipeline.data_selector.select_by_tags("biased_system_A", ["system_A", "biased"]) # NO pipeline_data parameter
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = DataSelectorManager() >>> manager.select_by_tags(pipeline_data, "biased_system_A", ["system_A", "biased"]) # WITH pipeline_data parameter
Parameters
- pipeline_dataPipelineData
Pipeline data object containing trajectory data
- namestr
Name of the data selector to populate
- tagsList[str]
List of tags to search for
- match_allbool, default=True
If True, frame must have ALL tags. If False, ANY tag matches.
- modestr, default=”add”
Selection mode: “add” (union), “subtract” (difference), “intersect” (intersection)
- strideint, default=1
Minimum distance between consecutive frames (per trajectory). stride=1 returns all frames, stride=10 returns every 10th frame.
# TODO: We could use an Enum for modes
Returns
- None
Updates DataSelectorData with selected frame indices
Raises
- ValueError
If selector name doesn’t exist or no trajectories loaded
Examples
>>> # Add frames with all specified tags >>> manager.select_by_tags( ... pipeline_data, "biased_system_A", ["system_A", "biased"], match_all=True, mode="add" ... )
>>> # Select every 5th frame from tagged trajectories >>> manager.select_by_tags( ... pipeline_data, "biased_sparse", ["biased"], stride=5 ... )
>>> # Keep only frames that have these tags >>> manager.select_by_tags( ... pipeline_data, "my_frames", ["production"], mode="intersect" ... )
- select_by_cluster(pipeline_data: PipelineData, name: str, clustering_name: str, cluster_ids: List[int | str], mode: str = 'add', stride: int = 1) None
Select frames based on cluster assignments.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> pipeline.data_selector.select_by_cluster("structured", "conformations", [0, 1]) # NO pipeline_data parameter
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = DataSelectorManager() >>> manager.select_by_cluster(pipeline_data, "structured", "conformations", [0, 1]) # WITH pipeline_data parameter
Parameters
- pipeline_dataPipelineData
Pipeline data object containing cluster data
- namestr
Name of the data selector to populate
- clustering_namestr
Name of the clustering result to use
- cluster_idsList[Union[int, str]]
List of cluster IDs to select (can be numeric IDs or cluster names)
- modestr, default=”add”
Selection mode: “add” (union), “subtract” (difference), “intersect” (intersection)
- strideint, default=1
Minimum distance between consecutive frames (per trajectory). Applied after cluster selection to maintain cluster representation.
Returns
- None
Updates DataSelectorData with selected frame indices
Raises
- ValueError
If selector name doesn’t exist or clustering not found
Examples
>>> # Add frames from specific clusters >>> manager.select_by_cluster( ... pipeline_data, "structured", "conformations", [0, 1], mode="add" ... )
>>> # Select every 10th frame from clusters (sparse sampling) >>> manager.select_by_cluster( ... pipeline_data, "structured_sparse", "conformations", [0, 1], stride=10 ... )
>>> # Keep only frames from these clusters >>> manager.select_by_cluster( ... pipeline_data, "my_frames", "conformations", ["folded", "intermediate"], mode="intersect" ... )
- select_by_indices(pipeline_data: PipelineData, name: str, trajectory_indices: Dict[int, List[int]] | Dict[str, List[int]] | Dict[int, str] | Dict[str, str], mode: str = 'add') None
Select frames by explicit trajectory-specific frame indices.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> pipeline.data_selector.select_by_indices("custom_frames", {0: [10, 20], 1: [5, 15]}) # NO pipeline_data parameter
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = DataSelectorManager() >>> manager.select_by_indices(pipeline_data, "custom_frames", {0: [10, 20], 1: [5, 15]}) # pipeline_data required
Parameters
- pipeline_dataPipelineData
Pipeline data object
- namestr
Name of the data selector to populate
- trajectory_indicesDict[traj_selection, frame_selection]
Dictionary mapping trajectory selectors to frame specifications.
traj_selection:
int: trajectory index (0, 1, 2…)
str: trajectory name (“system_A”), tag (”tag:biased”), pattern (“system_*”)
Can resolve to multiple trajectories (e.g., tags apply frames to all matching)
frame_selection:
int: single frame (42)
List[int]: explicit frames ([10, 20, 30])
str: various formats:
Single: “42”
Range: “10-20” → [10, 11, …, 20]
Comma list: “10,20,30” → [10, 20, 30]
Combined: “10-20,30-40,50” → [10…20, 30…40, 50]
All: “all” → all frames in trajectory
dict: with stride support:
{“frames”: frame_selection, “stride”: N}
stride = minimum distance between consecutive frames
Example: {“frames”: “0-100”, “stride”: 10} → [0, 10, 20, …, 100]
- modestr, default=”add”
Selection mode: “add” (union), “subtract” (difference), “intersect” (intersection)
Returns
- None
Updates DataSelectorData with specified trajectory frame indices
Examples
>>> # Direct trajectory-specific selection >>> manager.select_by_indices( ... pipeline_data, "custom_frames", ... {0: [10, 20, 30], 1: [5, 15, 25]}, mode="add" ... )
>>> # Combined ranges >>> manager.select_by_indices( ... pipeline_data, "complex_frames", ... {0: "10-20,30-40,50", "system_A": "100-200"}, mode="add" ... )
>>> # All frames from tagged trajectories >>> manager.select_by_indices( ... pipeline_data, "all_biased", ... {"tag:biased": "all"}, mode="add" ... )
>>> # With stride for sparse sampling >>> manager.select_by_indices( ... pipeline_data, "sparse_frames", ... {0: {"frames": "0-1000", "stride": 50}}, mode="add" ... )
>>> # Complex mixed example >>> manager.select_by_indices( ... pipeline_data, "mixed_selection", ... { ... "system_A": {"frames": "10-20,100-200", "stride": 5}, ... "tag:biased": "all", ... 1: [42, 84, 126] ... }, mode="add" ... )
- get_selection_info(pipeline_data: PipelineData, name: str) Dict[str, Any]
Get information about a data selection.
Parameters
- pipeline_dataPipelineData
Pipeline data object
- namestr
Name of the data selector
Returns
- Dict[str, Any]
Dictionary with selection information
Examples
>>> info = manager.get_selection_info(pipeline_data, "folded_frames") >>> print(f"Selected {info['n_frames']} frames") >>> print(f"Selection type: {info['selection_type']}")
- list_selectors(pipeline_data: PipelineData) List[str]
List all available data selectors.
Parameters
- pipeline_dataPipelineData
Pipeline data object
Returns
- List[str]
List of selector names
Examples
>>> selectors = manager.list_selectors(pipeline_data) >>> print(f"Available selectors: {selectors}")
- clear_selector(pipeline_data: PipelineData, name: str) None
Clear all frames and criteria from a data selector.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> pipeline.data_selector.clear_selector("my_frames") # NO pipeline_data parameter
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = DataSelectorManager() >>> manager.clear_selector(pipeline_data, "my_frames") # WITH pipeline_data parameter
Parameters
- pipeline_dataPipelineData
Pipeline data object
- namestr
Name of the selector to clear
Returns
- None
Clears all frames and criteria from the selector
Examples
>>> manager.clear_selector(pipeline_data, "my_selection")
- remove_selector(pipeline_data: PipelineData, name: str) None
Remove a data selector.
Parameters
- pipeline_dataPipelineData
Pipeline data object
- namestr
Name of the selector to remove
Returns
- None
Removes the selector from pipeline_data
Examples
>>> manager.remove_selector(pipeline_data, "old_selection")
- save(pipeline_data: PipelineData, save_path: str) None
Save all data selector data to single file.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> pipeline.data_selector.save('data_selector.npy') # NO pipeline_data parameter
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = DataSelectorManager() >>> manager.save(pipeline_data, 'data_selector.npy') # pipeline_data required
Parameters
- pipeline_dataPipelineData
Pipeline data container with data selector data
- save_pathstr
Path where to save all data selector data in one file
Returns
- None
Saves all data selector data to the specified file
Examples
>>> manager.save(pipeline_data, 'data_selector.npy')
- load(pipeline_data: PipelineData, load_path: str) None
Load all data selector data from single file.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> pipeline.data_selector.load('data_selector.npy') # NO pipeline_data parameter
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = DataSelectorManager() >>> manager.load(pipeline_data, 'data_selector.npy') # pipeline_data required
Parameters
- pipeline_dataPipelineData
Pipeline data container to load data selector data into
- load_pathstr
Path to saved data selector data file
Returns
- None
Loads all data selector data from the specified file
Examples
>>> manager.load(pipeline_data, 'data_selector.npy')
- print_info(pipeline_data: PipelineData) None
Print data selector information.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> pipeline.data_selector.print_info() # NO pipeline_data parameter
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = DataSelectorManager() >>> manager.print_info(pipeline_data) # pipeline_data required
Parameters
- pipeline_dataPipelineData
Pipeline data container with data selector data
Returns
- None
Prints data selector information to console
Examples
>>> manager.print_info(pipeline_data)
- create_from_clusters(pipeline_data: PipelineData, group_name: str, clustering_name: str, cluster_ids: List[int] | None = None, noise_id: int | None = -1, min_cluster_size: int | None = 2, force: bool = False) None
Create data selectors automatically for clusters.
Creates one data selector per cluster and organizes them into a named group for easy reference. Noise clusters are filtered out by default.
Parameters
- pipeline_dataPipelineData
Pipeline data object containing clustering results
- group_namestr
Name for the selector group
- clustering_namestr
Name of clustering result to use
- cluster_idsList[int], optional
Specific cluster IDs to include. If None, includes all non-noise clusters.
- noise_idint or None, default=-1
Cluster ID that represents noise/outliers to filter out.
If int: Filters out this specific cluster ID (e.g., -1 for sklearn)
If None: No filtering, creates selectors for ALL cluster IDs
- min_cluster_sizeint or None, optional
Minimum number of frames required for a cluster to be included. If None, includes all clusters (except noise filtering). Default is 2 to avoid single-frame clusters. This is necessary for decision trees to work properly.
- forcebool, default=False
Whether to overwrite existing selectors with same names. If False, raises ValueError when selector already exists.
Returns
- None
Creates selectors and stores group in pipeline_data
Raises
- ValueError
If clustering_name does not exist in pipeline_data
- ValueError
If selector already exists and force is False
Examples
>>> # Create selectors for all non-noise clusters >>> manager.create_from_clusters( ... pipeline_data, "clusters", "dbscan_clustering" ... )
>>> # Create selectors for specific clusters only >>> manager.create_from_clusters( ... pipeline_data, "folded", "clustering", ... cluster_ids=[0, 1, 2] ... )
>>> # Include ALL clusters (even noise) >>> manager.create_from_clusters( ... pipeline_data, "all_states", "clustering", ... noise_id=None ... )
- create_from_tags(pipeline_data: PipelineData, group_name: str, tags: List[str] | None = None, force: bool = False) None
Create data selectors automatically for tags.
Creates one data selector per tag and organizes them into a named group for easy reference.
Parameters
- pipeline_dataPipelineData
Pipeline data object
- group_namestr
Name for the group
- tagsList[str], optional
Specific tags to include
- forcebool, default=False
Overwrite existing selectors
Returns
- None
Creates selectors and group
Examples
>>> manager.create_from_tags( ... pipeline_data, "systems", tags=["system_A", "system_B"] ... )
- get_group(pipeline_data: PipelineData, group_name: str) List[str]
Get selector names in a group.
Parameters
- pipeline_dataPipelineData
Pipeline data object
- group_namestr
Group name
Returns
- List[str]
Selector names
Examples
>>> selectors = manager.get_group(pipeline_data, "clusters")
- list_groups(pipeline_data: PipelineData) List[str]
List all group names.
Parameters
- pipeline_dataPipelineData
Pipeline data object
Returns
- List[str]
Group names
Examples
>>> groups = manager.list_groups(pipeline_data)
- delete_group(pipeline_data: PipelineData, group_name: str, delete_selectors: bool = False) None
Delete a group.
Parameters
- pipeline_dataPipelineData
Pipeline data object
- group_namestr
Group name
- delete_selectorsbool, default=False
Also delete selectors
Returns
- None
Deletes group
Examples
>>> manager.delete_group(pipeline_data, "old_group")