Data Selection (Frame/Row Selection)
While FeatureSelector defines which features (matrix columns) to analyze, DataSelector chooses which trajectory frames (matrix rows) to include. This enables subset-based analyses focusing on specific conformational states, conditions, or time windows.
Core Concept
FeatureSelector: Defines matrix columns (which features: contacts, distances, etc.)
DataSelector: Defines matrix rows (which frames: states, trajectories, conditions)
Combined: Creates targeted analysis matrices for specific scientific questions
Why Use Data Selection
State-specific analysis: Focus on folded, unfolded, or intermediate conformations
Condition comparison: Wild-type vs. mutant, ligand-bound vs. apo
Outlier removal: Exclude noise clusters or equilibration frames
Data reduction: Sample large datasets for manageable analysis
Combined criteria: Intersection of multiple selection criteria
Methods
create(name) - Create Named Frame Selection
pipeline.data_selector.create("folded_frames")
pipeline.data_selector.create("active_state")
select_by_cluster(name, clustering_name, cluster_ids, mode="add", stride=1) - Cluster-Based Selection
Parameters
clustering_name: Name of clustering result (e.g., “DPA”, “DBSCAN”)cluster_ids: List of cluster numbers[0, 1, 2]or namesmode: Same as tags (add/subtract/intersect)stride: Sample selected clusters
Why use it
Conformational states:
cluster_ids=[0]→ only folded stateMulti-state analysis:
cluster_ids=[0, 2]Result → active + intermediateComparative studies: Different selectors for each state
Outlier removal:
cluster_ids=[-1], mode="subtract"→ exclude noise
select_by_indices(name, trajectory_indices, mode="add") - Direct Frame Index Selection
The most flexible selection method: specify exact frame numbers for each trajectory. Supports various input formats for maximum convenience.
Parameters
trajectory_indices: Dictionary mapping trajectory selectors to frame specifications
Trajectory Selectors (Keys)
int: Trajectory index →0,1,2str: Trajectory name →"system_A"str: Tag pattern →"tag:biased"(applies to all trajectories with tag)str: Name pattern →"system_*"(glob-style matching)
Frame Specifications (Values)
int: Single frame →42List[int]: Explicit frames →[10, 20, 30, 50]str: Various string formats:Single:
"42"→ frame 42Range:
"10-20"→ frames 10, 11, …, 20Comma list:
"10,20,30"→ frames 10, 20, 30Combined:
"10-20,30-40,50"→ frames 10-20, 30-40, and 50All:
"all"→ all frames in trajectory
dict: With stride support:{"frames": frame_spec, "stride": N}→ apply stride to frame selectionExample:
{"frames": "0-100", "stride": 10}→ frames 0, 10, 20, …, 100
mode: Selection mode (add/subtract/intersect)
Why use it
Manual frame selection: Choose specific frames from analysis (e.g., representative structures)
Time window analysis: Select equilibrated region, exclude equilibration
Sparse sampling: Reduce dataset size with stride for computational efficiency
Complex patterns: Combine multiple ranges and specific frames
Cross-trajectory patterns: Use tags/names to apply same frame selection to multiple trajectories
Examples
Simple Frame Selection
# Select specific frames from trajectory 0Result
pipeline.data_selector.create("custom_frames")
pipeline.data_selector.select_by_indices(
"custom_frames",
{0: [100, 200, 300, 500]} # Four specific frames
)
Time Window (Exclude Equilibration)
# Use frames 1000-5000 from all trajectories (first 1000 = equilibration)
pipeline.data_selector.create("equilibrated")
pipeline.data_selector.select_by_indices(
"equilibrated",
{"all": "1000-5000"} # "all" applies to all loaded trajectories
)
Range With Stride (Sparse Sampling)
# Every 50th frame from equilibrated region for computational efficiency
pipeline.data_selector.create("sparse_equilibrated")
pipeline.data_selector.select_by_indices(
"sparse_equilibrated",
{0: {"frames": "1000-5000", "stride": 50}} # 1000, 1050, 1100, ..., 5000
)
Complex Combined Ranges
# Multiple time windows and specific frames
pipeline.data_selector.create("complex_selection")
pipeline.data_selector.select_by_indices(
"complex_selection",
{
0: "100-200,500-600,1000", # Two ranges + single frame
1: "200-400,800-1000" # Different ranges for trajectory 1
}
)
Tag-Based Frame Selection
# Apply same frame selection to all trajectories with "biased" tag
pipeline.data_selector.create("biased_frames")
pipeline.data_selector.select_by_indices(
"biased_frames",
{"tag:biased": "500-2000"} # Frames 500-2000 from all biased trajectories
)
Name Pattern Matching
# Use all frames from trajectories matching pattern
pipeline.data_selector.create("system_A_all")
pipeline.data_selector.select_by_indices(
"system_A_all",
{"system_A*": "all"} # All frames from trajectories starting with "system_A"
)
Multi-Trajectory With Different Frame Selections and Stride
# Complex real-world scenario: different selections per trajectory type
pipeline.data_selector.create("production_analysis")
pipeline.data_selector.select_by_indices(
"production_analysis",
{
"tag:wild_type": {"frames": "2000-10000", "stride": 20}, # WT: sparse sampling
"tag:mutant": "2000-10000", # Mutant: all frames (smaller dataset)
"system_C": [100, 500, 1000, 2000] # System C: specific snapshots only
}
)
Practical Examples
State-Specific Analysis
Scientific question: What features characterize the folded state?
Why: Focus feature importance analysis only on folded conformation
Use case: Identify stabilizing interactions in folded state, compare folded vs other states
Result: Only frames from cluster 0 (folded state) for downstream analysis
pipeline.data_selector.create("folded")
pipeline.data_selector.select_by_cluster("folded", "DPA", [0])
Multi-State Conformational Analysis
Scientific question: What features distinguish active and intermediate from inactive?
Why: Combine multiple conformational states for comparison against baseline
Use case: Activation pathway analysis, identify shared features of non-inactive states
Result: Frames from clusters 0 (active) + 2 (intermediate), excludes cluster 1 (inactive)
pipeline.data_selector.create("active_states")
pipeline.data_selector.select_by_cluster("active_states", "DPA", [0, 2])
Production Run With Data Reduction
Scientific question: What is baseline behavior in wild-type production simulations?
Why: match_all=True ensures only production WT (not equilibration WT)
Why: stride=5 reduces computational cost while maintaining statistical validity
Use case: Reference dataset for mutant comparison, representative WT behavior
Result: Every 5th frame from wild-type production runs only
pipeline.data_selector.create("wt_prod")
pipeline.data_selector.select_by_tags(
"wt_prod",
tags=["wild_type", "production"], # Must have BOTH tags
match_all=True, # AND logic: production AND wild_type
stride=5 # Sample every 5th frame for efficiency
)
Multi-Step Complex Selection With Set Operations
Scientific question: What features are specific to cluster 1 in biased simulations?
Why: 3-step refinement isolates specific subset using set operations
Use case: Enhanced sampling analysis, isolate converged non-noise states
pipeline.data_selector.create("biased_cluster1")
# Step 1: START with all biased simulation frames (union operation)
pipeline.data_selector.select_by_tags("biased_cluster1", ["biased"], mode="add")
# Current selection: all frames from biased trajectories
# Step 2: NARROW to cluster 1 only (intersection operation)
pipeline.data_selector.select_by_cluster("biased_cluster1", "DBSCAN", [1], mode="intersect")
# Current selection: biased frames that are ALSO in cluster 1
# Step 3: CLEAN by removing noise cluster (subtraction operation)
pipeline.data_selector.select_by_cluster("biased_cluster1", "DBSCAN", [-1], mode="subtract")
# Final selection: biased cluster 1, excluding any noise assignments
# Note: Safety step - DBSCAN can assign noise (-1), this removes artifacts
Systematic Condition Comparison (WT vs Mutant in Same State)
Scientific question: What molecular differences exist between WT and mutant in folded state?
Why: Compare same conformational state across different systems
Use case: Mutation effect analysis, identify compensatory changes
Strategy: Create two matched selectors for direct comparison
Both select folded state (cluster 0)
Different genetic backgrounds (WT vs mutant tags)
These two selectors enable apples-to-apples comparison:
Same conformational state, different genetic backgrounds
Use in
comparison.create_comparison()to find mutation-specific features
Wild-type in folded state
pipeline.data_selector.create("wt_folded")
pipeline.data_selector.select_by_tags("wt_folded", ["wild_type"], mode="add")
# Get all WT frames
pipeline.data_selector.select_by_cluster("wt_folded", "conformations", [0], mode="intersect")
# Narrow to folded conformation only
# Result: WT folded state frames
Selector B: Mutant in folded state
pipeline.data_selector.create("mutant_folded")
pipeline.data_selector.select_by_tags("mutant_folded", ["mutant"], mode="add")
# Get all mutant frames
pipeline.data_selector.select_by_cluster("mutant_folded", "conformations", [0], mode="intersect")
# Narrow to folded conformation only
# Result: Mutant folded state frames
Outlier Removal (Quality Control)
Scientific question: What is protein behavior excluding simulation artifacts?
Why: DBSCAN noise cluster (-1) often contains artifacts, transitions, rare events
Use case: Clean analysis dataset, focus on well-sampled conformations
Result: Only well-defined clusters, excludes noise/artifacts
pipeline.data_selector.create("clean_conformations")
# Start with all frames from main clustering
pipeline.data_selector.select_by_cluster("clean_conformations", "DBSCAN", [0, 1, 2], mode="add")
# Explicitly exclude noise (could also start with "all" and subtract)
Integration with Comparative Analysis
Data selectors define the frame sets (rows) for feature importance and comparison analyses:
# Create feature matrix (columns)
pipeline.feature_selector.create("binding_site")
pipeline.feature_selector.add.contacts("binding_site", "resid 120-140")
# Create frame selections (rows)
pipeline.data_selector.create("state_A")
pipeline.data_selector.select_by_cluster("state_A", "DPA", [0])
pipeline.data_selector.create("state_B")
pipeline.data_selector.select_by_cluster("state_B", "DPA", [1])
# Compare states using selected features and frames
pipeline.comparison.create_comparison(
name="state_comparison",
mode="one_vs_rest",
feature_selector="binding_site", # Column selection
data_selectors=["state_A", "state_B"] # Row selections
)
Common Use Cases
Conformational analysis: Select frames by cluster assignment
Condition comparison: Select frames by trajectory tags (WT/mutant, apo/holo)
Data reduction: Stride sampling for large datasets
Quality control: Exclude equilibration, outliers, or unstable frames
Multi-criteria filtering: Combine tags and clusters with mode operations