Data Selection (Frame/Row Selection)

While FeatureSelector defines which features (matrix columns) to analyze, DataSelector chooses which trajectory frames (matrix rows) to include. This enables subset-based analyses focusing on specific conformational states, conditions, or time windows.

Core Concept

FeatureSelector: Defines matrix columns (which features: contacts, distances, etc.)
DataSelector: Defines matrix rows (which frames: states, trajectories, conditions)
Combined: Creates targeted analysis matrices for specific scientific questions

Why Use Data Selection

State-specific analysis: Focus on folded, unfolded, or intermediate conformations
Condition comparison: Wild-type vs. mutant, ligand-bound vs. apo
Outlier removal: Exclude noise clusters or equilibration frames
Data reduction: Sample large datasets for manageable analysis
Combined criteria: Intersection of multiple selection criteria

Methods

`create(name)` - Create Named Frame Selection

pipeline.data_selector.create("folded_frames")
pipeline.data_selector.create("active_state")

`select_by_tags(name, tags, match_all=True, mode="add", stride=1)` - Tag-Based Selection

Parameters

tags: List of trajectory tags to match
match_all:
- True (default): Frame needs ALL tags (AND logic) - ["wild_type", "production"] → both required
- False: Frame needs ANY tag (OR logic) - ["system_A", "system_B"] → either suffices
mode:
- "add" (default): Union - add frames to selection
- "subtract": Difference - remove frames from selection
- "intersect": Intersection - keep only overlap
stride: Sample every Nth frame (1=all frames, 10=every 10th)

Why use it

Condition-based filtering: tags=["wild_type"] → only WT trajectory frames
System organization: tags=["system_A", "biased"] → specific simulation type
Data sampling: stride=10 → reduce dataset size

`select_by_cluster(name, clustering_name, cluster_ids, mode="add", stride=1)` - Cluster-Based Selection

Parameters

clustering_name: Name of clustering result (e.g., “DPA”, “DBSCAN”)
cluster_ids: List of cluster numbers [0, 1, 2] or names
mode: Same as tags (add/subtract/intersect)
stride: Sample selected clusters

Why use it

Conformational states: cluster_ids=[0] → only folded state
Multi-state analysis: cluster_ids=[0, 2] Result → active + intermediate
Comparative studies: Different selectors for each state
Outlier removal: cluster_ids=[-1], mode="subtract" → exclude noise

`select_by_indices(name, trajectory_indices, mode="add")` - Direct Frame Index Selection

The most flexible selection method: specify exact frame numbers for each trajectory. Supports various input formats for maximum convenience.

Parameters

trajectory_indices: Dictionary mapping trajectory selectors to frame specifications

Trajectory Selectors (Keys)

int: Trajectory index → 0, 1, 2
str: Trajectory name → "system_A"
str: Tag pattern → "tag:biased" (applies to all trajectories with tag)
str: Name pattern → "system_*" (glob-style matching)

Frame Specifications (Values)

int: Single frame → 42
List[int]: Explicit frames → [10, 20, 30, 50]
str: Various string formats:
- Single: "42" → frame 42
- Range: "10-20" → frames 10, 11, …, 20
- Comma list: "10,20,30" → frames 10, 20, 30
- Combined: "10-20,30-40,50" → frames 10-20, 30-40, and 50
- All: "all" → all frames in trajectory
dict: With stride support:
- {"frames": frame_spec, "stride": N} → apply stride to frame selection
- Example: {"frames": "0-100", "stride": 10} → frames 0, 10, 20, …, 100

mode: Selection mode (add/subtract/intersect)

Why use it

Manual frame selection: Choose specific frames from analysis (e.g., representative structures)
Time window analysis: Select equilibrated region, exclude equilibration
Sparse sampling: Reduce dataset size with stride for computational efficiency
Complex patterns: Combine multiple ranges and specific frames
Cross-trajectory patterns: Use tags/names to apply same frame selection to multiple trajectories

Examples

Simple Frame Selection

# Select specific frames from trajectory 0Result
pipeline.data_selector.create("custom_frames")
pipeline.data_selector.select_by_indices(
    "custom_frames",
    {0: [100, 200, 300, 500]}  # Four specific frames
)

Time Window (Exclude Equilibration)

# Use frames 1000-5000 from all trajectories (first 1000 = equilibration)
pipeline.data_selector.create("equilibrated")
pipeline.data_selector.select_by_indices(
    "equilibrated",
    {"all": "1000-5000"}  # "all" applies to all loaded trajectories
)

Range With Stride (Sparse Sampling)

# Every 50th frame from equilibrated region for computational efficiency
pipeline.data_selector.create("sparse_equilibrated")
pipeline.data_selector.select_by_indices(
    "sparse_equilibrated",
    {0: {"frames": "1000-5000", "stride": 50}}  # 1000, 1050, 1100, ..., 5000
)

Complex Combined Ranges

# Multiple time windows and specific frames
pipeline.data_selector.create("complex_selection")
pipeline.data_selector.select_by_indices(
    "complex_selection",
    {
        0: "100-200,500-600,1000",  # Two ranges + single frame
        1: "200-400,800-1000"       # Different ranges for trajectory 1
    }
)

Tag-Based Frame Selection

# Apply same frame selection to all trajectories with "biased" tag
pipeline.data_selector.create("biased_frames")
pipeline.data_selector.select_by_indices(
    "biased_frames",
    {"tag:biased": "500-2000"}  # Frames 500-2000 from all biased trajectories
)

Name Pattern Matching

# Use all frames from trajectories matching pattern
pipeline.data_selector.create("system_A_all")
pipeline.data_selector.select_by_indices(
    "system_A_all",
    {"system_A*": "all"}  # All frames from trajectories starting with "system_A"
)

Multi-Trajectory With Different Frame Selections and Stride

# Complex real-world scenario: different selections per trajectory type
pipeline.data_selector.create("production_analysis")
pipeline.data_selector.select_by_indices(
    "production_analysis",
    {
        "tag:wild_type": {"frames": "2000-10000", "stride": 20},  # WT: sparse sampling
        "tag:mutant": "2000-10000",        # Mutant: all frames (smaller dataset)
        "system_C": [100, 500, 1000, 2000] # System C: specific snapshots only
    }
)

Practical Examples

State-Specific Analysis

Scientific question: What features characterize the folded state?
Why: Focus feature importance analysis only on folded conformation
Use case: Identify stabilizing interactions in folded state, compare folded vs other states
Result: Only frames from cluster 0 (folded state) for downstream analysis

pipeline.data_selector.create("folded")
pipeline.data_selector.select_by_cluster("folded", "DPA", [0])

Multi-State Conformational Analysis

Scientific question: What features distinguish active and intermediate from inactive?
Why: Combine multiple conformational states for comparison against baseline
Use case: Activation pathway analysis, identify shared features of non-inactive states
Result: Frames from clusters 0 (active) + 2 (intermediate), excludes cluster 1 (inactive)

pipeline.data_selector.create("active_states")
pipeline.data_selector.select_by_cluster("active_states", "DPA", [0, 2])

Production Run With Data Reduction

Scientific question: What is baseline behavior in wild-type production simulations?
Why: match_all=True ensures only production WT (not equilibration WT)
Why: stride=5 reduces computational cost while maintaining statistical validity
Use case: Reference dataset for mutant comparison, representative WT behavior
Result: Every 5th frame from wild-type production runs only

pipeline.data_selector.create("wt_prod")
pipeline.data_selector.select_by_tags(
    "wt_prod",
    tags=["wild_type", "production"],  # Must have BOTH tags
    match_all=True,                    # AND logic: production AND wild_type
    stride=5                           # Sample every 5th frame for efficiency
)

Multi-Step Complex Selection With Set Operations

Scientific question: What features are specific to cluster 1 in biased simulations?
Why: 3-step refinement isolates specific subset using set operations
Use case: Enhanced sampling analysis, isolate converged non-noise states

pipeline.data_selector.create("biased_cluster1")

# Step 1: START with all biased simulation frames (union operation)
pipeline.data_selector.select_by_tags("biased_cluster1", ["biased"], mode="add")
# Current selection: all frames from biased trajectories

# Step 2: NARROW to cluster 1 only (intersection operation)
pipeline.data_selector.select_by_cluster("biased_cluster1", "DBSCAN", [1], mode="intersect")
# Current selection: biased frames that are ALSO in cluster 1

# Step 3: CLEAN by removing noise cluster (subtraction operation)
pipeline.data_selector.select_by_cluster("biased_cluster1", "DBSCAN", [-1], mode="subtract")
# Final selection: biased cluster 1, excluding any noise assignments
# Note: Safety step - DBSCAN can assign noise (-1), this removes artifacts

Systematic Condition Comparison (WT vs Mutant in Same State)

Scientific question: What molecular differences exist between WT and mutant in folded state?
Why: Compare same conformational state across different systems
Use case: Mutation effect analysis, identify compensatory changes
Strategy: Create two matched selectors for direct comparison
- Both select folded state (cluster 0)
- Different genetic backgrounds (WT vs mutant tags)
- These two selectors enable apples-to-apples comparison:
  Same conformational state, different genetic backgrounds
  
  Use in comparison.create_comparison() to find mutation-specific features

Wild-type in folded state

pipeline.data_selector.create("wt_folded")
pipeline.data_selector.select_by_tags("wt_folded", ["wild_type"], mode="add")
# Get all WT frames
pipeline.data_selector.select_by_cluster("wt_folded", "conformations", [0], mode="intersect")
# Narrow to folded conformation only
# Result: WT folded state frames

Selector B: Mutant in folded state

pipeline.data_selector.create("mutant_folded")
pipeline.data_selector.select_by_tags("mutant_folded", ["mutant"], mode="add")
# Get all mutant frames
pipeline.data_selector.select_by_cluster("mutant_folded", "conformations", [0], mode="intersect")
# Narrow to folded conformation only
# Result: Mutant folded state frames

Outlier Removal (Quality Control)

Scientific question: What is protein behavior excluding simulation artifacts?
Why: DBSCAN noise cluster (-1) often contains artifacts, transitions, rare events
Use case: Clean analysis dataset, focus on well-sampled conformations
Result: Only well-defined clusters, excludes noise/artifacts

pipeline.data_selector.create("clean_conformations")
# Start with all frames from main clustering
pipeline.data_selector.select_by_cluster("clean_conformations", "DBSCAN", [0, 1, 2], mode="add")
# Explicitly exclude noise (could also start with "all" and subtract)

Integration with Comparative Analysis

Data selectors define the frame sets (rows) for feature importance and comparison analyses:

# Create feature matrix (columns)
pipeline.feature_selector.create("binding_site")
pipeline.feature_selector.add.contacts("binding_site", "resid 120-140")

# Create frame selections (rows)
pipeline.data_selector.create("state_A")
pipeline.data_selector.select_by_cluster("state_A", "DPA", [0])

pipeline.data_selector.create("state_B")
pipeline.data_selector.select_by_cluster("state_B", "DPA", [1])

# Compare states using selected features and frames
pipeline.comparison.create_comparison(
    name="state_comparison",
    mode="one_vs_rest",
    feature_selector="binding_site",      # Column selection
    data_selectors=["state_A", "state_B"] # Row selections
)

Common Use Cases

Conformational analysis: Select frames by cluster assignment
Condition comparison: Select frames by trajectory tags (WT/mutant, apo/holo)
Data reduction: Stride sampling for large datasets
Quality control: Exclude equilibration, outliers, or unstable frames
Multi-criteria filtering: Combine tags and clusters with mode operations

Data Selection (Frame/Row Selection)

Core Concept

Why Use Data Selection

Methods

create(name) - Create Named Frame Selection

select_by_tags(name, tags, match_all=True, mode="add", stride=1) - Tag-Based Selection

Parameters

Why use it

select_by_cluster(name, clustering_name, cluster_ids, mode="add", stride=1) - Cluster-Based Selection

Parameters

Why use it

select_by_indices(name, trajectory_indices, mode="add") - Direct Frame Index Selection

Parameters

Trajectory Selectors (Keys)

Frame Specifications (Values)

Why use it

Examples

Simple Frame Selection

Time Window (Exclude Equilibration)

Range With Stride (Sparse Sampling)

Complex Combined Ranges

Tag-Based Frame Selection

Name Pattern Matching

Multi-Trajectory With Different Frame Selections and Stride

Practical Examples

State-Specific Analysis

Multi-State Conformational Analysis

Production Run With Data Reduction

Multi-Step Complex Selection With Set Operations

Systematic Condition Comparison (WT vs Mutant in Same State)

Outlier Removal (Quality Control)

Integration with Comparative Analysis

Common Use Cases

`create(name)` - Create Named Frame Selection

`select_by_tags(name, tags, match_all=True, mode="add", stride=1)` - Tag-Based Selection

`select_by_cluster(name, clustering_name, cluster_ids, mode="add", stride=1)` - Cluster-Based Selection

`select_by_indices(name, trajectory_indices, mode="add")` - Direct Frame Index Selection