Data Selection (Frame/Row Selection) ==================================== .. todo: needs streamlining While FeatureSelector defines which features (matrix columns) to analyze, DataSelector chooses which trajectory frames (matrix rows) to include. This enables subset-based analyses focusing on specific conformational states, conditions, or time windows. Core Concept ------------ - **FeatureSelector**: Defines matrix columns (which features: contacts, distances, etc.) - **DataSelector**: Defines matrix rows (which frames: states, trajectories, conditions) - **Combined**: Creates targeted analysis matrices for specific scientific questions Why Use Data Selection ---------------------- - **State-specific analysis**: Focus on folded, unfolded, or intermediate conformations - **Condition comparison**: Wild-type vs. mutant, ligand-bound vs. apo - **Outlier removal**: Exclude noise clusters or equilibration frames - **Data reduction**: Sample large datasets for manageable analysis - **Combined criteria**: Intersection of multiple selection criteria Methods ------- ``create(name)`` - Create Named Frame Selection ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: python pipeline.data_selector.create("folded_frames") pipeline.data_selector.create("active_state") ``select_by_tags(name, tags, match_all=True, mode="add", stride=1)`` - Tag-Based Selection ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Parameters """""""""" - ``tags``: List of trajectory tags to match - ``match_all``: - ``True`` (default): Frame needs ALL tags (AND logic) - ``["wild_type", "production"]`` → both required - ``False``: Frame needs ANY tag (OR logic) - ``["system_A", "system_B"]`` → either suffices - ``mode``: - ``"add"`` (default): Union - add frames to selection - ``"subtract"``: Difference - remove frames from selection - ``"intersect"``: Intersection - keep only overlap - ``stride``: Sample every Nth frame (1=all frames, 10=every 10th) Why use it """""""""" - Condition-based filtering: ``tags=["wild_type"]`` → only WT trajectory frames - System organization: ``tags=["system_A", "biased"]`` → specific simulation type - Data sampling: ``stride=10`` → reduce dataset size ``select_by_cluster(name, clustering_name, cluster_ids, mode="add", stride=1)`` - Cluster-Based Selection ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Parameters """""""""" - ``clustering_name``: Name of clustering result (e.g., "DPA", "DBSCAN") - ``cluster_ids``: List of cluster numbers ``[0, 1, 2]`` or names - ``mode``: Same as tags (add/subtract/intersect) - ``stride``: Sample selected clusters Why use it """""""""" - Conformational states: ``cluster_ids=[0]`` → only folded state - Multi-state analysis: ``cluster_ids=[0, 2]`` Result → active + intermediate - Comparative studies: Different selectors for each state - Outlier removal: ``cluster_ids=[-1], mode="subtract"`` → exclude noise ``select_by_indices(name, trajectory_indices, mode="add")`` - Direct Frame Index Selection ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The most flexible selection method: specify exact frame numbers for each trajectory. Supports various input formats for maximum convenience. Parameters """""""""" ``trajectory_indices``: Dictionary mapping trajectory selectors to frame specifications Trajectory Selectors (Keys) ''''''''''''''''''''''''''' - ``int``: Trajectory index → ``0``, ``1``, ``2`` - ``str``: Trajectory name → ``"system_A"`` - ``str``: Tag pattern → ``"tag:biased"`` (applies to all trajectories with tag) - ``str``: Name pattern → ``"system_*"`` (glob-style matching) Frame Specifications (Values) ''''''''''''''''''''''''''''' - ``int``: Single frame → ``42`` - ``List[int]``: Explicit frames → ``[10, 20, 30, 50]`` - ``str``: Various string formats: - Single: ``"42"`` → frame 42 - Range: ``"10-20"`` → frames 10, 11, ..., 20 - Comma list: ``"10,20,30"`` → frames 10, 20, 30 - Combined: ``"10-20,30-40,50"`` → frames 10-20, 30-40, and 50 - All: ``"all"`` → all frames in trajectory - ``dict``: With stride support: - ``{"frames": frame_spec, "stride": N}`` → apply stride to frame selection - Example: ``{"frames": "0-100", "stride": 10}`` → frames 0, 10, 20, ..., 100 ``mode``: Selection mode (add/subtract/intersect) Why use it """""""""" - **Manual frame selection**: Choose specific frames from analysis (e.g., representative structures) - **Time window analysis**: Select equilibrated region, exclude equilibration - **Sparse sampling**: Reduce dataset size with stride for computational efficiency - **Complex patterns**: Combine multiple ranges and specific frames - **Cross-trajectory patterns**: Use tags/names to apply same frame selection to multiple trajectories Examples -------- Simple Frame Selection ^^^^^^^^^^^^^^^^^^^^^^ .. code:: python # Select specific frames from trajectory 0Result pipeline.data_selector.create("custom_frames") pipeline.data_selector.select_by_indices( "custom_frames", {0: [100, 200, 300, 500]} # Four specific frames ) Time Window (Exclude Equilibration) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: python # Use frames 1000-5000 from all trajectories (first 1000 = equilibration) pipeline.data_selector.create("equilibrated") pipeline.data_selector.select_by_indices( "equilibrated", {"all": "1000-5000"} # "all" applies to all loaded trajectories ) Range With Stride (Sparse Sampling) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: python # Every 50th frame from equilibrated region for computational efficiency pipeline.data_selector.create("sparse_equilibrated") pipeline.data_selector.select_by_indices( "sparse_equilibrated", {0: {"frames": "1000-5000", "stride": 50}} # 1000, 1050, 1100, ..., 5000 ) Complex Combined Ranges ^^^^^^^^^^^^^^^^^^^^^^^ .. code:: python # Multiple time windows and specific frames pipeline.data_selector.create("complex_selection") pipeline.data_selector.select_by_indices( "complex_selection", { 0: "100-200,500-600,1000", # Two ranges + single frame 1: "200-400,800-1000" # Different ranges for trajectory 1 } ) Tag-Based Frame Selection ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: python # Apply same frame selection to all trajectories with "biased" tag pipeline.data_selector.create("biased_frames") pipeline.data_selector.select_by_indices( "biased_frames", {"tag:biased": "500-2000"} # Frames 500-2000 from all biased trajectories ) Name Pattern Matching ^^^^^^^^^^^^^^^^^^^^^ .. code:: python # Use all frames from trajectories matching pattern pipeline.data_selector.create("system_A_all") pipeline.data_selector.select_by_indices( "system_A_all", {"system_A*": "all"} # All frames from trajectories starting with "system_A" ) Multi-Trajectory With Different Frame Selections and Stride ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: python # Complex real-world scenario: different selections per trajectory type pipeline.data_selector.create("production_analysis") pipeline.data_selector.select_by_indices( "production_analysis", { "tag:wild_type": {"frames": "2000-10000", "stride": 20}, # WT: sparse sampling "tag:mutant": "2000-10000", # Mutant: all frames (smaller dataset) "system_C": [100, 500, 1000, 2000] # System C: specific snapshots only } ) Practical Examples ------------------ State-Specific Analysis ^^^^^^^^^^^^^^^^^^^^^^^ - **Scientific question**: What features characterize the folded state? - **Why**: Focus feature importance analysis only on folded conformation - **Use case**: Identify stabilizing interactions in folded state, compare folded vs other states - **Result**: Only frames from cluster 0 (folded state) for downstream analysis .. code:: python pipeline.data_selector.create("folded") pipeline.data_selector.select_by_cluster("folded", "DPA", [0]) Multi-State Conformational Analysis ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - **Scientific question**: What features distinguish active and intermediate from inactive? - **Why**: Combine multiple conformational states for comparison against baseline - **Use case**: Activation pathway analysis, identify shared features of non-inactive states - **Result**: Frames from clusters 0 (active) + 2 (intermediate), excludes cluster 1 (inactive) .. code:: python pipeline.data_selector.create("active_states") pipeline.data_selector.select_by_cluster("active_states", "DPA", [0, 2]) Production Run With Data Reduction ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - **Scientific question**: What is baseline behavior in wild-type production simulations? - **Why**: match_all=True ensures only production WT (not equilibration WT) - **Why**: stride=5 reduces computational cost while maintaining statistical validity - **Use case**: Reference dataset for mutant comparison, representative WT behavior - **Result**: Every 5th frame from wild-type production runs only .. code:: python pipeline.data_selector.create("wt_prod") pipeline.data_selector.select_by_tags( "wt_prod", tags=["wild_type", "production"], # Must have BOTH tags match_all=True, # AND logic: production AND wild_type stride=5 # Sample every 5th frame for efficiency ) Multi-Step Complex Selection With Set Operations ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - **Scientific question**: What features are specific to cluster 1 in biased simulations? - **Why**: 3-step refinement isolates specific subset using set operations - **Use case**: Enhanced sampling analysis, isolate converged non-noise states .. code:: python pipeline.data_selector.create("biased_cluster1") # Step 1: START with all biased simulation frames (union operation) pipeline.data_selector.select_by_tags("biased_cluster1", ["biased"], mode="add") # Current selection: all frames from biased trajectories # Step 2: NARROW to cluster 1 only (intersection operation) pipeline.data_selector.select_by_cluster("biased_cluster1", "DBSCAN", [1], mode="intersect") # Current selection: biased frames that are ALSO in cluster 1 # Step 3: CLEAN by removing noise cluster (subtraction operation) pipeline.data_selector.select_by_cluster("biased_cluster1", "DBSCAN", [-1], mode="subtract") # Final selection: biased cluster 1, excluding any noise assignments # Note: Safety step - DBSCAN can assign noise (-1), this removes artifacts Systematic Condition Comparison (WT vs Mutant in Same State) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - **Scientific question**: What molecular differences exist between WT and mutant in folded state? - **Why**: Compare same conformational state across different systems - **Use case**: Mutation effect analysis, identify compensatory changes - **Strategy**: Create two matched selectors for direct comparison - Both select folded state (cluster 0) - Different genetic backgrounds (WT vs mutant tags) - These two selectors enable apples-to-apples comparison: - Same conformational state, different genetic backgrounds - Use in ``comparison.create_comparison()`` to find mutation-specific features .. rubric:: Wild-type in folded state .. code:: python pipeline.data_selector.create("wt_folded") pipeline.data_selector.select_by_tags("wt_folded", ["wild_type"], mode="add") # Get all WT frames pipeline.data_selector.select_by_cluster("wt_folded", "conformations", [0], mode="intersect") # Narrow to folded conformation only # Result: WT folded state frames .. rubric:: Selector B: Mutant in folded state .. code:: python pipeline.data_selector.create("mutant_folded") pipeline.data_selector.select_by_tags("mutant_folded", ["mutant"], mode="add") # Get all mutant frames pipeline.data_selector.select_by_cluster("mutant_folded", "conformations", [0], mode="intersect") # Narrow to folded conformation only # Result: Mutant folded state frames Outlier Removal (Quality Control) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - **Scientific question**: What is protein behavior excluding simulation artifacts? - **Why**: DBSCAN noise cluster (-1) often contains artifacts, transitions, rare events - **Use case**: Clean analysis dataset, focus on well-sampled conformations - **Result**: Only well-defined clusters, excludes noise/artifacts .. code:: python pipeline.data_selector.create("clean_conformations") # Start with all frames from main clustering pipeline.data_selector.select_by_cluster("clean_conformations", "DBSCAN", [0, 1, 2], mode="add") # Explicitly exclude noise (could also start with "all" and subtract) Integration with Comparative Analysis ------------------------------------- Data selectors define the frame sets (rows) for feature importance and comparison analyses: .. code:: python # Create feature matrix (columns) pipeline.feature_selector.create("binding_site") pipeline.feature_selector.add.contacts("binding_site", "resid 120-140") # Create frame selections (rows) pipeline.data_selector.create("state_A") pipeline.data_selector.select_by_cluster("state_A", "DPA", [0]) pipeline.data_selector.create("state_B") pipeline.data_selector.select_by_cluster("state_B", "DPA", [1]) # Compare states using selected features and frames pipeline.comparison.create_comparison( name="state_comparison", mode="one_vs_rest", feature_selector="binding_site", # Column selection data_selectors=["state_A", "state_B"] # Row selections ) Common Use Cases ^^^^^^^^^^^^^^^^ - **Conformational analysis**: Select frames by cluster assignment - **Condition comparison**: Select frames by trajectory tags (WT/mutant, apo/holo) - **Data reduction**: Stride sampling for large datasets - **Quality control**: Exclude equilibration, outliers, or unstable frames - **Multi-criteria filtering**: Combine tags and clusters with mode operations