Feature Selection ================= .. todo: generally fine; needs streamlining and clarification. Clarify on difference to Data Selector. When should each be used? Feature selection defines which molecular features (matrix columns) enter your analysis. mdxplain provides a custom parsing syntax that supports residue names, IDs, consensus nomenclature, and logical operations for flexible feature filtering. How the Parser Works -------------------- The selection language is case-insensitive and automatically recognizes keywords. Ranges use ``-`` (e.g., ``10-50``), lists are space-separated (e.g., ``ALA HIS GLY``), ``and`` combines conditions (intersection), and ``not`` excludes matches (difference). Selection Keywords ^^^^^^^^^^^^^^^^^^ ``res`` - Residue Names """"""""""""""""""""""" - **What**: Select by amino acid type - **Why use it**: - Analyze hydrophobic core: ``res ALA VAL LEU ILE PHE`` - Study charged interactions: ``res ARG LYS ASP GLU`` - Focus on specific residue types for functional analysis - Test chemical property hypotheses (aromatic, polar, etc.) - **Example**: ``res ARG LYS`` ``resid`` - Residue IDs """"""""""""""""""""""" - **What**: Select by residue numbers in PDB - **Why use it**: - Target known functional regions from literature: ``resid 120-140`` → binding site - Test structure-based hypotheses from crystal structures - Focus on specific structural elements (loops, termini) - Region-specific analysis based on domain knowledge - **Example**: ``resid 10-50`` → N-terminal region ``seqid`` - Sequence IDs """""""""""""""""""""""" - **What**: Select by sequence position - **Why use it**: Alignment-based analyses, homologous position comparisons across proteins - **Example**: Useful in multi-chain complexes ``consensus`` - Consensus Nomenclature """""""""""""""""""""""""""""""""""""" - **What**: Structure-based biological numbering (GPCRdb, CGN, kinase numbering) - **Why use it**: - **Functional motifs**: ``consensus 3x50`` → DRY motif in GPCRs (conserved activation switch) - **Structural elements**: ``consensus 7x*`` → entire TM7 helix - **Conserved networks**: ``consensus *40-*50`` → positions 40-50 e.g. across all TM helices (allosteric pathways) - **Cross-protein comparison**: Same consensus position = functionally equivalent - **Patterns**: - Single: ``7x53`` → NPxxY tyrosine (Y7.53, activation-associated) - Range: ``7x50-8x50`` → TM7-TM8 interface region - Wildcard: ``7x*`` → all TM7 positions - Multi-pattern: ``*40-*50`` → positions 40-50 in all TM helices - G-proteins: ``G.H5.*`` → helix 5 of G-alpha - ``all`` **prefix**: Include residues without consensus labels - ``7x-8x`` → only residues WITH consensus labels from 7x to 8x - ``all 7x-8x`` → ALL residues from 7x to 8x, including those without consensus (loops, unstructured regions) - **Why**: Capture complete structural regions including flexible elements - **Example**: ``all 7x-8x`` includes TM7, intracellular loop, and TM8 - **Note**: This is not fixed to a consensus. ``*40-*50`` for example is looking for a pattern that matches this. It does not have to be a TM helix, if you have other consensus labels. It is a pattern matching. It takes the first consensus label entry with ``*40`` and then takes the next with ``*50`` and takes the range in between. If it does NOT find a start or end point it takes the next residue in the list starting or ending with seqid. ``all`` """"""" - **What**: Select all features - **Why use it**: Initial exploration, global pattern discovery, unbiased analysis Logical Operators ^^^^^^^^^^^^^^^^^ - ``and`` - Intersection (both conditions must be true) - **Why**: Specific combinations: ``res ARG and resid 120-140`` → positive charges in binding site - ``not`` - Exclusion (remove matches) - **Why**: "All except" patterns: ``res ALA and not resid 25`` → all alanines except position 25 .. code:: python # Step 1: Create named feature selection (empty container) pipeline.feature_selector.create("my_selection") # Step 2: Add features using different selection syntaxes # Add contacts matching residue names pipeline.feature_selector.add.contacts("my_selection", "res ALA HIS") # Add contacts in specific residue ID range (binding site region) pipeline.feature_selector.add.contacts("my_selection", "resid 10-50") # Add contacts using consensus nomenclature (GPCR activation pathway) pipeline.feature_selector.add.contacts("my_selection", "consensus 7x50-8x50") # Add distances with logical operators (all ALA except position 25) pipeline.feature_selector.add.distances("my_selection", "res ALA and not resid 25") # Step 3: Add features with reduction (statistical filtering during selection) # Only keep contacts formed in >30% of frames (stable interactions) pipeline.feature_selector.add.contacts.with_frequency_reduction( "my_selection", "resid 120-140", threshold_min=0.3 ) # per_trajectory (default): reduce each trajectory independently # cross_trajectory_intersection=True: must pass in ALL trajectories # cross_trajectory_union=True: pass in ANY trajectory # cross_trajectory_pooled=True: pool frames first, then reduce once # Step 4: Multi-trajectory mode # common_denominator=True: Only Alanins present in ALL trajectories # (Useful for comparing different systems with slightly different structures) pipeline.feature_selector.add.contacts( "my_selection", "ALA", common_denominator=True ) # Step 5: Use pre-reduced features # use_reduced=True: Uses features from feature.reduce_data() instead of raw data # (When you want to apply global reduction first, then select subset) pipeline.feature_selector.add.distances( "my_selection", "all", use_reduced=True ) # Step 6: Apply selection to create final feature matrix # Combines all added selections into single feature matrix for analysis # Note each of the selections adds the features to the set. Its a union of all selectors before select call. pipeline.feature_selector.select("my_selection") Practical Examples with Biological Context ------------------------------------------ GPCR Activation Contact (Ionic Lock) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Selects a known constraining interaction in GPCRs: - 3x50: DRY motif arginine (R3.50) - 6x30: conserved glutamate (E6.30) on TM6 .. code:: python pipeline.feature_selector.add.contacts("gpcr", "consensus 3x50 and consensus 6x30") G-Protein Binding Interface ^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Combines two structural elements for GPCR-G-protein coupling: - G.H5.*: Complete helix 5 of G-alpha protein (major contact surface) - 8x*: Helix 8 of receptor (intracellular C-terminus) - Interface contacts important for G-protein activation .. code:: python pipeline.feature_selector.add.contacts("interface", "consensus G.H5.* and consensus 8x*") Binding Pocket Aromatic Cage ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Targets aromatic residues in known binding region: - Aromatic amino acids (PHE, TRP, TYR) form π-stacking interactions - Residues 100-150: Binding pocket from crystal structure - Creates feature set for ligand-binding site characterization .. code:: python pipeline.feature_selector.add.contacts("pocket", "res PHE TRP TYR and resid 100-150") Hydrophobic Core Interactions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Selects non-polar residues forming protein core: - ALA, VAL, LEU, ILE, PHE: Hydrophobic amino acids - Monitors core packing stability and folding integrity .. code:: python pipeline.feature_selector.add.contacts("core", "res ALA VAL LEU ILE PHE") TM Helix Interface Including Loop ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - With 'all' prefix - Captures complete structural region: - Without 'all': Only labeled TM7/TM8 residues - With 'all': Includes intracellular loop between helices - Complete interface for conformational change analysis .. code:: python pipeline.feature_selector.add.contacts("tm7_loop_tm8", "all 7x-8x")