Feature Selection

Feature selection defines which molecular features (matrix columns) enter your analysis. mdxplain provides a custom parsing syntax that supports residue names, IDs, consensus nomenclature, and logical operations for flexible feature filtering.

How the Parser Works

The selection language is case-insensitive and automatically recognizes keywords. Ranges use - (e.g., 10-50), lists are space-separated (e.g., ALA HIS GLY), and combines conditions (intersection), and not excludes matches (difference).

Selection Keywords

res - Residue Names

  • What: Select by amino acid type

  • Why use it:

    • Analyze hydrophobic core: res ALA VAL LEU ILE PHE

    • Study charged interactions: res ARG LYS ASP GLU

    • Focus on specific residue types for functional analysis

    • Test chemical property hypotheses (aromatic, polar, etc.)

  • Example: res ARG LYS

resid - Residue IDs

  • What: Select by residue numbers in PDB

  • Why use it:

    • Target known functional regions from literature: resid 120-140 → binding site

    • Test structure-based hypotheses from crystal structures

    • Focus on specific structural elements (loops, termini)

    • Region-specific analysis based on domain knowledge

  • Example: resid 10-50 → N-terminal region

seqid - Sequence IDs

  • What: Select by sequence position

  • Why use it: Alignment-based analyses, homologous position comparisons across proteins

  • Example: Useful in multi-chain complexes

consensus - Consensus Nomenclature

  • What: Structure-based biological numbering (GPCRdb, CGN, kinase numbering)

  • Why use it:

    • Functional motifs: consensus 3x50 → DRY motif in GPCRs (conserved activation switch)

    • Structural elements: consensus 7x* → entire TM7 helix

    • Conserved networks: consensus *40-*50 → positions 40-50 e.g. across all TM helices (allosteric pathways)

    • Cross-protein comparison: Same consensus position = functionally equivalent

  • Patterns:

    • Single: 7x53 → NPxxY tyrosine (Y7.53, activation-associated)

    • Range: 7x50-8x50 → TM7-TM8 interface region

    • Wildcard: 7x* → all TM7 positions

    • Multi-pattern: *40-*50 → positions 40-50 in all TM helices

    • G-proteins: G.H5.* → helix 5 of G-alpha

    • all prefix: Include residues without consensus labels

      • 7x-8x → only residues WITH consensus labels from 7x to 8x

      • all 7x-8x → ALL residues from 7x to 8x, including those without consensus (loops, unstructured regions)

      • Why: Capture complete structural regions including flexible elements

      • Example: all 7x-8x includes TM7, intracellular loop, and TM8

  • Note: This is not fixed to a consensus. *40-*50 for example is looking for a pattern that matches this. It does not have to be a TM helix, if you have other consensus labels. It is a pattern matching. It takes the first consensus label entry with *40 and then takes the next with *50 and takes the range in between. If it does NOT find a start or end point it takes the next residue in the list starting or ending with seqid.

all

  • What: Select all features

  • Why use it: Initial exploration, global pattern discovery, unbiased analysis

Logical Operators

  • and - Intersection (both conditions must be true)

    • Why: Specific combinations: res ARG and resid 120-140 → positive charges in binding site

  • not - Exclusion (remove matches)

    • Why: “All except” patterns: res ALA and not resid 25 → all alanines except position 25

# Step 1: Create named feature selection (empty container)
pipeline.feature_selector.create("my_selection")

# Step 2: Add features using different selection syntaxes
# Add contacts matching residue names
pipeline.feature_selector.add.contacts("my_selection", "res ALA HIS")

# Add contacts in specific residue ID range (binding site region)
pipeline.feature_selector.add.contacts("my_selection", "resid 10-50")

# Add contacts using consensus nomenclature (GPCR activation pathway)
pipeline.feature_selector.add.contacts("my_selection", "consensus 7x50-8x50")

# Add distances with logical operators (all ALA except position 25)
pipeline.feature_selector.add.distances("my_selection", "res ALA and not resid 25")

# Step 3: Add features with reduction (statistical filtering during selection)
# Only keep contacts formed in >30% of frames (stable interactions)
pipeline.feature_selector.add.contacts.with_frequency_reduction(
    "my_selection", "resid 120-140", threshold_min=0.3
)
# per_trajectory (default): reduce each trajectory independently
# cross_trajectory_intersection=True: must pass in ALL trajectories
# cross_trajectory_union=True: pass in ANY trajectory
# cross_trajectory_pooled=True: pool frames first, then reduce once

# Step 4: Multi-trajectory mode
# common_denominator=True: Only Alanins present in ALL trajectories
# (Useful for comparing different systems with slightly different structures)
pipeline.feature_selector.add.contacts(
    "my_selection", "ALA", common_denominator=True
)

# Step 5: Use pre-reduced features
# use_reduced=True: Uses features from feature.reduce_data() instead of raw data
# (When you want to apply global reduction first, then select subset)
pipeline.feature_selector.add.distances(
    "my_selection", "all", use_reduced=True
)

# Step 6: Apply selection to create final feature matrix
# Combines all added selections into single feature matrix for analysis
# Note each of the selections adds the features to the set. Its a union of all selectors before select call.
pipeline.feature_selector.select("my_selection")

Practical Examples with Biological Context

GPCR Activation Contact (Ionic Lock)

  • Selects a known constraining interaction in GPCRs:

  • 3x50: DRY motif arginine (R3.50)

  • 6x30: conserved glutamate (E6.30) on TM6

pipeline.feature_selector.add.contacts("gpcr", "consensus 3x50 and consensus 6x30")

G-Protein Binding Interface

  • Combines two structural elements for GPCR-G-protein coupling:

  • G.H5.*: Complete helix 5 of G-alpha protein (major contact surface)

  • 8x*: Helix 8 of receptor (intracellular C-terminus)

  • Interface contacts important for G-protein activation

pipeline.feature_selector.add.contacts("interface", "consensus G.H5.* and consensus 8x*")

Binding Pocket Aromatic Cage

  • Targets aromatic residues in known binding region:

  • Aromatic amino acids (PHE, TRP, TYR) form π-stacking interactions

  • Residues 100-150: Binding pocket from crystal structure

  • Creates feature set for ligand-binding site characterization

pipeline.feature_selector.add.contacts("pocket", "res PHE TRP TYR and resid 100-150")

Hydrophobic Core Interactions

  • Selects non-polar residues forming protein core:

  • ALA, VAL, LEU, ILE, PHE: Hydrophobic amino acids

  • Monitors core packing stability and folding integrity

pipeline.feature_selector.add.contacts("core", "res ALA VAL LEU ILE PHE")

TM Helix Interface Including Loop

  • With ‘all’ prefix

  • Captures complete structural region:

  • Without ‘all’: Only labeled TM7/TM8 residues

  • With ‘all’: Includes intracellular loop between helices

  • Complete interface for conformational change analysis

pipeline.feature_selector.add.contacts("tm7_loop_tm8", "all 7x-8x")