mdxplain Overview
mdxplain creates a complete analytical loop from raw simulation data to explainable observations. The framework transforms large-scale trajectory data into interpretable insights through feature calculations, dimensionality reduction, clustering and comparisons.
Data Input
The pipeline begins with massive datasets compromising of 1-100+ MD simulations, each containing 100 to over a million frames. The tagging system allows classification of different simulation conditions, mutations, or experimental variants. Both the raw data and associated metadata (tags and labels) are imported.
mdxplain Tools
1. Feature Calculation
Central to mdxplain is the feature extraction engine that computes a multitue of descriptive features from the simulation trajectories.
Distances
Contacts
Torsions
DSSP secondary structure
Solvent accessible surface area (SASA)
Atomic coordinates
Through these features, high-dimenional data is transformed for efficient computational analysis.
2.1 Dimensionality Reduction and Clustering
The extracted features undergo dimensionality reduction to identify the most informative patterns while reducing computational complexity. mdxplain offers different statistical metrics for reduction.
Decomposition methods:
PCA (Principal Component Analysis)
Kernel PCA
Contact Kernel PCA
Diffusion Maps
Clustering methods:
DPA (Density Peak Advanced)
DBSCAN (Density-Based Spatial Clustering)
HDBSCAN (Hierarchical DBSCAN)
2.2 Feature Selection
The system identifies the most relevant features that capture essential molecular behavior. Different metrics (e. g. variance, range, transition, etc.) can be applied to help priortize features across different simulations.
3. Data Selection and Matrix Construction
User can select specific subsets of data based on frames, clusters or tags to further narrow down.
The target data from data selection or feature selection is organized into a final datamatrix. The rows represent frames and the columns represent features.
4. Comparison
mdxplain supports the systematic comparison of different datasets, such as mutated vs wildtype proteins.
5. Explainability Through Feature Importance
At the core of mdxplain is its use of feature importance analysis to identify which molecular feature-combination serves as system-specific “fingeprint”. Decision-tree-like visualization highlight the features that best separate sytems, reframing the question from “what happened?” to “why did it happen?”.
Data Output
mdxplain provide comprehensive output options:
Analysis Metrics
Data Exports
Visualizations
Energy Landscapes
3D molecular Visualization
Cluster Dynamics
Decision Trees