mdxplain Overview

mdxplain overview diagram

mdxplain creates a complete analytical loop from raw simulation data to explainable observations. The framework transforms large-scale trajectory data into interpretable insights through feature calculations, dimensionality reduction, clustering and comparisons.

Data Input

The pipeline begins with massive datasets compromising of 1-100+ MD simulations, each containing 100 to over a million frames. The tagging system allows classification of different simulation conditions, mutations, or experimental variants. Both the raw data and associated metadata (tags and labels) are imported.

mdxplain Tools

1. Feature Calculation

Central to mdxplain is the feature extraction engine that computes a multitue of descriptive features from the simulation trajectories.

  • Distances

  • Contacts

  • Torsions

  • DSSP secondary structure

  • Solvent accessible surface area (SASA)

  • Atomic coordinates

Through these features, high-dimenional data is transformed for efficient computational analysis.

2.1 Dimensionality Reduction and Clustering

The extracted features undergo dimensionality reduction to identify the most informative patterns while reducing computational complexity. mdxplain offers different statistical metrics for reduction.

Decomposition methods:

  • PCA (Principal Component Analysis)

  • Kernel PCA

  • Contact Kernel PCA

  • Diffusion Maps

Clustering methods:

  • DPA (Density Peak Advanced)

  • DBSCAN (Density-Based Spatial Clustering)

  • HDBSCAN (Hierarchical DBSCAN)

2.2 Feature Selection

The system identifies the most relevant features that capture essential molecular behavior. Different metrics (e. g. variance, range, transition, etc.) can be applied to help priortize features across different simulations.

3. Data Selection and Matrix Construction

User can select specific subsets of data based on frames, clusters or tags to further narrow down.

The target data from data selection or feature selection is organized into a final datamatrix. The rows represent frames and the columns represent features.

4. Comparison

mdxplain supports the systematic comparison of different datasets, such as mutated vs wildtype proteins.

5. Explainability Through Feature Importance

At the core of mdxplain is its use of feature importance analysis to identify which molecular feature-combination serves as system-specific “fingeprint”. Decision-tree-like visualization highlight the features that best separate sytems, reframing the question from “what happened?” to “why did it happen?”.

Data Output

mdxplain provide comprehensive output options:

  • Analysis Metrics

  • Data Exports

  • Visualizations

    • Energy Landscapes

    • 3D molecular Visualization

    • Cluster Dynamics

  • Decision Trees