Feature Importance Manager

GitHub Link to Code.

Feature importance manager for ML-based feature analysis.

This module provides the FeatureImportanceManager class that manages feature importance analysis using various ML algorithms. It follows the same pattern as DecompositionManager, working with analyzer_types and creating FeatureImportanceData objects.

class mdxplain.feature_importance.manager.feature_importance_manager.FeatureImportanceManager(use_memmap: bool = False, chunk_size: int = 2000, cache_dir: str = './cache')

Manager for creating and managing feature importance analyses.

This class provides methods to run feature importance analysis on comparisons created by ComparisonManager. It uses various ML algorithms (analyzer_types) to determine which features are most important for distinguishing between different data groups. So basically classifiers.

The manager follows the same pattern as DecompositionManager:

  • Uses analyzer_type objects similar to decomposition_type

  • Creates FeatureImportanceData objects similar to DecompositionData

  • Integrates with pipeline via AutoInjectProxy

Examples

Pipeline mode (automatic injection):

>>> pipeline = PipelineManager()
>>> from mdxplain.feature_importance import analyzer_types
>>> pipeline.feature_importance.add_analysis(
...     "my_comparison", analyzer_types.DecisionTree(max_depth=5), "tree_analysis"
... )

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = FeatureImportanceManager()
>>> manager.add_analysis(
...     pipeline_data, "my_comparison",
...     analyzer_types.DecisionTree(max_depth=5), "tree_analysis"
... )
__init__(use_memmap: bool = False, chunk_size: int = 2000, cache_dir: str = './cache') None

Initialize the feature importance manager.

Parameters

use_memmapbool, default=False

Whether to use memory mapping for large datasets

chunk_sizeint, default=10000

Processing chunk size for incremental computation

cache_dirstr, default=”./cache”

Cache directory path

Returns

None

Initializes FeatureImportanceManager instance with specified configuration

add_analysis(pipeline_data: PipelineData, comparison_name: str, analyzer_type: AnalyzerTypeBase, analysis_name: str, force: bool = False) None

Add feature importance analysis for a comparison.

Runs feature importance analysis on all sub-comparisons within the specified comparison using the provided analyzer. Creates a single FeatureImportanceData object containing results for all sub-comparisons.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> from mdxplain.feature_importance import analyzer_types
>>> pipeline.feature_importance.add_analysis("folded_vs_unfolded", analyzer_types.DecisionTree(), "tree_analysis")  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = FeatureImportanceManager()
>>> manager.add_analysis(pipeline_data, "folded_vs_unfolded", analyzer_types.DecisionTree(), "tree_analysis")  # WITH pipeline_data parameter

Parameters

pipeline_dataPipelineData

Pipeline data object containing comparisons

comparison_namestr

Name of the comparison to analyze

analyzer_typeAnalyzerTypeBase

Analyzer instance (e.g., analyzer_types.DecisionTree(max_depth=5))

analysis_namestr

Name to store the analysis results

forcebool, default=False

Whether to overwrite existing analysis with same name

Returns

None

Creates FeatureImportanceData in pipeline_data

Raises

ValueError

If analysis already exists (and force=False), comparison not found, or analysis computation fails

Examples

>>> from mdxplain.feature_importance import analyzer_types
>>> manager = FeatureImportanceManager()
>>> # Basic decision tree analysis
>>> manager.add_analysis(
...     pipeline_data, "folded_vs_unfolded",
...     analyzer_types.DecisionTree(max_depth=5, random_state=42),
...     "tree_analysis"
... )
>>> # Balanced tree for imbalanced data
>>> manager.add_analysis(
...     pipeline_data, "conformations",
...     analyzer_types.DecisionTree(class_weight="balanced"),
...     "balanced_tree", force=True
... )
get_analysis_info(pipeline_data: PipelineData, analysis_name: str) Dict[str, Any]

Get information about a feature importance analysis.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.feature_importance.get_analysis_info("tree_analysis")  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = FeatureImportanceManager()
>>> manager.get_analysis_info(pipeline_data, "tree_analysis")  # pipeline_data required

Parameters

pipeline_dataPipelineData

Pipeline data object

analysis_namestr

Name of the analysis

Returns

Dict[str, Any]

Dictionary with analysis information

Examples

>>> info = manager.get_analysis_info(pipeline_data, "tree_analysis")
>>> print(f"Analyzer: {info['analyzer_type']}")
>>> print(f"Comparisons: {info['n_comparisons']}")
get_top_features(pipeline_data: PipelineData, analysis_name: str, comparison_identifier: str | None = None, n: int = 10) List[Dict[str, Any]]

Get top N most important features from analysis.

Parameters

pipeline_dataPipelineData

Pipeline data object

analysis_namestr

Name of the analysis

comparison_identifierstr, optional

Specific sub-comparison to get features from. If None, returns average across all sub-comparisons.

nint, default=10

Number of top features to return

Returns

List[Dict[str, Any]]

List of dictionaries with feature information

Examples

>>> # Get top features averaged across all comparisons
>>> top_features = manager.get_top_features(
...     pipeline_data, "tree_analysis", n=5
... )
>>> # Get top features for specific comparison
>>> top_features = manager.get_top_features(
...     pipeline_data, "tree_analysis", "folded_vs_rest", n=5
... )
get_all_top_features(pipeline_data: PipelineData, analysis_name: str, n: int = 10) Dict[str, List[Dict[str, Any]]]

Get top features for all sub-comparisons in an analysis.

Returns a dictionary where keys are comparison identifiers and values are lists of top features for each comparison.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> all_features = pipeline.feature_importance.get_all_top_features("dt_analysis", n=5)

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = FeatureImportanceManager()
>>> all_features = manager.get_all_top_features(pipeline_data, "dt_analysis", n=5)

Parameters

pipeline_dataPipelineData

Pipeline data object

analysis_namestr

Name of the analysis

nint, default=10

Number of top features per comparison

Returns

Dict[str, List[Dict[str, Any]]]

Dictionary mapping comparison names to their top features

Examples

>>> all_features = manager.get_all_top_features(
...     pipeline_data, "dt_analysis", n=5
... )
>>> # Access specific comparison
>>> cluster_0 = all_features["cluster_0_vs_rest"]
>>> print(f"Top feature: {cluster_0[0]['feature_name']}")
print_top_n_features(pipeline_data: PipelineData, analysis_name: str, n: int = 3) None

Print top N features for all comparisons in analysis.

Uses get_all_top_features() internally and formats output for console display. If a trained Decision Tree model is available for a comparison, the printed label is extended with a representative split criterion for that feature. This keeps the output focused on the actual tree rule instead of generic metadata labels.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.feature_importance.print_top_n_features("my_analysis", n=3)

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = FeatureImportanceManager()
>>> manager.print_top_n_features(pipeline_data, "my_analysis", n=3)

Parameters

pipeline_dataPipelineData

Pipeline data object

analysis_namestr

Name of the feature importance analysis

nint, default=3

Number of top features to print per comparison

Returns

None

Prints to console

Examples

>>> pipeline.feature_importance.print_top_n_features(
...     "feature_importance", n=5
... )
Top 5 features for cluster_0_vs_rest:
  1. contacts: LEU13-ARG31: Non-Contact (0.456)
  2. torsions: GLY42_phi: <= 55.20 degrees (0.234)
  ...

Notes

  • Uses representative weighted split thresholds from the stored Decision Tree model when available

  • For binary discrete features, prints the left branch label (e.g. “Non-Contact”)

  • Falls back to feature_type: feature_name if no tree rule is available for a feature

list_analyses(pipeline_data: PipelineData) List[str]

List all available feature importance analyses.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.feature_importance.list_analyses()  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = FeatureImportanceManager()
>>> manager.list_analyses(pipeline_data)  # pipeline_data required

Parameters

pipeline_dataPipelineData

Pipeline data object

Returns

List[str]

List of analysis names

Examples

>>> analyses = manager.list_analyses(pipeline_data)
>>> print(f"Available analyses: {analyses}")
remove_analysis(pipeline_data: PipelineData, analysis_name: str) None

Remove a feature importance analysis.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.feature_importance.remove_analysis("old_analysis")  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = FeatureImportanceManager()
>>> manager.remove_analysis(pipeline_data, "old_analysis")  # pipeline_data required

Parameters

pipeline_dataPipelineData

Pipeline data object

analysis_namestr

Name of the analysis to remove

Returns

None

Removes the analysis from pipeline_data

Examples

>>> manager.remove_analysis(pipeline_data, "old_analysis")
save(pipeline_data: PipelineData, save_path: str) None

Save all feature importance data to single file.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.feature_importance.save('feature_importance.npy')  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = FeatureImportanceManager()
>>> manager.save(pipeline_data, 'feature_importance.npy')  # pipeline_data required

Parameters

pipeline_dataPipelineData

Pipeline data container with feature importance data

save_pathstr

Path where to save all feature importance data in one file

Returns

None

Saves all feature importance data to the specified file

Examples

>>> manager.save(pipeline_data, 'feature_importance.npy')
load(pipeline_data: PipelineData, load_path: str) None

Load all feature importance data from single file.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.feature_importance.load('feature_importance.npy')  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = FeatureImportanceManager()
>>> manager.load(pipeline_data, 'feature_importance.npy')  # pipeline_data required

Parameters

pipeline_dataPipelineData

Pipeline data container to load feature importance data into

load_pathstr

Path to saved feature importance data file

Returns

None

Loads all feature importance data from the specified file

Examples

>>> manager.load(pipeline_data, 'feature_importance.npy')
print_info(pipeline_data: PipelineData) None

Print feature importance data information.

Warning

When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.

Pipeline mode:

>>> pipeline = PipelineManager()
>>> pipeline.feature_importance.print_info()  # NO pipeline_data parameter

Standalone mode:

>>> pipeline_data = PipelineData()
>>> manager = FeatureImportanceManager()
>>> manager.print_info(pipeline_data)  # pipeline_data required

Parameters

pipeline_dataPipelineData

Pipeline data container with feature importance data

Returns

None

Prints feature importance data information to console

Examples

>>> pipeline_data = PipelineData()
>>> feature_importance_manager = FeatureImportanceManager()
>>> feature_importance_manager.print_info(pipeline_data)
property add

Service for adding feature importance analyses with simplified syntax.

Provides an intuitive interface for adding feature importance analyses without requiring explicit analyzer type instantiation or imports.

Returns

FeatureImportanceAddService

Service instance for adding feature importance analyses with combined parameters

Examples

>>> # Add different analyzer types
>>> pipeline.feature_importance.add.decision_tree("my_comparison", "tree_analysis", max_depth=5)
>>> pipeline.feature_importance.add.decision_tree(
...     "folded_vs_unfolded",
...     "deep_tree",
...     max_depth=10,
...     criterion="entropy",
...     random_state=42
... )

Notes

Pipeline data is automatically injected by AutoInjectProxy. All analyzer type parameters are combined with add_analysis parameters.

get_representative_frames(pipeline_data: PipelineData, analysis_name: str, n_top: int = 10) Dict[str, List[int]]

Find representative frames for each sub-comparison.

Finds frames that most strongly exhibit the top important features identified by the decision tree. Uses tree split rules to determine optimal feature values and scores frames based on how well they match these criteria.

Parameters

pipeline_dataPipelineData

Pipeline data object

analysis_namestr

Name of feature importance analysis

n_topint, default=10

Number of top features to consider

Returns

Dict[str, List[int]]

Mapping from sub_comparison_name to [traj_idx, frame_idx]

Examples

>>> representatives = manager.get_representative_frames(
...     pipeline_data, "dt_analysis", n_top=10
... )
>>> print(representatives)
{'cluster_0_vs_rest': [1, 2341], 'cluster_1_vs_rest': [3, 156]}

Notes

  • Uses Decision Tree split rules to find characteristic frames

  • Frames maximize expression of top important features

  • Handles periodic features (torsions) with circular distance

  • For multiclass mode, uses centroids instead

Raises

ValueError

If analysis not found or not Decision Tree based