Feature Importance Manager
GitHub Link to Code.
Feature importance manager for ML-based feature analysis.
This module provides the FeatureImportanceManager class that manages feature importance analysis using various ML algorithms. It follows the same pattern as DecompositionManager, working with analyzer_types and creating FeatureImportanceData objects.
- class mdxplain.feature_importance.manager.feature_importance_manager.FeatureImportanceManager(use_memmap: bool = False, chunk_size: int = 2000, cache_dir: str = './cache')
Manager for creating and managing feature importance analyses.
This class provides methods to run feature importance analysis on comparisons created by ComparisonManager. It uses various ML algorithms (analyzer_types) to determine which features are most important for distinguishing between different data groups. So basically classifiers.
The manager follows the same pattern as DecompositionManager:
Uses analyzer_type objects similar to decomposition_type
Creates FeatureImportanceData objects similar to DecompositionData
Integrates with pipeline via AutoInjectProxy
Examples
Pipeline mode (automatic injection):
>>> pipeline = PipelineManager() >>> from mdxplain.feature_importance import analyzer_types >>> pipeline.feature_importance.add_analysis( ... "my_comparison", analyzer_types.DecisionTree(max_depth=5), "tree_analysis" ... )
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = FeatureImportanceManager() >>> manager.add_analysis( ... pipeline_data, "my_comparison", ... analyzer_types.DecisionTree(max_depth=5), "tree_analysis" ... )
- __init__(use_memmap: bool = False, chunk_size: int = 2000, cache_dir: str = './cache') None
Initialize the feature importance manager.
Parameters
- use_memmapbool, default=False
Whether to use memory mapping for large datasets
- chunk_sizeint, default=10000
Processing chunk size for incremental computation
- cache_dirstr, default=”./cache”
Cache directory path
Returns
- None
Initializes FeatureImportanceManager instance with specified configuration
- add_analysis(pipeline_data: PipelineData, comparison_name: str, analyzer_type: AnalyzerTypeBase, analysis_name: str, force: bool = False) None
Add feature importance analysis for a comparison.
Runs feature importance analysis on all sub-comparisons within the specified comparison using the provided analyzer. Creates a single FeatureImportanceData object containing results for all sub-comparisons.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> from mdxplain.feature_importance import analyzer_types >>> pipeline.feature_importance.add_analysis("folded_vs_unfolded", analyzer_types.DecisionTree(), "tree_analysis") # NO pipeline_data parameter
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = FeatureImportanceManager() >>> manager.add_analysis(pipeline_data, "folded_vs_unfolded", analyzer_types.DecisionTree(), "tree_analysis") # WITH pipeline_data parameter
Parameters
- pipeline_dataPipelineData
Pipeline data object containing comparisons
- comparison_namestr
Name of the comparison to analyze
- analyzer_typeAnalyzerTypeBase
Analyzer instance (e.g., analyzer_types.DecisionTree(max_depth=5))
- analysis_namestr
Name to store the analysis results
- forcebool, default=False
Whether to overwrite existing analysis with same name
Returns
- None
Creates FeatureImportanceData in pipeline_data
Raises
- ValueError
If analysis already exists (and force=False), comparison not found, or analysis computation fails
Examples
>>> from mdxplain.feature_importance import analyzer_types >>> manager = FeatureImportanceManager()
>>> # Basic decision tree analysis >>> manager.add_analysis( ... pipeline_data, "folded_vs_unfolded", ... analyzer_types.DecisionTree(max_depth=5, random_state=42), ... "tree_analysis" ... )
>>> # Balanced tree for imbalanced data >>> manager.add_analysis( ... pipeline_data, "conformations", ... analyzer_types.DecisionTree(class_weight="balanced"), ... "balanced_tree", force=True ... )
- get_analysis_info(pipeline_data: PipelineData, analysis_name: str) Dict[str, Any]
Get information about a feature importance analysis.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> pipeline.feature_importance.get_analysis_info("tree_analysis") # NO pipeline_data parameter
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = FeatureImportanceManager() >>> manager.get_analysis_info(pipeline_data, "tree_analysis") # pipeline_data required
Parameters
- pipeline_dataPipelineData
Pipeline data object
- analysis_namestr
Name of the analysis
Returns
- Dict[str, Any]
Dictionary with analysis information
Examples
>>> info = manager.get_analysis_info(pipeline_data, "tree_analysis") >>> print(f"Analyzer: {info['analyzer_type']}") >>> print(f"Comparisons: {info['n_comparisons']}")
- get_top_features(pipeline_data: PipelineData, analysis_name: str, comparison_identifier: str | None = None, n: int = 10) List[Dict[str, Any]]
Get top N most important features from analysis.
Parameters
- pipeline_dataPipelineData
Pipeline data object
- analysis_namestr
Name of the analysis
- comparison_identifierstr, optional
Specific sub-comparison to get features from. If None, returns average across all sub-comparisons.
- nint, default=10
Number of top features to return
Returns
- List[Dict[str, Any]]
List of dictionaries with feature information
Examples
>>> # Get top features averaged across all comparisons >>> top_features = manager.get_top_features( ... pipeline_data, "tree_analysis", n=5 ... )
>>> # Get top features for specific comparison >>> top_features = manager.get_top_features( ... pipeline_data, "tree_analysis", "folded_vs_rest", n=5 ... )
- get_all_top_features(pipeline_data: PipelineData, analysis_name: str, n: int = 10) Dict[str, List[Dict[str, Any]]]
Get top features for all sub-comparisons in an analysis.
Returns a dictionary where keys are comparison identifiers and values are lists of top features for each comparison.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> all_features = pipeline.feature_importance.get_all_top_features("dt_analysis", n=5)
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = FeatureImportanceManager() >>> all_features = manager.get_all_top_features(pipeline_data, "dt_analysis", n=5)
Parameters
- pipeline_dataPipelineData
Pipeline data object
- analysis_namestr
Name of the analysis
- nint, default=10
Number of top features per comparison
Returns
- Dict[str, List[Dict[str, Any]]]
Dictionary mapping comparison names to their top features
Examples
>>> all_features = manager.get_all_top_features( ... pipeline_data, "dt_analysis", n=5 ... ) >>> # Access specific comparison >>> cluster_0 = all_features["cluster_0_vs_rest"] >>> print(f"Top feature: {cluster_0[0]['feature_name']}")
- print_top_n_features(pipeline_data: PipelineData, analysis_name: str, n: int = 3) None
Print top N features for all comparisons in analysis.
Uses get_all_top_features() internally and formats output for console display. If a trained Decision Tree model is available for a comparison, the printed label is extended with a representative split criterion for that feature. This keeps the output focused on the actual tree rule instead of generic metadata labels.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> pipeline.feature_importance.print_top_n_features("my_analysis", n=3)
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = FeatureImportanceManager() >>> manager.print_top_n_features(pipeline_data, "my_analysis", n=3)
Parameters
- pipeline_dataPipelineData
Pipeline data object
- analysis_namestr
Name of the feature importance analysis
- nint, default=3
Number of top features to print per comparison
Returns
- None
Prints to console
Examples
>>> pipeline.feature_importance.print_top_n_features( ... "feature_importance", n=5 ... ) Top 5 features for cluster_0_vs_rest: 1. contacts: LEU13-ARG31: Non-Contact (0.456) 2. torsions: GLY42_phi: <= 55.20 degrees (0.234) ...
Notes
Uses representative weighted split thresholds from the stored Decision Tree model when available
For binary discrete features, prints the left branch label (e.g. “Non-Contact”)
Falls back to
feature_type: feature_nameif no tree rule is available for a feature
- list_analyses(pipeline_data: PipelineData) List[str]
List all available feature importance analyses.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> pipeline.feature_importance.list_analyses() # NO pipeline_data parameter
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = FeatureImportanceManager() >>> manager.list_analyses(pipeline_data) # pipeline_data required
Parameters
- pipeline_dataPipelineData
Pipeline data object
Returns
- List[str]
List of analysis names
Examples
>>> analyses = manager.list_analyses(pipeline_data) >>> print(f"Available analyses: {analyses}")
- remove_analysis(pipeline_data: PipelineData, analysis_name: str) None
Remove a feature importance analysis.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> pipeline.feature_importance.remove_analysis("old_analysis") # NO pipeline_data parameter
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = FeatureImportanceManager() >>> manager.remove_analysis(pipeline_data, "old_analysis") # pipeline_data required
Parameters
- pipeline_dataPipelineData
Pipeline data object
- analysis_namestr
Name of the analysis to remove
Returns
- None
Removes the analysis from pipeline_data
Examples
>>> manager.remove_analysis(pipeline_data, "old_analysis")
- save(pipeline_data: PipelineData, save_path: str) None
Save all feature importance data to single file.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> pipeline.feature_importance.save('feature_importance.npy') # NO pipeline_data parameter
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = FeatureImportanceManager() >>> manager.save(pipeline_data, 'feature_importance.npy') # pipeline_data required
Parameters
- pipeline_dataPipelineData
Pipeline data container with feature importance data
- save_pathstr
Path where to save all feature importance data in one file
Returns
- None
Saves all feature importance data to the specified file
Examples
>>> manager.save(pipeline_data, 'feature_importance.npy')
- load(pipeline_data: PipelineData, load_path: str) None
Load all feature importance data from single file.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> pipeline.feature_importance.load('feature_importance.npy') # NO pipeline_data parameter
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = FeatureImportanceManager() >>> manager.load(pipeline_data, 'feature_importance.npy') # pipeline_data required
Parameters
- pipeline_dataPipelineData
Pipeline data container to load feature importance data into
- load_pathstr
Path to saved feature importance data file
Returns
- None
Loads all feature importance data from the specified file
Examples
>>> manager.load(pipeline_data, 'feature_importance.npy')
- print_info(pipeline_data: PipelineData) None
Print feature importance data information.
Warning
When using PipelineManager, do NOT provide the pipeline_data parameter. The PipelineManager automatically injects this parameter.
Pipeline mode:
>>> pipeline = PipelineManager() >>> pipeline.feature_importance.print_info() # NO pipeline_data parameter
Standalone mode:
>>> pipeline_data = PipelineData() >>> manager = FeatureImportanceManager() >>> manager.print_info(pipeline_data) # pipeline_data required
Parameters
- pipeline_dataPipelineData
Pipeline data container with feature importance data
Returns
- None
Prints feature importance data information to console
Examples
>>> pipeline_data = PipelineData() >>> feature_importance_manager = FeatureImportanceManager() >>> feature_importance_manager.print_info(pipeline_data)
- property add
Service for adding feature importance analyses with simplified syntax.
Provides an intuitive interface for adding feature importance analyses without requiring explicit analyzer type instantiation or imports.
Returns
- FeatureImportanceAddService
Service instance for adding feature importance analyses with combined parameters
Examples
>>> # Add different analyzer types >>> pipeline.feature_importance.add.decision_tree("my_comparison", "tree_analysis", max_depth=5) >>> pipeline.feature_importance.add.decision_tree( ... "folded_vs_unfolded", ... "deep_tree", ... max_depth=10, ... criterion="entropy", ... random_state=42 ... )
Notes
Pipeline data is automatically injected by AutoInjectProxy. All analyzer type parameters are combined with add_analysis parameters.
- get_representative_frames(pipeline_data: PipelineData, analysis_name: str, n_top: int = 10) Dict[str, List[int]]
Find representative frames for each sub-comparison.
Finds frames that most strongly exhibit the top important features identified by the decision tree. Uses tree split rules to determine optimal feature values and scores frames based on how well they match these criteria.
Parameters
- pipeline_dataPipelineData
Pipeline data object
- analysis_namestr
Name of feature importance analysis
- n_topint, default=10
Number of top features to consider
Returns
- Dict[str, List[int]]
Mapping from sub_comparison_name to [traj_idx, frame_idx]
Examples
>>> representatives = manager.get_representative_frames( ... pipeline_data, "dt_analysis", n_top=10 ... ) >>> print(representatives) {'cluster_0_vs_rest': [1, 2341], 'cluster_1_vs_rest': [3, 156]}
Notes
Uses Decision Tree split rules to find characteristic frames
Frames maximize expression of top important features
Handles periodic features (torsions) with circular distance
For multiclass mode, uses centroids instead
Raises
- ValueError
If analysis not found or not Decision Tree based