Feature Importance Entities
GitHub Link to Code.
Feature importance data entity for storing ML analysis results.
This module contains the FeatureImportanceData class that stores feature importance analysis results from various ML algorithms. It provides separated data and metadata storage with flexible access methods.
- class mdxplain.feature_importance.entities.feature_importance_data.FeatureImportanceData(name: str)
Data entity for storing feature importance analysis results.
Stores feature importance results from ML algorithms, in this case mainly classifiers with separated data and metadata storage. Each FeatureImportanceData contains results for all sub-comparisons from a single analysis run.
Attributes
- namestr
Name identifier for this analysis
- analyzer_typestr
Type of analyzer used (e.g., “decision_tree”)
- comparison_namestr
Name of the comparison this analysis was run on
- dataList[np.ndarray]
List of feature importance arrays (one per sub-comparison)
- metadataList[Dict[str, Any]]
List of metadata dictionaries (parallel to data list)
Examples
>>> fi_data = FeatureImportanceData("tree_analysis") >>> fi_data.analyzer_type = "decision_tree" >>> fi_data.comparison_name = "folded_vs_unfolded"
>>> # Access by index >>> importance_0, meta_0 = fi_data.get_comparison(0)
>>> # Access by name >>> importance, meta = fi_data.get_comparison("folded_vs_rest")
- __init__(name: str)
Initialize feature importance data with given name.
Parameters
- namestr
Name identifier for this analysis
Returns
- None
Initializes FeatureImportanceData with given name
Examples
>>> fi_data = FeatureImportanceData("my_analysis") >>> print(fi_data.name) 'my_analysis'
- add_comparison_result(importance_scores: ndarray, metadata: Dict[str, Any]) None
Add results for a sub-comparison.
Parameters
- importance_scoresnp.ndarray
Feature importance scores from ML algorithm
- metadataDict[str, Any]
Metadata for this sub-comparison
Returns
- None
Adds results to data and metadata lists
Examples
>>> importance = np.array([0.3, 0.2, 0.1, 0.4]) >>> meta = { ... "comparison": "folded_vs_unfolded", ... "n_samples": 1000, ... "accuracy": 0.85 ... } >>> fi_data.add_comparison_result(importance, meta)
- get_comparison(identifier: int | str) Tuple[ndarray, Dict[str, Any]]
Get comparison results by index or name.
Parameters
identifier : int or str
int: Index of the comparison (0-based)
str: Name of the comparison from metadata
Returns
- Tuple[np.ndarray, Dict[str, Any]]
Tuple of (importance_scores, metadata)
Raises
- ValueError
If identifier not found
- TypeError
If identifier is neither int nor str
Examples
>>> # Access by index >>> scores, meta = fi_data.get_comparison(0)
>>> # Access by name >>> scores, meta = fi_data.get_comparison("folded_vs_rest")
- get_all_comparisons() List[Tuple[ndarray, Dict[str, Any]]]
Get all comparison results.
Parameters
None
Returns
- List[Tuple[np.ndarray, Dict[str, Any]]]
List of (importance_scores, metadata) tuples
Examples
>>> all_results = fi_data.get_all_comparisons() >>> for scores, meta in all_results: ... print(f"{meta['comparison']}: {scores[:3]}")
- list_comparisons() List[str]
List all available comparison names.
Parameters
None
Returns
- List[str]
List of comparison names from metadata
Examples
>>> names = fi_data.list_comparisons() >>> print(f"Available comparisons: {names}")
- get_top_features(identifier: int | str, n: int = 10) List[Tuple[int, float]]
Get top N most important features for a comparison.
Parameters
- identifierint or str
Comparison identifier (index or name)
- nint, default=10
Number of top features to return
Returns
- List[Tuple[int, float]]
List of (feature_index, importance_score) tuples, sorted by importance
Examples
>>> top_features = fi_data.get_top_features("folded_vs_rest", n=5) >>> for feat_idx, score in top_features: ... print(f"Feature {feat_idx}: {score:.3f}")
- get_average_importance() ndarray
Get average feature importance across all comparisons.
Parameters
None
Returns
- np.ndarray
Average importance scores across all sub-comparisons
Examples
>>> avg_importance = fi_data.get_average_importance() >>> top_overall = np.argmax(avg_importance) >>> print(f"Most important feature overall: {top_overall}")
- get_analysis_info() Dict[str, Any]
Get summary information about this analysis.
Parameters
None
Returns
- Dict[str, Any]
Dictionary with analysis summary information
Examples
>>> info = fi_data.get_analysis_info() >>> print(f"Analyzer: {info['analyzer_type']}") >>> print(f"Comparisons: {info['n_comparisons']}")
- save(save_path: str) None
Save FeatureImportanceData object to disk.
Parameters
- save_pathstr
Path where to save the FeatureImportanceData object
Returns
- None
Saves the FeatureImportanceData object to the specified path
Examples
>>> feature_importance_data.save('analysis_results/tree_importance.pkl')
- load(load_path: str) None
Load FeatureImportanceData object from disk.
Parameters
- load_pathstr
Path to the saved FeatureImportanceData file
Returns
- None
Loads the FeatureImportanceData object from the specified path
Examples
>>> feature_importance_data.load('analysis_results/tree_importance.pkl')
- print_info(pipeline_data: PipelineData | None = None) None
Print comprehensive feature importance information.
Parameters
- pipeline_dataPipelineData, optional
Pipeline data object for extracting feature names. If provided, top features will show actual feature names instead of indices.
Returns
- None
Prints feature importance information to console
Examples
>>> feature_importance_data.print_info() === FeatureImportanceData === Name: tree_analysis Analyzer Type: RandomForest Comparison: folded_vs_unfolded Sub-Comparisons: 3 (folded_vs_rest, intermediate_vs_rest, unfolded_vs_rest) Features Analyzed: 150
>>> # With feature names >>> feature_importance_data.print_info(pipeline_data) Top Feature Overall: Res10-Res50 distance (avg importance: 0.1362)
- get_representative_frame(pipeline_data: PipelineData, comparison_identifier: str, representative_mode: str = 'best', n_top: int = 10, use_memmap: bool = False, chunk_size: int = 2000) Tuple[int, int]
Get representative frame for a sub-comparison.
Finds the most representative frame for a given sub-comparison based on feature importance. Supports two modes: “best” (frame maximizing top feature values) and “centroid” (frame closest to cluster center).
Parameters
- pipeline_dataPipelineData
Pipeline data object containing trajectories and features
- comparison_identifierstr
Sub-comparison identifier
- representative_modestr, default=”best”
Mode for frame selection:
“best”: Frame maximizing top important features
“centroid”: Frame closest to cluster centroid
- n_topint, default=10
Number of top features to consider (for “best” mode)
- use_memmapbool, default=False
Whether to use memory-mapped processing
- chunk_sizeint, default=2000
Chunk size for memory-mapped processing
Returns
- Tuple[int, int]
Trajectory index and frame index of representative frame
Examples
>>> # Get best representative frame >>> traj_idx, frame_idx = fi_data.get_representative_frame( ... pipeline_data, "cluster_0_vs_rest", representative_mode="best" ... )
>>> # Get centroid frame >>> traj_idx, frame_idx = fi_data.get_representative_frame( ... pipeline_data, "cluster_0_vs_rest", representative_mode="centroid" ... )
Notes
“best” mode uses Decision Tree split rules for scoring
“centroid” mode finds frame minimizing distance to mean
Use memmap mode for large trajectories to save memory