Feature Importance Entities

GitHub Link to Code.

Feature importance data entity for storing ML analysis results.

This module contains the FeatureImportanceData class that stores feature importance analysis results from various ML algorithms. It provides separated data and metadata storage with flexible access methods.

class mdxplain.feature_importance.entities.feature_importance_data.FeatureImportanceData(name: str)

Data entity for storing feature importance analysis results.

Stores feature importance results from ML algorithms, in this case mainly classifiers with separated data and metadata storage. Each FeatureImportanceData contains results for all sub-comparisons from a single analysis run.

Attributes

namestr

Name identifier for this analysis

analyzer_typestr

Type of analyzer used (e.g., “decision_tree”)

comparison_namestr

Name of the comparison this analysis was run on

dataList[np.ndarray]

List of feature importance arrays (one per sub-comparison)

metadataList[Dict[str, Any]]

List of metadata dictionaries (parallel to data list)

Examples

>>> fi_data = FeatureImportanceData("tree_analysis")
>>> fi_data.analyzer_type = "decision_tree"
>>> fi_data.comparison_name = "folded_vs_unfolded"
>>> # Access by index
>>> importance_0, meta_0 = fi_data.get_comparison(0)
>>> # Access by name
>>> importance, meta = fi_data.get_comparison("folded_vs_rest")
__init__(name: str)

Initialize feature importance data with given name.

Parameters

namestr

Name identifier for this analysis

Returns

None

Initializes FeatureImportanceData with given name

Examples

>>> fi_data = FeatureImportanceData("my_analysis")
>>> print(fi_data.name)
'my_analysis'
add_comparison_result(importance_scores: ndarray, metadata: Dict[str, Any]) None

Add results for a sub-comparison.

Parameters

importance_scoresnp.ndarray

Feature importance scores from ML algorithm

metadataDict[str, Any]

Metadata for this sub-comparison

Returns

None

Adds results to data and metadata lists

Examples

>>> importance = np.array([0.3, 0.2, 0.1, 0.4])
>>> meta = {
...     "comparison": "folded_vs_unfolded",
...     "n_samples": 1000,
...     "accuracy": 0.85
... }
>>> fi_data.add_comparison_result(importance, meta)
get_comparison(identifier: int | str) Tuple[ndarray, Dict[str, Any]]

Get comparison results by index or name.

Parameters

identifier : int or str

  • int: Index of the comparison (0-based)

  • str: Name of the comparison from metadata

Returns

Tuple[np.ndarray, Dict[str, Any]]

Tuple of (importance_scores, metadata)

Raises

ValueError

If identifier not found

TypeError

If identifier is neither int nor str

Examples

>>> # Access by index
>>> scores, meta = fi_data.get_comparison(0)
>>> # Access by name
>>> scores, meta = fi_data.get_comparison("folded_vs_rest")
get_all_comparisons() List[Tuple[ndarray, Dict[str, Any]]]

Get all comparison results.

Parameters

None

Returns

List[Tuple[np.ndarray, Dict[str, Any]]]

List of (importance_scores, metadata) tuples

Examples

>>> all_results = fi_data.get_all_comparisons()
>>> for scores, meta in all_results:
...     print(f"{meta['comparison']}: {scores[:3]}")
list_comparisons() List[str]

List all available comparison names.

Parameters

None

Returns

List[str]

List of comparison names from metadata

Examples

>>> names = fi_data.list_comparisons()
>>> print(f"Available comparisons: {names}")
get_top_features(identifier: int | str, n: int = 10) List[Tuple[int, float]]

Get top N most important features for a comparison.

Parameters

identifierint or str

Comparison identifier (index or name)

nint, default=10

Number of top features to return

Returns

List[Tuple[int, float]]

List of (feature_index, importance_score) tuples, sorted by importance

Examples

>>> top_features = fi_data.get_top_features("folded_vs_rest", n=5)
>>> for feat_idx, score in top_features:
...     print(f"Feature {feat_idx}: {score:.3f}")
get_average_importance() ndarray

Get average feature importance across all comparisons.

Parameters

None

Returns

np.ndarray

Average importance scores across all sub-comparisons

Examples

>>> avg_importance = fi_data.get_average_importance()
>>> top_overall = np.argmax(avg_importance)
>>> print(f"Most important feature overall: {top_overall}")
get_analysis_info() Dict[str, Any]

Get summary information about this analysis.

Parameters

None

Returns

Dict[str, Any]

Dictionary with analysis summary information

Examples

>>> info = fi_data.get_analysis_info()
>>> print(f"Analyzer: {info['analyzer_type']}")
>>> print(f"Comparisons: {info['n_comparisons']}")
save(save_path: str) None

Save FeatureImportanceData object to disk.

Parameters

save_pathstr

Path where to save the FeatureImportanceData object

Returns

None

Saves the FeatureImportanceData object to the specified path

Examples

>>> feature_importance_data.save('analysis_results/tree_importance.pkl')
load(load_path: str) None

Load FeatureImportanceData object from disk.

Parameters

load_pathstr

Path to the saved FeatureImportanceData file

Returns

None

Loads the FeatureImportanceData object from the specified path

Examples

>>> feature_importance_data.load('analysis_results/tree_importance.pkl')
print_info(pipeline_data: PipelineData | None = None) None

Print comprehensive feature importance information.

Parameters

pipeline_dataPipelineData, optional

Pipeline data object for extracting feature names. If provided, top features will show actual feature names instead of indices.

Returns

None

Prints feature importance information to console

Examples

>>> feature_importance_data.print_info()
=== FeatureImportanceData ===
Name: tree_analysis
Analyzer Type: RandomForest
Comparison: folded_vs_unfolded
Sub-Comparisons: 3 (folded_vs_rest, intermediate_vs_rest, unfolded_vs_rest)
Features Analyzed: 150
>>> # With feature names
>>> feature_importance_data.print_info(pipeline_data)
Top Feature Overall: Res10-Res50 distance (avg importance: 0.1362)
get_representative_frame(pipeline_data: PipelineData, comparison_identifier: str, representative_mode: str = 'best', n_top: int = 10, use_memmap: bool = False, chunk_size: int = 2000) Tuple[int, int]

Get representative frame for a sub-comparison.

Finds the most representative frame for a given sub-comparison based on feature importance. Supports two modes: “best” (frame maximizing top feature values) and “centroid” (frame closest to cluster center).

Parameters

pipeline_dataPipelineData

Pipeline data object containing trajectories and features

comparison_identifierstr

Sub-comparison identifier

representative_modestr, default=”best”

Mode for frame selection:

  • “best”: Frame maximizing top important features

  • “centroid”: Frame closest to cluster centroid

n_topint, default=10

Number of top features to consider (for “best” mode)

use_memmapbool, default=False

Whether to use memory-mapped processing

chunk_sizeint, default=2000

Chunk size for memory-mapped processing

Returns

Tuple[int, int]

Trajectory index and frame index of representative frame

Examples

>>> # Get best representative frame
>>> traj_idx, frame_idx = fi_data.get_representative_frame(
...     pipeline_data, "cluster_0_vs_rest", representative_mode="best"
... )
>>> # Get centroid frame
>>> traj_idx, frame_idx = fi_data.get_representative_frame(
...     pipeline_data, "cluster_0_vs_rest", representative_mode="centroid"
... )

Notes

  • “best” mode uses Decision Tree split rules for scoring

  • “centroid” mode finds frame minimizing distance to mean

  • Use memmap mode for large trajectories to save memory