Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
abhi-glitchhg committed Oct 31, 2023
2 parents 7e2cb47 + 476aa3e commit 7d1ffa8
Show file tree
Hide file tree
Showing 4 changed files with 25 additions and 10 deletions.
21 changes: 18 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

# corrfeatred

reduce features using correlation matrix
select features using correlation matrix

## Installation

Expand All @@ -29,15 +29,30 @@ different_feature_set = reduce_features(correlation_matrix, threshold=0.8, polic
```


## Workflow
## Idea and workflow

Currently there is only one function which takes correlation matrix and thresholds as input and then constructs a graph.

There after we find maximal cliques in the graph and our goal is to have at max one feature from each clique.

We create a graph where each node is represents a feature, and edge represents collinearity between the features. Then maximal cliques present in the graph are calculated.


Each clique represents a cluster of features that are correlated with each other, and hence only one feature from this cluster is enough to represent whole cluster in the final feature sets. Hence, we can have multiple policies about how we want to choose the features (minimum number of features, maximum number of features etc).

Our goal is to have at max one feature from each clique.

And finally the feature set we get from this function will all have pairwise correlation less than the threshhold.

![workflow](https://github.com/abhi-glitchhg/corrfeatred/assets/72816663/731c0be4-75a0-4355-b4aa-7682d7759d38)












10 changes: 5 additions & 5 deletions corrfeatred/reduce.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,12 @@ def reduce_features(correlation_matrix, threshold=0.75,policy='min', random_stat
# correlation_matrix : df.corr
# threshold: float
# method: whether we want minimum number of features or maximum number of features. NOTE: this is bit unstable and sometimes `min` policy has more features than `max`, this depends on how graph develops.
# random_state: random state, use this to get different set of features for same correlation matrix.
# random_seed: random state, use this to get different set of features for same correlation matrix.
# """

if random_state!=None:
random_gen = random.Random(random_state)
corrmatrix = correlation_matrix.copy()
if random_seed!=None:
random_gen = random.Random(random_seed)
corrmatrix = correlation_matrix.abs().copy()
inf_ = 4*(corrmatrix.shape[0]) + 10 # adding 10 for no reason, this could be any positive number;
corr_matrix = corrmatrix > threshold
assert policy in ('min', 'max'), "wrong input parameter"
Expand Down Expand Up @@ -50,7 +50,7 @@ def reduce_features(correlation_matrix, threshold=0.75,policy='min', random_stat
node_to_clique_list[key].append(idx) # node to group mapping
while not np.all(mask):

if random_state==None:
if random_seed==None:
top = method(node_to_clique_count.items(), key = lambda kv: kv[1])
else:
dict_items = list(node_to_clique_count.items())
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "corrfeatred"
version = "0.0.3.1"
version = "0.0.3.2"
authors = [
{ name="Abhijit Deo", email="[email protected]" },
]
Expand Down
2 changes: 1 addition & 1 deletion tests/test_correlation.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ def test_hello():


def test_correlation():
for i in (4,8,10,12,16,22,100,1000):
for i in (4,8,10,12,16,22,100,500):
rand_array = np.random.uniform(0,1,i*i).reshape(i,i)
corr_arr = np.tril(rand_array) + np.triu(rand_array.T,1)
assert (corr_arr!=corr_arr.T).sum().sum() == 0
Expand Down

0 comments on commit 7d1ffa8

Please sign in to comment.