Parallel initialization for k-means #1754

Hakdag97 · 2024-12-17T13:03:27Z

Description

The bottleneck of k-means clustering (concerning runtime) is the initialization of centroids, which was previously built on a cost intensive serial algorithm. The aim of this pull request is to replace this algorithm by the more sophisticated k-means || initialization of centroids.

Issue/s resolved:

Optimization of k-means initialization #1674

Changes proposed:

Complete new implementation of the initialization of centroids used for k-means, k-medians, and k-medoids
Adjustment of classes (like KMeans) to match with the new implementation

Type of change

Bug fix
New feature

Performance

Reducing the runtime of initialization of clustering algorithm in distributed and non-distributed mode with split=None and split not None by (at least) an order of magnitude (depending on the setting concerning, e.g., size of data and chosen parameters)

Does this change modify the behaviour of other functions? If so, which?

yes: the classes KMeans, KMedoids, KMedians and the function where are affected

github-actions · 2024-12-17T13:09:26Z

Thank you for the PR!

github-actions · 2025-01-06T10:43:56Z

Thank you for the PR!

codecov · 2025-01-06T11:35:01Z

Codecov Report

Attention: Patch coverage is 97.82609% with 1 line in your changes missing coverage. Please review.

Project coverage is 92.45%. Comparing base (87f2812) to head (7f860c4).

Files with missing lines	Patch %	Lines
heat/cluster/_kcluster.py	97.05%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1754      +/-   ##
==========================================
+ Coverage   92.26%   92.45%   +0.18%     
==========================================
  Files          84       84              
  Lines       12445    12438       -7     
==========================================
+ Hits        11482    11499      +17     
+ Misses        963      939      -24

Flag	Coverage Δ
unit	`92.45% <97.82%> (+0.18%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mrfh92 · 2025-01-20T09:33:22Z

heat/cluster/batchparallelclustering.py

@@ -169,7 +174,7 @@ def functional_value_(self) -> float:
        """
        return self._functional_value

-    def fit(self, x: DNDarray):
+    def fit(self, x: DNDarray, weights: torch.tensor = 1):


This is the fit of batchparallel clustering. If we allow for specification of weights here, it must be a DNDarray, and below the corresponding local arrays would have to be used for the local clusterings.

mrfh92 · 2025-01-20T09:33:35Z

heat/cluster/batchparallelclustering.py

@@ -233,6 +241,7 @@ def fit(self, x: DNDarray):
                        self.max_iter,
                        self.tol,
                        local_random_state,
+                        weights,


mrfh92 · 2025-01-20T09:35:11Z

heat/cluster/kmeans.py

@@ -102,7 +102,7 @@ def _update_centroids(self, x: DNDarray, matching_centroids: DNDarray):

        return new_cluster_centers

-    def fit(self, x: DNDarray) -> self:
+    def fit(self, x: DNDarray, oversampling: float = 100, iter_multiplier: float = 20) -> self:


Are there references for these default values? I had a look into dask and they use oversampling=2 as default.

mrfh92 · 2025-01-20T09:38:30Z

heat/cluster/_kcluster.py

+                # output format: scalar
+                #
+                # Iteratively fill the tensor storing the centroids
+                for _ in ht.arange(0, iter_multiplier * ht.log(init_cost)):


for loop counters, the standard python range should be more effective

mrfh92

Looks fine 👍. I have only some minor comments concerning some details.

mrfh92 · 2025-01-20T09:44:44Z

heat/cluster/_kcluster.py

+                    prob = oversampling * min_distance / min_distance.sum()
+                    # -->   probability distribution with oversampling factor
+                    #       output format: vector
+                    idx = ht.where(sample <= prob)


one may think about moving the creation of sample here to have some new sampling every step of the loop

mrfh92 · 2025-01-20T09:46:07Z

heat/cluster/_kcluster.py

+                # Evaluate distance between final centroids and data points
+                if centroids.shape[0] <= self.n_clusters:
+                    raise ValueError(
+                        "The oversampling factor and/or the number of iterations are chosen"


why two strings?
It may also be helpful to use sth like f"The oversampling factor (={oversampling}) ... " to give the user the values in the error message.

mrfh92 · 2025-01-20T09:48:44Z

heat/cluster/_kcluster.py

+                    reclustered_centroids = torch.zeros(
+                        (self.n_clusters, centroids.shape[1]),
+                        dtype=x.dtype.torch_type(),
+                        device=centroids.device,


centroids.device.torch_device()?

mrfh92 · 2025-01-20T09:49:56Z

heat/cluster/_kcluster.py

+                ht.MPI_WORLD.Bcast(
+                    reclustered_centroids, root=0
+                )  # by default it is broadcasted from process 0
+                reclustered_centroids = ht.array(reclustered_centroids, split=x.split)


I dont know whether we want to split the centroids as they are probably only a small number. So this might produce overhead compared to having split=None here.

Hakdag97 and others added 3 commits January 6, 2025 11:38

Created a test file mytest.py

57098a5

Implementation of parallel initialization

5b7a6e0

Refined comments for better readability

7f860c4

Hakdag97 force-pushed the features/1674-Optimization_of_k-means_initialization branch from 0889330 to 7f860c4 Compare January 6, 2025 10:38

github-actions bot added cluster core features testing Implementation of tests, or test-related issues labels Jan 6, 2025

Hakdag97 requested a review from mrfh92 January 6, 2025 11:59

Hakdag97 added the PR talk label Jan 13, 2025

JuanPedroGHM changed the title ~~Features/1674 optimization of k means initialization~~ Optimization of k means initialization Jan 13, 2025

JuanPedroGHM added the benchmark PR label Jan 13, 2025

JuanPedroGHM self-requested a review January 13, 2025 09:20

ClaudiaComito added this to the 1.6 milestone Jan 13, 2025

ClaudiaComito changed the title ~~Optimization of k means initialization~~ Parallel initialisation for k-means Jan 13, 2025

Hakdag97 changed the title ~~Parallel initialisation for k-means~~ Parallel initialization for k-means Jan 13, 2025

Hakdag97 removed the PR talk label Jan 20, 2025

mrfh92 reviewed Jan 20, 2025

View reviewed changes

mrfh92 requested changes Jan 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel initialization for k-means #1754

Parallel initialization for k-means #1754

Hakdag97 commented Dec 17, 2024 •

edited

Loading

github-actions bot commented Dec 17, 2024

github-actions bot commented Jan 6, 2025

codecov bot commented Jan 6, 2025 •

edited

Loading

mrfh92 Jan 20, 2025

mrfh92 Jan 20, 2025

mrfh92 Jan 20, 2025

mrfh92 Jan 20, 2025

mrfh92 left a comment

mrfh92 Jan 20, 2025

mrfh92 Jan 20, 2025

mrfh92 Jan 20, 2025

mrfh92 Jan 20, 2025

Parallel initialization for k-means #1754

Are you sure you want to change the base?

Parallel initialization for k-means #1754

Conversation

Hakdag97 commented Dec 17, 2024 • edited Loading

Description

Changes proposed:

Type of change

Performance

Does this change modify the behaviour of other functions? If so, which?

github-actions bot commented Dec 17, 2024

github-actions bot commented Jan 6, 2025

codecov bot commented Jan 6, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrfh92 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hakdag97 commented Dec 17, 2024 •

edited

Loading

codecov bot commented Jan 6, 2025 •

edited

Loading