Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternatives to PCA, such as umap #27

Open
Hellisotherpeople opened this issue Mar 27, 2024 · 13 comments
Open

Alternatives to PCA, such as umap #27

Hellisotherpeople opened this issue Mar 27, 2024 · 13 comments

Comments

@Hellisotherpeople
Copy link

There's a whole large body of work on dimensionality reduction which handles non linearity better - i.e. UMAP. https://umap-learn.readthedocs.io/en/latest/

Is it simple to just "drop" this in place of PCA and get theoretically better results? If not, why?

what about other things, like NMF https://en.wikipedia.org/wiki/Non-negative_matrix_factorization ?

@vgel
Copy link
Owner

vgel commented Apr 6, 2024

Playing with UMAP currently! I have it working but it's pretty funky, needs small coefficients. Doesn't seem to be a huge improvement over PCA currently, but it's possible the way I'm doing it isn't ideal. Might include it experimentally in the upcoming release!

(Generating a vector with UMAP is also ~30x slower than PCA currently.)

@vgel
Copy link
Owner

vgel commented Apr 6, 2024

image
image

@Hellisotherpeople
Copy link
Author

Hellisotherpeople commented Apr 11, 2024

Very interesting!

Given the issues you describe with performance of training, there is a CuML GPU implementation of UMAP (and a lot of other dimensionality reduction algorithms which could be offered) - https://docs.rapids.ai/api/cuml/stable/api/#umap - certainly a larger dependency chain but these days everyone's accepted nvidia's stack as being mandatory so it might be good to make optional at least.

I think there is some tuning you can do with base UMAP's hyperparamaters to improve speed and possibly the quality of the generated control vectors. A UMAP expert would be able to look over that and make sure it's set "correctly" given the data - unfortunately that is not me (and likely fewer than 100 of them exist in the world).

As far as to why it requires smaller coefficients and why the performance may be hard to quantify as better - I'd love to see some analysis about this from others in the community, or even the UMAP creator himself (or at least one of the aformentioned 100)

I'm extremely appreciative that you have implemented it yourself and tried it. Very happy to see such rapid response and that it might even be made available to others. Thank you!!!

@vgel
Copy link
Owner

vgel commented May 24, 2024

umap is now experimentally supported as an (undocumented) option in #34 — use ControlVector.train(..., method="umap"), and ensure the umap-learn package is installed.

@vgel
Copy link
Owner

vgel commented May 24, 2024

Please feel free to use this issue to continue discussing umap and potential improvements! I'm not sure if the current method is the ideal usage of it.

@vgel vgel changed the title Couldn't we do better than PCA? Alternatives to PCA, such as umap May 24, 2024
@thiswillbeyourgithub
Copy link

Thanks @vgel for all this.

I don't have a GPU and have little free time for quite some time still but I'm still very curious as to wether nonlinear dim reduction work "better".

Here are a few thoughts:

  1. There are tons of dimred algorithm. For example pacmap
  2. Each algorithm has usually lots of parameters, giving a lot of room for experiment. So it might be better to have the "method" argument accept a callable, taking as input the "train" var and be stored in "directions[layer]", offering maximum flexibility at seemingly little coding cost.
  3. I'm especially interested in the effect of gradually changing the thresholds between local and global focus of those algorithm. For example in the umap API we can tune the densmap parameter.
  4. Or hybrid approaches: use PCA for the rough direction, then add the vector * 0.1 of the umap transformed with a focus on local relationships. Also try with global focus.
  5. Also, maybe PCA starts working quickly (=with few examples) but the cost is a greater loss in benchmark, whereas nonlinear dimred have less sensitivity (=need more examples) but greater specificity (=reducing those directions have less side effects)
  6. In any case, those algorithms can be greatly speedup by first taking the PCA transformation over say 50 dimensions and then applying the non linear dimension algorithm over the transform. This might not defeat the purpose entirely and retain the hypothetical gains. AFAIK it's common practice as PCA allows checking we retain enough of the variance to make sure we're not screwing up the data.
  7. Maybe it would be a well spend effort to constitude a standardized test rig before tuning all those things. These days I'm thinking about the recent papers about abliteration, well summarized into this blog post that nicely uploaded easy to use datasets with good and bad exemples of refusals. That might be the quickest way to create our own mini benchmark.

Anyway, I won't have time for about 6-12 months but may do a PR eventually.

If anyone's interested, please share your findings, especially negative results!

@thiswillbeyourgithub
Copy link

thiswillbeyourgithub commented Jun 25, 2024

Addendum to my thoughts above (I hope nobody will mind!):
8. Instead of taking all the N samples and doing a 1 dimension PCA to deduce dimension, I'm thinking of another way:

  • Outline:
    • Take the N samples
    • Do a Kmeans with n_cluster=k
    • Split the N samples into the k roughly equal clusters (KMeans has the nice property to tend to make even size clusters).
    • Then do the 1D PCA over each cluster.
    • Now for each inference: compute the distance between the current activation, and each cluster centroid and normalize these distances so they sum to 1.
    • Now apply to the activation the k directions (1 per cluster), weighted by the distances.
    • All resemblance to mixture of experts is intentional: the distances is a bit like the routing network, and the idea is to optimize the tradeoff between how effective representative engineering is without being too rough (=risk of side effects).

@thiswillbeyourgithub
Copy link

Hello, I am back and have a tiny bit of free time to devote to explore those ideas. I plan to document things in this fork

I saw that the owner of this repo tried already with umap with some success. Can you share your remarks from testing it as in depth as your time allows before diving myself?

@thiswillbeyourgithub
Copy link

Btw, PaCMAP is an alternative to UMAP that does non linear dimension reduction, has somewhat less free paramters, appears much simpler to install and package, and can actually output a 1 dimension output (it was not initially possible, cf this issue). The author will probably update the package soonish.

@thiswillbeyourgithub
Copy link

I'm struggling to make umap work, can you tell me :

  • how many examples you generate in your dataset
  • which model you use
  • what layer you apply the control vector to
  • what umap settings you're using
  • what strength you're using

It's making it harder to investigate PaCMAP

@thiswillbeyourgithub
Copy link

I'm struggling to make umap work, can you tell me :

  • how many examples you generate in your dataset
  • which model you use
  • what layer you apply the control vector to
  • what umap settings you're using
  • what strength you're using

It's making it harder to investigate PaCMAP

I am still interested in the answer :)

I think I figured out that trying to preserve the scale of the train array helps a lot. See https://github.com/thiswillbeyourgithub/repeng/blob/c0722440ce5f67d8be112ebe7a2ff3fd8e97ae80/repeng/extract.py#L479

Likewise to applying a regularization norm infered from the initial data.

@thiswillbeyourgithub
Copy link

Addendum to my thoughts above (I hope nobody will mind!): 8. Instead of taking all the N samples and doing a 1 dimension PCA to deduce dimension, I'm thinking of another way:

  • Outline:

    • Take the N samples
    • Do a Kmeans with n_cluster=k
    • Split the N samples into the k roughly equal clusters (KMeans has the nice property to tend to make even size clusters).
    • Then do the 1D PCA over each cluster.
    • Now for each inference: compute the distance between the current activation, and each cluster centroid and normalize these distances so they sum to 1.
    • Now apply to the activation the k directions (1 per cluster), weighted by the distances.
    • All resemblance to mixture of experts is intentional: the distances is a bit like the routing network, and the idea is to optimize the tradeoff between how effective representative engineering is without being too rough (=risk of side effects).

Btw something like that works great and is present in my fork

@thiswillbeyourgithub
Copy link

thiswillbeyourgithub commented Dec 12, 2024

Basically I do umap/pacmap in 3 dimensions to project the samples, then kmeans to find 2 clusters, then substract the mean of each cluster to the sample of the other clusters then apply the pca_diff on the resulting data. It seems to work great. I can push the strength to like x5 and it stays coherent. Lots more things to try!

Edit: also the directions are pretty much always orthogonal to what pca diff would do, so it seems like there's a benefit to using umap/pacmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants