Skip to content

Commit

Permalink
rephrase and typos
Browse files Browse the repository at this point in the history
  • Loading branch information
dmetivie committed Dec 28, 2023
1 parent 4303fbb commit 31f1cc2
Show file tree
Hide file tree
Showing 6 changed files with 27 additions and 20 deletions.
11 changes: 8 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@

# ExpectationMaximization

[![Docs](https://img.shields.io/badge/docs-dev-blue.svg)](https://dmetivie.github.io/ExpectationMaximization.jl/dev)

This package provides a simple implementation of the Expectation Maximization (EM) algorithm used to fit mixture models.
Due to [Julia](https://julialang.org/) amazing [dispatch](https://www.youtube.com/watch?v=kc9HwsxE1OY) systems, generic and reusable code spirit, and the [Distributions.jl](https://juliastats.org/Distributions.jl/stable/) package, the code while being very generic is both very expressive and fast! (Have a look at the [Benchmark section](https://dmetivie.github.io/ExpectationMaximization.jl/dev/benchmarks/))

## What type of mixtures?

In particular, it works on a lot of mixtures:

- Mixture of Univariate continuous distributions
Expand All @@ -14,9 +15,13 @@ In particular, it works on a lot of mixtures:
- Mixture of mixtures (univariate or multivariate and continuous or discrete)
- More?

Note that [Distributions](https://juliastats.org/Distributions.jl/stable/) *currently* does not allow `MixtureModel` to both have discrete and continuous components (but what does that? Rain).
## What EM algorithm?

So far the classic EM algorithm and the Stochastic EM are implemented. Look at the [Bibliography section](https://dmetivie.github.io/ExpectationMaximization.jl/dev/biblio) for references.

## How?

Just define a [`mix::MixtureModel`](https://juliastats.org/Distributions.jl/stable/mixture/) and do `fit_mle(mix, y)` with your data `y` and that's it!
Just define a [`mix::MixtureModel`](https://juliastats.org/Distributions.jl/stable/mixture/) and do `fit_mle(mix, y)` where `y` is you observation array (vector or matrix). That's it! For Stochastic EM, just do `fit_mle(mix, y, method = StochasticEM())`.
**Have a look at the [Examples](https://dmetivie.github.io/ExpectationMaximization.jl/dev/examples/#Examples) section**.

To work, the only requirements are that the components of the mixture `dist ∈ dists = components(mix)` considered (custom or coming from an existing package)
Expand Down
23 changes: 11 additions & 12 deletions docs/src/benchmarks.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,32 @@

# Benchmarks

I was inspired by [this benchmark](https://floswald.github.io/post/em-benchmarks/).
I am not too sure how to do 100% fair comparisons across languages[^1].
There is a small overhead for using `PyCall` and `RCall`. I checked that it was small in my experimentation (~ few milliseconds?).

I test only Gaussian Mixture since it is the most common type of mixture (remembering that this packages allow plenty of mixtures, and it should be fast in general).
I test only the Gaussian Mixture case since it is the most common type of mixture (remember that this package allows plenty of other mixtures).

In my code, I did not use fancy programming tricks, the speed only comes from Julia, `LogExpFunctions.jl` for `logsumexp!` function and `fit_mle` for each distribution's coming from `Distributions.jl` package.
In the code, I did not use (too much) fancy programming tricks, the speed only comes from Julia, `LogExpFunctions.jl` for `logsumexp!` function and `fit_mle` for each distribution's coming from `Distributions.jl` package.

## Univariate Gaussian mixture with 2 components

I compare with [Sklearn.py](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture)[^2], [mixtool.R](https://cran.r-project.org/web/packages/mixtools/index.html), [mixem.py](https://mixem.readthedocs.io/en/latest/index.html)[^3].
I wanted to try [mclust](https://cloud.r-project.org/web/packages/mclust/vignettes/mclust.html), but I did not manage to specify initial conditions

Overall, [mixtool.R](https://cran.r-project.org/web/packages/mixtools/index.html) and [mixem.py](https://mixem.readthedocs.io/en/latest/index.html) were constructed in a similar spirit as this package hence easy to use for me. [Sklearn.py](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture) is build to fit the Sklearn format (all in one). `GaussianMixturesModel.jl` is build with a similar vibe.
Overall, [mixtool.R](https://cran.r-project.org/web/packages/mixtools/index.html) and [mixem.py](https://mixem.readthedocs.io/en/latest/index.html) were constructed in a similar spirit as this package, making them easy to use for me. [Sklearn.py](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture) is built to match the Sklearn format (all in one). `GaussianMixturesModel.jl` is built with a similar vibe.

If you have comments to improve these benchmarks, comments are welcomed.
If you have comments to improve these benchmarks, they are welcome.

You can find the benchmark code in [here](https://github.com/dmetivie/ExpectationMaximization.jl/tree/master/benchmarks/benchmark_v1_K2_unidim.jl).
You can find the benchmark code [here](https://github.com/dmetivie/ExpectationMaximization.jl/tree/master/benchmarks/benchmark_v1_K2_unidim.jl).

**Conclusion: for Gaussian mixture, `ExpectationMaximization.jl` is about 2 to 10 times faster than Python or R implementations** and about as fast as the specialized Julia package `GaussianMixturesModel.jl`.
**Conclusion: for Gaussian mixtures, `ExpectationMaximization.jl` is about 2 to 10 times faster than Python or R implementations** and about as fast as the specialized Julia package `GaussianMixturesModel.jl`.

![timing_K_2_rudimentary_wo_memory_leak](https://user-images.githubusercontent.com/46794064/227195619-c75b9276-932b-4029-8b49-6cce919acc87.svg)

[^1]: Note that `@btime` with `RCall` and `PyCall` might produce a small-time overhead compare to the true R/Python time see [here for example](https://discourse.julialang.org/t/benchmarking-julia-vs-python-vs-r-with-pycall-and-rcall/37308).
I did compare with `R` `microbenchmark` and Python `timeit` and it produces very similar timing but in my experience `BenchmarkTools` is smarter and simpler to use, i.e. it will figure out alone the number of repetition to do in function of the run.
[^1]: Note that `@btime` with `RCall` and `PyCall` might produce a small-time overhead compared to the pure R/Python time; see [here for example](https://discourse.julialang.org/t/benchmarking-julia-vs-python-vs-r-with-pycall-and-rcall/37308).
I did compare with `R` `microbenchmark` and Python `timeit` and they produced very similar timing but in my experience `BenchmarkTools.jl` is smarter and simpler to use, i.e. it will figure out alone the number of repetition to do in function of the run.

[^2]: There is a suspect triggers warning regarding K-means which I do not want to use here. I asked a question [here](https://github.com/scikit-learn/scikit-learn/discussions/25916). It lead to [this issue](https://github.com/scikit-learn/scikit-learn/issues/26015) and [that PR](https://github.com/scikit-learn/scikit-learn/pull/26021). Turns out even if intial condition were provided K-mean were still computed. However to this day 23-11-29 with `scikit-learn 1.3.2` it still get the warning. Maybe it will be in the next release? I also noted this recent [PR](https://github.com/scikit-learn/scikit-learn/pull/26416).
Last, the step by step likelihood of `Sklearn` is not the same as outputted by `ExpectationMaximization.jl` and [mixtool.R](https://cran.r-project.org/web/packages/mixtools/index.html) (both agree), so I am a bit suspicious.
[^2]: There is a suspect trigger warning regarding K-means which I do not want to use here. I asked a question [here](https://github.com/scikit-learn/scikit-learn/discussions/25916). It led to [this issue](https://github.com/scikit-learn/scikit-learn/issues/26015) and [that PR](https://github.com/scikit-learn/scikit-learn/pull/26021). It turns out that even if initial conditions were provided, the K-mean was still computed. However, to this day (23-11-29) with `scikit-learn 1.3.2` it still gets the warning. Maybe it will be in the next release? I also noted this recent [PR](https://github.com/scikit-learn/scikit-learn/pull/26416).
Last, the step-by-step likelihood of `Sklearn` is not the same as outputted by `ExpectationMaximization.jl` and [mixtool.R](https://cran.r-project.org/web/packages/mixtools/index.html) (both agree), so I am a bit suspicious.

[^3]: It overflows very quickly for $n>500$ or so. I think it is because of naive implementation of [`logsumexp`](https://github.com/sseemayer/mixem/blob/2ffd990b22a12d48313340b427feae73bcf6062d/mixem/em.py#L5). So I eventually did not include the result in the benchmark.
[^3]: It overflows very quickly for $n>500$ or so. I think it is because of the implementation of [`logsumexp`](https://github.com/sseemayer/mixem/blob/2ffd990b22a12d48313340b427feae73bcf6062d/mixem/em.py#L5). So I eventually did not include the result in the benchmark.
1 change: 0 additions & 1 deletion docs/src/biblio.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

# Bibliography

## Theory
Expand Down
1 change: 0 additions & 1 deletion docs/src/examples.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

# Examples

```@example 1
Expand Down
1 change: 0 additions & 1 deletion docs/src/fit_mle.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

# [Instance vs Type version](@id InstanceVType)

This package relies heavily on the types and methods defined in `Distributions.jl` e.g., `fit_mle` and `logpdf`.
Expand Down
10 changes: 8 additions & 2 deletions docs/src/index.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@

# ExpectationMaximization.jl

This package provides a simple implementation of the Expectation Maximization (EM) algorithm used to fit mixture models.
Due to [Julia](https://julialang.org/) amazing [dispatch](https://www.youtube.com/watch?v=kc9HwsxE1OY) systems, generic and reusable code spirit, and the [Distributions.jl](https://juliastats.org/Distributions.jl/stable/) package, the code while being very generic is both very expressive and fast[^1]!

## What type of mixtures?

In particular, it works on a lot of mixtures:

- Mixture of Univariate continuous distributions
Expand All @@ -11,7 +13,7 @@ In particular, it works on a lot of mixtures:
- Mixture of mixtures (univariate or multivariate and continuous or discrete)
- More?

Note that [Distributions](https://juliastats.org/Distributions.jl/stable/) *currently* does not allow `MixtureModel` to both have discrete and continuous components (but what does that? Rain).
## How?

Just define a [`mix::MixtureModel`](https://juliastats.org/Distributions.jl/stable/mixture/) and do `fit_mle(mix, y)` with your data `y` and that's it!
**Have a look at the [examples](@ref Examples) section**.
Expand All @@ -27,7 +29,11 @@ In case 2. is not explicit known, you can always implement a numerical scheme, i
Or, when possible, represent your “difficult” distribution as a mixture of simple terms.
(I had [this](https://stats.stackexchange.com/questions/63647/estimating-parameters-of-students-t-distribution) in mind, but it is not directly a mixture model.)

!!! note
[Distributions.jl](https://juliastats.org/Distributions.jl/stable/) *currently* does not allow `MixtureModel` to both have discrete and continuous components[^2].

[^1]: Have a look at the [Benchmark section](@ref Benchmarks).
[^2]: Rain is a good example of mixture having both a discrete (`Delta` distribution in `0`) and continuous (`Exponential`, `Gamma`, ...) component.

## Algorithms

Expand Down

0 comments on commit 31f1cc2

Please sign in to comment.