diff --git a/README.md b/README.md index 78fe7b5..75f4648 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,3 @@ - # ExpectationMaximization [![Docs](https://img.shields.io/badge/docs-dev-blue.svg)](https://dmetivie.github.io/ExpectationMaximization.jl/dev) @@ -6,6 +5,8 @@ This package provides a simple implementation of the Expectation Maximization (EM) algorithm used to fit mixture models. Due to [Julia](https://julialang.org/) amazing [dispatch](https://www.youtube.com/watch?v=kc9HwsxE1OY) systems, generic and reusable code spirit, and the [Distributions.jl](https://juliastats.org/Distributions.jl/stable/) package, the code while being very generic is both very expressive and fast! (Have a look at the [Benchmark section](https://dmetivie.github.io/ExpectationMaximization.jl/dev/benchmarks/)) +## What type of mixtures? + In particular, it works on a lot of mixtures: - Mixture of Univariate continuous distributions @@ -14,9 +15,13 @@ In particular, it works on a lot of mixtures: - Mixture of mixtures (univariate or multivariate and continuous or discrete) - More? -Note that [Distributions](https://juliastats.org/Distributions.jl/stable/) *currently* does not allow `MixtureModel` to both have discrete and continuous components (but what does that? Rain). +## What EM algorithm? + +So far the classic EM algorithm and the Stochastic EM are implemented. Look at the [Bibliography section](https://dmetivie.github.io/ExpectationMaximization.jl/dev/biblio) for references. + +## How? -Just define a [`mix::MixtureModel`](https://juliastats.org/Distributions.jl/stable/mixture/) and do `fit_mle(mix, y)` with your data `y` and that's it! +Just define a [`mix::MixtureModel`](https://juliastats.org/Distributions.jl/stable/mixture/) and do `fit_mle(mix, y)` where `y` is you observation array (vector or matrix). That's it! For Stochastic EM, just do `fit_mle(mix, y, method = StochasticEM())`. **Have a look at the [Examples](https://dmetivie.github.io/ExpectationMaximization.jl/dev/examples/#Examples) section**. To work, the only requirements are that the components of the mixture `dist ∈ dists = components(mix)` considered (custom or coming from an existing package) diff --git a/docs/src/benchmarks.md b/docs/src/benchmarks.md index 7216aed..a6a5c00 100644 --- a/docs/src/benchmarks.md +++ b/docs/src/benchmarks.md @@ -1,33 +1,32 @@ - # Benchmarks I was inspired by [this benchmark](https://floswald.github.io/post/em-benchmarks/). I am not too sure how to do 100% fair comparisons across languages[^1]. There is a small overhead for using `PyCall` and `RCall`. I checked that it was small in my experimentation (~ few milliseconds?). -I test only Gaussian Mixture since it is the most common type of mixture (remembering that this packages allow plenty of mixtures, and it should be fast in general). +I test only the Gaussian Mixture case since it is the most common type of mixture (remember that this package allows plenty of other mixtures). -In my code, I did not use fancy programming tricks, the speed only comes from Julia, `LogExpFunctions.jl` for `logsumexp!` function and `fit_mle` for each distribution's coming from `Distributions.jl` package. +In the code, I did not use (too much) fancy programming tricks, the speed only comes from Julia, `LogExpFunctions.jl` for `logsumexp!` function and `fit_mle` for each distribution's coming from `Distributions.jl` package. ## Univariate Gaussian mixture with 2 components I compare with [Sklearn.py](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture)[^2], [mixtool.R](https://cran.r-project.org/web/packages/mixtools/index.html), [mixem.py](https://mixem.readthedocs.io/en/latest/index.html)[^3]. I wanted to try [mclust](https://cloud.r-project.org/web/packages/mclust/vignettes/mclust.html), but I did not manage to specify initial conditions -Overall, [mixtool.R](https://cran.r-project.org/web/packages/mixtools/index.html) and [mixem.py](https://mixem.readthedocs.io/en/latest/index.html) were constructed in a similar spirit as this package hence easy to use for me. [Sklearn.py](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture) is build to fit the Sklearn format (all in one). `GaussianMixturesModel.jl` is build with a similar vibe. +Overall, [mixtool.R](https://cran.r-project.org/web/packages/mixtools/index.html) and [mixem.py](https://mixem.readthedocs.io/en/latest/index.html) were constructed in a similar spirit as this package, making them easy to use for me. [Sklearn.py](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture) is built to match the Sklearn format (all in one). `GaussianMixturesModel.jl` is built with a similar vibe. -If you have comments to improve these benchmarks, comments are welcomed. +If you have comments to improve these benchmarks, they are welcome. -You can find the benchmark code in [here](https://github.com/dmetivie/ExpectationMaximization.jl/tree/master/benchmarks/benchmark_v1_K2_unidim.jl). +You can find the benchmark code [here](https://github.com/dmetivie/ExpectationMaximization.jl/tree/master/benchmarks/benchmark_v1_K2_unidim.jl). -**Conclusion: for Gaussian mixture, `ExpectationMaximization.jl` is about 2 to 10 times faster than Python or R implementations** and about as fast as the specialized Julia package `GaussianMixturesModel.jl`. +**Conclusion: for Gaussian mixtures, `ExpectationMaximization.jl` is about 2 to 10 times faster than Python or R implementations** and about as fast as the specialized Julia package `GaussianMixturesModel.jl`. ![timing_K_2_rudimentary_wo_memory_leak](https://user-images.githubusercontent.com/46794064/227195619-c75b9276-932b-4029-8b49-6cce919acc87.svg) -[^1]: Note that `@btime` with `RCall` and `PyCall` might produce a small-time overhead compare to the true R/Python time see [here for example](https://discourse.julialang.org/t/benchmarking-julia-vs-python-vs-r-with-pycall-and-rcall/37308). -I did compare with `R` `microbenchmark` and Python `timeit` and it produces very similar timing but in my experience `BenchmarkTools` is smarter and simpler to use, i.e. it will figure out alone the number of repetition to do in function of the run. +[^1]: Note that `@btime` with `RCall` and `PyCall` might produce a small-time overhead compared to the pure R/Python time; see [here for example](https://discourse.julialang.org/t/benchmarking-julia-vs-python-vs-r-with-pycall-and-rcall/37308). +I did compare with `R` `microbenchmark` and Python `timeit` and they produced very similar timing but in my experience `BenchmarkTools.jl` is smarter and simpler to use, i.e. it will figure out alone the number of repetition to do in function of the run. -[^2]: There is a suspect triggers warning regarding K-means which I do not want to use here. I asked a question [here](https://github.com/scikit-learn/scikit-learn/discussions/25916). It lead to [this issue](https://github.com/scikit-learn/scikit-learn/issues/26015) and [that PR](https://github.com/scikit-learn/scikit-learn/pull/26021). Turns out even if intial condition were provided K-mean were still computed. However to this day 23-11-29 with `scikit-learn 1.3.2` it still get the warning. Maybe it will be in the next release? I also noted this recent [PR](https://github.com/scikit-learn/scikit-learn/pull/26416). -Last, the step by step likelihood of `Sklearn` is not the same as outputted by `ExpectationMaximization.jl` and [mixtool.R](https://cran.r-project.org/web/packages/mixtools/index.html) (both agree), so I am a bit suspicious. +[^2]: There is a suspect trigger warning regarding K-means which I do not want to use here. I asked a question [here](https://github.com/scikit-learn/scikit-learn/discussions/25916). It led to [this issue](https://github.com/scikit-learn/scikit-learn/issues/26015) and [that PR](https://github.com/scikit-learn/scikit-learn/pull/26021). It turns out that even if initial conditions were provided, the K-mean was still computed. However, to this day (23-11-29) with `scikit-learn 1.3.2` it still gets the warning. Maybe it will be in the next release? I also noted this recent [PR](https://github.com/scikit-learn/scikit-learn/pull/26416). +Last, the step-by-step likelihood of `Sklearn` is not the same as outputted by `ExpectationMaximization.jl` and [mixtool.R](https://cran.r-project.org/web/packages/mixtools/index.html) (both agree), so I am a bit suspicious. -[^3]: It overflows very quickly for $n>500$ or so. I think it is because of naive implementation of [`logsumexp`](https://github.com/sseemayer/mixem/blob/2ffd990b22a12d48313340b427feae73bcf6062d/mixem/em.py#L5). So I eventually did not include the result in the benchmark. +[^3]: It overflows very quickly for $n>500$ or so. I think it is because of the implementation of [`logsumexp`](https://github.com/sseemayer/mixem/blob/2ffd990b22a12d48313340b427feae73bcf6062d/mixem/em.py#L5). So I eventually did not include the result in the benchmark. diff --git a/docs/src/biblio.md b/docs/src/biblio.md index 9c350cf..908f2fa 100644 --- a/docs/src/biblio.md +++ b/docs/src/biblio.md @@ -1,4 +1,3 @@ - # Bibliography ## Theory diff --git a/docs/src/examples.md b/docs/src/examples.md index 5958e01..4f473d7 100644 --- a/docs/src/examples.md +++ b/docs/src/examples.md @@ -1,4 +1,3 @@ - # Examples ```@example 1 diff --git a/docs/src/fit_mle.md b/docs/src/fit_mle.md index b4089c5..d83f4e3 100644 --- a/docs/src/fit_mle.md +++ b/docs/src/fit_mle.md @@ -1,4 +1,3 @@ - # [Instance vs Type version](@id InstanceVType) This package relies heavily on the types and methods defined in `Distributions.jl` e.g., `fit_mle` and `logpdf`. diff --git a/docs/src/index.md b/docs/src/index.md index 3c46a79..776d1a4 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -1,8 +1,10 @@ - # ExpectationMaximization.jl This package provides a simple implementation of the Expectation Maximization (EM) algorithm used to fit mixture models. Due to [Julia](https://julialang.org/) amazing [dispatch](https://www.youtube.com/watch?v=kc9HwsxE1OY) systems, generic and reusable code spirit, and the [Distributions.jl](https://juliastats.org/Distributions.jl/stable/) package, the code while being very generic is both very expressive and fast[^1]! + +## What type of mixtures? + In particular, it works on a lot of mixtures: - Mixture of Univariate continuous distributions @@ -11,7 +13,7 @@ In particular, it works on a lot of mixtures: - Mixture of mixtures (univariate or multivariate and continuous or discrete) - More? -Note that [Distributions](https://juliastats.org/Distributions.jl/stable/) *currently* does not allow `MixtureModel` to both have discrete and continuous components (but what does that? Rain). +## How? Just define a [`mix::MixtureModel`](https://juliastats.org/Distributions.jl/stable/mixture/) and do `fit_mle(mix, y)` with your data `y` and that's it! **Have a look at the [examples](@ref Examples) section**. @@ -27,7 +29,11 @@ In case 2. is not explicit known, you can always implement a numerical scheme, i Or, when possible, represent your “difficult” distribution as a mixture of simple terms. (I had [this](https://stats.stackexchange.com/questions/63647/estimating-parameters-of-students-t-distribution) in mind, but it is not directly a mixture model.) +!!! note + [Distributions.jl](https://juliastats.org/Distributions.jl/stable/) *currently* does not allow `MixtureModel` to both have discrete and continuous components[^2]. + [^1]: Have a look at the [Benchmark section](@ref Benchmarks). +[^2]: Rain is a good example of mixture having both a discrete (`Delta` distribution in `0`) and continuous (`Exponential`, `Gamma`, ...) component. ## Algorithms