-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: math: add Mean, Median, Mode, Variance, and StdDev #69264
Comments
Related Issues and Documentation (Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.) |
In general the math package aims to provide the functions that are in the C++ standard library |
Thanks for the feedback! I get that the math package is meant to mirror the functions in C++'s |
Can you do some detective work to see how people are dealing with this in open source Go now? Is there some go-stats package that has a million stars on Github? Are there ten libraries that are each imported five hundred times? Seeing that something has big demand already is important for bringing something that could be in a third party library into the standard library. Otherwise this will just get closed with "write a third party library." Which has certainly happened to me more than once! |
I’ve done some digging into how statistical functions are currently being handled in the Go community. While libraries like Gonum and others provide statistical methods, there's no single source of truth or dominant package in this space, and many are designed for more complex or specialized tasks. However, the basic statistical functions we're proposing—like By integrating these into the standard library, we'd eliminate the need for external dependencies for basic tasks, which is in line with Go's philosophy of having a strong standard library for common use cases. While third-party packages are an option, including these functions in the |
this is the part where we need to see evidence. especially considering the existence of libraries like gonum, how often does the need arise for functions like those proposed where you wouldn't need the extra functionality that other libraries provide. |
For what it's worth, python has a statistics package in its standard library: https://docs.python.org/3/library/statistics.html It would be nice to have a simple package everyone agrees on for common use cases, but that doesn't necessarily need to be in std. |
These functions sound pretty simple, but I think there's actually a lot of subtlety here. For instance, what does |
in my experience everytime some numeric problems comes up gonum lib is suggested. they have a stats package https://pkg.go.dev/gonum.org/v1/[email protected]/stat |
Yeah, so think about having it's functionalities in go std lib straight away ! |
Gonum library is indeed often suggested for statistical and numerical work in Go, and it has a dedicated However, my proposal is focused on adding foundational statistical functions like |
IMHO these functions would be very useful in the standard library, even if (or indeed, because) the implementation requires some care. There are many "quick" uses of these basic stats operations in testing, benchmarking, and writing CL descriptions that shouldn't require a heavyweight dependency on a fully-featured third-party stats library. (I often end up moving data out of my Go program to the shell and running the github.com/nferraz/st command.) Another function I would like is Percentile(n, series), which reports the nth percentile value of a given series. |
If it belongs in |
Here is a small experience report with existing stats packages: In some code I was using gonum’s stats package, and a collaborator started using github.com/montanaflynn/stats as well, whose API returns an error (which I felt was annoying.) Luckily, I caught the unnecessary dependency in code review. These are the types of things that can easily cause unnecessary dependencies to get added in projects. Hence, I think adding common statistics functions would be a great addition to the std. |
It seems like a lot of developers will benefit from this !! |
Can I know the update on this proposal ??_ |
The proposal review committee will likely look at it this week. It usually takes a few rounds to reach a final decision. |
OK, Cool ! |
Can I know the update on this proposal please ? |
Sorry, we didn't get to it last week, but perhaps will this week. |
Yes Please.... |
Some of the questions raised in the meeting were:
|
Thanks for the feedback! I totally get the concerns and here’s my take:
Overall, I think this keeps the package lightweight, practical, and easy to use, which should be the priority. Looking forward to hearing your thoughts! |
I think the goal of limiting the scope would be to ensure that these (other than Percentile) are not potential additions. ;-) I agree that a slice result for Mode seems appropriate. Perhaps it should be called Modes. |
I'm not sure about the proposed API. Specifically, it seems to me that these should arguably take So to me, this API only really makes sense for small data series. Where the cost of looping multiple times is negligible and/or you are fine with pre-allocating them. An API to remedy that is arguably too complex for the stdlib. Are we okay with that limitation? If so, should we still make the arguments |
Another design would a single value with methods for all the various stats and whose factories take the slice or sequence (and any weights). That way it could do any sorting or failing on NaN upfront and cache any intermediate values required by multiple stats. Something like stats, err := statistics.For(floats)
// handle err
fmt.Println(stats.Mean(), stats.Max(), stats.Median()) |
Though an
There's a tantalizing idea here that perhaps one could just call |
@adonovan it could not print percentiles or just print quartiles and you have to ask if you need something more specific. My main thought with the API is that it makes it clear that it's taking ownership. I'm guessing in most cases you want more than one stat at a time so if it can cache some intermediary value that gets used for more than one stat or speed things up by storing the numbers in a special order or data structure that's a nice bonus. I don't know what the specific numerical methods are used for stats but I imagine there could be some savings by caching the sum or just knowing if there's a +Inf in there somewhere. |
To be clear: That is the opposite of what I was trying to point out :) It requires multiple passes to calculate both the Mean and the Stddev with the proposed API. Regardless of whether they are given as an And the promise has been, that |
Sorry, long day, tired brain. I don't really have a strong feeling about iterator vs slice. My initial feeling was that the slice was simpler, but perhaps we should embrace iterators so that all sequences can be supplied with equal convenience.
That's true, but the point I was trying to make was that we are unlikely to be able to correctly anticipate the exact set of operations that we should compute in a single pass. Should it be mean, median, and 90% percentile? What about 95% or 99%? And so on. So, I argue for separate operators, each taking an |
It seems to me that this discussion about using iterators and collecting multiple results in a single pass is circling around the idea of a generic fold/reduce mechanism over iterators, with the statistics operations discussed here being a set of predefined combining functions to use with that mechanism. Someone who wants to compute multiple at once could then presumably write their own combining function that wraps multiple others and produces a struct or map type (with one field/element per inner function) as its result. I will say immediately that I'm not super convinced that such complexity is justified, but if we think that combining multiple operations over a single iterator is something we want to support then I'd wonder what that would look like as a more general facility implemented in EDIT: After posting this I immediately found #61898 which proposes to add |
I am glad that you said that immediately. ;-)
This reminds me of a certain Google interview question from years back: how do you estimate the median value of a long stream with only finite working store? Any loop over a sequence can be expressed as a wrapper around a call to Reduce, but it is often neither clearer nor more efficient to do so. We absolutely should not require users of the new stats package to hold such higher-order concepts in mind. |
I should've said that my main intention in my earlier comment was to respond to the ideas around calculating multiple of these functions at the same time over a given sequence, not to the original proposal for separate functions. Concretely what I was thinking about was:
I intend the last item here to be an alternative to offering in this package any specialized API for calculating multiple aggregates together. In particular, an alternative to the I'm proposing this only if there's consensus that supporting the use of multiple functions over a single sequence in only one pass is a requirement. If we can convince ourselves that it isn't a requirement then I don't think this complexity is justified. I expect that the original proposal's functions, potentially but not necessarily recast as taking |
To clarify, Methods could cache any intermediary calculations that other stats may need so they don't need to be computed twice if you need two stats that depend on the same value. If, as part of storing and preparing the info, it could easily calculate and cache a few basic stats while it's at it, that's certainly a nice bonus—but that would be an implementation detail. Whether that makes sense in some part depends on what operations there will be (now and in the future), the methods for calculating them, and how many calculations can be shared between them. Though it could have multiple factories, one for slice and one for iter so you could work easily with either without having to have a seq and slice version of each operation. |
I firmly believe it should not be a requirement and that such complexity is unwarranted. The goal for this package is to provide simple implementations of the most well known of all statistical functions. I imagine a typical usage will be to print a summary of results in a benchmarking scenario. The cost of computing the statistics will be insignificant. |
I would recommend, if we include quantile, to do what both Python and R do, and accept a list of quantiles to be computed. I admit this is purely anecdotal, but I can't really recall a situation in which I had to compute a single quantile. |
I think these should all take slices. Slices are faster and simpler. Just because we have iterators doesn't mean we should stop using slices in APIs: slices should still be the default, unless there's a good justification for using an iterator. In this case, if people have enough data that they must stream it, they should probably be using something more specialized.
Right.
I agree. Let's leave It seems like, if we're going to have standard deviation and variance, that we need both population and sample version. It would be nice to have a stats expert weigh in on including both population and sample variance, and the question of which quantile definition to use. @adonovan is going to see about getting input from a stats expert, but any other experts should feel free to weigh in. So, I believe that leaves us at the following API for package func Mean(x []float64) float64
func Median(x []float64) float64
func Quantiles(x []float64, quantiles []float64) []float64
func SampleStdDev(x []float64) float64
func SampleVariance(x []float64) float64
func PopulationStdDev(x []float64) float64
func PopulationVariance(x []float64) float64 This leaves some open questions:
|
If we've settled on slices then back to an earlier question, should they be generic like func Mean[Slice ~[]E, E ~float32 | ~float64](x Slice) E or at least func Mean[Slice ~[]E, E ~float64](x Slice) E ? Also, could the quantiles param of Quantiles be |
We should not use generics; package math is float64-only. @adonovan is still trying to find out what specific algorithms we should be using. |
Would package math be float64-only if it were designed today? It's entirely reasonable either way, imo. However, I'd hope math/bits would use generics if designed today (or v2'd). If I had a slice of |
Two comments:
The only time the difference between population and sample standard deviation R, for example, always divides by (n-1) for both its base library var() and sd() functions (Variance and standard deviation, respectively).
If one is computing the standard deviation, almost always one also wants the mean too. It makes me cringe to think I'd have to do two passes to get both; so much so that I would avoid using the standard library functions if it forced this. In place of SampleStdDev() and SampleVariance() (the later seems redundant) I would just have a single MeanSd() func that returns both mean and sample standard deviation from a single pass. For example: // MeanSd returns the mean and sample standard I've provided a simple one-pass implementation of this here: https://github.com/glycerine/stats-go |
For quantile computation, it is hard to provide an efficient, exact, online Almost always you need an online algorithm for your statistics to avoid O(n^2) Therefore, most users are going to be better off using an online T-digest implementation like https://github.com/caio/go-tdigest with, for example, Unless someone has a better algorithm or a clever way to get the exact Or the standard library could bring in and polish one of the T-digest implementations for Quantile and CDF (cumulative distribution function) computation. That would also be nice. |
Thanks for weighing in, @glycerine !
Thanks! This all makes sense and certainly simplifies things. So instead of func SampleStdDev(x []float64) float64
func SampleVariance(x []float64) float64
func PopulationStdDev(x []float64) float64
func PopulationVariance(x []float64) float64 we'd have just func MeanAndStdDev(x []float64) (mean, stddev float64)
My sense is that T-digests would be beyond the scope of a small descriptive stats standard package. We're not trying to replace serious stats packages, just cover really common needs. T-digests are great if you need online quantiles, but have their own cognitive overheads, especially around understanding how they're approximating. My sense is that the common need is that you have a simple slice of data and just want to get a few quantiles. That's certainly been true in my code. I'm also not overly concerned with the performance of these functions. It just has to be "good enough." That's why we're thinking Quantiles would accept a slice of quantiles to compute because that generally allows for a lot of work sharing, and balances that with a simple API. (Side note: we could balance the performance needs a little more here by saying that Quantiles will be faster if you pass it sorted data, but that's not required.) |
Please consider a struct for the api. Also, you could use an online variant like in https://www.johndcook.com/skewness_kurtosis.html |
As a data point, at work we've created some |
We've discussed this above and it doesn't seem like the right trade-off for a simple descriptive stats API.
Again, it's not clear this is justified in this case. For more advanced stats needs, such as online computation, it's easy enough to pull in an external, more specialized package. |
@jimmyfrasche , I feel your pain here. I've run into this exact problem. However, I think it's rare enough that we shouldn't complicate this API for it. @adonovan and I just discussed that it would be worth considering a small language change to allow explicit
I like the idea of a |
I believe the current proposed API for package // Mean returns the arithmetic mean of the values in x.
//
// If x is an empty slice, it panics.
// If x contains NaN or both Inf and -Inf, it returns NaN.
// If x contains Inf, it returns Inf. If x contains -Inf, it returns -Inf.
func Mean(x []float64) float64
// MeanAndStdDev returns the arithmetic mean and
// sample standard deviation of x.
//
// If x is an empty slice, it panics.
// If x contains NaN, it returns NaN, NaN.
// If x contains both Inf and -Inf, it returns NaN, Inf.
// If x contains Inf, it returns Inf, Inf. If x contains -Inf, it returns -Inf, Inf.
func MeanAndStdDev(x []float64) (mean, stddev float64)
// Median returns the median of the values in x.
// If len(x) is even, it returns the mean of the two central values.
//
// If x is an empty slice, it panics.
// If x contains NaN, it returns NaN.
// -Inf is treated as smaller than all other values,
// Inf is treated as larger than all other values, and
// -0.0 is treated as smaller than 0.0.
func Median(x []float64) float64
// Quantiles returns a sequence of quantiles of x.
//
// The returned slice has the same length as the quantiles slice,
// and the elements correspond one-to-one.
// A quantile of 0 corresponds to the minimum value in x and
// a quantile of 1 corresponds to the maximum value in x.
//
// TODO: Which quantile algorithm should we use?
// TODO: How should we treat quantiles < 0 or > 1?
//
// If x is an empty slice, it panics.
// If x contains NaN, it returns NaN.
// -Inf is treated as smaller than all other values,
// Inf is treated as larger than all other values, and
// -0.0 is treated as smaller than 0.0.
func Quantiles(x []float64, quantiles... float64) []float64 There are two open questions on |
I think we do need some doc about accuracy and overflow. Maybe we say the result is exactly |
This should be defined one way or another, but my preference would be to do Kahan summation. |
I agree with this in principle, but I don't want to lock us in to a particular algorithm when there could be higher-accuracy algorithms with good performance. Even Kahan summation isn't the state of the art here, from what I understand, and summation is only part of the accuracy story. To me, the point of pulling these functions into the standard library is to provide a reasonable default when people don't want to worry about the details. People can always reach for more specialized implementations. Looking to other languages, Python's |
I could see a blanket statement in the package documentation like "These functions aim to balance performance and accuracy, but some amount of error is inevitable in floating-point computations. The underlying implementations may change, resulting in small changes in their results from version to version. If the caller needs particular guarantees on accuracy and overflow behavior or version stability, they should use a more specialized implementation." |
Description:
This proposal aims to enhance the Go standard library’s
math
(math/stats.go
)package by introducing several essential statistical functions. The proposed functions are:and many more....
Motivation:
The inclusion of these statistical functions directly in the
math
package will offer Go developers robust tools for data analysis and statistical computation, enhancing the language's utility in scientific and financial applications. Currently, developers often rely on external libraries for these calculations, which adds dependencies and potential inconsistencies. Integrating these functions into the standard library will:Design:
The functions will be added to the existing
math
package, ensuring they are easy to use and integrate seamlessly with other mathematical operations. Detailed documentation and examples will be provided to illustrate their usage and edge case handling.Examples:
The text was updated successfully, but these errors were encountered: