Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BP-65: implement load balance for select bookie #4247

Open
TakaHiR07 opened this issue Mar 27, 2024 · 5 comments
Open

BP-65: implement load balance for select bookie #4247

TakaHiR07 opened this issue Mar 27, 2024 · 5 comments
Assignees

Comments

@TakaHiR07
Copy link
Contributor

TakaHiR07 commented Mar 27, 2024

BP

This is the master ticket for tracking BP-65 :
Proposal PR - #4246

Motivation

One of our clusters have 255 bookies, and we find that bookie's write pressure is very unbalance.

Usually there are several bookies write latency too high, which cause the message publish latency also too high in pulsar broker.

Currently, bookie have quarantine mechanism to deal with this case.

  • when broker select ensemble for a ledger, it uses RackAwarePlacement and random select strategy to decide which bookie should be written.
  • If bookie write error achieve n times, the bookie would be quarantined in broker.
  • To avoid too many broker quarantine a bookie at the same time, we can define quarantineRatio to do quarantine randomly.

However, this mechanism is not good enough to avoid bookie high write latency

  1. Since we select bookie ensemble randomly, ramdom strategy is hard to achieve good performance if we don't have too many bookies or too many ledgers to select.
  2. It is hard to define appropriate n time write error, and quarantineRatio. Because it is hard to map bookie write pressure to these configs.
  3. It is hard to define how long we quarantine a bookie. Because we don't know which time bookie's write pressure has already dropped. Now we can only set the quarantineTime by config.
  4. Now we can only quarantine a bookie when it already has write-pressure, but we can't avoid the write-pressure problem occur in advance.

Proposed Changes

To solve this write pressure problem, we propose to introduce bookie load balance mechanism, which is supplement of current quarantine mechanism.
When we choose ensemble for ledger, if we have load information of all bookies, we can prefer to select the low-load bookie as ensemble.
And we can avoid to write into the high-load bookie, which make the bookie perform worse and cause high write latency.
Therefore, with the bookie load information, we can better avoid high write latency problem occur

We notice that bookie already has DiskWeightBasedPlacement mechanism, which is similar to load balance. We just need to enhance this mechanism,
replace it to LoadWeightBasedPlacement.

The proposed changes involves:

  1. Implement BaseMetricMonitor in bookie server, which would collect bookie load information periodically
  2. bookie client continue to use getBookieInfo restApi to acquire load information from each bookie.
  3. modify the implementation of RackawareEnsemblePlacementPolicyImpl, support select ensemble by LoadWeightBasedPlacement. Since LoadWeightBasedPlacement is an enhancement of DiskWeightBasedPlacement, it would cover the DiskWeightBasedPlacement if feature enable.

BaseMetricMonitor

Now BaseMetricMonitor would collect multiple load metrics, including journal IOUtil, ledger IOUtil, bookie write bytes per second, cpu usage, free disk space, total disk space.
Then we can define the bookie load pressure by these metrics.
Actually for our cluster, bookie load pressure is mainly influenced by journal IOUtil, because we use HDD as journal disk and 1 journal disk is responsible for 3 bookie.

BaseMetricMonitor would collect the metrics per second by default. But we find that some metrics is jittering so much.
So it is necessary to smooth the collected metrics, by calculating average value between 3 seconds.
This 3 second can be modified by config baseMetricMonitorMetricSlideWindowSize

If one bookie contains multiple disks, we calculate the average value.

modification of getBookieInfo restApi

bookie client continue to use getBookieInfo restApi to acquire load information from each bookie.
That means if we enable LoadWeightBasedPlacement, the restApi would contain more information.

  • Previous: getBookieInfo contains totalDiskUsage and freeDiskUsage
message GetBookieInfoResponse {
    required StatusCode status = 1;
    optional int64 totalDiskCapacity = 2;
    optional int64 freeDiskSpace = 3;
}
  • Current: getBookieInfo contains totalDiskUsage, freeDiskUsage, and the other load information
message GetBookieInfoResponse {
    required StatusCode status = 1;
    optional int64 totalDiskCapacity = 2;
    optional int64 freeDiskSpace = 3;
    optional int32 journalIoUtil = 4;
    optional int32 ledgerIoUtil = 5;
    optional int32 cpuUsedRate = 6;
    optional int64 writeBytePerSecond = 7;
}

GetBookieInfoResponse in BookkeeperProtocol would be changed.
If we disable LoadWeightBasedPlacement or the restApi is error because of timeout or throwing exception, the load information would be -1.

And we have tested the pressure of this restApi bringing to cluster. Such as for our cluster, with more than 20 brokers and more than 200 bookies,
the pressure of restApi is still acceptable.

modification of RackawareEnsemblePlacementPolicyImpl

Implement a new strategy to select bookies for LoadWeightBasedPlacement.

The target is :

  • make write pressure more balance on all bookies. (Usually the more throughput, the higher write pressure)
  • avoid some bookies occur high write latency problem. (Usually the IOUtil become higher, write latency would be also higher)

So the designed strategy is :

  • use writeBytesPerSecond as load weight. And use roulette wheel selection, define the bookie selection probability by its load weight. The higher load weight, the less probability bookie to be selected.
  • Selection filter the high load bookie, whose journal IOUtil is higher than threshold. If bookie is already high load, we should not continue to write entry on it. Once its IOUtil decrease, bookie become writable.

To avoid a corner case that so many bookies being filtered, we add config lowLoadBookieRatio. Default if more than half of bookies are filtered, fallback to randomly selection.

Notice that many bookie clients would do ensemble selections separately, the probability of each bookie should not differ too much, or it would cause write incline problem.
So we have probability smooth operation in roulette wheel selection.

Furthermore, different users can implement their own selected strategy based on their production environment.

compatibility

LoadWeightBasedPlacement is a supplemental feature, ledger replication must obey the RackAwarePolicy/RegionAwarePolicy firstly,
and then try to obey LoadWeightBasedPlacement. We can disable the feature by configuration.

Because GetBookieInfo protocol has been changed, this restApi would get error if the version of bookie-server and bookie-client is not the same one. But since this restApi is used only when enable diskWeightBasedPlacement, I think it is no problem for most people, who do not enable diskWeightBasedPlacement in client.

Performance

We have applied LoadWeightBasedPlacement to our production clusters. And the high write latency problem no matter happen.

企业微信截图_91bcf3d9-251f-47a4-9ea5-666e06bb9115
企业微信截图_7ea1190b-3c69-4d5a-967e-5c75dd62b1b0

@thetumbled
Copy link
Member

List some data collected from the production enviroment after we lauch this feature.
A direct way to evaluate the performance of a load balancing algorithm is to look at the indicators we are concerned about, specifically P99 write latency. What is more, we care about the standard deviation of journal disk write throughput , the range of journal disk write throughput, and the top 10 of journal disk write throughput .
Let's take a look at the comparison of these indicators one by one (launched on 1.10):

  • P99 write latency
    image

It can be seen that the P99 write latency has significantly decreased from a peak of more than ten to twenty seconds to less than one second.

  • the standard deviation of journal disk write throughput
    image

The peak standard deviation of journal disk write throughput has decreased from 25MB/s to 21MB/s.

  • the range of journal disk write throughput
    image

The peak write throughput range has decreased from 150MB/s to 130MB/s.

  • Journal disk write throughput top 10
    image

There is also a significant decrease in the top 10 journal disk write throughput.

Could you help to review this BP? @shoothzj @wenbingshen @eolivelli @dlg99 @jiazhai

@eolivelli
Copy link
Contributor

This is a great idea +1

@hangc0276
Copy link
Contributor

I like this idea. I'm reviewing this proposal

@eolivelli
Copy link
Contributor

Can you please start the discussion on the [email protected] mailing list?

@TakaHiR07
Copy link
Contributor Author

Can you please start the discussion on the [email protected] mailing list?

Done, can see here, https://lists.apache.org/thread/ltdls9r55h70mdxc0z8k5qzvzw48nb6d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants