-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Git information is split into events #44
Comments
In our Monday meeting, @sduenas volunteered to link the tickets in CHAOSS or bitergia-analytics where our team is doing the work. |
I have analyzed the execution of workers that produce and consume events, as well as the Redis stream size and the storage server (OpenSearch for now) Using Redis Streams, multiple producers can be launched to generate events from repositories. The events will be added to a Redis Stream and consumed by several consumer groups, which will process these events. For the tests, the consumers inserted the events into an OpenSearch database, but there could be other groups. The rate at which consumers stored events in OpenSearch was not linear, and several challenges were encountered. Firstly, in the initial runs, an OpenSearch server was used with Secondly, inserting all data into the same index tends to saturate OpenSearch, especially when surpassing 8 million items. In the latest tests, each consumer was set to insert data into a separate index, which improved OpenSearch performance, reduced memory usage, and decreased insertion times. To design the number of OpenSearch consumers relative to the producers depends on OpenSearch's capacity to process and store events. A ratio of 20 producers to 15 consumers worked correctly, but issues arose after reaching 9 million items. In subsequent tests with a 20:30 ratio, no problems were encountered, and the lag (events pending consumption) remained nearly zero at all times, except for small peaks of 10k, which should not exceed the stream size of 1M during tests. The event stream size was set to 1M. The stream must be limited because it consumes memory space. On average, each event consumes about 2.6KB (calculated using Another important point is regarding the producers. For Git repositories, there is a memory consumption limit. Cloning large repositories consumes memory, usually affecting only the initial clone. Therefore, the number of workers per machine should be limited to prevent the machine from reaching 100% memory usage. Larger Mozilla repositories, such as https://github.com/mozilla/gecko-dev, used up to 2.3G, while smaller ones used around 100M. On a machine with 8G free memory, if 20 workers are running and downloading 3 large and 17 small repositories simultaneously, the machine will likely run out of memory, potentially killing any process on it. To replicate the analysis:
|
These features are now available on grimoirelab-core main branch: chaoss/grimoirelab-core@8d19bd8 |
This looks promising :). Thanks @jjmerchante ! I think we're still missing the Definition of Done here or I may have missed this, sorry in advance ;). From the definition of done section in this ticket:
Thanks! |
The new definition of done will state that this task focuses on eventizing the JSON docs, and not creating an API. |
Context
Task Description
There is a need to offer internal data scientists a new API based on events. These events will come from any of the different data sources. Initially this should work with Git Commits.
GrimoireLab tickets
The text was updated successfully, but these errors were encountered: