Git information is split into events #44

dicortazar · 2024-03-18T12:01:34Z

Context

Task goal: define the technical needs to scale the technology to 3.5K repositories of high activity. This includes improvements and development in the area of operations and scalability mainly.
Scope: the initial scope goes for only Git repositories at the retrieval and enrichment phase. This may include the first gathering process, although this may face certain other difficulties as for example get banned by certain platforms.
Definition of done: when the first deployment of Bitergia Analytics is ready to go, the process to download and/or enrich 3.5 repositories should take no more than half day in total.

Task Description

There is a need to offer internal data scientists a new API based on events. These events will come from any of the different data sources. Initially this should work with Git Commits.

Definition of done: A simple (type, content, date) API return would be enough, but this requires to work with the 3.5K detailed repositories and compliant with the previous data completeness requirement.
- This is provided through ~/api/
- Examples should be provided as well as documentation to use the API to third parties

GrimoireLab tickets

Conversion of Git data in events chaoss/grimoirelab#647

canasdiaz · 2024-04-24T08:20:53Z

In our Monday meeting, @sduenas volunteered to link the tickets in CHAOSS or bitergia-analytics where our team is doing the work.

jjmerchante · 2024-07-30T15:20:42Z

I have analyzed the execution of workers that produce and consume events, as well as the Redis stream size and the storage server (OpenSearch for now)

Using Redis Streams, multiple producers can be launched to generate events from repositories. The events will be added to a Redis Stream and consumed by several consumer groups, which will process these events. For the tests, the consumers inserted the events into an OpenSearch database, but there could be other groups.

The rate at which consumers stored events in OpenSearch was not linear, and several challenges were encountered.

Firstly, in the initial runs, an OpenSearch server was used with OPENSEARCH_JAVA_OPTS configured to 512m, 1g, and 2g. Only with 2g did the issues cease, likely due to the volume of data being inserted.

Secondly, inserting all data into the same index tends to saturate OpenSearch, especially when surpassing 8 million items. In the latest tests, each consumer was set to insert data into a separate index, which improved OpenSearch performance, reduced memory usage, and decreased insertion times.

To design the number of OpenSearch consumers relative to the producers depends on OpenSearch's capacity to process and store events. A ratio of 20 producers to 15 consumers worked correctly, but issues arose after reaching 9 million items. In subsequent tests with a 20:30 ratio, no problems were encountered, and the lag (events pending consumption) remained nearly zero at all times, except for small peaks of 10k, which should not exceed the stream size of 1M during tests.

The event stream size was set to 1M. The stream must be limited because it consumes memory space. On average, each event consumes about 2.6KB (calculated using MEMORY USAGE). Once the stream reaches 1M events, new events will replace the old ones, which must be consumed by the consumers before that happens. The stream size should provide enough buffering for consumers to process the items. I found some posts that people compress the data with gzip before inserting it to consume less memory, but I haven't tested it.

Another important point is regarding the producers. For Git repositories, there is a memory consumption limit. Cloning large repositories consumes memory, usually affecting only the initial clone. Therefore, the number of workers per machine should be limited to prevent the machine from reaching 100% memory usage. Larger Mozilla repositories, such as https://github.com/mozilla/gecko-dev, used up to 2.3G, while smaller ones used around 100M. On a machine with 8G free memory, if 20 workers are running and downloading 3 large and 17 small repositories simultaneously, the machine will likely run out of memory, potentially killing any process on it.

To replicate the analysis:

Run MariaDB, Redis, and OpenSearch.
Create the fetch task of all the repositories from Mozilla (2.4K repositories)
Run 30 consumers with grimoirelab run opensearch-consumer https://admin:admin@localhost:9200 <index_name>
Run 20 workers to execute the fetch tasks: grimoirelab run workerpool --num-workers 20
Monitor the status with htop to check the memory, XINFO GROUPS events from redis-cli to know the lag and total events processed by the consumers, and python manage.py rqstats to know the number of pending jobs.

sduenas · 2024-11-13T15:33:37Z

These features are now available on grimoirelab-core main branch: chaoss/grimoirelab-core@8d19bd8

dicortazar · 2024-11-15T13:26:22Z

This looks promising :). Thanks @jjmerchante !

I think we're still missing the Definition of Done here or I may have missed this, sorry in advance ;).

From the definition of done section in this ticket:

Can I have an API I can access, where is this?
Where is the documentation to access that API?
Where are the examples I can check?

Thanks!

dicortazar · 2025-01-13T12:08:18Z

The new definition of done will state that this task focuses on eventizing the JSON docs, and not creating an API.

dicortazar added this to Bitergia Analytics Mar 18, 2024

dicortazar converted this from a draft issue Mar 18, 2024

dicortazar added the scalability Tickets related to scalability topics label Mar 18, 2024

dicortazar removed the status in Bitergia Analytics Mar 18, 2024

dicortazar moved this to Backlog in Bitergia Analytics Mar 18, 2024

dicortazar added this to Bitergia Analytics Apr 15, 2024

dicortazar moved this to Backlog in Bitergia Analytics Apr 15, 2024

jjmerchante moved this from Backlog to Ready in Bitergia Analytics Apr 30, 2024

sduenas moved this from Ready to In progress in Bitergia Analytics Jun 14, 2024

dicortazar added this to the Scale Bitergia Analytics to 15K Repositories milestone Jun 17, 2024

dicortazar added the GrimoireLab 2.x label Jul 4, 2024

dicortazar assigned jjmerchante Jul 4, 2024

jjmerchante mentioned this issue Aug 26, 2024

Scale Git gathering process to the 15K repositories #73

Open

2 tasks

dicortazar moved this from PO review to Done in Bitergia Analytics Jan 13, 2025

dicortazar mentioned this issue Jan 13, 2025

API to serve Git Events #133

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Git information is split into events #44

Git information is split into events #44

dicortazar commented Mar 18, 2024 •

edited by jjmerchante

Loading

canasdiaz commented Apr 24, 2024

jjmerchante commented Jul 30, 2024

sduenas commented Nov 13, 2024

dicortazar commented Nov 15, 2024

dicortazar commented Jan 13, 2025

Git information is split into events #44

Git information is split into events #44

Comments

dicortazar commented Mar 18, 2024 • edited by jjmerchante Loading

Context

Task Description

canasdiaz commented Apr 24, 2024

jjmerchante commented Jul 30, 2024

sduenas commented Nov 13, 2024

dicortazar commented Nov 15, 2024

dicortazar commented Jan 13, 2025

dicortazar commented Mar 18, 2024 •

edited by jjmerchante

Loading