Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git information is split into events #44

Open
dicortazar opened this issue Mar 18, 2024 · 5 comments
Open

Git information is split into events #44

dicortazar opened this issue Mar 18, 2024 · 5 comments
Assignees
Labels
GrimoireLab 2.x scalability Tickets related to scalability topics

Comments

@dicortazar
Copy link

dicortazar commented Mar 18, 2024

Context

  • Task goal: define the technical needs to scale the technology to 3.5K repositories of high activity. This includes improvements and development in the area of operations and scalability mainly.
  • Scope: the initial scope goes for only Git repositories at the retrieval and enrichment phase. This may include the first gathering process, although this may face certain other difficulties as for example get banned by certain platforms.
  • Definition of done: when the first deployment of Bitergia Analytics is ready to go, the process to download and/or enrich 3.5 repositories should take no more than half day in total.

Task Description

There is a need to offer internal data scientists a new API based on events. These events will come from any of the different data sources. Initially this should work with Git Commits.

  • Definition of done: A simple (type, content, date) API return would be enough, but this requires to work with the 3.5K detailed repositories and compliant with the previous data completeness requirement.
    • This is provided through ~/api/
    • Examples should be provided as well as documentation to use the API to third parties

GrimoireLab tickets

@dicortazar dicortazar converted this from a draft issue Mar 18, 2024
@dicortazar dicortazar added the scalability Tickets related to scalability topics label Mar 18, 2024
@dicortazar dicortazar moved this to Backlog in Bitergia Analytics Mar 18, 2024
@dicortazar dicortazar moved this to Backlog in Bitergia Analytics Apr 15, 2024
@canasdiaz
Copy link

In our Monday meeting, @sduenas volunteered to link the tickets in CHAOSS or bitergia-analytics where our team is doing the work.

@jjmerchante jjmerchante moved this from Backlog to Ready in Bitergia Analytics Apr 30, 2024
@sduenas sduenas moved this from Ready to In progress in Bitergia Analytics Jun 14, 2024
@jjmerchante
Copy link
Contributor

I have analyzed the execution of workers that produce and consume events, as well as the Redis stream size and the storage server (OpenSearch for now)

Using Redis Streams, multiple producers can be launched to generate events from repositories. The events will be added to a Redis Stream and consumed by several consumer groups, which will process these events. For the tests, the consumers inserted the events into an OpenSearch database, but there could be other groups.

The rate at which consumers stored events in OpenSearch was not linear, and several challenges were encountered.

Firstly, in the initial runs, an OpenSearch server was used with OPENSEARCH_JAVA_OPTS configured to 512m, 1g, and 2g. Only with 2g did the issues cease, likely due to the volume of data being inserted.

Secondly, inserting all data into the same index tends to saturate OpenSearch, especially when surpassing 8 million items. In the latest tests, each consumer was set to insert data into a separate index, which improved OpenSearch performance, reduced memory usage, and decreased insertion times.

To design the number of OpenSearch consumers relative to the producers depends on OpenSearch's capacity to process and store events. A ratio of 20 producers to 15 consumers worked correctly, but issues arose after reaching 9 million items. In subsequent tests with a 20:30 ratio, no problems were encountered, and the lag (events pending consumption) remained nearly zero at all times, except for small peaks of 10k, which should not exceed the stream size of 1M during tests.

The event stream size was set to 1M. The stream must be limited because it consumes memory space. On average, each event consumes about 2.6KB (calculated using MEMORY USAGE). Once the stream reaches 1M events, new events will replace the old ones, which must be consumed by the consumers before that happens. The stream size should provide enough buffering for consumers to process the items. I found some posts that people compress the data with gzip before inserting it to consume less memory, but I haven't tested it.

Another important point is regarding the producers. For Git repositories, there is a memory consumption limit. Cloning large repositories consumes memory, usually affecting only the initial clone. Therefore, the number of workers per machine should be limited to prevent the machine from reaching 100% memory usage. Larger Mozilla repositories, such as https://github.com/mozilla/gecko-dev, used up to 2.3G, while smaller ones used around 100M. On a machine with 8G free memory, if 20 workers are running and downloading 3 large and 17 small repositories simultaneously, the machine will likely run out of memory, potentially killing any process on it.

To replicate the analysis:

  1. Run MariaDB, Redis, and OpenSearch.
  2. Create the fetch task of all the repositories from Mozilla (2.4K repositories)
  3. Run 30 consumers with grimoirelab run opensearch-consumer https://admin:admin@localhost:9200 <index_name>
  4. Run 20 workers to execute the fetch tasks: grimoirelab run workerpool --num-workers 20
  5. Monitor the status with htop to check the memory, XINFO GROUPS events from redis-cli to know the lag and total events processed by the consumers, and python manage.py rqstats to know the number of pending jobs.

@sduenas
Copy link
Contributor

sduenas commented Nov 13, 2024

These features are now available on grimoirelab-core main branch: chaoss/grimoirelab-core@8d19bd8

@dicortazar
Copy link
Author

This looks promising :). Thanks @jjmerchante !

I think we're still missing the Definition of Done here or I may have missed this, sorry in advance ;).

From the definition of done section in this ticket:

  • Can I have an API I can access, where is this?
  • Where is the documentation to access that API?
  • Where are the examples I can check?

Thanks!

@dicortazar
Copy link
Author

The new definition of done will state that this task focuses on eventizing the JSON docs, and not creating an API.

@dicortazar dicortazar moved this from PO review to Done in Bitergia Analytics Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GrimoireLab 2.x scalability Tickets related to scalability topics
Projects
Status: Done
Status: Backlog
Development

No branches or pull requests

4 participants