Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the lzreposync loop over channels and synchronize them using a Status column #9551

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

waterflow80
Copy link
Collaborator

What does this PR change?

Made the lzreposync service continuously loop over the existing
channels and synchronize the corresponding repositories.

Added a status column in the rhnchannel table to indicate the
sync status of a given channel.

Also added some helper arguments to the service that allows us to
perform test operations, like creating a test channel and
associating repositories to it, etc

GUI diff

No difference.

  • DONE

Documentation

  • No documentation needed: only internal and user invisible changes

  • DONE

Test coverage

ℹ️ If a major new functionality is added, it is strongly recommended that tests for the new functionality are added to the Cucumber test suite

  • No tests: add explanation

  • No tests: already covered

  • Unit tests were added

  • Cucumber tests were added

  • DONE

Links

Issue(s): #
Port(s): # add downstream PR(s), if any

  • DONE

Changelogs

Make sure the changelogs entries you are adding are compliant with https://github.com/uyuni-project/uyuni/wiki/Contributing#changelogs and https://github.com/uyuni-project/uyuni/wiki/Contributing#uyuni-projectuyuni-repository

If you don't need a changelog check, please mark this checkbox:

  • No changelog needed

If you uncheck the checkbox after the PR is created, you will need to re-run changelog_test (see below)

Re-run a test

If you need to re-run a test, please mark the related checkbox, it will be unchecked automatically once it has re-run:

  • Re-run test "changelog_test"
  • Re-run test "backend_unittests_pgsql"
  • Re-run test "java_pgsql_tests"
  • Re-run test "schema_migration_test_pgsql"
  • Re-run test "susemanager_unittests"
  • Re-run test "javascript_lint"
  • Re-run test "spacecmd_unittests"

Before you merge

Check How to branch and merge properly!

from spacewalk.common.repo import GeneralRepoException
from spacewalk.satellite_tools.repo_plugins.deb_src import DebRepo

SLEEP_TIME = 2 # Num of seconds between one sync and another
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 seconds seems very short. I have no idea what time we would need here though.

Copy link
Collaborator Author

@waterflow80 waterflow80 Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree, 2 seconds aren't that much for the lzreposync to look for updated channels to sync. But I think that we shouldn't also make it too long to not let the users wait too much for the sync. I am also not sure what will be the best time to put yet.

By the way, will the user have the ability to explicitly fire the sync job for his own newly updated channels ? It's more like the On-Demand download strategy ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postgresql can also have a notification mechanism. I know we are using it in some other parts of our code, but I don't know how that works.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what context we'll be using the notification mechanism ?

python/lzreposync/src/lzreposync/__init__.py Show resolved Hide resolved
Comment on lines +238 to +251
parser.add_argument(
"--create-content-source",
help="Adding a new content source to channel by specifying 'channel_label', 'source_url', 'source_label' and 'type'\n"
"Eg: --create-content-source test_channel https://download.opensuse.org/update/leap/15.5/oss/ leap15.5 yum",
dest="source_info",
type=str,
nargs=4,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you want to create a channel from the command line? While looping you take basically no other parameter than the DB connection and may be the sleep time and act on the available channels.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, it's just temporary for me to able to test the service, we can remove it before merging the PR. And maybe I shouldn't have committed it in the first place.

python/lzreposync/src/lzreposync/__init__.py Outdated Show resolved Hide resolved
)
_display_failed_repos(ret)

except KeyboardInterrupt:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only interruption we need?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'm still not completely sure about it.

And I think we should also be careful about the final channel's status, after any interruption in general, and not leaving the channel in a blocking status like in_progress, because in that case it will be ignored by future syncs.

It still needs some refinements..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be in those cases the status should be set to failed?

Copy link
Collaborator Author

@waterflow80 waterflow80 Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true...I was just thinking about a rare scenario in which all the repositories of a given channel get completely synced, but before updating the database status to done the interruption happens. However the probability of this happening is extremely low, so I think setting the status to failed should be good enough.

Now in terms of implementation, I would like to get your opinion about these two possible ways:

First one

Add another try...catch block inside the for loop, to be able to get the current channel label being synced, and update its status:

...
for channel in all_channels
    ch_label = channel["channel_label"]
    try:
        sync_channel_workflow(ch_label)
    except KeyboardInterrupt:
        update_status(ch_label, 'failed')

Second one

Initialize an empty variable above the first try block and set it to None: curr_chnl_label=None. Then it gets updated inside the for loop with the current channel label being synced:

curr_chnl_label = None
try:
    ...
    for channel in all_channels:
        curr_chnl_label = channel["channel_label"]
        ....
except KeyboardInterrupt:
        update_status(curr_chnl_label, 'failed')

However in the latter case, we should add None check.

Both seem to have the same complexity, but I'm not sure what will be the best practice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not sure about the actual status, "failed" looks appropriate to me. Better to run it twice than have bad data.

For the implementation, I don't think it makes a difference. Pick either of them 😄

@@ -423,6 +423,7 @@ class OracleBackend(Backend):
"checksum_type_id": DBint(),
"channel_access": DBstring(10),
"update_tag": DBstring(128),
"status": DBstring(11),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more optimized to reduce the status to a single character in case there are lots of data. @rjmateus any opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postgresql also support enums (occupying 4 bytes) https://www.postgresql.org/docs/current/datatype-enum.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use enums, or just 1 char. Both should be fine

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add the enum type creation script inside rhnPackage.sql, or in the schema upgrade script, or in both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it needs to be defined in both places, one for new installations and the other for upgrades (only if it does not exist)

agraul and others added 2 commits January 2, 2025 13:39
lzreposync will be a spacewalk-repo-sync replacement written in Python.
It uses a src layout and a pyproject.toml. The target Python version is
3.11, compatibility with older Python versions is explicitly not a goal.
@waterflow80
Copy link
Collaborator Author

@cbosdo @agraul @rjmateus

Shouldn't we add a status state/column relative the repository itself, because currently, if one repo failed to sync among other repositories in a given channel, the next sync job will try to synchronize all the repositories of that channel, since it is labeled as failed.

Or should we think about another kind of status that indicates that some of the repositories of that channel have been synced and some others are not ?

Added the remote_path column that will hold the remote path/
url of a given package.

This information will help locate the package later-on on the
remote repository and download it.
A boolean argument that checks whether we should call the
header.hdr.fullFilelist()

We added this argument to disable the header.hdr.fullFilelist()
function only for the lzreposync service.
The inspect.getargspec() method is deprecated in Python 3
It can be replaced by inspect.getfullargspec()
The import_signatures is a boolean argument that specifies
whether we should execute the _import_signatures() method.

We added this parameter to disable the _import_signatures()
method for the lzreposync service.
Parsing the rpm's Primary.xml packages metadata file using
pulldom xml parser as a memory efficient parsing library.

Note that some attributes in the returned parsed object are
faked, and maybe filled in elsewhere.

The faking of some of the data is done because some
attributes are required by the importer service.
Parsing the rpm's filelists.xml metadata file using
pulldom xml parser as a memory efficient parsing library.

The parser parses the given filelists.xml file (normally in gz
format), and cache the filelist information of each package
in a separate file in the cache directory, using the package's
hash as the filename, with no file extension.
Using both primary_parser and filelists_parser, return the full
packages' metadata, pacakge by package, using lazing parsing.

Note that there some attributes that are faked, because we
can't fetch them now, and they're required by the package
importer later-on.
However, we can fake them more efficiently, using less memory.
Parsed the update-info.xml file and imported the parsed
patches/updates to the database.

We used pretty much the same code from the old Reposync class.
Import the parsed rpm and debian packages to the database in
batche, and associate each pacakge with the corresponding
channel
Parsed the debian Packages metadata file in a lazy way and
yield the metadata of each package separately.
Parsed the debian's Translation file that contains the full
description of packages, grouped by description-md5, and
cache the parsed descriptions in a cache directory.
Using both packages_parser and translation_parser, return the
full packages' metadata, pacakge by package, using lazing
parsing

Also set the debian repository's information in a DebRepo
class
Given the channel label, fetch important repository's
information form the database, and store it in a temporary
object RepoDTO
Added the necessary command line arguments.

Identify the target repositories, prepare the datastructures,
and execute the lazy synchronization of repositories/packages.
Added a new dependency python-gnupg used to verify repo
signature.
waterflow80 and others added 14 commits January 2, 2025 17:17
Ignored two linting complains about rasing exceptions floowing the
approach in the old reposync.

We can enhance the code instead of doing this though.
This commit completes almost all the logic and use cases
of the new lazy reposync.

**Note** that this commit will be restructured and possibly
divided into smaller and more convenient commits.
This commit is for review purposes.
Seemingly this error happened because we reached the maximum number
of unclosed db connections. And thought that this might be due to
the fact that the close() method in the Database class was not
implemented, and the rhnSQL.closeDB() was not closing any connection.

However, we're still hesitating about whether this is the root cause
of the problem, because the old(current) reposync is was using it
without any error.
This is the latest and almost the final version of the
lzreposync service. (gpg sig check not complete)

It contains pretty much all the necessary tests,
including the ones for updates/patches import.

Some of the remaining 'todos' are either for code
enhancements or some unclear concepts that
will be discussed with the team.

Of course, this commit will be split into smaller
ones later after rebase.
- Removed some todos.
- Changed some sql queries with equivalent ones using
JOIN...ON.
- Some other minor cleanup
Optimized some code by changing classes and methods
in some logics with free functions.

Consolidated the debian repo parsing.
Completed the gpg signature check for rpm repositories,
mainly for the repomd.xml file.

This is done by downloading the signature file from the
remote rpm repo, and executing 'gpg verify' to verify the
repomd.xml file against its signature using the already
added gpg keys on the filesystem.

So, if you haven't already added the required gpg keyring
on your system, you'll not be able to verify the repo.

You should ideally run this version directly on the uyuni-
server, because the gpg keyring will probably be present
there.
makedirs() in uyuni.common.fileutils now accepts relative paths that
consist of only a directory name or paths with trailing slashes.
Completed the gpg signature check for debian repositories.

If you haven't already added the required gpg keyring
on your system, you'll not be able to verify the repo,
and you'll normally get a GeneralRepoException.

You should ideally run this version directly on the uyuni-
server, because the gpg keyring will probably be present
there.
Mocked the SPACEWALK_GPG_HOMEDIR value to `~/.gnupg/`, which is the
default directory for gpg, in order to execute the gpg tests outside
the uyuni-server
Made the lzreposync service continuously loop over the existing
channels and synchronize the corresponding repositories.

Added a status column in the rhnchannel table to indicate the
sync status of a given channel.

Also added some helper arguments to the service that allows us to
perform test operations, like creating a test channel and
associating repositories to it, etc
@rjmateus
Copy link
Member

rjmateus commented Jan 7, 2025

If I remember correctly the idea to have the flag in the channel is to know if it is ready to use or not, meaning that at least one full synchronization went fine. Not fully synchronizing a channel has impacts on other parts, like Content Lifecycle Management.
On each iteration, all the repositories of the channel must be synchronized, so by adding a flag to the repository level I'm not sure what we would be benefiting. Do you see any use case for it?

@waterflow80
Copy link
Collaborator Author

If I remember correctly the idea to have the flag in the channel is to know if it is ready to use or not, meaning that at least one full synchronization went fine. Not fully synchronizing a channel has impacts on other parts, like CLM. On each iteration, all the repositories of the channel must be synchronized, so by adding a flag to the repository level I'm not sure what we would be benefiting. Do you see any use case for it?

You're right.

I am not actually completely aware of how the end user will interact with the channels, but while testing the sync service, I added some bad (nonexistent) repositories' urls to a channel; In this case, good repositories got synced, but the other, 'bad', ones did not, so the channel was labeled as failed . And the lzreposync would infinitely try to sync the channel again, and will never make it successful.

Note that each sync will include all the repositories of the channel, including the ones that has already been successfully synced. And this will be time and resource consuming.

However, this will only happen when the user puts some bad repositories' urls in a channel, so if there's already a control over these repositories in the front-end, then we won't face this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants