Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - Reject duplicate submissions #876

Closed
wants to merge 4 commits into from
Closed

Conversation

noliveleger
Copy link
Contributor

@noliveleger noliveleger commented May 8, 2023

Description

TBC

Additional info

would supersede #859

@noliveleger noliveleger requested a review from jnm May 8, 2023 19:46
@noliveleger noliveleger force-pushed the duplicate-submissions branch 3 times, most recently from 9b39237 to 53c6320 Compare May 9, 2023 18:45
@noliveleger noliveleger force-pushed the duplicate-submissions branch from 53c6320 to 019e7bf Compare May 9, 2023 18:47
@@ -52,7 +52,7 @@ jobs:
- name: Install Python dependencies
run: pip-sync dependencies/pip/dev_requirements.txt
- name: Run pytest
run: pytest -vv -rf
run: pytest -vv -rf --disable-warnings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

@@ -32,6 +32,7 @@ test:
POSTGRES_PASSWORD: kobo
POSTGRES_DB: kobocat_test
SERVICE_ACCOUNT_BACKEND_URL: redis://redis_cache:6379/4
GIT_LAB: "True"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something more descriptive like SKIP_TESTS_WITH_CONCURRENCY

@@ -40,7 +41,7 @@ test:
script:
- apt-get update && apt-get install -y ghostscript gdal-bin libproj-dev gettext openjdk-11-jre
- pip install -r dependencies/pip/dev_requirements.txt
- pytest -vv -rf
- pytest -vv -rf --disable-warnings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

Comment on lines +90 to +94
fakeredis.FakeStrictRedis(),
):
with patch(
'onadata.apps.django_digest_backends.cache.RedisCacheNonceStorage._get_cache',
fakeredis.FakeStrictRedis,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is one instantiated and the other isn't?

results[result] += 1

assert results[status.HTTP_201_CREATED] == 1
assert results[status.HTTP_409_CONFLICT] == DUPLICATE_SUBMISSIONS_COUNT - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the OpenRosa spec allow returning a 409? and do Enketo and Collect handle a 409 properly? i can't find the code, but i remember wanting to return a 40x that wasn't 400 but being forced to use 400 only because without it Collect wouldn't display the error message i was sending. could've been Enketo, though, or i might be misremembering entirely

Comment on lines +171 to +176
# The start-time requirement protected submissions with identical responses
# from being rejected as duplicates *before* KoBoCAT had the concept of
# submission UUIDs. Nowadays, OpenRosa requires clients to send a UUID (in
# `<instanceID>`) within every submission; if the incoming XML has a UUID
# and still exactly matches an existing submission, it's certainly a
# duplicate (https://docs.opendatakit.org/openrosa-metadata/#fields).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should reject outright any submission without a UUID

Comment on lines +189 to +194
if existing_instance:
existing_instance.check_active(force=False)
# ensure we have saved the extra attachments
new_attachments = save_attachments(existing_instance, media_files)
if not new_attachments:
raise DuplicateInstanceError()
Copy link
Member

@jnm jnm Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't ConflictingXMLHashInstanceError be raised already and block us from getting here? Nevermind; I misunderstood, and that exception only has to do with concurrent processing of submissions with identical XML

Comment on lines +229 to +238
if is_postgresql:
cur = connection.cursor()
cur.execute('SELECT pg_try_advisory_lock(%s::bigint);', (int_lock,))
acquired = cur.fetchone()[0]
else:
prefix = os.getenv('KOBOCAT_REDIS_LOCK_PREFIX', 'kc-lock')
key_ = f'{prefix}:{int_lock}'
redis_lock = settings.REDIS_LOCK_CLIENT.lock(key_, timeout=60)
acquired = redis_lock.acquire(blocking=False)
yield acquired
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think support for something other than PostgreSQL is needed. We already depend on Postgres for many things.

Moot point, then, but just to say it: I think all os.getenv() for configuration (and default values) should be in the settings files

Comment on lines -544 to +581
except DuplicateInstance:
response = OpenRosaResponse(t("Duplicate submission"))
except ConflictingXMLHashInstanceError:
response = OpenRosaResponse(t('Conflict with already existing instance'))
response.status_code = 409
response['Location'] = request.build_absolute_uri(request.path)
error = response
except DuplicateInstanceError:
response = OpenRosaResponse(t('Duplicate instance'))
Copy link
Member

@jnm jnm Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A conflicting XML hash with no additional attachments is the same thing as a duplicate instance. If the XML is the same but there are additional attachments, no exception should be raised; this is normal operation. I don't think a new exception class is needed. I also don't think that DuplicateInstanceError can be reached anymore, but that's addressed by a different comment [was based on a misunderstanding]

Copy link
Member

@jnm jnm Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing that confused me about this is that the verbiage is wrong. It's not a really a conflict with an existing instance, it's unwanted concurrent processing of two (or more) instances with identical XML.

I think ideally we should wait (a short amount of time) to try to acquire the lock before returning an error. I don't know what happens in Enketo now, but it's easy to imagine that a client would asynchronously send the following requests concurrently:

  1. Submission XML (hash abc123) + dog.jpg
  2. Submission XML (hash abc123) + cat.jpg
  3. Submission XML (hash abc123) + gecko.jpg

Ideally, all three requests would succeed. Let's assume request 1 arrives first and is still processing while requests 2 and 3 are received by the server: what I understand this PR would do is reject requests 2 and 3 immediately with an error code. I think we should wait (again, briefly) for request 1 to finish1 before returning an immediate rejection.

If the waiting takes too long, then we do have to reject in order to avoid sapping the worker pool with useless spinning. The message we return should be effectively "try again later", not something about a conflict with an existing instance. The HTTP code has to be dependent on the OpenRosa specification and compatible with Enketo and ODK Collect. Hopefully, those requirements are one and the same, but we'll have to test.

Footnotes

  1. Addendum: Whoops, we don't really need to spin until request 1 finishes, we just need to wait until the row has been created in logger_instance. Imagine one POST has 50 files: locking while all 50 are written to storage is not what we want. We should still avoid storing the same attachment multiple times for a submission, so within a single submission, we should have some kind of attachment uniqueness constraint (if we don't already).

Copy link

@tiritea tiritea Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assume request 1 arrives first and is still processing while requests 2 and 3 are received by the server: what I understand this PR would do is reject requests 2 and 3 immediately with an error code.

So my interpretation of the spec would be that we would only need to (immediately!) reject if the same submission XML was subsequently received [ie duplicate hash] and it includes no attachment (!). Because a client should never be attempting to resend just the submission XML on its own more than once.

Whereas if the same submission XML hash was received - but it included an attachment (or multiple!) - then we can then decide whether to or not to reject based on whether any of those attachments have already been received (or are currently being processed). [and just throw away the XML since we know its already been processed, or is currently being processed...]

So any locking would minimally only need to occur for the duration of calculating (then checking for existing, then storing if not) the submission XML hash, right?

@@ -1,6 +1,7 @@
# coding: utf-8
import os

from fakeredis import FakeConnection, FakeStrictRedis, FakeServer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, was the support for locking without PostgreSQL exclusively for unit testing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit testing, at least on GitLab, is already using PostgreSQL, so I don't think we need to support any non-Postgres locking mechanisms for this

Copy link

@noliveleger
Copy link
Contributor Author

Closed in favor of kobotoolbox/kpi#5047

rajpatel24 added a commit to kobotoolbox/kpi that referenced this pull request Jan 15, 2025
## Summary
Implemented logic to detect and reject duplicate submissions.

## Description

We have identified a race condition in the submission processing that
causes duplicate submissions with identical UUIDs and XML hashes. This
issue is particularly problematic under conditions with multiple remote
devices submitting forms simultaneously over unreliable networks.

To address this issue, a PR has been raised with the following proposed
changes:

- Race Condition Resolution: A locking mechanism has been added to
prevent the race condition when checking for existing instances and
creating new ones. This aims to eliminate duplicate submissions.

- UUID Enforcement: Submissions without a UUID are now explicitly
disallowed. This ensures that every submission is uniquely identifiable
and further mitigates the risk of duplicate entries.

- Introduction of `root_uuid`:

- To ensure a consistent submission UUID throughout its lifecycle and
prevent duplicate submissions with the same UUID, a new `root_uuid`
column has been added to the `Instance` model with a unique constraint
(`root_uuid` per `xform`).

- If the `<meta><rootUuid>` is present in the submission XML, it is
stored in the `root_uuid` column.

- If `<meta><rootUuid>` is not present, the value from
`<meta><instanceID>` is used instead.

- This approach guarantees that the `root_uuid` remains constant across
the lifecycle of a submission, providing a reliable identifier for all
instances.

- UUID Handling Improvement: Updated the logic to strip only the `uuid:`
prefix while preserving custom, non-UUID ID schemes (e.g.,
domain.com:1234). This ensures compliance with the OpenRosa spec and
prevents potential ID collisions with custom prefixes.

- Error Handling:
- 202 Accepted: Returns when content is identical to an existing
submission and successfully processed.
- 409 Conflict: Returns when a duplicate UUID is detected but with
differing content.

These changes should improve the robustness of the submission process
and prevent both race conditions and invalid submissions.

## Notes

- Implemented a fix to address the race condition that leads to
duplicate submissions with the same UUID and XML hash.
- Incorporated improvements from existing work, ensuring consistency and
robustness in handling concurrent submissions.
- The fix aims to prevent duplicate submissions, even under high load
and unreliable network conditions.

## Related issues

Supersedes
[kobotoolbox/kobocat#876](kobotoolbox/kobocat#876)
and kobotoolbox/kobocat#859

---------

Co-authored-by: Olivier Leger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants