Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IT does not assert that replica bundles are indexed #6647

Open
nadove-ucsc opened this issue Oct 23, 2024 · 11 comments
Open

IT does not assert that replica bundles are indexed #6647

nadove-ucsc opened this issue Oct 23, 2024 · 11 comments
Assignees
Labels
+ [priority] High bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost orange [process] Done by the Azul team spike:2 [process] Spike estimate of two points test [subject] Unit and integration test code

Comments

@nadove-ucsc
Copy link
Contributor

No description provided.

@nadove-ucsc nadove-ucsc added the orange [process] Done by the Azul team label Oct 23, 2024
nadove-ucsc added a commit that referenced this issue Oct 23, 2024
nadove-ucsc added a commit that referenced this issue Oct 23, 2024
@dsotirho-ucsc dsotirho-ucsc added bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost test [subject] Unit and integration test code + [priority] High labels Oct 23, 2024
nadove-ucsc added a commit that referenced this issue Nov 1, 2024
nadove-ucsc added a commit that referenced this issue Nov 3, 2024
nadove-ucsc added a commit that referenced this issue Nov 3, 2024
nadove-ucsc added a commit that referenced this issue Nov 4, 2024
nadove-ucsc added a commit that referenced this issue Nov 4, 2024
nadove-ucsc added a commit that referenced this issue Nov 5, 2024
nadove-ucsc added a commit that referenced this issue Nov 7, 2024
nadove-ucsc added a commit that referenced this issue Nov 8, 2024
nadove-ucsc added a commit that referenced this issue Nov 8, 2024
nadove-ucsc added a commit that referenced this issue Nov 8, 2024
nadove-ucsc added a commit that referenced this issue Nov 8, 2024
hannes-ucsc pushed a commit that referenced this issue Nov 9, 2024
@nadove-ucsc nadove-ucsc self-assigned this Nov 20, 2024
@nadove-ucsc nadove-ucsc added spike:1 [process] Spike estimate of one point spike:2 [process] Spike estimate of two points and removed spike:1 [process] Spike estimate of one point labels Nov 20, 2024
@nadove-ucsc
Copy link
Contributor Author

Spike for design

@nadove-ucsc
Copy link
Contributor Author

Currently, AnVIL replica bundles aren't recorded anywhere in ElasticSearch. There are no contributions because replica bundles don't emit any contributions, and there are no replicas because we don't emit replicas for AnVIL bundles (since they're synthetic) and other replicas don't include any information on what bundle(s) they originated from.

@nadove-ucsc
Copy link
Contributor Author

During the most recent reindex for anvilprod, replica bundles accounted for 12.5% of the total number of bundles:

**CloudWatch Logs Insights**      
region: us-east-1      
log-group-names: /aws/lambda/azul-indexer-anvilprod-contribute      
start-time: 2024-11-12T08:00:00.000Z      
end-time: 2024-11-17T07:59:59.000Z      
query-string:

    fields @message
| parse @message ' is a * bundle' as bundle_type
| stats count(*) by bundle_type

---
| bundle_type | count(*) |
| --- | --- |
|  | 52963921 |
| replica | 50022 |
| supplementary | 21671 |
| primary | 327020 |
| DUOS | 256 |
---

@nadove-ucsc
Copy link
Contributor Author

The current integration test already covers the possibility of indexing failures for replica bundles because it will fail if there are any messages in the fail queue. The current test also is not effective at testing whether the indexing was truly "complete", as it has no means of verifying that all expected non-bundle entities are present. So, as currently written, I'd argue that the omission of replica bundles from _assert_catalog_complete does not represent a meaningful lack of coverage.

@nadove-ucsc
Copy link
Contributor Author

The easiest way to include replica bundles in the catalog_complete subtest would be to just emit contributions for them. These bundles would have no inner entities besides the dataset, and ideally, we would also suppress the dataset. This would result in a 14.3% increase in the number of bundle contributions and aggregates. This would violate our current tenet that replica bundles never emit contributions. It would have no other benefit besides facilitating this change to the IT.

@nadove-ucsc
Copy link
Contributor Author

If we continue to adhere to the tenet that replica bundles never emit contributions, then they'll need to record their existence via replicas instead. We could add a bundle_fqid field to replicas, but would add a lot of complexity because replicas can be emitted by multiple bundles and we'd need to resolve conflicts via a scripted update like we currently do for hub IDs. Note every replica is emitted by exactly one replica bundle and zero or more non-replica bundles.

@nadove-ucsc
Copy link
Contributor Author

Perhaps a better idea would be emit special "stub" replicas for bundles, with no content or hub IDs, that would only be read during the IT. The changes to the indexer would be smaller and more localized in this case, and I would expect the performance impact to be smaller as well. But these replicas would serve no purpose once the IT is finished.

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Dec 17, 2024

The current integration test already covers the possibility of indexing failures for replica bundles because it will fail if there are any messages in the fail queue.

Technically, I don't think that statement is true. I found no code in the IT that looks at the fail queues. Luckily, it takes a while for a notification to make its way into the fail queue, so the IT will detect a stall before it does. But this all depends on the queue configuration, the number of attempts and the specific timeouts used in the the IT.

We should add code that actually fails the IT with messages in the fail queues. We should also double check that the IT clears the fail queues before queuing more notifications. I think it does but the assignee should check.

Most importantly, the code that validates the verbatim manifest formats is very permissive: it accepts any manifest with more than one replica. We should assert that each unfiltered verbatim manifest contains a replica for every file in every bundle that we index. This would be good enough coverage for now.

For extra points, we may also be able to require that there is at least one orphan but I'm not sure. We do know the table names of the replica bundles (from their FQID) so we should be able to infer if there were non-schema orphans. Then again, the tables could be empty.

@achave11-ucsc
Copy link
Member

Spike to review response.

@nadove-ucsc
Copy link
Contributor Author

I found no code in the IT that looks at the fail queues.

I mistakenly thought that the indexer would count messages in the fail queues when deciding whether indexing had completed or not, in the same way that it checks for messages in the notifications and tallies queues. Since this is not the case, Hannes is correct that the IT does not directly check the fail queues.

We should also double check that the IT clears the fail queues before queuing more notifications.

The IT does not purge the fail queues. On non-stable deployments, only the notifications and tallies are purged. On stable deployments, no queues are purged. See AzulClient.reset_indexer.

the code that validates the verbatim manifest formats [...] accepts any manifest with more than one replica

I believe the code will accept a manifest with exactly one replica. But the point stands.

We should assert that each unfiltered verbatim manifest contains a replica for every file in every bundle that we index.

As written, this is both impossible to implement and the wrong expected behavior. We expect orphaned files to be missing from unfiltered manifests (because they are not filtered by a dataset field), and supplementary bundles contain files that are not observable in the index, and thus cannot be used for the expected value in the test. However, these points largely cancel out, so the suggested improvement will work if we check only for non-orphaned files.

we should be able to infer if there were non-schema orphans. Then again, the tables could be empty.

We do not emit replica bundles for empty tables, so I think this should work.

@hannes-ucsc
Copy link
Member

We should also double check that the IT clears the fail queues before queuing more notifications.

The IT does not purge the fail queues. On non-stable deployments, only the notifications and tallies are purged. On stable deployments, no queues are purged. See AzulClient.reset_indexer.

The condition is here:

purge_queues=not config.deployment.is_stable,

We'll deal with this in a separate issue: #6781

the code that validates the verbatim manifest formats [...] accepts any manifest with more than one replica

I believe the code will accept a manifest with exactly one replica. But the point stands.

It currently requires at least one replica, right? IOW, it would accept a manifest with two replicas.

Same for pfb:

self.assertGreater(next(num_records), 1)

We should assert that each unfiltered verbatim manifest contains a replica for every file in every bundle that we index.

As written, this is both impossible to implement and the wrong expected behavior. We expect orphaned files to be missing from unfiltered manifests (because they are not filtered by a dataset field), and supplementary bundles contain files that are not observable in the index, and thus cannot be used for the expected value in the test. However, these points largely cancel out, so the suggested improvement will work if we check only for non-orphaned files.

An unfiltered manifest does contain orphans. The condition whether to include orphans changed from the initial implementation (if filtered field is implicit hub ID) to whether the filtered fields are a subset of the implicit hub fields.

But yes, the IT doesn't know the set of orphans in the sources it indexes, so it can only expect a replica for each hit in /index/files.

we should be able to infer if there were non-schema orphans. Then again, the tables could be empty.

We do not emit replica bundles for empty tables, so I think this should work.

Agreed.

@hannes-ucsc hannes-ucsc removed their assignment Dec 21, 2024
@hannes-ucsc hannes-ucsc changed the title Integration test does not assert that replica bundles are indexed IT does not assert that replica bundles are indexed Dec 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
+ [priority] High bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost orange [process] Done by the Azul team spike:2 [process] Spike estimate of two points test [subject] Unit and integration test code
Projects
None yet
Development

No branches or pull requests

4 participants