IT does not assert that replica bundles are indexed #6647

nadove-ucsc · 2024-10-23T09:53:00Z

No description provided.

nadove-ucsc · 2024-11-20T20:19:45Z

Spike for design

nadove-ucsc · 2024-11-21T03:42:30Z

Currently, AnVIL replica bundles aren't recorded anywhere in ElasticSearch. There are no contributions because replica bundles don't emit any contributions, and there are no replicas because we don't emit replicas for AnVIL bundles (since they're synthetic) and other replicas don't include any information on what bundle(s) they originated from.

nadove-ucsc · 2024-11-21T03:53:39Z

During the most recent reindex for anvilprod, replica bundles accounted for 12.5% of the total number of bundles:

**CloudWatch Logs Insights**      
region: us-east-1      
log-group-names: /aws/lambda/azul-indexer-anvilprod-contribute      
start-time: 2024-11-12T08:00:00.000Z      
end-time: 2024-11-17T07:59:59.000Z      
query-string:

    fields @message
| parse @message ' is a * bundle' as bundle_type
| stats count(*) by bundle_type

---
| bundle_type | count(*) |
| --- | --- |
|  | 52963921 |
| replica | 50022 |
| supplementary | 21671 |
| primary | 327020 |
| DUOS | 256 |
---

nadove-ucsc · 2024-11-21T04:15:20Z

The current integration test already covers the possibility of indexing failures for replica bundles because it will fail if there are any messages in the fail queue. The current test also is not effective at testing whether the indexing was truly "complete", as it has no means of verifying that all expected non-bundle entities are present. So, as currently written, I'd argue that the omission of replica bundles from _assert_catalog_complete does not represent a meaningful lack of coverage.

nadove-ucsc · 2024-11-21T04:27:34Z

The easiest way to include replica bundles in the catalog_complete subtest would be to just emit contributions for them. These bundles would have no inner entities besides the dataset, and ideally, we would also suppress the dataset. This would result in a 14.3% increase in the number of bundle contributions and aggregates. This would violate our current tenet that replica bundles never emit contributions. It would have no other benefit besides facilitating this change to the IT.

nadove-ucsc · 2024-11-21T04:35:10Z

If we continue to adhere to the tenet that replica bundles never emit contributions, then they'll need to record their existence via replicas instead. We could add a bundle_fqid field to replicas, but would add a lot of complexity because replicas can be emitted by multiple bundles and we'd need to resolve conflicts via a scripted update like we currently do for hub IDs. Note every replica is emitted by exactly one replica bundle and zero or more non-replica bundles.

nadove-ucsc · 2024-11-21T04:41:02Z

Perhaps a better idea would be emit special "stub" replicas for bundles, with no content or hub IDs, that would only be read during the IT. The changes to the indexer would be smaller and more localized in this case, and I would expect the performance impact to be smaller as well. But these replicas would serve no purpose once the IT is finished.

hannes-ucsc · 2024-12-17T02:32:38Z

The current integration test already covers the possibility of indexing failures for replica bundles because it will fail if there are any messages in the fail queue.

Technically, I don't think that statement is true. I found no code in the IT that looks at the fail queues. Luckily, it takes a while for a notification to make its way into the fail queue, so the IT will detect a stall before it does. But this all depends on the queue configuration, the number of attempts and the specific timeouts used in the the IT.

We should add code that actually fails the IT with messages in the fail queues. We should also double check that the IT clears the fail queues before queuing more notifications. I think it does but the assignee should check.

Most importantly, the code that validates the verbatim manifest formats is very permissive: it accepts any manifest with more than one replica. We should assert that each unfiltered verbatim manifest contains a replica for every file in every bundle that we index. This would be good enough coverage for now.

For extra points, we may also be able to require that there is at least one orphan but I'm not sure. We do know the table names of the replica bundles (from their FQID) so we should be able to infer if there were non-schema orphans. Then again, the tables could be empty.

achave11-ucsc · 2024-12-17T19:45:44Z

Spike to review response.

nadove-ucsc · 2024-12-18T04:38:44Z

I found no code in the IT that looks at the fail queues.

I mistakenly thought that the indexer would count messages in the fail queues when deciding whether indexing had completed or not, in the same way that it checks for messages in the notifications and tallies queues. Since this is not the case, Hannes is correct that the IT does not directly check the fail queues.

We should also double check that the IT clears the fail queues before queuing more notifications.

The IT does not purge the fail queues. On non-stable deployments, only the notifications and tallies are purged. On stable deployments, no queues are purged. See AzulClient.reset_indexer.

the code that validates the verbatim manifest formats [...] accepts any manifest with more than one replica

I believe the code will accept a manifest with exactly one replica. But the point stands.

We should assert that each unfiltered verbatim manifest contains a replica for every file in every bundle that we index.

As written, this is both impossible to implement and the wrong expected behavior. We expect orphaned files to be missing from unfiltered manifests (because they are not filtered by a dataset field), and supplementary bundles contain files that are not observable in the index, and thus cannot be used for the expected value in the test. However, these points largely cancel out, so the suggested improvement will work if we check only for non-orphaned files.

we should be able to infer if there were non-schema orphans. Then again, the tables could be empty.

We do not emit replica bundles for empty tables, so I think this should work.

hannes-ucsc · 2024-12-21T05:02:05Z

We should also double check that the IT clears the fail queues before queuing more notifications.

The IT does not purge the fail queues. On non-stable deployments, only the notifications and tallies are purged. On stable deployments, no queues are purged. See AzulClient.reset_indexer.

The condition is here:

azul/test/integration_test.py

Line 484 in 315adce

purge_queues=not config.deployment.is_stable,

We'll deal with this in a separate issue: #6781

the code that validates the verbatim manifest formats [...] accepts any manifest with more than one replica

I believe the code will accept a manifest with exactly one replica. But the point stands.

It currently requires at least one replica, right? IOW, it would accept a manifest with two replicas.

Same for pfb:

azul/test/integration_test.py

Line 1062 in a045398

self.assertGreater(next(num_records), 1)

We should assert that each unfiltered verbatim manifest contains a replica for every file in every bundle that we index.

As written, this is both impossible to implement and the wrong expected behavior. We expect orphaned files to be missing from unfiltered manifests (because they are not filtered by a dataset field), and supplementary bundles contain files that are not observable in the index, and thus cannot be used for the expected value in the test. However, these points largely cancel out, so the suggested improvement will work if we check only for non-orphaned files.

An unfiltered manifest does contain orphans. The condition whether to include orphans changed from the initial implementation (if filtered field is implicit hub ID) to whether the filtered fields are a subset of the implicit hub fields.

But yes, the IT doesn't know the set of orphans in the sources it indexes, so it can only expect a replica for each hit in /index/files.

we should be able to infer if there were non-schema orphans. Then again, the tables could be empty.

We do not emit replica bundles for empty tables, so I think this should work.

Agreed.

nadove-ucsc added the orange [process] Done by the Azul team label Oct 23, 2024

nadove-ucsc added a commit that referenced this issue Oct 23, 2024

Add FIXME (##6647)

39e6560

nadove-ucsc added a commit that referenced this issue Oct 23, 2024

Add FIXME (##6647)

1ed63fa

dsotirho-ucsc added bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost test [subject] Unit and integration test code + [priority] High labels Oct 23, 2024

nadove-ucsc added a commit that referenced this issue Nov 1, 2024

Add FIXME (##6647)

0eaa18f

nadove-ucsc added a commit that referenced this issue Nov 3, 2024

Add FIXME (##6647)

0812fef

nadove-ucsc added a commit that referenced this issue Nov 3, 2024

Add FIXME (##6647)

6d20dae

nadove-ucsc added a commit that referenced this issue Nov 4, 2024

Add FIXME (##6647)

541a497

nadove-ucsc added a commit that referenced this issue Nov 4, 2024

Add FIXME (##6647)

4be5906

nadove-ucsc added a commit that referenced this issue Nov 5, 2024

Add FIXME (#6647)

d733bbb

nadove-ucsc added a commit that referenced this issue Nov 7, 2024

Add FIXME (#6647)

039628d

nadove-ucsc added a commit that referenced this issue Nov 8, 2024

Add FIXME (#6647)

49fb21b

nadove-ucsc added a commit that referenced this issue Nov 8, 2024

Add FIXME (#6647)

2e983c5

nadove-ucsc added a commit that referenced this issue Nov 8, 2024

Add FIXME (#6647)

55960cf

nadove-ucsc added a commit that referenced this issue Nov 8, 2024

Add FIXME (#6647)

ed16b76

hannes-ucsc pushed a commit that referenced this issue Nov 9, 2024

Add FIXME (#6647)

fc10972

nadove-ucsc self-assigned this Nov 20, 2024

nadove-ucsc added spike:1 [process] Spike estimate of one point spike:2 [process] Spike estimate of two points and removed spike:1 [process] Spike estimate of one point labels Nov 20, 2024

dsotirho-ucsc assigned hannes-ucsc and unassigned nadove-ucsc Nov 21, 2024

achave11-ucsc assigned nadove-ucsc and unassigned hannes-ucsc Dec 17, 2024

achave11-ucsc assigned hannes-ucsc and unassigned nadove-ucsc Dec 18, 2024

hannes-ucsc removed their assignment Dec 21, 2024

hannes-ucsc changed the title ~~Integration test does not assert that replica bundles are indexed~~ IT does not assert that replica bundles are indexed Dec 21, 2024

achave11-ucsc assigned nadove-ucsc Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IT does not assert that replica bundles are indexed #6647

IT does not assert that replica bundles are indexed #6647

nadove-ucsc commented Oct 23, 2024

nadove-ucsc commented Nov 20, 2024

nadove-ucsc commented Nov 21, 2024

nadove-ucsc commented Nov 21, 2024

nadove-ucsc commented Nov 21, 2024

nadove-ucsc commented Nov 21, 2024

nadove-ucsc commented Nov 21, 2024

nadove-ucsc commented Nov 21, 2024

hannes-ucsc commented Dec 17, 2024 •

edited

Loading

achave11-ucsc commented Dec 17, 2024

nadove-ucsc commented Dec 18, 2024

hannes-ucsc commented Dec 21, 2024

IT does not assert that replica bundles are indexed #6647

IT does not assert that replica bundles are indexed #6647

Comments

nadove-ucsc commented Oct 23, 2024

nadove-ucsc commented Nov 20, 2024

nadove-ucsc commented Nov 21, 2024

nadove-ucsc commented Nov 21, 2024

nadove-ucsc commented Nov 21, 2024

nadove-ucsc commented Nov 21, 2024

nadove-ucsc commented Nov 21, 2024

nadove-ucsc commented Nov 21, 2024

hannes-ucsc commented Dec 17, 2024 • edited Loading

achave11-ucsc commented Dec 17, 2024

nadove-ucsc commented Dec 18, 2024

hannes-ucsc commented Dec 21, 2024

hannes-ucsc commented Dec 17, 2024 •

edited

Loading