Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HMA] Bank and BankContent disable #1732

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Conversation

aryzle
Copy link
Collaborator

@aryzle aryzle commented Jan 11, 2025

Summary

fixes #1682

minor refactors

  • rename BankConfig -> IBank
  • rename BankContentConfig -> IBankContent
  • rename hash_media_post_impl -> hash_file

Test Plan

  • unittest

@aryzle aryzle changed the title Bank and BankContent disable [HMA] Bank and BankContent disable Jan 11, 2025
@aryzle
Copy link
Collaborator Author

aryzle commented Jan 11, 2025

@Dcallies this is still a draft but if you get a chance to look at it, would appreciate to know if I'm headed in the right direction 😁🙏

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @aryzle, thanks for contributing this PR! This one is in high demand, and I let another user who was asking for it know that there's a PR up.

blocking: Please split this PR into its component parts. It helps during review for PRs to have "one thesis", and the renames adds a lot of additional lines to the review, where the risk profile for renames are a lot smaller than the functional changes, and it can be hard to spot which changes are for which!

Additionally, I would prefer we hold off on the proposed renames. There is an in-progress migration to move these classes from HMA into python-threatexchange, and it will be difficult to complete any rename until that is completed. Additionally additionally, in programming languages which have a specific interface concept, prefixing with I can have some benefits for identifying them. However, in python, the waters are a bit muddy, since you can happily combine implementation in otherwise "abstract" classes.

HMA itself does mix implementation in some of these "interfaces" - so an author coming from a different language may project expectations onto these classes that might be surprising as a result of the rename. Thus, I'm not sure it makes sense to rename these classes. As long as you are willing to hold your nose on the current names, I think it's better to leave as is. If you still want to follow through with the rename, let's create a dedicated issue and discuss the tradeoffs of doing so before we follow through - it should not be a part of this functional change.

Comment on lines 236 to 237
except Exception as e:
abort(400, *e.args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Some of these exceptions would be the servers fault (500). I don't think we should capture every exception here to try and distinguish. Can you only capture the exceptions you know to be the client's fault?

data = request.get_json()

try:
if "disable_until_ts" in data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check to make sure that this is in a sane range.

@bp.route("/bank/<bank_name>/content/<content_id>", methods=["PUT"])
def bank_update_content(bank_name: str, content_id: int):
"""
Update the metadata for a banked content item.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Can you document the parameters accepted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we allow updating anything besides disable_until_ts? I could see collab_metadata or original_media_uri as candidates but not sure what they do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can start with just this one, and can add more in followup PRs if you feel inclined.

storage.bank_content_update(content)
except Exception as e:
abort(400, *e.args)
return jsonify(content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure/ignorable: I think I wrote something to automatically json-ify the returns, but maybe not

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like it's needed on our DB entities, like jsonify(bank)

def hash_media_post_impl() -> dict[str, str]:
""" """

def hash_file() -> dict[str, str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: The old name actually was meaningful in that it was referring to the fact that it is one variation of the hash_file implementation - one where the media content is in the form data. There's another version which downloads from a URL. This new name is probably misleading because someone might confuse it with how tx hash <file> works.

Maybe something like hash_media_from_form_data might be more accurate? (I believe form data can only be included in a POST)

@@ -16,6 +18,8 @@
import OpenMediaMatch.storage.interface as iface
from OpenMediaMatch.blueprints import hashing

MAX_TIME = int(time.mktime((datetime.now() + timedelta(days=365 * 100)).timetuple()))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this but wanted to have a max time for disable_until_ts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: This time will only be calculated on import time. If the server was running for over a year, it would not be possible to set a time at all!

I recommend instead this record a duration like the number of seconds in a year, and calculate whether it's out of range dynamically in the function

nit: MAX_TIME is a bit generic as a name for this. Maybe something related to bank_content_disable?

@@ -201,7 +209,11 @@ def _bank_add_signals(
abort(400, f"Invalid {name} signal: {str(e)}")

content_config = iface.BankContentConfig(
id=0, disable_until_ts=0, collab_metadata={}, original_media_uri=None, bank=bank
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dcallies was this meant to add new content with disable_until_ts=0 ?

actually it looks like disable_until_ts and collab_metadata don't get used in storage.bank_add_content

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it looks like disable_until_ts and collab_metadata don't get used in storage.bank_add_content

Might be needed to complete the issue then!

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @aryzle ! Sorry for the slow review, I've been traveling in the last week.

We're getting closer! This looks like a decent scope for the size of this PR, but if you'd like, the only functional changes that are needed is the updates to the API that allows marking a content as disabled.

blocking: I think you are close enough to the right version that you should add some new unittests for you new APIs to make sure they get the right values! This will be easier if you add the GET option for bank content, since you can then just test the full create-update-delete cycle.

@@ -16,6 +18,8 @@
import OpenMediaMatch.storage.interface as iface
from OpenMediaMatch.blueprints import hashing

MAX_TIME = int(time.mktime((datetime.now() + timedelta(days=365 * 100)).timetuple()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: This time will only be calculated on import time. If the server was running for over a year, it would not be possible to set a time at all!

I recommend instead this record a duration like the number of seconds in a year, and calculate whether it's out of range dynamically in the function

nit: MAX_TIME is a bit generic as a name for this. Maybe something related to bank_content_disable?

Comment on lines +96 to +97
except IntegrityError:
abort(403, "Bank already exists")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: This is an update function, the bank is expected to exist (otherwise it will 404 on L82). In what circumstances are IntegrityError thrown?

Comment on lines +98 to +99
except Exception as e:
abort(500, *e.args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: What happens if you don't catch the exception at all? If you call this abort I think it may "hide" the root exception (by replacement).

I suspect the default behavior is also to 500, so you may not need this explicit abort.

@bp.route("/bank/<bank_name>/content/<content_id>", methods=["PUT"])
def bank_update_content(bank_name: str, content_id: int):
"""
Update the metadata for a banked content item.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can start with just this one, and can add more in followup PRs if you feel inclined.

Comment on lines +256 to +259
except ValueError as e:
abort(400, *e.args)
except KeyError as e:
abort(404, *e.args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are also potentially catching exceptions from storage.bank_content_update(content) which may need to be a 500.

Instead of throwing, catching, and aborting, just abort(400) inline.

Comment on lines +633 to +643
if enabled_only:
query = query.join(database.BankContent).where(
(
database.BankContent.disable_until_ts
== interface.BankContentConfig.ENABLED
)
or (
database.BankContent.disable_until_ts
< func.extract("epoch", func.now())
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking question: Hmm, I'm not sure about this change because I'm worried it moves some of this computation to the database on non-indexed fields. What are the tradeoffs in having the caller do this filtering instead?

Comment on lines +635 to +642
(
database.BankContent.disable_until_ts
== interface.BankContentConfig.ENABLED
)
or (
database.BankContent.disable_until_ts
< func.extract("epoch", func.now())
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I don't think this quite works - check out the value of those constants:

https://github.com/facebook/ThreatExchange/blob/main/hasher-matcher-actioner/src/OpenMediaMatch/storage/interface.py#L331-L332

I think maybe if we were smarted, we would have chosen an arbitrarily large value for the disabled time, which would make the time check consistent.

@@ -72,7 +72,7 @@ def build_index(
index_cls = for_signal_type.get_index_cls()
signal_list = []
last_cs = None
for last_cs in bank_store.bank_yield_content(for_signal_type):
for last_cs in bank_store.bank_yield_content(for_signal_type, enabled_only=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking; Sadly, I think this may be more complex. Since the index is only built in snapshots, you may exclude a hash that will be enabled one second from now, where the better behavior is to probably index it anyways.

If you choose to follow through with this change, I recommend only excluding the hashes that are permanently disabled, and indexing all the ones that enabled by time.

ignorable/alt: My recommendation is to give less options to bank_yield_content and make the determination in python code here.

storage = get_storage()
current_app.logger.debug("getting bank content")
current_app.logger.debug(raw_results)
contents = storage.bank_content_get(raw_results)
enabled = [c for c in contents if c.enabled]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this enable check will handle it for you already even if it's indexed.

https://github.com/facebook/ThreatExchange/blob/main/hasher-matcher-actioner/src/OpenMediaMatch/storage/interface.py#L331-L332

@@ -212,6 +224,44 @@ def _bank_add_signals(
}


@bp.route("/bank/<bank_name>/content/<content_id>", methods=["PUT"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alt: Consider also adding a GET version of this, which would then round out the CRUD and make unittesting easier for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[hma] Implement Bank & content-level disable
3 participants