-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HMA] Bank and BankContent disable #1732
base: main
Are you sure you want to change the base?
Conversation
@Dcallies this is still a draft but if you get a chance to look at it, would appreciate to know if I'm headed in the right direction 😁🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @aryzle, thanks for contributing this PR! This one is in high demand, and I let another user who was asking for it know that there's a PR up.
blocking: Please split this PR into its component parts. It helps during review for PRs to have "one thesis", and the renames adds a lot of additional lines to the review, where the risk profile for renames are a lot smaller than the functional changes, and it can be hard to spot which changes are for which!
Additionally, I would prefer we hold off on the proposed renames. There is an in-progress migration to move these classes from HMA into python-threatexchange
, and it will be difficult to complete any rename until that is completed. Additionally additionally, in programming languages which have a specific interface
concept, prefixing with I
can have some benefits for identifying them. However, in python, the waters are a bit muddy, since you can happily combine implementation in otherwise "abstract" classes.
HMA itself does mix implementation in some of these "interfaces" - so an author coming from a different language may project expectations onto these classes that might be surprising as a result of the rename. Thus, I'm not sure it makes sense to rename these classes. As long as you are willing to hold your nose on the current names, I think it's better to leave as is. If you still want to follow through with the rename, let's create a dedicated issue and discuss the tradeoffs of doing so before we follow through - it should not be a part of this functional change.
except Exception as e: | ||
abort(400, *e.args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: Some of these exceptions would be the servers fault (500). I don't think we should capture every exception here to try and distinguish. Can you only capture the exceptions you know to be the client's fault?
data = request.get_json() | ||
|
||
try: | ||
if "disable_until_ts" in data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check to make sure that this is in a sane range.
@bp.route("/bank/<bank_name>/content/<content_id>", methods=["PUT"]) | ||
def bank_update_content(bank_name: str, content_id: int): | ||
""" | ||
Update the metadata for a banked content item. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: Can you document the parameters accepted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we allow updating anything besides disable_until_ts
? I could see collab_metadata
or original_media_uri
as candidates but not sure what they do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can start with just this one, and can add more in followup PRs if you feel inclined.
storage.bank_content_update(content) | ||
except Exception as e: | ||
abort(400, *e.args) | ||
return jsonify(content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure/ignorable: I think I wrote something to automatically json-ify the returns, but maybe not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like it's needed on our DB entities, like jsonify(bank)
def hash_media_post_impl() -> dict[str, str]: | ||
""" """ | ||
|
||
def hash_file() -> dict[str, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: The old name actually was meaningful in that it was referring to the fact that it is one variation of the hash_file implementation - one where the media content is in the form data. There's another version which downloads from a URL. This new name is probably misleading because someone might confuse it with how tx hash <file>
works.
Maybe something like hash_media_from_form_data
might be more accurate? (I believe form data can only be included in a POST)
@@ -16,6 +18,8 @@ | |||
import OpenMediaMatch.storage.interface as iface | |||
from OpenMediaMatch.blueprints import hashing | |||
|
|||
MAX_TIME = int(time.mktime((datetime.now() + timedelta(days=365 * 100)).timetuple())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this but wanted to have a max time for disable_until_ts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: This time will only be calculated on import time. If the server was running for over a year, it would not be possible to set a time at all!
I recommend instead this record a duration like the number of seconds in a year, and calculate whether it's out of range dynamically in the function
nit: MAX_TIME
is a bit generic as a name for this. Maybe something related to bank_content_disable?
@@ -201,7 +209,11 @@ def _bank_add_signals( | |||
abort(400, f"Invalid {name} signal: {str(e)}") | |||
|
|||
content_config = iface.BankContentConfig( | |||
id=0, disable_until_ts=0, collab_metadata={}, original_media_uri=None, bank=bank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dcallies was this meant to add new content with disable_until_ts=0
?
actually it looks like disable_until_ts
and collab_metadata
don't get used in storage.bank_add_content
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually it looks like disable_until_ts and collab_metadata don't get used in storage.bank_add_content
Might be needed to complete the issue then!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @aryzle ! Sorry for the slow review, I've been traveling in the last week.
We're getting closer! This looks like a decent scope for the size of this PR, but if you'd like, the only functional changes that are needed is the updates to the API that allows marking a content as disabled.
blocking: I think you are close enough to the right version that you should add some new unittests for you new APIs to make sure they get the right values! This will be easier if you add the GET option for bank content, since you can then just test the full create-update-delete cycle.
@@ -16,6 +18,8 @@ | |||
import OpenMediaMatch.storage.interface as iface | |||
from OpenMediaMatch.blueprints import hashing | |||
|
|||
MAX_TIME = int(time.mktime((datetime.now() + timedelta(days=365 * 100)).timetuple())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: This time will only be calculated on import time. If the server was running for over a year, it would not be possible to set a time at all!
I recommend instead this record a duration like the number of seconds in a year, and calculate whether it's out of range dynamically in the function
nit: MAX_TIME
is a bit generic as a name for this. Maybe something related to bank_content_disable?
except IntegrityError: | ||
abort(403, "Bank already exists") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: This is an update function, the bank is expected to exist (otherwise it will 404 on L82). In what circumstances are IntegrityError thrown?
except Exception as e: | ||
abort(500, *e.args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: What happens if you don't catch the exception at all? If you call this abort I think it may "hide" the root exception (by replacement).
I suspect the default behavior is also to 500, so you may not need this explicit abort.
@bp.route("/bank/<bank_name>/content/<content_id>", methods=["PUT"]) | ||
def bank_update_content(bank_name: str, content_id: int): | ||
""" | ||
Update the metadata for a banked content item. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can start with just this one, and can add more in followup PRs if you feel inclined.
except ValueError as e: | ||
abort(400, *e.args) | ||
except KeyError as e: | ||
abort(404, *e.args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are also potentially catching exceptions from storage.bank_content_update(content)
which may need to be a 500.
Instead of throwing, catching, and aborting, just abort(400
) inline.
if enabled_only: | ||
query = query.join(database.BankContent).where( | ||
( | ||
database.BankContent.disable_until_ts | ||
== interface.BankContentConfig.ENABLED | ||
) | ||
or ( | ||
database.BankContent.disable_until_ts | ||
< func.extract("epoch", func.now()) | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking question: Hmm, I'm not sure about this change because I'm worried it moves some of this computation to the database on non-indexed fields. What are the tradeoffs in having the caller do this filtering instead?
( | ||
database.BankContent.disable_until_ts | ||
== interface.BankContentConfig.ENABLED | ||
) | ||
or ( | ||
database.BankContent.disable_until_ts | ||
< func.extract("epoch", func.now()) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I don't think this quite works - check out the value of those constants:
I think maybe if we were smarted, we would have chosen an arbitrarily large value for the disabled time, which would make the time check consistent.
@@ -72,7 +72,7 @@ def build_index( | |||
index_cls = for_signal_type.get_index_cls() | |||
signal_list = [] | |||
last_cs = None | |||
for last_cs in bank_store.bank_yield_content(for_signal_type): | |||
for last_cs in bank_store.bank_yield_content(for_signal_type, enabled_only=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking; Sadly, I think this may be more complex. Since the index is only built in snapshots, you may exclude a hash that will be enabled one second from now, where the better behavior is to probably index it anyways.
If you choose to follow through with this change, I recommend only excluding the hashes that are permanently disabled, and indexing all the ones that enabled by time.
ignorable/alt: My recommendation is to give less options to bank_yield_content
and make the determination in python code here.
storage = get_storage() | ||
current_app.logger.debug("getting bank content") | ||
current_app.logger.debug(raw_results) | ||
contents = storage.bank_content_get(raw_results) | ||
enabled = [c for c in contents if c.enabled] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this enable check will handle it for you already even if it's indexed.
@@ -212,6 +224,44 @@ def _bank_add_signals( | |||
} | |||
|
|||
|
|||
@bp.route("/bank/<bank_name>/content/<content_id>", methods=["PUT"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alt: Consider also adding a GET version of this, which would then round out the CRUD and make unittesting easier for you.
Summary
fixes #1682
minor refactors
BankConfig -> IBank
BankContentConfig -> IBankContent
hash_media_post_impl -> hash_file
Test Plan