Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: instrument served apps with Prometheus metrics #31

Merged
merged 1 commit into from
Apr 17, 2024

Conversation

squat
Copy link
Contributor

@squat squat commented Jan 15, 2024

This commit adds basic Prometheus instrumentation of the HTTP server of
all applications served with serve=True. This lets us conveniently
monitor how models are performing.

Signed-off-by: Lucas Servén Marín [email protected]

@squat squat force-pushed the instrument_served_apps branch from 3708ffe to 07c7e4e Compare January 16, 2024 01:24
@isidentical
Copy link
Collaborator

@squat i'd recommend manually testing it out with a basic python server and trying fal fn run t.py test_server to see the exact error, integration tests just show the server couldn't start and i assume its because of the loopy stuff but there might be others as well

@squat squat force-pushed the instrument_served_apps branch from 07c7e4e to 924ca6f Compare April 11, 2024 08:16
projects/fal/src/fal/api.py Outdated Show resolved Hide resolved
@squat squat requested a review from efiop April 11, 2024 08:20
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but at least 1 test is failing with 503 (could be unrelated, but restart didn't help)

@kakkoyun
Copy link
Contributor

/subscribe

@squat squat force-pushed the instrument_served_apps branch 2 times, most recently from 183312d to d490a21 Compare April 17, 2024 18:15
@squat
Copy link
Contributor Author

squat commented Apr 17, 2024

re-posting a conversation I had with @efiop on diagnosing the mysterious test failures we would get when we tried to import the PrometheusMiddleware in the _build_app func:

the problem was that re generate the metadata during app registration (not in the isolate agent). this means that the func to build the open api schema is run locally, which means that the middleware packages have to be installed locally as well as in the agent. So when we imported from starlette_exporter import PrometheusMiddleware this would raise an exception. HOWEVER we catch and ignore all exceptions raised during metadata generation and leave an empty metadata

if metadata is None:
metadata = {}
# TODO: let the user send more metadata than just openapi
if isinstance(func, ServeWrapper):
# Assigning in a separate property leaving a place for the user
# to add more metadata in the future
try:
metadata["openapi"] = func.openapi()
except Exception as e:
print(
f"[warning] Failed to generate OpenAPI metadata for function: {e}"
)
, which is why we would get a 200 but with an empty payload and no error

thanks @chamini2 for the pointers that lead to this fix!

projects/fal/pyproject.toml Outdated Show resolved Hide resolved
Comment on lines +910 to +917
# TODO(squat): handle shutdowns gracefully.
# You cannot add signal handlers to any loop if you're not
# on the main thread.
# How can we detect that we are being shut down and stop the
# uvicorn servers gracefully?
# loop = asyncio.get_running_loop()
# loop.add_signal_handler(signal.SIGINT, event.set)
# loop.add_signal_handler(signal.SIGTERM, event.set)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered running them in separate threads? Also interesting if there are implications to running two servers in the same event loop (basically same thread?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we run our two servers in the same event loop for all of our isolate cloud instances. We could run them in different threads, but that doesn't really fix this issue commented out here at all. We have to fix signal handling in the isolate agent to signal the thread in this fal app using a threading event, then this fal app can decide what to do with that event, e.g. set an asyncio event for async servers somehow call stop on the uvicorn workers. In short, threading vs async doesn't fix this graceful shutdown at all. I prefer to use async for our multiple servers. Some of our internal fal isolate servers have >3 servers running in the same event loop: grpc, http, metrics, etc. Eventually I'd like to define our async-friendly uvicorn server class here in fal-ai/fal and import it in our internal tooling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, didn't know that we have an additional issue with the agent. Not a blocker for me, we can fix this when we need to.

This commit adds basic Prometheus instrumentation of the HTTP server of
all applications. This lets us conveniently monitor how models are
performing.

Signed-off-by: Lucas Servén Marín <[email protected]>
@squat squat force-pushed the instrument_served_apps branch from d490a21 to c3e03cc Compare April 17, 2024 22:21
@efiop efiop merged commit 77be50d into main Apr 17, 2024
5 checks passed
@efiop efiop deleted the instrument_served_apps branch April 17, 2024 22:41
efiop added a commit that referenced this pull request Apr 23, 2024
`gather` doesn't cancel other tasks automatically, e.g. causing metrics
server to keep running if user app failed to `setup()`.

Caused by #31
efiop added a commit that referenced this pull request Apr 23, 2024
`gather` doesn't cancel other tasks automatically, e.g. causing metrics
server to keep running if user app failed to `setup()`.

Caused by #31
efiop added a commit that referenced this pull request Apr 23, 2024
`gather` doesn't cancel other tasks automatically, e.g. causing metrics
server to keep running if user app failed to `setup()`.

Caused by #31
efiop added a commit that referenced this pull request Apr 23, 2024
`gather` doesn't cancel other tasks automatically, e.g. causing metrics
server to keep running if user app failed to `setup()`.

Caused by #31
efiop added a commit that referenced this pull request Apr 23, 2024
`gather` doesn't cancel other tasks automatically, e.g. causing metrics
server to keep running if user app failed to `setup()`.

Caused by #31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants