feat: instrument served apps with Prometheus metrics #31

squat · 2024-01-15T07:40:26Z

This commit adds basic Prometheus instrumentation of the HTTP server of
all applications served with serve=True. This lets us conveniently
monitor how models are performing.

Signed-off-by: Lucas Servén Marín [email protected]

projects/fal/src/fal/api.py

isidentical · 2024-01-16T17:28:04Z

@squat i'd recommend manually testing it out with a basic python server and trying fal fn run t.py test_server to see the exact error, integration tests just show the server couldn't start and i assume its because of the loopy stuff but there might be others as well

projects/fal/src/fal/api.py

efiop

LGTM, but at least 1 test is failing with 503 (could be unrelated, but restart didn't help)

kakkoyun · 2024-04-16T17:16:02Z

/subscribe

squat · 2024-04-17T18:23:20Z

re-posting a conversation I had with @efiop on diagnosing the mysterious test failures we would get when we tried to import the PrometheusMiddleware in the _build_app func:

the problem was that re generate the metadata during app registration (not in the isolate agent). this means that the func to build the open api schema is run locally, which means that the middleware packages have to be installed locally as well as in the agent. So when we imported from starlette_exporter import PrometheusMiddleware this would raise an exception. HOWEVER we catch and ignore all exceptions raised during metadata generation and leave an empty metadata

fal/projects/fal/src/fal/api.py

Lines 397 to 409 in 47be563

if metadata is None:

metadata = {}

# TODO: let the user send more metadata than just openapi

if isinstance(func, ServeWrapper):

# Assigning in a separate property leaving a place for the user

# to add more metadata in the future

try:

metadata["openapi"] = func.openapi()

except Exception as e:

print(

f"[warning] Failed to generate OpenAPI metadata for function: {e}"

)

, which is why we would get a 200 but with an empty payload and no error

thanks @chamini2 for the pointers that lead to this fix!

projects/fal/pyproject.toml

efiop · 2024-04-17T19:25:20Z

projects/fal/src/fal/api.py

+            # TODO(squat): handle shutdowns gracefully.
+            # You cannot add signal handlers to any loop if you're not
+            # on the main thread.
+            # How can we detect that we are being shut down and stop the
+            # uvicorn servers gracefully?
+            # loop = asyncio.get_running_loop()
+            # loop.add_signal_handler(signal.SIGINT, event.set)
+            # loop.add_signal_handler(signal.SIGTERM, event.set)


Have you considered running them in separate threads? Also interesting if there are implications to running two servers in the same event loop (basically same thread?).

we run our two servers in the same event loop for all of our isolate cloud instances. We could run them in different threads, but that doesn't really fix this issue commented out here at all. We have to fix signal handling in the isolate agent to signal the thread in this fal app using a threading event, then this fal app can decide what to do with that event, e.g. set an asyncio event for async servers somehow call stop on the uvicorn workers. In short, threading vs async doesn't fix this graceful shutdown at all. I prefer to use async for our multiple servers. Some of our internal fal isolate servers have >3 servers running in the same event loop: grpc, http, metrics, etc. Eventually I'd like to define our async-friendly uvicorn server class here in fal-ai/fal and import it in our internal tooling.

Oh, didn't know that we have an additional issue with the agent. Not a blocker for me, we can fix this when we need to.

This commit adds basic Prometheus instrumentation of the HTTP server of all applications. This lets us conveniently monitor how models are performing. Signed-off-by: Lucas Servén Marín <[email protected]>

`gather` doesn't cancel other tasks automatically, e.g. causing metrics server to keep running if user app failed to `setup()`. Caused by #31

chamini2 reviewed Jan 15, 2024

View reviewed changes

projects/fal/src/fal/api.py Outdated Show resolved Hide resolved

squat force-pushed the instrument_served_apps branch from 3708ffe to 07c7e4e Compare January 16, 2024 01:24

isidentical reviewed Jan 16, 2024

View reviewed changes

projects/fal/src/fal/api.py Outdated Show resolved Hide resolved

isidentical reviewed Jan 16, 2024

View reviewed changes

projects/fal/src/fal/api.py Outdated Show resolved Hide resolved

squat force-pushed the instrument_served_apps branch from 07c7e4e to 924ca6f Compare April 11, 2024 08:16

squat commented Apr 11, 2024

View reviewed changes

projects/fal/src/fal/api.py Outdated Show resolved Hide resolved

squat requested a review from efiop April 11, 2024 08:20

efiop reviewed Apr 11, 2024

View reviewed changes

squat force-pushed the instrument_served_apps branch 2 times, most recently from 183312d to d490a21 Compare April 17, 2024 18:15

efiop reviewed Apr 17, 2024

View reviewed changes

projects/fal/pyproject.toml Outdated Show resolved Hide resolved

efiop reviewed Apr 17, 2024

View reviewed changes

feat: instrument served apps with Prometheus metrics

c3e03cc

This commit adds basic Prometheus instrumentation of the HTTP server of all applications. This lets us conveniently monitor how models are performing. Signed-off-by: Lucas Servén Marín <[email protected]>

squat force-pushed the instrument_served_apps branch from d490a21 to c3e03cc Compare April 17, 2024 22:21

efiop approved these changes Apr 17, 2024

View reviewed changes

efiop merged commit 77be50d into main Apr 17, 2024
5 checks passed

efiop deleted the instrument_served_apps branch April 17, 2024 22:41

efiop added a commit that referenced this pull request Apr 23, 2024

fix(api): stop other servers if one exited

cd42f2f

`gather` doesn't cancel other tasks automatically, e.g. causing metrics server to keep running if user app failed to `setup()`. Caused by #31

efiop mentioned this pull request Apr 23, 2024

fix(api): stop other servers if one exited #187

Merged

efiop added a commit that referenced this pull request Apr 23, 2024

fix(api): stop other servers if one exited

d1890a3

`gather` doesn't cancel other tasks automatically, e.g. causing metrics server to keep running if user app failed to `setup()`. Caused by #31

efiop added a commit that referenced this pull request Apr 23, 2024

fix(api): stop other servers if one exited

395e5af

`gather` doesn't cancel other tasks automatically, e.g. causing metrics server to keep running if user app failed to `setup()`. Caused by #31

efiop added a commit that referenced this pull request Apr 23, 2024

fix(api): stop other servers if one exited

7a84ea7

`gather` doesn't cancel other tasks automatically, e.g. causing metrics server to keep running if user app failed to `setup()`. Caused by #31

efiop added a commit that referenced this pull request Apr 23, 2024

fix(api): stop other servers if one exited (#187)

e64aba8

`gather` doesn't cancel other tasks automatically, e.g. causing metrics server to keep running if user app failed to `setup()`. Caused by #31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: instrument served apps with Prometheus metrics #31

feat: instrument served apps with Prometheus metrics #31

squat commented Jan 15, 2024

isidentical commented Jan 16, 2024

efiop left a comment •

edited

Loading

kakkoyun commented Apr 16, 2024

squat commented Apr 17, 2024

efiop Apr 17, 2024

squat Apr 17, 2024

efiop Apr 17, 2024

feat: instrument served apps with Prometheus metrics #31

feat: instrument served apps with Prometheus metrics #31

Conversation

squat commented Jan 15, 2024

isidentical commented Jan 16, 2024

efiop left a comment • edited Loading

Choose a reason for hiding this comment

kakkoyun commented Apr 16, 2024

squat commented Apr 17, 2024

efiop Apr 17, 2024

Choose a reason for hiding this comment

squat Apr 17, 2024

Choose a reason for hiding this comment

efiop Apr 17, 2024

Choose a reason for hiding this comment

efiop left a comment •

edited

Loading