feat(android): AOSP observability blog post #534

Blisse · 2024-11-25T10:09:02Z

No description provided.

cloudflare-workers-and-pages · 2024-11-25T10:10:31Z

Deploying interrupt with Cloudflare Pages

Latest commit:	`3837eac`
Status:	✅ Deploy successful!
Preview URL:	https://8d209cc3.interrupt.pages.dev
Branch Preview URL:	https://victorlai-interrupt.interrupt.pages.dev

View logs

sjp4 · 2024-11-27T11:20:23Z

_drafts/aospobservability.md

+
+There are definitely many good reasons to build on top of AOSP. Android is really the only option for custom, touch-first experiences, and the popularity and price point of phone, tablet, and watch form factors is really attractive for a lot of different product experiences. Android SoC vendors bundle support for a lot of common capabilities by default: Bluetooth, LTE, Wi-Fi, cameras, batteries, sensors, and more. And Android also has an incredibly strong community and ecosystem, the continued support of Google, and millions of apps and developers.
+
+But people often don’t distinguish between Android developers and AOSP developers enough. While Android *app* development is incredibly popular especially as a hobby, AOSP development is only really done for work due to the custom SoC required (outside of the LineageOS or GrapheneOS folks). More importantly, the actual work done by Android versus AOSP developers is often totally different. Android *app* developers are generally only involved with product-centric app development (which in itself is a huge problem space), whereas AOSP developers may develop some apps, but also work much closer to the kernel, by writing drivers, implementing vendor or hardware interfaces, creating and updating sepolicy, and more.


s/for work/professionally

sjp4 · 2024-11-27T11:20:52Z

_drafts/aospobservability.md

+
+Android developers are very familiar with Android **app** observability tools - everyone by default integrates Crashlytics, Bugsnag, Sentry, Instabug, or another observability SDK. When I worked on an AOSP product, we actually built 10 or so apps to replicate some device functionality, so we had to integrate the observability SDK into all 10 apps. But what about the other 10 to 100 apps and native services that we don't normally touch that were part of the generic AOSP implementation?
+
+When you build on top of AOSP, you’re actually potentially running 4 different sources of "apps": your own custom-built apps, your SoC vendor’s apps, the generic AOSP apps, and if you have an app store, third-party apps installed from the app store. You can only install an app observability SDK in your own custom-built apps, you can’t install SDKs into all these other apps, so you immediately hit the limit of what information app observability SDKs can provide.


When you build a device on top of AOSP

sjp4 · 2024-11-27T11:22:22Z

_drafts/aospobservability.md

+
+![Apps you don't build yourself can't integrate app observability SDKs](/img/aosp-observability/app-sources.svg)
+
+There are also 4 kinds of “apps”: (1) the common Android app written in Java or Kotlin, (2) native apps using C++ using the Android NDK, (3) binaries written in C++ or Rust, and (4) init.rc shell services which can invoke the binaries and more. App observability SDKs often come with NDK support, but they run into the same problem of only working on apps you build yourself, and there are 10 to 100 other apps running on the device that aren’t monitored.


The 4 types seem a bit muddled (what's the difference between native apps using C++ vs binaries written in C++?). Mention the kernel? Or maybe divide between "apps" whether native or not, vs system services, vs kernel?

sjp4 · 2024-11-27T11:23:34Z

_drafts/aospobservability.md

+
+As time went on though, we started running into more and more situations where the cached logs didn’t actually span the time of the incident, so they were basically useless. This is the natural result of a reactive pull-based model, and so we got to brainstorming what other solutions were possible.
+
+We eventually landed on the desire for a push-based model - literally a push messaging system, where we could poke the device remotely to trigger an upload. And we designed an app that could query FileProviders and ContentProviders for any additional data. But we hit a couple of roadblocks: our device was not GMS-compliant so we didn’t have Firebase Cloud Messaging and would have to integrate our own push messaging system, and we needed to enlist the help of a backend team to build the push backend and the push trigger, but also a storage solution, and no team had the budget to help and our teams didn’t have the expertise to contribute. This ends up being a common sticking point with internal tools for mobile teams - Android and AOSP developers have a lot of mobile experience and expertise, but often need help creating a maintainable backend solution that makes sense for the volume and cost of storing and processing the data.


sjp4 · 2024-11-27T11:25:39Z

_drafts/aospobservability.md

+
+### Android Bug Reports
+
+Logs are one tool to diagnose an issue, but Android has so much more information if you can access it - namely, Android Bug Reports, which gathers data from all over the device into a single zip file. So any AOSP observability solution that can capture and upload logcat logs automatically, may also want to build a way to trigger and upload Android Bug Reports when needed, too. And so Bug Reports are often the second thing that developers build when they just need more AOSP observability.


This section makes it sounds like bugreports are the answer (not sure how deep you want to go into the alternatives, but maybe mention e.g. that they contain a complete dumpsys dump, which could also be captured individually)

sjp4 · 2024-11-27T11:26:23Z

_drafts/aospobservability.md

+
+## Understand the trends
+
+As we encountered more and more bugs, and downloaded more and more logs, several colleagues decided to write a number of different parsers in Python that they would run on the logs to rule out certain problems, or filter the logs for certain crashes. As they encountered more and more issues, if they could identify the issue in the log, they could write a parser for that issue and re-use it for next time.


Maybe add an example here of what we'd be looking for in the logs

sjp4 · 2024-11-27T11:27:17Z

_drafts/aospobservability.md

+
+Most teams that develop Android hardware already buy Android app observability tools, like Crashlytics, plus a general purpose product analytics tool, like Grafana, for their backend, so it seems natural to try to fit AOSP data into these tools. But these tools are not designed for AOSP, they’re designed for apps or for servers, and so their data models or pricing structures can feel like they don’t make sense.
+
+- Existing Android app observability tools have a battle-tested data model for Java and Kotlin app crashes and C++ tombstones. But there’s really no way to record any other forms of data, so you have to coerce other kinds of crashes, like kernel panics, into the data model, which can be finicky and tedious to manually manage. -


nit: not really tombstones?

Also feels like we should mention somewhere here: that part of the solution is collecting the right information on the device (e.g. dropbox + dumpsys vs bugreport) - signal/noise.

sjp4 · 2024-11-27T11:33:22Z

_drafts/aospobservability.md

+
+There's a ton of content on what observability means for Android apps, but content and tooling for AOSP device observability feels sparse in comparison. Most AOSP developers end up building internal observability tooling because of the lack of standard tooling. And most AOSP developers end on a similar journey: they discover the limits of the their apps observability tool, and they figure they want to pull logcat logs, then they find a way to push the logs from the device, and then they think of ways to pull the logs, and then discover even more Android data they want to pull, and then they finally realize they don't want to pull and parse the data individually and manually every time there's a problem.
+
+One of the reasons I joined Memfault was to build the tooling I wish I had previously, and to help other Android developers working on AOSP avoid going through the same years of struggles, to learn the same lessons I had. And there is still many more lessons in AOSP observability to learn, if you have experience in observability at scale, I'd love to hear from you.


I like the narrative - I think it would be great with some example problems that you were trying to diagnose at scale, how logs might have helped you find them on an individual device, but not at the fleet level, etc - then whether either FWLS/L2M or other on-device stuff e.g. dumpsys feeding into a metrics-like system would have helped you.

If this blog is targeted at "monitoring" (vs "debugging") then I think that's a good start (just needs a bit of editing). But I'm wondering if we should also discuss how to make sense of the data for an individual device (i.e. timeline + device dashboard replaces that python tooling, to visualize)

gminn

Steve has some great suggestions already, I just have a few minor comments

gminn · 2025-01-21T21:22:24Z

_drafts/aospobservability.md

+- While everyone agrees that knowing how to comb through logs is an incredibly valuable skill to develop, at some point in time the tediousness wears on you, and you’d rather have something tell you what and where the useful insight is.
+- As the number of devices grows, figuring out the relative scale of an issue by manually parsing each device’s logs becomes infeasible, and you really want to know how many devices are impacted by a certain issue to determine the priority.
+
+At this point, teams often start to recognize another tool is needed to progress from *debugging* individual devices, to *monitoring* trends across the entire fleet.


This would be a great opportunity to plug your logging article as an aside!

gminn · 2025-01-21T21:27:10Z

_drafts/aospobservability.md

+- Existing Android app observability tools have a battle-tested data model for Java and Kotlin app crashes and C++ tombstones. But there’s really no way to record any other forms of data, so you have to coerce other kinds of crashes, like kernel panics, into the data model, which can be finicky and tedious to manually manage. -
+- Combining an Android app observability tool with an product analytics tool might get you the best of both worlds, if you understand all the limitations involved. For example, most observability SDKs are all clearly server-focused (such as by providing a Java SDK as opposed to an Android SDK), so they often have no regard for data usage, bandwidth, battery use, or device performance, and it can be a minefield in a lot of ways.


I don't fully follow whether we need to break out these two points into bullets (I probably need to read this a few times to follow it) -- should they perhaps just be separate paragraphs?

gminn · 2025-01-21T21:30:01Z

_drafts/aospobservability.md

+
+If you can make a decision on what to buy, then it’s up to the team to build whatever pipeline that’ll collect, transform, and send the device-level data into the observability tools. There can be so many interesting decisions to make around whether which kinds of processing belongs on the device versus in the cloud, depending on whether the team can act full-stack or not, what data is being collected, and the costs involved. A dedicated engineer or team may need to be staffed to develop the necessary functionality or to manage its costs.
+
+![A simplified view of the eventual AOSP observability pipeline](/img/aosp-observability/analytics-pipeline.svg)


I love this diagram! Can you add a legend explaining the color-coating and introduce the image at the end of the paragraph above? e.g:

"It is a complex pipeline but I tried to break it down into some basic, key pieces below:"

gminn · 2025-01-21T21:33:29Z

_drafts/aospobservability.md

+
+As time went on though, we started running into more and more situations where the cached logs didn’t actually span the time of the incident, so they were basically useless. This is the natural result of a reactive pull-based model, and so we got to brainstorming what other solutions were possible.
+
+We eventually landed on the desire for a push-based model - literally a push messaging system, where we could poke the device remotely to trigger an upload. And we designed an app that could query FileProviders and ContentProviders for any additional data. But we hit a couple of roadblocks: our device was not GMS-compliant so we didn’t have Firebase Cloud Messaging and would have to integrate our own push messaging system, and we needed to enlist the help of a backend team to build the push backend and the push trigger, but also a storage solution, and no team had the budget to help and our teams didn’t have the expertise to contribute. This ends up being a common sticking point with internal tools for mobile teams - Android and AOSP developers have a lot of mobile experience and expertise, but often need help creating a maintainable backend solution that makes sense for the volume and cost of storing and processing the data.


Some real-world context of when this would happen may help to mention here, such as if a customer reports an issue or you get other signals that the device is acting up and want to dig deeper

feat(android): AOSP observability blog post

3837eac

Blisse force-pushed the victorlai/interrupt branch from 4827e2f to 3837eac Compare November 26, 2024 00:57

Blisse marked this pull request as ready for review November 26, 2024 18:16

Blisse requested a review from a team as a code owner November 26, 2024 18:16

sjp4 reviewed Nov 27, 2024

View reviewed changes

gminn requested changes Jan 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(android): AOSP observability blog post #534

feat(android): AOSP observability blog post #534

Blisse commented Nov 25, 2024

cloudflare-workers-and-pages bot commented Nov 25, 2024 •

edited

Loading

sjp4 Nov 27, 2024

sjp4 Nov 27, 2024

sjp4 Nov 27, 2024

sjp4 Nov 27, 2024

sjp4 Nov 27, 2024

sjp4 Nov 27, 2024

sjp4 Nov 27, 2024

sjp4 Nov 27, 2024

sjp4 Nov 27, 2024

gminn left a comment

gminn Jan 21, 2025

gminn Jan 21, 2025

gminn Jan 21, 2025

gminn Jan 21, 2025


		There are definitely many good reasons to build on top of AOSP. Android is really the only option for custom, touch-first experiences, and the popularity and price point of phone, tablet, and watch form factors is really attractive for a lot of different product experiences. Android SoC vendors bundle support for a lot of common capabilities by default: Bluetooth, LTE, Wi-Fi, cameras, batteries, sensors, and more. And Android also has an incredibly strong community and ecosystem, the continued support of Google, and millions of apps and developers.

		But people often don’t distinguish between Android developers and AOSP developers enough. While Android app development is incredibly popular especially as a hobby, AOSP development is only really done for work due to the custom SoC required (outside of the LineageOS or GrapheneOS folks). More importantly, the actual work done by Android versus AOSP developers is often totally different. Android app developers are generally only involved with product-centric app development (which in itself is a huge problem space), whereas AOSP developers may develop some apps, but also work much closer to the kernel, by writing drivers, implementing vendor or hardware interfaces, creating and updating sepolicy, and more.


		Android developers are very familiar with Android app observability tools - everyone by default integrates Crashlytics, Bugsnag, Sentry, Instabug, or another observability SDK. When I worked on an AOSP product, we actually built 10 or so apps to replicate some device functionality, so we had to integrate the observability SDK into all 10 apps. But what about the other 10 to 100 apps and native services that we don't normally touch that were part of the generic AOSP implementation?

		When you build on top of AOSP, you’re actually potentially running 4 different sources of "apps": your own custom-built apps, your SoC vendor’s apps, the generic AOSP apps, and if you have an app store, third-party apps installed from the app store. You can only install an app observability SDK in your own custom-built apps, you can’t install SDKs into all these other apps, so you immediately hit the limit of what information app observability SDKs can provide.


		![Apps you don't build yourself can't integrate app observability SDKs](/img/aosp-observability/app-sources.svg)

		There are also 4 kinds of “apps”: (1) the common Android app written in Java or Kotlin, (2) native apps using C++ using the Android NDK, (3) binaries written in C++ or Rust, and (4) init.rc shell services which can invoke the binaries and more. App observability SDKs often come with NDK support, but they run into the same problem of only working on apps you build yourself, and there are 10 to 100 other apps running on the device that aren’t monitored.


		As time went on though, we started running into more and more situations where the cached logs didn’t actually span the time of the incident, so they were basically useless. This is the natural result of a reactive pull-based model, and so we got to brainstorming what other solutions were possible.

		We eventually landed on the desire for a push-based model - literally a push messaging system, where we could poke the device remotely to trigger an upload. And we designed an app that could query FileProviders and ContentProviders for any additional data. But we hit a couple of roadblocks: our device was not GMS-compliant so we didn’t have Firebase Cloud Messaging and would have to integrate our own push messaging system, and we needed to enlist the help of a backend team to build the push backend and the push trigger, but also a storage solution, and no team had the budget to help and our teams didn’t have the expertise to contribute. This ends up being a common sticking point with internal tools for mobile teams - Android and AOSP developers have a lot of mobile experience and expertise, but often need help creating a maintainable backend solution that makes sense for the volume and cost of storing and processing the data.


		### Android Bug Reports

		Logs are one tool to diagnose an issue, but Android has so much more information if you can access it - namely, Android Bug Reports, which gathers data from all over the device into a single zip file. So any AOSP observability solution that can capture and upload logcat logs automatically, may also want to build a way to trigger and upload Android Bug Reports when needed, too. And so Bug Reports are often the second thing that developers build when they just need more AOSP observability.


		## Understand the trends

		As we encountered more and more bugs, and downloaded more and more logs, several colleagues decided to write a number of different parsers in Python that they would run on the logs to rule out certain problems, or filter the logs for certain crashes. As they encountered more and more issues, if they could identify the issue in the log, they could write a parser for that issue and re-use it for next time.


		Most teams that develop Android hardware already buy Android app observability tools, like Crashlytics, plus a general purpose product analytics tool, like Grafana, for their backend, so it seems natural to try to fit AOSP data into these tools. But these tools are not designed for AOSP, they’re designed for apps or for servers, and so their data models or pricing structures can feel like they don’t make sense.

		- Existing Android app observability tools have a battle-tested data model for Java and Kotlin app crashes and C++ tombstones. But there’s really no way to record any other forms of data, so you have to coerce other kinds of crashes, like kernel panics, into the data model, which can be finicky and tedious to manually manage. -


		There's a ton of content on what observability means for Android apps, but content and tooling for AOSP device observability feels sparse in comparison. Most AOSP developers end up building internal observability tooling because of the lack of standard tooling. And most AOSP developers end on a similar journey: they discover the limits of the their apps observability tool, and they figure they want to pull logcat logs, then they find a way to push the logs from the device, and then they think of ways to pull the logs, and then discover even more Android data they want to pull, and then they finally realize they don't want to pull and parse the data individually and manually every time there's a problem.

		One of the reasons I joined Memfault was to build the tooling I wish I had previously, and to help other Android developers working on AOSP avoid going through the same years of struggles, to learn the same lessons I had. And there is still many more lessons in AOSP observability to learn, if you have experience in observability at scale, I'd love to hear from you.

		- Existing Android app observability tools have a battle-tested data model for Java and Kotlin app crashes and C++ tombstones. But there’s really no way to record any other forms of data, so you have to coerce other kinds of crashes, like kernel panics, into the data model, which can be finicky and tedious to manually manage. -
		- Combining an Android app observability tool with an product analytics tool might get you the best of both worlds, if you understand all the limitations involved. For example, most observability SDKs are all clearly server-focused (such as by providing a Java SDK as opposed to an Android SDK), so they often have no regard for data usage, bandwidth, battery use, or device performance, and it can be a minefield in a lot of ways.


		If you can make a decision on what to buy, then it’s up to the team to build whatever pipeline that’ll collect, transform, and send the device-level data into the observability tools. There can be so many interesting decisions to make around whether which kinds of processing belongs on the device versus in the cloud, depending on whether the team can act full-stack or not, what data is being collected, and the costs involved. A dedicated engineer or team may need to be staffed to develop the necessary functionality or to manage its costs.

		![A simplified view of the eventual AOSP observability pipeline](/img/aosp-observability/analytics-pipeline.svg)

feat(android): AOSP observability blog post #534

Are you sure you want to change the base?

feat(android): AOSP observability blog post #534

Conversation

Blisse commented Nov 25, 2024

cloudflare-workers-and-pages bot commented Nov 25, 2024 • edited Loading

Deploying interrupt with Cloudflare Pages

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gminn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloudflare-workers-and-pages bot commented Nov 25, 2024 •

edited

Loading