Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(android): AOSP observability blog post #534

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Blisse
Copy link
Contributor

@Blisse Blisse commented Nov 25, 2024

No description provided.

Copy link

cloudflare-workers-and-pages bot commented Nov 25, 2024

Deploying interrupt with  Cloudflare Pages  Cloudflare Pages

Latest commit: 3837eac
Status: ✅  Deploy successful!
Preview URL: https://8d209cc3.interrupt.pages.dev
Branch Preview URL: https://victorlai-interrupt.interrupt.pages.dev

View logs

@Blisse Blisse force-pushed the victorlai/interrupt branch from 4827e2f to 3837eac Compare November 26, 2024 00:57
@Blisse Blisse marked this pull request as ready for review November 26, 2024 18:16
@Blisse Blisse requested a review from a team as a code owner November 26, 2024 18:16

There are definitely many good reasons to build on top of AOSP. Android is really the only option for custom, touch-first experiences, and the popularity and price point of phone, tablet, and watch form factors is really attractive for a lot of different product experiences. Android SoC vendors bundle support for a lot of common capabilities by default: Bluetooth, LTE, Wi-Fi, cameras, batteries, sensors, and more. And Android also has an incredibly strong community and ecosystem, the continued support of Google, and millions of apps and developers.

But people often don’t distinguish between Android developers and AOSP developers enough. While Android *app* development is incredibly popular especially as a hobby, AOSP development is only really done for work due to the custom SoC required (outside of the LineageOS or GrapheneOS folks). More importantly, the actual work done by Android versus AOSP developers is often totally different. Android *app* developers are generally only involved with product-centric app development (which in itself is a huge problem space), whereas AOSP developers may develop some apps, but also work much closer to the kernel, by writing drivers, implementing vendor or hardware interfaces, creating and updating sepolicy, and more.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/for work/professionally


Android developers are very familiar with Android **app** observability tools - everyone by default integrates Crashlytics, Bugsnag, Sentry, Instabug, or another observability SDK. When I worked on an AOSP product, we actually built 10 or so apps to replicate some device functionality, so we had to integrate the observability SDK into all 10 apps. But what about the other 10 to 100 apps and native services that we don't normally touch that were part of the generic AOSP implementation?

When you build on top of AOSP, you’re actually potentially running 4 different sources of "apps": your own custom-built apps, your SoC vendor’s apps, the generic AOSP apps, and if you have an app store, third-party apps installed from the app store. You can only install an app observability SDK in your own custom-built apps, you can’t install SDKs into all these other apps, so you immediately hit the limit of what information app observability SDKs can provide.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you build a device on top of AOSP


![Apps you don't build yourself can't integrate app observability SDKs](/img/aosp-observability/app-sources.svg)

There are also 4 kinds of “apps”: (1) the common Android app written in Java or Kotlin, (2) native apps using C++ using the Android NDK, (3) binaries written in C++ or Rust, and (4) init.rc shell services which can invoke the binaries and more. App observability SDKs often come with NDK support, but they run into the same problem of only working on apps you build yourself, and there are 10 to 100 other apps running on the device that aren’t monitored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 4 types seem a bit muddled (what's the difference between native apps using C++ vs binaries written in C++?). Mention the kernel? Or maybe divide between "apps" whether native or not, vs system services, vs kernel?


As time went on though, we started running into more and more situations where the cached logs didn’t actually span the time of the incident, so they were basically useless. This is the natural result of a reactive pull-based model, and so we got to brainstorming what other solutions were possible.

We eventually landed on the desire for a push-based model - literally a push messaging system, where we could poke the device remotely to trigger an upload. And we designed an app that could query FileProviders and ContentProviders for any additional data. But we hit a couple of roadblocks: our device was not GMS-compliant so we didn’t have Firebase Cloud Messaging and would have to integrate our own push messaging system, and we needed to enlist the help of a backend team to build the push backend and the push trigger, but also a storage solution, and no team had the budget to help and our teams didn’t have the expertise to contribute. This ends up being a common sticking point with internal tools for mobile teams - Android and AOSP developers have a lot of mobile experience and expertise, but often need help creating a maintainable backend solution that makes sense for the volume and cost of storing and processing the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define GMS


### Android Bug Reports

Logs are one tool to diagnose an issue, but Android has so much more information if you can access it - namely, Android Bug Reports, which gathers data from all over the device into a single zip file. So any AOSP observability solution that can capture and upload logcat logs automatically, may also want to build a way to trigger and upload Android Bug Reports when needed, too. And so Bug Reports are often the second thing that developers build when they just need more AOSP observability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section makes it sounds like bugreports are the answer (not sure how deep you want to go into the alternatives, but maybe mention e.g. that they contain a complete dumpsys dump, which could also be captured individually)


## Understand the trends

As we encountered more and more bugs, and downloaded more and more logs, several colleagues decided to write a number of different parsers in Python that they would run on the logs to rule out certain problems, or filter the logs for certain crashes. As they encountered more and more issues, if they could identify the issue in the log, they could write a parser for that issue and re-use it for next time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an example here of what we'd be looking for in the logs


Most teams that develop Android hardware already buy Android app observability tools, like Crashlytics, plus a general purpose product analytics tool, like Grafana, for their backend, so it seems natural to try to fit AOSP data into these tools. But these tools are not designed for AOSP, they’re designed for apps or for servers, and so their data models or pricing structures can feel like they don’t make sense.

- Existing Android app observability tools have a battle-tested data model for Java and Kotlin app crashes and C++ tombstones. But there’s really no way to record any other forms of data, so you have to coerce other kinds of crashes, like kernel panics, into the data model, which can be finicky and tedious to manually manage. -
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not really tombstones?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also feels like we should mention somewhere here: that part of the solution is collecting the right information on the device (e.g. dropbox + dumpsys vs bugreport) - signal/noise.


There's a ton of content on what observability means for Android apps, but content and tooling for AOSP device observability feels sparse in comparison. Most AOSP developers end up building internal observability tooling because of the lack of standard tooling. And most AOSP developers end on a similar journey: they discover the limits of the their apps observability tool, and they figure they want to pull logcat logs, then they find a way to push the logs from the device, and then they think of ways to pull the logs, and then discover even more Android data they want to pull, and then they finally realize they don't want to pull and parse the data individually and manually every time there's a problem.

One of the reasons I joined Memfault was to build the tooling I wish I had previously, and to help other Android developers working on AOSP avoid going through the same years of struggles, to learn the same lessons I had. And there is still many more lessons in AOSP observability to learn, if you have experience in observability at scale, I'd love to hear from you.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the narrative - I think it would be great with some example problems that you were trying to diagnose at scale, how logs might have helped you find them on an individual device, but not at the fleet level, etc - then whether either FWLS/L2M or other on-device stuff e.g. dumpsys feeding into a metrics-like system would have helped you.

If this blog is targeted at "monitoring" (vs "debugging") then I think that's a good start (just needs a bit of editing). But I'm wondering if we should also discuss how to make sense of the data for an individual device (i.e. timeline + device dashboard replaces that python tooling, to visualize)

Copy link
Member

@gminn gminn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steve has some great suggestions already, I just have a few minor comments

- While everyone agrees that knowing how to comb through logs is an incredibly valuable skill to develop, at some point in time the tediousness wears on you, and you’d rather have something tell you what and where the useful insight is.
- As the number of devices grows, figuring out the relative scale of an issue by manually parsing each device’s logs becomes infeasible, and you really want to know how many devices are impacted by a certain issue to determine the priority.

At this point, teams often start to recognize another tool is needed to progress from *debugging* individual devices, to *monitoring* trends across the entire fleet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a great opportunity to plug your logging article as an aside!

Comment on lines +77 to +78
- Existing Android app observability tools have a battle-tested data model for Java and Kotlin app crashes and C++ tombstones. But there’s really no way to record any other forms of data, so you have to coerce other kinds of crashes, like kernel panics, into the data model, which can be finicky and tedious to manually manage. -
- Combining an Android app observability tool with an product analytics tool might get you the best of both worlds, if you understand all the limitations involved. For example, most observability SDKs are all clearly server-focused (such as by providing a Java SDK as opposed to an Android SDK), so they often have no regard for data usage, bandwidth, battery use, or device performance, and it can be a minefield in a lot of ways.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully follow whether we need to break out these two points into bullets (I probably need to read this a few times to follow it) -- should they perhaps just be separate paragraphs?


If you can make a decision on what to buy, then it’s up to the team to build whatever pipeline that’ll collect, transform, and send the device-level data into the observability tools. There can be so many interesting decisions to make around whether which kinds of processing belongs on the device versus in the cloud, depending on whether the team can act full-stack or not, what data is being collected, and the costs involved. A dedicated engineer or team may need to be staffed to develop the necessary functionality or to manage its costs.

![A simplified view of the eventual AOSP observability pipeline](/img/aosp-observability/analytics-pipeline.svg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this diagram! Can you add a legend explaining the color-coating and introduce the image at the end of the paragraph above? e.g:

"It is a complex pipeline but I tried to break it down into some basic, key pieces below:"


As time went on though, we started running into more and more situations where the cached logs didn’t actually span the time of the incident, so they were basically useless. This is the natural result of a reactive pull-based model, and so we got to brainstorming what other solutions were possible.

We eventually landed on the desire for a push-based model - literally a push messaging system, where we could poke the device remotely to trigger an upload. And we designed an app that could query FileProviders and ContentProviders for any additional data. But we hit a couple of roadblocks: our device was not GMS-compliant so we didn’t have Firebase Cloud Messaging and would have to integrate our own push messaging system, and we needed to enlist the help of a backend team to build the push backend and the push trigger, but also a storage solution, and no team had the budget to help and our teams didn’t have the expertise to contribute. This ends up being a common sticking point with internal tools for mobile teams - Android and AOSP developers have a lot of mobile experience and expertise, but often need help creating a maintainable backend solution that makes sense for the volume and cost of storing and processing the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some real-world context of when this would happen may help to mention here, such as if a customer reports an issue or you get other signals that the device is acting up and want to dig deeper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants