Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-9476] Add experimental profiling managed string storage #725

Open
wants to merge 56 commits into
base: main
Choose a base branch
from

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Nov 11, 2024

What does this PR do?

This PR builds on the work started by @AlexJF on #607 to introduce "managed string storage" for profiling.

The idea is to introduce another level of string storage for profiling that is decoupled in lifetime from individual profiles, and that is managed by the libdatadog client.

At its core, managed string storage provides a hashtable that stores strings and returns ids. These ids can then be provided to libdatadog instead of CharSlices when recording profiling samples.

For FFI users, this PR adds the following APIs to manage strings:

  • ddog_prof_ManagedStringStorage_new
  • ddog_prof_ManagedStringStorage_intern(String)
  • ddog_prof_ManagedStringStorage_unintern(id)
  • ddog_prof_ManagedStringStorage_advance_gen
  • ddog_prof_ManagedStringStorage_drop
  • ddog_prof_ManagedStringStorage_get_string

A key detail of the current implementation is that each intern call with the same string will increase an internal usage counter, and unintern call with reduce the counter.

Then at advance_gen time, if the counter is zero, we get rid of the string.

Then to interact with profiles, there's a new ddog_prof_Profile_with_string_storage API to create a profile with a given ManagedStringStorage, and all structures that make up a Sample (Mapping, Function, Label) etc have been extended so that they either take a CharSlice or a ManagedStringId.

Thus, after interning all strings for a sample, it's possible to add a sample to a profile entirely by referencing strings by ids, rather than CharSlices.

Motivation

The initial use-case is to support heap profiling -- "samples" related to heap profiling usually live across multiple profiles (as long as a given object is alive) and so this data must be kept somewhere.
Previously for Ruby we were keeping this on the Ruby profiler side, but having libdatadog manage this instead presents a few optimization opportunities.

We also hope to replace a few other "string tables" that other profilers had to build outside of libdatadog for similar use-cases.

This topic was also discussed in the following two documents (Datadog-only, sorry!):

Additional Notes

In keeping with the experimental nature of this feature, I've tried really hard to not disturb existing profiling API users with the new changes.

That is -- I was going for, if you're not using managed string storage, you should NOT be affected AT ALL by it -- be it API changes or overhead.

(This is why on the pure-Rust profiling crate side, I ended up duplicating a bunch of structures and functions. I couldn't think of a great way to not disturb existing API users other than introducing alternative methods, but to be honest the duplication is all in very simple methods so I don't think this substantially increases complexity/maintenance vs trying to be smarter to bend Rust to our will.)

There's probably a lot of improvements we can make, but with this PR I'm hoping to have something in a close to "good enough" state, that we can merge this in and then start iterating on master, rather than have this continue living in a branch for a lot longer.

This doesn't mean we shouldn't fix or improve things before merging, but I'll be trying to identify what needs to go in now and what can go in as separate, follow-up PRs.

As an addendum, there's still a bunch of expects sprinkled that should be turned into proper errors. I plan to do a pass on all of those. (But again, none of the panics affect existing code, so they're harmless and inert unless you're experimenting with the new APIs)

How to test the change?

The branch in https://github.com/DataDog/dd-trace-rb/tree/ivoanjo/prof-9476-managed-string-storage-try2 is where I'm testing the changes on the Ruby profiler side.

It may not be entirely up-to-date with the latest ffi changes on the libdatadog side (I've been prettying up the API), but it shows how to use this concept, while passing all the profiling unit/integration tests, and has shown improvements in memory and latency in the reliability environment.

This will later allow us to introduce the new StringId code without
disturbing existing API clients.

This duplication is intended to be only an intermediate step to
introducing support for StringId.
Credit goes to @AlexJF, this is lifted from his earlier PR
#607
I decided to introduce the `PersistentStringId` name to distinguish
such ids from the `StringId` type used by things that are in the
profile-bound string table.
This makes it clear it's only kept around for experiments.
It doesn't really make sense to make it optional to have a CharSlice
here.
This makes this API a bit more future-proof, as well as a bit more
consistent with other such APIs in libdatadog.
@ivoanjo ivoanjo requested a review from a team as a code owner November 11, 2024 15:41
This reverts commit 624ebd0.

During PR review, I didn't quite remember why and when this was
needed, so let's remove until we figure out exactly why/if this is
needed.
A `MaybeError` is apparently already must_use so this is redundant. TIL.
}

#[no_mangle]
/// TODO: @ivoanjo Should this take a `*mut ManagedStringStorage` like Profile APIs do?
Copy link
Contributor

@morrisonlevi morrisonlevi Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, this is something even in C I go back and forth on. It's one of those "do I trust the user or do I be more defensive?" things. Setting the C pointer to null makes it easier to debug when things go wrong, and can sometimes even prevent further things from going wrong. Sometimes it also makes it harder to debug because you don't get a use-after-free warning from ASAN, so that can swing both ways. But it's theoretically wholly wasted work because nobody should use the thing after it's been dropped...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the mut option I think is nice since it means we're often returning an error back on the api calls wrongly, and since the client should handle those anyway, it means we're turning something that's a definitively a bug into a nice error message.


#[must_use]
#[no_mangle]
/// TODO: Consider having a variant of intern (and unintern?) that takes an array as input, instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a use case for this in your PoC for Ruby? If so, I'd do it, and if not, I'd pass.

Copy link
Member Author

@ivoanjo ivoanjo Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do! We're literally interning in a loop when I need to consume a whole stack. (Note: intern_or_raise here is just a nice helper to call ddog_prof_ManagedStringStorage_intern and check if there's an error in the result)

heap_stack* heap_stack_new(heap_recorder *recorder, ddog_prof_Slice_Location locations) {
  uint16_t frames_len = locations.len;
  // ...some error checking...
  heap_stack *stack = ruby_xcalloc(1, sizeof(heap_stack) + frames_len * sizeof(heap_frame));
  stack->frames_len = frames_len;
  for (uint16_t i = 0; i < stack->frames_len; i++) {
    const ddog_prof_Location *location = &locations.ptr[i];
    stack->frames[i] = (heap_frame) {
      .name = intern_or_raise(recorder->string_storage, location->function.name),
      .filename = intern_or_raise(recorder->string_storage, location->function.filename),
      // ddog_prof_Location is a int64_t. We don't expect to have to profile files with more than
      // 2M lines so this cast should be fairly safe?
      .line = (int32_t) location->line,
    };
  }
  return stack;
}

(from my working branch which is based off of DataDog/dd-trace-rb#3628 ).

My thinking is that, unlike most other libdatadog APIs where either a) Are very small but we don't call them very often (e.g. setup and reporting); b) We do a big chunk of work on every call (profile add), this API does c) Both very little work and gets called many times.

Thus, it seems like a prime candidate to turn C into B -> by having a more coarse-grained call that lowers the overhead cost of the ffi and locking.

…lementation

These are not needed currently, so let's simplify rather than having
a lot of unused appendages in this PR.
Comment on lines 17 to 18
cached_seq_num_for: Cell<Option<*const StringTable>>,
cached_seq_num: Cell<Option<StringId>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are storing the cache on each string, but aren't these all added in a batch to the same string table? I think you said that you add all these managed strings to the Profile's string table just before serialization, right? Couldn't we perform a larger batch operation and store the cache there? That way the memory is only used on serialization rather than kept around but largely not being used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I as looking at this, I did a small tweak to store these as a tuple, rather than as separate entries, as they're related anyway -- 1f2b953 .

Your suggestion is interesting, but I'm curious how far were you thinking about the "large batch operation". In particular, were you thinking of moving the cache entirely away from the string table, to the profile? Or even to the caller of the profile?

ivoanjo added 11 commits January 7, 2025 09:41
…called with id 0

This is much nicer than having a weird panic in there.
…safer

With the current structure of the code, the `expect` inside `resolve`
should never fail; hopefully we don't introduce a bug in the future
that changes this.

(I know that ideally in Rust we would represent such constraints
in the type system, but I don't think I could do so without a lot of
other changes, so I decided to go for the more self-contained
solution for now.)
In particular, in the unlikely event that we would overflow the id,
we signal an error back to the caller rather than impacting the
application.

The caller is expected to stop using this string table and create
a new one instead. In the future we may make it easy to do so, by
e.g. having an API to create a new table from the existing strings
or something like that.
This will enable us to propagate failures when a ManagedStringId is not
known, which will improve debugability and code quality by allowing us
to signal the error.
This string is supposed to live for as long as the managed string
storage does.

Treating it specially in intern matches what we do in other functions
and ensures that we can never overflow the reference count (or
something weird like that).
Comment on lines 53 to 93
pub fn intern(&mut self, item: &str) -> anyhow::Result<u32> {
if item.is_empty() {
// We don't increase ref-counts on the empty string
return Ok(0);
}

let entry = self.str_to_id.get_key_value(item);
match entry {
Some((_, id)) => {
let usage_count = &self
.id_to_data
.get(id)
.ok_or_else(|| {
anyhow::anyhow!("BUG: id_to_data and str_to_id should be in sync")
})?
.usage_count;
usage_count.set(usage_count.get() + 1);
Ok(*id)
}
None => self.intern_new(item),
}
}

pub fn intern_new(&mut self, item: &str) -> anyhow::Result<u32> {
let id = self.next_id;
let str: Rc<str> = item.into();
let data = ManagedStringData {
str: str.clone(),
cached_seq_num: Cell::new(None),
usage_count: Cell::new(1),
};
self.next_id = self
.next_id
.checked_add(1)
.ok_or_else(|| anyhow::anyhow!("Ran out of string ids!"))?;
let old_value = self.str_to_id.insert(str.clone(), id);
debug_assert_eq!(old_value, None);
let old_value = self.id_to_data.insert(id, data);
debug_assert_eq!(old_value, None);
Ok(id)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the API level difference between these two methods? Also need to document when errors happen, and what maybe to do with them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice catch, intern_new was fully 100% intended not to be public, yet somehow the pub ended up there.

It's not intended to be exposed (too dangerous!) -- I'll remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 20b9469

Adding it as `pub` was an oversight, since the intent is for this to
be an inner helper that's only used by `intern` and by `new`.

Having this as `pub` is quite dangerous as this method can easily
be used to break a lot of the assumptions for the string storage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Relates to the profiling* modules.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants