-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-9476] Add experimental profiling managed string storage #725
base: main
Are you sure you want to change the base?
[PROF-9476] Add experimental profiling managed string storage #725
Conversation
This will later allow us to introduce the new StringId code without disturbing existing API clients. This duplication is intended to be only an intermediate step to introducing support for StringId.
I decided to introduce the `PersistentStringId` name to distinguish such ids from the `StringId` type used by things that are in the profile-bound string table.
This makes it clear it's only kept around for experiments.
It doesn't really make sense to make it optional to have a CharSlice here.
This makes this API a bit more future-proof, as well as a bit more consistent with other such APIs in libdatadog.
This reverts commit 624ebd0. During PR review, I didn't quite remember why and when this was needed, so let's remove until we figure out exactly why/if this is needed.
A `MaybeError` is apparently already must_use so this is redundant. TIL.
} | ||
|
||
#[no_mangle] | ||
/// TODO: @ivoanjo Should this take a `*mut ManagedStringStorage` like Profile APIs do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, this is something even in C I go back and forth on. It's one of those "do I trust the user or do I be more defensive?" things. Setting the C pointer to null makes it easier to debug when things go wrong, and can sometimes even prevent further things from going wrong. Sometimes it also makes it harder to debug because you don't get a use-after-free warning from ASAN, so that can swing both ways. But it's theoretically wholly wasted work because nobody should use the thing after it's been dropped...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the mut option I think is nice since it means we're often returning an error back on the api calls wrongly, and since the client should handle those anyway, it means we're turning something that's a definitively a bug into a nice error message.
|
||
#[must_use] | ||
#[no_mangle] | ||
/// TODO: Consider having a variant of intern (and unintern?) that takes an array as input, instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a use case for this in your PoC for Ruby? If so, I'd do it, and if not, I'd pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do! We're literally interning in a loop when I need to consume a whole stack. (Note: intern_or_raise
here is just a nice helper to call ddog_prof_ManagedStringStorage_intern
and check if there's an error in the result)
heap_stack* heap_stack_new(heap_recorder *recorder, ddog_prof_Slice_Location locations) {
uint16_t frames_len = locations.len;
// ...some error checking...
heap_stack *stack = ruby_xcalloc(1, sizeof(heap_stack) + frames_len * sizeof(heap_frame));
stack->frames_len = frames_len;
for (uint16_t i = 0; i < stack->frames_len; i++) {
const ddog_prof_Location *location = &locations.ptr[i];
stack->frames[i] = (heap_frame) {
.name = intern_or_raise(recorder->string_storage, location->function.name),
.filename = intern_or_raise(recorder->string_storage, location->function.filename),
// ddog_prof_Location is a int64_t. We don't expect to have to profile files with more than
// 2M lines so this cast should be fairly safe?
.line = (int32_t) location->line,
};
}
return stack;
}
(from my working branch which is based off of DataDog/dd-trace-rb#3628 ).
My thinking is that, unlike most other libdatadog APIs where either a) Are very small but we don't call them very often (e.g. setup and reporting); b) We do a big chunk of work on every call (profile add), this API does c) Both very little work and gets called many times.
Thus, it seems like a prime candidate to turn C into B -> by having a more coarse-grained call that lowers the overhead cost of the ffi and locking.
…lementation These are not needed currently, so let's simplify rather than having a lot of unused appendages in this PR.
cached_seq_num_for: Cell<Option<*const StringTable>>, | ||
cached_seq_num: Cell<Option<StringId>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are storing the cache on each string, but aren't these all added in a batch to the same string table? I think you said that you add all these managed strings to the Profile's string table just before serialization, right? Couldn't we perform a larger batch operation and store the cache there? That way the memory is only used on serialization rather than kept around but largely not being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I as looking at this, I did a small tweak to store these as a tuple, rather than as separate entries, as they're related anyway -- 1f2b953 .
Your suggestion is interesting, but I'm curious how far were you thinking about the "large batch operation". In particular, were you thinking of moving the cache entirely away from the string table, to the profile? Or even to the caller of the profile?
…called with id 0 This is much nicer than having a weird panic in there.
…safer With the current structure of the code, the `expect` inside `resolve` should never fail; hopefully we don't introduce a bug in the future that changes this. (I know that ideally in Rust we would represent such constraints in the type system, but I don't think I could do so without a lot of other changes, so I decided to go for the more self-contained solution for now.)
In particular, in the unlikely event that we would overflow the id, we signal an error back to the caller rather than impacting the application. The caller is expected to stop using this string table and create a new one instead. In the future we may make it easy to do so, by e.g. having an API to create a new table from the existing strings or something like that.
This will enable us to propagate failures when a ManagedStringId is not known, which will improve debugability and code quality by allowing us to signal the error.
This string is supposed to live for as long as the managed string storage does. Treating it specially in intern matches what we do in other functions and ensures that we can never overflow the reference count (or something weird like that).
pub fn intern(&mut self, item: &str) -> anyhow::Result<u32> { | ||
if item.is_empty() { | ||
// We don't increase ref-counts on the empty string | ||
return Ok(0); | ||
} | ||
|
||
let entry = self.str_to_id.get_key_value(item); | ||
match entry { | ||
Some((_, id)) => { | ||
let usage_count = &self | ||
.id_to_data | ||
.get(id) | ||
.ok_or_else(|| { | ||
anyhow::anyhow!("BUG: id_to_data and str_to_id should be in sync") | ||
})? | ||
.usage_count; | ||
usage_count.set(usage_count.get() + 1); | ||
Ok(*id) | ||
} | ||
None => self.intern_new(item), | ||
} | ||
} | ||
|
||
pub fn intern_new(&mut self, item: &str) -> anyhow::Result<u32> { | ||
let id = self.next_id; | ||
let str: Rc<str> = item.into(); | ||
let data = ManagedStringData { | ||
str: str.clone(), | ||
cached_seq_num: Cell::new(None), | ||
usage_count: Cell::new(1), | ||
}; | ||
self.next_id = self | ||
.next_id | ||
.checked_add(1) | ||
.ok_or_else(|| anyhow::anyhow!("Ran out of string ids!"))?; | ||
let old_value = self.str_to_id.insert(str.clone(), id); | ||
debug_assert_eq!(old_value, None); | ||
let old_value = self.id_to_data.insert(id, data); | ||
debug_assert_eq!(old_value, None); | ||
Ok(id) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the API level difference between these two methods? Also need to document when errors happen, and what maybe to do with them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nice catch, intern_new
was fully 100% intended not to be public, yet somehow the pub
ended up there.
It's not intended to be exposed (too dangerous!) -- I'll remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 20b9469
Adding it as `pub` was an oversight, since the intent is for this to be an inner helper that's only used by `intern` and by `new`. Having this as `pub` is quite dangerous as this method can easily be used to break a lot of the assumptions for the string storage.
What does this PR do?
This PR builds on the work started by @AlexJF on #607 to introduce "managed string storage" for profiling.
The idea is to introduce another level of string storage for profiling that is decoupled in lifetime from individual profiles, and that is managed by the libdatadog client.
At its core, managed string storage provides a hashtable that stores strings and returns ids. These ids can then be provided to libdatadog instead of
CharSlice
s when recording profiling samples.For FFI users, this PR adds the following APIs to manage strings:
ddog_prof_ManagedStringStorage_new
ddog_prof_ManagedStringStorage_intern(String)
ddog_prof_ManagedStringStorage_unintern(id)
ddog_prof_ManagedStringStorage_advance_gen
ddog_prof_ManagedStringStorage_drop
ddog_prof_ManagedStringStorage_get_string
A key detail of the current implementation is that each
intern
call with the same string will increase an internal usage counter, andunintern
call with reduce the counter.Then at
advance_gen
time, if the counter is zero, we get rid of the string.Then to interact with profiles, there's a new
ddog_prof_Profile_with_string_storage
API to create a profile with a givenManagedStringStorage
, and all structures that make up aSample
(Mapping
,Function
,Label
) etc have been extended so that they either take aCharSlice
or aManagedStringId
.Thus, after interning all strings for a sample, it's possible to add a sample to a profile entirely by referencing strings by ids, rather than
CharSlice
s.Motivation
The initial use-case is to support heap profiling -- "samples" related to heap profiling usually live across multiple profiles (as long as a given object is alive) and so this data must be kept somewhere.
Previously for Ruby we were keeping this on the Ruby profiler side, but having libdatadog manage this instead presents a few optimization opportunities.
We also hope to replace a few other "string tables" that other profilers had to build outside of libdatadog for similar use-cases.
This topic was also discussed in the following two documents (Datadog-only, sorry!):
Additional Notes
In keeping with the experimental nature of this feature, I've tried really hard to not disturb existing profiling API users with the new changes.
That is -- I was going for, if you're not using managed string storage, you should NOT be affected AT ALL by it -- be it API changes or overhead.
(This is why on the pure-Rust profiling crate side, I ended up duplicating a bunch of structures and functions. I couldn't think of a great way to not disturb existing API users other than introducing alternative methods, but to be honest the duplication is all in very simple methods so I don't think this substantially increases complexity/maintenance vs trying to be smarter to bend Rust to our will.)
There's probably a lot of improvements we can make, but with this PR I'm hoping to have something in a close to "good enough" state, that we can merge this in and then start iterating on master, rather than have this continue living in a branch for a lot longer.
This doesn't mean we shouldn't fix or improve things before merging, but I'll be trying to identify what needs to go in now and what can go in as separate, follow-up PRs.
As an addendum, there's still a bunch of
expect
s sprinkled that should be turned into proper errors. I plan to do a pass on all of those. (But again, none of the panics affect existing code, so they're harmless and inert unless you're experimenting with the new APIs)How to test the change?
The branch in https://github.com/DataDog/dd-trace-rb/tree/ivoanjo/prof-9476-managed-string-storage-try2 is where I'm testing the changes on the Ruby profiler side.
It may not be entirely up-to-date with the latest ffi changes on the libdatadog side (I've been prettying up the API), but it shows how to use this concept, while passing all the profiling unit/integration tests, and has shown improvements in memory and latency in the reliability environment.