[PROF-9476] Add experimental profiling managed string storage #725

ivoanjo · 2024-11-11T15:41:49Z

What does this PR do?

This PR builds on the work started by @AlexJF on #607 to introduce "managed string storage" for profiling.

The idea is to introduce another level of string storage for profiling that is decoupled in lifetime from individual profiles, and that is managed by the libdatadog client.

At its core, managed string storage provides a hashtable that stores strings and returns ids. These ids can then be provided to libdatadog instead of CharSlices when recording profiling samples.

For FFI users, this PR adds the following APIs to manage strings:

ddog_prof_ManagedStringStorage_new
ddog_prof_ManagedStringStorage_intern(String)
ddog_prof_ManagedStringStorage_unintern(id)
ddog_prof_ManagedStringStorage_advance_gen
ddog_prof_ManagedStringStorage_drop
ddog_prof_ManagedStringStorage_get_string

A key detail of the current implementation is that each intern call with the same string will increase an internal usage counter, and unintern call with reduce the counter.

Then at advance_gen time, if the counter is zero, we get rid of the string.

Then to interact with profiles, there's a new ddog_prof_Profile_with_string_storage API to create a profile with a given ManagedStringStorage, and all structures that make up a Sample (Mapping, Function, Label) etc have been extended so that they either take a CharSlice or a ManagedStringId.

Thus, after interning all strings for a sample, it's possible to add a sample to a profile entirely by referencing strings by ids, rather than CharSlices.

Motivation

The initial use-case is to support heap profiling -- "samples" related to heap profiling usually live across multiple profiles (as long as a given object is alive) and so this data must be kept somewhere.
Previously for Ruby we were keeping this on the Ruby profiler side, but having libdatadog manage this instead presents a few optimization opportunities.

We also hope to replace a few other "string tables" that other profilers had to build outside of libdatadog for similar use-cases.

This topic was also discussed in the following two documents (Datadog-only, sorry!):

Additional Notes

In keeping with the experimental nature of this feature, I've tried really hard to not disturb existing profiling API users with the new changes.

That is -- I was going for, if you're not using managed string storage, you should NOT be affected AT ALL by it -- be it API changes or overhead.

(This is why on the pure-Rust profiling crate side, I ended up duplicating a bunch of structures and functions. I couldn't think of a great way to not disturb existing API users other than introducing alternative methods, but to be honest the duplication is all in very simple methods so I don't think this substantially increases complexity/maintenance vs trying to be smarter to bend Rust to our will.)

There's probably a lot of improvements we can make, but with this PR I'm hoping to have something in a close to "good enough" state, that we can merge this in and then start iterating on master, rather than have this continue living in a branch for a lot longer.

This doesn't mean we shouldn't fix or improve things before merging, but I'll be trying to identify what needs to go in now and what can go in as separate, follow-up PRs.

As an addendum, there's still a bunch of expects sprinkled that should be turned into proper errors. I plan to do a pass on all of those. (But again, none of the panics affect existing code, so they're harmless and inert unless you're experimenting with the new APIs)

How to test the change?

The branch in https://github.com/DataDog/dd-trace-rb/tree/ivoanjo/prof-9476-managed-string-storage-try2 is where I'm testing the changes on the Ruby profiler side.

It may not be entirely up-to-date with the latest ffi changes on the libdatadog side (I've been prettying up the API), but it shows how to use this concept, while passing all the profiling unit/integration tests, and has shown improvements in memory and latency in the reliability environment.

This will later allow us to introduce the new StringId code without disturbing existing API clients. This duplication is intended to be only an intermediate step to introducing support for StringId.

@AlexJF

Credit goes to @AlexJF, this is lifted from his earlier PR #607

I decided to introduce the `PersistentStringId` name to distinguish such ids from the `StringId` type used by things that are in the profile-bound string table.

…ture

This makes it clear it's only kept around for experiments.

…shed in header

It doesn't really make sense to make it optional to have a CharSlice here.

This makes this API a bit more future-proof, as well as a bit more consistent with other such APIs in libdatadog.

profiling-ffi/build.rs

This reverts commit 624ebd0. During PR review, I didn't quite remember why and when this was needed, so let's remove until we figure out exactly why/if this is needed.

…3-clean

A `MaybeError` is apparently already must_use so this is redundant. TIL.

morrisonlevi · 2025-01-03T16:03:21Z

profiling-ffi/src/string_storage.rs

+}
+
+#[no_mangle]
+/// TODO: @ivoanjo Should this take a `*mut ManagedStringStorage` like Profile APIs do?


Ugh, this is something even in C I go back and forth on. It's one of those "do I trust the user or do I be more defensive?" things. Setting the C pointer to null makes it easier to debug when things go wrong, and can sometimes even prevent further things from going wrong. Sometimes it also makes it harder to debug because you don't get a use-after-free warning from ASAN, so that can swing both ways. But it's theoretically wholly wasted work because nobody should use the thing after it's been dropped...

Yeah, the mut option I think is nice since it means we're often returning an error back on the api calls wrongly, and since the client should handle those anyway, it means we're turning something that's a definitively a bug into a nice error message.

morrisonlevi · 2025-01-03T16:19:08Z

profiling-ffi/src/string_storage.rs

+
+#[must_use]
+#[no_mangle]
+/// TODO: Consider having a variant of intern (and unintern?) that takes an array as input, instead


Do you have a use case for this in your PoC for Ruby? If so, I'd do it, and if not, I'd pass.

I do! We're literally interning in a loop when I need to consume a whole stack. (Note: intern_or_raise here is just a nice helper to call ddog_prof_ManagedStringStorage_intern and check if there's an error in the result)

heap_stack* heap_stack_new(heap_recorder *recorder, ddog_prof_Slice_Location locations) { uint16_t frames_len = locations.len; // ...some error checking... heap_stack *stack = ruby_xcalloc(1, sizeof(heap_stack) + frames_len * sizeof(heap_frame)); stack->frames_len = frames_len; for (uint16_t i = 0; i < stack->frames_len; i++) { const ddog_prof_Location *location = &locations.ptr[i]; stack->frames[i] = (heap_frame) { .name = intern_or_raise(recorder->string_storage, location->function.name), .filename = intern_or_raise(recorder->string_storage, location->function.filename), // ddog_prof_Location is a int64_t. We don't expect to have to profile files with more than // 2M lines so this cast should be fairly safe? .line = (int32_t) location->line, }; } return stack; }

(from my working branch which is based off of DataDog/dd-trace-rb#3628 ).

My thinking is that, unlike most other libdatadog APIs where either a) Are very small but we don't call them very often (e.g. setup and reporting); b) We do a big chunk of work on every call (profile add), this API does c) Both very little work and gets called many times.

Thus, it seems like a prime candidate to turn C into B -> by having a more coarse-grained call that lowers the overhead cost of the ffi and locking.

profiling-ffi/src/string_storage.rs

…3-clean

profiling/src/collections/string_storage.rs

…lementation These are not needed currently, so let's simplify rather than having a lot of unused appendages in this PR.

profiling/src/collections/string_storage.rs

morrisonlevi · 2025-01-06T18:57:18Z

profiling/src/collections/string_storage.rs

+    cached_seq_num_for: Cell<Option<*const StringTable>>,
+    cached_seq_num: Cell<Option<StringId>>,


We are storing the cache on each string, but aren't these all added in a batch to the same string table? I think you said that you add all these managed strings to the Profile's string table just before serialization, right? Couldn't we perform a larger batch operation and store the cache there? That way the memory is only used on serialization rather than kept around but largely not being used.

As I as looking at this, I did a small tweak to store these as a tuple, rather than as separate entries, as they're related anyway -- 1f2b953 .

Your suggestion is interesting, but I'm curious how far were you thinking about the "large batch operation". In particular, were you thinking of moving the cache entirely away from the string table, to the profile? Or even to the caller of the profile?

…called with id 0 This is much nicer than having a weird panic in there.

…for now

…safer With the current structure of the code, the `expect` inside `resolve` should never fail; hopefully we don't introduce a bug in the future that changes this. (I know that ideally in Rust we would represent such constraints in the type system, but I don't think I could do so without a lot of other changes, so I decided to go for the more self-contained solution for now.)

In particular, in the unlikely event that we would overflow the id, we signal an error back to the caller rather than impacting the application. The caller is expected to stop using this string table and create a new one instead. In the future we may make it easy to do so, by e.g. having an API to create a new table from the existing strings or something like that.

This will enable us to propagate failures when a ManagedStringId is not known, which will improve debugability and code quality by allowing us to signal the error.

This string is supposed to live for as long as the managed string storage does. Treating it specially in intern matches what we do in other functions and ensures that we can never overflow the reference count (or something weird like that).

morrisonlevi · 2025-01-13T16:41:35Z

profiling/src/collections/string_storage.rs

+    pub fn intern(&mut self, item: &str) -> anyhow::Result<u32> {
+        if item.is_empty() {
+            // We don't increase ref-counts on the empty string
+            return Ok(0);
+        }
+
+        let entry = self.str_to_id.get_key_value(item);
+        match entry {
+            Some((_, id)) => {
+                let usage_count = &self
+                    .id_to_data
+                    .get(id)
+                    .ok_or_else(|| {
+                        anyhow::anyhow!("BUG: id_to_data and str_to_id should be in sync")
+                    })?
+                    .usage_count;
+                usage_count.set(usage_count.get() + 1);
+                Ok(*id)
+            }
+            None => self.intern_new(item),
+        }
+    }
+
+    pub fn intern_new(&mut self, item: &str) -> anyhow::Result<u32> {
+        let id = self.next_id;
+        let str: Rc<str> = item.into();
+        let data = ManagedStringData {
+            str: str.clone(),
+            cached_seq_num: Cell::new(None),
+            usage_count: Cell::new(1),
+        };
+        self.next_id = self
+            .next_id
+            .checked_add(1)
+            .ok_or_else(|| anyhow::anyhow!("Ran out of string ids!"))?;
+        let old_value = self.str_to_id.insert(str.clone(), id);
+        debug_assert_eq!(old_value, None);
+        let old_value = self.id_to_data.insert(id, data);
+        debug_assert_eq!(old_value, None);
+        Ok(id)
+    }


What's the API level difference between these two methods? Also need to document when errors happen, and what maybe to do with them.

Ah, nice catch, intern_new was fully 100% intended not to be public, yet somehow the pub ended up there.

It's not intended to be exposed (too dangerous!) -- I'll remove it.

Fixed in 20b9469

Adding it as `pub` was an oversight, since the intent is for this to be an inner helper that's only used by `intern` and by `new`. Having this as `pub` is quite dangerous as this method can easily be used to break a lot of the assumptions for the string storage.

ivoanjo added 29 commits November 5, 2024 12:24

Introduce StringId copies of all functions that deal with Sample

6e847e3

This will later allow us to introduce the new StringId code without disturbing existing API clients. This duplication is intended to be only an intermediate step to introducing support for StringId.

Introduce StringStorage implementation

c8a2e6e

Credit goes to @AlexJF, this is lifted from his earlier PR #607

Introduce StringStorage instance into Profile

795d25f

Ran cargo fmt

8d68513

Introduce use of PersistentStringId in StringId* variants

d3c7d6d

I decided to introduce the `PersistentStringId` name to distinguish such ids from the `StringId` type used by things that are in the profile-bound string table.

Ran cargo fmt

7716d56

Fix label validation

0570625

Fix incorrect error message

2887d2f

Remove duplicate comments from StringId variants

81fb1ff

Extract method to avoid redundancy in add_sample

cc6b6c0

Introduce PersistentStringId::new to make it easier to create struc…

8fa0653

…ture

Import + wire up ffi changes for string storage

9413b95

Fix incorrect operation names on errors

82b1593

Isolate SimpleStringStorage into seprate module

a65c31d

This makes it clear it's only kept around for experiments.

Mark last_usage_gen as not in use

36d5d32

Add TODO to ManagedStringStorage_intern

64d6277

Make sure header gets refreshed when things change

624ebd0

Fix ddog_prof_Profile_with_string_storage not being correctly publi…

30d9a65

…shed in header

Remove unused includes

f9f3703

Remove unused struct

4aa5f44

Add short-circuit for looking up empty strings

0a99208

Replace ManagedStringStorageResult with equivalent MaybeError

e105dd1

Add TODO to get_string API

04d76d2

Have intern receive a CharSlice unconditionally

b273e47

It doesn't really make sense to make it optional to have a CharSlice here.

Introduce ManagedStringStorageNewResult

560316f

This makes this API a bit more future-proof, as well as a bit more consistent with other such APIs in libdatadog.

Add TODO about shape of ManagedStringStorage argument

66cf6fb

Rename PersistentStringId -> ManagedStringId

08f5760

Also expose ManagedStringId to ffi instead of u32

621ed3a

Avoid using expect in string storage ffi APIs

4be030b

ivoanjo requested a review from a team as a code owner November 11, 2024 15:41

morrisonlevi reviewed Dec 20, 2024

View reviewed changes

profiling-ffi/build.rs Outdated Show resolved Hide resolved

ivoanjo added 5 commits January 3, 2025 11:03

Revert "Make sure header gets refreshed when things change"

172a0a2

This reverts commit 624ebd0. During PR review, I didn't quite remember why and when this was needed, so let's remove until we figure out exactly why/if this is needed.

Merge branch 'main' into ivoanjo/prof-9476-managed-string-storage-try…

e291371

…3-clean

Remove redundant must_use as pointed out by clippy

d414999

A `MaybeError` is apparently already must_use so this is redundant. TIL.

Apply another clippy suggestion

fdfaa1f

More linting... Why can't all of this be applied in one step... :/

c634964

morrisonlevi reviewed Jan 3, 2025

View reviewed changes

profiling-ffi/src/string_storage.rs Outdated Show resolved Hide resolved

Merge branch 'main' into ivoanjo/prof-9476-managed-string-storage-try…

b335d14

…3-clean

morrisonlevi reviewed Jan 3, 2025

View reviewed changes

profiling/src/collections/string_storage.rs Outdated Show resolved Hide resolved

ivoanjo added 3 commits January 6, 2025 10:04

Improve API description when reading string from managed string storage

d4194a2

Remove StringStorage / SimpleStringStorage experimental trait/imp…

28782b5

…lementation These are not needed currently, so let's simplify rather than having a lot of unused appendages in this PR.

Cargo fmt

fe29cf9

morrisonlevi reviewed Jan 6, 2025

View reviewed changes

profiling/src/collections/string_storage.rs Outdated Show resolved Hide resolved

morrisonlevi reviewed Jan 6, 2025

View reviewed changes

profiling/src/collections/string_storage.rs Outdated Show resolved Hide resolved

morrisonlevi reviewed Jan 6, 2025

View reviewed changes

ivoanjo added 11 commits January 7, 2025 09:41

Use NonZeroU32 in managed string table functions that shouldn't be …

03e7649

…called with id 0 This is much nicer than having a weird panic in there.

Minor tweak to comment

79901eb

Simplify cached field into tuple, as both are related

1f2b953

Remove last_usage_gen from string storage since we're not using it …

2f1dc92

…for now

Document that failure to acquire lock is unlikely

8a1ebcc

Refactor: Allow Profile::resolve to fail

9211511

This will enable us to propagate failures when a ManagedStringId is not known, which will improve debugability and code quality by allowing us to signal the error.

Properly handle invalid ManagedStringIds by returning errors

8387b85

Add fast-path for empty string interning

ce63b4e

morrisonlevi reviewed Jan 13, 2025

View reviewed changes

Make intern_new private

20b9469

Adding it as `pub` was an oversight, since the intent is for this to be an inner helper that's only used by `intern` and by `new`. Having this as `pub` is quite dangerous as this method can easily be used to break a lot of the assumptions for the string storage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-9476] Add experimental profiling managed string storage #725

[PROF-9476] Add experimental profiling managed string storage #725

ivoanjo commented Nov 11, 2024

morrisonlevi Jan 3, 2025 •

edited

Loading

ivoanjo Jan 6, 2025

morrisonlevi Jan 3, 2025

ivoanjo Jan 6, 2025 •

edited

Loading

morrisonlevi Jan 6, 2025

ivoanjo Jan 7, 2025

morrisonlevi Jan 13, 2025

ivoanjo Jan 13, 2025

ivoanjo Jan 13, 2025

		cached_seq_num_for: Cell<Option<*const StringTable>>,
		cached_seq_num: Cell<Option<StringId>>,

[PROF-9476] Add experimental profiling managed string storage #725

Are you sure you want to change the base?

[PROF-9476] Add experimental profiling managed string storage #725

Conversation

ivoanjo commented Nov 11, 2024

What does this PR do?

Motivation

Additional Notes

How to test the change?

morrisonlevi Jan 3, 2025 • edited Loading

Choose a reason for hiding this comment

ivoanjo Jan 6, 2025

Choose a reason for hiding this comment

morrisonlevi Jan 3, 2025

Choose a reason for hiding this comment

ivoanjo Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

morrisonlevi Jan 6, 2025

Choose a reason for hiding this comment

ivoanjo Jan 7, 2025

Choose a reason for hiding this comment

morrisonlevi Jan 13, 2025

Choose a reason for hiding this comment

ivoanjo Jan 13, 2025

Choose a reason for hiding this comment

ivoanjo Jan 13, 2025

Choose a reason for hiding this comment

morrisonlevi Jan 3, 2025 •

edited

Loading

ivoanjo Jan 6, 2025 •

edited

Loading