Skip to content

0.6.0 – Call stack restoration of inlined code, enhanced calculations and presentation, new data model

Latest
Compare
Choose a tag to compare
@lahmatiy lahmatiy released this 14 Jan 06:37

In the months since the previous release, CPUpro underwent significant refactoring, with major improvements in computations and new feature prototypes. This release introduces enhanced calculations, a new data model, and select UI features considered ready.

The main areas of focus are:

  • Enhanced calculations for accuracy and resource efficiency, along with new metrics to refine analysis and highlight potential issues. Optimized calculations improve performance and enable complex computations and metrics while enhancing UX for handling multiple profiles.
  • Improved profile details, particularly from V8 logs in JSON format. This enables deeper insights into V8 processes, including compilation, optimization, inlining, and memory allocation.
  • Sample aggregation and reduction, designed for analyzing long-running processes. This narrows the analysis scope, reduces noise, and simplifies work with large profiles.
  • Support for multiple profiles, enabling result averaging, noise reduction, and comparisons. Experimental features in this area show promise and are expected to mature soon.

CPUpro focuses on low-level details and long sessions, leaving small- and medium-scale scenarios to browser devtools like Chromium DevTools. It aims to fill gaps in areas where these tools fall short, such as data manipulation and profile comparisons.

This release delivers results from the first two focus areas, with other features planned for upcoming releases in the coming weeks.

A call stack restoration for inlined code

Previously, CPUpro only performed a basic parse of the V8 log (or, more precisely, the preprocessed V8 log – a JSON format derived from the V8 log, for instance by using the command node --prof-process --preprocess ...). It turned out that the stack traces in this profile format contain not only a “reference” to a function (call frame) but also an offset in the bytecode or machine code of the function. Previously, CPUpro, like other tools, simply ignored these offsets. However, this information is quite useful, since it enables two important artifacts:

  • A per-instruction association of samples (i.e., associating time with the exact location in the code) instead of per-function.
  • The ability to restore the original stack trace for inlined code.

Offsets alone are not enough. Functions (or, more precisely, their representation as bytecode or machine code) also have a table associating the compiled code with the original code, along with auxiliary dictionaries. Combined, these provide information about which point in the code corresponds to the block of the function being executed on the call stack, as well as whether that block is part of inlined code from another function. All this makes it possible to map the executing compiled code back to a location in the original code and to “unfold” code assembled from fragments of different functions (inline code) into a call chain—effectively inserting it into the call stack trace instead of a single function call. This is a rather complex and resource-intensive task (in both time and memory), but the result is worth it.

When analyzing performance profiles, there are often two situations. First, functions that are expected to appear at the top of the profile end up either near the bottom of the list or simply missing. Second, some simple functions that merely call one or more other functions unexpectedly appear at the top of the list. This is quite typical for code that is “hot” (executed many, many times) and is related to the property of JavaScript engines to inline code from functions into the functions that call them frequently enough.

We won’t go into all aspects of how JavaScript engines perform function inlining. Within these release notes, only the effect of inlining matters: once a function is inlined into another function, it disappears from the stack trace. This means that the samples (time) previously attributed to that function are now attributed to the function into which it was inlined. It’s also worth noting that a function can inline code from several functions and, moreover, code from functions that call the inlined functions themselves. The number and depth of nested inlined functions are governed by complex rules, but in practice, this can be dozens of functions with fairly deep nesting. This is further complicated by the fact that optimizations and inlining are non-deterministic—that is, they may happen at any arbitrary time depending on many factors, or not happen at all, so the results will always differ from one measurement to another. In other words, optimizations (inlining) distort the distribution of time across functions, and this distortion has a large element of randomness, which often deprives us of reliable insights into how to improve code performance.

To demonstrate how inlining affects the analysis of work in a profile, let’s consider the following example of a simple “benchmark” where two identical functions by code are tested, and see how the results are displayed:

function a() {
    let result = 0;
    for (let i = 0; i < 100_000; i++) {
        result += i % 7;
    }
    return result;
}

function b() {
    let result = 0;
    for (let i = 0; i < 100_000; i++) {
        result += i % 7;
    }
    return result;
}

function test(fn) {
    let result = 0;
    for (let i = 0; i < 10_000; i++) {
        result += fn();
    }
    return result;
}

test(a);
test(b);
Chromium DevTools:

image

Although the code of functions a and b is identical, the time for these functions differs by several times. The difference is due to the effect of inlining. Without knowing the code and relying solely on what the tools show, one might conclude that b is slow, a is super-fast, and most of the work is done by test, since it accounts for most of the time. However, knowing how the code is organized, we can say that this is a misleading picture. In fact, any profiling tool will yield roughly the same results for this example. Spoiler alert: in this particular case, the issue is not so much about how the data is displayed, but rather how the profile was recorded.

If we disable the optimizing compilers that apply code inlining - Turbofan and Maglev - using V8 flags (for Chromium-based browsers, --js-flags='--no-turbofan --no-maglev'), the result changes:

image

Yes, now the test runs significantly slower (about 6 times slower), which is expected because the machine code is less optimized. At the same time, however, the duration of functions a and b is now roughly the same. This is a consequence of disabling code inlining, which makes the call stack more stable. However, the combined time of a and b is now less than 20% of the total time of the test function, which is somewhat puzzling, since intuitively it seems that a and b should take about 10 times longer to run than the calling function test.

Unfortunately, it is currently impossible to get profiling results in Chromium DevTools that match expectations, even in such a simple case. There are two main problems. First, it’s related to how Chromium DevTools interacts with V8 and records events (i.e., sampling the stack trace and VM state, via CDP). Second, Chromium DevTools (and CDP) does not use the offset information in the call frames on the stack or the block code tables to restore the stack trace for inlined code.

None of this is a fundamental limitation of V8, CDP, or Chromium DevTools, but rather an engineering gap, and here is why. In V8-based runtimes (e.g., Node.js and Deno), it is possible to record what’s known as a V8 log using the --prof flag (this can also be done in the browser, but it’s much more complicated). By leveraging the data from the V8 log (including the JSON format derived from it), specifically the offsets for call frames in stack traces and the code tables for restoring the call stack with inlined code, one can obtain more accurate and expected results. Here’s how such a profile looks in the new version of CPUpro:

image

And here’s how the same profile looks, but without restoring the stack trace for inlined code:

image

In real-world profiles, the effect of inlining is no less significant than in this synthetic test. Here is an example of how a profile for a massive webpack build (from the Node.js world - since, as mentioned earlier, in-browser profiling with the necessary data is not natively supported, and other methods are quite complicated) appears with and without restoring the stack trace for inlined code. As you can see from this example, in the top 20, the set of call frames does not repeat in 9 cases, and for 4 call frames, the self time changes by 5–40%. The timings remain unchanged only for VM states and RegExp, which have no nested calls and thus are not affected by inlining, as well as for functions that have no nested calls or only calls to VM states (like compilation or GC) - in this example, that’s the single function deepClone().

image

There are currently some challenges with restoring the call stack for inlined code:

  • Code tables and the associated dictionaries exist only in the V8 log (and its derivative, the preprocessed V8 log). Other profiling formats (V8 cpuprofile, Chromium Performance Profile, and Edge Enhanced Performance Traces) do not contain this data. Chromium DevTools cannot process V8 logs or their derivatives, nor can they record this type of profile.
  • The V8 log does not record tables for the Sparkplug and Maglev compilers, meaning that offsets in their code can’t be associated with the corresponding location in the source code, and the inlined code stack trace cannot be restored. This is not an issue for Sparkplug because it does not perform inlining, but it is problematic for Maglev, which does inline code. And in both cases, unfortunately, we cannot link the work precisely to the location in the code.

Yes, the computations required to unfold inline call stacks are fairly heavy. It’s not just the time but also the increased data volume, since building the call tree depends not only on which function called which, but also on which location in the function the call occurred. The call tree can grow by a factor of 1.5–2 in the number of nodes, which affects subsequent calculations. Perhaps for this reason, it has not been implemented in profiling tools.

To add these new calculations in CPUpro to the already resource-intensive processing of the preprocessed V8 log, it was necessary to revisit and refactor the handling of such profiles for greater efficiency. After numerous experiments, several optimizations and effective approaches were found to nearly double the speed of preprocessed V8 log processing, and then fit the new calculations into the freed-up budget. As a result, processing of the preprocessed V8 log takes about the same amount of time as before (at most 10% slower), despite a much larger volume of computations (a bit more than double).

A pleasant surprise while writing these release notes was that in Firefox, the synthetic test described above shows the expected results right out of the box, with no extra effort:

image

Hope that Firefox’s example and CPUpro’s achievements in data processing will inspire developers of profiling tools to reach new milestones!

Presentation enhancements

Numerous interface changes have been introduced, below are the most significant improvements:

  1. New pages for all call frames, modules, and packages
    These pages now display entries in a tabular format with basic name filtering. Entries without samples may appear, particularly in V8 logs or preprocessed V8 logs that list all compiled functions—some of which do not show up within the sampling interval.

    image
  2. Reworked timing tables on category, package, and module pages
    The tables now include the total number of nested entities (e.g., modules or call frames) and how many of these have recorded samples, with an option to expand details.

    image
  3. Consolidated nested call sites
    On the call frame page, nested call sites are consolidated by default in the “Nested call sites” tree and the Subtree flame chart. This approach offers clearer insights into nested tasks when a function is invoked from multiple call points with deeply branched call trees.

    image image

Changes in terminology and categories

Some terms has been updated to align with conventions adopted in V8. The primary terms now are call frame (JS/Wasm function, RegExp, VM state, native handler, etc.) instead of "function" and script (ESM/Wasm module, new Function(), eval, <script>, etc.). Not every call frame has an associated script, so call frames are aggregated into modules (which can be "virtual" or "logical"), modules into packages, and packages into categories. While this was not always strictly adhered to in previous versions, it is now a strict rule.

There are also some changes to the categories:

  • The engine category has been split into two:
    • compilation – covering the work of parsers and compilers
    • blocking – for call frames like "atomic waits"
  • A new category, devtools, has been added. Currently, it includes a single module injected by VS Code when debugging Node.js scripts. In the future, other similar scripts may be added to this category.

New data model

The data model has been reworked to reflect changes in profile data processing and to prepare for working with multiple profiles. This is largely an internal part of CPUpro, relevant only to those working with raw profile data and building calculations upon it. In future releases, the data model will continue to evolve, and a special version of CPUpro will be introduced to perform calculations on profile data without using the UI.

One of the key changes is the introduction of a "shared dictionary" concept. This allows entities (functions, scripts, modules, etc.) to be shared across multiple profiles without requiring them to be matched during calculations. Working with multiple profiles also necessitates layering the data and creating weak links between datasets. For instance, dictionary objects, such as functions, cannot reference timing data directly. Instead, timing data references the dictionary entities they relate to. This alters the approach to calculations, as the starting point for selections and aggregations changes. This topic will be explored in more detail in upcoming releases and planned articles.

The changelog

  • Changed terminology in accordance with that adopted in V8; the main term is now call frame (JS/Wasm function, RegExp, VM state, native handler, etc.) instead of "function"
  • Added "All call frames," "All modules," and "All packages" pages, which display entries even if they have no samples (when a profile provides data about compiled functions, such as a V8 log)
  • Added an option to consolidate call frames (a "Consolidate call frames" checkbox, checked by default) to the "Nested call sites" tree on the call frame page
  • Added call frame consolidation in a call frame's nested calls flame chart on the call frame page
  • Added call tree updates on filter changes on the call frame page
  • Added a kind property for call frames, which can be script, function, regexp, vm-state, or root
  • Reworked tables on the category, package, and module pages to include records with no samples and show the number of nested subjects (e.g., modules or call frames for a package) as sampled / all
  • Reworked the data model in preparation for multi-profile analysis and comparison
  • Reworked profile data preprocessing and initial profile computations
  • Introduced a shared dictionary for call frames, scripts, modules, packages, etc., to be used across a set of profiles
  • Introduced an optional call frame positions tree and related structures for source code position precision timings
  • Improved parsing of function names and URLs during the conversion of preprocessed V8 logs into cpuprofile
  • Changed line and column numbers to be zero-based during the conversion of preprocessed V8 logs into cpuprofile
  • Limited the display of very long regular expressions in the function page header to the first 256 characters
  • Replaced the engine category with compilation (covering parsers and compilers) and blocking (atomic waits) categories
  • Added a devtools category for scripts injected by a runtime for debugging purposes, such as those added by VS Code for scripts run from the "Run and Debug" panel
  • Fixed processing of v8/gc call frames that appear when Node.js is run with --expose-gc and global.gc() is invoked (added to internals)
  • Various fixes and improvements