Releases · discoveryjs/cpupro

14 Jan 06:37

lahmatiy

v0.6.0

4268b41

0.6.0 – Call stack restoration of inlined code, enhanced calculations and presentation, new data model Latest

Latest

In the months since the previous release, CPUpro underwent significant refactoring, with major improvements in computations and new feature prototypes. This release introduces enhanced calculations, a new data model, and select UI features considered ready.

The main areas of focus are:

Enhanced calculations for accuracy and resource efficiency, along with new metrics to refine analysis and highlight potential issues. Optimized calculations improve performance and enable complex computations and metrics while enhancing UX for handling multiple profiles.
Improved profile details, particularly from V8 logs in JSON format. This enables deeper insights into V8 processes, including compilation, optimization, inlining, and memory allocation.
Sample aggregation and reduction, designed for analyzing long-running processes. This narrows the analysis scope, reduces noise, and simplifies work with large profiles.
Support for multiple profiles, enabling result averaging, noise reduction, and comparisons. Experimental features in this area show promise and are expected to mature soon.

CPUpro focuses on low-level details and long sessions, leaving small- and medium-scale scenarios to browser devtools like Chromium DevTools. It aims to fill gaps in areas where these tools fall short, such as data manipulation and profile comparisons.

This release delivers results from the first two focus areas, with other features planned for upcoming releases in the coming weeks.

A call stack restoration for inlined code

Previously, CPUpro only performed a basic parse of the V8 log (or, more precisely, the preprocessed V8 log – a JSON format derived from the V8 log, for instance by using the command node --prof-process --preprocess ...). It turned out that the stack traces in this profile format contain not only a “reference” to a function (call frame) but also an offset in the bytecode or machine code of the function. Previously, CPUpro, like other tools, simply ignored these offsets. However, this information is quite useful, since it enables two important artifacts:

A per-instruction association of samples (i.e., associating time with the exact location in the code) instead of per-function.
The ability to restore the original stack trace for inlined code.

Offsets alone are not enough. Functions (or, more precisely, their representation as bytecode or machine code) also have a table associating the compiled code with the original code, along with auxiliary dictionaries. Combined, these provide information about which point in the code corresponds to the block of the function being executed on the call stack, as well as whether that block is part of inlined code from another function. All this makes it possible to map the executing compiled code back to a location in the original code and to “unfold” code assembled from fragments of different functions (inline code) into a call chain—effectively inserting it into the call stack trace instead of a single function call. This is a rather complex and resource-intensive task (in both time and memory), but the result is worth it.

When analyzing performance profiles, there are often two situations. First, functions that are expected to appear at the top of the profile end up either near the bottom of the list or simply missing. Second, some simple functions that merely call one or more other functions unexpectedly appear at the top of the list. This is quite typical for code that is “hot” (executed many, many times) and is related to the property of JavaScript engines to inline code from functions into the functions that call them frequently enough.

We won’t go into all aspects of how JavaScript engines perform function inlining. Within these release notes, only the effect of inlining matters: once a function is inlined into another function, it disappears from the stack trace. This means that the samples (time) previously attributed to that function are now attributed to the function into which it was inlined. It’s also worth noting that a function can inline code from several functions and, moreover, code from functions that call the inlined functions themselves. The number and depth of nested inlined functions are governed by complex rules, but in practice, this can be dozens of functions with fairly deep nesting. This is further complicated by the fact that optimizations and inlining are non-deterministic—that is, they may happen at any arbitrary time depending on many factors, or not happen at all, so the results will always differ from one measurement to another. In other words, optimizations (inlining) distort the distribution of time across functions, and this distortion has a large element of randomness, which often deprives us of reliable insights into how to improve code performance.

To demonstrate how inlining affects the analysis of work in a profile, let’s consider the following example of a simple “benchmark” where two identical functions by code are tested, and see how the results are displayed:

function a() {
    let result = 0;
    for (let i = 0; i < 100_000; i++) {
        result += i % 7;
    }
    return result;
}

function b() {
    let result = 0;
    for (let i = 0; i < 100_000; i++) {
        result += i % 7;
    }
    return result;
}

function test(fn) {
    let result = 0;
    for (let i = 0; i < 10_000; i++) {
        result += fn();
    }
    return result;
}

test(a);
test(b);

Chromium DevTools:

Although the code of functions a and b is identical, the time for these functions differs by several times. The difference is due to the effect of inlining. Without knowing the code and relying solely on what the tools show, one might conclude that b is slow, a is super-fast, and most of the work is done by test, since it accounts for most of the time. However, knowing how the code is organized, we can say that this is a misleading picture. In fact, any profiling tool will yield roughly the same results for this example. Spoiler alert: in this particular case, the issue is not so much about how the data is displayed, but rather how the profile was recorded.

If we disable the optimizing compilers that apply code inlining - Turbofan and Maglev - using V8 flags (for Chromium-based browsers, --js-flags='--no-turbofan --no-maglev'), the result changes:

Yes, now the test runs significantly slower (about 6 times slower), which is expected because the machine code is less optimized. At the same time, however, the duration of functions a and b is now roughly the same. This is a consequence of disabling code inlining, which makes the call stack more stable. However, the combined time of a and b is now less than 20% of the total time of the test function, which is somewhat puzzling, since intuitively it seems that a and b should take about 10 times longer to run than the calling function test.

Unfortunately, it is currently impossible to get profiling results in Chromium DevTools that match expectations, even in such a simple case. There are two main problems. First, it’s related to how Chromium DevTools interacts with V8 and records events (i.e., sampling the stack trace and VM state, via CDP). Second, Chromium DevTools (and CDP) does not use the offset information in the call frames on the stack or the block code tables to restore the stack trace for inlined code.

None of this is a fundamental limitation of V8, CDP, or Chromium DevTools, but rather an engineering gap, and here is why. In V8-based runtimes (e.g., Node.js and Deno), it is possible to record what’s known as a V8 log using the --prof flag (this can also be done in the browser, but it’s much more complicated). By leveraging the data from the V8 log (including the JSON format derived from it), specifically the offsets for call frames in stack traces and the code tables for restoring the call stack with inlined code, one can obtain more accurate and expected results. Here’s how such a profile looks in the new version of CPUpro:

And here’s how the same profile looks, but without restoring the stack trace for inlined code:

In real-world profiles, the effect of inlining is no less significant than in this synthetic test. Here is an example of how a profile for a massive webpack build (from the Node.js world - since, as mentioned earlier, in-browser profiling with the necessary data is not natively supported, and other methods are quite complicated) appears with and without restoring the stack trace for inlined code. As you can see from this example, in the top 20, the set of call frames does not repeat in 9 cases, and for 4 call frames, the self time changes by 5–40%. The timings remain unchanged only for VM states and RegExp, which have no nested calls and thus are not affected by inlining, as well as for functions that have no nested calls or only calls to VM states (like compilation or GC) - in this example, that’s the single function deepClone().

There are currently some challenges with restoring the call stack for inlined code:

Code tables and the associated dictionaries exist only in the V8 log (and its derivative, the preprocessed V8 log). Other profiling formats (V8 cpuprofile...

Assets 2

11 May 01:32

lahmatiy

v0.5.1

13e0d67

0.5.1

Added transformation from parent to children for call tree nodes for .cpuprofile files if needed (#5)
Implemented exclusion of ending no-sample time. For certain profiles, the time from the last sample until endTime can be significant, indicating the conclusion of the profiling session or adjustments from excluding idle samples at the end. This time is now excluded from the Profiling time which used for computing time percentages
Fixed double rendering of the page after the profile data is loaded

Assets 2

10 May 02:43

lahmatiy

v0.5.0

8866275

0.5.0 Performance, Reworked UI, New formats, Deno

This release of CPUpro introduces significant updates, including performance enhancements, a redesigned user interface, and expanded format and runtime support. This version introduces groundbreaking enhancements that significantly reduce the time to load and process extremely large profiles, making CPUpro highly efficient for analyzing complex long-running scripts. The user interface has been thoroughly revamped to offer a more intuitive and responsive experience, enhancing usability across various features and views. New profile formats and support for the Deno runtime has been added, expanding the tool's versatility and adaptability to modern development environments.

Performance

CPUpro has been entirely re-engineered to optimize the preprocessing of profiles upon loading and for subsequent computations. This redesign enables it to handle massive profiles (exceeding 100MB) significantly faster than other tools. CPUpro is currently the best option for analyzing intense long-running scripts that generate extensive CPU profiles, such as webpack build profiles or prolonged browser sessions (that can last minutes or even tens of minutes).

The table below illustrates the time of loading and first render of profiles of varying sizes across different tools:

Profile size	Profile type	CPUpro v0.5	CPUpro v0.4	Chromium DevTools	speedscope
33MB ^{215k samples / 120k call tree}	V8 cpuprofile	0.5s	0.8s	4.6s	6.5s
113MB ^{625k samples / 62k call tree}	Chromium Profile	1.3s	1.6s	10.6s	12.4s
114MB ^{739k samples / 446k call tree}	V8 cpuprofile	1.3s	2.6s	12.3s	18.5s
239MB ^{11.6M samples / 489k call tree}	V8 cpuprofile	2.8s	11.3s	48s	Out of memory ^{(after 23s)}
277MB ^{127k samples / 35k call tree}	Chromium Profile	1.9s	2.2s	4.2s	Out of memory ^{(after 30s)}
418MB ^{897k samples / 1.86M call tree}	V8 cpuprofile	4.6s	8.7s	Out of memory ^{(after 36s)}	Out of memory ^{(after 49s)}
2GB ^{7.3M samples / 7.28M call tree}	V8 cpuprofile	27.1s	Out of memory ^{(after 57s)}	Invalid string length ^{(after 20s)}	Out of memory ^{(after 43s)}

Chrome 124 / MacBook Pro 13-inch, M1, 2020

As indicated in the table, the time is affected not only by the profile size but also by its format, the number of samples and the size of the call tree (note that some profiles contain millions of samples and nodes). Notably, the Chromium Profile, which includes extensive additional data beside CPU profile, tends to load faster than .cpuprofile files of the same size. It is worth mentioning that some tools struggle with large profiles, hitting the heap size limit (4GB) and resulting in crashes because of "Out of Memory" errors, which is particularly frustrating when a lengthy load time yields no results. Unlike these tools, CPUpro avoids such pitfalls thanks to new optimizations, now capable of loading and processing even 2GB profiles.

When comparing the loading time between CPUpro versions 0.4 and 0.5, the difference does not look so impressive. The reason for this is that a significant portion of the time is spent on loading and parsing JSON which remains unchanged. However, if we isolate the processing time and initial rendering, where main optimization efforts were concentrated, the new version shows performance improvements ranging from 1.5 to 11 times:

Profile size	Profile type	Load data & parse	CPUpro v0.5 ^{(computations + render)}	CPUpro v0.4 ^{(computations + render)}	Delta
33MB ^{215k samples / 120k call tree}	V8 cpuprofile	0.3s	0.16s	0.52s	3.1x
113MB ^{625k samples / 62k call tree}	Chromium Profile	1.1s	0.21s	0.64s	3.0x
114MB ^{739k samples / 446k call tree}	V8 cpuprofile	0.9s	0.37s	1.48s	4.0x
239MB ^{11.6M samples / 489k call tree}	V8 cpuprofile	2.2s	0.79s	9.21s	11.7x
277MB ^{127k samples / 35k call tree}	Chromium Profile	1.9s	0.15s	0.24s	1.7x
418MB ^{897k samples / 1.86M call tree}	V8 cpuprofile	3.6s	1.12s	4.26s	3.6x
2GB ^{7.3M samples / 7.28M call tree}	V8 cpuprofile	22.1s	4.98s	–	–

Chrome 124 / MacBook Pro 13-inch, M1, 2020

The acceleration was achieved by switching to linear memory (TypedArrays) for tree representation and calculations storage, despite the increased number and complexity of computations added since v0.4. The majority of the calculation algorithms are implemented using simple loops without recursion or complex branching. Experiments with WebAssembly for some calculations have resulted in up to a 2x speed increase in JavaScriptCore (Safari) and SpiderMonkey (Firefox), aligning execution times with V8, where there was no change in performance. Remarkably, the new algorithms allow V8 to optimize JavaScript execution to match the efficiency of WebAssembly, which was an unexpected.

Adopting TypedArray has drastically reduced heap memory usage. While modern browsers typically offer up to 4GB of heap space, exceeding this limit can crash browser's tab (and, accordingly, the app). CPUpro primarily uses the heap only for loading and parsing JSON and during the initial stages of data processing, then most data is managed using TypedArrays. These buffers, stored in what is termed "external memory", are only limited by the system's available memory, significantly lowering the risk of crashes due to "Out of memory". However, there is no reason to worry about it, since CPUpro consumes memory sparingly:

Profile size	Profile type	CPUpro v0.5	CPUpro v0.4	Chromium DevTools	speedscope
33MB ^{215k samples / 120k call tree}	V8 cpuprofile	8MB ^{External: 20MB}	97MB	752MB	916MB
113MB ^{625k samples / 62k call tree}	Chromium Profile	7MB ^{External: 17MB}	61MB	1063MB	466MB
114MB ^{739k samples / 446k call tree}	V8 cpuprofile	8MB ^{External: 155MB}	324MB	1803MB	2001MB
239MB ^{11.6M samples / 489k call tree}	V8 cpuprofile	12MB ^{External: 92MB}	463MB	3877MB	Out of memory
277MB ^{127k samples / 35k call tree}	Chromium Profile	8MB ^{External: 9MB}	34MB	488MB	Out of memory
418MB ^{897k samples / 1.86M call tree}	V8 cpuprofile	18MB ^{External: 233MB}	1387MB	Out of memory	Out of memory
2GB ^{7.3M samples / 7.28M call tree}	V8 cpuprofile	22MB ^{External: 866MB}	Out of memory	Invalid string length	Out of memory

Data collected after loading the profile and calling the garbage collector

After loading the profile and initial calculations, CPUpro is ready for rapid timings recalculations and data sampling on demand, e.g. filter changes. This enhancement enabled the introduction of new complex views that were previously impossible due to prolonged calculations (many seconds) and UI freezing, which broke the user experience. Most views have also been optimized to react almost instantaneously to changes in filters, ensuring a seamless user experience even with large profiles.

cpupro-perf.mov

The optimizations in speed and memory efficiency are not just about improving profile loading and UI responsiveness, they also unlock new capabilities. Notably, it's crucial for features such as profile comparison, which requires loading at least two profiles, potentially doubling both the computation time and memory usage. These challenges have been addressed, setting the stage for future enhancements including profile comparison and more.

User interface

The user interface has undergone a significant redesign. The start page now appears more compact and provides a clearer overview of how the V8 engine operates. It features a timeline categorized by work type and function clustering tables, followed by a flamechart.

Other pages have also been reworked to be more informative. Each page now includes:

A timeline that not only displays self time but also nested time, with the distribution of nested time by categories.
A new section titled "Nested Time Distribution" that offers insights into the distribution of nested time in a hierarchical format, from a package to a function.
A basic flamechart displaying all frames related to the current subject (category, package, module, or function) as root frames.

The timeline has been enhanced with a tooltip that provides expanded details and the capability to select a range, a feature previously lacking when focusing on specific segments of work.

The Flamechart is now faster and smoother. It includes new selection capabilities and a detailed information block for the selected or zoomed frame.

The welcome page has been redesigned as well, and now offers example profiles in various formats to try:

New formats, runtimes, and registries

Support for new formats has been introduced:

V8 log converted into JSON with the --preprocess op...

Assets 2

21 Jan 20:41

lahmatiy

v0.4.0

75abac0

0.4.0

Report
- Extracted regular expression into a separate area regexp
- Fixed edge cases when scriptId is not a number
- Added ancestor call sites on a function page
- Added function grouping on a function page (enabled by default)
- Added timeline split by areas on default page
- Improved function subtree displaying
- Fixed processing of evaled functions (call frames with evalmachine prefixes)
CLI:
- Added support to load jsonxl files
API:
- Profile (result of profileEnd()):
  - Renamed methods:
    - writeToFile() -> writeToFileAsync()
    - writeToFileSync() -> writeToFile()
    - writeJsonxlToFileSync() -> writeJsonxlToFile()
  - Changed writeToFileAsync(), writeToFile() and writeJsonxlToFile() methods to return a destination file path
  - Added writeReport() method as alias to report.writeToFile()
- profileEnd().report
  - Renamed writeToFile() -> writeToFileAsync() and writeToFileSync() -> writeToFile() (however, at the moment both are sync)
  - Changed open() method to return a destination file path
- Capture (result of profile())
  - Added onEnd(callback) method to add a callback to call once capturing is finished, a callback can take a profiling result argument
  - Added writeToFile(), writeJsonxlToFile(), writeReport() and openReport() methods to call corresponding methods one capturing is finished
- Changed profile() to return an active capturing for a name if any instead of creating a new one
- Changed profile() to subscribe on process exit to end profiling (process.on('exit', () => profileEnd()))
- Added writeToFile(), writeJsonxlToFile(), writeReport() and openReport() methods that starts profile() and call a corresponding method, i.e. writeReport() is the same as profile().writeReport()

Assets 2

06 Apr 17:32

lahmatiy

v0.3.0

ae8feb0

0.3.0

Used jsonxl binary and gzip encodings for data on report generating, which allow to produce a report much faster and much smaller (up to 50 times) in size
Added writeJsonxlToFileSync() method to profile
Added build/*.html and package.json to exports
Report
- Bumped discoveryjs to 1.0.0-beta.73
- Enabled embed API for integrations
- Rework flamechart for performance and reliability, it's a little more compact now
- Added badges for function references
- Updated segments timeline
- Fixed Windows path processing
- New page badges

Assets 2

20 Apr 10:58

lahmatiy

v0.2.1

5f657c6

0.2.1 – Boosted flame chart performance and fixes

Added count badges and tweaked numeric captions
Reworked flamechart view to improve performance especially on large datasets (eliminated double "renders" in some cases, a lot of unnecessary computations and other optimisations)
Changed behaviour in flamechart when click on already selected frame to select previously selected frame with a lower depth
Fixed flamechart's view height updating when stack depth is growing on zoom
Fixed processing of profiles when call frame scriptId is a non-numeric string
Bumped discoveryjs to 1.0.0-beta.65

Assets 2

21 Feb 19:24

lahmatiy

v0.2.0

9d2ecd0

0.2.0 – Support for Chromium profile format & flame charts

Added support for Chromium Developer Tools profile format (Trace Event Format)
Added flame chart on index page
Fixed time deltas processing
Fixed total time computation for areas, packages, modules and functions
Fixed module path processing
Reworked aggregations for areas, packages, modules and functions

Assets 2

08 Feb 15:33

lahmatiy

v0.1.1

5fa3ad6

0.1.1

Added missed bin field
Renamed profile recording method end() into profileEnd() for less confussion
Fixed a crash in viewer when an element in nodes doesn't contain a children field, e.g. when DevTools protocol is used
Fixed file module path normalization in viewer
Removed modification of startTime and endTime in recorded profile
Exposed createReport() method

Assets 2

07 Feb 22:09

lahmatiy

v0.1.0

3fbb747

0.1.0 – Hello world

Initial release

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A call stack restoration for inlined code

Performance

User interface

New formats, runtimes, and registries

Releases: discoveryjs/cpupro

0.6.0 – Call stack restoration of inlined code, enhanced calculations and presentation, new data model

A call stack restoration for inlined code

0.5.1

0.5.0 Performance, Reworked UI, New formats, Deno

Performance

User interface

New formats, runtimes, and registries

0.4.0

0.3.0

0.2.1 – Boosted flame chart performance and fixes

0.2.0 – Support for Chromium profile format & flame charts

0.1.1

0.1.0 – Hello world