Be careful with performance.measure

I recently helped a team improve the performance of one of their services. They had already invested some time, but over time the performance was getting even worse! Their Node.js service was suffering from slow response times. At some point, the team added performance tracing using performance.mark() and performance.measure() to understand why the response time was so long. Little did they know that this would make things even worse…

While analyzing the service’s performance, I noticed that the service’s memory grows over time, reaching over 4GB after a month of uptime. This sounds like a classic memory leak. Due to the high memory pressure, each request triggered one or multiple garbage collection runs, which took a long time to complete and blocked the main thread without actually freeing any memory. As a result, the service was unresponsive for several seconds during garbage collection. Restarting the service on a daily basis greatly improved the response times.

This calls for a deeper analysis of the performance problem. I used the --inspect CLI parameter for Node.js to attach the Chrome Dev Tools as a debugger to the locally running Node.js service. Fortunately, the team already had a load test suite available to reproduce the problem. The Chrome Dev Tools provide a memory profiler to take and compare heap snapshots. One approach is to take a snapshot before the load test and a snapshot after to see what was allocated and what is still retained in memory. It’s best to warm up the service before taking the first snapshot by running the load test once to avoid false positives such as caches that are fill up over time. If you want to analyze the heap growth in more detail, it’s probably best to minimize the number of operations between the two snapshots, but in this case running the load test worked really well.

Analyzing the heap snapshots requires some luck. You can review the differences, look for things that catch your interest, and compare the objects on the heap to your code. One thing I noticed was a lot of PerformanceMark and PerformanceMeasure objects. These come from the the usage of performance.mark() and performance.measure().

What are performance.mark() and performance.measure() and how can they be used? The API is tightly coupled with the performance timeline view in the Chrome Dev Tools. You can use use performance.mark() to add a marker to the timeline that represents a discrete named event. For example, you can use it to indicate when your page is loaded, or when the user performed an action. You can use performance.measure() to combine two existing markers into a time span. A time span can be an operation such as a calculation or a load process that you want to measure. Note that the markers must have unique names, so be careful when naming your markers in async code, otherwise the results of your measurements may be unexpected.

performance.mark('start');
// Your operation
performance.mark('end');
performance.measure('timespan', 'start', 'end');
Chrome Dev Tools Performance Timeline View
Markers and measurements in the Performance Timeline view of the Chrome Dev Tools.

Both operations return an object with some details, such as the timestamp or the duration of the span. To display the marks and spans on the timeline, these objects are held in memory until they are explicitly freed. It’s a bit hidden in the documentation, but you should to call performance.clearMarks() and performance.clearMeasures() to release these objects.

Removing all uses of performance.mark() and performance.measure() calls from the code fixed the memory leak. But you probably don’t want to use performance.mark anyway. If you just want to log the duration of an operation, you are better off using performance.now() and calculating the runtime of your code yourself.

const start = performance.now();
// Your operation
const end = performance.now();
const duration = end - start;
console.log(`Your operation took ${duration} milliseconds.`);

In the end, it was a bit odd that a tool that was supposed to help you analyze your application performance introduced an even bigger problem that made the performance worse.