We’ve previously discussed how we adopted the new React Native architecture as a way to improve our performance. Before we discuss how we detect regressions, let’s first explain how we define performance.
Mobile performance indicators
In browsers, there is already a set of industry standard metrics for measuring performance in the Core Web Vitals, and while they are by no means perfect, they focus on the real impact on user experience. We wanted to have something similar but for applications, so we adopted Application rendering complete And Total navigation blocking time as our two most important measurements.
- Application rendering complete is the time it takes to cold start the application for an authenticated user, for it to be fully loaded and interactive, roughly equivalent to Time To Interactive in the browser.
- Total navigation blocking time is the time the application is prevented from processing code during the 2 second window after a navigation. It’s a proxy for overall responsiveness instead of something better like Interaction to Next Paint.
We still collect many other metrics, such as render times, packet sizes, network requests, frozen frames, memory usage, etc., but these are indicators that tell us why something went wrong rather than how our users perceive our applications.
Their advantage over the more holistic ARC/NTBT metrics is that they are more granular and deterministic. For example, it is much easier to reliably impact and detect an increase in bundle size or a decrease in total bandwidth usage, but this does not automatically translate into a noticeable difference for our users.
Metrics collection
Ultimately, we’re interested in how our apps perform on our users’ actual physical devices, but we also want to know how an app performs. Before we ship it. For this we leverage the Performance API (via React-Native-Performance) which we redirect to Sentry for real user monitoring, and in development this is supported out of the box by Rozenite.
But we also wanted a reliable way to evaluate and compare two different versions to see if our optimizations are moving things forward or if new features are causing performance to regress. Since Maestro was already used for our end-to-end testing suite, we simply extended it to also collect performance benchmarks in some key flows.
To account for contingencies, we ran the same flow multiple times on different devices in our CI and calculated the statistical significance for each metric. We were now able to compare each Pull Request to our main branch and see how they performed performance-wise. Performance regressions were surely a thing of the past.
Reality check
In practice, this did not have the results we had hoped for for several reasons. We first saw that automated benchmarks were mainly used when developers wanted to verify that their optimizations were having an effect – which in itself is important and very valuable – but this was usually after seeing regression in real user monitoring, not before.
To resolve this issue, we started performing performance tests between release branches to see how they were doing. While this helped detect regressions, they were generally difficult to resolve because there was a full week of changes to make – something our release managers simply weren’t capable of doing in all cases. Even if they found the cause, there was often no turning back the clock.
On top of that, the App Render Complete metric was network-dependent and non-deterministic, so if the servers had extra load that hour or if a feature flag was enabled, it would affect testing even if the code didn’t change, invalidating the statistical significance calculation.
Precision, specificity and variance
We had to go back to the drawing board and reconsider our strategy. We had three major challenges:
- Precision: Even if we could detect that a regression had occurred, we were not clear about what change caused it.
- Specificity: We wanted to detect regressions caused by changes to our mobile codebase. While regressions impacting users in production for any reason are crucial in production, the opposite is true for pre-production where we want to isolate as much as possible.
- Variance: For the reasons mentioned above, our benchmarks simply weren’t stable enough between each run to say with certainty that one version was faster than another.
The solution to the accuracy problem was simple; we just needed to run the benchmark tests for each merge, that way we could see on a time series graph when things changed. This was mainly an infrastructure issue, but thanks to optimized pipelines, build process and caching, we were able to reduce the total time to around 8 minutes from merge to benchmark preparation.
When it came to specificity, we needed to eliminate as many confusing factors as possible, with the backend being the main one. To achieve this, we first record network traffic and then replay it during testing, including API requests, feature flags, and Websocket data. Additionally, executions were spread across even more devices.
Together, these changes also helped solve the problem of variance, partly by reducing it, but also by increasing the sample size by several orders of magnitude. Just like in production, a single sample never tells the whole story, but looking at them all over time, it was easy to see trend changes that we could attribute to a range of 1-5 commits.
Alert
As mentioned above, simply having the metrics is not enough, as any regression must be addressed quickly. So we needed an automated way to alert us. At the same time, if we alerted too often or incorrectly due to inherent variance, it would be ignored.
After testing more esoteric models like the online Bayesian change point, we settled on a much simpler moving average. When a metric declines by more than 10% for at least two consecutive runs, we trigger an alert.
Next steps
While detecting and fixing regressions before a release branch is cut is fantastic, the holy grail is preventing them from merging in the first place.
What’s stopping us from doing this at the moment is twofold: firstly, running this for every commit in every branch requires even more capacity in our pipelines, and secondly having enough statistical power to know whether there was an effect or not.
The two are antagonistic, meaning that given we have the same budget to spend, running more benchmarks on fewer devices would reduce statistical power.
The trick we intend to apply is to spend our resources more intelligently – since the effect can vary, as can our sample size. Essentially, for changes with a large impact, we can perform fewer executions, and for changes with a lower impact, we perform more executions.
Make mobile performance regressions observable and actionable
By combining Maestro-based benchmarks, tighter variance control, and pragmatic alerting, we have taken performance regression detection from a reactive exercise to a systematic, near real-time signal.
While there is still work to be done to stop regressions before they are merged, this approach has already made performance a top-tier, continuously monitored concern – helping us deliver faster without slowing down.


