Preventing mobile performance regressions with Maestro

Mobile performance vitals

In browsers there is already an industry standard set of metrics to measure performance in the Core Web Vitals, and while they are by no means perfect, they focus on the actual impact on the user experience. We wanted to have something similar but for apps, so we adopted App Render Complete and Navigation Total Blocking Time as our two most important metrics.

App Render Complete is the time it takes to open the cold boot the app for an authenticated user, to it being fully loaded and interactive, roughly equivalent to Time To Interactive in the browser.
Navigation Total Blocking Time is the time the application is blocked from processing code during the 2 second window after a navigation. It’s a proxy for overall responsiveness in lieu of something better like Interaction to Next Paint.

We still collect a slew of other metrics – such as render times, bundle sizes, network requests, frozen frames, memory usage etc. – but they are indicators to tell us why something went wrong rather than how our users perceive our apps.

Their advantage over the more holistic ARC/NTBT metrics is that they are more granular and deterministic. For example, it’s much easier to reliably impact and detect that bundle size increased or that total bandwidth usage decreased, but it doesn’t automatically translate to a noticeable difference for our users.

Collecting metrics

In the end, what we care about is how our apps run on our users’ actual physical devices, but we also want to know how an app performs before we ship it. For this we leverage the Performance API (via react-native-performance) that we pipe to Sentry for Real User Monitoring, and in development this is supported out of the box by Rozenite.

But we also wanted a reliable way to benchmark and compare two different builds to know whether our optimizations move the needle or new features regress performance. Since Maestro was already used for our End to End test suite, we simply extended that to also collect performance benchmarks in certain key flows.

To adjust for flukes we ran the same flow many times on different devices in our CI and calculated statistical significance for each metric. We were now able to compare each Pull Request to our main branch and see how they fared performance wise. Surely, performance regressions were a thing of the past.

Reality check

In practice, this didn’t have the outcomes we had hoped for a few reasons. First we saw that the automated benchmarks were mainly used when developers wanted validation that their optimizations had an effect – which in itself is important and highly valuable – but this was typically after we had seen a regression in Real User Monitoring, not before.

To address this we started running benchmarks between release branches to see how they fared. While this did catch regressions, they were typically hard to address as there was a full week of changes to go through – something our release managers simply weren’t able to do in every instance. Even if they found the cause, simply reverting often wasn’t a possibility.

On top of that, the App Render Complete metric was network-dependent and non-deterministic, so if the servers had extra load that hour or if a feature flag turned on, it would affect the benchmarks even if the code didn’t change, invalidating the statistical significance calculation.

Precision, specificity and variance

We had to go back to the drawing board and reconsider our strategy. We had three major challenges:

Precision: Even if we could detect that a regression had occurred, it was not clear to us what change caused it.
Specificity: We wanted to detect regressions caused by changes to our mobile codebase. While user impacting regressions in production for whatever reason is crucial in production, the opposite is true for pre-production where we want to isolate as much as possible.
Variance: For reasons mentioned above, our benchmarks simply weren’t stable enough between each run to confidently say that one build was faster than another.

The solution to the precision problem was simple; we just needed to run the benchmarks for every merge, that way we could see on a time series graph when things changed. This was mainly an infrastructure problem, but thanks to optimized pipelines, build process and caching we were able to cut down the total time to about 8 minutes from merge to benchmarks ready.

When it comes to specificity, we needed to cut out as many confounding factors as possible, with the backend being the main one. To achieve this we first record the network traffic, and then replay it during the benchmarks, including API requests, feature flags and websocket data. Additionally the runs were spread out across even more devices.

Together, these changes also contributed to solving the variance problem, in part by reducing it, but also by increasing the sample size by orders of magnitude. Just like in production, a single sample never tells the whole story, but by looking at all of them over time it was easy to see trend shifts that we could attribute to a range of 1-5 commits.

Alerting

As mentioned above, simply having the metrics isn’t enough, as any regression needs to be actioned quickly, so we needed an automated way to alert us. At the same time, if we alerted too often or incorrectly due to inherent variance, it would go ignored.

After trialing more esoteric models like Bayesian online changepoint, we settled on a much simpler moving average. When a metric regresses more than 10% for at least two consecutive runs we fire an alert.

Next steps

While detecting and fixing regressions before a release branch is cut is fantastic, the holy grail is to prevent them from getting merged in the first place.

What’s stopping us from doing this at the moment is twofold: on one hand running this for every commit in every branch requires even more capacity in our pipelines, and on the other hand having enough statistical power to tell if there was an effect or not.

The two are antagonistic, meaning that given that we have the same budget to spend, running more benchmarks across fewer devices would reduce statistical power.

The trick we intend to apply is to spend our resources smarter – since effect can vary, so can our sample size. Essentially, for changes with big impact, we can do fewer runs, and for changes with smaller impact we do more runs.

Making mobile performance regressions observable and actionable

By combining Maestro-based benchmarks, tighter control over variance, and pragmatic alerting, we have moved performance regression detection from a reactive exercise to a systematic, near-real-time signal.

While there is still work to do to stop regressions before they are merged, this approach has already made performance a first-class, continuously monitored concern – helping us ship faster without getting slower.

Explore open engineering roles at Kraken

The post appeared first on Kraken Blog.

Preventing mobile performance regressions with Maestro

Mobile performance vitals

Collecting metrics

Reality check

Precision, specificity and variance

Alerting

Next steps

Making mobile performance regressions observable and actionable

트레이딩을
자동화
하세요!

How to Set Up and Use Trust Wallet for Binance Smart Chain

Your Essential Guide To Binance Leveraged Tokens

How to Sell Your Bitcoin Into Cash on Binance (2021 Update)

What is Grid Trading? (A Crypto-Futures Guide)

Cryptohopper에서 무료로 거래를 시작하세요!

무료 사용 - 신용카드 필요 없음

Preventing mobile performance regressions with Maestro

Mobile performance vitals

Collecting metrics

Reality check

Precision, specificity and variance

Alerting

Next steps

Making mobile performance regressions observable and actionable

트레이딩을 자동화하세요!

How to Set Up and Use Trust Wallet for Binance Smart Chain

Your Essential Guide To Binance Leveraged Tokens

How to Sell Your Bitcoin Into Cash on Binance (2021 Update)

What is Grid Trading? (A Crypto-Futures Guide)

Cryptohopper에서 무료로 거래를 시작하세요!

무료 사용 - 신용카드 필요 없음

트레이딩을
자동화
하세요!