A/B Testing

Quantitive Method:
Explore how players feel and behave through interviews, observations, and open-ended feedback. (with optional quantitative metrics)

Best Stage: Late development → Live / post-launch

Primary Goal: Compare two variants to see which performs better

Effort: Moderate - High

Overview

A/B testing is a method used to compare two versions of a feature, mechanic, or interface in your game to determine which performs better. By splitting your players into randomized groups and showing each group a different version (A or B), you can collect data on how each version impacts player behavior. This approach enables data-driven decision-making, helping developers optimize the player experience.

A/B Testing can be applied in various scenarios, not limited to:

Testing different UI layouts or placements to improve user experience.
Experimenting with different mechanics, difficulty levels and progression to increase retention.
Investigating different item prices/pricing models to maximize revenue.
Analyzing the impact of new features on user experience.
Testing different ad placements to improve ad revenue.

How It Works:

Identify a Variable to Test

Choose a single change to measure. For example:
- Will a larger “Buy” button increase purchases?
- Does a daily login reward keep users coming back?
- Do players prefer a fast or slow progression curve?
Create Two Versions
- Version A (Control): The original or current implementation.
- Version B (Variant): The new idea or change you want to test.
Segment Your Audience

Randomly split your players into two (or more) groups. Make sure groups are statistically similar to avoid bias. (see Determining Sample Size)
Run the Test

Deploy each version to its corresponding player group while keeping all other variables constant.
Collect Data

Measure key metrics over a meaningful period of time, this might include:
- Session length
- Retention rate
- Conversion rate
- Revenue per user
- Engagement with specific features
Analyze the Results

Compare the data from both groups to see which version performed better. Use statistical significance testing to ensure your results are reliable.
Act on Your Findings

If Version B outperforms Version A, consider rolling it out to all players. If not, use what you’ve learned to refine your hypothesis and test again.

A/B Test analysis example:

What’s being tested
A new tutorial system for onboarding new players.

Goal:
See if the new tutorial improves Day 1 retention (whether players come back the next day).

Test Setup

Group A (Control): 500 new players see the old tutorial
Group B (Variant): 500 new players see the new tutorial

After Day 1, we check whether players returned on Day 2.
Each player outcome is simple: Returned or Did Not Return.

Data Collected:

Old tutorial (A): 200 / 500 players returned on Day 2, 300 did not
New tutorial (B): 250 / 500 players returned on Day 2, 250 did not
At a glance: The new tutorial shows higher Day 2 retention, pending statistical validation

Analyzing with Chi-Square:

A Chi-Square Test of Independence to determine whether the difference in Day 2 return rates between two groups is statistically significant, in other words, whether the change is likely due to the feature being tested or just random chance.

Even if one version looks like it's performing better, a statistical test helps confirm whether that difference actually matters.

A Chi-Square Test is used when comparing categorical data — information grouped into distinct, non-numerical categories or labels. These can be:

Nominal (unordered), like tutorial version, player region, or character class.
Ordinal (ordered), like survey responses from "strongly disagree" to "strongly agree."

In game testing, it’s a helpful way to analyze things like retention, conversion, or player choices across different groups.

Calculate expected values:

To calculate expected values, we assume the tutorial has no effect on whether players return. Expected values are calculated using the formula:
Expected = (Row Total × Column Total) / Grand Total

Based on this, both tutorial versions are expected to have 225 players return on Day 2 and 275 players not return, given 500 players in each group.

Chi-Square Calculation

To test whether the observed differences are meaningful, we use a Chi-Square test, which compares observed values to expected values using the formula:

χ² = Σ (Observed − Expected)² / Expected

Where:

Observed (O): the actual player counts
Expected (E): the counts we would expect if the tutorial had no impact

Applying this formula to all four outcomes (returned and not returned for both tutorial versions) produces a Chi-Square value of approximately 10.1.

Conclusion

Players who saw the new tutorial were significantly more likely to return on Day 2.

Decision: The new tutorial is performing better and should be considered for rollout.