2023-12-11

Intro

We’ve had reports about users having trouble playing videos in Gleev for quite a while. To help the investigation, a “benchmark” of Joystream CDN was created. In this post I will share some of the analysis of the results, along with some observations and recommendations.

In this first section I will explain the test’s methodology in detail so that any issues with my approach can be spotted and pointed out. In the second section I will get to the actual results analysis. In the last section, I will discuss conclusions and recommendations.

This document will be the first part of analysis of CDN performance. In a later document I will take a look at the results of synthetic tests and possibly Atlas-related issues.

Goals

This benchmark had a specific purpose: see how distribution performance is perceived by end users. While synthetic tests ran from data centres can help us understand the performance of our network, what ultimately matters is what the end (Gleev) user experiences. To best isolate the performance of the distribution network, for this test, only cached assets are considered. To get a full understanding of the storage system, that is not enough, but for now we focus strictly on performance of distributors. In the future we may want to run a similar test with uncached assets to get the full picture.

Test methodology

The benchmark is a web application available at https://benchmark.joyutils.org. When ran, the test proceeds as follows:

  1. Reference speed test is performed using Cloudflare - this allows us to normalize the download speeds and reduce impact of bad network connections on overall results, more on that later.
  2. Test URLs are collected - details about both media and thumbnail data objects for the test video are fetched from the Query Node. Both objects will be downloaded from all operators so we get 18 test URLs (2 objects * 9 operators).
  3. For each test URL a test is performed:
    1. In case of images, full image is downloaded and various measurements are taken - most notably full response time and time-to-first-byte (TTFB). In case of media, first 10/25 MB of the video are downloaded (using bytes=0-${maxDownloadSize - 1} Range header). As with images, full response time and TTFB are measured, but also download time (excluding initial connection) and download bandwidth.
    2. We wait 200ms to start the next run fresh.
    3. Both steps above are repeated 3 times. From those 3 runs, all the measurements are averaged and considered the final result for the given test URL.
  4. After all URLs are tested, the results are sent to Elasticsearch for analysis. Each test URL with its measurements creates a single document.

Normalized speeds

One thing that I was trying to avoid was poor network connections affecting the results. If we had a lot of users with a slow connection (like 15 Mbps) run the test, that could give us a fake impression that our infra offers slow download speeds, when in reality the values would be limited by users’ connection. To minimze impact of this, I’ve introduced normalizedDownloadSpeed = downloadSpeed / referenceDownloadSpeed, basically a ratio of utilization of user’s connection. normalizedDownloadSpeed = 1 indicates that user was able to download with their full available speed from a given node. It’s not an ideal statistic and should be considered alongside unscaled download speeds.

Test regions

For the purpose of analysis, I’ve split the results into multiple key regions. Most regions are just the continent, but Asia was split into more regions to give us better understanding of results across usage clusters:

  1. South Asia - country is one of (India, Bangladesh, Nepal, Bhutan, Myanmar).
  2. Southeast Asia - country is not one of the above and geo longitude is above 90 (includes Japan).
  3. West Asia - all the remaining Asia results.