We’re now operating a Neumob mobile app acceleration network with 164+ Points of Presence across the globe. You can see our global network here.
Real experience in building this network out creates a wealth of information about how different cloud providers actually function around the globe, and we thought we’d share a small piece of that (and we plan to share even more in the near future).
At Neumob, we focus on accelerating mobile applications around the globe while reducing those apps’ network errors. Our PoPs relay our customers’ mobile app traffic from the networks very close to our mobile users to the networks used by the origin mobile app servers as reliably as possible.
CDNs provide a similar service for web content. Traditionally, CDNs would spend hundreds of millions of dollars in capital and operating expenses to buy hardware, build infrastructure and supporting systems, then deploy them around the world in data centers after signing long-term contracts for hosting and bandwidth. They’d hire dozens of employees to manage this distributed network. They’d have staff dedicated to server hardware, automation, network operations, virtualization, and orchestration; and still others devoted to BGP/IGP/LAN engineering to ensure the best transit and peering to & from their own PoPs.
Neumob has taken a different approach to this problem. With the surge in public cloud vendors (such as AWS, Azure, Google Cloud, Digital Ocean, IBM SoftLayer, etc.), we’ve become quite a prolific buyer of cloud VMs around the world. We currently buy from over 55 unique cloud vendors (and that’s growing). We’ve developed layers of software that behave like a control plane on top of these cloud vendors, abstracting away much of the complexity and implementation differences, in order to create our 164+ PoP network.
As those who’ve worked in this industry know, this is not easy.
The Challenges of Onboarding Multiple Cloud VMs
Every cloud provider is different in their own special way, and our software stack is designed to not depend on specific cloud provider features, hypervisors, hardware configurations, nor Linux distributions or configurations. This affords us an extreme degree of flexibility to rapidly onboard new providers and strategically add PoPs as we become aware of performance gaps in different parts of the world.
While this allows Neumob to be incredibly capital-efficient, it also creates an abundance of new technical challenges that we needed to solve. Cloud VMs can be less reliable and less stable than traditional hardware POPs, with no guarantee that the network or hardware characteristics of one VM are the same as any of the others in the same POP. We’ve seen this many times.
Therefore our alerting, monitoring, and routing systems have to be real-time. Unlike traditional CDNs who create static maps and routes through their network, the Neumob system is learning, changing and adapting with every single network transaction in real time.
When public cloud providers are used, a customer generally doesn’t have control over the upstream routing, nor the transit providers and peers used. This challenge plays a key role in deciding which providers we use and where we use them. It’s why we work with so many unique providers.
A legacy CDN would have a dedicated network engineering team whose only job is to make sure that transit/peering connections are working well with the outside world. This creates many BGP challenges and requires continual monitoring and maintenance, yet it is a solvable problem by the CDN. At Neumob, since we rely on the network connections that Amazon AWS or Microsoft Azure or other providers use, which can and do change without notice, we need to make decisions based on what we have at that moment in time, then adjust our provider use accordingly.
Here’s some data to demonstrate this challenge and how we’ve solved it. To isolate the discussion, we will share 1 network path: Singapore to London. In our example, we’ve got a London-based mobile shopping app that is serving users in Singapore.
Neumob has 5 different cloud providers in Singapore (and 6 in London):
- Microsoft Azure
- Google Cloud
Why 5? Obviously redundancy is a part of this, yet more importantly, each has different network connectivity to the rest of the world, and to mobile operators & home ISPs in Singapore and Southeast Asia. So when we focus on network latencies between Singapore and London, which cloud vendor is best? If you guessed AWS or Google, you’d be wrong. Based on our data and our PoP choices, we’ll share the results.
First, a few notes about our data:
- This data was sampled over the span of a few days during the month of July 2017, every 30 minutes, all during the same minute
- RTT was measured by sending UDP probes from our servers in London to our servers in Singapore
- We use UDP probes rather than ICMP Echo due to ICMP filtering, rate-limiting and general limitations
- Packet loss was measured by taking the number of unreceived probes divided by the total number of sent probes over the entire time interval
|Fastest Singapore to London||Singapore to London||MEDIAN RTT (ms)|
|Amazon Web Services (AWS)||223.43|
Yet that’s not the full story. Consistency, stability and reliability are also important. So which is the most consistent Singapore provider for Neumob during this period?
|Most consistent, Singapore to London||Singapore to London||STDEV||MIN RTT (ms)||MAX RTT (ms)|
|Amazon Web Services (AWS)||66.13||164.65||418.52|
And finally – during this period, who experienced the least amount of packet loss? (with less being better, of course):
|Least instances of packet loss||Softlayer||195|
|Most instances of packet loss||AWS||657|
As you can see above, there’s no clear winner, but we can make some safe conclusions.
For example: Azure is clearly superior to AWS for this route during the period of these measurements, yet bear in mind this only represents a snapshot in time; both RTT and packet loss tend to vary considerably as time goes on. VULTR is the fastest on average, but is less consistent than Google Cloud, Azure, or SoftLayer, and has higher incidence of packet loss during this period. SoftLayer has the least amount of packet loss, but is the slowest on average for this route. And this doesn’t even consider other factors such as cost, hardware reliability or ease of administration.
This is exactly why making routing decisions is not always easy, and why it doesn’t make sense to do this manually. Hence the need for a real-time machine learning system to constantly make adjustments.
Keep in mind that the above data only considers the path between Singapore and London, and only for a short period of time. We currently have 164 PoPs, and we need to make decisions for each path between the PoPs; each path between each PoP and our customer origins (a customer’s traditional CDN may even be our origin); as well as each path from our PoPs to the larger mobile (and WiFi!) Internet IP space. So this wasn’t merely a fun problem for Neumob to solve, it continues to be vital to our overall mobile app acceleration service offering.
In future blogs, we’ll cover other interesting routes, and make more comparisons about our many cloud providers from other perspectives. At the end of the day, there are things we love about each of our many providers, or we wouldn’t be working with them and constantly tweaking the recipe in response to the demands of our customers.
(By the way, If you want to work on crazy-hard, ridiculously complicated, challenging and fun problems like these, come join our DevOps and Engineering teams. Learn more here.)____________________