Building for the (Inevitable) Next Cloud Outage - Pavel Nikolov, Section

Fascinating talk about high availability/reliability:

Notes:

  • high availability and high reliablity
  • no cloud provider can guarantee 100% uptime
  • no a question of “if” but rather “when” your servers go down
  • disaster recovery
    • active/active (DNS switch)
    • active/passive (spin up + DNS switch)
    • periodic backup (manual recovery)
    • no DR strategy (figure it out when it happens)
  • RTO - Recovery time objective
  • RPO - Recovery point objective (how much data can you afford to lose)
  • what if there is a different approach
    • self-healing
    • cloud native
    • no single point of failure
    • expect that anything could go down at any time – even DNS
  • BGP + Anycast IPs to the rescue
  • IP packets
    • Unicast - one-to-one
    • Multicast - one-to-many
    • Anycast - one-to-nearest
      • many servers around the world with the same IP address, packet finds the nearest one
  • Benefits from BGP (over DNS)
    • DNS has TTL (usually at least 300s)
    • BGP convergence takes seconds
  • downsides
    • you have to own an IP address range
    • your cloud provider has to support BYO IP
    • learning curve
  • bird tool to announce IP addresses
  • BGP is the backbone of the internet
  • What about data (consistency)?
  • the answer is (almost) always Eventual Consistency
  • most applications can tolerate eventualy consistetency
  • most microservices do not need a database
  • Event sourcing is a perfect fit
    • ideal for microservices architecture
    • CQRS pattern - command query responsibility segregation
      • producers don’t need to know about consumers
      • consumers don’t know where events came from
    • Requires a durable event store (NATS Jetstream or Kafka)
    • Immutable data
      • replay data since beginning of time
    • Reproducible state
    • Eventual consistency

@bminer, check this out.

My favorite part of this talk is when he talked about the “human factor” and the natural tendency for the disaster recovery plan to erode over time. Sad but true. Guilty as charged.

1 Like

Yes, easy to get busy (or lazy) – been there …

This is why it seems so important to build in end-to-end testing, redundancy/backup mechanisms, etc. If these are manual or separate from your main deployment and running processes, it’s easy to lose track of them.

So true. Somewhat related is documentation. Keeping it close to the code makes it much easier to maintain.

I had never heard of BGP until I heard this talk. Why don’t more cloud providers make it obvious how to do this? I also didn’t think anything other than unicast was possible (for average folks) on the Internet.

yeah, fascinating stuff. I assume it would NOT work to reserve an IP at a cloud provider, and then use that as an anycast IP at other places …

That is probably correct because it’s unlikely that competing cloud providers would coordinate routing information between one another.