Building for the (Inevitable) Next Cloud Outage - Pavel Nikolov, Section

cbrake · June 14, 2022, 5:00pm

Fascinating talk about high availability/reliability:

Notes:

high availability and high reliablity
no cloud provider can guarantee 100% uptime
no a question of “if” but rather “when” your servers go down
disaster recovery
- active/active (DNS switch)
- active/passive (spin up + DNS switch)
- periodic backup (manual recovery)
- no DR strategy (figure it out when it happens)
RTO - Recovery time objective
RPO - Recovery point objective (how much data can you afford to lose)
what if there is a different approach
- self-healing
- cloud native
- no single point of failure
- expect that anything could go down at any time – even DNS
BGP + Anycast IPs to the rescue
IP packets
- Unicast - one-to-one
- Multicast - one-to-many
- Anycast - one-to-nearest
  - many servers around the world with the same IP address, packet finds the nearest one
Benefits from BGP (over DNS)
- DNS has TTL (usually at least 300s)
- BGP convergence takes seconds
downsides
- you have to own an IP address range
- your cloud provider has to support BYO IP
- learning curve
bird tool to announce IP addresses
BGP is the backbone of the internet
What about data (consistency)?
the answer is (almost) always Eventual Consistency
- image1015×681 58.4 KB
most applications can tolerate eventualy consistetency
most microservices do not need a database
Event sourcing is a perfect fit
- ideal for microservices architecture
- CQRS pattern - command query responsibility segregation
  - producers don’t need to know about consumers
  - consumers don’t know where events came from
- Requires a durable event store (NATS Jetstream or Kafka)
- Immutable data
  - replay data since beginning of time
- Reproducible state
- Eventual consistency
image1048×473 50 KB
image767×462 59.8 KB

@bminer, check this out.

bminer · June 20, 2022, 2:32pm

My favorite part of this talk is when he talked about the “human factor” and the natural tendency for the disaster recovery plan to erode over time. Sad but true. Guilty as charged.

cbrake · June 20, 2022, 3:02pm

Yes, easy to get busy (or lazy) – been there …

This is why it seems so important to build in end-to-end testing, redundancy/backup mechanisms, etc. If these are manual or separate from your main deployment and running processes, it’s easy to lose track of them.

bminer · June 20, 2022, 7:24pm

So true. Somewhat related is documentation. Keeping it close to the code makes it much easier to maintain.

I had never heard of BGP until I heard this talk. Why don’t more cloud providers make it obvious how to do this? I also didn’t think anything other than unicast was possible (for average folks) on the Internet.

cbrake · June 20, 2022, 7:27pm

yeah, fascinating stuff. I assume it would NOT work to reserve an IP at a cloud provider, and then use that as an anycast IP at other places …

bminer · September 12, 2022, 11:52am

That is probably correct because it’s unlikely that competing cloud providers would coordinate routing information between one another.