Liatrio - Your AI Enablement Partner

Moby once sang to us, “We are all made of stars”, and now infrastructure is all made of code. A far cry from the days when that song was overplayed, those times seemed much easier. VMware, Blades, and a SAN, and you are good to go. Resiliency was the job of componentry and distance. Now, Everything As Code (EaC) makes limitless expansion accessible with just the push of a button. It can provide the responsive capacity and throughput for thirsty businesses to meet the ever-demanding needs of customers. Paired with this is the shift away from our loving old “bricks and mortar” infrastructure… You can’t count on blaming physical technology, suppliers, or data center capacity for expansion problems. The new world order is limitless.

We are seeing unprecedented demands for expansion and capacity. Sure, it’s a great business problem to have, but one that comes with a hard life if we don’t set our north star up correctly as we scale. Problems are merely exacerbated at scale and open up other issues that easily derail what would otherwise be a productive day.

The question now stands: are we expanding in a way that will provide the resilience and safety needed to deliver quality technology outcomes? Or are we just “bursting” into the cloud with the same energy your dog has at the park, running full speed with no idea where the ball went?

Why Resiliency Matters

Regional or global expansion is deeply tied to technology elasticity. Can you flex infrastructure into a new region, securely? Where do your cloud providers operate, and are you best aligned to use these areas effectively? What would you do if you had to consider Data Sovereignty as a principle? Hampering business expansion because your technology can’t swiftly service a region or area likely doesn’t bode well in the boardroom. Often these types of expansions are time sensitive based on market opportunity or a specific business strategy, so delivering this capability on time is really non-negotiable. Resilient and repeatable cloud patterns that work across regions and zones are crucial for rapid deployment. We shouldn’t have to start from nothing when responding to this type of expansion.

Downtime costs in more ways than you think. There is not only the direct hit on service and revenue associated with downtime, but also the human element of triage and repair, coupled with missed productivity in other areas. This is a giant snowball that barrels through opportunity. Countless go-lives, change approvals, and workstreams get benched or parked because P1 incidents drag key people away to fix issues, often caused by poorly architected practices. Funding is constantly rerouted away from innovation to fix dire issues that shouldn’t exist in the first place. What are you truly missing out on when incidents hit?

Innovation needs a strong foundation. There is a race to the bottom, with business leaders demanding new use cases fueled by AI and data. But provisioning the infrastructure that is required for your latest MCP pilot, or quickly supplying a dataset for the latest LLM to train on, is not going to work well unless it can be delivered in a timely, repeatable way… or dare I even say in an ephemeral way. The weakest, most vulnerable areas are often the ones that are rushed to supply. I’ve seen countless instances of attacks caused by an accidental API left open in a test environment that was hurriedly stood up, or by a network port left open when it shouldn’t have been, caused by a pressured need for quick infrastructure. Resiliency lives with consistency, no matter what the deadline looks like, the way you supply infrastructure and technology should be consistent, repeatable, and secure.

What to Consider

It’s easy to get lost in the many details, but here are some architectural areas that remind us of what makes up resiliency at scale:

Architecting For Scale - How are you using (or not using) Microservices, Containers, Orchestration and Automation? Loosely-coupled or decoupled systems?
Pattern Elasticity Where are your load balancing & autoscaling patterns fitting into this? How do these things work in regions and zones?
Everything As Code - Infrastructure is now code, you can now quickly stand up cloud infrastructure through scripts and code. But are your pipelines codified? What is stopping this if it’s not? Are you still running too many things too manually?
Data Layer - Is this architecture using the right database type? Aside from traditional Relational Databases, many new cloud databases now offer solutions like NoSQL, Time Series, or In-Memory databases that are more appropriate to some applications. Application architecture can often be the weak link in a stack, causing faults at scale.
Security At Scale - Using repeatable cloud patterns that are frequently maintained ensures security controls are adhered to, reducing the risk of known vulnerabilities being exposed. Keeping secure whilst scaling is a crucial part of a resilient technology stack.
People and Process - We often forget the human element to resilience as we scurry to expand. Do the teams and people running your technology have the support to do this easily and well? This may mean training, access to operational data, reporting, or tools. Also, the right processes that reflect modern technology practices. What does your path to production look like? Hot tip: if it’s not streamlined and responsive to the pace of scale, the best people and technology won’t help.
Testing the Outcome - Having a test process that aligns with the pace at which technology initiatives are moving is vital. An annual DR test probably won’t cut it anymore. Testing practices have significantly evolved. Testing failover and auto-scaling patterns is a bare minimum, all the way through to fault injection testing, logging and alerting, and chaos engineering. Testing practices should be carefully considered to match resiliency expectations.

The Takeaways

In reality, achieving perfection in all these concepts is not always easy. Growth over a long period of time often means we don’t scale in the best possible way, and these are known risks assumed in the course of getting on with the job. But what is really important is understanding and measuring where you are, and prioritizing improvements that help lift the bar.

Some ideas that can help:

Service, Incident, and RCA reporting should be regularly reviewed to identify key themes and patterns that can be addressed. Are there processes, ceremonies, or methods to do this? Who is involved?
Review architectural processes to architect the right way, the first time.
Drive a culture of blameless post-mortems when things go wrong. Focus on surfacing the right learnings and improvements rather than the finger-pointing. How do we build up rather than tear down?
Identify where investments should be made, and how they should be implemented. This may take the form of a workshop, failure exercise, or Value Stream Mapping (VSM). Taking time with your team to work “on” rather than “in” does pay off.
Stay ahead of the tech. With so many innovations and changes happening right now, you can’t assume the same solutions will serve you effectively for the next five years. Understand what you need to do in your industry to stay alert and ahead, to architect the right things.

Keeping resilience alive as a conversation is half the battle, so asking questions in your technology teams about how and why things are built the way they are will do a lot more than you think. Resilience isn’t a static goal; it’s a rhythm in technology that needs to be beating along to the business dance.

Infrastructure Expansion and Resilience: Setting the Right North Star

Why Resiliency Matters

What to Consider

The Takeaways

Continue Reading

Contributing to Sigstore: Lessons from an Enterprise Open Source Journey

When the Calendar Becomes the System: How Reasonable Coordination Choices Quietly Break Engineering Teams