Thoughts on operating your IT an applications in a cloud-native way
Work in progress – A rather long article in snackable chapters that will be added and extended in the coming weeks.
The content of this article represents my personal opinion that is based on my daily experiences as CTO of Microsoft Switzerland, working with organizations that run or plan to run their IT and applications on Microsoft Azure.
Azure evolves continuously, as my knowledge – And the more I learn, the more I have the impression that I know very little. In this respect, this article does not pretend to share any dogmatic truth, but simply my current thoughts – I am truly thankful for any feedback that will help in continuously improve this content.
I talk with many organizations, in the most recent years mostly enterprises. The common denominator: Looking for (business) agility.
Technical and business model innovation is disrupting industries. Business leaders of established organizations demand their companies to (digitally) transform, to get real-time, end-to-end insights of their companies, customers, and processes – Of their (aggregated) data – to be able to rapidly take fact-/data-based decisions that can be implemented quickly.
These series of blogs that compose into the Everything is Code article aims to give you some hints on the necessary transformation steps and how to approach the transformation process.
In this context, moving to the cloud* becomes the opportunity to achieve this highly demanded agility. Be careful, however, because doing the same things in the cloud that you on-premises will bring you costs saving, scalability, better resiliency, more possibility to deploy near your customers, better security but won’t bring you agility.
To achieve agility you will need to adopt a cloud-native approach. In the following 4 chapters, I will try to describe what does this means, using examples from our own transformation** and knowledge acquired in my daily work helping organizations during their journey to cloud.
Eventually, these steps will enable you to increase your capability to deploy features and updates to production, leading us to the title of this post and my guiding-line for this article:
because it is my profound conviction that the only true measure of your (IT) agility is the number of deployments you do to production.
To give you an idea, as per today in an always evolving practice (devops is a journey!), we do
… and this is the number that I personally consider the real measure of (IT) agility.
As we will see with details in the upcoming posts composing this article, your deployment capacity depends on a few macro-factors that I like to summarize as:
Before moving to the next chapter, just a glimpse of what we will discuss in part /05/ where everything comes together:
* Even if most of the things that I will mention could be rather cloud-agnostic, my expertise is on Microsoft technologies and Microsoft Azure. In this context and in transparency, Cloud becomes here almost a synonym of Microsoft Azure.
** I am referring here to the migration of our internal IT and applications used by Microsoft employees from our data-centers on-premises to Microsoft Azure.
…and clearly, it is not real if it is not on a t-shirt:
Expedition Cloud is a web site that collects great information and testimonials about how our internal IT organization has migrated the majority of the infrastructure and applications used by Microsoft employees to Microsoft Azure. While you can find all the details here:
I would like to rapidly list a few milestones and lessons-learned of that expedition that are relevant to our mission-to-agility, along our guiding-line: #OnlyWhatYouDeployMatters.
Please note that for the matter of keeping this post short and aligned with the goal of the article, I am quite simplifying the message and adapting it for story-telling reasons.
The migration approach that was taken was relatively simple: Move as much as possible to public cloud, and move the remaining to a modern private cloud – As part of a Hybrid Cloud Strategy.
Additionally, the idea was to move
commodity software, like email, to SaaS offering – From Microsoft or from 3rd parties.
strategic software, into which we wanted to invest, to PaaS.
all the rest that just needed to be maintained to IaaS.
The migration has been a fantastic occasion to re-visit our IT portfolio:
This has allowed us to eliminate around 30% of our on premise footprint, counting for thousands of servers and VMs
We were also able to move around 15% of our functionality to SaaS
We identified less than 5% of functionality that would target the private cloud.
For the remaining 50% of the application portfolio, the target was Azure, with the approach explained before.
For IaaS, we proceeded really fast by lifting & shifting pre-production systems, which allowed us to rapidly gain important experience.
For the other applications, we started an in-depth portfolio analysis, to identify which apps to move first. We soon realized that this approach was not adequate and it was delaying our transformation.
Taking some risks, we drastically simplified the selection criteria. Fundamentally we said, if it is not business critical and if it is technically not too interconnected and challenging, we will move them first. Astonishingly, 75% of the applications were identified as first-movers, including naturally all new applications.
We started thinking differently
While we were successfully migrating applications, we started thinking a differently. First of all, we started to think that
IaaS may well be seen like the next data center to close…
(I let you reflect on this one…)
We realized that we wouldn’t really transform focusing on moving applications to the cloud. If you do the same in the cloud as you do on premises, the benefits you get are limited: Some costs saving, scalability, better resiliency, more possibility to deploy near your customers, better security. That’s it.
What we needed to do, was
breaking down the application silos and think in business capabilities, in unique, discrete business services. This fundamental change is what enabled us to adopt a highly efficient devops culture, a key aspect in our mission-to-agility.
Let’s see a concrete example of this new business service driven architecture talking about our finance applications
Finance Applications (An Example)
When we surveyed our legacy set of 36 on-premises, standalone applications used in procurement, we discovered a significant functionality overlap among them.
We broke down the complete feature set within each apps to separate core functionality form duplicated features and we have used this information to build 16 discrete services, each of which has a unique set of functionality, presentation layer, and master data.
These vertical, discrete services have well-defined and well-managed APIs, that are used to bring them together into the different end-to-end User Experience layers, within a modern microservices architecture.
And why is this key to our mission-to-agility? Next chapter will tell you this, while at the link below you find the whole story of this transformation
The 16 discrete services mentioned before are “isolated units” that are developed, tested, deployed, and maintained by a autonomous, small devops teams. While I will go more in the details of how devops teams work efficiently in the next blog post, here is important to understand that
the concept of a microservice, intended as a unit of capability that undergoes the same needs for security and scalability, is the precondition, the necessary architecture, to enable true devops.
In this context, devops teams need to define and manage interfaces (APIs) smartly, while they have total freedom beyond these APIs, including the choice of the technologies they use to develop.
Deep Dives related to this chapter
If Microservices bring a lot of advantages, this type of architecture also bring a few challenges. The first two that come to my mind are:
Challenge #1: Communication between Microservices
Challenge #2: Communication between consumers and an application
I will be dealing with these challenges and going deeper on Microservices in upcoming posts, including:
This post is part of the Everything is Code article and it aims to give some hints on how to approach DevOps in the context of my little formula:
DevOps is better development practices
In the previous chapters, I have stressed how DevOps is key to high deployment frequency, which is key to (IT) agility. This concept is well accepted within the communities to which I talk, however, it happens sometimes that organizations tell me: “Our products do not need to evolve so fast and we don’t need/want this type of deployment frequency”.
Please let me be opinionated on this matter: DevOps is better development practices that not only allows you to deploy more frequently but also enables you to deploy better, more robust software in general.
While I warmly invite you to read the “2018 State of DevOps Report” published by DORA to get insightful data collected from over 30,000 surveys, I would like to briefly mention that the analysis shows how top DevOps performers not only deliver faster, but also more, and with less failures/bugs – And they are also able to recover faster.
So, what’s DevOps?
The definition that I like the most is
The union of people, process , and tools to deliver continuous value to customers
This definition well suits with the “Culture” multiplier in my formula and puts continuous value at the center. This requires to bring application engineering (development), operation engineering, quality engineering, and security engineering together, breaking down practices that were once siloed.
Improved coordination and collaboration across these disciplines reduces the time between when a change is committed to a system and when the change is being placed into production. And, it ensures that standards for security and reliability are met as part of the process.
“Pills of DevOps”
Many of the things that you will find below in a compact form are available with details at DevOps at Microsoft. Because DevOps relies on an agile mindset, if you are not, you may want to familiarize yourself with What is agile?.
As a reminder, there is no dogmatic truth in what is listed here but just sharing of some practices that have delivered good results within the context of the teams that have been using them – At Microsoft and among organizations with which I have been working .
I have grouped my thoughts around 1) DevOps Teams, 2) Planning and Release Management, 3) Quality Assurance, and 4) Live Site Management.
1) DevOps Teams
“Food for thoughts”:
teams fully own one or more microservices and consist of 10-12 engineers and 1 program manager (or service manager or product owner)
teams are vertical, covering front-, middle-, and data-/backend- layer
teams own development, testing, and deployment of features, and are also responsible for the service to run smoothly in production
Comment: It may sound a bit harsh, but if what you put into production can keep you in the office at night, it is in your own interest to deploy high quality code…
teams are (ideally) physically in the same room so that communication runs continuously without the needs of meetings (except the daily scrums)
teams are self-managing and intact for 12-18 months
teams are self-forming. After these 12-18 months, Program Managers expose the high-level roadmap of their services and Engineers have the possibility to express with priorities for which service they would like to work for the next 12-18 months
Comment: This increases job happiness, reinforces the DevOps culture, and enables cross-pollination.
2) Planning and Release Management
One of the questions that I am asked the most is how we do release management at Microsoft. I normally like to use as example the Azure DevOps team because they apply the “highest form of dogfooding” by using Azure DevOps to build Azure DevOps. This also allows me to refer to the following articles that well describe the approach they follow:
Their approach to planning covers alignment and autonomy:
Alignment represents the big picture in light of the business goals. It includes the product strategy over a period of 12-18 months (product roadmap) and a (high level) feature planning, over a period of 6 months.
Autonomy covers the details about what will be delivered to achieve the business goals. It includes Stories and Tasks.
“Food for thoughts”:
Release sprints are 3 weeks longs
Comment: This length comes from empirical trying. 2 weeks has proven to be too intensive, 4 weeks too long.
Forward planning covers the next 3 sprints
Within the sprints, some teams plan only with tasks that are no longer than 4hr.
Comment: This is very powerful because if a task is planned for today and it is not completed, it will be discussed in the daily scrum of tomorrow. This enables extremely fast handling of issues.
Trunk-based development is great and it virtually eliminate the “merge debt”
700+ engineers work with 1 Master that is always healthy and shippable
Feature Flags are used to deploy fast and to make features available to selected (test) groups. They also protect from rollbacks, which are complicated when you are deploying in several regions or rings.
Following this approach, feedback flows in fast from pull-requests, daily scrums, testing, and users, considering that code goes into production every 3 weeks. This enables continuous adaption and adjustment of deliverable to maximize customer/user experience.
3) Quality Assurance
By increasing your deployment frequency
you will be dealing with smaller amount of code to test
you will have a small number of bugs to deal with
you will exercise more your testing skills and become more proficient with it.
Test teams no longer exist in this setup
DevOps teams are responsible for the quality of their code
Pull Requests (PR) ensure that only “good code” merge to master. To complete a PR, DevOps teams at Microsoft need:
If there are more than 5 bug per engineer in one team, work on new features is stopped to focus entirely on bug fixing.
Shift Left: Pull request flow gives a common point to force testing, code review, and error detection early in the pipeline. This helps shorten the feedback cycle to developer. Errors are usually detected very fast. This also gives confidence when refactoring, since all changes are tested all the time.
As mentioned above, DevOps teams are also responsible for live-site issues and interruption – For keeping their code running smoothly in production.
To provide focus and assist with an interrupt culture, each DevOps team self-organizes into 2 distinct sub-teams:
F-Team (Feature) works on committed features (new work)
L-Team (Live-Site) deals with all live-site issues and interruptions.
Typically, 2 engineers in turns become the L-Team that is 100% focused on keeping the services running smoothly.
When no issue is on the radar, they work on dashboards, monitoring improvements, and similar – But no development of new features! This is the job of the F-Team only.
One important advice…
You don’t have to be perfect from the beginning. A good approach to maturing your DevOps practices is to start from where it hurts most. When this pain is cured, move to the next one that hurts the most. I have seen great progresses with this simple approach, while I have seen great struggles with team that were aiming for perfection from the beginning.