Cloud & DevOps

1. Foundations

1.1 DevOps Principles

Introduction

DevOps is a cultural and technical movement that bridges software development (Dev) and IT operations (Ops) to deliver value faster, safer, and more reliably. It emerged from the agile world, addressing the friction between teams that build software and those that run it in production.

Why It Matters

Organisations adopting DevOps ship features 30× more frequently, have 60× fewer failures, and recover 168× faster (DORA 2024). It is the backbone of modern software delivery, enabling cloud-native architectures, microservices, and continuous everything.

Fundamental Concepts

The core of DevOps includes Culture (collaboration, shared ownership), Automation (CI/CD, infrastructure as code), Measurement (metrics, telemetry), and Sharing (feedback, blameless postmortems). Agile, Scrum, Kanban, Lean, and the SDLC provide the process foundation.

Intermediate Topics

Deployment strategies: blue‑green, canary, rolling updates
Continuous Delivery vs Continuous Deployment
Feedback loops: chatops, feature flags, observability-driven development
Internal toolchains and platform engineering concepts

Advanced Topics

DevOps metrics (DORA, SPACE, DevEx) and their statistical correlation
Complex delivery pipelines with progressive delivery and automated rollbacks
Policy-as-code and compliance automation
Value stream management and flow metrics

Enterprise Practices

Federated DevOps at scale with platform teams
Internal Developer Platforms (IDPs) and self‑service
Cross‑functional reliability councils
Compliance as code in heavily regulated environments (SOC2, HIPAA, PCI DSS)

Popular Tools

Jira, Azure Boards, GitHub Projects, Linear; CI/CD tools like Jenkins, GitHub Actions; IaC: Terraform; monitoring: Prometheus, Grafana.

Best Practices

Shift left on security and testing
Immutable infrastructure
Everything as code
Blameless culture and psychological safety

Common Mistakes

Treating DevOps as a toolchain rather than a culture
Automating broken processes
Ignoring measurement and feedback
Isolating security from the pipeline

Real‑World Use Cases

Netflix’s full CI/CD with Spinnaker, Etsy’s early deploy culture, Amazon’s two‑pizza teams and CI/CD at scale.

Troubleshooting Topics

Pipeline failures due to flaky tests
Environment drift and configuration mismatch
Slow feedback loops and how to diagnose them

Learning Resources

The Phoenix Project and The DevOps Handbook
DORA research and Google’s SRE books
Online courses: AWS DevOps, Azure DevOps, KodeKloud

Projects

Build a CI/CD pipeline for a static site
Deploy a containerised application with automated rollbacks
Create a full delivery pipeline with automated security scanning

Interview Topics

“Explain the difference between Continuous Delivery and Deployment.”
“How would you improve a team’s deployment frequency?”
“Describe a blameless postmortem culture.”

1.2 Linux Operating System

Introduction

Linux is the dominant operating system for cloud and DevOps. Nearly all containers, servers, and orchestration nodes run Linux. Proficiency in Linux administration is non‑negotiable.

Why It Matters

Every cloud VM, Kubernetes node, and container image is Linux. Understanding the OS internals directly impacts performance, security, and troubleshooting.

Fundamental Concepts

Distributions: Ubuntu, Debian, Fedora, CentOS Stream, Rocky Linux, AlmaLinux, RHEL, Arch Linux. Key elements: filesystems (ext4, XFS), permissions (ugo/rwx, ACLs), users, groups, processes (fork, exec), systemd services, cron scheduling, package managers (apt, dnf, yum, snap).

Intermediate Topics

Kernel tuning (sysctl)
Namespaces and cgroups (the building blocks of containers)
SELinux / AppArmor
Systemd unit files and timers
Network configuration (ip, nmcli)

Advanced Topics

eBPF for observability and security
Custom kernel compilation and tuning
Filesystem internals (inodes, journaling)
Memory management and OOM handling
Linux boot process (GRUB, initramfs)

Enterprise Practices

Immutable OS images (CoreOS, Bottlerocket)
CIS‑benchmarked hardened images
Centralised authentication (LDAP, SSSD)
Compliance scanning with OpenSCAP

Popular Tools

htop, iotop, strace, ltrace, lsof, tcpdump, auditd, systemd‑journald, rsyslog.

Best Practices

Minimal base images; reduce attack surface
Use standardised AMIs/Golden Images
Automate patching and configuration with Ansible/Chef
Monitor filesystem inodes and disk usage

Common Mistakes

Running services as root inside containers
Ignoring open file limits and port ranges
Misconfigured time synchronisation (NTP/chrony)
Leaving debug symbols and unnecessary packages in production

Real‑World Use Cases

Google’s container‑optimised OS (COS) for GKE nodes
Netflix’s use of Ubuntu with custom performance tuning
Financial services running RHEL with real‑time kernel patches

Troubleshooting Topics

High load average but low CPU (I/O wait)
Out‑of‑memory killer investigations
DNS resolution failures from /etc/nsswitch.conf
“Too many open files” errors

Learning Resources

Linux Bible, UNIX and Linux System Administration Handbook
Linux Foundation courses (LFCS)
Distro‑specific documentation (Red Hat, Ubuntu)

Projects

Build a minimal Linux from scratch inside a VM
Set up a hardened web server with SELinux enforcing
Automate user and group management with Bash

Interview Topics

“How do you troubleshoot a process consuming 100% CPU?”
“Explain the difference between a hard link and a symbolic link.”
“What happens between pressing power and the login prompt?”

1.3 Shell Scripting & Automation

Introduction

Shell scripting glues together tools, performs repeatable tasks, and drives CI/CD pipelines. Bash remains the universal glue; PowerShell and Python extend automation to Windows and complex logic.

Why It Matters

Automation is a core DevOps principle. Scripting eliminates manual toil, reduces errors, and ensures consistency across environments.

Fundamental Concepts

Bash: variables, loops, conditionals, functions, exit codes, stdin/stdout/stderr, job control. PowerShell: objects, cmdlets, modules. Python: subprocess, os, shutil, argparse.

Intermediate Topics

Error handling with set -e, traps
Logging and structured output (JSON with jq)
Templating with envsubst, sed, or Jinja2
Python for API clients, infrastructure glue

Advanced Topics

Idempotent scripts
Parallel execution with xargs, GNU Parallel
Writing custom CLIs with Click or Cobra (Go)
Shell‑based unit testing (Bats)

Enterprise Practices

Scripts stored in version control, reviewed and linted
ShellCheck integration in CI
Packaging scripts as RPM/DEB or container images

Best Practices

Use #!/usr/bin/env bash for portability
Quote variables, avoid eval
Prefer long options for readability
Return meaningful exit codes

Common Mistakes

Hardcoding secrets
Not checking command success
Over‑complicating when a simple built‑in suffices

Real‑World Use Cases

GitHub Actions composite actions powered by Bash
Cloud‑init and user‑data scripts to bootstrap EC2
Database migration scripts executed from CI

Troubleshooting Topics

Debugging with set -x and PS4
Locale issues (LC_ALL)
Handling spaces in filenames

Learning Resources

Advanced Bash‑Scripting Guide, ShellCheck wiki
Exercism Bash track
PowerShell in a Month of Lunches

Projects

Create a backup script with rotation and notification
Write a CLI tool in Python to manage cloud resources
Build a pre‑commit hook for linting and secrets scanning

Interview Topics

“How would you remove files older than 7 days?”
“Explain the difference between > and >>.”
“Script a health check that fails if HTTP status ≠ 200.”

1.4 Networking

Introduction

Networking is the foundation of distributed systems. A DevOps engineer must understand how data moves from a user’s browser through load balancers, firewalls, and proxies to the application and back.

Why It Matters

Misconfigured networks cause the hardest‑to‑debug outages. Cloud‑native patterns (service mesh, overlay networks) demand deeper networking knowledge.

Fundamental Concepts

OSI model, TCP/IP (three‑way handshake, flow control), UDP, ICMP. Address resolution (ARP), DNS (A, CNAME, MX, NS, DNSSEC), DHCP, NAT, PAT. Subnetting (CIDR, VLSM), VLANs, static/dynamic routing, BGP, Anycast.

Intermediate Topics

HTTP/1.1, HTTP/2, HTTP/3 (QUIC)
TLS 1.3, mTLS, certificate management
Load balancers: L4 (NLB) vs L7 (ALB), algorithms (round‑robin, least connections)
Proxies: forward, reverse (nginx, Envoy)
WebSockets, gRPC (protocol and load‑balancing implications)
VPN, WireGuard, IPsec

Advanced Topics

Software‑defined networking (SDN) and overlays (VXLAN, Geneve)
Kubernetes networking (CNI, Calico, Cilium/eBPF, Flannel)
BGP in the data centre (Calico, MetalLB)
Network policy and micro‑segmentation
eBPF‑based observability (Cilium Hubble)
QUIC and HTTP/3 deep‑dive

Enterprise Practices

Hub‑and‑spoke VPC/VNet architectures
Transit gateways and network virtual appliances
DDoS protection (AWS Shield, Cloudflare Magic Transit)
Global traffic management with latency‑based DNS

Popular Tools

Wireshark, tcpdump, nmap, iperf, netcat, dig, mtr, tc (traffic control), Calico, Cilium.

Best Practices

Use explicit CIDR planning, avoid overlapping IP space
Implement network micro‑segmentation and zero‑trust
Enforce TLS everywhere with automated cert renewal
Monitor latency, packet loss, and TCP retransmissions

Common Mistakes

Security group rules too permissive (0.0.0.0/0)
Not understanding ephemeral port exhaustion
DNS resolution delays causing cascading timeouts
Load balancer health checks pointing to wrong path

Real‑World Use Cases

Cloudflare’s global Anycast network
Netflix’s Zuul gateway for traffic shaping
Kubernetes Cilium replacing kube‑proxy with eBPF

Troubleshooting Topics

ping, traceroute, mtr
dig +trace for DNS delegation
netstat -plant, ss -tunlp
tcpdump and Wireshark filter expressions
Analysing TLS handshake with openssl s_client

Learning Resources

TCP/IP Illustrated, Computer Networking: A Top‑Down Approach
GNS3 for network simulation
CNCF Cilium certification study materials

Projects

Simulate a multi‑VPC hub‑spoke network on AWS
Deploy a service mesh and trace a request across pods
Set up a WireGuard mesh between cloud regions

Interview Topics

“What happens when you type google.com in a browser?”
“Explain the TCP three‑way handshake and connection teardown.”
“How would you debug a 5xx error on a load balancer?”

2. Version Control

2.1 Git & Collaboration

Introduction

Git is the de‑facto version control system for modern software. It enables asynchronous collaboration, code review, and audit trails across distributed teams.

Why It Matters

All infrastructure as code, application code, and configuration live in Git. GitOps principles treat Git as the single source of truth for declarative infrastructure and application state.

Fundamental Concepts

Working tree, staging area, commits (SHA, message), branches, merging (fast‑forward, three‑way), rebasing (interactive), cherry‑pick, tags (lightweight/annotated), stash, and .gitignore.

Intermediate Topics

Branching models: Git Flow, GitHub Flow, Trunk‑Based Development
Monorepos vs polyrepos; tools: Bazel, Nx, Turborepo
Git hooks (client‑side and server‑side)
Submodules and subtrees for dependencies
Git attributes and large file storage (LFS)

Advanced Topics

Git internals: objects (blob, tree, commit, tag), packfiles, reflog
git bisect for automated bug finding
Custom merge drivers
Signed commits and tags (GPG, SSH)
Git‑based operations at scale (partial clones, sparse checkouts)

Enterprise Practices

Branch protection rules and required reviews
Commit message conventions (Conventional Commits)
Verified commits with Sigstore or GPG
Repository governance and CODEOWNERS
Automated changelog generation

Platforms

GitHub, GitLab, Bitbucket, Azure DevOps Repos, Gitea/Forgejo.

Best Practices

Commit small, logical units
Write meaningful commit messages (imperative mood)
Rebase feature branches on main before merging
Never force‑push to shared branches

Common Mistakes

Committing secrets or large binaries
Long‑lived feature branches causing painful merges
Merge conflict resolution that loses intent
Using git add . indiscriminately

Real‑World Use Cases

Linux kernel’s email‑based patch flow (git format‑patch)
Google’s monorepo with Piper and trunk‑based development
GitOps with Flux reading cluster state from a Git repository

Troubleshooting Topics

Recovering lost commits with reflog
Undoing a force push (if you have the reflog on remote)
Resolving detached HEAD
Cleaning large .git folders with BFG or git filter-repo

Learning Resources

Pro Git book (free)
GitHub Skills, GitLab Learn
Oh My Git! (interactive game)

Projects

Contribute to an open‑source project via fork and PR
Set up Git hooks to lint commit messages and run tests
Simulate a broken rebase and repair it

Interview Topics

“Explain the difference between merge and rebase.”
“How would you recover a deleted branch?”
“Walk through a typical Git Flow release cycle.”

3. CI/CD

Introduction

Continuous Integration (CI) automates code integration and testing; Continuous Delivery/Deployment (CD) ensures every change is releasable or automatically deployed to production.

Why It Matters

CI/CD is the engine of DevOps speed. It reduces integration pain, catches defects early, and provides a repeatable, auditable path to production.

Fundamental Concepts

CI: frequent merges, automated build and test. CD: delivery (manual approval to prod) vs deployment (automatic). Pipeline as code, build agents/runners, artifact management.

Intermediate Topics

Multi‑stage pipelines, parallel jobs, matrix builds
Environment promotion (dev → staging → prod)
Secrets injection in pipelines
Artifact repositories (Nexus, Artifactory, ECR)
Containerised build agents

Advanced Topics

DAG‑based pipelines (Argo Workflows, Tekton)
Progressive delivery and deployment automation with automated rollbacks
Pipeline composition and reusable templates
Testing in production (canary analysis, chaos)

Enterprise Practices

Centralised pipeline governance with shared libraries (Jenkins) or reusable workflows (GitHub Actions)
Binary authorization (only signed images promoted)
Compliance gating (SAST, license checks) before deployment
Multi‑cloud, multi‑account pipeline strategies

Popular Tools

Jenkins, GitHub Actions, GitLab CI/CD, CircleCI, TeamCity, Bamboo, Azure Pipelines, Argo Workflows, Dagger, Woodpecker CI.

Best Practices

Treat pipelines as code, version‑controlled alongside app
Keep builds fast (<10 mins); parallelise
Never store secrets in pipeline scripts; use vault integration
Design idempotent deployment steps

Common Mistakes

Flaky tests tolerated over weeks
Pipeline snowflakes (manual UI configuration)
Insufficient rollback plans
Missing pipeline monitoring and alerting

Real‑World Use Cases

Netflix Spinnaker for multi‑region deployments
GitHub Actions powering over 10 million builds/day
Shopify’s merge queue and CI with auto‑rollback

Troubleshooting Topics

Build cache invalidation issues
Environment‑specific failures due to differences
External dependency timeouts in pipelines

Learning Resources

Continuous Delivery book by Jez Humble & Dave Farley
GitHub Actions docs, GitLab CI guides
KodeKloud CI/CD courses

Projects

Create a pipeline that builds a Docker image, scans it, and deploys to Kubernetes
Implement a canary deployment with automated rollback based on Prometheus metrics
Build a GitHub Actions composite action for Terraform plan/apply

Interview Topics

“Design a CI/CD pipeline for a microservices application.”
“How do you handle database migrations in a CD pipeline?”
“What is the difference between a CI build and a release pipeline?”

4. Containers

4.1 Container Runtimes & Tools

Introduction

Containers package applications with their dependencies, ensuring consistent runtime across environments. They are the foundation of Kubernetes and modern cloud‑native platforms.

Why It Matters

Containers solve “works on my machine”, enable microservices, and provide isolation with lower overhead than VMs. The ecosystem (Docker, Podman, containerd) powers all major cloud services.

Fundamental Concepts

Images (layers, digests), containers (isolated processes), registries. Dockerfile instructions (FROM, RUN, COPY, CMD, ENTRYPOINT), build context, image caching.

Intermediate Topics

Multi‑stage builds to minimise image size
Docker Compose for local multi‑service environments
Networking modes (bridge, host, overlay) and volume mounts
BuildKit and buildx for advanced caching and multi‑arch images
Rootless containers, user namespace remapping
Image scanning and SBOM generation (Trivy, Syft)

Advanced Topics

Container runtimes: containerd, CRI‑O, gVisor, Kata Containers, Firecracker
OCI specification and runtime tools (runc, crun)
Container‑optimised OS (Flatcar, Bottlerocket)
Secure supply chain: image signing (Cosign), attestations (in‑toto)
Buildpacks (Paketo, kpack) for source‑to‑image without Dockerfiles

Enterprise Practices

Using minimal distroless base images
Enforcing image policies in Kubernetes (OPA/Gatekeeper, Kyverno)
Centralised image promotion across environments
Continuous vulnerability management and patching

Popular Tools

Docker, Podman, Buildah, Skopeo, Dive (image inspection), Hadolint (Dockerfile lint).

Registries

Docker Hub, GHCR, ECR, GCR/Artifact Registry, ACR, Harbor, Quay.

Best Practices

Never run as root, drop capabilities
Use specific image digests, not latest
Clean package manager caches in the same layer
Run security scanners in CI and on admission

Common Mistakes

Baking secrets into images
Large images slowing deployments
Ignoring layer ordering and cache busting
Exposing Docker socket to containers

Real‑World Use Cases

Google’s internal container infrastructure (Borg precursor)
Amazon ECS using Docker images to run billions of tasks
Cloudflare Workers using isolates, but OCI containers for edge compute

Troubleshooting Topics

Image pull failures and credential helpers
Out of disk space due to overlay2 amassing layers (docker system prune)
Container crash loops and exit codes

Learning Resources

Docker Mastery course (Bret Fisher)
Play with Docker online labs
CNCF container runtime landscape

Projects

Write a multi‑stage Dockerfile for a Go app reducing image from 800MB to 10MB
Set up a private registry with Harbor and vulnerability scanning
Build a CI pipeline that builds and signs multi‑architecture images

Interview Topics

“Explain the difference between CMD and ENTRYPOINT.”
“How does a container isolate processes without a full OS?”
“What are the layers in a Docker image and why do they matter?”

5. Kubernetes

5.1 Core Orchestration

Introduction

Kubernetes (K8s) is the standard container orchestrator. It schedules workloads across clusters, manages networking, storage, and scales applications automatically.

Why It Matters

Kubernetes abstracts infrastructure, enabling true cloud portability and automated operations. With 96% of organisations using or evaluating it (CNCF survey), K8s skills are mandatory.

Fundamental Concepts

Pods, Deployments, ReplicaSets, StatefulSets, DaemonSets, Jobs/CronJobs. Services (ClusterIP, NodePort, LoadBalancer, ExternalName), ConfigMaps, Secrets, Namespaces, Labels, Selectors.

Intermediate Topics

Ingress controllers and Gateway API
Storage: PV, PVC, StorageClasses, CSI drivers
Helm: charts, values, repositories, hooks
RBAC: ServiceAccounts, Roles, ClusterRoles, RoleBindings
Resource requests/limits, QoS classes
Liveness, readiness, startup probes

Advanced Topics

Custom Resource Definitions (CRDs) and Operators
Horizontal Pod Autoscaler (HPA) with custom metrics, Vertical Pod Autoscaler (VPA)
Cluster Autoscaler, Karpenter, node provisioning optimisation
Admission controllers (OPA Gatekeeper, Kyverno)
etcd backup and restore, multi‑cluster management
Scheduler extender and custom scheduling policies
Kubelet, container runtime interaction, and node lifecycle

Networking

CNI plugins: Calico (eBPF), Cilium (eBPF, service mesh, network policy), Flannel, Weave. Service types, CoreDNS, network policies for micro‑segmentation.

Storage

CSI drivers for cloud volumes, Rook (Ceph), Longhorn, OpenEBS, Portworx. Snapshots, volume expansion, topology‑aware scheduling.

Security

Pod Security Standards (privileged, baseline, restricted), PodSecurityPolicy replacement, image signature verification (Connaisseur, Sigstore), runtime security (Falco). Secrets management (External Secrets Operator, Sealed Secrets, Vault).

Enterprise Practices

Multi‑tenancy via namespaces, resource quotas, and limit ranges
GitOps for cluster configuration (Argo CD, Flux)
Centralised policy management with OPA/Gatekeeper
Cluster API for declarative cluster lifecycle management
Backup and disaster recovery with Velero

Managed Kubernetes

EKS, AKS, GKE (Autopilot), OpenShift, DigitalOcean Kubernetes.

Certifications

CKA, CKAD, CKS, KCNA.

Best Practices

Use namespaces for logical separation, not security isolation alone
Always define resource requests and limits
Avoid latest tags; use digests and image policies
Keep control plane and nodes up to date (Kube‑no‑trouble)
Monitor cluster state and audit API server logs

Common Mistakes

Running privileged containers unnecessarily
Exposing the Kubernetes dashboard publicly
Ignoring etcd performance and backups
Too many microservices on a single cluster without limits

Real‑World Use Cases

Spotify’s migration to Kubernetes for backend services
Airbnb’s multi‑cluster, multi‑region setup
CERN’s use of Kubernetes for physics data processing

Troubleshooting Topics

Pod stuck in Pending (scheduling, resources)
CrashLoopBackOff diagnostics (kubectl logs, describe)
Service connectivity (kube‑dns, endpoints, network policy)
OOMKilled and memory limits tuning

Learning Resources

Kubernetes in Action, O’Reilly
KodeKloud, killer.sh
CNCF Kubernetes project documentation

Projects

Deploy a 3‑tier application with Helm and ingress
Set up a cluster with Cilium and Hubble for network observability
Implement a blue‑green deployment using Argo Rollouts

Interview Topics

“Explain the control plane components and their roles.”
“How does a service route traffic to pods?”
“You have a pod in CrashLoopBackOff – how do you debug?”

6. Cloud Providers

6.1 AWS, Azure, GCP & Others

Introduction

Public cloud providers offer on‑demand compute, storage, and higher‑level services. AWS, Azure, and Google Cloud dominate, but Oracle Cloud, Alibaba Cloud, DigitalOcean, and Linode serve specific niches.

Why It Matters

Most organisations operate in at least one public cloud. Understanding cloud services, pricing models, and multi‑cloud architectures is essential for DevOps and architecture roles.

Fundamental Concepts

Regions, Availability Zones, IAM (users, roles, policies), virtual networks (VPC/VNet), compute instances (EC2, VMs, Compute Engine), object storage (S3, Blob, GCS), managed databases (RDS, Cloud SQL, Azure SQL).

Intermediate Topics

Cloud‑native services: Lambda, Cloud Functions, Cloud Run
Container services: ECS, EKS, AKS, GKE
Messaging: SQS/SNS, Azure Service Bus, Pub/Sub
CDN & edge: CloudFront, Azure CDN, Cloud CDN, Cloudflare
Monitoring & logging: CloudWatch, Azure Monitor, Cloud Operations Suite

Advanced Topics

Landing zone architecture (AWS Control Tower, Azure CAF, Google Fabric)
Service control policies, organisation‑level governance
Cross‑cloud networking and interconnection
FinOps tooling specific to each provider
Multi‑cloud abstraction layers (Crossplane, Terraform)

Enterprise Practices

Well‑Architected Framework reviews (AWS, Azure, GCP)
Cost allocation tags and chargeback models
Cloud security posture management (CSPM)
Hybrid connectivity (Direct Connect, ExpressRoute, Interconnect)

Popular Tools

AWS CLI, Azure CLI, gcloud, aws‑shell, cloud‑agnostic: Terraform, Pulumi.

DigitalOcean & Linode

Ideal for developers and smaller workloads; simplicity and predictable pricing.

Best Practices

Enforce MFA and least privilege IAM
Never hardcode credentials; use instance roles/managed identities
Enable logging on all critical services
Automate resource creation/destruction with IaC

Common Mistakes

Leaving default VPC settings with open security groups
Not setting billing alarms and budgets
Ignoring region‑specific service availability
Over‑provisioning resources, leading to unexpected bills

Real‑World Use Cases

Netflix runs entirely on AWS with multi‑region resilience
Maersk uses Azure for global supply chain operations
Spotify migrated to GCP for data analytics

Troubleshooting Topics

CloudFormation/Terraform stack stuck in ROLLBACK_IN_PROGRESS
S3 bucket permission “Access Denied” despite policy
Intermittent connectivity across VPC peering

Learning Resources

Official cloud provider documentation and workshops
A Cloud Guru, Pluralsight
Cloud provider free tiers for hands‑on

Projects

Deploy a serverless web app with API Gateway, Lambda, and DynamoDB
Set up a multi‑account AWS organisation with SSO
Create a cost‑aware architecture using spot/preemptible VMs

Interview Topics

“Compare AWS IAM roles and Azure Managed Identities.”
“How would you design a globally resilient application on any cloud?”
“Explain VPC peering vs Transit Gateway.”

7. Infrastructure as Code (IaC)

7.1 Terraform, OpenTofu, Pulumi, CDKs

Introduction

IaC replaces manual point‑and‑click infrastructure provisioning with declarative or imperative code. It enables versioning, peer review, and reproducible environments.

Why It Matters

IaC eliminates configuration drift, reduces deployment time, and enforces security/compliance before resources are created. It’s a prerequisite for GitOps and self‑service platforms.

Fundamental Concepts

Declarative vs imperative approaches. Core primitives: resources, data sources, providers, variables, outputs. State management: local vs remote (S3, Azure Storage, GCS), state locking.

Intermediate Topics

Terraform: modules, workspaces, provisioners (and why to avoid them), dynamic blocks
OpenTofu as an open‑source fork with community governance
Pulumi: using real programming languages (TypeScript, Python, Go)
AWS CDK, CDK for Terraform (CDKTF), CDK8s – generating IaC from code
Crossplane: Kubernetes‑native control plane for infrastructure

Advanced Topics

Custom Terraform providers and provisioners
IaC testing: Terratest, Kitchen‑Terraform, policy‑based (Conftest, OPA)
State manipulation and migration (moved blocks, terraform import)
IaC in CI/CD pipelines with plan/apply approval gates
Drift detection and reconciliation (driftctl, Crossplane’s continuous reconciliation)

Enterprise Practices

Module registries with semantic versioning
Sentinel or OPA policies to enforce compliance before apply
Multi‑account provisioning with Terraform Cloud/Enterprise or Atlantis
Immutable infrastructure with Packer and Terraform in concert

Best Practices

Never store state files in Git; use remote backend with encryption
Structure repos by environment or module, not monoliths
Pin provider versions
Always run terraform plan and review before apply

Common Mistakes

Manual changes to resources managed by Terraform (drift)
Storing secrets in plain text in .tf files or state
Over‑reliance on workspaces for environment separation
Not using data sources, leading to hardcoded IDs

Real‑World Use Cases

HashiCorp’s own Terraform Cloud manages thousands of workspaces
Pulumi used by Snowflake to manage cloud infrastructure as code
AWS CDK powering large‑scale serverless applications

Troubleshooting Topics

State lock timeout and forced unlock
Provider authentication issues
Resource dependency cycle errors

Learning Resources

Terraform Up & Running (O’Reilly)
HashiCorp Learn platform
Pulumi University

Projects

Write a Terraform module for a highly available web server
Convert an existing AWS console setup into Terraform code
Build a CI pipeline that automatically applies Terraform on merge

Interview Topics

“What is the purpose of Terraform state?”
“How would you manage secrets in Terraform?”
“Explain the difference between Terraform and CloudFormation.”

8. Configuration Management

8.1 Ansible, Puppet, Chef, SaltStack

Introduction

Configuration management tools enforce desired state on servers, ensuring consistent software installation, file configuration, and service management across fleets of machines.

Why It Matters

While containers and immutable infrastructure reduce reliance on CM, managing Kubernetes nodes, on‑prem VMs, and legacy systems still demands solid CM skills.

Fundamental Concepts

Idempotency, push vs pull models, declarative (Puppet, Salt) vs procedural (Ansible). Inventory, playbooks/modules (Ansible), manifests/recipes, and convergence.

Intermediate Topics

Ansible roles, Galaxy, AWX/Automation Controller
Puppet Hiera, environments, r10k
Chef cookbooks, Berkshelf, Test Kitchen
SaltStack pillars, grains, reactors
Windows configuration with DSC and Ansible

Advanced Topics

Custom facts, plugins, and extensions
Self‑healing and drift remediation
Event‑driven automation (StackStorm, Salt Reactor)
Integrating CM with IaC and image baking (Packer+Ansible provisioner)

Enterprise Practices

Role‑based access control in AWX/Tower
Policy‑as‑code with Chef InSpec for compliance scanning
Secrets management integration (Vault, CyberArk)
Zero‑touch provisioning with PXE and Kickstart/Preseed plus CM

Best Practices

Treat CM code like application code: version, review, lint
Keep playbooks/manifests idempotent
Use dynamic inventories from cloud providers
Run CM in CI to validate against ephemeral instances

Common Mistakes

Running unencrypted secret data in CM code
Non‑idempotent scripts causing drift
Long‑running convergence with no reporting

Real‑World Use Cases

Ansible used by NASA for patching and configuration
Puppet managing thousands of nodes at CERN
Chef used by Facebook (now Meta) for bare‑metal provisioning

Troubleshooting Topics

Ansible task hanging due to SSH connectivity
Puppet agent failing to apply catalogue
Debugging Jinja2 template errors

Learning Resources

Ansible for DevOps (Jeff Geerling)
Learn Puppet, Chef training sites
Red Hat official Ansible courses

Projects

Write an Ansible playbook to harden a Linux server to CIS benchmarks
Set up AWX and run a job template from a Git repo
Migrate a shell script‑based provisioning to Ansible roles

Interview Topics

“Compare Ansible and Terraform; when would you use each?”
“How do you handle secrets in Ansible?”
“Explain idempotency with an example.”

9. GitOps

9.1 Argo CD, Flux, and Progressive Delivery

Introduction

GitOps uses Git as the single source of truth for declarative application and infrastructure configuration. An operator continuously reconciles the live state with the desired state in Git.

Why It Matters

GitOps provides a unified, auditable, and secure deployment model. It simplifies rollbacks, enhances security (pull‑based), and integrates naturally with developer workflows.

Fundamental Concepts

Desired state in Git, reconciliation loop, pull vs push deployments. Argo CD and Flux as leading CNCF‑graduated tools.

Intermediate Topics

Application definition and sync strategies (auto‑sync, pruning)
Helm and Kustomize integration
Multi‑cluster/multi‑environment management
Image updater automation (Argo CD Image Updater, Flux Image Automation)

Advanced Topics

Progressive delivery: Argo Rollouts, Flagger with canary analysis (Prometheus metrics)
Multi‑tenant GitOps architectures
Custom health checks and diff strategies
GitOps for infrastructure (Crossplane + Argo/Flux)
Security: sealed secrets, SOPS, Vault integration

Enterprise Practices

Repositories structured per team, environment, or cluster
Promotion pipelines from dev to prod via Git branches or directories
Policy enforcement: only allowed registries, required labels
Disaster recovery: bootstrap an empty cluster from Git in minutes

Popular Tools

Argo CD, Flux CD, Helm Operator, Argo Rollouts, Flagger.

Best Practices

Separate config repos from application source
Avoid manual kubectl changes; let the operator reconcile
Use pull‑based model for better security
Implement automated drift detection and alerting

Common Mistakes

Committing secrets in plaintext (use Sealed Secrets or Vault)
Ignoring sync status and health checks
Too many manual syncs overriding Git truth

Real‑World Use Cases

Intuit uses Argo CD to manage 1000+ apps across clusters
Weaveworks (pioneers) built Flux for multi‑tenant Kubernetes
BMW uses GitOps for factory automation systems

Troubleshooting Topics

Argo CD out of sync despite identical YAML (due to live mutation)
Flux reconciliation loops due to missing CRDs
Webhook delivery failures

Learning Resources

Argo CD and Flux documentation
GitOps Guide to the Galaxy (Weaveworks)
CNCF GitOps Working Group papers

Projects

Set up Argo CD to deploy a simple app from a GitHub repo
Implement a canary deployment with Flagger and Linkerd
Build a multi‑cluster GitOps setup with a single control plane

Interview Topics

“Explain the difference between push and pull GitOps.”
“How do you handle secrets in a GitOps workflow?”
“What are the benefits of using Git as a source of truth?”

10. Service Mesh

10.1 Istio, Linkerd, Cilium Service Mesh

Introduction

A service mesh extracts networking and security logic from application code into a sidecar proxy or eBPF‑based layer, providing traffic management, observability, and encryption between services.

Why It Matters

It enables zero‑trust networking, fine‑grained traffic control (retries, timeouts, circuit breaking), and deep telemetry without application changes.

Fundamental Concepts

Data plane (Envoy, Linkerd‑proxy) and control plane. Sidecar injection vs sidecar‑less (ambient mesh, Cilium). mTLS, authorization policies, traffic splitting.

Intermediate Topics

Istio: VirtualService, DestinationRule, Gateway
Linkerd: simplicity, rust‑based proxy, automatic mTLS
Kuma, Consul Connect as alternatives
Observability: distributed tracing (Jaeger/Zipkin), service graphs
Fault injection and chaos testing at mesh layer

Advanced Topics

Ambient mesh (Istio sidecar‑less with ztunnel)
eBPF‑based service mesh (Cilium) offering performance and simplicity
Multi‑cluster mesh federation
Wasm extensions in Envoy for custom processing
Performance benchmarks and resource overhead analysis

Enterprise Practices

Gradual rollout of mTLS per namespace
Using authorization policies to enforce least‑privilege communication
Mesh‑federation across VPCs or clouds
Centralised certificate management with cert‑manager and mesh control plane

Best Practices

Start with observability, then enable mTLS, then traffic control
Use namespace‑level policies before global defaults
Monitor sidecar resource usage
Keep mesh version updated for security patches

Common Mistakes

Enforcing strict mTLS before all services are ready
Creating overly permissive authorization policies
Ignoring the increased CPU/memory footprint on high‑traffic pods

Real‑World Use Cases

eBay’s use of Istio for traffic routing and mTLS
Nordstrom’s Linkerd deployment for zero‑trust microservices
Cilium replacing kube‑proxy and service mesh with eBPF at many cloud‑native companies

Troubleshooting Topics

Pod unable to reach service after mesh injection (sidecar not ready)
mTLS certificate rotation failures
Envoy configuration dump analysis (istioctl proxy-config)

Learning Resources

Istio in Action, Linkerd getting started
Solo.io workshops
Cilium and Hubble documentation

Projects

Deploy Bookinfo app with Istio and observe mTLS traffic
Implement a canary release with Istio traffic shifting
Set up Linkerd and inject into a sample microservice

Interview Topics

“Why use a service mesh instead of application‑level libraries?”
“Explain how mTLS works in a mesh.”
“What is the difference between a VirtualService and a DestinationRule?”

11. Observability

11.1 Monitoring, Logging, Tracing, Profiling

Introduction

Observability is the ability to understand system internals from external outputs. It rests on three pillars: metrics, logs, and traces, augmented by continuous profiling and events.

Why It Matters

Without observability, teams fly blind in production. It enables rapid detection, diagnosis, and resolution of issues, directly feeding SRE error budgets and platform improvements.

Fundamental Concepts

Metrics (counter, gauge, histogram), logs (structured, unstructured), traces (spans, context propagation). OpenTelemetry (OTel) as the standard for instrumentation and collection.

Intermediate Topics

Prometheus data model, PromQL, recording rules, alerting rules
Grafana dashboards, alerting, and Loki for logs
ELK stack (Elasticsearch, Logstash, Kibana) and beats
Fluentd and Fluent Bit for unified log aggregation
Alertmanager for routing alerts to PagerDuty, Slack, etc.

Advanced Topics

Cortex, Thanos, Mimir for long‑term, scalable Prometheus
High‑cardinality metrics and cost management
OpenTelemetry collector deployment (agent/gateway)
Exemplars linking metrics and traces
Continuous profiling (Pyroscope, Parca)
AIOps‑enhanced anomaly detection on observability data

Enterprise Practices

Centralised observability platform (Grafana Cloud, Datadog, New Relic, Splunk)
Service Level Objectives (SLOs) as primary operational metric
Log sampling and retention policies for cost control
Observability as code: dashboards, alerts, and recording rules in Git

Best Practices

Use structured logging (JSON) for machine parseability
Instrument applications with OpenTelemetry SDKs
Define meaningful SLOs and error budgets
Avoid alert fatigue by alerting on symptoms, not causes

Common Mistakes

Logging too much or too little (no request IDs)
Dashboard sprawl without actionability
No retention or downsampling strategy
Ignoring P95/P99 latencies, only using averages

Real‑World Use Cases

Google’s Monarch and Borgmon for internal monitoring
Uber’s Jaeger tracing platform
Grafana Labs’ Loki used for petabyte‑scale log querying

Troubleshooting Topics

PromQL query returning no data (label mismatch)
Log ingestion backpressure in Fluent Bit
Trace context not propagating across services

Learning Resources

Prometheus: Up & Running, Grafana docs
OpenTelemetry official documentation
Monitoring distributed systems (Google SRE book chapter)

Projects

Deploy the Prometheus/Grafana stack and create a custom dashboard for a web app
Implement distributed tracing with OpenTelemetry in a microservice
Set up alerting for high error rates with Alertmanager and PagerDuty

Interview Topics

“How do you monitor a Kubernetes cluster?”
“Describe the three pillars of observability.”
“What is an SLO and how does it relate to monitoring?”

12. DevSecOps

12.1 Security in the DevOps Pipeline

Introduction

DevSecOps integrates security practices into every stage of the software delivery lifecycle, making security a shared responsibility rather than a final gate.

Why It Matters

With the rise of supply chain attacks and cloud breaches, security must be embedded from code to production. Compliance and risk management demand automated, continuous security.

Fundamental Concepts

Shift left, Zero Trust, least privilege, defense in depth. SAST (Static Application Security Testing), DAST (Dynamic), SCA (Software Composition Analysis), container scanning, secret scanning.

Intermediate Topics

SAST tools: SonarQube, Semgrep, CodeQL
SCA/dependency scanning: Snyk, Dependabot, OWASP Dependency‑Check
Container image scanning: Trivy, Grype, Clair
IaC scanning: tfsec, Checkov, Kics, Terrascan
Secrets detection: git‑secrets, TruffleHog, GitGuardian

Advanced Topics

Supply chain security: SLSA framework, Sigstore (Cosign, Rekor, Fulcio), in‑toto attestations
Policy as code: OPA/Rego, Kyverno, Sentinel
Runtime security: Falco, Tetragon (eBPF), seccomp profiles, AppArmor
Vulnerability management lifecycle and patch SLAs
Automated compliance: Prowler, InSpec, compliance‑as‑code

Enterprise Practices

Security champions program and threat modelling
Centralised secrets management (Vault, AWS Secrets Manager)
Building an internal secure software development policy
Continuous authorization and zero‑trust networking with service mesh

Popular Tools

Trivy, Snyk, SonarQube, Checkov, Vault, Falco, Kyverno.

Best Practices

Never store secrets in code or images
Run security scans in CI and block on critical vulns
Rotate credentials and use short‑lived tokens
Maintain an up‑to‑date SBOM for all artifacts

Common Mistakes

Focusing only on perimeter security
Alert fatigue from unfiltered vulnerability scanners
Not scanning for secrets in git history
Treating compliance as a one‑off audit

Real‑World Use Cases

Netflix’s Security Monkey and Repokid for cloud security automation
Google’s Binary Authorization for only trusted images on GKE
Capital One’s adoption of cloud‑native security tooling after their breach

Troubleshooting Topics

False positives in SAST/DAST
Image pulling blocked by admission controller
Vault token renewal failures

Learning Resources

OWASP Top 10, DevSecOps Hub
Kubernetes Security by Liz Rice
Hands‑on: KodeKloud DevSecOps path

Projects

Build a CI pipeline that fails on critical CVE in container image
Implement secret management with External Secrets Operator
Create a Falco rule to detect suspicious exec in containers

Interview Topics

“How would you secure a CI/CD pipeline?”
“What is the difference between SAST and DAST?”
“Explain the principle of least privilege in a Kubernetes context.”

13. Site Reliability Engineering (SRE)

13.1 Reliability, Operations, and Chaos Engineering

Introduction

SRE applies software engineering principles to operations, focusing on automating toil, measuring reliability via SLOs, and managing risk through error budgets.

Why It Matters

SRE bridges the gap between product velocity and operational stability. It provides a data‑driven framework for trade‑off decisions and ensures services remain reliable while evolving.

Fundamental Concepts

SLI (Service Level Indicator), SLO (Objective), SLA (Agreement). Error budgets, toil, automation, and the concept of “Hope is not a strategy.”

Intermediate Topics

Defining meaningful SLIs (latency, availability, throughput, error rate)
Error budget policies: burn rate alerts, freezing changes
Incident management lifecycle: detection, response, blameless postmortem
Monitoring and observability through an SRE lens
Capacity planning and load testing

Advanced Topics

Chaos engineering: principles, chaos‑day, game‑day exercises
Tools: LitmusChaos, Gremlin, Chaos Mesh, AWS Fault Injection Simulator
Toil elimination through runbook automation and self‑healing
Reliability across multi‑region and multi‑cloud architectures
Advanced error budget statistical analysis

Enterprise Practices

SRE organization models (embedded, consulting, platform)
Automated incident response and runbook execution
Reliability scoring and team dashboards
SLO‑based release gating

Popular Tools

Prometheus/Alertmanager for SLO monitoring, PagerDuty/Opsgenie, LitmusChaos, Gremlin.

Best Practices

Start SLOs from user journeys, not server metrics
Use multi‑window, multi‑burn‑rate alerts
Write blameless postmortems with action items
Automate toil away; never accept repetitive manual work

Common Mistakes

Setting SLO targets arbitrarily (e.g., 99.999% for everything)
Not enforcing error budgets, leading to unreliable features
Skipping postmortems after incidents
Measuring reliability only by uptime

Real‑World Use Cases

Google’s SRE teams managing Search, Gmail with tight SLOs
Dropbox’s SRE‑driven migration to gRPC with error budget adoption
Target’s chaos engineering program before Black Friday

Troubleshooting Topics

SLO burn rate alert triggered but no customer impact
Incident command handoff failures
Chaos experiment causing unexpected cascading failures

Learning Resources

Google SRE books (free online)
SRE Workbook and Seeking SRE
Chaos Engineering by Casey Rosenthal

Projects

Define SLIs and SLOs for a sample microservice and create a dashboard
Automate a runbook using a script triggered by an alert
Design a chaos experiment to test database failover

Interview Topics

“How would you choose an SLO for a payment API?”
“Explain an error budget and how it affects development speed.”
“Walk through a recent incident and how you handled it.”

14. Platform Engineering

14.1 Internal Developer Platforms

Introduction

Platform Engineering builds self‑service internal platforms that abstract infrastructure complexity, offering paved roads (golden paths) for developers while reducing cognitive load.

Why It Matters

It addresses the “you build it, you run it” overload, enabling developers to focus on business logic without sacrificing autonomy. It’s the evolution of DevOps at scale.

Fundamental Concepts

Platform as a Product, Internal Developer Platform (IDP), golden paths, self‑service, developer experience (DevEx) metrics. Backstage as a developer portal.

Intermediate Topics

Building a service catalog with Backstage and Software Templates
Composable platforms with Crossplane, Terraform modules behind APIs
Scaffolding tools: Yeoman, Cookiecutter, custom CLI generators
Measuring platform adoption and satisfaction (DORA, SPACE, DevEx frameworks)

Advanced Topics

Platform orchestration: Kratix, Humanitec, Port
Federation and multi‑platform architectures
Policy‑driven platform with OPA and admission control
Dynamic environment provisioning and ephemeral environments
Platform observability and cost chargeback

Enterprise Practices

Treating platform as a product with dedicated PM, roadmap, and SLAs
Platform conformance testing and certification
Integrating security and compliance into golden paths by default
Building a community of practice around platform usage

Popular Tools

Backstage, Crossplane, Humanitec, Port, Kratix, Scaffolder.

Best Practices

Start with the thinnest viable platform (MVP), then iterate
Gather developer feedback continuously
Make the golden path the path of least resistance
Avoid building a platform that locks teams in; enable escape hatches

Common Mistakes

Building a platform no one asked for (ivory‑tower architecture)
Over‑abstraction leading to flexibility loss
Neglecting documentation and onboarding experience

Real‑World Use Cases

Spotify’s Backstage, now open‑source and used by thousands
Humanitec powering IDPs at enterprises like Bosch
Monzo’s internal platform for safe, fast deployments

Troubleshooting Topics

Template rendering failures in Backstage
Self‑service provisioning stuck due to cloud API quota
Developer experience degradation due to platform API latency

Learning Resources

Team Topologies (book)
Backstage documentation and plugin ecosystem
Platform engineering community (platformengineering.org)

Projects

Set up a Backstage instance and create a software template
Build a simple self‑service API using Crossplane to provision a database
Design a golden path with Terraform modules and a CLI wrapper

Interview Topics

“What is the difference between Platform Engineering and DevOps?”
“How would you measure the success of an IDP?”
“Describe a golden path and how to enforce it.”

15. Databases, Messaging & Storage

15.1 Data Layer for Cloud Native Systems

Introduction

Stateful workloads require careful handling in dynamic environments. Choosing the right database, cache, and message broker directly impacts scalability, consistency, and resilience.

Why It Matters

Data is the hardest part of distributed systems. DevOps engineers must understand replication, failover, backups, and performance tuning to keep applications reliable.

Fundamental Concepts

SQL (PostgreSQL, MySQL) and NoSQL (MongoDB, DynamoDB, Cassandra). Caching (Redis, Valkey). Message brokers (RabbitMQ, Kafka, NATS). Event sourcing, CQRS.

Intermediate Topics

Connection pooling, read replicas, sharding
Managed cloud services (RDS, Aurora, Cloud SQL, Cosmos DB)
Kafka: topics, partitions, consumer groups, exactly‑once semantics
Elasticsearch for full‑text search and analytics
Redis clustering and persistence options

Advanced Topics

Operators for databases on Kubernetes (Crunchy Data for PostgreSQL, Strimzi for Kafka)
Distributed SQL (CockroachDB, YugabyteDB)
Vector databases for AI (pgvector, Pinecone, Weaviate)
Stream processing with Kafka Streams, Flink, or RisingWave
Backup and point‑in‑time recovery strategies, RPO/RTO

Enterprise Practices

Multi‑AZ and cross‑region replication
Automated failover and chaos testing of data layer
Data encryption at rest and in transit
Schema migration strategies in CI/CD (Flyway, Liquibase)

Storage Systems

Block (EBS, managed disks), object (S3, MinIO, Ceph), file (EFS, Azure Files). Container‑native storage: Rook, Longhorn, OpenEBS.

Best Practices

Treat database changes as code, apply with same pipeline
Monitor query performance and index usage
Separate OLTP and OLAP workloads
Use connection pooling and circuit breakers

Common Mistakes

Using the wrong consistency model for the problem
Ignoring backup verification; backups don’t exist until restored
Over‑loading a single broker topic without partitioning

Real‑World Use Cases

Uber’s migration to a multi‑active database architecture
Netflix’s use of Cassandra for cross‑regional replication
LinkedIn’s Kafka handling trillions of messages per day

Troubleshooting Topics

High latency on database due to missing indexes
Kafka consumer lag alerts
Redis OOM and eviction policies

Learning Resources

Designing Data‑Intensive Applications (Kleppmann)
Official documentation of each system
A Cloud Guru’s database specialty courses

Projects

Deploy PostgreSQL on Kubernetes with an operator
Build a simple message pipeline with Kafka and a consumer
Set up MinIO as an S3‑compatible object store and integrate with an application

Interview Topics

“How would you handle a database migration with zero downtime?”
“Explain CAP theorem in practice.”
“What strategies exist for caching in a microservices environment?”

16. API Management & Architecture

Introduction

APIs are the contracts between services. Managing them involves design, security, versioning, rate limiting, and providing developer portals.

Why It Matters

A well‑designed API ecosystem accelerates development and enables partnerships. API gateways become critical for north‑south traffic control.

Fundamental Concepts

REST, GraphQL, gRPC (Protobuf). API design best practices, versioning, pagination, error handling, OpenAPI specification.

Intermediate Topics

API gateways: Kong, Tyk, Apigee, AWS API Gateway, Azure APIM
Authentication and authorization: OAuth2, JWT, API keys
Rate limiting, throttling, and quotas
Developer portals and documentation (Swagger UI, Redoc)

Advanced Topics

API as a product: monetization, analytics
Federation: GraphQL stitching and supergraph
gRPC load balancing and health checking
Event‑driven APIs with AsyncAPI
Service mesh for east‑west API traffic management

Enterprise Practices

Centralised API management with self‑service onboarding
API security scanning in CI (OWASP ZAP, 42Crunch)
Full lifecycle management: design → publish → deprecate
Observability integration with API analytics

Best Practices

Design APIs first using OpenAPI
Use API versioning via URL or headers
Enforce authentication at the gateway
Monitor API usage and error rates

Common Mistakes

Breaking changes without versioning
Over‑fetching and under‑fetching with REST (leading to GraphQL adoption)
Not setting rate limits, allowing abuse

Real‑World Use Cases

Twilio’s API‑first product strategy
Netflix’s use of GraphQL with Falcor and later Federation
Google Cloud Endpoints managing external APIs

Troubleshooting Topics

429 Too Many Requests due to rate limiting
CORS errors from misconfigured gateway
gRPC deadline exceeded vs. connection refused

Learning Resources

Designing Web APIs (O’Reilly)
Stoplight, Swagger tools
GraphQL official tutorial

Projects

Design and implement a REST API with OpenAPI spec and deploy behind Kong
Build a GraphQL wrapper over a REST service
Set up an API gateway with OAuth2 authentication

Interview Topics

“REST vs GraphQL vs gRPC – when would you choose each?”
“How do you secure a public API?”
“What is the role of an API gateway in microservices?”

17. Testing, Architecture & Advanced Topics

17.1 Testing in DevOps

Introduction

Testing is not a phase but a continuous activity. DevOps incorporates unit, integration, performance, security, and chaos testing into the pipeline.

Why It Matters

Pre‑production confidence comes from automated testing. Skipping testing leads to production incidents, broken SLOs, and burned error budgets.

Fundamental Concepts

Testing pyramid: unit, integration, end‑to‑end. TDD, BDD. Performance testing, load/stress testing.

Intermediate Topics

Infrastructure testing: Terratest, Kitchen, InSpec
Policy testing: Conftest, OPA unit tests
Contract testing (Pact)
Chaos engineering as testing
Synthetic monitoring as post‑deployment validation

Advanced Topics

Continuous verification with canary analysis (Argo Rollouts + Prometheus)
Fuzzing for security
Load generation tools: k6, Locust, JMeter
Testing in production with feature flags and progressive delivery

Enterprise Practices

Shift‑left and shift‑right testing combined
Testing environments as code, on‑demand
Automated test result dashboards and quality gates

Best Practices

Tests must be fast and reliable; flaky tests undermine trust
Use real‑world production traffic replay for load testing
Automate everything, but also perform exploratory testing

Common Mistakes

Only testing happy paths
Slow integration tests blocking CI
Not testing disaster recovery procedures

Real‑World Use Cases

Google’s DiRT (Disaster Recovery Testing) exercises
Amazon’s use of automated canary deployments
GitHub’s testing of merge queue with thousands of tests

Troubleshooting Topics

Flaky test root‑cause analysis
Load test reveals bottleneck, debugging stack traces
Chaos experiment causing cascading failure

Learning Resources

Continuous Delivery (Humble & Farley)
k6 documentation
Chaos Engineering community

Projects

Write Terratest for a Terraform module and run in CI
Create a k6 load test script and integrate with GitHub Actions
Implement contract tests between two microservices

Interview Topics

“How would you introduce testing to a team that does none?”
“Explain the testing pyramid and if it still applies.”
“How do you test infrastructure changes?”

17.2 Architecture Patterns

Introduction

Modern architecture choices (monolith, microservices, event‑driven) shape the operational model. DevOps engineers must understand the trade‑offs to design reliable systems.

Why It Matters

Architecture determines scalability, deployability, and resilience. A poorly chosen architecture can make DevOps practices impossible.

Fundamental Concepts

Monolithic vs distributed, SOA, microservices, event‑driven architecture. CQRS, event sourcing, domain‑driven design (DDD).

Intermediate Topics

12‑factor app methodology (and 15‑factor)
Backend‑for‑frontend (BFF), API composition
Saga pattern for distributed transactions
Idempotency and deduplication

Advanced Topics

Cell‑based architecture (as used by Amazon, DoorDash)
Multi‑runtime microservices with Dapr
Serverless orchestration (Step Functions, Durable Functions)
Event‑driven data mesh for analytics

Enterprise Practices

Well‑architected framework reviews across all clouds
Architecture decision records (ADRs) in source control
Fitness functions to continuously validate architectural qualities

Best Practices

Start with monolith unless microservices are absolutely needed
Design for failure from day one
Use async messaging to decouple services

Common Mistakes

Microservices without proper DevOps maturity (distributed monolith)
Ignoring eventual consistency implications
Over‑engineering for scale that never arrives

Real‑World Use Cases

Amazon’s rule: teams communicate through APIs; cell‑based architecture
Netflix’s microservices with Hystrix for resilience
Uber’s domain‑oriented microservices

Troubleshooting Topics

Latency propagation in deeply chained services
Event duplication in distributed messaging
Data inconsistency across services

Learning Resources

Building Microservices (Sam Newman)
Domain‑Driven Design (Eric Evans)
Architecture Katas for practice

Projects

Refactor a monolithic app into two services with an event bus
Design an event‑driven order system with CQRS
Implement a 12‑factor app checklist

Interview Topics

“Monolith vs microservices: how do you decide?”
“Explain the Saga pattern.”
“What is domain‑driven design and why does it matter for DevOps?”

18. MLOps, AIOps, and Emerging Trends

18.1 MLOps

Introduction

MLOps extends DevOps principles to machine learning, covering data versioning, experiment tracking, model training pipelines, deployment, and monitoring.

Why It Matters

As AI becomes ubiquitous, reliable ML delivery pipelines are critical. MLOps ensures reproducibility, governance, and operational excellence for ML models.

Fundamental Concepts

Data versioning (DVC, LakeFS), experiment tracking (MLflow, W&B), model registry, feature stores (Feast, Tecton), training pipelines.

Intermediate Topics

Orchestration with Kubeflow, Airflow, Prefect
Model serving: Seldon Core, BentoML, KServe (formerly KFServing)
CI/CD for ML: building, testing, deploying models
Drift detection and model retraining triggers

Advanced Topics

MLOps on Kubernetes with GPU scheduling (Kubeflow + NVIDIA GPU Operator)
ML metadata and lineage
A/B testing models in production
MLOps at massive scale (Ray, Ray Serve)

Enterprise Practices

Model governance, explainability, and auditability
Centralised feature store across teams
Automated retraining pipelines based on metric degradation

Best Practices

Treat data as code; version datasets
Monitor model performance (accuracy, fairness, drift)
Automate the full ML lifecycle

Common Mistakes

Not versioning data, making experiments unreproducible
Deploying models without monitoring
Ignoring operational overhead of GPUs

Real‑World Use Cases

Netflix’s recommendation pipeline with Metaflow and Kubeflow
Uber’s Michelangelo platform
Google’s TFX for production ML

Troubleshooting Topics

Model serving latency spikes
Training job OOM on GPU
Feature store inconsistency

Learning Resources

Introducing MLOps (O’Reilly)
Kubeflow and MLflow documentation
MLOps community (mlops.community)

Projects

Set up MLflow for experiment tracking and register a model
Build a Kubeflow pipeline that trains and deploys a model
Implement a drift detection monitor for a production model

Interview Topics

“How does MLOps differ from DevOps?”
“What is a feature store and why is it important?”
“Explain the ML lifecycle from data to production.”

18.2 AIOps

Introduction

AIOps applies AI/ML to IT operations data to automate anomaly detection, root cause analysis, and remediation.

Why It Matters

As systems grow complex, AIOps reduces alert noise, accelerates incident resolution, and enables proactive operations.

Fundamental Concepts

Event correlation, anomaly detection, log pattern recognition, predictive alerting, automated runbooks.

Intermediate Topics

AIOps platforms: Moogsoft, BigPanda, Dynatrace Davis, Splunk ITSI
Integration with observability and incident management
Training models on incident history for RCA suggestions

Advanced Topics

Generative AI for runbook generation and natural language querying of systems
Autonomous operations and self‑healing pipelines
AI‑driven capacity forecasting and FinOps

Enterprise Practices

Augmenting NOC/SRE with AIOps insights
Building custom AIOps with ML on observability data
Ethical considerations and human‑in‑the‑loop

Best Practices

Start with data quality; AIOps is only as good as the data
Use AI to enrich alerts, not replace humans
Combine with chaos engineering for training

Common Mistakes

Expecting magic without curated data and incident labeling
Over‑reliance on black‑box recommendations

Real‑World Use Cases

eBay’s AIOps for reducing MTTR
Intuit’s anomaly detection on financial services
Large banks using AIOps for compliance and fraud operations

Troubleshooting Topics

AIOps false positives flooding on‑call
Model drift causing missed incidents

Learning Resources

AIOps: Artificial Intelligence for IT Operations (O’Reilly)
Vendor‑specific training
AIOps Exchange community

Projects

Build a simple anomaly detection on Prometheus metrics using Prophet
Create a Slack bot that suggests runbooks based on alert content

Interview Topics

“How would you implement an AIOps strategy?”
“What’s the difference between AIOps and traditional monitoring?”

18.3 Emerging Technologies (2026+)

eBPF: Cilium, Tetragon, and continuous profiling revolutionising networking, security, and observability without kernel changes.
WebAssembly (Wasm): Serverless functions on edge, plugin extensibility in Envoy, and container alternatives.
Confidential Computing: Encrypted data in use via hardware‑enclaves (Intel SGX, AMD SEV), safeguarding sensitive workloads.
Agentic Operations: AI agents that autonomously plan, execute, and remediate infrastructure tasks, bridging LLMs and DevOps toolchains.
GreenOps: Sustainability‑aware cloud operations, carbon‑aware scheduling, and energy‑efficient architectures.
Cloud‑Native AI Infrastructure: Kubernetes‑native orchestration of LLMs (vLLM, KServe), GPU sharing, and Ray for distributed ML.
Policy‑as‑Code Everywhere: AI‑verified policies for security, cost, and architecture using OPA and new DSLs.
Edge & Distributed Cloud: 5G/MEC, Cloudflare Workers, Fastly Compute@Edge, and AWS Outposts bringing compute closer.

19. FinOps

Introduction

FinOps is the cultural practice of managing cloud costs, where engineering, finance, and business teams collaborate to maximise business value.

Why It Matters

Cloud spend can spiral without governance. FinOps ensures every dollar is accounted for, forecasted, and optimised without slowing innovation.

Fundamental Concepts

Cost allocation, tagging, showback/chargeback, reserved/saving plans, spot/preemptible instances, rightsizing.

Intermediate Topics

Tools: Kubecost, CloudHealth, Cloudability, AWS Cost Explorer, Cloud Custodian
Budget alerts and anomaly detection
Unit economics: cost per API call, per customer
Optimising Kubernetes cost (requests vs actual usage, overprovisioning)

Advanced Topics

Automated cost‑aware scheduling (kube‑downscaler)
Continuous cost optimisation with AI recommendations
Multi‑cloud cost management and commitment strategies
Integrating FinOps into CI/CD (cost estimation on pull requests)

Enterprise Practices

FinOps Foundation frameworks and maturity model
Chargeback models in self‑service platforms
Sustainability metrics alongside cost

Best Practices

Enforce tagging from day one
Make cost visible to engineers
Implement automated kill‑switches for non‑production environments

Common Mistakes

Treating FinOps as purely a finance function
Ignoring orphan resources (EBS volumes, idle IPs)
Not leveraging spot instances for fault‑tolerant workloads

Real‑World Use Cases

Atlassian saving millions through FinOps practices
Spotify’s Cost Insights backstage plugin
AWS’s own internal cost optimisation team

Troubleshooting Topics

Unexplained cost spikes via detailed billing reports
Kubernetes pods consuming more than requested (no limits)

Learning Resources

FinOps Foundation certification & playbooks
Cloud provider cost optimisation whitepapers
Kubecost blog

Projects

Set up Kubecost and identify over‑provisioned workloads
Build a Lambda that stops dev instances at night and on weekends
Create a dashboard of cloud spend per team

Interview Topics

“How would you reduce a company’s AWS bill by 30%?”
“Explain spot instances and their risks.”
“What is a committed use discount?”

20. Career Paths, Certifications, and Interview Prep

20.1 Career Paths

DevOps Engineer: Builds and maintains CI/CD, IaC, and operational tooling.
Cloud Engineer/Architect: Designs cloud infrastructure, migration, and governance.
Site Reliability Engineer: Focuses on reliability, SLOs, and incident management.
Platform Engineer: Creates internal developer platforms and golden paths.
DevSecOps Engineer: Integrates security into the full delivery pipeline.
MLOps Engineer: Manages ML lifecycle from data to deployment.
FinOps Practitioner: Optimises cloud spend and bridges engineering/finance.
Cloud Security Engineer: Cloud posture management, threat detection, compliance.
Developer Experience (DevEx) Engineer: Improves workflows, tooling, and productivity.

20.2 Certifications (2026 landscape)

AWS: Cloud Practitioner, Solutions Architect (Associate/Pro), DevOps Engineer Pro, Security Specialty, Advanced Networking.
Azure: AZ‑900, AZ‑104, AZ‑305, AZ‑400 (DevOps), AZ‑500.
Google Cloud: Associate Cloud Engineer, Professional Cloud Architect, Professional Cloud DevOps Engineer, Professional Data Engineer.
Kubernetes: KCNA, CKA, CKAD, CKS.
HashiCorp: Terraform Associate, Vault Associate.
Linux: LFCS, RHCSA, RHCE.
FinOps: FinOps Certified Practitioner.
CNCF: Prometheus Associate, Cilium Associate, Istio (upcoming).
Security: CISSP, CCSP, AWS/Azure Security.

20.3 Interview Preparation

Beginner: Linux basics, simple CI/CD pipelines, Docker, basic networking, cloud fundamentals.
Intermediate: Kubernetes troubleshooting, IaC design, monitoring/alerting setups, incident response scenarios.
Senior: System design (cloud‑native, multi‑region), SLO/SLI definition, chaos engineering, cost optimisation.
Staff/Principal: Organisational DevOps transformation, platform strategy, reliability at massive scale, influencing without authority.
Scenario questions: “Your production database is down – walk me through your response.” “Design a multi‑cloud active‑active architecture.”
Architecture design whiteboarding: common enterprise patterns, trade‑off discussions.

20.4 Projects (Complete List)

Beginner

Static website on S3 + CloudFront with CI/CD via GitHub Actions.
Containerise a Go/Python web app and push to a registry.
Deploy a container to a managed Kubernetes service.

Intermediate

Microservices app with Helm, Ingress, and Prometheus monitoring.
GitOps with Argo CD – auto‑sync a cluster.
Centralised logging with Loki and Grafana.

Advanced

Service mesh (Istio/Linkerd) with mTLS and observability.
Multi‑cloud Kubernetes federation with GitOps.
Chaos engineering experiment suite with LitmusChaos.

Enterprise

Internal Developer Platform with Backstage, Crossplane, and self‑service APIs.
Compliance‑as‑code pipeline (CIS hardening, SBOM, SLSA level 3).
FinOps automation: rightsizing, scheduling, and chargeback.

https://zabitechcommunity.netlify.app/post.html?id=cybersecurity-roadmap-2026-the-complete-step-by-step-guide-from-beginner

https://zabitechcommunity.netlify.app/post.html?id=frontend-developer-roadmap-2026

https://zabitechcommunity.netlify.app/post.html?id=the-complete-roadmap-to-become-a-software-developer-in-2026