Cloud & DevOps
1. Foundations
1.1 DevOps Principles
Introduction
DevOps is a cultural and technical movement that bridges software development (Dev) and IT operations (Ops) to deliver value faster, safer, and more reliably. It emerged from the agile world, addressing the friction between teams that build software and those that run it in production.
Why It Matters
Organisations adopting DevOps ship features 30× more frequently, have 60× fewer failures, and recover 168× faster (DORA 2024). It is the backbone of modern software delivery, enabling cloud-native architectures, microservices, and continuous everything.
Fundamental Concepts
The core of DevOps includes Culture (collaboration, shared ownership), Automation (CI/CD, infrastructure as code), Measurement (metrics, telemetry), and Sharing (feedback, blameless postmortems). Agile, Scrum, Kanban, Lean, and the SDLC provide the process foundation.
Intermediate Topics
- Deployment strategies: blue‑green, canary, rolling updates
- Continuous Delivery vs Continuous Deployment
- Feedback loops: chatops, feature flags, observability-driven development
- Internal toolchains and platform engineering concepts
Advanced Topics
- DevOps metrics (DORA, SPACE, DevEx) and their statistical correlation
- Complex delivery pipelines with progressive delivery and automated rollbacks
- Policy-as-code and compliance automation
- Value stream management and flow metrics
Enterprise Practices
- Federated DevOps at scale with platform teams
- Internal Developer Platforms (IDPs) and self‑service
- Cross‑functional reliability councils
- Compliance as code in heavily regulated environments (SOC2, HIPAA, PCI DSS)
Popular Tools
Jira, Azure Boards, GitHub Projects, Linear; CI/CD tools like Jenkins, GitHub Actions; IaC: Terraform; monitoring: Prometheus, Grafana.
Best Practices
- Shift left on security and testing
- Immutable infrastructure
- Everything as code
- Blameless culture and psychological safety
Common Mistakes
- Treating DevOps as a toolchain rather than a culture
- Automating broken processes
- Ignoring measurement and feedback
- Isolating security from the pipeline
Real‑World Use Cases
Netflix’s full CI/CD with Spinnaker, Etsy’s early deploy culture, Amazon’s two‑pizza teams and CI/CD at scale.
Troubleshooting Topics
- Pipeline failures due to flaky tests
- Environment drift and configuration mismatch
- Slow feedback loops and how to diagnose them
Learning Resources
- The Phoenix Project and The DevOps Handbook
- DORA research and Google’s SRE books
- Online courses: AWS DevOps, Azure DevOps, KodeKloud
Projects
- Build a CI/CD pipeline for a static site
- Deploy a containerised application with automated rollbacks
- Create a full delivery pipeline with automated security scanning
Interview Topics
- “Explain the difference between Continuous Delivery and Deployment.”
- “How would you improve a team’s deployment frequency?”
- “Describe a blameless postmortem culture.”
1.2 Linux Operating System
Introduction
Linux is the dominant operating system for cloud and DevOps. Nearly all containers, servers, and orchestration nodes run Linux. Proficiency in Linux administration is non‑negotiable.
Why It Matters
Every cloud VM, Kubernetes node, and container image is Linux. Understanding the OS internals directly impacts performance, security, and troubleshooting.
Fundamental Concepts
Distributions: Ubuntu, Debian, Fedora, CentOS Stream, Rocky Linux, AlmaLinux, RHEL, Arch Linux. Key elements: filesystems (ext4, XFS), permissions (ugo/rwx, ACLs), users, groups, processes (fork, exec), systemd services, cron scheduling, package managers (apt, dnf, yum, snap).
Intermediate Topics
- Kernel tuning (sysctl)
- Namespaces and cgroups (the building blocks of containers)
- SELinux / AppArmor
- Systemd unit files and timers
- Network configuration (ip, nmcli)
Advanced Topics
- eBPF for observability and security
- Custom kernel compilation and tuning
- Filesystem internals (inodes, journaling)
- Memory management and OOM handling
- Linux boot process (GRUB, initramfs)
Enterprise Practices
- Immutable OS images (CoreOS, Bottlerocket)
- CIS‑benchmarked hardened images
- Centralised authentication (LDAP, SSSD)
- Compliance scanning with OpenSCAP
Popular Tools
htop, iotop, strace, ltrace, lsof, tcpdump, auditd, systemd‑journald, rsyslog.
Best Practices
- Minimal base images; reduce attack surface
- Use standardised AMIs/Golden Images
- Automate patching and configuration with Ansible/Chef
- Monitor filesystem inodes and disk usage
Common Mistakes
- Running services as root inside containers
- Ignoring open file limits and port ranges
- Misconfigured time synchronisation (NTP/chrony)
- Leaving debug symbols and unnecessary packages in production
Real‑World Use Cases
- Google’s container‑optimised OS (COS) for GKE nodes
- Netflix’s use of Ubuntu with custom performance tuning
- Financial services running RHEL with real‑time kernel patches
Troubleshooting Topics
- High load average but low CPU (I/O wait)
- Out‑of‑memory killer investigations
- DNS resolution failures from /etc/nsswitch.conf
- “Too many open files” errors
Learning Resources
- Linux Bible, UNIX and Linux System Administration Handbook
- Linux Foundation courses (LFCS)
- Distro‑specific documentation (Red Hat, Ubuntu)
Projects
- Build a minimal Linux from scratch inside a VM
- Set up a hardened web server with SELinux enforcing
- Automate user and group management with Bash
Interview Topics
- “How do you troubleshoot a process consuming 100% CPU?”
- “Explain the difference between a hard link and a symbolic link.”
- “What happens between pressing power and the login prompt?”
1.3 Shell Scripting & Automation
Introduction
Shell scripting glues together tools, performs repeatable tasks, and drives CI/CD pipelines. Bash remains the universal glue; PowerShell and Python extend automation to Windows and complex logic.
Why It Matters
Automation is a core DevOps principle. Scripting eliminates manual toil, reduces errors, and ensures consistency across environments.
Fundamental Concepts
Bash: variables, loops, conditionals, functions, exit codes, stdin/stdout/stderr, job control. PowerShell: objects, cmdlets, modules. Python: subprocess, os, shutil, argparse.
Intermediate Topics
- Error handling with
set -e, traps - Logging and structured output (JSON with jq)
- Templating with envsubst, sed, or Jinja2
- Python for API clients, infrastructure glue
Advanced Topics
- Idempotent scripts
- Parallel execution with xargs, GNU Parallel
- Writing custom CLIs with Click or Cobra (Go)
- Shell‑based unit testing (Bats)
Enterprise Practices
- Scripts stored in version control, reviewed and linted
- ShellCheck integration in CI
- Packaging scripts as RPM/DEB or container images
Best Practices
- Use
#!/usr/bin/env bashfor portability - Quote variables, avoid eval
- Prefer long options for readability
- Return meaningful exit codes
Common Mistakes
- Hardcoding secrets
- Not checking command success
- Over‑complicating when a simple built‑in suffices
Real‑World Use Cases
- GitHub Actions composite actions powered by Bash
- Cloud‑init and user‑data scripts to bootstrap EC2
- Database migration scripts executed from CI
Troubleshooting Topics
- Debugging with
set -xandPS4 - Locale issues (LC_ALL)
- Handling spaces in filenames
Learning Resources
- Advanced Bash‑Scripting Guide, ShellCheck wiki
- Exercism Bash track
- PowerShell in a Month of Lunches
Projects
- Create a backup script with rotation and notification
- Write a CLI tool in Python to manage cloud resources
- Build a pre‑commit hook for linting and secrets scanning
Interview Topics
- “How would you remove files older than 7 days?”
- “Explain the difference between
>and>>.” - “Script a health check that fails if HTTP status ≠ 200.”
1.4 Networking
Introduction
Networking is the foundation of distributed systems. A DevOps engineer must understand how data moves from a user’s browser through load balancers, firewalls, and proxies to the application and back.
Why It Matters
Misconfigured networks cause the hardest‑to‑debug outages. Cloud‑native patterns (service mesh, overlay networks) demand deeper networking knowledge.
Fundamental Concepts
OSI model, TCP/IP (three‑way handshake, flow control), UDP, ICMP. Address resolution (ARP), DNS (A, CNAME, MX, NS, DNSSEC), DHCP, NAT, PAT. Subnetting (CIDR, VLSM), VLANs, static/dynamic routing, BGP, Anycast.
Intermediate Topics
- HTTP/1.1, HTTP/2, HTTP/3 (QUIC)
- TLS 1.3, mTLS, certificate management
- Load balancers: L4 (NLB) vs L7 (ALB), algorithms (round‑robin, least connections)
- Proxies: forward, reverse (nginx, Envoy)
- WebSockets, gRPC (protocol and load‑balancing implications)
- VPN, WireGuard, IPsec
Advanced Topics
- Software‑defined networking (SDN) and overlays (VXLAN, Geneve)
- Kubernetes networking (CNI, Calico, Cilium/eBPF, Flannel)
- BGP in the data centre (Calico, MetalLB)
- Network policy and micro‑segmentation
- eBPF‑based observability (Cilium Hubble)
- QUIC and HTTP/3 deep‑dive
Enterprise Practices
- Hub‑and‑spoke VPC/VNet architectures
- Transit gateways and network virtual appliances
- DDoS protection (AWS Shield, Cloudflare Magic Transit)
- Global traffic management with latency‑based DNS
Popular Tools
Wireshark, tcpdump, nmap, iperf, netcat, dig, mtr, tc (traffic control), Calico, Cilium.
Best Practices
- Use explicit CIDR planning, avoid overlapping IP space
- Implement network micro‑segmentation and zero‑trust
- Enforce TLS everywhere with automated cert renewal
- Monitor latency, packet loss, and TCP retransmissions
Common Mistakes
- Security group rules too permissive (0.0.0.0/0)
- Not understanding ephemeral port exhaustion
- DNS resolution delays causing cascading timeouts
- Load balancer health checks pointing to wrong path
Real‑World Use Cases
- Cloudflare’s global Anycast network
- Netflix’s Zuul gateway for traffic shaping
- Kubernetes Cilium replacing kube‑proxy with eBPF
Troubleshooting Topics
ping,traceroute,mtrdig +tracefor DNS delegationnetstat -plant,ss -tunlptcpdumpand Wireshark filter expressions- Analysing TLS handshake with
openssl s_client
Learning Resources
- TCP/IP Illustrated, Computer Networking: A Top‑Down Approach
- GNS3 for network simulation
- CNCF Cilium certification study materials
Projects
- Simulate a multi‑VPC hub‑spoke network on AWS
- Deploy a service mesh and trace a request across pods
- Set up a WireGuard mesh between cloud regions
Interview Topics
- “What happens when you type google.com in a browser?”
- “Explain the TCP three‑way handshake and connection teardown.”
- “How would you debug a 5xx error on a load balancer?”
2. Version Control
2.1 Git & Collaboration
Introduction
Git is the de‑facto version control system for modern software. It enables asynchronous collaboration, code review, and audit trails across distributed teams.
Why It Matters
All infrastructure as code, application code, and configuration live in Git. GitOps principles treat Git as the single source of truth for declarative infrastructure and application state.
Fundamental Concepts
Working tree, staging area, commits (SHA, message), branches, merging (fast‑forward, three‑way), rebasing (interactive), cherry‑pick, tags (lightweight/annotated), stash, and .gitignore.
Intermediate Topics
- Branching models: Git Flow, GitHub Flow, Trunk‑Based Development
- Monorepos vs polyrepos; tools: Bazel, Nx, Turborepo
- Git hooks (client‑side and server‑side)
- Submodules and subtrees for dependencies
- Git attributes and large file storage (LFS)
Advanced Topics
- Git internals: objects (blob, tree, commit, tag), packfiles, reflog
git bisectfor automated bug finding- Custom merge drivers
- Signed commits and tags (GPG, SSH)
- Git‑based operations at scale (partial clones, sparse checkouts)
Enterprise Practices
- Branch protection rules and required reviews
- Commit message conventions (Conventional Commits)
- Verified commits with Sigstore or GPG
- Repository governance and CODEOWNERS
- Automated changelog generation
Platforms
GitHub, GitLab, Bitbucket, Azure DevOps Repos, Gitea/Forgejo.
Best Practices
- Commit small, logical units
- Write meaningful commit messages (imperative mood)
- Rebase feature branches on main before merging
- Never force‑push to shared branches
Common Mistakes
- Committing secrets or large binaries
- Long‑lived feature branches causing painful merges
- Merge conflict resolution that loses intent
- Using
git add .indiscriminately
Real‑World Use Cases
- Linux kernel’s email‑based patch flow (git format‑patch)
- Google’s monorepo with Piper and trunk‑based development
- GitOps with Flux reading cluster state from a Git repository
Troubleshooting Topics
- Recovering lost commits with reflog
- Undoing a force push (if you have the reflog on remote)
- Resolving detached HEAD
- Cleaning large
.gitfolders with BFG orgit filter-repo
Learning Resources
- Pro Git book (free)
- GitHub Skills, GitLab Learn
- Oh My Git! (interactive game)
Projects
- Contribute to an open‑source project via fork and PR
- Set up Git hooks to lint commit messages and run tests
- Simulate a broken rebase and repair it
Interview Topics
- “Explain the difference between merge and rebase.”
- “How would you recover a deleted branch?”
- “Walk through a typical Git Flow release cycle.”
3. CI/CD
Introduction
Continuous Integration (CI) automates code integration and testing; Continuous Delivery/Deployment (CD) ensures every change is releasable or automatically deployed to production.
Why It Matters
CI/CD is the engine of DevOps speed. It reduces integration pain, catches defects early, and provides a repeatable, auditable path to production.
Fundamental Concepts
CI: frequent merges, automated build and test. CD: delivery (manual approval to prod) vs deployment (automatic). Pipeline as code, build agents/runners, artifact management.
Intermediate Topics
- Multi‑stage pipelines, parallel jobs, matrix builds
- Environment promotion (dev → staging → prod)
- Secrets injection in pipelines
- Artifact repositories (Nexus, Artifactory, ECR)
- Containerised build agents
Advanced Topics
- DAG‑based pipelines (Argo Workflows, Tekton)
- Progressive delivery and deployment automation with automated rollbacks
- Pipeline composition and reusable templates
- Testing in production (canary analysis, chaos)
Enterprise Practices
- Centralised pipeline governance with shared libraries (Jenkins) or reusable workflows (GitHub Actions)
- Binary authorization (only signed images promoted)
- Compliance gating (SAST, license checks) before deployment
- Multi‑cloud, multi‑account pipeline strategies
Popular Tools
Jenkins, GitHub Actions, GitLab CI/CD, CircleCI, TeamCity, Bamboo, Azure Pipelines, Argo Workflows, Dagger, Woodpecker CI.
Best Practices
- Treat pipelines as code, version‑controlled alongside app
- Keep builds fast (<10 mins); parallelise
- Never store secrets in pipeline scripts; use vault integration
- Design idempotent deployment steps
Common Mistakes
- Flaky tests tolerated over weeks
- Pipeline snowflakes (manual UI configuration)
- Insufficient rollback plans
- Missing pipeline monitoring and alerting
Real‑World Use Cases
- Netflix Spinnaker for multi‑region deployments
- GitHub Actions powering over 10 million builds/day
- Shopify’s merge queue and CI with auto‑rollback
Troubleshooting Topics
- Build cache invalidation issues
- Environment‑specific failures due to differences
- External dependency timeouts in pipelines
Learning Resources
- Continuous Delivery book by Jez Humble & Dave Farley
- GitHub Actions docs, GitLab CI guides
- KodeKloud CI/CD courses
Projects
- Create a pipeline that builds a Docker image, scans it, and deploys to Kubernetes
- Implement a canary deployment with automated rollback based on Prometheus metrics
- Build a GitHub Actions composite action for Terraform plan/apply
Interview Topics
- “Design a CI/CD pipeline for a microservices application.”
- “How do you handle database migrations in a CD pipeline?”
- “What is the difference between a CI build and a release pipeline?”
4. Containers
4.1 Container Runtimes & Tools
Introduction
Containers package applications with their dependencies, ensuring consistent runtime across environments. They are the foundation of Kubernetes and modern cloud‑native platforms.
Why It Matters
Containers solve “works on my machine”, enable microservices, and provide isolation with lower overhead than VMs. The ecosystem (Docker, Podman, containerd) powers all major cloud services.
Fundamental Concepts
Images (layers, digests), containers (isolated processes), registries. Dockerfile instructions (FROM, RUN, COPY, CMD, ENTRYPOINT), build context, image caching.
Intermediate Topics
- Multi‑stage builds to minimise image size
- Docker Compose for local multi‑service environments
- Networking modes (bridge, host, overlay) and volume mounts
- BuildKit and buildx for advanced caching and multi‑arch images
- Rootless containers, user namespace remapping
- Image scanning and SBOM generation (Trivy, Syft)
Advanced Topics
- Container runtimes: containerd, CRI‑O, gVisor, Kata Containers, Firecracker
- OCI specification and runtime tools (runc, crun)
- Container‑optimised OS (Flatcar, Bottlerocket)
- Secure supply chain: image signing (Cosign), attestations (in‑toto)
- Buildpacks (Paketo, kpack) for source‑to‑image without Dockerfiles
Enterprise Practices
- Using minimal distroless base images
- Enforcing image policies in Kubernetes (OPA/Gatekeeper, Kyverno)
- Centralised image promotion across environments
- Continuous vulnerability management and patching
Popular Tools
Docker, Podman, Buildah, Skopeo, Dive (image inspection), Hadolint (Dockerfile lint).
Registries
Docker Hub, GHCR, ECR, GCR/Artifact Registry, ACR, Harbor, Quay.
Best Practices
- Never run as root, drop capabilities
- Use specific image digests, not
latest - Clean package manager caches in the same layer
- Run security scanners in CI and on admission
Common Mistakes
- Baking secrets into images
- Large images slowing deployments
- Ignoring layer ordering and cache busting
- Exposing Docker socket to containers
Real‑World Use Cases
- Google’s internal container infrastructure (Borg precursor)
- Amazon ECS using Docker images to run billions of tasks
- Cloudflare Workers using isolates, but OCI containers for edge compute
Troubleshooting Topics
- Image pull failures and credential helpers
- Out of disk space due to overlay2 amassing layers (
docker system prune) - Container crash loops and exit codes
Learning Resources
- Docker Mastery course (Bret Fisher)
- Play with Docker online labs
- CNCF container runtime landscape
Projects
- Write a multi‑stage Dockerfile for a Go app reducing image from 800MB to 10MB
- Set up a private registry with Harbor and vulnerability scanning
- Build a CI pipeline that builds and signs multi‑architecture images
Interview Topics
- “Explain the difference between CMD and ENTRYPOINT.”
- “How does a container isolate processes without a full OS?”
- “What are the layers in a Docker image and why do they matter?”
5. Kubernetes
5.1 Core Orchestration
Introduction
Kubernetes (K8s) is the standard container orchestrator. It schedules workloads across clusters, manages networking, storage, and scales applications automatically.
Why It Matters
Kubernetes abstracts infrastructure, enabling true cloud portability and automated operations. With 96% of organisations using or evaluating it (CNCF survey), K8s skills are mandatory.
Fundamental Concepts
Pods, Deployments, ReplicaSets, StatefulSets, DaemonSets, Jobs/CronJobs. Services (ClusterIP, NodePort, LoadBalancer, ExternalName), ConfigMaps, Secrets, Namespaces, Labels, Selectors.
Intermediate Topics
- Ingress controllers and Gateway API
- Storage: PV, PVC, StorageClasses, CSI drivers
- Helm: charts, values, repositories, hooks
- RBAC: ServiceAccounts, Roles, ClusterRoles, RoleBindings
- Resource requests/limits, QoS classes
- Liveness, readiness, startup probes
Advanced Topics
- Custom Resource Definitions (CRDs) and Operators
- Horizontal Pod Autoscaler (HPA) with custom metrics, Vertical Pod Autoscaler (VPA)
- Cluster Autoscaler, Karpenter, node provisioning optimisation
- Admission controllers (OPA Gatekeeper, Kyverno)
- etcd backup and restore, multi‑cluster management
- Scheduler extender and custom scheduling policies
- Kubelet, container runtime interaction, and node lifecycle
Networking
CNI plugins: Calico (eBPF), Cilium (eBPF, service mesh, network policy), Flannel, Weave. Service types, CoreDNS, network policies for micro‑segmentation.
Storage
CSI drivers for cloud volumes, Rook (Ceph), Longhorn, OpenEBS, Portworx. Snapshots, volume expansion, topology‑aware scheduling.
Security
Pod Security Standards (privileged, baseline, restricted), PodSecurityPolicy replacement, image signature verification (Connaisseur, Sigstore), runtime security (Falco). Secrets management (External Secrets Operator, Sealed Secrets, Vault).
Enterprise Practices
- Multi‑tenancy via namespaces, resource quotas, and limit ranges
- GitOps for cluster configuration (Argo CD, Flux)
- Centralised policy management with OPA/Gatekeeper
- Cluster API for declarative cluster lifecycle management
- Backup and disaster recovery with Velero
Managed Kubernetes
EKS, AKS, GKE (Autopilot), OpenShift, DigitalOcean Kubernetes.
Certifications
CKA, CKAD, CKS, KCNA.
Best Practices
- Use namespaces for logical separation, not security isolation alone
- Always define resource requests and limits
- Avoid latest tags; use digests and image policies
- Keep control plane and nodes up to date (Kube‑no‑trouble)
- Monitor cluster state and audit API server logs
Common Mistakes
- Running privileged containers unnecessarily
- Exposing the Kubernetes dashboard publicly
- Ignoring etcd performance and backups
- Too many microservices on a single cluster without limits
Real‑World Use Cases
- Spotify’s migration to Kubernetes for backend services
- Airbnb’s multi‑cluster, multi‑region setup
- CERN’s use of Kubernetes for physics data processing
Troubleshooting Topics
- Pod stuck in Pending (scheduling, resources)
- CrashLoopBackOff diagnostics (kubectl logs, describe)
- Service connectivity (kube‑dns, endpoints, network policy)
- OOMKilled and memory limits tuning
Learning Resources
- Kubernetes in Action, O’Reilly
- KodeKloud, killer.sh
- CNCF Kubernetes project documentation
Projects
- Deploy a 3‑tier application with Helm and ingress
- Set up a cluster with Cilium and Hubble for network observability
- Implement a blue‑green deployment using Argo Rollouts
Interview Topics
- “Explain the control plane components and their roles.”
- “How does a service route traffic to pods?”
- “You have a pod in CrashLoopBackOff – how do you debug?”
6. Cloud Providers
6.1 AWS, Azure, GCP & Others
Introduction
Public cloud providers offer on‑demand compute, storage, and higher‑level services. AWS, Azure, and Google Cloud dominate, but Oracle Cloud, Alibaba Cloud, DigitalOcean, and Linode serve specific niches.
Why It Matters
Most organisations operate in at least one public cloud. Understanding cloud services, pricing models, and multi‑cloud architectures is essential for DevOps and architecture roles.
Fundamental Concepts
Regions, Availability Zones, IAM (users, roles, policies), virtual networks (VPC/VNet), compute instances (EC2, VMs, Compute Engine), object storage (S3, Blob, GCS), managed databases (RDS, Cloud SQL, Azure SQL).
Intermediate Topics
- Cloud‑native services: Lambda, Cloud Functions, Cloud Run
- Container services: ECS, EKS, AKS, GKE
- Messaging: SQS/SNS, Azure Service Bus, Pub/Sub
- CDN & edge: CloudFront, Azure CDN, Cloud CDN, Cloudflare
- Monitoring & logging: CloudWatch, Azure Monitor, Cloud Operations Suite
Advanced Topics
- Landing zone architecture (AWS Control Tower, Azure CAF, Google Fabric)
- Service control policies, organisation‑level governance
- Cross‑cloud networking and interconnection
- FinOps tooling specific to each provider
- Multi‑cloud abstraction layers (Crossplane, Terraform)
Enterprise Practices
- Well‑Architected Framework reviews (AWS, Azure, GCP)
- Cost allocation tags and chargeback models
- Cloud security posture management (CSPM)
- Hybrid connectivity (Direct Connect, ExpressRoute, Interconnect)
Popular Tools
AWS CLI, Azure CLI, gcloud, aws‑shell, cloud‑agnostic: Terraform, Pulumi.
DigitalOcean & Linode
Ideal for developers and smaller workloads; simplicity and predictable pricing.
Best Practices
- Enforce MFA and least privilege IAM
- Never hardcode credentials; use instance roles/managed identities
- Enable logging on all critical services
- Automate resource creation/destruction with IaC
Common Mistakes
- Leaving default VPC settings with open security groups
- Not setting billing alarms and budgets
- Ignoring region‑specific service availability
- Over‑provisioning resources, leading to unexpected bills
Real‑World Use Cases
- Netflix runs entirely on AWS with multi‑region resilience
- Maersk uses Azure for global supply chain operations
- Spotify migrated to GCP for data analytics
Troubleshooting Topics
- CloudFormation/Terraform stack stuck in ROLLBACK_IN_PROGRESS
- S3 bucket permission “Access Denied” despite policy
- Intermittent connectivity across VPC peering
Learning Resources
- Official cloud provider documentation and workshops
- A Cloud Guru, Pluralsight
- Cloud provider free tiers for hands‑on
Projects
- Deploy a serverless web app with API Gateway, Lambda, and DynamoDB
- Set up a multi‑account AWS organisation with SSO
- Create a cost‑aware architecture using spot/preemptible VMs
Interview Topics
- “Compare AWS IAM roles and Azure Managed Identities.”
- “How would you design a globally resilient application on any cloud?”
- “Explain VPC peering vs Transit Gateway.”
7. Infrastructure as Code (IaC)
7.1 Terraform, OpenTofu, Pulumi, CDKs
Introduction
IaC replaces manual point‑and‑click infrastructure provisioning with declarative or imperative code. It enables versioning, peer review, and reproducible environments.
Why It Matters
IaC eliminates configuration drift, reduces deployment time, and enforces security/compliance before resources are created. It’s a prerequisite for GitOps and self‑service platforms.
Fundamental Concepts
Declarative vs imperative approaches. Core primitives: resources, data sources, providers, variables, outputs. State management: local vs remote (S3, Azure Storage, GCS), state locking.
Intermediate Topics
- Terraform: modules, workspaces, provisioners (and why to avoid them), dynamic blocks
- OpenTofu as an open‑source fork with community governance
- Pulumi: using real programming languages (TypeScript, Python, Go)
- AWS CDK, CDK for Terraform (CDKTF), CDK8s – generating IaC from code
- Crossplane: Kubernetes‑native control plane for infrastructure
Advanced Topics
- Custom Terraform providers and provisioners
- IaC testing: Terratest, Kitchen‑Terraform, policy‑based (Conftest, OPA)
- State manipulation and migration (moved blocks, terraform import)
- IaC in CI/CD pipelines with plan/apply approval gates
- Drift detection and reconciliation (driftctl, Crossplane’s continuous reconciliation)
Enterprise Practices
- Module registries with semantic versioning
- Sentinel or OPA policies to enforce compliance before apply
- Multi‑account provisioning with Terraform Cloud/Enterprise or Atlantis
- Immutable infrastructure with Packer and Terraform in concert
Best Practices
- Never store state files in Git; use remote backend with encryption
- Structure repos by environment or module, not monoliths
- Pin provider versions
- Always run
terraform planand review before apply
Common Mistakes
- Manual changes to resources managed by Terraform (drift)
- Storing secrets in plain text in
.tffiles or state - Over‑reliance on workspaces for environment separation
- Not using data sources, leading to hardcoded IDs
Real‑World Use Cases
- HashiCorp’s own Terraform Cloud manages thousands of workspaces
- Pulumi used by Snowflake to manage cloud infrastructure as code
- AWS CDK powering large‑scale serverless applications
Troubleshooting Topics
- State lock timeout and forced unlock
- Provider authentication issues
- Resource dependency cycle errors
Learning Resources
- Terraform Up & Running (O’Reilly)
- HashiCorp Learn platform
- Pulumi University
Projects
- Write a Terraform module for a highly available web server
- Convert an existing AWS console setup into Terraform code
- Build a CI pipeline that automatically applies Terraform on merge
Interview Topics
- “What is the purpose of Terraform state?”
- “How would you manage secrets in Terraform?”
- “Explain the difference between Terraform and CloudFormation.”
8. Configuration Management
8.1 Ansible, Puppet, Chef, SaltStack
Introduction
Configuration management tools enforce desired state on servers, ensuring consistent software installation, file configuration, and service management across fleets of machines.
Why It Matters
While containers and immutable infrastructure reduce reliance on CM, managing Kubernetes nodes, on‑prem VMs, and legacy systems still demands solid CM skills.
Fundamental Concepts
Idempotency, push vs pull models, declarative (Puppet, Salt) vs procedural (Ansible). Inventory, playbooks/modules (Ansible), manifests/recipes, and convergence.
Intermediate Topics
- Ansible roles, Galaxy, AWX/Automation Controller
- Puppet Hiera, environments, r10k
- Chef cookbooks, Berkshelf, Test Kitchen
- SaltStack pillars, grains, reactors
- Windows configuration with DSC and Ansible
Advanced Topics
- Custom facts, plugins, and extensions
- Self‑healing and drift remediation
- Event‑driven automation (StackStorm, Salt Reactor)
- Integrating CM with IaC and image baking (Packer+Ansible provisioner)
Enterprise Practices
- Role‑based access control in AWX/Tower
- Policy‑as‑code with Chef InSpec for compliance scanning
- Secrets management integration (Vault, CyberArk)
- Zero‑touch provisioning with PXE and Kickstart/Preseed plus CM
Best Practices
- Treat CM code like application code: version, review, lint
- Keep playbooks/manifests idempotent
- Use dynamic inventories from cloud providers
- Run CM in CI to validate against ephemeral instances
Common Mistakes
- Running unencrypted secret data in CM code
- Non‑idempotent scripts causing drift
- Long‑running convergence with no reporting
Real‑World Use Cases
- Ansible used by NASA for patching and configuration
- Puppet managing thousands of nodes at CERN
- Chef used by Facebook (now Meta) for bare‑metal provisioning
Troubleshooting Topics
- Ansible task hanging due to SSH connectivity
- Puppet agent failing to apply catalogue
- Debugging Jinja2 template errors
Learning Resources
- Ansible for DevOps (Jeff Geerling)
- Learn Puppet, Chef training sites
- Red Hat official Ansible courses
Projects
- Write an Ansible playbook to harden a Linux server to CIS benchmarks
- Set up AWX and run a job template from a Git repo
- Migrate a shell script‑based provisioning to Ansible roles
Interview Topics
- “Compare Ansible and Terraform; when would you use each?”
- “How do you handle secrets in Ansible?”
- “Explain idempotency with an example.”
9. GitOps
9.1 Argo CD, Flux, and Progressive Delivery
Introduction
GitOps uses Git as the single source of truth for declarative application and infrastructure configuration. An operator continuously reconciles the live state with the desired state in Git.
Why It Matters
GitOps provides a unified, auditable, and secure deployment model. It simplifies rollbacks, enhances security (pull‑based), and integrates naturally with developer workflows.
Fundamental Concepts
Desired state in Git, reconciliation loop, pull vs push deployments. Argo CD and Flux as leading CNCF‑graduated tools.
Intermediate Topics
- Application definition and sync strategies (auto‑sync, pruning)
- Helm and Kustomize integration
- Multi‑cluster/multi‑environment management
- Image updater automation (Argo CD Image Updater, Flux Image Automation)
Advanced Topics
- Progressive delivery: Argo Rollouts, Flagger with canary analysis (Prometheus metrics)
- Multi‑tenant GitOps architectures
- Custom health checks and diff strategies
- GitOps for infrastructure (Crossplane + Argo/Flux)
- Security: sealed secrets, SOPS, Vault integration
Enterprise Practices
- Repositories structured per team, environment, or cluster
- Promotion pipelines from dev to prod via Git branches or directories
- Policy enforcement: only allowed registries, required labels
- Disaster recovery: bootstrap an empty cluster from Git in minutes
Popular Tools
Argo CD, Flux CD, Helm Operator, Argo Rollouts, Flagger.
Best Practices
- Separate config repos from application source
- Avoid manual
kubectlchanges; let the operator reconcile - Use pull‑based model for better security
- Implement automated drift detection and alerting
Common Mistakes
- Committing secrets in plaintext (use Sealed Secrets or Vault)
- Ignoring sync status and health checks
- Too many manual syncs overriding Git truth
Real‑World Use Cases
- Intuit uses Argo CD to manage 1000+ apps across clusters
- Weaveworks (pioneers) built Flux for multi‑tenant Kubernetes
- BMW uses GitOps for factory automation systems
Troubleshooting Topics
- Argo CD out of sync despite identical YAML (due to live mutation)
- Flux reconciliation loops due to missing CRDs
- Webhook delivery failures
Learning Resources
- Argo CD and Flux documentation
- GitOps Guide to the Galaxy (Weaveworks)
- CNCF GitOps Working Group papers
Projects
- Set up Argo CD to deploy a simple app from a GitHub repo
- Implement a canary deployment with Flagger and Linkerd
- Build a multi‑cluster GitOps setup with a single control plane
Interview Topics
- “Explain the difference between push and pull GitOps.”
- “How do you handle secrets in a GitOps workflow?”
- “What are the benefits of using Git as a source of truth?”
10. Service Mesh
10.1 Istio, Linkerd, Cilium Service Mesh
Introduction
A service mesh extracts networking and security logic from application code into a sidecar proxy or eBPF‑based layer, providing traffic management, observability, and encryption between services.
Why It Matters
It enables zero‑trust networking, fine‑grained traffic control (retries, timeouts, circuit breaking), and deep telemetry without application changes.
Fundamental Concepts
Data plane (Envoy, Linkerd‑proxy) and control plane. Sidecar injection vs sidecar‑less (ambient mesh, Cilium). mTLS, authorization policies, traffic splitting.
Intermediate Topics
- Istio: VirtualService, DestinationRule, Gateway
- Linkerd: simplicity, rust‑based proxy, automatic mTLS
- Kuma, Consul Connect as alternatives
- Observability: distributed tracing (Jaeger/Zipkin), service graphs
- Fault injection and chaos testing at mesh layer
Advanced Topics
- Ambient mesh (Istio sidecar‑less with ztunnel)
- eBPF‑based service mesh (Cilium) offering performance and simplicity
- Multi‑cluster mesh federation
- Wasm extensions in Envoy for custom processing
- Performance benchmarks and resource overhead analysis
Enterprise Practices
- Gradual rollout of mTLS per namespace
- Using authorization policies to enforce least‑privilege communication
- Mesh‑federation across VPCs or clouds
- Centralised certificate management with cert‑manager and mesh control plane
Best Practices
- Start with observability, then enable mTLS, then traffic control
- Use namespace‑level policies before global defaults
- Monitor sidecar resource usage
- Keep mesh version updated for security patches
Common Mistakes
- Enforcing strict mTLS before all services are ready
- Creating overly permissive authorization policies
- Ignoring the increased CPU/memory footprint on high‑traffic pods
Real‑World Use Cases
- eBay’s use of Istio for traffic routing and mTLS
- Nordstrom’s Linkerd deployment for zero‑trust microservices
- Cilium replacing kube‑proxy and service mesh with eBPF at many cloud‑native companies
Troubleshooting Topics
- Pod unable to reach service after mesh injection (sidecar not ready)
- mTLS certificate rotation failures
- Envoy configuration dump analysis (
istioctl proxy-config)
Learning Resources
- Istio in Action, Linkerd getting started
- Solo.io workshops
- Cilium and Hubble documentation
Projects
- Deploy Bookinfo app with Istio and observe mTLS traffic
- Implement a canary release with Istio traffic shifting
- Set up Linkerd and inject into a sample microservice
Interview Topics
- “Why use a service mesh instead of application‑level libraries?”
- “Explain how mTLS works in a mesh.”
- “What is the difference between a VirtualService and a DestinationRule?”
11. Observability
11.1 Monitoring, Logging, Tracing, Profiling
Introduction
Observability is the ability to understand system internals from external outputs. It rests on three pillars: metrics, logs, and traces, augmented by continuous profiling and events.
Why It Matters
Without observability, teams fly blind in production. It enables rapid detection, diagnosis, and resolution of issues, directly feeding SRE error budgets and platform improvements.
Fundamental Concepts
Metrics (counter, gauge, histogram), logs (structured, unstructured), traces (spans, context propagation). OpenTelemetry (OTel) as the standard for instrumentation and collection.
Intermediate Topics
- Prometheus data model, PromQL, recording rules, alerting rules
- Grafana dashboards, alerting, and Loki for logs
- ELK stack (Elasticsearch, Logstash, Kibana) and beats
- Fluentd and Fluent Bit for unified log aggregation
- Alertmanager for routing alerts to PagerDuty, Slack, etc.
Advanced Topics
- Cortex, Thanos, Mimir for long‑term, scalable Prometheus
- High‑cardinality metrics and cost management
- OpenTelemetry collector deployment (agent/gateway)
- Exemplars linking metrics and traces
- Continuous profiling (Pyroscope, Parca)
- AIOps‑enhanced anomaly detection on observability data
Enterprise Practices
- Centralised observability platform (Grafana Cloud, Datadog, New Relic, Splunk)
- Service Level Objectives (SLOs) as primary operational metric
- Log sampling and retention policies for cost control
- Observability as code: dashboards, alerts, and recording rules in Git
Best Practices
- Use structured logging (JSON) for machine parseability
- Instrument applications with OpenTelemetry SDKs
- Define meaningful SLOs and error budgets
- Avoid alert fatigue by alerting on symptoms, not causes
Common Mistakes
- Logging too much or too little (no request IDs)
- Dashboard sprawl without actionability
- No retention or downsampling strategy
- Ignoring P95/P99 latencies, only using averages
Real‑World Use Cases
- Google’s Monarch and Borgmon for internal monitoring
- Uber’s Jaeger tracing platform
- Grafana Labs’ Loki used for petabyte‑scale log querying
Troubleshooting Topics
- PromQL query returning no data (label mismatch)
- Log ingestion backpressure in Fluent Bit
- Trace context not propagating across services
Learning Resources
- Prometheus: Up & Running, Grafana docs
- OpenTelemetry official documentation
- Monitoring distributed systems (Google SRE book chapter)
Projects
- Deploy the Prometheus/Grafana stack and create a custom dashboard for a web app
- Implement distributed tracing with OpenTelemetry in a microservice
- Set up alerting for high error rates with Alertmanager and PagerDuty
Interview Topics
- “How do you monitor a Kubernetes cluster?”
- “Describe the three pillars of observability.”
- “What is an SLO and how does it relate to monitoring?”
12. DevSecOps
12.1 Security in the DevOps Pipeline
Introduction
DevSecOps integrates security practices into every stage of the software delivery lifecycle, making security a shared responsibility rather than a final gate.
Why It Matters
With the rise of supply chain attacks and cloud breaches, security must be embedded from code to production. Compliance and risk management demand automated, continuous security.
Fundamental Concepts
Shift left, Zero Trust, least privilege, defense in depth. SAST (Static Application Security Testing), DAST (Dynamic), SCA (Software Composition Analysis), container scanning, secret scanning.
Intermediate Topics
- SAST tools: SonarQube, Semgrep, CodeQL
- SCA/dependency scanning: Snyk, Dependabot, OWASP Dependency‑Check
- Container image scanning: Trivy, Grype, Clair
- IaC scanning: tfsec, Checkov, Kics, Terrascan
- Secrets detection: git‑secrets, TruffleHog, GitGuardian
Advanced Topics
- Supply chain security: SLSA framework, Sigstore (Cosign, Rekor, Fulcio), in‑toto attestations
- Policy as code: OPA/Rego, Kyverno, Sentinel
- Runtime security: Falco, Tetragon (eBPF), seccomp profiles, AppArmor
- Vulnerability management lifecycle and patch SLAs
- Automated compliance: Prowler, InSpec, compliance‑as‑code
Enterprise Practices
- Security champions program and threat modelling
- Centralised secrets management (Vault, AWS Secrets Manager)
- Building an internal secure software development policy
- Continuous authorization and zero‑trust networking with service mesh
Popular Tools
Trivy, Snyk, SonarQube, Checkov, Vault, Falco, Kyverno.
Best Practices
- Never store secrets in code or images
- Run security scans in CI and block on critical vulns
- Rotate credentials and use short‑lived tokens
- Maintain an up‑to‑date SBOM for all artifacts
Common Mistakes
- Focusing only on perimeter security
- Alert fatigue from unfiltered vulnerability scanners
- Not scanning for secrets in git history
- Treating compliance as a one‑off audit
Real‑World Use Cases
- Netflix’s Security Monkey and Repokid for cloud security automation
- Google’s Binary Authorization for only trusted images on GKE
- Capital One’s adoption of cloud‑native security tooling after their breach
Troubleshooting Topics
- False positives in SAST/DAST
- Image pulling blocked by admission controller
- Vault token renewal failures
Learning Resources
- OWASP Top 10, DevSecOps Hub
- Kubernetes Security by Liz Rice
- Hands‑on: KodeKloud DevSecOps path
Projects
- Build a CI pipeline that fails on critical CVE in container image
- Implement secret management with External Secrets Operator
- Create a Falco rule to detect suspicious exec in containers
Interview Topics
- “How would you secure a CI/CD pipeline?”
- “What is the difference between SAST and DAST?”
- “Explain the principle of least privilege in a Kubernetes context.”
13. Site Reliability Engineering (SRE)
13.1 Reliability, Operations, and Chaos Engineering
Introduction
SRE applies software engineering principles to operations, focusing on automating toil, measuring reliability via SLOs, and managing risk through error budgets.
Why It Matters
SRE bridges the gap between product velocity and operational stability. It provides a data‑driven framework for trade‑off decisions and ensures services remain reliable while evolving.
Fundamental Concepts
SLI (Service Level Indicator), SLO (Objective), SLA (Agreement). Error budgets, toil, automation, and the concept of “Hope is not a strategy.”
Intermediate Topics
- Defining meaningful SLIs (latency, availability, throughput, error rate)
- Error budget policies: burn rate alerts, freezing changes
- Incident management lifecycle: detection, response, blameless postmortem
- Monitoring and observability through an SRE lens
- Capacity planning and load testing
Advanced Topics
- Chaos engineering: principles, chaos‑day, game‑day exercises
- Tools: LitmusChaos, Gremlin, Chaos Mesh, AWS Fault Injection Simulator
- Toil elimination through runbook automation and self‑healing
- Reliability across multi‑region and multi‑cloud architectures
- Advanced error budget statistical analysis
Enterprise Practices
- SRE organization models (embedded, consulting, platform)
- Automated incident response and runbook execution
- Reliability scoring and team dashboards
- SLO‑based release gating
Popular Tools
Prometheus/Alertmanager for SLO monitoring, PagerDuty/Opsgenie, LitmusChaos, Gremlin.
Best Practices
- Start SLOs from user journeys, not server metrics
- Use multi‑window, multi‑burn‑rate alerts
- Write blameless postmortems with action items
- Automate toil away; never accept repetitive manual work
Common Mistakes
- Setting SLO targets arbitrarily (e.g., 99.999% for everything)
- Not enforcing error budgets, leading to unreliable features
- Skipping postmortems after incidents
- Measuring reliability only by uptime
Real‑World Use Cases
- Google’s SRE teams managing Search, Gmail with tight SLOs
- Dropbox’s SRE‑driven migration to gRPC with error budget adoption
- Target’s chaos engineering program before Black Friday
Troubleshooting Topics
- SLO burn rate alert triggered but no customer impact
- Incident command handoff failures
- Chaos experiment causing unexpected cascading failures
Learning Resources
- Google SRE books (free online)
- SRE Workbook and Seeking SRE
- Chaos Engineering by Casey Rosenthal
Projects
- Define SLIs and SLOs for a sample microservice and create a dashboard
- Automate a runbook using a script triggered by an alert
- Design a chaos experiment to test database failover
Interview Topics
- “How would you choose an SLO for a payment API?”
- “Explain an error budget and how it affects development speed.”
- “Walk through a recent incident and how you handled it.”
14. Platform Engineering
14.1 Internal Developer Platforms
Introduction
Platform Engineering builds self‑service internal platforms that abstract infrastructure complexity, offering paved roads (golden paths) for developers while reducing cognitive load.
Why It Matters
It addresses the “you build it, you run it” overload, enabling developers to focus on business logic without sacrificing autonomy. It’s the evolution of DevOps at scale.
Fundamental Concepts
Platform as a Product, Internal Developer Platform (IDP), golden paths, self‑service, developer experience (DevEx) metrics. Backstage as a developer portal.
Intermediate Topics
- Building a service catalog with Backstage and Software Templates
- Composable platforms with Crossplane, Terraform modules behind APIs
- Scaffolding tools: Yeoman, Cookiecutter, custom CLI generators
- Measuring platform adoption and satisfaction (DORA, SPACE, DevEx frameworks)
Advanced Topics
- Platform orchestration: Kratix, Humanitec, Port
- Federation and multi‑platform architectures
- Policy‑driven platform with OPA and admission control
- Dynamic environment provisioning and ephemeral environments
- Platform observability and cost chargeback
Enterprise Practices
- Treating platform as a product with dedicated PM, roadmap, and SLAs
- Platform conformance testing and certification
- Integrating security and compliance into golden paths by default
- Building a community of practice around platform usage
Popular Tools
Backstage, Crossplane, Humanitec, Port, Kratix, Scaffolder.
Best Practices
- Start with the thinnest viable platform (MVP), then iterate
- Gather developer feedback continuously
- Make the golden path the path of least resistance
- Avoid building a platform that locks teams in; enable escape hatches
Common Mistakes
- Building a platform no one asked for (ivory‑tower architecture)
- Over‑abstraction leading to flexibility loss
- Neglecting documentation and onboarding experience
Real‑World Use Cases
- Spotify’s Backstage, now open‑source and used by thousands
- Humanitec powering IDPs at enterprises like Bosch
- Monzo’s internal platform for safe, fast deployments
Troubleshooting Topics
- Template rendering failures in Backstage
- Self‑service provisioning stuck due to cloud API quota
- Developer experience degradation due to platform API latency
Learning Resources
- Team Topologies (book)
- Backstage documentation and plugin ecosystem
- Platform engineering community (platformengineering.org)
Projects
- Set up a Backstage instance and create a software template
- Build a simple self‑service API using Crossplane to provision a database
- Design a golden path with Terraform modules and a CLI wrapper
Interview Topics
- “What is the difference between Platform Engineering and DevOps?”
- “How would you measure the success of an IDP?”
- “Describe a golden path and how to enforce it.”
15. Databases, Messaging & Storage
15.1 Data Layer for Cloud Native Systems
Introduction
Stateful workloads require careful handling in dynamic environments. Choosing the right database, cache, and message broker directly impacts scalability, consistency, and resilience.
Why It Matters
Data is the hardest part of distributed systems. DevOps engineers must understand replication, failover, backups, and performance tuning to keep applications reliable.
Fundamental Concepts
SQL (PostgreSQL, MySQL) and NoSQL (MongoDB, DynamoDB, Cassandra). Caching (Redis, Valkey). Message brokers (RabbitMQ, Kafka, NATS). Event sourcing, CQRS.
Intermediate Topics
- Connection pooling, read replicas, sharding
- Managed cloud services (RDS, Aurora, Cloud SQL, Cosmos DB)
- Kafka: topics, partitions, consumer groups, exactly‑once semantics
- Elasticsearch for full‑text search and analytics
- Redis clustering and persistence options
Advanced Topics
- Operators for databases on Kubernetes (Crunchy Data for PostgreSQL, Strimzi for Kafka)
- Distributed SQL (CockroachDB, YugabyteDB)
- Vector databases for AI (pgvector, Pinecone, Weaviate)
- Stream processing with Kafka Streams, Flink, or RisingWave
- Backup and point‑in‑time recovery strategies, RPO/RTO
Enterprise Practices
- Multi‑AZ and cross‑region replication
- Automated failover and chaos testing of data layer
- Data encryption at rest and in transit
- Schema migration strategies in CI/CD (Flyway, Liquibase)
Storage Systems
Block (EBS, managed disks), object (S3, MinIO, Ceph), file (EFS, Azure Files). Container‑native storage: Rook, Longhorn, OpenEBS.
Best Practices
- Treat database changes as code, apply with same pipeline
- Monitor query performance and index usage
- Separate OLTP and OLAP workloads
- Use connection pooling and circuit breakers
Common Mistakes
- Using the wrong consistency model for the problem
- Ignoring backup verification; backups don’t exist until restored
- Over‑loading a single broker topic without partitioning
Real‑World Use Cases
- Uber’s migration to a multi‑active database architecture
- Netflix’s use of Cassandra for cross‑regional replication
- LinkedIn’s Kafka handling trillions of messages per day
Troubleshooting Topics
- High latency on database due to missing indexes
- Kafka consumer lag alerts
- Redis OOM and eviction policies
Learning Resources
- Designing Data‑Intensive Applications (Kleppmann)
- Official documentation of each system
- A Cloud Guru’s database specialty courses
Projects
- Deploy PostgreSQL on Kubernetes with an operator
- Build a simple message pipeline with Kafka and a consumer
- Set up MinIO as an S3‑compatible object store and integrate with an application
Interview Topics
- “How would you handle a database migration with zero downtime?”
- “Explain CAP theorem in practice.”
- “What strategies exist for caching in a microservices environment?”
16. API Management & Architecture
Introduction
APIs are the contracts between services. Managing them involves design, security, versioning, rate limiting, and providing developer portals.
Why It Matters
A well‑designed API ecosystem accelerates development and enables partnerships. API gateways become critical for north‑south traffic control.
Fundamental Concepts
REST, GraphQL, gRPC (Protobuf). API design best practices, versioning, pagination, error handling, OpenAPI specification.
Intermediate Topics
- API gateways: Kong, Tyk, Apigee, AWS API Gateway, Azure APIM
- Authentication and authorization: OAuth2, JWT, API keys
- Rate limiting, throttling, and quotas
- Developer portals and documentation (Swagger UI, Redoc)
Advanced Topics
- API as a product: monetization, analytics
- Federation: GraphQL stitching and supergraph
- gRPC load balancing and health checking
- Event‑driven APIs with AsyncAPI
- Service mesh for east‑west API traffic management
Enterprise Practices
- Centralised API management with self‑service onboarding
- API security scanning in CI (OWASP ZAP, 42Crunch)
- Full lifecycle management: design → publish → deprecate
- Observability integration with API analytics
Best Practices
- Design APIs first using OpenAPI
- Use API versioning via URL or headers
- Enforce authentication at the gateway
- Monitor API usage and error rates
Common Mistakes
- Breaking changes without versioning
- Over‑fetching and under‑fetching with REST (leading to GraphQL adoption)
- Not setting rate limits, allowing abuse
Real‑World Use Cases
- Twilio’s API‑first product strategy
- Netflix’s use of GraphQL with Falcor and later Federation
- Google Cloud Endpoints managing external APIs
Troubleshooting Topics
- 429 Too Many Requests due to rate limiting
- CORS errors from misconfigured gateway
- gRPC deadline exceeded vs. connection refused
Learning Resources
- Designing Web APIs (O’Reilly)
- Stoplight, Swagger tools
- GraphQL official tutorial
Projects
- Design and implement a REST API with OpenAPI spec and deploy behind Kong
- Build a GraphQL wrapper over a REST service
- Set up an API gateway with OAuth2 authentication
Interview Topics
- “REST vs GraphQL vs gRPC – when would you choose each?”
- “How do you secure a public API?”
- “What is the role of an API gateway in microservices?”
17. Testing, Architecture & Advanced Topics
17.1 Testing in DevOps
Introduction
Testing is not a phase but a continuous activity. DevOps incorporates unit, integration, performance, security, and chaos testing into the pipeline.
Why It Matters
Pre‑production confidence comes from automated testing. Skipping testing leads to production incidents, broken SLOs, and burned error budgets.
Fundamental Concepts
Testing pyramid: unit, integration, end‑to‑end. TDD, BDD. Performance testing, load/stress testing.
Intermediate Topics
- Infrastructure testing: Terratest, Kitchen, InSpec
- Policy testing: Conftest, OPA unit tests
- Contract testing (Pact)
- Chaos engineering as testing
- Synthetic monitoring as post‑deployment validation
Advanced Topics
- Continuous verification with canary analysis (Argo Rollouts + Prometheus)
- Fuzzing for security
- Load generation tools: k6, Locust, JMeter
- Testing in production with feature flags and progressive delivery
Enterprise Practices
- Shift‑left and shift‑right testing combined
- Testing environments as code, on‑demand
- Automated test result dashboards and quality gates
Best Practices
- Tests must be fast and reliable; flaky tests undermine trust
- Use real‑world production traffic replay for load testing
- Automate everything, but also perform exploratory testing
Common Mistakes
- Only testing happy paths
- Slow integration tests blocking CI
- Not testing disaster recovery procedures
Real‑World Use Cases
- Google’s DiRT (Disaster Recovery Testing) exercises
- Amazon’s use of automated canary deployments
- GitHub’s testing of merge queue with thousands of tests
Troubleshooting Topics
- Flaky test root‑cause analysis
- Load test reveals bottleneck, debugging stack traces
- Chaos experiment causing cascading failure
Learning Resources
- Continuous Delivery (Humble & Farley)
- k6 documentation
- Chaos Engineering community
Projects
- Write Terratest for a Terraform module and run in CI
- Create a k6 load test script and integrate with GitHub Actions
- Implement contract tests between two microservices
Interview Topics
- “How would you introduce testing to a team that does none?”
- “Explain the testing pyramid and if it still applies.”
- “How do you test infrastructure changes?”
17.2 Architecture Patterns
Introduction
Modern architecture choices (monolith, microservices, event‑driven) shape the operational model. DevOps engineers must understand the trade‑offs to design reliable systems.
Why It Matters
Architecture determines scalability, deployability, and resilience. A poorly chosen architecture can make DevOps practices impossible.
Fundamental Concepts
Monolithic vs distributed, SOA, microservices, event‑driven architecture. CQRS, event sourcing, domain‑driven design (DDD).
Intermediate Topics
- 12‑factor app methodology (and 15‑factor)
- Backend‑for‑frontend (BFF), API composition
- Saga pattern for distributed transactions
- Idempotency and deduplication
Advanced Topics
- Cell‑based architecture (as used by Amazon, DoorDash)
- Multi‑runtime microservices with Dapr
- Serverless orchestration (Step Functions, Durable Functions)
- Event‑driven data mesh for analytics
Enterprise Practices
- Well‑architected framework reviews across all clouds
- Architecture decision records (ADRs) in source control
- Fitness functions to continuously validate architectural qualities
Best Practices
- Start with monolith unless microservices are absolutely needed
- Design for failure from day one
- Use async messaging to decouple services
Common Mistakes
- Microservices without proper DevOps maturity (distributed monolith)
- Ignoring eventual consistency implications
- Over‑engineering for scale that never arrives
Real‑World Use Cases
- Amazon’s rule: teams communicate through APIs; cell‑based architecture
- Netflix’s microservices with Hystrix for resilience
- Uber’s domain‑oriented microservices
Troubleshooting Topics
- Latency propagation in deeply chained services
- Event duplication in distributed messaging
- Data inconsistency across services
Learning Resources
- Building Microservices (Sam Newman)
- Domain‑Driven Design (Eric Evans)
- Architecture Katas for practice
Projects
- Refactor a monolithic app into two services with an event bus
- Design an event‑driven order system with CQRS
- Implement a 12‑factor app checklist
Interview Topics
- “Monolith vs microservices: how do you decide?”
- “Explain the Saga pattern.”
- “What is domain‑driven design and why does it matter for DevOps?”
18. MLOps, AIOps, and Emerging Trends
18.1 MLOps
Introduction
MLOps extends DevOps principles to machine learning, covering data versioning, experiment tracking, model training pipelines, deployment, and monitoring.
Why It Matters
As AI becomes ubiquitous, reliable ML delivery pipelines are critical. MLOps ensures reproducibility, governance, and operational excellence for ML models.
Fundamental Concepts
Data versioning (DVC, LakeFS), experiment tracking (MLflow, W&B), model registry, feature stores (Feast, Tecton), training pipelines.
Intermediate Topics
- Orchestration with Kubeflow, Airflow, Prefect
- Model serving: Seldon Core, BentoML, KServe (formerly KFServing)
- CI/CD for ML: building, testing, deploying models
- Drift detection and model retraining triggers
Advanced Topics
- MLOps on Kubernetes with GPU scheduling (Kubeflow + NVIDIA GPU Operator)
- ML metadata and lineage
- A/B testing models in production
- MLOps at massive scale (Ray, Ray Serve)
Enterprise Practices
- Model governance, explainability, and auditability
- Centralised feature store across teams
- Automated retraining pipelines based on metric degradation
Best Practices
- Treat data as code; version datasets
- Monitor model performance (accuracy, fairness, drift)
- Automate the full ML lifecycle
Common Mistakes
- Not versioning data, making experiments unreproducible
- Deploying models without monitoring
- Ignoring operational overhead of GPUs
Real‑World Use Cases
- Netflix’s recommendation pipeline with Metaflow and Kubeflow
- Uber’s Michelangelo platform
- Google’s TFX for production ML
Troubleshooting Topics
- Model serving latency spikes
- Training job OOM on GPU
- Feature store inconsistency
Learning Resources
- Introducing MLOps (O’Reilly)
- Kubeflow and MLflow documentation
- MLOps community (mlops.community)
Projects
- Set up MLflow for experiment tracking and register a model
- Build a Kubeflow pipeline that trains and deploys a model
- Implement a drift detection monitor for a production model
Interview Topics
- “How does MLOps differ from DevOps?”
- “What is a feature store and why is it important?”
- “Explain the ML lifecycle from data to production.”
18.2 AIOps
Introduction
AIOps applies AI/ML to IT operations data to automate anomaly detection, root cause analysis, and remediation.
Why It Matters
As systems grow complex, AIOps reduces alert noise, accelerates incident resolution, and enables proactive operations.
Fundamental Concepts
Event correlation, anomaly detection, log pattern recognition, predictive alerting, automated runbooks.
Intermediate Topics
- AIOps platforms: Moogsoft, BigPanda, Dynatrace Davis, Splunk ITSI
- Integration with observability and incident management
- Training models on incident history for RCA suggestions
Advanced Topics
- Generative AI for runbook generation and natural language querying of systems
- Autonomous operations and self‑healing pipelines
- AI‑driven capacity forecasting and FinOps
Enterprise Practices
- Augmenting NOC/SRE with AIOps insights
- Building custom AIOps with ML on observability data
- Ethical considerations and human‑in‑the‑loop
Best Practices
- Start with data quality; AIOps is only as good as the data
- Use AI to enrich alerts, not replace humans
- Combine with chaos engineering for training
Common Mistakes
- Expecting magic without curated data and incident labeling
- Over‑reliance on black‑box recommendations
Real‑World Use Cases
- eBay’s AIOps for reducing MTTR
- Intuit’s anomaly detection on financial services
- Large banks using AIOps for compliance and fraud operations
Troubleshooting Topics
- AIOps false positives flooding on‑call
- Model drift causing missed incidents
Learning Resources
- AIOps: Artificial Intelligence for IT Operations (O’Reilly)
- Vendor‑specific training
- AIOps Exchange community
Projects
- Build a simple anomaly detection on Prometheus metrics using Prophet
- Create a Slack bot that suggests runbooks based on alert content
Interview Topics
- “How would you implement an AIOps strategy?”
- “What’s the difference between AIOps and traditional monitoring?”
18.3 Emerging Technologies (2026+)
- eBPF: Cilium, Tetragon, and continuous profiling revolutionising networking, security, and observability without kernel changes.
- WebAssembly (Wasm): Serverless functions on edge, plugin extensibility in Envoy, and container alternatives.
- Confidential Computing: Encrypted data in use via hardware‑enclaves (Intel SGX, AMD SEV), safeguarding sensitive workloads.
- Agentic Operations: AI agents that autonomously plan, execute, and remediate infrastructure tasks, bridging LLMs and DevOps toolchains.
- GreenOps: Sustainability‑aware cloud operations, carbon‑aware scheduling, and energy‑efficient architectures.
- Cloud‑Native AI Infrastructure: Kubernetes‑native orchestration of LLMs (vLLM, KServe), GPU sharing, and Ray for distributed ML.
- Policy‑as‑Code Everywhere: AI‑verified policies for security, cost, and architecture using OPA and new DSLs.
- Edge & Distributed Cloud: 5G/MEC, Cloudflare Workers, Fastly Compute@Edge, and AWS Outposts bringing compute closer.
19. FinOps
Introduction
FinOps is the cultural practice of managing cloud costs, where engineering, finance, and business teams collaborate to maximise business value.
Why It Matters
Cloud spend can spiral without governance. FinOps ensures every dollar is accounted for, forecasted, and optimised without slowing innovation.
Fundamental Concepts
Cost allocation, tagging, showback/chargeback, reserved/saving plans, spot/preemptible instances, rightsizing.
Intermediate Topics
- Tools: Kubecost, CloudHealth, Cloudability, AWS Cost Explorer, Cloud Custodian
- Budget alerts and anomaly detection
- Unit economics: cost per API call, per customer
- Optimising Kubernetes cost (requests vs actual usage, overprovisioning)
Advanced Topics
- Automated cost‑aware scheduling (kube‑downscaler)
- Continuous cost optimisation with AI recommendations
- Multi‑cloud cost management and commitment strategies
- Integrating FinOps into CI/CD (cost estimation on pull requests)
Enterprise Practices
- FinOps Foundation frameworks and maturity model
- Chargeback models in self‑service platforms
- Sustainability metrics alongside cost
Best Practices
- Enforce tagging from day one
- Make cost visible to engineers
- Implement automated kill‑switches for non‑production environments
Common Mistakes
- Treating FinOps as purely a finance function
- Ignoring orphan resources (EBS volumes, idle IPs)
- Not leveraging spot instances for fault‑tolerant workloads
Real‑World Use Cases
- Atlassian saving millions through FinOps practices
- Spotify’s Cost Insights backstage plugin
- AWS’s own internal cost optimisation team
Troubleshooting Topics
- Unexplained cost spikes via detailed billing reports
- Kubernetes pods consuming more than requested (no limits)
Learning Resources
- FinOps Foundation certification & playbooks
- Cloud provider cost optimisation whitepapers
- Kubecost blog
Projects
- Set up Kubecost and identify over‑provisioned workloads
- Build a Lambda that stops dev instances at night and on weekends
- Create a dashboard of cloud spend per team
Interview Topics
- “How would you reduce a company’s AWS bill by 30%?”
- “Explain spot instances and their risks.”
- “What is a committed use discount?”
20. Career Paths, Certifications, and Interview Prep
20.1 Career Paths
- DevOps Engineer: Builds and maintains CI/CD, IaC, and operational tooling.
- Cloud Engineer/Architect: Designs cloud infrastructure, migration, and governance.
- Site Reliability Engineer: Focuses on reliability, SLOs, and incident management.
- Platform Engineer: Creates internal developer platforms and golden paths.
- DevSecOps Engineer: Integrates security into the full delivery pipeline.
- MLOps Engineer: Manages ML lifecycle from data to deployment.
- FinOps Practitioner: Optimises cloud spend and bridges engineering/finance.
- Cloud Security Engineer: Cloud posture management, threat detection, compliance.
- Developer Experience (DevEx) Engineer: Improves workflows, tooling, and productivity.
20.2 Certifications (2026 landscape)
- AWS: Cloud Practitioner, Solutions Architect (Associate/Pro), DevOps Engineer Pro, Security Specialty, Advanced Networking.
- Azure: AZ‑900, AZ‑104, AZ‑305, AZ‑400 (DevOps), AZ‑500.
- Google Cloud: Associate Cloud Engineer, Professional Cloud Architect, Professional Cloud DevOps Engineer, Professional Data Engineer.
- Kubernetes: KCNA, CKA, CKAD, CKS.
- HashiCorp: Terraform Associate, Vault Associate.
- Linux: LFCS, RHCSA, RHCE.
- FinOps: FinOps Certified Practitioner.
- CNCF: Prometheus Associate, Cilium Associate, Istio (upcoming).
- Security: CISSP, CCSP, AWS/Azure Security.
20.3 Interview Preparation
- Beginner: Linux basics, simple CI/CD pipelines, Docker, basic networking, cloud fundamentals.
- Intermediate: Kubernetes troubleshooting, IaC design, monitoring/alerting setups, incident response scenarios.
- Senior: System design (cloud‑native, multi‑region), SLO/SLI definition, chaos engineering, cost optimisation.
- Staff/Principal: Organisational DevOps transformation, platform strategy, reliability at massive scale, influencing without authority.
- Scenario questions: “Your production database is down – walk me through your response.” “Design a multi‑cloud active‑active architecture.”
- Architecture design whiteboarding: common enterprise patterns, trade‑off discussions.
20.4 Projects (Complete List)
Beginner
- Static website on S3 + CloudFront with CI/CD via GitHub Actions.
- Containerise a Go/Python web app and push to a registry.
- Deploy a container to a managed Kubernetes service.
Intermediate
- Microservices app with Helm, Ingress, and Prometheus monitoring.
- GitOps with Argo CD – auto‑sync a cluster.
- Centralised logging with Loki and Grafana.
Advanced
- Service mesh (Istio/Linkerd) with mTLS and observability.
- Multi‑cloud Kubernetes federation with GitOps.
- Chaos engineering experiment suite with LitmusChaos.
Enterprise
- Internal Developer Platform with Backstage, Crossplane, and self‑service APIs.
- Compliance‑as‑code pipeline (CIS hardening, SBOM, SLSA level 3).
- FinOps automation: rightsizing, scheduling, and chargeback.
https://zabitechcommunity.netlify.app/post.html?id=frontend-developer-roadmap-2026