Unlocking Terraform: A Guide to Modern Infrastructure Management

Hey there, fabulous folks, It's time to buckle up for an electrifying ride through the world of Terraform! Day 1 of the Terraweek challenge is here, and we're diving headfirst into the heart of this revolutionary technology. Prepare to be amazed, inspired, and empowered as we unravel the mysteries of Terraform together.

Objectives of Day 1

Introduction to IaC
Different Infrastructure as Code Tools
Need for Terraform
Success Stories from Companies using Terraform
Terraform Installation on Various Environments (Windows, Linux, etc.)
HCL Language: Basic Syntax
Providers, Variables, Resource Attributes and Terraform State

Traditional Deployment Model: Steps and Drawbacks

Consider an organization that has to roll out a new app. These are the steps in the traditional approach:

Business Identification and Analysis:

Business: Identifies key stakeholders and gathers their input on application requirements.
Business Analyst: Gathers necessary business information, analyzes business needs, and converts them into a set of high-level technical requirements.

Solution Architecture and Deployment Planning:

Solution Architect: Receives information from the Business Analyst and then assists in architecting the framework for app development and subsequent deployment. This includes infrastructure considerations such as specifications, the number of servers needed for frontend, backend, Load Balancer, Databases, Firewalls, etc.

Infrastructure Setup and Configuration:

Infrastructure Deployment: The organization deploys the app in an on-premises environment, requiring the construction of a data center within the company, incurring significant costs on data center assets.
Procurement: If additional resources are needed, the Tech Lead requests them from the procurement team, who then places new orders with hardware providers. However, these orders may take days, weeks, or even months to fulfill.
Infrastructure Team: Once resources are acquired, they are responsible for installing and setting up hardware on racks and stacking equipment.
System Configuration: The System Administrator performs the initial configuration, followed by the Network Administrator ensuring the system is accessible on the network.
Storage and Backup: The Storage Administrator assigns storage to servers and manages storage resources, while the Backup Administrators configure backups.
Application Deployment: Finally, once the system is set up according to organizational standards, the Application Team deploys the app.

Drawbacks of Traditional Model:

This commonly seen deployment model has significant drawbacks. It is time-consuming, with turnover times ranging from weeks to months. Scaling up and down is slow and costly. While some infrastructure provisioning tasks can be automated, others, such as rack and stack and cabling attachment, remain manual and slow. Moreover, involving multiple teams increases the chances of errors and inconsistency. System resources are underutilized, with servers often operating at peak capacity, leading to inefficiencies and wastage.

Transition to Cloud:

As a result, many companies are turning to virtualization and cloud service providers such as AWS, GCP, and Azure to mitigate these challenges and improve deployment efficiency. After moving to the Cloud, companies were seeing a huge difference in time to market, with significant changes from weeks to months and days to weeks. This is because now organizations only have to worry about the application itself, as everything from data centers to networks and storage is fully managed by cloud providers.

Advantages of Cloud Computing: The best part is, in a virtual environment, operating systems can be spun up in a matter of minutes, and using containers, this can be achieved in a matter of seconds. In the traditional system, where companies were taking more time, progress is now being seen in a matter of weeks or even days in terms of time to market. This acceleration is attributed to cost reduction, as human resources and additional data center management are minimized.

Automation and Scalability with Cloud: Cloud providers offer APIs, which automate tasks and reduce human error, enabling faster app deployment. Cloud Scale Up and Scale Down feature helps to reduce resource wastage by dynamically adjusting resources based on demand. With Virtualization, we can now provision infrastructure with just a few steps, significantly faster compared to the traditional approach.

Evolution towards Infrastructure as Code (IaC) After this revolution, companies started using programming languages to script and automate certain processes. Tools such as Shell Script and programming languages like Ruby, PowerShell, Python, etc., were utilized. Everyone was adopting a similar approach in a consistent fashion, using cloud APIs to automate tasks. This led to the evolution of Infrastructure as Code (IaC), which is widely used today.

Different Infrastructure as Code Tools

Most Commonly Used Infrastructure as Code (IaC) Tools:

We know that various cloud infrastructures can be provisioned through different cloud providers such as GCP, Azure, and AWS. The most effective method for provisioning infrastructure is by codifying the entire setup. This enables one to write and execute code to define, configure, provision, update, and delete infrastructure as needed. This methodology is referred to as Infrastructure as Code.

With Infrastructure as Code (IaC), it becomes possible to manage a wide range of resources and components as infrastructure, including networks, databases, messaging services, and application configurations.

Through Infrastructure as Code, we can define code using simple, human-readable, high-level languages.

Sample Terraform Code

resource "aws_instance" "webserver" {
  ami           = "ami-0c94855ba95c71c99"
  instance_type = "t2.micro"
  tags = {
    Name        = "webserver"
    Environment = "production"
  }
}

Sample Ansible Playbook Code

aws_instance:
  webserver:
    ami: ami-0c94855ba95c71c99
    instance_type: t2.micro
    tags:
      Name: webserver
      Environment: production

Although Ansible and Terraform can both be used as Infrastructure as Code (IaC) tools, they have different use cases. Let's discuss them further without delay.

There are various IaC tools available, including Puppet, Chef, Ansible, Docker, Terraform, SaltStack, AWS CloudFormation, Vagrant, etc. While these IaC tools can be used for similar solutions, each has been created for a unique and different purpose based on industry requirements.

IaC can be broadly classified into three types:

Configuration Management
Provisioning Tools
Server Templating

Configuration Management tools:

Ansible, Puppet, and SaltStack fall under this category.
These are the most commonly used tools to install and manage software on existing infrastructure, such as server configurations, databases, networking devices, and user creation.
They also have a standard structure of code with consistency.
It is easy to add and update when needed.
These tools are designed in such a way that we can run multiple resources in a single click.
Ansible playbooks can be uploaded to GitHub (a version control system) and can be updated with changes, allowing multiple uses of the same playbook.
A very unique feature of these kinds of tools is their idempotence.
Idempotent means they can be run multiple times, and they only change items when they are not in the current state.

Provisioning Tools:

These tools are used to provision infrastructure using simple declarative syntax.
Infrastructure components range from virtual machine servers, VPCs, subnets, and security groups to databases such as MySQL and Redis.
They also include provisioning storage services such as S3 in AWS.
Terraform can provision any service providers we choose based on our requirements.
CloudFormation is specifically used to deploy and provision AWS services.

Server Templating Tools :

These are tools such as Vagrant, Docker, and Packer from HashiCorp.
They can be used to create custom images of VMs or containers.
These images contain all the steps in a step-by-step code and are stored in a file, including all requirements and dependencies.
For instance, a Docker custom image can be created using a Dockerfile.
The most commonly used server templates are VM images offered on https://www.osboxes.org/, custom AMIs on Amazon AWS, and Docker images on Docker Hub and other container registries.
These tools also support immutable infrastructure, unlike configuration management tools.
This means once a VM or container is deployed, it is designed to remain unchanged.
If there are changes made to the images, instead of doing this on running instances like configuration management tools such as Ansible, we would rather redeploy the images by updating the new images.

Need for Terraform

As discussed earlier, Terraform is popular among Infrastructure as Code tools. It is a free and open-source tool developed by HashiCorp. Terraform can be used to build, validate, plan, apply, and destroy infrastructure with a single click. One of its most significant features is its ability to deploy infrastructure across multiple platforms, from public clouds such as GCP, AWS, and Azure to private clouds such as OpenStack, in a matter of seconds. You can describe the components of your infrastructure, such as servers, networks, and databases, in a configuration file. Terraform will then manage the provisioning and orchestration of these resources.

How does Terraform manage resources across multiple platforms?

This is achieved through providers. Providers help manage third-party platforms.

Terraform also manages various service providers, including:

Network providers such as BIgIP, Cloudflare, DNS, Palo Alto.
Data monitoring and management tools such as Grafana, Auth0, DataDog, Wavefront.
Databases like MongoDB, MySQL, PostgreSQL, InfluxDB.
Version control systems such as GitHub, Bitbucket, etc.

These are just a few examples. Terraform works with and supports over 100 such providers.

Success Stories from Companies using Terraform

Vodafone UK's Terraform Tale of Empowering Observability

Wondering what strategies and tools Vodafone UK employs to integrate observability as code within its digital infrastructure, particularly focusing on Terraform implementation? Without further delay, let's get started.

Introduction to Vodafone's Transformation

Vodafone UK embarked on a journey from being a traditional telco to becoming a TechCo. During this transition, it evolved into a technology communications company with a significant emphasis on software engineering.

In 2018, they initiated their DevOps journey, which involved significant insourcing and led to rapid growth in scale and size as they effectively delivered and met the demands of the business. Subsequently, in 2019, their site reliability engineering (SRE) journey began, organically arising from the need to accelerate delivery rates due to the scale of resources they were managing.

Llywelyn Griffith-Swain was fortunate to be one of the first four SREs, contributing to the development of many initial implementations. The team has since taken it a step further. Additionally, in 2020, they successfully implemented their full infrastructure as code/zero-touch environments, where all operations are conducted through a CI/CD pipeline. This initiative underscored the necessity for observability as code, as they recognized the importance of treating observability in the same manner as their environments and applications. They firmly believe that these three elements provide a robust framework for software delivery.

Challanges Faced :

In 2018, they began insourcing and significantly expanded their developer team.
Initially, they relied on a centralized monitoring team for monitors, dashboards, and alerting.
As they grew, the developer team became a bottleneck, waiting for monitoring and alerting provisioning before projects could go live.
With the DevOps journey emphasizing developer ownership, they needed a solution that allowed for end-to-end control of code.
They also faced the challenge of a separate production team handling alerts, requiring efficient monitoring and alerting implementations.
Starting with around 150 developers, they sought a solution to promote ownership and provide real-time insight into production status.
Given the involvement of Llywelyn Griffith-Swain in the SRE team, the solution had to adhere to SRE principles.
Despite having a small team, the solution needed to be automated, enabling self-service for developers and removing bottlenecks.
The solution had to be simple, providing developers with control, repeatability, and written as code for easy implementation and scalability.

Tools Used for Solution

Azure Devops CI/CD: In their Azure DevOps CI/CD implementation, they utilized the Azure DevOps infrastructure to store a variety of tools for managing repositories, releases, teams, and user access.

PagerDuty: Embarking on their PagerDuty journey, they initially encountered a significant amount of manual configuration, as they had just established their SRE team and were new to using the tool for configuration.

DataDog is a monitoring and analytics platform for cloud-scale applications, offering comprehensive visibility into the performance of applications, servers, databases, and other infrastructure components..

Terraform: It was fortunate for them that when they faced the challenge and examined their toolsets, they discovered they had access to both PagerDuty and Datadog providers through which they could then manage and implement their solutions.

AWS: Their current infrastructure and applications are hosted on Amazon Web Services (AWS).

They started their progress by taking baby steps, which can then lead to giant leaps of progress.

Starting Small with Synthetic API Tests:

They aimed to begin their journey with synthetic API tests, considering them as the easiest to execute.
Synthetic API tests required only the endpoint URL for monitoring and understanding the response code.
The initial focus was on building monitors for these tests to track performance metrics.

Creating a 50,000-Foot View of Production:

After completing the initial step, they accelerated into creating a comprehensive overview of production.
This overview monitored all incoming traffic, including data passing through their CDN, frontend, backend layers, and other background content.
It provided a centralized location to identify any issues that arose in real-time.

Implementing the RED Approach to Monitoring:

They recognized the need to understand precisely what was happening with each service.
Following the RED approach—rate, error, duration (or latency)—they monitored key metrics such as the number of requests, error counts, and request durations.
This approach provided valuable insights into the performance of their services.

Deploying Monitors and On-Call Teams:

Despite having service-specific dashboards, they implemented monitors to actively alert them of any changes or issues.
This proactive approach ensured that they didn't need to constantly monitor screens for updates.
Additionally, they assigned all development teams to be on call, enabling immediate response to any detected issues.

Learning and Growth Along the Way:

Through this process, they gained valuable experience and learned important lessons.
Each step, from starting small with synthetic API tests to implementing proactive monitoring and on-call teams, contributed to their growth and success in managing their infrastructure effectively.

Challenge 1: Managing Synthetic Tests with Terraform 0.11
1. Difficulty with Modularization:
  - They encountered challenges while managing synthetic tests using Terraform 0.11.
  - Utilizing count.index in Terraform's code, they created synthetic tests to monitor website availability.
  - The coding involved a variable called endpoints to monitor multiple websites like vodafone.co.uk and vodafone-trade-in.
2. Nightmare of Variable Lists:
  - Despite having a module resource to call, managing multiple lists of variables became overwhelming.
  - With over ten similar variables, maintaining them became an absolute nightmare.

Solution: Building Modular with Variable Blocks

Solving the Issue:
- They found a solution by building modular structures with variable blocks.
- By restructuring the code into these blocks, it became more human-readable and easier to manage.
Enhanced Readability and Management:
- Transforming the code into modular blocks improved its readability.
- It made variable management much simpler, reducing the complexity they previously faced.

Challenge 2: Managing Terraform Version Upgrades

Difficulty with Terraform 0.11:
- While using Terraform 0.11, they encountered challenges with version upgrades.
- They utilized a module or resource to build synthetic API tests in Datadog.
Limitations of count.index:
- In Terraform 0.11, count.index counted the occurrences of a variable, determining the number of monitors needed.
- For example, they provisioned three synthetic monitors: vodafone.co.uk, trade-in, and register-your-interest, each associated with a name.
Issue with Numeric Indexes:
- The challenge arose when any changes were required.
- Terraform associated URLs with numeric indexes in its state, making it difficult to change without disrupting the monitor's history.
- Replacing a URL shifted all subsequent monitors, leading to trust issues with monitor history.

Solution for Challenge 2: Stay Updated with Terraform Versions

Transition to Terraform 0.12:
- Upgrading to Terraform 0.12 introduced the use of for_each loops.
- This eliminated the reliance on numerical indexes, replacing them with dictionary values based on monitor names.
Preventing Future Challenges:
- The adoption of for_each loops ensured that similar issues would not occur in the future.
- By staying updated with Terraform versions, they avoided potential toil and maintained trust in their monitoring infrastructure.

The Initial Solution: Enabling Total Infrastructure Visibility

Utilizing Terraform and Python Scripting:
- They developed a solution involving Terraform code calling a Python script.
- The Python script interfaced with Azure DevOps to retrieve information about teams, users, and services stored there.
- Additionally, the script fetched data from AWS, specifically running Amazon ECS tasks, to gain insights into the environment.
Automated Provisioning with Terraform:
- The retrieved data from Azure DevOps and AWS was inputted into Terraform.
- Terraform utilized this data to provision teams, users, and escalation policies within PagerDuty.
- This automated the process of putting developers on-call, eliminating manual intervention.
Integration with Datadog:
- Data from AWS and ECS was fed into Datadog through Terraform.
- This integration enabled the provisioning of monitors, dashboards, and various performance metrics within Datadog's platform.
Training and Optimization:
- Initially, developers needed training to be comfortable with the automated on-call system.
- The implementation revealed challenges such as a significantly large state file, stemming from provisioning numerous developers, services, monitors, and dashboards.

Lessons Learned and Optimization:

Identifying Performance Bottlenecks:
- The implementation highlighted performance issues, such as lengthy Terraform run times.
- Provisioning over 150 developers, 100 services, and various monitoring resources resulted in a Terraform execution time of 17 minutes.
Optimizing Infrastructure Provisioning:
- They recognized the need to optimize infrastructure provisioning to reduce execution times and improve efficiency.
- Addressing issues such as large state files and performance bottlenecks became essential for streamlining operations.

Final Challenge: State File Management

Drawbacks of Large State Files:
- Managing a large state file could lead to significant delays in delivery rates.
- The inefficiencies tied into the lesson of splitting their state.

Solution for Final Challange: Splitting the State

Improving Efficiency with State Splitting:
- They addressed the challenge by splitting their state file.
- The state file was divided into sections such as PagerDuty users, dashboards, API tests, and monitors.
Enhanced Performance and Delivery Rates:
- By splitting the state, Terraform could run much faster.
- This optimization increased the rate at which they could deliver without becoming a bottleneck.
Provisioning Total Visibility:
- Despite the initial challenge, splitting the state file enabled them to provision total visibility across their entire estate.
- It ensured that Terraform's execution speed did not hinder the deployment process.

From the Initial Solution to Final Success Stage

Developing Terraform Modules for PagerDuty and Datadog
The journey began with the SRE team developing Terraform modules for PagerDuty and Datadog. The goal was to provision these services through Terraform, enabling developers to easily utilize them.

Enabling Self-Service Provisioning
Developers could call these services by pulling down pre-built modules from an S3 bucket and inputting variables. This self-service approach removed bottlenecks and empowered developers to work at their desired pace.

Embracing Collaboration: Developers as Contributors
Surprisingly, developers began submitting pull requests to build modules for new enhancements or technologies. Collaborating with the SRE team, these contributions were approved, stored in the S3 bucket, and made available for all developers to use. This collaborative effort enhanced agility and facilitated rapid development.

Advancing the offering to its final stage marks progress towards a successful story.

Expanding to Synthetic Browser Tests

Transition from API to Browser Tests: Team progressed from API tests to synthetic browser tests.
Customer User Journey Simulation: Synthetic browser tests programmatically simulate customer user journeys.
Release Blockers: These tests serve as release blockers, ensuring deployment readiness by identifying potential issues.
Enhanced Deployment Confidence: Introduction of synthetic browser tests enhances confidence in deployments by mimicking real user interactions.
Improved Test Coverage: Inclusion of browser tests expands test coverage, addressing both backend API functionality and frontend user experience.

Streamlining Environment Validation
Utilization of Synthetic Tests: Synthetic tests played a crucial role in validating new performance environments.

Effective for Destroy-and-Deploy Setups: Synthetic tests were particularly useful for environments using destroy-and-deploy setups.
Automation with Terraform: Terraform automation enabled immediate execution of synthetic tests without manual programming.
Enhanced Efficiency and Reliability: The use of Terraform automation enhanced both efficiency and reliability in the validation process.
Improved Validation Process: Combined use of synthetic tests and Terraform automation streamlined the environment validation process effectively.

Leveraging Datadog for Metrics and SLOs
Implementation of Datadog: Utilized Datadog for comprehensive metric collection.

Facilitating SLO Generation: Datadog implementation enabled the generation of Service Level Objectives (SLOs).
Automation with Terraform: SLOs were automatically created using Terraform, enhancing efficiency.
Consistency Across Environments: Terraform executed at every release ensured consistency by storing each development team's state in S3.
Enhanced Monitoring and Automation: Combined use of Datadog and Terraform facilitated enhanced monitoring and automation processes.

Business Benefits

Developer Benefits

SRE Team Benefits

Terraform Observability as Code

Terraform Installation on Various Environments (Windows, Linux, etc.)

Steps to Install on Windows

Step 1: Download Terraform by open your web browser and navigate to the official Terraform website.

Step 2: Find the latest version of Terraform for Windows and download the appropriate ZIP archive. Terraform is distributed as a single binary executable, so no installation is required other than extracting the files from the ZIP archive.

Step 3: Create a folder named "terraform" on the hard disk in any location. Copy the extracted single binary file of Terraform into this folder, and add the location of the binary file to the environment variable.

Step 4: Open Command Prompt and type the following command:

terraform--version

You have now successfully installed Terraform on your Windows machine.

HCL Language: HashiCorp Configuration Language (HCL). HCL is a configuration language created by HashiCorp, primarily used for writing configuration files that describe infrastructure resources managed by various HashiCorp tools such as Terraform, Nomad, Vault, and others.

HCL syntax is designed to be easy to read and write, with a focus on simplicity and readability. It uses a combination of key-value pairs, blocks, and expressions to define configurations.

Understanding the Basic Syntax of HCL in Terraform

HashiCorp Configuration Language (HCL) serves as the foundation for defining infrastructure as code (IaC) in Terraform. In this guide, we'll delve into the basic syntax of HCL, focusing on crucial concepts like Providers, Variables, Resource Attributes, and Terraform State.

Providers

Providers in Terraform serve as the bridge between the Terraform code and the APIs of cloud providers or other services. They define the resources and data sources available for use within your Terraform configuration.

provider "aws" {
  region = "us-west-2"
}

In this example, we declare the AWS provider and specify the region where our infrastructure resources will be provisioned.

Variables

Variables in Terraform enable parameterization, allowing you to customize your infrastructure configurations without modifying the underlying code. Variables can be defined at the root module level or within modules to promote reusability.

variable "instance_type" {
  type    = string
  default = "t2.micro"
}

Here, we define a variable named "instance_type" with a default value of "t2.micro" representing the instance type for our infrastructure.

Resource Attributes

Resources are the building blocks of infrastructure in Terraform. Each resource represents a single piece of infrastructure, such as an EC2 instance, a database, or a network interface. Resource attributes specify the properties of these infrastructure components.

resource "aws_instance" "example" {
  ami           = "ami-12345678"
  instance_type = var.instance_type
  subnet_id     = "subnet-12345678"
  tags = {
    Name = "example-instance"
  }
}

In this snippet, we create an AWS EC2 instance using the "aws_instance" resource block. We specify attributes like AMI ID, instance type (using a variable), subnet ID, and tags for the instance.

Terraform State

Terraform State is a crucial aspect of managing infrastructure as code. It serves as the source of truth for the current state of your infrastructure. Terraform uses state to map real-world resources to your configuration, track metadata, and plan and execute changes.

terraform {
  backend "s3" {
    bucket = "my-terraform-state-bucket"
    key    = "terraform.tfstate"
    region = "us-east-1"
  }
}

In this example, we configure Terraform to store state files in an S3 bucket named "my-terraform-state-bucket" in the "us-east-1" region.

Understanding these fundamental concepts of HCL in Terraform sets the stage for effectively managing and provisioning infrastructure using Terraform. Incorporate these practices into your workflows to harness the full power of infrastructure as code.

Conclusion:

In conclusion, our exploration of Terraform has provided invaluable insights into modern infrastructure management. From understanding Infrastructure as Code (IaC) fundamentals to mastering Terraform's capabilities, we've embarked on a journey poised to reshape infrastructure provisioning and management.

Throughout our journey, we've highlighted the need for Terraform, explored its transformative impact through success stories, and provided practical guidance for installation. We've also delved into key concepts like Providers, Variables, Resource Attributes, and Terraform State, laying the foundation for effective infrastructure management.

As we conclude Day 1 of the Terraweek challenge, let's embrace the knowledge gained and the opportunities ahead. Stay tuned for more exciting adventures as we continue to unlock the full potential of Terraform. The journey has just begun!

7 Days Terraweek Challenge: Day 1

Unveiling Terraform's Magic: Day 1 of 7 in the Terraweek Adventure!

Table of contents