ETL Pipeline Migration Case Study

Customer Challenge

Antimetal engaged with a client who was already utilizing Amazon Simple Storage Service (Amazon S3) for daily CSV data storage. The challenge lay in the inefficient and disjointed data transformation processes employed by various teams, leading to duplicated efforts and inconsistent outcomes.

Solution with Antimetal

Antimetal's objective was to optimize these processes by centralizing the data transformation operations. The initial step involved a thorough assessment of the existing setup, identifying the key pain points in the data handling workflows. The solution crafted by Antimetal hinged on the deployment of AWS Glue, chosen for its robust data processing capabilities, which would replace the varied and manual methods previously in use.

A pivotal component of the strategy was establishing a secure and streamlined environment for data handling. This was achieved through the configuration of a Amazon Virtual Private Cloud (Amazon VPC) and the implementation of precise AWS IAM roles and policies, ensuring secure access control and data privacy. To support collaborative development and version control, the system was integrated with GitHub, facilitating a cohesive working environment for the client's teams.

Recognizing the importance of cost transparency and efficiency, Antimetal leveraged AWS Cost and Usage Reports alongside our proprietary cost management tools. This provided a clear view of the financial aspects of the new data transformation setup, allowing for an informed decision-making process regarding the transition to AWS Glue and Amazon Relational Database Service (Amazon RDS). Despite AWS Lambda's lower initial cost, the comprehensive analysis justified the shift towards a more scalable and efficient system, aligning with the client's broader objectives of improved practices and sustainability.

With the solution's framework established and validated, Antimetal proceeded to codify the entire infrastructure using Terraform. This step significantly reduced the complexity of network and resource management, enabling the client's technical teams to direct their focus towards enhancing data transformation processes.

Throughout the project, a key focus was on building the client's internal expertise. Antimetal facilitated this through close collaboration with the client's engineers, imparting best practices in cloud cost optimization and FinOps management. This ensured that the client was not only equipped with a more efficient data transformation pipeline but also the knowledge to maintain optimal cloud operations into the future.

Cost Planning and Forecasting

The customer wanted a cost forecast of how much the ETL set up would cost

Service

Operation

Charge Type

Frequency

Estimated Daily Cost

Estimated Monthly Cost

AWS Glue

Crawler

$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run

Upon EventBridge Trigger, Roughly Daily

$0.07

$0.88

AWS Glue

ETL

$0.44 per DPU-Hour for each Apache Spark or Spark Streaming job, billed per second with a 1-minute minimum

Daily

$6.81

$81.72

Amazon RDS

Postgres Database

$0.674 per hour for db.m7g.2xlarge

N/A

$16.18

$194.11

Amazon VPC

Gatways, IPs

Daily

$7.44

$89.28

Amazon S3

Storage

$13.18

$158.16

Cloud Cost Financial Management Training

The following link contains a slide to the overall schedule of training we provided to the customer to enhance knowledge and culture around Cloud Cost Financial Management Best Practices.

https://docs.google.com/presentation/d/1MMlusZ--sAghj9dxjpAY2b69OSe7pwhLALZixkxL7gQ/edit?usp=sharing

Infrastructure-as-code (IaC)

As mentioned in the description, our partnership with the customer also entailed Infrastructure As Code and Terraform. Since it is private code, we will not be sharing the repo but it does contain terraform code like such:

data "aws_canonical_user_id" "current" {}


locals {
  version_suspended = var.versioning.suspended != null ? var.versioning.suspended : false
}


resource "aws_s3_bucket" "this" {
  bucket              = "${var.bucket_name_prefix}${var.bucket_name}"
  object_lock_enabled = false
}


resource "aws_s3_bucket_acl" "this" {
  bucket = aws_s3_bucket.this.bucket
  access_control_policy {
    grant {
      permission = "FULL_CONTROL"
      grantee {
        type = "CanonicalUser"
        id   = data.aws_canonical_user_id.current.id
      }
    }


    owner {
      display_name = data.aws_canonical_user_id.current.display_name
      id           = data.aws_canonical_user_id.current.id
    }
  }
}


resource "aws_s3_bucket_ownership_controls" "this" {
  bucket = aws_s3_bucket.this.bucket
  rule {
    object_ownership = "BucketOwnerEnforced"
  }
}


resource "aws_s3_bucket_public_access_block" "this" {
  bucket = aws_s3_bucket.this.bucket


  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true


}


resource "aws_s3_bucket_server_side_encryption_configuration" "this" {
  bucket = aws_s3_bucket.this.bucket


  rule {
    bucket_key_enabled = true
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}


resource "aws_s3_bucket_versioning" "this" {
  bucket = aws_s3_bucket.this.bucket


  versioning_configuration {
    status     = coalesce(var.versioning.suspended, false) ? "Suspended" : (var.versioning.enabled ? "Enabled" : "Disabled")
    mfa_delete = var.versioning.mfa_delete
  }

Last updated