ETL Pipeline Migration Case Study
Customer Challenge
Antimetal engaged with a client who was already utilizing Amazon Simple Storage Service (Amazon S3) for daily CSV data storage. The challenge lay in the inefficient and disjointed data transformation processes employed by various teams, leading to duplicated efforts and inconsistent outcomes.
Solution with Antimetal
Antimetal's objective was to optimize these processes by centralizing the data transformation operations. The initial step involved a thorough assessment of the existing setup, identifying the key pain points in the data handling workflows. The solution crafted by Antimetal hinged on the deployment of AWS Glue, chosen for its robust data processing capabilities, which would replace the varied and manual methods previously in use.
A pivotal component of the strategy was establishing a secure and streamlined environment for data handling. This was achieved through the configuration of a Amazon Virtual Private Cloud (Amazon VPC) and the implementation of precise AWS IAM roles and policies, ensuring secure access control and data privacy. To support collaborative development and version control, the system was integrated with GitHub, facilitating a cohesive working environment for the client's teams.
Recognizing the importance of cost transparency and efficiency, Antimetal leveraged AWS Cost and Usage Reports alongside our proprietary cost management tools. This provided a clear view of the financial aspects of the new data transformation setup, allowing for an informed decision-making process regarding the transition to AWS Glue and Amazon Relational Database Service (Amazon RDS). Despite AWS Lambda's lower initial cost, the comprehensive analysis justified the shift towards a more scalable and efficient system, aligning with the client's broader objectives of improved practices and sustainability.
With the solution's framework established and validated, Antimetal proceeded to codify the entire infrastructure using Terraform. This step significantly reduced the complexity of network and resource management, enabling the client's technical teams to direct their focus towards enhancing data transformation processes.
Throughout the project, a key focus was on building the client's internal expertise. Antimetal facilitated this through close collaboration with the client's engineers, imparting best practices in cloud cost optimization and FinOps management. This ensured that the client was not only equipped with a more efficient data transformation pipeline but also the knowledge to maintain optimal cloud operations into the future.
Cost Planning and Forecasting
The customer wanted a cost forecast of how much the ETL set up would cost
Service
Operation
Charge Type
Frequency
Estimated Daily Cost
Estimated Monthly Cost
AWS Glue
Crawler
$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run
Upon EventBridge Trigger, Roughly Daily
$0.07
$0.88
AWS Glue
ETL
$0.44 per DPU-Hour for each Apache Spark or Spark Streaming job, billed per second with a 1-minute minimum
Daily
$6.81
$81.72
Amazon RDS
Postgres Database
$0.674 per hour for db.m7g.2xlarge
N/A
$16.18
$194.11
Amazon VPC
Gatways, IPs
Daily
$7.44
$89.28
Amazon S3
Storage
$13.18
$158.16
Cloud Cost Financial Management Training
The following link contains a slide to the overall schedule of training we provided to the customer to enhance knowledge and culture around Cloud Cost Financial Management Best Practices.
https://docs.google.com/presentation/d/1MMlusZ--sAghj9dxjpAY2b69OSe7pwhLALZixkxL7gQ/edit?usp=sharing
Infrastructure-as-code (IaC)
As mentioned in the description, our partnership with the customer also entailed Infrastructure As Code and Terraform. Since it is private code, we will not be sharing the repo but it does contain terraform code like such:
Last updated