Welcome to this week’s enabledata Insights! Integrating Apache Airflow with AWS Glue allows you to automate and manage complex ETL workflows effortlessly. Here’s a step-by-step guide to configuring the GlueJobOperator in Airflow, ensuring smooth Glue-Airflow integration.
Step 1: Create Your Glue Job
Define your Glue job using:
Visual ETL: For drag-and-drop simplicity.
Custom Python Scripts: For advanced transformations.
Save your script in an S3 bucket, e.g., s3://enabledata/scripts/job.py
.
Step 2: Set Up AWS Permissions
Ensure both Glue and Airflow have the correct roles and policies:
Glue: Use
AWSGlueServiceRole
with S3 access.Airflow: Assign permissions to access Glue and invoke jobs.
Pro Tip: Use Terraform to automate IAM role and policy creation.
Looking to enhance your data engineering processes or scale your pipeline architecture?
At enabledata.io, we offer expert insights and tailored solutions.
Schedule a FREE consulting call today or reach out to us at contact@enabledata.io.
Let's build smarter, scalable solutions together!
Step 3: Add AWS Permissions in Airflow
In Airflow:
Go to Admin > Connections.
Create an AWS connection with:
Connection ID:
aws_default
.Access key, secret key, and region details.
Step 4: Configure Your Airflow DAG
Install the required package: apache-airflow-providers-amazon
.
Example DAG to submit a Glue job:
from airflow import DAG
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator
from airflow.operators.empty import EmptyOperator
from datetime import timedelta, datetime
with DAG(
dag_id="glue_integration_dag",
start_date=datetime(2024, 11, 28),
schedule="@daily",
dagrun_timeout=timedelta(minutes=60),
catchup=False,
max_active_runs=1,
) as dag:
submit_glue_job = GlueJobOperator(
task_id="submit_glue_job",
job_name="test_job",
script_location="s3://enabledata/scripts/job.py",
s3_bucket="enabledata",
iam_role_name="AWSGlueServiceRole",
create_job_kwargs={"GlueVersion": "4.0", "NumberOfWorkers": 2, "WorkerType": "G.1X", "Timeout": 60},
retry_limit=0,
)
validate_task = EmptyOperator(task_id="validate_results")
submit_glue_job >> validate_task
Step 5: Run Your DAG
Trigger the DAG in Airflow and monitor logs for:
Successful Glue job submission.
Job completion status updates.
Takeaways
Automation: Streamline ETL processes by connecting Airflow and Glue.
Cost-Efficiency: Prototyping this setup cost just $1.75.
Scalability: Easily manage and scale data workflows across tools.