🧠 The Ultimate Guide to AWS Glue, S3, Athena, and Redshift

May 23, 2025

Amazon’s data services form a powerful ecosystem for building scalable, serverless data lakes and warehouses. This guide explains the complete picture: from S3 to Glue to Athena and Redshift. You’ll learn what each component does, why it exists, when to use it, and how they work together.

🔹 Part 1: What Is Amazon S3?

Amazon S3 (Simple Storage Service) is a durable, highly available object storage system used for storing raw data in data lakes.

Stores: JSON, Parquet, CSV, Avro, etc.
Cheap and elastic
Cannot be queried directly (requires external engines like Athena, Redshift Spectrum, EMR)

🔹 Part 2: What Is AWS Glue?

AWS Glue is a serverless ETL (Extract, Transform, Load) service. It helps prepare, catalog, and transform data from various sources.

Glue Has Three Main Roles:

Glue Jobs: Run data transformations using Spark (PySpark or Scala)
Glue Crawlers: Scan S3 or JDBC sources to infer schema and create metadata
Glue Data Catalog: Central metadata store (tables, schemas, locations)

Example Workflow:

Upload raw data to S3
Create a Crawler to scan files and generate a table in the Glue Catalog
Write a Glue Job to clean/transform the data
Load the final output into S3 or Redshift

Glue is great for automating pipelines and managing metadata.

🔹 Part 3: What Is the Glue Data Catalog?

Think of the Glue Data Catalog as a central schema registry and metadata store for your data lake.

It stores table names, column names/types, file formats, S3 paths, and partitioning info
It is not your data, just the blueprint
It is used by Glue, Athena, Redshift Spectrum, and EMR
It is region-specific, serverless, and managed by AWS

🔹 Part 4: What Is Amazon Athena?

Amazon Athena is a serverless SQL query engine that lets you query S3 data directly using standard SQL.

Reads data from S3 using schema from the Glue Catalog
No data loading required
Pay-per-query (\$5 per TB scanned)
Great for ad-hoc queries, exploration, or dashboard prototyping

Athena is perfect for cost-effective, exploratory analytics — but can be slower for frequent or complex joins.

🔹 Part 5: What Is Amazon Redshift?

Amazon Redshift is a fully managed cloud data warehouse optimized for high-performance SQL analytics.

Stores data internally in compressed, columnar format
Supports materialized views, fast joins, dashboards, concurrency
You can COPY data from S3 into Redshift
You can also query S3 using Redshift Spectrum

Redshift is ideal for dashboards and large analytical workloads.

🔹 Part 6: Why Move Data from S3 to Redshift?

Reason	Explanation
Query performance	Redshift is optimized for joins, filters, and aggregations
BI dashboard compatibility	Better performance with Tableau, Power BI, etc.
Data modeling	Support for star schemas, materialized views
Governance and security	Row/column-level permissions and access control

You move data from S3 to Redshift to support frequent, interactive queries and production analytics.

🔹 Part 7: How to Move Data from S3 to Redshift

Option 1: Redshift COPY command

COPY my_table FROM 's3://bucket/data/'
IAM_ROLE 'arn:aws:iam::account:role/MyRedshiftRole'
FORMAT AS PARQUET;

Option 2: Glue ETL Job to Redshift

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

# Assume dyf is your transformed DynamicFrame
glueContext.write_dynamic_frame.from_jdbc_conf(
    frame=dyf,
    catalog_connection="my-redshift-conn",
    connection_options={"dbtable": "public.my_table"},
    redshift_tmp_dir="s3://my-temp-dir/"
)

Option 3: Redshift Spectrum

Query S3 data using external schema in Glue Catalog:

SELECT * FROM spectrum_schema.sales_data WHERE year = 2024;

🔹 Part 8: Glue Crawler and Job Pseudocode

Glue Crawler (conceptual steps)

Choose a data store (e.g., S3 path)
Assign an IAM role for access
Run crawler
It infers schema and creates/updates Glue Catalog table

You can create a crawler via console, CLI, or boto3:

import boto3
client = boto3.client('glue')

client.create_crawler(
    Name='my-crawler',
    Role='AWSGlueServiceRole',
    DatabaseName='my_database',
    Targets={"S3Targets": [{"Path": "s3://my-bucket/raw-data/"}]},
    TablePrefix='raw_',
    SchemaChangePolicy={'UpdateBehavior': 'UPDATE_IN_DATABASE', 'DeleteBehavior': 'DEPRECATE_IN_DATABASE'}
)

What Exactly Does a Glue Job Do?

A Glue Job reads data (often defined by the Glue Catalog), applies transformation logic using PySpark, and writes the cleaned or enriched data to a destination such as S3, Redshift, or RDS.

Typical Tasks Performed by a Glue Job:

Read data from Glue Catalog or S3 directly
Clean, filter, or enrich the data
Join with other datasets
Convert file format (e.g., CSV to Parquet)
Write results to destination storage

Glue ETL Job (PySpark example)

from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Step 1: Read from Glue Catalog (data inferred by crawler)
dyf = glueContext.create_dynamic_frame.from_catalog(
    database="mydb",
    table_name="raw_users"
)

# Step 2: Transform - Filter out users under age 18
df = dyf.toDF().filter("age >= 18")
dyf_clean = DynamicFrame.fromDF(df, glueContext, "dyf_clean")

# Step 3: Optionally join with another dataset
# e.g., enrich with region info, etc.

# Step 4: Write back to S3 (cleaned zone)
output_path = "s3://my-bucket/cleaned/users/"
glueContext.write_dynamic_frame.from_options(
    frame=dyf_clean,
    connection_type="s3",
    connection_options={"path": output_path},
    format="parquet"
)

🔺 Part 9: Putting It All Together

Raw Data → S3 Bucket
          ↓
    Glue Crawler (creates table in Catalog)
          ↓
    Athena (ad-hoc queries)
          ↓
    Glue Job (cleans + transforms)
          ↓
    Redshift (final structured analytics)

🔻 Final Thoughts

AWS Glue, S3, Athena, and Redshift form a modern, modular, scalable data platform. You don’t always need to move data — but knowing when and why to use each tool is what makes your architecture efficient, cost-effective, and powerful.

Hara Bhara Kabab

Spinach and peas patties lightly spiced and grilled.

Price: $6.99

Quantity:

Chicken 65

South Indian-style spicy fried chicken nuggets.

Price: $8.99

Quantity:

Tandoori Mushroom

Button mushrooms marinated in tikka spices and grilled.

Price: $7.99

Quantity:

Tandoori Chicken

Bone-in chicken marinated and roasted in tandoor.

Price: $11.99

Quantity:

Paneer Lababdar

Cottage cheese in tangy tomato and creamy onion gravy.

Price: $11.49

Quantity:

Chicken Tikka Masala

Grilled chicken chunks in creamy spiced masala sauce.

Price: $12.49

Quantity:

Lamb Korma

Lamb cooked in rich yogurt-cashew gravy.

Price: $13.99

Quantity:

Fish Curry (Kerala Style)

Coconut milk-based curry with tangy tamarind and spices.

Price: $12.99

Quantity:

Mutton Biryani

Layered rice and marinated mutton, slow-cooked to perfection.

Price: $13.99

Quantity:

Veg Fried Rice

Sautéed rice with vegetables and Indo-Chinese flavor.

Price: $8.49

Quantity:

Lachha Paratha

Flaky, layered flatbread cooked in tandoor.

Price: $2.99

Quantity:

Paneer Paratha

Whole wheat flatbread stuffed with spiced paneer.

Price: $4.49

Quantity:

Pav Bhaji

Mumbai-style mashed vegetable curry served with buttered buns.

Price: $7.49

Quantity:

Pani Puri

Crispy puris filled with spicy and tangy water, potato & chickpeas.

Price: $5.99

Quantity:

Masala Dosa

Crispy crepe stuffed with potato masala and served with chutney.

Price: $7.99

Quantity:

Medu Vada

Crisp fried urad dal doughnuts served with sambhar.

Price: $5.49

Quantity:

Sweet Lassi

Thick yogurt drink sweetened and topped with cream.

Price: $3.99

Quantity:

Filter Coffee

Authentic South Indian filter coffee with strong decoction.

Price: $2.49

Quantity:

Jalebi

Deep-fried flour spirals soaked in saffron sugar syrup.

Price: $3.99

Quantity:

Malai Kulfi

Creamy frozen dessert flavored with cardamom and nuts.

Price: $3.99

Quantity:

Deluxe Veg Thali

Includes 3 curries, rice, bread, sweet, raita, pickle, and papad.

Choose Size:Quantity:

Royal Non-Veg Thali

Includes chicken curry, lamb curry, biryani, bread, dessert.

Choose Size:Quantity:

Test chutney

Price: $0.10

Quantity:

🧠 The Ultimate Guide to AWS Glue, S3, Athena, and Redshift

🔹 Part 1: What Is Amazon S3?

🔹 Part 2: What Is AWS Glue?

Glue Has Three Main Roles:

Example Workflow:

🔹 Part 3: What Is the Glue Data Catalog?

🔹 Part 4: What Is Amazon Athena?

🔹 Part 5: What Is Amazon Redshift?

🔹 Part 6: Why Move Data from S3 to Redshift?

🔹 Part 7: How to Move Data from S3 to Redshift

Option 1: Redshift COPY command

Option 2: Glue ETL Job to Redshift

Option 3: Redshift Spectrum

🔹 Part 8: Glue Crawler and Job Pseudocode

Glue Crawler (conceptual steps)

What Exactly Does a Glue Job Do?

Typical Tasks Performed by a Glue Job:

Glue ETL Job (PySpark example)

🔺 Part 9: Putting It All Together

🔻 Final Thoughts

Leave a Reply Cancel reply

Further reading

PostgreSQL supports several index types, each optimized for different kinds of queries and data structures. Here’s a breakdown:

🧠 The Ultimate Guide to AWS Glue, S3, Athena, and Redshift