🧠 The Ultimate Guide to AWS Glue, S3, Athena, and Redshift

Amazon’s data services form a powerful ecosystem for building scalable, serverless data lakes and warehouses. This guide explains the complete picture: from S3 to Glue to Athena and Redshift. You’ll learn what each component does, why it exists, when to use it, and how they work together.


🔹 Part 1: What Is Amazon S3?

Amazon S3 (Simple Storage Service) is a durable, highly available object storage system used for storing raw data in data lakes.

  • Stores: JSON, Parquet, CSV, Avro, etc.
  • Cheap and elastic
  • Cannot be queried directly (requires external engines like Athena, Redshift Spectrum, EMR)

🔹 Part 2: What Is AWS Glue?

AWS Glue is a serverless ETL (Extract, Transform, Load) service. It helps prepare, catalog, and transform data from various sources.

Glue Has Three Main Roles:

  1. Glue Jobs: Run data transformations using Spark (PySpark or Scala)
  2. Glue Crawlers: Scan S3 or JDBC sources to infer schema and create metadata
  3. Glue Data Catalog: Central metadata store (tables, schemas, locations)

Example Workflow:

  • Upload raw data to S3
  • Create a Crawler to scan files and generate a table in the Glue Catalog
  • Write a Glue Job to clean/transform the data
  • Load the final output into S3 or Redshift

Glue is great for automating pipelines and managing metadata.


🔹 Part 3: What Is the Glue Data Catalog?

Think of the Glue Data Catalog as a central schema registry and metadata store for your data lake.

  • It stores table names, column names/types, file formats, S3 paths, and partitioning info
  • It is not your data, just the blueprint
  • It is used by Glue, Athena, Redshift Spectrum, and EMR
  • It is region-specific, serverless, and managed by AWS

🔹 Part 4: What Is Amazon Athena?

Amazon Athena is a serverless SQL query engine that lets you query S3 data directly using standard SQL.

  • Reads data from S3 using schema from the Glue Catalog
  • No data loading required
  • Pay-per-query (\$5 per TB scanned)
  • Great for ad-hoc queries, exploration, or dashboard prototyping

Athena is perfect for cost-effective, exploratory analytics — but can be slower for frequent or complex joins.


🔹 Part 5: What Is Amazon Redshift?

Amazon Redshift is a fully managed cloud data warehouse optimized for high-performance SQL analytics.

  • Stores data internally in compressed, columnar format
  • Supports materialized views, fast joins, dashboards, concurrency
  • You can COPY data from S3 into Redshift
  • You can also query S3 using Redshift Spectrum

Redshift is ideal for dashboards and large analytical workloads.


🔹 Part 6: Why Move Data from S3 to Redshift?

ReasonExplanation
Query performanceRedshift is optimized for joins, filters, and aggregations
BI dashboard compatibilityBetter performance with Tableau, Power BI, etc.
Data modelingSupport for star schemas, materialized views
Governance and securityRow/column-level permissions and access control

You move data from S3 to Redshift to support frequent, interactive queries and production analytics.


🔹 Part 7: How to Move Data from S3 to Redshift

Option 1: Redshift COPY command

COPY my_table FROM 's3://bucket/data/'
IAM_ROLE 'arn:aws:iam::account:role/MyRedshiftRole'
FORMAT AS PARQUET;

Option 2: Glue ETL Job to Redshift

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

# Assume dyf is your transformed DynamicFrame
glueContext.write_dynamic_frame.from_jdbc_conf(
    frame=dyf,
    catalog_connection="my-redshift-conn",
    connection_options={"dbtable": "public.my_table"},
    redshift_tmp_dir="s3://my-temp-dir/"
)

Option 3: Redshift Spectrum

Query S3 data using external schema in Glue Catalog:

SELECT * FROM spectrum_schema.sales_data WHERE year = 2024;

🔹 Part 8: Glue Crawler and Job Pseudocode

Glue Crawler (conceptual steps)

  1. Choose a data store (e.g., S3 path)
  2. Assign an IAM role for access
  3. Run crawler
  4. It infers schema and creates/updates Glue Catalog table

You can create a crawler via console, CLI, or boto3:

import boto3
client = boto3.client('glue')

client.create_crawler(
    Name='my-crawler',
    Role='AWSGlueServiceRole',
    DatabaseName='my_database',
    Targets={"S3Targets": [{"Path": "s3://my-bucket/raw-data/"}]},
    TablePrefix='raw_',
    SchemaChangePolicy={'UpdateBehavior': 'UPDATE_IN_DATABASE', 'DeleteBehavior': 'DEPRECATE_IN_DATABASE'}
)

What Exactly Does a Glue Job Do?

A Glue Job reads data (often defined by the Glue Catalog), applies transformation logic using PySpark, and writes the cleaned or enriched data to a destination such as S3, Redshift, or RDS.

Typical Tasks Performed by a Glue Job:

  • Read data from Glue Catalog or S3 directly
  • Clean, filter, or enrich the data
  • Join with other datasets
  • Convert file format (e.g., CSV to Parquet)
  • Write results to destination storage

Glue ETL Job (PySpark example)

from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Step 1: Read from Glue Catalog (data inferred by crawler)
dyf = glueContext.create_dynamic_frame.from_catalog(
    database="mydb",
    table_name="raw_users"
)

# Step 2: Transform - Filter out users under age 18
df = dyf.toDF().filter("age >= 18")
dyf_clean = DynamicFrame.fromDF(df, glueContext, "dyf_clean")

# Step 3: Optionally join with another dataset
# e.g., enrich with region info, etc.

# Step 4: Write back to S3 (cleaned zone)
output_path = "s3://my-bucket/cleaned/users/"
glueContext.write_dynamic_frame.from_options(
    frame=dyf_clean,
    connection_type="s3",
    connection_options={"path": output_path},
    format="parquet"
)

🔺 Part 9: Putting It All Together

Raw Data → S3 Bucket
          ↓
    Glue Crawler (creates table in Catalog)
          ↓
    Athena (ad-hoc queries)
          ↓
    Glue Job (cleans + transforms)
          ↓
    Redshift (final structured analytics)

🔻 Final Thoughts

AWS Glue, S3, Athena, and Redshift form a modern, modular, scalable data platform. You don’t always need to move data — but knowing when and why to use each tool is what makes your architecture efficient, cost-effective, and powerful.

Hara Bhara Kabab

Spinach and peas patties lightly spiced and grilled.

Price: $6.99

Chicken 65

South Indian-style spicy fried chicken nuggets.

Price: $8.99

Tandoori Mushroom

Button mushrooms marinated in tikka spices and grilled.

Price: $7.99

Tandoori Chicken

Bone-in chicken marinated and roasted in tandoor.

Price: $11.99

Paneer Lababdar

Cottage cheese in tangy tomato and creamy onion gravy.

Price: $11.49

Chicken Tikka Masala

Grilled chicken chunks in creamy spiced masala sauce.

Price: $12.49

Lamb Korma

Lamb cooked in rich yogurt-cashew gravy.

Price: $13.99

Fish Curry (Kerala Style)

Coconut milk-based curry with tangy tamarind and spices.

Price: $12.99

Mutton Biryani

Layered rice and marinated mutton, slow-cooked to perfection.

Price: $13.99

Veg Fried Rice

Sautéed rice with vegetables and Indo-Chinese flavor.

Price: $8.49

Lachha Paratha

Flaky, layered flatbread cooked in tandoor.

Price: $2.99

Paneer Paratha

Whole wheat flatbread stuffed with spiced paneer.

Price: $4.49

Pav Bhaji

Mumbai-style mashed vegetable curry served with buttered buns.

Price: $7.49

Pani Puri

Crispy puris filled with spicy and tangy water, potato & chickpeas.

Price: $5.99

Masala Dosa

Crispy crepe stuffed with potato masala and served with chutney.

Price: $7.99

Medu Vada

Crisp fried urad dal doughnuts served with sambhar.

Price: $5.49

Sweet Lassi

Thick yogurt drink sweetened and topped with cream.

Price: $3.99

Filter Coffee

Authentic South Indian filter coffee with strong decoction.

Price: $2.49

Jalebi

Deep-fried flour spirals soaked in saffron sugar syrup.

Price: $3.99

Malai Kulfi

Creamy frozen dessert flavored with cardamom and nuts.

Price: $3.99

Deluxe Veg Thali

Includes 3 curries, rice, bread, sweet, raita, pickle, and papad.

Royal Non-Veg Thali

Includes chicken curry, lamb curry, biryani, bread, dessert.

Test chutney

Price: $0.10

Leave a Reply

Your email address will not be published. Required fields are marked *