Amazon’s data services form a powerful ecosystem for building scalable, serverless data lakes and warehouses. This guide explains the complete picture: from S3 to Glue to Athena and Redshift. You’ll learn what each component does, why it exists, when to use it, and how they work together.
🔹 Part 1: What Is Amazon S3?
Amazon S3 (Simple Storage Service) is a durable, highly available object storage system used for storing raw data in data lakes.
- Stores: JSON, Parquet, CSV, Avro, etc.
- Cheap and elastic
- Cannot be queried directly (requires external engines like Athena, Redshift Spectrum, EMR)
🔹 Part 2: What Is AWS Glue?
AWS Glue is a serverless ETL (Extract, Transform, Load) service. It helps prepare, catalog, and transform data from various sources.
Glue Has Three Main Roles:
- Glue Jobs: Run data transformations using Spark (PySpark or Scala)
- Glue Crawlers: Scan S3 or JDBC sources to infer schema and create metadata
- Glue Data Catalog: Central metadata store (tables, schemas, locations)
Example Workflow:
- Upload raw data to S3
- Create a Crawler to scan files and generate a table in the Glue Catalog
- Write a Glue Job to clean/transform the data
- Load the final output into S3 or Redshift
Glue is great for automating pipelines and managing metadata.
🔹 Part 3: What Is the Glue Data Catalog?
Think of the Glue Data Catalog as a central schema registry and metadata store for your data lake.
- It stores table names, column names/types, file formats, S3 paths, and partitioning info
- It is not your data, just the blueprint
- It is used by Glue, Athena, Redshift Spectrum, and EMR
- It is region-specific, serverless, and managed by AWS
🔹 Part 4: What Is Amazon Athena?
Amazon Athena is a serverless SQL query engine that lets you query S3 data directly using standard SQL.
- Reads data from S3 using schema from the Glue Catalog
- No data loading required
- Pay-per-query (\$5 per TB scanned)
- Great for ad-hoc queries, exploration, or dashboard prototyping
Athena is perfect for cost-effective, exploratory analytics — but can be slower for frequent or complex joins.
🔹 Part 5: What Is Amazon Redshift?
Amazon Redshift is a fully managed cloud data warehouse optimized for high-performance SQL analytics.
- Stores data internally in compressed, columnar format
- Supports materialized views, fast joins, dashboards, concurrency
- You can COPY data from S3 into Redshift
- You can also query S3 using Redshift Spectrum
Redshift is ideal for dashboards and large analytical workloads.
🔹 Part 6: Why Move Data from S3 to Redshift?
Reason | Explanation |
---|---|
Query performance | Redshift is optimized for joins, filters, and aggregations |
BI dashboard compatibility | Better performance with Tableau, Power BI, etc. |
Data modeling | Support for star schemas, materialized views |
Governance and security | Row/column-level permissions and access control |
You move data from S3 to Redshift to support frequent, interactive queries and production analytics.
🔹 Part 7: How to Move Data from S3 to Redshift
Option 1: Redshift COPY command
COPY my_table FROM 's3://bucket/data/'
IAM_ROLE 'arn:aws:iam::account:role/MyRedshiftRole'
FORMAT AS PARQUET;
Option 2: Glue ETL Job to Redshift
from awsglue.context import GlueContext
from pyspark.context import SparkContext
sc = SparkContext()
glueContext = GlueContext(sc)
# Assume dyf is your transformed DynamicFrame
glueContext.write_dynamic_frame.from_jdbc_conf(
frame=dyf,
catalog_connection="my-redshift-conn",
connection_options={"dbtable": "public.my_table"},
redshift_tmp_dir="s3://my-temp-dir/"
)
Option 3: Redshift Spectrum
Query S3 data using external schema in Glue Catalog:
SELECT * FROM spectrum_schema.sales_data WHERE year = 2024;
🔹 Part 8: Glue Crawler and Job Pseudocode
Glue Crawler (conceptual steps)
- Choose a data store (e.g., S3 path)
- Assign an IAM role for access
- Run crawler
- It infers schema and creates/updates Glue Catalog table
You can create a crawler via console, CLI, or boto3:
import boto3
client = boto3.client('glue')
client.create_crawler(
Name='my-crawler',
Role='AWSGlueServiceRole',
DatabaseName='my_database',
Targets={"S3Targets": [{"Path": "s3://my-bucket/raw-data/"}]},
TablePrefix='raw_',
SchemaChangePolicy={'UpdateBehavior': 'UPDATE_IN_DATABASE', 'DeleteBehavior': 'DEPRECATE_IN_DATABASE'}
)
What Exactly Does a Glue Job Do?
A Glue Job reads data (often defined by the Glue Catalog), applies transformation logic using PySpark, and writes the cleaned or enriched data to a destination such as S3, Redshift, or RDS.
Typical Tasks Performed by a Glue Job:
- Read data from Glue Catalog or S3 directly
- Clean, filter, or enrich the data
- Join with other datasets
- Convert file format (e.g., CSV to Parquet)
- Write results to destination storage
Glue ETL Job (PySpark example)
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# Step 1: Read from Glue Catalog (data inferred by crawler)
dyf = glueContext.create_dynamic_frame.from_catalog(
database="mydb",
table_name="raw_users"
)
# Step 2: Transform - Filter out users under age 18
df = dyf.toDF().filter("age >= 18")
dyf_clean = DynamicFrame.fromDF(df, glueContext, "dyf_clean")
# Step 3: Optionally join with another dataset
# e.g., enrich with region info, etc.
# Step 4: Write back to S3 (cleaned zone)
output_path = "s3://my-bucket/cleaned/users/"
glueContext.write_dynamic_frame.from_options(
frame=dyf_clean,
connection_type="s3",
connection_options={"path": output_path},
format="parquet"
)
🔺 Part 9: Putting It All Together
Raw Data → S3 Bucket
↓
Glue Crawler (creates table in Catalog)
↓
Athena (ad-hoc queries)
↓
Glue Job (cleans + transforms)
↓
Redshift (final structured analytics)
🔻 Final Thoughts
AWS Glue, S3, Athena, and Redshift form a modern, modular, scalable data platform. You don’t always need to move data — but knowing when and why to use each tool is what makes your architecture efficient, cost-effective, and powerful.
Leave a Reply