The Challenge
The client needed a solution that could publish real-time insert and update events, ingest data into the datalake with minimal delay, scale automatically as data volume increased, and improve performance without adding operational overhead. In addition, the system had to integrate seamlessly with heterogeneous data sources such as DB2, AuroraDB, DynamoDB, and RSE applications. The overall goal was to deliver fresh data with minimal latency while maintaining a simple, resilient, and scalable architecture.
Our Solution
We implemented a fully automated ingestion pipeline using AWS-native services, orchestrated end-to-end with Airflow.
High-Level Flow:
(DB2 / AuroraDB / DynamoDB) → Producer (ECS) → SQS → Lambda Consumer → Redshift Datalake
↑
Airflow Orchestration
Here’s how it works:
1. Data Extraction – Powered by Airflow + ECS
• Airflow DAGs orchestrate scheduled tasks.
• ECS Fargate containers execute Producer.py, connecting to databases (DB2, AuroraDB, DynamoDB).
• The producer converts every row into a standardized JSON event.
2. Event Publication to SQS
• Each event (insert/update) is published directly into an Amazon SQS Queue.
• This gives the system durability, retry capability, and fault isolation.
3. Real-Time Processing via Lambda
• The SQS queue triggers our Lambda Consumer, which processes events at scale.
• Lambda performs transformations, deduplication, and schema validation.
4. Ingestion into Redshift Datalake
• Lambda writes data into Redshift Lake tables using optimized COPY/SQL logic.
• All events are also archived into S3 for audit, traceability, and replay purposes.
5. Real-Time Availability for Reporting
• Data becomes available in near real time for BI dashboards, analytical workloads, and monitoring tools.
Results & Impact
With this architecture, data is now available in near real time, enabling:
• Seconds-level ingestion latency
• Fault tolerance with retries & DLQs
• Consistent JSON-based data model
• Up-to-date dashboards and analytics
It’s a strong foundation for future enhancements such as CDC automation, streaming analytics, and ML-based anomaly detection.