Finance

Near Real-Time Data Ingestion for a Modern Datalake

The client needed a solution that could publish real-time insert and update events, ingest data into the datalake with minimal delay, scale automatically as data volume increased, and improve performa

Confidential
6 Months
10 Team Members

The Challenge

The client needed a solution that could publish real-time insert and update events, ingest data into the datalake with minimal delay, scale automatically as data volume increased, and improve performance without adding operational overhead. In addition, the system had to integrate seamlessly with heterogeneous data sources such as DB2, AuroraDB, DynamoDB, and RSE applications. The overall goal was to deliver fresh data with minimal latency while maintaining a simple, resilient, and scalable architecture.

Our Solution

We implemented a fully automated ingestion pipeline using AWS-native services, orchestrated end-to-end with Airflow.
High-Level Flow:

(DB2 / AuroraDB / DynamoDB) → Producer (ECS) → SQS → Lambda Consumer → Redshift Datalake
                   ↑
                Airflow Orchestration

Here’s how it works:
1. Data Extraction – Powered by Airflow + ECS
• Airflow DAGs orchestrate scheduled tasks.
• ECS Fargate containers execute Producer.py, connecting to databases (DB2, AuroraDB, DynamoDB).
• The producer converts every row into a standardized JSON event.
2. Event Publication to SQS
• Each event (insert/update) is published directly into an Amazon SQS Queue.
• This gives the system durability, retry capability, and fault isolation.
3. Real-Time Processing via Lambda
• The SQS queue triggers our Lambda Consumer, which processes events at scale.
• Lambda performs transformations, deduplication, and schema validation.
4. Ingestion into Redshift Datalake
• Lambda writes data into Redshift Lake tables using optimized COPY/SQL logic.
• All events are also archived into S3 for audit, traceability, and replay purposes.
5. Real-Time Availability for Reporting
• Data becomes available in near real time for BI dashboards, analytical workloads, and monitoring tools.

Technologies Used

Airflow
AWS ECS Fargate
Amazon SQS
AWS Lambda
Amazon Redshift
Amazon S3
DB2
Amazon AuroraDB
Amazon DynamoDB
RSE Applications
Python
Redshift COPY Command
JSON Event Model

Results & Impact

With this architecture, data is now available in near real time, enabling:
• Seconds-level ingestion latency
• Fault tolerance with retries & DLQs
• Consistent JSON-based data model
• Up-to-date dashboards and analytics
It’s a strong foundation for future enhancements such as CDC automation, streaming analytics, and ML-based anomaly detection.