Build and maintain scalable data pipelines and workflows within the Lakehouse environment.
Transform, cleanse, and aggregate data using Spark SQL or PySpark.
Optimize Spark jobs for performance, cost efficiency, and reliability.
Develop and manage Lakehouse tables for efficient data storage and versioning.
Utilize notebooks for interactive data exploration, analysis, and development.
Implement data quality checks and monitoring to ensure accuracy and reliability.
Drive Automation:
Implement automated data ingestion processes using functionality available in the data platform, optimizing for performance and minimizing manual intervention.
Design and implement end-to-end data pipelines, incorporating transformations, data quality checks, and monitoring.
Utilize CI/CD tools (Azure DevOps/GitHub Actions) to automate pipeline testing, deployment, and version control.
Enterprise Data Warehouse (EDW) Management:
Create and maintain data models, schemas, and documentation for the EDW.
Collaborate with data analysts, data scientists and business stakeholders to gather requirements, design data marts, and provide support for reporting and analytics initiatives.
Troubleshoot and resolve any issues related to data loading, transformation, or access within the EDW.