Design and build production data pipelines for ingesting consumptions in a big data architecture
Work with various AWS tools like Spark (PySpark), Glue, Hudi, Kinesis, DMS, EMR, and Lambdas for processing data.
Develop a Data lake on S3 with landing, raw, trusted, and curated zones
Configure AWS Redshift / Redshift Spectrum Data LakeHouse
Work on automating processes and infrastructure using Python CDK.
Job Requirements:
Bachelor s/Master s degree in Engineering, Computer Science (or equivalent experience)
At least 8 years of relevant experience in data engineering
Experience in working with Apache Spark RDDs and building data frames along with the ability to create Spark jobs for data transformation and aggregation
Must be able to work withlue, PySpark, Kinesis, DMS, EMR, and Lambda
Knowledge of Kubernetes (EKS), AWS, Kinesis, Istio, and Jaegar is essential
Must possess an expertise to work with Python, Redshift/Redshift Spectrum, Athena, and also SQL
Understanding how to work with various file formats like Hudi, Parquet, Avro, ORC for large volumes of data
Experience in working with NoSQL databases like Cassandra, DocumentDB, DynamoDB is a plus
Should be well-versed with data warehousing, data lakes, and the data lakehouse concepts
Deep understanding of Git, Gitflow based workflows, and CI/CD
Well-versed with modern engineering best practices and design principles
Must have excellent communication, presentation, strongly analytical, and problem-solving skills