Design and build production data pipelines for ingesting consumptions in a big data architecture
Work with a variety of AWS tools like Spark (PySpark), Glue, Hudi, Kinesis, DMS, EMR, and Lambdas for processing data
Develop a Data lake on S3 with landing, raw, trusted, and curated zones
Configure AWS Redshift/Redshift Spectrum Data LakeHouse
Work on automating processes and infrastructure using Python CDK
Job Requirements:
Bachelor s/Master s degree in Engineering, Computer Science (or equivalent experience)
At least 3+ years of relevant experience as a back-end engineer
Experience in working with Python is required
Prior experience of working in a high-growth startup is ideal
Knowledge of SDLC principles (Unit Testing, Git, Jira, etc.) will be helpful
Must possess the expertise of scripting complex database queries in SQL
Understanding of Postgres and other relational databases is required
Experience working with cloud computing platforms and some of their services (AWS EC2, S3, Athena, Kinesis, Lambda, Kafka and/or Google Cloud Compute Engine, Cloud Storage, BigQuery, BigTable, etc.)
Must be willing to collaborate and possess excellent written and verbal communication skills
Experience working with machine learning tools such as Tensorflow, PyTorch, Scikit-learn, Pandas, etc., is preferred
Track record of working on Spark will be beneficial
Ability to work on GPUs/TPUs will be an added advantage