A Bachelor s degree and a minimum of 3 years of relevant experience as a data engineer
Hands-on deployment experience with Hadoop/Spark, Scala, MySQL, Redshift, and Amazon AWS or other cloud base systems
Comfortable writing code in Python, Ruby, Perl, or equivalent scripting language
Experience with Cosmos/Scope, SQL, or Hadoop
At least 3 years of professional work experience programming in Python, Java or Scala
2+ years of Distributed Computing frameworks such as Apache Spark, Hadoop
Responsibilities
Design and develop ETL (extract-transform-load) processes to validate and transform data, calculate metrics and attributes, and populate data models using HADOOP, Spark, SQL, and other technologies
Experience in Cloud technologies like S3, databases and so on.
Lead by example, demonstrating best practices for code development and optimization, unit testing, CI/CD, performance testing, capacity planning, documentation, monitoring, alerting, and incident response to ensure data availability, quality, usability and required performance.
Use programming languages such as SAS, R, Python, and SQL to create automated data gathering, cleansing, reporting, and visualization processes.
Implement systems for tracking data quality, usage, and consistency
Design and develop new data products using languages
Monitor and maintain system health and security
Oversee administration and improvements to source control and deployment process.
Prepare unit tests for all work to be released to our live environment (including data validation scripts for data set releases or changes)
Implement performance tuning on the databases based on monitoring
Design and implement data products using Hadoop technologies
Clear documentation of process flow diagrams and best practices
Design and implementation of multi-source data channels and ETL processes
Working experience with AWS services such as EMR, Athena, Glue, Redshift and Lambda.