StarTree is seeking exceptional Site Reliability Engineers (SRE), to manage, tune and debug the large-scale highly available distributed systems. You will be working with a team of passionate and talented engineers in automation, tuning, and troubleshooting of Apache Pinot and SQL DBs. We are looking for motivated, hardworking and focused individuals who have a real passion for operational excellence, data systems, and automation.
Responsibilities:
Leverage various monitoring and alerting services to solve intricate programming problems at scale.
Manage and tune multiple critical customer-facing Apache Pinot clusters
Monitor availability, read/write latencies, and other key telemetry to proactively identify SLO misses and help mitigate issues
Build a rapport with and work closely with customers to mitigate and resolve incidents
Execute disaster recovery strategies with minimal downtime
Collaborate with other engineers to understand and troubleshoot systems and use the experience gained to influence the roadmap of other teams
Requirements:
5+ years of experience as an engineer (SRE, SDET, or development)
Experience managing highly available production facing distributed systems and in-depth knowledge of Java are a plus
Experience with cloud platforms such as AWS, GCP, or Azure
Experience with Kubernetes and container orchestration
Familiarity with streaming systems, such as Kafka, Pulsar, Flume, Flink, Spark, or similar
Knowledge of standard methodologies related to security, performance, and disaster recovery
Strong troubleshooting and critical thinking skills