We are seeking a highly skilled Site Reliability Engineer (SRE) for a long-term contract opportunity based in RTP (Research Triangle Park), North Carolina. The primary focus of this role is managing and scaling our multi-cloud infrastructure, specifically leveraging AWS and GCP environments, while ensuring high availability and performance of applications running on Kubernetes. This involves active troubleshooting of runtime applications, participating in a 24x7 on-call rotation for proactive monitoring and issue resolution, and collaborating closely with development and quality engineering teams to automate system health processes.
A core part of the responsibility includes managing and performance tuning critical data components, such as databases (Postgres, Redis, Cassandra, Elasticsearch) or streaming data pipelines (Kafka, Flink, Storm, Spark, Kubeflow). Candidates must be proficient in Linux, Python, and Shell scripting, and possess deep experience in Kubernetes cluster maintenance and debugging containerized applications developed in Golang, Java, or Python. Furthermore, successful candidates will be expected to follow SRE best practices, write and maintain comprehensive runbooks, and utilize modern infrastructure as code tools like Terraform and deployment services like CloudFormation. Experience with comprehensive monitoring solutions such as Prometheus, Grafana, and ELK stack is highly desirable.
Key Requirements
Manage AWS/GCP Cloud infrastructure and Kubernetes resources.
Troubleshoot applications effectively in runtime environments.
Manage and performance tune databases (Postgres, Redis, Cassandra, Elasticsearch) or streaming data pipelines (Kafka, Flink /Storm /Spark /Kubeflow frameworks desirable).
Write and maintain comprehensive runbooks for knowledge driven automated processes and bots.
Collaborate with developers and quality engineering teams to automate the monitoring, alerting, availability and scalability of applications.
Participate in proactive monitoring, diagnosis, on-call rotation, and resolution of issues in a 24x7 multicloud environment (AWS / GCP).
Analyze failures and provide support for software engineers to debug production issues across microservices and distributed platforms.
Must have experience maintaining production systems on AWS and/or GCP.
Proficiency in Linux, Python, and Shell scripting is required.
Experience in continuous integration practices & tools (Jenkins, Travis CI, CircleCI, etc.)