Cloud Systems Reliability Engineer at ThoughtSpot
Bangalore, IN

We, at ThoughtSpot, are looking for a well-rounded software developer with a customer operations mindset to join our Cloud Systems Reliability Engineering (SRE) team. The right candidate for this role will have a passion to develop tools and automation to deploy/maintain/troubleshoot distributed services/applications in the cloud and on-premise infrastructure. You will be responsible for the big picture of how the services in ThoughtSpot stack relate to each other and use a breadth of tools and approaches to solve a broad spectrum of problems. The use cases include, but not limited to, minimizing time spent on operational work, incident postmortems, and proactive identification of potential issues. Ability to work with cross-functional teams in a rapidly growing environment is important.

Responsibilities

  • Take a customer-first approach in understanding the requirements, nuances of different public clouds (AWS, Azure, or GCP) features, operational costs, and then implement efficient solutions for running SaaS services. AWS is preferred.
  • Develop tools and frameworks for distributed systems, services and applications
  • Maintain, monitor, and troubleshoot infrastructure (IaaS).
  • Work with Engineering teams to define, architect and implement tools to enhance debuggability, supportability, availability, scalability and performance
  • Ensure all key services implement metrics, are monitored and raise appropriate alerts
  • Develop reliable and scalable systems to monitor production alerts and build to auto-remedy
  • Be an expert in cloud and on-premise infrastructure by developing automation and best practices
  • Participate in on-call rotation for critical SRE systems, and lead incident review and root cause analysis
  • Handle cloud NetOps and SecOps aspects for Cloud and on-prem.

Required skills/qualifications

  • B.Tech. degree in Computer Science or equivalent.
  • 5+ years of software development experience.
  • Prior experience being part of Cloud SecOps, NetOps team. Strong expertise in on  Cloud Security and/or Cloud Network. 
  • Proficient in programming in C/C++, Python, Go or Java and familiarity with Kubernetes.
  • Ability to diagnose technical problems, debug code, and automate tasks.
  • Strong problem-solving skills, algorithmic thinking, and a strong foundation in how systems should work.
  • Prior experience being part of Cloud SecOps, NetOps team. Strong expertise in Cloud Security and/or Cloud Network. 
  • Understanding of tools &, frameworks required to Operate and manage Cloud infrastructure.
  • Experience in Linux, virtualization, DevOps tools and cloud technologies.
  • Solid communication skills and the ability to work independently.
  • DevOps mindset. You are familiar with Site Reliability Engineering (SRE) concepts. You treat operational issues as if they are software problems. You view software as a primary tool to manage, maintain, fix, and extend systems required to support large development environments. You promote operational excellence!
  • Automation. You leverage automation, monitoring, and data analysis to ensure high availability (HA) for internal services and infrastructure.
  • Problem solver. You love tackling the most difficult of challenges and know how to troubleshoot in order to get to the best solution.