SRE Engineer for our team is responsible for the reliability of our infrastructure platforms with a focus on availability and stability. The SRE engineer is responsible for successfully designing upgrades, overseeing changes, monitoring releases, and improving the performance of the platforms to provide quality services to our customers. While there is a strong focus on performing traditional operations functions, such as resolving incidents and being part of crisis calls an equal focus is on developing automated self-healing solutions to make the platform more resilient based on root cause analyses.
Metrics and Monitoring – Monitoring the performance of server/storage platforms, designing SLA, SLI, and SLO Dashboards, and building Analytics to predict anomalies. Improve SLI’s and show continuous improvement in metrics.
Capacity Planning – Forecast platform resource requirements and identify potential performance and demand bottlenecks beforehand.
Change Management – Create Release plans and automated release engineering workflows to release patches and install with zero downtime on platforms.
Emergency Response – On-call support, Root cause analysis, blameless postmortems.
Automation – Automate repetitive tasks (toil management), automated health checks, self-healing environments.
Agile/ DevOps engineering – Work in a product operating model which is based on Agile/ Scrum practices.
- Identify scalability bottlenecks and areas for performance improvements.
- Collaborate with the engineering team to propose features that solve recurring patterns of customer complaints.
- Experience in SRE practices / AIOps.
- Experience in desktop and server operating systems.
- Experience in Linux and Openstack.
- Experience in scripting languages like Shell scripting, Python etc.
- Experience in Dashboard tools like Grafana.
- Knowledge of Server, Storage, Network, Security and Firewall technologies.
- Knowledge of agile/ DevOps toolset and automated release management.
- Experience in Containers – Docker, Kubernetes.
- Experience in building DevOps- CI/CD pipelines using Jenkins.
- Technically guide dev teams in improving our customer experience, SLA, SLI and SLO.
- Ability to handle multiple priorities in a fast-paced environment. Exposure to 3 tier architecture – Frontend, middle-tier, and backend Database.
- Effectively use our app monitoring tools or similar tools to Dynatrace and Elastic search/Splunk.
- Teamwork, Problem-solving, good written and verbal communication skills.
- Ability to design solutions for any kind of requirement.
- Ability to design and implement infrastructure app monitoring tools or similar tools to Dynatrace and Elastic search/Splunk.
- Experience in MySQL, PostgreSQL.
- Knowledge of Server less architecture.
- Knowledge on any of the Cloud platforms: AWS, Google, Azure.
- Any of the Infrastructure Certifications: VMware, MCSE, Red Hat, Open Stack.