Site Reliability Engineer III
What We Want: Are you an experienced engineer who is passionate about building and managing highly available software and systems? Are you an excellent communicator looking for an interesting and ever-evolving career that allows you to work with cutting edge cloud technology in a fast paced and exciting environment? If so, we are looking to build our new Site Reliability team and we would love to connect!
Our Site Reliability Engineers will be dedicated to proactively developing software and tools to monitor and improve the reliability of our companies systems and software at all levels. This includes anticipating production issues, and implementing solutions before they impact users. To help lead this new initiative, we are hiring for an experienced Site Reliability Engineer III. You would play a key role in our incident management operations, communicating across departments within SVG and working with us to build and lead this new instrumental team.
What You’ll Do:
The essential duties for this role include, but are not limited to:
- Leading software lifecycle activities and reviewing changes to determine how they may affect site reliability
- Proactively anticipate production issues, such as outages, slowness, processing delays, errors, failures, etc., and taking corrective action to prevent them
- Leading and improving the incident response process which includes investigating, appropriately communicating key findings, and resolving issues; collaborating with subject matter experts, as needed, throughout the process
- Designing, implementing, and improving company monitoring and alerting solutions, such as Splunk, Datadog, and CouldWatch
- Identifying and tracking Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
- Maintaining a record of system incidents and reliability metrics
- Creating and improving data-driven analysis to identify patterns and offer recommendations for preventative measures to avoid future incidents
- Lead performance tests; identify bottlenecks, opportunities for optimization, and capacity demands.
- Design, develop, update, and support software solutions and automation using Python, Java, NodeJS, Go, etc
- Designing, building, managing, and supporting resources in Amazon Web Services
- Creating and updating company processes and procedures, documentation, knowledge based articles, and other resources related to Site Reliability Engineering and the incident management process
- Actively researching new technology and share knowledge with team members
- Participate in on-call rotations and triage or resolve critical production issues