Site Reliability Engineering Manager - Western CapeJob SummaryRole Overview:As the Site Reliability Engineering Manager, you will oversee the SRE team and work closely with engineering, product, and infrastructure teams to ensure the continuous operation of our platform. You will be responsible for defining and driving the SRE strategy, implementing best practices, and ensuring that the systems we deliver are highly reliable, scalable, and resilient. Your leadership will be essential in creating a culture of operational excellence across the organization.Key Responsibilities:Lead, mentor, and develop a high-performing SRE team, fostering a culture of collaboration, accountability, and innovation. Shape the overall strategy for site reliability engineering across the company.Own the availability, performance, and scalability of our production systems. Work proactively to identify potential issues and eliminate risks to ensure seamless user experience.Lead post-incident reviews and root cause analyses to drive continuous improvement. Implement and refine incident management processes and workflows.Champion the development and implementation of automation tools to improve system reliability, reduce manual intervention, and enable faster recovery. Oversee the implementation of monitoring and alerting systems to ensure proactive issue detection.Partner with software engineers, infrastructure teams, and product teams to design, build, and maintain systems that align with our high standards for availability, scalability, and performance.Establish and drive best practices for system reliability, testing, and incident response. Regularly evaluate and enhance existing processes and tools.Drive capacity planning and scaling strategies to meet the demands of our growing user base and business needs. Ensure that the system architecture is built to support future growth.Ensure systems are secure and compliant with industry standards, safeguarding user data and privacy.Required Qualifications:BTech/ Degree/ Masters/ PHD in Computer Science, Information Technology, Information Systems, Computer Engineering or related fields.Experience:BTech in Computer Science, Information Technology, Information Systems, Computer Engineering or related fields coupled with 13 years relevant working experience; or Degree in Computer Science, Information Technology, Information Systems, Computer.Engineering or related fields coupled with 9 years relevant working experience; or Masters Degree in Computer Science, Information.Technology, Information Systems, Computer Engineering or related fields coupled with 7 years relevant working experience; or PHD in Computer Science, Information Technology, Information Systems, Computer Engineering or related fields coupled with 5 years relevant working experience.Computer and network infrastructure implementationIT service, operations and management, including significant responsibility over Service Level AgreementsIT Infrastructure or software Team leadershipIT Architecture and GovernanceProject managementIT systems engineering, application support, and user managementIT governance and securityData governance and securityIT availability, resilience and redundancySystems analysis, design and engineeringExperience in supporting distributed software systems in a production environment such as Cloud and/or Data CentresProcurement and IT asset managementSkills:Essential:Experience working with Linux and within the Open Source Software EcosystemExperience with DevOps tools, processes and culture.Experience and/or certification and knowledge in SRE, ITIL or related IT Management processes.Experience supporting and maintaining large-scale High-Performance Computing (HPC) and storage systems.Advanced experience with programming and/or scripting languages such as PythonIf youre ready to make a lasting impact in the human capital development space and have the experience and passion to drive our site reliability initiatives,Apply Now!
JustTheJob
MNCJobs.co.za will not be responsible for any payment made to a third-party. All Terms of Use are applicable.