Infrastructure Sre Team Lead

Cape Town, WC, ZA, South Africa

Job Description

A vacancy exists for a Infrastructure SRE Team Lead within the Micro Merchant Division - Kazang, in Cape Town, Century City.



This role is ideal for a seasoned Infrastructure SRE professional looking to take on a leadership position and drive innovation within a dynamic team.



We are seeking an experienced Infrastructure Site Reliability Engineer (SRE) Team Lead with deep expertise in Linux-based, open-source environments to lead a team ensuring the reliability, scalability, and performance of our critical systems. This role involves technical leadership, strategic planning, and hands-on implementation of automated solutions for system monitoring, optimization, and infrastructure management. You will collaborate with the DevOps and engineering teams, guiding best practices in CI/CD, observability, and infrastructure automation, while mentoring a team to enhance system resilience and operational efficiency.



Key Responsibilities include, but are not limited to:



Lead and mentor a team, fostering a culture of reliability, automation, and continuous improvement. Provide technical guidance and career development support for team members. Design, implement, and maintain reliable systems in a Linux and open-source environment to meet uptime and performance objectives. Support the DevOps team with CI/CD pipelines, ensuring seamless and reliable deployments. Manage and optimize AWS-based infrastructure for scalability, cost efficiency, and performance. Develop and maintain monitoring and alerting systems to ensure observability and proactively address system issues. Build and maintain robust solutions for metric collection, dashboarding, and alerting to provide actionable insights and real-time system visibility. Conduct root cause analysis for incidents, implementing preventive measures to improve system resilience. Perform regular system maintenance, including updates, patches, and optimizations. Prepare and deliver comprehensive reporting on system performance, incidents, and reliability metrics. Identify and mitigate risks to system reliability, scalability, and security. Ensure compliance with organizational and regulatory standards in system design and operations. Manage on-call rotations and incident response protocols. In order to be considered for this position, the following requirements must be met:


Bachelor of Science or any related tertiary qualification.



A minimum of 5 years of professional experience in Site Reliability Engineering, DevOps, or a related field, with demonstrated expertise in Linux-based, open-source environments, and cloud infrastructure (AWS), wanting to progress into a leadership capacity. Proven ability to mentor and develop team members.

Competencies required:



Excellent leadership and communication skills. Strategic thinker with a proactive and results-oriented approach. Ability to build and maintain strong cross-functional relationships. High attention to detail and ability to enforce best practices. Passion for technology and continuous learning. Strong problem-solving and analytical skills. Expertise in diagnosing and resolving complex system issues, including performance bottlenecks, service outages, and application errors, using debugging tools, logs, and monitoring data. Proficiency in at least one programming or scripting language (e.g., Python, Bash, Go), with the ability to write automation scripts, develop tools, and optimize system performance. Hands-on experience with AWS services (e.g., EC2, S3, RDS, VPC), with the ability to design, manage, and optimize cloud-based infrastructure for scalability, reliability, and cost-efficiency. Skilled in implementing monitoring solutions and designing systems for metrics collection, dashboarding, and alerting to ensure system health and performance. Proficiency with tools like Ansible, Terraform, or similar frameworks to automate system management, deployments, and configurations, reducing manual effort and ensuring consistency. Demonstrates a proactive and analytical approach to identifying issues, diagnosing root causes, and implementing effective solutions in complex technical environments Works effectively with cross-functional teams, including DevOps, development, and operations, fostering a culture of shared ownership and open communication to achieve reliability goals. * Embraces change, learns new technologies quickly, and adjusts strategies to meet evolving system and organizational needs, particularly in fast-paced, dynamic environments.

Beware of fraud agents! do not pay money to get a job

MNCJobs.co.za will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD1413808
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    Cape Town, WC, ZA, South Africa
  • Education
    Not mentioned