Site Reliability Engineer (SRE) - LLM and Machine Learning
Posted 10 hours 43 minutes ago by Techruiter
Permanent
Not Specified
Other
London, United Kingdom
Job Description
We are a pioneering technology company specialising in cutting-edge Language Models (LLM) and Machine Learning solutions. We are seeking a highly skilled Site Reliability Engineer (SRE) to join our team and ensure the reliability, scalability, and performance of our LLM and Machine Learning infrastructure.
As an SRE, you will play a critical role in maintaining the stability and efficiency of our LLM and Machine Learning platforms. You will work closely with cross-functional teams to design, implement, and optimize infrastructure, monitor system health, and respond to incidents, enabling our researchers and engineers to focus on innovation.
Responsibilities- Infrastructure Design and Automation: Collaborate with engineering and research teams to design, implement, and automate infrastructure for LLM and Machine Learning workloads, ensuring scalability and reliability.
- Deployment and Configuration: Manage deployment pipelines, configuration management, and orchestration tools to streamline the deployment of models and services.
- Monitoring and Alerting: Implement and maintain robust monitoring, alerting, and logging systems to proactively identify and resolve issues. Ensure optimal system performance.
- Incident Response: Lead incident response efforts, investigate root causes of outages, and implement preventive measures to reduce the likelihood of recurrence.
- Capacity Planning: Perform capacity planning and scaling to accommodate growing workloads and ensure resource efficiency.
- Security and Compliance: Collaborate with security teams to implement security best practices, vulnerability assessments, and compliance requirements for LLM and Machine Learning systems.
- Continuous Improvement: Continuously evaluate and improve system reliability, performance, and efficiency through automation and optimisation.
- Documentation: Maintain comprehensive documentation for infrastructure configurations, procedures, and incident reports.
- Bachelor's or Master's degree in Computer Science, Information Technology, or a related field.
- Proven experience as a Site Reliability Engineer or a related role with a focus on LLM and Machine Learning infrastructure.
- Strong proficiency in cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).Experience with configuration management tools (e.g., Ansible, Terraform) and CI/CD pipelines.
- Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack).Scripting and automation skills (e.g., Python, Bash).Excellent problem-solving and troubleshooting skills.
- Strong communication and collaboration skills.