Leave us your email address and we'll send you all the new jobs according to your preferences.
Site Reliability Engineer (SRE) - LLM and Machine Learning
Posted 2 days ago by Techruiter
We are a pioneering technology company specialising in cutting-edge Language Models (LLM) and Machine Learning solutions. We are seeking a highly skilled Site Reliability Engineer (SRE) to join our team and ensure the reliability, scalability, and performance of our LLM and Machine Learning infrastructure.
As an SRE, you will play a critical role in maintaining the stability and efficiency of our LLM and Machine Learning platforms. You will work closely with cross-functional teams to design, implement, and optimize infrastructure, monitor system health, and respond to incidents, enabling our researchers and engineers to focus on innovation.
Responsibilities- Infrastructure Design and Automation: Collaborate with engineering and research teams to design, implement, and automate infrastructure for LLM and Machine Learning workloads, ensuring scalability and reliability.
- Deployment and Configuration: Manage deployment pipelines, configuration management, and orchestration tools to streamline the deployment of models and services.
- Monitoring and Alerting: Implement and maintain robust monitoring, alerting, and logging systems to proactively identify and resolve issues. Ensure optimal system performance.
- Incident Response: Lead incident response efforts, investigate root causes of outages, and implement preventive measures to reduce the likelihood of recurrence.
- Capacity Planning: Perform capacity planning and scaling to accommodate growing workloads and ensure resource efficiency.
- Security and Compliance: Collaborate with security teams to implement security best practices, vulnerability assessments, and compliance requirements for LLM and Machine Learning systems.
- Continuous Improvement: Continuously evaluate and improve system reliability, performance, and efficiency through automation and optimisation.
- Documentation: Maintain comprehensive documentation for infrastructure configurations, procedures, and incident reports.
- Bachelor's or Master's degree in Computer Science, Information Technology, or a related field.
- Proven experience as a Site Reliability Engineer or a related role with a focus on LLM and Machine Learning infrastructure.
- Strong proficiency in cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).Experience with configuration management tools (e.g., Ansible, Terraform) and CI/CD pipelines.
- Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack).Scripting and automation skills (e.g., Python, Bash).Excellent problem-solving and troubleshooting skills.
- Strong communication and collaboration skills.
Techruiter
Related Jobs
Experienced Social Workers - Safeguarding Services- Hybrid
- £42 Hourly
- Merseyside, Liverpool, United Kingdom, L21 0
Senior Care Assistant
- £12.95 - £13.55 Hourly
- Clackmannanshire, Alloa, United Kingdom, FK101
Senior Care Assistant
- £13.33 Hourly
- Leicestershire, Leicester, United Kingdom, LE1 1
Care Assistant
- £11.50 Hourly
- Birmingham, United Kingdom
Junior Talent Management Specialist (m/f/d)
- Bayern, München, Germany, 80331