India

AgenticOps Platform Engineer Lead (Baddi)

AgenticOps Platform Engineer Lead (Baddi)
Description
We are looking for a senior, hands-on AgentOps Platform Engineer to design, build, and operate the cloud-native infrastructure that powers our AI agents at scale. This is a lead-by-example role:
- You write the Terraform
- You build the pipelines
- You own the platform in production GCP is your primary environment, but you will design with multi-cloud in mind (AWS, Azure), ensuring portability, resilience, and long-term flexibility. This role sits at the intersection of DevOps, MLOps, and AgentOps, with deep responsibility for reliability, security, observability, and cost. KEY RESPONSIBILITIES Platform & Infrastructure Ownership
- Design, build, and operate production-grade infrastructure for AI agents and LLM services
- Own Terraform-based Infrastructure as Code for all environments (dev, uat, prod)
- Lead infrastructure decisions through hands-on implementation, not diagrams
- Build scalable foundations for: Agent orchestration Inference services RAG pipelines Vector stores
- Optimise cloud resources for performance and cost efficiency AgentOps & AI Platform Enablement
- Enable safe, continuous operation of autonomous agents
- Design agent runtime environments with: Isolation & sandboxing Failover and recovery strategies Controlled rollout mechanisms
- Support prompt/version management, agent configuration, and tool/plugin lifecycle
- Work closely with Agentic RAG engineers to operationalise research into production CI/CD & Automation
- Build and maintain CI/CD pipelines for: Infrastructure Agent services Prompt and config changes Model/version rollouts
- Automate workflows for: Vector DB updates RAG index refreshes Agent memory stores Tool registration and validation
- Reduce manual ops toil aggressively through automation Observability & Production Readiness
- Design and implement deep observability for agent systems: Platform health Agent execution metrics Latency, cost, and throughput Failure modes and retries
- Build dashboards, alerts, and telemetry using: Prometheus Grafana OpenTelemetry (or equivalent)
- Enable visibility into agent decision traces and runtime behavior Security, Safety & Reliability
- Implement secure cloud architecture and IAM best practices
- Own production reliability, incident response, and recovery
- Enforce operational guardrails and safety controls for agent APIs
- Support responsible AI practices from an infrastructure and runtime perspective Collaboration & Technical Leadership
- Work closely with: Agentic RAG engineers AI engineers Product & CTO Office
- Define SLOs, reliability targets, and operational metrics
- Set the technical bar for AgentOps at BridgeAI
- Mentor engineers by example and code, not process overhead REQUIRED SKILLS & EXPERIENCE Core Platform & DevOps
- 5+ years in DevOps, Platform Engineering, SRE, or MLOps
- Strong, hands-on experience with GCP: GKE / Compute Engine Cloud Run / Functions Cloud Storage, Pub/Sub Vertex AI (or equivalent)
- Deep experience with Terraform (mandatory) Containers, CI/CD & Automation
- Docker, Kubernetes, Helm
- CI/CD tooling (GitHub Actions, Jenkins, ArgoCD)
- Python and Bash for automation and platform glue code Agentic & AI Systems
- Experience supporting LLM-based systems in production
- Understanding of: Prompt/version management Context handling & caching Model rollout strategies
- Hands-on experience with vector databases (Weaviate, FAISS, Pinecone)
- Familiarity with RAG pipelines and agent execution patterns Observability & Security
- Monitoring and telemetry using Prometheus, Grafana, OpenTelemetry
- Strong understanding of cloud security, IAM, and operational safety NICE TO HAVE
- Multi-cloud experience (AWS, Azure)
- Exposure to agent frameworks (LangChain, LangGraph, AutoGen, CrewAI)
- Event-driven systems (Temporal, Airflow)
- Experience with responsible AI operations or safety monitoring WHAT SUCCESS LOOKS LIKE
- Infrastructure is reproducible, observable, and boring (in a positive way)
- Agent failures are visible, debuggable, and recoverable
- Cloud costs are understood and controlled
- Engineers trust the platform and move faster because of it
- You are the go-to authority for AgentOps at BridgeAI WHAT THIS ROLE IS (AND IS NOT)
- Deeply hands-on
- Terraform-first
- Production ownership
- Sets standards by building
- Not a people-manager role
- Not a ticket-based ops role
- Not a “just keep the lights on” job Apply on Kit Job: kitjob.in/job/4melej
Highlights
Safety Tips
Do not pay a ’prospective employer’ anything in order to secure a job.
1 / 10
More info about this ad

AgenticOps Platform Engineer Lead (Baddi) has been posted in the Baddi Engineering category on Locanto.

For Baddi, there are no other ads posted in this category.

There are more ads within a 15 km radius for this category. If you want to view those ads, click here.