India

AgenticOps Platform Engineer Lead, Thrissur

AgenticOps Platform Engineer Lead, Thrissur
Description
We are looking for a senior, hands-on AgentOps Platform Engineer to design, build, and operate the cloud-native infrastructure that powers our AI agents at scale. This is a lead-by-example role: You write the Terraform You build the pipelines You own the platform in production GCP is your primary environment, but you will design with multi-cloud in mind (AWS, Azure), ensuring portability, resilience, and long-term flexibility. This role sits at the intersection of DevOps, MLOps, and AgentOps, with deep responsibility for reliability, security, observability, and cost.KEY RESPONSIBILITIES Platform&Infrastructure Ownership Design, build, and operate production-grade infrastructure for AI agents and LLM services Own Terraform-based Infrastructure as Code for all environments (dev, uat, prod) Lead infrastructure decisions through hands-on implementation, not diagrams Build scalable foundations for: Agent orchestration Inference services RAG pipelines Vector storesOptimise cloud resources for performance and cost efficiency AgentOps&AI Platform Enablement Enable safe, continuous operation of autonomous agents Design agent runtime environments with: Isolation&sandboxing Failover and recovery strategies Controlled rollout mechanisms Support prompt/version management, agent configuration, and tool/plugin lifecycle Work closely with Agentic RAG engineers to operationalise research into production CI/CD&Automation Build and maintain CI/CD pipelines for: Infrastructure Agent services Prompt and config changes Model/version rollouts Automate workflows for: Vector DB updates RAG index refreshes Agent memory stores Tool registration and validation Reduce manual ops toil aggressively through automationObservability&Production Readiness Design and implement deep observability for agent systems: Platform health Agent execution metrics Latency, cost, and throughput Failure modes and retries Build dashboards, alerts, and telemetry using: Prometheus Grafana OpenTelemetry (or equivalent) Enable visibility into agent decision traces and runtime behaviorSecurity, Safety&Reliability Implement secure cloud architecture and IAM best practices Own production reliability, incident response, and recovery Enforce operational guardrails and safety controls for agent APIs Support responsible AI practices from an infrastructure and runtime perspective Collaboration&Technical Leadership Work closely with: Agentic RAG engineers AI engineers Product&CTO Office Define SLOs, reliability targets, and operational metrics Set the technical bar for AgentOps at BridgeAI Mentor engineers by example and code, not process overhead REQUIRED SKILLS&EXPERIENCE Core Platform&DevOps 5+ years in DevOps, Platform Engineering, SRE, or MLOps Strong, hands-on experience with GCP: GKE / Compute Engine Cloud Run / Functions Cloud Storage, Pub/Sub Vertex AI (or equivalent) Deep experience with Terraform (mandatory) Containers, CI/CD&Automation Docker, Kubernetes, Helm CI/CD tooling (GitHub Actions, Jenkins, ArgoCD) Python and Bash for automation and platform glue code Agentic&AI Systems Experience supporting LLM-based systems in production Understanding of: Prompt/version management Context handling&caching Model rollout strategies Hands-on experience with vector databases (Weaviate, FAISS, Pinecone) Familiarity with RAG pipelines and agent execution patterns Observability&Security Monitoring and telemetry using Prometheus, Grafana, OpenTelemetry Strong understanding of cloud security, IAM, and operational safety NICE TO HAVE Multi-cloud experience (AWS, Azure) Exposure to agent frameworks (LangChain, LangGraph, AutoGen, CrewAI) Event-driven systems (Temporal, Airflow)Experience with responsible AI operations or safety monitoring WHAT SUCCESS LOOKS LIKE Infrastructure is reproducible, observable, and boring (in a good way) Agent failures are visible, debuggable, and recoverable Cloud costs are understood and controlled Engineers trust the platform and move faster because of itYou are the go-to authority for AgentOps at BridgeAI WHAT THIS ROLE IS (AND IS NOT) Deeply hands-on Terraform-first Production ownership Sets standards by building Not a people-manager role Not a ticket-based ops role Not a“just keep the lights on” job
Highlights
Safety Tips
Be careful with jobs that explicitly state ’no experience needed’.
1 / 10
More info about this ad

AgenticOps Platform Engineer Lead has been posted in the Trichūr Engineering category on Locanto.

In this category, there are no other ads right now posted in Trichūr.

There are more ads within a 15 km radius for this category. If you want to view those ads, click here.