Terminal Bench Expert (Vapi)
Terminal Bench Expert (Vapi)
-
Vapi, India
-
Posted: less than a week ago
-
Save
Description
Terminal Bench Expert Employment Type Contractor assignment (no medical/paid leave) Skills
- 3-10 years of experience in software engineering or relevant domains.
- Strong debugging, reasoning, and analytical skills About the Role: Looking for highly analytical engineers, researchers, and domain specialists to contribute benchmark tasks for AI agent evaluation systems (e.g., Terminal-Bench). Design realistic, technically deep tasks simulating real-world scenarios such as debugging, data corruption, infrastructure failures, and complex workflows. What does day-to-day look like:
- Design high-quality Terminal-Bench task ideas and specifications.
- Develop complex tasks requiring reasoning, investigation, and debugging.
- Write clear task descriptions, solution approaches, and verification logic.
- Define deterministic, outcome-based evaluation criteria.
- Identify realistic failure modes, edge cases, and operational constraints.
- Create tasks that challenge AI systems while remaining solvable by experts.
- Collaborate with reviewers to refine task quality and difficulty.
- Contribute expertise across one or more specialized domains. Required Skills:
- 3–10 years of experience in software engineering or relevant domains.
- Strong debugging, reasoning, and analytical skills.
- Good understanding of system design, workflows, and dependencies.
- Ability to analyze complex systems across multiple layers.
- Experience with production systems, pipelines, or large-scale workflows.
- Solid technical writing and documentation skills.
- Exposure to LLMs, agentic systems, or AI evaluation frameworks.
- Experience reviewing technical specifications or designing validation logic. Domains (Any of the following):
- Software Engineering & Code Operations
- Debugging & Codebase Navigation
- System Administration & Shell Workflows
- File & Text Processing Pipelines
- Data Engineering (ETL & Data Pipelines)
- Database & SQL Operations
- Machine Learning Pipelines & MLOps
- Post-training & Model Finetuning Workflows
- AI Evaluation & Benchmarking Systems
- Retrieval, Search & Ranking Systems
- GPU / Systems Performance Optimization
- Distributed Systems & Infrastructure
- Cloud & Platform Engineering
- DevOps & CI/CD Systems
- Build & Dependency Management
- Scientific & Numerical Computing
- Simulation & Optimization Systems
- Formal Methods & Theorem Proving
- Document & Structured Data Processing (PDFs, Excel, etc.)
- Media Processing (Video, Audio, Images via CLI tools)
- Programmatic Graphics & Design (SVG, layout, rendering)
- Data Visualization & Reporting Workflows
- Geospatial & Spatial Data Processing
- Time-series & Forecasting Systems
- Security, Forensics & Reverse Engineering
- Cybersecurity & Vulnerability Analysis
- Networking & API Integration Workflows
- Automation & Multi-step Toolchain Orchestration
- CLI Tooling & Developer Tool Workflows
- Version Control & Git Workflows
- Observability, Logging & Monitoring
- Storage Systems & File Systems
- Finance & Accounting Workflows
- Quantitative Finance & Risk Modeling
- Legal & Compliance Workflows
- Healthcare & Clinical Data Processing
- Supply Chain & Logistics Operations
- Marketing & Growth Analytics
- CRM & Sales Operations
- HR & Recruiting Analytics
- Consulting & Strategy Modeling
- Investment Workflows
- Operations Research & Decision Optimization
- Benchmark Infrastructure, Adapters & Harness Evaluation Process (approximately 45 mins) :
- One round of technical evaluation (45 mins) Apply on Kit Job: kitjob.in/job/4mo126
- 3-10 years of experience in software engineering or relevant domains.
- Strong debugging, reasoning, and analytical skills About the Role: Looking for highly analytical engineers, researchers, and domain specialists to contribute benchmark tasks for AI agent evaluation systems (e.g., Terminal-Bench). Design realistic, technically deep tasks simulating real-world scenarios such as debugging, data corruption, infrastructure failures, and complex workflows. What does day-to-day look like:
- Design high-quality Terminal-Bench task ideas and specifications.
- Develop complex tasks requiring reasoning, investigation, and debugging.
- Write clear task descriptions, solution approaches, and verification logic.
- Define deterministic, outcome-based evaluation criteria.
- Identify realistic failure modes, edge cases, and operational constraints.
- Create tasks that challenge AI systems while remaining solvable by experts.
- Collaborate with reviewers to refine task quality and difficulty.
- Contribute expertise across one or more specialized domains. Required Skills:
- 3–10 years of experience in software engineering or relevant domains.
- Strong debugging, reasoning, and analytical skills.
- Good understanding of system design, workflows, and dependencies.
- Ability to analyze complex systems across multiple layers.
- Experience with production systems, pipelines, or large-scale workflows.
- Solid technical writing and documentation skills.
- Exposure to LLMs, agentic systems, or AI evaluation frameworks.
- Experience reviewing technical specifications or designing validation logic. Domains (Any of the following):
- Software Engineering & Code Operations
- Debugging & Codebase Navigation
- System Administration & Shell Workflows
- File & Text Processing Pipelines
- Data Engineering (ETL & Data Pipelines)
- Database & SQL Operations
- Machine Learning Pipelines & MLOps
- Post-training & Model Finetuning Workflows
- AI Evaluation & Benchmarking Systems
- Retrieval, Search & Ranking Systems
- GPU / Systems Performance Optimization
- Distributed Systems & Infrastructure
- Cloud & Platform Engineering
- DevOps & CI/CD Systems
- Build & Dependency Management
- Scientific & Numerical Computing
- Simulation & Optimization Systems
- Formal Methods & Theorem Proving
- Document & Structured Data Processing (PDFs, Excel, etc.)
- Media Processing (Video, Audio, Images via CLI tools)
- Programmatic Graphics & Design (SVG, layout, rendering)
- Data Visualization & Reporting Workflows
- Geospatial & Spatial Data Processing
- Time-series & Forecasting Systems
- Security, Forensics & Reverse Engineering
- Cybersecurity & Vulnerability Analysis
- Networking & API Integration Workflows
- Automation & Multi-step Toolchain Orchestration
- CLI Tooling & Developer Tool Workflows
- Version Control & Git Workflows
- Observability, Logging & Monitoring
- Storage Systems & File Systems
- Finance & Accounting Workflows
- Quantitative Finance & Risk Modeling
- Legal & Compliance Workflows
- Healthcare & Clinical Data Processing
- Supply Chain & Logistics Operations
- Marketing & Growth Analytics
- CRM & Sales Operations
- HR & Recruiting Analytics
- Consulting & Strategy Modeling
- Investment Workflows
- Operations Research & Decision Optimization
- Benchmark Infrastructure, Adapters & Harness Evaluation Process (approximately 45 mins) :
- One round of technical evaluation (45 mins) Apply on Kit Job: kitjob.in/job/4mo126
Highlights
-
Company nameCodefeast
-
Job positionTerminal Bench Expert (Vapi)
Safety Tips
Be careful if you are offered a job on the spot.
More info about this ad
Terminal Bench Expert (Vapi) has been posted in the Vapi Other Jobs category on Locanto.
For Vapi, there are no other ads posted in this category.
There are more ads within a 15 km radius for this category. If you want to view those ads, click here.