Data Domain Architect Senior Associate - Agentic AI Evaluation & Annotation
Support the design, execution, and scaling of evaluation and annotation programs for agentic AI systems, with a focus on defining metrics, schemas, rubrics, and quality frameworks for multi-step reasoning, tool use, task completion, policy adherence, and safe agent behavior.
Job Description
As a Senior Associate in Consumer & Community Banking, you will support the development and operationalization of evaluation frameworks for agentic AI systems. This role will focus on how AI agents plan, reason, use tools, follow policies, recover from errors, and complete tasks across multi-turn, multi-step workflows. You will partner with data science, machine learning engineering, product, architecture, tech, and Linguistics to define what “good” agent behavior looks like and translate that into measurable evaluation criteria. You will design annotation schemas, create rubrics for agent trajectories, train annotators, lead calibration exercises, maintain gold and challenge datasets, and help ensure evaluation outputs are consistent, scalable, auditable, and actionable.
The ideal candidate has hands-on experience evaluating LLM-based or agentic systems, including tool-calling behavior, planning quality, action sequencing, grounding, error recovery, and human-in-the-loop review workflows. This role requires someone who can go beyond language annotation and directly support the measurement and improvement of agentic AI performance.
Job Responsibilities
- Define and operationalize evaluation metrics for agentic AI workflows, including task success, step-level correctness, tool-use quality, policy adherence, recovery behavior, escalation decisions, and safe failure outcomes.
- Build and maintain agent-specific gold sets, challenge sets, and regression suites to assess planning quality, action sequencing, grounding, compliance boundaries, hallucination risk, loop detection, and release readiness
- Design annotation schemas, rubrics, taxonomies, and labeling guidelines for evaluating agent trajectories across multi-turn, multi-tool, and workflow-based scenarios.
- Develop evaluation approaches for tool-using agents, including tool selection, tool-call precision and recall, argument correctness, response interpretation, and unnecessary or missing tool usage.
- Train, calibrate, and support annotators on agentic evaluation tasks, ensuring consistent application of schemas, rubrics, edge-case guidance, and quality expectations.
- Lead annotation quality routines, including calibration sessions, adjudication reviews, sampling, inter-annotator agreement analysis, feedback loops, and guideline refinement.
- Identify, classify, and report agent failure patterns, including incorrect planning, premature task completion, wrong tool use, invalid arguments, repeated actions, unsafe recommendations, and policy violations.
- Partner with ML engineers, product managers, tech,, architecture, and data owners to align evaluation criteria with agent architecture, tool interfaces, orchestration logic, model behavior, and business requirements.
- Contribute to LLM-as-judge and automated evaluation workflows, including evaluator prompt design, rubric-based scoring, confidence thresholds, low-confidence flagging, and human-in-the-loop validation.
- Use prompt engineering and synthetic scenario generation to improve evaluation coverage, annotation instructions, pre-labeling workflows, and representative test cases for agentic systems.
- Produce clear reporting on agent performance, annotation quality, error trends, regression results, delivery status, release risks, and continuous improvement opportunities.
Required Qualifications, Capabilities, and Skills
- Master’s degree in Computer Science, Data Science, Computational Linguistics, Human-Computer Interaction, Cognitive Science, AI/ML, or a related field.
- 3+ years of experience supporting AI evaluation, annotation programs, ML-enabled products, LLM applications, conversational AI, workflow automation, or agentic AI systems.
- Hands-on experience evaluating LLM-based or agentic systems, including multi-step reasoning, planning quality, tool use, task completion, grounding, or workflow execution.
- Experience designing annotation schemas, evaluation rubrics, taxonomies, labeling guidelines, or grading standards for complex AI behaviors.
- Demonstrated ability to define measurable evaluation criteria for agentic workflows, including task success, step correctness, tool-call quality, policy adherence, recovery behavior, and escalation decisions.
- Experience training and calibrating annotators or reviewers on complex evaluation tasks, including rubric interpretation, edge-case resolution, adjudication, and quality feedback.
- Experience assessing tool-calling or API-using AI systems, including tool selection, argument accuracy, action sequencing, and interpretation of tool outputs.
- Working knowledge of agentic AI concepts such as planning, orchestration, tool invocation, context management, memory use, multi-turn execution, loop detection, and human handoff.
- Practical prompt engineering experience for LLM or agent evaluation workflows, including instruction refinement, evaluator prompts, pre-labeling, and synthetic test case generation.
- Hands-on Python experience for data analysis, cleaning, validation, automation, and evaluation result processing; experience using Git or similar version control tools.
- Strong analytical, communication, and documentation skills, with the ability to translate complex agent behavior into observable, measurable, and repeatable evaluation decisions.
Preferred Qualifications, Capabilities, and Skills
- Experience evaluating autonomous, semi-autonomous, or tool-using agents in production or pre-production environments.
- Experience building agent evaluation benchmarks, trajectory datasets, gold datasets, challenge sets, regression suites, or release-readiness test sets.
- Experience evaluating enterprise copilots, workflow agents, customer-service agents, operations agents, or multi-agent systems.
- Experience with agent observability, trace review, evaluation pipelines, model monitoring, quality dashboards, or regression analysis.
- Experience partnering with engineering teams to operationalize evaluation pipelines, automated scoring logic, validation checks, reporting, or audit-ready lineage.
- Experience applying automated quality checks, anomaly detection, or ML-based approaches to identify annotation inconsistencies or agent behavior regressions.
- Knowledge of emerging agentic AI evaluation methods, benchmarks, tooling, and best practices for measuring planning, tool use, recovery, grounding, and safe completion.