Beyond the model: Evaluating AI agricultural advisory systems so they work in the field

Key takeaways

Evaluating AI farmer advisory systems requires assessing real-world usefulness, usability, and trust—factors that matter as much as technical scores.

Assessments should happen at three levels: models, systems, and processes all shape farmer outcomes.

Equity should be measured, not assumed; inclusive benchmarking reveals who benefits—and who is left out.

Agricultural advisory services are increasingly adopting generative AI (gen AI) systems, including tools based on large language models (LLMs) such as chatbots, to provide farmers with tailored information on everything from how to manage pests to changes in commodity prices. However, developers of these tools face many challenges, including the need to function seamlessly in local languages and in different geographical areas and contexts.

Given these issues, developers must reliably ensure that AI tools perform safely and effectively in diverse real-world settings. A key approach is the benchmarking process. An AI benchmark is a standardized evaluation framework for comparing models across specific tasks (e.g., answering questions) and metrics (e.g., accuracy). For LLMs, a benchmark serves as a common testbed for assessing capabilities such as reasoning, factual knowledge, following instructions, robustness, and safety.

Currently, most LLM benchmarking efforts focus on achieving high scores in technical statistics such as accuracy and correctness. While useful, this approach represents only a fraction of what makes agricultural advisory successful. Advisory systems must not only be agronomically sound and context-specific, but also easy to understand for users with varying literacy levels. Factors like usability, linguistic diversity, and trust are often ignored in model-centric evaluations, yet they dictate whether a farmer actually uses the service.

The consensus among development practitioners is shifting: benchmarking must move beyond isolated model tests toward collaborative approaches that consider model behavior, system performance, and governance to ensure more trustworthy and reliable advisory tools for farmers.

Recognizing the gaps and potential for change, the AGX AI community—which aims to accelerate the development of responsible AI for small-scale producers in Africa and Asia—is bringing together researchers and practitioners to assess how best to evaluate these systems. On November 6, 2025, some AGX AI community members met during the IFPRI webinar session, Benchmarking LLMs for Agricultural Advisory: Insights from a Global Community of Practice. This post reviews key insights from that meeting.

From fragmented benchmarks to multilevel evaluation frameworks

The current landscape of AI evaluation in agriculture is innovative but fragmented. Most initiatives are tied to specific use cases and rely on different datasets or criteria, making it difficult to compare results across different geographies or value chains. This lack of mutual understanding and shared measurement systems prevents these efforts from offering broadly agreed-upon guidance that builds institutional trust.

To address this issue, one solution suggested during the webinar is that benchmarking should operate across three complementary levels: model, system, and process.

The model level corresponds to assessing LLMs’ technical capabilities, like reasoning and factual accuracy. The system level involves evaluating relevance, usefulness, and the actual user experience in the field. This leaves the process level to examine governance, risk mitigation, and the role of human experts in the solution’s lifecycle.

Ultimately, this trilevel structured approach strives to bridge the gap between technical performance and the sociotechnical realities that determine if a gen AI-enabled advisory service is truly scalable.

This trilevel approach to evaluation mirrors the lifecycle of an AI advisory system from development to deployment. The framework employed by Precision Development (PxD) is a promising example. It treats the AI advisory lifecycle as an iterative approach involving a user journey assessment (system level), an AI assessment (model level), and a product assessment (process level).

Learning from the ‘user journey’

At the system level, PxD’s “user journey” assessment involves a detailed investigation of users’ interactions and perceptions of AI-based tools such as its chatbot-powered mobile application for plant protection advisory. The model level includes evaluating the response quality of the AI chatbot. Finally, at the process level, PxD relies on an in-house end-to-end product assessment, including stability and how its AI tools handle errors.

Evaluations collected in the “user journey” segment can produce insights on areas that need improvement across all three levels. One PxD product assessment found that navigating the user interface (UI) was confusing for many. For example, in one of PxD’s mobile applications, the buttons for checking temperature, price, and language were not easily discoverable because users would often click the text labels instead of the button icons themselves.

Thus, even “perfect” AI advice tends to fail if a tool’s UI is unintuitive. Issues such as hidden language options or complex navigation paths not only frustrate farmers but can also decrease trust. Because these barriers are invisible to standard accuracy metrics, incorporating approaches similar to the PxD “user journey” into benchmarking is essential for assessing real-world impacts.

Grounding model-level evaluation in agricultural use cases

While system-level views involving user experiences are vital, model-level evaluation remains the foundation of any benchmarking process.

Several datasets have been developed to assess how LLMs handle agricultural knowledge. These typically consist of test questions and answers. While a longer list of such efforts is maintained by Athena Infonomics, notable examples include the Agriculture Information Exchange Platform (AIEP) Golden Q&As, AgriBench, AgREASON, and AgXQA.

Despite differences in scope and design, their developers are collectively aiming to ground model evaluation in practical and factual agricultural use cases. They are striving to reflect actual farmer queries, local terminology, and realistic decision constraints (temporal and geospatial).

For example, testing whether a model can address questions about specific local pests is more valuable than testing it on general biology. To be effective, LLM benchmarking should combine both quantitative and qualitative metrics: the first measuring factual accuracy, the latter assessing how LLMs handle uncertainty or ambiguous questions, which are common occurrences in farming.

While standardizing these two categories of metrics across the industry is an ongoing challenge, these efforts should ultimately provide the data necessary to inform better system design and user experience.

Equity and inclusion and gender-responsive benchmarking

On the third and final level—process—a key issue is the benchmarking of gen AI agricultural advisory systems to account for equity and inclusion.

Differences in language, access, literacy, and social roles shape how advisory information is received and acted upon, meaning that model-centric metrics alone may obscure disparities in who benefits from gen AI-enabled services. Detecting factual inaccuracies and algorithmic bias in models can only go so far; evaluating whether a system is inclusive requires assessing agency and empowerment outcomes experienced by different user groups.

Inclusive benchmarking should incorporate indicators focused on outcome-oriented dimensions such as usability, access and reach, trust, adoption, and agency, with each disaggregated to capture differential experiences across gender and social groups. This can reveal gaps that are not visible in automated or aggregate performance metrics such as F1-accuracy or BLEU scores.

Regarding access and reach, indicators such as women’s participation, preferred languages, or usage patterns are useful. On usability, trust, and adoption, methods including surveys, focus groups, and user testing can help assess perceived ease of use, comprehension, and reliability across different user groups.

Conclusion

As gen AI-enabled agricultural advisory systems move toward wider deployment, benchmarking practices must evolve beyond isolated model-level metrics to reflect real-world performance, governance, and equity outcomes. This will require multi-level evaluation frameworks that integrate model-, system-, and process-level assessment across the advisory lifecycle. Ultimately, reliable AI systems can help to support global food security, assisting farmers in efforts to adapt to climate and economic shocks. For this impact to scale, however, the AGX AI community must bridge the “affordability gap” alongside the “accuracy gap,” ensuring that service providers and smallholder farmers have access to effective tools.

Josué Kpodo, is a PhD candidate at Michigan State University; Jagannath R is Research Manager with Precision Development (PxD); Michael Minkoff is a digital agriculture specialist; Niyati Singaraju, is a Postdoctoral Fellow, Gender Research, International Rice Research Institute (IRRI) and the Gender and Inclusion Focal Point, CGIAR Gender + AI Accelerator & Digital Transformation Initiative. Opinions are the authors’.

Reference:
Athena Infonomics. (2025). A Look at Benchmarking Initiatives for gen AI-Powered Advisory in Agriculture. https://agxai.notion.site/a-look-at-benchmarking-initiatives#234b4002b2a881e2aa23dc8aee5b0e09

Who we are

What we do

Research topics

Projects and impact

Food Security Portal

Modeling tools

What’s new

Where we work

IFPRI Regional Programs

IFPRI Country and Regional Offices

Research by country and region