The large language models (LLMs) that power most artificial intelligence chatbots and other applications are exponentially more massive than those of earlier systems, relying on vast datasets and billions of parameters to produce their human-like outputs. Such scale is often celebrated as the driver of their cutting-edge capabilities. But scale alone does not guarantee sound results. The reliability (i.e., robustness, accuracy, and trustworthiness) of such systems significantly hinges on the quality and contextual relevance of LLM datasets. In short, not only quantity, but quality of available data for building gen AI models, is critical.
This is a major issue as AI systems expand their footprint across varied industries around the world—with particular implications for AI in agriculture, which has emerged as a priority sector, specifically for generative AI. For example, the African Union Continental AI Strategy, adopted in 2024, aims to leverage AI tools in pursuing its Agenda 2063 development plan and the Sustainable Development Goals (SDGs). Domestic policy documents in multiple African, South Asian, and Latin American countries attest to widespread commitments to use AI to speed agricultural modernization.
This expansion will demand more, ever-larger, high-quality datasets. Developing apps to serve specific geographical areas and populations requires contextual, relevant, up-to-date, and scientifically robust data. Where will all that data come from? In this post, we examine that challenge and propose a solution to enable a rapprochement between content producers and data-hungry AI developers: content licensing agreements.
Data challenges in agriculture
Let’s consider an example. A local agritech developer is exploring how to build an input advisory application for farmers in several states in India. To train the model, the developer might use open-access data (like some governmental surveys or reports available on the internet).
But this approach has serious limitations. If the developer simply relies on what is openly available, it is likely that the training data may include a lot of irrelevant or generic information, or information from other regions or geographies that compromise the quality and accuracy of the generated output—making it all but useless for its intended audience.
Without such regionally relevant, up-to-date datasets, there is a high risk that chatbots produce: Hallucinations, inaccurate or irrelevant outputs presented as fact; biases, reinforcement of outdated or harmful stereotypes, often baked into legacy datasets; and lack of nuance, generic advice that stops short of providing actionable, context-specific guidance.
To address this issue, the developer may seek to obtain content from universities or international research organizations or publishers (e.g., Springer, CGIAR, the UN Food and Agriculture Organization (FAO), and CABI) or even private sector organizations that may possess relevant data. Such a mix of more relevant sources will arguably yield more robust datasets used in training and fine-tuning the model, ultimately producing more reliable, useful answers for users.
This intensive demand for “AI-ready” data has resulted in mass scraping of content often deemed open access, or in some reported cases, even illegally, to feed into training and fine-tuning LLMs and related models.
The debate over AI datasets
This rapid, often indiscriminate amassing of content has fueled heated debates about intellectual property rights, especially copyright. There are two contesting positions in this debate.
On one side, advocates favoring rapid gen AI advancement say that legal or other encumbrances around access to data should be relaxed, and intellectual property laws shouldn’t curtail such access. Many argue that large-scale data appropriation for AI is a form of fair use, even of proprietary content.
On the other side, critics argue that such practices are unlawful and undermine the rights of creators and publishers. The legal arguments remain largely unresolved, given the complexities of IP laws and their nuanced variations around fair use in different jurisdictions.
But given the economic impact (both on AI developers as well as IP holders), the debate over the practice has grown from an academic matter into a broad public and political debate. We are thus at a crossroads of AI development. The era of “free” scraping of web content at massive scale is unlikely to continue. First, smaller, local AI developers lack the capacity to scrape at the necessary scale. Second, and more decisively, the legal risks are mounting. Technology giants such as OpenAI and Anthropic face costly lawsuits; in 2025, the latter reportedly settled a class-action copyright violation lawsuit, agreeing to pay $1.5 billion to authors whose books it scraped to assemble its gen AI training datasets.
Yet the ongoing uncertainty generated by these disputes leaves little clarity for AI developers, including those working on gen AI systems for agriculture (like advisory chatbots, predictive analytical models, etc.). But there is a growing recognition of the importance of seeking lawful, dependable ways to access needed content. Trust remains a critical condition for broad adoption, and trust ultimately depends on the foundation of the data in an AI model.
Content licensing as a solution
Given the legal grey zones surrounding data use in AI training, content licensing offers a practical, legally defensible path forward. Unlike sweeping and uncertain claims of “fair use,” licenses—bilateral agreements spelling out the rights and obligations between interested parties—establish clarity, accountability, and enforceability.
As part of its role in the Generative AI for Agriculture (GAIA) project, CABI is examining data governance issues to improve access to robust content for gen AI developers in a legal, equitable, and sustainable way. We are developing a model content license (MCL) intended as a standardized template that can be adapted to specific contexts by agritech AI developers (including small- and medium-sized enterprises or intermediaries such as government departments acquiring licensed content for a developer) and creators (e.g., publishers, creative copyright licensors, universities) or collective rights organizations.
The MCL aims to be a legal instrument between content providers and content users—a reusable, adaptable framework that will allow both sides to expedite and streamline contractual negotiations for accessing relevant content for a fair fee.
The MCL is structured as a standard template to address a range of legal issues, including general privacy and data security, and specific rights and restrictions regarding the data in question. Typically, negotiating licenses for content access can be a time-consuming and costly process. The MCL aims to ease the process by providing a mutually agreeable starting point. It is geared towards enabling content access for the development of different types of AI models, including chatbots.
Costs: License fees vs. litigation
Skeptics often point to the costs of licensing as an obstacle to widespread adoption. Yet collective licensing mechanisms—widely used in the creative industries—demonstrate that it is feasible to pool content from multiple authors under clear agreements. More importantly, the alternative—a legally dubious free-for-all—is ultimately far more costly.
The choice for developers is stark: negotiate fair and legally sound licenses that foster trust and innovation, or risk being bogged down in expensive litigation that undermines both credibility and investor confidence.
Conclusion
AI innovation is moving at remarkable speed, but the question of data access will remain central to its trajectory. In digital agriculture, the stakes are especially high: robust datasets are not just a technical requirement; they are the foundation for trust, adoption, and impact at scale.
Content licenses—particularly standardized models like the MCL being advanced under GAIA—offer a pragmatic way forward. They provide a mechanism that is scalable, legally sound, and capable of balancing the interests of innovators and content providers alike. If the agrifood sector is to build trustworthy and sustainable AI ecosystems, then content licensing is not a side issue — it is the backbone.
Ameen Jauhar is Data Governance Lead at CABI. Opinions are the author’s.






