AI in qualitative research: Using large language models to code survey responses in native languages

Food systems research—and more generally, policy and development research—often relies on structured surveys, administrative data, or experiments. While these approaches yield valuable quantitative insights, they tend to miss critical qualitative dimensions. One useful qualitative approach is open-ended interview questions. When such responses are collected in participants’ native languages, they can provide rich and nuanced information—for example, on the complex local challenges smallholder farmers face.

However, analyzing free-form text can be costly, time-consuming, and inconsistent across analysts. These challenges become even more pronounced when responses are in local or less widely spoken languages.

Recent advancements in generative artificial intelligence, especially large language models (LLMs), offer promising solutions to these challenges. LLMs, the technology behind many current AI applications, enable faster, cheaper, and potentially more consistent analysis of qualitative data. Such models are trained to generate coherent responses drawing on large datasets of text harvested from the internet.

Yet English language text predominates in most LLM training datasets, while other languages remain underrepresented. Thus, such AI tools may not work well with text in local languages. This raises an important question: Can these models effectively overcome this obstacle and support food systems research in low- and middle-income countries?

Our work suggests the answer is yes, with some important caveats. We designed an experiment to evaluate how well different LLMs classify Swahili text—here, identifying occupational categories from interview responses—finding that they matched human-coded labels with roughly 80%-87% accuracy, though they still require human supervision and other precautions.

Our approach

While Swahili is spoken by over 100 million people across East Africa, the language has a relatively small digital footprint. The combination of widespread use and limited online representation makes Swahili ideal for testing LLMs on so-called lower-resource languages—defined as having little labeled and unlabeled language data available for training, and with available data not sufficiently representing the language and its sociocultural context.

Using phone survey data from Tanzania, we tested the capabilities of OpenAI’s GPT-3.5, GPT-4, and GPT-4o, Anthropic’s Claude 3.5 Sonnet and Haiku, and open-source models from Mistral. All models were accessed via APIs (application programming interfaces). To benchmark the LLMs (i.e., provide a baseline comparison with older, established approaches), we also tested a smaller, Swahili-only text-classification model. The dataset included 4,000+ open-ended responses describing respondents’ main jobs, which human coders had classified into 21 occupational categories. Each model was prompted to assign one of those categories to each response, and we measured accuracy as the share of cases where the model’s label matched the human-coded occupational category.

We compared two strategies for prompting: Zero-shot (asking for classification without providing examples) and few-shot (with a few labeled examples) (Figure 1). We also used “function calling”—a feature of the OpenAI API that constrains the output format—allowing the model to respond using only predefined labels.

Figure 1

^{Note: English translation for “kilimo cha mazao” is “farming crops”; “kilimo cha mahindi, maharage” is “corn and bean farming”
Source: Authors}

We selected occupation coding as a proof-of-concept. While many surveys use pre-coded lists, large-scale household and census surveys often collect open-ended responses to preserve nuance, capture diverse job titles, and enable flexible post-coding. Similarly, in rapid phone surveys, such as the Tanzania case, enumerators often record verbatim responses because it is impractical to go through a long list of occupations. These must then be manually coded, which preserves detail but shifts the burden downstream and creates the very bottleneck LLMs can help resolve. The same logic applies to other open-ended questions, such as those on climate impacts, livelihoods, or attitudes, where standardized pre-coding is often not feasible.

How language models make sense of text: Understanding embeddings

To understand how language models interpret survey responses, we examined how they organize meaning internally. The model converts all terms in a body of text—for example, “coffee farming”—into high-dimensional vectors called embeddings. Responses that share meaning (i.e., are semantically similar) produce vectors that lie close together in this space, while very different ideas land farther apart. For example, “coffee farming” and “corn farming” are semantically closer than “coffee farming” and “teacher.”

To visualize this, we projected the embeddings onto a two-dimensional surface (Figure 2).

Figure 2

^{Note: Embeddings were generated using a BERT-based model and reduced to two dimensions using t-SNE for visualization.
Source: Authors}

In the plot, each dot is a Swahili job description, colored by the occupation assigned by a human coder. Clusters of the same color suggest the model sees those responses as semantically similar. For example, many blue dots representing occupations in the agriculture sector appear together. Mixed or overlapping colors indicate ambiguity, such as “selling chickens,” which could belong to retail or agriculture. Isolated dots often represent short or unusual replies that lack enough context for clear classification.

How well did the models perform?

Overall, the language models matched the human-assigned job categories in 80-87% of cases (Figure 3). Claude 3.5 Sonnet and OpenAI’s GPT-4 performed best, especially with a few labeled examples. GPT-3.5 and Claude Haiku also performed well, while GPT-4o showed slightly lower accuracy. Mistral models, though smaller and less specialized, improved significantly with few-shot prompts after weaker zero-shot results.

The model we used as a benchmark achieved even higher accuracy (89.3%), but unlike LLMs that can be applied across many languages and tasks, it was specifically adapted to Swahili and trained on labeled examples, and thus may not generalize as well.

Figure 3

^{Source: Authors}

Where did the models make mistakes?

Interpreting informal, open-ended responses is inherently challenging. Many included local expressions or context-dependent phrasing. Human coders often rely on additional information, such as tone, follow-up questions, or related survey modules, to assign categories accurately. LLMs, however, process each response in isolation, limiting their ability to correctly classify ambiguous or nuanced replies.

Most errors occurred when:

Responses were short or ambiguous, providing too little information.
Jobs could belong to multiple categories, for example, “selling chickens” might be labeled as either agriculture or retail.
Longer answers contained extra details that confused the model.

Few-shot prompting improved accuracy but could not fully resolve these challenges. As the examples in Table 1 show, human coders often relied on additional context that the models could not access.

Implications for research and practice

Our findings highlight both the promise and limitations of using LLMs to code open-ended survey responses. These models can process thousands of answers in minutes, significantly reducing manual effort. However, they still struggle with ambiguous or context-dependent replies that a human interviewer would clarify in conversation.

To help navigate these tradeoffs, we outline a few practical strategies below (Table 2):

Cost is an equally important consideration. Leading models like Claude 3.5 Sonnet and GPT-4 offer relatively high accuracy but can be expensive to use at scale. Smaller or language-specific models are cheaper and can perform well when given a few labeled examples, but often struggle on more challenging cases.

A human-in-the-loop approach is therefore the most practical strategy. Let the model code everything, but have researchers review responses the model flags as uncertain. Active learning can surface the 10%-20% of answers that drive most errors, balancing speed with the nuance that qualitative work requires.

Finally, our experiment focused on one task, occupation coding, in a single language. While it offered a clear reference point for LLM capabilities, it does not reflect the full range of open-ended questions researchers face. Real-world responses about perceptions or lived experiences are often more subjective. Here, LLMs may offer greater value by processing large volumes of complex text quickly and consistently—systematically identifying subtle meanings, emotional cues, and cultural references that might otherwise be overlooked or interpreted inconsistently; an area that needs more study. We hope this exercise offers a useful template for applying LLMs to richer and more complex survey questions in the future.

Tushar Singh is a Senior Research Analyst with IFPRI’s Natural Resources and Resilience Unit; Himangshu Kumar is a Data Associate with Atlas Public Policy. This post is based on research that is not yet peer-reviewed. Opinions are the authors’.

Who we are

What we do

Research topics

Projects and impact

Food Security Portal

Modeling tools

What’s new

Where we work

IFPRI Regional Programs

IFPRI Country Programs

Research by country and region