A Critical Examination of LLMs usage in the Context of Qualitative Research
- Yuan Ren
- Oct 10
- 7 min read
Updated: Oct 11

As generative artificial intelligence (GenAI) rapidly transforms the landscape of scientific research, qualitative scholars are confronted with an unprecedented technological challenge: Can this emerging tool—powered by large language models (LLMs)—truly fulfill the humanistic task of qualitative data analysis, which is fundamentally rooted in interpretive work? In their recent publication, Nguyen and Welch (2025) offer a systematic and incisive response to this question. They not only provide a detailed examination of the technical mechanisms behind GenAI but also engage in a profound philosophical and methodological reflection on its status as a scientific instrument.
What exactly is GenAI?
The authors begin by emphasizing that GenAI is not a form of general intelligence, but rather a probabilistic generation system driven by autoregressive large language models (AR-LLMs). These models generate text word by word by learning statistical relationships between terms across massive textual corpora, producing responses that appear coherent and natural (Henighan et al., 2020; Wei et al., 2022a). However, this process is synthetic rather than interpretive. While the generated outputs may mimic the rhetorical style of academic language, they do not entail genuine semantic understanding (Nguyen & Welch, 2025).
The training of these LLMs is based on the transformer architecture (Vaswani et al., 2017). Upon receiving user input—whether in the form of commands or raw text—the system tokenizes the input, embeds it into a high-dimensional vector space, models the contextual relationships, and then predicts the most probable next word sequentially to construct the output (Radford et al., 2018).
Crucially, this generative process lacks any built-in mechanism for factual verification. As Nguyen and Welch note, LLMs “are not designed to determine whether their answers are true, their output “does not need to be correct, only to appear correct” (Thornton, 2023). Thus, the perceived reliability of these outputs is largely an illusion of fluency, not a reflection of substantive content.
Criteria for Evaluating the Scientific Suitability of Research Tools
The integration of any technological tool into academic research—especially in the domain of qualitative analysis—must not rely solely on novelty or efficiency. Nguyen and Welch (2025) underscore that scientific tools must undergo rigorous validation, guided by four core criteria: factual accuracy, reliability, transparency, and ethical compatibility. Unfortunately, GenAI performs poorly on all four fronts.
In terms of factual accuracy, studies have highlighted the phenomenon of “hallucinations” in LLMs—instances where the model generates entirely fabricated content, cloaked in persuasive and authoritative language (Maynez et al., 2020; Kalai et al., 2025). These outputs are not grounded in truth but are shaped by the statistical patterns learned from training data.
Regarding reliability, LLM outputs are highly unstable. Identical inputs often produce significantly different responses depending on the timing or dialogue iteration. This so-called “probabilistic drift” (Achiam et al., 2023) severely undermines the reproducibility essential for any scientific tool.
On the matter of transparency, LLMs function as opaque systems. Their complex architectures and undisclosed training processes create a “black box” effect, preventing researchers from tracing how outputs are generated (Burrell, 2016; Marcus, 2024).
Finally, ethical concerns loom large. LLMs may incorporate user inputs into future training data, raising serious issues related to data privacy, intellectual property, and informed consent (Lukas et al., 2023).

Is GenAI Suitable for Qualitative Data Analysis?
To empirically assess the capabilities of GenAI in qualitative analysis, Nguyen and Welch (2025) adopted a methodologically layered and analytically diverse research approach. At the center of their study was a well-known, large-scale qualitative dataset comprising interviews, blogs, news articles, and journal pieces about portable computing technologies, spanning from the 1980s to the 2010s. This ensured both diversity and familiarity, enabling the researchers to rigorously evaluate the validity of GenAI outputs.
The experiment involved a variety of prompting strategies—including zero-shot, few-shot, and chain-of-thought prompting (Wei et al., 2022b). To strengthen credibility, they also triangulated their findings with results from five independent studies examining ChatGPT in qualitative research and compared them against GenAI-enhanced features in commercial CAQDAS tools such as NVivo and QUALSOFT. Together, these sources formed a robust, three-tiered evidence base (Nguyen & Welch, 2025).
The results were conclusive: even when analyzing the same data, GenAI models produced wildly inconsistent outputs across different runs. Some outputs appeared as literal translations of the text, while others included fabricated quotes unrelated to the data. Even with identical prompts, responses varied significantly from session to session. This unpredictability forced researchers to restart their analysis repeatedly, compare outputs, and re-code data, ultimately increasing the analytic workload rather than improving efficiency.
The conclusion is unequivocal: current LLMs based on transformer architecture are not suitable for qualitative data analysis (Nguyen & Welch, 2025). Models like ChatGPT fail to perform consistent and meaningful coding of qualitative data in a stable, accurate, or comprehensive manner, and they are entirely incapable of engaging in the interpretation of social meaning. Yet this act of interpretation lies at the heart of qualitative inquiry—an inherently human capacity that no automation can replicate.
The Five Epistemic Risks of Using GenAI Uncritically
Nguyen and Welch identify five serious epistemic risks that arise from the uncritical use of GenAI in qualitative research:
Category Error: Mistaking LLM chatbots for qualitative analysis tools conflates the form of language with its cognitive function—treating synthetic, patterned output as if it were genuine interpretive insight.
Unreliable Outputs: LLM-generated results are highly unstable, inconsistent, and irreproducible; even with refined prompts, they fail to meet the precision and trustworthiness required for qualitative research.
Anthropomorphic Fallacies: Attributing understanding and collaboration to AI is a misreading of its fluency; what seems like dialogue is merely the repetition of linguistic patterns, not engagement with data.
Causal Misattribution: Blaming poor results on inadequate prompting shifts responsibility away from the model’s structural limitations and conceals the fact that it cannot truly perform interpretive tasks.
The Oracle Effect: Trusting AI outputs as objective or neutral because of their technological origin obscures their lack of context, logic, and meaning—ultimately undermining the depth and ethical awareness of qualitative research.
The Irreplaceable Human Core of Qualitative Analysis
A central message threaded throughout Nguyen and Welch’s work is: qualitative data analysis is not about mechanized processing but about the interpretation of meaning. This interpretive work involves multiple layers—identifying unexpected patterns, engaging in abstraction, theorizing, challenging assumptions, and reconfiguring concepts. Far from being a deductive operation, this is a deeply subjective, experiential, and context-sensitive endeavor.
Such analysis is rooted in the researcher’s situated understanding of the field, forged through embodied engagement and dialogical interaction within scholarly communities—a process of intersubjectivity (Yanow & Schwartz-Shea, 2014). Even among interpretivist scholars who reject positivist criteria, there remains a strong ethical commitment to epistemic responsibility: faithfully representing participants, questioning one’s assumptions, ensuring transparency in research procedures, analyzing systematically, and drawing logically grounded conclusions (Nguyen & Welch, 2025).
To entrust this complex and inherently uncertain process to a tool built on probabilistic pattern-matching rather than interpretive reasoning is to fundamentally misunderstand the essence of qualitative research.
A Call for Caution and Collective Responsibility in the Academic Community
Nguyen and Welch offered a compelling critique of the academic community’s growing tendency to uncritically embrace emerging technologies. They argue that before GenAI becomes a normalized tool within qualitative research, it is imperative to engage in rigorous scholarly debate and establish clear methodological and ethical standards. Premature adoption without thorough evaluation risks not only producing unreliable findings but also destabilizing the very foundations of qualitative research methodology.
Citing Alvarado (2023), the authors remind us that scholarly inquiry consists of more than just theory and method—it also involves a third dimension: tools. And the legitimacy of these tools cannot be determined by market enthusiasm or corporate branding alone; it must be established through the collective scrutiny of the academic community. This underscores a shared responsibility: every qualitative researcher must engage critically with the technologies they use and continuously ask, “Is this tool truly fit for qualitative inquiry?” This is not only a matter of scholarly judgment but a core tenet of research ethics.
Reference:
Achiam J., Adler S., Agarwal S., Ahmad L., Akkaya I., Aleman F. L., Almeida D., Altenschmidt J., Altman S., Anadkat S., Avila R., Babuschkin I., Balaji S., Balcom V., Baltescu P., Bao H., Bavarian M., Belgum J., Bello I., …, Zoph B. (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774. https://arxiv.org/abs/2303.08774v6.
Alvarado R. (2023). Simulating science: Computer simulations as scientific instruments (Vol. 479). Springer Nature.
Burrell J. (2016). How the machine ‘thinks’: Understanding opacity in machine learning algorithms. Big Data & Society, 3(1), Article 2053951715622512. https://doi.org/10.1177/2053951715622512
Henighan T., Kaplan J., Katz M., Chen M., Hesse C., Jackson J., Jun H., Brown T. B., Dhariwal P., Gray S., Hallacy C., Mann B., Radford A., Ramesh A., Ryder N., Ziegler D. M., Schulman J., Amodei D., McCandlish S. (2020). Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701. https://arxiv.org/abs/2010.14701v2
Kalai A. T., Nachum O., Zhang E. (2025). Why language models Hallucinate. arXiv preprint arXiv:2509.04664. https://doi.org/10.48550/arXiv.2509.04664.
Lukas N., Salem A., Sim R., Tople S., Wutschitz L., Zanella-Béguelin S. (2023). Analyzing leakage of personally identifiable information in language models. IEEE Symposium on Security and Privacy (SP), 346–363. https://doi.org/10.1109/SP46215.2023.10179300
Marcus G. F. (2024). Taming Silicon Valley: How we can ensure that AI works for us. MIT Press.
Maynez J., Narayan S., Bohnet B., McDonald R. (2020). On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 1906). Association for Computational Linguistics.
Radford A., Narasimhan K., Salimans T., Sutskever I. (2018). Improving language understanding by generative pre-training. OpenAI Technical Report. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
Thornton I. (2023). A special delivery by a fork: Where does artificial intelligence come from? New Directions for Evaluation, 23–32. https://doi.org/10.1002/ev.20560
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser Ł., Polosukhin I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Wei J., Tay Y., Bommasani R., Raffel C., Zoph B., Borgeaud S., Yogatama D., Bosma M., Zhou D., Metzler D., Chi E. H., Hashimoto T., Vinyals O., Liang P., Dean J., Fedus W. (2022a). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682. https://arxiv.org/abs/2206.07682v2
Wei J., Wang X., Schuurmans D., Bosma M., Ichter B., Xia F., Chi E., Le Q. V., Zhou D. (2022b). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.
Yanow D., Schwartz-Shea P. (2014). Interpretation and method: Empirical research methods and the interpretive turn (2nd ed). Routledge.




Comments