Vibe Research: A Workflow Tutorial for AI-Collaborative Research
- Yuan Ren
- 15 hours ago
- 8 min read

In recent years, large language models, coding agents, and automation tools have gradually entered research practice. They can assist researchers with literature organization, code implementation, experiment orchestration, result aggregation, and even drafting. But this does not mean that the research process will naturally become more rigorous. On the contrary, if clear boundaries, evaluation, and process control are lacking, AI often amplifies the original disorder: code structures expand rapidly, experimental results become difficult to trace, output files keep accumulating, while researchers’ grasp of the overall project state continues to decline. Relevant surveys have already pointed out that, for large language models to effectively serve research, the prerequisite is not to “autonomously complete research,” but to be placed within a process aligned with human research goals and equipped with clear evaluation mechanisms (Zhang et al., 2025).
What this article calls Vibe Research is not a strictly formalized academic term, but rather a practical working framework. Its basic principle is: researchers are responsible for problem definition, direction judgment, evaluation design, and conclusion review, while AI is responsible for high-frequency iteration, boilerplate implementation, local exploration, and repetitive labor. This article attempts to organize this framework into a concise and executable tutorial.

I. Project initiation: establish the runtime environment and collaboration boundaries first
At the cold-start stage of a project, the primary task is not to generate code immediately, but to first establish the most basic runtime environment and collaboration constraints. According to practical experience, this step usually includes: configuring agents.md, relevant skills, and plugins in advance within the project directory; if a command-line workflow is adopted, then further preparing background session tools such as tmux or screen to ensure the sustainable operation of long-cycle tasks.
The importance of this stage lies mainly not in the tools themselves, but in the fact that it establishes a stable external framework for the subsequent research. In AI-collaborative research, environmental stability, recoverable tasks, and distinguishable responsibilities are often more important than pursuing an “elegant code structure” at the very beginning. From the perspective of research norms, this is also consistent with the emphasis in recent years within machine learning on reproducibility and transparency: the research process needs to satisfy the basic conditions of repeatable execution, traceability, and verifiability (NeurIPS, 2026).
II. The idea stage: research direction should be led by humans, with AI assisting understanding
At the idea formation stage, the role of AI should be limited to assistance rather than substitution. A relatively reliable approach is for researchers to first read the core papers themselves, screen out the truly relevant references and technical clues, and then provide these materials to the agent, which generates a more complete domain understanding; meanwhile, researchers themselves retain a preliminary roadmap to clarify how the project should proceed.
This division of labor is broadly consistent with the judgments in existing surveys. Relevant studies point out that LLMs can participate in multiple stages of the research workflow, but their role is built on alignment with human research goals, rather than replacing researchers in problem definition (Zhang et al., 2025). Therefore, at this stage, AI is better suited as an “understanding accelerator” rather than a “direction setter.”
III. The codebase stage: the code framework is the foundation of research work
When AI participates in code generation, the structure of the codebase is not a secondary issue, but rather the infrastructure of the research effort. In practice, a common problem is that agents tend to over-abstract at the early stage of a project—for example, splitting an originally simple data-loading process into multiple layers of dataclasses, configuration objects, and intermediate wrapper structures—causing system complexity to rise significantly before real experimentation has even begun.
Therefore, at the codebase design stage, researchers should actively constrain the level of abstraction in the first version of the framework as much as possible through plan mode or other means. The most important thing at this stage is not to build a general-purpose system for all future scenarios, but to establish the shortest, clearest, and most easily verifiable main research path. Which structures must be abstracted, which parts should remain direct for the time being, which configurations need to be exposed, and which logic should be centrally managed—these questions should all be determined by humans from the outset. If the first-layer framework is poorly designed, every subsequent feature iteration will accumulate additional complexity; conversely, if the main path is sufficiently clear, later modifications, comparisons, and debugging will all become more controllable.
IV. The evaluation stage: evaluation should precede experiment execution
In this workflow, evaluation should not be understood as “result statistics” after the experiment ends, but as “problem definition” before the experiment begins. More precisely, evaluation is the executable expression of the hypothesis. If the evaluation logic is not clearly designed before experiments begin, then the large amount of computing resources invested later will likely amount to directionless consumption.
Therefore, a more reliable sequence should be: design the evaluation first, then move into experiment. Evaluation logic should not be temporarily attached to the tail end of experiments, but should instead be reserved as part of the codebase at the earliest stage. As for its specific form, that can be determined by project complexity: it may be a relatively simple procedural script, or a well-separated modular system. But one principle should not be ignored: functions with different dependency relationships should be managed separately as much as possible, rather than cramming result saving, metric computation, and visualization into one overlong function.
V. The feature stage: use Red–Green–Refactor to establish a development closed loop
When the project enters the feature iteration stage, a relatively reliable practice is to adopt Red–Green–Refactor. According to Martin Fowler’s classic summary of TDD, this cycle includes three steps: first write a test that fails (Red), then complete the implementation so that it passes (Green), and finally reorganize the structure, eliminate duplication, and compress technical debt (Refactor) (Fowler, 2023).
This method is especially effective in coding-agent scenarios. Simon Willison (2026) explicitly recommends using red/green TDD when working with coding agents: first let the test fail, then drive the implementation based on the failure feedback, thereby reducing ambiguity and improving iteration quality.
The reason Refactor deserves special emphasis is that agents are often good at rapidly completing functionality, but are not naturally good at proactively reorganizing system structure after the functionality passes. If there are only Red and Green, but no Refactor, project complexity will still continue to accumulate across multiple rounds of iteration. Therefore, at the feature stage, the emphasis should not merely be on “getting AI to write the feature,” but rather on using tests to establish a verifiable, correctable, and sustainably maintainable development closed loop.

VI. The experiment stage: process management determines whether the project remains controllable
As the experiment stage truly begins to advance, the core challenge of the project shifts from “how to write the code” to “how to maintain order.” Two practical measures are especially worth preserving.
The first is to enforce the maintenance of CHANGELOG.md. Especially when multiple agents are running simultaneously, as long as core logic changes, it should be briefly recorded in the changelog, at minimum explaining the content of the change and its scope of impact. The second is to distinguish between human-created and agent-generated documents through naming conventions—for example, all human-created markdown files uniformly use lowercase, while files created or heavily refactored by agents uniformly use uppercase, so that source and responsibility boundaries can be identified quickly.
These practices may not belong to strict mainstream norms, but they serve the same goal: keeping the experimental process traceable, restorable, and interpretable. This is also aligned with the current machine learning community’s requirements for result traceability: research should not only report results, but also preserve sufficient process information for subsequent inspection and reproduction (NeurIPS, 2026).
VII. The autoresearch stage: automated exploration must be built on clear boundaries
When features have already progressed to a relatively stable stage and the optimization targets are sufficiently clear, one may try entering the autoresearch stage. The core of this stage is not to “let AI optimize freely,” but to first have researchers define the acceptable optimization scope, including modifiable code regions, parameter space, and structural boundaries, and only then allow AI to explore under managed conditions.
This judgment is very important. Because automation itself does not naturally guarantee more reliable results. Relevant surveys likewise point out that the premise for LLMs to play a role in research is clear evaluation metrics and alignment with human goals. In other words, AI can help search for candidate solutions, but it cannot replace researchers in deciding what truly constitutes an improvement (Zhang et al., 2025). Therefore, autoresearch is better suited as a constrained mechanism of partial delegation, rather than a transfer of research judgment.
VIII. Result collection and draft stage: in the late stage of research, the key is convergence rather than expansion
Once experiments are running stably and the form of evaluation results has also become clear, the project enters its later stage. According to practical experience, the most critical task at this stage is to manage the output folder, promptly archive failed experiments or invalid results, and synchronously write core results into the changelog.
The significance of this step lies in maintaining the clarity of the research chain. Because as the number of experiments increases, what truly threatens project quality is often no longer “whether there are still new ideas,” but rather “whether one can still clearly know where each result came from, what modification it corresponds to, and what conclusion it supports.”
Entering the draft stage usually means that large-scale experiments—such as ablation studies or hyperparameter searches—have basically been completed, and the project is approaching closure. At this point, one should return to the original roadmap, check whether any key experiments have been omitted, confirm whether the chain of argument has been closed, and only then begin writing.
It should also be noted that this tutorial is more directly applicable to fields such as computer science and econometrics, where research workflows are often more formalized, data-driven, and easier to evaluate reproducibly. By contrast, it is generally less suitable for many areas in the humanities and broader social sciences, where interpretation and contextual analysis play a more central role.

Conclusion
Taken as a whole, Vibe Research does not advocate “handing research over to AI,” but rather advocates re-establishing a stricter set of process constraints after AI has significantly improved iteration speed. The reason this framework is valuable is precisely that it consistently insists on three things: first, the framework must be established first; second, evaluation must be thought through first; third, development must establish a feedback closed loop. Therefore, the core conclusion of Vibe Research can be summarized in one sentence: AI can increase the speed of research, but responsibility for research judgment, evaluation, and conclusions must still be borne by humans.
Reference:
Fowler, M. (2023, December 11). bliki: TestDrivenDevelopment. Martinfowler.com. https://martinfowler.com/bliki/TestDrivenDevelopment.html
NeurIPS. (2026). PaperInformation / PaperChecklist. Neurips.cc. https://neurips.cc/public/guides/PaperChecklist
Willison, S. (2026). Red/green TDD - Agentic Engineering Patterns. Simon Willison’s Weblog. https://simonwillison.net/guides/agentic-engineering-patterns/red-green-tdd/
Zhang, Y., Khan, S. A., Mahmud, A., Yang, H., Lavin, A., Levin, M., Frey, J., Dunnmon, J., Evans, J., Bundy, A., Dzeroski, S., Tegner, J., & Zenil, H. (2025). Exploring the role of large language models in the scientific method: from hypothesis to discovery. NPJ Artificial Intelligence, 1(1), 14. https://doi.org/10.1038/s44387-025-00019-5




Comments