April 16, 2025 Jens LudwigThe advent of generative AI has created a great deal of excitement, but has also sparked concern about when and how AI – especially large language models (LLMs) like ChatGPT – can be used in empirical research. Thankfully, Jens Ludwig, the Edwin A. and Betty L. Bergman Distinguished Service Professor at the Harris School of Public Policy, writing with coauthors Sendhil Mullainathan and Ashesh Rambachan at the Massachusetts Institute of Technology, have developed a set of guidelines that make the use of AI possible in empirical research. The new working paper, entitled "Large Language Models: An Applied Econometric Framework," asks a critical question in empirical research: how can LLMs (types of generative AI) be effectively utilized in econometrics, while also accounting for their known limitations? With the recent advancements in natural language processing, LLMs have shown promise in assisting researchers across various fields. However, despite their potential, there remains a significant gap in understanding their limitations, particularly how these models can be applied to empirical tasks in a reliable and valid manner. The paper proposes an econometric framework that carefully distinguishes between two types of empirical research tasks—prediction problems and estimation problems—and offers guidelines—called a “contract”—on how LLMs can be leveraged for each, without compromising the integrity of the research. “LLMs are a diverse, dynamic set of extraordinarily complex machine learning models involving many layers of interactivity and billions of parameters,” the authors write in the introduction to the paper. “Their training datasets and architectures (among many other details) are often intentionally obscured because LLMs are proprietary commercial products. To further complicate matters, modeling the outputs of LLMs has proven to be equally difficult. Computer scientists struggle to characterize the brittleness of LLMs that lead their outputs to accomplish remarkable feats in some tasks yet produce bizarre failures in others.” The first aspect of Ludwig's framework concerns the use of LLMs for prediction tasks, including generating hypotheses. According to the paper, prediction tasks can be validly conducted using LLMs, provided there is no "leakage" between the training data of the language model and the sample used by the researcher. Leakages can occur when an LLM has been trained on data that overlaps with the researcher's empirical sample, leading to biased results. The paper recommends that researchers use open-source LLMs with clearly documented training data and published weights to ensure there is no such leakage. This safeguard is essential to maintain the reliability of the model's predictions, ensuring they do not reflect data that should not be available to the model during the empirical research process. Ludwig's second focus is on the use of LLMs for estimation tasks, where the goal is to automate the measurement of some economic concept—whether derived from text or human subjects. In this context, Ludwig argues that researchers must collect validation data to assess the errors in the LLM's automation process. Without validation data, it is impossible to evaluate how well the model is performing or to account for potential errors. In other words, LLMs may perform well in certain tasks, but without proper checks, their outputs may lead to unreliable empirical estimates at times. By integrating validation procedures, researchers can ensure that their use of LLMs for estimation tasks aligns with the rigorous standards expected in empirical research. The paper also includes two applications in finance and political economy, demonstrating how Ludwig's framework can be applied to real-world research scenarios. These examples show that while the requirements for using LLMs in empirical research are stringent, they are essential to maintaining the quality and accuracy of the results. Ludwig’s analysis emphasizes that when these guidelines are followed, LLMs can be a powerful tool for researchers, even when dealing with limited language data. When these safeguards are ignored, however, the resulting estimates can be deeply flawed, leading to potentially misleading conclusions. “Altogether our results suggest that the excitement around the empirical uses of LLMs is warranted, provided researchers guard against training leakage by using open-source LLMs in prediction problems and collect benchmark data in estimation problems,” the authors write in the paper’s conclusion. Upcoming Events More events Ask Admissions: Credential Programs Mon., April 21, 2025 | 7:00 AM Ask Admissions: Credential Programs Mon., April 21, 2025 | 7:30 PM Environmental Economics and Policy Lab (EEPL) Mini Class Mon., April 21, 2025 | 8:00 PM The Keller Center 1307 E 60th St Chicago, IL 60637 United States