There are a variety of tasks that researchers perform in order
to conduct background research and learn about both existing and new
work in their research area. These include summarizing text, searching
relevant literature, seeking to understand new concepts, and translating
text from other languages.
Summarizing Text
First, LLMs are very capable at summarizing. They are adept
at extracting the most relevant content from a passage of text, as
illustrated here:
The author discusses the potential of Generative AI, specifically
large language models like ChatGPT, to revolutionize research in various
domains such as ideation, writing, data analysis, coding, and mathematics,
and suggests that economists can significantly increase productivity
by automating micro tasks with AI, with the performance of these systems
expected to improve over time.
The summary sentence it provided covers all the important
bases. As discussed above, GPT-3.5/GPT-4 can process only up to 4000
tokens or about 3000 words, limiting the ability to produce summaries
of longer texts.
Paper Summaries
By contrast, the LLM Claude 2 can process texts with up to 100,000
tokens or about 75,000 words at the time of writing, which covers
the length of almost all academic papers as well as many books. To
provide an example, I used the "Upload'' button to feed the original
Feb. 2023 NBER working paper version of this paper (Korinek, 2023)Korinek, A. (2023). Language models and cognitive automation for economic research. NBER Working Paper 30957.
into Claude 2 and asked for a short summary — "short'' in order
to not take up too much space here:
This paper describes how large language models (LLMs) like ChatGPT
are starting to display useful capabilities across a range of research
tasks in economics. The author categorizes 25 potential use cases
into six domains: ideation, writing, background research, data analysis,
coding, and mathematical derivations. For each domain, the paper provides
instructions and examples to demonstrate current LLM abilities, rating
them on a scale from experimental to highly useful. The author argues
LLMs can help automate micro tasks and increase researcher productivity
in the short run. In the long run, the paper speculates cognitive
automation via LLMs may have profound implications for economic research
and the value of human cognitive labor.
The long context window of Claude 2 implies that it can
keep the entire text of a typical paper in its working memory.
The model can respond to follow-up questions on the content,
going into significant detail on the text provided. Additional examples
of useful prompts ... include:
What is the evidence supporting these points [or a specific point
from the summary]?
What are the author's main conclusions?
What are the most interesting (or most novel) ideas in this paper?
Can you explain the main findings of section [enter number]?
Claude allows the user to upload up to five documents and to ask the
LLM to perform tasks such as comparing them or identifying connections
between them.
Literature Research
At the time of writing, LLMs are of limited usefulness for
searching and referencing the academic literature. When asked for
citations or references, they frequently hallucinate and generate
authoritative-sounding papers that do not actually exist....
Although this has recently improved for cutting-edge LLMs — GPT-4
is significantly better in this respect than GPT-3.5 and is often
able to provide standard references that are frequently cited in the
literature — all references that LLMs generate should be carefully
double-checked. The following is an example:
Bethune, Z., & Korinek, A. (2021). Covid-19 Infection Externalities:
Trading Off Lives vs. Livelihoods. Journal of Monetary Economics,
117, 1-13.
The paper title is correct but the publication information
is hallucinated — and this is for a paper that has close to 200
citations. For small variations of my prompt (e.g., putting the authors
in alphabetic order), the LLM responded that it could not find such
a paper. Requests for academic literature frequently lead to invented
paper titles and journals that, however, sound superficially plausible.
Buchanan and Shapoval (2023)Buchanan, J. and Shapoval, O. (2023). GPT-3.5 hallucinates nonexistent citations: Evidence from economics. SSRN Working Paper. systematically examine the capability
of GPT-3.5 to write about every JEL code and find that more than 30%
of the citations generated do not exist.
The reason why LLMs struggle with academic references is that
they do not retain their training data verbatim, but instead retain
an abstract representation of it that involves some loss of information.
While this approach works for many general concepts, it falls short
for academic references where precision is crucial.
However, there are alternative tools available that employ LLMs to
help researchers find academic references. ...For example,
LLMs with web connections such as Bing or Bard can be prompted to
look up specific papers on the internet and generate reference information
from their findings. Moreover, the website https://elicit.org
allows searches of the academic literature and displays a list of
papers (that really exist) based on simple user prompts. It also includes
a brief summary generated by an LLM.
Formatting References
@techreport{korinek2023large,
title=Large Language Models and Cognitive Automation for Economic Research,
author=Korinek, Anton,
year=2023,
institution=NBER,
type=Working Paper,
number=30957,
month=February
}
Korinek, Anton. 2023. "Large Language Models and Cognitive Automation for Economic Research." NBER Working Paper No. 30957, February.
LLMs can convert a whole bibliography in batch format, for
example from APA to Chicago format or into bibtex format. For very
highly-cited works that the LLM has frequently encountered in its
training data, a prompt like "bibtex reference for stiglitz weiss''
also works well and saves the user from having to copy or type the
detailed citation information of the work in question. However, it
is advisable to double-check the information for all but the most
highly cited papers.
Translating Text
Spanish: La IA generativa, en particular los modelos de lenguaje
grandes (LLMs) como ChatGPT, tiene el potencial de revolucionar la
investigación en muchas disciplinas.
German: Generative KI, insbesondere grosse Sprachmodelle (LLMs) wie
ChatGPT, hat das Potenzial, die Forschung in vielen Disziplinen zu
revolutionieren.
Finnish: Generatiivinen teko\:aaly, erityisesti suuret kielen mallit
(LLM) kuten ChatGPT, on potentiaalia mullistaa tutkimusta monilla
aloilla.
During pre-training, LLMs develop representations of concepts
that abstract beyond any single language, and this makes them able
to easily map text from one language to the other. Performance on
lower-resource languages for which less digitized text and fewer digitized
translations are available is worse.
Explaining Concepts
LLMs can act as a tutor and explain many common economic concepts
— at a level that is useful for students who are trying to learn
new materials, and even for more advanced researchers who are branching
out beyond their main area of expertise. For good explanations, it
is critical to employ the most cutting-edge LLMs.
Instrumental variables are useful for addressing issues of endogeneity
in statistical analysis, particularly in regression analysis. Endogeneity
can occur due to unobserved variables, measurement errors, or simultaneity,
which can lead to biased and inconsistent parameter estimates.
Instrumental variables are used to isolate the effect of the variable
of interest from the effects of other confounding variables. They
are particularly useful in situations where it is not possible to
directly observe all relevant variables, or where the variable of
interest is correlated with the error term.
In addition, instrumental variables can be used to estimate causal
relationships in observational data, where random assignment of treatments
is not possible. They provide a way to mimic the conditions of a randomized
controlled trial, which is considered the gold standard for causal
inference.
When employing LLMs in this way, follow-up questions that
go into further detail can be very useful and allow the user to obtain
personalized tutoring.
Other concept questions that the LLM answered well included the following:...
What is the invisible hand theorem and why does it hold?
What is Monte Carlo simulation?
How does Cournot equilibrium differ from Bertrand equilibrium?
What are the leading theories of exchange rate determination?
But sometimes LLMs, especially earlier versions, hallucinate
when providing answers. The following is an example using the older
GPT-3.5 model text-davinci-003, which powered the original free version
of ChatGPT in late 2022:
The Second Theorem of Welfare Economics states that any competitive
equilibrium is Pareto efficient. This means that any competitive market
outcome is the most efficient outcome possible, as it cannot be improved
upon without making at least one individual worse off.
(Author's note: this is clearly the first not the second theorem!)
The two fundamental theorems of welfare economics are
cornerstones of modern economics, but the text generated by GPT-3.5
confuses the two. Although LLMs give clear and impressive responses
in many instances, this illustrates that their output cannot always
be trusted, even on what seem like basic questions about fundamental
concepts. Since LLMs can be quite helpful and powerful as tutors,
one of the dilemmas they introduce is that students are likely to
use them even though they occasionally produce false responses with
great confidence.