The Table provides an
overview of commonly used LLMs as of September 2023, together with
some of their key properties and limitations, including their release
date, the maximum token limit that they can process, and the date
as of which the training data cut off. It also lists the URLs at which
chatbots powered by these LLMs can be accessed.
OpenAI's ChatGPT is by far the most popular LLM. It comes
in a free version that is based on OpenAI's GPT-3.5 model as well
as a paid version for $20/month. Since March 2023, the paid version
has offered access to GPT-4, which is currently the most powerful
LLM that is publicly available. Both GPT-3.5 and GPT-4 were pre-trained
on data that cut off in Sept. 2021 so they have no knowledge of more
recent events. They have a context window of 4000 tokens, amounting
to about 3000 words in English, with the limit applying to the sum
of the user prompt and the completion that is generated. Aside from
the ChatGPT web interface, OpenAI also offers access to its models
using an Application Programming Interface (API) that enables programmers
to query a range of different OpenAI LLMs while setting several model
parameters that affect the result. The models are available on a pay-per-use-basis
and come in different sizes, with smaller models executing fast and
cheaply whereas larger models are more powerful but slower and more
expensive. The API also offers access to a version of GPT-4 with a
context window of 32k tokens.
Microsoft's New Bing chat engine is also based on GPT-3.5/4
and can browse the web in real time, serving users results that are
based on the most recent information available on the internet. It
also allows users to follow the links to the sources that it has identified.
It allows users to choose from 3 modes, Precise, Balanced, or Creative,
the latter of which provides users with free access to GPT-4.
Google's Bard is based on its PaLM-2 Bison model as of June
2023, which offers functionality at a similar level to GPT3.5. Like
Bing, it can also search the web to include real-time information
in its response to user queries and allows users to follow links to
its sources. It allows users to pick from multiple answers and makes
it easy for users to export the results into spreadsheets. Like OpenAI,
Google also offers API access to a range of PaLM-2 models of different
sizes and capabilities, although it excludes its most powerful PaLM-2
Unicorn model from public access.
Anthropic's Claude 2 is an LLM that brands itself as being
helpful, honest, and harmless. It employs a process called constitutional
AI to train the LLM to follow a set of high-level ethical principles
(Bai et al., 2022)Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., and others (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073.. One of the highlights of Claude is that it has
a context window of 100k token, meaning that it can process about
75,000 words at once. This is far beyond the other models and implies
that Claude can process most academic papers in one go, as we will
explore further below. Unfortunately Claude 2 is currently only available
in the US and UK. Anthropic also allows API access to their underlying
models to process LLM requests in bulk.
Meta's LlaMA 2 series is a set of models with 7B, 14B and
70B parameters released in July 2023 as well as a code-generation
model named Code LlaMA released in August 2023. Meta has freely distributed
the underlying code and the weights of the trained models while withholding
the data used to train the model. The most powerful 70B parameter
version is on par with GPT-3.5 and is available on the leading cloud
computing platforms, including Microsoft Azure, AWS, and Hugging Face.
LlaMA 2 comes with a license that allows both researchers and (with
minor limitations) corporations to run the LLMs on their own computers
and to fine-tune and improve the pre-trained models. This is highly
beneficial from an economic perspective, as it distributes the social
surplus created by LLMs and stimulates innovation. However, as these
moels become more powerful, it also poses growing safety risks (Anderljung et al., 2023)Anderljung, M., Barnhart, J., Korinek, A., Leung, J., O'Keefe, C., Whittlestone, J., and et al. (2023). Frontier AI regulation: Managing emerging risks to public safety. arXiv:2307.03718..
For example, LlaMA has already allowed researchers to construct adversarial
attacks that circumvent the safety restrictions of all the LLMs listed
above (Zou et al., 2023)Zou, A., Wang, Z., Kolter, J. Z., and Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043..
A website that provides occasional users with a user-friendly interface
with access to all leading LLMs is https://poe.com.
Plugins The capabilities of base LLMs can be significantly enhanced with plugins
that allow them to perform additional tasks that LLMs by themselves
are not good at. For economists, the plugin that is perhaps most noteworthy
at the time of writing is ChatGPT's Advanced Data Analysis, which
is available to ChatGPT Plus subscribers. The plugin allows ChatGPT
to write and execute computer code in a sandboxed environment and
to display the results as well as to build and iterate on them. Advanced
Data Analysis also allows users to upload files and perform data processing
tasks on them, ranging from complex analysis like regressions to file
conversions. We will cover several of these capabilities below. Google
Bard also runs code in the background to perform certain mathematical
tasks.
Another ChatGPT plugin that is useful for economists is Wolfram's
Alpha, which can be activated in the plugin store that is available
to ChatGPT Plus subscribers. The site https://www.wolfram.com/wolfram-plugin-chatgpt/
describes a range of examples for how to use this plugin.
Vision-Language Models (VLMs) combine LLMs with the ability to process visual information and integrate
the two. A version of GPT-4, which is not publicly available at the
time of writing, can incorporate visual information in its prompts.
Bard can display images from Google Search in its responses. This
is an area with a lot of potential for future use cases. For example,
early demonstrations suggest that VLMs are able to produce complex
outputs based on hand-drawn back-of-the-envelope drafts.
Reproducibility Most of the applications in the remainder of this section use the
leading publicly available LLM at the time of writing, OpenAI's GPT-4,
version gpt4-0613. In the online materials associated
with this article (see footnote on the frontpage of the article),
I provide python code to reproduce the results by calling OpenAI's
API. The code sets the parameter "Temperature'' to zero, which
makes the LLM responses close to deterministic. For non-programmers,
a user-friendly way to replicate the results is the OpenAI web interface
https://platform.openai.com/playground,
in which "Temperature'' can also be set to zero. Both the OpenAI
API and the Playground require a paid subscription to access GPT-4.*Executing all of the examples labeled GPT3.5/GPT-4 below required
a bit over 5k of input and 5k of output tokens each. At the time of
writing, the total cost was slightly below 50 cents. Further pricing
information is available at https://openai.com/pricing.
There are two factors that limit the reproducibility of my results.
First, OpenAI states that "setting temperature to 0 will make the
outputs mostly deterministic, but a small amount of variability will
remain.'' I have observed these limits to reproducibility in particular
for examples with responses that span multiple sentences.*See \urlhttps://platform.openai.com/docs/guides/gpt/why-are-model-outputs-inconsistent
for further information on the inconsistency of model output, even
at temperature zero, and \urlhttps://community.openai.com/t/a-question-on-determinism/8185
for a discussion of the inherent indeterminacy of efficiently performing
LLM inference. In a nutshell, the efficient execution of LLMs with
hundreds of billions of parameters requires that calculations are
parallelized. However, given the discrete nature of computers, calculations
such as (a\\cdot b)\\cdot c sometimes deliver a slightly different
result than a\\cdot(b\\cdot c). When an LLM calculate which word
has the top probability to be next, minor differences in the parallelization
of the exact same calculations sometimes come to matter, resulting
in different word choices. And once one word changes, everything that
follows becomes different.
Second, OpenAI states that "as we launch safer and more capable
models, we regularly retire older models.'' Moreover, "after a
new version is launched, older versions will typically be deprecated
3 months later.'' If the gpt4-0613 model is retired,
my results may no longer be reproducible.*Moreover, see https://platform.openai.com/docs/deprecations
on OpenAI's policy of model deprecations as well as the current timelines
for how long existing models are guaranteed to remain available.
The most convenient user interface is ChatGPT, available at https://chat.openai.com/,
which employs a "Temperature'' parameter greater than zero, which
introduces more variation into the model's responses. Accessing GPT-4
via this interface requires a paid subscription to ChatGPT Plus. This
allows users to try out the spirit of all the examples employing GPT-4
below, but the extra variability implies that the exact results will
differ every time a prompt is executed. The same applies to ChatGPT
Advanced Data Analysis and the Wolfram plugin, which both rely on
ChatGPT, and to Claude 2, which offers the ability to upload files.
My reproduction code therefore exlcudes the results of the latter
three models.