Hero image

LLMs Memorize vs Generalize: The Verdict

Meta, Google, Nvidia, and Cornell researchers reveal how much Large Language Models (LLMs) memorize compared to how much they generalize from their training data.

Summary

A new study by researchers from Meta, Google DeepMind, Cornell University, and NVIDIA has determined that Large Language Models (LLMs) have a fixed memorization capacity of approximately 3.6 bits per parameter. This capacity does not increase with more data; rather, when trained on larger datasets, the model’s memorization capacity is dispersed, resulting in lower attention to each data point. This result could alleviate concerns about LLMs memorizing copyrighted or sensitive material. The researchers used an unconventional method of training transformer models on random bitstrings to measure memorization versus generalization. This study provides insights that support the argument for using larger datasets to train LLMs, and may have implications for copyright infringement lawsuits involving model providers.

Key Concepts

  • A new study reveals LLMs’ fixed memorization capacity is approximately 3.6 bits per parameter.
  • The memorization capacity does not increase with more training data.
  • More training data leads to safer generalization behavior, not increased risk.
  • The researchers used unique datasets composed of uniformly random bitstrings to measure memorization.
  • The study could influence decisions in copyright infringement lawsuits involving LLMs.

Sentiment: POSITIVE

The findings are positive for developers of LLMs, as they may abate concerns about models memorizing copyrighted or proprietary training content. These insights could potentially impact ongoing copyright infringement lawsuits involving model providers.

Author and References

Author: VentureBeat
Article Link: How much information do LLMs really memorize? Now we know, thanks to Meta, Google, Nvidia and Cornell