Why is the default python hash not deterministic?

Why is the default python hash not deterministic?

  • October 31, 2025
Table of Contents

Why Python’s Default hash() Is Not Deterministic

Python’s built-in hash() is intentionally non-deterministic across process runs for security reasons. This design protects against hash-flooding attacks but makes hash() unsuitable for stable identifiers, persistence, or cross-process coordination. As a result, developers must introduce explicit hashing strategies, which adds extra complexity and boilerplate code that would otherwise be unnecessary.

Recently, while testing some refactoring capabilities of ChatGPT, an interesting issue surfaced in a piece of Python code. The code looked roughly like this:

my_document = {
    "key_1": "Main identity",
    "key_2": "Secondary identity",
}

doc_id = hash(f"{my_document['key_1']}_{my_document['key_2']}")

When reviewing this code with Codex, I was presented with the following warning:

Uses Python’s built-in hash for the checkpoint doc ID; hash randomization makes the ID unstable across processes, so retries/upserts will create duplicate docs and checkpoint reuse won’t work reliably.

My first reaction was disbelief: What do you mean, unstable?


Hash code generation: what we usually expect

Searching the web for “hash functions” quickly leads to a large body of documentation describing different hashing algorithms and use cases. One core idea shows up everywhere: a hash function is supposed to be deterministic.

A commonly cited definition goes like this:

Definition

Generating a hash is the process of applying a hash function to input data of arbitrary size to produce a fixed-size hash (digest) that deterministically represents that data. The same input always yields the same output, and small changes to the input produce significantly different outputs. For cryptographic uses, the function is designed to be computationally infeasible to invert or to find collisions.

Key characteristics

  • Deterministic mapping from arbitrary-length input to fixed-size output
  • Avalanche effect: tiny input changes produce large, unpredictable output changes
  • Efficiency on large inputs
  • Uniform distribution to minimize collisions
  • For cryptographic hashes specifically: preimage resistance, second-preimage resistance, and collision resistance

Even Wikipedia explicitly states this requirement:

Deterministic — A hash procedure must be deterministic—given the same input value, it must always generate the same hash value.

Wikipedia: Hash function

So why was Codex telling me that Python’s hash() does not behave this way?


So what is going on?

Some digging reveals that Python’s behavior is intentional. The key reason is security, specifically protection against hash-flooding denial-of-service attacks.

The root of the issue is documented in the oCERT advisory:

This advisory showed that predictable hash functions could be exploited by attackers to generate many colliding keys, degrading hash table performance from O(1) to O(n).

Searching the CPython issue tracker leads to the corresponding discussion:

In discussions involving Guido van Rossum and Christian Heimes, a proposal emerged: randomize hash values per process to make collision attacks impractical.

One summary from that discussion reads:

Guido wants the fix as simple and less intrusive as possible … Any new stuff is off the table unless it’s really necessary. … We haven’t agreed whether the randomization should be enabled by default or disabled by default.

In the end, randomization was enabled by default.

From a design perspective, this was a pragmatic security decision, even if it violated the common expectation that hash functions are deterministic across program runs.


What is actually happening in Python

When people say “Python’s built-in hash is randomized per process,” they mean the following:

  • For certain built-in types—most notably str, bytes, and some others—Python salts the hash with a random, per-interpreter seed chosen at startup.
  • Hash values are stable within a single process, but they may differ between separate runs of Python.
  • This randomization protects hash tables (dict, set) from collision-based DoS attacks.

This behavior is documented in the official Python documentation:

How it is controlled

Python exposes this behavior through an environment variable:

  • PYTHONHASHSEED

    • If unset or set to random, Python chooses a fresh random seed on each run.
    • If set to a decimal integer (for example 123), hash values become reproducible across runs.
    • If set to 0, hash randomization for affected types is disabled.

You can also inspect this at runtime via:

  • sys.flags.hash_randomization
  • sys.hash_info

Practical implications

  • Do not persist or compare values returned by hash() across processes or application restarts.
  • Do not use hash() as a stable identifier (for example, database keys, document IDs, or cache keys).
  • This behavior primarily affects string-like objects. Numeric types use deterministic hashing schemes defined in the language specification.

If you need a stable hash, use an explicit hashing algorithm instead, such as:

  • hashlib.sha256() for cryptographic stability
  • uuid.uuid5() for namespace-based identifiers

Quick demo: randomized vs reproducible hashing

Randomized (new seed each run)

python3 -c "print(hash('booking'))"
python3 -c "print(hash('booking'))"  # likely different

Reproducible (fixed seed)

PYTHONHASHSEED=123 python3 -c "print(hash('booking'))"
PYTHONHASHSEED=123 python3 -c "print(hash('booking'))"  # same value

Disable randomization

PYTHONHASHSEED=0 python3 -c "print(hash('booking'))"
PYTHONHASHSEED=0 python3 -c "print(hash('booking'))"  # same value

How the JVM addressed the same problem

The JVM ecosystem took a different approach.

Deterministic String.hashCode()

In Java, String.hashCode() is explicitly specified and deterministic:

s[0]*31^(n−1) + … + s[n−1]

This definition has been stable for decades:

Hash-flooding defense in HashMap

Rather than salting hash codes, Java strengthened the hash table implementation itself. Since Java 8:

  • Buckets with excessive collisions are converted from linked lists into red–black trees.
  • This guarantees worst-case O(log n) performance instead of O(n).

Relevant references:

Earlier Java versions (Java 7) briefly experimented with randomized string hashing behind a system property, but this approach was removed in favor of tree bins.


Concrete recommendations (and the cost of Python’s choice)

If you need identifiers that survive process restarts, deployments, or horizontal scaling, do not use hash(). Instead, choose one of the following explicitly:

  • Cryptographic hashes (stable, portable, slower):

    • hashlib.sha256(data).hexdigest()
    • Appropriate for persistence, database keys, cache keys, and distributed systems
  • Namespace-based UUIDs (stable, readable, standardized):

    • uuid.uuid5(namespace, value)
    • Well-suited for document IDs and external identifiers
  • Non-cryptographic hashes (fast, deterministic, third-party):

    • MurmurHash, xxHash, CityHash
    • Useful when performance matters and cryptographic guarantees are unnecessary

All of these options share one downside compared to hash():

They introduce additional dependencies, additional code, and additional decisions that developers must consciously make.

This is the real cost of Python’s design choice. A simple, obvious-looking function (hash()) cannot safely be used for common tasks such as generating IDs, forcing developers to learn about hash randomization—often the hard way.


Opinionated conclusion

Python and the JVM solved the same security problem in fundamentally different ways:

  • Python randomized hash values per process, preserving average-case performance but abandoning cross-run determinism.
  • Java/JVM preserved deterministic String.hashCode() and hardened HashMap itself to guarantee worst-case behavior.

From a security standpoint, Python’s solution is effective and easy to reason about.

From a developer-experience standpoint, it is far from ideal.

The result is a leaky abstraction: developers expect a hash function to be deterministic, yet Python’s most visible hashing API violates that expectation. The burden is pushed onto application code, where developers must now select, justify, and maintain explicit hashing strategies.

Whether this trade-off was worth it is debatable. What is clear is this:

In Python, hash() should be treated as an internal implementation detail, not a general-purpose hashing tool.

If you need determinism, be explicit—and be prepared to write more code to get it.

Share :

Related Posts

Streaming Key Values

Streaming Key Values

This post describes a implementation for streaming a String of space separated Key Values with Java 8. In order to implement a streaming keyvalue reader two simple classes are implemented. One to handle navigation in the string and one to make a Iterable from the String. The implementation I provide is basic but has enough extension points to meet more extensive requirements.

Read More
Creating a Maven resolver

Creating a Maven resolver

This post is part of a multipart series about creating a graph off all available Maven dependencies.

Read More
Writing a Neo4J extension

Writing a Neo4J extension

This post is part of a multipart series about creating a graph off all available Maven dependencies.

Read More