Why is the default python hash not deterministic?

October 31, 2025

Table of Contents

Why Python’s Default `hash()` Is Not Deterministic

Python’s built-in hash() is intentionally non-deterministic across process runs for security reasons. This design protects against hash-flooding attacks but makes hash() unsuitable for stable identifiers, persistence, or cross-process coordination. As a result, developers must introduce explicit hashing strategies, which adds extra complexity and boilerplate code that would otherwise be unnecessary.

Recently, while testing some refactoring capabilities of ChatGPT, an interesting issue surfaced in a piece of Python code. The code looked roughly like this:

my_document = {
    "key_1": "Main identity",
    "key_2": "Secondary identity",
}

doc_id = hash(f"{my_document['key_1']}_{my_document['key_2']}")

When reviewing this code with Codex, I was presented with the following warning:

Uses Python’s built-in hash for the checkpoint doc ID; hash randomization makes the ID unstable across processes, so retries/upserts will create duplicate docs and checkpoint reuse won’t work reliably.

My first reaction was disbelief: What do you mean, unstable?

Hash code generation: what we usually expect

Searching the web for “hash functions” quickly leads to a large body of documentation describing different hashing algorithms and use cases. One core idea shows up everywhere: a hash function is supposed to be deterministic.

A commonly cited definition goes like this:

Definition

Generating a hash is the process of applying a hash function to input data of arbitrary size to produce a fixed-size hash (digest) that deterministically represents that data. The same input always yields the same output, and small changes to the input produce significantly different outputs. For cryptographic uses, the function is designed to be computationally infeasible to invert or to find collisions.

Key characteristics

Deterministic mapping from arbitrary-length input to fixed-size output
Avalanche effect: tiny input changes produce large, unpredictable output changes
Efficiency on large inputs
Uniform distribution to minimize collisions
For cryptographic hashes specifically: preimage resistance, second-preimage resistance, and collision resistance

Even Wikipedia explicitly states this requirement:

Deterministic — A hash procedure must be deterministic—given the same input value, it must always generate the same hash value.
— Wikipedia: Hash function

So why was Codex telling me that Python’s hash() does not behave this way?

So what is going on?

Some digging reveals that Python’s behavior is intentional. The key reason is security, specifically protection against hash-flooding denial-of-service attacks.

The root of the issue is documented in the oCERT advisory:

oCERT-2011-003: Multiple implementations denial-of-service via hash algorithm collision

This advisory showed that predictable hash functions could be exploited by attackers to generate many colliding keys, degrading hash table performance from O(1) to O(n).

Searching the CPython issue tracker leads to the corresponding discussion:

CPython issue: hash collision security issue

In discussions involving Guido van Rossum and Christian Heimes, a proposal emerged: randomize hash values per process to make collision attacks impractical.

One summary from that discussion reads:

Guido wants the fix as simple and less intrusive as possible … Any new stuff is off the table unless it’s really necessary. … We haven’t agreed whether the randomization should be enabled by default or disabled by default.

In the end, randomization was enabled by default.

From a design perspective, this was a pragmatic security decision, even if it violated the common expectation that hash functions are deterministic across program runs.

What is actually happening in Python

When people say “Python’s built-in hash is randomized per process,” they mean the following:

For certain built-in types—most notably str, bytes, and some others—Python salts the hash with a random, per-interpreter seed chosen at startup.
Hash values are stable within a single process, but they may differ between separate runs of Python.
This randomization protects hash tables (dict, set) from collision-based DoS attacks.

This behavior is documented in the official Python documentation:

How it is controlled

Python exposes this behavior through an environment variable:

PYTHONHASHSEED
- If unset or set to random, Python chooses a fresh random seed on each run.
- If set to a decimal integer (for example 123), hash values become reproducible across runs.
- If set to 0, hash randomization for affected types is disabled.

You can also inspect this at runtime via:

sys.flags.hash_randomization
sys.hash_info

Practical implications

Do not persist or compare values returned by hash() across processes or application restarts.
Do not use hash() as a stable identifier (for example, database keys, document IDs, or cache keys).
This behavior primarily affects string-like objects. Numeric types use deterministic hashing schemes defined in the language specification.

If you need a stable hash, use an explicit hashing algorithm instead, such as:

hashlib.sha256() for cryptographic stability
uuid.uuid5() for namespace-based identifiers

Quick demo: randomized vs reproducible hashing

Randomized (new seed each run)

python3 -c "print(hash('booking'))"
python3 -c "print(hash('booking'))"  # likely different

Reproducible (fixed seed)

PYTHONHASHSEED=123 python3 -c "print(hash('booking'))"
PYTHONHASHSEED=123 python3 -c "print(hash('booking'))"  # same value

Disable randomization

PYTHONHASHSEED=0 python3 -c "print(hash('booking'))"
PYTHONHASHSEED=0 python3 -c "print(hash('booking'))"  # same value

How the JVM addressed the same problem

The JVM ecosystem took a different approach.

Deterministic `String.hashCode()`

In Java, String.hashCode() is explicitly specified and deterministic:

s[0]*31^(n−1) + … + s[n−1]

This definition has been stable for decades:

Java SE documentation: String.hashCode()

Hash-flooding defense in `HashMap`

Rather than salting hash codes, Java strengthened the hash table implementation itself. Since Java 8:

Buckets with excessive collisions are converted from linked lists into red–black trees.
This guarantees worst-case O(log n) performance instead of O(n).

Relevant references:

Java SE 8: HashMap implementation notes
OpenJDK source constants: TREEIFY_THRESHOLD = 8, UNTREEIFY_THRESHOLD = 6, MIN_TREEIFY_CAPACITY = 64

Earlier Java versions (Java 7) briefly experimented with randomized string hashing behind a system property, but this approach was removed in favor of tree bins.

Concrete recommendations (and the cost of Python’s choice)

If you need identifiers that survive process restarts, deployments, or horizontal scaling, do not use hash(). Instead, choose one of the following explicitly:

Cryptographic hashes (stable, portable, slower):
- hashlib.sha256(data).hexdigest()
- Appropriate for persistence, database keys, cache keys, and distributed systems
Namespace-based UUIDs (stable, readable, standardized):
- uuid.uuid5(namespace, value)
- Well-suited for document IDs and external identifiers
Non-cryptographic hashes (fast, deterministic, third-party):
- MurmurHash, xxHash, CityHash
- Useful when performance matters and cryptographic guarantees are unnecessary

All of these options share one downside compared to hash():

They introduce additional dependencies, additional code, and additional decisions that developers must consciously make.

This is the real cost of Python’s design choice. A simple, obvious-looking function (hash()) cannot safely be used for common tasks such as generating IDs, forcing developers to learn about hash randomization—often the hard way.

Opinionated conclusion

Python and the JVM solved the same security problem in fundamentally different ways:

Python randomized hash values per process, preserving average-case performance but abandoning cross-run determinism.
Java/JVM preserved deterministic String.hashCode() and hardened HashMap itself to guarantee worst-case behavior.

From a security standpoint, Python’s solution is effective and easy to reason about.

From a developer-experience standpoint, it is far from ideal.

The result is a leaky abstraction: developers expect a hash function to be deterministic, yet Python’s most visible hashing API violates that expectation. The burden is pushed onto application code, where developers must now select, justify, and maintain explicit hashing strategies.

Whether this trade-off was worth it is debatable. What is clear is this:

In Python, hash() should be treated as an internal implementation detail, not a general-purpose hashing tool.

If you need determinism, be explicit—and be prepared to write more code to get it.

Why is the default python hash not deterministic?

Why Python’s Default `hash()` Is Not Deterministic

Hash code generation: what we usually expect

Definition

Key characteristics

So what is going on?

What is actually happening in Python

How it is controlled

Practical implications

Quick demo: randomized vs reproducible hashing

Randomized (new seed each run)

Reproducible (fixed seed)

Disable randomization

How the JVM addressed the same problem

Deterministic `String.hashCode()`

Hash-flooding defense in `HashMap`

Concrete recommendations (and the cost of Python’s choice)

Opinionated conclusion

Tags :

Share :

Related Posts

Streaming Key Values

Creating a Maven resolver

Writing a Neo4J extension

Why is the default python hash not deterministic?

Why Python’s Default hash() Is Not Deterministic

Hash code generation: what we usually expect

Definition

Key characteristics

So what is going on?

What is actually happening in Python

How it is controlled

Practical implications

Quick demo: randomized vs reproducible hashing

Randomized (new seed each run)

Reproducible (fixed seed)

Disable randomization

How the JVM addressed the same problem

Deterministic String.hashCode()

Hash-flooding defense in HashMap

Concrete recommendations (and the cost of Python’s choice)

Opinionated conclusion

Tags :

Share :

Related Posts

Streaming Key Values

Creating a Maven resolver

Writing a Neo4J extension

Why Python’s Default `hash()` Is Not Deterministic

Deterministic `String.hashCode()`

Hash-flooding defense in `HashMap`