
Why is the default python hash not deterministic?
Table of Contents
Why Python’s Default hash() Is Not Deterministic
Python’s built-in hash() is intentionally non-deterministic across process runs for security reasons. This design protects against hash-flooding attacks but makes hash() unsuitable for stable identifiers, persistence, or cross-process coordination. As a result, developers must introduce explicit hashing strategies, which adds extra complexity and boilerplate code that would otherwise be unnecessary.
Recently, while testing some refactoring capabilities of ChatGPT, an interesting issue surfaced in a piece of Python code. The code looked roughly like this:
my_document = {
"key_1": "Main identity",
"key_2": "Secondary identity",
}
doc_id = hash(f"{my_document['key_1']}_{my_document['key_2']}")
When reviewing this code with Codex, I was presented with the following warning:
Uses Python’s built-in hash for the checkpoint doc ID; hash randomization makes the ID unstable across processes, so retries/upserts will create duplicate docs and checkpoint reuse won’t work reliably.
My first reaction was disbelief: What do you mean, unstable?
Hash code generation: what we usually expect
Searching the web for “hash functions” quickly leads to a large body of documentation describing different hashing algorithms and use cases. One core idea shows up everywhere: a hash function is supposed to be deterministic.
A commonly cited definition goes like this:
Definition
Generating a hash is the process of applying a hash function to input data of arbitrary size to produce a fixed-size hash (digest) that deterministically represents that data. The same input always yields the same output, and small changes to the input produce significantly different outputs. For cryptographic uses, the function is designed to be computationally infeasible to invert or to find collisions.
Key characteristics
- Deterministic mapping from arbitrary-length input to fixed-size output
- Avalanche effect: tiny input changes produce large, unpredictable output changes
- Efficiency on large inputs
- Uniform distribution to minimize collisions
- For cryptographic hashes specifically: preimage resistance, second-preimage resistance, and collision resistance
Even Wikipedia explicitly states this requirement:
Deterministic — A hash procedure must be deterministic—given the same input value, it must always generate the same hash value.
So why was Codex telling me that Python’s hash() does not behave this way?
So what is going on?
Some digging reveals that Python’s behavior is intentional. The key reason is security, specifically protection against hash-flooding denial-of-service attacks.
The root of the issue is documented in the oCERT advisory:
This advisory showed that predictable hash functions could be exploited by attackers to generate many colliding keys, degrading hash table performance from O(1) to O(n).
Searching the CPython issue tracker leads to the corresponding discussion:
In discussions involving Guido van Rossum and Christian Heimes, a proposal emerged: randomize hash values per process to make collision attacks impractical.
One summary from that discussion reads:
Guido wants the fix as simple and less intrusive as possible … Any new stuff is off the table unless it’s really necessary. … We haven’t agreed whether the randomization should be enabled by default or disabled by default.
In the end, randomization was enabled by default.
From a design perspective, this was a pragmatic security decision, even if it violated the common expectation that hash functions are deterministic across program runs.
What is actually happening in Python
When people say “Python’s built-in hash is randomized per process,” they mean the following:
- For certain built-in types—most notably
str,bytes, and some others—Python salts the hash with a random, per-interpreter seed chosen at startup. - Hash values are stable within a single process, but they may differ between separate runs of Python.
- This randomization protects hash tables (
dict,set) from collision-based DoS attacks.
This behavior is documented in the official Python documentation:
How it is controlled
Python exposes this behavior through an environment variable:
PYTHONHASHSEED- If unset or set to
random, Python chooses a fresh random seed on each run. - If set to a decimal integer (for example
123), hash values become reproducible across runs. - If set to
0, hash randomization for affected types is disabled.
- If unset or set to
You can also inspect this at runtime via:
sys.flags.hash_randomizationsys.hash_info
Practical implications
- Do not persist or compare values returned by
hash()across processes or application restarts. - Do not use
hash()as a stable identifier (for example, database keys, document IDs, or cache keys). - This behavior primarily affects string-like objects. Numeric types use deterministic hashing schemes defined in the language specification.
If you need a stable hash, use an explicit hashing algorithm instead, such as:
hashlib.sha256()for cryptographic stabilityuuid.uuid5()for namespace-based identifiers
Quick demo: randomized vs reproducible hashing
Randomized (new seed each run)
python3 -c "print(hash('booking'))"
python3 -c "print(hash('booking'))" # likely different
Reproducible (fixed seed)
PYTHONHASHSEED=123 python3 -c "print(hash('booking'))"
PYTHONHASHSEED=123 python3 -c "print(hash('booking'))" # same value
Disable randomization
PYTHONHASHSEED=0 python3 -c "print(hash('booking'))"
PYTHONHASHSEED=0 python3 -c "print(hash('booking'))" # same value
How the JVM addressed the same problem
The JVM ecosystem took a different approach.
Deterministic String.hashCode()
In Java, String.hashCode() is explicitly specified and deterministic:
s[0]*31^(n−1) + … + s[n−1]
This definition has been stable for decades:
Hash-flooding defense in HashMap
Rather than salting hash codes, Java strengthened the hash table implementation itself. Since Java 8:
- Buckets with excessive collisions are converted from linked lists into red–black trees.
- This guarantees worst-case O(log n) performance instead of O(n).
Relevant references:
- Java SE 8:
HashMapimplementation notes - OpenJDK source constants:
TREEIFY_THRESHOLD = 8,UNTREEIFY_THRESHOLD = 6,MIN_TREEIFY_CAPACITY = 64
Earlier Java versions (Java 7) briefly experimented with randomized string hashing behind a system property, but this approach was removed in favor of tree bins.
Concrete recommendations (and the cost of Python’s choice)
If you need identifiers that survive process restarts, deployments, or horizontal scaling, do not use hash(). Instead, choose one of the following explicitly:
Cryptographic hashes (stable, portable, slower):
hashlib.sha256(data).hexdigest()- Appropriate for persistence, database keys, cache keys, and distributed systems
Namespace-based UUIDs (stable, readable, standardized):
uuid.uuid5(namespace, value)- Well-suited for document IDs and external identifiers
Non-cryptographic hashes (fast, deterministic, third-party):
- MurmurHash, xxHash, CityHash
- Useful when performance matters and cryptographic guarantees are unnecessary
All of these options share one downside compared to hash():
They introduce additional dependencies, additional code, and additional decisions that developers must consciously make.
This is the real cost of Python’s design choice. A simple, obvious-looking function (hash()) cannot safely be used for common tasks such as generating IDs, forcing developers to learn about hash randomization—often the hard way.
Opinionated conclusion
Python and the JVM solved the same security problem in fundamentally different ways:
- Python randomized hash values per process, preserving average-case performance but abandoning cross-run determinism.
- Java/JVM preserved deterministic
String.hashCode()and hardenedHashMapitself to guarantee worst-case behavior.
From a security standpoint, Python’s solution is effective and easy to reason about.
From a developer-experience standpoint, it is far from ideal.
The result is a leaky abstraction: developers expect a hash function to be deterministic, yet Python’s most visible hashing API violates that expectation. The burden is pushed onto application code, where developers must now select, justify, and maintain explicit hashing strategies.
Whether this trade-off was worth it is debatable. What is clear is this:
In Python,
hash()should be treated as an internal implementation detail, not a general-purpose hashing tool.
If you need determinism, be explicit—and be prepared to write more code to get it.


