Bloom Filter implementation in Clojure.
A bloom filter is a probabilistic data structure for testing set membership. A deterministic data structure for testing set membership is a java.util.Set
, some other languages would use a hash (or map) data structure with a boolean as the value and the target strings as the key. As the number of entries in your set grows, so will the memory usage of these data structures. Bloom filters sacrifice determinism in favor of significantly lower memory usage. Bloom filters do not suffer from false negatives (indicating that a given string is not in the set when it is in fact in the set). They do suffer from false positives (indicating that the string is in the set when it is in fact not), though the false positive rate can be estimated fairly accurately, more importantly, the size of the filter and the number of hashes can be chosen to give you a target false positive rate.
The examples/words.clj
example demonstrates that calculating these values based on an estimated set size and a target false positive rate gets you within a factor of about three of the desired rate.
A use for such a filter might be to quickly test if a given string is not a dictionary word, helping to ensure a user’s password is not a dictionary word.
This implementation uses a Java java.util.BitSet
as the bit array implementation and provides several helpers for different hashing functions.
This example code is contained in the file examples/words.clj
it initializes
a bloom filter with the dictionary from /usr/local/dict/words
. You can run
this example with the file run.sh
in the project root directory.
(ns words (:require [ clojure.contrib.duck-streams :as ds] [com.github.kyleburton.clj-bloom :as bf])) (def ^:dynamic *words-file* "/usr/share/dict/words") (defn all-words [] (ds/read-lines *words-file*)) (defn add-words-to-filter! [filter] (dorun (doseq [word (all-words)] (bf/add! filter (.toLowerCase word))))) (defn run [hash-fn] (let [filter (bf/bloom-filter (* 10 1024 1024) hash-fn)] (add-words-to-filter! filter) (dorun (doseq [w (.split "The quick brown ornithopter hyper-jumped over the lazy trollusk" "\\s+")] (if (bf/include? filter (.toLowerCase w)) (prn (format "HIT: '%s' in the filter" w)) (prn (format "MISS: '%s' not in the filter" w))))))) (defn make-words-filter [m-expected-entries k-hashes hash-fn] (let [flt (bf/bloom-filter m-expected-entries (bf/make-permuted-hash-fn (or hash-fn bf/make-hash-fn-hash-code) (map str (range 0 k-hashes))))] (add-words-to-filter! flt) flt)) (defn words-fp-rate [count flt] (loop [times count fps 0] (cond (= 0 times) fps (bf/include? flt (str times)) (recur (dec times) (inc fps)) :else (recur (dec times) fps)))) (defn report-fp-rate [count flt] (let [rate (words-fp-rate count flt)] (/ rate count 1.0))) (def word-flt-1pct (make-words-filter 2875518 7 bf/make-hash-fn-hash-code)) (printf "n=2875518, k=10, p=0.01: Java's hashCode: fp=%f cardinality=%d\n" (report-fp-rate 100000 word-flt-1pct) (.cardinality (:bitarray word-flt-1pct))) (def word-flt-crc32-1pct (make-words-filter 2875518 7 bf/make-hash-fn-crc32)) (printf "n=2875518, k=10, p=0.01: CRC32: fp=%f cardinality=%d\n" (report-fp-rate 100000 word-flt-crc32-1pct) (.cardinality (:bitarray word-flt-crc32-1pct))) (def word-flt-adler32-1pct (make-words-filter 2875518 7 bf/make-hash-fn-adler32)) (printf "n=2875518, k=10, p=0.01: Adler32: fp=%f cardinality=%d\n" (report-fp-rate 100000 word-flt-adler32-1pct) (.cardinality (:bitarray word-flt-adler32-1pct))) (def word-flt-md5-1pct (make-words-filter 2875518 7 bf/make-hash-fn-md5)) (printf "n=2875518, k=10, p=0.01: MD5: fp=%f cardinality=%d\n" (report-fp-rate 100000 word-flt-md5-1pct) (.cardinality (:bitarray word-flt-md5-1pct))) (def word-flt-sha1-1pct (make-words-filter 2875518 7 bf/make-hash-fn-sha1)) (printf "n=2875518, k=10, p=0.01: SHA1: fp=%f cardinality=%d\n" (report-fp-rate 100000 word-flt-sha1-1pct) (.cardinality (:bitarray word-flt-sha1-1pct)))
/usr/share/dict/words
on my system is 24,86,813 bytes while the recommended size of this filter for a 1% FP rate is 2,875,518 bytes – a nearly ten fold reduction in required memory. If the entries were larger the memory savings would be larger as well.
The Wikipedia article on Bloom Filters discusses methods for determining optimal values for the number of hashes, k
, the size of the bit array, m
, given an expected number of entries n
and a target false positive probability, p
.
This library provides a functions to help you determine those parameters for your data set. Following along with the words.clj
example, the /usr/share/dict/words
file on my system contains about 234,936 words. This gives us n = 234936
entries, if we can tolerate a 1% error rate, the size of the bit array should be:
(num-bits-for-entries-and-fp-probability 234936 0.01) => 2251875
or about 2.2 million bits. Given this bit array size, we can now compute an optimal number of hashes:
(num-hash-fns-for-entries-and-bits 234936 2251875) => 6.643855378585771
rounding to 7. You can obtain both of these values in one call with:
(optimal-n-and-k 300000 0.01) => [2875518 7]
Based on the benchmarks in the examples/words.clj
file, I do not recommend using Java’s hashCode
function for use with the bloom filter. It results in a wildly higher false positive rate then the other hashes. My recommendation is to use the values reported by optimal-n-and-k
as a starting point and measure your actual false positive rate, keeping in mind the ‘4.8 bits per key’ rule as a way to decrease the false-positive rate by an order of magnitude.
There is also a helper function make-optimal-filter
that takes an estimated number of entries, a desired false-positive probability and an optional hash builder function.
The words.clj
example shows how to create your own hash function. As part of the definition of a bloom filter, a number of hashes k
are executed to determine the bits to set. The interface for this hash function should take a java.lang.String
and an int
. The string being the data to compute the hash for, and the int
being the size in bits (n
) of the bit array. The hash function must return the sequence of bit locations to be set in the filter for the given input string.
The current implementation uses a java.util.BitSet
, which is limited to 2^31 - 1
bits. This puts some boundaries on the number of elements and the false-positive probability.
Both java.util.BitSet
and java.math.BigInteger
have this limitation. I will look into implementing a version of the filter that uses byte
arrays under the hood so that this limitation can be lifted.
If you’re using Leiningen, add the following to your project.clj
file’s :dependencies
:
[com.github.kyleburton/clj-bloom "1.0.1"]
For maven:
<dependencies> <dependency> <groupId>com.github.kyleburton</groupId> <artifactId>clj-bloom</artifactId> <version>1.0.1</version> </dependency> ... </dependencies>
To build using Leiningen:
clj-bloom$ lein test clj-bloom$ lein jar clj-bloom$ lein uberjar