- define a distance metric among objects;
- compute the matrix of distances in O(N^2);
- define the fingerprint of the object;
- map the object to buckets; one different bucket for each different fingerprint;
- compare directly the objects in the same bucket.
Another interesting hash technique is the HashTree construction. Here an order set of items (itemset) is hashed with an hash function. The hash of item in position i-th is used to select a specific children of the node currently visited. In short, hashing is used to drive the visit of a path from the root to a leaf. Each leave of the HashTree contains a bucket, where similar itemsets are stored. The main difference with shingling is that order is preserved and that to items are considered as candidate for similarity if their hash collides in the first k positions. It turns our that this property if very useful for computing frequent itemsets in the Apriori algorithm. Here you have the C++ code for HashTree.