The original problem is in biology but the following captures the CS
issues, Assume I have a large number of locks and a large number of keys.
There is a scoring function between keys and locks and a key that fits a
lock will have a high score. There may be many keys fitting one lock and a
key may fit no locks well. The object is to find the best fitting lock for
each key.
Assume that the number of keys and locks is high enough that taking the
cartesian product of the two is computationally impractical. Also assume
that keys and locks have an attached location which is accurate within an
error (say 1 Km). Only keys and locks within 1 Km need be compared.
Now assume I can create a JavaRDD<Keys> and a JavaRDD<Locks> . I could
divide the locations into 1 Km squared bins and look only within a few
bins. Assume that it is practical to take a cartesian product for all
elements in a bin but not to keep all elements in memory. I could map my
RDDs into PairRDDs where the key is the bin assigned by location
I know how to take the cartesian product of two JavaRDDs but not how to
take a cartesian product of sets of elements sharing a common key (bin),
Any suggestions. Assume that in the worst cases the number of elements in a
bin are too large to keep in memory although if a bin were subdivided into,
say 100 subbins elements would fit in memory.
Any thoughts as to how to attack the problem
