Improve support for PairRDD joins

Description

As per this thread on the mailing list: https://groups.google.com/a/lists.datastax.com/forum/#!topic/spark-connector-user/bTh6vONFta8

In the standard Spark API, PairRDDFunctions[K, V] exposes a join operation, join[W](other: RDD[(K, W)]: RDD[(K, (V, W))], which joins two RDDs with the same key type, and returns a new joined PairRDD with the same key, with an effort to maintain the partitioner.

In the connector API, joinWithCassandraTable looks like: joinWithCassandraTable[R](paramsHere): RDD[(L, R)], where R is the type of the records you extract. So for a PairRDD, the result is an RDD[((K, V), R)]. That type makes sense in the general case, but for a PairRDD it’s probably not what you really wanted (the key changed). RDD[(K, (V, R))] would be better.

I currently work around this by having an additional map operation following the joinWithCassandraTable which modifies the key, but this causes a data shuffle.

Pull Requests

None

Status

Assignee

Unassigned

Reporter

Neil Ord

Labels

None

Reviewer

None

Reviewer 2

None

Tester

None

Pull Request

None

Priority

Major
Configure