Remove semantic API ambiguity of selecting RDDs of KV pairs

Description

Consider this code:

Now answer these questions:

  • Which columns are used to instantiate objects of type A?

  • Which columns are used to instantiate objects of type B?

  • Which columns are used to instantiate objects of type C?

  • Does adding C to the tuple change the way how A and B are mapped?

Keep in mind that A, B, C can be:

  • primitive type mapped to a single column

  • a user-defined type with a custom type converter mapped to a single column

  • a user-defined type with no custom type converter, mapped to a single UDT column

  • a tuple mapped to multiple columns

  • a class mapped to multiple columns

Current logic makes it somewhat hard to reason about which columns are used to instantiate B. Sometimes it uses all columns, sometimes only those not used by A. And class A not always "consumes" the columns.

Moreover, the logic of selecting the right implicit is quite convoluted (MagicTypeTricks used to backlist some implicits).

Unfortunately we tried to squeeze to much logic into a single method, having a single selection list. The use-case of selecting just 2 columns into a 2-tuple overlaps with the usecase of selecting full key-value pairs into 2-tuples.

I propose to add a new method similar to Spark's keyBy that would use a separate column list to map the key class. So instead we'd write:

which would type to RDD[(K, V)]. A shorthand version would also work:

There is no "column skipping". Both K and V are instantiated basing on respective column selections, and there can be some columns shared between both selection lists.

Pull Requests

https://github.com/datastax/spark-cassandra-connector/pull/722 (b1.3) https://github.com/datastax/spark-cassandra-connector/pull/723 (master)

Activity

Show:

H 
June 19, 2015 at 4:08 PM

Aside from a few minor questions on the master PR, all looks good. Very cool addition

Russell Spitzer 
May 7, 2015 at 3:57 PM

I like the keyBy method as well. I think thats pretty clear

TupshinT 
May 7, 2015 at 3:51 PM

Love that shorthand notation.

Fixed

Details

Assignee

Reporter

Reviewer

Components

Fix versions

Priority

Created May 7, 2015 at 3:42 PM
Updated June 22, 2015 at 8:15 AM
Resolved June 22, 2015 at 8:15 AM