Fixed
Details
Details
Assignee
Piotr Kołaczkowski
Piotr KołaczkowskiReporter
Piotr Kołaczkowski
Piotr KołaczkowskiReviewer
H
HComponents
Fix versions
Priority
Created May 7, 2015 at 3:42 PM
Updated June 22, 2015 at 8:15 AM
Resolved June 22, 2015 at 8:15 AM
Consider this code:
Now answer these questions:
Which columns are used to instantiate objects of type A?
Which columns are used to instantiate objects of type B?
Which columns are used to instantiate objects of type C?
Does adding C to the tuple change the way how A and B are mapped?
Keep in mind that A, B, C can be:
primitive type mapped to a single column
a user-defined type with a custom type converter mapped to a single column
a user-defined type with no custom type converter, mapped to a single UDT column
a tuple mapped to multiple columns
a class mapped to multiple columns
Current logic makes it somewhat hard to reason about which columns are used to instantiate B. Sometimes it uses all columns, sometimes only those not used by A. And class A not always "consumes" the columns.
Moreover, the logic of selecting the right implicit is quite convoluted (MagicTypeTricks used to backlist some implicits).
Unfortunately we tried to squeeze to much logic into a single method, having a single selection list. The use-case of selecting just 2 columns into a 2-tuple overlaps with the usecase of selecting full key-value pairs into 2-tuples.
I propose to add a new method similar to Spark's
keyBy
that would use a separate column list to map the key class. So instead we'd write:which would type to RDD[(K, V)]. A shorthand version would also work:
There is no "column skipping". Both K and V are instantiated basing on respective column selections, and there can be some columns shared between both selection lists.