Add Tunable Read Paralleism for Full Table Scans

Description

We've previously limited parallesim in reading to the spark cores along with prefetching. This may be unsuitable for users which want to run with a limited number of spark cores. To fix this we can just subdivide the Partition Token range into Parallelism pieces and request each of those pieces concurrently and merge the resultant iterator.

So

https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraTableScanRDD.scala#L331-L336

would be changed to multiple TokenRange Iterators that are combined with something like `concatMap` from RxScala. The difficulty here is we don't want to break the TokenOrdering of the results but still maintain parallelism. Not sure there is a way to do this without buffering the whole result or whether tokenOrdering is neccessary to preserve now that we have the Partitioner Code. This would break SpanBy (if order is lost)

Pull Requests

None

Status

Assignee

Russell Spitzer

Reporter

Russell Spitzer

Labels

None

Reviewer

None

Reviewer 2

None

Tester

None

Pull Request

None

Components

Priority

Major
Configure