Introduce Data Locality for Single Partition Queries

Description

Our application is using Spark and Cassandra.
We make queries on time series data, like this:

The table structure is the following:

The problem we have encountered here is that the Connector sends queries on some fixed set of nodes, and those nodes are NOT the nodes containing partitions of our data.

While debugging, we have found the following code fragments which (as we suspect) cause this behavior.

In CassandraPartitionGenerator:

... and the "splitCount == 1" is the result from CassandraTableScanRDD:

Is it possible to make the queries run on the same nodes where the data partitions are located?

Steps to reproduce:

1. Allocate a C* cluster of K nodes, where K > 1
2. Create a keyspace with replica=1
3. Create table as described above that uses (fname, timestamp) as composite key; insert several entities there
4. Deploy a Spark cluster on the C* nodes, e.g. using Mesos
5. Run multiple queries from Spark-shell using the Connector:

... and see that each Spark Job contains a single task of NODE_LOCAL locality, assigned to the same node, regardless of a randomly chosen name.

Pull Requests

None

Status

Assignee

Unassigned

Reporter

Dmitry Ochnev

Labels

None

Reviewer

None

Reviewer 2

None

Tester

None

Pull Request

None

Affects versions

Priority

Major
Configure