duplicate entries in a list when using saveToCassandra()

Description

Hi,

The address is duplicated.

Please mind, that I set spark parallelism to 1. Is this expected behavior? My expectation was that the row will be completely overwritten and not duplicated.

If I set parallelism to 2, it's not reproduced. There is a single address as expected.

Environment

Spark Cassandra version 2.0.6
Cassandra version 3.0.15

Pull Requests

None

Activity

Show:

Russell Spitzer July 18, 2018 at 3:54 PM

So this is apparently an interesting aspect of C* and you could file a bug there if you like. Basically the issue is when an "insert" is performed on a list it first prepends a tombstone at `writetime - 1`. In a batch all mutations have the same write timestamp. Which means both inserts will end up adding the same tombstone, then their entry after them. This means an series of inserts in a batch will end up being treated as appends although this is mostly un-intentional.

I'll mark this as invalid since this is really only fixable in C* but we can easily work around it app side by just not doing inserts to the same list multiple times (reducing the dataset) or by not doing batching (limiting inserts to 1 record per batch)

Russell Spitzer July 18, 2018 at 2:11 PM

Confirmed this in CQL

Russell Spitzer July 18, 2018 at 2:10 PM

This is probably a idempotentcy with lists issue, when the two lists appear in one partition they are grouped into a single batch write. This probably makes a two element list because of list's strange semantics. The 2 partition version has to separate writes, one overriding the other probably because they have different timestamps.

I guess this is a bug in cassandra's design? You could probably eliminate it by removing batching for writes with lists like this when you aren't attempting to append (just overwrite)>

Or by doing a reduce by key in spark eliminating duplicates before writing.

Invalid

Details

Assignee

Reporter

Components

Affects versions

Priority

Created July 18, 2018 at 1:54 PM
Updated July 18, 2018 at 3:54 PM
Resolved July 18, 2018 at 3:54 PM