QueryBuilder.removeAll cannot delete the frozen dataset if inserted by JSON due to the insert order mismatch

Description

The customer is using DSE driver version 1.8.2.

Schema:

Repro step:

1. Insert the table by JSON

2. Remove the frozen set by QueryBuilder.removeAll method.

In boundRemove.setSet("test_data", getUDTValue(keyspace));, it calls getUDTValue function to construct the frozen set.

The issue is that the order of elements returned by getUDTValue is different from the inserted JSON data.

In cqlsh, the elements in test_data "世界知的所有権機関" is before "中国"

cqlsh:test> select json * from test_table;

The order of elements from *Set<Object> tlaeiDataSentenceProcResultSet = new HashSet<>();* returned by getUDTValue , 中国 is before 世界知的所有権機関 although we did insert 世界知的所有権機関 first.

The current workaround is to use LinkedHashSet instead of Set for tlaeiDataTagInfoSet which can respect the order of the insertion so that it will match the json values.

Environment

None

Pull Requests

None

Activity

Show:
Olivier Michallat
August 24, 2020, 4:11 PM

Side note: could you work on simplifying the example a bit before opening a ticket? Here's a minimal reproducing case:

With that being said, I do reproduce the issue: row is null, but if I swap the two s.add() calls, it isn't.

I need to try a few more things, but my first thought is that this might be a problem on the server side.

Olivier Michallat
August 24, 2020, 4:23 PM

I tried to insert the row with the different order from the Java side, I now have two rows with what looks like the same PK:

This might be a corner case of how frozen works, I'm checking with the server developers.

Olivier Michallat
August 24, 2020, 10:54 PM

I got confirmation that this is a known limitation: once a set is frozen, it's just treated like an opaque blob. If you pass two different blobs (even if they correspond to the same elements in a different order), they are considered different values.

There are still some parts I don't fully understand: when you insert the data with the JSON literal, 世界知的所有権機関 always ends up first, and 中国 second, regardless of the order in which they were specified. But when you build a hashmap on the client, it's the other way. I would expect a consistent order because UDTValue uses the same hashcode on the client and server, but maybe Cassandra uses something else than UDTValue to encode the literal.

Another puzzling thing is that this only happens with Unicode characters. If I encode plain ascii strings, like abc and def, it works.

But anyway the bottom line is that we can't rely on a consistent encoding of frozen sets across the board. One way to work around that would be to read the full set, modify it on the client side and write it back; but that's a read-modify-write, it won't work well if the row is accessed concurrently. Beyond that, I think you should try to rework the data model: maybe you can associate a key to each test_data element, in order to turn test_table.test_data into a map. Or maybe the UDTs could be turned into auxiliary tables instead...


For reference, here's a second attempt at a minimal case (closer to your example):

Assignee

Olivier Michallat

Reporter

Mike Zhang

Labels

None

PM Priority

None

Reproduced in

DSE-1.8.2

Affects versions

Fix versions

None

Pull Request

None

Doc Impact

None

Size

None

External issue ID

None

External issue ID

None

Priority

Major
Configure