After upgrading from the datastax driver 3.8.0 to 4.7.2 we noticed that 'ascii' fields that get filled with non-ascii data get their non-ascii data replaced with question marks instead of throwing an exception like in the 3.x series.
This seems to come from StringCodec.java line 66: https://github.com/datastax/java-driver/blob/4.x/core/src/main/java/com/datastax/oss/driver/internal/core/type/codec/StringCodec.java#L66 in which String.getBytes() is used. As per the docs of this function this is expected:
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. The CharsetEncoder class should be used when more control over the encoding process is required.
Although this does indeed reduce the amount of errors thrown by code, I would argue that this is not the ideal behavior for a database driver.
Could the behavior from 3.x perhaps be put back, to avoid data corruption?
Thanks for catching this, I agree that we should keep the same behavior as driver 3.
The legacy codec used a regex, but CharsetEncoder as suggested in the docs might be a tad cleaner.
I'm scheduling this for 4.8.0.
No pressure, but if you feel like opening a PR, contributions are welcome.
Sure, happy to.