Type codec for 'ascii' types replaces unicode with '?'

Description

Howdy,

After upgrading from the datastax driver 3.8.0 to 4.7.2 we noticed that 'ascii' fields that get filled with non-ascii data get their non-ascii data replaced with question marks instead of throwing an exception like in the 3.x series.

This seems to come from StringCodec.java line 66: https://github.com/datastax/java-driver/blob/4.x/core/src/main/java/com/datastax/oss/driver/internal/core/type/codec/StringCodec.java#L66 in which String.getBytes() is used. As per the docs of this function this is expected:

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. The CharsetEncoder class should be used when more control over the encoding process is required.

Although this does indeed reduce the amount of errors thrown by code, I would argue that this is not the ideal behavior for a database driver.

Could the behavior from 3.x perhaps be put back, to avoid data corruption?

Environment

None

Pull Requests

None

Activity

Show:
Olivier Michallat
July 1, 2020, 3:40 PM
Edited

Thanks for catching this, I agree that we should keep the same behavior as driver 3.

The legacy codec used a regex, but CharsetEncoder as suggested in the docs might be a tad cleaner.

I'm scheduling this for 4.8.0.
No pressure, but if you feel like opening a PR, contributions are welcome.

Tom van der Woerdt
July 1, 2020, 5:07 PM

Sure, happy to.

Fixed

Assignee

Alexandre Dutra

Reporter

Tom van der Woerdt

Labels

PM Priority

None

Reproduced in

None

Affects versions

Fix versions

Pull Request

None

Doc Impact

None

Size

None

External issue ID

None

External issue ID

None

Priority

Major
Configure