Add metadata.schema.ignored-keyspaces option, and ignore all system keyspaces by default

Description

The token map is often a source of memory issues with large clusters. The refreshed-keyspaces configuration option can help, but it's opt-in, the driver loads everything by default. It would be better to take a more proactive approach.

It's probably safe to assume that applications don't usually need metadata about system keyspaces. So one thing we could do is ignore them by default. We need a new option that is the opposite of refreshed-keyspaces, proposed name: advanced.metadata.schema.ignored-keyspaces. By default it should be set to every system keyspace in Cassandra and DSE.
If a keyspace is present in both refreshed-keyspaces and ignored-keyspaces, we should include it, but log a warning.

From an implementation perspective, I don't think we can handle excludes in the WHERE clause like we do for includes. But we can filter on the client side, possibly in CassandraSchemaRows.Builder.

Environment

None

Pull Requests

None

Activity

Show:
Alexandre Dutra
August 26, 2020, 10:19 AM

By default it should be set to every system keyspace in Cassandra and DSE.

The exact contents and names of system keyspaces evolved across C* versions (e.g. system_schema appeared in C* 3.0).

I think it would be safer and more future-proof to introduce the ability to filter by regular expressions, that is, ignored-keyspaces = [ “system.*” ].

I’d also add that ability to refreshed-keyspaces for consistency. However in this case I guess all the filtering will have to be done client-side.

Olivier Michallat
August 27, 2020, 7:17 PM

It's an exclusion, we can add all the names that have ever existed.

Alexandre Dutra
August 28, 2020, 8:29 AM

That wouldn’t future-proof the token map against more system keyspaces that could be added in the future, would it?

Olivier Michallat
August 28, 2020, 9:47 PM

No. One risk is that the pattern could be too eager, but with something like system, system_.* and dse_.* it should be pretty safe.

For refreshed_keyspaces I'd like to keep the server-side filtering in order to avoid fetching too much data.

Olivier Michallat
September 1, 2020, 11:20 PM

For refreshed_keyspaces I'd like to keep the server-side filtering in order to avoid fetching too much data

Hrmm I don't like the asymmetry of having patterns on one side and not the other... OK, I'll allow it, but I'll add a recommendation in the docs to prefer name inclusions if possible.

Assignee

Olivier Michallat

Reporter

Olivier Michallat

Labels

None

PM Priority

None

Affects versions

None

Fix versions

Pull Request

None

Doc Impact

None

Size

None

External issue ID

None

External issue ID

None

Epic Link

Priority

Minor
Configure