With the fixes to paging / fetchSize in 2.0.7, I thought I would try using the prefetch logic by adding a sequence like the following wherever the result set might be large and benefit from paging:
Having done this, intermittently where I read all the rows in a logical row spanning partitions, the query stops without returning all the rows.
My schema is essentially the same as described before in CASSANDRA-6825 and CASSANDRA-6826. The difference from CASSANDRA-6825 lies in how I distributed the rows over the partition. In CASSANDRA-6825, I was writing the rows in long runs of 10,000 in a partition, before choosing a partition for the next run. In this case, I was randomly assigning the partition to each row, based on taking the first four hex digits of the ec hash modulo the number of partitions, 7. The query is the same as that described in CASSANDRA-6826:
SELECT ec, ea, rd FROM sr WHERE s = ? AND partition IN ? and l = ? ALLOW FILTERING;
This problem is very intermittent. I expanded my JUnit test program to run this test multiple times. The query to read all the rows could fail to return the correct number in one of the tests, the first, the second, the third, while returning the correct results in the others. It could run successfully all three times, but then later fail all three times, even when using the same test seed and generating identical test data.
If I set the prefetch limit very low, e.g., issue the query for more rows only when we are near the end, when, say, only one row remains, I don't see a failure. If I set the limit higher, typically when 100 rows remain, I can see the failing behavior.
This is not limited to my dual-core laptop. I was able to provoke the same failing behavior on an 8-core desktop system by setting the prefetch point at 500 rows while leaving the query fetch size at 1000 rows.
The only thing that is certain is if I disable prefetching by setting the limit to -1, the correct counts appear.
To drill down on this problem, I added some code to the test validation to check each row in order, to determine exactly when a row was dropped instead of just checking the total count at the end. What I found was that the first dropped rows appeared in partition 0. In the one case I analyzed in detail, the returned row skipped ahead 759 rows in the partition. The expected and actual rows were in the same date range, so I could actually do a select count covering the two endpoints, and it showed 760 rows inclusive. So it appears that, at random points in the ResultSet, we skip ahead a bunch of rows.
DataStax Cassandra 2.0.7
Single node cluster on a dual-core Windows laptop
it appears this version re-introduces a problem like the one we just solved in
Hum, I should have been a bit more careful here. Committed the fix, thanks.
I don't think that currentPage needs to be declared as volatile
Right, not sure why I did that.
Two minor issues
Thanks, and apologies for butchering the english language
define a private fetchMoreResults that accepts the same fetchState that was already fetched, so that it doesn't see a state inconsistent with the decisions already taken
That make sense, committed that.
so we would never issue more than one prefetch before finishing the currentPage
But we don't want that. The fact that the API allows a decided client to fetch N pages in advances without having started to consume the first one is on purpose. To be clear, I'm not saying it's the most useful thing ever, but I do think it's nice to have this ability and I'd really rather not remove it.
I understand, and I don't disagree. I just mentioned the double buffering for completeness. But it would lose this capability to gradually ramp up the number of pages pre-fetched without exhausting the first page. And it would introduce different issues of ensuring that the (pages, fetchState, info) tuple was updated in parallel in a consistent state, which would undoubtedly be uglier than the revision you made.
With today's changes, I re-ran the originally failing test more than a dozen times with no failures. On the faster desktop system, I let it randomly choose the seed and increased the run length from 1 million to 2 million and then 10 million inserts and reads, still with no failures.
Great! Thanks a lot for the report, review and testing!