An alternative to cursor pagination

atombender · 2024-05-01T21:43:25

The "Merkle tree" algorithm here isn't using a Merkle tree, it's just a binary partitioning algorithm. The point of a Merkle tree is that it's a tree of hashes. Also, it doesn't really solve the consistency problem the author claims is the biggest problem; yes, over time it will correct for Elasticsearch's eventual consistency, but in the short run it's just as bad as pagination.

I don't know the author's application, but I question the desire to get a consistent dump from Elasticsearch in the first place. It is very not much intended to be a "source of truth", so you're better off streaming the data from your original data source, which is presumably something like an SQL database.

That said, if you want a stable snapshot of an entire index — where your requirement is to not ever miss documents due to concurrent updates — then you can use Elasticsearch's snapshot support. Each snapshot is just that, a read-only snapshot of the data, allowing consistent reads.

The eventual consistency problem that the article describes is solved by refreshing the index. You can use "refresh=wait_for" when doing an update in order to wait for Elasticsearch to make the update searchable. You can also force a refresh. Any subsequent query will return the newest indexed data.

Since 6.x, Elasticsearch has had docvalue pagination via "search_after", which allows pagination without a durable cursor. Each cursor value is the docvalue set of the last seen document. This is consistent insofar as the set of source documents is consistent (so it's not safe against concurrent updates). There's essentially little need to use "_scroll" or offset-based pagination anymore.

dan-robertson · 2024-05-01T22:54:05

I think the author was using the search_after strategy (or something roughly equivalent). They were using ‘cursor’ in the generic sense as a place in the list of documents, rather than the specific durable cursors api offered by elasticsearch

ramsicandra · 2024-05-03T07:48:29

I think this is a fair criticism. The Merkle Tree here isn't used. I was just inspired by the diagram and come up with binary partitioning solution.

In terms of performance, it's fair to say that this binary partitioning algo is slightly worse than a cursor / search after pagination since there is an overhead of checking count while the cursor pagination does not need to.

Hmm, It never cross my mind to change/correct the design of using ES as a primary data source. My guess now is it would take as much effort or higher to migrate between ES -> SQL compared to ES -> ES.

I think the snapshot approach is interesting. If I had to start over, I'll most likely explore that.

reply

gregors · 2024-05-02T00:20:20

ok, I obviously don't know all the issues this team is dealing with, but it seems to be the goal of what they were doing was this...take the base data in one ES instance and split it out into several ES instances.

The way do do that is to go back to the data of record and repopulate the different ES instances (or new datastore). I can't strongly enough suggest people do not use ES as a database of record.

RubyConf 2019 - How to lose 50 Million Records in 5 minutes by Jon Druse

https://www.youtube.com/watch?v=Qbxmf_TxA-s

reply

wiml · 2024-05-01T21:38:17

The Merkle-tree approach reminds me of "set reconciliation" protocols — the author might be interested in references 3 and 5 of this paper: https://www.scs.stanford.edu/17au-cs244b/labs/projects/rucke...

jiggawatts · 2024-05-01T22:30:33

This is a generic problem across a vast range of technologies: how to handle scale out replication with delays and have a consistent “streaming” experience.

The main issue is that most query languages like SQL predate HTTP and haven’t been updated with the same concepts such as ETag headers and cookies.

Every database request should include a transaction consistency header that contains a vector clock of logical replication status. Every response should include cache control headers and an updated vector clock. This should then be encrypted and bubbled up through the HTTP pipeline to the end clients.

E.g.: a client should able to request a “snapshot no older then T minutes” for a report run and be able to return a cookie that ensures that subsequent queries remain consistent with that. Most databases can do this… for a single TCP socket connection only. They’ll lose track of the context if the socket is closed or if multiple servers connect from a web farm.

With an approach like this, any cache or any replica could be utilised for any query in a safe and consistent manner.

Essentially, this would be fixing an impedance mismatch between a stateful connection-oriented system and a stateless request-response system.

wbl · 2024-05-02T02:07:38

I don't think that works. First off the server doesn't know when the client goes away so has to hold the snapshot forever. Secondly you double your replication latency and have extreme sensitivity to stragglers, unless you take the risk of hitting an unavailable cache.

jiggawatts · 2024-05-02T06:02:59

The database engine would have to be designed for this, and/or the clients could request the level of consistency that they require. Most apps only care about “this age or newer”, except when paging through a data set where consistency matters.

E.g.: indexes can include the timestamp and then queries can filter out new rows implicitly. Physically this can be implemented with tiered indexes where the topmost layer is in-memory only and queries older than what it contains are rejected. The on-disk indexes then don’t need old row versions or timestamps.

reply

asadawadia · 2024-05-01T23:43:15

nice very clean

never thought of bubbling up the DB Transaction ID up to the cookie layer

shoelessone · 2024-05-02T02:42:20

Can somebody help me understand as a casual Elasticsearch (and honestly, more OpenSearch) user how so much data inside of Elasticsearch ends up not becoming the "database"?

I've always understood you're not supposed to treat ES/OS as a database, so in my head I've always sort of thought of it as a cache. If I'm using it I need to be able to reindex (not sure if this is the correct term technically) all of my data from a "real" database, something (in my case) like a SQL database of some sort or another data store.

If I'm reading the article correctly it sounds like they are trying to get massive amounts of data out of ES, but it feels like that's something that would come from a "real" database. To put it another way, it seems like admin tasks that involve exporting data ES wouldn't be the right place to do that.

Now very clearly the author knows an entire world more about ES and I don't fully understand their use case anyway, so I'm hoping somebody might be able to help me understand if there is some gray area or uses for ES/OS that might make different use cases more appropriate?

reply

atombender · 2024-05-02T18:15:59

You're right. They're arguably running into these challenges because they're using Elasticsearch in a way it's not intended to be used.

moomoo11 · 2024-05-01T23:11:38

Interesting.

I've always opted for cursor pagination for client consumer APIs.

For anything "admin" or management type stuff, it runs into the same issues you mentioned. In that case, I'd used application level queues and it took a long time for things like exports and what not.

ruslandanilin · 2024-05-01T21:32:57

This is quite an interesting solution. Especially with DFS approach in mind. Thank you!