Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random feedback #4

Open
natevw opened this issue Nov 13, 2014 · 4 comments
Open

Random feedback #4

natevw opened this issue Nov 13, 2014 · 4 comments

Comments

@natevw
Copy link

natevw commented Nov 13, 2014

With the caveat that I still know very little about Datalog, here's some thoughts on going whole hog while not needing tons of index space:

According to https://github.com/tonsky/datascript#project-status you need:

EAVT, AEVT and AVET indexes

I understand these to be permutations of the Entity Attribute Value Transaction concepts from reading e.g. http://docs.datomic.com/query.html#sec-2. In our case "entity" is basically doc._id; "attribute" is the name of a property within that doc (i.e. key); "value" is the value of said property, and I'm beginning to suspect that "transaction" is just doc._local_seq.

[Within a single instance of a database doc._local_seq stores a monontically increasing number — the database "_changes" sequence at that point. So for the same _id/_rev in a replica it may be different, but locally it's kinda what you need. Although I bet your trouble's gonna be that the view is only going to include emits from the most recent/winning version…so if you go to use it your data could actually disappear mid-execution which is obviously the opposite of the point!]

Code review

So anyway, putting aside transactions for a bit (have a few more thoughts, maybe lower down ;-) let's go to reviewing the code at https://github.com/dahjelle/pouch-datalog/blob/c9ae7f421c93d3f5b5ac64681fca5fe07b567912/index.js#L21:

callback(data.rows.map(function (el) {
  return el.value;
}));

Clean up emits

For starters why not just drop the emit(tuple, tuple) silliness and allow just emit(tuple) by a simple change here:

callback(data.rows.map(function (row) {
  return row.key;
}));

Right?

Avoid redundant _id

For your AVE view you really just need to emit([k,v]). This will save a bit of redundant space usage. Your query connector can fix it back up without much trouble:

callback(data.rows.map(function (row) {
  if (index === 'ave') row.key.push(row.id);
  return row.key;
}));

Note that [since we've left transactions out — this is actually the AVET index!] this doesn't change the sort order or anything — internally when you emit(k,v) CouchDB sort of stores/sorts that as [k,id] -> v, for array keys its like you have one more item at the end that you can hop to via ?startkey_docid.

Getting rid of views, pt. 1

I'm of a mixed mind on this one, but for completeness: the EAV[T] index is sort of redundant with the built-in _all_docs one, assuming it always just gets called for a single E+A lookup. What I mean is that you would do something like this pseudocode in the scope above the code we've been reviewing:

if (index === 'eav') db.get(E).then(callback([doc.id, A, doc[A]])
else /* code above, though could now be simplified to always just push id onto key */;

That is, an index by E is basically just what CouchDB has underlying db.get (aka db.allDocs if you do need a range of E, which seems unlikely in this case?) and then picking out the relevant keys.

Now I'm a mixed mind, because if your documents are large and you only want to fetch a single attribute across each, having this index will save I/O and parsing overhead — so it's a tradeoff between storing an additional copy of your database in the form of this index, versus better query performance. So keeping it is in the Couch tradition of optimizing for disk usage last.

Getting rid of views, alternate universe

So what is interesting about this from a user of this library perspective is that really my job is just to emit whatever [A,V[, T]] tuples I think are relevant from a document!

There's a couple tricks that might be useful in this vein:

  • at least in CouchDB [assuming PouchDB too] you can CommonJS require() logic in your views
  • you can actually just emit the index type as the first element of the array (assuming you do keep the EAV index)
  • so you could actually host several "Datalog" stores within a single database (or even a single design document!) by doing all the emits for each from a single map function, perhaps even with a helper.

So you could things like:

var emitTuple = require('…datalog helper…')(doc)
emitTuple('name', doc.firstName+' '+doc.lastName);

where (unbecarest to the user) the helper would be something like:

module.exports = function (doc) {
  function emitTuple(k,v, t) { // t optional
    emit(['eavt', doc._id, k, v, t]);
    emit(['avet', k, v, doc._id, t]);   // or see simplification without id/t above, but the point is caller is insulated from this anyway!
  }
}

Or I can almost imagine even an inverted version where pouch-datalog provides the "real" map and the user provides the "tuple" one, but basically splitting the actual CouchDB details from the Datalog tuple concept.

There's kind of a tradeoff here, where you do insulate the user from the internal indexing details but now they have to copy-paste the right version of your code into their ddoc before using this plugin (or is Kanso not actually dead yet?)

Although! You could just add an initialization phase to your plugin? Something like:

ddb = db.configureDataquery(function (doc) { emitTuple(…); })
db.dataquery(…);
db.dataquery(…);
db.dataquery(…);

…and it would basically write/overwrite "_design/datalog" with something generated from, roughly:

function configureDataquery(tupleMapper) {
  var ddoc = {};
  function realMap(doc) { function emitTuple(k,v,t) { …as above… } (TUPLE_MAP_PLACEHOLDER)(); }
  ddoc.view.index = realMap.toString().replace('TUPLE_MAP_PLACEHOLDER', tupleMapper.toString());
  ddoc._id = "_design/datalog";
  //ddoc.options.local_seq = true;
  db.put(ddoc);
}

Okay, in the time I spent explaining this I could have experimented with this for real, but at least now you know what I think you should do ;-)

One more thing

Oh, on the transaction thing, which I think is kind of cool although also agree it makes sense to at least leave optional, basically you tell users (or give them a helper method…) that instead of saving over top the old doc, basically just post a new doc each time it changes. So instead of having a replacing sequence (pretend its a changes feed):

{_id:"mydoc", _rev:"1-abc", _local_seq:1, …}
{_id:"mydoc", _rev:"2-cba", _local_seq:2, …}
{_id:"mydoc", _rev:"3-acb", _local_seq:3, …}

you store:

{_id:"mydoc@after@0", _rev:"1-abcd", _local_seq:1, …}
{_id:"mydoc@after@1-abcd", _rev:"1-cbad", _local_seq:2, …}
{_id:"mydoc@after@2-cbad", _rev:"1-acbd", _local_seq:3, …}

You'll still get conflict detection since the ids are deterministically based on the same MVCC token, but now your emitTuple can use doc._id.split("@after@")[0] as E and doc._local_seq as T (and the provided k/v as A and V, natch) and voilà your user gets Datalog when their ops team gave them CouchDB!

@natevw
Copy link
Author

natevw commented Nov 13, 2014

A tab I found open: https://groups.google.com/forum/#!topic/datomic/c9ZGWHLqTMY (Datomic already supports at least Couchbase as backing store? But of course a JS implementation of what is striking me as a cool-but-compact query language is a sweet project!)

@natevw
Copy link
Author

natevw commented Nov 13, 2014

So the more I think about it, and read the datomic docs, the more I think this ultimately should become both a PouchDB and CouchDB plugin the way geo is! At that level, the T stuff could be handled properly and invisibly — i.e. when you catch up your datalog index(es) with recent primary storage changes, you just would keep the old emits instead of removing them like MapReduce view engine does. The best part then is that you don't need any special write policy and the resulting main set of documents can still be used for other indexes too (MapReduce/spatial/fulltext…)

@kxepal
Copy link

kxepal commented Nov 14, 2014

@natevw ohai! Could you throw the most interesting ideas you'd found to CouchDB dev@ ML?

@dahjelle
Copy link
Member

@natevw Thank you so much for all your thoughts! It's gonna take me a while to digest all of it, but my (light) first pass yielded two immediate questions:

  1. I'm very interested in having a plugin for CouchDB and PouchDB…I'd love to see some more info on this if you have it. I've never looked into CouchDB plugins, and I'm especially curious as to at what level the two can share code. Do you have a link to this geo plugin? I found the CouchDB plugin, but not a PouchDB one…
  2. Yes, Datomic does support Couchbase as a backing store (along with several others). Definitely an option we considered but didn't end up going with.
  3. I didn't even think about keeping all the indexes in a single view prefixed by 'ave' or whatever. That's interesting! Are there performance differences you are aware of if you do this?
  4. I definitely think there are some options to make this more drop-in friendly, though I want to keep the option to index however you want ('cause I think that's one of the strengths of, well, CouchDB in general). For instance, I'm actually storing tuples in my CouchDB, whereas most people will be storing complete docs. But I don't see any reason to default to indexing normal docs with options to override it, and I definitely agree some helper functions would improve the drop-in-ability.
  5. As far as transactions…I'd have to do some more reading about that part myself. I see what you are saying…though wouldn't compaction eventually remove some of the historical data anyway, negating the purpose? Not to mention it gets really weird when things don't match across replicas… Maybe I'm saying that I don't anticipate using that functionality anyway, but perhaps it is more generally useful than I expect?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants