February 17, 2016

Cloudant and Couchdb: Little Docs, Big Thoughts

Especially for developers like me who were brought up on RDBS, when it comes to document-based data stores like Cloudant or its laid back self-hosted cousin, CouchDB, the first question to come to mind is how do I structure my data?

I'm learning more and more that the answer is not always clear cut and actually changes over time. Especially here, though, the old adage applies:

Start simple, then make it complicated!

Consider the following scenario:

You have been tasked with writing a web-app that houses a collection of real life robots. Each robot can have one or more capabilitys. These capabilitys come in a handful of varieties, each of which can be registered with the web-app.

Start Simple

When it comes to document-based data stores, the simplest way to store data is such that each document represents a primary entity from the problem domain. From the description, the primary entity is a robot, so having one document per robot would be the simplest way.

e.g.

{
    "type": "robot",
    "id": "some-universally-unique-id",
    "name": "Optimus",
    "description": "A robot full of snazziness.",
    "owner": "[email protected]",
    "capability_instances": [
        {
            "type": "transform-into-a-vehicle",
            "description": "Make a sound and change into a vehicle",
            "instance_id": "some-universally-unique-instance-id",
            ...
            ..
            .
        },
        ...
        ..
        .
    ]
}

How About Updates?

Updating a robot—for example, changing a robot's description—is a matter of updating its document in the Cloudant way.

In other words:

- Fetch the latest.
  - On error, bubble the error.
  - On success, update the description of what was fetched.
  - Insert fetched-and-modified document.
    - On error:
        - If it's a conflict, do this all again.
        - Otherwise, bubble the error.
    - On success, bubble success.

But isn't this slow?

In practice, not really, but there is a lot of cross talk here. For predictable operations like updating a known field of a document, update handlers can be used to avoid the need to fetch-a-doc-to-update-a-doc.

Here is an example of an update handler:

{
    ...
    ...
    "updates": {
        "update-robot-description":  "function (doc, req) {
            var resp = { 
                headers: {
                    "content-type": "text/plain",
                },
                code: 200,
                body: "ok"
            };
            
            if (!doc || !doc.type || doc.type !=== "robot") {
                resp.code = 404;
                resp.body = "not-found";
                return [ null, resp ];
            }

            doc.description = req.form.description;
            return [ doc, resp ];
        }"
    }
    ...
    ...
}

With this update handler, the web-app can now update robot descriptions by sending a POST to /_design/DESIGN/_update/update-robot-description/SOME_DOC_ID with the description field set in the request body.

Views

Using Cloudant's built in APIs, given an ID, we can fetch the document for any robot; all we need is its id. But what if we wanted to fetch a document by other attributes, like its name?

The answer is a View.

Without going too much into the topic, a view is a special list of imaginary documents, each derived (via a function you provide) from a single real document. They're imaginary in that they're actually derived documents—again, from a single real document.

So, in theory, a view could safely disappear and there would be no real data loss. But don't worry: Views don't usually disappear. In fact, they're automatically generated and efficiently updated whenever any document changes.

So, with views you can have Cloudant efficiently prepare answers to questions like the following:

For more info on views, see this wiki page.

Queries

Views are great, but defining them is kind of difficult. Also, views spawn documents that map a key (e.g. the id of an instance of a type of capability) to a value (e.g. the robot that the capability instance is in). These get very complicated if you need to ask questions that involve more than one field of a document.

For example, Give me the robots that have a name Optimus and a capability with type, transform-into-a-transport-truck is a difficult question to answer with views. Also, a view would need to be created for each variation on this type of question ahead of time: Changes to the names or number fields involved in the question will result in changes to the view code.

For situations where it's impractical to create a view ahead of time, Cloudant provides a feature called Cloudant Query. Similar to preparing a view, preparing a query involves giving Cloudant a function that generates zero or more key:value mappings for each document. The important difference between a query and a view is that these key:value mappings are automatically fed into a query engine which incrementally generates enough data to answer more complex questions.

Do note, though, that like views, each of these mappings derives from a single document.

For more info on Cloudant Query, see this post.

Big Docs are Simple and Reads are FAST!

So there you have it: The big-doc extreme of document based data stores like Cloudant.

In summary:

Big Docs Are Hard to Keep DRY

In spite of all of those benefits, big documents aren't all sunshine and roses. Remember that document for that robot? Let's make a few small tweaks to it and add a second robot:

// Added a new capability, speech, to Optimus
// Added a new robot, Bumblebee, who can transform but
//   cannot speak.

{
    "type": "robot",
    "id": "some-universally-unique-id",
    "name": "Optimus",
    "description": "A robot full of snazziness.",
    "owner": "[email protected]",
    "capability_instances": [
        {
            "type": "transform-into-a-vehicle",
            "description": "Make a signature sound and change into a vehicle",
            "instance_id": "some-universally-unique-id"
        },
        {
            "type": "speech",
            "description": "Speak in a human language",
            "instance_id": "some-universally-unique-instance-id"
        }
    ]
},
{
    "type": "robot",
    "id": "some-other-universally-unique-id",
    "name": "Bumblebee",
    "description": "A robot that can't talk.",
    "owner": "[email protected]",
    "capability_instances": [
        {
            "type": "transform-into-a-vehicle",
            "description": "Make a signature sound a change into a vehicle",
            "instance_id": "some-universally-unique-instance-id"
        }
    ]
}

You may notice issue number one when documents are too big: Repeated data. I guess this isn't an issue in-and-of-itself—it's only a few bytes of repeated data—but what if we weren't sure what we wanted to store in a capability_instance.

For example, what if we wanted to add a link to a video demonstrating the type of capability? We'd need to:

- For each capability type:
  - For each robot that contains this capability type:
    - Fetch the robot.
    - Add a new "video" field to the capability 
      instance(s) of this type.

Further, what if we, for some odd reason, wanted to embed the video right into the database? The space usage would be massive as the same video of a robot transforming into a vehicle would be embedded everywhere.

Big Docs Mean More Conflicts

Another issue with big documents is that the losers during "concurrent" edits need to do the conflict dance. In-and-of-itself, this isn't an issue; in fact, it's how Cloudant and Couchdb were designed. In practice, though, this conflict resolution flow ends up scaling poorly as the number of concurrent editors increases.

A concrete example:

What if our bread and butter—the flow our users used 24/7 and 10,000 operations per minute—was trying out new types of capability_instances with a single robot. The more concurrent edits to its list of capability_instances, the more conflicts would result, and the slower each edit would appear to be.

A less obvious winkle:

What if someone wanted to just change the description of a robot during this period of heavy load? They'd see conflicts as well, and may be wondering what's taking so long: It's just the description.

The answer: little docs.

Little Docs Help Ensure DRY-ness

Consider the following alternate list of documents:

// Details on each "class of capabilities" have been
//  moved to separate documents.
{
    "type": "robot",
    "id": "some-universally-unique-id",
    "name": "Optimus",
    "description": "A robot full of snazziness.",
    "owner": "[email protected]",
    "capability_instances": [
        {
            "name": "transform-into-a-vehicle",
            "id": "some-universally-unique-instance-id"
        },
        {
            "name": "speech"
        }
    ]
},
{
    "type": "robot",
    "id": "some-other-universally-unique-id",
    "name": "Bumblebee",
    "description": "A robot that can't talk.",
    "owner": "[email protected]",
    "capability_instances": [
        {
            "name": "speech",
            "id": "some-universally-unique-instance-id"
        }
    ]
},
// Separate documents ---v
//
{
    "type": "capability",
    "name": "transform-info-a-vehicle",
    "description": "Make a signature sound a change into a vehicle"
},
{
     "type": "capability",
     "name": "speech",
     "description": "Speak in a human language"
}

Here, you'll notice that I created a few additional documents with a different type field. There's actually nothing special about the word type. It could be sunshine_and_lollipops. Cloudant doesn't care what the field is called.

Whatever the field is named, it can come in handy for preparing queries on only documents that represent capabilitys, or while creating views where the view's imaginary documents spawn only from robot documents.

Little Docs Help Reduce Conflicts

I know. The collection of docs above helps us not repeat ourselves, but what about our users' desires to constantly try out different combinations of capabilities on Optimus?

Consider the following:

// Capability instances have been entirely removed
//   from robots.  Now, adding a capability instance will never
//   result in a conflict.
{
    "type": "robot",
    "id": "1-some-universally-unique-id",
    "name": "Optimus",
    "description": "A robot full of snazziness.",
    "owner": "[email protected]"
},
{
    "type": "robot",
    "id": "2-some-other-universally-unique-id",
    "name": "Bumblebee",
    "description": "A robot that can't talk.",
    "owner": "[email protected]"
},
{
    "type": "capability",
    "name": "transform-info-a-vehicle",
    "description": "Make a signature sound a change into a vehicle"
},
{
     "type": "capability",
     "name": "speech",
     "description": "Speak in a human language"
},
{
      "type": "capability_instance",
      "name": "speech",
      "robot": "1-some-other-universally-unique-id",
      "id": "some-universally-unique-instance-id"
},
{
      "type": "capability_instance",
      "name", "transform-into-vehicle"
      "robot": "1-some-other-universally-unique-id",
      "id": "some-universally-unique-instance-id"
},
{
      "type": "capability_instance",
      "name", "transform-into-vehicle"
      "robot": "2-some-other-universally-unique-id",
      "id": "some-universally-unique-instance-id"
}

Here we have removed capability_instances from our robots entirely. Now, adding a capability to a robot will always succeed with no conflict dance.

Neat huh?

Small Docs: Benefits...and Compromises

Remember how before, fetching a robot was a simple case of asking the DB for its document?

Now it involves the following steps:

- Ask the DB for the robot's document and set it aside.
- Collect a list of capability instances referring to this robot (could be a view or query).
- Fetch those and incorporate them into the robot documents we fetched earlier.
- Collect a list of capabilities referred to by these capability instances.
- Fetch these and incorporate them into the result.
- Return the composed document.
- Hope nothing changed during all of that.

Also, in case you were wondering about views, remember what I kept repeating earlier: Each document in a view may only be derived from a single real document. Ditto for the mappings backing a query, which are each derived from a single real document.

There is no way to have Cloudant round up the full view of a primary entity anymore. It necessarily requires multiple fetches.

Another compromise is that with multiple updates come potential consistency problems. We're using the same Cloudant as before here: The data in a document will always be consistent and will never be between operations. However, now we're composing data from multiple documents. These could well be out of sync and we now have no way of knowing for sure.

Little Docs, Big Thoughts

You may have expected this post to hold all of the answers to your questions on how to structure your data in Cloudant or Couchdb. Hopefully, it's clear that there is really no right answer and that it depends on a few factors.

Does your problem domain have a primary entity? If it's highly relational, consider something else.

If your data is hierarchical in nature, document based data stores like Cloudant make a lot of sense. Introspection is easy—just read the documents. As well, having the database prepare answers to questions about your entities is effortless with views and queries. If, on the other hand, your problem domain has a bunch of primary entities that need to stay consistent with each other, you may want to consider another type of database.

Lots of reads? Not a lot of writes? Stay big!

If your data is accessed a lot and mutated relatively infrequently, it makes a lot of sense to stick with the initial one-document-per-primary-entity model, using views and queries to fetch slices of your data. It may seem like this could not possibly be efficient, but trust me: It's probably just fine.

An added benefit of this approach is guaranteed data consistency for free. Because updates to a document are atomic, and because views and query data are updated automatically and efficiently, there are no opportunities for fetching data mid-operation. You may occasionally fetch old data, but it'll never be inconsistent.

Going small always adds complexity and always slows down reads.

It may seem like a great idea to go small from the start. Heck, this is the only thing you can do in an RDBS, but going small carries a cost:

Large amounts of repeated data? Pull out only that data.

Because of these costs, small documents should only be used where absolutely required to mitigate specific capacity or performance issues. For example, if large pieces of data are repeated across documents, consider replacing that data with a reference to a typed document and update your existing views/queries to emit for only specific types of documents. Fetching primary entities (Robots above) will be a bit harder, but the added complexity will be justified.

Lots of writes to a portion of a doc will slow down writes to any part of the entire doc!

Another reason to evict a portion of a big document into a series of little documents is to distribute load. Recall that only one update can be made to a document at once (with concurrent editors seeing a costly conflict). Where you know there will be contention on a certain structure within a document, consider exploding the structure into individual documents, each referring to the original doc. This way, inserts into this structure will always be conflict-free. As above, though, this will make it just that little bit harder to fetch documents. It will also make it harder to avoid data consistency issues.

Well, that's that. Hope this helps!

— chris

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket