Twenty Two Tabs

The Smallest Bash Script in the Universe

Chris Taylor — Mon, 03 Sep 2018 13:57:00 GMT

Welcome back 👋🏽

As usual, if you find my ramblings interesting and want to read more, or think they're rubbish and want to tell me about it, remember to scroll down to leave a comment before, during, or after your read.

tl;dr

If you're here to see the smallest Bash script, look no further.

Here it is:

#!/bin/bash

Read on if you're wondering why it's not just an empty file like 10-years-ago-me.

The Smallest Kinda Sorta Bash Script in the Universe

I'll let you in on a little secret.

Back in the day I had this misconception that a Bash script was any text file with execute permission. In other words, I thought that shell_script in the following example would always be executed using bash as the interpreter:

# Create an empty file
$ : >shell_script

# Mark that file as executable
$ chmod u+x shell_script

# Run that file
$ ./shell_script
$

To explain why shell_script is only kinda sorta the smallest Bash script, we gotta go a little deeper and learn how programs are executed on Linux.

fork and execve

Most programs running on Linux run at least some code from the C programming language.

If they're not directly written in C, they're very likely either running code from the C runtime or are running in an interpreter that uses the C runtime. And BTW, for the purposes of this post you can think of the C runtime as a bunch of functions that have already been written for C programmers. This code sits in a shared library, ready to run.

The two functions we'll be focusing on are ones you wouldn't ever want to implement yourself as they interact with the kernel to do fun and exciting things that you don't want to get wrong. Whoever had to review my implementations of these functions in that university operating systems course can attest to that. I'm sorry for subjecting you to that, professor I Forgot What your Name Is.

Let's follow a C program named shelly as it runs on a Linux box and executes another program named script. Whenever shelly wants to run another program, it needs to run two functions from the C runtime: fork() and execve().

Process Splitting with fork

fork() tells the kernel to clone the currently running process into two processes. It's sort of like how cells split in a petri dish. In Linux, however, the resulting pair of processes forms a hierarchy with the original process becoming the parent of the new process. Once it's done, the first instruction of either process is a return from fork().

So after fork() completes, there will be two shelly processes.

Process Replacement with execve

execve() is a little more interesting.

It asks the kernel to replace the program running in the current process with some other program. execve() quite literally obliterates the memory image of the running program and replaces it with a new memory image for the desired program. When its done, execution continues at the first instruction of that new program. So in our example above, after execve() completes, there will be one process running the code for the shelly program and one process running the code for the shell_script program.

Since no remnants of shelly 's code will survive execve(), the call accepts data that the kernel will pass to the new program when it starts. This data is passed in two forms which you may have heard of:

environment variables (a list of key-value pairs)
arguments (an array of strings).

Arguments usually describe what a process should call itself and what precisely it should do. Environment variables usually describe the system that a process is running in or other software systems that a process may need to work with.

Since the environment of a process will very likely be the same as its children, that set of key-value pairs is usually just copied from the already running program to execve() with maybe a few additions.

Arguments are usually invocation-specific, and so are instead passed to execve() explicitly every time it's run.

What does execve actually do?

execve()'s only purpose in life is to ask the kernel to run programs from executable files.

It does this by asking the kernel to identify the type of program in an executable file and to run the appropriate kernel code to load it and set it executing. The kernel has handlers for programs in a handful of different binary formats, but the handler we're interested in is right here.

Scripts Must Begin with #!

Here's the beginning of the real magic of executing a script file:

The most important code (highlighted) checks the first two characters of the first line of the program file. If these characters are # followed by !, the program is considered a script. It then goes on to parse out the text following the #! into two words: an interpreter and an optional argument to that interpreter.

With these values in hand, the kernel repeats what it was asked to do with a few changes:

Instead of loading and executing the script file, load and execute the interpreter.
Use the optional argument from the #! line as the first argument to the interpreter.
Use the script's name as the second argument.
For the remaining arguments, simply copy them from the arguments execve() was originally given.

Running the Actual Smallest Bash Script in the Universe

So let's say shelly tried to run execve() on a program with the following contents:

$ cat shell_script
#!/bin/bash
$

Even if shelly forks and executes shell_script, the kernel will execute /bin/bash with a first argument of shell_script.

The Smallest Bash Script

And that's why the smallest Bash script is:

#!/bin/bash

...and why it's important to always begin shell scripts with #!.

The end.

Um...really?

You may notice that if you list a bunch of commands in a text file and run it in Bash, those commands still execute as if the file were a Bash script:

# Create a file containing two lines of shell
$ cat > shell_script <


I haven't been lying to you.  Attempts to ask the kernel to execute a text file that doesn't start in #! will always fail.  If Bash is doing the executing, though, the shell will do you a solid.
Don't believe me?  Take a peek at the source:

Your script may not always be executed by Bash (e.g. from a cron job, by the backend of a web service, or by a continuous integration system) so it's important to always start scripts with #! followed by the interpreter and an optional argument.
Further Reading
Hope you found that interesting and helpful.  If it was, be sure to click Recommend below and may be even leave a comment.
For further reading on how Bash scripts are executed, see:

How programs get run, Linux Weekly News
binfmt_script.c, The Linux Kernel
execve(2), The Linux Manual
execute_command.c, GNU Bash



The Magic of $@
Chris Taylor — Sat, 18 Aug 2018 22:51:45 GMT
Hey 👋
Recently while lurking some geeky subreddits, I stumbled upon an interesting question about bash.  I have a bit of a soft spot for shell programming, so I decided to share some insight in a blog-worthy response.  If you too are interested in the magic variable that is $@, read on.
I promise you'll learn something 🎓
So, here's the question:

Can anyone explain this... I have a script which runs commands in docker containers with docker run and sh -c. I pass the arguments of the script to the command with $@. Now if I do this directly it only passes the first argument, but if I assign it to a different variable first and then pass that variable to docker run I get all the arguments. Let's say I have a file bin/test and the contents is this:
#!/usr/bin/env bash

# Outputs all arguments
echo "$@"

# Only outputs fist argument
docker run --rm -ti node:10.9 sh -c "echo $@"

# Outputs all arguments
args=$@
docker run --rm -ti node:10.9 sh -c "echo $args"
And I run it with bin/test foo --bar the output would be:

foo --bar
foo
foo --bar

Why does the second command only pass the first argument?

A few words about words
As some background, whenever a command is parsed and run, a bunch of things happen, including two important steps:

Parameter substitution: for example, replacing variable references like $var and ${bar} with the contents of variables var and bar respectively.
Word splitting: splitting a command line into a list of words suitable for a call to execve(2).  In English, this means starting with a command line like grep pattern file and splitting it into a command name, grep and a list of arguments, pattern and file.

Although other interesting things happen to command lines as the shell reads and executes them, those two steps happen in that order.  This is why seemingly correct scripts like:
#!/bin/bash

file="abc def ghi.txt"
cat $file

...don't quite work out as expected.
$@ is special 🌟
$@ is a special variable when it comes to parameter substitution and word splitting.
If $@ is evaluated while it's not surrounded by double quotes, it simply expands to a space delimited list of the positional parameters (the arguments to the enclosing script or function).
However, if $@ is evaluated while it's directly surrounded by double quotes, it temporarily overrides word splitting such that each of the positional parameters are treated as separate words—one word per positional parameter.  This even includes the positional parameters that have spaces in them.
This is what makes this script crash and burn:
$ cat foo.sh
#!/bin/bash

grep PASSWORD $@
$ ./foo.sh 'a file with spaces.txt'
grep: a: No such file or directory
grep: file: No such file or directory
grep: with: No such file or directory
grep: spaces.txt: No such file or directory

...while this script works out:
$ cat foo.sh
#!/bin/bash

grep PASSWORD "$@"
$ ./foo.sh 'a file with spaces.txt'
MY PASSWORD IS passw0rd

This property also makes the output of even trivial scripts deviate ever so slightly from expectations:
$ cat foo.sh
#!/bin/bash

echo You provided: $@
$ ./foo.sh "a   parameter   with   tripled   spaces"
a parameter with tripled spaces 🤔

Comparing the argument to foo.sh and its output, notice that sequences of multiple spaces are all folded into one. This is because the shell first expands $@ into a space delimited list of the command line arguments, and then proceeds to split the resulting command line into words, starting with ./foo.sh and ending in spaces.
Maybe a little too special
There's even more to $@ 😱.  The moment a double-quoted $@ is encountered, the shell temporarily goes into a word emitting mode
What this means is: if a double quoted $@ appears directly after or right before what would otherwise be a word (with no separating whitespace), the first word emitted by $@ will be joined to the word directly before it and the last word emitted by $@ will be joined to word that directly follows it.
Feels like an example is in order:
#!/bin/bash

echo a b c"$@"d e 

Here, echo will be called with the arguments: a, b, c$1, $2, ... , ${n}d, e, and f. The same would happen if $@ appeared in double quotes with other text. Like, echo a b "c$@"d e f: The first and last word from $@ would be fused with c and d.
$* is just plain ordinary
If you ever want to avoid $@'s special behaviour, use $*. It's just like $@, but it doesn't cause the script to go into funky modes. It just expands to the positional parameters, separated by single spaces.  No muss...no fuss.
So why's my script broken?
So, back to the problem script:

echo "$@"


This works as expected because it's being run, un-adulterated by the shell that's interpreting the script. The double quotes directly surrounding the $@ have the shell translate its command line parameters directly into words.

docker run --rm -ti node:10.9 sh -c "echo $@"


This works strangely because the shell that's interpreting your shell script sees the $@ and flips into word emitting mode: Although word splitting will occur on the rest of this command line later, when the shell expands $@ it immediately starts turning it into words that will bypass the real word-splitting phase: echo $@ becomes the following words, echo (1st arg) and (2nd arg), (3rd arg), (...the rest of the script's arguments as individual words).
By the time the entire command line is split into words, it becomes: docker run --rm -ti node:10.9 sh -c echo (1st arg) (2nd arg) ... (3rd arg). Docker then turns around and runs sh -c 'echo (1st arg)' '(2nd arg)' '(3rd arg)' ..., which means "run the script echo (1st arg) with the command line arguments, (2nd arg), (3rd arg), ... ."
To get around this strange behavior, you could try replacing $@ with a variable that doesn't have special word-splitting-bypassing properties, like $*.

args=$@; docker run --rm -ti node:10.9 sh -c "echo $args"


This works as expected because the value of $@ is stored in the args variable. Even if it were double quoted, word splitting doesn't occur when you assign a value to a regular variable, so the variable just receives the list of command line args, separated by a space character as a single string. Since $args is a plain ol' variable, it's expanded without any funky word splitting nonsense in the -c scriptlet and runs as expected.
Hope this post is as helpful to you as it was to the reddit user who posted it.
Love shell?  Hate it?  Not sure why it does the things it does?
Leave a comment and let's bash it out.
~ chris



Docker In A Nutshell
Chris Taylor — Mon, 25 Apr 2016 20:38:00 GMT
Recently, I started ramping up on a little something called Docker.  In its simplest form, Docker lets you treat a runtime environment like a codebase under source control.
Want to record how an environment looks?
docker commit  

Made a few unfortunate changes, and want to roll it back?
docker rm 
docker run  

It's more than a little intruiging and I'll definitely get back to how to make the most of it, but first an admission.  Early on, I was routinely confused by all of the terms.  "Images, containers and processes, oh my," I recall saying to myself on more than one occasion while staring aimlessly at an xterm.  To help me keep all this straight in my head, I whipped up a simple analogy.

In docker...
An image is like a blueprint for a jail cell.

A docker image contains a starting point for a runtime environment.  This includes all of the files on disk at the time that environment starts.  In there as well are tips on what network ports the process in the environment may need to listen on, the current working directory of this process, as well as a default command to run when the environment is started.

A container is like a completed jail cell...with internet access.
A dockerized process is like a prisoner.

A docker container is the runtime environment mentioned above.  Processes bound to a container may do whatever they want—create or delete files on disk, produce diagnostics, create users, bind network ports, or even create new processes.  The useful thing is that the effects of these actions will extend no further than the container walls.  This means that processes in two different containers can simultaneously, e.g. write different content to /etc/foo.conf or listen on port 8080 without conflict.

The docker daemon is part prison guard; part construction crew.

The docker daemon does a lot of work.  But the important bits involve creating containers from images, running dockerized processes, and recording their output.  The daemon also manages network traffic destined to containers (connections from other containers as well as ones routed in through the docker host), and helps manage requests to turn containers into new images.

A linux machine with lxc enabled and docker installed is like a prison.

Docker is effectively a thin veneer over a still maturing set of kernel features collectively called Linux Containers.  Containers are a not so easy way to setup isolated zones of execution on a Linux machine.  Docker makes these features extremely easy to use.
And that's it—Docker in a nutshell.  This will definitely save the day the next time I'm staring at my xterm.  Hope it saves yours, too :-)



Cloudant and Couchdb: Little Docs, Big Thoughts
Chris Taylor — Wed, 17 Feb 2016 04:55:00 GMT
Especially for developers like me who were brought up on RDBS, when it comes to document-based data stores like Cloudant or its laid back self-hosted cousin, CouchDB, the first question to come to mind is how do I structure my data?
I'm learning more and more that the answer is not always clear cut and actually changes over time.  Especially here, though, the old adage applies:

Start simple, then make it complicated!

Consider the following scenario:
You have been tasked with writing a web-app that houses a collection of real life robots.  Each robot can have one or more capabilitys.  These capabilitys come in a handful of varieties, each of which can be registered with the web-app.
Start Simple
When it comes to document-based data stores, the simplest way to store data is such that each document represents a primary entity from the problem domain. From the description, the primary entity is a robot, so having one document per robot would be the simplest way.
e.g.
{
    "type": "robot",
    "id": "some-universally-unique-id",
    "name": "Optimus",
    "description": "A robot full of snazziness.",
    "owner": "cttttt@domain.com",
    "capability_instances": [
        {
            "type": "transform-into-a-vehicle",
            "description": "Make a sound and change into a vehicle",
            "instance_id": "some-universally-unique-instance-id",
            ...
            ..
            .
        },
        ...
        ..
        .
    ]
}


How About Updates?
Updating a robot—for example, changing a robot's description—is a matter of updating its document in the Cloudant way.
In other words:
- Fetch the latest.
  - On error, bubble the error.
  - On success, update the description of what was fetched.
  - Insert fetched-and-modified document.
    - On error:
        - If it's a conflict, do this all again.
        - Otherwise, bubble the error.
    - On success, bubble success.


But isn't this slow?

In practice, not really, but there is a lot of cross talk here.  For predictable operations like updating a known field of a document, update handlers can be used to avoid the need to fetch-a-doc-to-update-a-doc.
Here is an example of an update handler:
{
    ...
    ...
    "updates": {
        "update-robot-description":  "function (doc, req) {
            var resp = { 
                headers: {
                    "content-type": "text/plain",
                },
                code: 200,
                body: "ok"
            };
            
            if (!doc || !doc.type || doc.type !=== "robot") {
                resp.code = 404;
                resp.body = "not-found";
                return [ null, resp ];
            }

            doc.description = req.form.description;
            return [ doc, resp ];
        }"
    }
    ...
    ...
}

With this update handler, the web-app can now update robot descriptions by sending a POST to /_design/DESIGN/_update/update-robot-description/SOME_DOC_ID with the description field set in the request body.
Views
Using Cloudant's built in APIs, given an ID, we can fetch the document for any robot; all we need is its id.  But what if we wanted to fetch a document by other attributes, like its name?
The answer is a View.
Without going too much into the topic, a view is a special list of imaginary documents, each derived (via a function you provide) from a single real document.  They're imaginary in that they're actually derived documents—again, from a single real document.
So, in theory, a view could safely disappear and there would be no real data loss.  But don't worry: Views don't usually disappear. In fact, they're automatically generated and efficiently updated whenever any document changes.
So, with views you can have Cloudant efficiently prepare answers to questions like the following:

Give me all of the robots with a capability of type, transform-into-a-vehicle.
How many robots have the capability, transform-into-a-vehicle?
Give me all of the robots with the name, Optimus.
Give me all of the robots with no capabilities.

For more info on views, see this wiki page.
Queries
Views are great, but defining them is kind of difficult.  Also, views spawn documents that map a key (e.g. the id of an instance of a type of capability) to a value (e.g. the robot that the capability instance is in).  These get very complicated if you need to ask questions that involve more than one field of a document.
For example, Give me the robots that have a name Optimus and a capability with type, transform-into-a-transport-truck is a difficult question to answer with views.  Also, a view would need to be created for each variation on this type of question ahead of time:  Changes to the names or number fields involved in the question will result in changes to the view code.
For situations where it's impractical to create a view ahead of time, Cloudant provides a feature called Cloudant Query.  Similar to preparing a view, preparing a query involves giving Cloudant a function that generates zero or more key:value mappings for each document.  The important difference between a query and a view is that these key:value mappings are automatically fed into a query engine which incrementally generates enough data to answer more complex questions.
Do note, though, that like views, each of these mappings derives from a single document.
For more info on Cloudant Query, see this post.
Big Docs are Simple and Reads are FAST!
So there you have it: The big-doc extreme of document based data stores like Cloudant.
In summary:

Big documents are a great starting point.
They're simple.
If your domain entities are strictly hierarchical, big documents are a natural fit.
Techniques like update-handlers can simplify updates.
Because of the way conflicts work, all reads are always consistent:  You'll never read a doc half way through an operation.
Reads pertaining to a primary entity will be extremely fast because they'll always require a single fetch.

Big Docs Are Hard to Keep DRY
In spite of all of those benefits, big documents aren't all sunshine and roses. Remember that document for that robot?  Let's make a few small tweaks to it and add a second robot:
// Added a new capability, speech, to Optimus
// Added a new robot, Bumblebee, who can transform but
//   cannot speak.

{
    "type": "robot",
    "id": "some-universally-unique-id",
    "name": "Optimus",
    "description": "A robot full of snazziness.",
    "owner": "cttttt@domain.com",
    "capability_instances": [
        {
            "type": "transform-into-a-vehicle",
            "description": "Make a signature sound and change into a vehicle",
            "instance_id": "some-universally-unique-id"
        },
        {
            "type": "speech",
            "description": "Speak in a human language",
            "instance_id": "some-universally-unique-instance-id"
        }
    ]
},
{
    "type": "robot",
    "id": "some-other-universally-unique-id",
    "name": "Bumblebee",
    "description": "A robot that can't talk.",
    "owner": "cttttt@domain.com",
    "capability_instances": [
        {
            "type": "transform-into-a-vehicle",
            "description": "Make a signature sound a change into a vehicle",
            "instance_id": "some-universally-unique-instance-id"
        }
    ]
}

You may notice issue number one when documents are too big:  Repeated data.  I guess this isn't an issue in-and-of-itself—it's only a few bytes of repeated data—but what if we weren't sure what we wanted to store in a capability_instance.
For example, what if we wanted to add a link to a video demonstrating the type of capability?  We'd need to:
- For each capability type:
  - For each robot that contains this capability type:
    - Fetch the robot.
    - Add a new "video" field to the capability 
      instance(s) of this type.

Further, what if we, for some odd reason, wanted to embed the video right into the database?  The space usage would be massive as the same video of a robot transforming into a vehicle would be embedded everywhere.
Big Docs Mean More Conflicts
Another issue with big documents is that the losers during "concurrent" edits need to do the conflict dance.  In-and-of-itself, this isn't an issue; in fact, it's how Cloudant and Couchdb were designed.  In practice, though, this conflict resolution flow ends up scaling poorly as the number of concurrent editors increases.
A concrete example:

What if our bread and butter—the flow our users used 24/7 and 10,000 operations per minute—was trying out new types of capability_instances with a single robot.  The more concurrent edits to its list of capability_instances, the more conflicts would result, and the slower each edit would appear to be.

A less obvious winkle:

What if someone wanted to just change the description of a robot during this period of heavy load? They'd see conflicts as well, and may be wondering what's taking so long:  It's just the description.

The answer: little docs.
Little Docs Help Ensure DRY-ness
Consider the following alternate list of documents:
// Details on each "class of capabilities" have been
//  moved to separate documents.
{
    "type": "robot",
    "id": "some-universally-unique-id",
    "name": "Optimus",
    "description": "A robot full of snazziness.",
    "owner": "cttttt@domain.com",
    "capability_instances": [
        {
            "name": "transform-into-a-vehicle",
            "id": "some-universally-unique-instance-id"
        },
        {
            "name": "speech"
        }
    ]
},
{
    "type": "robot",
    "id": "some-other-universally-unique-id",
    "name": "Bumblebee",
    "description": "A robot that can't talk.",
    "owner": "cttttt@domain.com",
    "capability_instances": [
        {
            "name": "speech",
            "id": "some-universally-unique-instance-id"
        }
    ]
},
// Separate documents ---v
//
{
    "type": "capability",
    "name": "transform-info-a-vehicle",
    "description": "Make a signature sound a change into a vehicle"
},
{
     "type": "capability",
     "name": "speech",
     "description": "Speak in a human language"
}

Here, you'll notice that I created a few additional documents with a different type field.  There's actually nothing special about the word type.  It could be sunshine_and_lollipops.  Cloudant doesn't care what the field is called.
Whatever the field is named, it can come in handy for preparing queries on only documents that represent capabilitys, or while creating views where the view's imaginary documents spawn only from robot documents.
Little Docs Help Reduce Conflicts
I know.  The collection of docs above helps us not repeat ourselves, but what about our users' desires to constantly try out different combinations of capabilities on Optimus?
Consider the following:
// Capability instances have been entirely removed
//   from robots.  Now, adding a capability instance will never
//   result in a conflict.
{
    "type": "robot",
    "id": "1-some-universally-unique-id",
    "name": "Optimus",
    "description": "A robot full of snazziness.",
    "owner": "cttttt@domain.com"
},
{
    "type": "robot",
    "id": "2-some-other-universally-unique-id",
    "name": "Bumblebee",
    "description": "A robot that can't talk.",
    "owner": "cttttt@domain.com"
},
{
    "type": "capability",
    "name": "transform-info-a-vehicle",
    "description": "Make a signature sound a change into a vehicle"
},
{
     "type": "capability",
     "name": "speech",
     "description": "Speak in a human language"
},
{
      "type": "capability_instance",
      "name": "speech",
      "robot": "1-some-other-universally-unique-id",
      "id": "some-universally-unique-instance-id"
},
{
      "type": "capability_instance",
      "name", "transform-into-vehicle"
      "robot": "1-some-other-universally-unique-id",
      "id": "some-universally-unique-instance-id"
},
{
      "type": "capability_instance",
      "name", "transform-into-vehicle"
      "robot": "2-some-other-universally-unique-id",
      "id": "some-universally-unique-instance-id"
}

Here we have removed capability_instances from our robots entirely.  Now, adding a capability to a robot will always succeed with no conflict dance.
Neat huh?
Small Docs: Benefits...and Compromises
Remember how before, fetching a robot was a simple case of asking the DB for its document?
Now it involves the following steps:
- Ask the DB for the robot's document and set it aside.
- Collect a list of capability instances referring to this robot (could be a view or query).
- Fetch those and incorporate them into the robot documents we fetched earlier.
- Collect a list of capabilities referred to by these capability instances.
- Fetch these and incorporate them into the result.
- Return the composed document.
- Hope nothing changed during all of that.

Also, in case you were wondering about views, remember what I kept repeating earlier:  Each document in a view may only be derived from a single real document.  Ditto for the mappings backing a query, which are each derived from a single real document.
There is no way to have Cloudant round up the full view of a primary entity anymore.  It necessarily requires multiple fetches.
Another compromise is that with multiple updates come potential consistency problems.  We're using the same Cloudant as before here:  The data in a document will always be consistent and will never be between operations.  However, now we're composing data from multiple documents.  These could well be out of sync and we now have no way of knowing for sure.
Little Docs, Big Thoughts
You may have expected this post to hold all of the answers to your questions on how to structure your data in Cloudant or Couchdb.  Hopefully, it's clear that there is really no right answer and that it depends on a few factors.

Does your problem domain have a primary entity?  If it's highly relational, consider something else.

If your data is hierarchical in nature, document based data stores like Cloudant make a lot of sense.  Introspection is easy—just read the documents.  As well, having the database prepare answers to questions about your entities is effortless with views and queries.  If, on the other hand, your problem domain has a bunch of primary entities that need to stay consistent with each other, you may want to consider another type of database.

Lots of reads?  Not a lot of writes?  Stay big!

If your data is accessed a lot and mutated relatively infrequently, it makes a lot of sense to stick with the initial one-document-per-primary-entity model, using views and queries to fetch slices of your data.  It may seem like this could not possibly be efficient, but trust me:  It's probably just fine.
An added benefit of this approach is guaranteed data consistency for free.  Because updates to a document are atomic, and because views and query data are updated automatically and efficiently, there are no opportunities for fetching data mid-operation.  You may occasionally fetch old data, but it'll never be inconsistent.

Going small always adds complexity and always slows down reads.

It may seem like a great idea to go small from the start.  Heck, this is the only thing you can do in an RDBS, but going small carries a cost:

Forming a primary entity requires multiple fetches.  There's no such thing as a JOIN in a document based data store, so round trips are required.
Multiple fetches means the possibility of data inconsistency.  The only (sort of) atomic write operation in Cloudant is an insert or update of a document (with autonomy enforced through conflicts).  This guarantee doesn't exist when updating multiple docs that compose an entity.


Large amounts of repeated data? Pull out only that data.

Because of these costs, small documents should only be used where absolutely required to mitigate specific capacity or performance issues.  For example, if large pieces of data are repeated across documents, consider replacing that data with a reference to a typed document and update your existing views/queries to emit for only specific types of documents.  Fetching primary entities (Robots above) will be a bit harder, but the added complexity will be justified.

Lots of writes to a portion of a doc will slow down writes to any part of the entire doc!

Another reason to evict a portion of a big document into a series of little documents is to distribute load.  Recall that only one update can be made to a document at once (with concurrent editors seeing a costly conflict).  Where you know there will be contention on a certain structure within a document, consider exploding the structure into individual documents, each referring to the original doc.  This way, inserts into this structure will always be conflict-free.  As above, though, this will make it just that little bit harder to fetch documents.  It will also make it harder to avoid data consistency issues.
Well, that's that.  Hope this helps!
— chris



Promises 101
Chris Taylor — Tue, 05 Jan 2016 04:50:00 GMT
The other day I was reading up on a new Node.js backend framework and was intrigued as it was based on a new Javascript language feature: Generators.  Knowing nothing about this bleeding edge Javascript tech, I set out on a bit of an adventure.  It turns out that generators are part of a yellow brick road that the TC39 hopes will lead developers away from zigzagging pyramid-like programs to code that more closely resembles the plain old synchronous variety almost all of us are used to.
But where does this journey begin?  Consider the following slow function...and just a note that all of these code examples should run in NodeJS v5.4.1 with no special flags or modules.
/*
 * A slow routine.
 */
function slowlyIncrement(i, cb) {
        setTimeout(function() {
                cb(null, i + 1);
        }, 1000);
}

This function follows a pretty well defined pattern where the last argument is a piece of code pushed in by the caller.  Internally, the function does some work and calls the provided callback, cb, sending it an optional error and a result.
Pretty simple.
But what if you need to run several slow routines in sequence?
slowlyIncrement(0, function(err, i) {
        // TODO: handle errors
        slowlyIncrement(i, function (err, j) {
                // TODO: handle errors
                slowlyIncrement(j, function (err, k) {
                        // TODO: handle errors
                        slowlyIncrement(k, function (err, x) {
                                // TODO: handle errors
                                slowlyIncrement(x, function (err, y) {
                                        // TODO: handle errors
                                        slowlyIncrement(y, function (err, z) {
                                                // TODO: handle errors
                                                console.log(z); // prints 6
                                        });
                                });
                        });
                });
        });
});

The main issue—the one that pops right out like one of those 3D stereograms from a few decades ago—is what I like to call the bad kind of horizontal scaling.  Another more subtle issue is how errors are handled.  Because potential errors are passed as arguments to each callback, there's no way to defer error handling:  Every callback must have error handling code.  Both of these issues may be jarring to someone more used to synchronous code and exceptions.
Arrow functions
One neat feature introduced in ES6/ES2015 is a shorthand for anonymous functions.  It's pretty simple.
In compatible runtimes, this:
function (i) {
        console.log(i);
}

...is equivalent to this:
(i) => {
        console.log(i);
}

So, our crazy zigzagging code can be reduced to this:
slowlyIncrement(0, (err, i) => {
        slowlyIncrement(i, (err, j) => {
                slowlyIncrement(j, (err, k) => {
                        slowlyIncrement(k, (err, x) => {
                                slowlyIncrement(x, (err, y) => {
                                        slowlyIncrement(y, (err, z) => {
                                                console.log(z); // prints 6
                                        });
                                });
                        });
                });
        });
});

Promises
Although arrow functions are a great way to reduce a little repetition, they don't really address the structural issues with our code.  Enter promises.
As I mentioned above, when a slow function is provided a callback, the programmer is pushing an entity (a piece of code) into that function.  At least to me, this concept was pretty confusing when I started working with Javascript.  I was used to functions—slow or fast—return-ing values which got pulled out via a function call.
Promises are an attempt at turning this unintuitive convention back on its head.  Don't spend too much time studying it too closely, but consider the following new and improved slow function:
function slowlyIncrement (i) {
        return new Promise((resolve, reject) => {
                setTimeout(() {
                        resolve(i + 1);
                }, 1000);
        });
}

One immediate improvement here is that we're no longer expecting callers to push code to our function.  Instead, callers pass in arguments and pull out an object that represents the eventual result.
For example:
var result = slowlyIncrement(0);

Now, the trick is coaxing the actual value out of this object.  This is accomplished via it's only documented method, then:
var result = slowlyIncrement(0);
result.then((i) => {
        console.log(i); // prints 1
});

The idea is that the callback provided to then will be called when the value underlying the promise is known or resolved.
A neat bit of trivia is that because the only standard method of a promise is then, a promise is usually described as a thenable object.  But more on that later.
More then meets the eye
then has a few more tricks up it's sleeve; actually, two pretty simple properties:

then always returns a promise.  Let's call this promise thenResult.
thenResult resolves to whatever the callback provided to then returns.  Specifically:
If the callback provided to then returns a run of the mill value, thenResult will immediately be resolved with that value.
If the callback provided to then returns a thenable, thenResult will be resolved whenever (and with whatever) that thenable is resolved with.

Here's an example of a run of the mill value being propagated:
var result = slowlyIncrement(0);
result
.then((i) => {
        return i + "'s the result";
})
.then((i) => {
        console.log(i);  // prints "1's the result"
});

Here's an example of a promise being chained:
var result = slowlyIncrement(0);
result
.then((i) => {
        return slowlyIncrement(i); // a promise is returned here
}) // this then() returns a promise which resolves to whatever value the promise above does
.then((i) => {
        return i + "'s the result";
})
.then((i) => {
        console.log(i); // prints "2's the result"
});

Scaling vertically
Now we have all the knowledge required to tackle that nesting in our original code snippet.  You remember...the code that performed five increments in sequence and spanned three screen widths.
Here's the Cole's Notes version:
slowlyIncrement(0)
.then((i) => {
  slowlyIncrement(i);
})
.then((j) => {
  slowlyIncrement(j);
})
.then((k) => {
  slowlyIncrement(k);
})
.then((x) => {
  slowlyIncrement(x);
})
.then((y) => {
  console.log(y);
})

...which can be reduced even further as:
// slowlyIncrement just so happens to accept one param
// ditto for console.log.
slowlyIncrement(0)
.then(slowlyIncrement)
.then(slowlyIncrement)
.then(slowlyIncrement)
.then(slowlyIncrement)
.then(slowlyIncrement)
.then(console.log)

Oh yeah.  Error handling.
Although I didn't get into it earlier, there was something I didn't mention about the then method of a promise.  It actually takes two callbacks.
The first one is called whenever the value underlying the promise is known (or resolves).  The second one is called whenever an error occurs while arriving at a value.  When this happens, it's up to the promise returning function to raise an error by rejecting the promise.
Here's an example of a more complete slowlyIncrement which rejects where the provided value is not a number:
function slowlyIncrement(i) {
        return new Promise(function (resolve, reject) {
                if (Number.isNaN(Number(i))) {
                        reject(new Error(i + ": Not a number"));
                } else {
                        setTimeout(() => {
                                resolve(i+1);
                        }, 1000);
                }
        });
}

Here's an example usage with proper error handling:
// Here, no error occurs.
slowlyIncrement(0)
.then(
  console.log, // prints 1
  (e) => {
    console.log(e.stack);
  }
);

Here's an example usage where an error actually occurs:
// Here, an error occurs because "abcde" isn't a number
slowlyIncrement("abcde")
.then(
  console.log,
  (e) => {
    console.log(e.stack);  // displays a stack trace
  }
);

The Weakest Link
Just like resolved values propagate down the promise chain, so too do errors.  Here, errors will still be caught, even if we decide not to handle them for each and every call to then:
slowlyIncrement("abcde")
.then(slowlyIncrement)
.then(
  console.log,
  (e) => {
    console.log(e.stack);
  }
);

This is looking much more familiar, eh?  Let's close the loop.  With promises, that grimy, nested, error prone mess all the way at the top becomes this:
function slowlyIncrement(i) {
        return new Promise(function (resolve, reject) {
                if (Number.isNaN(Number(i))) {
                        reject(new Error(i + ": Not a number"));
                } else {
                        setTimeout(() => {
                                resolve(i+1);
                        }, 1000);
                }
        });
}

slowlyIncrement(0)
.then(slowlyIncrement)
.then(slowlyIncrement)
.then(slowlyIncrement)
.then(slowlyIncrement)
.then(slowlyIncrement)
.then(
  console.log,
  (e) => {
    console.log(e.stack);
  }
)

Now I know what you're thinking:
The Good

We can cut our use of the keyword function down considerably with arrow functions.
With promises, we no longer need to push code to slow functions.  We simply call those functions and fetch a deferred result in the form of a promise.
Because promises may be chained, we no longer have to contend with very wide source files or hard to match parentheses.
Promises allow error handling code to be consolidated, instead of needing to always be spread out throughout an asynchronous call chain.

The Bad

We're still pushing code to fetch a value, but to the then method of a promise.  This still feels a bit unnatural.
A promise is a type that controls the execution path of a program.  Shouldn't this be part of the language's syntax? This way, the compiler will be privy to the desired control flow and may be able to optimize or at least confirm syntax.
The implementation of the slow running function is still kind of unnatural.  This is probably related to the point above about the lack of language support for any of this.

In my next post, I'll get into the part of the road closer to the Emerald City—that place where async code can look as clean and familiar as the synchronous code of old.
This leg of the journey is definitely under construction, with some pretty crazy twists and turns.  The exciting part, however, is how much of this plumbing is being brought behind the curtain of the Javascript language where it belongs.
Until next time,
— chris



How to Timeout a Connection in Node According to Stack Overflow
Chris Taylor — Wed, 25 Nov 2015 04:34:00 GMT
Firstly, I just want to make it clear that this post is in no way a slight towards Stack Overflow.  Without this fantastic collection of carefully curated questions and answers, we would all be lost.  This is just a story about an imaginary software developer maintaining imaginary legacy code.
Last week, this imaginary dev was on a bit of an adventure. Instead of writing brand new code or making calm one-or-two line edits to fix bugs in well written Javascript, he was helping maintain a bit of a crow's nest.
Let's just call this micro-service the Gardiner expressway.
Once more unto the breach, dear friends...
What I learned from this program—which I coincidentally named after a notoriously unmaintainable downtown Toronto thoroughfare —is a brand new way to timeout an outgoing HTTP connection.
Here's a module that demonstrates this new approach.  I'll throw this into get.js:
/* eslint-env node */
/* eslint no-use-before-define: 0 */
/* eslint no-multi-spaces: 0 */

"use strict";

var request = require("request");

module.exports = function (url, timeout, callback) {
    var isTimedOut;

    setTimeout(function () {
        isTimedOut = true;
	callback(new Error("Connection timed out"));
	return;
    }, timeout);

    request.get(url, function (err, res, body) {
        if (isTimedOut) {
            return;
        }

        callback(err, body);
        return;
    });
};

And here's a quick driver program that takes the URL and timeout off of the command line, runs our modularized function to fetch the URL, and displays the number of characters in the response body:
/* eslint-env node */
/* eslint no-use-before-define: 0 */
/* eslint no-multi-spaces: 0 */
/* eslint no-shadow: 0 */
/* eslint no-process-exit: 2 */

"use strict";

var get = require("./get");

if (process.argv.length !== 4) {
    console.error(
        "Usage: %s URL MILLISECONDS\n" +
        "Fetch URL and display the length of the fetched body." +
        "Timeout after the provided number of ms."
    );
    process.exit(1);
}

var url = process.argv[2],
    timeout = process.argv[3]
;

get(url, timeout, function (err, body) {
    if (err) {
        printError(err);
	return;
    }

    console.log("%s characters received", body.toString().length);
});

function printError(err) {
    if (err instanceof Error) {
        console.error(err.stack);
    } else {
        console.log(err.toString());
    }
}

So, for example, to fetch the url, https://www.google.com and give up after one second:
$ npm install request
...
...
$ node index https://www.google.com 1000
55429 chars received

And, as you'd expect, if we bump the timeout down low enough, we'll see errors:
$ node index https://www.google.com 900
55452 chars received
$ node index https://www.google.com 800
55460 chars received
$ node index https://www.google.com 700
55471 characters received
$ node index https://www.google.com 600
55415 chars received
$ node index https://www.google.com 500
Error: Connection timed out
    at null._onTimeout (/Users/ctaylorr/proj/hownottotimeout/get.js:14:11)
    at Timer.listOnTimeout (timers.js:110:15)

It works! Let's ship it! BTW, what the actual eff?!?!?!
But wait a second.  This isn't the way this is done in other runtimes:


We're starting our timeout countdown far before the connection is attempted.  This is measuring the latency of the destination server all right; but also:

DNS lookup time.
Socket creation time.
The amount of time it takes to run code in request(...) prior to creating a connection.



The setTimeout code is tying up the event loop.  This can be demonstrated by running node index https://www.google.com 10000.


Even when we do timeout, the connection is still active.  Don't believe me?  Run node index https://www.google.com:12345 500.
Node will not terminate since the connection is still tying up the event loop.  If this code were run in a web-app, we would be leaking sockets under heavy load.  One of the goals of implementing timeouts is to save resources.
We better look into this.


There are probably other issues, but what we've found so far are enough to consider investigating alternatives to this approach.
Let's try to address them one-by-one:
But first, let's check the docs...
According to the docs for the request module, implementing a connection timeout requires an additional option:

var request = require("request");

request.get({
    url: "https://www.google.com",
    timeout: 1000  // <---- THAT'S IT. AN OPTION.
}, function (err, res, body) {
    // handle errors

    // process results
});

It was an option all along
Yup.  And the resulting code is so simple, you don't even need a separate module.  Just run request() with the extra option.
tl;dr: Know your tools.
The moral of this story is that when using new tech (an API, a framework, ...anything) always set aside some time to take a look at the documentation.  No need to commit the entire thing to memory, but use this as the first stop when a new requirement pops up.  If nothing pops out, consider composing concepts from the docs.
If that still doesn't work, only then consider the clever techniques from sites like Stack Overflow.
Hope this helps,
— chris



Twenty two tabs at a time
Chris Taylor — Thu, 01 Oct 2015 15:21:00 GMT
I've always held the firm belief that being a software developer is more of a learning experience than anything else.  Granted, there will always be those super-hero moments when I live in an editor, write reams of code, hit build, run my unit tests and it all works out.  But a lot of days end up something like this...

...and to me at least, that's a-okay.  It just means that by the time I packed it in for the day, I learned something.