Planet CouchDB

September 01, 2010

Damien Katz

What's New in CouchDB 1.0 Security 'n Stuff

Today, I get a little help from Rebecca. She's writing a CouchApp, an application that is served right out of CouchDB and that lives in the browser. It has no middle tier application server in Ruby or Java. The application and display logic is written in JavaScript, the user interface is HTML & CSS, the backend is CouchDB and uses Ajax to shove JSON back and forth.

What's new in CouchDB 1.0 -- Part 4: Security 'n Stuff: Users, Authentication, Authorisation and Permissions

by Damien Katz at September 01, 2010 12:36 AM

August 31, 2010

Klaus Trainer

Building BigCouch on Ubuntu 10.04

I had some issues building BigCouch on Ubuntu. In the following, I'll describe one way how to get BigCouch installed on Ubuntu 10.04, with all depending packages installed from the Ubuntu main repository.

The first problem I encountered was that make failed because the SpiderMonkey headers were not found:

Could not find jsapi.h. Are Mozilla SpiderMonkey headers installed?
make: *** [compile] Error 1

To resolve this error, one needs to tell the build system the location of the xulrunner development library (i.e. Ubuntu package xulrunner-dev) and its respective include files, which can be achieved e.g. through the following patch:

Furthermore, you need to create a symbolic link /usr/lib/libmozjs.so, like e.g.:

sudo ln -s /usr/lib/xulrunner-1.9.2.8/libmozjs.so /usr/lib/libmozjs.so

Finally, I encountered another issue when running make install. The installation aborted, as the spec generation failed. The problem was a symbolic link. Its deletion finally made the installation succeed:

sudo rm /usr/lib/erlang/man

Note that this description is only about one way how to get BigCouch installed on Ubuntu 10.04. On top of everything, it's a quite hackish way. If you have any suggestions how to improve this description, or if you know of any better alternative, please share it and consider leaving a comment!

by Klaus at August 31, 2010 09:22 AM

August 30, 2010

Chris Strom

Surviving Restart

‹prev | My Chain | next›

Today, I would like to see if I can restart my (fab) game without the usual assorted problems. Before the switch to a permanent store (naturally CouchDB) this would have been impossible. At this point, I only need to restore the in-code setTimeouts, which are used to idle timeout players. Every other attribute should already be in CouchDB.

First up, I return the idle timeout epoch milliseconds from the idle_timeout method:
  timeout: 30*60*1000,
timeouts: { },

idle_watch: function(id) {
if (this.timeouts[id])
clearTimeout(this.timeouts[id]);

var self = this;
this.timeouts[id] = setTimeout(function() {
Logger.info("timeout " + id +"!");
self.drop_player(id);
}, self.timeout);

return (new Date((new Date()).getTime() + self.timeout)).getTime();
}
The actual setTimeout is specified in milliseconds that it will run. I hope to use the epoch seconds returned from idle_timeout to re-establish timeouts. The difference between the epoch milliseconds and the current epoch milliseconds is the remaining number of milliseconds in the timeout:
node> new Date((new Date()).getTime() + 30*60*1000).getTime();
1283131853957
node> new Date(1283131853957);
Mon, 30 Aug 2010 01:30:53 GMT
node> 1283131853957 - (new Date).getTime();
1533424
The only time the idle_timeout gets updated is during update_player. That method is also responsible for storing updated player status in the backend. I can use that to store the timeout time:
  update_player_status: function(status) {
var self = this;
this.get(status.id, function(player) {
Logger.debug("[players.update_player_status] " + inspect(player));
if (player) {
Logger.info("players.update_player_status: " + status.id);
player.status = status;
player.timeout = self.idle_watch(status.id);
db.saveDoc(player);

}
else {
Logger.warn("[players.update_player_status] unknown player: " + status.id + "!");
}
});
}
With that being stored in CouchDB:



I can add a bit of code to the players init() method to attempt to restore player timeout:
  init: function() {
var self = this;

// Now, in epoch milliseconds
var now = (new Date()).getTime();

// Grab all players from CouchDB
self.all(function(players) {
players.forEach(function(player){
// Difference between recorded time and now
var timeout = player.timeout - now;

// If time left before timeout, begin the wait again
if (timeout > 0) {
Logger.info("[init_timeouts] " + player._id + " " + timeout);
self.idle_watch(player._id, timeout);
}
// Otherwise drop the player
else {
Logger.info("[init_timeouts] dropping: " + player._id);
self.drop_player(player._id);
}
});
});


// Ensure that the faye server has fully established by waiting
// half a second before subscribing to channels
setTimeout(function(){ self.init_subscriptions(); }, 500);

return self;
}
That seems to work just fine unless a player needs to be deleted. In that can, the drop_player() method needs to broadcast via faye that the player is no longer in the room. The problem is that the faye client is not immediately available.

To get around this, I move the restore-players functionality into an init_players() method to be invoked after the faye client is ready:
  init: function() {
var self = this;

// Ensure that the faye server has fully established by waiting
// half a second before subscribing to channels
setTimeout(function(){
self.init_subscriptions();
self.init_players();
}, 500);

return self;
}
That is all well and good, but does it work?



That is pretty freaking cool! I move my player about in the room a little while, then stop the node.js server with a Ctrl+C. After waiting a few seconds, I restart the server and see that the timeout is restored:
cstrom@whitefall:~/repos/my_fab_game$ ./game.js
[INFO] Starting up...
[INFO] [init_timeouts] bob 1793694
...
Most importantly, I can continue playing as if nothing happened.


Day #210

by eee.c (chris.eee@gmail.com) at August 30, 2010 02:36 AM

August 29, 2010

Ricky Ho

The Limitations of SPARQL

Recently, I have been looking at RDF model and try to compare that with the property graph model that I mention in a previous post. I also look at the SPARQL query model. While I think it is a very powerful query language based by variable bindings, I also observe a couple of limitations that it doesn't handle well.

Note that I haven't used SPARQL in very simple examples and don't claim to be expert in this area. I am hoping my post here can invite other SPARQL experts to share their experience.

Here are the limitations that I have seen.

Support of Negation

Because of the “Open World” assumption, SPARQL doesn’t support “negation” well, this means expressing "negation" in SPARQL is not easy.
  • Find all persons who is Bob’s friends but doesn’t know Java
  • Find all persons who know Bob but doesn't know Alice
Support of Path Expression

In SPARQL, expressing a variable length path is not easy.
  • Find all posts written by Bob’s direct and indirect friends (everyone reachable from Bob)
Predicates cannot have Properties

This may be a RDF limitation that SPARQL inherits. Since RDF represents everything in Triples. It is easy to implement properties of a Node using extra Triples, but it is very difficult to implement properties in Edges.

In SPARQL, there is no way to attaching a property to a “predicate”.
  • Bob knows Peter for 5 years
RDF inference Rule

Inference rules are build around RDFS and OWL which is focusing mainly on type and set relationships and is implemented using a Rule: (conditions => derived triple) expression. But it is not easy to express a derived triples whose object’s value is an expression of existing triples.
  • Family income is the sum of all individual member’s income
Support of Fuzzy Matches with Ranked results

SPARQL is based on a boolean query model which is designed for exact match. Express a fuzzy match with ranked result is very difficult.
  • Find the top 20 posts that is “similar” to this post ranked by degree of similarity (lets say similarity is measured by the number of common tags that the 2 posts share)

I am also very interested to see if there is any large scale deployment of RDF graph in real-life scenarios. I am not aware of any popular social network sites are using RDF to store the social graph or social activities. I guess this may be due to scalability of the RDF implementation today. I may be wrong though.

by Ricky Ho (rickyphyllis@gmail.com) at August 29, 2010 04:04 AM

Chris Strom

More Callbacks with node-couchdb

‹prev | My Chain | next›

Yesterday, I was able to get player updates in my (fab) game stored in a CouchDB backend thanks to node-couchdb. Today, I hope to move the entire store over to CouchDB.

First up, I need to be able to drop players:
  drop_player: function(id) {
Logger.info("players.drop_player " + id);
this.faye.publish("/players/drop", id);

this.get(id, function(player) {
Logger.debug("[players.drop_player] " + inspect(player));
if (player) db.removeDoc(id, player._rev);
});

}
To delete in CouchDB, I need the current revision of the document. I use the get() method from last night to grab the current document, then remove it immediately. I could keep the revision ID in the local store to get around the lookup/then delete. I will worry about doing that when this proves to be a bottleneck.

Up next, I need to rework the idle timeout code a bit. I had been storing the idle timeout directly on the player objects in the local store (this._):
  idle_watch: function(id) {
if (this._[id].idle_timeout) {
clearTimeout(this._[id]].idle_timeout);
}

var self = this;
this._[id].idle_timeout = setTimeout(function() {
Logger.info("timeout " + id +"!");
self.drop_player(id);
}, 30*60*1000);


this._[id].idle_watch_started = "" + (new Date());
}
Since I am no longer storing data locally, I need to create a new local store for these timeouts. Easy enough:
  timeouts: { },

idle_watch: function(id) {
if (this.timeouts[id])
clearTimeout(this.timeouts[id]);

var self = this;
this.timeouts[id] = setTimeout(function() {
Logger.info("timeout " + id +"!");
self.drop_player(id);
}, 30*60*1000);

}
The timeouts are now stored in the timeouts attribute in my players object, which leaves very little still relying on the old local store. After replacing a couple more individual gets from the local store, I am left with only one more local store reference—a lookup of all players in a faye subscription:
     this.faye.subscribe("/players/query", function(q) {
var ret = [];
for (var id in self._) {
ret.push(self._[id].status);
}

self.faye.publish("/players/all", ret);
});
As with everything else in node-couchdb, a lookup of all documents in CouchDB is done via callback:
  all: function(callback) {
return db.allDocs({include_docs:true}, function(err, docs) {
callback(docs.rows.map(function(row) {return row.doc;}));
});
}
I use {include_docs:true} to pull back the player documents rather than just the meta data that is normally returned. Similarly, I map those results to return just the player documents (no meta data). With that, I need to convert the faye subscription to use the callback:
  all: function(callback) {
return db.allDocs({include_docs:true}, function(err, docs) {
callback(docs.rows.map(function(row) {return row.doc;}));
});
}
And finally, I can remove the local store entirely. I am still storing timeouts locally, but that is a code timeout, so there is really no choice in the matter. Now that I think about it, I ought to be able to save the timeout time in the CouchDB store. If I do that, I could restore the software timeout after a server restart. I will pick back up with that tomorrow.


Day #209

by eee.c (chris.eee@gmail.com) at August 29, 2010 02:54 AM

August 28, 2010

Couchio

CouchDB: The Definitive Guide — Redesigned Website, Up To Date Content and All Open Source And Forkable on GitHub

Last week we (Noah, Chris and Jan, with great design help from Kristina Schneider) released our latest work on CouchDB: The Definitive Guide.

The latest update includes:

  • Updated and edited content as it appears in the printed book.
  • All new styles for your reading pleasure.
  • All new open source setup using Github for contributions and translations.

It took quite a while to get it all out, but we couldn’t be more proud about what we can offer you now. O’Reilly’s great editorial team went through all our chapters and cleaned up all the nitty-gritty and we think it turned out great. It feels almost like a real book now ;)

The Long Haul

The bulk of the work went into making sure the book is super easy to work on in the future. We want to be able to improve the book and collaborate with the open source community as much and as efficient as possible.

Minimal Technology

This includes switching writing the sources from Asciidoc to a subset of HTML. This allows us to work with the native web, our primary publication platform without having to deal with conversion issues left and right (trust us on that one).

The minimal nature of our markup might throw you off (we don’t even close our

’s!) but it is really great to write in and keeps everything lean and free from technological cruft that is not really needed. We encourage you to view source at least once.

A small XSLT (yuck, we know ;) transforms our minimal HTML into DocBook which O’Reilly in turn can take and produce a book from every once in a while.

Github Goodness

We’ve all been using Git and Github for quite some time in other projects and we finally migrated our book repository over (yes, Kristina is an avid Git user :). Half way through, Github introduced Organizations and it is just perfect for our needs.

You can now fork, edit, and contribute back to the book without much hassle. This is a good opportunity to thank the folks at O’Reilly again about their commitment to open source and allowing us to publish our work under the Creative Commons license.

Aftermath

Getting everything together truly felt like shipping a 1.0 project. Once we agreed on a final date, we cut our todo list in half in favour of getting done in time. There is still work to do, but we managed to get all the rough edges out.

The result is amazing, in the few days we’ve been live, more work on the content has been done than in the past six months. We already included contributions from third-party contributors and the German Translation is already two chapters into being done.

And nothing is stopping you from helping out :)

Here’s to the second edition!

— Chris, Noah, Jan & Kristina

August 28, 2010 10:22 PM

What’s new in CouchDB 1.0 — Part 4: Security’n stuff: Users, Authentication, Authorisation and Permissions

Welcome to Part 4 on my little mini-series on new features in CouchDB 0.11.01.0. Do not miss parts one, two and three.

Today, I get a little help from Rebecca. She’s writing a CouchApp, an application that is served right out of CouchDB and that lives in the browser. It has no middle tier application server in Ruby or Java. The application and display logic is written in JavaScript, the user interface is HTML & CSS, the backend is CouchDB and uses Ajax to shove JSON back and forth.

Rebecca is writing a small todo list app for herself and her friends, but we’ll punt on the actual application for now, so we can concentrate on the security features. Part two and three of our book CouchDB: The Definitive Guide explains how the rest of the application development works, make sure to read up on it!

Security

Security is a wide field. This article only discusses some of things you need to know to add authenticated-user features to your CouchDB application (whether it is a CouchApp or a regular application). It does not discuss best practices for securing network servers or defending against cross site scripting.

View Source is Open Source

The CouchDB security model is based around the premise that Rebecca can control who can create documents of what form into which database inside CouchDB. It does not try to make CouchDB and all the data she and others put in is absolutely water-tight and doesn’t leak any information. Although you can lock CouchDB as much down as you need, open and sharable databases are the default and it is a good thing.

Traditional applications are built with a database in the back; an application is the only logical “user” of a database, the only entity that accesses the database directly (aside from maybe an administrator). CouchDB happily supports that model (there is nothing wrong with it either). The only case where security is relevant here is shared hosting where you have multiple mutually untrusting parties accessing a single CouchDB instance. The mechanics I describe can be used to make CouchDB useful here, but I won’t describe this specific scenario. Partly because it is a lot simpler to just give every user a separate full instance of CouchDB with “root” access, but mostly because I believe there is a much more interesting deployment scenario:

The idea of standalone CouchApps is that they travel with the data, since they are just some HTML, CSS & JavaScript that CouchDB tack onto a _design document as attachments. Applications as data allow us to replicate them around just like we do with data. Data ultimately wants to be free and shareable with the people and applications we trust. Why shouldn’t applications do the same?

Since CouchApps run in the browser you can’t hide their implementation anywhere. In that, CouchApps are inherently Open Source and we believe that is a good thing because that is how the web works and that’s a crucial feature of the web. View source allows everyone curious to learn how a website was built.

Yadda, yadda, Open Source zealottery, you can’t hear it anymore, sorry :) — If you like your Rails, Django, PHP or Java, CouchDB won’t prevent you from using them. You can create private, closed source applications, but you’re losing the powerful attribute of native app-shareability.

All that said you can make CouchDB as closed as you need it, but with each barrier to entry you lose a layer of data and application flexibility. You might want to reconsider some of your previous ideas about how to lock down your app in the light of ultra-portable, peer-to-peer shareable applications.

Ok, CouchApps are a big deal, you get that now, I’ll shut up. Back to the nuts and bolts of CouchDB security :)

Terminology

Lets make sure we talk about the same things.

  • Admin Party: CouchDB by default comes in the admin party mode. Each request made to CouchDB is considered to come from an admin. This means it is extremely easy to get started. Isn’t that terribly insecure, you ask. By default CouchDB will only listen on 127.0.0.1, your localhost IP address. Only users on your computer can access CouchDB. Most of the time that‘s just you, so no biggie; but be aware of this when you are on machine with multiple, possibly untrusting users.

  • Database: A database is a bucket that holds any number of documents in CouchDB. Each CouchDB server can have any number of databases. Each database is self contained and access to CouchDB can be defined on a per-database level. I’ll show you how below.

  • User: A user is identified by a username and matching password that is securely stored inside CouchDB. A user can have one or more roles assigned. A user with an empty name and password is the anonymous user. CouchDB further distinguishes admin users between server admins and database admins.

  • Authentication: The process of a user proving it’s her by providing the correct username / password combination in an authenticated HTTP request.

  • Authorisation: The process of determining whether an authenticated user is allowed to do what she wants to do.

  • Roles: Roles are associated with users, you could also call them “group”. For example, in Admin Party mode, each request is implicitly authenticated with the anonymous user that in turn implicitly gets assigned the _admin role that allows each request to do anything.

  • Anonymous User: A user with an empty username and password. All unauthenticated requests are implicitly assigned to the anonymous user.

  • Access Control Lists: A list of usernames or roles for a database. CouchDB distinguishes reader-ACLs and admin-ACLs. A database admin can fully access the database. The reader-ACL list defines a list of users or roles that can read from the database. If no reader-ACLs are defined, everybody can read from the database. Note that there is no writer-ACL; see Validation Functions next.

  • Validation Functions: A JavaScript function stored in the validate_doc_update field of a _design document. It gets executed whenever a write requests reaches the database. It can decide to allow or deny access to the database based on the document that is being written and the authenticated user or her roles.

  • Stateless HTTP: Each HTTP requests stands on its own. A client and server do not expect any previous requests have be made.

  • Basic Auth: An authentication mechanism for HTTP that uses base64 encoded headers to send a users credentials to the server. Most notably known for producing an ugly pop-up window in the browser (although this can be prevented). Base64 encoding is not a form of encryption. It is not a safe transport for user credentials. It is easy for third parties to spy out a user’s password. Basic auth can only reasonably used in a trusted environment; a local LAN, a VPN or over SSL.

  • Secure Cookie Auth: Unlike Basic Auth Secure Cookie Auth uses HMAC-encryption for transporting user credentials. It can be used to securely authenticate users over an untrusted connection.

  • OAuth: Lets users allow applications to authenticate as the user to a service. The canonical example is a web application that does something with a user’s private account data on another service. With OAuth, the web application does not have to know the user’s credentials to do its work. Access permissions can be managed and revoked on a per-application basis. OAuth is not limited to web applications though.

Getting Started with Security in Futon

Let’s start with a blank slate. A fresh installation of CouchDB 0.11.0 or later, the admin party and a look at Futon:

In the lower right you should see that in fact, we are having an admin party.

Admin Party!

You should also see a link that asks you to “Fix This”. Click it and you should see a form that asks you to specify a username and password for the first server admin user.

Creating an Admin User

I put in rebecca and 12345, you can choose whatever else you like. Just remember it, otherwise, you’re locked out of the server (there are ways to get in again, but that’s beyond the scope of this article.)

The lower right should now greet you with your username and offer you to create more admins or to log out. Futon uses Secure Cookie Authentication to keep you logged in.

Logged In

At this point, CouchDB no longer runs in Admin Party mode and requires you to be logged in to perform certain actions: Creating or deleting a database; creating, updating or deleting a _design document inside a database; read or update the _config API; read _stats or _log and request temporary views.

Under the Hood

Let’s see what goes on under the hood of Futon. We’re following the same steps as above, only we use curl on the command line instead of Futon.

First, see if CouchDB is running:

> curl http://127.0.0.1:5984/
{"couchdb":"Welcome","version":"1.0.0"}

Yes!

Next, let’s create an admin user:

> curl -X PUT http://127.0.0.1:5984/_config/admins/rebecca -d’"12345"’
""

Well that was easy. We created a config-level admin user and that takes CouchDB right out of admin party mode (intentionally left over production note: reference “partypooper” in some way).

Now all administrative requests to CouchDB (see above) need to be authenticated. To make your life a little easier Futon does a little dance for you and automatically logs you in as the newly created user.

What does logs you in mean? First, Futon creates a new document in the _users database. It has a special format that you have to follow if you are doing this on your own.

Luckily CouchDB’s buit-in client libraries couch.js and jquery.couch.js do all the heavy lifting for you.

> cat rebecca.json 
{
  "_id":"org.couchdb.user:rebecca",
  "name": "rebecca",
  "salt": "68cf5946d9760d19759b5016d90f612c",
  "password_sha": "3588a9b2039e53b674d8da361e4be98f00637f5a",
  "type":"user",
  "roles":["admin"]
}
> curl -X PUT "http://127.0.0.1:5984/_users/org.couchdb.user%3Arebecca" \
   -d@rebecca.json
{
   "ok":true,
   "id":"org.couchdb.user:rebecca",
   "rev":"1-9aa9e9a2c855e81061d6d8553d6adbc5"
}

This is laying foundations for the future. Once you have a user document in the _users database, you can use the _session API to get an encrypted session cookie that authenticates you for the next few requests. By default, a session cookie is valid for 10 minutes.

Showing all the cookie business with curl would be a little tedious, so I’ll jump over to how to do all that in your own code.

Aside: What’s the deal with users or admins created with the _config API vs. the _users database? Admins created through the _config API are persisted to CouchDB’s configuration file ($prefix/etc/couchdb/local.ini by default). Users in the _users database are stored in that database. For some setups, it is required that some external tool is able to create a user for CouchDB without having a user or admin account on that CouchDB, but access to the ini file (think system setup software). In addition, _config users are always automatically server admins, so use them with care.

Using jquery.couch.js in Your Application

jquery.couch.js is the standard JavaScript API that ships with CouchDB. Futon uses it for its snazzy interface. And so can you, or should, really, unless you want to re-do all the work the CouchDB project put into it :)

Let me show you the methods in question. I’m quoting right out of the API docs:

$.couch.signup(user_doc, password, options)

  • Hashes the password
  • Adds an empty roles array to the user_doc when not specified
  • Adds an id, composed of “org.couchdb.user:” and name, to the userdoc when not specified
  • Saves the user_doc with options as parameters in the userDb
  • Performs the success callback on the saved user_doc

$.couch.login(options)

  • Does a POST request to “_session” with username and password, they have to be present in the options hash. Throws a 404 error when the password is wrong or there is no user with that username stored in the userDb.

$.couch.logout(options)

  • Does a DELETE request to “_session”.

Concepts

The _session API provides you with a convenient endpoint to manage authenticated requests to CouchDB. A simple GET /_session returns a JSON object detailing your current session state.

> curl http://127.0.0.1:5984/_session | jsonpretty 
{
  "userCtx": {
    "name": null,
    "roles": [

    ]
  },
  "info": {
    "authentication_handlers": [
      "oauth",
      "cookie",
      "default"
    ],
    "authentication_db": "_users"
  },
  "ok": true
}

userCtx is where all the authentication information is stored. name is your login username and roles is list of roles your user has assigned to it.

info has some server-wide information about the authentication system. authentication_handlers are the different ways CouchDB can do the actual authentication process for you. By default CouchDB ships with an OAuth handler, a cookie handler and the default handler (which does HTTP basic auth). The authentication_db is the database that user documents are stored in. The default is _users, but you can change it in the CouchDB configuration settings. Only do so if you have a very good reason.

ok just lets us know our request was a-ok.

We made an unauthenticated request to CouchDB, so we don’t see any values for userCtx.name or userCtx.roles. Let’s make one with user credentials:

> curl http://rebecca:12345@127.0.0.1:5984/_session | jsonpretty 
{
  "userCtx": {
    "name": "rebecca",
    "roles": [
      "_admin"
    ]
  },
  "info": {
    "authentication_handlers": [
      "oauth",
      "cookie",
      "default"
    ],
    "authenticated": "default",
    "authentication_db": "_users"
  },
  "ok": true
}

The result looks a lot similar, but this times the values inside userCtx are filled out. We used HTTP basic auth. The details of OAuth authentication are out of the scope of this article, but we sure should feature them at some point.

Cookie authentication or it’s full name Secure Cookie Authentication works by granting access through HMAC digest transported credentials and one time tokens. To increase convenience, a one time token is actually valid for 10 minutes by default, but you can adjust that as needed.

We showed you the login() method of jquery.couch.js earlier, use it to log a user into CouchDB with cookie authentication.

Roles

With roles you can group multiple users. We’ll show you in a bit how roles allow you to define permissions on the CouchDB server and individual databases. A role is a simple string that doesn’t start with an underscore. Underscore-roles are reserved to CouchDB. You roles can be anything, really. The only role that CouchDB prescribes is the _admin role (with an underscore, see?). It grants the user server-wide privileges to do anything.

ACLs, Database Admins & Validation Functions

To allow more fine-grained control over who can read from your databases, CouchDB comes with Access Control Lists (ACLs), Database Admins and Validation Functions.

Each database in CouchDB comes with its own security object. It is not a document, but simply a JSON structure associated with the database. On a newly created database, it looks like this:

{}

The empty object, duh :)

You can set two properties admins and readers. Both are another JSON object with the two properties roles and names and these two are lists of roles and names respectively.

Here is an example:

{
  "admins": {
    "roles": [],
    "names": ["rebecca"]
  }
}

For this database, our user rebecca is the admin. A database admin has full read access to the database as well as the ability to update the security object. You can add more users:

{
  "admins": {
    "roles": [],
    "names": ["rebecca", "pete"]
  }
}

Or if that starts to get tedious, you can assign roles, and then by adding roles to a user, they automatically inherit the right to administer the database. This assumes, “rebecca” and “pete” each have the “local-heros” role assigned:

{
  "admins": {
    "roles": ["local-heroes"],
    "names": []
  }
}
Readers

Now this is awesome :) — You can specify in the same way a list of usernames or roles to grant read-access to a database. If no readers are specified, everyone can read your database. This is cool, again, public databases make the world a better place.

In case you want only specific authenticated users to be able to read from your database, use the security object:

{
  "admins": {
    "roles": ["local-heroes"],
    "names": ["rebecca", "pete"]
  },
  "readers": {
    "roles": ["lolcat-heroes"],
    "names": ["simon", "ben", "james"]
  }
}

Now simon, ben and james are among your trusted readers as well as all users with the role “lolcat-heores”.

There is no need to add names or roles from the admins section, since they automatically are also readers.

Validation Functions or How to Control Write Access

What about restricting write access, can you just create a new property writers in the security object and do as before? — No, for this, you will be using a validation function.

CouchDB has had validation functions for quite some time and always have been the way of restricting write access to your database. The cool thing with validation functions is that they have full access to the document a user is trying to write as well as the user context, i.e. the username and any roles.

This allows a validation function to reject a document write because of both user-authentication (or the lack thereof) and document content or structure.

Validation functions are invoked once for every document that is written to the database. It gets passed the document to be written, the previous revision of a document, if it exists, and the user context. To block a document write, the validation function needs to throw an exception. The return value is ignored. If no exceptions are thrown, the document write can proceed.

Here are a few examples.

Disallowing anonymous writes:

function(new_doc, old_doc, userCtx) {
  if(!userCtx.name) {
    // CouchDB sets userCtx.name only after a successful authentication
    throw({forbidden: "Please log in first."});
  }
}

Only allow writes to users with a certain role:

function(new_doc, old_doc, userCtx) {
  if(userCtx.roles.indexOf("editors") === -1) {
    // sure lovely that JavaScript doesn’t
    // have an Array.includes() method
    throw({unauthorized: "You are not an editor."});
  }
}

Only allow updates by the author (this assumes, that the user sets his or her username as doc.name).

function(new_doc, old_doc, userCtx) {
  if(doc.name != userCtx.name) {
    throw({unauthorized: "You are not the author"});
  }
}

Conclusion

Alright, this was a really long post and we should get it wrapped. We hope to have given you a good overview of the security concepts in CouchDB and enough pointers to keep you reading and experimenting.

August 28, 2010 09:22 PM

Chris Strom

Updating CouchDB from Node.js (It's asynchronous)

‹prev | My Chain | next›

Tonight, I would like to explore hooking up the player store in my (fab) game to CouchDB.

First up, I install node-couchdb (it is in npm as couchdb, not node-couchdb):
cstrom@whitefall:~/repos/my_fab_game$ npm install couchdb
npm it worked if it ends with ok
npm cli [ 'install', 'couchdb' ]
npm version 0.1.26
...

npm activate couchdb@1.0.0
npm build Success: couchdb@1.0.0
npm ok
With that, I can declare CouchDB variables in my (fab) game:
var express = require('express'),
http = require('http'),
faye = require('faye'),
puts = require( "sys" ).puts,
p = require( "sys" ).p,
couchdb = require('node-couchdb/lib/couchdb'),
client = couchdb.createClient(5984, 'localhost'),
db = client.db('my-fab-game');


...
When a player first enters the game room, I need to create a new record in the DB. The saveDoc method ought to do the trick:
  add_player: function(player) {
var new_id = player.id;
if (!this.get(new_id)) {
this._[new_id] = {token: player.authToken};
db.saveDoc(new_id, this._[new_id]);
}
delete(player['authToken']);

this.update_player_status(player);
}
Checking in the DB, I see the record created. Easy enough! Now I would like to be able to update that record as the payer moves around the room.

To accomplish this, I need to pull back the record from the DB, modify the data accordingly, then save it back. The getDoc() and saveDoc() methods sound about right:
  get: function(id) {
db.getDoc(id);
},

update_player_status: function(status) {
var player = this.get(status.id);
p(player);
if (player) {
puts("[update_player_status] " + status.id);
player.status = status;
db.saveDoc(player);
this.idle_watch(status.id);
}
else {
puts("[update_player_status] unknown player: " + status.id + "!");
}
}
But nothing happens. The initial record is created, but not updated. And, I'm seeing that "unknown player" message, so nothing is being returned from the get() method.

Ah! Nothing is getting returned because I have no return statement:
  get: function(id) {
return db.getDoc(id);
}
That should fix it. But of course it does not.

What the hell? Why is getDoc() not returning anything? Even in node-repl, there is nothing.

And then, I actually read the documentation.

The getDoc() method, like all methods that retrieve data in node-couchdb, does not return anything. It expects a callback to which it sends both errors and returned data.

I do not care about errors (if the player is not in the room, the message should be ignored), so I log the occurrence and send the response data onto the requesting function via a callback:
  _get: function(id, callback) {
puts("trying to get: " + id);
db.getDoc(id, function(err, res) {
if (err) {
puts(JSON.stringify(er));
}
callback(res);
});
}
Putting this to use in the update_player_status method requires almost no change to the original, other than wrapping it in a callback for the get method:
  update_player_status: function(status) {
var self = this;
this._get(status.id, function(player) {
p(player);
if (player) {
puts("[update_player_status] " + status.id);
player.status = status;
db.saveDoc(player);
self.idle_watch(status.id);
}
else {
puts("[update_player_status] unknown player: " + status.id + "!");
}
});
}
With that, I can move about the room and the player status is updated in CouchDB:



Nice! That is a fine stopping point for now. Tomorrow I will (hopefully) finish adapting the store to CouchDB with a stretch goal of trying to tap into the changes stream.


Day #208

by eee.c (chris.eee@gmail.com) at August 28, 2010 03:28 AM

August 27, 2010

Damien Katz

August 21, 2010

Damien Katz

The Little Comedian

Gwen: Knock knock.

Me: Who's there?

Gwen: Knock knock.

Me: Who's there?

Gwen: Knock knock.

Me: Who's there?

Gwen: Banana!

Me: Banana who?

Gwen: Orange you glad I didn't say orange again?

Us: BWAHAHAHA!

Photo on 2010-08-20 at 18.46.jpg

by Damien Katz at August 21, 2010 01:53 AM

August 18, 2010

Couchio

Don't Reinvent The Wheel by Josh Berkus @ CouchCamp

The talk description that went out in some recent PR about Josh Berkus at CouchCamp wasn’t quite accurate, my bad.

Here is a description of the talk Josh will be giving at CouchCamp that we are all very much looking forward to.

Don’t Reinvent The Wheel

CouchDB, as a new database, is doing a lot of new cool stuff. But bucking the conventional wisdom doesn’t mean that you need to be ignorant of database history; the older databases like PostgreSQL have decades of experience which the Couch community can profitably steal from. This talk will touch on issues like scaling, security, complex queries, data architecture, optimization, upgrades, long-term maintenance, and standards that developers and users of Couch should be thinking of for the future of the database.

August 18, 2010 05:20 PM

John Wood

Slides From My Intro To CouchDB Talk

Thanks to everybody who showed up at Monday’s ChicagoDB meeting for the great discussion on MapReduce and my talk on CouchDB. Sides from my talk can be found on Slideshare, and the files/commands that were used for the demo can be found on github. As usual, please don’t hesitate to email me with any questions or comments.

See everybody next month!

by John Wood at August 18, 2010 01:40 PM

August 17, 2010

Couchio

Guest blog post from Max Ogden, creator of PDX API We posted a case study today on PDX API, which is...

Guest blog post from Max Ogden, creator of PDX API

We posted a case study today on PDX API, which is a JSON API that provides access to data from CivicApps, the open data initiative by the City of Portland. Max worked with us to write a blog post that gives some background information about working with government geo data. We really appreciate Max taking the time to help share his use of CouchDB with the community.

Now for Max’s blog post…

Portland, PDX API and GeoCouch…

In the fall of 2009 the City of Portland graciously hosted the volunteer organized WhereCamp conference at Metro headquarters. Metro is a regional government organization that, among other things like operating many local parks and the zoo, acts as a data warehouse for the 45+ municipalities in the greater Portland area. WhereCamp is a geo un-conference, where instead of lectures there are group discussions on proposed topics. One of the highlights was a brainstorming session lead by Metro employees regarding how Metro can release their datasets to the public.

Since Sam Adams, the City of Portland’s current ‘younger, tech savvy’ mayor, took office, the idea of a city wide open data initiative had been drifting closer to reality. Metro is predictably bureaucratic, and there was apprehension within Metro about releasing data directly onto the internet. They didn’t want to have to spend a significant amount of money to engineer some sort of web infrastructure for hosting their vast collection of geo data. I have found that Metro’s concerns are also echoed throughout the region at all levels of government.

The solution came in the form of a joint effort amongst all of the major civic entities in Portland (such as Portland public schools, Metro, public transportation, etc). They started hosting around 100 raw GIS files of various shapes and sizes from CivicApps (http://www.civicapps.org), a website that they created for the initiative. The datasets themselves range in size from lists of bicycle parking racks to outlines of city parks and neighborhood boundaries.

After downloading some datasets and attempting to interact with the data it quickly became evident that the workflow was incredibly clunky. Most of the datasets are created and released as a Shapefile, which is a proprietary desktop GIS format. Expensive desktop GIS software is capable of analyzing the geo data in a Shapefile in many amazing ways, but it isn’t easy to extract the data out for other use cases. Shapefiles are the GIS equivalent of a Word document. Sure, the useful data is in there somewhere, but it’s buried deep within years of vestigial formatting bloat. CivicApps provides raw data downloads of entire datasets, which is a good start, but the same data could be distributed a more efficient and accessible manner. Public geographic data ought to live on a server, but accessed a-la carte in small, efficient and interesting chunks rather than the entire dataset at a time.

For example, one of the datasets available on CivicApps contains all restaurants in the Portland area. I wanted to see which restaurants were near my house, but in order to find the handful of restaurants in my neighborhood out of the 3300 listed it took countless open source data conversion utilities, hours of reading documentation and many cups of coffee. After going through this process a few times I decided that nobody else should have to dive that deep into GIS-land in order to get at the data in a meaningful way.

There is a definite disconnect between government and open source when it comes to understanding the term ‘accessible data’. Most non-GIS developers aren’t going to want to learn how to work with a Shapefile. A great strategy for gaining adoption in open data initiatives is to build distribution tools work around the constraints of the existing government data. Portland’s regional government, along with most GIS users at the professional level, use Shapefiles almost exclusively. You aren’t going to convince an entire region full of career GIS developers to convert their datasets to some random open source format. It is the responsibility of the community to develop tools to convert the government’s raw data into a more usable form.

Government-level open data initiatives are increasing in frequency for a variety of reasons. I believe that they will become successful when a cosymbiotic relationship forms between the regional government (data suppliers) and the developer community within the region (data consumers). When local developers create applications using government data, governments save money because they no longer have to hire contractors to create the applications themselves, and citizens get to reap the benefits of applications with rich data created from government maintained datasets.

This means that you need a platform for hosting open data that is built on formats that developers already know.

GeoCouch is CouchDB plus set of geospatial extensions. Recently released was a mostly rewritten version that features a super fast R-tree spatial indexing implementation. GeoCouch didn’t take long to set up and populate with GeoJSON. Of the many formats to describe geographic data, the most ubiquitous is perhaps GeoJSON. GeoJSON is a standardized way of representing geographic data in pure JSON, and therefore you can throw any GeoJSON object at any application with a JSON parser. CouchDB uses JSON for transferring data, so GeoCouch has naturally baked in support for GeoJSON.

I created some utilities to facilitate the conversion workflow from the raw Shapefiles that come from the government to documents in a GeoCouch instance. The overall process involves dumping the Shapefiles into a PostGIS instance, exporting GeoJSON from PostGIS, and bulk importing the GeoJSON into GeoCouch. This is the process that I have found to be the most fault tolerant when performing coordinate transformations against large datasets.

Once the data has been loaded into GeoCouch it instantly turns into a nice, clean REST API for developers who want to be able to retrieve bounding box queries against municipal datasets from Portland. For example anyone can ask my GeoCouch, PDXAPI (http://www.pdxapi.com), to return a list of bicycle friendly trails in any rectangularly shaped region.

=== Benefits ===

I initially wrote a simple proximity query server in Ruby that was able to do a proximity query and return objects from a dataset that were closest to a specified point. You could retrieve, for example, the closest 5 bus stops to your current location. The proximity lookup itself was very much brute force and didn’t use any type of spatial indexing. This was okay for a prototype, but definitely wouldn’t have scaled very far. GeoCouch’s spatial indexer, on the other hand, only has to index a particular dataset once and then subsequent lookups are incredibly snappy. GeoCouch’s R-tree is literally thousands of times faster that my own implementation. GeoCouch does the heavy lifting and lets me relax.

Being able to offer read access to large amounts of municipal data is great, but switching to GeoCouch also lets me accept upstream changes from users, or even create entirely new databases. This is a huge win. Being able to edit documents means that you can, with minimal effort, let users edit any data in Couch in a wiki-like fashion.

There are many projects in Portland that are dedicated to sharing quantitative information about local objects and places. Urban Edibles (http://www.urbanedibles.org) lets anyone contribute the locations of publicly accessible fruit bearing trees or other edible plants. TapLister (http://www.taplister.com) lets users contribute to lists of microbrews on tap at Portland bars. PC-PDX (http://www.pc-pdx.com) is a community calendar for live music.

These are all examples of applications embracing the principles of the civic web. Whereas the social web tries to reinvent and replace real life conversation, the civic web simply tries to augment the systems we already use by encouraging efficient and convenient participation in the happenings of your neighborhood. Your neighborhood is where you work, where you raise children, and where you invest time and emotion, so tools that let people more proactively engage in their communities have a more wholesome and positive long term impact on those communities than social media does.

Don Park (@donpdonp) and I have been working on an example CouchApp for manipulating data in GeoCouch. We’ve adapted the CouchApp to be a wiki for food cart and truck information in the Portland area, available at http:// www.foodcartpages.com. If someone wants to start a community database of, say, publicly accessible rope swings in front yards around Portland, they can start a new database on my GeoCouch. They can then take the source code from Food Cart Pages (http://github.com/donpdonp/foodcartpages) and adapt it to work with their new rope swing dataset.

Going a step further, I’ve created an example iPhone native application (an Android app is in the works) for manipulating GeoCouch hosted data. The rope swing dataset developer can grab a copy of my iPhone source code (http:// github.com/maxogden/pdx-food-carts-mobile) and adapt it to work with their application. Instead of writing in pure Objective-C, I chose to craft the application using Titanium, a JavaScript framework for cross platform native mobile development. This means that anyone who knows JavaScript can jump in and edit the iPhone source code and tailor the application to fit their use case.

When developing the mobile application, I didn’t need a special CouchDB client- side library in order to interact with the remote data stored on GeoCouch. I was able to use the vanilla AJAX library included with Titanium and fully interact with the data stored on GeoCouch. This is a great example of the importance of the ubiquitous patterns like REST that are present in the design of CouchDB.

GeoCouch happily acts as the centralized data store for both the web based CouchApp version of Food Cart Pages, as well as the iPhone application. When a user on an iPhone takes a photo of a food carts menu and uploads it, anyone else consuming data from GeoCouch will see the new photo. As a bonus for using CouchDB, I get free support for conflict resolution and an easy way to store old revisions of documents.

At the end of the day, GeoCouch has, in a few months time, help me to create an ecosystem for regional government and open source developers to share information with citizens, and for those same citizens to share information back. Crafting architecture for sharing large amounts of information over the web is usually no easy task, but GeoCouch has let me focus on the big picture and not get bogged down in the details. I am hoping to enable other developers in Portland and other cities to also see the big picture and start creating applications that not only their communities enjoy, but that they themselves also enjoy.

August 17, 2010 06:41 AM

August 16, 2010

Couchio

CouchCamp Early Bird Registration Closes Tomorrow, August 17th

Hey all, this is just a quick reminder that our early bird sales for CouchCamp closes tomorrow, August 17th. Trust us, you want to be there :)

CouchCamp is ideal for anyone interested in learning more about CouchDB, including developers, administrators and business users. The three-day camp will include speaking sessions from Damien Katz, creator of CouchDB, Selena Deckelman, founder of Open Source Bridge, Ted Leung, director of advanced technology at Disney and Stuart Langridge of Canonical, makers of Ubuntu Linux. There will also be unconference sessions led by conference participants.

For more information or to register, visit: http://www.couch.io/couchcamp.

See you there!

August 16, 2010 10:26 AM

Upstream

Upstream Summer Party

Sorry for the short notice but we have been busy with being on vacation :)

Tomorrow (August 17th 20:00) is the date for our annual summer party. As last year’s location ceased to exist we are moving to the Kiki Blofeld.

Be there and let us buy you a drink or two.

by Alexander Lang at August 16, 2010 09:07 AM

August 13, 2010

Damien Katz

WARNING: CouchDB 1.0.0 Data Loss bug

Update August 13, 2010. A data recovery tool now available: http://wiki.couchone.com/page/repair-tool

if you are running CouchDB 1.0.0 with the default delayed commit setting, you are subject to serious data loss on restart. See this page for instructions how to force all outstanding commits and configure your server at runtime to run in the safer full commit mode: http://couchdb.apache.org/notice/1.0.1.html

The page also includes details of the bug and a postmortem.

0.11.1 and earlier are unaffected. Servers with delayed commits turned off are unaffected. 1.0.1 will be coming shortly that fixes the problem in all configurations.

DO NOT COMPACT THE DATABASE! Compaction throws away any lost updates permanently.

Data recovery is coming. The CouchDB core contributors are working on a utility to reliably recover lost updates, so no data is lost as long the database file is not compacted.

by Damien Katz at August 13, 2010 06:42 PM

August 12, 2010

Till Klampäckel

Looking for Two PHP Developers in NYC

Hey everyone,

it's my sincere pleasure to announce that we're looking to fill two positions for PHP developers (entry/junior) in NYC.

Expectations

This is what we look for from candidates:

  • A strong and firm knowledge of PHP5
  • First hand experience with the Zend Framework
  • You've heard of PHPUnit and TDD
  • An idea of what a HTTP request is and the different applications that take part in one
  • You heard of CouchDB, MongoDB or Redis (generally "NoSQL") before

Last but absolutely not least:

We very, very, very much prefer people who contribute(d) to Open Source.

Playground

  • A web start-up.
  • The not-so-standard LAMP stack with: Linux, Nginx, PHP and mostly CouchDB.
  • A lot time to play with Amazon Web Services.
  • Size matters to you? Databases and indices in the 100 millions.
  • Maybe Solr!
  • Definitely Redis!

... generally, we always try to use the right tool for the job.

If you're interested, please email me your resume:

till+nyphp@imagineeasy.com

If you know someone else and we happen to hire this person my special referral bonus is a couple beers next time we meet. ;-) [Disclaimer: If you're 1821, or older.]

by Till Klampaeckel (till@php.net) at August 12, 2010 06:47 PM

August 10, 2010

Couchio

Press Release: Apache CouchDB Now Available on Google Android

Developers can now build web or native mobile applications taking advantage of CouchDB’s peer-to-peer sync 

Oakland, CALIF. – August 10, 2010 – Couchio (http://www.couch.io/), corporate sponsor of the CouchDB post-relational database, today announced that the first release of a CouchDB SDK for Android devices is now available for free download. Designed to take full advantage of CouchDB’s peer-to-peer sync facilities, CouchDB for Android allows developers to build web or native applications that work even if the Internet connection is slow, intermittent or completely down. With continuous access to a local copy of data, developers can leverage their existing knowledge about web technologies to quickly build collaborative business applications on mobile devices.
 

CouchDB for Android allows shared applications to work offline by automatically synchronizing  between platforms, alleviating a common pain point for users. Developers no longer have to develop an application once for the web, once for each mobile platform and then synchronize between the two. 
 

 “Our goal is to provide users with a kick-ass SDK for Android devices to build web and native applications using CouchDB as the device-native data store,” said Damien Katz, creator of CouchDB and CEO of Couchio. “CouchDB now makes sync ubiquitous and part of the mobile computing fabric.”
 

With CouchDB on Android, developers can build applications and access their data freely across devices, desktops or in the cloud, regardless of the network. Palm has already announced that the next version of it webOS will include services for syncing local data with CouchDB. 

For more information about CouchDB on Android, or to download it for free, visit  http://www.couch.io/android. From an Android device, intersted parties can directly install CouchDB through the Android Marketplace.


About Couchio

Couchio (http://www.couch.io/), co-founded by the creator of Apache CouchDB, is the commercial CouchDB company providing services, support, training and hosting. CouchDB is an open source database designed for the reporting and storage of large amounts of semi-structured, document oriented data, unlike SQL databases, which store and report on very structured and correlated data. CouchDB changes the way document-based applications are built, benefiting from the cloud while also keeping data available at the network edges via replication. Couchio has received venture funding from Redpoint Ventures.

Media Contact:

Ray George

Page One PR

650-922-3825

ray@pageonepr.com

August 10, 2010 06:51 PM

August 06, 2010

Couchio

Because we just couldn’t not do it.



Because we just couldn’t not do it.

August 06, 2010 06:22 PM

Damien Katz

CouchDB Diplom Thesis

It's whooping 163 pages containing all the nitty-gritty-researchy details on why CouchDB is the number one choice for writing distributed applications in both the small and large scale.

Diplom Thesis: Realisation of a Distributed Application Using the Document-Oriented Database CouchDB by Lena Herrmann

diplom_couchdb.jpg

by Damien Katz at August 06, 2010 12:30 AM

August 05, 2010

Couchio

RelaxBack for Thursday 8/5/2010

Upcoming Events

Tonight! 7pm socal.js Mikeal will be talking about CouchDB and node.js.

CouchCamp tickets are on sale for $500 (all inclusive) until August 17th.

RESTFest is going to be September 17th - 18th in South Carolina.

Recent Happenings

New Case study on Migrating to CouchDB from a Relational Database.

Enzo has another post in his series about using CouchDB with Rails about understanding map/reduce.

Damien and Aaron Miller have erlang running on iOS :)

CouchCamp attendee spotlight on Max Ogden.

Lena Hermann handed in her Thesis on Realisation of a Distributed Application Using the Document-Oriented Database CouchDB.

jchris wrote up a new description of CouchApps.

Jan started a page for everyone to add their local CouchDB meetups.

New expanded docs on installing CouchDB on Windows.

Jobs

Couchio is hiring a ton of roles! 6 week vacation…. I gotta figure out somewhere to travel to .

August 05, 2010 11:58 PM

Ricky Ho

Map/Reduce to recommend people connection

Once common feature in Social Network site is to recommend people connection. e.g. "People you may know" from Linkedin. The basic idea is very simple; if person A and person B doesn't know each other but they have a lot of common friends, then the system should recommend person B to person A and vice versa.

From a graph theory perspective, for each person who is 2-degree reachable from person A, we count how many distinct paths (with 2 connecting edges) exist between this person and person A. Rank this list in terms the number of paths and show the top 10 persons that person A should connect with.

We should how we can use Map/Reduce to compute this top-10 connection list for every person. The problem can be stated as: For every person X, we determine a list of person X1, X2 ... X10 which is the top 10 persons that person X has common friends with.

The social network graph is generally very sparse. Here we assume the input records is an adjacency list sorted by name.
"ricky" => ["jay", "peter", "phyllis"]
"peter" => ["dave", "jack", "ricky", "susan"]
We use two rounds of Map/Reduce job to compute the top-10 list

First Round MR Job

The purpose of this MR job is to compute the number of distinct path between all pairs of people who is 2 degree separated from each other.
  • In Map(), we do a cartesian product for all pairs of friends (since these friends may be connected in 2-dgrees). We also need to eliminate the pairs if they already have a direct connection. Therefore, the The Map() function should also emit pairs of direct connected persons. We need to order the key space such that all keys with the same pair of people with go to the same reducer. On the other hand, we need the pair of direct connection come before the pairs of 2 degree of separations.
  • In Reduce(), we know all the key pairs reaching the same reducer will be sorted. So the direct connect pair will come before the 2-degree pairs. So the reducer just need to check if the first pair is a direct connected one and if so skip the rest.
Input record ...  person -> connection_list
e.g. "ricky" => ["jay", "john", "mitch", "peter"]
also the connection list is sorted by alphabetical order

def map(person, connection_list)
# Compute a cartesian product using nested loops
for each friend1 in connection_list
# Eliminate all 2-degree pairs if they already
# have a one-degree connection
emit([person, friend1, 0])
for each friend2 > friend1 in connection_list
emit([friend1, friend2, 1], 1)

def partition(key)
#use the first two elements of the key to choose a reducer
return super.partition([key[0], key[1]])

def reduce(person_pair, frequency_list)
# Check if this is a new pair
if @current_pair != [person_pair[0], person_pair[1]]
@current_pair = [person_pair[0], person_pair[1]]
# Skip all subsequent pairs if these two person
# already know each other
@skip = true if person_pair[2] == 0

if !skip
path_count = 0
for each count in frequency_list
path_count += count
emit(person_pair, path_count)

Output record ... person_pair => path_count
e.g. ["jay", "john"] => 5



Second Round MR Job

The purpose of this MR job is to rank the connections for every person by the number of distinct path between them.
  • In Map(), we rearrange the input records so it will be sorted before reaching the reducer
  • In Reduce(), all the connections from the person is sorted, we just need to aggregate the top 10 to a list and then write the list out.
Input record = Output record of round 1

def map(person_pair, path_count)
emit([person_pair[0], path_count], person_pair[1])

def partition(key)
#use the first element of the key to choose a reducer
return super.partition(key[0])

def reduce(connection_count_pair, candidate_list)
# Check if this is a new person
if @current_person != connection_count_pair[0]
emit(@current_person, @top_ten)
@top_ten = []
@current_person = connection_count_pair[0]

#Pick the top ten candidates to connect with
if @top_ten.size < 10
for each candidate in candidate_list
@top_ten.append([candidate, connection_count_pair[1]])
break if @pick_count > 10

Output record ... person -> candidate_count_list
e.g. "ricky" => [["jay", 5], ["peter", 3] ...]

by Ricky Ho (rickyphyllis@gmail.com) at August 05, 2010 05:22 AM

August 04, 2010

Damien Katz

We are hiring open source contributors!

We are hiring front end and back end engineers, documentation writer, trainer, release engineer and managers. Must have an open source background. See our jobs page for more info:

http://couch.io/jobs

by Damien Katz at August 04, 2010 11:36 PM

Couchio

CouchCamp Attendee Spotlight: Max Ogden

Max Ogden is a programmer from Portland, OR. Max is becoming quite well known in the open government applications community with his recent work PDXAPI which won the Civic Apps award for best overall utilization of data. Max was also the first CouchCamp ticket buyer and we’re incredibly excited to have him attending.

What was your first CouchDB project?

Trying to set up the old version of GeoCouch back when it had dependencies on Python and Spatialite. I spent more time trying to get the dependencies to compile than I did actually working with any actual geographic data. When the new GeoCouch came out and it didn’t have any external dependencies I was way excited.

What are you currently working on?

I’m working on PDX API (http://pdxapi.com), a developer interface to civic geo datasets in Portland, OR. It’s a big GeoCouch instance that has a bunch of geographic datasets that mostly come from Portland’s regional government agencies.

What is your favorite part of CouchDB?

The ubiquity of JSON and JavaScript. It’s easy to get people excited about working with Couch, since a lot of developers already think RESTfully and throw JSON objects around all the time. CouchApps are also really exciting.

What are you looking forward to at CouchCamp?

Kickball in Marin county in late summer, seeing what other people are working on.

What is your favorite color?

#FFB901

What drink(s) would you like to see at CouchCamp?

Some fancy draft root beer (I don’t drink drink)

August 04, 2010 08:49 PM

Diplom Thesis: Realisation of a Distributed Application Using the Document-Oriented Database CouchDB by Lena Herrmann

Lena Herrmann & Thesis

Major Congrats to Lena Herrmann for handing in her Diplom Thesis on Realisation of a Distributed Application Using the Document-Oriented Database CouchDB.

It’s whooping 163 pages containing all the nitty-gritty-researchy details on why CouchDB is the number one choice for writing distributed applications in both the small and large scale.

Its review is pending but we’ll give you a shout when the full text is available. The great folks at UPSTREAM where Lena wrote the thesis are contributingthe text back to the wider community. Thank you Lena & UPSTREAM!

August 04, 2010 03:19 PM

RelaxBack for Tuesday 8/3/2010

Upcoming Events

Mikeal is speaking about node.js and CouchDB at the socal.js meetup on Thursday August 5th.

CouchCamp tickets are on sale for $500 (room, food and drink included) until August 17th.

Mathias Meyer is speaking about Couchapps at WebAppDays in September.

Recent Happenings

Geoff Buesing wrote a Rack adapter for CouchDB external processes.

Jason landed Debian support in build-couchdb.

Mikeal did a post about abstracting CouchDB.

Mailing List

On the dev list the request for comment on CouchDB 1.0.1 and a proposal for view server protocol changes are still kicking around as well as a new thread to get Filipe’s replicator db work in to trunk.

The user list has threads about the content of userCtx, compilation errors, view performance testing, couchdb-lucene, info about deleted documents, and multiple view queries.

August 04, 2010 04:50 AM

August 03, 2010

Klaus Trainer

Got My Now Open Source Project to 0.1 - Announcing CouchDBCP 0.1

In his recent blog post, Damien Katz described how he (with the help of a lot of people) got his Open Source project CouchDB to version 1.0.

Trying to follow (at least some of) Damien's tips, I now try to explain the motivation behind CouchDBCP, and what it aims to be.

CouchDBCP

CouchDBCP is an acronym for CouchDB Clustering Proxy. Basically, it's a reverse proxy (a.k.a. gateway) for maintaining CouchDB clusters. Its objective is to allow for an abstraction of a single reliable CouchDB device, using a collection of possibly unreliable CouchDB units.

Why?

Almost two months ago, I submitted my bachelor thesis titled Conception and Implementation of a Reliable Web Service with CouchDB, which I've been working on for several months with increasing intensity. I defined a system's reliability as its ability to guarantee some well-defined combination of the properties consistency, availability, and partition tolerance (CAP), while maintaining scalability. In that context, "well-defined" does not mean that CAP properties are not allowed to dynamically change. Quite to the contrary, being able to flexibly choose the best tradeoff is key to scalability, as a system only can be called scalable, if it is possible to guarantee desired behavior when aspects related to the system vary.

The initial motivation for CouchDBCP was to have a solution for managing CouchDB nodes in a decentralized way, so that one can have control over the consistency level and replication behavior. Imagine you have an application scenario that requires atomic data consistency in favor of availability just for a small subset of your data. Wouldn't it be cool, being able to control the consistency level on a per-request basis?

As an example, Riak (like Amazon's Dynamo) allows to specify the number of replicas that need to commit a particular read or write operation, before the client receives a response indicating operation success. Also, since February 2010, Amazon SimpleDB allows choosing between eventual and atomic consistency guarantees on a per-request basis. Therewith, a client can override the default of eventual consistency when issuing a request on a certain type of resource that requires atomic data consistency.

However, as far as I know, for CouchDB no equivalent solution has been reported or previously recognized.

Current State of Implementation

At this point in time, CouchDBCP does not support data partitioning. It can only be used to manage redundant CouchDB instances in a decentralized way. As long as data partitioning is not available, it is intended that each CouchDBCP is assigned only one CouchDB node. A CouchDBCP is used as a gateway (more precisely as a reverse HTTP proxy) in front of a CouchDB. In order to minimize latency, the two components are intended to be located nearby, connected with a high speed, low-latency network link; if not just simply located on the same computer.

Currently, most Futon tests are passing. Some are failing because the REST constraint of stateless communication is violated. Furthermore, there are a few tests that sometimes fail due to timeouts when the CouchDB instances are frequently restarted by the test. There should be room for improvement regarding socket communication, timeouts and error handling.

Regarding cookie authentication, CouchDBCP is currently able to maintain authentication state: If atomic consistency is used for cookie authentication (i.e., the POST request to /_session), it is possible to load-balance requests over all cluster nodes, without losing authentication.

I don't know, however, whether OAuth will ever work cluster-wide, since OAuth is significantly more complex than cookie authentication.

For read requests, both eventual and atomic consistency are supported. However for write requests, only atomic consistency is currently implemented. In practice that means that a cluster won't be available for a write operation when there is no quorum of nodes being able to commit.

To minimize latency and reduce load on the database, CouchDBCP caches all HTTP headers that have an ETag. Therefore, HEAD-requests as well as GET-requests where the If-None-Match header field matches a cached header's ETag, can be served without necessarily hitting the database.

Future Goals

Future goals are:

  • allowing eventual consistency for write operations (will soon be available)
  • decentralized cluster configuration using an HTTP-based gossip protocol
  • data partitioning support based on consistent hashing (c.f. Riak)
  • cluster monitoring

by Klaus at August 03, 2010 05:41 PM

August 01, 2010

Damien Katz

Getting Your Open Source Project to 1.0

The project I founded, Apache CouchDB, recently hit 1.0. I'm very proud :)

saucelabscupcakes.jpg
Awesome 1.0 cupcakes from Sauce Labs.

It's been a long time, but we finally produced a release that's complete, performs well and is rock solid.

Already CouchDB is on over 10 million machines. It's used by big respected websites (like the BBC) and groundbreaking organizations (Mozilla and Canonical). We run on most *nix, OS X, Windows, and even Android phones. Have dozens of frameworks and client libraries available. There are 2 books available for sale right now. There is a venture capital backed startup, Cloudant, that offers CouchDB hosting and scales to huge datasets. And I'm CEO of another venture backed ($2 million invested) 12 person start-up, Couchio.

So how did I get here? It took a lot of time and effort (almost 5 years!), and the help of a lot of people. Here are some tips of what it took to get CouchDB to 1.0.

Why?

Successful open source projects need a reason for being. You need to decide why you are creating a project and what problems it solves. Whether it's one or many reasons, you need to figure out what they are and explain them.

Perhaps you are making something new, that hasn't existed. Why hasn't it existed it before? No had the idea? No one had the will to carry it through? Or maybe you are making something that's already in existence, like an HTTP server. What are your reasons? Simpler, faster, more features, different license?

If you are just doing it as a learning exercise, that's fine. But don't expect to attract a community until you can explain why it's useful beyond you own personal goals.

With CouchDB, my reasons were:
1. A schemaless document database with views, bi-directional replication and conflict detection to enable disconnected operation would be really useful.
2. I wanted to understand more about creating distributed systems and database internals.

No one cares about reason #2 except for me. But the first reason is compelling.

Make sure you can tell people why your project exists and what it's good for. And put the reasons on your project site where people can find them.

Code Comes First

Don't start a project unless you have a deep commitment to being a strong coder.

Now I'm not saying you must be a strong coder to participate in a project. Not at all. I'm saying that you must be strong coder to lead one. Maybe you'll get lucky and somehow attract a some really good coders to your project. But most really good coders go to projects with already solid codebases, or start their own.

Also, you don't have to be a strong coder when you start out, but you should know the basics and have a strong desire to learn and get better. Don't expect to attract anyone to your project until you have a substantial amount of working code that isn't a big ball of spaghetti.

With CouchDB, I always emphasized the quality of the high-level design and code implementation. We cannot under any circumstances lose or corrupt your committed data, or get things into an inconsistent state. Reliability and durability are absolutely imperative. Any design or implementation that doesn't meet these goals doesn't make it into the project.

Some projects might not have an emphasis on the reliability, but on absolute performance. That's a fine choice to make, but make sure your users know what they are trading off. And then actually deliver on the performance.

As the project moves along, you will need to ensure the code quality (reliability, performance, resource, usage, etc) is improving over time. If you aren't a good coder, you won't be able to do this.

Know What You Aren't

Almost as important as knowing what your project is trying to accomplish, is know what it isn't trying to accomplish.

When your project starts to get traction, but before it's done, you'll get a lot of people who want the project to work more like things they've used in the past. New users might think your goals and abilities are cool, but they'd trade it all for just a little more. They'll want everything your project does, plus a pony.

The problem is feature and scope creep. Even if you are successfully keeping the project on track, the community may get slowed down dealing with people trying to make it something it's not. Stating clearly what your project isn't trying to do or be helps make it much easier to explain what you can't implement or change.

Now, you can't define everything your project isn't. (It's not a video game. It's not accounting software. It's not a banana. It's not a rainbow. etc.). But you can find the things it's related to, overlaps with, or might be confused with, and explicitly say it's not those things.

With CouchDB, because we are a database, people often asked us to add features that were in traditional RDBMS's, but didn't fit well with the CouchDB data model. Not being intimately familiar with CouchDB's model and how it all fits together, they don't realize that what they're asking for simply doesn't work. But because we explicitly stated on the project site we aren't a relational database and aren't trying to replace relational databases, it made it much easier to explain why those features weren't a good fit for what CouchDB is trying to accomplish.

So if you don't clearly define what your project isn't, often people will try to make it into those things. This can damage the community, as moving forward is slower and people feel like they aren't being listened to. Be explicit what you aren't, and it makes it much easier to focus on what you actually are.

Don't Do Everything (Well)

So you are superstar coder, your code is clear and concise and high quality, you write clear complete documentation, your create all the tests, and you fix every bug. You are awesome!

Thing is, you might be awesome, but until you actually get a community behind the project, the project will be limited in an absolute sense by what a single person can produce. And if you are doing everything, that's not a whole lot. Trying to do everything well means you'll probably never actually release anything.

Unfortunately, at first, you _will_ need to do everything. But just don't do everything really well. Instead, you'll have do some things crappily, and then move on. In addition to writing all the code, you'll need to: Create a project site. Explain your project. Write documentation. Do the releases. Start a bug tracker. Create a mailing list and answer questions. And you'll have to do most of these things poorly if you want to keep moving the ball forward.

You'll have to do some things poorly. But you'll need to pick a few things that you do really well and execute on those things. (The code should be one of the things you do well).

And everything you do, you'll need to make it easy for others to participate. To add patches, to update and create documentation, make bug reports and send patches. And make it clear that help is desired.

Don't get hung up on trying to make everything perfect. That just paralyzes you. But by picking a few things to do well, you will attract people to help you with the things you aren't doing well.

Community Wants to Help

Open Source is awesome in the way it attracts people who just want to help make something cool. Many people want to contribute their time, but only if they think their help will amount to something in the long run. They don't want to spend time and effort on something that doesn't yet show potential or might be abandoned if the creators lose interest.

If you have a solid codebase, then it becomes much easier to attract people to your community. If people can recognize there is at least something high quality about your project, but it's lacking in some areas, people will want to help you in those areas. But you have to have the high quality pieces in place. People don't want to be the one excellent contributor to dreck. They'd rather not have their efforts associated at all.

They do want to be a part of something great. They want to add their work and make it even better. They want to contribute to projects where the total excellence of the project reflects well on them and their efforts. They want to make the world a better place, and don't want their efforts wasted.

And people who like making the world a better place are exactly the kind of people you want to attract. You want people to have pride in their contributions, and to feel like they are really positively affecting the things they care about. Those people have lots of projects to choose where they can add their time and talents. If they feel their efforts on your project are wasted, they are gone. Make sure the people who show a strong desire to contribute aren't ignored, and feel like their efforts will eventually amount to something.

Being a part of Apache has helped CouchDB tremendously. Partially it's because Apache has helped our visibility and credibility. But it's also because we've adopted the "Apache Way", which is more focused on the community aspects of a project than on any specific contributor. Without our amazingly active community, CouchDB would be far behind where it is now.

Community Is Often Incompetent

Unfortunately, many people who will want to help you will produce contributions of poor quality. You will have deal with this "help", and do so diplomatically. The best way to point out the shortcomings with their contributions is to identify what needs improvement without denigrating their overall effort. This can be hard, and many don't want to hear why their efforts aren't up to the project's standards.

Sometimes you have to hurt peoples feelings. But it's better to be honest then to have the quality of your project brought down. If they can't handle the feedback, so be it. The good news is the people who do listen to constructive criticism and actually improve the quality of their contributions are incredibly valuable. Look for these people and nurture their involvement.

With CouchDB, we try to listen to all members of our community, but we only grant commit access to the ones who have shown high quality contributions. Our committers are our first line of defense against poor code and design.

Paul Graham Was Right

It seems to make sense to choose a mainstream language for your project. The more mainstream it is, the larger the potential community you can attract. While that's true to an extent, the quality of the community is more important than its absolute size. Much more important.

Using a mainstream language means you are also competing for contributor's time from other projects in the same language. So the pool is large, but in the end, you still have to attract quality developers from other things competing for their attention. And the competition might actually be stronger in that larger pool.

The more mainstream a language, the more likely it is that a random developer knows it because it's what they use at work. They aren't necessarily interested in being more productive, being more reliable, or whatever. They are interested in getting paid, and they choose their language not for elegance, power or performance, but for the number of job openings available.

If you pick a non-mainstream, more esoteric language, you tend to get a higher quality of developer. You tend to find people who absolutely love programming and building, and choose their languages not based on the scale of pay, but because they make the developers and projects more powerful. So while the total pool of contributors is smaller, they tend to give a higher quality of contribution. You get a much better signal to noise ratio.

As Paul Graham explained in Beating the Averages, the exotic languages tend to attract devs who love to learn and expand their toolbox. You'll attract more of the types of devs who don't mind creating new code to fill in the gaps, or diving into source to find a bug. They aren't afraid of what they don't know, they actually get excited by the chance to learn and do something new.

But if you pick enterprisey language X, you might find you are spending more of your time fixing problems and dealing with developers who just don't "get it". If you aren't careful, this can drown your project and bring the total code quality down to the point where you can't find good devs to help you anymore. With the less popular, esoteric languages, that tends to be less of a problem and you get a higher quality of contribution in general.

Use Your Brain

I can keep listing all the stuff we did, but you aren't creating the same project under the same circumstances. Pretty much everything I've said here, we've not followed at some point during the project. Often it was to the detriment of the project, but sometimes it just didn't make sense to blindly follow a rule or guideline.

You have a brain, and using it is the most important thing to remember at anytime. Projects can't follow cookie cutter rules. Even the "Apache Way", as I've discovered, means different things to different people, often at different times.

So take my advice here with a grain of salt, and use your brain to figure out what's actually important to you, your project and it's community. Good luck!

by Damien Katz at August 01, 2010 05:38 PM

July 31, 2010

Couchio

RelaxBack for Friday 7/30/2010

Upcoming Events

CouchCamp tickets are still on sale. The event will take place on September 8th - 10.

Recent Happenings

Damien wrote up a thorough article about bringing your open source project to 1.0.

Jason’s linux packages now support fedora

Scott Davis gave a well received talk at the Dallas Tech Fest on CouchDB.

Sam Bisbee posted his into to CouchDB slides.

Simpsons CouchApp :P.

Alexander Lang wrote up a great article on how to handle transactional use cases in CouchDB.

And Max Ogden has started a new GitHub project for all his GeoJSON JavaScript code.

Couchapps

BigBlueHat. Badass content management system.

Jobs

Something in Stuttgart that Jan has a line on.

July 31, 2010 03:37 AM

July 30, 2010

Upstream

Transactions in CouchDB

So we’ve been told that all these fancy new NoSQL stores don’t support transactions because that wouldn’t scale, and we’d just have to live with that. So yes, technically, CouchDB doesn’t support transactions, yet it still does. In a way.

What CouchDB doesn’t support is transactions that span multiple read/write operations, i.e. write document a, then write document b, if something goes wrong, roll back both writes. What it does support is single document “transactions”, i.e. a document is either written completely or not. So if our application requires a transaction, all we have to do is make sure that transaction happens in a single document.

Here’s our use case: at cobot (our coworking space management service) we have a feature where a coworking space can charge a coworker for a one time service, e.g. usage of a meeting room. The way it works is that the manager of a space goes to cobot and enters an amount and description, e.g. “$10″ and “meeting room”. At the end of the month, a cron job sends out invoices for all the one time charges.

Our problem lies with this cron job. The job has to find all the charges for a coworker that haven’t been invoiced yet, create an invoice and mark the charges as invoiced. Without transactions, if something went wrong between creating the invoice and marking the charge as invoiced, the charge could end up being invoiced twice, because it hasn’t been marked as invoiced the first time.

Clearly we need to throw CouchDB away now and go back to a proper™ (a.k.a. relational) database, right?
Well, no. As I said earlier we can try to move the transaction into a single document.

In our relational past, we would have had an invoices table and a charges table. The invoices table would carry all the invoices data (have fun with that), and the charges would have a field for amount and description, as well as a boolean field invoiced.

In the world of NoSQL (documents, no transactions), instead we have 2 documents: an invoice document that contains all the invoice data and a charge document with the amount and description again. Oh, and the documents also have ids.

Now when we create the invoice, instead of marking the charge as invoiced, we add the charge’s id to an array invoiced_charges in the invoice.

That was pretty easy. Now the tricky part is, next month, to determine which charges have been invoiced already. The first approach could be to make two requests:

Fetch all the charges by using a view that just emits every document that is a charge

function(doc) {
  if(doc.type == 'charge') {
    emit(doc.id, null);
  }
}

Then fetch all the charge ids from the invoices:

function(doc) {
  if(doc.invoiced_charges) {
    doc.invoiced_charges.forEach(function(id) {
      emit(id, null);
    });
  }
}

Then on the client side we can throw all the charges whose ids are in the list of invoiced charges.

But we can do better: We can use a list function to do the filtering within CouchDB. First we have to combine the above views into one:

function(doc) {
  if(doc.type == 'charge') {
    emit('charge', null);
  }
  if(doc.invoiced_charges) {
    doc.invoiced_ids.forEach(function(id) {
      emit('_invoiced', null)
    });
  }
}

Now we add the following list function:

function() {
  // some helpers
  Array.prototype.index = function(val) {
    for(var i = 0, l = this.length; i < l; i++) {
      if(this[i] == val) return i;
    }
    return null;
  }

  Array.prototype.include = function(val) {
    return this.index(val) !== null;
  }


  // interesting stuff
  var used_ids = [];

  send_json_results(function(row, sender) {
    if(row['key'] == '_invoiced') {
      used_ids.push(row['id']);
    } else {
      if(!used_ids.include(row['id'])) {
        sender.send_row(row);
      };
    };
  });

  // more helpers

  // this just makes sending json from a list function easier
  function send_json_results(callback) {
    send('{"rows": [');
    var first_row = true, sender = {};
    sender.send_row = function(json) {
      if(!first_row) {
        send(',');
      }
      first_row = false;
      send(JSON.stringify(json));
    };

    while(row = getRow()) {
      callback(row, sender);
    }
    send(']}');
  };
}

(Disclaimer: this is not exactly the same code as used in cobot and I haven’t tested it again.)

Apart from all the helpers, what this essentially does it put all the invoiced charges ids into a list and check every charge against that list, only sending it down to the client if it hasn’t been invoiced already.

As the view is sorted by the keys (‘_invoiced’, ‘charge’) we can be sure that all the invoiced charge ids have been collected before the list function checks the first charge.

There you have it, transactions in CouchDB.

P.S. If you are writing your stuff in Ruby (and possibly Rails), Couch Potato now has support for creating and querying list functions. See the readme.

by Alexander Lang at July 30, 2010 10:56 AM

July 28, 2010

Couchio

RelaxBack for Wednesday 7/28/2010

Upcoming Events

NYC NoSQL meetup tonight

Recent Happenings

Jason Smith put together install binaries for CouchDB 1.0 on linux 32 and 64bit.

Klaus Trainer posted a good roundup of CouchDB 1.0 retrospectives.

New Stack Overflow question about what use cases CouchDB is best suited for.

CouchDB is not being re-written in C.

Calvin Yu posted a new node script for syncing design doc functions.

Mailing Lists

On the dev list Norman Baker started a thread about putting CouchDB code on GitHub. Also messages about bounding box queries in GeoCouch, the CouchDB SDK for Android on the HTC Tattoo,

On the user list some people are tracking down beam CPU performance issues that Sivian Greenberg is having.

Couchapps

Henrik Skupin is working on a test results dashboard at for Mozilla test automation.

July 28, 2010 09:41 PM

July 27, 2010

Couchio

RelaxBack for Tuesday 7/27/2010

Upcoming

NYC CouchUp Tonight tonight.

NYC NoSQL Meetup tomorrow.

Recent

CouchDB case study of Aptela telecom. Highlight for me: “Reliability has been exceptional”.

Enzo has a good new post up on CouchDB and Rails.

Firefox 4 Beta 2 was released today which includes support for the latest draft version of the IndexedDatabase specification. IndexedDatabase provides low level transactions, indexing, and object stores to the browser. Mikeal has been working on a full implementation of CouchDB on top of IndexedDatabase and all of it’s tests pass on Firefox4B2.

Mailing lists

The dev list Noah Slater put out a request for comment for releasing CouchDB 1.0.1. Jason Smith added to the view protocol changes with a request for better error handling on form POST, errors on form POST current can’t return a nice HTML page which is something we need to fix.

The user list has a new thread about reporting in CouchDB and some talk about issues related to external handlers not being quite in sync with commits. This was brought up in the context of couchdb-clucene but most likely effects other services that use a similar externals method.

Couchapps

Afghan war leaks in a CouchDB CouchApp by Benoit! This is seriously awesome!

Jobs

Couchio hiring a C / database engineer.

July 27, 2010 10:05 PM

Guest blog post from Mahesh Paolini-Subramanya, CTO of Aptela We posted a case study today on...

Guest blog post from Mahesh Paolini-Subramanya, CTO of Aptela

We posted a case study today on Aptela, the leading provider of business phone services for small business and mobile workers nationwide, and how they use CouchDB to scale their application.  Their CTO Mahesh Paolini-Subramanya worked with us to write a guest blog for us to further explain their use of CouchDB.  We really appreciate Mahesh taking the time to help share their use of CouchDB with the community.

Now for Mahesh’s blog post…

Aptela Achieves Replication and Scaling with CouchDB

We just launched our next generation calling platform, Aptela v5.0, and I had to make sure that when we rolled it out, we had a solution that would help our new calling platform be massively, yet affordably, scalable, and aid us as we continue to deliver reliable, crystal clear phone service; a mission critical requirement for all our customers. They rely heavily on our service to run their businesses, and we cannot go down.

As a business-class phone service provider with a customer base of more than 17,000 users across 3,000 small businesses nationwide, Aptela handles over 100 million minutes of calls per year. That’s a lot of calling! We needed a way to effectively manage the millions of Call Detail Records (CDRs) generated by those calls on a daily basis, so that we could provide those CDRs to our customers (and internal Aptela folks) instantly. We also needed the data to synchronize across all of our servers, all of the time.

I know, it’s all supposed to be so easy - you put all of your information in a database somewhere and magically, the problem is solved. Come to think of it, that is exactly what we did in the previous iteration of our architecture - A nice Postgres database happily serving up data kept all of our systems in sync. This of course, worked perfectly until the day the database server crashed (Really? The backup generator doesn’t work for more than 10 minutes? Quelle surprise!), and our customers were offline until our (previous!) hosting facility figured out which circuit-breaker to un-trip.

This, naturally, lead us to replication, master-mater configurations, master-slave setups, new hosting facilities, load-balancers, MySQL, and the next thing you know, it was Yet Another 3 AM crisis with me frantically Googling “repair corrupt MySQL database unknown error 3l33t”.

Our entire server-infrastructure is (and needs to be!) cloud-based, i.e., highly distributed, reliable, scalable, location-independent, fine-grained and with built-in coffee service.   Take incoming calls for example.  In our environment, they go to a randomly chosen server, which figures out what to do with the call based on the called number.  Then the server waits until the Official Data Store (MySQL or Postgres, or Oracle) figures out what to do with the call. The problem? We were spending all of our time figuring out how we could improve our databases to support our application, and not actually spending any time on improving our application! This was clearly not the answer. Side note - There should be some kind of law about this, e.g. Any software operation will eventually stagnate when the Development budget equals the Maintenance budget.

Enter CouchDB, which has worked like a charm for us. If anything, we have only begun to tap into all of the cool things that it does for us.

We are now able to handle our massive amounts of data by dumping it into local instances of CouchDB on each of our telephony nodes.  At this point, a couple of really neat things happen (ok, neat for me, probably not for you):

  • - Billing information gets extracted from these CDRs, and replicated over into the billing system
  • - Metadata associated with voicemails and recordings get replicated across to the other telephony nodes
  • - Metadata associated with the calls get replicated over to the application nodes
  • - The CDRs themselves all end up getting replicated to the reporting servers, where all sorts of goofy reports can now get generated off of them.

The free-form nature of CouchDB is tailor-made for reporting and that alone makes it worth the price of admission. Come to think of it, that was pretty much what made us look at CouchDB in the first place! That said, once we started working with it, it became immediately clear that this was the solution to all our data management and maintenance issues. 

CouchDB is written in Erlang and to paraphrase – We love Erlang so much we wrote our entire application in it – which makes it trivially easy to integrate it into our application. It also has an extremely easy to use REST API (Representational State Transfer), which makes integrating it into our back-office systems just about as trivial.

Now, you might be nitpicking that we didn’t really solve the problem as I originally described it (consistency across all the nodes). This is quite ok, since we are actually fairly devout believers in Eventual Consistency, i.e., trading off high-availability for eventual consistency.

For example, when a voicemail gets received at one node, we copy the audio and metadata over to the other nodes asynchronously. If, however, the client calls up at one of the other nodes before the audio/metadata gets there, and wants to listen to that voicemail, we tell the caller that “The Audio is still being Processed”, and to “Call Back In Just A Wee Bit”. You could consider this a bit of a cop-out, but it works just fine for everyone involved because the view of voicemails is Eventually Consistent, but we don’t lock anyone out of the system while updates are occurring. For extra credit, we just move the call over to the node where the voicemail was left, so that it is immediately accessible.

Finding the right tool for the job was the goal, and CouchDB is the perfect match! I know that we have barely begun to fully utilize everything available with CouchDB. Going forward, we plan to use it to help us continue to improve the way we manage our data and I feel confident that it will be able to evolve right along with us. Bravo Damien!

July 27, 2010 05:08 PM

RelaxBack for Monday 7/26/2010

Upcoming

There is a CouchDB meetup tomorrow, Tuesday the 27th, in NYC at 7pm at DBA, 41 First Avenue, New York City, NY. 10003. Between 2nd and 3rd Street.

Mikeal Rogers will be speaking on Wednesday the 28th at the NYC NoSQL meetup.

Recent Stuff

Yay Benoit! Benoit has been rockin the couchapp commits. Benoit has pushed 7 major versions of the Python couchapp toolkit so far and hasn’t shown much sign of slowing down. Couchapp also had recent contributions from Henrik Skupin and Geoff Buesing.

David Nolen wrote some awesome clojure code that can write about 5500 documents a second using bulk docs.

Cloudant was at OSCON last week where SETI showed off Open SETIQuest which hosts all of it’s metadata in Cloudant.

Mathias wrote up his 10 biggest pet peeves in CouchDB.

Mailing lists

On the dev list Mikeal suggested some changes to the view server architecture and Robert Newson worked out the integration of the lastest MochiWeb release in to CouchDB which now supports native SSL.

One the user list there were debates about the best way to structure AND & OR style queries, scheduled tasks, and Simon Metson posted a CouchDB job that has opened up in Bristol, UK.

Couchapps

TweetEater, displays tweets that are stored in CouchDB, source and example.

And Russell pushed his new sofa blog http://chewbranca.com.

July 27, 2010 12:39 AM

NYC CouchUp Tuesday the 27th at DBA

There is a CouchDB meetup tomorrow, Tuesday the 27th, in NYC at 7pm at DBA, 41 First Avenue, New York City, NY. 10003. Between 2nd and 3rd Street.


View Larger Map

July 27, 2010 12:24 AM

July 26, 2010

Damien Katz

CouchCamp is Coming Soon!

713448945.png

CouchCamp, September 8-10

This is the place to be to learn and hack on Apache CouchDB. In honor of the recent 1.0 release, for a limited time it's only $500, with accommodations.

In addition to unconference style discussions, we've got some great speakers: Selena Deckelman, Stuart Langridge, Ted Leung, Josh Berkus, Dion Almaer and me :)

One thing I'm really excited to talk about is our work porting CouchDB to mobile platforms. Android, iOS, RIM, etc. We've got some very cool stuff coming :)

by Damien Katz at July 26, 2010 05:10 PM

July 23, 2010

Couchio

New O’Reilly Book: CouchDB Kurz & Gut

CouchDB Kurz & GutHey you fellow Germans out there! O’Reilly published a new book on CouchDB just for you! It is Mario Scheliga’s CouchDB Kurz & Gut — roughly translated to CouchDB Short and Good. The Kurz & Gut series consists of compact books with enough content to get you running and that are later useful as a quick reference in day to day work (Jan happily remembers having the LaTeX edition at hand a couple of years back).

From all of us at Couchio:

Mario, awesome job! Thanks a lot for help getting the good word out and providing such a great resource for the German CouchDB community. Rock on! (or relax, either is fine :)

July 23, 2010 09:45 AM

July 22, 2010

Will Hartung

Couch 1.0 retrospect

Surprise, to me, CouchDB 1.0 hit today.

A call went out for folks to chime in about their experiences with CouchDB.

I'm a casual participant, follower, and non-user of CouchDB since around 0.9. I have dabbled with it, but have not been able to employ it anywhere in my projects. The tool is in the box, just doesn't fit anything I need right now.

I didn't head over to the Couch project trying to fill a need, rather when a friend started using it, I went over to see what the buzz was about. I learned a bit and started to linger in IRC and on the mailing list.

I made an effort to understand CouchDB. I have a solid understanding of RDBMS systems, so I really wanted to understand this "new" thing. With the understanding I have, I post random stuff on things like Stack Overflow to help answer Couch questions, and offer other bits of support.

As a "non-user", I found the community to be great. The principals live on the mailing lists and IRC. They don't just talk, they listen, and discuss the finest nuances of the systems with random strangers that just happen to show up.

For me, the Couch community is the center of the NoSQL world. I think they have a unique perspective and are doing things different from what the other DB projects are trying to do. Because of its uniqueness, it's a great place to start and radiate out in to the large DB world.

CouchDB 1.0 is exciting as it "finishes" the first leg of their journey. But the journey will continue and I wish them the best.

Congrats to the CouchDB team and un-team.

by Will Hartung (noreply@blogger.com) at July 22, 2010 06:20 PM

John Wood

Speaking About CouchDB at Upcoming ChicagoDB Meeting

I’m going to be speaking about CouchDB at the next ChicagoDB meeting, which will be held on August 16th, 2010. I’m currently putting together some slides that will (I hope) provide a good introduction to CouchDB and its features. I also plan on doing a live demo at the end, so everybody can see CouchDB in action.

Information about the meeting can be found here. I hope to see you there!

by John Wood at July 22, 2010 03:33 PM

Couchio

CouchDB Relaxback

The last week was a big one for CouchDB. Here is a brief recap, (inspired by Mark Phillips’s great blog post on the Riak Recaps). I can’t get everything in, cause there’s way too much. Leave a comment (or email jchris@couch.io) if I left anything out, or if you have ideas for what to mention next week.

1.0 Released

We’ve been working toward CouchDB 1.0 for five years, so geting that out the door was rather major. There was even a New York Times article that calls CouchDB the first production ready NoSQL database (I heartily concur, we’ve been production worthy since 0.8.0) Now that we are 1.0 there’s no reason not to use CouchDB in your banking, medical, air-traffic control, or other mission-critical applications.

As part of the 1.0 release we asked the community to write retrospective blog posts on how they came to CouchDB. Thanks to Klaus for a collection of links:

We’re still hoping to find more of these, so if you’ve been involved in CouchDB for a while (or even a short time) and you write one, let me know.

Windows Support

Part of the 1.0 release is Windows support. Thanks to Aaron Miller for building a provisional installer kit. Some folks are having trouble with the installer, so we’re waiting on Mark Hammond to get home and build a proper one. Thanks Mark!

CouchCamp

Last year we had CouchHack which was the first time a few of the committers met each other. It was also when Damien, Jan, and Chris first decided to form a company behind CouchDB.

This year we are hosting CouchCamp which will be the biggest gathering of CouchDB supporters in the history of all time. Of all time. There will be scrumptious organic food (and maybe Damien will even make a McDonald’s run) and quality beer. All this and a cabin bed at Walker Creek Ranch in Marin County. Slots are filling up fast, and there are only a limited number of beds. Anyone who registers after we are out of beds will be welcome, but you’ll end up camping (for real) in the great outdoors. Registration for this two and a half day event is currently only $500, in celebration of CouchDB 1.0.

CouchRest

In other news, the CouchRest Ruby library for CouchDB has a new set of active maintainers. For help with patching CouchRest, talk to Marcos Tapajós, Sam Lown, and Will Leinweber. Thanks Rubyists!

GeoCouch

GeoCouch is taking the world by storm. There was a meetup in Augsburg. PDXAPI continues to kick ass, and has begun to spawn related applications like Food Cart Pages and mobile clients.. Thanks Volker Mische, Max Ogden, Don Park and others.

This just in: PDXAPI has won an award (and $1,000) from the Mayor of Portland!

Faster JavaScript Views

A few months ago, the Riak team (great hackers) released some code to make communication between Erlang and JavaScript way more efficient. After that Paul Davis integrated it with CouchDB, as a proof of concept. Since then, he’s started to refactor it for easier builds. We are looking forward to integrating this with CouchDB trunk in a near-future release.

Mr. Rogers on Sesame Street

Just kidding about the PBS reference (as a kid I was always fascinated by cameos). But Cloudant’s Mike, Alan, and Dave stopped by the Couchio offices and we talked about collaboration and strategy, and how to help get the CouchDB story to the millions of developers who haven’t heard a word of it yet.

One thing that came out of the meeting, we plan to host some CouchDB trainings. If you want one in your town, email hello@couch.io and we’ll get it together.

Monthly webinars

O’Reilly has been sponsoring monthly Webcasts in association with the CouchDB book. Here’s the list so far.

CouchApp Evently Guided Hack. Learn how to make CouchApps the JChris way.

Intro to Apache CouchDB

What’s new in CouchDB 1.0. A round up of the features and improvements that CouchDB’s seen in the last few months.

Flexible Scaling with CouchDB Replication (video not yet available.)

Next month (on the 25th) we’ll be doing one on how to make crash-only applications using the _changes feed. Signup link to follow.

New Committers

The Apache CouchDB project is very lucky to have two new committers joining us: Filipe Manana and Robert Newson. Filipe has been hard at work on updating the replicator to be more robust and performant. Robert is the force behind CouchDB-Lucene.

July 22, 2010 05:29 AM

July 20, 2010

Klaus Trainer

First CouchDB Meetup in Augsburg

Yesterday, I went to Augsburg in order to attend the colloquium Volker had previously announced. Volker gave a nice introductory talk on CouchDB and GeoCouch, where he first highlighted some CouchDB features (e.g. MVCC, schema-less JSON data format), and then gave a nice introduction to GeoCouch.

The talk was more geared towards geographers rather than computer scientists, but unfortunately there were mostly computer scientists, but no geographers. The geographers that had previously shown interest in the talk cancelled the appointment on short notice due to another appointment.

Volker explained the motivation behind trying to create yet another solution for geospatial search. One reason for doing so is that CouchDB, unlike other database systems, has a schema-free data format, and therefore does not impose any unnecessary constraints in this regard. Another reason is that some of its features allow for perfect usage scenarios with mobile devices running GeoCouch: it runs with a low memory footprint, and, due to local storage and incremental replication of data, applications can work in spite of temporal network unavailability.

Finally, Volker gave some overview of GeoCouch's current state and talked a bit about its future direction.

After the talk and the following discussion, we were four people joining for the meetup: Volker, Tom, Ben, and me. We had drinks and pizza (generously sponsored by couch.io), while discussing application scenarios for CouchDB, GeoCouch, and other NoSQL databases, as well as related practical tradeoffs.

Thank you, couch.io, for sponsoring the meetup!

by Klaus at July 20, 2010 05:19 PM

CouchDB 1.0 Retrospectives

As CouchDB version 1.0 has been released recently, a few CouchDB developers, as well as Will Hartung (a non-developer and self-confessed non-user of CouchDB), and me (a CouchDB user) have written down a bit of their personal CouchDB story, i.e., how they got involved and what experiences they had with CouchDB and the CouchDB community.

I'd like to list the respective writings (in no particular order):

PS: Please tell me, in case I've missed one!

UPDATE 2010-07-25: Added a link to Randall's writing.

by Klaus at July 20, 2010 04:55 PM

My Personal CouchDB 1.0 Retrospective

In my previous blog post I gave a link to a post on Damien Katz's blog. For the case you don't know: Damien Katz is the inventor and initial creator of CouchDB. In case you still don't know anything about CouchDB, you better grab the book or go to the project site, and maybe later get your own instance at couch.io.

By the way, this blog is served from a CouchDB instance provided by couch.io. There is no classical application layer, just CouchDB as a backend and your browser as frontend. Its a classical 2-tier architecture.

So back to my CouchDB story...

I probably arrived on Damien's blog after following a link on Hacker News. After enjoying Damien's insightful observations about crappy programmers, I started reading some other blog posts, amongst other things, about CouchDB, and spending a few hours on it.

At that time, I had a growing interest in functional programming, distributed programming, as well as the underlying concepts. I had learned a few functional programming concepts (e.g. list comprehensions, anonymous and higher-order functions) through playing around with and later using them on a regular basis in Python.

A few months later (it was during an internship), I got the opportunity to do some functional programming in practice: I was engaged in a project where a lot of XSL Transformations needed to be done, which was my part then. Doing XSLT turned out to be not too bad at that time: I could delve into basic functional programming concepts and techniques, like recursion, immutable variables, and pattern matching, while applying them in a rather constrained (and therefore non-distracting) environment.

Still during that internship, I started studying Erlang (of course, using Joe's book). What had intruiged me about Erlang (and is still intriguing me) are its built-in distributed programming abstractions, as well as its focus on concurrency, fault-tolerance, and maintainability.

Meanwhile, I also started to observe CouchDB. It became more and more clear to me that CouchDB was the Open Source Project written in Erlang that had the highest potential with regard to its popularity and impact. Arguably its ingenuity lies in its particular combination of mostly rather old and traditional technologies (e.g. B-Trees, MVCC, HTTP).

Consequentially, about one year ago, I decided that I'd do my bachelor thesis in Computer Science about how to create reliable web services with CouchDB.

Now that I've submitted my bachelor thesis six weeks ago, I work on improving my prototype implementation of a reverse proxy in order to create a not-so-prototype-anymore out of it. The goal for it is to be usable (at least to some degree) and to have a good code base that can be built upon.

I will write more on that later. Source code will then be available on my github site.

by Klaus at July 20, 2010 04:51 PM

Ricky Ho

Graph Processing in Map Reduce

In my previous post about Google's Pregel model, a general pattern of parallel graph processing can be expressed as multiple iterations of processing until a termination condition is reached. Within each iteration, same processing happens at a set of nodes (ie: context nodes).

Each context node perform a sequence of steps independently (hence achieving parallelism)
  1. Aggregate all incoming messages received from its direct inward arcs during the last iteration
  2. With this aggregated message, perform some local computation (ie: the node and its direct outward arcs' local state)
  3. Pass the result of local computation along all outward arcs to its direct neighbors
This processing pattern can be implemented using Map/Reduce model, using a MR job for each iteration. The sequence is a little different from above. Typically a mapper will perform (2) and (3) where it emits the message using its neighbor's node id as key. Reducer will be responsible to perform (1).

Issue of using Map/Reduce

However, due to the functional programming nature of Map() and Reduce(), M/R does not automatically retain "state" between jobs. To retain the graph across iterations, the mapper need to explicitly pass along the corresponding portion of the graph to the reducer, in additional to the messages itself. Similarly, the reducer need to handle a different type of data passed along.

map(id, node) {
emit(id, node)
partial_result = local_compute()
for each neighbor in node.outE.inV {
emit(neighbor.id, partial_result)
}
}

reduce(id, list_of_msg) {
node = null
result = 0

for each msg in list_of_msg {
if type_of(msg) == Node
node = msg
else
result = aggregate(result, msg)
end
}

node.value = result
emit(id, node)
}

This downside of this approach is a substantial amount of I/O processing and bandwidth is consumed to just passing the graph itself around.

Google's Pregel model provides an alternative message distribution model so that state can be retained at the processing node across iterations.

The Schimmy Trick

In a recent research paper, Jimmy Lin and Michael Schatz use a clever partition() algorithm in Map /Reduce which can achieve "stickiness" of graph distribution as well as maintaining a sorted-order of node id on disk.

The whole graph is broken down into multiple files and stored in HDFS. Each file contains multiple records and each record describe a node and its corresponding adjacency list.

id -> [nodeProps, [[arcProps, toNodeId], [arcProps, toNodeId] ...]

In addition, the records are physically sorted within the file by their node id.

There will be as many reducers as the number of above files and so each Reducer task is assigned with one of this file. On the other hand, the partition() function assign all nodes within the file to land on its associated reducer.

Mapper does the same thing before, except the first line in the method is removed as it no longer need to emit the graph.

Reducer will receive all the message emitted from the mapper, which is sorted by the Map/Reduce framework by the key (which happens to be the node id). On the other hand, the reducer can open the corresponding file in HDFS, which also maintain a sorted list of nodes based on their ids. The reducer can just read the HDFS file sequentially on each reduce() call and confident that all preceding nodes in the file has already received their corresponding messages.

reduce(id, list_of_msg) {
nodeInFile = readFromFile()

# Emit preceding nodes that receives no message
while(nodeInFile.id < id)
emit(nodeInFile.id, nodeInFile)
end

result = 0

for each msg in list_of_msg {
result = aggregate(result, msg)
}

nodeInFile.value = result
emit(id, nodeInFile)
}

Although the Schimmy trick provides an improvement over the classical way of map/reduce, it only eliminates the communication between the mapper and the reducer. At each iteration, the mapper still needs to read the whole graph from HDFS to the mapper node and the reducer still need to write the whole graph back to HDFS, which maintains a 3-way replication for each file.

Hadoop provides some co-location mechanism for the mapper and try to assign files that is sitting at the same machine to the mapper. However, this co-location mechanism is not available for the reducer and so reducer still need to write the graph back over the network.

Pregel Advantage

Since Pregel model retain worker state (the same worker is responsible for the same set of nodes) across iteration, the graph can be loaded in memory once and reuse across iterations. This will reduce I/O overhead as there is no need to read and write to disk at each iteration. For fault resilience, there will be a periodic check point where every worker write their in-memory state to disk.

Also, Pregel (with its stateful characteristic), only send local computed result (but not the graph structure) over the network, which implies the minimal bandwidth consumption.

Of course, Pregel is very new and relative immature as compared to Map/Reduce.

by Ricky Ho (rickyphyllis@gmail.com) at July 20, 2010 03:13 PM

Couchio

Guest Blogpost by Mark Headd, Creator of TweetMy311

A little bit about Mark — He is an experienced voice, mobile and web application developer who has built civic applications for the District of Columbia, the Sunlight Foundation, the New York State Senate, and the cities of New York, San Francisco and Toronto. He writes about open government, programming, communication technologies and open source software on his blog at voiceingov.org.

Now for his guest blogpost…

Building Twitter Apps with CouchDB

CouchDB is a sexy beast.

There are so many things about it that make it attractive to Web 2.0 and mashup developers, and it seems like the more I use it the more cool features I find that make my life as a developer easier.

The dead simple HTTP API. Replication. URL rewriting. I could go on…

Several months ago, I started a project to develop an application that would let citizens who live in cities that have adopted the Open311 standard submit service requests using Twitter. There are some important reasons why I believe Twitter makes an ideal interface for submitting geographically specific service requests to municipalities. When I started to consider various alternatives for a platform for the TweetMy311 application, I looked very closely at CouchDB. This was right around the time CouchDB 0.11 was released. I soon discovered something else cool about my favorite NoSQL database — CouchDB absolutely rocks as a platform for Twitter applications.

Why CouchDB?

If you’ve had a play with the Twitter API and your thinking about building an application that uses it, you should take a close look at CouchDB. There are a number of reasons why it makes an ideal platform for Twitter applications:

  • You interact with a CouchDB instance the same way that you interact with the Twitter API — by making HTTP calls. This can help keep the code for your application clean and simple, and provides lots of opportunities for code reuse within your application.
  • The structure of documents in CouchDB are JSON, which is one of the formats returned from the Twitter API when searching for Tweets (or “status objects” in Twitter parlance).
  • Documents in a CouchDB database are assigned a globally unique ID — it’s how documents are distinguished from one another. Twitter also uses unique identifiers for status objects, so using the ID of a Twitter status object as the ID for a document in CouchDB makes life pretty easy for a Twitter app developer.

These benefits become readily apparent when you begin interacting with the Twitter API and storing status object in CouchDB.

A Quick Example

Twitter actually has several different APIs that developers can interact with — the basic REST API, the Search API and the Streaming API. Since there are lots of resources on these different APIs, and tons of good tutorials for how to use them, I’ll focus in this post on the most basic way to use HTTP requests to get Twitter status updates — using statuses/show. (The same approach described below can be used to get multiple status updates using the Twitter REST or Search APIs and store them in a CouchDB database.)

Consider the following Tweet:

This is a status update I sent when I was walking around my neighborhood in Wilmington, Delaware thinking about how Twitter could be used to start a 311 service request. If I wanted to get this status update in JSON format from the Twitter API, I would use the statuses/show method with the id of the status update.

This URL will return the full JSON object for the status update. So now we can start to see how this JSON structure can be saved into a CouchDB database.

First, create a CouchDB database to use for our example:

curl -X PUT http://127.0.0.1:5984/twittertest
{"ok":true}

Consider the following sample script (written in PHP):

This is a very simplistic example of how you can interact with the Twitter API and store JSON formatted status updated to CouchDB. Do note, you’ll need to use PHP version 5.2.0 or greater to take advantage of PHP’s JSON functions. Not counting our constant declarations and basic cURL functions, it takes 3 lines of code to grab a status update and store it in CouchDB. When you run this script from the command line, you’ll see something like this:

curl http://127.0.0.1/twitter-example.php?status_id=7911766753
{"ok":true,"id":"7911766753","rev":"3-8855db513f6fc0d45bbbc66a6be02035"}

It may seem redundant to get the ID of the Twitter status update from the JSON object returned from the API (since we’re already using this ID to get the status update in the first place), but consider a scenario where you are interacting with the Twitter REST API, to get @mentions or to search for any status update the contains specific phrases. Those interactions could potentially return dozens or hundreds of Tweets at a time as a collection of JSON formatted objects. By wrapping our simple example logic in a foreach() statement, you can easily process large volumes of status requests without having the ID in advance.

Ch-ch-ch-ch-Changes

Now that you’ve got a handle on just how easy it is to grab Twitter status updates and store them in CouchDB, you’re probably wondering what to do next in building your Twitter app. Let’s say your Twitter app looks for @mentions on a specific account and then processes those status updates based on the content of the Tweet, or by looking at the location of the Tweet (this is, by the way, what TweetMy311 does).

You’ll probably need to go through a series of steps, processing a status update at each step and ultimately sending a response back to a user via the Twitter API. Get a Tweet — process it n number of times —  send back a response. The series of steps involved in processing a Tweet (one step to potentially many steps, depending on what your application does) is the heart of any Twitter application, and can cause lots of complexity if not approached properly.

Fortunately, CouchDB makes this aspect of building Twitter applications incredibly easy to do through the _changes API. The _changes API is a powerful, easy to use mechanism for receiving notices when a change has been made to your CouchDB database. Each time a new document is inserted into a CouchDB database, or an existing document is updated, the _changes API will provide notice. The book “CouchDB: The Definitive Guide” has an entire section on the _changes API to goes into great detail about all the ways it can be used.

For the purposes of this post, let’s focus on the continuous changes API. This API lets you set up a single, persistent HTTP connection to CouchDB and get notices each and every time a document is inserted or updated. You can access the continues changes API by simply running the following at the command line:

curl -X GET "http://127.0.0.1:5984/twittertest/_changes?feed=continuous"

If you run this command in a separate terminal window you’ll see the following:

{"seq":1,"id":"7911766753","changes":[{"rev":"1-daac85c534df91d607ea8a7aa8f46fb2"}]}

This is the first change (in sequence order) that has happened in the specified CouchDB database since it was created.

There are lots of options you can use to help you refine how the HTTP connection to the _changes API works, and which specific changes that your Twitter application acts on. If the HTTP client you’re using is picky about how long it will keep a connection open without getting a response, you can use the “heartbeat” parameter to tell CouchDB to send a newline character at specific intervals, to tell your HTTP client that the connection is still alive.

You can also use the “since” parameter to specify the change sequence you want to act on. For example, if you the following example in your terminal, you won’t see anything (yet) because there has been only one change to the database:

curl -X GET "http://127.0.0.1:5984/twittertest/_changes?feed=continuous&since=2"

Even more powerful, you can specify filters and apply them to the _changes API to access only the changes that meet specific tests. Filters live inside design documents and can be applied by using the “filter” parameter. For example, consider a Twitter application that processes incoming @mentions stored in a CouchDB database through a series of steps.

Let’s say that there is something specific that needs to happen on the third step of processing Tweets your application (e.g., sending a response back to the user through the Twitter API). We want to use the _changes API to tease out any change that are at step 3. We can do this by looking at the revision number of in the _rev field of each document:

{
"_id": "_design/process_tweet",
"filters": {
"step_3": "function(doc, req) { if(doc._rev.charAt(0) == '2') {
return true;
} else {
return false;
}}"
}
}

When we define our filter like this, we can now access the _changes API to ensure we only access changes to status messages that have reached the third step in processing.

To test this filter, you can open up Futon and access the twittertest database. The Twitter status update we inserted earlier in this post should have a revision starting with a “1”. When you access that document in Futon and save it (thereby changing its revision to begin with a “2”) you will see it come through in the HTTP connection we have set up to the _changes API that uses our custom process_tweet filter. For example

curl -X GET "http://127.0.0.1:5984/twittertest/_changes?feed=continuous&filter=process_tweet/step_3"
{"seq":3,"id":"7911766753","changes":[{"rev":"2-0e701a74cb89c65853fc25dbff152364"}]}

Towards Easier Twitter Apps

As you can see, there are lots of reasons why CouchDB is ideal for building Twitter applications. It’s easy to use, incredibly powerful and has lots of cool built in features that can help Twitter application developers.

The primary challenge I faced in building TweetMy311 was time —  I was the only developer on the project and my time to get it up and running was very limited. I didn’t want this constraint to diminish what the application could do, and I also wanted to make my approach repeatable - so I can build more cool Twitter apps in the future. CouchDB was the perfect choice for my project, and I’m proud to use it on a project that will soon help lots of people make their communities better.

The life of a Twitter developer can be challenging on the best of days —  make it easier, with CouchDB.

July 20, 2010 12:24 PM

July 14, 2010

Randall Leeds

CouchDB Turns 1.0!

Today CouchDB had its first "software birthday"! CouchDB is now officially tagged with a 1.0 release number, a mark often reserved for projects which have emerged from a development and beta-testing period as stable and production-ready systems. For users this means reliabiity and for developers it means a stable platform to build upon. Exciting times, indeed. Appropriately, this is my first post on this new blog and it runs on CouchDB thanks to the good folks at Couch.io!

My personal CouchDB story started at The Open Planning Project where I spent a summer working on the (now discontinued) Melkjug project. It was to be a news aggregation tool with a focus on ease of tuning: the news you want to read. I was in heaven. I had an internship working on free software with cool people who promised a better civic future. Idealism seemed a misnomer because this felt like very real progress.

Luke, Melkjug's lead developer, had played around with CouchDB as a storage engine, but had let it fall to the wayside in favor of active feature development. After reading up a little bit it seemed that CouchDB was a perfect fit for storing Atom or RSS news items. After all, the world has several different feed formats (while RSS prevails in common usage as an umbrella term for the whole bunch) and CouchDB is schema-less! In other words, when you're storing documents it just makes sense to use a document database (duh?).

So I took it upon myself to revive the CouchDB storage backend. While my involvement with Melkjug ended at the end of the summer my interest in CouchDB continued quietly. My last year of undergraduate education followed, full of multiprocessing, (mostly functional) programming languages, and distributed systems. CouchDB, armed with Erlang and a bad-ass replication model, pulsed quietly at the center of all these things. I lurked on the mailing list. I read all the developer e-mails. I contributed little, but thought a lot.

Finally, in the spring of 2009, with Joe Armstrong's Programming Erlang under my arm, I made a bold proposal to add clustering support to CouchDB. With the classic naivety of an over-eager student I proposed to do in a summer what Meebo and Cloudant were actively engaged in doing with multi-person development teams. No matter that the project never got off the ground because it got me in touch with Meebo and since last September I've been happily contributing to CouchDB-Lounge.

CouchDB has been my triumphant return to free software development. CouchDB got me programming again when I had been focused on theory.

How did CouchDB manage to keep me so hooked? Community. CouchDB has a community which is very much alive with the spirit of collaboration. At first I thought that Erlang scared off the realists and what was left was a bunch of science project geeks with pure hearts. That's just plain false, though. CouchDB is in the wild today, it has proven its usefulness, and it attracts the attention of all kinds of developers.

One side anecdote deserves mention. Last summer, while traveling in Europe, my bank halted and then (repeatedly) failed to reinstate my debit card. Without having ever met me in person, Jan Lehnart, an early CouchDB committer and advocate, spotted me a plane ticket from Oslo to Berlin. That may sound absolutely crazy, but so does the idea of free software to the old institutions of production. Jan had talked to me extensively on IRC and the CouchDB developer mailing list, was largely responsible for putting me in touch with Meebo (and by extension, for getting me a job). He was (is) so sincerely interested in growing the CouchDB community and meeting like-minded developers, and trusted so deeply the goodness of those in the community, that he didn't think twice to extend this helping hand. Trekking around Berlin with my copy of Christopher Kelty's Two Bits and thinking about Chris Anderson's crazy ideas I felt inspired. I felt like part of a revolution. I still do.

Since that time, the Lounge has grown into an even more serious project, supporting day-to-day storage needs at Meebo. It has moved to github to facilitate involvement (please join us!). I've contributed a bunch of performance and bug fix patches to the CouchDB code-base and made a bunch of friends in an amazing community that continues to motivate me every day.

Thanks, everybody. I expect this next year to be even better.

by tilgovi at July 14, 2010 08:07 PM

Volker Mische

How I met CouchDB

It was a Saturday in late April 2008, I was sitting on my Laptop in my 5m² room down under. Chatting with some German people I used to chat with for about 8 years by that time. Suddenly I discover that Jan is there, who I haven't talked with for years. Wondering why he was in there, he replied that he wanted to brag about his apache.org email address. This is how I found out about CouchDB.

After several long discussions with Jan I finally wrapped my head around the document oriented concept. I was blown away, it was exactly what I would have liked to use on so many occasions at my one year internship at a geospatial company. Though CouchDB wasn't ready, I needed spatial indexing. One week later I had a first idea of how such an extension might look like.

And only 2 years later I'm really involved in CouchDB and people actually start using GeoCouch :) I'd like to use this blog post to thank the developers and the whole community, it's been a great time and the IRC channel just kicks ass. You all helped to make CouchDB 1.0 possible!

by Volker Mische at July 14, 2010 10:33 AM

July 13, 2010

Ricky Ho

Google Pregel Graph Processing

A lot of real life problems can be expressed in terms of entities related to each other and best captured using graphical models. Well defined graph theory can be applied to processing the graph and return interesting results. The general processing patterns can be categorized into the following ...
  1. Capture (e.g. When John is connected to Peter in a social network, a link is created between two Person nodes)
  2. Query (e.g. Find out all of John's friends of friends whose age is less than 30 and is married)
  3. Mining (e.g. Find out the most influential person in Silicon Valley)

Distributed and Parallel Graph Processing

Although using a Graph to represent a relationship network is not new, the size of network has been dramatically increase in the past decade such that storing the whole graph in one place is impossible. Therefore, the graph need to be broken down into multiple partitions and stored in different places. Traditional graph algorithm that assume the whole graph can be resided in memory becomes invalid. We need to redesign the algorithm such that it can work in a distributed environment. On the other hand, by breaking the graph into different partitions, we can manipulate the graph in parallel to speed up the processing.

Property Graph Model

The paper “Constructions from Dots and Lines” by Marko A. Rodriguez and Peter Neubauer illustrate the idea very well. Basically, a graph contains nodes and arcs.

A node has a "type" which defines a set of properties (name/value pairs) that the node can be associated with.

An arc defines a directed relationship between nodes, and hence contains the fromNode, toNode as well as a set of properties defined by the "type" of the arc.



General Parallel Graph Processing

Most of the graph processing algorithm can be expressed in terms of a combination of "traversal" and "transformation".

Parallel Graph Traversal

In the case of "traversal", it can be expressed as a path which contains a sequence of segments. Each segment contains a traversal from a node to an arc, followed by a traversal from an arc to a node. In Marko and Peter's model, a Node (Vertex) contains a collection of "inE" and another collection of "outE". On the other hand, an Arc (Edge) contains one "inV", one "outV". So to expressed a "Friend-of-a-friend" relationship over a social network, we can use the following

./outE[@type='friend']/inV/outE[@type='friend']/inV

Loops can also be expressed in the path, to expressed all persons that is reachable from this person, we can use the following

.(/outE[@type='friend']/inV)*[@cycle='infinite']

On the implementation side, a traversal can be processed in the following way
  1. Start with a set of "context nodes", which can be defined by a list of node ids, or a search criteria (in this case, the search result determines the starting context nodes)
  2. Repeat until all segments in the path are exhausted. Perform a walk from all context nodes in parallel. Evaluate all outward arcs (ie: outE) with conditions (ie: @type='friend'). The nodes that this arc points to (ie: inV) will become the context node of next round
  3. Return the final context nodes
Such traversal path can also be used to expressed inference (or derived) relationships, which doesn't have a physical arc stored in the graph model.

Parallel Graph Transformation

The main goal of Graph transformation is to modify the graph. This include modifying the properties of existing nodes and arcs, creating new arcs / nodes and removing existing arcs / nodes. The modification logic is provided by a user-defined function, which will be applied to all active nodes.

The Graph transformation process can be implemented in the following steps
  1. Start with a set of "active nodes", which can be defined by a lost of node ids, or a search criteria (in this case, the search result determines the starting context nodes)
  2. Repeat until there is no more active nodes. Execute the user-defined transformation which modifies the properties of the context nodes and outward arcs. It can also remove outwards arcs or create new arcs that point to existing or new nodes (in other words, the graph connectivity can be modified). It can also send message to other nodes (the message will be picked up in the next round) as well as receive message sent from other nodes in the previous round.
  3. Return the transformed graph, or a traversal can be performed to return a subset of the transformed graph.
Google's Pregel

Pregel can be thought as a generalized parallel graph transformation framework. In this model, the most basic (atomic) unit is a "node" that contains its properties, outward arcs (and its properties) as well as the node id (just the id) that the outward arc points to. The node also has a logical inbox to receive all messages sent to it.


The whole graph is broken down into multiple "partitions", each contains a large number of nodes. Partition is a unit of execution and typically has an execution thread associated with it. A "worker" machine can host multiple "partitions".


The execution model is based on BSP (Bulk Synchronous Processing) model. In this model, there are multiple processing units proceeding in parallel in a sequence of "supersteps". Within each "superstep", each processing units first receive all messages delivered to them from the preceding "superstep", and then manipulate their local data and may queue up the message that it intends to send to other processing units. This happens asynchronously and simultaneously among all processing units. The queued up message will be delivered to the destined processing units but won't be seen until the next "superstep". When all the processing unit finishes the message delivery (hence the synchronization point), the next superstep can be started, and the cycle repeats until the termination condition has been reached.

Notice that depends on the graph algorithms, the assignment of nodes to a partition may have an overall performance impact. Pregel provides a default assignment where partition = nodeId % N but user can overwrite this assignment algorithm if they want. In general, it is a good idea to put close-neighbor nodes into the same partition so that message between these nodes doesn't need to flow into the network and hence reduce communication overhead. Of course, this also means traversing the neighboring nodes all happen within the same machine and hinder parallelism. This usually is not a problem when the context nodes are very diverse. In my experience of parallel graph processing, coarse-grain parallelism is preferred over fine-grain parallelism as it reduces communication overhead.

The complete picture of execution can be implemented as follows:


The basic processing unit is a "thread" associated with each partition, running inside a worker. Each worker receive messages from previous "superstep" from its "inQ" and dispatch the message to the corresponding partition that the destination node is residing. After that, a user defined "compute()" function is invoked on each node of the partition. Notice that there is a single thread per partition so nodes within a partition are executed sequentially and the order of execution is undeterministic.

The "master" is playing a central role to coordinate the execute of supersteps in sequence. It signals the beginning of a new superstep to all workers after knowing all of them has completed the previous one. It also pings each worker to know their processing status and periodically issue "checkpoint" command to all workers who will then save its partition to a persistent graph store. Pregel doesn't define or mandate the graph storage model so any persistent mechanism should work well. There is a "load" phase at the beginning where each partition starts empty and read a slice of the graph storage. For each node read from the storage, a "partition()" function will be invoked and load the node in the current partition if the function returns the same node, otherwise the node is queue to another partition who the node is assigned to.

Fault resilience is achieved by having the checkpoint mechanism where each worker is instructed to save its in-memory graph partition to the graph storage periodically (at the beginning of a superstep). If the worker is detected to be dead (not responding to the "ping" message from the master), the master will instruct the surviving workers to take up the partitions of the failed worker. The whole processing will be reverted back to the previous checkpoint and proceed again from there (even the healthy worker need to redo the previous processing). The Pregel paper mention a potential optimization to just re-execute the processing of the failed partitions from the previous checkpoint by replaying the previous received message, of course this requires keeping a log of all received messages between nodes at every super steps since previous checkpoint. This optimization, however, rely on the algorithm to be deterministic (in other words, same input execute at a later time will achieve the same output).

Further optimization is available in Pregel to reduce the network bandwidth usage. Messages destined to the same node can be combined using a user-defined "combine()" function, which is required to be associative and commutative. This is similar to the same combine() method in Google Map/Reduce model.

In addition, each node can also emit an "aggregate value" at the end of "compute()". Worker will invoke an user-defined "aggregate()" function that aggregate all node's aggregate value into a partition level aggregate value and all the way to the master. The final aggregated value will be made available to all nodes in the next superstep. Just aggregate value can be used to calculate summary statistic of each node as well as coordinating the progress of each processing units.

I think the Pregel model is general enough for a large portion of classical graph algorithm. I'll cover how we map these traditional algorithms in Pregel in subsequent postings.

Reference

http://www.slideshare.net/slidarko/graph-windycitydb2010

by Ricky Ho (rickyphyllis@gmail.com) at July 13, 2010 06:53 PM

July 09, 2010

Couchio

Our next webcast, Flexible Scaling with CouchDB Replication is on July 14th at 10am PST

Chris Anderson will be presenting Flexible Scaling with CouchDB Replication next Wednesday, July 14th, at 10am PST.  Register now!

“CouchDB is known for having a flexible schemaless JSON storage API. But that is just the tip of the iceberg when it comes to flexibility. In this webcast we’ll learn how replication can be used to share data securely, build offline-capable applications, provide redundancy, and manage large-scale clusters. We’ll start with replication basics, then cover offline replication, filtered replication, and replication in a cluster”

July 09, 2010 02:51 PM

July 08, 2010

Couchio

CouchCamp Whiteboard

I’ve created a wiki page to serve as a whiteboard of session ideas at CouchCamp 2010.

http://wiki.apache.org/couchdb/CouchCamp2010

Please contribute ideas and join the discussion.

July 08, 2010 10:15 PM

Josh Berkus speaking at CouchCamp

I’m proud to announce that Josh Berkus will be speaking at CouchCamp.

Josh is best known for his work in the PostgreSQL project, and has been a database consultant and engineer since 1995. He’s worked with MSSQL, MySQL, SQLite, and Oracle as well as CouchDB, Memcached and Redis, and at Greenplum and Sun, thus qualifying him as an “old relational database guy” at the age of 40. He’s also a darned good cook.

http://www.couch.io/couchcamp

w00t!

July 08, 2010 09:57 PM

July 07, 2010

Volker Mische

GeoCouch Vortrag in Augsburg

Im Rahmen des Diplomandencolloquium des Lehrstuhl für Humangeographie und Geoinformatik halte ich am 19.07.2010 um 17:30 Uhr (Raum 2125) an der Uni Augsburg einen Votrag über GeoCouch. Der genaue Titel lautet:

GeoCouch: Eine Erweiterung für CouchDB zur Abfrage räumlicher Daten

Er richtet sich an Geographen, wird also nicht zu sehr ins Detail der Implementierung gehen. Es sind auch keine Vorkenntnisse zum Thema CouchDB nötig. Wer also mehr über CouchDB und GeoCouch wissen will, ist herzlich dazu eingeladen. Danach stehe ich natürlich zu Fragen zur Verfügung.

Ich habe keine Ahnung wie groß die CouchDB Community im Raum Augsburg ist, aber sollte jemand dieser Einladung folgen, spricht auch nichts gegen ein anschließendes kleines CouchDB/GeoCouch/NoSQL Meetup. Am besten meldet ihr euch bei mir per Mail, denn wenn ein paar Leute sicher kommen, werden es sich andere bestimmt auch überlegen.

Sorry Planet CouchDB for writing in German, but this is about a talk in German.

by Volker Mische at July 07, 2010 10:09 AM

June 30, 2010

Mikeal Rogers

CouchDB builtin reduce functions

Today was one of those “how did I not know about this?!” days.

At lunch @jchris mentioned that using the “builtin” sum reduce was up to 100x faster than using a JavaScript reduce function that does the same thing. Then of course my reaction was “wtf is a builtin reduce?”.

Apparently CouchDB has these awesome internal erlang functions for common reduce operations that you can use instead of a calling out to the view server that you just set in your design document instead of a JavaScript function. Since they don’t talk to the view server and can work on native erlang terms, skipping the JSON serialization steps, they are absurdly fast. Like so fast that there is very little reason to ever write another JavaScript reduce.

It’s all written up on the wiki but here is the gist of it.

{
  "_id":"_design/company",
  "_rev":"12345",
  "language": "javascript",
  "views":
  {
    "all_customers": {
      "map": "function(doc) { if (doc.type == 'customer')  emit(doc.id, 1) }",
      "reduce" : "_count"
    },
    "total_purchases_by_customer": {
      "map": "function(doc) { if (doc.type == 'purchase')  emit(doc.customer_id, doc.amount) }",
      "reduce": "_sum"
    }
  }
}

by mikeal at June 30, 2010 09:27 PM

Couchio

New Case Study About Mu Dynamics!

We have posted our latest case study profiling how Mu Dynamics is using CouchDB. Thanks to Kowsik Guruswamy for sharing their story about how they are using CouchDB on their new site pcapr.net.

Kowsik Guruswamy, Founder & CTO of Mu Dynamics, says “The ability to dream of an application and bring it to life with CouchDB has just been an incredible, heady experience. With valuable utilities such as Collaborative Network Forensics and xtractr, pcapr has evolved to become the largest repository of packets on the Internet with around 60 million packets.”

June 30, 2010 02:39 PM

June 29, 2010

Couchio

CouchUp Thursday 6pm at Pacific Coast Brewery, Oakland

It’s time again to combine our favorite two things, drinks and CouchDB :)

We’ll be doing a CouchUp on Thursday the 1st of July at 6pm. The location is a local brewery just a few blocks from the 12th st BART station, Pacific Coast Brewing Co.



View Larger Map

June 29, 2010 06:42 PM

Mikeal Rogers

CouchUp Thursday 6pm at Pacific Coast Brewery, Oakland

It’s time again to combine our favorite two things, drinks and CouchDB :)

We’ll be doing a CouchUp on Thursday the 1st of July at 6pm. The location is a local brewery just a few blocks from the 12th st BART station, Pacific Coast Brewing Co.


View Larger Map

by mikeal at June 29, 2010 06:37 PM

June 22, 2010

Mikeal Rogers

Open Source BBQ, Sunday, June 27th 2010

Time for another Open Source BBQ!

All are welcome, contributions appreciated, I’m gonna start cooking around 3pm and probably close it down when the sun goes down.

Please RSVP:

http://www.mobaganda.com/opensourcebbq-june2010

or on Plancast

http://plancast.com/a/3pt9

by mikeal at June 22, 2010 10:13 PM

Couchio

New Whitepaper on migrating to CouchDB. Thanks @johnpwood!

When John Wood from Interactive Mediums was migrating from MySQL to CouchDB he started a blog series to detail the process. We found his blog posts, thought they were awesome, and asked if we could turn them into a paper to help others that might find themselves doing their own migration. Check out the just released whitepaper! And thanks again to John for doing such a great job on the blog and for being a pleasure to work with :)

June 22, 2010 07:19 PM

Damien Katz

Migrating From MySQL to CouchDB

Here is a detailed white paper from John P. Woods of Interactive Mediums, going into depth about their migration of their mobile marketing archive from MySQL to CouchDB.

http://www.couch.io/migrating-to-couchdb

by Damien Katz at June 22, 2010 06:08 PM