Planet CouchDB

July 04, 2009

Chris Anderson

Kings of Code slides

This talk has a bit more attitude than some of my others. I'm not sure if it was recorded, but that's cool: "you had to be there."

My favorite bits were getting to tell Steven Pemperton that "in the fullness of time there is only one CouchDB." I find the first-wave web guys really get it.

Also fun in this slide deck is a new bullet point on my "Why CouchDB?" slide: "Makes Google look old-school."

Kings of Code CouchDB slides here - 6.5MB pdf

by jchris at July 04, 2009 02:54 PM

Chris Strom

Don't Program RSS by Coincidence

‹prev | My Chain | next›

Up tonight is the RSS recipes feed. For the most part, the inside, RSpec driven work is the same as that for meals RSS feed.

I do need to create a new, recipes by-date CouchDB view:
  recipes_view = <<_JS
{
"views": {
"by_date": {
"map": "function (doc) {
if (typeof(doc['preparations']) != 'undefined') {
emit(doc['date'], [doc['_id'], doc['title']]);
}
}"
},
},
"language": "javascript"
}
_JS

RestClient.put "#{@@db}/_design/recipes",
recipes_view,
:content_type => 'application/json'
That gets consumed by the Sinatra application:
  url = "#{@@db}/_design/recipes/_view/by_date?limit=10&descending=true"
data = RestClient.get url
@recipe_view = JSON.parse(data)['rows']
As with the meals RSS feed, I use the results of the CouchDB view to build the recipes RSS feed with RSS::Maker.

With the inside work complete, I can move back out to the Cucumber scenario:
Feature: RSS

So that I tell my user when there are updates to this great cooking site
As an RSS bot
I want to be able to consume your RSS

Scenario: Recipe RSS

Given 20 delicious, easy to prepare recipes
When I access the recipe RSS feed
Then I should see the 10 most recent recipes
And I should see the summary of each recipe
The first step is a re-used, earlier defined step. I can define the next two steps with:
When /^I access the recipe RSS feed$/ do
visit('/recipes.rss')
response.should be_ok
end

Then /^I should see the 10 most recent recipes$/ do
response.
should have_selector("channel > item > title",
:count => 10)
response.
should have_selector("channel > item > title",
:content => "delicious, easy to prepare")
end
When I get to the last step, I realize that I will have to either choose a different first step or define a new one. The last step calls for recipes summaries, but the "Given 20 delicious, easy to prepare recipes" step does not define them.

I will determine what to do with that first step tomorrow. I may also want to refactor a little.

by eee.c (chris.eee@gmail.com) at July 04, 2009 03:02 AM

July 03, 2009

Chris Anderson

The P2P Web (part one)

(Part one because this is only just some of what I'd like to say on the topic.)

The web was originally designed as a peer-to-peer medium. Tim Berners-Lee needed a way to share physics papers with his friends around the world. Since they were physicists, and the medium was simply published texts, the barriers to entry were low. All you had to do to become an independent publisher was run a copy of the NCSA web server, and point it to an HTML directory of your papers.

To make a long story short, the idea caught on. In the 15 years since we've seen an explosion of uses for the web, alongside an steady increase in the complexity of web applications and deployments. Google's web index is a long way away from the humble list of all the university web servers of the early years. Even much smaller sites (like the ones you might be building right now) are more likely to consist of service-oriented architecture and complex caching protocols, than a directory index of static html files.

This complexity comes for a reason, the web has gotten bigger and users have grown to expect sites to integrate data from all kinds of sources. Even if your site is simple enough to be hosted as plain old html, you're still subject to the traffic spikes that millions of active users can bring without warning.

More importantly, all these changes have made it that much harder for the average user to run their own web sites. At the same time, companies have stepped into the gap to make it easy to publish a blog or connect with your friends (as long as you host your writing or your friendships on the company server.) Most users have never considered controlling their own platforms, instead we see frustration about control manifest itself in ways that many of us geeks find amusing: clamboring to grap "your" Facebook URL before someone else does, pushing the limits of how ugly one can style a Myspace page, anger at site operators for not implementing your favorite imaginary features. These are all reactions to the very real power site operators have over their users.

There are also more-geeky reactions: the OpenID and Data-Portability movements, for instance. While these geek movements have their heart in the right place, I was never able to get excited about them. For as much as they see the value of giving control to users, they're still steeped in the idea of a web owned by vendors, where users can at best "go on strike" to demand more respect. The data-portability movement may have had user's interests at heart, but the ability to send your photos from Flickr to Picasa isn't a huge step up.

I suppose if there's a similar term to describe the p2p web, it'd be "application portability." When an application is designed to run against a client-based CouchDB node, not only can it (and the data it touches) be replicated from (say) Flickr to Picasa, it can also be replicated from Alice to Bob, or even modified by Alice and then replicated for publishing by any generic CouchDB hosting provider.

I tend to dig into the technical details of how this is accomplished, but I'll skip that part for today. Instead I'll just say it's easy. Once you've got a group of people with CouchDB on their laptops, it only takes a few minutes before they are actively sharing and aggregating data in an ad-hoc way. Since applications are just another form of data, they are replicated along with other changes. The big picture here seems to be a little bit harder for experienced developers to grasp, than for new developers and end users.

If you've never jumped through the mental hoops required to build a modern web application cluster, if you've never struggled with defining indexes on a relational database or configuring an HTTP proxy, than you have less invested in the centralized model of application development. I've found that people who haven't yet learned how "to do it right" are quite comfortable with the relaxed model CouchDB provides. "You mean I just save this document and then load it again when I need it?" "I can get this same data onto your computer by clicking that button?"

What is simple for a user requires a lot of relearning on the part of experienced developers. The physics of the web are changing, and a lot of the hard work we had to do to make the centralized model function just doesn't apply to the p2p web. So far I've said more about the centralized web than the peer-to-peer web. I should at least mention that I don't see the centralized web going anywhere anytime soon. There will always be room for online shopping carts and even centralized message routers. But as a general rule of thumb technology has followed the path of least resistance.

When users have the data they care about locally, they can afford to burn much more CPU time pulling interesting patterns from it than (say) Google can use to categorize a particular web page. This in turn frees the centralized services to provide what they do best: message routing and peer discovery. It's nice that Facebook can help me find photos of my friends, but it's frustrating that each time I visit an album I have to wait for image files to cross the wire. Social graph services of the future will be built on the assumption that users have data they may want to share with their friends but not the service provider. It's up to them to figure out how to meet that need.

by jchris at July 03, 2009 06:10 PM

July 02, 2009

Rodrigo Moya

GCDS expectations

With just a few hours before I leave to Gran Canaria, here’s a list of things I personally would like to get from the conference:

  • I’ve been to all GUADEC’s except for 2 (Stuttgart and Istanbul), and every time I’ve missed one GUADEC, I was doubly excited to go to the next one, so this year, having missed last year’s, this is the case again.
  • Since for the first time we are having a joint KDE/GNOME, I am expecting to have a big push on collaboration and cooperation between the 2 projects. I am not sure what would come out of this, but we should all really be looking for this, since it would just help both projects a lot. So, keep the rivalry only for the sport activities, please (maybe a KDE vs GNOME football game? :-) )
  • As I’ve already blogged about recently, we (at Canonical) are trying to push CouchDB use to the desktop. I’ve got all the code I’ve been working on ready to be shown (karmic packages here, but broken for jaunty right now, sorry), so if someone wants to see it in action (a technology preview, of course, not everything is done yet), just find me around and I’ll do a personal demo (a better demo if you buy me a beer :-D ). Other Canonical staff will be around also showing these (and other) technologies, so if interested, just ask.
  • GNOME 3.0 plans and technologies like mutter, gnome-shell.
  • I only played the FreeFA tournament in Vilanova (yeah, was part of the cool champion team), so looking forward to revalidate the title :-D
  • Mojo Picón, a spicy hot sauce typical from the Canary Islands. Make sure you try the Papas Arrugadas with that sauce.
  • Have a lot of fun!

Only bad thing is that I’m going to miss the first few days of San Fermín festival in Pamplona, but well, since I’ll be back home on the 10th, I’ll have the chance to enjoy the last few days of it. As I said other times, please use other dates than July 6th to 14th next year!

See you all in Gran Canaria!

by rodrigo at July 02, 2009 02:53 PM

Mikeal Rogers

Up for a Pint?

I’m in London for the next few days and would love to grab a drink with any community members be you Mozilla, CouchDB, Python, Windmill, JavaScript or just plain old coffee, whisky or beer geeks :)

by mikeal at July 02, 2009 01:55 PM

June 30, 2009

John Wood

CouchDB: Databases and Documents

This is part 2 in a series of posts that describe our investigation into CouchDB as a solution to several database related performance issues facing the TextMe application.

by John Wood at June 30, 2009 03:16 PM

June 27, 2009

CouchDB Podcast

June 26, 2009

Mikeal Rogers

Heading to EuroPython

I’m getting all packed up and leaving Sunday for EuroPython in Birmingham, UK.

This will be my first time at EuroPython and my first time in Europe!

I’ll be giving two talks, one on Windmill and one about CouchDB and Python. The Windmill talk will be more or less the talk that I gave at Open Source Bridge last week, which went very well. This is the first time I’ll be talking about CouchDB, the most exciting new technology on the web. The talk will mostly be about breaking our old data modeling habits that we developed to deal with SQL and what libraries and tools are available for interacting with CouchDB in Python.

I will also be in London for a few extra days after the conference so anyone interested in a meetup should ping me.

by mikeal at June 26, 2009 11:35 PM

June 24, 2009

Damien Katz

StackOverflow Podcast

Yesterday I did a StackOverflow podcast with Joel Spolsky and Jeff Atwood. We talked about CouchDB and Erlang, among other things: StackOverflow Episode 59

by Damien Katz at June 24, 2009 07:27 PM

June 20, 2009

Jan Lehnardt

EU Summer Tour

After the US Spring Tour in April this year, I’m about to embark on the EU Summer Tour.

I’ll be visiting London, Amsterdam, Zurich and the Gran Canaria. Here’s when, how and why:

June 22nd–26th: CouchDB University & Factory, London

The CouchDB University is a a three day training course where J Chris and I teach a select group of students everything about CouchDB. With little prior knowledge, we’ll leave you with being able to build amazing CouchDB applications at small and large scale as well as extend CouchDB itself.

The CouchDB Factory is a track at the Erlang Factory running all day Friday.


June 29th–30th: Kings of Code, Amsterdam

Kings of Code looks like it is going to be a kick-ass web developer conference featuring some of my favourite web people: Geoffrey Grosenbach Joe Stump & Francisco Tolmasky. I’m fairly confident that the other speakers will be among my favourites after Kings of Code :)

J Chris will be talking about CouchDB.

It’s still in discussion, but I might talk about CouchDB and Erlang for web developers on one of the side events.


July 1st–2nd: ICOODB, Zurich

I’ll be taking the night train from Amsterdam zu Zurich to give a three hour tutorial as well as a 60 minute presentation on CouchDB at the International Conference on Object Databases. CouchDB is strictly not an object oriented database, but it stores objects and is of interest to the research community that meets in Zurich.

Prof. Stefan Edlich invited me to speak at ICOODB and I’m very happy I can make it.


July 1st–7th: GUADEC, Gran Canaria

Canonical, the kind folks behind the Ubuntu Linux distribution are pushing CouchDB to become a centerpiece of the Ubuntu desktop data synchronization infrastructure. Merrily sync your contacts, calendar data between your machines, an online backup service and share select data with your peers. And yeah UbuntuOne is also related :)

Canonical is flying me out to attend the Linux Desktop Summit to talk to desktop application developers and show them how cool CouchDB is and where it is useful for them.

Also, Gran Canria, I couldn’t say no. Thank you Canonical!


As much as I am excited about the travels and meeting all you out there, I’ll be missing three weeks in my favourite city, Berlin and it makes me a little sad.

by Jan (jan@apache.org) at June 20, 2009 12:09 PM

June 19, 2009

John Wood

Paginating Records in CouchDB via CouchRest

When I began looking into replacing some of TextMe’s large MySQL tables with CouchDB databases, one of the things I noticed right away was that pagination support was not quite there in CouchRest. I say “not quite there” because CouchRest does have the ability to fetch data from the database in paginated chunks, but [...]

by John Wood at June 19, 2009 06:37 PM

Rodrigo Moya

CouchDB contacts in Evolution

Continuing with my CouchDB on the desktop series, here’s the 1st screenshot:

Evolution addressbook showing contacts stored in CouchDB

It’s Evolution addressbook components showing contacts from a CouchDB database. As stated in previous posts, all contacts in that database would be automatically replicated to a remote CouchDB instance, so, for instance, you could just see and edit/delete/whatever them from a web interface, and the changes would show up in Evolution.

Code is in GNOME git, under couchdb-glib and evolution-couchdb modules.

by rodrigo at June 19, 2009 04:27 PM

June 17, 2009

Paul Joseph Davis

Forging - Beats TLAs

Forging - Beats TLAs

Avoiding hate-mail

I am a fan of testing. I write tests. I try and write good tests that test functionality and not that a computer can properly add integers. Testing is great for validating my mental model and checking that refactored code continues to conform to my mental model.

New analogy!

I just realized the reason why I dislike test driven development. My coding is like simulated annealing. I work up alternative solutions quickly and iterate through until I find a local extrema that appears to suck the least. My testing phases are more like a non-linear cooling optimization at the end. To me these phases are the final step that gives any particular solution its strength and confidence.

Slightly differently

I code like a blacksmith forges. Get the object of current obsession malleable, beat on it for awhile and then temper the result. Face it. Forging etymologically kicks the crap out of all those TLA's.

June 17, 2009 04:00 AM

June 16, 2009

Jan Lehnardt

Caveats of Evaluating Databases

This is part two in a small series about measuring software performance. There’s a lot of common sense covered, but I feel it necessary to shed some light.

If you haven’t, check out part one.


Say you want to find out what’s behind the buzz of all these new #nosql databases. There’s a large number to choose from today. All options come in varying degrees of maturity and characteristics so it’d be nice to know what solves your problem best. A non-exhaustive list of these databases or storage systems include Memcache[DB], Tokyo Cabinet / Tyrant, Project Voldemort, Scalaris, Dynamite, Redis, Persevere, MongoDB, Solr or my favourite CouchDB. And these are just some of the open source ones.

This article is not a comprehensive comparison of any of the mentioned systems. Instead it tries to give you an idea about what to look for when evaluating a storage system or how to take into perspective evaluations and benchmarks others have done.

We’ll look at some of the technical aspects of data storage systems: Applying common sense when reading benchmarks; b-trees and hashing; speed vs. concurrency; networked systems and their problems; low level data storage (disks’n stuff); and data reliability on single-nodes and multi-node systems.

There are a lot of other reasons to decide for or against a project based on a lot of non-technical criteria, but things like commercial support or a healthy open source community are not part of this article.

Astounding Numbers

From time to time you see some crazy numbers posted to the reddits of the internets that claim fantastic performance.

The (imaginary) SuperfastDB can store 450,000 items per second!.

Wow.

No word on where the items are stored (in memory? on a harddrive? Spindles? Solid State?), what an item is exactly and how big it is, the rest of the hardware this was run on and how to reproduce it.

But boy, 450,000 a second!

My shoes can do 650,000 a second, but you’ve got to figure out what.

Context is as important as reproducibility. The last article here established that finding out that my system and your system come up with different numbers is not much of a help. Any sort of serious test must come with a set of scripts or programs and comprehensive instructions on how the tests were run.


Everything “cool” in computer science has been around for 25+ years. Actual innovation is rare. Advancements in hardware and new combinations of existing solutions make for new stuff coming out each day (that’s a good thing), but the fundamental rules are the same for all. We’re all running von Neumann machines, quicksort is still pretty quick and hashes and b-trees rule the storage world.

Let’s recap.

Hashes & Trees

Hashing revolves around the idea of O(1) lookups. Allocate a number of buckets, create a function that gives you a number of a bucket for any data item you might want to store, make sure no two data items hit the same bucket (or work around that). Runtime characteristics include that you only need to ask your function where to look for or store your data and the allocation of your set of buckets: If you need to store more items than you have buckets, some more work is required which gives you O(N) operations that you can’t ignore in practice.

D5563B63-7B48-4280-A31F-EDB37DB78416.jpg

The other elephant in the room are b-trees. The fundamental idea here is to get to your data in a minimal number of steps traversing a tree because making a step is expensive, but reading your data is very fast comparatively. Steps are expensive because they translate to a head seek (that is the time your spinning hard drive needs to position the reading arm to find the spot to read your data from), but reading from a harddrive once the reading head is in place is fast.

6720EE64-4DFC-4298-B3BA-0145746C6523.jpg

There are a bunch of more interesting lookup structure like R-Trees for spacial queries, but they are mostly used for secondary indexes on top a regular data set that lives in a hash or b-tree.

Concurrency vs. Speed

Concurrency is hard. The devil lies in the details and when briefly looking at things, the details are often overlooked. Suites the devil.

Creating storage systems that assume only one access occurs at a time is relatively easy. If resources are shared concurrently, things become tricky. The two larger schools of thought (and practice) are locking and no-locking (heh).

Locking means that the database has to maintain information for everybody who wants to write to a part of the database, and what part it is.

No locking, or optimistic locking or MVCC moves that burden to the person who is trying to write to the database. She must prove that she won’t be overwriting any existing data.

The trade-offs here are a leaner request handing on the server that works well with remote & concurrent clients at the expense of more complexity on the client (the person who wants to store something in our database).

Hybrid approaches are possible too: While MVCC is used internally, the database’s clients can rely on database-side locking (e.g. PostgreSQL or InnoDB).

Networks

Just a quick note: We already talk about client and server here. There is a strong case for embedded databases like SQLite that don’t expose a concurrent user model to the outside. The program that needs an embedded database just includes it.

Another approach to using databases is having a dedicated computer running a database system and sharing it over the network with any number of clients using this database server. They can often be “a bunch of servers” or a cluster. More on that later.

A separate database server (networked or not) will need to spend some time to deal with connections, network failures, unspecified client behaviour and so on. The upside is a piece of infrastructure that can be maintained separately. An embedded database will thus be faster but probably won’t solve all of your problems and it will always be tied to your application.

fsync(): Reliability vs. Speed

When people tell me “SuperfastDB does 450,000 a second!” I ask “How many fsync()s is that?”. Let me explain:

A database system uses operating system services to use any hardware. The operating systems exposes a harddrive through a filesystem. The database systems talks to the filesystem and asks it to store or retrieve data in its behalf. The filesystem then goes ahead and tries to satisfy the database’s requests.

(I’ll not talk about databases that can use raw block devices to store data. They exist but they are not as common as those who use the filsystem.)

The filesystem also tries to be clever – for good reasons. When the database requests a piece of data, the filesystem will not only find that piece and return it, it will also store it in a cache to avoid having to actually talk to the harddrive the next time this piece of data gets requested. When the data changes, the filesystem either removes it from the cache or updates it with the harddrive. It might even go further and only store the new data that comes in with a write request into the cache and rely on a periodic task to write all of the cache back to the drive. Writing a bunch of of pieces at once is more efficient than storing each one on its own.

More efficient equals to faster and faster is good, right? Well, it depends: If all goes well, this approach is a nice one. But you know computers, things will not go well 100% of the time. The failure scenarios are endless, but they boil down to the question: “What happens when your machine dies and you have data that has only been written to memory?” — The answer isn’t too hard: That data is lost. If there is a delay between a write request finishing and data being written (or “flushed”) to disk any data that has been “written” during the delay period is subject to lost.

There are cases where this is not a problem; in other cases it is. A developer should have the chance to decide. (Note that even your hardware could be lying to you about having stored data, but I’ll punt on this one, get proper hardware).

So, flushing to disk needs to happen before you can rest assured your data has been stored. Your operating system has an API call that forces the filesystem to write its cache to disk. It is called fsync() (on UNIX systems) and it is an expensive operation. You can only do so many fsync()s in a second and it is not a great many.

The 450,000 items were most likely just written to memory and not to disk.

Space & In-Place

When writing files to disk (at the end of the day, your data ends up in one file or another on the filesystem) that represents what lives in a database, there are multiple options to handle updates.

An update is a change to your data item, for example, a new phone number. The intuitive way to handle this is to go and find the old phone number in the file, and overwrite it with the new number. Easy.

There are several problems with this approach: What to do if the new phone number is longer than the old one (say you added an international calling prefix)? The new number needs to be written to a different place and the change in location must be recorded. Not too big of an issue.

Back to failure scenarios: Again, the reasons can be manifold, but what happens when we’ve (over-)written the first 4 digits of the old with the new number and then the server dies, power goes away or the database server crashes? The next time you want to read the phone number you get a mix of the old and the new one (if you are lucky) and you don’t exactly know that this is the case and which parts are missing. Your database file is inconsistent and you need to run a integrity check to find missing bits and correct half-written bytes. In the worst case that means scanning your entire database file a few times before you resolved all inconstancies. If you have a lot of data, that can take days.

To solve this, you always write the new phone number to a new place in the database file and only when it has been fsync()ed to disk, you update the location of the phone number (and then flush that update to disk as well). You will never end up in a scenario where your database file can end up an inconsistent state and after a crash you are back online without an integrity check.

The trade-off for consistency is write-speed (remember fsync()s are expensive) for consistency-check-speed after a failure.

A nice bonus is that if the “new place in the database” is the end of the file, you keep your disk-drive head busy with writing data to disk instead of seeking all over the place (remember: seeks are expensive).

Distribution, Sharding & Resharding

So far, we’ve been looking at scenarios that involve a single database. We learned a great deal (I hope), but in reality we often deal with more than one database. The simplest reason to have two databases is for redundancy. Failures can bring down your database temporarily or even permanently. If it is a temporary issue, waiting a bit (or a bit longer) to get up and running again might be an option, but often, an application or service should be available at all times. A fatal failure where a database server is lost beyond repair, your data is gone if you haven’t stored it in a second place.

“I’ll just make two copies, easy!”. Yup easy, until you look at the details (that damn devil again!).

It’s all about failures again. Consider a single read request. A client connects to a server and asks for a data item. The server looks it up and returns the data to the client. All is well. At any point things can go wrong. The network connection can drop (or slow down so much that client or server assume it dropped), the client can disappear (because of a network failure or crash) as can the server. Clients, servers and the protocols they speak need to be built around the assumption that any of these things (and many more) can go wrong. If any parts is not designed to handle error cases, your system will do funny things, but it won’t reliably store and manage your data.

Add complexity: With each write target (store in two places) the possibility of error and the need for proper error handling grows exponentially. When evaluating a distributed storage system, looking at how errors are handled is vital.


Another reason to distribute data among multiple servers is capacity. The three metrics of interest here are read requests, write requests and data. If you have more requests or data than a single machine can handle, you need to move to multiple machines. Each metric calls for different strategies, but they often go along with each other. The need for fault tolerance that I discussed above needs to be considered alongside.

Growing read capacity is relatively easy once you covered the base case where the source for reading data might not be the same as the the target for writing data and that there can be a mismatch (cf. eventual consistency).

Distributing writes and data works by designating two machines with 50% of the operations. A clever intermediate, a proxy server for example, decides which request goes where and all is well, we can store twice as much and we can store at twice the speed. When we need to grow bigger yet, we add another server and tell the proxy server to distribute the load equally among them. Adding a proxy for distribution introduces a single point of failure and you don’t want these; there’s added complexity with this approach.

resharding.png

The diagram shows that there is another step needed that wasn’t included in the above description. The new “node” needs to have a copy of all data items that are assigned to him and are currently living on the two existing nodes. The process of moving data items to new nodes is called resharding and needs to happen every time a new node is added.

Resharding can be an expensive operation if you have a lot of data. Techniques like consistent hashing help with minimising the amount of items that need to move. If you are looking at a sharding database, you want to understand how the sharding is performed and if you like the trade-offs.

CAP Theorem

The CAP Theorem states that out of consistency, availability and partition tolerance, a system can choose to support two at any given moment, but never three.

cap.png

Consistency guarantees that all clients that talk to cluster of nodes will always get to read the same data. Write operations are atomic on all nodes.

Availability guarantees that in any (reasonable) failure scenario, clients are still able to access their data.

Partition tolerance guarantees that when nodes in the cluster lose their network connection and two or more completely separated sub-clusters emerge, the system will still be able to store and retrieve data.

Please Talk! (To Developers)

If you are aiming for a comparative benchmark of two or more systems, you should run your procedure by they authors. I found developers are happy to help out with benchmarks by clearing up misconceptions or sharing tricks to speed things up (which you can choose to ignore, if you are looking for out-of-the box comparison, but this is rarely useful).

by Jan (jan@apache.org) at June 16, 2009 09:57 PM

June 15, 2009

John Wood

CouchDB: A Case Study

This is part 1 in a series of posts that describe our investigation into CouchDB as a solution to several database related performance issues facing the TextMe application. Part 2: Databases and Documents >> The wall was quickly approaching. After only a few short years, several of our database tables had over a million rows, a [...]

by John Wood at June 15, 2009 01:40 PM

June 13, 2009

Chris Strom

Outside the Homepage

‹prev | My Chain | next›

I finished off the navigation between meals Cucumber scenario last night. Before moving on, I have to ask myself, "did I create a navigation between recipes scenario?" Sadly, the answer to that is "no", so I do so now:
    Scenario: Navigating to other recipes

Given a "Spaghetti" recipe from May 30, 2009
And a "Pizza" recipe from June 1, 2009
And a "Peanut Butter and Jelly" recipe from June 11, 2009
When I view the "Peanut Butter and Jelly" recipe
Then I should see the "Peanut Butter and Jelly" title
When I click "Pizza"
Then I should see the "Pizza" title
When I click "Spaghetti"
Then I should see the "Spaghetti" title
When I click "Pizza"
Then I should see the "Pizza" title
When I click "Peanut Butter and Jelly"
Then I should see the "Peanut Butter and Jelly" title
That's very nearly a cut-n-paste of the navigation between meals scenario. Hopefully the work already done to implement the latter will make this new scenario easy to implement.

I will not be implementing that today. Even with the new scenario, I have 170 out of 220 scenario steps complete. Most of the remaining scenarios deal with the homepage and site-wide navigation. If I can get those complete, I can deploy—if only in beta. I can live without between-recipe navigation for a beta deployment. Heck, I can live without it in live code for a little while.

So it's on to the "Site" feature. To put myself in the right frame of mind, I revisit the Cucumber preamble that I wrote for this feature:
  So that I may explore many wonderful recipes and see the meals in which they were served
As someone interested in cooking
I want to be able to easily explore this awesome site
Ah, effusive language really inspires. I must do well by these users!

The first scenario in there is "Quickly scanning meals and recipes accessible from the home page". The "given" preconditions are:
    Given 25 yummy meals
And 1 delicious recipe for each meal
And the first 5 recipes are Italian
And the second 10 recipes are Vegetarian
When creating the 25 meals, I inject the IDs into an instance variable for use in later steps:
Given /^(\d+) yummy meals$/ do |count|
start_date = Date.new(2009, 6, 11)

@meal_ids = (0...count.to_i).inject([]) do |memo, i|
date = start_date - (i * 10)

meal = {
:title => "Meal #{i}",
:date => date.to_s,
:serves => 4,
:summary => "meal summary",
:description => "meal description",
:type => "Meal",
:menu => []
}

RestClient.put "#{@@db}/#{date.to_s}",
meal.to_json,
:content_type => 'application/json'

memo + [date.to_s]
end
end
I do something similar when creating a recipe for each meal. The only difference is that I need to update the meal with the recipe included on the menu:
Given /^1 delicious recipe for each meal$/ do
@recipe_ids = @meal_ids.inject([]) do |memo, meal_id|
data = RestClient.get "#{@@db}/#{meal_id}"
meal = JSON.parse(data)

permalink = meal['date'] + "-recipe"

recipe = {
:title => "Recipe for #{meal['title']}",
:date => meal['date'],
:preparations => [
{ 'ingredient' => { 'name' => 'ingredient' } }
]
}

RestClient.put "#{@@db}/#{permalink}",
recipe.to_json,
:content_type => 'application/json'

# Update the meal to include the recipe in the menu
meal['menu'] << "[recipe:#{permalink.gsub(/-/, '/')}"
RestClient.put "#{@@db}/#{meal['_id']}",
meal.to_json,
:content_type => 'application/json'

memo + [permalink]
end
end
Per the CouchDB API I can PUT the meal there because it includes the rev attribute when it is looked up at the beginning of this step. Had I simply remembered the JSON from the meal creation step, I would have gotten a 409 error from CouchDB telling me that the PUT operation failed. By retrieving the meals via the IDs, I retrieve the current revision number, allowing this step to pass without error.

I need to call it a night at this point. I will pick up the next pending step tomorrow and then start working my way into the homepage code.
(commit)

by eee.c (chris.eee@gmail.com) at June 13, 2009 01:58 PM

June 12, 2009

Chris Anderson

NoSQL Slides

NoSQL was a rip-roaring good time. It was fun to catch up with old friends as well as get an all day brain-dump of what's going on in the distributed database world. I'm pretty heads-down on CouchDB, so seeing how others have approached a similar problem space was eye opening.

Mostly I was amazed at the level of complexity in the various Big Table clones. My impression is that most of them blend storage concerns with distribution concerns. Mixing it all up in one big distributed porridge allows for optimizations that CouchDB's approach can't yield. However, I think the CouchDB approach of building a solid single-node implementation always with an eye toward distributed uses is just as viable as always.

Meebo's CouchDB-Lounge proxy is a pure HTTP approach to building CouchDB clusters that span multiple machines. It doesn't handle the dynamic nature of truly large clusters, where you can count on nodes continuously leaving and joining. But it makes up for that with a truly simple implementation (just a few hundred lines of Python and C). When the need arises, someone will have an easy time adding facilities for managing dynamic large clusters, because the separation of concerns is so clear.

Maybe I'm underestimating the necessary complexity in large cluster management, but so far betting on simplicity has been a winning strategy and I intend to continue it.

The slides are available as a PDF here.

by jchris at June 12, 2009 09:29 PM

June 11, 2009

Rodrigo Moya

couchdb-glib 0.1

As the first step on CouchDB desktop integration, here’s version 0.1 of couchdb-glib, a GLib-based API to talk to CouchDB.

This initial version only allows reading and does all operations synchronously (not a problem in most cases, since the communication is done to the local CouchDB instance, which is quite quick, at least from what my tests show so far). Next releases will have all the missing functionality.

And, well, no screenshots to show, so here’s some example code for you to enjoy.

Source code is in GNOME GIT, under couchdb-glib module.

by rodrigo at June 11, 2009 10:35 AM

June 09, 2009

Paul Joseph Davis

Erlang Dependency Graph

Erlang Dependency Graph

Overview

Got distracted and wrote a small parser for dialyzer to print the dependencies for a set of Erlang beam files. Just used CouchDB sources as that's what I had handy.

Script

#! /usr/bin/env python

import os
import re
import subprocess as sp
import sys

edge_re = re.compile(r"(couch[^:]+):([^/]+)/\d+")

def dialyze(file):
    command = ' '.join([
        "dialyzer",
        "--build_plt",
        "-pa", "src/ibrowse",
        "-pa", "src/mochiweb",
        "-pa", "/usr/local/lib/erlang/lib",
        "-c", file
    ])
    pipe = sp.Popen(
        command, shell=True, stdin=sp.PIPE, stdout=sp.PIPE, stderr=sp.PIPE
    )
    (stdout, stderr) = pipe.communicate(input="")
    for line in stdout.split("\n"):
        match = edge_re.match(line.strip())
        if not match:
            continue
        yield (match.group(1), match.group(2))

def start_graph():
    print "digraph G {"

def add_edges(file, edges):
    src = os.path.split(file)[1][:-len(".beam")]
    edges[src] = set()
    for (mod, fun) in dialyze(file):
        edges[src].add(mod)

def end_graph():
    print "}"

def main():
    if len(sys.argv) != 2:
        print "usage: %s code_dir" % sys.argv[0]
        exit(-1)

    start_graph()
    edges = {}
    for root, dnames, fnames in os.walk(sys.argv[1]):
        for fname in fnames:
            if not fname.endswith(".beam"):
                continue
            add_edges(os.path.join(root, fname), edges)
    keys = edges.keys()
    keys.sort(key=lambda x: len(edges[x]), reverse=True)
    for k in keys:
        for m in edges[k]:
            print "  %s -> %s;" % (k, m)
    end_graph()

if __name__ == '__main__':
    main()

Result

CouchDB Dependency Graph

June 09, 2009 04:00 AM

June 06, 2009

Damien Katz

June 03, 2009

Ricky Ho

RESTFul Design Patterns

Summarize a set of RESTful design practices that I have used quite successfully.

Object Patterns

If there are many objects of the same type, the object URL should contains the id of the object.
http://www.xyz.com/library/books/668102
If this object is a singleton object of that type, the id is not needed.
http://www.xyz.com/library

Get the object representation
HTTP GET is used to obtain a representation of the object. By default the URI refers to the object's metadata but not actual content. To get the actual content ...
http://www.xyz.com/library/books/668102.content
HTTP header "Accept" is also used to indicate the expected format. Note also that the representation of the whole object is returned. There is no URL representation at the attribute level.
GET /library/books/668102 HTTP/1.1
Host: www.xyz.com
Accept: application/json

Modify an existing Object

HTTP PUT is used to modify the object, the request body contains the representation of the Object after successful modification.

Create a new Object
HTTP PUT is also used to create the object if the caller has complete control of assigning the object id, the request body contains the representation of the Object after successful creation.
PUT /library/books/668102 HTTP/1.1
Host: www.xyz.com
Content-Type: application/xml
Content-Length: nnn

<book>
<title>Restful design</title>
<author>Ricky</author>
</book>
HTTP/1.1 201 Created

If the caller has no control in the object id, HTTP POST is made to the object's parent container with the request body contains the representation of the Object. The response body should contain a reference to the URL of the created object.
POST /library/books HTTP/1.1
Host: www.xyz.com
Content-Type: application/xml
Content-Length: nnn

<book>
<title>Restful design</title>
<author>Ricky</author>
</book>
HTTP/1.1 301 Moved Permanently
Location: /library/books/668102

Invoke synchronous operation of the Object
HTTP POST is used to invoke a operation of the object, which has the side effect. The operation is indicated in a mandated parameter "action". The arguments of the method can also be encoded in the URL (for primitive types) or in the request body (for complex types)
POST /library/books/668102?action=buy&user=ricky HTTP/1.1
Host: www.xyz.com
POST /library/books/668102?action=buy HTTP/1.1
Host: www.xyz.com
Content-Type: application/xml; charset=utf-8
Content-Length: nnn

<user>
<id>ricky</id>
<addr>175, Westin St. CA 12345</addr>
</user>

Invoke asynchronous operation of the Object

In case when the operation takes a long time to complete, an asynchronous mode should be used. In a polling approach, a transient transaction object is return immediately to the caller. The caller can then use GET request to poll for the result of the operation

We can also use a notification approach. In this case, the caller pass along a callback URI when making the request. The server will invoke the callback URI to POST the result when it is done.

Destroy an existing Object
HTTP DELETE is used to destroy the object. This release all the resources associated with this object.
DELETE /library/books/668102 HTTP/1.1
Host: www.xyz.com


Container Patterns

The immediate parent of a container must be an object (can be a singleton object without an id or an object with an id). Container "contains" other objects or containers. If a container is destroyed, everything underneath will be destroyed automatically in a recursive manner.
http://www.xyz.com/library/books
http://www.xyz.com/library/dvds
http://www.xyz.com/library/books/668102/chapters

In GET operation, by default the container only return the URL reference of its immediate children. An optional parameter "expand" can be used to request the actual representation of all children and descendants.

A more sophisticated GET operation can contain a "criteria" parameter to show only the children that fulfills certain criteria.
GET http://www.xyz.com/library/book?criteria=[author likes 'ricky']


Reference Patterns

In many case, objects are referring to each other. The reference is embedded inside the representation of the object that reference it. In case the object being referenced is deleted, all these references need to be fixed in an application specific way.

by Ricky Ho (rickyphyllis@gmail.com) at June 03, 2009 11:02 PM

June 02, 2009

Rodrigo Moya

Desktop data/settings replication

In the last UDS, there were some talks about UbuntuOne, the technologies it uses, and how it could be well integrated into the Desktop. Also, there were discussions about how it could be integrated painlessly into upstream projects. So, here’s an idea on how this could be done.

First, it must be said that the easiest (and quickest) way of achieving UbuntuOne integration in Ubuntu would be to just patch/extend applications so that they supported accessing the UbuntuOne server, and have Ubuntu packages use that as default for users with UbuntuOne accounts. That would make most Ubuntu users happy, but it would not benefit at all users of other distributions, and worst, the upstream projects.

Now, if we look at the technologies being used in UbuntuOne, there is one awesome thing, called CouchDB, a project supported by the Apache Foundation, which provides databases (of JSON documents) that can be replicated (and 2-way synchonized) to other hosts. So, what if we had Linux Desktop applications use this for storage of files and settings?

couchdb-in-the-desktop

Well, what would happen is that we’d gain data / settings replication and synchronization for free. And also, if we could come up with standard formats / locations for common information (accounts, notes, mails, calendars, etc, etc), we’d also gain a shared storage for all applications to use, solving the problem of incompatible formats / locations used by similar free software applications.

And other advantages:

  • CouchDB knows already how to deal with conflicts, as this is included in the automatic replication / syncing features it provides.
  • While normal documents in CouchDB are JSON, you can attach any kind of file to any JSON document (even to empty JSON documents), so any kind of files can be stored. Also, it allows users to create as many databases as needed, so storage for different needs can be easily separated.
  • CouchDB provides a sort of revision history, so it could be used for nice stuff like Zeitgeist.
  • This, not being an Ubuntu-only solution, could benefit every Linux Desktop user.
  • UbuntuOne would be a service built on top of this that users can subscribe to. But others could just setup a CouchDB server on their home / company network and use that by just pointing their local CouchDB to their remote CouchDB replication server.

To continue my investigations/playing on this, I’m going to try writing a gvfs backend to manage files in the CouchDB instances. Once that’s done, applications could start just writing their files to couchdb://… URIs instead of file://… ones and enter the replication/synchronization world with just a single change. Next, a GConf/d-conf backend could be added for replicating/sync’ing settings, and so on.

by rodrigo at June 02, 2009 10:44 PM

May 31, 2009

Chris Anderson

Less

I like it when people talk about less code. Code slower, all of that. Here's my attempt to talk about less.

Less Layers

This is one aspect people find appealing about pure Couch apps. Less layers makes deployment easier. Less layers means less impedance mismatch between layers. This makes applications different.

But which layer do we drop? Model, View, Controller, Client, Server?

Unlearning

The hardest things about switching to a document store is learning to think in documents. Documents are self-contained so reconstructing their meaning for users is a simpler task, mostly consisting of presentation. However, CouchDB enforces some constraints of it's own, but they are largely a consequence of it's distributed programming model, so we learn to live with them.

Model

Validating user input is crucial for security, as well as useful for providing guarantees to your application's views etc. Most frameworks have you doing this sort of thing in an application server, which is scaled and distributed differently from your database. Without an application server, where do the models live?

In CouchDB models feel functional, not object oriented. This can be a mind fuck but after about six months you get used to it. For instance, CouchDB's validation functions can access only one document at a time, and have no side effects other than blocking invalid database updates.

Imagine if your Rails controller had read-only access to your database, and could make only one select query per request (determined by the URL), and was required to return true or fail with a message. It's that weird. Did I mention validation functions are run during CouchDB replication, not just during user access?

But it makes sense, when you understand the physics of distributed computing. Of course you run validations at replication, because replication is accomplished via the same HTTP channels as normal client access. It's not hard to "replicate" into CouchDB from JavaScript running in a browser.

View

Do I have to tell you that CouchDB can transform JSON documents and Map Reduce rows into formatted output, for instance as XML feeds or HTML pages? This blog's Atom feed is generated on CouchDB's server side using a _list function stored in Sofa's design document.

Client / Server

When CouchDB is running on localhost, this distinction becomes less important in certain ways. In other ways, the constraints of HTTP (definitely a networking protocol) make local applications easier to deploy and manage. When a web application is local, what makes it a web application?

Controller

I think the controller is largely moved to the client, in my experience it's wrapped up in event hooks attached to HTML elements. Good riddance, that bit of my Rails app was always a pain anyway. Keep refactoring the controller layer, eventually it's gone. ;)

There are other application models besides MVC, I'm not going into them right away, but I'm taking requests.

Karaoke

I could go on all night, but @amysue's playing Jump really loud on the Piano, which means it's time for out-the-door to Chopsticks III. Little known fact, my karaoke name is Grandpa Chris. Protip: if you've ever heard it before, you can sing Wooly Bully, but you have to #leanintoit.

by jchris at May 31, 2009 03:22 AM

May 29, 2009

Chris Anderson

First Principles

I'm not particularly concerned with people who take issue with some of the CouchDB demos I've been doing lately. Either they don't get it or they're trying hard not too. If you're on the cusp, and you're not sure whether or not you get it, I encourage you to read Jacob Kaplan-Moss's blog post from a couple of years ago: "Of the Web". For the non link-clicking types I'll quote the bits that make me happiest:

Let me tell you something: Django may be built for the Web, but CouchDB is built of the Web. I've never seen software that so completely embraces the philosophies behind HTTP. CouchDB makes Django look old-school in the same way that Django makes ASP look outdated.

And more:

Look, CouchDB may succeeded, and it may fail; who knows. I'm sure of one thing, though - this is what the software of the future looks like.

So while there may be platforms that can pass bytes around among a particular set of listeners with less overhead than CouchDB, I'd be surprised if they can also subsume that functionality into a web-first database that also has the resilience and flexibility of CouchDB.

Maybe I should characterize all the chat rooms in Toast with word clouds, so you can see which room has the highest signal/noise ratio just by looking at the top terms. Can your MQ do that?

by jchris at May 29, 2009 04:42 AM

May 28, 2009

Upstream

Conference Triathlon - Euruko, Ruby on OS X, RailsWayCon

Wow, this was the month of conferences. First we visited Barcelona for the Euruko: a great conference taking place every year in a different city. We attended it the 3rd time. The talks ranged from practical like “Cooking with Chef“, over entertaining like “Fun with Ruby (and without R***s), program your own games with Gosu” to just geeky like the lightning talk about “Vimmish and how much fun gramma parsers can be”. I really liked the 2 days 1 track format and the people I met there. And you can be sure that we will be in Krakow next year too.

5 days later I visited Amsterdam for Ruby on OS X, which I already blogged about.

And a little more than a week later, until yesterday, four of us attended the RailsWayCon here in Berlin, which tries to fill the gab that the RailsConf Europe left. The first day was reserved for whole day tutorial sessions.

The second day offered a lot of advanced topics to choose from in 3 tracks. I chose to hear more about Asynchronous Processing from Mathias Meyer and the nitty gritty details about Events from Lourens Naudé. The keynote about the Present and Future of Programming Languages by Ola Bini was also very interesting.

Upstream also took an active part at the conference: Alex gave his talk about CouchDB Frameworks for Ruby and CouchApp (using his new presentation tool boom_amazing) and I introduced MacRuby, the Ruby that plays nice with Objective-C.

On the third day Yehuda Katz revealed some more details about Rails 3. The talk by Michael Koziarski about Rails Performance was good for a reality check. In the afternoon everybody was tired after 3 days of conference and the talks lost quality. Still a very good conference with potential.

by Thilo Utke at May 28, 2009 09:57 AM

May 27, 2009

Jason Davies

Sebastian Bergmann

May 25, 2009

Damien Katz

Realtime Chat on CouchDB

Hot off the presses, Chris Anderson just got Toast, a CouchDB based real-time chat working about an hour ago. Say hi

Update - Chris writes about design and motivation of Toast in Simple Wins

toast.png

by Damien Katz at May 25, 2009 01:09 PM

Ricky Ho

Solving TF-IDF using Map-Reduce

TF-IDF (Term Frequency, Inverse Document Frequency) is a basic technique to compute the relevancy of a document with respect to a particular term.

"Term" is a generalized element contains within a document. A "term" is a generalized idea of what a document contains. (e.g. a term can be a word, a phrase, or a concept).

Intuitively, the relevancy of a document to a term can be calculated from the percentage of that term shows up in the document (ie: the count of the term in that document divide by the total number of terms in it). We called this the "term frequency"

On the other hand, if this is a very common term which appears in many other documents, then its relevancy should be reduced. (ie: the count of documents having this term divided by total number of documents). We called this the "document frequency"

The overall relevancy of a document with respect to a term can be computed using both the term frequency and document frequency.

relevancy = term frequency * log (1 / document frequency)

This is called tf-idf. A "document" can be considered as a multi-dimensional vector where each dimension represents a term with the tf-idf as its value.

Compute TF-IDF using Map/Reduce

To extract the terms from a document, the following process is common
  • Extract words by tokenize the input streams
  • Make the words case-insensitive (e.g. transform to all lower case)
  • Apply n-gram to extract phrases
  • Filter out stop words
  • Normalize the word's concept (e.g. transform cat, cats, kittens to cat)
To keep the term simple, each word itself is a term in our example below.

We use multiple rounds of Map/Reduce to gradually compute …
  1. the word count of per word/doc combination
  2. the total number of words per doc
  3. the total number of docs per word. And finally compute the TF-IDF



Implementation in Apache PIG

There are many ways to implement the Map/Reduce paradigm above. Apache Hadoop is a pretty popular approach using Java or other programming language (ie: Hadoop Streaming).

Apache PIG is another approach based on a higher level language with parallel processing construct built in. Here is the 3 rounds of map/reduce logic implemented in PIG Script
REGISTER rickyudf.jar

/* Build up the input data stream */
A1 = LOAD 'dirdir/data.txt' AS (words:chararray);
DocWordStream1 =
FOREACH A1 GENERATE
'data.txt' AS docId,
FLATTEN(TOKENIZE(words)) AS word;

A2 = LOAD 'dirdir/data2.txt' AS (words:chararray);
DocWordStream2 =
FOREACH A2 GENERATE
'data2.txt' AS docId,
FLATTEN(TOKENIZE(words)) AS word;

A3 = LOAD 'dirdir/data3.txt' AS (words:chararray);
DocWordStream3 =
FOREACH A3 GENERATE
'data3.txt' AS docId,
FLATTEN(TOKENIZE(words)) AS word;

InStream = UNION DocWordStream1,
DocWordStream2,
DocWordStream3;

/* Round 1: word count per word/doc combination */
B = GROUP InStream BY (word, docId);
Round1 = FOREACH B GENERATE
group AS wordDoc,
COUNT(InStream) AS wordCount;

/* Round 2: total word count per doc */
C = GROUP Round1 BY wordDoc.docId;
WW = GROUP C ALL;
C2 = FOREACH WW GENERATE
FLATTEN(C),
COUNT(C) AS totalDocs;
Round2 = FOREACH C2 GENERATE
FLATTEN(Round1),
SUM(Round1.wordCount) AS wordCountPerDoc,
totalDocs;

/* Round 3: Compute the total doc count per word */
D = GROUP Round2 BY wordDoc.word;
D2 = FOREACH D GENERATE
FLATTEN(Round2),
COUNT(Round2) AS docCountPerWord;
Round3 = FOREACH D2 GENERATE
$0.word AS word,
$0.docId AS docId,
com.ricky.TFIDF(wordCount,
wordCountPerDoc,
totalDocs,
docCountPerWord) AS tfidf;

/* Order the output by relevancy */
ORDERRound3 = ORDER Round3 BY word ASC,
tfidf DESC;
DUMP ORDERRound3;



Here is the corresponding User Defined Function in Java (contained in rickyudf.jar)
package com.ricky;

import java.io.IOException;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class TFIDF extends EvalFunc<Double> {

@Override
public Double exec(Tuple input) throws IOException {
// TODO Auto-generated method stub
long wordCount = (Long) input.get(0);
long wordCountPerDoc = (Long) input.get(1);
long totalDocs = (Long) input.get(2);
long docCountPerWord = (Long) input.get(3);

double tf = (wordCount * 1.0) / wordCountPerDoc;
double idf = Math.log((totalDocs * 1.0) / docCountPerWord);

return tf * idf;
}
}

by Ricky Ho (rickyphyllis@gmail.com) at May 25, 2009 06:53 AM

Chris Anderson

Simple Wins

The biggest response I got to Toast, my realtime CouchDB chat server was: "wtf why didn't you use XYZ technology?"

The point of developing chat in CouchDB is not to show how CouchDB is an ideal persisted chat server (even if it is). The point is to show how CouchDB's "databasey" features, because they are implemented using HTTP, can be leveraged to make powerful end-user experiences, with just a minimum of code.

Before I dig into how Toast works, let's talk about simplicity. Detractors to the experiment tend to fall into three camps: First are the script kiddies who are so gleeful that I didn't harden my HTML escape functions that all they can think to say is:

alert("fart!");

I was utterly surprised by how many of them there are. I guess it's a good thing that CouchDB got a lot of interest from the 15 and under set, but damn - kids smell.

The second set of detractors say realtime applications shouldn't go to disk, because it's slow and a lot of data. To them I say "disk is free". Also, I'm not trying to build a stock exchange or a missile guidance system. Call me when performance matters. CouchDB is designed for highly concurrent applications, and while Toast's traffic doesn't count as super high, the upshot is that my old Mac Mini was able to sustain 4 hours at the top slot on Hacker News without load exceeding 0.3.

No technology can scale to Twitter like sizes without application specific problems, and building a large-scale realtime messaging system will always require a lot of tuning and encountering strange issues. I would definitely recommend anyone who's building a system like that to have a look at CouchDB.

The third set of detractors seem to think speed is everything. Um, huh? What matters is fast enough. As long as latency is acceptable, the only reason speed matters is if you are trying to cram a bunch of activity through a limited number of processes. If you have 10,000 users all updating once a second, you either need a k/v store that can handle 10k updates per second, or you need 10 key value stores that can each handle 1000 updates per second. With CouchDB the second thing is a lot easier to build than it is with most of the other options.

But what none of these detractors seem to understand (except for the script-kiddies) is that the important thing about this demo is the web. CouchDB is part of the web, and serves web applications natively.

Serving RESTful web applications natively means two things.

  • you already know how to use CouchDB.

  • your applications can be deployed anywhere there's a CouchDB. This makes transporting an application (and it's data) from one node to another trivially simple. This is why I call it the p2p web.

Back to Simplicity

Of all the qualities a good program can have, simplicity is most often overlooked. When you see benchmarks ("look how many operations per second my hash table can handle") almost never do they brag about how easy it was to achieve. Even if you wanted to brag about simplicity, it's hard to find a quantitative measure. Certainly the Quine test isn't about true simplicity:

#!/usr/bin/perl
$_=<<'eof';eval $_;
print "#!/usr/bin/perl\n\$_=<<'eof';eval \$_;\n${_}eof\n"
eof

So if we can't measure simplicity, how can we value it? I'm not alone when I say the web stack (HTML and friends) is simpler than the predecessors in computing history. This is completely subjective, of course. No one would argue that modern web browsers are less complex than for instance Prodigy's old school terminal but I will argue that it's a lot easier to write robust applications for the web.

The kind of simplicity I'm talking about is what you see when you view-source on a basic HTML page. There's not much learning curve, and a beginner can probably get somewhere useful just by reorganizing the content they see in their editor.

The same reason explains why PHP and the other scripting languages beat out Java on the server. Maybe Java is more "correct", but PHP's dirtiness means you don't have to be an expert to build something using it.

CouchDB is simple in that same way, in that it hides much of the complexity from you as a developer. Actually, there isn't much complexity to hide as CouchDB is simple on the inside as well.

Making Toast

There have always been a few killer features on the CouchDB roadmap. They all play directly into CouchDB's dual nature: scaling up to thousands of nodes, as well as scaling down: CouchDB's paradigm use case is local deployments. Users serve application to their peers in human sized groups. CouchDB is built to give you control of your data - in your pocket, on your laptop, and in the cloud.

One of these features has always been filtered replication. A top use case is for splitting shards in a partitioned cluster. The first step toward filtered replication is a realtime stream of updates, made available as events in the database sequence. HTTP access to the update stream of database changes is crucial for chat on CouchDB.

Oh, messages in sequence... That's another new one that Eric aka @thisfred was pretty happy about. It gives a lot of flexibility, but should only be used if you know what you are doing. Essentially you can order view results in your database by order of the local sequence number (when the document was written to the server generating the views.) However, in replication this order is not preserved, so tracking an ordered stream across multiple disconnected nodes has it's own challenges.

Toast took about 6 hours of coding, interrupted by a few hours of adding features to CouchDB to support it.

Normal Toast

Normal Toast

The simplest thing about Toast is that it can run on any copy of CouchDB (0.10.0-dev or newer). That's what makes it a CouchApp. "What does it matter?" is a common question when people are confronted with their first self-contained CouchApp. The difference is that a pure CouchDB application is as portable as the data it manages.

View Source

When the application is just data, and moves through the same replication flow as other data, it gives users control over the source code and not just their data. Some users won't notice, but those who do will start hacking on the apps they use. Just as an Excel user quickly turns into a Excel hacker, we'll see CouchApp users becoming savvy about editing JavaScript views, Ajax callbacks, etc.

To see the Toast source code in deployment (yes this is the actual code that runs Toast) click this link to the Toast design document in Futon. Clicking index.html will take you into the application.

Deploy via HTTP

Design documents are the source code of CouchDB Applications. They contain definitions of views as well as HTML pages that turn into Ajax applications. There's room in the CouchDB application model to have applications written in PHP, Ruby, Python or any number of languages, as long as they are properly sandboxed.

The cool thing about deploying applications as regular documents is that I can replicate from my local machine to a cluster to deploy. Or I could use HTTP to PUT the application to a remote machine. Anonymous users can not create design documents, but admins can.

Share through replication

Replication is the bomb.

Any two CouchDB databases can be merged using replication. All new updates are applied to the target database, whether they are the addition of a new document, updating an existing document, or even the deletion of an existing document. Replication is incremental, which means extra data is not transferred if the databases have replicated recently.

When I wanted to merge Jason's channel onto my localhost, I could replicate his his CouchDB database to mine via HTTP. This way any messages he'd seen would be available for me to browse locally. By connecting a mesh of CouchDB's you'd be able to keep up with a few channels without much latency.

Futon Replication

Above is a screenshot of the what replication looks like in CouchDB. You merely enter the source and target database urls, and CouchDB handles the rest. Here's a zoom in on what it would look like for your when you're using it.

Replication Controls

Low Energy State

So duh, the web is simpler than Prodigy. What else is simpler? Basic is simpler than C. Ruby is simpler than Java. The web is simpler because more of the decisions are made for you.

How do realtime updates work on CouchDB?

$ curl 'http://jchrisa.net/toast/_changes?continuous=true&since=9457'

a recent sequence number will always be last. Changing your request parameters to point to the current bottom sequence id will give output like this, after a few messages are sent in the room:

$ curl 'http://jchrisa.net/toast/_changes?continuous=true&since=9507'
{"results":[
    {"seq":9508,"id":"fca5e615dec80dbdd848a0a3738d5be4","changes":[{"rev":"1-1761360865"}]},
    {"seq":9509,"id":"224b8f9ce4ef5587d0c6676b87d04aa6","changes":[{"rev":"1-3621019413"}]},

where each line is a JSON object. The cool thing is that CouchDB (and your browser) will hold open the connection for minutes if that's how much time passes between updates.

In the case of Toast I don't even parse the JSON text, but rather rely on the browser to let me know just that it has changed, but hooking to the onreadystate change event. Here's the relevant code, from the bottom of channel.html:

c_xhr = jQuery.ajaxSettings.xhr();
c_xhr.open("GET", app.db.uri+"_changes?continuous=true&since="+db_info.update_seq, true);
c_xhr.send("");
c_xhr.onreadystatechange = function() {
  refreshView();
};

This calls the refreshView() function anytime a new line comes over the wire from CouchDB. So we don't even care what the lines say. Simple.

refreshView() just makes sure the user has the 25 most recent messages on their screen. It's not fancy, but it is simple. Also, this code could be optimized in a straightforward manner to use much less resources.

Toast is a showcase of how simple a real-time chat system can be, when you leverage CouchDB's _changes API.

There will be other exciting apps coming.

by jchris at May 25, 2009 06:10 AM

May 23, 2009

Chris Anderson

My Couch is on Port 80

It started with the tshirt:

My Couch Is On Port 80

Now it's turned into a (realtime chat) movement.

Toast is a simple demo chat application for CouchDB. It can still have a lot added to it, but it can IM, and that's the fun part.

by jchris at May 23, 2009 08:08 PM

Building things

I had a wild weekend meeting VCs in SF, and came home determined to actually write some code. Meetings are fun but they remind you that nothing talks like code.

So I'm putting together a demo suggested by Damien, involving realtime chat on CouchDB.

I'll be upgrading the CouchDB server this blog runs on, which should be hopefully hiccoughless.

by jchris at May 23, 2009 05:35 PM

May 20, 2009

Jan Lehnardt

Benchmarks: You are Doing it Wrong

This is part one in a small series about measuring software performance. There’s a lot of common sense covered, but I feel it is necessary to shed some light.


Coffee

Pete needs coffee and his coffee maker broke down. Pete’s browsing through Craigslist. He’s looking for a coffee maker and he’s fine with a used one if he can get it from nearby. While results may vary when Pete’s got his coffee, his brain processes what he sees on a web page in between 200 and 500 milliseconds. Of course this depends on the complexity of the page and outside distractions[citation needed].

Computers are very limited in what they can calculate but they are incredibly fast and reliable. Humans brains are a lot more sophisticated, but not as fast on raw computations. To render the Craigslist homepage takes about 150ms right now (I’m in Berlin) when I ask curl and it takes Safari around 1.4 seconds (1400ms) to display the page.

This in part demonstrates the measuring dilemma. Pete never sees the 150ms response for http://craigslist.org/. He only sees that it takes a bit before his browsers finishes loading. We’ll get back to that later.

The point here is, even if all parts of the system would result in a sub-200ms response time, Pete (and everybody else) would not notice. Pages would change “instantly” as far as he (and everybody else) is concerned. While the fallacies of distributed computing (read: The Internet) will probably never get us there, at some point it does not make any more sense to speed things up because no one will notice.

Moving Parts

Lets take a look what a typical web app looks like. This is not exactly how Craigslist works (because I don’t know how Craigslist works), but it is a close enough approximation to illustrate problems with benchmarking.

You have web server, some middleware, a database. A user request comes in, the web server takes care of the networking and parses the HTTP request. The request gets handed to the middleware layer which figures out what to run; then runs whatever is needed to serve the request. The middleware might talk to your database and other external resources like files or remote web services. The requests bounces back to the web server which sends out any resulting HTML. The HTML includes references to other resources living on your web server, like CSS-, JS- or image files and the process starts anew for every resource. A little different each time, but in general, all requests are similar. And along the way there are caches to store intermediate results to avoid expensive recomputation.

That’s a lot of moving parts. Getting a top-to-bottom profile of all components to figure out where bottlenecks lie is pretty complex (but nice to have). I start making up numbers now, the absolute values are not important, only numbers relative to each other. Say a request takes 1.5 seconds (1500ms) to be fully rendered in a browser.

In a simple case like Craigslist there is the initial HTML, a CSS file, a JS file and the favicon. Except for the HTML, these are all static resources and involve reading some data from a disk (or from memory) and serve it to the browser who then renders it. The most notable things to do for performance are keeping data small (gzip compression, high jpg compression) and avoiding requests all together (HTTP level caching in the browser). Making the web server any faster doesn’t buy us much (yeah, hand wavey, but I don’t want to focus on static resources here. Pete wants his coffee. Let’s say all static resources take 500ms to serve & render.

(Read all about improving client experience with proper use of HTTP from Steve Sounders. The YSlow tool is indispensable for tuning a web site.)

That leaves us with 1000ms for the initial HTML. We’ll chop off 200ms for network latency [cf. Network Fallacies]. Let’s pretend HTTP parsing, middleware routing & execution and database access share equally the rest of the time, 200ms each.

If you now set out to improve one part of the big puzzle that is your web app and gain 10ms in the database access time, this is probably time not well spent (unless you have the numbers to prove it).

Variables

We established that there are a lot of moving parts. Each part has a variable performance characteristic, based on load, disk I/O, state of various caches (down to CPU L2 caches) and different OS scheduler behaviour based on any input variable. It is nearly impossible to know every interfering factor, so any numbers you ever come up with should be read with a grain of salt. In addition, when my system reports a number of 1000ms and yours reports 1200ms the only thing we can derive from that is our systems are different and we knew that before.

To combat variables, usually profiles are run multiple times (and a lot of times!) to have statistics tell you the margin of error you’re getting. Profiles should also run a long time with the same amounts of data that you will see in production. If you run a quick profile for a few seconds or minutes, you will hit empty caches and get skewed numbers. If your data does not have the same properties as the data you have in your production environment, you’ll get skewed results.

Story time: Chris tried to find out how many documents of a certain size he could write into CouchDB. CouchDB has a feature that generates a UUID for every new document you store. The UUID variant it is using uses a full 128 bits of randomness. The documents are then stored in a b+-tree. Turns out that for a b+-tree, truly random keys for any kind of access are the worst possible case to handle. Chris then switched to pre-genereated sequential ids for his test and got a 10x improvement. Now he’s testing the best case for CouchDB which coincides with the application’s data, but your application might have a different key distribution only resulting in a 2x or 5x improvement or none at all.

In a different case, the amount of data stored and retrieved could easily fit in memory and Linux’ filesystem cache was smart enough to turn all disk access to memory access which is naturally faster. But it doesn’t help if you production setup has more data that fits in memory.

Take home point: Profiling data matters.

The second part of this little series will look at pitfalls when profiling storage systems.

Trade Offs

Tool X might give you 5ms response times and this is an order of magnitude faster than anything else on the market. Programming is all about trade-offs and everybody is bound by the same laws.

On the outside it might appear that everybody who is not using Tool X is a moron. But speed & latency are only part of the picture. We already established that going from 5ms to 50ms might not even be noticeable by anyone using your product. The expense for speed can be multiple things:

  • Memory; instead of doing computations over and over, Tool X might have a cute caching layer that saves recomputation by storing results in memory. If you are CPU bound, that might be good, if you are memory bound it might not. A trade off.

  • Concurrency; the clever data structures in Tool X are extremely fast when only one request at a time is processed, and because it is so fast most of the time, it appears as if it would process multiple request in parallel. Eventually though, a high number of concurrent requests fill up the request queue and response time suffers. — A variation on this is that Tool X might work exceptionally well on a single CPU or core, but not on many, leaving your beefy servers idling.

  • Reliability; making sure data is actually stored is an expensive operation. Making sure a data store is in a consistent state and not corrupted is another. There are two trade offs here: Buffers that store data in memory before committing it to disk to ensure a higher data throughput. In case of a power loss or crash (hard- or software), the data is gone. This may or may not be acceptable for your application. The other is a consistency check that is required to run after a failure. If you have a lot of data, this can take days. If you can afford to be offline, that’s okay, but maybe you can’t afford it.

Make sure to understand what requirements you have and pick the tool that complies instead of taking the one that has the prettiest numbers. Who’s the moron when your web application is offline for a fix up for a day and your customers impatiently wait to get their job done; or worse, you lose their data.

But…My Boss Wants Numbers!

Yeah, you want to know which one of these databases, caches, programming language, language constructs or tools are faster, harder, stronger. Numbers are cool and you can draw pretty graphs that management types can compare and make decisions from.

First thing a good exec knows is that she’s operating on insufficient data (aside, everybody does all the time, but sometimes it is just not apparent to you) and diagrams drawn from numbers are a very distilled view of reality. And graphs from numbers that are effectively made up by bad profiling are not much more than a fairy tale.

If you are going to produce numbers, make sure you understand how much is and isn’t covered by your results. Before passing them on, make sure the receiving person knows as much.

A Call to Arms

I’m in the market for databases and key-value stores. Every solution has a sweet spot in terms of data, hardware, setup and operation and there are enough permutations that you can pick the one that is closest to your problem. But how to find out? Ideally, you download & install all possible candidates, create a profiling test suite with proper testing data, make extensive tests and compare the results. This can easily take weeks and you might not have that much time.

I would like to ask developers [*] of storage systems to compile a set of profiling suites that simulate different usage patterns of their system (read-heavy & write-heavy loads, fault tolerance, distributed operation and a lot more). A fault tolerance suite should include steps necessary to get data live again, like any rebuild or checkup time. I would like users of these systems to help their developers to find out how to reliably measure different scenarios.

* I’m working on CouchDB and I’d like to have such a suite very much!

Even better, developers could agree (hehe) on a set of benchmarks that objectively measure performance for easy comparison. I know this is a lot of work and the results can still be questionable (you read the above part, did you?), but it’ll help our users a great when figuring out what to use.


Stay tuned for the next part in this series about things you can do wrong when testing databases & k-v stores.

by Jan (jan@apache.org) at May 20, 2009 03:41 PM

May 19, 2009

Helder Ribeiro

Second steps with CouchDB: Playing with hierarchical data

So I jumped on the bandwagon and joined all the cool kids in scorning relational databases and playing with CouchDB (for a sort of long and excellent wrap up on this versus that, read this). I installed it, read through a lot of docs, thought I understood it pretty well and immediately started searching the tubes for what Ruby framework/library/gizmo would best allow me to get kickstarted with using it on a new project.

Turns out that was a bit premature, as my brain couldn’t really handle trying to model the domain of my intended application onto this totally new way of thinking so abruptly. It’s like when you’re a native Portuguese speaker and you’re drunk, trying to make yourself pass as speaking Spanish and you keep spilling out the German you’ve been learning for the past two years. I’m sure everyone can relate to that.

Time to take a step back.

The best way I found to get more familiar with this new type of database was to get rid of all the mental cruft I had around it. So I forgot about my app, its data model and the web framework and went on to play with the database alone.

I had seen this blog post about how to store hierarchical data in CouchDB and decided to play with the example data and views the guy provided (big thanks to Paul Bonser!). [Note: If you can follow that, you don't need to read this, as it's basically an expanded version of one of his examples.]

This is the data:


{
  "docs": [
    {"_id":"Food", "path":["Food"]},
    {"_id":"Fruit", "path":["Food","Fruit"]},
    {"_id":"Red", "path":["Food","Fruit","Red"]},
    {"_id":"Cherry", "path":["Food","Fruit","Red","Cherry"]},
    {"_id":"Tomato", "path":["Food","Fruit","Red","Tomato"]},
    {"_id":"Yellow", "path":["Food","Fruit","Yellow"]},
    {"_id":"Banana", "path":["Food","Fruit","Yellow","Banana"]},
    {"_id":"Meat", "path":["Food","Meat"]},
    {"_id":"Beef", "path":["Food","Meat","Beef"]},
    {"_id":"Pork", "path":["Food","Meat","Pork"]}
  ]
}

Which corresponds to this tree:
Tree

To create a database and import this data, save that snippet as a file somewhere (I’ll use /tmp/data.json). CouchDB talks to the world in HTTP. We’re gonna use curl for that so you really see what’s going on. Web browsers are for wimps.

Usually CouchDB runs locally on 127.0.0.1, port 5984. To create a new database all you need to do is PUT to that address with the name you want your DB to have in the URL and no payload. We’ll save that URL in a variable because we’re gonna use it a lot.


DB="http://127.0.0.1:5984/hierarchical_data"
curl -v -X PUT $DB

Here -v means verbose, and -X lets you choose the HTTP method.

We have our database, now we import the data using the bulk document API. We specify the payload (data) with -d and feed it as a string from the file by prefixing its path with a @ (that’s one bash trick I didn’t know about!):


curl -v -d @/tmp/data.json -X POST $DB/_bulk_docs

And this is the view I was interested in (it lists all descendants of a node, including itself):


{
  "language": "javascript",
  "views": {
    "descendants": {
      "map": "
        function(doc) {
            for (var i in doc.path) {
                emit(doc.path[i], doc)
            }
        }"
    }
  }
}

This thing about views going through all objects in your database took a little time to sink in with me. Initially I thought the query took place in the view, that I would somehow pass the node from which I wanted the descendants as the doc argument to that function. That’s not how it works. The query actually takes place in the view parameters, and the view function itself only flattens everything out into a convenient array so you can query it better.

This view I just mentioned, for example, doesn’t actually give you the elements in a sub-tree. It goes through each object (document) in the database and adds it to the array of results once for each of its ancestors.

To see if for yourself, save it in a file somewhere (/tmp/view.json in my case) and add it to the database. We do that by creating a special design document:


curl -v -d @/tmp/view.json -X PUT $DB/_design/tree

Now, to run it, just execute:


curl -v -X GET $DB/_design/tree/_view/descendants

Or see it in the browser: http://localhost:5984/hierarchical_data/_design/tree/_view/descendants

This is what you get:


{"total_rows":29,"offset":0,"rows":[
{"id":"Banana","key":"Banana","value":""},
{"id":"Beef","key":"Beef","value":""},
{"id":"Cherry","key":"Cherry","value":""},
{"id":"Banana","key":"Food","value":""},
{"id":"Beef","key":"Food","value":""},
{"id":"Cherry","key":"Food","value":""},
{"id":"Food","key":"Food","value":""},
{"id":"Fruit","key":"Food","value":""},
{"id":"Meat","key":"Food","value":""},
{"id":"Pork","key":"Food","value":""},
{"id":"Red","key":"Food","value":""},
{"id":"Tomato","key":"Food","value":""},
{"id":"Yellow","key":"Food","value":""},
{"id":"Banana","key":"Fruit","value":""},
{"id":"Cherry","key":"Fruit","value":""},
{"id":"Fruit","key":"Fruit","value":""},
{"id":"Red","key":"Fruit","value":""},
{"id":"Tomato","key":"Fruit","value":""},
{"id":"Yellow","key":"Fruit","value":""},
{"id":"Beef","key":"Meat","value":""},
{"id":"Meat","key":"Meat","value":""},
{"id":"Pork","key":"Meat","value":""},
{"id":"Pork","key":"Pork","value":""},
{"id":"Cherry","key":"Red","value":""},
{"id":"Red","key":"Red","value":""},
{"id":"Tomato","key":"Red","value":""},
{"id":"Tomato","key":"Tomato","value":""},
{"id":"Banana","key":"Yellow","value":""},
{"id":"Yellow","key":"Yellow","value":""}
]}

As you can see, there was no view parameter in that call, and this looks nothing like a list of descendants. Each emit call is responsible for one line of the output, which contains the id of the doc object, a key and a value. [I replaced doc as the value in the original emit call with '' to make it more readable.]. Note that lines do not appear in the order they were emitted (otherwise you’d see lines with the same id grouped together). CouchDB automatically sorts them by key. Another thing you’ll notice is that all the lines whose keys have the same element are also a descendant of that element. Convenient, huh?

Now, to get the descendants of one particular node, just query the view with that node’s name in the key:


curl -v -X GET 'http://localhost:5984/hierarchical_data/_design/tree/_view/descendants?key="Fruit"'

Or, again, use the browser: http://localhost:5984/hierarchical_data/_design/tree/_view/descendants?key=%22Fruit%22

And bingo!


{"total_rows":29,"offset":13,"rows":[
{"id":"Banana","key":"Fruit","value":""},
{"id":"Cherry","key":"Fruit","value":""},
{"id":"Fruit","key":"Fruit","value":""},
{"id":"Red","key":"Fruit","value":""},
{"id":"Tomato","key":"Fruit","value":""},
{"id":"Yellow","key":"Fruit","value":""}
]}
Share/Save/Bookmark

by helder at May 19, 2009 07:29 PM

May 18, 2009

Paul Joseph Davis

CouchDB JSON Parser Timings

CouchDB JSON Parser Timings

Overview

I've put together some work on integrating eep0018 int CouchDB as well as adding support for Spidermonkey 1.8.1. This is all still very experimental. I have the test suite passing for both branches except where Spidermonkey's JSON serialization differs from the JavaScript function previously used ([undefined] is serialized as [null]).

So, after getting those branches together I spent a bit of time and ran some tests to see what kind of speed differences I could get. Turns out it's dependent on the amount of data we give it, but there is a noticeable impact.

Branches

  • Trunk - As of when I ran the tests today.
  • eep0018 - Includes the JSON parser in Erlang
  • Spidermonkey 1.8.1 - Includes eep0018 and Spidermonkey trunk as of today (or yesterday...) Configured with --enable-optimized=-O3 is the only flag touched. (Ie, no JIT enabled.)

Caveats

Notice that I'm inserting 4KiB documents. If we shrunk those down to a couple bytes then these numbers tend to even out. I know that the eep0018 code is hampered by moving across the VM boundary so when we're dealing with _bulk_docs the key would be that eep0018 allows us to post more docs in a single request.

Other things people might want to play with are the numbers in couch_utils:should_flush/0 to see if we can tune how much data gets sent to the view server in one go.

So, there's a lot of permutations for different speed tests, not to mention just making sure that these branches aren't screwed beyond recognition in terms of actually working.

If you're bored and looking for something to do instead of clicking through Twitter or the current trendy social news site on a Monday morning, I invite you to grab one or two or all of the branches and run your own tests.

Benchmark Script

I realize this isn't the most sound measurement system, but I'm tired and didn't feel like being thorough. You can grab it here.

#! /usr/bin/env python
import time
import couchdb
server = couchdb.Server("http://127.0.0.1:5984/")
if "eep0018" in server:
    del server["eep0018"]
db = server.create("eep0018")

start = time.time()
updates = []
for docid in xrange(10000):
    doc = {"_id": "%.10d" % docid, "integer": docid, "text": "a" * 4096}
    updates.append(doc)
    if len(updates) >= 1000:
        db.update(updates)
        updates = []
if len(updates): db.update(updates)
end = time.time()
print "Inserting: %f" % (end - start)

start = time.time()
for row in db.query("function(doc) {emit(doc._id, doc.integer % 100);}"):
    pass
end = time.time()
print "Map only: %f" % (end - start)

start = time.time()
for row in db.query("function(doc) {emit(doc._id, doc.integer / 100);}",
            reduce_fun="function(keys, vals) {return sum(vals);}"
        ):
    pass
end = time.time()
print "With reduce: %f" % (end - start)

start = time.time()
for row in db.query("function(doc) {emit(doc._id, doc.integer * 2);}",
            reduce_fun="_sum"
        ):
    pass
end = time.time()
print "With erlang reduce: %f" % (end - start)

Results

This is the data that I got from running that test script against each of the three branches three times. The error bars are simple min/max notations. No fancy standard deviation shit going on here.

CouchDB JSON Parsing Times

Hand Waving

The results generally make sense. We get a speed bump during insertion when we switch to the eep0018 branch. The views are faster too. When we add the Spidermonkey 1.8.1 updates we get the same insert speed (because we don't touch the view server) and faster view computation.

For the more motivated timing people out there, if someone wants to play around with data sizes and look at timings for different scenarios that'd be pretty awesome. And more fancy number math probably wouldn't hurt.

May 18, 2009 04:00 AM

May 17, 2009

Upstream

New Couch Potato: simple, testable, opinionated.

After my talk about Ruby CouchDB frameworks at Scotland on Rails where I dismissed a few of of the libraries available (including my own Couhch Potato) as not fitting the CouchDB way of doing things, I have been hacking away the past few weeks working on a complete overhaul of Couch Potato.

As a first result I have just released version 0.2 of the framework. Its new goals are simplicity, embracing the CouchDB semantics and testability. In order to achieve this I had to introduce some major changes:

I disconnected models from the database - there are no more save/get/find methods in the models. Instead you can hand the models to a database object that will save/load documents for you - this allows for unit tests that are completely disconnected from the database

I have dropped associations and thrown away all the ActiveRecord like view creation/querying, replacing it with a new, more CouchDB like system. That new system is lighter, simpler and easier to extend.

The following paragraphs will show you how to work with the new Couch Potato.

Saving / loading models

As I said I have decoupled the models from the database, a model doesn’t have permanent access to the database anymore. Instead you instantiate a database object yourself and tell it to load or save a model object. This change isn’t so much about CouchDB as it is about testability. Having the database separated means you can now have true unit test of your models without talking to the database. Here is how you save/load models:

The database is responsible for running the model’s validations and lifecycle callbacks, saving the document to the database and afterwards setting the _id and _rev on the model. This way a lot of the persistence related logic is removed from the models making them more lightweight and most importantly easier to test (see below).

New Views

Overhauling of the views had two main goals:

  • provide a simple and extensible way for saving/querying views that works the way CouchDB works
  • since models don’t have access to the database anymore, find a new way to query them

Here is how you create and query a simple view:

The way views work is now essentially reversed (inversion of control, rings a bell?). Instead of the view method calling the database the new view method creates a specification for a CouchDB view. The view is passed to the actual database in order to query CouchDB. Again this makes testing easier (see next section) but it also gives Couch Potato a clean interface in order to make creating views easier through abstractions. There is now a hierarchy of view specification classes that you can use to more easily create and query views. The above example used the CouchPotato::View::ModelViewSpec which is the default. This spec creates a view whose map function emits one (or an array of) attribute of the model and returns full documents.

Other view specs let you create more customized views, for example the RawViewSpec, which lets you define your own map/reduce functions and return the raw CouchDB results hash.

For more examples see the Documentation in the CouchPotato::View::*ViewSpec classes.

Testing

As I have mentioned repeatedly decoupling the database from the models makes testing easier. With the new Couch Potato you can unit test your models without hitting the database once. First of all that makes your test lightning fast, and secondly it allows you to more easily test for example your lifecycle callbacks because you can call them directly passing them a stub or mock database.

Here’s an example: (with RSpec)

If you don’t like calling the run_callbacks method you can also use the actual database object but without a real connection to CouchDB. Couch Potato is based on the excellent CouchRest - a more low level CouchDB framework. The CouchPotato::Database constructor expects an instance of a CouchRest database which we can replace with a stub:

For more details and examples please see the README, the RDocs and watch this blog. If you want to know all about Couch Potato I encourage you to read through its source code and specs - it’s not that much code actually. Note that all this is still work in progress and time will show how well all of this works. I would be happy to hear your feedback.

by Alexander Lang at May 17, 2009 10:57 PM

Ricky Ho

Cloud Security Considerations

"Security concern" is perhaps one of the biggest hurdle for enterprises to move their application from in-house data center to the public cloud. While a number of these concerns are real, I've found most of them are just fears from the lack of understanding of the cloud operating environment.

In this post, I will dig into each common security concern to look at its implication and suggest a set of practices and techniques to tackle the associated risks. I argue than with proper use of existing security technology, running application in the public cloud can often be more secure than running within your own data center.

Legal and Compliance

Country laws, auditing practices, compliance requirement typically evolves pretty slow compare to technology. For example, if the country law requires you to store your data within certain geography location, then there is not much that technology can help. You need to identify these set of data as well as the corresponding application that manipulate them because you have to run them at your data center residing in that particular geography location.

Other than acknowledging that legal practices needs to be modified before enterprise can legally run their application in the cloud, I will not go further into this area because there is no much we can do at the technical level.


Trust the Cloud Provider

First of all, you have to establish some level of trust to the cloud provider. At the very least, you need to believe your cloud provider have no incentive to steal your information. Also, you need to have confidence that they have sufficient security expertise to handle different kinds of attacks. Typically, an extensive dual diligence is necessary to get a deep understand of the cloud provider's operation procedures, underlying technologies being used, expertise/experience of handling various kind of security attacks ... etc.

In the dual diligence session, here are some typical questions ...
  • What is the level of physical security that their data center runs on ?
  • How much power does their system administrator has ? Can they see all your sensitive information or modify your environment ? e.g. Can the system admin modify the machine image after creation ?
  • Is the cloud provider subcontract certain functionalities to another 3rd party provider ? If so, you may need to expand the evaluation to those 3rd parties ?
  • Are they have a bigger consequence (legally, financially, or reputation) in case any security compromises happen ?
  • Do they have sufficient security expertise and toolset for detecting and isolating attacks ? A good probe is to understand how they tackle DDOS attack ?
  • What guarantee do they provide in the SLA and are they willing to share your lost in case of security breach happens ? (almost no cloud provider today provide risk sharing in their SLA)
  • In the cloud provider goes out of business, is there a process to get your data back ? How is your existing data disposed ?
  • In case any government agencies or legal entities need to examine the cloud provider, how will your sensitive data being protected ?
Having a comprehensive set of security operating procedure is very important.
  • How do they control the access of their personnel to the data center environment ?
  • In case of security attacks that has caused damage, how and when will you be notified ?
  • How do they isolate different tenants running in their shared physical environment ?
  • What tools do they use to monitor their operating environment ? How do they diagnose a security attack incident ?
  • Are all sensitive operations leave an audit trail ?
In order to stay competitive in business, public cloud providers generally have acquired the best system administration talent usually better than your own data center administrator (particularly when you are a SME). In this regard, running your application in the public cloud can actually be more secure.

On the other hand, cloud providers are more likely to be the target of attacks.


Trust the Underlying Technologies

Since your application will be running in a shared environment, you highly depend on the virtualization technology (e.g. Hypervisor) to provide the fundamental isolation from other tenants. Any defects/bugs of the virtualization technology can leak your sensitive information to other tenants.

Talk to your cloud provider to get a deep understand of their selected technology to make sure it is proven and robust.

Computation / Execution environment
Most likely your cloud provider runs a hypervisor on top of physical machines and therefore the isolation mechanism is especially important. You need to be sure that when your virtual machine instance (guest VM) is started, even the cloud provider's administrator is technically impossible to gain access of any information residing in memory or local disk. And after you shut down your VM, all data in memory, swap device, and local disk will be completely wiped out and unrecoverable.

Data Storage / Persistence
Make sure your data is stored in a highly reliable environment such that the chance of data lost or corruption is practically zero even in event of disaster. To achieve high data resilience, Cloud providers typically create additional copies of your data and put them into geographically distributed locations. An auto-sync mechanism is also involved to keep the copies up to date.
  • How many copies of your data are made ?
  • Are these copies being placed in independent availability zones ?
  • How do they keep these copies in sync ? Is there any window that the data is inconsistent ?
  • In case of some copies are lost, what is the mechanism to figure out which is the valid copy ? How is the recovery done and what is the worst possible damage ?
  • Is my data recoverable after delete ? If so, what is their data retention policies ?
For scalability reasons, the cloud provider usually offer an "eventual consistency model" which relax the data integrity guarantee. This model is quite different from the traditional DB semantics so you need to reevaluate what assumptions your application is making on the underlying DB and adjust your application logic accordingly.

Network communication
Cloud providers in general do a pretty good job in isolating network communications between guest VMs. For example, Amazon EC2 allow the setup of security group which is basically a software firewall to control who can talk to who. Nevetheless, "security group" is not equivalent to the same LAN segment. The guest VMs (even within the same security group) can only see the network traffic that is directed to themselves but not communication among other VMs. By disallowing access to the physical network, it is much harder for other tenants to sniff your network communication traffic. On the other hand, running any low-level network intrusion detection software becomes unnecessary or even impossible. Also note that multicast communication is typically not supported in the cloud environment. Your application need some changes in case it is relying on multicast enabled network.

Nevertheless, the cloud provider's network admin can get access to the physical network and see everything flowing across them. For highly sensitive data, you should encrypt the communication.

Best Practices

There are some common practices that you can use to improve the cloud security

Encrypt all sensitive information
If your application is running in an environment that you have no control, cryptography is your best friend. In general, you should encryption your data whenever there is a possibility of them being exposed to an environment accessible by someone you don't trust.
  • Network Communication: For all communication channels that has sensitive information flowing across, you should encrypt the network traffic.
  • Data Storage: For all sensitive data that you plan to store into the cloud, encrypt them first.
Secure your private key
The effectiveness of any cryptography technique depends largely on how well you secure your private key. Make sure you put your private key in a separate safe environment (e.g. in a different cloud provider's environment). You can also encrypt your private key with another secret key, or split the secret key into multiple pieces and put each piece in a different place. The basic idea is to increase the number of independent systems that the hacker need to compromise before gaining access to your sensitive information.

On the other hand, you also need to consider how to recover your data if you lost your private key (e.g. the system where your private key is stored has been irrecoverably damaged). One way is to store additional copies of the keys but this will increase chances of it being stolen. Another way is to apply erasure coding technique to break your private key into n pieces such that the key can be recovered by any m pieces where m is less than n.

Build your image carefully
If the underlying OS is malicious, then all the cryptographic technique is useless. Therefore, it is extremely important that your application is run on top of a trustworthy OS image. In case you are building your own image, you need to make sure you start from a clean and trustworthy OS source.

The base OS may come with a configuration containing software components that you don't need. You should harden the OS by removing these unused software, as well as the services, ports and user accounts that you don't need. It is good practices to take a default close policy: Every service is disabled by default and explicit configuration change is needed to turn it on.

Afterwards, you need to make sure each software that you install on the clean image is also trustworthy. The effective trustworthiness of your final OS image is the weakest link, which is the minimum trustworthiness of all the installed software and the base OS.

Never put any sensitive information (e.g. your private key) into your customized image. Even you are think to use your image in a private environment, you may share your image to the public in future and forget to remove the sensitive information. In case you need the password or secret key to start your VM, such sensitive information should be pushed-in (via startup user paremeters, or data upload) by the initiator when it starts the VM.

Over the life cycle of your customized image, you need to monitor the relevant security patches and apply them in a timely manner.

Abnormaly detection
Continuous health monitoring is important to detect security attacks or abnormal workload to your application. One effective way to take a baseline of your application by observing the traffic patterns over a period of time. Novelty detection technqiue can be used to detect sudden change of traffic pattern which usually indicates DDOS attacks.

Keep sensitive data within your data center
The bottomline is ... you don't have to put your sensitive data into the public cloud. For any sensitive information that you have concerns, just keep them and the associated operations within your data center and carefully select a set of sensitive operations to be exposed as a service. Instead of worrying about data storage protection, this way you just concentrate the protection at the service interface level and apply proven authentication/authorization technique.

e.g. you can run your user database within your data center and expose an authentication services to applications running in the public cloud.

Backup cloud data periodically
To make sure your data is not lost even when the cloud provider goes out of business, you should install a periodic backup process. The backup should be in a format restorable in a different environment and also should be transferred and stored in a different location (e.g. a different cloud provider).

A common approach is to utilize DB replication mechanism where data updates are propagated from the master to the slaves in an asynchronous manner. One of the slave replica will be taken offline regularly for the backup purpose. The backup data will be transferred to a separated environment, restored and tested/validated. The whole process should be automated and execute periodically base on how much data lost you can tolerate.

Run on Multiple Clouds simultaneously
An even better approach is to prepare your application to run on multiple clouds simultaneously. This force you to isolate your application from vendor specific features upfront and all the vendor lock-in issues disappears.

Related to security, since each cloud providers has their own set of physical resources, it is extremely unlikely that hackers or DDOS can bring all of them down at the same time. Therefore you gain additional reliability and resiliency when run your application across different cloud providers.

Since your application is already sitting in multiple cloud provider's environment, you no longer need to worry about migrating your application across different provider when the primary provider goes out of business. On the other hand, storing encrypted data in one cloud provider and your key in another cloud provider is a very good idea as the hacker now need to compromise 2 cloud provider's security measures (which is very difficult) before gaining access to your sensitive data.

You can also implement your data backup strategy by asynchronously replicated data from one cloud provider to another cloud provider and provide application services using the replicated data on different cloud providers. In this model, no data restoration is necessary when one cloud provider is inaccessible.

by Ricky Ho (rickyphyllis@gmail.com) at May 17, 2009 06:42 AM

Chris Strom

A View and a View

‹prev | My Chain | next›

The first view that I need to create renders the list of meals for a given year. We tend to have a lot of meals in a year, so this list is a simple unordered list.

In the view specification, before(:each) example, two meals suffice to create a list:
  before(:each) do
assigns[:meals] = @meal =
[
{
'title' => 'Meal 1',
'_id' => '2009-05-14'
},
{
'title' => 'Meal 2',
'_id' => '2009-05-14'
}
]
end
The two examples that describe the list are:
  it "should display a list of meals" do
render("/views/meal_by_year.haml")
response.should have_selector("ul li", :count => 2)
end

it "should link to meals" do
render("/views/meal_by_year.haml")
response.should have_selector("li a", :content => "Meal 1")
end
The Haml code that makes these examples pass is:
%ul
- @meal.each do |meal|
%li
%a{:href => "/meals/#{meal['_id']}"}= meal['title']
One view down...
(commit)

The other view that I need to create is a corresponding CouchDB view. I need to map each document to the year in which it was prepared, pointing to the meal ID and title (enough information to hyperlink to the meal). Additionally, a simple reduction pointing to the meal ID and title will allow me to group recipes by the year. The view that I want is:
{
"views": {
"by_year": {
"map": "function (doc) { emit(doc['date'].substring(0, 4), [doc['_id'], doc['title']]); }",
"reduce": "function(keys, values, rereduce) { return values; }"
}
},
"language": "javascript"
}
The date is an ISO 8601 string (YYYY first), which is the reason for the substring(0,4) call on it.

With that in place, and some non-trivial reworking of the views to match the actual data structure from the map-reduce rather than what I had guessed it would be, I can work my way back out to the Cucumber scenario to define how the next two steps behave—on the 2009 meals page, the 2009 recipe should be included, but not the 2008 recipe:
Then /^the "([^\"]*)" meal should be included in the list$/ do |title|
response.should have_selector("li a", :content => title)
end

Then /^the "([^\"]*)" meal should not be included in the list$/ do |title|
response.should_not have_selector("a", :content => title)
end
All that is left is to verify that clicking through to the 2008 meals works:



(commit)

Tomorrow.

by eee.c (chris.eee@gmail.com) at May 17, 2009 02:50 AM

Damien Katz

My Brother the Badass

He's the one with the blue headgear.

by Damien Katz at May 17, 2009 01:09 AM

May 16, 2009

Jan Lehnardt

CouchDB University

Prior last week’s Erlang Factory, we ran the CouchDB University, a three day training course where Chris & I taught a group of eleven all about CouchDB.

First of all, thanks to all students to sign up for this first installment of a commercial CouchDB training. It is great to see that our work gets validated that way. It also shows that CouchDB is a hot topic.

We went through all areas of CouchDB: An introductory overview, the HTTP API basics, setup and administration, basic and advanced view theory and practice, replication and distributed setups, performance tricks, CouchApps and finally internals.

I feel we could have gone on for five days. Despite fearing to not have enough material. It panned out pretty well.

Special thanks to Kevin Ferguson and Paul Davis. Kevin gave an introduction to couchdb-lounge, meebo.com’s CouchDB add-on that adds sharding, auto-replication and failover. Paul gave an tour through couchdb-lucene, Robert Newson’s add-on that adds fulltext search to CouchDB.

The next round of training will be in London in June. If you missed the Palo Alto event and can’t make it to London, please get in touch and we can try and see if we can set something up in your neighbourhood.

by Jan (jan@apache.org) at May 16, 2009 10:30 AM

Erlang Factory, Palo Alto 2009

The Erlang Factory 2009 in Palo Alto was the biggest gathering of Erlang expertise on the West Coast (or ever). Frank & Francesco and the Erlang Training & Consulting Team together with their numerous sponsors ran a solid conference.

The two days featured three tracks and by attending one talk you usually missed out on at least two other really interesting ones. Luckily, most of the sessions got videotaped and the recordings already show up.

Highlights include Cliff Moon’s live-killing of Dynomite nodes that power live.com to demonstrate the fault tolerance of the system (it worked well). Damien talked about why he decided to do CouchDB, his personal, rather than technical talk was well received.

The conference closed with an open discussion about what’s next. The Erlang community has a lot to catch up compared to other open source communities, and different strategies and improvements have been discussed. I hope something comes out of it.

Personally, the Erlang folks are the most relaxed and approachable of all. Robert Virding, one of the three original Erlang designers and developers was fun to talk to and he was genuinely interested in new developments. The matter or “stardom” is completely absent and this was pretty refreshing.

If you missed this Erlang Factory, the next one is in London in June, I hope to see all you European folks there!

by Jan (jan@apache.org) at May 16, 2009 10:19 AM

May 15, 2009

Till Klampäckel

RFC: CouchDB on FreeBSD

Thanks to Wesley, we recently managed to update CouchDB's FreeBSD port to the official 0.9.0 release.

My current TODO for the port includes:

  • a super-cool rc-script (currently, there is none)
  • automatic user setup/creation (couchdb)
  • patching of the install/source to use BSD-style directories for the database (e.g. /var/db/couchdb).

In regard to the the rc-script, I continued on a work in progress and committed an idea on Github. This work in process (couchdb) works out of the box. Just download the file, put it in /usr/local/etc/rc.d/, chmod +x couchdb and done. I also updated the CouchDB wiki page explaining how CouchDB on FreeBSD is setup (e.g. the options of the rc-script) and updated the instructions on user creation — something that I plan to roll into the couchdb port ASAP.

But before I continue on the port, I'd like to ask for feedback from people (Probably, you!) who use CouchDB on FreeBSD. For example, are you happy with the current options, or do you need a different set, etc.. If this work in progress is what you need, then that's valuable feedback as well.

Thanks!

by Till Klampaeckel (till@php.net) at May 15, 2009 02:19 PM

Chris Strom

Browsing Meals

‹prev | My Chain | next›

Up next, according to Cucumber (my master), is the "Browse Meals" feature. The first scenario in there is "Browsing a meal in a given year":
  Scenario: Browsing a meal in a given year
Given a "Even Fried, They Won't Eat It" meal enjoyed in 2009
When I view the list of meals prepared in 2009
Then "Even Fried, They Won't Eat It" should be included in the list

Given a "Even Fried, They Won't Eat It" meal enjoyed in 2009
And a "Salad. Mmmm." meal enjoyed in 2008
When I view the list of meals prepared in 2009
Then I should be able to follow a link to the list of meals in 2008
And "Salad. Mmmm." should be included in the list
This was one of the first scenarios that I wrote and it shows, I think. I have the same "Given" and the order of the steps reads a bit off. Re-organizing the scenario a bit:
  Scenario: Browsing a meal in a given year

Given a "Even Fried, They Won't Eat It" meal enjoyed in 2009
And a "Salad. Mmmm." meal enjoyed in 2008
When I view the list of meals prepared in 2009
Then the "Even Fried, They Won't Eat It" meal should be included in the list
And the "Salad. Mmmm." meal should not be included in the list
When I follow the link to the list of meals in 2008
Then the "Even Fried, They Won't Eat It" meal should not be included in the list
And the "Salad. Mmmm." meal should be included in the list
Much better.

My "Givens" are declared at the outset. There are two paths being followed (one for 2009, one for 2008), both starting with "When" declarations. Best of all, the second "When" flows from the first—clicking a link displayed in the first.

Implementing the first two steps can be accomplished via a single Given block:
Given /^a "([^\"]*)" meal enjoyed in (\d+)$/ do |title, year|
date = Date.new(year.to_i, 5, 13)

permalink = "id-#{date.to_s}"

meal = {
:title => title,
:date => date,
:serves => 4,
:summary => "meal summary",
:description => "meal description"
}

RestClient.put "#{@@db}/#{permalink}",
meal.to_json,
:content_type => 'application/json'
end
The next step, viewing the list of meals in 2009, can be defined as:
When /^I view the list of meals prepared in 2009$/ do
visit("/meals/2009")
response.status.should == 200
end
It fails, of course, since I have to define the meals action, so into the code I go...

As with recipes, I write my meal creation / tear down before / after blocks (this ain't no relational DB with fancy transactions):
  context "a CouchDB meal" do
before(:each) do
@date = Date.new(2009, 5, 13)
@title = "Meal Title"
@permalink = "id-#{@date.to_s}"

meal = {
:title => @title,
:date => @date,
:serves => 4,
:summary => "meal summary",
:description => "meal description"
}

RestClient.put "#{@@db}/#{@permalink}",
meal.to_json,
:content_type => 'application/json'

end

after(:each) do
data = RestClient.get "#{@@db}/#{@permalink}"
meal = JSON.parse(data)

RestClient.delete "#{@@db}/#{@permalink}?rev=#{meal['_rev']}"
end
end
The first meals action example is a simple one:
    describe "GET /meals/YYYY" do
it "should respond OK" do
get "/meals/2009"
response.should be_ok
end
end
This fails, of course since the action has not been defined. Let's define it and call it a night:
get %r{/meals/(\d+)} do |year|
end
(commit)
(commit)

Three steps down, 5 to go in the first meal scenario:



Looks like some fun with map-reduce is in store for tomorrow!

by eee.c (chris.eee@gmail.com) at May 15, 2009 03:04 AM

May 14, 2009

Chris Strom

Render CouchDB Images Via Sinatra

‹prev | My Chain | next›

Having finished the recipe details last night, I realized that I had no way of serving up the images stored in the CouchDB database. Something of a frightening realization, but it turns out to be as simple as:
get '/images/:permalink/:image' do
content_type 'image/jpeg'
RestClient.get "#{@@db}/#{params[:permalink]}/#{params[:image]}"
end
The content_type line specifies the content type as a JPEG image.

The RestClient.get retrieves the document from the CouchDB database and serves up the output to the requesting client. The :permalink parameter is the URL of the recipe document itself. Accessing an attachment to that document is as simple as adding a slash and the name of the attachment. If the recipe's ID is 2008-07-21-spinach and the image filename is spinach_pie_0004.jpg, then I can use the Sinatra resource to access the image at http://localhost:4567/images/2008-07-21-spinach/spinach_pie_0004.jpg:



Next up: deleting this and redoing it with a spec—mostly to handle not found errors.

by eee.c (chris.eee@gmail.com) at May 14, 2009 11:52 AM

Deliberate CouchDB Views

‹prev | My Chain | next›

I was able to get CouchDB map-reduce working yesterday. Even though it is working, I am not sure that I fully understand why it is working. I want to avoid programming by coincidence, so I am taking some time to step back and understand my tool of choice.

To recap, I want a list of every ingredient in all recipes, pointing to each recipe in which they are used:



I was able to accomplish this via CouchDB map-reduces. In temporary view form it looks like the following:



In design document form, this looks like:
{
"_id": "_design/recipes",
"_rev": "16-2289953674",
"views": {
"by_ingredients": {
"map": "function (doc) { for (var i in doc['preparations']) {
var ingredient = doc['preparations'][i]['ingredient']['name'];
var value = [doc['_id'], doc['title']];
emit(ingredient, value); }}",
"reduce": "function(keys, values, rereduce) { return values; }"
}
},
"language": "javascript"
}


But when I access the view via HTTP, I get this:
cstrom@jaynestown:~/repos/eee-code$ curl "http://localhost:5984/eee/_design/recipes/_view/by_ingredients"
{"rows":[
{"key":null,"value":[[["2006-06-17-fish","Green Chutney Covered Fish"],
["2006-06-17-shrimp","Curried Shrimp"],
["2007-01-15-soup","Crockpot Lentil Andouille Soup"],
...]]}
CouchDB params are well documented, and I probably just need to read that documentation a little closer. For instance, the reason I am getting the above is because of:
If a view contains both a map and reduce function, querying that view will by default return the result of the reduce function. The result of the map function only may be retrieved by passing reduce=false as a query parameter.
Indeed there are both a map and reduce, so I must be getting the reduce above (though not the one that I want). Passing reduce=false does get me the non-reduced results (the mapped results) that I was getting yesterday before adding the reduce:
cstrom@jaynestown:~/repos/eee-code$ curl "http://localhost:5984/eee/_design/recipes/_view/by_ingredients?reduce=false"
{"total_rows":72,"offset":0,"rows":[
{"id":"2008-07-21-spinach","key":"artichoke hearts","value":["2008-07-21-spinach","Spinach and Artichoke Pie"]},
{"id":"2008-07-19-oatmeal","key":"barley","value":["2008-07-19-oatmeal","Multi-grain Oatmeal"]},
{"id":"2006-10-08-dressing","key":"black pepper","value":["2006-10-08-dressing","Mustard Vinaigrette"]},
{"id":"2008-07-19-oatmeal","key":"brown sugar","value":["2008-07-19-oatmeal","Multi-grain Oatmeal"]},
{"id":"2002-01-13-hollandaise_sauce","key":"butter","value":["2002-01-13-hollandaise_sauce","Hollandaise Sauce"]},
{"id":"2006-07-26-fish","key":"butter","value":["2006-07-26-fish","Pan-Fried Fish with Potato Crust"]},
{"id":"2006-06-17-shrimp","key":"cardamom pod","value":["2006-06-17-shrimp","Curried Shrimp"]},
{"id":"2007-01-15-soup","key":"celery","value":["2007-01-15-soup","Crockpot Lentil Andouille Soup"]},
{"id":"2006-06-17-fish","key":"cilantro","value":["2006-06-17-fish","Green Chutney Covered Fish"]},
{"id":"2006-06-17-raita","key":"cilantro","value":["2006-06-17-raita","Yogurt Raita"]},
{"id":"2006-06-17-shrimp","key":"cinnamon","value":["2006-06-17-shrimp","Curried Shrimp"]},
{"id":"2008-07-19-oatmeal","key":"cinnamon","value":["2008-07-19-oatmeal","Multi-grain Oatmeal"]},
...
So what was that first result? That is not that reduced set that I want. I want the reduced set that I saw above in futon. Again, actually reading the documentation, I find:
Keep in mind that the the Futon Web-Client silently adds group=true to your views
In desperation last night I added the group=true option in order to get the results I desired, but did not fully understand. At the very least, I am not relying on undocumented behavior that is likely to change in a future CouchDB release. For that alone, I feel much better.

But what does group=true actually do? According to the documentation:
The group option controls whether the reduce function reduces to a set of distinct keys or to a single result row.
It would seem that group=false is the default. When I access it with group=true, I get my desired results (as I did last night):
cstrom@jaynestown:~/repos/eee-code$ curl "http://localhost:5984/eee/_design/recipes/_view/by_ingredients?group=true"
{"rows":[
{"key":"artichoke hearts","value":[["2008-07-21-spinach","Spinach and Artichoke Pie"]]},
{"key":"barley","value":[["2008-07-19-oatmeal","Multi-grain Oatmeal"]]},
{"key":"black pepper","value":[["2006-10-08-dressing","Mustard Vinaigrette"]]},
{"key":"brown sugar","value":[["2008-07-19-oatmeal","Multi-grain Oatmeal"]]},
{"key":"butter","value":[["2006-07-26-fish","Pan-Fried Fish with Potato Crust"],["2002-01-13-hollandaise_sauce","Hollandaise Sauce"]]},
{"key":"cardamom pod","value":[["2006-06-17-shrimp","Curried Shrimp"]]},
{"key":"celery","value":[["2007-01-15-soup","Crockpot Lentil Andouille Soup"]]},
{"key":"cilantro","value":[["2006-06-17-raita","Yogurt Raita"],["2006-06-17-fish","Green Chutney Covered Fish"]]},
{"key":"cinnamon","value":[["2008-07-19-oatmeal","Multi-grain Oatmeal"],["2006-06-17-shrimp","Curried Shrimp"]]},
...
I understand this well enough to proceed. I am still not sure why the default is not to group the results. Why would I want a single result row? It certainly does not provide useful results in the case where I reduce to a list of recipes. Perhaps it proves useful in the case when reducing to a count or when making use of the rereduce parameter to the map function. Maybe, but why make it the default?

Something to learn another day. For now, I am satisfied that I can build on my current understanding. It should prove more than adequate for the meal / blog work that is coming up this week.

by eee.c (chris.eee@gmail.com) at May 14, 2009 11:52 AM

May 13, 2009

Mikeal Rogers

RiP: Annotations Remix

I had some fun this weekend with Python, <video>, CouchDB and Brett Gaylor’s RiP: A Remix Manifesto. In just a few hours I was able to crank out a little annotations remix which allows anyone to add annotations to the film that are displayed as people view it.

I’m hosting it on my little mac mini (currently hidden in a data-center) so hopefully it doesn’t fall over pushing so much video :)

I’ve posted all the code up on github. The more I use <video> and CouchDB the more excited I get about the future of web applications. This entire project was done in little chunks of spare time over the weekend and most of that was me messing around with styling. To get the data stored, queried, and displayed took less than 2 hours.

Hope you all enjoy the annotations remix and if you haven’t already go and pay what you want for a terrific copy of RiP: A Remix Manifesto. It’s worth it.

by mikeal at May 13, 2009 11:25 PM

Chris Strom

Meal Upload

‹prev | My Chain | next›

To prepare for working with meals in my EEE Cooks replacement, I need to upload the old Rails meals into the new CouchDB store. Fortunately, I have already done this with recipes.

Just as I did with Recipe, I need to re-open the Meal class so that a few methods can be added to it:
class Meal < ActiveRecord::Base
# IDs - date suffices for a meal (we never do more than one meal per day)
def _id; date.to_s end

# For uploading meal images
def _attachments
{
self.image.filename =>
{
:data => Base64.encode64(File.open(self.image.full_filename).read).gsub(/\n/, ''),
:content_type => "image/jpeg"
}
}
end

# For the menu
def menu; menu_items.map(&:name) end

# To distinguish between meals, recipes, etc.
def type; self.class.to_s end
end
To include these methods in the JSON output:
>> m = Meal.first
=> #<Meal id: 15, title: "The Bestest Dinner Ever", summary: "...
>> json = m.to_json(:methods => [:_id, :menu, :type, :_attachments], :except => [:id, :image_old])
=> "{"type": "Meal",
"title": "The Bestest Dinner Ever",
"published": true,
"date": "2006/05/03",
"_id": "2006-05-03",
"serves": 4,
"_attachments": {"egg_dinner_7321.jpg": {"data": "/9j/4AAQSkZJRgABAQEASABIAAD...
Uploading the document to CouchDB is easily accomplished with a RestClient call:
>> RestClient.put "http://localhost:5984/eee/#{m._id}", json, :content_type => 'application/json'
=> "{"ok":true,"id":"2006-05-03","rev":"1-1328519176"}\n"
A quick check in Futon to make sure all is OK and I am all set:



For work over the next couple of days, I note the JSON structure in CouchDB:
{
"_id": "2006-05-03",
"_rev": "1-1328519176",
"type": "Meal",
"title": "The Bestest Dinner Ever",
"published": true,
"date": "2006/05/03",
"serves": 4,
"summary": " [kid:son1] was very excited about this meal. When he sat down to eat, he had in front of him pasta and cheese sticks. A thin slice of heaven for him. And this was before the sausage was put on the table. ",
"description": " The girls were quite pleased with the dinner as well. Especially once Robin gave some lettuce to [kid:daughter2]. ",
"author_id": null,
"menu": [
" [recipe:2006/05/03/eggs] ",
" [recipe:2003/11/25/caesar_salad Caesar Salad] (served topped with the fried eggs) ",
" Baked Macaroni with Sausage ",
" Mozzarella Sticks "
],
"_attachments": {
"egg_dinner_7321.jpg": {
"stub": true,
"content_type": "image/jpeg",
"length": 30411
}
}
}

by eee.c (chris.eee@gmail.com) at May 13, 2009 03:10 AM

May 11, 2009

Chris Strom

Prototyping CouchDB Views

‹prev | My Chain | next›

With recipe search done (yay!), I am ready to move on to other things today. There are only a few scenarios left, all dealing with the meal in which the recipes were originally served.

Before moving onto those things, I thought I would play around with CouchDB views a bit first. It's somewhat amazing to me that I have spent this much time on a CouchDB project and have yet to do something as basic as a view.

Ah well, time to rectify that oversight.

On our current site, we have a list of recipes by ingredient:



It would be very nice to be able to reproduce that with a CouchDB view.

In Javascript, what I would like do with each recipe document is this:
function (doc) {
for (var i in doc['preparations']) {
var ingredient = doc['preparations'][i]['ingredient']['name'];
var value = [doc['_id'], doc['title']];
emit(ingredient, value);
}
}
For each recipe document, iterate over each ingredient preparation and emit the ingredient, pointing to the document title and ID (enough to link to the recipe). I add this to the DB using futon. From the main DB page, I select the Design Document view from the drop-down and create the following document:



Accessing the view results in:
cstrom@jaynestown:~/repos/eee-code$ curl "http://localhost:5984/eee/_design/recipes/_view/by_ingredients"
{"total_rows":72,"offset":0,"rows":[
{"id":"2008-07-21-spinach","key":"artichoke hearts","value":["2008-07-21-spinach","Spinach and Artichoke Pie"]},
{"id":"2008-07-19-oatmeal","key":"barley","value":["2008-07-19-oatmeal","Multi-grain Oatmeal"]},
{"id":"2006-10-08-dressing","key":"black pepper","value":["2006-10-08-dressing","Mustard Vinaigrette"]},
{"id":"2008-07-19-oatmeal","key":"brown sugar","value":["2008-07-19-oatmeal","Multi-grain Oatmeal"]},
{"id":"2002-01-13-hollandaise_sauce","key":"butter","value":["2002-01-13-hollandaise_sauce","Hollandaise Sauce"]},
{"id":"2006-07-26-fish","key":"butter","value":["2006-07-26-fish","Pan-Fried Fish with Potato Crust"]},
{"id":"2006-06-17-shrimp","key":"cardamom pod","value":["2006-06-17-shrimp","Curried Shrimp"]},
{"id":"2007-01-15-soup","key":"celery","value":["2007-01-15-soup","Crockpot Lentil Andouille Soup"]},
{"id":"2006-06-17-fish","key":"cilantro","value":["2006-06-17-fish","Green Chutney Covered Fish"]},
{"id":"2006-06-17-raita","key":"cilantro","value":["2006-06-17-raita","Yogurt Raita"]},
{"id":"2006-06-17-shrimp","key":"cinnamon","value":["2006-06-17-shrimp","Curried Shrimp"]},
{"id":"2008-07-19-oatmeal","key":"cinnamon","value":["2008-07-19-oatmeal","Multi-grain Oatmeal"]},
{"id":"2006-06-17-shrimp","key":"clove","value":["2006-06-17-shrimp","Curried Shrimp"]},
{"id":"2006-06-17-raita","key":"cucumber","value":["2006-06-17-raita","Yogurt Raita"]},
{"id":"2006-06-17-raita","key":"cumin","value":["2006-06-17-raita","Yogurt Raita"]},
{"id":"2006-06-17-shrimp","key":"curry leaves","value":["2006-06-17-shrimp","Curried Shrimp"]},
{"id":"2008-07-19-oatmeal","key":"dry milk powder","value":["2008-07-19-oatmeal","Multi-grain Oatmeal"]},
{"id":"2002-01-13-hollandaise_sauce","key":"egg yolks","value":["2002-01-13-hollandaise_sauce","Hollandaise Sauce"]},
{"id":"2002-01-13-eggs_benedict","key":"eggs","value":["2002-01-13-eggs_benedict","Crab Eggs Benedict"]},
{"id":"2008-07-21-spinach","key":"eggs","value":["2008-07-21-spinach","Spinach and Artichoke Pie"]},
...
They are sorted by ingredient name, but there are multiple records with "cinnamon" and with "eggs". What I really want is a list of each ingredient pointing to the recipes in which they are used. Sure I could assemble that list in Ruby code (especially with the nice ordering above), but maybe CouchDB can help?

Of course it can!

In addition to the "map" attribute for "by_ingredients", I add the following, as-simple-as-it-gets "reduce" attribute:
function(keys, values, rereduce) {
return values;
}
In Futuon, this looks like:



Now when I access the by_ingredients view, this time with the group attribute set, I get exactly what I am looking for:
cstrom@jaynestown:~/repos/eee-code$ curl "http://localhost:5984/eee/_design/recipes/_view/by_ingredients?group=true"
{"rows":[
{"key":"artichoke hearts","value":[["2008-07-21-spinach","Spinach and Artichoke Pie"]]},
{"key":"barley","value":[["2008-07-19-oatmeal","Multi-grain Oatmeal"]]},
{"key":"black pepper","value":[["2006-10-08-dressing","Mustard Vinaigrette"]]},
{"key":"brown sugar","value":[["2008-07-19-oatmeal","Multi-grain Oatmeal"]]},
{"key":"butter","value":[["2006-07-26-fish","Pan-Fried Fish with Potato Crust"],["2002-01-13-hollandaise_sauce","Hollandaise Sauce"]]},
{"key":"cardamom pod","value":[["2006-06-17-shrimp","Curried Shrimp"]]},
{"key":"celery","value":[["2007-01-15-soup","Crockpot Lentil Andouille Soup"]]},
{"key":"cilantro","value":[["2006-06-17-raita","Yogurt Raita"],["2006-06-17-fish","Green Chutney Covered Fish"]]},
{"key":"cinnamon","value":[["2008-07-19-oatmeal","Multi-grain Oatmeal"],["2006-06-17-shrimp","Curried Shrimp"]]},
{"key":"clove","value":[["2006-06-17-shrimp","Curried Shrimp"]]},
{"key":"cucumber","value":[["2006-06-17-raita","Yogurt Raita"]]},
{"key":"cumin","value":[["2006-06-17-raita","Yogurt Raita"]]},
{"key":"curry leaves","value":[["2006-06-17-shrimp","Curried Shrimp"]]},
{"key":"dry milk powder","value":[["2008-07-19-oatmeal","Multi-grain Oatmeal"]]},
{"key":"egg yolks","value":[["2002-01-13-hollandaise_sauce","Hollandaise Sauce"]]},
{"key":"eggs","value":[["2008-07-21-spinach","Spinach and Artichoke Pie"],["2002-01-13-eggs_benedict","Crab Eggs Benedict"]]},
...

by eee.c (chris.eee@gmail.com) at May 11, 2009 02:30 AM

May 09, 2009

Ricky Ho

Machine Learning Intuition

As more and more user data are gathered on different web sites (such as e-commerce, social network), data mining / machine learning technique becomes an increasingly important tool to analysis them and extract useful information out of it.

There a wide variety of machine learning applications, such as …
  • Recommendation: After buying a book at Amazon, or rent a movie from Netflix, they recommends what other items that you may be interested
  • Fraud detection: To protect its buyer and seller, an auction site like EBay detect abnormal patterns to identify fraudulent transaction
  • Market segmentation: Product company divide their market into segments of similar potential customers and design specific marketing campaign for each segment.
  • Social network analysis: By analysis the user’s social network profile data, social networking site like Facebook can categorize their users and personalize their experience
  • Medical research: Analyzing DNA patterns, Cancer research, Diagnose problem from symptoms
However, machine learning theory involves a lot of math which is non-trivial for people who doesn’t have the rigorous math background. Therefore, I am trying to provide an intuition perspective behind the math.

General Problem

Each piece of data can be represented as a vector [x1, x2, …] where xi are the attributes of the data.

Such attributes can be numeric or categorical. (e.g. age is an numeric attribute and gender is a categorical attribute)

There are basically 3 branch of machine learning ...

Supervised learning
  • The main use of supervised learning is to predict an output based on a set of training data. A set of data with structure [x1, x2 …, y] is presented. (in this case y is the output). The learning algorithm will learn (from the training set) how to predict the output y for future seen data.
  • When y is numeric, the prediction is called regression. When y is categorical, the prediction is called classification.

Unsupervised learning
  • The main use of unsupervised learning is to discover unknown patterns within data. (e.g. grouping similar data, or detecting outliers).
  • Identifying clusters is a classical scenario of unsupervised learning

Reinforcement learning
  • This is also known as “continuous learning” where the final output is not given. The agent will choose an action based on its current state and then will be present with an award. The agent learns how to maximize its award and come up with a model call “optimal policy”. A policy is a mapping between from “state” to “action” (given I am at a particular state, what action should I take).

Data Warehouse

Data warehouse is not “machine learning”, it is basically a special way to store your data so that it can be easily group in many ways for doing analysis in a manual way.

Typically, data is created from OLTP systems which runs the company’s business operation. OLTP capture the “latest state” of the company. Data are periodically snapshot to the data-warehouse for OLAP, in other words, data-warehouse add a time dimension to the data.

There is an ETL process that extract data from various sources, cleansing the data, transform to the form needed by the data-warehouse and then load into the data cube.

Data-warehouse typically organize the data as a multi-dimensional data cube based on a "Star schema" (1 Fact table + N Dimension tables). Each cell contains aggregate data along different (combination) of dimensions.


OLAP processing involves the following operations
  • Rollup: Aggregate data within a particular dimension. (e.g. For the “time” dimension, you can “rollup” the aggregation from “month” into “quarter”)
  • Drilldown: Breakdown the data within a particular dimension (e.g. For the “time” dimension, you can “drilldown” from months” into “days”)
  • Slice: Cut a layer out of a particular dimension (e.g. Look at all data at “Feb”)
  • Dice: Select a sub data cube (e.g. Look at all data at “Jan” and “Feb” as well as product “laptop” and “hard disk”
The Data-warehouse can be further diced into specific “data marts” that focus in different areas for further drilldown analysis.

Some Philosophy

To determine the output from a set of input attributes, one way is to study the physics behinds them and write a function that transform the input attributes to the output. However, what if the relationship is unknown ? or the relationship hasn’t been formally specified ?

Instead of based on a sound theoretical model, machine learning is trying to make prediction based on previously observed data. There are 2 broad type of learning strategies

Instance-based learning
  • Also known as lazy learning, the learner remembers all the previous seen examples. When a new piece of input data arrives, it tried to find the best matched data it previous seen and use its output to predict the output of the new data. It has an underlying assumption that if two piece of data are “similar” in their input attributes, their output are also similar.
  • Nearest neighbor is a classical approach for instance-based learning

Model-based learning
Eager learning that learn a generalized model upfront, and lazy learning learn from seen examples at the time of query. Instead of learning a generic model that fits all observed data, lazy learning can focus its analysis close to the query point. However, getting a comprehensible model in lazy learning is harder and it also require large memory to store all seen data.

by Ricky Ho (rickyphyllis@gmail.com) at May 09, 2009 06:58 PM

Machine Learning: Linear Model

Linear Model is a family of model-based learning approaches that assume the output y can be expressed as a linear algebraic relation with the input attributes x1, x2 ...

Here our goal is to learn the parameters of the underlying model, which the coefficients.

Linear Regression

Here the input and output are both real numbers, related through a simple linear relationship. The learning goal is to figure out the hidden weight value (ie: the W vector).

Given a batch of training data, we want to figure out the weight vector W such that the total sum of error (which is the difference between the predicted output and the actual output) to be minimized.


Instead of using the batch processing approach, a more effective approach is to learn incrementally (update the weight vector for each input data) using a gradient descent approach.

Gradient Descent

Gradient descent is a very general technique that we can use to incrementally adjust the parameters of the linear model. The basic idea of "gradient descent" is to adjust each dimension (w0, w1, w2) of the W vector according to their contribution of the square error. Their contribution is measured by the gradient along the dimension which is the differentiation of the square error with respect to w0, w1, w2.

In the case of Linear Regression ...


Logistic Regression

Logistic Regression is used when the output y is binary and not a real number. The first part is the same as linear regression while a second step sigmod function is applied to clamp the output value between 0 and 1.

We use the exact same gradient descent approach to determine the weight vector W.

Neural Network

Inspired by how our brain works, Neural network organize many logistic regression units into layers of perceptrons (each unit has both input and outputs in binary form).

Learning in Neural network is to discover all the hidden values of w. In general, we use the same technique above to adjust the weight using gradient descent layer by layer. We start from the output layer and move towards the input layer (this technique is called backpropagation). Except the output layer, we don't exactly know the error at the hidden layer, we need to have a way to estimate the error at the hidden layers.

But notice there is a symmetry between the weight and the input, we can use the same technique how we adjust the weight to estimate the error of the hidden layer.



Support Vector Machine

by Ricky Ho (rickyphyllis@gmail.com) at May 09, 2009 06:54 PM

May 08, 2009

Jason Davies

CouchDB on Wheels

Ely Service now runs on CouchDB. Things just got a little simpler: no more Django plus PostgreSQL plus Nginx.

Casual Lofa: World's fastest furniture
Casual Lofa: the World's fastest furniture

Ely Service is, as J. Chris Anderson put it, “just a very ordinary-looking garage Web site”. It's a simple Web site, which I originally developed using Django. It consists of six pages, one of which has a contact form for sending emails. So the requirements are very straightforward.

Why switch?

This was an experiment to see how easy it is to develop a simple Web site using CouchDB and (almost) nothing else. Ely Service is essentially a static Web site, and hence barely exploits any of the roaring power of CouchDB's B-Tree index or its distributed capabilities.

CouchApp

CouchApp is a set of scripts that make developing standalone CouchDB applications a lot simpler. Using Futon to do this at the moment is far too painful, although I could imagine a lightweight IDE that allows various show/list functions to be previewed as they are developed. Patches welcome!

In a nutshell, CouchApp allows you to store your map/reduce views, lists, shows and validation functions as files in a directory tree. You can also include various helper functions and templates, which are inserted using macros before being pushed to the database.

I put the majority of the Ely Service site into its own app and the contact form handler into a separate app. Complex sites may consist of many apps that work together. Here is the structure of the "elyservice" application:

elyservice/
  _attachments/
  lib/
    helpers/
      ejs.js
  templates/
    layout/
      head.html
      tail.html
  shows/
    contact.js
    page.js

Show Me

As this is a simple site with only 6 static pages, these are all generated using simple "show" functions.

function (doc, req) {
  // !json templates
  // !code lib/helpers/couchapp.js
  // !code lib/helpers/ejs.js
  var body = new EJS({
    text: templates.layout.head + templates.page + templates.layout.tail
  }).render({
    assets: assetPath(),
    doc: doc
  });
  return {
    headers: {'Content-Type': 'text/html; charset=UTF-8'},
    body: body
  };
}

Using CouchDB to send E-mail

This is the most complex part of the site, as it requires the use of an external process to send the emails. Strictly speaking, an external process is not necessary; a cron job would also do the job just fine.

I decided to write this as a generic CouchApp so it could be reused across multiple sites. Pretty much every site has a contact form of some kind.

This works like a standard UNIX mail spooler. New messages are created with a status of "spool", and the notification script sets the status to "sent" when it has finished sending. Unsent messages are retrieved by the notification script by calling the "mail_spool" view:

function (doc) {
  if (doc.type == 'mail' && doc.status == 'spool')
    emit(null, null);
}

The actual sending of email is done by the send_emails.py script, which is launched as an external process.

I've put the contact form code here, including the mail spooler: http://github.com/jasondavies/couchdb-contact-form/tree/master

Nginx Configuration

One of the only remaining hurdles to truly pure CouchApps is support for clean URLs. I wanted to retain the clean URLs of the original Ely Service site, and in order to do this I had to rewrite them using a reverse proxy. Nginx was ideal for this task.

server {
    listen 89.145.97.172:80;
    server_name www.elyservice.co.uk;
    set $projectname elyservice;

    location / {
        if ($request_method !~ ^(GET|HEAD)$) {
            return 444;
        }

        proxy_pass http://127.0.0.1:5984/elyservice;
        proxy_redirect default;
        proxy_set_header X-Orig-Host '$host:$server_port';

        rewrite ^/media/(.+)$ /$projectname/_design/elyservice/$1 break;
        rewrite ^/$ '/$projectname/_design/elyservice/_show/pages' break;
        rewrite ^/(.*)/$ '/$projectname/_design/elyservice/_show/pages/pages:$1' break;

        return 404;
    }

    location /contact/ {
        if ($request_method !~ ^(GET|HEAD|POST)$) {
            return 444;
        }

        proxy_pass http://127.0.0.1:5984/elyservice;
        proxy_redirect default;
        proxy_set_header X-Orig-Host '$host:$server_port';

        if ($request_method = POST) {
            rewrite ^/contact/$ /$projectname/ break;
        }
        rewrite ^/contact/$ '/$projectname/_design/elyservice/_show/contact' break;

        return 404;
    }
}

It turns out that Nginx automatically decodes URL-encoded characters in rewrite URLs before passing them through the proxy. Hence I couldn't use "pages/foo" for my docids. No problem here, I simply elected to use "pages:foo" instead.

It's worth noting that support for a CouchDB rewrite handler is under active discussion at the moment, so watch this space.

Security and Validation

Validation is very important; I don't want d00dz being able to edit any document in the database. All it takes is a POST or a PUT and anyone can create or update any document. To prevent this, first of all I added an admin user to local.ini. This user is given the special role of "_admin", which has the special priviledge of being able to create and modify design docs.

However, this level of security is not enough, as someone could still PUT malicious text to the home page doc for example.

A simple way to prevent this is to configure Nginx to reject any requests that aren't HEAD or GET:

if ($request_method !~ ^(GET|HEAD)$) {
    return 444;
}

Note: the non-standard error code 444 causes nginx to drop the connection (see https://calomel.org/nginx.html). The standard "forbidden" error code 403 could be used instead.

There may be cases, though, where we want users to be able to create/modify some documents but not others. For Ely Service, we want anonymous users to be able to create new documents of type "mail", but nothing else. This is where validate_doc_update comes in handy.

function (newDoc, oldDoc, userCtx) {
  // !code _attachments/validate.js
  if (userCtx.roles.indexOf('_admin') != -1) {
    return;
  }
  if (oldDoc == null) {
    return validate(newDoc);
  }
  throw {
    forbidden: "Invalid operation: existing messages cannot be modified."
  };
}

Conclusion

Although CouchDB is still alpha software, developing and deploying a simple Web site using CouchApp was very straightforward. The real benefits of CouchDB were not exploited at all, but we'll see some of that in a future post.

Several people have noted that Ely Service loads very quickly. This is a combination of CouchDB's raw speed and the simplicity of Ely Service's design.

by Jason Davies at May 08, 2009 12:30 AM

May 05, 2009

Ricky Ho

Kids Learning Resources

In chatting with a friend, I think it is useful to summarize a list of good resources for kids learning that I've used in the past.

Learning in this generation is very different given there are a lot of resources out on the internet. It is important to teach the kids to figure out the knowledge themselves.

General tools for knowledge acquisition

Geography and Site seeing

  • Google Map for navigating to different places
  • Using the street view feature by dragging the little person icon (at the top left corner) into the map, we can visit different countries virtually. Here is the Eiffel tower in Paris, France. Google street view is extending its coverage so expect to see more areas in the world being covered in future.
  • Google Earth for an even more sophisticated 3D map navigation. Watch their introduction video.

History and Ancient Empire


Human Body


Money Management


Develop a good economic sense and planning to use money wisely is important, especially when kids are commonly spoiled by their parents these days.

Online Games


I normally don't recommend exposing video or online games to the kids unless they are carefully chosen for educational purposes. Kids are so easily addicted to games. But if you can use their addiction appropriately, it is a good way to motivated them to learn, and it is fun too.

Here are some exceptionally good ones from Lego

Game Designing


It is also good to teach them how a game is designed because this gives them an opportunity to switch roles (from a player to a designer) and establish a good overall picture. There are some very good tools for designing games.

Kids Programming

Programming is all about planning for the steps to achieve a goal, and debugging is all about using a systematic way to find out why something doesn't work out as planned. Even you are not planning your kids to be a programmer or a computer professional, teach them how to programming can develop a very good planning and analytical skills, which is very useful in their daily life.
  • The famous LOGO Green turtle is a good start at age 6 - 7
  • Lego Mindstorm is a very good one because kids are exposed not just to programming skills but other engineering skills (such as mechanics) as well. It costs some money but I think it is well worth.
  • A great link for robot fans.
  • At around age 10, I found a professional programming language can be taught. I have chosen Ruby given its expressiveness and simplicity.
  • One of my colleagues has suggested BlueJ, basically Java, but I personally haven't tried that yet (given I am a Ruby fan).

Materials at all subjects and all levels

by Ricky Ho (rickyphyllis@gmail.com) at May 05, 2009 03:55 PM

Machine Learning: Nearest Neighbor

This is the simplest technique for instance based learning. Basically, find a previous seen data that is "closest" to the query data point. And then use its previous output for prediction.

The concept of "close" is defined by a distance function, dist(A, B) gives a quantity which need to observe the triangular inequality.
ie: dist(A, B) + dist(B, C) >= dist(A, C)

Defining the distance function can be domain specific. One popular generic distance function is to use the Euclidean distance.
dist(A, B) = square_root(sum_over_i(square(xai - xbi)))

In order to give each attribute the same degree of influence, you need to normalize their scale within the same range. On the other hand, you need to figure out a way to compute the difference between categorical values (ie: whether "red" is more similar to "blue" or "green"). A common approach is to see whether "red" and "blue" affects the output value in a similar way. If both colors has similar probability distribution across each output value, then we consider the two colors are similar.

Therefore you need to transform the attributes xi.
  • Normalize their scale: transform xi = (xi - mean) / std-deviation
  • Quantify categorical data: If xi is categorical, then (xai - xbi) = sum_over_k(P(class[k] | xai) – P(class[k] | xbi))
Nearest neighbor will be sensitive to outliers, say you have a few abnormal data and query point around these outliers will be wrongly estimated. One solution is to use multiple nearest neighbors and combine their output in a certain way. This is known as KNN (k-nearest-neighbor). If the problem is classification, every neighbor will cast a vote with a weight inversely proportional to the "distance" with the query point, and the majority win. If the problem is regression, the weighted average of their output will be used instead.

Execution Optimization

One problem of instance-based learning is that you need to store all previously seen data and also compute the distance of query point to each of them. Both time and space complexity to serve a single query is O(M * N) where M is the number of dimensions and N is the number of previous data points.

Instead of compute the distance between the query point to each of the existing data points, you can organized the existing points into a KD Tree based on the distance function. The KD Tree has the properties that the max distance between two nodes is bound by the level of their common parent.

Using the KD Tree, you navigate the tree starting at the root node. Basically, you calculate the dist(current_node, query_point)

and each of the child nodes of the current_node
dist(child_j, query_point)

And then find the minimum of them, if the minimum is one of its child, then you navigate down the tree by setting current_node to this child and repeat the process. You terminate if the current_node is the minimum, or when there is no more child nodes.

After terminating at a particular node, this node is pretty close to the query point. You need to explore the surrounding nodes around this node (its siblings, siblings child, parent's siblings) to locate the K nearest neighbors.

By using a KD Tree, the time complexity depends on the depth of the tree and hence of order O(M * log N)

Note that KD Tree is not effective when the data has high dimensions (> 6).

Another way is to throw away some of the previous seen data if they won't affect the result prediction (especially effective for classification problem, you can just keep the data at the boundary between two different output values and throw away the interior points of a cluster of data points all has the same output values). However, if you are using KNN, then throwing away some points may change the result. So a general approach is to verify the previous seen data is still correctly predicted after throwing out various combination of points.

Recommendation Engine

A very popular application of KNN is the recommendation engine of many e-commerce web sites using a technique called "Collaborative Filtering". E.g. An online user have purchased a book, the web site looks at other "similar" users to see what other books they have seen and recommends that to the current user.

First of all, how do we determine what attributes of the users to be captured. This is a domain-specific questions because we want to identify those attributes that are most influential, maybe we can use static information such as user's age, gender, city ... etc. But here lets use something more direct ... the implicit transaction information (e.g. if the user has purchased a book online, we know that he likes that book) as well as explicit rating information (e.g. the user rates a book he bought previously so we know whether he/she likes the book or not).

Lets use a simple example to illustrate the idea. Here we have a number of users who rates a set of movies. The ratings is from [0 - 10] where 0 means hates it and 10 means extremely likes it.


The next important things is to define the distance function. We don't want to use the rating directly because of the following reasons.

Some nice users give an average rating of 7 while some tough users give an average rating of 5. On the other hand, the range of ratings of some users are wide while other users are narrow. However, we don't want these factors to affect our calculation of user similarity. We consider two users of same taste as long as they rate the same movie above their average or below their average with the same percentage of their rating range. Two users has different taste if they rate the movies in different directions.

Lets call rating_i_x to denote user_i's rating on movie_x

We can use the correlation coefficient to capture this.

sum_of_product =
sum_over_x( (rating_i_x - avg(rating_i)) * (rating_j_x - avg(rating_j)) )

If this number is +ve, then user_i and user_j are moving in the same direction. If this number is -ve, then they are moving in opposite direction (negatively correlated). If this number is zero, then they are uncorrelated.

We also need to normalize them with the range of the user's ratings, so we compute
root_of_product_square_sum =
square_root(sum_over_x( ((rating_i_x - avg(rating_i)) **2) * ((rating_j_x - avg(rating_j)) **2) )))

Define Pearson Coefficient = sum_of_product / root_of_product_square_sum

Let Pearson Coefficient to quantify the "similarity" between 2 users.

We may also use negatively correlated users to make recommendation. For example, if user_i and user_j is negatively correlated, then we can recommend the movies that user_j hates to user_i. However, this seems to be a bit risky so we are not doing it here.

by Ricky Ho (rickyphyllis@gmail.com) at May 05, 2009 07:18 AM

Machine Learning: Probabilistic Model

Probabilistic model is a very popular approach of “model-based learning” based on Bayesian theory. Under this approach, all input attributes is binary and the output is categorical.

Here, we are given a set of data with structure [x1, x2 …, y] is presented. (in this case y is the output). The learning algorithm will learn (from the training set) how to predict the output y for future seen data

We assume there exist a hidden probability distribution from the input attributes to the output. The goal is to learn this hidden distribution and apply it to the input attributes of the later encountered query point to pick the class that has the maximum probability.

Making Prediction

Lets say the possible value of output y is {class_1, class_2, class_3}. Given input [x1, x2, x3, x4], we need to compute the probability of each output class_j, and predict the one which has the highest value.
max {P(class_j | observed_attributes)}

According to Bayes theorem, this value is equal to …
max { P(observed_attributes | class_j) * P(class_j) / P(observed_attributes) }

The dominator is the same for all class_j, so we can ignore it, so we just need to find
max { P(observed_attributes | class_j) * P(class_j) }

P(class_j) is easy to find, we just calculate
P(class_j) = (samples_of_class_j / total samples)

Now, lets look at the other term, P(observed_attributes | class_j), from Bayesian theory
P(x1 ^ x2 ^ x3 ^ x4 | class_j) =
P(x1 | class_j) *
P(x2 | class_j ^ x1) *
P(x3 | class_j ^ x1 ^ x2) *
P(x4 | class_j ^ x1 ^ x2 ^ x3)

Learning the probability distribution

In order to provide all the above terms for the prediction, we need to build the probability distribution model by observing the training data set. Notice that finding the last term is difficult because there can be 2 ** (m – 1) possible situations to watch. It is very likely that we haven’t seen enough situations in the training data, in this case this term for all class_j will be zero.


Bayesian Network

Bayesian Network base on the fact we know certain attributes are clearly independent. By applying this domain knowledge, we draw a dependency graph between attributes. The probability of occurrence of a particular node only depends on the occurrence of its parent nodes and nothing else. To be more precise, nodeA and nodeB (which is not related with a parent-child relationship) doesn't need to be completely independent, they just need to be independent given their parents.

In other words, we don't mean P(A|B) = P(A),
we just need P(A | parentsOfA ^ B) = P(A | parentsOfA)
Therefore P(x4 | class_j ^ x1 ^ x2 ^ x3) = P(x4 | class_j ^ parentsOf_x4)

we only need to find 2 ** p situations of the occurrence combination of the parent nodes where p is the number of parent nodes.


Naive Bayes

Naive Bayes takes a step even further by assuming every node is completely independent


Spam Filter Application

Lets walk through an application of the Naive Bayes approach. Here we want to classify a particular email to determine whether it is spam or not.

The possible value of output y is {spam, nonspam}

Given an email: "Hi, how are you doing ?", we need to find ...
max of P(spam | mail) and P(nonspam | mail), which is same as ...
max of P(mail | spam) * P(spam) and P(mail | nonspam) * P(nonspam)

Lets focus in P(mail | spam) * P(spam)

We can view mail as an array of words [w1, w2, w3, w4, w5]
P(mail | spam) =
P(w1='hi' | spam) *
P(w2='how' | spam ^ w1='hi') *
P(w3='are' | spam ^ w1='hi' ^ w2='how') *
...
We make some naive assumptions here
  • Chance of occurrence is independent of preceding words. In other words, P(w2='how' | spam ^ w1='hi') is the same as P(w2='how' | spam)
  • Chance of occurrence is independent of word position. P(w2='how' | spam) is the same as (number of 'how' in spam mail) / (number of all words in spam mail)
With these assumptions ...
P(mail | spam) =
(hi_count_in_spam / total_words_in_spam) *
(how_count_in_spam / total_words_in_spam) *
(are_count_in_spam / total_words_in_spam) *
...
What if we haven't seen the word "how" from the training data ? Then the probability will becomes zero. Here we adjust terms such that a reasonable probability is assigned to unseen words.
P(mail | spam) =
((hi_count_in_spam + 1) / (total_words_in_spam + total_vocabulary)) *
((how_count_in_spam + 1) / (total_words_in_spam + total_vocabulary)) *
((are_count_in_spam + 1) / (total_words_in_spam + total_vocabulary)) *
...

What we need is the word_count per word/class combination as well as the word_count per class. This can be done through feeding a large number of training sample mails labeled with "spam" or "nonspam" into a learning process.

The learning process can also be done in parallel using a 2-rounds of Map/Reduce.


Alternatively, we can also update the counts incrementally as new mail arrives.

by Ricky Ho (rickyphyllis@gmail.com) at May 05, 2009 12:00 AM

May 04, 2009

Jan Lehnardt

CouchHack 2009 — Relaxing at Damien's

Wow, just, wow.

What

CouchHack is an open-for-all, multi-day CouchDB developer & user meetup. The first one took place in Asheville, NC at Damien’s. And boy it was fun, relaxing and productive all at the same time.

A total of six, including Damien, got together for four days of learning, hacking and goofing off.

Special thanks to HUDORA & Ben Young for sponsoring CouchHack.

Got Stuff Done

The concentration of developer knowledge and enthusiasm turned out to be a major boost for productivity (no surprise here).

CouchHack — Whiteboard

Batch PUT

CouchDB comes with a bulk_docs API that lets you send batches of documents to be saved. bulk_docs is significantly faster than writing single documents to CouchDB. There are situations however where collecting a set of documents prior to saving them is not desirable or possible. Take a distributed logging system storing events into a central server.

To speed up this use-case, we came up with the idea of adding a write buffer to single document PUT and POST requests and Chris took the time to write the code. He went through different design strategies (single module, gen_server, ets tables and finally settled on pure Erlang lists for the buffer). When saved with a special query parameter, the PUT or POST request gets buffered for one second before being flushed to disk. In the event of a power failure, everything before the one-second sync might get lost. Be sure to use this only for data that you can afford to lose. Community review & commit are pending.

CouchDBX

I spent some time on CouchDBX. The first version of CouchDBX was hand-crafted and it was non-trivial to update single or even all components.

CouchDBX bundles CouchDB (of course), Erlang & Spidermonkey — all components required to run CouchDB as well as a small Cocoa application that lets you launch CouchDB with a double-click.

One other dependency, ICU (the IBM Components for Unicode) is not included. Including ICU would make the Downloadable package register at about 40 MB where the current one is around 12 MB.

Luckily ICU is bundled with Mac OS X. Unfortunately, Apple doesn’t ship ICU header files with the installation. WebKit does, though and with them it is possible to build CouchDB and link it against the built-in ICU library.

Only my autotools-fu is too weak to get it set up with a stock CouchDB. I tried a few times in the past and always got stuck at some point. Eventually, while setting up the script to do the downloading, building & packaging of all components, I added a “post install phase” where I would re-link the necessary libs against the built-in ICU library. This worked well.

We now have couchdbx-core-builder.sh, a script that can build any combination of Erlang & CouchDB that is known to work. It builds and packages all components and then strips out everything that is not strictly needed for CouchDB’s operation. This includes a great many Erlang standard libraries.

Finally, I updated the CouchDBX application bundle (I used Geoffrey Grosenbach’s version with the embedded WebKit view for Futon) and already released two versions.

CouchHack Live

As part of the CouchDB Podcast, we recorded an episode with everybody on one table. This is the first Podcast that features Damien, go check it out :)

Reduce Overflow

CouchDB’s Map/Reduce system is different from both the Google and Hadoop models. I’d say it is improved, but nevertheless, it is different. One common misconception for newcomers is that the reduce function has to reduce its input values. It cannot collect values. The rate of reduction is less or equal to log N where N is the number of input values per key.

Chris and Paul worked on a patch that warns the user when his or her reduce function returns too much data. Once committed, the warning will be on by default, but it can be overridden for the cases the user knows what he’s doing and has a scenario where he can get away with it. Usually though, a faulty reduce function will make view requests crawlingly slow. Community review & commit pending.

Relax

But it wasn’t that we worked a hard 20 hours a day. We also got together to have a good time, play Wii and havoc heli. Damien’s got an AirZooka that we used to knock down random things in the house.

It was a perfect combination of geeking out and hanging out and we all can’t wait to do the next one :)

Future CouchHacks

We hope to make CouchHack a distributed event that takes part all over the world wherever CouchDB hackers happen to be.

I hope to set up a CouchHack in Berlin soonish, I keep you updated :)

If you want to open your own CouchHack, feel free to add it to the wiki.

by Jan (jan@apache.org) at May 04, 2009 09:45 AM

May 02, 2009

Chris Anderson

Erlang Factory

Too tired to say much.

Here are the slides from today's CouchDB talks (remixed a bit for publication)

CouchDB Internals and CouchDB to the Edge - pdf

by jchris at May 02, 2009 08:07 AM

April 29, 2009

Damien Katz

Pr0n on the Couch

Yes, I saw the slide of the presentation CouchDB: Perform like a Pr0n Star. It's sparked a raging debate all over the internet that's taken a life of it's own. I'm getting some testy emails. So I think I should say something, lest some people interpret my silence as approval.

Now, I don't know what it was like to be there in attendance, but what I saw I was not offensive to me (as if!). I thought it was kind of humorous. But how I felt is besides the point.

Was it sexist? Not in my opinion. Sexual themes aren't necessarily sexist, and I didn't see anything to support a notion of women being less suitable for development or other tech work. Reasonable people can differ here, but I don't see sexism in this talk.

But was the talk inappropriate? In my opinion, yes. I wouldn't give a talk like that, and I would discourage colleagues from giving talks like that at a developer conference.

Some people in attendance were, if not offended, at least made to feel uncomfortable. I can imagine there are conferences where it's just fine, even encouraged, to push the limits of polite behavior. Heck, sounds like a fun conference, can I go too?

But at a developer conference, that's not a good thing. Most everyone at conferences are strangers to one another, and developers aren't known to be the most outgoing people. Anything that makes people feel uncomfortable is going to shut down communication and openness. Even if those who are uncomfortable skipped the talk, others didn't and no doubt will be gabbing about it afterward. There can be no doubt that sexual themes absolutely will make some people feel uncomfortable and close them off. Not just women, some men also feel really uncomfortable about this stuff, but are less likely to admit it.

Now, I'm not against making people feel uncomfortable (sometimes I even like it). There are a lot of very important issues that people feel uneasy discussing, but that doesn't mean we shouldn't discuss it. I remember as a kid watching C. Everett Coop talk about condoms and intercourse during the AIDs/HIV crisis. Lots of people got wound up about that. But he absolutely had to, the cost of people's discomfort was nothing compared to their ignorance.

But in this case, the cost of people's discomfort was outweighed by nothing. It was simply unnecessary, and therefore inappropriate. Not wildly, dangerously, or seriously inappropriate. Just your garden variety inappropriateness.

Now, the reason I'm writing this is not to express my disapproval or outrage on this talk. Honestly I don't feel that strongly about this talk, and I'm a little surprised it's such a big issue. But there can be no doubt it is a big issue and lots of people are talking about it.

The real reason I'm writing this is I don't want to see a misinterpretation of CouchDB culture, or worse a trend of being so cool we don't need politeness and decorum. We shouldn't be that way, we are an inclusive community and CouchDB is a progressive technology, not a cultural movement. If we piss people off it should be because our technology is disruptive, not our community.

by Damien Katz at April 29, 2009 06:44 PM

April 24, 2009

Chris Anderson

"CouchDB to the Edge" at JSConf (with slides)

Chilling in the Track A room at JSConf - very relaxed feeling conference.

Jan and I just gave the talk "CouchDB to the Edge" about the p2p web.

There will be video of our talk, but for now the slides are here (pdf).

by jchris at April 24, 2009 06:22 PM

Damien Katz

CouchDB Coming to the Bay Area

CouchHack, a 3 day hacker-fest at my house here in Asheville, is now done and I really had a blast. It was great to meet everyone and just hang out, and somehow a lot of code got written (but not by me). Thanks to Chris Anderson, Brad Anderson, Paul Davis, Jan Lehnardt, and Benjamin Young. You were all excellent guests and are invited back when we do it again.

And thanks to my very understanding and supportive wife Laura. It was her idea to have it at our house, she and the kids visited Grandma so we could hack!

Next week I'll be in the San Francisco area tuesday April 28 through March 2 to give a keynote at the Erlang Factory Conference. CouchDB contributers Chris Anderson, Paul Davis, Adam Kocoloski and Jan Lehnardt will be there too.

Last Minute Special: If you just want to learn about CouchDB, there is a now a special price for the CouchDB conference track only.

Contact me if you want to meet about CouchDB or related projects and ventures while I'm in the area: damien_katz@yahoo.com

by Damien Katz at April 24, 2009 04:50 PM

April 23, 2009

Ricky Ho

Good and Bad Public Cloud candidates

I recently have a good conversation with Ed and Dean of RightScale on what are the characteristics making an application a good public cloud citizen.

Good Public Cloud Candidates

Being stateless
  • There is no warm up and cool down periods required. Newly started instances are immediately ready to work
  • Work dispatching is very simple when any instances can do the work
Compute intensive with small dataset size
  • Cloud computing enable quick deployment of a lot of CPU to work on you problem. But if it requires to load large amount of data before the CPU can start their computation, the latency and bandwidth cost will be increased
Contains only non-sensitive data
  • Although cryptography technology is sufficient to protect your data, you don't want to pay the CPU overhead for encrypt/decrypt for every piece of data that you use in the cloud.
Highly fluctuating workload pattern
  • Now you don't need to provision inhouse equipment to cater for the peak load, which sits idle for most of the time.
  • The "pay as you go" model save cost because you don't pay when you are not using them
New application launch with unknown workload anticipation
  • You don't need to take the risk of over-estimating the popularity of your new application, and buy more equipment than you actually need.
  • You don't need to take the risk of under-estimating the popularity and ends up frustrating your customer because you cannot handle their workload.
  • You can defer your big step investment and still be able to try out new ideas.


Bad Public Cloud Candidates

Demand special hardware
  • Most of the cloud equipment are cheap, general purpose commodities. If you application demands a certain piece of hardware (e.g. a graphic processor for rendering), it will just not work or awfully slow.
Demand multi-cast communication facilities
  • Most of the cloud providers disable multicast traffic in their network
Need to reside in a particular geographic location
  • If there is legal or business requirement demanding your server to run in a physical location and that location is not covered by cloud providers, then you are out of luck.
Contain large dataset
  • Bandwidth cost across cloud boundary is high. So you may endup have a large bill when loading large amount of data into the cloud
  • Loading large amount of data also takes time. You need to compare that with the overall time of the processing itself to see if it makes sense
Contain highly sensitive data
  • Legal, liabilities, auditing practices hasn't catched up yet. Companies running their core business app in the cloud will face a lot of legal challenges
Demand extremely low latency of user response
  • Since you have no control about the location of where the machines are residing, latency is usually increase when you run your app in the cloud residing in a remote location
Run by 24 x 7 with extremely stable and non-fluctuating workload pattern
  • If you are using a machine without shutting down, then many cost analysis report shows running the machine inhouse will be cheaper (especially for large enterprise who already have data center setup and a team of system administrators)

Hybrid Cloud


I personally believe the real usage pattern of public cloud for large enterprise is to move the fluctuating workload into the public cloud (e.g. Christmas sales for e-commerce site, Newly launched services) but retain most of the steady workload traffic in-house. In fact, I think enterprise is going to move in and out their application constantly based on the change of traffic patterns.

It is much appropriate to do the classification at the component level rather than at the App level. Instead of saying whether the App (as a whole) is suitable or not, we should determine which components of the app should run in the public cloud and which should stay in the data center.

In other words, an Application is running in a hybrid cloud mix, which span across public and private cloud.

The ability to move your application “frictionless” across cloud boundaries and manage the scattered components in a holistic way is key. Once you have this freedom to move, then the price of a particular public cloud provider has less impact on you because you can easily move to any cheaper provider at any time.

by Ricky Ho (rickyphyllis@gmail.com) at April 23, 2009 04:01 PM