Planet CouchDB

February 09, 2010

Damien Katz

First week in the new office

Last week was our first week in our new office in Old Downtown Oakland. It's a really neat area with lots of restaurants and bars, and hardly any murders.

Oh yeah, we've changed our name to Couchio. Our new blog will be here http://blog.couch.io/ soon.

Our office:
office - 4.jpg

Our office manager Claire:
office - 6.jpg

Chris and Mikeal:
office - 1.jpg

Claire and Jan:
office - 2.jpg

Nitin:
office - 8.jpg

Super Awesome Art by Julie Armbruster:
office - 3.jpg

Me:
office - 7.jpg

My Office:
office - 5.jpg

So far we are really disorganized and discombobulated. But I'm are working on it! I even bought Management for Dummies. Things will be running smoothly in no time ;)

Also we are looking hard for someone to help us offer CouchDB support and hopefully build a whole support organization. Email me damien@couch.io if you are interested or know someone who is.

by Damien Katz at February 09, 2010 12:40 AM

February 06, 2010

Chris Strom

Exploring CouchDB-Lucene 0.5

‹prev | My Chain | next›

Thanks to a fix by the clairvoyant Robert Newson, I was able to get the latest couchdb-lucene built. Now to install it.

Per the README, the maven build process creates a .zip file that contains everything needed. But what to do with the contents on my development netbook? The contents are not exactly typical Linux layout:
cstrom@whitefall:~/tmp/couchdb-lucene-0.5-SNAPSHOT$ ls
bin conf lib LICENSE README.md tools
These directories look to be intended as self-contained—maybe in a /opt/couchdb-lucene or /usr/local/couchdb-lucene directory. For now, I will create a local directory in my home directory, and symlink it to couchdb-lucene (for future installs):
cstrom@whitefall:~$ mkdir local
cstrom@whitefall:~$ cd !$
cstrom@whitefall:~/local$ cp /home/cstrom/repos/couchdb-lucene/target/couchdb-lucene-0.5-SNAPSHOT-dist.zip .
cstrom@whitefall:~/local$ unzip couchdb-lucene-0.5-SNAPSHOT-dist.zip
cstrom@whitefall:~/local$ ln -s couchdb-lucene-0.5-SNAPSHOT couchdb-lucene
With that, I can add the necessary configuration to /etc/couchdb/local.ini:
[couchdb]
os_process_timeout=60000 ; increase the timeout from 5 seconds.

[external]
;fti=/path/to/python /path/to/couchdb-lucene/tools/couchdb-external-hook.py
fti=/usr/bin/python /home/cstrom/local/couchdb-lucene/tools/couchdb-external-hook.py

[httpd_db_handlers]
_fti = {couch_httpd_external, handle_external_req, <<"fti">>}
Lastly, I need to start up the couchdb-lucene server (say, that's new!):
cstrom@whitefall:~/local/couchdb-lucene$ ./bin/run 
2010-02-04 21:33:45,309 INFO [Main] Index output goes to: /home/cstrom/local/couchdb-lucene-0.5-SNAPSHOT/indexes
2010-02-04 21:33:45,416 INFO [Main] Accepting connections with SelectChannelConnector@localhost:5985
So I have my couchdb-lucene server running, I restart my couchdb server (to pick up the local.ini changes. Now what?

To actually get indexing working, I need a CouchDB design document that contains a "fulltext" object. Since I am only doing exploratory code here, I name the design document "test" and define it as:
{
"_id": "_design/test",
"_rev": "1-93d99ffe0bddccd4b1aa521f3a569a50",
"fulltext": {
"by_title": {
"index": "function(rec) { var doc=new Document(); doc.add(rec.title); return doc; }\n"
}
}

}
The "fulltext" object can contain any number of index types. Here, I define only one: "by_title". The "by_title" object contains only an "index" function. It can contain other attributes (and I will need some of them eventually), but for exploratory purposes, all I need to do is define the "index" function.

The index function will operate on each record in the CouchDB database—that is what the rec input value represents. Inside the function, I create a new couchdb-lucene document object ("new Document()"), add the title from the CouchDB record ("rec.title") to the couchdb-lucene document, and finally return that couchdb-lucene document. If I wanted to index only recipes and exclude the meal in which they were served, I could have added a condition to return null if the rec's type was "Meal". For now, I index everything.

That should do it. I access this test resource with curl thusly:
cstrom@whitefall:~/repos/eee-code$ curl http://localhost:5984/eee/_fti/test/by_title?q=fish
{"q":"default:fish",
"etag":"357888d94e68","skip":0,"limit":25,"total_rows":20,"search_duration":1,"fetch_duration":4,
"rows":[{"id":"2004-10-24-fish","score":3.0870423316955566},
{"id":"2004-02-27","score":3.0870423316955566},
{"id":"2002-04-07","score":3.0870423316955566},
{"id":"2006-02-05-chili","score":3.0870423316955566},
{"id":"2002-02-15","score":2.4696338176727295},
{"id":"2002-02-18","score":2.4696338176727295},
{"id":"2004-03-31-fish","score":2.4696338176727295},
//...
]}
(I added the formatting for readability)

That seems to be working. To be sure I could check a sample of those document IDs in the CouchDB server to ensure that each has "fish" in the title. Instead, I will try out the field store couchdb-lucene feature, which can be used to store a field for return in the results. To use this, I only need to add {"store":"yes"} to the add method of the couchdb-lucene document:
function(rec) { var doc=new Document(); doc.add(rec.title, {"store":"yes"}); return doc; }
Then, when I request the same query I find:
cstrom@whitefall:~/repos/eee-code$ curl http://localhost:5984/eee/_fti/test/by_title?q=fish
{"q":"default:fish",
"etag":"35b5363947ea","skip":0,"limit":25,"total_rows":20,"search_duration":1,"fetch_duration":6,
"rows":[{"id":"2004-10-24-fish","score":3.0870423316955566,"fields":{"default":"Fish Ghosts"}},
{"id":"2004-02-27","score":3.0870423316955566,"fields":{"default":"Fish in a Packet"}},
{"id":"2002-04-07","score":3.0870423316955566,"fields":{"default":"Fish Curry"}},
{"id":"2006-02-05-chili","score":3.0870423316955566,"fields":{"default":"Fish Chili"}},
{"id":"2002-02-15","score":2.4696338176727295,"fields":{"default":"Fish Sticks and Leftovers"}},
{"id":"2002-02-18","score":2.4696338176727295,"fields":{"default":"Sesame Fish Sticks"}},
{"id":"2004-03-31-fish","score":2.4696338176727295,"fields":{"default":"Quick Thai Fish Curry"}},
//...
]}
Indeed, the titles are now returned with the results and yes, the query is indeed only returning those documents with "fish" in the title.

That is a good stopping point for tonight. Up tomorrow: using these and more spiffy couchdb-lucene features to get my various search Cucumber scenarios passing again.

Day #4

by eee.c (chris.eee@gmail.com) at February 06, 2010 05:01 AM

February 04, 2010

Chris Strom

Reduce Numbers, Not Sizes

‹prev | My Chain | next›

Picking up from last night I am currently unable to get even the simplest of once rock-solid Cucumber scenarios to pass:
cstrom@cstrom-laptop:~/repos/eee-code$ cucumber features/browse_meals.feature:7
Feature: Browse Meals

So that I can find meals made on special occasions
As a person interested in exploring meals and how they drive certain recipes
I want to browse meals by date

Scenario: Browsing a meal in a given year # features/browse_meals.feature:7
Given a "Even Fried, They Won't Eat It" meal enjoyed in 2009 # features/step_definitions/meal_details.rb:1
And a "Salad. Mmmm." meal enjoyed in 2008 # features/step_definitions/meal_details.rb:1
When I view the list of meals prepared in 2009 # features/step_definitions/meal_details.rb:44
HTTP status code 500 (RestClient::RequestFailed)
/usr/lib/ruby/1.8/net/http.rb:543:in `start'
./features/support/../../eee.rb:114:in `GET (?-mix:\/meals\/(\d+))'
(eval):2:in `visit'
./features/step_definitions/meal_details.rb:45:in `/^I view the list of meals prepared in 2009$/'

features/browse_meals.feature:11:in `When I view the list of meals prepared in 2009'
Then the "Even Fried, They Won't Eat It" meal should be included in the list # features/step_definitions/meal_details.rb:75
And the "Salad. Mmmm." meal should not be included in the list # features/step_definitions/meal_details.rb:79
When I follow the link to the list of meals in 2008 # features/step_definitions/meal_details.rb:59
Then the "Even Fried, They Won't Eat It" meal should not be included in the list # features/step_definitions/meal_details.rb:79
And the "Salad. Mmmm." meal should be included in the list # features/step_definitions/meal_details.rb:75

Failing Scenarios:
cucumber features/browse_meals.feature:7 # Scenario: Browsing a meal in a given year

1 scenario (1 failed)
8 steps (1 failed, 5 skipped, 2 passed)
0m0.907s
Alas! This is my laziness (not the good kind) catching up with me. When I was learning CouchDB, I failed to heed an important CouchDB idiom: reduce to a single value when map-reducing. In the heady days of CouchDB 0.9, one could get away with reducing to smaller lists of values. Those same map-reduces now result in wonderful Erlang errors like:
[Wed, 03 Feb 2010 03:00:07 GMT] [error] [<0.4416.0>] {error_report,<0.24.0>,
{<0.4416.0>,crash_report,
[[{initial_call,{couch_view_group,init,['Argument__1']}},
{pid,<0.4416.0>},
{registered_name,[]},
{error_info,
{exit,
{reduce_overflow_error,
<<"Reduce output must shrink more rapidly: Current output: '[[{\"_id\": \"2008-05-13\",\"_rev\": \"1-3c7fe582a51492dc6a756e0
721373165\",\"type\": \"Meal\",\"title\": \"Salad. '... (first 100 of 476 bytes)">>},
[{gen_server,terminate,6},{proc_lib,init_p_do_apply,3}]}},
{ancestors,
[couch_view,couch_secondary_services,couch_server_sup,<0.1.0>]},
{messages,[]},
{links,[<0.4418.0>,<0.61.0>]},
{dictionary,[]},
{trap_exit,true},
{status,running},
{heap_size,1597},
{stack_size,24},
{reductions,675}],
...
What this is telling me is that the reduce in map-reduce does not refer to reducing the size of values, rather it refers to the total number of values (which should be very close to one). The "Current output" in the backtrace is a smaller version of each meal document (just the title and date of each). The output is smaller, but not reduced.

At least two of my "reduce" functions were plain silly and need to be deleted. I have a by-year map, which I then reduce with this nonsensical function:
function(keys, values, rereduce) { return values; }
There is no reduce going on there at all! I added that before I realized that you can select ranges in a map view (non-reduced) thusly:
  url = "#{@@db}/_design/meals/_view/by_date?" +
"startkey=%22#{year}-00-00%22&" +
"endkey=%22#{year}-99-99%22"
data = RestClient.get url
Removing two silly reduces gets me past Erlang backtraces and on to the following:
 Scenario: Browsing a meal in a given year                                          # features/browse_meals.feature:7
Given a "Even Fried, They Won't Eat It" meal enjoyed in 2009 # features/step_definitions/meal_details.rb:1
And a "Salad. Mmmm." meal enjoyed in 2008 # features/step_definitions/meal_details.rb:1
When I view the list of meals prepared in 2009 # features/step_definitions/meal_details.rb:44
undefined local variable or method `response' for #<MyWorld:0x..fdb7afd40> (NameError)
./features/step_definitions/meal_details.rb:46:in `/^I view the list of meals prepared in 2009$/'
features/browse_meals.feature:11:in `When I view the list of meals prepared in 2009'
Ugh. This is due to my shift to rack test last night. The response method is no longer available—I have to use one of response_body or response_code.

After changing that in nearly ever Cucumber step definition that I have, I have all of my scenarios passing except for searching (I have not installed couchdb-lucene yet) and a by-ingredients index page that relies on a problematic map-reduce. I will pick up tomorrow addressing those.

Day #2

by eee.c (chris.eee@gmail.com) at February 04, 2010 10:29 AM

January 19, 2010

Paul Joseph Davis

erlang_js - Awesome

erlang_js - Awesome

erlang_js

If you haven't heard, the Basho team released erlang_js today. Its a linked in driver that provides a Spidermonkey JavaScript context to run JS code for Erlang. This is interesting to me because it avoids the stdio overhead incurred by the current Map/Reduce system that CouchDB uses. So I did what any bored hacker would do: threw erlang_js into the CouchDB build system and hacked the view generation code to use the in-VM contexts.

Numbers

These times are for the "mega view" reported in seconds from raindrop-perf.py found here.

Run Trunk erlang_js
1 13.63 6.89
2 11.16 6.94
3 11.82 6.80

Code

Look here.

Next Up

The communication between Erlang and JS is unnecessarily converting Erlang -> JSON -> Spidermonkey Objects. I've written the code to go from the external Erlang representation to Spidermonkey objects directly so I plan on integrating that in the next couple days to see how these numbers change.

Code Might be Nice

Just thought that maybe people would be interested in the code that's used to talk to erlang_js. Its pretty straight forward, though not very elegant on my side.

% From couch_query_servers.erl
start_doc_map(_Lang, Functions) ->
    {ok, Port} = js_driver:new(),
    ok = js_driver:define_js(
        Port, <<"map_support.js">>, map_support(), 5000
    ),
    lists:foreach(fun(FuncSource) ->
        Source = <<"map_funs.push(", FuncSource/binary, ");">>,
        ok = js_driver:define_js(Port, Source)
    end, Functions),
    {ok, Port}.

map_docs(Port, Docs) ->
    Results = lists:map(
        fun(Doc) ->
            Json = couch_doc:to_json_obj(Doc, []),
            {ok, Results} = js:call(Port, <<"map_doc">>, [Json]),
            lists:map(
                fun(FunRs) ->
                    [list_to_tuple(FunResult) || FunResult <- FunRs]
                end,
            Results)
        end,
        Docs),
    {ok, Results}.

stop_doc_map(nil) ->
    ok;
stop_doc_map(Port) ->
    js_driver:destroy(Port).
% EOF

And map_support.js I wrote to avoid having to think to hard on the Erlang side of things:

// src/couchdb/priv/map_support.js
var map_funs = [];
var results = [];

var emit = function(key, value) {
  results.push([key, value]);
};

var map_doc = function(doc) {
  var ret = [];
  map_funs.forEach(function(func) {
    results = [];
    func(doc);
    ret.push(results);
  });
  return ret;
};
// EOF

Shoutout

Go, Basho, Go!

January 19, 2010 05:00 AM

January 18, 2010

Paul Joseph Davis

Fighting Hype with Hype - RDBMS FTW!!!1!

Fighting Hype with Hype - RDBMS FTW!!!1!

Yay Google Alerts

I woke up this morning to an amusing blog post by Ryan Park in my Google alerts. I tend to read alot of the "RDBMS's are awesome! No, NOSQL is moar awesome!" with great bemusement. Even though it's a year and a half old it was amusing enough to motivate me to write out some of the thoughts I had while reading it.

1. Data integrity is not guaranteed.

Ryan spends a couple paragraphs talking about how horrible it is that non-RDBMS systems don't provide data constraints. In general he's pretty spot on here. One of the first things that tends to get left out of non-RDBMS systems is constraint enforcement.

There are two and a half points I'd like to make. First, constraints are usually the first to go because they're costly. Costly to implement and costly at runtime. Especially when the system is being designed with the ability to run on multiple machines.

Secondly, there are plenty of people that don't use constraints. Ryan falls pretty squarely into the "RDBMS's work for me, so they should work for you too" camp. He appears to know his stuff but what people like Ryan forget is that there are a lot of people that don't. They use an RDBMS because that's what the internet says to do. And then they plop an ORM on top of it and never actually use any of the RDBMS features that are so lauded. As it turns out, lots of these developers are super happy using a database that doesn't provide constraint enforcement.

The last half a point I'll make later as it was suggested in a comment on Ryan's post and applies later on in the conversation.

2. Inconsistency will provide a terrible user experience.

Even reading the bullet point on this one and I knew I was in for some fun. I mean seriously, if that's not proof by assertion then I don't know what is.

Ryan is quite right that developers need to make some things appear consistent so as to not confuse users. I can't speak to the specifics of SimpleDB, but obviously someone's using it successfully so I'll assume that its possible.

The two things I'd point out though is that consistency is not limited to those crazy non-RDBMS people. Even in a traditional three-tier web architecture, there's the issue with sessions. Basically the issue is that a client needs to be repeatedly routed to the same application server handling their session.

The other thing I'll point out is an interesting Facebook blog post I read a couple months ago. Its an interesting look at how Facebook added a second datacenter on the east coast. A datacenter based on MySQL no less. I'll draw your attention to the "Cache Consistency" section. And I'll I'm going to point out is that their solution required modifying MySQL's query parser. Seriously.

3. Aggregate operations will require more coding.

For the bullet point, yes, that's more or less true. The argument in support of this is pretty much non-existent. If Ryan really wanted to make an argument about aggregates, the best thing would be to go on about how a non-RDBMS requires you to know what type of aggregates you'll want up front and then do insert time calculations for these values. While that will work just fine, it makes ad-hoc queries harder. The ad-hoc issue is the next bullet point, but for some reason the connection wasn't made.

4. Complicated reports, and ad hoc queries, will require a lot more coding.

This was one of my favorite bullet points in the whole article. And by favorite I mean that it produced the most WTF's per word.

Firstly, Ryan points out that there are three general work loads for databases. (1) General queries that are used by the application, (2) More complicated queries run by staff for reporting, (3) ad-hoc queries for debugging. I would pretty much agree with him there. But then he goes on to make the assertion that points 2 and 3 are better served by SQL.

The entire second paragraph is some sort of weird twisted logic to bolster the argument that SQL makes reporting super easy. My favorite part of the whole thing is the quote right in the middle:

In my previous jobs, our reports often required hundreds 
of lines of SQL to get the right information out of the
database. This is a lot of code, but it was required to
generate the data for our customers.

As far as I can tell, the argument is that SQL makes complex reports easy even though it still might take hundreds of lines to get the data required. And the other thing that's not mentioned, these reports can still take a substantial amount of time to generate. But obviously this is still better than the non-RDBMS systems where they don't even have SQL! Because obviously its impossible that an any imperative language could be as good as SQL... because... because... well I never figured that part out either.

5. Aggregate operations will be much slower if you don't use an RDBMS.

Yeah. This one is special.

RDBMSes are highly optimized for performing aggregate
operations across huge volumes of data. Fast algorithms
like the hash join, merge join, and indexed binary search
have been around for 20 years or more.

Ok. Breathe. I'm assuming that he forgot that joins are not aggregates. And a binary search, well, shit... I guess the non-RDBMS people really are screwed. And the second paragraph talks about how the client is going to need to scan the entire database thus incurring the huge network transfer to even try and compute an aggregate.

I'll just point out that there are non-RDBMS systems that provide aggregate functionality and anything that uses a b+tree probably uses binary search. Remember people, just because you can't think of a different solution to your problem doesn't mean it can't exist.

6. Data import, export, and backup will be slow and difficult.

Now this is just FUD. Sorry, but there's no better way to say it. If you've ever had to fit some random piece of data into your existing relational schema you'll probably agree that this is crap. Munging random data is hard. And if its not random data then its not really that important. And getting data out? Perhaps Ryan was being satirical?

7. SimpleDB isn't that fast.

Jan Lehnardt has a couple thought provoking arguments and Volker Mische provides some interesting fodder as well. Basically, to say something isn't fast requires you to define what fast is and, generally, no two people will ever agree on the same definition.

That said, Ryan does make an allusion to this situation when he mentions that SimpleDB probably needs a larger DB to be measured on. And he also points out that lots of databases probably fit into RAM.

There's an interesting article by one of the 37signals guys about buying more RAM instead of sharding. While definitely a valid approach, not everyone can go out and buy a single machine with 32 GiB of RAM (though obviously that's getting closer). Though I now curiously wonder what type of disks they have to keep up with the write load they might have.

8. Relational databases are scalable, even with massive data sets.

I don't have a better response than the commenter jackson on the original blog post. Once an RDBMS is scaled to multiple machines, lots of the benefits are nullified and you're dealing with the same issues that the non-RDBMS folks are.

9. Super-scalability is overrated. Slowing the pace of your product development is even worse.

There is definitely a lot of noise in the echo chamber about scalability. Developers like to talk about needing hundreds of nodes to support their work load because that's just cool. But in reality, the issue isn't adding the hundredth node to a system, its adding the second. Regardless of the database being used, if that second node isn't planned for it'll be painful. Non-RDBMS systems generally reduce that pain point by discouraging designs that exacerbate the problems when adding a second node.

10. SimpleDB is useful, but only in certain contexts.

I'll file this under the "No shit?" category. There are plenty of places that an RDBMS might be a better fit than any given non-RDBMS. And vice versa. The underlying issue that people seem to miss is being able to describe situations where one might be better than the other.

The bottom line to this whole "My database is better than your database!" argument is that "You're both right, so STFU!" Eventually people will calm down and start to realize that there are multiple solutions and the right one will depend as much on the problem domain as the developer coding the solution. A better use of time would be finding personal projects and drawing up the arguments for and against the coded solution so that others might learn from past experience.

January 18, 2010 05:00 AM

January 16, 2010

Ricky Ho

RoadMap to SaaS

I have observed a pattern from multiple enterprises moving from a traditional web app to a SaaS. Trying to capture this pattern and a number of lessons learned. I use a J2EE Web App architecture to illustrate but the same principles apply to other technology platforms.

Stage 1: Some working Web App

At the very beginning, we have an web application that works well. We analyze the function of the web application and group the implementation classes accordingly


Stage 2: Separate functionality across processes

We analyze the functions and partition them into different processes (JVM). The partition needs to be coarse-grain and each process will communicate with each other via a service interface exposed by a Facade class. The service interface can be any remote object invocation interface or XML over HTTP. Restful web service is the de-facto way for the service interface.


Stage 3: Move different processes to different machines

To scale out beyond a single server's capacity, we move the process to a separate machine. Notice that the machine can be a physical machine, or a virtual machine running on top of hypervisor.



Stage 4: Build service pools

If the service itself is stateless, we can easily scale out the service capacity by putting multiple machines (running the same service) into a server pool. A network load balancer will be used to spread the workload evenly to the member servers.

When the workload increase, more machines can be added to the pool to boost up the overall capacity of the service. Elastic scalability provided by Cloud computing provider make growing and shrinking the pool even more rapid, and can hence dynamic workload more effectively.



Stage 5: Scale the data tier by partitioning

After we scale out the processing tier, we found the data tier becomes the bottleneck. So we also need to distribute the data access workload by partition the database according to the data key. Here are some classical techniques how we can build a distributed database.


Stage 6: Add Cache to reduce server load

If the application is has a high read/write ratio, or has some tolerance of data staleness, we can add a cache layer to reduce the hit of the actual services. Clients will check the cache before sending the request to the service.

We need to make sure the cached items to remain fresh. There are various schemes to achieve this. e.g. cached items can be expired after certain timeout, or an explicit invalidation request can be make to specific cached items when the corresponding backend data has been changed.

We can use local cache (reside in the same machine as the client) or a distributed cache engine such as Memcached or Oracle Coherence.



Stage 7: Consider which service to expose to public

At this point, we want to expose some of the services to the public either because this can bring revenue to our company or can facilitate a better B2B integration with our business partners. There are a lot of considerations to decide what to be expose, such as security consideration, scalability, service level agreement, utilization tracking ... etc.


Stage 8: Deploy an ingress service gateway

Once we decide what services to be exposed, we decide to use a specialized ingress service gateway to handle the concern that we outline above. Most of the XML service gateway is equipped with message validation, security checking, message transformation, routing logic.



Stage 9: Deploy an egress service gateway

We not only providing service to the public, but may also consume other public services. In this case we deploy an egress service gateway which can help to lookup the service provider endpoints, extract the service policy of the public service providers ... etc.


One important function of the egress service gateway is to manage my dependencies to external service providers. It typically keeps a list of equivalent service providers together with their availability and response time, and routing my request to the one according to my selection criteria (e.g. cheapest, most-reliable, lowest-latency ... etc.)



Stage 10: Implement service version control

My service will evolve after exposing to the public. In the ideal case, only the service implementation change but not the service interface so there is no need to change the client code.

But mostly likely it is not that ideal. There are chances that I need to change the service interface or the message format. In this case, I may need to run multiple versions of the services simultaneously to make sure I am not breaking an existing clients. This means I need the ingress service gateway to be intelligent about routing the client request to the right version of the service implementation.


A typical way is to maintain a matrix of version and keep track of their compatibilities. For example, we can use a release convention such that all minor releases is required to be backward compatible but major release is not required for that. Having this compatibility matrix information, the ingress gateway can determine the client version from its incoming request and route it to the server which has the latest compatible version.


Stage 11: Outsource infrastructure to public Cloud provider

Purchase necessary hardware equipment and maintaining them can be very costly, especially when there are idle time in the usage of computing resources. Idle time is usually unavoidable because we need to budget the resource at the peak workload scenarios so there are idle time at non-peak hours.

For a more efficient use of computing resources, we can consider some public cloud provider such as Amazon AWS or Microsoft Azure.

But it is important that running an application in the cloud may need to redesign the application to cope with some unique characteristics of the cloud environment, such as how to deal with high network latency and bandwidth cost, as well as how to design application to live with an eventually consistent DB.

by Ricky Ho (rickyphyllis@gmail.com) at January 16, 2010 06:04 AM

January 14, 2010

Ricky Ho

Notes on Oracle Coherence

Oracle Coherence is a distributed cache that functionally comparable to Memcached. On top of the basic cache API function, it has some additional capabilities that is attractive for building large scale enterprise applications.

The API is based on the Java Map (Hashtable) Interface, which provides a key/value store semantics where the value can be any Java Serializable object. Coherence allows data stored in multiple caches identified by a unique name (which they called a "named cache").

Below code examples are extracted from the great presentation from Brian Oliver of Oracle

The common usage pattern is to first locate a cache by its name, and then act on the cache.

Basic cache function (Map, JCache)
  • Get data by key
  • Update data by key
  • Remove data by key
NamedCache nc = CacheFactory.getCache("mine");

Object previous = nc.put("key", "hello world");

Object current = nc.get("key");

int size = nc.size();

Object value = nc.remove("key");

Set keys = nc.keySet();

Set entries = nc.entrySet();

boolean exists = nc.containsKey("key");

Cache Modification Event Listener (ObservableMap)

You can register an event listener on a cache to catch certain change events happen within the cache.
  • New cache item is inserted
  • Existing cache item is deleted
  • Existing cache item is updated
NamedCache nc = CacheFactory.getCache("stocks");

nc.addMapListener(new MapListener() {
public void onInsert(MapEvent mapEvent) {
...
}

public void onUpdate(MapEvent mapEvent) {
...
}

public void onDelete(MapEvent mapEvent) {
...
}
});


View of Filtered Cache (QueryMap)

You can also define a "view" by providing a "filter" which is basically a boolean function. Only items that is evaluated to be true by this function will be visible in this view.

NamedCache nc = CacheFactory.getCache("people");

Set keys =
nc.keySet(new LikeFilter("getLastName", "%Stone%"));

Set entries =
nc.entrySet(new EqualsFilter("getAge", 35));


Continuous Query Support (ContinuousQueryCache)

The view can also be used as a "continuous query". All new coming data that fulfilled the filter criteria will be included automatically in the view.

NamedCache nc = CacheFactory.getCache("stocks");

NamedCache expensiveItems =
new ContinuousQueryCache(nc,
new GreaterThan("getPrice", 1000));


Parallel Query Support (InvocableMap)

We can also spread a query execution and partial aggregation across all nodes and have them execute in parallel, followed by the final aggregation.
NamedCache nc = CacheFactory.getCache("stocks");

Double total =
(Double)nc.aggregate(AlwaysFilter.INSTANCE,
new DoubleSum("getQuantity"));

Set symbols =
(Set)nc.aggregate(new EqualsFilter("getOwner", "Larry"),
new DistinctValue("getSymbol"));


Parallel Execution Processing Support (InvocableMap)

We can ship a piece of processing logic to all nodes which will execute the processing in parallel
NamedCache nc = CacheFactory.getCache("stocks");

nc.invokeAll(new EqualsFilter("getSymbol", "ORCL"),
new StockSplitProcessor());

class StockSplitProcessor extends AbstractProcessor {
Object process(Entry entry) {
Stock stock = (Stock)entry.getValue();
stock.quantity *= 2;
entry.setValue(stock);
return null;
}
}


Implementation Architecture

Oracle Coherence runs on a cluster of identical server machines connected via a network. Within each server, there are multiple layers of software provide a unified data storage and processing abstraction over a distributed environment.


Smart Data Proxy

Application typically runs within a node of the cluster as well. The cache interface is implemented by a set of smart data proxy which knows the location of master (primary) and slave (backup) copy of data based on its key.

Read through with 2 level cache

When the client "read" data from the proxy, it first try to find the data in a local cache (also called the "near cache" within the same machine). If it is not found, the smart proxy will then locate the distributed cache for the corresponding copy (also called the L2 cache). Since this is a read, either a master or a slave copy is fine. If the smart proxy wouldn't find data from the distributed cache, it will lookup data from the backend DB. The return data will then propagate back to the client and the cache will be populated.

Master/Slave data partitioning

Updating data (insert, update, delete) is done in the reverse way. Under the master/slave architecture, all updates will go to the corresponding master node that owns that piece of data. Coherence support two modes of update; "Write through" and "Write behind". "Write through" will update the DB backend immediately after updating the master copy, but before updating the slave copy, and therefore keep the DB always up to date. "Write behind" will update the slave copy and then the DB in an asynchronous fashion. Data lost is possible in "write behind" mode, which has a higher throughput because multiple write can be merge in a single write, resulting in a fewer number of writes.

Moving processing logic towards data

While extracting data from the cache to the application is a typical way of processing data, it is not very scalable when large volume of data is required to be processed. Instead of shipping the data to the processing logic, a much more efficient way is to ship the processing logic to where the data is residing. This is exactly why Oracle Coherence provide an invocableMap interface where the client can provide a "processor" class that get shipped to every node where processing can be conducted with local data. Moving code towards the data dstributed across many nodes also enable parallel processing because now every node can conduct local processing in parallel.

The processor logic is shipped into the processing queue of the execution node, which has an active processor dequeue the processor object and execute it. Notice that this execution is performed in a serial manner, in other words, the processor will completely finished a processing job before proceeding to the next job. There is no worry about multi-threading issue and no need to use locks, and therefore no dead lock issue.

by Ricky Ho (rickyphyllis@gmail.com) at January 14, 2010 12:32 AM

January 12, 2010

Mark Headd

NoSQL Telephony with Tropo and CouchDB

In the last two posts, I’ve provided a basic overview of how to create cloud telephony applications using the Tropo platform and CouchDB.

Apache CouchDB Logo

In the first post of this series, I walked through a quick install of CouchDB and provided information on getting a Tropo account set up. In the second post, we created a simple auto attendant Tropo script in PHP that populates a CouchDB database with a call record for each inbound call that is transferred.

I’ll conclude the series with information on how to retrieve information from a CouchDB instance for use in a cloud telephony application, and talk about design documents. This post will also introduce the reader to the concepts of CouchDB Views and Show Functions - powerful tools that can be harnessed to create truly cutting edge cloud phone apps.

First, let’s create a CouchDB database to hold our call settings.

Creating a Call Settings Database

As mentioned in the previous CouchDB posts, you can create a new call settings database using curl from the command line, or using the Futon GUI.

$ curl -X PUT http://your_new_couchdb_ip:5984/call_settings

You should see a response from CouchDB like this:

{”ok”:true}

You can add a record to the call settings database the same way. This time, however, we’ll append the URL for our CouchDB database with a document ID, in this case ‘1000′ - this is the extension that a caller to our cloud telephony app will dial. We’ll use the document ID and and the CouchDB REST API to get all of the settings we’ll need to conduct the transfer - these settings can be seen in the document structure below (feel free to add others to meet your needs or preferences).

$ curl -X PUT http://your_new_couchdb_ip:5984/call_settings/1000 -d ‘{”first_name”:”Joe”,”last_name”:”Blow”,”phone”:”17777777777″,”title”:”Master of Disaster”,”ring_tone”:”audio/ring.wav”}’

You should see a response from CouchDB like this:

{”ok”:true,”id”:”1000″,”rev”:”1-0cf5a7c3a70ac5760f1a5d8dcb8b48d2″}

Let’s add a few more documents to our call settings database (replacing the telephone numbers below with real ones that you want callers to transfer to) and then view all of the documents that we have created.

$ curl -X PUT http://your_new_couchdb_ip:5984/call_settings/2000 -d ‘{”first_name”:”Harry”,”last_name”:”Smith”,”phone”:”18888888888″,”title”:”President of the World”,”ring_tone”:”audio/ring.wav”}’

$ curl -X PUT http://your_new_couchdb_ip:5984/call_settings/3000 -d ‘{”first_name”:”Martin”,”last_name”:”Scorsese”,”phone”:”19999999999″,”title”:”The Departed”,”ring_tone”:”audio/ring.wav”}’

You can view all of the documents in a CouchDB database using the HTTP GET method:

$ curl -X GET http://your_new_couchdb_ip:5984/call_settings/_all_docs

You should see a response from CouchDB like this:

{”total_rows”:3,”offset”:0,”rows”:[
{"id":"1000","key":"1000","value":{"rev":"1-0cf5a7c3a70ac5760f1a5d8dcb8b48d2"}},
{"id":"2000","key":"2000","value":{"rev":"1-ee2f09516df8b191a89791b01828d788"}},
{"id":"3000","key":"3000","value":{"rev":"1-a1399e8218ae75e1efb73ba3f87862ff"}}
]}

Now we need to modify our Tropo PHP script to retrieve the settings we want to use with each transferred call.

Note, for now we’ll keep the logic simple - if a caller enters an extension that does not exist we’ll get a specific HTTP response back from CouchDB - something in the 400 class of responses. If this happens, we’ll just end the call - in the real world you’d want to do something a little more friendly, but you can sort that out when you build your own cloud telephony application. ;-)

Modifying the Tropo Script

So, our new Tropo script looks like this:

Note that the getPhoneNumberByExtension() method no longer returns a hard coded phone number - it is using the 4-digit extension entered by the caller to access our CouchDB database using the REST API. The response from CouchDB is a document in JSON format, that we can easily parse using PHP’s handy json_decode() function.

I’ve also modified the value of the $callLog variable to correctly capture some of the variables exposed in the Tropo environment (i.e., the session ID of the call, and the caller ID - see this thread for more information).

So now we have a working cloud telephony application built on Tropo that uses CouchDB to get its call settings, and also to write a call record for billing, reconciliation, etc.

As cool as this is, there is still a lot more we can do with CouchDB in our cloud telephony apps. Note the constants declared at the top of the Tropo script - the last two are blank; one for a design document name, and one for a show function.

define(”COUCH_DB_DESIGN_DOCUMENT_NAME”, “”);
define(”COUCH_DB_SHOW_FUNCTION_NAME”, “”);

Let’s talk about those concepts now, and explore how they could be used in a cloud telephony application.

Getting more out of CouchDB - Design Documents, Map/Reduce and Show Functions

As the title of this post suggests, we’re building cloud-based phone applications without SQL. CouchDB doesn’t use SQL - instead it uses a Map/Recuce framework to index documents in a database.

Map functions can be used to emit a key-value listing of documents in a CouchDB database. Reduce functions are used to aggregate the key-value pairs emitted by a Map function. Map/Reduce functions (or Views) live inside of a special document in a CouchDB database called a “design document“, which has a document ID prefixed with “_design/”.

For example, suppose we have a special design document in our database called “_design/extensions” with a View called “getExtensions” - our View is made up of a Map function and (optionally) a Reduce function. Let’s assume our View has only a Map function to return data on extensions with valid phone numbers to transfer a caller to.

function(doc) {
  if(doc.phone.length == 11 && doc.phone.substr(0,1) == ‘1′) {
    emit(doc._id, doc.phone);
  }
}

Our Map function (which is written in JavaScript, and stored in our design document) has one parameter - doc. This function is called for each document in our database, and the doc parameter represents the document itself. As can be seen, we simply examine each document in the database to see if it has a valid phone number (11 digits, starting with 1).

Views are accessed using a specific URI structure (do note, however, that the REST API for querying Views can change significantly between CouchDB versions), and the response is a set of key-value pairs formatted as JSON.

http://{server_ip_address}:5984/{database_name}/_design/{design_document_id}/_view/{view_name}

$ curl -X GET http://your_new_couchdb_ip:5984/call_settings/_design/extensions/_view/getExtensions

You should see a response from CouchDB like this:

{”total_rows”:3,”offset”:0,”rows”:[
{"id":"1000","key":"1000","value":"17777777777"},
{"id":"2000","key":"2000","value":"18888888888"},
{"id":"3000","key":"3000","value":"19999999999"}
]}

You can check to see if your Map function is working properly by adding a document with an invalid phone number.

$ curl -X PUT http://your_new_couchdb_ip:5984/call_settings/4000 -d ‘{”first_name”:”Richard”,”last_name”:”Kimble”,”phone”:”4444444″,”title”:”The Fugitive”,”ring_tone”:”audio/ring.wav”}’

Accessing the getExtensions view will return the same results as before, as the phone number for the new document does not pass validation. Using design documents and Views, cloud telephony developers can use CouchDB to build grammars for user input which will significantly enhance the usability of the sample application we’ve used during the last few posts.

But there is even more potential with another piece of functionality in CouchDB - show functions. Show function also live in design documents, alongside Views. Show functions allow a developer to return specifically formatted content from a CouchDB instance, not just data in JSON format.

A basic show function that can be used to return information from our CouchDB database in the format of a SRGS grammar might look like this.

function(doc, req) {
 var grammer = ‘<?xml version=\”1.0\”?><grammar xmlns=\”http://www.w3.org/2001/06/grammar\”>’;
 grammar += ‘<rule id=\”R_1\”><one-of>’;
 grammar += ‘<item>’ + doc.phone + ‘<\item>’;
 grammar += ‘</one-of></rule></grammar>’;
 return grammar;
}

Like Views, Show Functions are accessed using a specific URI structure.

http://{server_ip_address}:5984/{database_name}/_design/{design_document_id}/_show/{show_function_name}/{document_id}

Note that the Show function above is different from the Map function discussed earlier in that it takes two parameters - doc and req. As before, the doc parameter represents the document the function is called against. The req parameter represents a parameter that is sent in with the HTTP request, which can be used inside the function to render output. So a Show function canbe accessed using the above URL with an optional parameter as well, like so.

http://{server_ip_address}:5984/{database_name}/_design/{design_document_id}/_show/{show_function_name}/{document_id}?{parameter}={some_value}

Conclusion

I hope this series of posts has provided a helpful overview of CouchDB, with an emphasis on how it can be used to build cloud telephony applications.

Cloud telephony platforms like Tropo, CloudVox, CallFire and others provide enormous flexibility to developers in building and deploying sophisticated cloud telephony applications.

Pair these tools with CouchDB and you’ve got a powerful combination for building full featured, easy to maintain cloud-based phone apps.

a

by Mark Headd at January 12, 2010 10:51 PM

January 06, 2010

Mark Headd

Relaxing on the Couch with Tropo and CouchDB

This is the continuation of a series that will describe how to build voice applications with the Tropo cloud telephony platform and CouchDB.
Apache CouchDB Logo
In the last post, I detailed how to get a CouchDB instance up and running on Ubuntu, and how to get an account started on Tropo so that you can start building cloud telephony applications. In this post, we’ll create our first CouchDB database and create a simple Tropo application that connects to our CouchDB instance. First, however, we need to tweak the default settings for CouchDB so that we can access our CouchDB instance from the an external environment.

Configuring CouchDB

Recall from the last post that the configuration files for CouchDB are located in /usr/local/etc/couchdb/. Open the local configuration file and take a look at the default settings:

$ sudo vim /usr/local/etc/couchdb/local.ini

In the [httpd] section, you’ll notice the setting for the default port that is used to connect to CouchDB - 5984. You’ll also note the bind_address setting. By default, CouchDB listens only on localhost – you can change this by altering the value of bind_address to a publicly resolvable IP address (you may need to uncomment this setting as well).

However, before proceeding please note that CouchDB does not yet have a built in security model, so anyone that can access the IP address in the configuration file can potentially access your CouchDB instance. We’ll need to take some steps to restrict access to our CouchDB instance – there are several ways of doing this.

First, if you know the IP address (or range of addresses) that will be accessing your CouchDB instances, you can simply use IPTables to restrict access to that IP:

$ iptables -A INPUT -p tcp -s 64.57.102.34 –dport 5984 -j ACCEPT

The above command would restrict access to your CouchDB instance to a single IP address.

Another method for securing a publicly exposed CouchDB instance is to use Apache password authentication. A good overview of this approach can be found here.

After you’ve modified the bind_address setting, restart CouchDB and test connectivity to it:

$ sudo /usr/local/etc/init.d/couchdb restart
$ curl http://your_new_couchdb_ip:5984

Creating our CouchDB Database

Once you have your CouchDB instance up and running, you can create a database in one of two ways. The first, and easiest, is simply to use the curl command. You create a database in CouchDB by using the HTTP PUT method:

$curl -X PUT http://your_new_couchdb_ip:5984/call_logs

You can also create a database (and do lots of management functions in CouchDB) by using the GUI (called Futon). Its located at http://your_new_couchdb_ip:5984/_utils

Building a Simple Auto Attendant Application with Tropo

Now that we have an initial database in our CouchDB instance, lets build a simple Tropo application that will populate it with records (or documents in CouchDB parlance):

This simple application is a basic auto attendant. It asks the caller for a 4-digit extension and then transfers them to a 10-digit PSTN number. At the end of the call, we write a very simple call log document to our new call_logs database using the HTTP POST method.

(One small side note – you can use either the POST or PUT methods to insert a document into a CouchDB database. However, using PUT assumes you want to assign a specific document ID to your document. When you use HTTP POST, CouchDB will automatically assign a document ID. For now, we’ll keep things simple and use POST.)

Much of the functionality in this simple app is just stubbed out for now - i.e., the getPhoneNumberByExtension() method - we’ll build more of this out in later posts.

Modify this file by adding your instance-specifc details to the constant declarations at the top. Do also note that the last two constants can remain blank for now.


define("COUCH_DB_DESIGN_DOCUMENT_NAME", "");
define("COUCH_DB_SHOW_FUNCTION_NAME", "");

We’ll talk about how to use design documents in the next post.

When you load this file up on Tropo and make a test call, you will see your call log document is inserted into the call_logs database. The structure of the document is pure JSON, which is supported quite nicely in PHP (and most every other language that can run on Tropo as well).

In the next post, we’ll examine CouchDB design documents in more detail and modify our simple demo application to get a list of extensions from another CouchDB database and parse the JSON data structure in the getPhoneNumberByExtension() method.

Until then, keep on relaxing….

a

by Mark Headd at January 06, 2010 11:48 PM

January 04, 2010

Mark Headd

Building Voice Applications with Tropo and CouchDB

The beginning of a new decade is usually the time when there is a lot of reflection on what has changed for the better (and the worse) over the past 10 years.
Voxeo Tropo Logo

The end of 2009 saw its fair share of pundits talking about how far we’ve come since the year 2000, but for me the most dramatic change has been to the way that voice applications are developed and deployed. In the past 10 years (hell, in the past 10 months!) we’ve seen a dramatic shift toward cloud telephony with the launch of a number of new services that have fundamentally altered how voice applications can be built.

There has simply never been a more varied and powerful array of tools available for developers to build phone applications with than exists today. Some of the newest and most innovative platforms around for building cloud-based phone applications are listed below:

Over the same period of time there’s also been lots of changes to some of the foundational technologies supporting voice applications (and other kinds of web applications). The NoSQL movement is gaining steam, and there is an interesting collection of new document-oriented databases available for developers to use. One of my favorite is Apache CouchDB.

Apache CouchDB Logo

The thing I find really interesting about CouchDB is that it makes use of an HTTP-based API - pretty much any tool or technology that can communicate via HTTP can be used to interact with a CouchDB instance (hello command line…). In addition, that data structure of the documents stored in a CouchDB instance is JSON. In my mind, this makes CouchDB a very useful choice when building cloud applications, specifically cloud telephony applications.

Who wants to use one of those old relational databases to build a cutting edge cloud phone app? That’s so 2009.

This post and the next several that follow it will detail how to set up a CouchDB instance and to build a cloud telephony application with it using the Voxeo Tropo platform.

For those that don’t know, Tropo is one of the new cloud telephony platforms that lets developers author voice apps quickly and easily using one of several different languages. It’s open source, well documented and supported by Voxeo’s industry leading telephony infrastructure.

So, if you want to start the new decade of right by learning how to build powerful, scalable, full featured voice applications using Tropo and CouchDB, read on.

Getting Started with Tropo

Head on over to Tropo.com and set up a new account (if you don’t have one already). Take a little time to review the documentation for Tropo - I’d recommend running through a few of the sample apps if you have time. They’re fairly self explanatory and provide a solid overview of the different languages that can be used to write a Tropo app - I’ll be using PHP for the example application in this series of blog posts, but you can use any language that Tropo supports.

Deploying and testing an application on Tropo is a snap, and you can even deploy a PSTN number for your application (or you can use the Skype calling number automatically provisioned when you create a Tropo app). More on this in the next post.

Installing CouchDB

The next step in building our cloud telephony application for the new decade is getting CouchDB up and running. The steps listed below detail how to install CouchDB 0.9 on Ubuntu 8.04 (the long-term support version of Ubuntu). A few points before we get started…

This specific combination of Ubuntu and CouchDB is my own preference. I typically run Ubuntu 8.04 when I deploy a new virtual server, but you are free to run whatever version you like, or another Linux distro entirely - its up to you. Depending on the version of Ubuntu you are running, you may be able to get CouchDB 0.9 installed by simply doing sudo apt-get install couchdb.

Keep in mind, though, that the HTTP API for CouchDB can change dramatically between versions - I’ve noticed some significant changes when going from 0.8 to 0.9 - the discussion here will focus on version 0.9 (as does a lot of good documentation on CouchDB available on the web).

If you don’t want to install CouchDB yourself, you may be able to take advantage of one of the growing number of CouchDB hosting services like CloudAnt or Couch.io. Again, its up to you.

To install CouchDB 0.9 on Ubuntu 8.04, using the following steps.

Step 1. Determine what version of Ubuntu is running on your machine:

$ cat /etc/lsb-release

Step 2. Install Erlang - CouchDB 0.9 requires at least Erlang version 5.5.5. If you are running Ubuntu 8.10 or above, you can probably get the required Erlang version by simply doing sudo apt-get install erlang. (Note - The last step in this section may take a while, feel free to go grab a cup of Joe while the source compiles.)

$ sudo apt-get build-dep erlang
$ sudo apt-get install java-gcj-compat java-gcj-compat-dev
$ wget http://www.erlang.org/download/otp_src_R12B-5.tar.gz
$ tar -zxvf otp_src_R12B-5.tar.gz
$ cd otp_src_R12B-5/
$ ./configure && make && sudo make install

Step 3. Install CouchDB dependencies:

$ sudo apt-get install libmozjs-dev libicu-dev libcurl4-openssl-dev

Step 4. Download CouchDB 0.9 Source and install:

$ wget http://apache.mirrors.redwire.net/couchdb/0.9.0/apache-couchdb-0.9.0.tar.gz
$ tar -zxvf apache-couchdb-0.9.0.tar.gz
$ cd apache-couchdb-0.9.0/
$ ./configure && make && sudo make install

Step 5. Create a user for CouchDB (More on this in the CouchDB README.txt file):

$ sudo adduser – –system – –home /usr/local/var/lib/couchdb – –no-create-home – –shell /bin/bash – –group – –gecos “CouchDB Administrator” couchdb
$ sudo chown -R couchdb /usr/local/etc/couchdb
$ sudo chown -R couchdb /usr/local/var/lib/couchdb
$ sudo chown -R couchdb /usr/local/var/log/couchdb

Step 6. Start CouchDB

$ sudo /usr/local/etc/init.d/couchdb start

If you see an error that says:

“Apache CouchDB needs write permission on the PID file: /usr/local/var/run/couchdb.pid”

Do the following, then try starting CouchDB again:

$ sudo touch /usr/local/var/run/couchdb.pid
$ sudo chown couchdb:couchdb /usr/local/var/run/couchdb.pid

When CouchDB starts successfully, you will see a message that says:

* Starting database server couchdb     [ OK ]

Step 7. Test connectivity to CouchDB:

$ curl http://127.0.0.1:5984

You should see:

{”couchdb”:”Welcome”,”version”:”0.9.0″}

The configuration file for CouchDB is located at /usr/local/etc/couchdb/default.ini — in the next post, we’ll modify some of the config settings for our CouchDB instance so that we can access it via HTTP from the Tropo environment.

We’ll also set up our first CouchDB database, add some documents and start coding our new cloud telephony application using Tropo.

Stay tuned…

a

by Mark Headd at January 04, 2010 11:49 PM

January 02, 2010

Mikeal Rogers

Moving on

The new year is bringing some big changes for me. A few weeks back I accepted a position at Relaxed Inc. and notified Mozilla that I would be leaving at the end of the year.

Mozilla

I started working at Mozilla 2 years ago. I started the day after my employment at the Open Source Applications Foundation ended. At this point I already took for granted some of the best parts of working at Mozilla; working for a public benefit organization, spending 100% of my time working on Open Source, working with very smart people in the open (lists, IRC, etc.).

But Mozilla is even more than all that. Succeeding at Mozilla means something more than a pat on the back and a good end of the year review. When you succeed at Mozilla you impact one of the most important products on the internet. You reach hundreds of millions of users and contribute to keeping the web an open and free (as in speech) world. There is no other place in the world you can work where you can conceivably have this kind of impact.

Mozilla as an organization is truly unique. Last year was the hardest I’ve ever had, i suffered a huge loss in my personal life and Mozilla was as supportive during this time as any of my friends or family. There are a lot of places that let you put so much of yourself in to the organization to help it attain it’s goals but there are only a handful that are there to support you when you need it.

Relaxed Inc.

I started using CouchDB in 2008 after a great talk by Jan Lehnardt at OSCON. I started using it right away and over the next year it re-shaped how I think about web development and applications. In the last 6 months my group at Mozilla has become a heavy CouchDB user and not just because of my own interest but because CouchDB was the only solution for some of the harder problems we needed to solve with our results storage.

As I’ve used CouchDB more and more and become a part of the CouchDB community I’ve had the pleasure of knowing some of the core contributors, three of which have decided to found a new startup around CouchDB; Jan Lehnardt, J Chris Anderson, and the creator of CouchDB Damien Katz. Shorty after they received their funding they made an offer. It’s an amazing opportunity and while the decision to leave Mozilla is one of the hardest I’ve ever had to make I’m very excited about my future at Relaxed.

The Future

I’m really looking forward to working with everyone at Relaxed. It’s an exciting time and I’m not 100% sure yet which projects I currently work on that I will still have time to maintain. In the next week or so I’ll be doing a blog post on all the libraries I currently work on and maintain (it’s a long list) and what their status is moving forward. I still maintain code I wrote long before I worked at Mozilla and have every intention of continuing to work on some of the projects I started at Mozilla.

One thing is certain. I’m not the guy who figures out how to test the browser any more. Windmill and Mozmill are important projects that I have every intention of supporting by making time for code reviews and community support but I won’t be available to put time in to new feature work and refactoring like I have in the past. Luckily there are solid communities behind both of these projects and I’m confident that there are people who can continue to drive them in the future.

I don’t know what is going to happen next, all I know is that it should be fun, it won’t be like anything I’ve done before, and will certainly continue to include lots JavaScript and Python.

For everyone who depends on me and the code I’ve written over the last few years I’ll be sure to keep you all up to date. And one thing I can promise is that if you want to fix anything in one my projects, fork it on github and send me a pull request and I will always find time to look at it :)

by mikeal at January 02, 2010 09:30 PM

December 22, 2009

Ricky Ho

Takeaways on Responsive Design

I recently have attended a great talk from Kent Beck who is the thought leader in Extreme Programming, Test-Driven-Development and Code refactoring. This talk "Responsive Design" outline a set of key principles of how to create a design that is "malleable" so it can respond quickly to future changes.

Anticipating Changes
I don't think one can design a system without knowing the context of the problem. You have to know what problem your solution is trying to solve. Of course, you may not know the detail of every aspects, or you may know that certain parts of the requirements may undergoing some drastic changes. In these areas where changes are anticipated, you need to build in more flexibilities into your design.

In my opinion, knowing what you know well and what you don't know is important. Good designers usually have good instinct in sensing between the "known" and "unknown" and adjust the flexibility of his design along the way as more information is gathered.

As more information is gathered, the dynamics of "change anticipation" also evolves. Certain parts of your system has reduced its anticipated changes due to less unknowns so now you can trade off some flexibility for efficiency or simplicity. On the other hand, you may discover that certain parts of the system has increased its anticipated changes and so even more flexibility is needed.

Safe Steps
If the system is already in production, making changes in the architecture is harder because there are other systems that may already depend on it. And any changes on it may break those other systems. "Safe Steps" is about how we can design the changes to existing system with minimal impact to those other systems that depends on it.

Design for Evolution
One important aspect when design a system is not just by looking at what the end result should be, but also look at what the evolution path of the system should look like. The key idea is that a time dimension is introduced here and the overall cost and risk should be summed along the time dimension.

In other words, it is not about whether you have designed a solution that finally meet the business requirement. What is important is how your solution bring value to the business as it evolves over time. A good design is a live animal that can breath and evolve together with your business.

Kent also talk about 4 key design approaches under different conditions

1. Leap

"Leap" is a brave approach where you go ahead to design and implement the new system. When complete, swap out the existing components with the new one. This approach requires a very good understanding of the functionality of system you want to build, how the existing system and how other systems depends on the existing system.

"Leap" can be a very effective approach if the system is very self-contained with clearly defined responsibilities. But in general, this approach is somewhat high-risk and is an exception rather than a norm for large enterprise applications.

2. Parallel

"Parallel" takes a "wrap and replace" approach. The new system is designed to run in parallel with existing system so that migration can be conducted gradually in a risk-containing manner. If there is any problem happens during the migration, the whole system can be switched back to the original system immediately so the risk is contained.

After 100% client has been migrated to the new system for some period of time, the old system can be shutdown without even the clients notice it.

"Parallel" approach still requires you to have a clear understanding of what you want to build. But it relax you from knowing the dependencies of existing system. Of course, the design may be more complicated because it needs to run in parallel with the existing system and has to deal with things like data consistency and synchronization issues.

"Parallel" is a predominant approach that I've seen people used in reality.

3. Stepping Stone

"Stepping Stone" is very useful when you don't exactly know your destination, but you know that there are some intermediate steps that you have to do. In this approach, the designer focus in those intermediate steps that will lead to the final destination.

"Stepping Stone" requires the designer to have a "scope of variability" in mind about the final solution and then identify some common ground across the perceived variability. Knowing what you don't know is important to define the "stepping stone".

This approach is also very useful to design the evolution path of the system. Since you need to look at how the "stepping stone" will provide value to the existing systems while it evolves.

4. Simplification

"Simplification" is the about how you can simplify your intermediate design by breaking down the requirement into multiple phases. I personally don't think the designer has the flexibility to change the ultimate requirement but she definitely can break down the ultimate requirement into multiple phases so she can control the evolution path of her design. In other words, by simplifying the requirement in each phase, she can pick the challenges that she want to tackle in different phases.

"Simplification" is also an important skill for experienced designers. The only way to tackle any complex system beyond a human's brain power is to break the original complexity down into simpler systems and tackle them in incremental steps.

"Simplification" is also an important abstraction skills where experience designer can generalize a specific problem case into a generic problem pattern where a generic solution can be found (e.g. design pattern), and then customize a specific solution from the design pattern.

by Ricky Ho (rickyphyllis@gmail.com) at December 22, 2009 01:23 AM

December 20, 2009

Maximillian Dornseif

Running a CouchDB cluster on Amazon EC2

CouchDB is a nearly zero-configuration multi-master document oriented database. It is a awsome product build by an awsome team.

So far I have been using CouchDB like we would have used any other modern Document Datastore: in a centraized fashion. One Server at our premises. For backup purposes we replicated on a second couchdb instance running on our backup server.

Hosting about 300 GB of data a small 2.6 GHz Server with consumer-grade disks we started seeing preformance issues. Also we see latency issues since we are hosting some application at Amazon EC2 “in the cloud” which
results an an addiotional 40 ms delay for all queries to our locally hosted server.

So this is the right time to use some more of CouchDBs capabilities and spin up additional instances on demand at Amazon EC2. I assume you have already set up an Amazon EC2 account and are comfortable with the general concepts.

There are some tutorials out there which threat EC2 like a regular hosting provider. This is a seriously misguided approach. If you don’t use EC2 in a way that you always can loose one or two instances, you are using it wrong. If you are not spinning up servers in a way that it takes the same time to set up one instance than it takes to set up 10 instances you are using it wrong.

To use EC2 as it meant to be used, we need automation. We will use puppet in this example.

I assume that you have installed a “puppetmaster” on a machine called puppet.example.com. I also assume the puppet configuration on the puppetmaster is at /etc/puppet. On BSD ist might be located instead at /usr/local/etc/puppet. Place the following content at /etc/puppet/files/etc/couchdb/local.ini:

[couchdb]
database_dir = /mnt/couchdb
view_index_dir = /mnt/couchdb

[httpd]
bind_address = 0.0.0.0

[couch_httpd_auth]
require_valid_user=true

[admins]
admin = sekrit

This ensures that only clients which authenticate as user “admin” with the password “sekrit” are allowed to access the server. You might want to change “sekrit” to something more suble.

Add /etc/puppet/fileserver.conf to make sure the local.ini file can be moved the clients:

[files]
  path /etc/puppet/files
  allow *

Then add /etc/puppet/manifests/site.pp to allow automatic installation and configuration:

class couchserver {
  package { "couchdb": ensure => latest }
  package { "python-couchdb": ensure => installed }
  group { "couchdb": ensure => present }
  user { "couchdb": ensure => present, groups => "couchdb",
    comment => "CouchDB Administrator",
    home => "/mnt/couchdb" }
  file { "/etc/couchdb": ensure => directory,
    owner  => couchdb, group  => couchdb,
    mode   => 755 }
  file { "/mnt/couchdb": ensure => directory,
    owner  => couchdb, group  => couchdb,
    mode   => 700 }
  file {"local.ini":
    mode => 774,
    owner => couchdb, group => couchdb,
    path => "/etc/couchdb/local.ini",
    source =>
      "puppet://puppet.exmple.com/files/etc/couchdb/local.ini"
  }
  service { couchdb:
    ensure    => running,
    subscribe => [Package[couchdb],
                  File["local.ini"],
                  File["/mnt/couchdb"]]
}}

node "PLACEHOLDER" {
    include couchserver
}

Now we have to create a Amazon “security group” to firewall our CouchDB servers. Since I like the belt and suspenders way of doing things we not only will use HTTP-Auth in CouchDB but also firewall rules. You have to have the EC2 commandline tools installed. I assume your comapny has a public IP range at 17.18.19.0/24.

$ ec2-add-group couchserver -d 'couchdb server'
$ ec2-authorize couchserver -P tcp -p 5984 -s 17.18.19.0/24

Next step is starting a EC2 instance. We use a Small Ubuntu 9.10 AMI since it comes with a decent version of CouchDB. We then log in and install Puppet.

$ ec2-run-instances ami-a62a01d2 --key YOUR_EC2_SSH_KEY \
  --instance-type m1.small --region eu-west-1 \
  --group default --group couchdb
INSTANCE       i-ec985e9b   ...
$ sleep 120
# get the id from the output of ec2-run-instances
$ ec2-describe-instances i-ec985e9b
INSTANCE       i-ec985e9b      79.125.56.43   10.227.94.80
# get the ip from the output of ec2-describe-instances
$ ssh -i ~/.ssh/YOUR_EC2_SSH_KEY ubuntu@79.125.56.43
# on the EC2 instance:
$ sudo apt-get update -y
$ sudo apt-get install -y puppet
$ puppetd --test --server puppet.example.com

This will result in a Error message about certificates. The puppet client requested a certificate and you have to sign this certificate at the puppet server. There is still some room for automatation. Log into the puppetmaster and list the signature requests with puppetca -l. You’ see the name of your newly created instance. Sign that name by using puppetca -s:

root@puppet:~# puppetca -l
ip-10-20-30-40.eu-west-1.compute.internal
root@puppet:~# puppetca -s ip-10-20-30-40.compute.internal
Signed ip-10-20-30-40.eu-west-1.compute.internal
root@puppet:~# perl -npe 's//ip-10-20-30-40.compute.internal/;' \
   -i.bak /etc/puppet/manifests/site.pp

The last line automatically edits /etc/puppet/manifests/site.pp to contian configuration information for the new instance. That’s all there is to do on the puppet master.

Now back on the new instance you can make puppet configure your CouchDB by typing puppetd --test --server puppet.example.com. This should install CouchDB and configure it to use the “big” 140 GB disk of your instance and to require password authentication.

You can test if CouchDB is up, running and secured by using cURL:

$ curl http://127.0.0.1:5984
{"error":"unauthorized","reason":"Authentication required."}

This is the point in time where we can start replication from our internal, behind-the-firewall CouchDB to the new box running at Amazon. Since there are some issues regarding commandline tools and authentication I created a patched version of python-couchdb at GitHub. Download it from here to a machine in your internal network, untar it and change in the couchddb-python directory. Then initiate replication:

$ PYTHONPATH=. python ./couchdb/tools/manual_replication.py \
  --source=http://couchdb.internal.example.com:5984 \
  --target=http://admin:sekrit@79.125.56.43:5984/ --push \
  --continuous

After this ran, set up permanent two-way replication between the two Servers:

$ PYTHONPATH=. python ./couchdb/tools/manual_replication.py \
  --source=http://couchdb.internal.example.com:5984 \
  --target=http://admin:sekrit@79.125.56.43:5984/ --push
$ PYTHONPATH=. python ./couchdb/tools/manual_replication.py \
  --source=http://admin:sekrit@79.125.56.43:5984/ \
  --target=http://couchdb.internal.example.com:5984 \
  --continuous

Basicaly that’s it. We are still missing a few bit’s and pices to get full automation, but we are nearly there. and for a cluster you probably want more than one CouchDB instance running at Amazon.

by mdornseif at December 20, 2009 09:38 PM

Volker Mische

GeoCouch: The future

GeoCouch started as a proof of concept and was heavily rewritten for the 0.10 release. As more and more people got interested, I got feedback to see what people really want/need. And now it's time to determine the future of GeoCouch. It's your chance to shape the future. In this blog entry I'll explain my ideas for the future, but I'm more than happy to get further ideas/complains from you. So please check if my ideas match your use-cases for GeoCouch.

Stripping it down

GeoCouch needs an external spatial index, at the moment I use SpatiaLite for it, but a PostGIS backend would be easily possible. My inital idea was that it is better to use the existing power of spatial databases, rather than reinventing the wheel. I though I could use all the power they have, that I can even use them for complex analytics, but I can't. As I only store the geometries, I need to “ask” CouchDB for the attributes (no, I don't want to store attributes in my spatial index).

If I don't use the full power of the spatial databases, but only a small fraction, there might be better solution. Therefore I propose that GeoCouch will use a simple spatial index for storing the geometries, not a full blown spatial database. I haven't decided yet which one it'll be, but I really think about moving this part to Erlang (I know that quite a few people would love that move).

You will loose functionality like reprojection. The spatial index won't know anything about projections. So GeoCouch won't be projection aware anymore, but you application still can be. For example if you want to return your data in a different projection than it was stored, you do the transformation after you've queried GeoCouch.

You would also loose fancy things for geometries, like boolean operations on them. But this is something I'd call complex analytics, and not simple querying.

GeoCouch would only support three simple queries: bounding search, polygon search and radius/distance search. If the search would be within a union of polygons, let's say all countries of the European Union, you would simply make the union operation before you query GeoCouch.

Complex analytics

What I call “complex analytics” is things like: “return all apple trees that are located with a 10km range around buildings that have are over 100m high, but only in countries with a population over 50 million people” is not possible with GeoCouch as you would need the attribute values as well. Those are stored in CouchDB, so you would need to request them. What GeoCouch only supports is a simple: give me all IDs within a bounding box/polygon/radius.

Conclusion

Simple requests are needed for everyday use, thus they should be incredibly fast. Complex analytics don't necessarily need to handle thousands of requests per second, in most cases they don't even need to be processed in real-time. I'd like to see some layer build above GeoCouch, so CouchDB can even be used for analytics (which is a thing I wanted to have right from the start).

This means that GeoCouch will be mainly for high performance and massive sized projects that need some simple spatial bits, what I think the majority of users need.

If you either think you really need only those simple queries, but you want them to be fast, or you think this is wrong, that you need dynamic reprojection I can only invite you to leave a comment below or drop a mail to volker.mische@gmail.com. Thanks.

by Volker Mische at December 20, 2009 02:37 PM

December 16, 2009

Damien Katz

Preschools in/near Piedmont?

Anyone know of a good preschool in/near Piedmont CA accepting applications for January? This is for our 2 year old girl, born Aug 2007.

by Damien Katz at December 16, 2009 09:33 PM

December 15, 2009

Damien Katz

Relaxed Inc.

So something interesting happened recently.

I, Jan, and Chris are building a startup around Apache CouchDB and Redpoint Ventures has invested $2 million. Pretty cool huh?

What are we going to do with that? Well, we are still figuring that out. For the most part we are going to try to grow a large and healthy CouchDB ecosystem and then build our own business(es) within that. We are working with Satish Dharmaraj on some basic strategy stuff right now, and we aren't trying to be secretive or "stealthy" as TechCrunch said. It's just very early and I'm also very busy planning a move. Which bring me to my 2nd and 3rd announcements.

2nd Announcement: Yes, I've left IBM.

To the projects at IBM using CouchDB, please continue to feel free contact me at anytime. IBM has been very good to me and the CouchDB project and I want IBM and its customers to be successful. That hasn't changed.

To Anant Jhingran, David Fallside and especially Sam Ruby, thank you for your early support and getting IBM behind CouchDB. We'd never have made it this far without you.

3rd Announcement: California here we come! I, Laura, and our 5 yo, 2 yo, and 9 mo (for those counting at home, that's 5 people) are moving to Piedmont CA the first week of January. We have everything all settled, other than a million small details that Laura has to deal with.

We still need some recommendations for a good local family doctor, ophthalmologist and dentist. And does any know where you can get them to convert your car to a lowrider that goes bouncy bouncy down the street?

Anyway, exciting times ahead. Follow me, Chris and Jan on twitter as we figure all this stuff out.

by Damien Katz at December 15, 2009 05:51 PM

Stuart Langridge

Desktopcouch on Windows/Mac

One of the things that I’d really like to see is desktopcouch on Windows and the Mac, so that your data can be everywhere, on all your machines. Now, I don’t know enough about those platforms to actually do it, but I’d be happy to help anyone who was interested in making it happen; here are some thoughts on what’d be required, and do please chip in here with questions, or ask me or others on the #ubuntuone IRC channel on irc.freenode.net.

Desktop Couch, for those of you who aren’t sure, is a personal CouchDB that your apps can store their data in. It’s secured for you alone, and it comes with a built-in replication setup, so two desktopcouches on the same network can exchange data. This means that if your applications store their data in desktopcouch — for example, Bindwood, our Firefox extension to store your bookmarks there — then all the machines on your network can exchange that data, meaning that adding a bookmark on one of your machines will automatically add it on the others, without you having to do anything, and without you having to sign up to some third-party service to make that happen. It’ll all work on your local network. (It can also work via a third-party service — Ubuntu One is such a service, and there could be others, as long as they deploy CouchDB in the cloud — so that machines on different networks can also exchange data.)

CouchDB

The first thing you need to have, to get a personal CouchDB, is CouchDB. I know the Mozilla Raindrop team have done some work getting CouchDB nice and robust and working on Windows, and I believe they have an installer. Working with that would be very cool indeed. (Indeed, it’s possible for Raindrop to use desktopcouch if you have it, so that might be an interesting test!)

On-demand startup and port numbers (or, what’s your D-Bus?)

Desktopcouch is started on demand, when the first application requires it, rather than running all the time even when you don’t need it. Because it is a personal CouchDB, and because there might be more than one user on the machine that you’re on, it can’t run on a specific port number; instead, you ask desktopcouch for its port number when you want it, and asking that question starts it up on a randomly-chosen port if it’s not already running. The way this is done on Ubuntu (and other Linuxes) is with D-Bus, which is a Linux-specific IPC mechanism. It’s possible to use D-Bus on other platforms, but a much better way would be to use your platform’s specific way of passing messages to a service and starting a service if it’s not already running. (This is one of the things I don’t know how to do on other platforms. Would Macs use launchd or something? Should a Windows service be running on startup? I don’t know.) Desktopcouch only uses D-Bus for these two things (”what’s your port number” and “start this service if it’s not already running”), so replacing those sections with a native way of doing that on Windows/Mac would be reasonably easy and make it fit in much better with the Right Way of doing things on that platform; these changes would need to be made both in desktopcouch itself, and in APIs (desktopcouch.records) that connect to desktopcouch.

Where are the keys? (or, no gnome-keyring on Windows)

Because desktopcouch is a personal CouchDB, access to it is secured by OAuth. When desktopcouch is first set up, it randomly generates a set of OAuth tokens, and these are stored in the Gnome keyring on Ubuntu. Obviously, it would be better to use the platform’s own way of storing authentication data; I believe the Mac has a “keychain”, and Windows surely has something similar. Again, this would need changing in desktopcouch itself (to store the keys in the right place), and in desktopcouch.records (to retrieve the keys from that place to use them).

And that’s it

With those changes in place, desktopcouch should run on another platform, meaning that you can exchange data between all your apps on all your machines. We’re already getting some sterling work done to see desktopcouch on other Linux distributions and on phones like the Nokia N900 (hooray for Thomas, among others!). I’d really like to see this happen on Windows and Mac too; are you interested in helping make it happen? Let me know, and I’ll give you all the help you need!

by sil at December 15, 2009 01:39 PM

December 08, 2009

Enda Farrell

CouchDB compaction - big impacts

CouchDB needs to have it’s databases compacted regularly. It’s quite easy to do but the ease of doing so may lead you into thinking that it’s not worthy of serious consideration. You need to be aware of a few things.

Here at the beeb we have many databases of very differing sizes, with very different “busy times” and to be honest we don’t really know nor care what data is in them. Some documents are very simple, some we know are serialised PHP objects, some are JSON base64 encodings of images (we do not allow attachments) - some docs are less than 50 bytes, some are pushing 1 meg. Some databases have their busy times at 2am - others with the broadcast on the weekends of television shows.

We had a bug in some code a while ago (not CouchDB code) that meant we wanted to not compact our databases for a while. Ordinarily we compact daily - but for one namespace (renamed as “brain test” below) we wanted to keep all previous versions of data for a while (and still do so I did not compact it yesterday). As it was a “hmm - we need to do a one-off here” I compacted everything (nearly together) instead of our system code’s more gentle “one at a time” approach.

Here are some charts so see what happened.

This chart is going to take some explaining:

 

  • Think of it like a stock chart: the blue is the “volume” - the overall size on disk of the database. Its axis is on the left drawn on a log scale given the range of sizes. The biggest was ~ 150 GB, the smallest 80k - but it’s not really that important here.
  • The top of the red bar is the relative size of the database before the compaction started - that’s why they are all “one” on the right-hand axis
  • The bottom of the red bar is the relative size of the database after the compaction finished. Some of hte databases did not compact much, some compacted down to about 2% of their original size
  • The thin black lines going up are the amount of space that the compaction took during the process. For namespaces that compact really well, that line is short and the “high tide mark” of the amount of space used is not much above the opening size. However - for namespaces that do not compact well - this high tide mark is as big as the original database. More on this watch-point later. 

 

So - what can we see here? 

 

  • The biggest - “big page” - saved 32% of its space - it ended up at about 100 GB in the end.
  • “a cache” which was the second biggest compacted REALLY well - due essentially to the very high number of revisions that some of its keys had. It started at 129 gig - finished at 7 gig. That was really quite a nice saving.
  • I didn’t compact “brain test” so it finished up the same size - but look at it’s “high” bar. If I had compacted it - it would have take up quite a bit of space. The high bar ought not be there as it didn’t undergo the taking-up-more-space-temporarily compaction process - it’s a flaw in my spreadsheet.
  • As you look at the red bars, you can see that most databases compact well - but small ones look like they don’t. This is because small ones are not being revised much - there’s very little going on in them so there’s little changing and hence little for the compactor to free up.

 

So that’s the good news. Now the bad.

Here’s what the process did to one of our 16 servers handling the CouchDB data:

For two hours on these 8 core (2 x quad-core DL-380s) machines were taking a thumping. The ops crew came wandering over wondering what’s going on. They know that big HTTP traffic can cause load on these boxes but high load without traffic? That’s odd.

I don’t have the charts that show the slower response times that this causes on the users of the service but they’ll look a bit like the load graph.

What can we take from this?

 

  • Compact regularily - but only if the amount of free space on your drives is greater than the size of your biggest database. This is a hard limit that your “capacity planning” must take into account
  • It can save you a lot of space
  • It will heavily load your servers - perhaps for quite a while
  • Tell your ops crew that this will happen and that they can expect this sort of load

 

 

by Enda Farrell at December 08, 2009 11:07 AM

December 03, 2009

Jason Davies

Non-Relational Databases and World Domination

I gave a talk this morning at the Ruby on Rails eXchange. Excellent audience! The slides are here: Non-Relational Databases and World Domination (PDF).

I'll update this post with links to related material when I get a chance.

by Jason Davies at December 03, 2009 03:23 PM

December 02, 2009

Christopher Lenz

CouchDB "Joins"

CouchDB - relax

I've been playing more and more with CouchDB lately. After putting together a Python library, I worked on a brand new included HTML/AJAX interface. Now I'm starting to dive into the Erlang code, which is my first serious encounter with Erlang. In particular, I started a branch that aims to replace the HTTP server underpinnings with Bob Ippolito´s MochiWeb library.

Despite all that activity (and past experience with the conceptually similar Lotus Notes), the correct approach to designing applications “the CouchDB way” isn't always obvious to me at this point. Just today, there was a discussion on IRC how you'd go about modeling a simple blogging system with “post” and “comment” entities, where any blog post might have N comments. If you'd be using an SQL database, you'd obviously have two tables with foreign keys and you'd be using joins. (At least until you needed to add some denormalization.)

But what would the “obvious” approach in CouchDB look like?

Note: I've updated this post to clarify the role of view collation, and to stress that all three approaches are equally valid for different kinds of applications.

read on …

December 02, 2009 08:32 AM

December 01, 2009

Ricky Ho

Query Processing for NOSQL DB

The recent rise of NoSQL provides an alternative model in building extremely large scale storage system. Nevetheless, compare to the more mature RDBMS, NoSQL has some fundamental limitations that we need to be aware of.
  1. It calls for a more relaxed data consistency model
  2. It provides primitive querying and searching capability
There are techniques we can employ to mitigate each of these issue. Regarding the data consistency concern, I have discussed a number of design patterns in my previous blog to implement system with different strength of consistency guarantee.

Here I like to give myself a try to tackle the second issue.

So what is the problem ?

Many of the NoSQL DB today is based on the DHT (Distributed Hash Table) model, which provides a hashtable access semantics. To access or modify any object data, the client is required to supply the primary key of the object, then the DB will lookup the object using an equality match to the supplied key.

For example, if we use DHT to implement a customer DB, we can choose the customer id as the key. And then we can get/set/operate on any customer object if we know its id
  • cust_data = DHT.get(cust_id)
  • DHT.set(cust_id, modified_cust_data)
  • DHT.execute(cust_id, func(cust) {cust.add_credit(200)})
In the real world, we may want to search data based on other attributes than its primary key, we may also search attributes based on "greater/less than" relationship, or we may want to combine multiple search criteria using a boolean expression.

Using our customer DB example, we may do ...
  • Lookup customers within a zip code
  • Lookup customers whose income is above 200K
  • Lookup customer using keywords "chief executives"
Although query processing and indexing technique is pretty common in RDBMS world, it is seriously lacking in the NoSQL world because of the very nature of the "distributed architecture" underlying most of NoSQL DB.

It seems to me that the responsibility of building an indexing and query mechanism lands on the NoSQL user. Therefore, I want to explore some possible techniques to handle these.


Companion SQL-DB

A very straighforward approach is provide querying capability is to augment NoSQL with an RDBMS or TextDB (for keyword search). e.g. We add the metadata of the object into a RDBMS so we can query its metadata using standard SQL query.

Of course, this requires the RDBMS to be large enough to store the search-able attributes of each object. Since we only store the attributes required for search, rather than the whole object into the RDBMS, this turns out to be a very practical and common approach.


Scatter/Gather Local Search

Some of the NOSQL DB provides indexing and query processing mechanism within the local DB. In this case, we can have the query processor broadcast the query to every node in the DHT where a local search will be conducted with results sent back to the query processor which aggregates into a single response.

Notice that the search is happening in parallel across all nodes in the DHT.


Distributed B-Tree

B+Tree is a common indexing structure using in RDBMS. A distributed version of B+Tree can also be used in a DHT environment. The basic idea is to hash the search-able attribute to locate the root node of the B+ Tree. The "value" of the root node contains the id of its children node. So the client can then issue another DHT lookup call to find the children node. Continue this process, the client eventually navigate down to the leaf node, where the object id of the matching the search criteria is found. Then the client will issue another DHT lookup to extract the actual object.

Caution is needed when the B+Tree node is updated due to split/merge caused by object creation and deletion. This should be ideally done in an atomic fashion. This paper from Microsoft, HP and Toronto U describe a distributed transaction protocol to provide the required atomicity. Distributed transaction is an expensive operation but its uses here is justified because most of the B+ tree updates rarely involve more than a single machine.


Prefix Hash Table (distributed Trie)

Trie is an alternative data structure, where every path (from the root) contains the prefix of the key. Basically, every node in the Trie contains all the data whose key is prefixed by it. Berkeley and Intel research has a paper to describe this mechanism.

1. Lookup a key
To locate a particular key, we start with its one digit prefix and do a DHT lookup to see if we get a leaf node. If so, we search within this leaf node as we know the key must be contained inside. If it is not a leaf node, we extend the prefix with an extra digit and repeat the whole process again.
# Locate the key next to input key
def locate(key)
leaf = locate_leaf_node(key)
return leaf.locate(key)
end

# Locate leaf node containing input key
def locate_leaf_node(key)
for (i in 1 .. key.length)
node = DHT.lookup(key.prefix(n))
return node if node.is_leaf?
end
raise exception
end

2. Range Query
Perform a range query can be done by first locate the leaf node that contains the start key and then walk in the ascending order direction until we exceed the end key. Note that we can walk across a leaf node by following the leaf node chain.
def range(startkey, endkey) {
result = Array.new
leaf = locate_leaf_node(startkey)
while leaf != nil
result.append(leaf.range(startkey, endkey))
if (leaf.largestkey < endkey)
leaf = leaf.nextleaf
end
end
return result
end
To speedup the search, we can use a parallel search mechanism. Instead of walking from the start key in a sequential manner, we can find the common prefix of the start key and end key (as we know all the result is under its subtree) and perform a parallel search of the children leaf nodes of this subtree.

3. Insert and Delete keys
To insert a new key, we first identify the leaf node that contains the inserted key. If the leaf node has available capacity (less than B keys), then simply add it there. Otherwise, we need to split the leaf node into two branches and redistribute its existing keys to the newly created child nodes.

To delete a key, we similarly identify the leaf node that contains the deleted key and then remove it there. This may cause some of my parents to have less than B + 1 keys so I may need to merge some child nodes.


Combining Multiple Search Criteria

When we have multiple criteria in the search, each criteria may use a different index that resides within a different set of machines in the DHT. Multiple criterias can be combined using boolean operators such as OR / AND. Performing OR operation is very straightforward because we just need to union the results of each individual index search that is performed separately. On the other hand, performing AND operation is trickier because we need to deal with the situation that each individual criteria may have a large number of matches but their intersection is small. The challenge is: how can we efficiently perform an intersection between two potentially very large sets ?

One naive implementation is to send all matched object ids of each criteria to a server that performs the set intersection. If each data set is large, this approach may cause a large bandwidth consumption for sending across all the potential object ids.

A number of techniques are described here in this paper

1. Bloom Filter
Instead of sending the whole set of matched object id, we can send a more compact representation called "Bloom Filter". Bloom filter is a much more compact representation that can be used for testing set membership. The output has zero false negative, but has a chance of false positive p, which is controllable.


For minimizing bandwidth, we typically pick the one with the larger set as the sending machine and perform the intersection on the receiving machine who has the smaller set.

Notice that the false positive can actually be completely eliminated by sending the matched result of Set2 back to Set1 machine, which double check the membership of set1 again. In most cases, 100% precision is not needed and a small probability of false positive is often acceptable.

2. Caching
It is possible that certain search criteria is very popular and will be issued over and over again. The corresponding bloom filter of this hot spots can be cached in the receiving machine. Since the bloom filter has a small footprint, we can cache a lot of bloom filters of popular search criterias.

3. Incremental fetch
In case if the client doesn't need to get the full set of matched results, we can stream the data back to client using a cursor mode. Basically, at the sending side, set1 is sorted and broken into smaller chunks with a bloom filter computed and attached to each chunk. At the receiving side, every element of set2 is checked for every bloom filter per chunk.

Notice that we save computation at the sending side (compute the bloom filter for the chunk rather than the whole set1) at the cost of doing more at the receiving side (since we need to repeat the checking of the whole set2 for each chunk of set1). The assumption is that client only needs a small subset of all the matched data.

An optimization we can do is to mark the range of each chunk in set1 and ask set2 to skip the objects that falls within the same range.

by Ricky Ho (rickyphyllis@gmail.com) at December 01, 2009 12:34 AM

November 30, 2009

Mikeal Rogers

Hosting?

I’m starting to work on a simple blog to replace this WordPress instance.

I’ve had a great run with WordPress but I have a few ideas I want to experiment with and I also want to dogfood couchdb-pythonviews a little more.

This blog is hosted on Dreamhost. Dreamhost has been a great host for a low impact blog, the uptime hasn’t been 100% but all the maintenance has been easy and it’s also remained dirt cheap for the last few years.

I need to find a new hosting provider. I have one dedicated server but I don’t plan on running a blog there because that server is a little busy.

I need something cheap. I need root (or some kind of sudo jail) where I can run CouchDB and nginx and manage Python. Preferably Debian. Definitely Linux. Decent uptime.

I’ve considered EC2 but for a low impact site it’s actually quite expensive (~30 dollars a month before bandwidth) and the performance I’m told is about 5x slower than a Macbook.

Backups aren’t necessary since I have CouchDB replication for backing up all the important bits.

I’m open to any and all suggestions.

by mikeal at November 30, 2009 12:56 AM

November 25, 2009

Ricky Ho

NOSQL Patterns

Over the last couple years, we see an emerging data storage mechanism for storing large scale of data. These storage solution differs quite significantly with the RDBMS model and is also known as the NOSQL. Some of the key players include ...
These solutions has a number of characteristics in common
  • Key value store
  • Run on large number of commodity machines
  • Data are partitioned and replicated among these machines
  • Relax the data consistency requirement. (because the CAP theorem proves that you cannot get Consistency, Availability and Partitioning at the the same time)
The aim of this blog is to extract the underlying technologies that these solutions have in common, and get a deeper understanding on the implication to your application's design. I am not intending to compare the features of these solutions, nor to suggest which one to use.


API model

The underlying data model can be considered as a large Hashtable (key/value store).

The basic form of API access is
  • get(key) -- Extract the value given a key
  • put(key, value) -- Create or Update the value given its key
  • delete(key) -- Remove the key and its associated value
More advance form of API allows to execute user defined function in the server environment
  • execute(key, operation, parameters) -- Invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map .... etc).
  • mapreduce(keyList, mapFunc, reduceFunc) -- Invoke a map/reduce function across a key range.

Machines layout

The underlying infratructure is composed of large number (hundreds or thousands) of cheap, commoditized, unreliable machines connected through a network. We call each machine a physical node (PN). Each PN has the same set of software configuration but may have varying hardware capacity in terms of CPU, memory and disk storage. Within each PN, there will be a variable number of virtual node (VN) running according to the available hardware capacity of the PN.


Data partitioning (Consistent Hashing)

Since the overall hashtable is distributed across many VNs, we need a way to map each key to the corresponding VN.

One way is to use
partition = key mod (total_VNs)

The disadvantage of this scheme is when we alter the number of VNs, then the ownership of existing keys has changed dramatically, which requires full data redistribution. Most large scale store use a "consistent hashing" technique to minimize the amount of ownership changes.


In the consistent hashing scheme, the key space is finite and lie on the circumference of a ring. The virtual node id is also allocated from the same key space. For any key, its owner node is defined as the first encountered virtual node if walking clockwise from that key. If the owner node crashes, all the key it owns will be adopted by its clockwise neighbor. Therefore, key redistribution happens only within the neighbor of the crashed node, all other nodes retains the same set of keys.


Data replication

To provide high reiability from individually unreliable resource, we need to replicate the data partitions.

Replication not only improves the overall reliability of data, it also helps performance by spreading the workload across multiple replicas.


While read-only request can be dispatched to any replicas, update request is more challenging because we need to carefully co-ordinate the update which happens in these replicas.

Membership Changes

Notice that virtual nodes can join and leave the network at any time without impacting the operation of the ring.

When a new node joins the network
  1. The joining node announce its presence and its id to some well known VNs or just broadcast)
  2. All the neighbors (left and right side) will adjust the change of key ownership as well as the change of replica memberships. This is typically done synchronously.
  3. The joining node starts to bulk copy data from its neighbor in parallel asynchronously.
  4. The membership change is asynchronously propagate to the other nodes.

Notice that other nodes may not have their membership view updated yet so they may still forward the request to the old nodes. But since these old nodes (which is the neighbor of the new joined node) has been updated (in step 2), so they will forward the request to the new joined node.

On the other hand, the new joined node may still in the process of downloading the data and not ready to serve yet. We use the vector clock (described below) to determine whether the new joined node is ready to serve the request and if not, the client can contact another replica.

When an existing node leaves the network (e.g. crash)
  1. The crashed node no longer respond to gossip message so its neighbors knows about it.
  2. The neighbor will update the membership changes and copy data asynchronously

We haven't talked about how the virtual nodes is mapped into the physical nodes. Many schemes are possible with the main goal that Virtual Node replicas should not be sitting on the same physical node. One simple scheme is to assigned Virtual node to Physical node in a random manner but check to make sure that a physical node doesn't contain replicas of the same key ranges.

Notice that since machine crashes happen at the physical node level, which has many virtual nodes runs on it. So when a single Physical node crashes, the workload (of its multiple virtual node) is scattered across many physical machines. Therefore the increased workload due to physical node crashes is evenly balanced.


Client Consistency

Once we have multiple copies of the same data, we need to worry about how to synchronize them such that the client can has a consistent view of the data.

There is a number of client consistency models
  1. Strict Consistency (one copy serializability): This provides the semantics as if there is only one copy of data. Any update is observed instantaneously.
  2. Read your write consistency: The allows the client to see his own update immediately (and the client can switch server between requests), but not the updates made by other clients
  3. Session consistency: Provide the read-your-write consistency only when the client is issuing the request under the same session scope (which is usually bind to the same server)
  4. Monotonic Read Consistency: This provide the time monotonicity guarantee that the client will only see more updated version of the data in future requests.
  5. Eventual Consistency: This provides the weakness form of guarantee. The client can see an inconsistent view as the update are in progress. This model works when concurrent access of the same data is very unlikely, and the client need to wait for some time if he needs to see his previous update.

Depends on which consistency model to provide, 2 mechanisms need to be arranged ...
  • How the client request is dispatched to a replica
  • How the replicas propagate and apply the updates
There are various models how these 2 aspects can be done, with different tradeoffs.

Master Slave (or Single Master) Model

Under this model, each data partition has a single master and multiple slaves. In above model, B is the master of keyAB and C, D are the slaves. All update requests has to go to the master where update is applied and then asynchronously propagated to the slaves. Notice that there is a time window of data lost if the master crashes before it propagate its update to any slaves, so some system will wait synchronously for the update to be propagated to at least one slave.

Read requests can go to any replicas if the client can tolerate some degree of data staleness. This is where the read workload is distributed among many replicas. If the client cannot tolerate staleness for certain data, it also need to go to the master.

Note that this model doesn't mean there is one particular physical node that plays the role as the master. The granularity of "mastership" happens at the virtual node level. Each physical node has some virtual nodes acts as master of some partitions while other virtual nodes acts as slaves of other partitions. Therefore, the write workload is also distributed across different physical node, although this is due to partitioning rather than replicas

When a physical node crashes, the masters of certain partitions will be lost. Usually, the most updated slave will be nominated to become the new master.

Master Slave model works very well in general when the application has a high read/write ratio. It also works very well when the update happens evenly in the key range. So it is the predominant model of data replication.

There are 2 ways how the master propagate updates to the slave; State transfer and Operation transfer. In State transfer, the master passes its latest state to the slave, which then replace its current state with the latest state. In operation transfer, the master propagate a sequence of operations to the slave which then apply the operations in its local state.

The state transfer model is more robust against message lost because as long as a latter more updated message arrives, the replica still be able to advance to the latest state.

Even in state transfer mode, we don't want to send the full object for updating other replicas because changes typically happens within a small portion of the object. In will be a waste of network bandwidth if we send the unchanged portion of the object, so we need a mechanism to detect and send just the delta (the portion that has been changed). One common approach is break the object into chunks and compute a hash tree of the object. So the replica can just compare their hash tree to figure out which chunk of the object has been changed and only send those over.

In operation transfer mode, usually much less data need to be send over the network. However, it requires a reliable message mechanism with delivery order guarantee.


Multi-Master (or No Master) Model

If there is hot spots in certain key range, and there is intensive write request, the master slave model will be unable to spread the workload evenly. Multi-master model allows updates to happen at any replica (I think call it "No-Master" is more accurate).

If any client can issue any update to any server, how do we synchronize the states such that we can retain client consistency and also eventually every replica will get to the same state ? We describe a number of different approaches in following ...

Quorum Based 2PC

To provide "strict consistency", we can use a traditional 2PC protocol to bring all replicas to the same state at every update. Lets say there is N replicas for a data. When the data is update, there is a "prepare" phase where the coordinator ask every replica to confirm whether each of them is ready to perform the update. Each of the replica will then write the data to a log file and when success, respond to the coordinator.

After gathering all replicas responses positively, the coordinator will initiate the second "commit" phase and then ask every replicas to commit and each replica then write another log entry to confirm the update. Notice that there are some scalability issue as the coordinator need to "synchronously" wait for quite a lot of back and forth network roundtrip and disk I/O to complete.

On the other hand, if any one of the replica crashes, the update will be unsuccessful. As there are more replicas, chance of having one of them increases. Therefore, replication is hurting the availability rather than helping. This make traditional 2PC not a popular choice for high throughput transactional system.

A more efficient way is to use the quorum based 2PC (e.g. PAXOS). In this model, the coordinator only need to update W replicas (rather than all N replicas) synchronously. The coordinator still write to all the N replicas but only wait for positive acknowledgment for any W of the N to confirm. This is much more efficient from a probabilistic standpoint.

However, since no all replicas are update, we need to be careful when reading the data to make sure the read can reach at least one replica that has been previously updated successful. When reading the data, we need to read R replicas and return the one with the latest timestamp.

For "strict consistency", the important condition is to make sure the read set and the write set overlap. ie: W + R > N


As you can see, the quorum based 2PC can be considered as a general 2PC protocol where the traditional 2PC is a special case where W = N and R = 1. The general quorum-based model allow us to pick W and R according to our tradeoff decisions between read and write workload ratio.

If the user cannot afford to pick W, R large enough, ie: W + R <= N, then the client is relaxing its consistency model to a weaker one.

If the client can tolerate a more relax consistency model, we don't need to use the 2PC commit or quorum based protocol as above. Here we describe a Gossip model where updates are propagate asynchronous via gossip message exchanges and an auto-entropy protocol to apply the update such that every replica eventually get to the latest state.

Vector Clock


Vector Clock is a timestamp mechanism such that we can reason about causal relationship between updates. First of all, each replica keeps vector clock. Lets say replica i has its clock Vi. Vi[i] is the logical clock which if every replica follows certain rules to update its vector clock.
  • Whenever an internal operation happens at replica i, it will advance its clock Vi[i]
  • Whenever replica i send a message to replica j, it will first advance its clock Vi[i] and attach its vector clock Vi to the message
  • Whenever replica j receive a message from replica i, it will first advance its clock Vj[j] and then merge its clock with the clock Vm attached in the message. ie: Vj[k] = max(Vj[k], Vm[k])

A partial order relationship can be defined such that Vi > Vj iff for all k, Vi[k] >= Vj[k]. We can use these partial ordering to derive causal relationship between updates. The reasoning behind is
  • The effect of an internal operation will be seen immediately at the same node
  • After receiving a message, the receiving node knows the situation of the sending node at the time when the message is send. The situation is not only including what is happening at the sending node, but also all the other nodes that the sending node knows about.
  • In other words, Vi[i] reflects the time of the latest internal operation happens at node i. Vi[k] = 6 reflects replica i has known the situation of replica k up to its logical clock 6.
Notice that the term "situation" is used here in an abstract sense. Depends on what information is passed in the message, the situation can be different. This will affect how the vector clock will be advanced. In below, we describe the "state transfer model" and the "operation transfer model" which has different information passed in the message and the advancement of their vector clock will also be different.

Because state is always flow from the replica to the client but not the other way round, the client doesn't have an entry in the Vector clock. The vector clock contains only one entry for each replica. However, the client will also keep a vector clock from the last replica it contacts. This is important for support the client consistency model we describe above. For example, to support monotonic read, the replica will make sure the vector clock attached to the data is > the client's submitted vector clock in the request.


Gossip (State Transfer Model)

In a state transfer model, each replica maintain a vector clock as well as a state version tree where each state is neither > or < among each other (based on vector clock comparison). In other words, the state version tree contains all the conflicting updates.

At query time, the client will attach its vector clock and the replica will send back a subset of the state tree which precedes the client's vector clock (this will provide monotonic read consistency). The client will then advance its vector clock by merging all the versions. This means the client is responsible to resolve the conflict of all these versions because when the client sends the update later, its vector clock will precede all these versions.


At update, the client will send its vector clock and the replica will check whether the client state precedes any of its existing version, if so, it will throw away the client's update.


Replicas also gossip among each other in the background and try to merge their version tree together.

Gossip (Operation Transfer Model)

In an operation transfer approach, the sequence of applying the operations is very important. At the minimum causal order need to be maintained. Because of the ordering issue, each replica has to defer executing the operation until all the preceding operations has been executed. Therefore replicas save the operation request to a log file and exchange the log among each other and consolidate these operation logs to figure out the right sequence to apply the operations to their local store in an appropriate order.

"Causal order" means every replica will apply changes to the "causes" before apply changes to the "effect". "Total order" requires that every replica applies the operation in the same sequence.

In this model, each replica keeps a list of vector clock, Vi is the vector clock the replica itself and Vj is the vector clock when replica i receive replica j's gossip message. There is also a V-state that represent the vector clock of the last updated state.

When a query is submitted by the client, it will also send along its vector clock which reflect the client's view of the world. The replica will check if it has a view of the state that is later than the client's view.


When an update operation is received, the replica will buffer the update operation until it can be applied to the local state. Every submitted operation will be tag with 2 timestamp, V-client indicates the client's view when he is making the update request. V-@receive is the replica's view when it receives the submission.

This update operation request will be sitting in the queue until the replica has received all the other updates that this one depends on. This condition is reflected in the vector clock Vi when it is larger than V-client


On the background, different replicas exchange their log for the queued updates and update each other's vector clock. After the log exchange, each replica will check whether certain operation can be applied (when all the dependent operation has been received) and apply them accordingly. Notice that it is possible that multiple operations are ready for applying at the same time, the replica will sort these operation in causal order (by using the Vector clock comparison) and apply them in the right order.


The concurrent update problem at different replica can also happen. Which means there can be multiple valid sequences of operation. In order for different replica to apply concurrent update in the same order, we need a total ordering mechanism.

One approach is whoever do the update first acquire a monotonic sequence number and late comers follow the sequence. On the other hand, if the operation itself is commutative, then the order to apply the operations doesn't matter

After applying the update, the update operation cannot be immediately removed from the queue because the update may not be fully exchange to every replica yet. We continuously check the Vector clock of each replicas after log exchange and after we confirm than everyone has receive this update, then we'll remove it from the queue.

Map Reduce Execution

Notice that the distributed store architecture fits well into distributed processing as well. For example, to process a Map/Reduce operation over an input key list.

The system will push the map and reduce function to all the nodes (ie: moving the processing logic towards the data). The map function of the input keys will be distributed among the replicas of owning those input, and then forward the map output to the reduce function, where the aggregation logic will be executed.


Handling Deletes

In a multi-master replication system, we use Vector clock timestamp to determine causal order, we need to handle "delete" very carefully such that we don't lost the associated timestamp information of the deleted object, otherwise we cannot even reason the order of when to apply the delete.

Therefore, we typically handle delete as a special update by marking the object as "deleted" but still keep its metadata / timestamp information around. Around a long enough time that we are confident that every replica has marked this object deleted, then we garbage collected the deleted object to reclaim its space.


Storage Implementaton

One strategy is to use make the storage implementation pluggable. e.g. A local MySQL DB, Berkeley DB, Filesystem or even a in memory Hashtable can be used as a storage mechanism.

Another strategy is to implement the storage in a highly scalable way. Here are some techniques that I learn from CouchDB and Google BigTable.

CouchDB has a MVCC model that uses a copy-on-modified approach. Any update will cause a private copy being made which in turn cause the index also need to be modified and causing the a private copy of the index as well, all the way up to the root pointer.

Notice that the update happens in an append-only mode where the modified data is appended to the file and the old data becomes garbage. Periodic garbage collection is done to compact the data. Here is how the model is implemented in memory and disks

In Google BigTable model, the data is broken down into multiple generations and the memory is use to hold the newest generation. Any query will search the mem data as well as all the data sets on disks and merge all the return results. Fast detection of whether a generation contains a key can be done by checking a bloom filter.

When update happens, both the mem data and the commit log will be written so that if the machine crashes before the mem data flush to disk, it can be recovered from the commit log.

by Ricky Ho (rickyphyllis@gmail.com) at November 25, 2009 06:04 PM

Cloud Computing Patterns

I have attended a presentation by Simon Guest from Microsoft on their cloud computing architecture. Although there was no new concept or idea introduced, Simon has provided an excellent summary on the major patterns of doing cloud computing.

I have to admit that I am not familiar with Azure and this is my first time hearing a Microsoft cloud computing presentation. I felt Microsoft has explained their Azure platform in a very comprehensible way. I am quite impressed.

Simon talked about 5 patterns of Cloud computing. Let me summarize it (and mix-in a lot of my own thoughts) ...

1. Use Cloud for Scaling
The key idea is to spin up and down machine resources according to workload so the user only pay for the actual usage. There is two types of access patterns: passive listener model and active worker model.

Passive listener model uses a synchronous communication pattern where the client pushes request to the server and synchronously wait for the processing result.
In the passive listener model, machine instances are typically sit behind a load balancer. To scale the resource according to the work load, we can use a monitor service that send NULL client request and use the measured response time to spin up and down the size of the machine resources.

On the other hand, Active worker model uses an asynchronous communication patterns where the client put the request to a queue, which will be periodically polled by the server. After queuing the request, the client will do some other work and come back later to pickup the result. The client can also provide a callback address where the server can push the result into after the processing is done.
In the active worker model, the monitor can measure the number of requests sitting in the queue and use that to determine whether machine instances (at the consuming end) need to be spin up or down.


2. Use Cloud for Multi-tenancy
Multi-tenancy is more a SaaS provider (rather than an enterprise) usage scenario. The key idea is to use the same set of code / software to host the application for different customers (tenants) who may have slightly different requirement in
  • UI branding
  • Business rules for decision criteria
  • Data schema
The approach is to provide sufficient "customization" capability for their customer. The most challenging part is to determine which aspects should be opened for customization and which shouldn't. After identifying these configurable parameters, it is straightforward to define configuration metadata to capture that.

3. Use Cloud for Batch processing
This is about running things like statistics computation, report generation, machine learning, analytics ... etc. These task is done in batch mode and so it is more economical to use the "pay as you go" model. On the other hand, batch processing has very high tolerance in latency and so is a perfect candidate of running in the cloud.
Here is an example of how to run Map/Reduce framework in the cloud. Microsoft hasn't provided a Map/Reduce solution at this moment but Simon mentioned that Dryad in Microsoft research may be a future Microsoft solution. Interestingly, Simon also recommended Hadoop.

Of course, one challenge is how to move the data from the cloud in the first place. In my earlier blog, I have describe some best practices on this.

4. Use Cloud for Storage
The idea of storing data into the cloud and no need to worry about DBA tasks. Most cloud vendor provide large scale key/value store as well as RDBMS services. Their data storage services will also take care of data partitioning, replication ... etc. Building cloud storage is a big topic involving many distributed computing concepts and techniques, I have covered it in a separate blog.

5. Use Cloud for Communication
A queue (or mailbox) service provide a mechanism for different machines to communicate in an asynchronous manner via message passing.

Azure also provide a relay service in the cloud which is quite useful for machines behind different firewall to communicate. In a typical firewall setup, incoming connection is not allowed so these machine cannot directly establish a socket to each other. In order for them to communicate, each need to open an on-going socket connection to the cloud relay, which will route traffic between these connections.

I have used the same technique in a previous P2P project where user's PC behind their firewall need to communicate, and I know this relay approach works very well.

by Ricky Ho (rickyphyllis@gmail.com) at November 25, 2009 03:10 PM

November 24, 2009

Enda Farrell

Problems with replication maps

For a long set of reasons that I must sometime write about, I have a set of CouchDB databases which replicate with each other. Each database replicates with two others: one in the same datacentre, one in the other datacentre (we’re only running with two datacentres at the moment). One is to do with resiliance the other with response capability.

Anyway - having two or more replications is not a problem with CouchDB (if you’re careful) - even with our version 0.9x.

Lacking symmetry

However - we may not have been as careful as we’d have wished: look at these numbers:

 

 

Here we see that the replication maps have datacentre symmetry - the data in each node in the two data centres are the same. Apart from the very slight increase in DC2’s data - because I took those numbers about 5 min after the first DC. This set of data (one of 15 which we have today) is growing at 0.3% per 5 min.

 

However - some nodes have 200,000 docs - others none at all - something is amiss.

The datacentres give us the capacity to handle requests in our platform, but we also have replications within each datacentre for resilience. Each node should be either a primary or a fallback node, and we should have resilience symmetry - but look at what we actually have when we align the primary and fallbacks:

Ouch - this does not look symmetrical at all!

 Here’s where the problem lies - a bug in the config:

Here we can see a few problems:

 

  • two of our nodes appear as both Primary and as Fallback nodes
  • two of our nodes are unused

 

 

Fixing these problems

The first thing to do is not to correct the bug in the config - it’s to write tests that can spot this happening. This way we can help ensure that someone coming behind us does not end up in the same situation again. We can even write these as runtime tests - not just unit tests - so that in the future if we have dynamic config reloading we can test it’ll be A-OK.

Second will then be to actually fix the config - quite easy in this case - and get it rolled out to production. This will then fill the unused nodes with the appropriate data.

Third will be to reshard everything - which has (in our system) the side-effect of deleting data from places that it ought not be. However - we do need to be careful that we do not have unexpected side effects.

The two nodes that appear as both primary and fallback (kv003:5986 and kv004:5987) have too much data as they get their “own” data and they were incorrectly replicated into too. Here’s how the other two nodes got out of hand.

 

  1. Data is correctly added to the primary node kv001:5987.
  2. Replication sent a copy from kv001:5987 to kv004:5987.
  3. Unsymmetric replication sent a copy from kv004:5987 to kv002:5987

 

Hence kv002:5987 has too much data. Here is the same story for kv001:5986.

 

  1. Data is correctly added to the primary node kv002:5986.
  2. Replication sent a copy from kv002:5986 to kv003:5986.
  3. Unsymmetric replication sent a copy from kv003:5986 to kv001:5986.

 

When we reshard, we will unexpectedly find data on kv003:5986 and kv004:5987 - and they’ll be deleted (on the code that we have). CouchDB replication will also delete the data from those nodes’ replication pair - so we’ll end up with appropriatly populated databases.

To ease the problems of the time it takes to reshard, you might be tempted to simply look at the data found on kv003:5986 and kv004:5987 (one quarter of the full set) and reshard those - and you’d be correct. In the fix - in order to balance the load across the nodes - kv002:5986 will be demoted to be a fallback and kv004:5986 will be promoted to be a full primary node as shown here:

 

However - there’s more.

DNS, VIPs and things that can go wrong

Alas, if the above really was the source of the problem we wouldn’t have fallen into it. You see, for the resilience to work we need the primary and fallback nodes to “flip” automatically - for this we have some carp magic doing its work. Of course, for that to be picked up by the Tomcat layer above, DNS CNAMES must be used. 

This table is horrible - so sorry - but it shows what is there now and by assuming that we need to keep everything symmetrical - an excellent assumption - we can spot what needs to change. A larger version appears when you click on it - but it essentially shows how to fix the problem.

In the yellow box we see the nodes that are both primary and fallback and the nodes that are unused. The fix - in the green - shows the newly bolded fallback changes that most simply fix the problem. It was important that we were able to “simply” find the fix by only changing the fallback nodes - we avoid a potentially long-running and danger-fraught resharding of lots more data. 

by Enda Farrell at November 24, 2009 11:23 AM

November 23, 2009

Paul Joseph Davis

Erlang NIF Test

Erlang NIF Test

NOTE

This tutorial requires a fairly recent snapshot of Erlang. I'm using the snapshot from November 22nd, 2009. The official release containing the required functionality is slated to be out on November 25th, 2009. You'll see I haven't actually installed the snapshot build in case you're like me and want to wait for an official release.

UPDATE

No more NIF's! The daily snapshot page was updated and the new tarball for today doesn't include the new NIF functions. Luckily there are public mirrors of the code.

$ git clone git://github.com/janl/erlang0d.git otp_src_R13B03
$ cd otp_src_R13B03
$ git checkout origin/R13B03-20091122225501

Erlang NIF's

I've been waiting excitedly for the new Natively Implemented Function (NIF) interface to land in the next Erlang release since I first saw them announced. Then I saw another message form @dizzyco that was more specific. So I did what any normal person would do. Read the test suite and wrote a minimal NIF to figure out the compiling and call semantics.

The NIF Module C API

The first thing to note is that a NIF module has four callbacks that are used for bookkeeping with loading the shared library code: load, reload, upgrade, and unload. Each function gets an ErlNifEnv* argument, a pointer to some driver specific data, and (except unload) an ERL_NIF_TERM load_info argument. The environment and private data pointers are pretty standard for this sort of thing. I'm not entirely certain what load_info is for. The method for initializing NIF modules takes a second parameter which may be what this is for, but I haven't investigated to find out for certain.

After defining each of those four methods, to actually implement the NIF functions we define a function that takes an ErlNifEnv* argument and zero or more positional parameters of type ERL_NIF_TERM. These functions will show up in the Erlang side and can be called as expected.

The code for our minimal NIF module looks like this:

// mynif.c
#include <stdio.h>
#include "erl_nif.h"

static int
load(ErlNifEnv* env, void** priv, ERL_NIF_TERM load_info)
{
    return 0;
}

static int
reload(ErlNifEnv* env, void** priv, ERL_NIF_TERM load_info)
{
    return 0;
}

static int
upgrade(ErlNifEnv* env, void** priv, void** old_priv,
          ERL_NIF_TERM load_info)
{
    return 0;
}

static void
unload(ErlNifEnv* env, void* priv)
{
    return;
}

static ERL_NIF_TERM
do_something(ErlNifEnv* env, ERL_NIF_TERM a1)
{
    unsigned long val;
    if(!enif_get_ulong(env, a1, &val)) {
        return enif_make_badarg(env);
    } else {
        return enif_make_ulong(env, val*2);
    }
}

static ErlNifFunc mynif_funcs[] =
{
    {"do_something", 1, do_something}
};

ERL_NIF_INIT(mynif, mynif_funcs, load, reload, upgrade, unload)
// EOF

That's all fairly straight forward. Define the four required functions and just return 0 to indicate no error. The ErlNifFunc structure appears to be a triple of {name_in_erlang, arity, name_in_c} calls. There's an example in the source on having the same Erlang name and different arities. As you'd expect, you just specify the same string, and change the second value.

The implementation of do_something shows a basic error when the argument is not an unsigned long. We'll test that this works as expected later.

The Erlang API

The Erlang side is pretty simple as well. To load a NIF module we just call erlang:load_nif/2. The first parameter is the path to the shared object to load. The second parameter I just specify as 0 to follow the test code, I've not investigated its use though I assume it shows up in the load_info argument in the module API.

Another thing to note is that the NIF module and its corresponding Erlang module have overlapping function namespaces. When we define a function in the NIF module, it shows up in our Erlang module. The tests use a pattern to throw an error if the Erlang function gets called. In other words, when we load the NIF module it replaces the Erlang definition, so if we hit the Erlang definition we report an error.

Our Erlang code looks like this:

// mynif.erl
-module(mynif).
-export([start/0, do_something/1]).

start() ->
    erlang:load_nif("mynif", 0).

do_something(_Val) ->
    nif_error(?LINE).    

nif_error(Line) ->
    exit({nif_not_loaded,module,?MODULE,line,Line}).
// EOF

Building the Modules

Building appears to just be the standard shared object style. I happened to have an example lying around from my earlier work on an EEP0018 module (which I'll definitely be revisiting now). The linker dark magic is outside this simple example, but there are plenty of places that will explain this. I haven't tested the Linux flags, but they should work just fine.

// Makefile
OTPROOT=/Users/davisp/tmp/otp_src_R13B03/
INCLUDES = -I$(OTPROOT)/erts/emulator/beam/

# OS X flags.
GCCFLAGS = -O3 -fPIC -bundle -flat_namespace -undefined suppress -fno-common -Wall

# Linux Flags
#GCCFLAGS = -O3 -fPIC -shared -fno-common -Wall

CFLAGS = $(GCCFLAGS) $(INCLUDES)
LDFLAGS = $(GCCFLAGS) $(LIBS)

OBJECTS = mynif.o

DRIVER = mynif.so
BEAM = mynif.beam

all: $(DRIVER) $(BEAM)

clean: 
    rm -f *.o *.beam $(DRIVER)

$(DRIVER): $(OBJECTS)
    gcc -o $@ $^ $(LDFLAGS)

$(BEAM): mynif.erl
    erlc $^
# EOF

With all three of those files in your $CWD you should be able to just run make and have the proper output in the same directory.

Running the Example

A sample console log to show that it behaves as expected:

$ ~/tmp/otp_src_R13B03/bin/erl
Erlang R13B03 (erts-5.7.4) [source] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]

Eshell V5.7.4  (abort with ^G)
1> mynif:start().
ok
2> mynif:do_something(0).
0
3> mynif:do_something(2).
4
4> mynif:do_something(nil).
** exception error: bad argument
     in function  mynif:do_something/1
        called as mynif:do_something(nil)
5> mynif:do_something(2.3). 
** exception error: bad argument
     in function  mynif:do_something/1
        called as mynif:do_something(2.3)

And there you have it. This is fairly exciting stuff. I've already got a list of projects I'm going to play with integrating into the NIF API to see what type of speedups I can get.

November 23, 2009 05:00 AM

November 21, 2009

Mikeal Rogers

JSON Performance in Python

In part of my ongoing performance work in our CouchDB+Python application I’ve decided to sit down and profile JSON performance in the different open source libraries available for Python.

I ran this test profiling json (pure Python simplejson) available in Python stdlib, simplejson compiled with C speedups, cjson, and jsonlib2, with a large JSON document. The test decodes and encodes a large JSON object 100 times. It then runs that test 100 times in each library in succession in order to find the average encode/decode time for each library and minimize other environmental factors that may occur. These numbers were taken on my MacBook Air running Mac OS X 1.6.1 with the default Python 2.6.

The time represents in milliseconds how long it takes to encode/decode this JSON object 100 times.

JSONPerf

I honestly didn’t expect the stdlib json to be this far behind.

Among the other C based libraries there isn’t a clear winner. cjson is the best decoder but the slowest encoder, simplejson compiled with C speedups is the fastest encoder but the slowest decoder while jsonlib2 is somewhere in the middle for both cases.

Also, annoyingly, cjson doesn’t implement the same API as the other libraries (dump and load functions are named encode and decode) making it much more difficult for a library to include support for all available libraries. Now rather than just being able to add a user defined json module I’ll need to add support for user defined parsing and encoding functions to couchdb-pythonviews, couchquery, and couchdb-wsgi.

by mikeal at November 21, 2009 12:07 AM

November 17, 2009

Volker Mische

FOSS4G 2009: “Geodata and CouchDB” presentation is online

The final wrap-up of the FOSS4G 2009, my presentation on “Geodata and CouchDB” is available online in several formats. It should also be of interest for people who are new to CouchDB as huge parts of the talk are an introduction into CouchDB.

by Volker Mische at November 17, 2009 09:48 AM

Paul Joseph Davis

Unix in 13 lines of Python - Mostly trivial

Unix in 13 lines of Python - Mostly trivial

I found this fairly amusing, but just a big if statement? So I ported it to Python. I was reminded that iterating over stdin in Python is always a bit weird. I generally create an iterator that just yields sys.stdin.readline() forever, but went more hackish here to conserve lines and add lulz.

import sys
print "You have no mail"

commands = {
    "uname": lambda args: "Punix 1.0",
    "halt": lambda args: exit(0)
}
error = lambda args: "Command not found."

sys.stdout.write("$ ")
for line in (sys.stdin.readline() for i in xrange(sys.maxint)):
    print commands.get(line.split()[0], error)(line.split()[1:])
    sys.stdout.write("$ ")

November 17, 2009 05:00 AM

Ricky Ho

Impression on Scala

I have been hearing quite a lot of good comments about the Scala programming language. I personally use Java extensively in the past and have switched to Ruby (and some Erlang) in last 2 years. The following features that I heard about Scala really attracts me ...
  • Scala code is compiled in Java byte code and run natively in JVMs. Code written in Scala immediately enjoy the performance and robustness of Java VM technologies.
  • Easy to integrate with Java code and libraries, immediately enjoy the wide portfolio of exiting Java libraries.
  • It has good support to the Actor model, which I believe is an important programming paradigm for multi-core machine architecture.
So I decide to take a Scala tutorial from Dean Wampler today in the Qcon conference. This is a summary of my impression on Scala after the class.

First of all, Scala is a strongly typed language. However it has a type inference mechanism so you don't have to type declaration is optional. But in some place (like a method signature), type declaration is mandatory. It is not very clear to me when I have to declare a type.

Having the "val" and "var" declaration in variables is very nice because it makes immutability explicit. In Ruby, you can make an object immutable by sending it a freeze() method but Scala do this more explicitly.

But I found it confusing to have a method define in 2 different ways

class A() {
def hello {
...
}
}
class A() {
def hello = {
...
}
}
The MyFunction[+A1, -A2] is really confusing to me. I feel the typeless language is much more easy.

Removing the open and close bracket is also causing a lot of confusion to me.
class Person(givenName: String) {
var myName = givenName
def name =(anotherName: String) = {
myName = anotherName
}
}

class Person(givenName: String) {
var myName = givenName
def name =(anotherName: String) = myName = anotherName
}
The special "implicit" conversion method provides a mechanism to develop DSL (Domain Specific Language) in Scala but it also looks very odd to me. Basically, you need to import a SINGLE implicit conversion method that needs to take care of all possible conversions.

All the method that ends with ":" has a reverse calling order is also an odd stuff to me.

Traits provides mixins for Scala but I feel the "Module" mechanism in Ruby has done a better job.

Scala has the notion of "function" and can pass "function" as parameters. Again, I feel Ruby blocks has done a better job.

Perhaps due to JVM's limitation of supporting a dynamic language, Scala is not very strong in doing meta-programming, Scala doesn't provide the "open class" property where you can modify an existing class (add methods, change method implementation, add class ... etc.) at run time

Scala also emulate a number of Erlang features but I don't feel it is doing a very clean job. For example, it emulate the pattern matching style of Erlang programming using the case Class and unapply() method but it seems a little bit odd to me.

Erlang has 2 cool features which I couldn't find in Scala (maybe I am expecting too much)
  • The ability to run two version of class at the same time
  • Able to create and pass function objects to a remote process (kinda like a remote code loading)
Overall impression

I have to admit that my impression on Scala is not as good as before I attend the tutorial. Scala tries to put different useful programming paradigm in the JVM but I have a feeling of force-fit. Of course its close tie to JVM is still a good reason to use Scala. But from a pure programming perspective, I will prefer to use a combination of Ruby and Erlang, rather than Scala.

by Ricky Ho (rickyphyllis@gmail.com) at November 17, 2009 02:36 AM

November 15, 2009

Ricky Ho

Machine Learning: Association Rule

A typical example is the "market-basket" problem. Lets say a supermarket keep track of all the purchase transactions. Each purchase transaction is a subset of all the item available in the store. e.g. {beer, milk, diaper, butter}.

The problem is: By analyzing a large set of transactions, can be discover the correlation between subsets ? ie: people buying milk and butter has a high tendency of buying diaper. Or people buying diaper tends to buy soda and ice-cream.

Such correlation is called an association rule, which has the following form:
A => B where A, B are disjoint subsets of U (a universal set)

This rule can be interpreted as: From the transaction statistics, people buying all items in set A tends to also buy all items in set B.

Note that people buying both set A AND set B is denoted as (A union B) rather than (A intersect B).

There are two concepts need to be defined here ...

"Support" is defined with respect to a subset X as the % of total transaction that has contains subset X. This can be indicate as P(contains X). e.g. The support of {beer, diaper} is P(contains {beer, diaper}) which means if we randomly pick a transaction, how likely that it will contain both beer and diaper.

"support" of an association rule A => B is defined as the "support" of (A union B)

"Confidence" is defined with respect to a rule (A => B) that given we know a transaction contains A, how likely that it also contains B.

P(contains B | contains A) = P(contains B union A) / P(contains A) which is the same as Support(A union B) / Support(A)

Mining Association Rules

The problem is how can we discover the association rules that has a high enough "support" and "confidence". First of all, an arbitrary threshold of "support" and "confidence" is set according to domain specific concerns. There are two phases.

1) Extract all subset X where support(X) > thresholdOfSupport
2) For all extracted subset X, discover A => B where A is subset of X and B is (X - A)

1) is also known as the "finding frequent subsets" problem. A naive implementation can generate all possible subsets and check their support value. The naive approach has exponential complexity 2 exp(N) .

Apriori Algorithm to find frequent subsets

This algorithm exploit the fact that if X is not a frequent subset, then (X union Anything) will never be a frequent subset. So it starts with scanning small subsets and throw away those that doesn't has high enough support. In other words, it prune the tree as it grows.

Lets say the universal set is {I1, I2, I3, I4, I5, I6}

First round:
  • Generate possible candidates of 1-item subset. ie: {I1}, {I2}, ... {I6}
  • Find out all supports of the candidate set. ie: support({I1}), support({I2}), .... support({I6})
  • Filter out those whose value < supportThreshold
Second round:
  • From the surviving 1-item subset, generate possible 2-item subset candidates. ie: {I1, I2}, {I1, I4} ... Note that we can skip any subset that contains I3 because it is out.
  • Find out all supports of the 2-item candidate set.
  • Filter out those whose value < supportThreshold

K round: (repeat until no more surviving k-1 item subset)
  • From the surviving k-1 item subset, generate possible k-item subset candidates by adding one more item that is not already in the k-1 item subset. Skip any k item subset that contains any throw-away k-1 item candidates from the last round.
  • Calculate the support of k-item candidates
  • Throw away those whose support value < supportThreshold
Find association rules from frequent subsets

After knowing a frequent k-item subset X, we want to find its subset A such that the confidence value of (A => X-A) is higher than the confidence threshold.

Note that confidence = support(X) / support(A)

First round:
  • Within X, generate possible candidates of k-1 item subset.
  • Find out all confidence of the candidate set.
  • Filter out those whose confidence value < confidenceThreshold
  • For those surviving k-1 item subset A, mark the rule (A => X-A)
J round: (repeat until no more surviving j-1 item subset)
  • Within the surviving j-item subset A, generate possible candidates of j-1 item subset.
  • Find out all confidence of the candidate set.
  • Filter out those whose confidence value < confidenceThreshold
  • For those surviving j-1 item subset A', mark the rule (A' => X-A')
Miscellaneous

Note that confidence is absolute but not relative.
When (A => B) has confidence = 75%, it is also possible that (!A => B) has confidence = 90%. In other words, it is possible that some rules are contradict to each other and usually the one with higher support and confidence wins.

by Ricky Ho (rickyphyllis@gmail.com) at November 15, 2009 02:47 PM

November 13, 2009

Ricky Ho

Amazon Cloud Computing

Cloud computing is becoming a very hot area as it provides cost savings and time-to-market benefits to a wide spectrum of organizations.

At the consumer end, small startup companies found Cloud computing can significantly reduce their initial setup cost. Large enterprises also found Cloud computing allows them to improve resource utilization and cost effectiveness, although they also have security and control concerns. Here is a very common cloud deployment model across many large enterprises.

Traditional software companies who distributes software on CD also look into the SaaS model as a new way of doing business. However, a SaaS model typically requires the companies to build some kind of web site. But these companies may not have the expertise to build large scale web sites and operate it. Cloud computing also allows them to outsource the SaaS infrastructure.

Here we look at the leader in the cloud computing provider space. AWS from Amazon.

Amazon Web Service

Amazon is the current leading provider in the Cloud computing space. At the heart of its technology stack (which is known as the Amazon Web Services), it includes an IaaS stack, a PaaS stack and a SaaS stack.
  1. Their IaaS stack includes infrastructure resource such as virtual machine, virtual mount disks, virtual network, load balancer, VPN, Databases.
  2. Their PaaS stack provides a set of distributed computing services including queuing, data storage, metadata, parallel batch processing,
  3. Their SaaS stack provides a set of high level services such as content delivery network, payment processing services, ecommerce fulfillment services.
Since we are focusing in the Cloud Computing aspects, we will describe their IaaS and PaaS stack below but will skip their SaaS stack.


EC2 – Elastic Computing

Amazon has procured a large number of commoditized Intel boxes running virtualization software Xen. On top of Xen, Linux or Windows can be run as the guest OS . The guest operating system can have many variations with different set of software packages installed.

Each configuration is bundled as a custom machine image (called AMI). Amazon host a catalog of AMI for the users to choose from. Some AMI is free while other requires a usage charge. User can also customize their own setup by starting from a standard AMI, make their special configuration changes and then create a specific AMI that is customized for their specific needs. The AMIs are stored in Amazon’s storage subsystem S3.

Amazon also classifies their machines in terms of their processor power (no of cores, memory and disk size) and charged their usage at a different rate. These machines can be run in different network topology specified by the users. There is an “availability zone” concept which is basically a logical data center. “Availability zone” has no interdependency and is therefore very unlikely to fail at the same time. To achieve high availability, users should consider putting their EC2 instances in different availability zones.

“Security Group” is the virtual firewall of Amazon EC2 environment. EC2 instances can be grouped under “security group” which specifies which port is open to which incoming range of IP addresses. So EC2 instances that running applications at various level of security requirements can be put into appropriated security groups and managed using ACL (access control list). Somewhat very similar to what network administrator configure their firewalls.

User can start the virtual machine (called an EC2 instance) by specifying the AMI, the machine size, the security group, and its authentication key via command line or an HTTP/XML message. So it is very easy to startup the virtual machine and start running the user’s application. When the application completes, the user can also shutdown the EC2 instance via command line or HTTP/XML message. The user is only charged for the actual time when the EC2 instance is running.

One of the issue of extremely dynamic machine configuration (such as EC2) is that a lot of configuration setting is transient and does not survive across reboot. For example, the node name and IP address may have been changed, all the data stored in local files is lost. Latency and network bandwidth between machines may also have changed. Fortunately, Amazon provides a number of ways to mitigate these issues.
  • By paying some charge, user can reserve a stable IP address, called “elastic IP”, which can be attached to EC2 instance after they bootup. External facing machine is typically done this way.
  • To deal with data persistence, Amazon also provides a logical network disk, called “elastic block storage” to store the data. By paying some charges, EBS is reserved for the user and it survives across EC2 reboots. User can attach the EBS to EC2 instances after the reboot.

EBS – Elastic Block Storage

Based on RAID disks, EBS provides a persistent block storage device for data persistence where user can attach it to a running EC2 instance within the same availability zone. EBS is typically used as a file system that is mounted to EC2 instance, or as raw devices for database.

Although EBS is a network devices to the EC2 instance, benchmark from Amazon shows that it has higher performance than local disk access. Unlike S3 which is based on eventual consistent model, EBS provides strict consistency where latest updates are immediately available.


CloudWatch -- Monitoring Services

CloudWatch provides an API to extract system level metrics for each VM (e.g. CPU, network I/O and disk I/O) as well as for each load balancer services (e.g. response time, request rate). The collected metrics is modeled as a multi-dimensional data cube and therefore can be queried and aggregated (e.g. min/max/avg/sum/count) in different dimensions, such as by time, or by machine groups (by ami, by machine class, by particular machine instance id, by auto-scaling group).

This metrics is also used to drive the auto-scaling services (described below). Note that the metrics are predefined by Amazon and custom metrics (application level metrics) is not supported at this moment.


Load Balancing Services

Load balancer provides a way to group identical VMs into a pool. Amazon provides a way to create a software load balancer in a region and then attach EC2 instances (of the same region) to the it. The EC2 instances under a particular load balancer can be in different availability zone but they have to be in the same region.


Auto-Scaling Services

Auto-scaling allows the user to group a number of EC2 instances (typically behind the same load balancer) and specify a set of triggers to grow and shrink the group. Trigger defines the condition which is matching the collected metrics from the CloudWatch and match that against some threshold values. When match, the associated action can be to grow or shrink the group.

Auto-scaling allows resource capacity (number of EC2 instances) automatically adjusted to the actual workload. This way user can automatically spawn more VMs as the workload increases and shutdown the VM as the load decreases.


Relational DB Services

RDS is basically running MySQL in the EC2.


S3 – Simple Storage Service

Amazon S3 provides a HTTP/XML services to save and retrieve content. It provides a file system-like metaphor where “objects” are group under “buckets”. Based on a REST design, each object and bucket has its own URL.

With HTTP verbs (PUT, GET, DELETE, POST), user can create a bucket, list all the objects within the bucket, create object within a bucket, retrieve an object, remove an object, remove a bucket … etc.

Under S3, each object has a unique URI which serves as its key. There is no query mechanism in S3 and User has to lookup the object by its key. Each object is stored as an opaque byte array with maximum 5GB size. S3 also provides an interesting partial object retrieval mechanism by specifying the ranges of bytes in the URL.

However, partial put is not current support but it can be simulated by breaking the large object into multiple small objects and then do the assembly at the app level. Breaking down the object also help to speed up the upload and download by doing the data transfer in parallel.

Within Amazon S3, each S3 objects are replicated across 2 (or more) data center and also cache at the edge for fast retrieval.

Amazon S3 is based on an “eventual consistent” model which means it is possible that an application won’t see the change it just made. Therefore, some degree of tolerance of inconsistent view is required by the application. Application should avoid the situation of having two concurrent modifications to the same object. And application should wait for some time between updates, and also should expect all the data it reads is potentially stale for few seconds.

There is also no versioning concept in S3, but it is not hard to build one on top of S3.


SimpleDB – queriable data storage

Unlike S3 where data has to be looked up by key, SimpleDB provides a semi-structured data store with querying capability. Each object can be stored as a number of attributes where the user can search the object by the attribute name.

Similar to the concepts of “buckets “ and “objects” in S3, SimpleDB is organized as a set of “items” grouped by “domains”. However, each item can have a number of “attributes” (up to 256). Each attribute can store one or multiple values and the value must be a string (or a string array in case of multi-valued attribute). Each attribute can store up to 1K bytes, so it is not appropriate to store binary content.

SimpleDB is typically used as a metadata store in conjuction with S3 where the actual data is being stored. SimpleDB is also schema-less. Each item can define its own set of attributes and is free to add more or remove some attributes at runtime.

SimpleDB provides a query capability which is quite different from SQL. The “where” clause can only match an attribute value with a constant but not with other attributes. On the other hand, the query result only return the name of the matched items but not the attributes, which means subsequent lookup by item name is needed. Also, there is no equivalent of “order by” and the returned query result is unsorted.

Since all attribute are store as strings (even number, dates … etc). All comparison operation is done based on lexical order. Therefore, special encoding is needed for data type such as date, number to string to make sure comparison operation is done correctly.

SimpleDB is also based on an eventual consistency model like S3.


SQS – Simple Queue Service

Amazon provides a queue services for application to communicate in an asynchronous way with each other. Message (up to 256KB size) can be sent to queues. Each queue is replicated across multiple data centers.

Enterprises use HTTP protocol to send messages to a queue. “At least once” semantics is provided, which means, when the sender get back a 200 OK response, SQS guarantees that the message will be received by at least one receiver.

Receiving messages from a queue is done by polling rather than event driven calling interface. Since messages are replicated across queues asynchronously, it is possible that receivers only get some (but not all) messages sent to the queue. But the receiver keep polling the queue, he will eventually get all messages sent to the queue. On the other hand, message can be delivered out of order or delivered more than once. So the message processing logic needs to be idempotent as well as independent of message arrival order.

Once message is taken by a receiver, the message is invisible to other receivers for a period of time but it is not gone yet. The original receiver is supposed to process the message and make an explicit call to remove the message permanently from the queue. If such “removal” request is not made within the timeout period, the message will be visible in the queue again and will be picked up by subsequent receivers.


Elastic Map/Reduce

Amazon provides an easy way to run Hadoop Map/Reduce in the EC2 environment. They provide a web UI interface to start/stop a Hadoop Cluster and submit jobs to it. For a detail of how Hadoop works, see here.

Under elastic MR, both input and output data are stored into S3 rather than HDFS. This means data need to be loaded to S3 before the Hadoop processing can be started. Elastic also provides a job flow definition so user can concatenate multiple Map/Reduce job together. Elastic MR supports the program to be written in Java (jar) or any programming language (Hadoop streaming) as well as PIG and Hive.


Virtual Private Cloud

VPC is a VPN solution such that the user can extend its data center to include EC2 instances running in the Amazon cloud. Notice that this is an "elastic data center" because its size can grow and shrink when the user starts / stops EC2 instances.

User can create a VPC object which represents an isolated virtual network in the Amazon cloud environment and user can create multiple virtual subnets under a VPC. When starting the EC2 instance, the subnet id need to be specified so that the EC2 instance will be put into the subnet under the corresponding VPC.

EC2 instances under the VPC is completely isolated from the rest of Amazon's infrastructure at the network packet routing level (of course it is software-implemented isolation). Then a pair of gateway objects (VPN Gateway on the Amazon side and Customer gateway on the data center side) need to be created. Finally a connection object is created that binds these 2 gateway objects together and then attached to the VPC object.

After these steps, the two gateway will do the appropriate routing between your data center and the Amazon VPC with VPN technologies used underneath to protect the network traffic.


Things to watch out

I personally think Amazon provides a very complete set of services that is sufficient for a wide spectrum of deployment scenarios. Nevertheless, there are a number of limitations that needs to pay attention to …
  • There are no Cloud standards today. Whatever choice made for a provider will imply some degree of lock-in to a vendor specific architecture. Amazon is no exception. One way to minimize such lock-in is to introduce an insulation layer to localize all the provider-specific API.
  • Cloud providers typically run their infrastructure on low-cost commodity hardware inside some data center with network connected between them. Amazon is not making their hosting environment very transparently and so it is not very clear how much reliability one can expect from their environment. On the other hand, the SLA guarantee that Amazon is willing to provide is relatively low.
  • Multicast communication is not supported between EC2 instances. This means application has to communicate using TCP point-to-point protocol. Some cluster replication framework based on IP multicast simply doesn’t work in EC2 environment.
  • EBS currently cannot be attached to a multiple EC2 instance at the same time. This means some application (e.g. Oracle cluster) which based on having multiple machines accessing a shared disk simply won’t work in EC2 environment.

by Ricky Ho (rickyphyllis@gmail.com) at November 13, 2009 04:52 PM

Upstream

Awesome presentations with boom amazing

Screen shot 2009-11-13 at 14.56.03
Boom amazing in action at JSConf.EU

Last weekend at JSConf.EU I gave another talk on CouchDB and CouchApps. This gave me an excellent excuse to hack on boom amazing instead of preparing my talk. Boom amazing is my personal presentation tool. Instead of slides my presentation is laid out on a single large surface and I can pan around, zoom in and out and rotate my viewpoint to show specific contents.

In the end I managed to add a bunch of new features and still get my talk done. The following post explains how boom amazing works and how you can create your own fancy presentations with it.

What is behind it?

Boom amazing consists of a few relatively simple building blocks:

  • One or more SVG files that contain all the texts and graphics
  • a CouchApp that includes some HTML and CSS, and JavaScript
  • some client side JavaScript to handle manipulating SVG and moving around (using JQuery and the jQuery SVG plugin)
  • CouchDB: stores all the data: the presentation itself (as document attachments), the CouchApp and all the camera positions and data for replaying presentations
  • A browser that displays the SVG file, I use PlainView which has a fullscreen mode.

Generating SVGs on OSX…

… pretty much sucks. I’ve tried a few tools already but still haven’t found anything that is simple and stable. I first tried Adobe Illustrator which can export to SVG, but first of all it’s slow, it sometimes crashes and I’m not willing to pay that much money just for drawing a few circles and write some text.

My current setup consists of Vector Designer for drawing and Scribus for converting to SVG. While I do like working in Vector Designer very much it doesn’t have an SVG exporter, it limits the maximum zoom and it has problems with the kerning when I set the font size below a certain level. I might be able to work around the zoom and kerning problems by simply using larger page sizes (this time I used 2×2 meters). I then export my drawing to the EPS format which works okay, except that the day before the conference Vector Designer suddenly created file sizes of 115MB instead of a few hundred KB like before. Which brings me to Scribus. It’s an open source vector drawing app with a truly ugly user interface, but it can import EPS and export SVG. At least sometimes. For me it worked in ~40% of the cases. Everything else was crashes, unknown import errors and missing data.

There’s a command line tool called pdf2svg (which you can install using MacPorts) which I also tried, but the resulting SVGs are much larger than what Scribus creates and the browser refused to display them.

There is hope though, a one man company called Bohemian Coding is working on a vector tool called Sketch and their previous apps look really slick.

If you’re not on OSX or can tolerate working with X11 apps there’s also Inkscape which has built-in SVG support.

Using it

In order to use boom amazing all you have to do is download it and push it into CouchDB.

git clone git://github.com/langalex/boom_amazing.git
cd boom_amazing
curl -X PUT http://localhost:5984/boom_amazing
couchapp push boom_amazing

After that go to futon and upload your SVG file into any document in the boom_amazing database. Then you go to http://localhost:5984/boom_amazing/_design/boom_amazing/index.html. When you move your mouse to the top a toolbar appears where you can select your presentation from the dropdown. Once the file has loaded you can:

  • pan around by click & dragging the mouse
  • zoom by pressing & holding [alt] and moving the mouse up and down
  • rotate by holding down [ctrl] and moving the mouse left and right

Whenever you like a position you can save that by clicking the save button in the toolbar. After you have recorded all positions you can then jump back and forth between them with the Previous and Next buttons or by pressing [b] and [space] – which is what I do when presenting.

Hacking

If you are interested in extending the app here are the basics of how it works:

In _attachments/javascripts/app.js you find a small sammy application which controls all the actions like loading and saving data. Sammy maps the hash part of the URL to different actions, so when a link points to e.g. #/slides/23 sammy will automatically call the code to load slide (slide refers to a saved position) 23.

screen.js includes all the logic that handles mouse and keyboard events to move around. It also does all the computations for things like smooth transitions between positions and rotating around the center of the screen instead of just (0, 0).

The way this all works is that the jQuery SVG plugin loads the file and turns everything into DOM nodes. The application then takes all the nodes and puts them into a single group node. A group node can have an attribute transforms which expects a rotate, scale and translate parameter – the rest is up to the browser.

Conclusion

The feedback I have been getting for the talks where I have used boom amazing has been great so I am going to continue to use and extend it. Right now performance with larger SVGs is an issue but I don’t have any ideas wether I can do anything about it – except for buying faster laptops. There are also still a bunch of basic features missing, like editing or removing saved positions, or reordering.

I would love to see other people to pick up boom amazing and use it for their presentations. It certainly is more work than putting together a bunch of slides but then again I think it is worth the effort, and the results easily make any fancy Keynote presentation look like 1990. :D

by Alexander Lang at November 13, 2009 02:44 PM

November 12, 2009

Till Klampäckel

PHP: So you'd like to migrate from MySQL to CouchDB? - Part II

This is part II of my introductory series to move from MySQL a relational database (management) system to CouchDB. I will be using MySQL as an example. Part I of this series is available here.

Recap

In part I, I introduced CouchDB by explaining its basics. I continued by showing a simple request to create a document using curl (on the shell) and expanded how the same request could be written in PHP (using ext/curl) — or in HTTP_Request2 or with phpillow.

Part II will focus on the most basic operations and help you build a small wrapper for CouchDB. The code will be available on Github.

Getting data into CouchDB

As I explained before — CouchDB is all HTTP.

But for starters, we'll take a shortcut here and briefly talk about CouchDB's nifty administration frontend called futon. Think of Futon as a really easy to use phpMyAdmin. And by default futon is included in your CouchDB installation.

Features

Futon will allow you to virtually do anything you need:

  • create a database
  • create a document
  • create a view
  • build the view
  • view ;-) (documents, views)
  • update all of the above
  • delete all of the above
  • replicate with other CouchDB servers

Assuming the installation completed successfully, and CouchDB runs on 127.0.0.1:5984, the URL to your very own Futon is http://127.0.0.1:5984/_utils/. And it doesn't hurt to create a bookmark while you are at it.

Why GUI?

Purists will always argue that using a GUI is slower than for example hacking your requests from the shell.

That may be correct once you are a level 99 CouchDB super hero, but since we all start at level 1, Futon is a great way to interact with CouchDB to learn the basics. And besides, even level 99 CouchDB super heroes sometimes like to just click and see, and not type in crazy hacker commands to get it done.

I'll encourage everyone to check it out.

Read, write, update and delete.

Sometimes also referred to as CRUD (create, read, update, delete) — read, write, update and delete are the basics most web applications do.

Since most of you have done a "write a blog in X"-tutorial before — and I personally am tired of writing blogs in different languages or with different backends — let's use another example.

Think about a small guestbook application, it does all of the above — a fancy guest will even do update. For the sake of simplicity, I'll skip on the frontend in this example and we'll work on the backend and essentially create a small wrapper class for CouchDB.

Operations

By now — "CouchDB is all HTTP" — should sound all familiar. So in turn, all these CRUD operations in CouchDB translate to the following HTTP request methods:

  • write/create - PUT or POST
  • read - GET
  • update - PUT or POST
  • delete - DELETE

On write

Whenever you supply an ID of a new document along with the document, you should use PUT.

When you don't care about the document ID, use POST instead, and CouchDB will generate a unique ID for you.

This unique ID will not look like an autoincremented integer, but we should not cling to this concept anyway. Without diving into too advanced context now, but the auto_increment feature in MySQL is a little flawed in general and in a distributed context especially. More on this (maybe) in a later part of this series — in the mean-time, check out Joshua Schachter's post.

On update

By default, CouchDB keeps a revision ID of each document. To many this is a pretty cool feature — out of the box, so to speak. But there are two very important and fundamental things to be aware of.

  1. CouchDB will keep previous revisions of a document around until you run compact on the database. Think of compact as a house keeping chore. It will wipe your database clean of previous revisions and even optimize the indices (in case you had a lot of data changing operations in the mean time). For CouchDB revisions are especially important in a distributed context (think replication — more on this later) and while it's cool to have them, they should not be used as a feature and be exposed to the user of your application.

  2. In case we decide a document, we always have to provide the previous (or current) revision of the document. This sounds strange, but the reasons are simple — in case another update gets in between all we have to do is provide the necessary interfaces and workflows in our application to alert the user and avoid a possible conflict.

CouchDB and the HTTP standard

CouchDB's API adheres to the above in 99.999999999% of the time. And it only breaks the pattern once. The exception to the rule is that when you bulk request multiple documents, which is strictly speaking a GET operation CouchDB will allow you to post in this case.

The reason for this is that the length of GET request is limited (by RFC) and if we exceeded this by requesting too many document IDs, we would hit a hard limit. To get around this, the operation is POST — but more on this later.

Requirements

For the following examples we assume a working CouchDB installation and a database called guestbook. No admins are set up — we can read and write without permission.

For simplicty, we imagine a form with the following fields:

  • author
  • entry

... and in addition to those two keys that may be populated by the user we add:

  • type (always: guestbook)
  • added (a date of some kind)

... the last two are not absolutely necessary, but will come handy in future iterations of this tutorial.

Also, on the tech side, we need a webserver with PHP and HTTP_Request2 (pear install HTTP_Request2-alpha) in your include_path. :-)

by Till Klampaeckel (till@php.net) at November 12, 2009 07:13 PM

Ricky Ho

Machine Learning with Linear Model

Linear Model is a family of model-based learning approaches that assume the output y can be expressed as a linear algebraic relation with the input attributes x1, x2 ...

The input attributes x1, x2 ... is expected to be numeric and the output is expected to be numeric as well.

Here our goal is to learn the parameters of the underlying model, which is the coefficients.

Linear Regression

Here the input and output are both numeric, related through a simple linear relationship. The learning goal is to figure out the hidden weight value (ie: the W vector).

Notice that non-linear relationship is equivalent of a linear relationship at a higher dimension. e.g. if x2 = x1 * x1, then it becomes a quadratic relationship. Because of this, the polynomial regression can be done using linear regression technique.

Given a batch of training data, we want to figure out the weight vector W such that the total sum of error (which is the difference between the predicted output and the actual output) to be minimized.

Instead of using the batch processing approach, a more effective approach is to learn incrementally (update the weight vector for each input data) using a gradient descent approach.

Gradient Descent

Gradient descent is a very general technique that we can use to incrementally adjust the parameters of the linear model. The basic idea of "gradient descent" is to adjust each dimension (w0, w1, w2) of the W vector according to their contribution of the square error. Their contribution is measured by the gradient along the dimension which is the differentiation of the square error with respect to w0, w1, w2.

In the case of Linear Regression ...

Logistic Regression

Logistic Regression is used when the output y is binary and not a real number. The first part is the same as linear regression while a second step sigmod function is applied to clamp the output value between 0 and 1.

We use the exact same gradient descent approach to determine the weight vector W.



Neural Network

Inspired by how our brain works, Neural network organize many logistic regression units into layers of perceptrons (each unit has both input and outputs in binary form).

Learning in Neural network is to discover all the hidden values of w. In general, we use the same technique above to adjust the weight using gradient descent layer by layer. We start from the output layer and move towards the input layer (this technique is called backpropagation). Except the output layer, we don't exactly know the error at the hidden layer, we need to have a way to estimate the error at the hidden layers.

But notice there is a symmetry between the weight and the input, we can use the same technique how we adjust the weight to estimate the error of the hidden layer.



by Ricky Ho (rickyphyllis@gmail.com) at November 12, 2009 01:15 AM

November 10, 2009

Chris Anderson

What #NoSQL means to me: No SQL in HTML5

A popular question these days has been "What does NoSQL mean?"

Some say it means "Not only SQL" or something. But to me it means something different.

NoSQL in HTML5

In my CouchDB talk at the original NoSQL event I introduce CouchDB's document-oriented approach to distribution and concurrency. But mostly I talk about the advantages of local data. CouchDB handles the syncing problem so you can think about your application.

If you build web apps and you aren't paying attention to the HTML5 storage question then you also probably aren't reading this blog.

If you ask me what NoSQL means to me, it means No SQL in HTML5. From my perspective, we seem to be winning that struggle. It's widely recognized that SQL doesn't belong in the browser, but we do want something.

What do we want if not SQL? There is WebSimpleDB which totally does the trick as far as I'm concerned (you can build a Couch on top of it) but I feel for the implementors, as it has a fair amount of unnecessary complexity. I think all in all implementation is more important than specification anyway.

I've been working with friends on the JSONDB specification but I'm afraid I'm stuck for now, waiting for a friendly implementation. The important point I'm trying to make is that there are only a couple of features in WebSimpleDB that we need in order to build a Couch, so if someone is implementing and can point me to a branch where I can start building the JavaScript pieces, maybe we can build a WebCouch, which will be a powerful use case for any specification.

by jchris at November 10, 2009 12:09 AM

November 09, 2009

Ricky Ho

Support Vector Machine

Support vector machine is a very powerful classification technique. Its theory is based on the linear model but can also handle non-linear model very well. It is also immute to the curse of high dimensionality.

In support vector machine (SVM), inputs are numeric and output are binary. Each data sample can be consider as a m dimension point label as + or - depends on the output.

Optimal separating hyperplane
Assume there are m numeric input attributes, the key approach of SVM is to try finding a (m - 1) dimension hyperplane which can separate the points in the best way. (ie: all the +ve points on one side of the plane and all the -ve points on the other side).

There are many planes that divide the regions. But we need to find the red line which has the maximum margin.


Sometimes there may be noise or variation that not all points lies in the same side of the plane. So we modify the equation to allow for some errors in the constraints and we want to minimize the overall errors in the optimization goal.


At first glance, it seems like the classification will be O(n) where n is the size of training data. This is not the case because most of the alpha values are zero except the supporting vectors (points touching the margin band) which is a very small value.

Non-Linearity
So far, we have made an assumption that the data is "linearly separable". What if this assumption is not true ? (e.g. y = a1.x1.x1 + a2.x1.x2)

The answer is to create another attribute x3 = x1.x1 and x4 = x1.x2.
In other words, we can always make a non-linear equation becomes linear by introducing extra variables which is a non-linear combination of existing variables. Notice that adding these extra variables effectively is increasing the dimension of the data, but we maintain the linearity of the data point.

As an example, a quadratic equation y = 3x.x + 2x + 5 is a one variable, non-linear equation. But if we introduce z = x.x, then it becomes a 2 variable, linear equation with
y = 3z + 2x + 5

So by adding more attributes to increase the dimensionality of the data points, we can keep the model in linear form. So, we can solve non-linear model by transforming the current data into a higher dimension (adding extra attributes by combining existing attributes in a non-linear way). And then apply the hyperplane separation technique described above to build the model (figure out the alpha value) and use that to classify new data points.

But ! How do we decide what extra attributes should be added and how they should be composed from existing attributes, and how many of them do we need to reconstruct the linearity ?

The Kernal Trick
From examine the above algorithm, an interesting finding is that we only need to know the dot product between two data points but not individual input attribute values. In other words, we don't need to care about how to calculate the extra attributes as long as we know how to calculate the dot product of the new transform space.


The process of using SVM is the same as the other machine learning algorithms
  1. Pick a tool (such as libSVM)
  2. Prepare the input data (convert them to numeric or filter them, normalize their range)
  3. Pick a Kernel function and its parameters
  4. Run cross-validation against different combination of parameters
  5. Pick the best parameter and retrain them.
  6. Now we have the learned model, we can use this for classifying new data

by Ricky Ho (rickyphyllis@gmail.com) at November 09, 2009 08:02 PM

November 06, 2009

Ricky Ho

Hadoop Map/Reduce Implementation

In my previous post, I talk about the methodology of transforming a sequential algorithm into parallel. After that, we can implement the parallel algorithm, one of the popular framework we can use is the Apache Opensource Hadoop Map/Reduce framework.

Functional Programming

Multithreading is one of the popular way of doing parallel programming, but major complexity of multi-thread programming is to co-ordinate the access of each thread to the shared data. We need things like semaphores, locks, and also use them with great care, otherwise dead locks will result.

If we can eliminate the shared state completely, then the complexity of co-ordination will disappear. This is the fundamental concept of functional programming. Data is explicitly passed between functions as parameters or return values which can only be changed by the active function at that moment. Imagine functions are connected to each other via a directed acyclic graph. Since there is no hidden dependency (via shared state), functions in the DAG can run anywhere in parallel as long as one is not an ancestor of the other. In other words, analyze the parallelism is much easier when there is no hidden dependency from shared state.

Map/Reduce functions

Map/reduce is a special form of such a DAG which is applicable in a wide range of use cases. It is organized as a “map” function which transform a piece of data into some number of key/value pairs. Each of these elements will then be sorted by their key and reach to the same node, where a “reduce” function is use to merge the values (of the same key) into a single result.


map(input_record) {

emit(k1, v1)

emit(k2, v2)

}

reduce (key, values) {
aggregate = initialize()
while (values.has_next) {
aggregate = merge(values.next)
}
collect(key, aggregate)
}

The Map/Reduce DAG is organized in this way.



A parallel algorithm is usually structure as multiple rounds of Map/Reduce




Distributed File Systems

The distributed file system is designed to handle large files (multi-GB) with sequential read/write operation. Each file is broken into chunks, and stored across multiple data nodes as local OS files.



There is a master “NameNode” to keep track of overall file directory structure and the placement of chunks. This NameNode is the central control point and may re-distributed replicas as needed.

To read a file, the client API will calculate the chunk index based on the offset of the file pointer and make a request to the NameNode. The NameNode will reply which DataNodes has a copy of that chunk. From this points, the client contacts the DataNode directly without going through the NameNode.

To write a file, client API will first contact the NameNode who will designate one of the replica as the primary (by granting it a lease). The response of the NameNode contains who is the primary and who are the secondary replicas. Then the client push its changes to all DataNodes in any order, but this change is stored in a buffer of each DataNode. After changes are buffered at all DataNodes, the client send a “commit” request to the primary, which determines an order to update and then push this order to all other secondaries. After all secondaries complete the commit, the primary will response to the client about the success.

All changes of chunk distribution and metadata changes will be written to an operation log file at the NameNode. This log file maintain an order list of operation which is important for the NameNode to recover its view after a crash. The NameNode also maintain its persistent state by regularly check-pointing to a file.

In case of the NameNode crash, all lease granting operation will fail and so any write operation is effectively fail also. Read operation should continuously to work as long as the clinet program has a handle to the DataNode. To recover from NameNode crash, a new NameNode can take over after restoring the state from the last checkpoint file and replay the operation log.

When a DataNode crashes, it will be detected by the NameNode after missing its hearbeat for a while. The NameNode removes the crashed DataNode from the cluster and spread its chunks to other surviving DataNodes. This way, the replication factor of each chunk will be maintained across the cluster.

Later when the DataNode recover and rejoin the cluster, it reports all its chunks to the NameNode at boot time. Each chunk has a version number which will advanced at each update. Therefore, the NameNode can easily figure out if any of the chunks of a DataNode becomes stale. Those stale chunks will be garbage collected at a later time.


Job Execution

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

The job execution starts when the client program uploading three files: “job.xml” (the job config including map, combine, reduce function and input/output data path, etc.), “job.split” (specifies how many splits and range based on dividing files into ~16 – 64 MB size), “job.jar” (the actual Mapper and Reducer implementation classes) to the HDFS location (specified by the “mapred.system.dir” property in the “hadoop-default.conf” file). Then the client program notifies the JobTracker about the Job submission. The JobTracker returns a Job id to the client program and starts allocating map tasks to the idle TaskTrackers when they poll for tasks.




Each TaskTracker has a defined number of "task slots" based on the capacity of the machine. There are heartbeat protocol allows the JobTracker to know how many free slots from each TaskTracker. The JobTracker will determine appropriate jobs for the TaskTrackers based on how busy thay are, their network proximity to the data sources (preferring same node, then same rack, then same network switch). The assigned TaskTrackers will fork a MapTask (separate JVM process) to execute the map phase processing. The MapTask extracts the input data from the splits by using the “RecordReader” and “InputFormat” and it invokes the user provided “map” function which emits a number of key/value pair in the memory buffer.

When the buffer is full, the output collector will spill the memory buffer into disk. For optimizing the network bandwidth, an optional “combine” function can be invoked to partially reduce values of each key. Afterwards, the “partition” function is invoked on each key to calculate its reducer node index. The memory buffer is eventually flushed into 2 files, the first index file contains an offset pointer of each partition. The second data file contains all records sorted by partition and then by key.

When the map task has finished executing all input records, it start the commit process, it first flush the in-memory buffer (even it is not full) to the index + data file pair. Then a merge sort for all index + data file pairs will be performed to create a single index + data file pair.

The index + data file pair will then be splitted into are R local directories, one for each partition. After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job. JobTracker also provide a web interface for viewing the job status.

When the JobTracker notices that some map tasks are completed, it will start allocating reduce tasks to subsequent polling TaskTrackers (there are R TaskTrackers will be allocated for reduce task). These allocated TaskTrackers remotely download the region files (according to the assigned reducer index) from the completed map phase nodes and concatenate (merge sort) them into a single file. Whenever more map tasks are completed afterwards, JobTracker will notify these allocated TaskTrackers to download more region files (merge with previous file). In this manner, downloading region files are interleaved with the map task progress. The reduce phase is not started at this moment yet.

Eventually all the map tasks are completed. The JobTracker then notifies all the allocated TaskTrackers to proceed to the reduce phase. Each allocated TaskTracker will fork a ReduceTask (separate JVM) to read the downloaded file (which is already sorted by key) and invoke the “reduce” function, which collects the key/aggregatedValue into the final output file (one per reducer node). Note that each reduce task (and map task as well) is single-threaded. And this thread will invoke the reduce(key, values) function in assending (or descending) order of the keys assigned to this reduce task. This provides an interesting property that all entries written by the reduce() function is sorted in increasing order. The output of each reducer is written to a temp output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

The Map/Reduce framework is resilient to crashes of any components. TaskTracker nodes periodically report their status to the JobTracker which keeps track of the overall job progress. If the JobTracker hasn’t heard from any TaskTracker nodes for a long time, it assumes the TaskTracker node has been crashed and will reassign its tasks appropriately to other TaskTracker nodes. Since the map phase result is stored in the local disk, which will not be available when the TaskTracker node crashes. In case a map-phase TaskTracker node crashes, the crashed MapTasks (regardless of whether it is complete or not) will be reassigned to a different TaskTracker node, which will rerun all the assigned splits. However, the reduce phase result is stored in HDFS, which is available even the TaskTracker node crashes. Therefore, in case a reduce-phase TaskTracker node crashes, only the incomplete ReduceTasks need to be reassigned to a different TaskTracker node, where the incompleted reduce tasks will be re-run.

The job submission process is asynchronous. Client program can poll for the job status at any time by supplying the job id.

by Ricky Ho (rickyphyllis@gmail.com) at November 06, 2009 08:37 PM

What Hadoop is good at

Hadoop is getting more popular these days. Lets look at what it is good at and what not.

The Map/Reduce Programming model
Map/Reduce offers a different programming model for handling concurrency than the traditional multi-thread model.

Multi-thread programming model allows multiple processing units (with different execution logic) to access the shared set of data. To maintain data integrity, each processing units co-ordinate their access to the shared data by using Locks, Semaphores. Problem such as "race condition", "deadlocks" can easily happen but hard to debug. This makes multi-thread programming difficult to write and hard to maintain. (Java provides a concurrent library package to ease the development of multi-thread programming)

Data-driven programming model feeds data into different processing units (with same or different execution logic). Execution is triggered by arrival of data. Since processing units can only access data piped to them, data sharing between processing units is prohibited upfront. Because of this, there is no need to co-ordinate access to data.

This doesn't mean there is no co-ordination for data access. We should think of the co-ordination is done explicitly by the graph. ie: by defining how the nodes (processing units) are connected to each other via data pipes.

Map-Reduce programming model is a specialized form of data-driven programming model where the graph is defined as a "sequential" list of MapReduce jobs. Within each Map/Reduce job, execution is broken down into a "map" phase and a "reduce" phase. In the map phase, each data split is processed and one or multiple output is produced with a key attached. This key is used to route the outputs (of the Map phase) to the second "reduce" phase, where data with the same key is collected and processed in an aggregated way.

Note that in a Map/Reduce model, parallelism happens only within a Job and execution between jobs are done in a sequential manner. As different jobs may access the same set of data, knowing that jobs is executed serially eliminate the needs of coordinating data access between jobs.

Design application to run in Hadoop is a matter of breaking down the algorithm in a number of sequential jobs and then exploit data parallelism within each job. Not all algorithms can fit in to the Map Reduce model. For a more general approach to break down an algorithm into parallel, please visit here.

Characteristics of Hadoop Processing
A detail explanation of Hadoop implementation can be found here. Basically Hadoop has the following characteristics ...
  • Hadoop is "data-parallel", but "process-sequential". Within a job, parallelism happens within a map phase and a reduce phase. But these two phases cannot be run in parallel, the reduce phase cannot be started until the map phase is fully completed.
  • All data being accessed by the map process need to be freezed (update cannot happen) until the whole job is completed. This means Hadoop processes data in chunks using a batch-oriented fashion, making it not very suitable for stream-based processing where data flows in continuously and immediate processing is needed.
  • Data communication happens via a distributed file system (HDFS). Latency is introduced as extensive network I/O is involved in moving data around (ie: Need to write 3 copies of data synchronously). This latency is not an issue for batch-oriented processing where throughput is the primary factor. But this means Hadoop is not suitable for online access where low latency is critical.
Given the above characteristics, Hadoop is NOT good at the following ...
  • Perform online data access where low latency is critical
  • Perform random ad/hoc processing of a small subset of data within a large data set
  • Perform real-time, stream-based processing where data is arrived continuously and immediate processing is needed.

by Ricky Ho (rickyphyllis@gmail.com) at November 06, 2009 05:24 AM

November 05, 2009

Ricky Ho

Notes on Memcached

Some notes about Memcached. Here is its architecture.


How it works ?
Memcached is organized as a farm of N servers. The storage model can be considered as a huge HashTable partitioned among these N servers.

Every API request takes a "key" parameter. There is a 2-step process at the client lib ...
  • Given the key, locate the server
  • Forward the request to that server
The server receiving the request will do a local lookup for that key. The servers within the farm doesn't gossip with each other at all. Each server use asynchronous, non-blocking I/O and one thread can be used to handle large number of incoming TCP sockets. Actually a thread pool is being used but the number of threads is independent of the number of incoming sockets. This architecture is highly scalable for large number of incoming network connections.

API
Memcached provide a HashTable-like interface, so it has ...
  • get(key)
  • set(key, value)
Memcached also provides a richer "multi-get" so that one read request can retrieve values for multiple keys. The client library will issue different requests to multiple servers and doing the lookup in parallel.
  • get_multi(["k1", "k2", "k3"])
Some client lib offers a "master-key" concept such that a key contains 2 parts, the prefix master-key and the suffix key. In this model, the client lib only use the prefix to located the server (rather than looking at the whole key) and then pass the suffix key to that server. So user can group entries to be stored by the same server by using the same prefix key.
  • get_multi(["user1:k1", "user1:k2", "user1:k3"]) -- This request just go to the server hosting all keys of "user1:*"

For updating data, Memcached provides a number of variations.
  • set(key, value, expiration) -- Memcached guarantees the item will never be staying in the cache once the expiration time is reached. (Note that it is possible that the item being kicked out before expiration due to cache full)
  • add(key, value, expiration) -- Success only when no entry of the key exist.
  • replace(key, value, expiration) -- Success only when an entry of the key already exist.
Server Crashes
When one of the server crashes, all entries owned by that server is lost. Higher resilience can be achieved by storing redundant copies of data in different servers. Memcached has no support for data replication. This has to be taken care by the application (or client lib).

Note that the default server hashing algorithm doesn't handle the growth and shrink of the number of servers very well. When the number of servers changes, the ownership equation (key mod N) will all be wrong. In other words, if the crashed server needs to be taken out from the pool, the total number of servers will be decreased by one and all the existing entries needs to be redistributed to different server. Effectively, the whole cache (among all server) is invalidated even when just one server crashes.

So one approach to address this problem is to retain the number of Memcached servers across system crashes. We can have a monitor server to detect the heartbeat of all Memcached server and in case any crashes is detected, start a new server with the same IP address as the dead server. In this case, although the new server will still lost all the entries and has to repopulate the cache, the ownership of the keys are unchanged and data within the surviving node doesn't need to be redistributed.

Another approach is to run logical servers within a farm of physical machines. When a physical machine crashes, its logical servers will be re-start in the surviving physical machines. In other words, the number of logical servers is unchanged even when crashes happens. This logical server approach is also good when the underlying physical machines has different memory capacity. We can start more Memcached process in the machine with more memory and proportionally spread the cache according to memory capacity.

We also can use a more sophisticated technique called "consistent hashing", which localize the ownership changes to just the neighbor of the crashed server. Under this schema, each server is assigned with an id under the same key space. The ownership of a key is determined by the closest server whose key is the first one encountered when walking in the anti-clockwise direction. When a server crashes, its immediate upstream neighbor server (walking along the anti-clockwise direction) will adopt the key ownership of the dead server, while all other servers has the same ownership of key range unchanged.


Atomicity

Each request to Memcached is atomic by itself. But there is no direct support for atomicity across multiple requests. However, App can implement its own locking mechanism by using the "add()" operation provide by Memcached as follows ...
success = add("lock", null, 5.seconds)
if success
set("key1", value1)
set("key2", value2)
cache.delete("lock")
else
raise UpdateException.new("fail to get lock")
end

Memcached also support a "check-and-set" mechanism that can be used for optimistic concurrency control. The basic idea is to get a version stamp when getting an object and pass that version stamp in the set method. The system will verify the version stamp to make sure the entry hasn't been modified by something else or otherwise, fail the update.
data, stamp = get_stamp(key)
...
check_and_set(key, value1, stamp)

What Memcached doesn't do ?
Memcached's design goal is centered at performance and scalability. By design, it doesn't deal with the following concerns.
  • Authentication for client request
  • Data replication between servers for fault resilience
  • Key > 250 chars
  • Large object > 1MB
  • Storing collection objects
Data Replication DIY
First of all, think carefully about whether you really need to have data replication at the cache level, given that cache data should always be able to recreated from the original source (although at a higher cost).

The main purpose of using a cache is for "performance" reason. If your system cannot tolerate data lost at the cache level, rethink your design !

Although Memcached doesn't provide data replication, it can easily be done by the client lib or at the application level, based on a similar idea described below.

At the client side, we can use multiple keys to represent different copies of the same data. A monotonically increasing version number is also attached with the data. This version number is used to identify the most up-to-date copy and will be incremented for each update.

When doing update, we update all the copies of the same data via different keys.
def reliable_set(key, versioned_value)
key_array = [key+':1', key+':2', key+':3']
new_value = versioned_value.value
new_version = versioned_value.version + 1
new_versioned_value =
combine(new_value, new_version)

for k in key_array
set(k, new_versioned_value)
end
end
For reading the data from cache, use "multi-get" for multiple keys (one for each copy) and return the copy which has the latest version. If any discrepancy is detected (ie: some copies have a lacking version, or some copies are missing), start a background thread to write the latest version back to all copies.
def reliable_get(key)
key_array = [key+':1', key+':2', key+':3']
value_array = get_multi(key_array)

latest_version = 0
latest_value = nil
need_fix = false

for v in value_array
if (v.version > latest_verson)
if (!need_fix) && (latest_version > 0)
need_fix = true
end
latest_version = v.version
latest_value = v.value
end
end
versioned_value =
combine(latest_value, latest_version)

if need_fix
Thread.new do
reliable_set(key, versioned_value)
end
end

return versioned_value
end
When we delete the data, we don't actually remove it. Instead, we mark the data as deleted but keep it in the cache and let it expire.

User Throttling
An interesting use case other than caching is to throttle user that is too active. Basically you want to disallow user request that is too frequent.
user = get(userId)
if user != null
disallow request and warn user
else
add(userId, anything, inactive_period)
handle request
end

by Ricky Ho (rickyphyllis@gmail.com) at November 05, 2009 04:20 PM

Enda Farrell

CouchDB 0.9x - 1st read from v large views serially

On a server, we run 4 different CouchDB nodes, each with 30 or so databases. We can therefore have over 100 databases - and if you’re reading from large views - or view over large databases - you will need to do so serially.

We have 4 as “normally” CouchDB is kind re it’s use of memory, and kind in terms of CPU usage (we have 8-core machines here). Yes - there is a headache re being disk-IO bound, but it’s not normally a problem.

But views are now needed (to identify conflicting docs) as we have many applications using our KV service in both of our datacentres. And adding them is not straight-forward.

In the docs, you’ll see that a view is no more than a “special” document, so “creating” them is indeed straight-forward. What’s not in the docs (yet at least) is that if you’re going to add a view to a large database, you’re going to need to have PLENTY of memory. Last night we ran into “eheap_alloc” which took down each of the nodes.

So - you gotta take your time, and do them serially.

Unfortunately, the first read of a view over a very large database can take some time. Even a 70 thousand doc 8 GB database can take 2 to 4 hours to build even a simple view (which led to the code that added and read them in parallel) - go give yourself many days.

 

Here’s what top looks like mid-1st read:

top - 15:05:54 up 71 days, 17:26,  1 user,  load average: 1.02, 1.04, 1.04
Tasks: 147 total,   1 running, 146 sleeping,   0 stopped,   0 zombie
Cpu(s): 11.5%us,  1.7%sy,  0.0%ni, 86.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16438912k total, 16331660k used,   107252k free,    19048k buffers
Swap: 16771820k total,  8493532k used,  8278288k free,  6754524k cached


  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8893 couchdb   18   0 15.6g 7.8g 1944 S    0 49.6  13:53.09 couchjs
 2790 couchdb   25   0  357m 132m 4180 S    0  0.8  76:06.97 beam.smp
 2843 couchdb   25   0  342m 167m 4116 S   86  1.0  81:54.23 beam.smp
 2896 couchdb   25   0  175m  17m 3896 S    0  0.1   0:06.73 beam.smp
 2949 couchdb   25   0  171m  14m 3744 S    0  0.1   0:06.03 beam.smp
32320 couchdb   18   0  150m 105m 1948 S   15  0.7  13:36.61 couchjs

by Enda Farrell at November 05, 2009 01:38 PM

Mikeal Rogers

CouchDB View Performance (Python vs JavaScript)

We’re gearing up for some heavy CouchDB usage in a new automation system and it has fallen upon me to do some performance benchmarking.

The most important thing for us to figure out was whether or not the CentOS virtual machine we’re currently running CouchDB on is going to be enough even in the short term. Until today we’ve been running 0.9 and have encountered performance problems.

Our main bottleneck is, and has always been, view generation and update performance. We tend to have medium to large size documents (jobs are relatively small but results from test runs can be incredibly large).

View generation of large documents has typically been our biggest issue which we have previously mitigated by refreshing all views after any large write but that isn’t going to work for the amount of results that we plan on pouring in to the new system.

Last weekend I wrote a Python view server for CouchDB. couchdb-python includes a view server but in the past I’ve heard complaints about performance (although none recently). In addition, the view server in couchdb-python only supports map and reduce, which is only about 1/5 of the current view server spec which includes handlers for update, show, list, filter, and validate which provide the groundwork for CouchDB as an application platform. As of Sunday my view server passes all of the current CouchDB spec and initial performance tests showed it faster than the JavaScript view server.

Below are the performance graphs for CouchDB trunk running on a CentOS virtual machine. I’m using Python 2.6 with the default stdlib json library. The spidermonkey core is 1.7 (I don’t know what the status of using 1.8 with CouchDB is but as we’ll see below, this won’t improve performance too much for these tests).

These graphs show view generation time for a given number of documents in a new database. The design doc I used had two views, one does emit(doc['type'],doc), the other emit(doc['_id'], 1).

The graphs support zooming, mouseover and all kinds of flot goodness :)

JavaScript is the yellow line. Python is the Blue line.

This is a test of moderately sized documents, what we normally expect the size of a job or build description. Each document is identical and fairly simple with a size of ~1,588 bytes.

These documents were incredibly large, they were taken from a full fennec mochitest run. Each document is identical and while large it consists mostly of small sized JSON objects inside a much larger JSON object coming in at ~139,096 bytes.

––
––

I had also intended to chart the reduce performance with a simple sum operation but all the results were sub-second regardless of the amount of documents I threw at it with Python being only a little faster than JavaScript.

The nearly identical reduce time tells me that the actual code processing time inside the view functions are hardly different which means that the large difference in performance during view generation is most likely due to JSON serialization time. This also explains why larger documents cause an even greater difference in performance between Python and JavaScript.

Improving Performance

The Python view server is already as optimized as I can imagine for processing time inside the views. Since CouchDB doesn’t provide a way for the view server to support it’s own concurrency we’ve basically hit the wall here on what Python can provide. If we increased the complexity of the view functions I think that Python would start to show better than Spidermonkey 1.7, but 1.8 with traceing enabled would likely bridge that gap, possibly even showing JavaScript faster than Python.

The big problem is JSON serialization. We can make Python faster by compiling simplejson with C speedups. But using the C based JSON parser in newer versions of Spidermonkey requires some other changes to CouchDB since there are differences in the encoding of undefined.

At the end of the day though, this all looks great. CouchDB trunk (pre-0.11) is going to run fast enough for what we need right now even on a virtual machine and if we start to see view generation bottlenecks on views that aren’t hit as often and have to update a large number of documents we can just move those views to Python and the performance should be back down to sub-second.

by mikeal at November 05, 2009 02:13 AM

November 04, 2009

Till Klampäckel

CouchDB: checkpointing on view building

I'm posting about this tidbit because Google seemed to know nothing about it.

Anyway, during the view building process, we may see the following in the couchdb.log (level = info, at least, in local.ini):

[...] [info] [...] checkpointing view update at seq 78163851 for citations _design/erlang
[...] [debug] [...] New task status for citations _design/erlang: Processed 17844590 of 107444308 changes (16%)
[...] [debug] [...] New task status for citations _design/erlang: Processed 17848060 of 107444308 changes (16%)
[...] [debug] [...] New task status for citations _design/erlang: Processed 17850878 of 107444308 changes (16%)
[...] [info] [...] checkpointing view update at seq 78170348 for citations _design/erlang
[...] [debug] [...] New task status for citations _design/erlang: Processed 17851087 of 107444308 changes (16%)

The above tells us, that CouchDB saved the current process during indexing and allows us to resume in case we decide to restart the CouchDB and interrupt the indexing process. I've tried it myself a couple times with CouchDB 0.10.0 — I also had not noticed this feature prior to it.

And why is this useful in particular? The biggest use for this is upgrading computing power (e.g. on AWS EC2) when we realize we need MOAR and then we are still able to resume when we boot into more resources.

Sidenote: Checkpointing will not help if indexing is stopped and the view is adjusted/changed. Or when the indexing stopped due to an error, such as a crash.

That's all, kids.

by Till Klampaeckel (till@php.net) at November 04, 2009 06:50 PM

Enda Farrell

CouchDB 0.9x - compact serially

If you have more than 20 namespaces on a CouchDB instance, you may run into a little trouble.

Everything will work fine, but on the 0.9x series of CouchDB, if you requested a compaction of each and every namespace at essentially the same time, some would not be done.

In fairness, I need to check the code that requests these compactions to check that it checks for the {“ok”:”true”} response, but even so, you will need to remember to do them in series (or maybe two or three at a time)

You might need to consider doing them in series in any case to avoid the disk space being completely used up, which will cause major difficulties for the users of your databases.

by Enda Farrell at November 04, 2009 02:27 PM

November 03, 2009

Chris Anderson

How CouchDB Treats the Disks

Blogs are pretty much append-only. Most of the time you add new posts, sometimes you edit recent ones. There are new comments.

CouchDB's file format is also append-only. This doesn't make it special, lots of databases have used similar techniques before, but it does make it more like web data.

It also makes it smooth for the disk, because the disk doesn't have to seek for writes, and most reads are from file-system cache.

Here's some slides:

Intro to CouchDB's Append Only B-Tree (9MB pdf) Feel free to reuse with attribution.

by jchris at November 03, 2009 10:45 PM

November 02, 2009

Ricky Ho

Principal Component Analysis

One common problem of machine learning is the "curse of high dimensionality". When there are too many attributes in the input data, many of the ML algorithms will be very inefficient or some of them will even be non-performing (e.g. in nearest neighbor computation, data points in a high-dimensional space are pretty much equal distance with each other).

It is quite possible that the attributes we selected are inter-dependent on each other. If so, we may be able to extract a smaller subset of independent attributes that may still be very useful to describe the data characteristics. In other words, we may be able to reduce the number of dimensions significantly without losing much fidelity of the data.

"Dimension Reduction" is a technique to determine how we can reduce the number of dimensions while minimizing the loss of fidelity of data characteristics. It is typically applied during the data cleansing stage before feeding into the machine learning algorithm.

"Feature Selection" is a simple techniques to select a subset of features that is more significant. A very simple "filtering" approach can be used by looking at each attribute independently and rank their significance using some measurement (e.g. info gain) and throw away those that has minimum significance. A more sophisticated "wrapper" approach is to look at different subset of features to do the evaluation. There are two common model in the "wrapper" approach, "forward selection" and "backward elimination".

In forward selection, we start with no attribute and then start to pick the attribute with the highest statistical significance, (ie: prediction improves a lot from the cross validation check). After picking the first attribute, and we start to pair it up with another unselected attribute and find the one with the most significant improvement in the cross-validation-check. We keep growing the set of attributes until we don't find significant improvements. One issue of the "forward selection" approach is that it may miss "grouped features". For example, attribute1 and attribute2 may be insignificant when they are stand-alone, but combining them will give very big improvement.

Backward elimination can be used to handle this problem. It basically goes the opposite direction, starts with the full set of attributes and start to drop those attribute that has least statistical significance (ie: prediction degrades very little from the cross validation check). The downside of "backward elimination" is that it is much more expensive to run.

A more powerful approach called "Feature Extraction" is more commonly used to extract a different set of attributes by linearly combining the existing set of attributes. Principal Component Analysis "PCA" is a very popular technique in this arena. PCA can analyze the interdependency between pair of attributes and identify those significant ones.

The intuition of PCA
The intuition is to rearrange the linear combination of existing m attributes in different way to form another set of m attributes. The new set of attributes has the characteristic that
  • Each attribute is independent of each other
  • The set of attributes is ranked according to the range of variation
Note that attributes with narrow range of variation doesn't provide much information to describe the data samples and so can be ignored with minimal lost of fidelity. So we remove that to reduce the dimensionality.

The question is : How do we recompose the m attributes to exhibit the above 2 characteristics. Lets take a deeper look into it.

Underlying theory of PCA
Assume there are N data points in the input data set and each data point is described by M attributes. We use the statistical definition for the "mean", "variance" of each attribute and "co-variance" for every pair of attributes. Co-variance is an indicator of dependencies of two attributes with zero implies independence.


In an ideal situation, we want COV-x to be a diagonal matrix, which means COV(i, j) to be zero. In other words, all pairs of attribute-i and attribute-j are independent to each other. We also want the diagonal to be ranked in descending order.

So the problem can be reduced to finding a different combination of the m attributes to form a new set of m attributes (Y = P. X) such that COV-y is a ranked diagonal matrix.

How do we determine P ?


Some Matrix theory
Here is a review of Matrice theory that will be used

Lets find the transformation matrice P


So the PCA process can be summarized in following ...
  1. Input: X, a matrice of (m * n), a set of N sample data points, each with M attributes.
  2. Compute Cov-X, a matrice of (m * m), the Covariance matrice of X
  3. Compute the m Eigenvectors and m Eigenvalues of Cov-X
  4. Order the Eigenvectors according to the Eigenvalues
  5. Now found the transformation matrice P, which is a matrice of (m * m). Note that each row vector of P corresponding to an eigenvector, which is effectively the axis of the new co-ordinate system.
  6. Truncate P to just take the top k rows. Now P' is a (k * m) matrice.
  7. Apply P' . X to all input data to result in a matrice of (k * n). This is effectively reduce each data point from m-dimension to k-dimension.

References
A very good paper
Some Matrix math review and step by step PCA calculation

by Ricky Ho (rickyphyllis@gmail.com) at November 02, 2009 10:41 PM

Jan Lehnardt

NoSQL Berlin Debrief

On November 22nd, ’09 Isabel, Thomas and I ran NoSQL Berlin. In summary it was “awesome”. Read on.

4036245707_fda489c343.jpg

We had speakers and attendees travelling as far as from New York to attend. About 80 people were attending in the standing-room-only newthinking Store. Thanks all for showing up!

Additional thanks to all speakers for their great presentations. And special thanks to our sponsors Peritor, SoundCloud, StudiVZ, Versant & Sociomantic.

The Talks

For the benefit of the growing NoSQL community, we’re publishing videos of all talks along with their slides under the Creative Commons Attribution 3.0 License.

  • Consistency in Key-Value Stores, Monika Moser (Video, Slides)
  • Redis, Fast and Furious, Mathias Meyer (Video, Slides)
  • Peer-to-peer Applications with CouchDB, Jan Lehnardt (Video, Slides)
  • Riak, Martin Scholl (Video, Slides)
  • MongoDB, Mathias Stearn (Video, Slides)
  • 18:25 — 4th Generation Object Databases, Prof. Stefan Edlich (Video, Slides)

The Future

The feedback and responses for NoSQL Berlin were overwhelming. We’re planning a bigger follow-up event in spring 2010.

by Jan (jan@apache.org) at November 02, 2009 04:02 PM

October 31, 2009

Till Klampäckel

PHP: So you'd like to migrate from MySQL to CouchDB? - Part I

Update (2009-10-13): I posted part II!

This is the first part of a series. I'll start off by introducing CouchDB — from a PHP side, then I'll demo a couple basic use cases and I later on, I'll dive into migrations from MySQL.

My idea is to introduce CouchDB to a world where database-driven development generally refers to MySQL. By no means, this is meant to be disrespectful to MySQL, or SQL-databases in general. However, I'm a firm believer in using the right tool for the job.

First things first!

First off, before using CouchDB and maybe eventually replacing MySQL with it, we need to ask ourself the "Why?"-question.

And in order to be capable of more than a well-educated guess we need to familiarize ourselves with the CouchDB basics.

Basics

  • Document-oriented and schema-less storage.
  • Erlang (for rock-solid-scaling-goodness).
  • RESTful HTTP-API (we'll get to that).
  • Everything is JSON - request data, storage, response!

Document-oriented

In a document-oriented as to opposed to a relational store, the data is not stored in table, where data is usually broken down into fields. In a document-oriented store each record is stored along side and can have its own characteristics — properties of any kind.

As an example, consider these two records:

Till Klampaeckel, Berlin
Till Klampaeckel, till@php.net, Berlin, Germany

In a relational store, we would attempt to break down, or normalize, the data. Which means that we would probably create a table with the columns name, email, city and country.

Consider adding another record:

Till Klampaeckel, +49110, till@some.jabber

(Just fyi — this is not my real phone number!)

Looking for an intersection in the records, the name is the only thing this record has in common with the previous two. With a relational database, we would either have to add a column for phone number and chat, or we would start splitting off the data into multiple tables (e.g. a table called phone and one called chat) in order to get grip.

With a document-oriented database — such as CouchDB — this is not an issue.

We can store any data, constraints do not apply.

Erlang

Erlang was invented a while ago, by Ericsson, when it was still sans Sony. In a nutshell, Erlang's true strength is reliability and stability. It also manages to really utilize all the resources modern hardware has to offer since it's a master of parallelization.

CouchDB is written in Erlang, and also accepts view code written in Erlang. More on views later.

RESTful HTTP-API

For starters, a lot of HTTP-APIs claim to be RESTful, most of them are not. HTTP has so called request verbs (DELETE, GET, HEAD, POST, PUT among them) and a lot of APIs don't use them to the fullest extend, or rather not all.

Instead, most APIs are limited GET and maybe use a little POST. An example of such an API is the Flickr API.

Most of us are familiar with GET and POST already. For example, when you opened the web page to this blog entry, the browser made a GET-request. If you decide to post a comment later on — you guessed it, that's a POST-request.

Aside from its basic yet powerful nature, HTTP is interesting in particular because it is the least common multiple in many programming language. Whatever you use — C#, PHP, Python, Ruby — these languages know how to talk HTTP. And even better — most of them ship pretty comfortable wrappers.

JSON

JSON — it's godsend for those of us who never liked XML.

It's very lightweight, yet we able to represent lists and objects, integers, strings — most data types you would want to use. A clear disadvantage of JSON is that it lacks validation (think DTD), and of course comments — ha, ha!

Why, oh why?

So along with "Why?", we should consider the following:

  • Does it make sense?
  • Is CouchDB (really) the better fit for my application?
  • What is my #1 problem in MySQL, and how does CouchDB solve it?

And if we are still convinced to migrate all of our data, we'll need to decide on an access wrapper.

It's all HTTP, right?

By now, everyone has heard that CouchDB has a RESTful HTTP-API. But what does that imply?

It means, that we won't need to build a new extension in PHP to be able to use it. There's already either ext/socket or ext/curl — often both — in 99% of all PHP installs out there. Which means that PHP is more than ready to talk to CouchDB — right out of the box.

Since I mentioned JSON before — today ext/json is available in most PHP installs as well. If however we happen to be one of the few unfortunates who don't have and cannot get this extension, we should use Services_JSON instead.

Install it!

CouchDB installations are available in most Linux and Unix distributions. On MacOSX, get CouchDBXthe one-click CouchDB package, and there's a work in process for Windows as well. Especially interesting for those who run Ubuntu 9.10 (which has been released a few days ago), there's already a CouchDB install included.

Ubuntu/Debian:

apt-get install couchdb

FreeBSD:

cd /usr/ports/databases/couchdb && make install clean

by Till Klampaeckel (till@php.net) at October 31, 2009 10:59 AM

Chris Anderson

HTML5 Web Workers

Web Workers open up the web client to message-passing-style programming. Getting this into HTML5 is the first step toward taking Erlang's robust parallelism to the web.

People keep asking me what needs to happen to get the CouchDB spark into the web. I think the #1 most important thing is making sure Web Workers are available on every platform.

Web Workers open up message passing APIs to the client-side. I think we'll get a bunch of really cool services running inside Web Workers. The main benefit is simplicity. Message passing APIs make it easier to reason about certain kinds of JavaScript library functions.

Practically any one of the existing browser based storage implementations can be made to support a CouchDB instance. The job becomes a lot simpler when wrapped inside a web worker instance.

There are other applications, from gaming to graphics to crypto, that benefit from Web Workers.

by jchris at October 31, 2009 02:02 AM

October 30, 2009

Upstream

Unit Testing CouchDB Views with Couch Potato

I just released Couch Potato 0.2.14 and amongst other things it has a new feature i think is pretty neat: you can unit test your (JavaScript) views using RSpec and Ruby.

You can declare a view in Couch Potato like this:

class Comment
  property :post_id
  view :by_post_id, :key => :post_id
end

This will generate a pair of map/reduce function and push them to Couch Potato. The map function looks something like this:

function(doc) {
  if(doc.ruby_class == 'Comment') {
    emit(doc.post_id, 1);
  }
}

And here’s a unit test for that function:

describe Comment, 'by_post_id' do
  it "should map to the post_id" do
    Comment.by_post_id.should map({:ruby_class => 'Comment', :post_id => 3}).to([3, null])
  end
end

As you can see all you have to do is pass a Ruby Hash that looks like your CouchDB document and the expected results. You can also pass in multiple results if you expect your map function to emit multiple key/value pairs.

Testing a reduce function works the same way:

describe Comment, 'by_post_id' do
  it "should reduce to the number of comments" do
    Comment.by_post_id.should reduce([], [1, 1, 1]).to(3)
  end
end

For testing re-reducing you simply call .should rereduce(...).to(...).

How it works

So how come you can test JavaScript functions in pure Ruby? Well, by stealing other people’s tricks. I recently contributed a few patches to mustache.js which is a new templating library ported to JavaScript by @janl. It has a test runner currently implemented in Ruby which generates JavaScript code on the fly, runs it using Spidermonkey and reads back the results. I have added a few steps to this process:

  1. A custom RSpec matcher collects the map function, a Ruby Hash representing the input and the expected output
  2. The input is converted to JavaScript using the JSON gem
  3. JavaScript code is generated that runs the document through the map function and prints the resulting JSON.
  4. The Ruby code runs spidermonkey, collects the output and parses it back to Ruby using again the JSON gem
  5. The results are compared to the expected values.

You can see how it works by looking at the code for the RSpec matchers.

I think this addition lowers the barrier to test your views quite a bit. Although I’m usually not a big fan of “one language to rule them all” and love writing JavaScript, being able to write all the necessary tests in Ruby when working on a Ruby project makes things way easier.

by Alexander Lang at October 30, 2009 03:40 PM

Rodrigo Moya

Syncing Evolution contacts to Ubuntu One

The other day was about Tomboy notes, today, Evolution contacts syncing to Ubuntu One!

For the basic setup, see this tutorial. So, once you have contacts in the Evolution CouchDB Ubuntu One addressbook, syncing to Ubuntu One happens automatically:

The same contacts show up automatically in the Ubuntu One web UI:

Now, we just need to get mobile devices (N900, Android, etc) to sync contacts there also, and your contacts would be everywhere you need them!

Enjoy it!

by rodrigo at October 30, 2009 01:47 PM

Chris Anderson

What CouchDB brings to HTML5

CouchApps are the product of an HTML5 browser and a CouchDB instance. Their key advantage is portability, based on the ubiquity of the html5 platform. Features like Web Workers and cross-domain XHR really make a huge difference in the fabric of the web. Their availability on every platform is key to the future of the web.

Because CouchDB is aimed squarely at the line between the desktop and the cloud, Couch disrupts a lot of the web-physics we're accustomed too. There will be new applications, but also new security concerns. I think the biggest change will be in our perception of privacy.

In a CouchDB-enabled web, data-flows don't have to be centralized, which means friends can communicate without going through a fixed domain. This makes the web more efficient. It also means I can make data available to my social network without relying on 3rd-party services.

I think what we perceive as changes, are actually a conservation of the original values of the web, when it was common that everyone would run their own websites, owning their own content. That's why I think it's important that CouchDB-style replication be part of the HTML5 standard.

by jchris at October 30, 2009 04:58 AM

October 29, 2009

Damien Katz

Koala on the loose!

Ubuntu 9.10 Karmic Koala has just been released. This is big news as this version includes Apache CouchDB, used as a replicable database by desktop apps. This means CouchDB will be on over 10 million desktops. Nice :)

ubuntu_couchdb.png

by Damien Katz at October 29, 2009 08:08 PM

"CouchDB Implements a Fundamental Algorithm"

Chris Anderson, an Apache CouchDB contributor, writes a great article about the core design of CouchDB:

CouchDB Implements a Fundamental Algorithm

by Damien Katz at October 29, 2009 07:17 PM

Mikeal Rogers

Introducing… couchdb-wsgi

Last weekend I put together some pretty useful code that converts [CouchDB's external process](http://wiki.apache.org/couchdb/ExternalProcesses) JSON request/responses to a WSGI compliant interface.

This means you should be able to run any modern Python web framework in an external process :)

The simplest example:

#!/usr/bin/python
import couchdb_wsgi
 
def application(environ, start_response):
    start_response('200 Ok', [('content-type', 'text/plain')])
    return ['Hello World']
 
couchdb_wsgi.CouchDBWSGIHandler(application).run()

But a far more interesting example is running a django app :)

#!/usr/bin/python
import os, sys
import couchdb_wsgi
 
django_project = os.path.join(os.path.dirname(__file__), 'mysite')
sys.path.append(django_project)
os.environ['DJANGO_SETTINGS_MODULE'] = 'mysite.settings'
 
import django.core.handlers.wsgi
 
application = django.core.handlers.wsgi.WSGIHandler()
 
couchdb_wsgi.CouchDBWSGIHandler(application).run()

All the code is [up on github](http://github.com/mikeal/couchdb-wsgi) and I’ve written up some solid [Sphinx docs that are up on gh-pages](http://mikeal.github.com/couchdb-wsgi/). I also pushed an [initial release to PyPI](http://pypi.python.org/pypi/couchdb-wsgi).

by mikeal at October 29, 2009 05:47 AM

October 28, 2009

Ricky Ho

Math Concepts for kids and teens

Summarizing some key math concepts that I teach my kids.

Fundamentals
  • A correct value system is the most important foundation (the goal to excel, the willingness to help)
  • How to make decisions ? (differentiate emotional decision and strategic decision)
  • How to do planning ?
  • How to do reasoning, analyzing and drawing conclusion ?
  • How to be open minded, humble but not blindly follow conventional wisdom ? (why human walk with 2 legs, why do we have supermarkets, how do we decide where to put a bus station, why an apple is more expensive than an orange)
  • How to be patient and control emotions ?
  • Develop a good sense of numbers and able to read different charts and graphs, observing relationship between variables and their trends.
  • Appreciation of doing things in a smart way

Basic Math Concepts
  • Numbers, counting (Integer) and quantities (Real)
  • Cause and effect
  • Set (belongs, union, intersect, subset)
  • Function (dependent and independent variable, continuous vs discrete). Various graphing (histogram, line graph, plot), 2D curve and 3D plane
  • Linear equations, degree of freedoms, relationship between number of variables and number of equations.
  • Calculus (differentiation and integration), multi-variables and partial differentiation
  • Logic (if/then, necessary/sufficient conditions, equivalence) and Proof establishments
  • Debate and Logic fallacies
  • Geometry and Vector (think 3D instead of 2D)
  • Probabilities (Draw a tree of all outcomes and counting)
  • Probability distribution function and expected gains
  • Permutations and Combinations (how to find out "all possibilities")
  • Mathematical induction, recursion in proofs.
  • Digits with different bases (and their relationship with Polynomials)
  • Making predictions: False positives, False negatives and how trade-off decisions should be made

Math Models
  • Decision tree (decision and outcome alternations, min/max strategy). Expected gain and optimization
  • Game theory (Nash equilibrium). Outcome prediction within a social group. Win/win and win/lose and lose/lose situations.
  • Finding solution using Search tree, exhaustive search in all possibilities in a systematic way (tree traversal, breath-first vs depth-first vs heuristic)
  • Linear programming for constraint satisfaction and optimization
  • Deterministic vs Stochastic process (Markov chains), Queuing theory
  • Control system (equilibrium, stability and feedback loop)
  • Graph model (nodes and arcs, path finding, shortest path, minimal spanning tree)
  • Finite State Machines (everything happens in a cycle)

by Ricky Ho (rickyphyllis@gmail.com) at October 28, 2009 04:16 PM

Rodrigo Moya

Syncing Tomboy notes with Ubuntu One

Lots of people keep asking the same question (how do I sync Tomboy notes with Ubuntu One?), so, since there is a nice tutorial already, posting it here to get to a wider audience: the tutorial.

Since this is also my first post about it (didn’t want to make it too public until it worked great), I wish to give special thanks to Sandy Armstrong, Tomboy’s super hacker, whose help in making this work has been very valuable. Not only he helped us in all the problems we found, but he was very receptive on our suggestions for changes in the syncing protocol. So, every time you sync your notes (to Ubuntu One or a Snowy server), please save some money to pay him (and me, if possible) some beers :-D

by rodrigo at October 28, 2009 11:33 AM

October 25, 2009

Upstream

Testing Couchapps with Cucumber and Culerity

On last week’s RailsCamp UK I started hacking on a new CouchApp called HejHej. Its purpose is to help me learn Swedish, but what’s more important here: I wrote this app BDD style using Cucumber, the famous BDD tool and Culerity, my humble addition that allows me to test any webapp (including client side JavaScript) with it.

The whole thing is available on Github so you can check out the features and steps there. In the following post I will show the necessary steps to test your own CouchApps with Cucumber.

First of all you will need to install some software:

  1. Java – most computers have it installed already (OSX does). If not go to java.sun.com
  2. JRuby – Culerity is written in Ruby but wraps a Java library called HTMLUnit, hence we need both. You can get JRuby at jruby.org. Just download the tar and extract it to any directory.
  3. Culerity – comes as a RubyGem that you will have to install into JRuby. To install run JRUBY_HOME/bin/jruby -S gem install culerity --source=http://gemcutter.org
  4. Cucumber – jruby -S gem install cucumber
  5. RestClient – helps to clean up the test database: jruby -S gem install rest-client jruby-openssl

Next change to your CouchApp’s directory and create the following directory structure:


  CouchApp root
  |- features
     |- support
        |- env.rb
        |- paths.rb
     |- step_definitions
        |- common_culerity_steps.rb

Copy and paste the following into your env.rb:


  require 'rubygems'
  require 'culerity'

  require 'cucumber/formatter/unicode'

  require 'restclient'

  Before do
    RestClient.delete "#{host}/#{database}" rescue nil
    RestClient.put "#{host}/#{database}", ""
    system "couchapp push"
  end

Warning: the before hook you see there will delete and recreate your CouchDB database before every run, so make sure you are using a separate database for testing than for your actual “production” database.

This is what your common_culerity_steps.rb should look like:


  require 'culerity'

  Symbol.class_eval do
    def to_proc
      Proc.new{|object| object.send(self)}
    end
  end unless :symbol.respond_to?(:to_proc)

  Before do
    $server ||= Culerity::run_server
    $browser = Culerity::RemoteBrowserProxy.new $server, {:browser => :firefox, :javascript_exceptions => true, :resynchronize => true, :status_code_exceptions => true}
    $browser.log_level = :warning
  end

  def host
    'http://localhost:5984'
  end

  def database
    'hejhej'
  end

  at_exit do
    $browser.exit if $browser
    $server.close if $server
  end

  When /I press "(.*)"/ do |button|
    button = [$browser.button(:text, button), $browser.button(:id, button)].find(&:exist?)
    button.click
    When 'I wait for the AJAX call to finish'
  end

  When /I click "(.*)"/ do |link|
    When "I follow \"#{link}\""
  end

  When /I follow "(.*)"/ do |link|
    _link = [[:text, /^#{Regexp.escape(link)}$/], [:id, link], [:title, link]].map{|args| $browser.link(*args)}.find{|__link| __link.exist?}
    raise "link \"#{link}\" not found" unless _link
    _link.click
    When 'I wait for the AJAX call to finish'
  end

  When /I follow \/(.*)\// do |link|
    $browser.link(:text, /#{link}/).click
    When 'I wait for the AJAX call to finish'
  end

  When /I fill in "(.*)" with "(.*)"/ do |field, value|
    find_by_label_or_id(:text_field, field).set value
  end

  When /I attach "(.*)" to "(.*)"/ do |value, field|
    $browser.file_field(:id, find_label(field).for).set(value)
  end

  When /I check "(.*)"/ do |field|
    find_by_label_or_id(:check_box, field).set true
  end

  def find_by_label_or_id(element, field)
    begin
      $browser.send(element, :id, find_label(/#{field}/).for)
    rescue #Celerity::Exception::UnknownObjectException
      $browser.send element, :id, field
    end
  end

  When /^I uncheck "(.*)"$/ do |field|
    $browser.check_box(:id, find_label(field).for).set(false)
  end

  When /^I select "([^"]+)" from "([^"]+)"$/ do |value, field|
    find_by_label_or_id(:select_list, field).select value
  end

  When /^I select "([^"]+)"$/ do |value|
    $browser.option(:text => value).select
  end

  When /I choose "(.*)"/ do |field|
    $browser.radio(:id, find_label(field).for).set(true)
  end

  When /I go to the (.+)/ do |path|
    $browser.goto host + path_to(path)
  end

  When /I wait for the AJAX call to finish/ do
    sleep 0.4
  end

  When /^I visit "([^"]+)"$/ do |url|
    $browser.goto host + url
  end

  Then /^I should see "(.*)"$/ do |text|
    Then "I should see /#{Regexp.escape(text)}/"
  end

  Then /^I should see \/(.*)\/$/ do |text|
    # if we simply check for the browser.html content
    # we don't find content that has been added dynamically,
    # e.g. after an ajax call
    div = $browser.div(:text, /#{text}/)
    begin
      div.html
    rescue
      #puts $browser.html
      raise("div with '#{text}' not found")
    end
  end

  Then /I should see the text "(.*)"/ do |text|
    $browser.html.should include(text)
  end

  Then /I should not see the text "(.*)"/ do |text|
    $browser.html.should_not include(text)
  end

  Then /I should not see "(.*)"/ do |text|
    div = $browser.div(:text, /#{text}/).html rescue nil
    div.should be_nil
  end

  Then /I should see no link "([^"]+)"/ do |text|
    $browser.link(:text => text).should_not exist
  end

  Then /I should not find the page "([^"]+)"/ do |url|
    no_exception = false
    begin
      $browser.goto host + url
      no_exception = true
    rescue => e
      e.message.should =~ /404/
    end
    no_exception.should be_false
  end

  Then /^"([^\"]*)" should be chosen$/ do |field|
    find_by_label_or_id(:radio, field).should be_checked
  end

  Then /^"([^\"]*)" should be checked$/ do |field|
    find_by_label_or_id(:check_box, field).should be_checked
  end

  Then /^"([^\"]*)" should not be checked$/ do |field|
    find_by_label_or_id(:check_box, field).should_not be_checked
  end

  Then /^"([^\"]*)" should be selected$/ do |selection|
    $browser.option(:text => selection).should be_selected
  end

  When 'show response' do
    p $browser.url
    open_response_in_browser
  end

  def find_label(text)
    $browser.label :text, text
  end

  def open_response_in_browser
    tmp_file = '/tmp/culerity_results.html'
    FileUtils.rm_f tmp_file
    File.open(tmp_file, 'w') do |f|
      f < < $browser.div(:id, 'content').html
    end
    `open #{tmp_file}`
  end

Make sure to change the database name (and host if necessary).

Finally the paths.rb


  module NavigationHelpers
    def path_to(page_name)
      case page_name
      when /start page/
        "/#{database}/_design/#{database}/index.html"
      else
        raise "Can't find mapping from \"#{page_name}\" to a path."
      end
    end
  end

  World(NavigationHelpers)

Now that you are done with the prerequisites you can write your first feature. Create a file start_page.feature in the features directory and paste the following:


  Feature: start page
    In order to feel welcome
    As a user
    I want to be welcomed on the start page

    Scenario: go to the start page
      When I go to the start page
      Then I should see "Welcome"

Now you can run your first feature:

jruby -S cucumber features/start_page.feature

Congratulations, you can now test drive your application. Explore the step definitions in the common_culerity_steps.rb in order to learn how to traverse links (When I follow “link”), fill out forms etc.

In order to make the feature pass you should add the text “Welcome” to the index.html of your CouchApp.

by Alexander Lang at October 25, 2009 08:32 PM

Chris Anderson

CouchDB Implements a Fundamental Algorithm

We're seeing a lot of action in the key/value map/reduce world lately. On the one hand this is because simpler stuff scales more simply, and on the other because key/value and B-Tree stores map cleanly to some fundamental algorithms.

At the heart of CouchDB's value proposition is incremental, peer-to-peer replication. All copies of a database are complete and can function independently at all times. Data can be shared with other nodes via replication, which can be triggered or run in continuous mode.

Replication is damn simple. Most of the "hard" work is done at CouchDB write time. Did I mention how simple CouchDB writes are? The gist of it is that for document storage, CouchDB keeps two indexes. One is by document id -- this is used for most lookups, and to enforce uniqueness of document ids within a database. The other index is by local update sequence. Each time a database is updated, the update is tagged with a monotonically increasing integer. This is a just local sequence specific to the local database, so it doesn't have to be replicated or shared.

The by-sequence index drives CouchDB's incremental replication, map reduce views, and external indexers. How? When you start the replicator, it traverses the by-sequence index to find the original updates. This means the target database will see the updates in roughly the same order that the source database saw them. When the replicator gets to the end of the by-sequence index, it knows it has seen the complete state of the current database. The replicator finishes by storing the source db's final sequence number into a replication history document on the target db.

The next time replication begins, the replicator can pull the high-water mark from the history document. The new replication can pick up from the last sequence number, so it doesn't have to start from the top again.

In primitive form, you can start to see how this technique allows for incremental peer-based replication, as any node can track its status vis-a-via any other node, and bring itself up to date, just by traversing the by-sequence index. Incremental map views are just like replication, except instead of updates being copied verbatim to the view index file, they are filtered and transformed using JavaScript functions. For an added bonus, compaction can be thought of as local replication to an empty file.

Optimized

In practice, CouchDB makes a few optimization to the above technique. The big one is that the by-sequence index is sparse. This means if a single document is updated over and over again, replication will only transfer the current state, not all the intermediate states, saving on bandwidth, storage and processing costs.

Another optimization is that before an update is transfered, it is compared to the content of the target database. If the target already has that update (perhaps from replication with a different peer) it doesn't need to be transfered.

Lockless

When you sit down to start implementing something that exhibits these properties, there is a challenge you run into as soon as you start to think about concurrency. View query range-scans, replication, document listings, reduce queries -- these are all potentially long-running processes that depend on the consistency of an index over multiple operations.

What's more, think about the writer: it's got to update the by-sequence and the by-id indexes together in an all-or-nothing way. (The view engine also updates multiple B-Tree indexes atomically as well.)

Your standard database toolkit will tempt you to reach for a familiar abstraction layer: transactional isolation. If you can ensure that only one process can access the B-Tree at any time, then you don't have to worry about writes altering the data-structure while reads are in progress.

Homey don't play that: transactions are a form of locking, and I un-friended locking years ago, for posting too many lame updates to my log files.

To get to the core of how CouchDB provides these properties in a lockless way requires understanding the append-only file format, but I'll give you the basic picture: Each db file has a single writer process and multiple reader processes. Readers can proceed independently of the writer, getting a consistent snapshot of the data, even as changes are being made.

In a nutshell, since we only ever append to the file, we never touch anything that's already on disk. Each time we commit to the file, we rewrite the portion of the index that is invalidated by the changed documents. For geeks: this means we update O(log n) B-Tree nodes on each update. The last thing we write to the file for each commit is the B-Tree root (there's also a signed header written to ensure reliability.)

Each reader grabs the file, and finds the last header / B-Tree root. From there it can traverse the by-seq or the by-id index (or in the case of a view file, any of the map indexes, or the back-reference index). If the writer updates the file, it will go unnoticed by any existing readers, as they will not have pointers to the new portion of the file. In this way CouchDB gives readers long-lived consistent access to indexes without locking or transactions.

Original?

CouchDB is not the first implementation of an append-only B-Tree, but it may be unique that it has the ability to interleave multiple indexes on disk in an append-only way. This interleaving is what gives it the ability to update the by-sequence and by-id indexes together atomically, without transactions.

The overall effect of this design is to maximize for reliability and concurrency. There is no way to corrupt the data, as we never touch what we've already written to disk. Erlang can support as many simultaneous readers as resources allow, which we've found in practice to be in the tens of thousands.

If you're the type to have names for these techniques, let me know in the comments. I wouldn't be surprised to find CouchDB described in a dusty-old algorithms book somewhere.

The fact that CouchDB's API is such a thin wrapper around a fundamental algorithm, makes me think it is time to start talking and thinking about alternate implementations: C++, JavaScript, Ruby

by jchris at October 25, 2009 06:49 PM

October 24, 2009

Chris Anderson

HTML5 Storage Continues

I love HTML5 because it's got such a pragmatic approach. The original descriptive effort of current usage and implementation is undoubtably the right approach. On the applications front, new efforts like Web Workers lead the way on the future of computing.

On the storage front, the original notion of, "heh, let's just drop SQLite in there and call it awesome" is a perfect example of the original descriptive pragmatism. But HTML5 also has an obligation to consider the "long now" of computing. The #1 concern in the coming decade will be concurrency.

I'm working on an HTML5 spec proposal on how you can build CouchDB using a B-Tree and Web Workers. The focus is on simplicity and ease of implementation. Some people don't realize that CouchDB is built on a very simple core, so we don't need much spec to support it.

It's still messy, and fluid. Here's an excerpt from something I sent to some folks about implementing it:

Our goal original goal is for the spec to be as simple as physically possible, and still able to support CouchDB-style offline replication. Adding features on top of that can be approached pragmatically.

I think implementing our spec should involve a lot of the same work for WebSimpleDB, so trying it out won't create a huge cost. Eg: once you've got Tokyo Cabinet wired into the window, wrapping it for JSONDB/WebWorkerStorage should overlap with wrapping it for WebSimpleDB.

The minimum we need in order to build a proper replicating CouchDB is a B-Tree with (optional) secondary indexes, that can be run from within a web worker.

I think the fact that we can demonstrate a useful query model without exposing a transaction API is key. In the multi-core world, locking is not acceptable. We understand that underlying implementations may hold locks, but giving a locking API to user code seems like a recipe for fail, as it prevents truly lockless implementations by vendors.

I'm still hard at work in the WebWorkerStorage repo, and I'm eager to take patches.

I think this spec has a chance if it's put together collaboratively and is released along with an implementation. As far as implementation goes, once a browser implements the bare-bones B-Tree (secondary indexes as described in WebSimpleDB help a lot but aren't strictly necessary) we should be able to easily build a replicating Couch inside a web worker.

by jchris at October 24, 2009 07:07 PM