Planet CouchDB

January 19, 2012

Damien Katz

Couchbase Meetup at new HQ

meetup_logo.gif

Join us Thursday January 19 at 6:30 PM at our brand new Headquarters (aka Fort Awesome). Join and RSVP here.

by Damien Katz at January 19, 2012 02:39 AM

January 16, 2012

Ricky Ho

Machine Learning: Ensemble Methods

Ensemble Method is a popular approach in Machine Learning based on the idea of combining multiple models. For example, by mixing different machine learning algorithms (e.g. SVM, Logistic regression, Bayesian network), ensemble method can automatically pick the best algorithmic model that fits the data the best. On the other hand, by mixing different parameter set of the same algorithmic model (e.g. Random forest, Boosting tree), it can pick the best set of parameters of the same algorithmic model.

Bagging
Bagging is based on the idea of learning multiple models using different sets of sample data, basically by random sampling (with replacement) from the same training set. After the models are learned, we use a voting scheme to predict future data. In case of the classification problem, we use the majority class voted by all models. In case of the regression problem, we take the average value of estimated output from all models.

The model doesn't have to be equally weighted. We can use a weighted average of individual models to come up with the final model. m = w1.m1 + w2.m2 + w3.m3 + ...

To obtain the weights w1, w2 ... we can use machine learning to figure out. In the case, we first use the training data set to train individual models. After that, we use the validation data set to train the weights. Concretely, we feed each data point from the validation set to each model to come up with the final prediction. Based on how the ensembled prediction deviates from the actual outcome, we can learn the optimal set of weights.

Boosting
Boosting extend the idea of bagging by putting more emphasis on the training data that is wrongly predicted. In boosting, weight sampling is used. Initially each training data is equally weighted but as each iteration goes, the data that is wrongly classified will have its weight increased. On the other hand, the model also has its weighted according to how good this model is predicting the data in its round.
  1. Each training data carry a sample weight (initially are the same)
  2. For each iteration, take sample (with replacement) based on weights
  3. Compute the error rate e
  4. For each sample that is wrongly predicted, adjust its sample weight by e/(1-e)
  5. Weight the trained model according to its error. For example, log(e/(1-e))
  6. Stop when e is small enough
  7. Ensemble the overall model according to model weight.
Gradient Boosting Method is one of the most powerful and popular boosting methods. It is based on incrementally add a function that fits the residuals.

Found function F to fit y ~ F(x) using a gradient descent approach.

Start with some random guessing function F. Compute the loss based on the residual.
loss = L(y - F(x)) where L is some loss function

Partial differentiate loss w.r.t. function F, and compute the value at every data point. Use another machine learning model to learn another function g(x) that predicts the partial differentiation value. Then update F(x) <- F(x) + a.g(x) where a is the learning rate.

Other ways of Sampling
Instead of sampling the training data, we can also sample on the attributes (e.g. we can randomly pick a subset of the input variables) Random Forest is an example of this approach.

Sliding window
Machine Learning is based on an important assumption that the future is repeating the same pattern as the history. As we all know, as time goes by, this assumption becomes more and more invalid. In other words, we should put more weight in recent history than long term history.

Ensemble method also gives us an elegant way to decay the weight of old data. In this case, we maintain a sliding window of models (say the last 7 days). We learn a model daily and expire the oldest model we compute 7 days ago.

M = M1 + M2.a + M3.a^2 + ... + M7.a^6
M = M / (1 + a + a^2 + ... + a^6)

by Ricky Ho (noreply@blogger.com) at January 16, 2012 12:41 AM

January 11, 2012

Damien Katz

Why Couchbase?

So apparently my last entry ruffled some feathers, so maybe I should explain why I think Couchbase is the future?

Simple Fast Elastic.

That's pretty much it. We make it very simple to get started, we are extremely fast (and getting faster), and we really are "web scale", with the ability to add and remove machines from a cluster to rapidly scale your capacity to your workload.

The Membase product was very fast and scalable, but a bit too simple, with no reporting capability or cross-datacenter replication capability.

The CouchDB product has a lot of features, but is too slow, unable to keep up with high loads and inability scale-out on it's own.

The combination of the 2 will hit a sweet spot to allow developers to quickly get their apps up and running, along with the reliability, speed and low cost that make running it in production cheap and worry free.

Our 2.0 product is coming soon, adding CouchDB style views and reporting with a nifty trick for extremely fast failover while maintaining full coherency with the underling distributed data storage (we are calling it our B-Superstar index). We'll of course have lighting fast reads (same as Memcached) but also very fast durable writes. For 2kb docs, we are currently getting sustained random insert/updates rates of 25k writes/sec, fully durable, with compaction in background so it can go all day and all night. We've got some more write work coming soon which we are hoping will give us another performance boost too before 2.0. Stay tuned.

And so right now the focus is on the features and customers that pay, a thing that allow us to build a real sustainable business. And that's REAL DAMN IMPORTANT. It's not enough to build some cool technology, not enough to build a community of excited technologist. You need to cross the chasm and build a real business. A business that provides support, training, documentation and of course a reliable product. A business you can call up when you have difficultly upgrading from an old version, or are getting some weird error you've never seen before at 3am. A business you know will be around to support you for years to come.

And so while we focus on the features and customers that most quickly make us a viable business (and it's growing fast), we are still looking to build the features and technology to expand our use cases and, get customers and developers excited. Future versions are planned to have full CouchDB compatible replication technology, with the ability to support all sorts of mobile and embedded databases, such as our new TouchDB projects for iOS and Android. So with Couchbase you can have fast, scalable database in the cloud that also supports the offline use of thousands, or millions of apps on devices that drop in and out of internet connectivity, and can sync when connected but still completely usable when disconnected.

That's some cool shit. Simple Fast Elastic. And Reliable. And Mobile. That's why Couchbase.

by Damien Katz at January 11, 2012 08:17 PM

January 09, 2012

Till Klampäckel

The future of CouchDB

… is not Damien Katz.

TL;DR

The blog post Damien Katz wrote earlier today, doesn't mean much or anything for the Apache CouchDB project (or memcache project for that matter). If anything it's a public note that Damien Katz acknowledged that he moved (on) from CouchDB to Couchbase.

Short story, long

I'm not a contributor to CouchDB by means of code, (but) I blog a lot, I maintain the FreeBSD port, wrote a book and have an opinion on many things CouchDB. I've been a CouchDB-user for something like four years (since pre-0.8 times) and a BigCouch-user (and Cloudant customer) of about 1.5-two years.

I am not sure what Damien Katz tried to achieve when he posted his message to the community and while I personally find it ignorant (to say the least), it worries me how it is perceived by the general public.

Talk is cheap

Of course it may sound like the end of the world when the creater of CouchDB quits his own project, but truth to be told, Damien Katz left CouchDB a long time ago. Couchbase moved past CouchDB long before they announced it. Basically when they started integrating Membase, though there are all kinds of notable contributions from Couchbase employees (e.g. Filipe and Jan).

Damien himself hasn't (actually) commited in over a year to CouchDB. Which makes his move no real surprise, just the way he decided to communicate surprised me. Especially since he said to have no regrets, I find the tone and statements in his blog post rather questionable. That is both from a personal and professional perspective.

I just attended a Couchbase event in Berlin last year where talks about CouchDB were given along with newer Couchbase developments. So personally, while I welcome clarity, all too sudden changes in strategy don't make me happy. If I was new to CouchDB and/or Couchbase, this would look like a headless chicken (excuse the image) and way too much drama to get into.

And on a professional level, Damien's posts invalidates the efforts of many people who both contribute and work with Apache CouchDB on a daily basis.

Then later today Damien shared this:

TIL, if you create an open source project, you should stick with it forever and ever. Family can live off unicorns and stardust. — Damien Katz

Real talk

First off, the most people in this discussion (excluding HN, of course ;-)) are actually active Open Source contributors one way or the other. Many of us have other projects (plural) besides CouchDB. It's not that we troll about something we have no idea about.

Secondly, it's not the fact that Damien left, it's how he left.

No one blames people for moving on: it happens all the time. I do it all the time — write code, push it out, move on.

If code is good enough it'll be picked up, if not, it'll rott on Github for forever. It happened to other projects and it happened to CouchDB. But why would anyone pronounce a project dead where he is not anymore invested in?

Anyway: I wish Damien good luck in the future.

So where is CouchDB at?

I wrote a blog post about the current state of CouchDB last year (2011) in early December.

A few things have changed since then:

Overall, I still see contributions from notable community members all around. Not sure if it's my own perception, but there has been more activity as of late. A new release of Apache CouchDB is just around the corner. Overall, good times for Apache CouchDB indeed.

The other thing that happened after my blog post is that Couchbase said in their 2011 review (which was published in late December) that it would officially step off Apache CouchDB and contribute documentation and OSX builds (aka CouchDBX or Couchbase Single Server) to the Apache CouchDB project.

This is great news and announcing that they will step off the turf is fine too since it clears up a couple misconceptions people may have about Apache CouchDB and the former Couchbase Single Server.

And all in, this makes Damien's blog post even more unnecessary and confusing to many users out there. Especially confusing for those who are not neck-deep in Apache CouchDB — and by that I mean: they are neither subscribed to a mailing list, take part on IRC, read the CouchDB planet or know any of the contributors directly.

In the end his blog post confuses people because it contains absolutely nothing but fear, uncertainty and doubt (FUD).

Outlook

I can't reveal too much because it's not my business to announce anything — let me just say that there are good times ahead for Apache CouchDB.

Fin

I'd like to get past all the drama and re-focus on what is important: CouchDB and our data. I don't care for the rest, I want to see exciting things from Apache CouchDB in 2012.

by Till Klampaeckel (till@php.net) at January 09, 2012 10:51 PM

January 06, 2012

Volker Mische

The future of GeoCouch and CouchDB

The CouchDB world is currently full of “The future of CouchDB” blog posts. It started with the blog post from Damien Katz the creator of CouchDB. Of course people were also concerned about the future of GeoCouch. No worries, it will be good.

The future of Apache CouchDB

The reactions were quite different. People who are not deeply involved with the CouchDB community think that this means the end of Apache CouchDB. My reaction was positive, I tweeted:

“It’s good to see the Damien is so open to [the] world”

The reason was, that for me it was pretty clear that it would happen, and I was just happy that Damien officially made the cut.

The reactions from CouchDB community members where pretty much what Till Klampäckel describes in his blog post. You could see it comming after Couchbase announced that they are not the CouchDB company and that their product won’t be Apache CouchDB compatible.

I agree with Till here, the way Damien wrote his blog post, isn’t the best imaginable. For outsiders, it really seems to be the end of Apache CouchDB, but it is not. For me it just shows, why foundations like the Apache Foundation are such a great idea. Even if the original creator leaves the project, it still lives on.

Apache CouchDB has a lot of contributers and the mailing lists and IRC channel is busy as always. That CouchDB has a future is also shown by the blog post from Cloudant. They will keep supporting Apache CouchDB.

The future of GeoCouch

After this quick recap what happened so far, it’s time to talk about the future of GeoCouch. As you may know, I work for Couchbase on the integration of spatial functionality into their product.

Currently the overlap between Apache CouchDB and the version Couchbase uses internally is still quite huge, but it will diverge more and more in the future. Thus it will get harder and harder to maintain a single version that supports Apache CouchDB and Couchbase.

The good news is, that GeoCouch is pretty much a data structure only. It's an R-tree that stores JSON documents. This can easily be used by CouchDB and Couchbase. Perhaps small wrappers will be needed, but those should be minimal.

The easiest way to understand how the future looks like is in a small illustration:

Illustration of GeoCouch and its relation to CouchDB and Couchbase

GeoCouch's core is the R-tree, it's the same code for CouchDB and Couchbase. On top of it there will be code that is specific to either CouchDB or Couchbase.

This means that the majority of the devlopment I do for Couchbase will also improve the GeoCouch you can use for CouchDB.

Conclusion

The future of all three, Apache CouchDB, Couchbase and GeoCouch looks bright.

by Volker Mische at January 06, 2012 03:02 PM

Damien Katz

The Future of CouchDB

What's the future of CouchDB? It's Couchbase.

Huh? So what about Apache CouchDB? Well, that's a great project. I founded it, coded the earliest versions almost completely myself, I've spent a huge amount of blood, sweat and tears on it. I'm very proud of it and the impact it's had. And now I, and the Couchbase team, are mostly moving on. It's not that we think CouchDB isn't awesome. It's that we are creating the successor to it: Couchbase Server. A product and project with similar capabilities and goals, but more faster, more scalable, more customer and developer focused. And definitely not part of Apache.

With Apache CouchDB, much of the focus has been around creating a consensus based, developer community that helps govern and move the project forward. Apache has done, and is doing a good job of that. But for us, it's no longer enough. CouchDB was something I created because I thought an easy to use, peer based, replicating document store was something the world would find useful. And it proved a lot of the ideas were possible and useful and it's been successful beyond my wildest ambitions. But if I had it all to do again, I'd do many things different.

If it sounds like I'm saying Apache was a mistake, I'm not. Apache was a big part in the success of CouchDB, without it CouchDB would not have enjoyed the early success it did. But in my opinion it's reached a point where the consensus based approach has limited the competitiveness of the project. It's not personal, it's business.

And now, as it turns out, I have a chance to do it all again, without the pain of starting from scratch. Building on the previous Apache CouchDB and Membase projects, throwing out what didn't work, and strengthening what does, and advancing great technologies to make something that is developer friendly, high performance, designed for mission critical deployment and mobile integration, and can move faster and more responsively to users and customers needs than a community based project.

Apache CouchDB, as project and community, is in fine shape. And many of us at Couchbase are still contributing back to it. But the future, the one I'm pushing forward on, is Couchbase Server.

And what is my part in building Couchbase? Right now I'm focusing on getting Couchbase 2.0 ready for serious production use. I'm once again an engineer and coder, back in the trenches, designing and writing code, reviewing code and designs, helping other engineers and solving tough problems. And I'm dead serious about making it the easiest, fastest and most reliable NoSQL database. Easy for developers to use, easy to deploy, reliable on single machines or large clusters, and fast as hell. We are building something you can put your mission critical, customer facing business data on, and not feel like you're running a dirty hack.

Soon, to work more closely with the team (and get rid of my nasty Oakland commute), I'll be relocating my family to the Mountain View area. Shit just got real!

And I'm really excited about the work we've got in the pipeline. We are moving more and more of the core database in C/C++, while still using many of the concurrency and reliability design principles we've proven with the Erlang codebase. And Erlang is still going to be part of the product as well, particularly with cluster management, but most of the performance sensitive portions will be moving to over C code. Erlang is still a great language, but when you need top performance and low level control, C is hard to beat.

Anyway, there so much to talk about, to much for one blog post. One of my New Years resolutions is to blog more, and I've got a ton of interesting things to talk about. The trials of tribulations of building a startup and an engineering culture. What's wrong (and right) with Erlang. Bringing forth UnQL. TouchDB for Mobile. And yes, we'll still interoperate with Apache CouchDB and Memcached. But the future is Couchbase.

Ride with me.

Edit

As J. Chris Anderson notes in the comments, Couchbase is completely open source and Apache licensed:


Everything Couchbase does is open source, we have 2 github pages that are very active:

https://github.com/couchbaselabs

https://github.com/couchbase

Probably the most fun place to jump into development is the code review: http://review.couchbase.org/

Let me clarify, if you like Apache CouchDB, stick with it. I'm working on something I think you'll like a lot better. If not, well, there's still Apache CouchDB.

by Damien Katz at January 06, 2012 01:19 AM

December 21, 2011

Till Klampäckel

Quo vadis, CouchDB?

Update, 2011-12-21: Couchbase posted their review of 2011 (the other day) — TL;DR: Couchbase Single Server (their Apache CouchDB distribution) is discontinued and its documentation (and its buildtools) will be contributed to Apache CouchDB.


When Ubuntu1 dropped CouchDB two weeks ago, there were a couple things which annoy (present tense) me a lot. Add to that the general echo from various media outlets blogs which pronounced CouchDB dead and a general misconception how this situation or CouchDB in general is dealt with.

Some people said I am caremad about CouchDB and that is probably true. Let me try to work through these things without offending more people.

Ubuntu1

What annoy[ed,s] me about this situation is that I wrote a chapter about Ubuntu1 in my CouchDB book. And while I realize that as soon as a book is published the information is outdated, I also want to say that I could have used the space for another project.

I talked to a couple of people about CouchDB at Ubuntu1 on IRC and no one made it sound like they are having huge or for that matter any issues.

Of course I neither work for Canonical or Couchbase. I haven't signed any NDAs etc. — but looking back a week or two my well-educated guess is that not even the people at Couchbase knew there were fundamental issues with CouchDB and Ubuntu1.

The NDA-part is of course an assumption: don't quote me on it.

Transparency

Scumbag Ubuntu1 drops CouchDB and doesn't say why. — myself on Twitter

First off: I'm not really sorry. I was abusing a meme and if you read my Twitter bio, you should not take things personal.

I also should have known better since it's not like I expect anything transparent from Canonical. (Just said it.)

When people are compelled to write a press release and put it out like that, they should expect a backlash. The reason why I reacted harsh is that Canonical didn't share any valuable information on why they discontinued using CouchDB except for: it doesn't scale.

And I'm not aware of anything concious to date.

Helpful criticism — how does it work?

Please take a look at the following email: https://lists.launchpad.net/u1db-discuss/msg00043.html

This email contains a lot of criticism. And it's all valid as well.

CouchDB feedback

Other examples:

These are great emails because they contain extremely valuable feedback.

Deal with it!

In my (humble) opinion, these kind of emails are exactly what is necessary in CouchDB-land, and many other open source projects: criticism and a little time to reflect on not so awesome features. And then moving on to make it better. If the feedback cycle doesn't happen, there's no development or evolution — just stagnation.

And in retrospect I wish more people would share their opinion on CouchDB and this situation more often. Since I'm personally invested in CouchDB, it's hard to say certain things. Honesty is sometimes brutal, but it's necessary.

In summary, a CouchDB user like Ubuntu1 (or Canonical) doesn't have the civic duty to give feedback, but to desert a project while pretending to be an Open Source vendor, and not talking to the community of the project or sharing your issues in public, that is extremely unhelpful.

Overall it strikes me that the only thing to date known about Canonical's collaboration with CouchDB is the support for OAuth in CouchDB. And most people don't even know about that (or wouldn't know how to use it). It worries me personally to not know the kind of problems Canonical ran into because they seem so messed up that they couldn't be discussed in public.

CouchDB doesn't scale

One thing I was able to extract is: CouchDB doesn't scale.

Thanks! But no thanks.

I wrote a book on CouchDB and I pretty much used it all, or at least looked at it very, very closely. I also get plenty of experience with CouchDB due to my job. Indeed, there are many situations where CouchDB doesn't scale or where it becomes extremely hard to make it scale. Situations where the user is better of putting data somewhere else.

Myself (and I'm assuming others) enjoy to learn the reasons why things break, so we can take this experience and use it going forward. If this doesn't happen we might as well all subscribe to the koolaid of a closed source vendor and purchase update subscriptions, install security packs and happily live ever after.

A patch to make CouchDB scale?

Another piece of information I gathered from the various emails written is that Canonical maintained CouchDB-specific patches for Ubuntu1. However, it's unknown what the purpose of these patches were. For example, if these patches made CouchDB scale (magically) for Ubuntu1 or if the patchset added a new feature.

What I'd really like to know is why these patches were not discussed in the open and why no one worked with the project on incorporating them into upstream. The upstream is the Apache CouchDB project.

This is another example of where communication went horribly wrong or just didn't happen.

A CouchDB company

I'm a little torn here and I don't want to offend anyone (further) especially since I know a couple Couchbase'rs or original CouchOne'rs (Hello to Jan, JChris and Mikeal) in person, but seriously: a lot of people realized that CouchOne stopped being The CouchDB company a long time ago.

This is not to say that the CouchDB project members who are employed by CouchOne/Couchbase are not dedicated to CouchDB. But if I take a look at the mobile strategy and the more or less recent integration of CouchDB with Membase/Memcache, I must notice that these strategies are far away from Apache CouchDB. Big data (whatever that means), to mobile and back.

The conclusion is that the majority of work done will not be merged into Apache CouchDB and this is one of the reasons why the Apache CouchDB project hasn't evolved much in a long time.

Not all changes can go upstream

I realize that when a company has a different strategy, not everything they do can be send upstream. After all, most if not all companies operate in a world where money is to be made and goals are to be met. Nothing wrong there.

But let's take a look at the one project which could have been dedicated to Apache CouchDB: the documentation project.

CouchOne hired an ex-MySQL'er to write really great documentation for CouchDB. The documentation made sense, it was up to date with releases, contained lots examples and what not. But it was never contributed to the open source project. The documentation is still online today, though it's now the documentation of the Couchbase Server HTTP API.

Wakey, wakey!

So in my opinion the biggest news is not that Canonical stopped using CouchDB and it's also not outrageous to think that there can be one CouchDB company. The biggest news is that Couchbase officially said: "It's not us!".

Having said that and also not knowing much about Canonical's setup and scale, I still fail to even remotely understand why they didn't work with Cloudant who spezialize in making CouchDB scale all along.

CouchDB and Evolution

Of course it is unfair to single them (Couchbase employees) out like that. For the record, there are pretty vivid projects such as GeoCouch which are also funded by Couchbase and while being devoted to the project, these guys also have to meet goals for their company.

Add to that, that other CouchDB contributors involved have not driven sustantial user-facing changes in Apache CouchDB either. CouchDB is still a very technical project with a lot of technical issues to solve. The upside to this situation is that while other NoSQL vendors add new buzzwords to each and every CHANGELOG, CouchDB is very conservative and stability driven. I appreciate that a lot.

User-facing changes on the other side are just as important for the health of a project. Subtle changes aside, but today's talks on for example querying CouchDB are extremely similar to those talks given a year or two ago. Whatever happens in this regard is not visible to users at all.

Take URL rewriting, virtualhosts and range queries as examples for features. I question:

  • the usefulness for 80% of the users
  • the rather questionable quality
  • the state of their documentation

Users need to have the ability to grasp what's on the roadmap for CouchDB. There needs to be a way for not so technical users to provide feedback which is actually incorporated into the project. All of these things aside from opening issues in a monster like Jira.

Since no one bothers currently, this is not going to happen soon.

Pretty candid stuff.

Marketing

In terms of marketing and with a lack of an official CouchDB company, the CouchDB project has taken a PostgreSQL-attitude in the last two years.

In a nutshell:

We don't give a damn if you don't realize that our database is better than this other database.

This is a little dangerous for the project itself because when I look at the cash other NoSQL vendors pour into marketing for their NoSQL database, I realized quickly that with the lack of support this project can go away pretty soon.

CouchDB being an Apache project doesn't save me or anyone either: clean intellectual property, deserted, for forever.

The various larger companies (let's say Cloudant and Meebo) are basically employed with their own forks with maybe too little reason to merge anything back to upstream yet. There are independent contributors Enki Multimedia who contribute to core but also sub projects like CouchApp.

And then, there's Couchbase which is trying to tie CouchDB behind Memcached. And from what I can tell pretty much abondens HTTP and other slower CouchDB principals in the process.

Is CouchDB alive and kicking?

You saw it coming: it depends!

Dear Jan, I'm still thinking about the email you wrote while I write my own blog entry. And honestly, that email and the general response raised more questions for myself and others than it answered.

I'd like to emphasize a difference I see (thanks, Lukas):

Core

Is the core of Apache CouchDB alive? — It's not dead.

  • Yes, because some companies drive a lot of stability into CouchDB.
  • No, because there's little or no innovation happening right now.

Ecosystem

There is a lot of innovation going on in CouchDB's ecosystem.

Most notable, the following projects come to mind:

  • BigCouch
  • Couchappspora
  • CouchDB-lucene
  • Doctrine ODM in PHP (and I'm sure there are similar projects in other languages)
  • ElasticSearch's river
  • erica
  • GeoCouch
  • Lounge (and lode)
  • various JavaScript libraries to connect CouchDB with CouchApps or node.js
  • various open data projects (like refuge.io)

Need more? Check out CouchDB in the wild which I think is more or less up to date.

Hate it or love it — there is plenty of innovating going on. And many (if not all) CouchDB committers are a part of it.

The innovation just doesn't happen in CouchDB's core.

Fin

My closing words are that I don't plan on migrating anywhere else. If anything, we have mostly migrated to BigCouch.

For Apache CouchDB, I think it's important that someone fills that void. That can be either a company, a BDFL or more engaging project leaders (plural). I think this is required so the project continues vividly.

Because I would really like to see the project survive.

by Till Klampaeckel (till@php.net) at December 21, 2011 04:15 PM

December 05, 2011

Klaus Trainer

couchapp-compress

If you've ever wondered whether there's a tool that will automatically compress JavaScript files when you're pushing your CouchApp somewhere, the answer is that there exists at least one: couchapp-compress.

Basically, couchapp-compress is a small Ruby script that wraps the couchapp command line tool. It compresses a CouchApp's JavaScript files and puts them altogether into one file, and temporarily changes the CouchApp so that instead of all the single uncompressed JavaScript files the compressed one is used. It pushes the CouchApp and then restores the previous state, so that again everything looks like before couchapp-compress was executed.

Check out the README for more details if you're curious and want to give it a try.

by Klaus at December 05, 2011 11:00 AM

November 11, 2011

Chris Strom

CouchDB and Backbone-relational

‹prev | My Chain | next›

I ended last night's exploration of Backbone-relational making quite a bit of headway, but also a bug. The bug resulted in a duplicate set of empty models in a has-many relationship.

In my case, my Backbone.js calendar application has appointments. Each appointment has many people that are invited ("invitees"). When viewing an appointment on the 17th, there should be 3 invitees. And indeed they show up, but so do three uninvited, no-name phantoms:


To figure this out, I add debugger statements to my calendar application. Lots of debugger statements.

Eventually, I add a debugger statement to my collection's parse() method:

    Invitees = Backbone.Collection.extend({
model: Models.Invitee,

url: function( models ) {
return '/invitees?' + ( models ? 'ids=' + _.pluck( models, 'id' ).join(',') : '' );
},
parse: function(response) {
debugger;
return _(response.rows).map(function(row) { return row.doc; });
}

});
There is nothing actually wrong with that parse statement. It converts the results of a CouchDB query into a list of attributes that can be used to build individual Invitee models. As can be seen from the parse() method, the CouchDB response contains a "rows" attribute with the actual invitee documents / attributes. The "rows" in that response contain several attributes. The only one that is needed to create an Invitee Backbone model is the "doc" attribute, which contains the entire JSON representation of a Person/Invitee:
{
"_id": "6bb06bde80925d1a058448ac4d006f6e",
"_rev": "3-231654f6914afe2e20eb57a41ec8497a",
"firstName": "Black",
"lastName": "Francis",
"type": "Person"
}
Checking one of the objects in the response in the debugger, I find that, indeed, the person is included in the list of results:


So far so good. This is exactly what I expect and my parse method should work fine. And it does. The problem is that the existing models in the collection look like:


That is, they look like:
attributes: Object
id: "6bb06bde80925d1a058448ac4d006f6e"
That is just the model placeholder until the real thing can be fetched from the server. But, when the response is fetched from CouchDB, it comes back as:
doc: Object
_id: "6bb06bde80925d1a058448ac4d006f6e"
_rev: "3-231654f6914afe2e20eb57a41ec8497a"
firstName: "Black"
lastName: "Francis"
type: "Person"
Do you see the problem there? I did not. Not for quite some time. The problem is that the existing placeholder has an "id" attribute, but the replacement from CouchDB has an "_id" attribute. CouchDB puts an underscore in front of the ID to indicate that it is meta data. Far be it for me to argue the wisdom of doing so, but it sure screws things up for me here.

The problem is that, when the model is fetched, it tries to replace the existing model, but there is no existing model. Although "id" and "_id" point to the same ID, they are two different attributes. And so Backbone simply adds the new model to the collection, retaining the placeholder. Hence the duplicates.

It took me a long time to track that down and to figure it out. As is usually the case, the solution is simple. In parse() I copy the "_id" attribute to "id":
    Invitees = Backbone.Collection.extend({
// ...
parse: function(response) {
return _(response.rows).map(function(row) {
var doc = row.doc;
doc['id'] = doc['_id'];
return doc;

});
}
});
And with that, I have no more phantom invitees:


That was a pain to track down. It is more a function of using CouchDB than anything else, but I do wish I could have figured that out quicker.


Day #200

by Chris Strom (noreply@blogger.com) at November 11, 2011 05:31 AM

November 10, 2011

Chris Strom

Replacing Homespun "Has Many" with Backbone-Relational

‹prev | My Chain | next›

I got started with backbone-relational last night. It took me a bit, but I ended up making some progress. Tonight, I hope to actually get it working in my Backbone.js calendar application.

The specific use case for which I have need of backbone-relational are the calendar appointments in my application. Calendar appointments have many people invited to them:


In JSON, the invitees attribute of this appointment is simply a list of IDs:
{
"_id": "6bb06bde80925d1a058448ac4d004fb9",
"_rev": "2-7fb2e6109fa93284c19696dc89753102",
"title": "Test #7",
"description": "asdf",
"startDate": "2011-11-17",
"invitees": [
"6bb06bde80925d1a058448ac4d0062d6",
"6bb06bde80925d1a058448ac4d006758",
"6bb06bde80925d1a058448ac4d006f6e"
]

}
And, to get that invitees attribute loading actual Invitee models, I had to add a relations attribute to my Appointment relational-model:
    Appointment = Backbone.RelationalModel.extend({
// ...
relations: [
{
type: Backbone.HasMany,
key: 'invitees',
relatedModel: 'Invitee',
collectionType: 'Invitees'
}
],
// ...
});
At this point, I can load the invitees in the Javascript console, but they are no longer showing up in appointment dialog. For that, I need to replace the loadInvitees method call from my pre-backbone-relational days with the fetchRelated() method from backbone-relational:
    Appointment = Backbone.RelationalModel.extend({
// ...
initialize: function(attributes) {
if (!this.id)
this.id = attributes['_id'];

this.fetchRelated("invitees");
// this.loadInvitees();
},
// ...
});
With that change, I have an invitees collection in the "invitees" attribute of my model. To make use of that, I pass said collection to the collection view (but only if people have been invited):
    var AppointmentEdit = new (Backbone.View.extend({
// ...
showInvitees: function() {
$('.invitees', this.el).remove();
$('#edit-dialog').append('<div class="invitees"></div>');

if (this.model.get("invitees").length == 0) return this;

var view = new Invitees({collection: this.model.get("invitees")});


$('.invitees').append(view.render().el);

return this;
}
}));
Amazingly, that works! It turns out that I was not that far away from success last night after all.

Although it does work—the invitees again show up in the appointment dialog—this works because of the somewhat hackish collection fetch() model that I wrote a few nights back. Specifically, the collection overrides fetch to individually retrieve the models specified by the list of IDs, manually triggering the "reset" event when complete.

Looking through the backbone-relational documentation, I see that it recommends mucking with the collection's URL so that it can request multiple IDs from the backend.

First up, my backend. Since I am using CouchDB, I can POST from my node.js backend to CouchDB, requesting all documents with the IDs POSTed in the request body:
app.get('/invitees', function(req, res){
var options = {
method: 'POST',
host: 'localhost',
port: 5984,
path: '/calendar/_all_docs?include_docs=true'
};

var couch_req = http.request(options, function(couch_response) {
console.log("Got response: %s %s:%d%s", couch_response.statusCode, options.host, options.port, options.path);

couch_response.pipe(res);
}).on('error', function(e) { /* ... */ });

var ids = req.param('ids').split(/,/);
couch_req.write(JSON.stringify({"keys":ids}));

couch_req.end();
});
Admittedly, that is somewhat exotic, but it works. Now, I need to be able to GET the /appointments resource with a query parameter of a comma separated list of strings.

As suggested by the backbone-relational documentation, I do that in the url() method of my collection (url can be either property or method in Backbone):
    Invitees = Backbone.Collection.extend({
model: Models.Invitee,

url: function( models ) {
return '/invitees' + ( models ? '?ids=' + _.pluck( models, 'id' ).join(',') : '' );
},

parse: function(response) {
return _(response.rows).map(function(row) { return row.doc;});
}
});
(I also have to parse() the results coming back from CouchDB to ensure that they map into an array of Model attributes).

And (again) amazing, this works. Except... that my collections have now doubled in size with the first bunch of elements being empty:


I do a little bit of digging, but am unable to see an obvious explanation for the phantom invitees. I will pick back up here tomorrow to solve this minor mystery (and hopefully conclude my exploration of backbone-relational).

I am still undecided on backbone-relational at this point. It has certainly guided me to a cleaner solution. But the reliance on global variable definitions for the models and collections remains worrisome. Perhaps another day's exploration will ease my concerns.


Day #199

by Chris Strom (noreply@blogger.com) at November 10, 2011 05:08 AM

October 09, 2011

Chris Strom

Filtering Backbone.js Collections

‹prev | My Chain | next›

One of the things lacking in my Backbone.js calendar application is the ability to switch to different months. Currently it simply shows the current month (and does not even display the month name):


So I do a little plain-old Javascript hacking to give me a global function named draw_calendar(), which takes an ISO 8601 string that can draw the basic calendar (without the Backbone application adding appointments):
  function draw_calendar(year_and_month) {
$('.year-and-month', 'h1').html(' (' + year_and_month + ') ');
reset_calendar();
add_dates_to_calendar(year_and_month);
};
The first line in there now displays the calendar date:

And, if I set the date to September 2011 in Chrome's Javascript console:
draw_calendar('2011-09');
Then I am treated to a blank September calendar:
Now I just need to teach my Backbone application to filter its Appointments collection. But, first I need the CouchDB backend to be able to support filtering appointments by month.

In the futon admin interface, I define a temporary view that lists documents by the year and month of the appointment's startDate:

This gives me the desired results—keys of the format YYYY-MM and values of the appointment documents:

So I save the view so that I can re-use it in my application:

Since CouchDB is of the web, I can query my new view with curl to verify that I can find all startDates in the month of 2011-10:
➜  calendar git:(pagination) curl http://localhost:5984/calendar/_design/appointments/_view/by_month\?key\='"2011-10"'
{"total_rows":18,"offset":12,"rows":[
{"id":"4a5600f5b2e36fc99d24fe9b8700037d",
"key":"2011-10",
"value":{
"_id":"4a5600f5b2e36fc99d24fe9b8700037d",
"_rev":"5-181418fffeacdc08b5fdda91525978b6",
"title":"Update dialog errors",
"description":"asdf1",
"startDate":"2011-10-05"}},
{"id":"8b5c80c0211068428272af4784000451",
"key":"2011-10",
"value":{
"_id":"8b5c80c0211068428272af4784000451",
"_rev":"1-87f5be84687eada2cd178c8dab7aa34c",
"title":"Finish Beta",
"description":"Book important",
"startDate":"2011-10-31"}},
{"id":"8b5c80c0211068428272af478400f7bb",
"key":"2011-10",
"value":{
"_id":"8b5c80c0211068428272af478400f7bb",
"_rev":"2-b1ce8c6c815a11f7b78b7db412988451",
"title":"Go to bed early",
"description":"asdf",
"startDate":"2011-10-22"}},
{"id":"8b5c80c0211068428272af478400fc93",
"key":"2011-10",
"value":{
"_id":"8b5c80c0211068428272af478400fc93",
"_rev":"1-b19359520433df557ad2fa0d56165f24",
"title":"Validations",
"description":"description should be required",
"startDate":"2011-10-03"}},
{"id":"956a5c19fd866a6a024bbb4c39002e3b",
"key":"2011-10",
"value":{
"_id":"956a5c19fd866a6a024bbb4c39002e3b",
"_rev":"2-0bbb8660f7f116cc813e0fd9093cec6a",
"title":"Has Description",
"description":"asdf",
"startDate":"2011-10-13"}},
{"id":"956a5c19fd866a6a024bbb4c390031a2",
"key":"2011-10",
"value":{
"_id":"956a5c19fd866a6a024bbb4c390031a2",
"_rev":"2-0568aded9df520a0c1c8bb2ee8961156",
"title":"In-dialog errors","description":"asdf",
"startDate":"2011-10-04"}}
]}
Yay!

I update my backend server, which is effectively middleware between the browser and CouchDB, to request this design document resource rather than the _all_doc resource it had been using.

Teaching the Apppointments Backbone collection how to interact with this new resource requires almost no changes at all. Only the parse() method needs to be updated to return appointments from the value attribute of my query:
      var Collections = (function() {
var Appointments = Backbone.Collection.extend({
model: Models.Appointment,
url: '/appointments',
parse: function(response) {
return _(response.rows).map(function(row) { return row.value ;});
}
});

return {Appointments: Appointments};
})();
To initialize the collection, I now need to pass the date query parameter to the Appointments URL. This is done by passing a data option to the collection's fetch() method (just like with jQuery's ajax() call):
// ...
// Initialize the app
var appointments = new Collections.Appointments();

new Views.Application({collection: appointments});

var today = new Date(),
year = today.getFullYear(),
month = today.getMonth() + 1,
year_and_month = year + '-' + pad(month);

appointments.fetch({data: {date: year_and_month}});
With that working, all that remains is the ability to switch months.

For tonight, I will just try to get this working manually in Chrome's Javascript console. In there I set the variable september to last month's date, then draw_calendar(september) (to get the calendar itself drawn correctly). Lastly, I tell the collection to load in September's appointments via a calendar.appointments.fetch({data: {date: september}}):
Based on the length of the Calendar appointments collections, it seems that 9 appointments hath September and, indeed, they actually show up on the calendar:

Best of all, I can do all of the common operations in my Backbone application like editing:
That is pretty darn cool.

I call it a night here. I will pick back up tomorrow trying to do all of this without resorting to Javascript console hacking.


Day #167

by Chris Strom (noreply@blogger.com) at October 09, 2011 04:01 AM

October 04, 2011

Chris Strom

Set CouchDB Attributes Before Backbone.js Transport

‹prev | My Chain | next›

I am still not quite done with my conversion to faye as the persistence layer for my calendar Backbone.js application. Most of the mistakes that I have found so far have been of the basic Backbone implementation variety, not the faye-as-persistence-layer variety. Well tonight, I think that I have a legitimate faye persistence problem.

When I update newly created models from my Backbone application, I am getting HTTP 400 responses from CouchDB:
Got response: 400 localhost:5984/calendar/undefined
{ error: 'bad_request', reason: 'Invalid rev format' }
Hrm... Checking the CouchDB logs, I see that I have a least got the PUT part of the update working:
[info] 127.0.0.1 - - 'PUT' /calendar/undefined 400
The PUT is correct, but the "undefined" is a clear indication that whatever I am updating does not have an "id" attribute.

My first instinct is that my overridden Backbone.sync() is not sending the JSON representation of the model, but rather it is sending the model itself:
    var faye = new Faye.Client('/faye');
Backbone.sync = function(method, model, options) {
faye.publish("/calendars/" + method, model);
}
So adding a toJSON() ought to fix the problem:
    var faye = new Faye.Client('/faye');
Backbone.sync = function(method, model, options) {
if (model.toJSON) model = model.toJSON();
faye.publish("/calendars/" + method, model);
}
Except it has no effect whatsoever. And really, it should not have an effect. If I am always sending a Backbone model object my overridden Backbone.sync() method, then something must already be calling toJSON() on my models. Since Backbone.sync() is sending directly to faye, it must be faye that is calling toJSON(). And in fact it is. Looking through the faye source, for each of the various transports, I see Faye.toJSON:
Faye.Transport.WebSocket = Faye.extend(Faye.Class(Faye.Transport, {
// ...
request: function(messages, timeout) {
this._timeout = this._timeout || timeout;
this._messages = this._messages || {};
Faye.each(messages, function(message) {
this._messages[message.id] = message;
}, this);
this.withSocket(function(socket) { socket.send(Faye.toJSON(messages)) });
},
// ...
});
Ah, so I am lucky that Faye has that toJSON() wrapper. That or faye is just a fabulous choice for a Backbone transport. At the very least, it is not the cause of my woes.

Anyhow, I still have my original problem that some update messages do not have the "_id" or "_rev" attributes needed on the server:
client.subscribe('/calendars/update', function(message) {
// HTTP request options
var options = {
method: 'PUT',
host: 'localhost',
port: 5984,
path: '/calendar/' + message._id,
headers: {
'content-type': 'application/json',
'if-match': message._rev
}
};

// ...
});
So I finally take the advice of Recipes with Backbone co-author and put mapping of "_id" and "_rev" attributes into the Backbone.sync() method:
    Backbone.sync = function(method, model, options) {
var message = model.toJSON();
if (!message._id && message.id) message._id = message.id
if (!message._rev && message.rev) message._rev = message.rev


faye.publish("/calendars/" + method, message);
}
I assign an intermediate message variable so that setting attributes does not affect the real model.

With that, I can now update my newly created model as many times as I like:
Best of all, there is nothing but beautiful HTTP 201's from CouchDB:
Got response: 201 localhost:5984/calendar/8b5c80c0211068428272af478400df1e
{ ok: true,
id: '8b5c80c0211068428272af478400df1e',
rev: '4-2929cf0e4933a4e32474145eb3e79f02' }
That is a fine stopping point for tonight. I think that I might have resolved all of my faye transport issues (and issues that I noticed after better testing with faye than in the original). I will do some more monkey testing tomorrow and then possibly move on to other areas to explore.


Day #151

by Chris Strom (noreply@blogger.com) at October 04, 2011 04:09 AM

October 02, 2011

Chris Strom

Deleting Backbone.js Records with Faye as the Persistence Layer

‹prev | My Chain | next›

To date, I have enjoyed decent success replacing the persistence layer in my Backbone.js calendar application. With only a few hiccups, I have replaced the normal REST-based persistence layer with faye pub-sub channels. The hope is that this will make it easier to respond to asynchronous change from other sources.

So far, I am able to fetch and populate a calendar appointment collection. I can also create new appointments. Now, I really need to get deleting appointments working:
Thanks to replacing Backbone.sync(), in my Backbone application, I already have delete requests being sent to the /caledars/delete faye channel:
    var faye = new Faye.Client('/faye');
Backbone.sync = function(method, model, options) {
faye.publish("/calendars/" + method, model);
}
Thanks to simple client logging:
    _(['create', 'update', 'delete', 'read', 'changes']).each(function(method) {
faye.subscribe('/calendars/' + method, function(message) {
console.log('[/calendars/' + method + ']');
console.log(message);
});
});
...I can already see that, indeed, deletes are being published as expected:
To actually delete things from my CouchDB backend, I have to subscribe to the /calendars/delete faye channel in my express.js server. I already have this working for adding appointments, so I can adapt the overall structure of the /calendars/add listener for /calendars/delete:
client.subscribe('/calendars/delete', function(message) {
// HTTP request options
var options = {...};

// The request object
var req = http.request(options, function(response) {...});

// Rudimentary connection error handling
req.on('error', function(e) {...});

// Send the request
req.end();
});
The HTTP options for the DELETE request are standard node.js HTTP request parameters:
  // HTTP request options
var options = {
method: 'DELETE',
host: 'localhost',
port: 5984,
path: '/calendar/' + message._id,
headers: {
'content-type': 'application/json',
'if-match': message._rev
}
};
Experience has taught me that I need to send the CouchDB revisions along with operations on existing records. The if-match HTTP header ought to work. The _rev attribute on the record/message sent to /calendars/delete holds the latest revision that the client holds. Similarly, I can delete the correct object from CouchDB by specifying the object ID in the path attribute—the ID coming from the _id attribute of the record/message.

That should be sufficient to delete the record from CouchDB. To tell my Backbone app to remove the deleted element from the UI, I need to send a message back on a separate faye channel. The convention that I have been following is to send requests to the server on a channel named after a CRUD operation and to send responses back on a channel named after the Backbone collection method to be used. In this case, I want to send back the deleted record on the /calendars/remove channel.

To achieve this, I do the normal node.js thing of passing an accumulator callback to http.request(). This accumulator callback accumulates chunks of the reply into a local data variable. When all of the data has been received, the response is parsed as JSON (of course it's JSON—this is a CouchDB data store), and the JSON object is sent back to the client:
  var req = http.request(options, function(response) {
console.log("Got response: %s %s:%d%s", response.statusCode, options.host, options.port, options.path);

// Accumulate the response and publish when done
var data = '';
response.on('data', function(chunk) { data += chunk; });
response.on('end', function() {
var couch_response = JSON.parse(data);

console.log(couch_response)

client.publish('/calendars/remove', couch_response);
});
});
Before hooking a Backbone subscription to this /calendars/remove channel, I supply a simple logging tracer bullet:
    faye.subscribe('/calendars/remove', function(message) {
console.log('[/calendars/remove]');
console.log(message);
});
So, if I have done this correctly, clicking the delete icon in the calendar UI should send a message on the /calendars/delete channel (which we have already seen working above). The faye subscription on the server should be able to use this to remove the object from the CouchDB database. Finally, this should result in another message being broadcast on the /calendars/remove channel. This /calendars/remove message should be logged in the browser. So let's see what actually happens...
Holy cow! That actually worked.

If I reload the web page, I see one less fake appointment on my calendar. Of course, I would like to have the appointment removed immediately. Could it be as simple as sending that JSON message as an argument to the remove() method of the collection?
    faye.subscribe('/calendars/remove', function(message) {
console.log('[/calendars/remove]');
console.log(message);

calendar.appointments.remove(message);
});
Well, no. It is not that simple. That has no effect on the page. But wait...

After digging through the Backbone code a bit, it ought to work. The remove() method looks up models to be deleted via get():
    _remove : function(model, options) {
options || (options = {});
model = this.getByCid(model) || this.get(model);
// ...
}
The get() method looks up models by the id attribute:
    // Get a model from the set by id.
get : function(id) {
if (id == null) return null;
return this._byId[id.id != null ? id.id : id];
},
The id attribute was set on the message/record received on the /calendars/remove channel:
So why is the UI not being updated by removing the appointment from the calendar?

After a little more digging, I realize that it is being removed from the collection, but the necessary events are not being generated to trigger the Backbone view to remove itself. The events that the view currently subscribes to are:
        var Appointment = Backbone.View.extend({
initialize: function(options) {
// ....
options.model.bind('destroy', this.remove, this);
options.model.bind('error', this.deleteError, this);
options.model.bind('change', this.render, this);
},
// ...
});
It turns out that the destroy event is only emitted if the default Backbone.sync is in place. If the XHR DELETE is successful, a success callback is fired that emits destroy. Since I have replaced Backbone.sync, that event never fires.

What does fire is the remove event. So all I need to change in order to make this work is to replace destroy with remove:
        var Appointment = Backbone.View.extend({
initialize: function(options) {
// ....
options.model.bind('remove', this.remove, this);
options.model.bind('error', this.deleteError, this);
options.model.bind('change', this.render, this);
},
// ...
});
And it works! I can now remove all of those fake appointments:
Ah. Much better.

It took a bit of detective work, but, in the end, not much really needed to change.

I really need to cut back on my chain posts so that I can focus on writing Recipes with Backbone. So I sincerely hope that lessons learned tonight will make updates easier tomorrow. Fingers crossed.


Day #147

by Chris Strom (noreply@blogger.com) at October 02, 2011 09:19 PM

Faye as the Persistence Layer in Backbone.js

‹prev | My Chain | next›

Yesterday I was able to override the sync() method in my Backbone.js model to achieve an added layer of persistence. In addition to the normal REST persistence, my model also persists newly created appointments on a Faye pub-sub channel:
          var Appointment = Backbone.Model.extend({
urlRoot : '/appointments',
initialize: function(attributes) {
// ...
this.faye = new Faye.Client('/faye');
},
// ...
sync: function(method, model, options) {
if (method == "create") {
this.faye.publish("/calendars/public", model);
}
Backbone.sync.call(this, method, this, options);
}

});
As my esteemed Recipes with Backbone co-author pointed out yesterday, it might make sense to switch entirely to Faye for persistence. It is hard for me to wrap my brain around all of the implications for such a change. At the very least, it is going to break my tests, which stub out XHR REST calls (via sinon.js). That aside, will it clean up my backend code?

Only one way to find out and that is to get started. So, in my browser code, I redefine Backbone.sync() to send any sync requests for create, update, delete or read to a faye channel named accordingly:
    var faye = new Faye.Client('/faye');
Backbone.sync = function(method, model, options) {
faye.publish("/calendars/" + method, model);
}


// Simple logging of Backbone sync messages
_(['create', 'update', 'delete', 'read']).each(function(method) {
faye.subscribe('/calendars/' + method, function(message) {
console.log('[/calendars/' + method + ']');
console.log(message);
});
});
With that, when I reload my funky calendar Backbone application, I see an empty calendar:
There ought to be 10 appointments on that calendar. I just switched persistence transports, so a few other things need to change as well. To figure out where to start, I check Chrome's Javascript console. There, I see that the request for "read" did go out:
That read request comes when the application is initialized—which includes a fetch() of the collection:
      // Initialize the app
var appointments = new Collections.Appointments;

new Views.Application({collection: appointments});
appointments.fetch();
It can be argued that I should not be fetching here, which requires a round trip to the server. The Backbone documentation itself suggests fetching the data in the backend (node.js / express.js in my case). The data can then be interpolated into the page as a Backbone reset() call. Personally, I prefer serving up a static file that shows something almost immediately followed by quick requests to populate the page with useful, actionable information.

To get "actionable" stuff in my currently empty calendar, I need something on the server side to reply to the request on the /calendars/read channel. Doing faye things on the server is relatively easy. I already have the faye node.js adapter hooked in to my express.js application. I can then call to getClient() to gain access to client actions like subscribe():
// Faye server
var bayeux = new faye.NodeAdapter({mount: '/faye', timeout: 45});
bayeux.attach(app);

// Faye clients
var client = bayeux.getClient();

client.subscribe('/calendars/read', function() {
// do awesome stuff here
});
Now, when the client receives a message on the "read" channel, I can do awesome stuff. In this case, I need to read from my CouchDB backend store:
client.subscribe('/calendars/read', function() {
// CouchDB connection options
var options = {
host: 'localhost',
port: 5984,
path: '/calendar/_all_docs?include_docs=true'
};

// Send a GET request to CouchDB
var req = http.get(options, function(couch_response) {
console.log("Got response: %s %s:%d%s", couch_response.statusCode, options.host, options.port, options.path);

// Accumulate the response and publish when done
var data = '';
couch_response.on('data', function(chunk) { data += chunk; });
couch_response.on('end', function() {
var all_docs = JSON.parse(data);
client.publish('/calendars/reset', all_docs);
});
});

// If anything goes wrong, log it (TODO: publish to the /errors ?)
req.on('error', function(e) {
console.log("Got error: " + e.message);
});
});
This is a bit more work than with the normal REST interface. With pure REST, I could make the request to CouchDB and pipe() the response back to the client. Backbone (more accurately jQuery) itself takes care of parsing the JSON. Here, I have to accumulate the data response from CouchDB and parse it into a JSON object to be published on a Faye channel. I could send back a JSON string, requiring the client to parse, but that feels like bad form. Faye channels can transmit actual data structures, so that is what I ought to do.

Anyhow, I publish to the /calendars/reset channel because that is what the client will do with this information—reset the the currently empty appointments collection:
    window.calendar = new Cal();

faye.subscribe('/calendars/reset', function(all_docs) {
console.log('[/calendars/reset]');
console.log(all_docs);

calendar.appointments.reset(all_docs);
});
Upon reloading the page, however, I still see no appointments on the calendar. In the Javascript console, I can see that the /calendar/read message is still going out. I also see that I am getting a response back that includes the ten appointments already scheduled for this month:
So the message is coming back over the /calendars/reset channel as expected. It is the CouchDB query results as expected, but something is going wrong in the call to reset() on the appointments collection. Probably, something related to the "Uncaught ReferenceError: description is not defined" error message at the bottom of the Javascript console.

Digging through the Backbone code a bit (have I mentioned how nice it is to read that?), I find that calls to reset() or add() need to be run through parse() first. Well, not always, just when parse() does something with the data. Something like I had to do with the CouchDB results:
        var Appointments = Backbone.Collection.extend({
model: Models.Appointment,
parse: function(response) {
return _(response.rows).map(function(row) { return row.doc ;});
}

});
Anyhow, the fix is easy enough—just run the results through parse():
    faye.subscribe('/calendars/reset', function(message) {
console.log('[/calendars/reset]');
console.log(message);

var all_docs = calendar.appointments.parse(message);
calendar.appointments.reset(all_docs);
});
With that, I have my calendar appointments again populating my calendar:
Only now they are being populated via Faye with an assist from overriding Backbone's sync() function.

On the plus side, it was relatively easy to swap out the entire persistence layer in Backbone. A simple (and in this case very small) override of Backbone.sync() did the trick. On the minus side, I had to do a little more work to convert CouchDB responses into real Javascript data structures. That is not a huge negative (and one that I can easily push into a helper function). Still outstanding it how this will affect the entire application. Also, I have the feeling that I could choose faye channel names better. Questions for another day...


Day #143

by Chris Strom (noreply@blogger.com) at October 02, 2011 08:42 PM

Backbone.js Updates with Faye as the Persistence Layer

‹prev | My Chain | next›

I need to focus on writing Recipes with Backbone tonight, but I still hope to build some on the progress from last night. My efforts to switch to faye as the persistence layer in my Backbone.js calendar application have gone quite well to date. I can create, delete and read objects over faye at this point. So all that remains is update.

The Backbone view code from prior to the persistence layer switch still works, so I can still open an edit dialog to make changes:
Deciding that I should be more humble in my plea to the coding gods, I change the description from "dammit" to "please". Clicking OK seemingly updates the appointment on my calendar (mouseovers reveal the description). Even attempting to re-edit the appointment includes the updated description:
So is that it? Does it just work?

Of course not. I have not subscribed to the /calendars/udpate faye channel on my backend. My debug subscription in the client verifies that the message is being published to that channel:
So all I ought to need is to add a backend subscription to that channel. This follows a node.js / express.js pattern that has become familiar over the past few nights:
client.subscribe('/calendars/update', function(message) {
// HTTP request options
var options = {...};

// The request object
var req = http.request(options, function(response) {...});

// Rudimentary connection error handling
req.on('error', function(e) {...});

// Write the PUT body and send the request
req.write(JSON.stringify(message));
req.end();
});
The pattern is to set HTTP options, here a PUT, to update the existing record:
  // HTTP request options
var options = {
method: 'PUT',
host: 'localhost',
port: 5984,
path: '/calendar/' + message._id,
headers: {
'content-type': 'application/json',
'if-match': message._rev
}
};
(the if-match is a CouchDB optimistic locking thing)

Next, I build the http request object which includes a response handler callback. This callback parses the JSON response from CouchDB and sends it back on the /calendars/changes channel:
  // The request object
var req = http.request(options, function(response) {
console.log("Got response: %s %s:%d%s", response.statusCode, options.host, options.port, options.path);

// Accumulate the response and publish when done
var data = '';
response.on('data', function(chunk) { data += chunk; });
response.on('end', function() {
var couch_response = JSON.parse(data);
client.publish('/calendars/changes', couch_response);
});
});
Last I send the message/record via the request object and close the request so that the CouchDB server knows that I have no more HTTP PUT data to send:
  // Write the PUT body and send the request
req.write(JSON.stringify(message));
req.end();
If all goes according to plan, the messages that I already know are being sent on /calendars/update will be seen by my server-side subscription, which will tell CouchDB to update the record and finally the browser will see the update on the /calendars/changes channel.

And that is exactly what happens. The PUT is logged as a successful HTTP 201 response from CouchDB:
Got response: 201 localhost:5984/calendar/66543e3457df7597f0e41764e500067c
{ ok: true,
id: '66543e3457df7597f0e41764e500067c',
rev: '3-243c4d7084fcdcb7c728917a73e94b97' }
And I even see that response back in the browser:
Nice!

That almost seems too easy. And sadly, it is. If I try to make another change on the same record, the CouchDB updates fail with a HTTP 409 / Document Conflict:
Got response: 409 localhost:5984/calendar/66543e3457df7597f0e41764e500067c
{ error: 'conflict',
reason: 'Document update conflict.' }
This is because the revision ID that is stored in the Backbone model is now out of date. I need to take the revision returned from the first update and ensure that model becomes aware of it. Otherwise, CouchDB's optimistic locking kicks in, rejecting the update.

True to my word, I call it a night here. I will pick back up tomorrow solving this last mystery. Then, perhaps, some refactoring because this code is extremely soggy (i.e. not DRY).




Day #149

by Chris Strom (noreply@blogger.com) at October 02, 2011 08:40 PM

Worky Faye Updates with Backbone.js

‹prev | My Chain | next›

Up tonight, I hope to get multiple updates working with my Backbone.js calendar application when using faye as the persistence layer.

When I make changes, I am sending the update as a message from my Backbone application to the /calendars/update faye channel. The updated information is sent back from the server on the /calendars/changes faye channel:
The problem is that I am not doing anything with that change information. Ordinarily that would not be a problem, but my backend storage is CouchDB. CouchDB really needs that rev attribute. If a subsequent update is sent with the old rev, CouchDB's optimistic locking will kick in and reject the update:
Got response: 409 localhost:5984/calendar/cd823d4aaaf358069f9a800410000b95
{ error: 'conflict',
reason: 'Document update conflict.' }
So, in my client, I subscribe to the /calendars/changes channel. In the subscription's callback, I use the revision published by the server to update the Backbone model:
    faye.subscribe('/calendars/changes', function(message) {
console.log('[/calendars/changes]');
console.log(message);

var model = calendar.appointments.get(message);
model.set({rev: message.rev});
model.set({_rev: message.rev});
});
With, that, I can update an appointment. And update it again... and it works:
Got response: 201 localhost:5984/calendar/cd823d4aaaf358069f9a800410000b95
{ ok: true,
id: 'cd823d4aaaf358069f9a800410000b95',
rev: '7-55b3b95b6996072aa1519b6070cf12b6' }

Got response: 201 localhost:5984/calendar/cd823d4aaaf358069f9a800410000b95
{ ok: true,
id: 'cd823d4aaaf358069f9a800410000b95',
rev: '8-a06160ea8bef50bd71d92a73db61c14b' }
Yay! Except... immediately after I see the successful update, I see:
Got response: 201 localhost:5984/calendar/cd823d4aaaf358069f9a800410000b95
{ ok: true,
id: 'cd823d4aaaf358069f9a800410000b95',
rev: '8-a06160ea8bef50bd71d92a73db61c14b' }
Got response: 409 localhost:5984/calendar/cd823d4aaaf358069f9a800410000b95
{ error: 'conflict',
reason: 'Document update conflict.' }
When I update again, I successfully update the record, which is followed immediately by two conflicts:
Got response: 201 localhost:5984/calendar/cd823d4aaaf358069f9a800410000b95
{ ok: true,
id: 'cd823d4aaaf358069f9a800410000b95',
rev: '9-c09e4a0a8bbd5cc25a9e67b19b6a254b' }
Got response: 409 localhost:5984/calendar/cd823d4aaaf358069f9a800410000b95
{ error: 'conflict',
reason: 'Document update conflict.' }
Got response: 409 localhost:5984/calendar/cd823d4aaaf358069f9a800410000b95
{ error: 'conflict',
reason: 'Document update conflict.' }
Ah. I know what that is. And I will fix it. But first, I need to finish proof reading the recipes that taught me how to solve it.

Recipes with Backbone. Going alpha tonight.


Day #149

by Chris Strom (noreply@blogger.com) at October 02, 2011 08:37 PM

September 27, 2011

Damien Katz

Become a Distributed Database Expert (or just look like one)

At Couchbase we are looking for experienced hackers to help us build the fastest, most reliable distributed database on the planet. You don't need to a be expert already, but you should be ready to learn the ins and outs of distribute database systems, including:

  • Distributed Systems
  • Systems Resource Management: io (disk, network), cpu, memory usage
  • Maximizing Throughput and Minimizing Latency
  • Functional programming
  • Systems Reliability
  • Network Programming
  • Profiling, Benchmarking and Optimization
  • Cluster and Network Topology
  • Replication and Logical Sync
  • Distributed Data modeling
  • Embedded and Mobile software

More info here: http://www.couchbase.com/company/jobs Or you can send your resume and qualifications to me here: damien@couchbase.com

by Damien Katz at September 27, 2011 10:08 PM

September 24, 2011

Damien Katz

Re: Data sync


>On Sep 23, 2011, at 1:40 AM, XXXX XXXXX wrote:
>
>Hi Damien,
>
>Greeting from XXXXX XXXXXX;
>
>Im running a small company with history in the mobile enterprise space
>
>We are just about to get some seed funding to build sqllite sync
>technology for mobile devices;
>
>I came across CouchBase extremely cool;
>
>We are planning to offer some of same features;
>
>Offline access
>Smart sync
>Bandwidth optimisation
>
>It would be good to get any advice or pointers you might have in
>terms of building sync technology for mobile
>
>All the best,
>
>XXXX XXXXX,

Hello! I would say that mobile sync is a deceptively hard problem to get all the nice properties you want. I suggest you look at how Couchbase replication works and try to duplicate it, and ideally, try to interoperate with it.

Some of the properties you probably want:

Incremental replication - The ability to stop and restart replication and not lose all your progress. Vital in a mobile environment where connections are slow and flaky.

Concurrency -You want to be able to use the local and the remote the databases while it's getting sync'd/replicated, no global locking. So the app is usable at all times and syncing in the background.

Conflict management - You need plan for how you'll deal with and manage edit conflicts.

Partial replication - Having replicas that only hold a interesting subset of other replicas. Important when sharing a large data set, but mobile clients only need a portion of it.

Ad hoc Topology - Couchbase supports ad hoc topology, any machine can sync with any other machine without prior knowledge. This is much more flexible than a single centralized sync point or fixed topology. Though many deployments will only need a single sync point, often new ones will need to be added.

Schema upgrade - Couchbase is schemaless, so it's easy to add new field/properties without breaking things. If using a schema, it's difficult to upgrade remote clients when they have new data in older schemas, etc.

Security - the ability to refuse updates if the come from unauthorized sources.


Anyway, Couchbase and CouchDB has worked out these problems and is successful in production on millions of machines. It's not the only way to build a sync scheme, but it's one of the most successful.

-Damien

by Damien Katz at September 24, 2011 06:26 PM

September 21, 2011

Henri Bergius

September 20, 2011

Volker Mische

FOSS4G 2011: Report

The FOSS4G 2011 is over now. Time for a small report. The crowd was amazing and it was again the ultimate gathering of the Free and Open Source for Geospatial developer tribe. Solid presentations and great evenings.

My talk: The State of GeoCouch

I'm really happy how my talk went, I really enjoyed it. The were lots of people (although there was a talk from Frank Warmerdam at the same time) asking interesting questions at the end.

The talk is not only about GeoCouch but also gives you an overview of some of the features it leverages from Apache CouchDB. In the end you should have an overview why you might want to use GeoCouch for your next project.

You can get the slides right here.

Other talks

I was happy to see that there was another talk about GeoCouch. Other talks I really enjoyed were:

And of course there were also great talks from in the plenary sessions from Paul Ramsey about Why do you do that? An exploration of open source business models and Schuyler Erle's so funny lightning talk about Pivoting to Monetize Mobile Hyperlocal Social Gamification by Going Viral

Code Sprint

At the code sprint I was working on MapQuery together with Steven Ottens and Justin Penka. Steven was working on TMS support, Justin on a 6 minutes tutorial and I on making manual adding of features possible.

The OpenLayers developers did the migration from Subversion to Git for their development. OpenLayers is now available on Github.

And luckily there was a fire alarm in between to take a group photograph.

Future of the FOSS4G

I really hope there won't be a yearly FOSS4G conference for the whole of the US. There should be regional events, as I think one big one would draw the attention away from the international conference. Why should you fly to Beijing for the FOSS4G 2012 if you can meet the majority of the developers in the US as well?

Final words

The FOSS4G was great. It was organized well and people were always out in the evenings. The only minor nitpick is that many people working remote had the city of their company in the name badge and not the one they live in. It seems that the original for you had to fill was confusing. So for next year it should perhaps say “Location where you live”. Hence I still don't believe that there were more Dutch than German people at the conference (Tik hem aan, ouwe! ;)

by Volker Mische at September 20, 2011 02:11 PM

September 08, 2011

Chris Strom

Overriding Model.get in Backbone.js

‹prev | My Chain | next›

When I create a new appointment in my Backbone.js calendar, it saves just fine. If I reload the page or check in the CouchDB backend, the appointment persists. But, if I try to delete the appointment immediately after I create it, I get an error.

The specific error is an HTTP 409, which indicates some kind of conflict. In CouchDB, this usually means that I have forgotten to include a revision number. But wait... I am including the revision number on DELETE:
    window.Appointment = Backbone.Model.extend({
// ...
destroy: function() {
Backbone.Model.prototype.destroy.call(this, {
headers: {'If-Match': this.get("_rev")}
});
}
});
In fact, that works when I delete a pre-existing record. It is just newly created records that throw me for a loop. So what gives?

The trouble turns out to be how CouchDB represents IDs, revisions and other meta data about the documents that it stores. On create, CouchDB returns:
{"ok":true,"id":"7acf98778a669f4d6fc33d6b340106de","rev":"1-21662a1368aa1592d1e5d1df710f6d8c"}
But the actual data is stored as:
{
"_id": "7acf98778a669f4d6fc33d6b340106de",
"_rev": "1-21662a1368aa1592d1e5d1df710f6d8c",

"title": "Delete me #2",
"description": "asdf",
"startDate": "2011-09-15"
}
CouchDB normally represents meta data with a leading underscore ("_id", "_rev"). In the POST / create response, however, the ID and revision returned are not meta-data. Rather they are the actual data returned describing the newly created record.

The problem is that Backbone slurps the CouchDB response directly into the model's attributes. This means that appointment.get("_rev") will not work but appointment.get("rev") will.

Now, I could change my delete code to get _rev or rev:
      destroy: function() {
Backbone.Model.prototype.destroy.call(this, {
headers: {'If-Match': this.get("_rev") || this.get("rev")}
});
}
The problem with this approach is twofold. First, I have to remember to do this everywhere that I want to access the revision (which will definitely be necessary when I add updates). The other is that I have to remember CouchDB's meta-data policy any time I want to access these attributes.

I think, ideally, I would like to call this.get("rev") and it just work—regardless of update, create or delete.

This turns out to be relatively easy with the pseudo sub-class method override suggested in Backbone's documentation:

window.Appointment = Backbone.Model.extend({
get: function(attribute) {
return Backbone.Model.prototype.get.call(this, attribute) ||
Backbone.Model.prototype.get.call(this, "_" + attribute);
}
,
// ...
});
In my new get() method, I call the get() method directly on the Backbone.Model.prototype. Since I am invoking it directly, as not as method on an instantiated object, I have to supply the object context to be used inside the method. After all, the get() method expects to be called on an object and, as such, it expects the this variable to refer to that object.

Not coincidentally, the Javascript call() method does just this—it sets the this variable inside the function to the first argument supplied. In this case, I supply the this variable from my Appointment model. So, in the end, the original get() is called with the same this variable with which it would have otherwise been called.

The difference is that I can make two calls—one with the normal attribute (e.g. "rev") and the second with an underscore prepended to it (e.g. "_rev").

With that, I can change my destroy() method to work with both CouchDB updates and deletes:

window.Appointment = Backbone.Model.extend({
get: function(attribute) {
return Backbone.Model.prototype.get.call(this, attribute) ||
Backbone.Model.prototype.get.call(this, "_" + attribute);
},
destroy: function() {
Backbone.Model.prototype.destroy.call(this, {
headers: {'If-Match': this.get("rev")
});
},
// ...
});
If I try to destroy a pre-existing record, the first Backbone.Model.prototype.get.call() in my get() will return undefined but the second one ("_rev") will return the current revision number for the records.

It I try to destroy a record created after page load, then the first Backbone.Model.prototype.get.call() will return the revision ID.

I do not how often I will be connecting to a CouchDB backend as I work with Backbone. Still, it is comforting knowing that workarounds like this are fairly straight-forward with Backbone.


Day #134

by Chris Strom (noreply@blogger.com) at September 08, 2011 04:12 AM

September 05, 2011

Chris Strom

Backbone.js: Telling the View to Delete the Model, Which Tells the View to Delete Itself

‹prev | My Chain | next›

I think that I have more or less figured out how to delete things in Backbone.js. I also have a halfway decent, event driven means of removing deleted things from the UI. But, to date, I am doing all of that deleting in the Javascript console. Tonight, I would like to add a UI element to do the deleting.

Adding the UI element is easy enough. I add an "X" inside a <span> with a class of "delete" to my calendar event template:
<script type="text/template" id="calendar-event-template">
<span class="event" title="<%= description %>">
<%= title %>
<span class="delete">X</span>
</span>
</script>
On the page, the "X" displays like:
To hook that "X" up to a function call, I need to add a click event to the View. For now, I stick with tracer bullets:
window.EventView = Backbone.View.extend({
// ...
events: {
'click .delete': 'test'
},
test: function() {console.log("delete")},

});
So, when a click event is received in this view for an element with the delete class, the "test" function is called. Sure enough, clicking on the X inside the delete <span> logs "delete" messages to the console:
Cool beans, but that is not the final target I hope to hit. So, I replace the test function with a handler for the "click .delete" event:
window.EventView = Backbone.View.extend({
// ...
events: {
'click .delete': 'deleteClick'
},
deleteClick: function() {
this.model.destroy();
},

remove: function() {
$(this.el).find('.event').remove();
}
});
It feels a little awkward having a remove() method and a deleteClick(). The former removes the UI element from the page. The latter handles clicks that should signal the model to delete itself, which will, in turn, tell the view to remove itself from the page. I will worry about the odd feeling another day. For now, I am not quite done with my delete.

I am telling a CouchDB store to delete a record. Since CouchDB uses optimistic locking, I need to supply the revision ID when deleting the record. The revision is already stored in the model, so I have been deleting like this:
e.destroy({headers: {'If-Match':e.model.get("_rev")} })
It seems really wrong to me that the View should be responsible for knowing about this. But how to get the model to do this? I could create a new destroyWithRevision method on the model, but the view would still need to know to call this instead of the conventional destroy() method.

Luckily, Backbone.js does support overriding methods and calling the superclass's original method:

window.Event = Backbone.Model.extend({
// ...
destroy: function() {
Backbone.Model.prototype.destroy.call(this, {
headers: {'If-Match': this.get("_rev")}
});
}
});
That is slick. I call the destroy function that resides on the Backbone.Model prototype. Since I am invoking that function directly, I need to supply an object instance so that the method has a this (or self if you're a Rubyist) to which it can refer. That is the first argument to a Javascript call method. Then I can supply the arguments that inform CouchDB of the revision being deleted.

Slick indeed. Now, when the view tells the model to delete itself, it can call the very conventional destroy() method—completely unaware of this complexity. As an added bonus, I am not losing any of the benefits of optimistic locking—if the loaded model was superseded before the user clicked the "X", the delete would fail. And yes, when I click the little "X", the calendar event goes away from the UI:


That is a good stopping point for tonight. Up tomorrow, I think that my partner in crime on the Recipes with Backbone book has given me some food for thought on how to improve my view.


Day #130

by Chris Strom (noreply@blogger.com) at September 05, 2011 07:32 PM

jQuery UI and Backbone.js

‹prev | My Chain | next›

Before doing anything else with my little Backbone.js calendar application, I would like to be able to add new appointments / calendar events. I have been doing that via the CouchDB backend and it is getting a bit old. Besides, with the new month, most of my events have disappeared:
But how to add these appointments?

I rather fancy a jQuery-ui modal dialog box that pops up when I click on the appropriate day. But, I have no idea where to hook the jQuery-ui dialog into my Backbone app...

First things first, I download and install jQuery-ui (the javascript and the theme css) and add it to my Jade layout template:
!!!
html
head
title= title
link(rel='stylesheet', href='/stylesheets/style.css')
link(rel='stylesheet', href='/stylesheets/blitzer/jquery-ui.css')
script(src='/javascripts/jquery.min.js')
script(src='/javascripts/jquery-ui.min.js')
script(src='/javascripts/underscore.js')
script(src='/javascripts/backbone.js')
body!= body
Next, I create a very simple dialog in the Jade template:
#dialog(title="Add calendar event")
#calendar-event-start-date
p title
p
input#calendar-event-title(type="text", name="title")
p description
p
input#calendar-event-description(type="text", name="description")
I have the intention of eventually grabbing the values for appointments from the two dialog fields and from the #calendar-event-start-date <div> (which I will populate from the date clicked). But before I reach that point, I need to make this a jQuery-ui dialog:
script
$(function() {
$('#dialog').dialog({
autoOpen: false,
modal: true,
buttons: [
{ text: "OK",
click: function() { $(this).dialog("close"); } },
{ text: "Cancel",
click: function() { $(this).dialog("close"); } }
]
});
});
So far, there is absolutely nothing Backbone-y about this. I change that by adding an AppView Backbone View class:

window.AppView = Backbone.View.extend({
el: $("#dialog"),
events: {
'click .ok': 'create'
},
create: function() {
console.log("here");
Events.create({
title: "foo",
description: "bar",
startDate: "2011-09-01"});
}
});

window.AppView = new AppView;
After reloading the page, I open that dialog from the Javascript console:
$('#dialog').dialog('open')
And I am greeted with a right proper jQuery-ui dialog:
Unfortunately, when I click the "OK" button, nothing happens. Well, the dialog closes (the behavior specified in my jQuery-ui dialog() invocation. But a new Event is not create. Even the console.log() statement is not reached.

Hrm...

Eventually, I track this down to two things. First, I need to set the el attribute to the dialog's parent:
    window.AppView = Backbone.View.extend({
el: $("#dialog").parent(),
// ...
});
This way, the wrapper divs added by jQuery-ui become the element for this view. Also, I need to add a class to the OK button:

script
$(function() {
$('#dialog').dialog({
autoOpen: false,
modal: true,
buttons: [
{ text: "OK",
class: "ok",
click: function() { $(this).dialog("close"); } },
{ text: "Cancel",
click: function() { $(this).dialog("close"); } }
]
});
});
With that, I reach my console.log statement and I try to create my event:


Well, once I create the backend POST route, I ought to able to create appointments.

The POST route in my express.js needs to POST the submitted JSON to CouchDB as 'application/json' data. Thus, my POST route is:
app.post('/events', function(req, res){
var options = {
method: 'POST',
host: 'localhost',
port: 5984,
path: '/calendar',
headers: {'content-type': 'application/json'}
};

var couch_req = http.request(options, function(couch_response) {
console.log("Got response: %s %s:%d%s", couch_response.statusCode, options.host, options.port, options.path);

couch_response.pipe(res);
}).on('error', function(e) {
console.log("Got error: " + e.message);
});

couch_req.write(JSON.stringify(req.body));
couch_req.end();
});
Aside from the headers and the write() of the JSON data to the CouchDB request, the remainder of this route looks very similar to stuff that I have been writing for GETs and DELETEs over the past few days. It may be time to investigate adding an abstraction layer in my express app. Another day, perhaps.

With the backend POST route, I am able to create appointments. I still have a bunch of cleanup to do in this, but I think I am off to a good start. I will pick back up here tomorrow.

Day #130

by Chris Strom (noreply@blogger.com) at September 05, 2011 07:31 PM

Error Handling in Backbone.js

‹prev | My Chain | next›

I was able to eliminate the last little oddity in my Backbone.js appointment calendar application last night. At this point I am able to add and delete (though not update) calendar appointments. I am getting to the point that the overall codebase leaves much to be desired. Before I begin refactoring, I notice yet another bug...

If add an appointment to my calendar:

Then save it, all is well:
The appointment shows up on the calendar and persists on reload.

If I don't reload the page and delete the appointment by clicking the "X" icon, it is removed:

If I now reload the page, the appointment is back from the great beyond:

So what gives? A quick check of the error logs reveals that the calendar appointment was created (HTTP 201). But when I tried to delete the record, there was a 409 response from my CouchDB backend:
Got response: 201 localhost:5984/calendar
Got response: 409 localhost:5984/calendar/7acf98778a669f4d6fc33d6b3400e480
Got response: 200 localhost:5984/calendar/_all_docs?include_docs=true
There are, in fact two bugs here. The first is that my Backbone app is not sending the revision number of the newly created appointment when it comes time to delete the record. That is a somewhat understandable oversight on my part. What is not so OK is the lack of error handling that I have built. The frontend responded to the 409 as if nothing went wrong—the appointment was removed from the calendar as if nothing went wrong.

Taking a look at the delete route in my express.js server, I have:
app.delete('/appointments/:id', function(req, res){
var options = { /* Connection Options */ };

var couch_req = http.request(options, function(couch_response) {
console.log("Got response: %s %s:%d%s", couch_response.statusCode, options.host, options.port, options.path);

couch_response.pipe(res);
}).on('error', function(e) {
console.log("Got error: " + e.message);
});

couch_req.end();
});
Interesting. I had expected the 409 response from CouchDB to be considered an error by node.js's http.request(). But I am not seeing the "Got error" message logged. I am seeing the "Got response" message:
Got response: 409 localhost:5984/calendar/7acf98778a669f4d6fc33d6b3400e480
Ah, looking at the http.request documentation, I see that:
If any error is encountered during the request (be that with DNS resolution, TCP level errors, or actual HTTP parse errors) an 'error' event is emitted on the returned request object.
The failure here is not a connection error and not technically a parse error, so I suppose that the error event should not be fired after all.

Checking out the Network tab in Chrome's Developer Tools, I see:
Hrm... the response being sent back from the node.js app is a 200 OK:
HTTP/1.1 200 OK
X-Powered-By: Express
Connection: keep-alive
Transfer-Encoding: chunked
Looking at the actual body of the response, however, there clearly was an error:
"error":"conflict","reason":"Document update conflict."}
Well, I have the correct 409 statusCode in the couch_response already. It seems that the solution here is simple enough. I set the HTTP response from my express.js app to be that of the CouchDB response that I am proxying:
  // ...
var couch_req = http.request(options, function(couch_response) {
console.log("Got response: %s %s:%d%s", couch_response.statusCode, options.host, options.port, options.path);

res.statusCode = couch_response.statusCode;
couch_response.pipe(res);
}). // ...
Now, when I delete, the response back in the browser is the expected 409:
But that is not quite the end of it. Although the image is no longer removed from the UI, there is no visual indication to the user why this occurred. Clicks on the "X" icon now seemingly have no effect.

Well, the model is receiving the 409 error, but the view needs to be told of the fact. Can it be as easy as subscribing the view to an error event from the model?
    window.AppointmentView = Backbone.View.extend({
initialize: function(options) {
this.container = $('#' + this.model.get('startDate'));
options.model.bind('destroy', this.remove, this);
options.model.bind('error', this.deleteError, this);
},
deleteError: function(model, error) {
// TODO: blame the user instead of the programmer...
if (error.status == 409) {
alert("This site does not understand CouchDB revisions.");
}
else {
alert("This site was made by an idiot.");
}
}
,
// ...
});
Yup. It's exactly that easy. Now, when I click delete, an alert pops informing me that I'm an idiot:
That's a good stopping point for tonight. Up tomorrow, I will fix the 409 error itself and (assuming I do not uncover yet another defect) start to refactor a bit.


Day #133

by Chris Strom (noreply@blogger.com) at September 05, 2011 07:29 PM

September 04, 2011

Ricky Ho

Recommendation Engine

In a classical model of recommendation system, there are "users" and "items". User has associated metadata (or content) such as age, gender, race and other demographic information. Items also has its metadata such as text description, price, weight ... etc. On top of that, there are interaction (or transaction) between user and items, such as userA download/purchase movieB, userX give a rating 5 to productY ... etc.



Now given all the metadata of user and item, as well as their interaction over time, can we answer the following questions ...
  1. What is the probability that userX purchase itemY ?
  2. What rating will userX give to itemY ?
  3. What is the top k unseen items that should be recommended to userX ?
Content-based Approach
In this approach, we make use of the metadata to categorize user and item and then match them at the category level. One example is to recommend jobs to candidates, we can do a IR/text search to match the user's resume with the job descriptions. Another example is to recommend an item that is "similar" to the one that the user has purchased. Similarity is measured according to the item's metadata and various distance function can be used. The goal is to find k nearest neighbors of the item we know the user likes.

Collaborative Filtering Approach
In this approach, we look purely at the interactions between user and item, and use that to perform our recommendation. The interaction data can be represented as a matrix.


Notice that each cell represents the interaction between user and item. For example, the cell can contain the rating that user gives to the item (in the case the cell is a numeric value), or the cell can be just a binary value indicating whether the interaction between user and item has happened. (e.g. a "1" if userX has purchased itemY, and "0" otherwise.

The matrix is also extremely sparse, meaning that most of the cells are unfilled. We need to be careful about how we treat these unfilled cells, there are 2 common ways ...
  • Treat these unknown cells as "0". Make them equivalent to user giving a rate "0". This may or may not be a good idea depends on your application scenarios.
  • Guess what the missing value should be. For example, to guess what userX will rate itemA given we know his has rate on itemB, we can look at all users (or those who is in the same age group of userX) who has rate both itemA and itemB, then compute an average rating from them. Use the average rating of itemA and itemB to interpolate userX's rating on itemA given his rating on itemB.
User-based Collaboration Filter
In this model, we do the following
  1. Find a group of users that is “similar” to user X
  2. Find all movies liked by this group that hasn’t been seen by user X
  3. Rank these movies and recommend to user X

This introduces the concept of user-to-user similarity, which is basically the similarity between 2 row vectors of the user/item matrix. To compute the K nearest neighbor of a particular users. A naive implementation is to compute the "similarity" for all other users and pick the top K.

Different similarity functions can be used. Jaccard distance function is defined as the number of intersections of movies that both users has seen divided by the number of union of movies they both seen. Pearson similarity is first normalizing the user's rating and then compute the cosine distance.

There are two problems with this approach
  1. Compare userX and userY is expensive as they have millions of attributes
  2. Find top k similar users to userX require computing all pairs of userX and userY
Location Sensitive Hashing and Minhash
To resolve problem 1, we approximate the similarity using a cheap estimation function, called minhash. The idea is to find a hash function h() such that the probability of h(userX) = h(userY) is proportion to the similarity of userX and userY. And if we can find 100 of h() function, we can just count the number of such function where h(userX) = h(userY) to determine how similar userX is to userY. The idea is depicted as follows ...


It will be expensive to permute the rows if the number of rows is large. Remember that the purpose of h(c1) is to return row number of the first row that is 1. So we can scan each row of c1 to see if it is 1, if so we apply a function newRowNum = hash(rowNum) to simulate a permutation. Take the minimum of the newRowNum seen so far.

As an optimization, instead of doing one column at a time, we can do it a row at the time, the algorithm is as follows


To solve problem 2, we need to avoid computing all other users' similarity to userX. The idea is to hash users into buckets such that similar users will be fall into the same bucket. Therefore, instead of computing all users, we only compute the similarity of those users who is in the same bucket of userX.

The idea is to horizontally partition the column into b bands, each with r rows. By pick the parameter b and r, we can control the likelihood (function of similarity) that they will fall into the same bucket in at least one band.


Item-based Collaboration Filter
If we transpose the user/item matrix and do the same thing, we can compute the item to item similarity. In this model, we do the following ...
  1. Find the set of movies that user X likes (from interaction data)
  2. Find a group of movies that is similar to these set of movies that we know user X likes
  3. Rank these movies and recommend to user X

It turns out that computing item-based collaboration filter has more benefit than computing user to user similarity for the following reasons ...
  • Number of items typically smaller than number of users
  • While user's taste will change over time and hence the similarity matrix need to be updated more frequent, item to item similarity tends to be more stable and requires less update.
Singular Value Decomposition
If we look back at the matrix, we can see the matrix multiplication is equivalent to mapping an item from the item space to the user space. In other words, if we view each of the existing item as an axis in the user space (notice, each user is a vector of their rating on existing items), then multiplying a new item with the matrix gives the same vector like the user. So we can then compute a dot product with this projected new item with user to determine its similarity. It turns out that this is equivalent to map the user to the item space and compute a dot product there.


In other words, multiply the matrix is equivalent to mapping between item space and user space. Now lets imagine there is a hidden concept space in between. Instead of jumping directly from user space to item space, we can think of jumping from user space to a concept space, and then to the item space.


Notice that here we first map the user space to the concept space and also map the item space to the concept space. Then we match both user and item at the concept space. This is a generalization of our recommender.

We can use SVD to factor the matrix into 2 parts. Let P be the m by n matrix (m rows and n columns). P = UDV where U is an m by m matrix, each column represents the eigenvectors of P*transpose(P). And V is an n by n matrix with each row represents the eigenvector of transpose(P)*P. D is a diagonal matrix containing eigenvalues of P*transpose(P), or transpose(P)*P.

In other words, we can decompose P into U*squareroot(D) and squareroot(D)*V.

Notice that D can be thought as the strength of each "concept" in the concept space. And the value is order in terms of their magnitude in decreasing order. If we remove some of the weakest concept by making them zero, we reduce the number of non-zero elements in D, which effective generalize the concept space (make them focus in the important concepts).

Calculate SVD decomposition for matrix with large dimensions is expensive. Fortunately, if our goal is to compute an SVD approximation (with k diagonal non-zero value), we can use the random projection mechanism as describer here.

Association Rule Based
In this model, we use the market/basket association rule algorithm to discover rule like ...
{item1, item2} => {item3, item4, item5}

We represent each user as a basket and each viewing as an item (notice that we ignore the rating and use a binary value). After that we use association rule mining algorithm to detect frequent item set and the association rules. Then for each user, we match the user's previous viewing items to the set of rules to determine what other movies should we recommend.

Evaluate the recommender
After we have a recommender, how do we evaluate the performance of it ?

The basic idea is to use separate the data into the training set and the test set. For the test set, we remove certain user-to-movies interaction (change certain cells from 1 to 0) and pretending the user hasn't seen the item. Then we use the training set to train a recommender and then fit the test set (with removed interaction) to the recommender. The performance is measured by how much overlap between the recommended items with the one that we have removed. In other words, a good recommender should be able to recover the set of items that we have removed from the test set.

Leverage tagging information on items
In some cases, items has explicit tags associated with them (we can considered the tags is a user-annotated concept space added to the items). Consider each item is described with a vector of tags. Now user can also be auto-tagged based on the items they have interacted. For example, if userX purchase itemY which is tagged with Z1, and Z2. Then user will increase her tag Z1 and Z2 in her existing tag vector. We can use a time decay mechanism to update the user's tag vector as follows ...

current_user_tag = alpha * item_tag + (1 - alpha) * prev_user_tag

To recommend an item to the user, we simply need to calculate the top k items by computing the dot product (ie: cosine distance) of the user tag vector and the item tag vector.

by Ricky Ho (noreply@blogger.com) at September 04, 2011 11:35 PM

September 01, 2011

Chris Strom

A Simple Node.js + CouchDB Calendar

‹prev | My Chain | next›

Last night, I got a nice, little node.js and CouchDB app thrown together. The node (really express.js) app serves up simple HTML and also passes through requests to the CouchDB database. Tonight I would like to serve up a static HTML calendar and populate events from the CouchDB store.

For as long as I have been building HTML calendars, I have always put the ISO8601 date on the day cells. In Jade templating this looks like:
...

tr#week4
td.sunday
td.monday 22
td.tuesday
td.wednesday
td#2011-08-25.thursday
td#2011-08-26.friday
td.saturday
...
The resultant HTML is then:
And the resultant page looks like:
The benefits of using ISO 8601 are numerous, which is why it is the de facto standard for XML and JSON dates and times. Since CouchDB is returning JSON, I can be pretty sure that it will be returning ISO 8601 (especially since I created the data in the first place). By identifying the date cells by ISO 8601, it will make it easy to tie date records to date cells. Let's have a look at what I mean...

Accessing the /events resource in my app returns:
{"total_rows":2,"offset":0,"rows":[

{"id":"fdbed27594feb433c74e82eb910015e0",
"key":"fdbed27594feb433c74e82eb910015e0",
"value":{"rev":"2-b7c22d428e648a6cdd2978c213f79ec0"},
"doc":{"_id":"fdbed27594feb433c74e82eb910015e0",
"_rev":"2-b7c22d428e648a6cdd2978c213f79ec0",
"startDate":"2011-08-25",
"title":"create blog post",
"description":"talk about node and CouchDB"}},
{"id":"fdbed27594feb433c74e82eb91001f45",
"key":"fdbed27594feb433c74e82eb91001f45",
"value":{"rev":"1-2b18432cf6e63b82c6507ff28af9724c"},
"doc":{"_id":"fdbed27594feb433c74e82eb91001f45",
"_rev":"1-2b18432cf6e63b82c6507ff28af9724c",
"startDate":"2011-08-26",
"title":"blog again",
"description":"add backbone into the node + couch mix"}}
]}
(this is just a pass-thru to CouchDB's _all_docs?include_docs=true)

To get those events into the calendar, I perform a jQuery getJSON call inside a document-ready:
  $(function() {

$.getJSON('/events', function(data) {
$.each(data.rows, function(i, rec) { add_event(rec.doc) });
});
});
For each of the rows in the events returned from CouchDB, I extract the document (the event itself) and make a call to add_event().

The add_event() function then exploits the fact that the cells in my calendar are identified with an ISO 8601 date:
  function add_event(event) {

var date = event.startDate,
title = event.title,
description = event.description;

$('#' + date).html(
'<span title="' + description + '">' +
title +
'</span>'
);
}
If the startDate from CouchDB is 2011-08-26, then this function finds the correct cell via a jQuery $('#2011-08-26') selector. If that selector is found, then the inner HTML is replaced with the event's title (and the description in a <span> title attribute). If the calendar event is for a date not currently displayed, no worries, the selector returns an empty wrapped set, in which case the html() has nothing to do.

The result is a rather snappy:
Nice. There is no ability to modify or remove elements just yet, but it was quite easy to get this calendar populated quickly. Up tomorrow, I think I shall begin exploring doing this again, but with Backbone.js.


Day #126

by Chris Strom (noreply@blogger.com) at September 01, 2011 01:14 PM

Deleting Things in Backbone.js

‹prev | My Chain | next›

After getting a pretty decent Backbone.js view implementation in place yesterday, today I would like to see if I can add a bit of interactivity to the beasty. The easiest thing seems to be deleting. "Easy" usually turns out to be a red-flag, but who knows? Maybe this time it'll just work.

Anyhow, the first thing I need is a delete route in my express.js app. Nothing too fancy ought to be required. I just need to make an HTTP request with a method of "DELETE" to my CouchDB backend. The response from CouchDB can then be piped directly to my Backbone app. Something like this ought to do:
app.delete('/events/:id', function(req, res){

var options = {
method: 'DELETE',
host: 'localhost',
port: 5984,
path: '/calendar/' + req.params.id
};

// Send the HTTP request with the DELETE options
var couch_req = http.request(options, function(couch_response) {
console.log("Got response: %s %s:%d%s", couch_response.statusCode, options.host, options.port, options.path);

// Pipe the response from CouchDB to the browser
couch_response.pipe(res);
}).on('error', function(e) {
console.log("Got error: " + e.message);
});

// Send the complete request.
couch_req.end();
});
Now to my Backbone application. As usual when I am exploring, I use Chrome's Javascript console for interacting with page elements and Javascript objects. In this case, I would like to delete the Backbone model responsible for the "foo" calendar event on the first of next month:
In the console, I find that entry is the second of four calendar events (clearly, I need to investigate sorting another day):
Assuming that is an Event model, all I need do is call its destroy method and the offending event should be stricken from existence:
> e.destroy()

=> child
Hrm... Dunno what I expected. I suppose it has to be chainable, so maybe it worked..? Actually, no, it did not. Examining the express.js app's log, I see no log entries. Reloading the page, I see that the calendar event is still there. So what gives?

One of the first places I check is the Event model itself:
window.Event = Backbone.Model.extend({});
Say, that looks a bit spartan. Perhaps more is needed for Backbone to know how to delete a thing from the database.

After a bit of research, I find that yes, two things are needed for this to work: a URL root (e.g. /events) and a record ID. In retrospect, both make all kinds of sense. How else is Backbone supposed to infer the resource to be DELETEd?

Anyhow, the fix should be pretty easy. My new and improved Event model looks like:
    window.Event = Backbone.Model.extend({

urlRoot : '/events',
initialize: function(attributes) { this.id = attributes['_id']; }
});
The urlRoot property is fairly self-explanatory. The initialize method for setting the model's id is less so. CouchDB stores document IDs in the "_id" attribute:
{

"_id": "a38f51509190f265959bbb2b5d001128",
"_rev": "1-174f31204df52e79a92c1ac875ac09a2",
"startDate": "2011-09-01",
"title": "foo",
"description": "bar"
}
This is available in the model's attributes (I code call event.get("_id") to retrieve it), but Backbone has no way to tie it to the special id property of a Backbone model. So I link the two manually in the model's initializer.

Now I should be able to delete the bogus calendar event. Reloading the page and trying again I get:
Well that is progress. I am seeing an AJAX request logged, but is it doing anything? Checking the express.js logs, it is failing to do something:
Got response: 409 localhost:5984/calendar/a38f51509190f265959bbb2b5d001128
That is certainly progress, but 409?!

Ah, wait. This is CouchDB. I need to work a little harder to delete things. Specifically, I have to assure the database that I am acting on the same revision that is currently in the database. To work properly with this optimistic locking, I need to supply the revision as a query parameter:
DELETE /calendar/a38f51509190f265959bbb2b5d001128?rev=1-174f31204df52e79a92c1ac875ac09a2 HTTP/1.0
Or as an If-Match header:
DELETE /calendar/a38f51509190f265959bbb2b5d001128 HTTP/1.0

If-Match: "1-174f31204df52e79a92c1ac875ac09a2"
Hrm... I tend to think it would be easier to transmit the revision via the If-Match header. If I could set that in my Backbone application, then I ought to be able to pass that directly through to CouchDB:
app.delete('/events/:id', function(req, res){

var options = {
method: 'DELETE',
host: 'localhost',
port: 5984,
path: '/calendar/' + req.params.id,
headers: req.headers
};

var couch_req = http.request(options, function(couch_response) {
// ...
couch_req.end();
});
I rather like that one line change. No futzing with query parameters just feels cleaner.

But is it even possible to set HTTP headers in Backbone? Actually, it is relatively easy. Anything supplied in the destroy (or update or create) method is sent along to jQuery as an option. Since jQuery AJAX requests recognize a headers attribute, something like this ought to work:
> e.destroy({headers: {'If-Match':'1-174f31204df52e79a92c1ac875ac09a2'} })
And, finally, I see a change in the CouchDB response:
Got response: 200 localhost:5984/calendar/a38f51509190f265959bbb2b5d001128?rev=1-174f31204df52e79a92c1ac875ac09a2
Most importantly, I no longer have a bogus entry on the first:
That is good progress for tonight. I still have work to do with deleting. It would be nice to do this from the UI rather than the Javascript console. Also, I should not have to refresh my display to see the calendar event removed. But I will worry about those things tomorrow.


Day #128

by Chris Strom (noreply@blogger.com) at September 01, 2011 01:11 PM

Pass-Thru Node.js and CouchDB

‹prev | My Chain | next›

Up tonight, I try to get started with a simple node.js / CouchDB application. I am a big fan of CouchDB with node.js because CouchDB speaks HTTP natively. There is no need for middleware or data drivers with CouchDB—I can just make HTTP requests and process the response or send it along to the client.

I already have the latest node.js installed and a CouchDB server running.

My first step is to create a simple express.js application to play with:
➜  repos  express calendar   

create : calendar
create : calendar/package.json
create : calendar/app.js
create : calendar/public/stylesheets
create : calendar/public/stylesheets/style.css
create : calendar/public/javascripts
create : calendar/public/images
create : calendar/views
create : calendar/views/layout.jade
create : calendar/views/index.jade
In the new "calendar" app directory, I need to install the express and jade packages from npm:
➜  calendar git:(master) ✗ npm install express jade

jade@0.14.2 ./node_modules/jade
express-unstable@2.4.3 ./node_modules/express
├── mime@1.2.2
├── connect@1.6.0
└── qs@0.3.1
My next step is to use the Futon admin interface to create a database. I fancy a calendar app, so I name my database accordingly:


Next, I create a calendar event:


And an event for tomorrow:


To allow my simple express.js app to pass those events back to the browser, I need to establish a route for all events. In that route, I access the special _all_docs resource in CouchDB to pull back all records in the calendar DB (I prolly would not do that in a larger DB). Once the Couch DB response comes back, I write the data back to the browser:
app.get('/events', function(req, res){

var options = {
host: 'localhost',
port: 5984,
path: '/calendar/_all_docs'
};

http.get(options, function(couch_response) {
console.log("Got response: %s %s:%d%s", couch_response.statusCode, options.host, options.port, options.path);

res.contentType('json');

// Send all couch data to the client
couch_response.on('data', function (chunk) {
res.write(chunk);
});

// When couch is done, so is this request
couch_response.on('end', function (chunk) {
res.end();
});
}).on('error', function(e) {
console.log("Got error: " + e.message);
});
});
That's a bit of work establishing 'data' and 'end' listeners. Fortunately, node has an answer for this case in the pipe method for all stream objects:
  http.get(options, function(couch_response) {

console.log("Got response: %s %s:%d%s", couch_response.statusCode, options.host, options.port, options.path);

couch_response.pipe(res)
})
Any events emitted by couch_response will be sent to the original Response object. The result of calling accessing the /events resource is:


As can be seen in the screen shot, the actual event data is not being returned—just IDs and other metadata. To get the full event, I can create another express.js route with a similar callback:
app.get('/events/:id', function(req, res){

var options = {
host: 'localhost',
port: 5984,
path: '/calendar/' + req.params.id
};

http.get(options, function(couch_response) {
console.log("Got response: %s %s:%d%s", couch_response.statusCode, options.host, options.port, options.path);

couch_response.pipe(res);
}).on('error', function(e) {
console.log("Got error: " + e.message);
});
});
This route makes use of some nifty parameter assignments in express.js. Named parameters in the route (the :id in /events/:id) are made available in the request params object (e.g. req.params.id). That bit of coolness aside, this is nearly identical to the all-events route from above.


That is all well and good, but I do not think that I want my client making dozens of calls to this resource for each event on a particular calendar. Instead, I go back into my /events route and add include_docs=true to the URL. This includes the documents (which are relatively small) along with the meta data:


Cool beans. That is a good stopping point for tonight. Tomorrow I will hook those into jQuery Ajax calls to build a month-view calendar. And then the fun begins with a backbone.js equivalent.

Day #124

by Chris Strom (noreply@blogger.com) at September 01, 2011 12:52 PM

August 29, 2011

Ricky Ho

Scale Independently in the Cloud

Deploying a large scale system nowadays is quite different from before when data center is the only choice. A traditional deployment exercise typically involve a intensive performance modeling exercise to accurately predict the resource requirement for the production system. The accuracy is very important because it is expensive and slow to make changes after deploy.

This performance modeling typically involve the following steps.
  1. Build a graph model based on the component interaction.
  2. Express the mathematical relationship between input traffic, the resource consumption at the processing node (CPU and Memory based on the processing algorithm), and the output traffic (which will become the input of downstream processing nodes)
  3. Model external workload as random variable (with a workload distribution function)
  4. Run a simulation exercise to compute the corresponding workload distribution function for the workload of each link and node, such workload unit includes CPU, Memory and Network requirement (latency and bandwidth).
  5. Based on business requirement, pick a peak external load target (say 95%). Vary the external workload from 0 to the max workload and compute the corresponding range of workload at each node and link in the graph.
  6. The max CPU, Memory, I/O of each node defines capacity needed to provision for that node. The max value of each link defines the network bandwidth / latency requirement of that link


Notice that the resource are typically provisioned at the peak load target which means the resources are idle most of the time, impacting the efficiency of the overall system. On the other hand, SaaS based system introduce a more dynamic relationship (anyone can call anyone) between components which makes this tradition way of performance modeling more challenging. The performance modeling exercise need to be conducted whenever new clients or new services are introduced into the system, resulting in a non-trivial on going maintenance cost.

Thanks for the cloud computing phenomenon the underlying dynamics and economics has shifted quite significantly over the last few years and now doing capacity planning is quite different from before.

First of all, making a wrong capacity estimation is less costly when deploying additional resources are talking about minutes rather than month. Instead to attempting to construct the fully picture of the system, the cloud practices is to focus at each individual component to make sure each can "scale independently". The steps are as follows ...
  1. Each component scale independently using horizontal scaling. ie: f(a.x) = a.f(x)
  2. Instead of establish a formal mathematical model, just deploy the system in the cloud, adjust the input workload and measure the utilization at each node and link (e.g. AWS Cloudwatch)
  3. Based on the utility measurement, define the initial deployment capacity based on average load (not peak load).
  4. Use auto-scaling to adjust pool size of independent components according to runtime workload.
  5. Sync workload is typically frontend by Load balancer. Async workload will be frontend by scalable queues. Output can be a callout, stored in queue, or stored in scalable storage


By focusing in "scale independently", each component can plug and play much easier with other component due to less assumption is made on each other as each component can dynamically adjusted its capacity according to run-time need. This results in not only a more scalable, but also more flexible system.

by Ricky Ho (noreply@blogger.com) at August 29, 2011 05:14 AM

August 09, 2011

Mark Headd

Speech Recognition for Open311

Really excited about a new project I started recently to enable phone-based speech recognition for 311 service requests.

Here is a screen cast demonstrating the solution.




I write about it in detail on the Tropo blog. Head on over the get the details, or check out the code for this solution (still a work in progress, but under active development) on GitHub.

a

by Mark Headd at August 09, 2011 01:47 PM

July 14, 2011

Ricky Ho

Designing algorithms for Map Reduce

Since the emerging of Hadoop implementation, I have been trying to morph existing algorithms from various areas into the map/reduce model. The result is pretty encouraging and I've found Map/Reduce is applicable in a wide spectrum of application scenarios.

So I want to write down my findings but then found the scope is too broad and also I haven't spent enough time to explore different problem domains. Finally, I realize that there is no way for me to completely cover what Map/Reduce can do in all areas, so I just dump out what I know at this moment over the long weekend when I have an extra day.

Notice that Map/Reduce is good for "data parallelism", which is different from "task parallelism". Here is a description about their difference and a general parallel processing design methodology.

I'll cover the abstract Map/Reduce processing model below. For a detail description of the implementation of Hadoop framework, please refer to my earlier blog here.


Abstract Processing Model
There are no formal definition of the Map/reduce model. Basic on the Hadoop implementation, we can think of it as a "distributed merge-sort engine". The general processing flow is as follows.
  • Input data is "split" into multiple mapper process which executes in parallel
  • The result of the mapper is partitioned by key and locally sorted
  • Result of mapper of the same key will land on the same reducer and consolidated there
  • Merge sorted happens at the reducer so all keys arriving the same reducer is sorted

Within the processing flow, user defined functions can be plugged-in to the framework.
  • map(key1, value1) -> emit(key2, value2)
  • reduce(key2, value2_list) -> emit(key2, aggregated_value2)
  • combine(key2, value2_list) -> emit(key2, combined_value2)
  • partition(key2) return reducerNo
Design the algorithm for map/reduce is about how to morph your problem into a distributed sorting problem and fit your algorithm into the user defined functions of above.

To analyze the complexity of the algorithm, we need to understand the processing cost, especially the cost of network communication in such a highly distributed system.

Lets first consider the communication between Input data split and Mapper. To minimize this overhead, we need to run the mapper logic at the data split (without moving the data). How well we do this depends on how the input data is stored and whether we can run the mapper code there. For HDFS and Cassandra, we can the mapper at the storage node and the scheduler algorithm of JobTracker will assign the mapper to the data split that it collocates with and hence significantly reduce the data movement. Other data store such as Amazon S3 doesn't allow execution of mapper logic at the storage node and therefore incur more data traffic.

The communication between Mapper and Reducer cannot be collocated because it depends on the emit key. The only mechanism available is the combine() function which can perform a local consolidation and hence can reduce the data sent to the reducer.

Finally the communication between the reducer and the output data store depends on the store's implementation. For HDFS, the data is triply replicated and hence the cost of writing can be high. Cassandra (a NOSQL data store) allows configurable latency with various degree of data consistency trade-off. Fortunately, in most case the volume of result data after a Map/Reduce processing is not high.

Now, we see how to fit various different kinds of algorithms into the Map/Reduce model ...


Map-Only
"Embarrassing parallel" problems are those that the same processing is applied in each data element in a pretty independent way, in other words, there is no need to consolidate or aggregate individual results.

These kinds of problem can be expressed as a Map-only job (by specifying the number of reducers to zero). In this case, Mapper's emitted result will directly go to the output format.

Some examples of map-only examples are ...
  • Distributed grep
  • Document format conversion
  • ETL
  • Input data sampling

Sorting
As we described above, Hadoop is fundamentally a distributed sorting engine, so using it for sorting is a natural fit.

For example, we can use an Identity function for both map() and reduce(), then the output is equivalent to sorting the input data. Notice that we are using a single reducer here. So the merge is still sequential although the sorting is done at the mapper in parallel.

We can perform the merge in parallel by using multiple reducers. In this case, output of each reducer are sorted. We may need to do a final merge on all the reducer's output. Another way is to use a customized partition() function such that the keys are partitioned by range. In this case, each reducer is sorting a particular range and the final result is just to concatenate the each reducer's sorted result.
partition(key) {
range = (KEY_MAX - KEY_MIN) / NUM_OF_REDUCERS
reducer_no = (key - KEY_MIN) / range
return reducer_no
}


Inverted Indexes
The map reduce model is originated from Google which has a lot of scenarios of building large scale inverted index. Building an inverted index is about parsing different documents to build a word -> document index for keyword search.

In fact, inverted index is pretty general and can be applied in many scenarios. To build an inverted index, we can feed the mapper each document (or lines within a document). The Mapper will parse the words in the document to emit [word, doc] pairs along with other metadata such as where in the document this word occurs ... etc. The reducer can simply be an identity function that just dump out the list, or it can perform some statistic aggregation per word.

In a more general form of Inverted index, there is a "container" and "element" concept. The Map and Reduce function will be organized in the following patterns.
map(key, container) {
for each element in container {
element_meta =
extract_metadata(element, container)
emit(element, [container_id, element_meta])
}
}

reduce(element, container_ids) {
element_stat =
compute_stat(container_ids)
emit(element, [element_stat, container_ids])
}

In Text index, we are not just counting the actual frequency of the terms but also adjust its weighting based on its frequency distribution so common words will have less significance when they appears in the document. The final value after normalization is called TF-IDF (term frequency times inverse document frequency) and can be computed using Map Reduce as well.


Simple Statistics Computation
Computing max, min, count is very straightforward since this operation is commutative and associative. Each mapper will perform the local computation and send the result to a single reducer to do the final computation.

Combine function is typically used to reduce the network traffic. Notice that the input to the combine function must look the same as the input to the reducer function and the output of the combine function must look the same as the output of the map function. There is also no guarantee that the combiner function will be invoked at all.

class Mapper {
buffer

map(key, number) {
buffer.append(number)
if (buffer.is_full) {
max = compute_max(buffer)
emit(1, max)
}
}
}


class Reducer {
reduce(key, list_of_local_max) {
global_max = 0
for local_max in list_of_local_max {
if local_max > global_max {
global_max = local_max
}
}
emit(1, global_max)
}
}


class Combiner {
combine(key, list_of_local_max) {
local_max = maximum(list_of_local_max)
emit(1, local_max)
}
}
Computing avg is done in a similar way except that instead of computing the local avg, we compute the local sum and local count. The reducer will do the final sum divided by the final count to come up with the final avg.

Computing a histogram is pretty common in statistics and can give a quick idea about the data distribution. A typical approach is to divide the number into different intervals. The mapper will compute the count per interval, and emit that per interval and the reducer will compute the sum of that interval.
class Mapper {
interval_start = [0, 20, 40, 60, 80]

map(key, number) {
i = 0;
while (i < NO_OF_INTERVALS) {
if (number < interval_start[i]) {
emit(i, 1)
break
}
}
}
}


class Reducer {
reduce(interval, counts) {
total_counts = 0
for each count in counts {
total_counts += count
}
emit(interval, total_counts)
}
}


class Combiner {
combine(interval, occurrence) {
emit(interval, occurrence.size)
}
}
Notice that a non-uniform distribution of values across intervals may cause an unbalanced workload among reducers and hence undermine the degree of parallelism. We'll address this in the later part of this post.


In-Mapper Combine
Jimmy Lin, in his excellent book, talks about a technique call "in-mapper combine" which regains control at the application level when the combine takes place. The general idea is to maintain a HashMap to buffer the intermediate result and has a separate logic to determine when to actually emit the data from the buffer. The general code structure is as follows ...
class Mapper {
buffer

init() {
buffer = HashMap.new
}

map(key, data) {
elements = process(data)
for each element {
....
check_and_put(buffer, k2, v2)
}
}

check_and_put(buffer, k2, v2) {
if buffer.full {
for each k2 in buffer.keys {
emit(k2, buffer[k2])
}
}
}

close() {
for each k2 in buffer.keys {
emit(k2, buffer[k2])
}
}
}

SQL Model
The SQL model can be used to extract data from the data source. It contains a number of primitives.

Projection / Filter
This logic is typically implemented in the Mapper
  • result = SELECT c1, c2, c3, c4 FROM source WHERE conditions
Aggregation / Group by / Having
This logic is typically implemented in the Reducer
  • SELECT sum(c3) as s1, avg(c4) as s2 ... FROM result GROUP BY c1, c2 HAVING conditions
The above example can be realized by the following map/reduce job
class Mapper {
map(k, rec) {
select_fields =
[rec.c1, rec.c2, rec.c3, rec.c4]
group_fields =
[rec.c1, rec.c2]
if (filter_condition == true) {
emit(group_fields, select_fields)
}
}
}

class Reducer {
reduce(group_fields, list_of_rec) {
s1 = 0
s2 = 0
for each rec in list_of_rec {
s1 += rec.c3
s2 += rec.c4
}
s2 = s2 / rec.size
if (having_condition == true) {
emit(group_fields, [s1, s2])
}
}
}

Data Joins
Joining 2 data set is a very common operation in Relational Data Model and has been very mature in RDBMS implementation. The common join mechanism in a centralized DB architecture is as follows
  1. Nested loop join -- This is the most basic and naive mechanism and is organized as two loops. The outer loop reads from data set1, the inner loop scan through the whole data set2 and compare with the records just read from data set1.
  2. Indexed join -- An index (e.g. B-Tree index) is built for one of the data sets (say data set2 which is the smaller one). The join will scan through data set1 and lookup the index to find the matched records of data set2.
  3. Merge join -- Pre-sort both data sets so they are arranged physically in increasing order. The join is realized by just merging the two data sets. a) Locate the first record in both data set1 & set2, which is their corresponding minimum key b) In the one with a smaller minimum key (say data set1), keep scanning until finding the next key which is bigger than the minimum key of the other data set (ie. data set2), call this the next minimum key of data set1. c) Switch position and repeat the whole thing until one of the data set is exhausted.
  4. Hash / Partition join -- Partition the data set1 and data set2 into smaller size and apply other join algorithm in a smaller data set size. A linear scan with a hash() function is typically performed to partition the data sets such that data in set1 and data in set2 with the same key will land on the same partition.
  5. Semi join -- This is mainly used to join two sets of data that is stored at different locations and the goal is to reduce the amount of data transfer such that only the full records appears in the final joint result will be send through. a) Data set2 will send its key set to machine holding Data set1. b) Machine holding Data set1 will do a join and send back the records in Data set1 that matches one of the send-over keys. c) The machine holding data set2 will do a final join to the data send back.
In the map reduce environment, it has the corresponding joins.

General reducer-side join
This is the most basic one, records from data set1 and set2 with the same key will land on the same reducer, which will then do a cartesian product. The downside of this model is that the reducer need to have enough memory to hold all records of each key.
map(k1, rec) {
emit(rec.key, [rec.type, rec])
}

reduce(k2, list_of_rec) {
list_of_typeA = []
list_of_typeB = []
for each rec in list_of_rec {
if (rec.type == 'A') {
list_of_typeA.append(rec)
} else {
list_of_typeB.append(rec)
}
}

# Compute the catesian product
products = []
for recA in list_of_typeA {
for recB in list_of_typeB {
emit(k2, [recA, recB])
}
}
}

Optimized reducer-side join
You can "secondary sort" the data type for each key by defining a customized partition function. In this model, you arrange the data type (which has less records per key to arrive first) and you only need to store these types.
map(k1, rec) {
emit([rec.key, rec.type], rec])
}

partition(key_pair) {
super.partition(key_pair[0])
}

reduce(k2, list_of_rec) {
list_of_typeA = []
for each rec in list_of_rec {
if (rec.type == 'A') {
list_of_typeA.append(rec)
} else { # receive records of typeA
for recA in list_of_typeA {
emit(k2, [recA, rec])
}
}
}
}

While being very flexible, the downside of Reducer side join is that all data need to be transfer from the mapper to the reducer and then result write to HDFS. Map-side join explore some special arrangement of the input file such that the join is being perform at the mapper. The advantage of doing in the mapper is that we can exploit the collocation of the Map reduce framework such that the mapper will be allocated an input split in its local machine, hence reduce the data transfer from the disk to the mapper. After the map-side join, the result is written directly to the output HDFS files and hence eliminate the data transfer between the mapper and the reducer.

Map-side partition join
In this model, it requires the 2 data sets to be partitioned into 2 sets of partition files (same number of partitions for each set). The size of the partition is such that it can fit into the memory of the Mapper machine. We also need to configure the Map/Reduce job such that there is no split in the partition file, in other words, the whole partition is assigned to a mapper task.

The mapper will detect the partition of the input file and then read the corresponding partition file of the other data set into an in-memory hashtable. After that, the mapper will lookup the Hashtable to do the join.
class Mapper {
map = Hashtable.new

init() {
partition = detect_input_filename()
map = load("hdfs://dataset2/" + partition)
}

map(k1, rec1) {
rec2 = map[rec1.key]
if (rec2 != nil) {
emit(rec1.key, [rec1, rec2])
}
}
}

Map-side partition merge join
In additional, if the partition file is also sorted, then the mapper can use a merge join, which has an even smaller memory footprint.
class Mapper {
rec2_key = nil
next_rec2 = nil
list_of_rec2 = []
file = nil

init() {
partition = detect_input_filename()
file = open("hdfs://dataset2/" + partition, "r")
next_rec2 = file.read()
fill_rec2_list()
}

# Fill up the list of rec2 list which has the same key
fill_rec2_list() {
rec2_key = next_rec2.key
list_of_rec2.append(next_rec2)
next_rec2 = file.read
while(next_rec2.key == key) {
list_of_rec2.append(next_rec2)
}
}

map(k1, rec1) {
while (rec1.key > rec2_key) {
fill_rec2_list()
}

while (rec1.key == rec2.key) {
for rec2 in list_of_rec2 {
emit(rec1.key, [rec1, rec2])
}
}

}
}

Memcache join
The model is very straightforward, the second data set is loaded into a distributed hash table (like memcache) which has effectively unlimited size. The mapper will receive input split from the first data set and then lookup the memcache for the corresponding record of the other data set.

There are also some other more sophisticated join mechanism such as semi-join described in this paper.

Graph Algorithms
Many problems can be modeled as a graph of Node and Edges. In the Search engine environment, computing the rank of a document using Page Rank or Hits can be model as a sequence of iterations of Map/Reduce jobs.

In the past, I have been blog a number of very basic graph algorithms in map reduce including doing topological sort, finding shortest path, minimum spanning tree etc. and also how to recommend people connection using Map/Reduce.

Due to the fact that graph traversal is inherently sequential, I am not sure Map/Reduce is the best parallel processing model for graph processing. Another problem is that due to the "stateless nature" of map() and reduce() functions, the whole graph need to be transferred between mapper and reducer which incur significant communication costs. Jimmy Lin has described a clever technique called Shimmy which exploit using a special partitioning function which let the reducer to retain the ownership of nodes across map/reduce jobs. I have described this technique as well as a general model of Map/Reduce graph processing in a previous blog.

I think a parallel programming model specific for Graph processing will perform much better. Google's Pregel model is a good example of that.


Machine Learning
Many of the machine learning algorithm involve multiple iterations of parallel processing that fits very well into Map/Reduce model.

For example, we can use map reduce to calculate the statistics for probabilistic methods such as naive Bayes.

A simple example of computing K-Means cluster can also be done in the following way.
  • Input: A set of points, with k initial centrods
  • Output: K final centroids
Iterate until no more change of membership
  1. For each point, assign it to be the member of closest centroid
  2. Re-compute the centroid from the assigned point members


For a complete list of Machine learning algorithms and how they can be implemented using the Map/Reduce model, here is a very good paper.


Matrix arithmetic
A lot of real-life relationships can be represented as a Matrix. One example is the vector space model of Information Retrieval where the column represents docs and the row represents terms. Another example is the social network graph where the column as well as the row representing people and a binary value of each cell to represent a "friend" relationship. In this case, M + M.M represents all the people that I can reach within 2 degree.

Processing for dense matrix is very easy to parallelized. But since the sequential version is O(N^3), it is not that interesting for Matrix with large size (millions range in rows and columns).

A lot of real-world graph problem can be represented as sparse matrix. So my interests is to focus more in the processing of sparse matrix. I don't have much to share at this moment but I hope this is something I will blog about in future.

by Ricky Ho (noreply@blogger.com) at July 14, 2011 06:39 AM

July 10, 2011

Ricky Ho

Fraud Detection Methods

Online electronic fraud has become increasingly problematic to many companies offering services on the web. Here I am trying to generalize a set of techniques that I found useful in the past.

To be effective in combating frauds, the first thing companies need to have is an overall top-down strategy to deal with frauds, including ...
  1. Have a clearly defined security objective, a good understanding of the fraudsters' motivation, as well as the consequences of fraud.
  2. Have an effective analytic method in place to detect fraud immediately when it happens
  3. Have an responsive handling process in place to react immediately after fraud is detected
  4. Have an preventive process in place to feedback newly discovered fraud patterns into the system
I will be focusing more in following discussion on the technical side of the analytic methods but I want to reiterate that the process side is equally (or even more) important in order for the whole effort of combating fraud to be effective.

Setting Objectives and Targets
Setting the objectives upfront is very important for guiding the subsequent design process of the technical mechanism, especially when making tradeoffs decisions between false positive and false negative. A high false negative rate means fraud goes through undetected while a high false positive rate will cause inconvenience to your existing customers as well as unnecessarily large manual investigation effort.

From another angle, some companies look at the fraud detection methods as an optimization mechanism of using existing resource for conducting manual investigation, which is usually the last resort to handle fraud. These companies usually has a constant team size of fraud investigators. If these people spend too much time in legitimate transactions, there will be less time left to investigate the real fraud transactions. Therefore, the analytical methods aim at guiding the manual investigation effort to those transaction with a higher chance of fraud.

Notice that fraud detection is a continuously-improvement-game. At each iteration, there is a baseline (usually the current best method) and an improvement threshold. The method at each iteration is supposed to provide at least an improvement over the baseline. In the first iteration, the baseline can be very low (e.g. simply random guess). At each iteration, the baseline will be raised until the companies' objectives and targets have been satisfactorily met.

Instrumenting Analytical Methods
Depends on the nature of business and the motivation of fraudster, the characteristics of fraud can be very different. It is very important to understand them before designing the best mechanism to combat them.

Here is a high level decision process to determine the correct method


a) Rule-base approach
If the attack pattern is well-defined (e.g. credit card fradulent transactions tend to have a higher-than-usual spending amount as well as higher-than-usual transaction rate). These attack pattern can usually be extracted from domain experts in the business. The best method in this case to implement a solution is to encode such knowledge as rules or even hard-wired into the application code for efficiency reasons.

Notice that rules need to maintain as new attack patterns are discovered or old attack patterns become obsoleted. Rule engine is a pretty common approach in order to keep such domain knowledge in a declarative form so it can be easily maintained.

b) Classification approach
If we have training examples for both normal case and fraud case, classification methods (based on machine learning) can perform very well. Such analytic methods includes logistic regression, decision trees (random forest), Support vector machine, Bayesian network (naive bayes), Neural network ... etc.

To compare the performance of different classification methods, confusion matrix is commonly used. It is a 2 by 2 matrix measuring the ratio of true positive, false positives, true negative and false negative. Based on the cost associated with false positive and false negative, we can determine a best method (or ensemble of multiple methods) to achieve a minimal cost.

c) One-Class Model approach
If we have just training examples for norm cases but no fraud examples, we still can learn a model based on normal data and then compute the distance between the transaction data and the model we learned. We flag the transaction as fraud if the distance exceed a domain-specific threshold. Here the distance function between the model and a data point needs to be defined ad commonly used ones include statistic methods where the model is the mean and standard deviation of the norm data and the P-value as the distance function. On the other hand, Euclidean distance, Jaccard distance and cosine distance are also commonly used.

d) Density based methods and clustering methods
If we know nothing about the fraud patterns and also don't have training examples for even norm cases, then we can make some assumptions about the distribution of data, such as fraud data is less dense than norm data, in other words, fraud transaction will have less neighbors within a certain radius. If this assumption is reasonable, then we can use density-based method to predict fraud transaction. For example, counting number of neighbors within radius r, or measuring the distance to the kth nearest neighbour. We can also use clustering method to learn clusters and flag transactions too distant from its cluster center as fraud.

Determining input signals
In my experience, determining the right signal is the most important part of the whole process. Sometimes we use raw input attributes as the signal while other times we need to combine multiple attributes to provide the signal.

For example, as we take raw measurements at different points in time, the input signal may involve computing the rate of change of these raw measurement over time. In other words, it is not adequate to just look at each data point in isolation and we need to aggregate raw measurement in a domain specific way.

In my past experience, a large portion of fraud detection cases is about how to deal with account takeover transactions (stolen identities and impersonation). Usually detecting sudden change of behavior (e.g. change point detection) is an effective approach to deal with this kind of frauds.

Time dimension
Instead of looking at each fraud in isolation, in many cases we need to look at the "context" under which fraud are evaluated. As we discussion above in detecting sudden change of behavior, it is quite common to use the past data of a user to build a norm model and evaluate the recent transactions against it to determine if it is fraud. In other words, we compare his/her current behavior with the past.

Besides the "time dimension", we can look into other context as well. For example, we can look at user's peer-group's behavior, observing the deviation of one person's behavior to its peer-group as an indication of a stolen identity.

Notice that the norm pattern may also evolve/change over time, nevertheless we usually don't expect such change to be sudden or rapid. To cater for such slow drift, the norm model need to be continuously adjusted as well. A pretty common technique is to compute a long-term behavioral signature based on a longer time span of transactional data (e.g. 6 months) and compute a short-term behavioral signature based on a shorter time span of data. Then the short-term signature is compared with the long-term signature using a distance function and fraud is flagged if it exceed a pre-defined threshold. It is also important to have an incremental update mechanism for the long-term signature rather than recomputing it from scratch at every update. A common approach is to use exponentially time-decay function such as ...
M[t+1] = a.M[t] + (1-a)S[t].
where 0 < a < 1
M[t] is model at time t
S[t] is the transaction at time t

The importance of Domain Experts
Although sophisticated machine learning algorithms has been pretty powerful in using a generalized solution for a broad scenarios of problems. From my past experience, I have yet seen much cases a sophisticated machine learning algorithm can beat domain expertise. In many projects, simple algorithm with deep domain expertise out-performs sophisticated analytical methods significantly. Therefore, the common pattern that I recommend is to build a rule-based solution at the core and augment it using machine learning analytical methods.

by Ricky Ho (noreply@blogger.com) at July 10, 2011 01:05 AM

July 07, 2011

Damien Katz

Break's Over. Big Mover. Couchbase changing the Game.

There is some seriously cool stuff coming up at CouchConf on July 29. One the things I'm most excited about is Richard Hipp, creator of SQLite, will join me on stage to talk about our current joint project. Can't tell you what it is right now, but if you feel the Earth shift a little that day, you'll know why...and be sure to watch this space on July 29 to learn the details!

by Damien Katz at July 07, 2011 10:52 PM

June 21, 2011

Damien Katz

Couchbase Training Summer Special

We are doing a special training deal this summer--$395 for two days of training!

The next one is in Portland in just a couple days on June 27 and 28! If you're in Portland for OSBridge, or you are in the area, you should definitely sign up.

http://www.couchbase.com/couchdb-training/portland-june-2011

Also, if you're in NYC this summer and want to learn about Membase Server, we'll be doing a class on July 11 and 12th.

http://www.couchbase.com/membase-training/nyc-july-2011

We only have a limited number of seats so it's important to sign up ASAP.

by Damien Katz at June 21, 2011 07:59 PM

June 16, 2011

Damien Katz

CouchConf Early Bird Special Ends Tomorrow

Sign up by Friday, June 16, for the early bird rate. CouchConf is July 29 in San Francisco.

CouchConf is the only conference dedicated to all things Couch. This one-day event is for any developer who wants to take a deeper dive into Couchbase technology, learn where it's headed and build really cool stuff.

Sign up now!

by Damien Katz at June 16, 2011 03:57 PM

June 15, 2011

Till Klampäckel

RFC: Mocking protected methods

Update, 2011-06-16, 12:15 AM Thanks for the comments.

(I swear I had something like that before and it didn't work!) Here's the solution:

$userId = 61382;
$docId  = 'CLD2_62e029fc-1dae-4f20-873e-69facb64a21a';
$body   = '{"error":"missing","reason":"problem?"}';

$client = new Zend_Http_Client;
$client->setAdapter(new Zend_Http_Client_Adapter_Test);

$couchdb = $this->getMock(
    'CouchDB',
    array('makeRequest',),
    array($userId, $this->config,)
);
$couchdb->expects($this->once())
    ->method('makeRequest')
    ->will($this->returnValue(new Zend_Http_Response(200, array(), $body)));
$couchdb->setHttpClient($client);
$couchdb->getDocument($docId);

--- Original blog entry ---

I wrote a couple tests for a small CouchDB access wrapper today. But when I wrote the implementation itself, I realized that my class setup depends on an actual CouchDB server being available and here my journey began.

Example code

Consider the following example:

My objective is not to be able to test any of the protected methods directly, but to be able to supply a fixture so we don't have to setup CouchDB to run our testsuite. My fixture would replace makeRequest() and return a JSON string instead.

... more after the jump.

by Till Klampaeckel (till@php.net) at June 15, 2011 10:19 PM

June 02, 2011

Mark Headd

“Phind It For Me” Live in Philly

Really excited to launch a new OpenGov project in Philadelphia - Phind It For Me.

The service is built on PHLAPI and the point data sets it houses. As such, one could understand why I’d be interested in enhancing the data sets currently in PHLAPI.

I’m really excited about this project - source code available on GitHub - and would love to see if there is an interest in launching in other cities with CouchDB-based geospatial data repositories, like Baltimore.

It’s built on the awesome new SMSified platform from Voxeo (disclaimer, I work there) and uses a Node.js module I built for working with the SMSified API.

As always, dear readers, any comments or feedback is welcomed.

Do head on over to the project website and check it out!

a

by Mark Headd at June 02, 2011 03:32 PM

May 19, 2011

Damien Katz

Upstream

JavaScript Workshop for Designers 2

Workshop

We are doing another iteration of our JavaScript for Designers Workshop (in German). Our designer Kristina and I (Alex) will be teaching the basics of programming in the browser for a full day. You don’t need to have any knowledge about programming. You don’t even have to be a designer :)

The workshop is scheduled for July 15, early bird tickets are available for €300 until June 10.

For more information visit jstraining.de.

by Alexander Lang at May 19, 2011 08:01 AM

April 23, 2011

Till Klampäckel

Some thoughts on outtages

Cloud, everybody wants it, some actually use it. So what's my take away from AWS' recent outtage?

Background

So first off, we had two pieces of our infrastructure failing (three if we include our Multi-AV RDS) — both of which involve EBS.

Numero uno

One of those pieces in my immediate reach was a MySQL server, which we use to keep sessions. And to say the least about AWS and in their defense, the instance had run for almost 550 days and had never given us much or any reason to let us down.

In almost two years with AWS I did not magically lose a single instance. I had to reboot two or three once because the host had issues and Amazon sent us an email and asked us to make sure the instance survive a reboot, but that's about it.

Recovering the service, or at least launching a replacement in a different region would have been possible if not by coincidence we would have hit several limits on our AWS account (instances, IPs and EBS volumes), which apparently take multiple days to lift. We contacted AWS immediately and the autoresponder told us to email back if it was urgent, but I guess they had their hands full and apparently we are not high up the chain enough to express how urgent it really was.

I also tried to reach out to some of the AWS evangelists on Twitter which didn't work since they went silent almost all the way through this outtage.

All in all, it took roughly five hours to get the volume back and another 4-5 to recover the database. As far as I can tell, nothing was lost.

And in our defense — we were well aware of this SPOF and had already plans to move on a more redundant approach — I have another blog post in draft about evaluating alternatives (Membase).

Numero due

The second critical piece of infrastructure which failed for us is our hosted BigCouch cluster with Cloudant.

We managed to (manually) failover to their cluster in us-west1 later in the day and brought the service back up. We would have done this earlier, but AWS suggested it would be only a few hours which is why we wanted to avoid the hassle of having to sync up clusters later on.

Sidenote: Cloudant is still (day three of the downtime) trying to get all pieces back online. Kudos to everyone from Cloudant for their hard work and patience with us.

Lesson learned for myself: When things fail which are not within your reach, it's pretty hard to do anything and stay calm. A good thing is to keep everyone busy so our team tried to reach out to all customers (we average about 200,000 users per day) via Twitter and Facebook and it looks like we've tackled that well.

Un trio!

Well, I don't have much to say about Amazon RDS (hosted MySQL in the cloud). Except that it didn't live up to our expectations: it costs a lot of money, but we learned that apparently that doesn't buy us anything.

Looking at the CloudWatch statistics associated with our RDS setup (or EBS in general), I'm rather weary and don't even know if they can be trusted. In the end, I can't really say for how long RDS was down or failed to failover, but it must have been back before I got a handle on our own MySQL server.

The rest?

The rest of our infrastructure seems fine on AWS — most of the servers are stateless (Shared nothing, anyone?) and no EBS is involved.

And with the absense of EBS, there are no issues to begin with. Everything continued to work just as expected.

Design for failure.

This is not really a take away, but a no-brainer. It's also not limited to AWS or cloud computing in general.

You should design for failure when you build any type of applications.

In terms of cloud computing it's really not as bad as ZOMG MY INSTANCE IS GONE!!!11, but it certainly can happen. I've heard people claim that EC2 instances constantly disappear and with my background of almost two years on AWS, I know that's just not true.

Designing for failure should take the following into account:

Services become unavailable

These services don't have to run in the cloud, they can become unavailable on bare metal too. For example, a service could crash, there could be a network partition or maintenance involved. The bottom line: How does your application deal with it?

Services run degraded

For example, higher latency between the services, slower response time from disk — you name it.

The unexpected

Sometimes, everything is green, but your application still chokes.

While testing for the unexpected is of course impossible, validating what comes in is not.

Recovery

I'm not sure if a fire drill is necessary, but it helps to have a plan on how to troubleshoot issues to be able to recover from an outtage.

In our case, we log almost anything and everything to syslog and utilize loggly to centralize logs. loggly's nifty console provides us with great input about the state of our application at any time.

Add to the centralized logging, that we have a lot of monitoring using ganglia and munin in place. Monitoring is an ongoing project for us since it seems like once you start, you just can't stop. ;-)

And last but not least: We can launch a new configured EC2 instance with a couple mouse clicks using Scalarium.

I value all these things equally — without them, troubleshooting and recovery would be impossible or at least a royal PITA.

Don't buy hype

So to get to the bottom of this (still ongoing) event, I'm not particulary pissed that there was downtime. Of course I can live without it, but what I mean is: Things are bound to fail.

The truth is though, that Amazon's product description is not exactly honest, or at the very least provides everyone with a lot of room for interpretation. You're asking for how much interpretation? I'm sure you could put ten cloud experts into one room and come away with 15 different opinions.

For example the use of attributes to a service such as highly available may cause different expectations for different people.

Let me break it down for you: highly available in AWS speak, means, "it works most of the time".

What about an SLA?

Highly available is not an SLA with those infamous nine Erlang nines.

On paper a multi-az deployment of Amazon RDS gets pretty close to what generally people expect from highly available: MySQL master-master replication, backups — in multiple datacenters. As of today we all know: even these things can fail.

And speaking of SLAs: It looks like none of the services failing are covered by it: AWS' track record remains clean. This is because EBS is not explicitely named in it and by the way neither is RDS. Amazon's SLA — as far as EC2 is concerned — covers some but not all network outtages. Since I was able to access the instance the entire time none of this applies here.

Multi-Zone

On Twitter people are quick to suggest that everyone who complaines now should have had a setup in multiple availability zones setup.

Here's what I think about it:

  • I find it rather amusing that apparently everyone on bare metal runs in multiple datacenters.

  • When people suggest multi-zone (or multi-az in Amazon-speak), I think they really mean multi-region. Because a zone is effectively us-east-1a, us-east-1b, us-east-1c and us-east-1d. Since all datacenters (= availability zones) in the us-east1 region failed on 2011/04/21, your multi-zone setup would not have covered your butt. Even Amazon's multi-az RDS failed.

  • Little do people know, but the zone identifiers — e.g. us-east-1a, us-east-1b — are tied to customer accounts. So for example, Cloudant's version of us-east-1a may be my us-east-1c, it may or may not be the same. This is why in many cases AWS never calls out explicit zones in outtages. This also makes it somewhat hard to plan ahead in a single region.

  • AWS sells customers on the idea that an actual multi-az setup is plenty. I don't know too many companies who do multi-region (maybe SimpleGeo?). Not even NetFlix does multi-region, but I guess they managed to sail around this disaster because they don't use EBS.

  • In the end it shouldn't be necessary to do a multi-region setup (and deal with its caveats) since according to AWS the different zones inside a region are different physical locations (let's call them datacenters) to begin with. Correct me if I'm wrong, but the description says different physical location, this is not just another rack in the same building or another port on the core switch.

Communication

Which brings me to the most important point of my blong entry.

In a nutshell, when you build for the AWS platform, you're building for a blackbox. There are a couple papers and blog posts where people try to reverse engineer the platform and write about its behavior. The problem with these things is that most people are guessing, though often (of course depending on the person writing) it seems to be a very well educated guess.

Roman Stanek blogged about communication between AWS and its customers, so head on over, I pretty much agree with everything he has to say:

Fin

So what exactly is my take away? In terms of technical details and as far as redundancy is concerned: not so much.

Whatever you do to run redundant on AWS, applies to setups in your local colocation or POP as well. And in theory, AWS makes it easier though to leverage multiple datacenters (availability zones) and even achieve somewhat of a global footprint by distributing in different regions.

The real question for anyone to ask is, Is AWS fit to host anything which requires permanent storage? (I'm inclined to say no.)

That's all.

by Till Klampaeckel (till@php.net) at April 23, 2011 07:56 PM

April 22, 2011

Ricky Ho

K-Means Clustering in Map Reduce

Unsupervised machine learning has broad application in many e-commerce sites and one common usage is to find clusters of consumers with common behaviors. In clustering methods, K-means is the most basic and also efficient one.

K-Means clustering involve the following logical steps

1) Determine the value of k
2) Determine the initial k centroids
3) Repeat until converge
- Determine membership: Assign each point to the closest centroid
- Update centroid position: Compute new centroid position from assigned members

Determine the value of K
This is basically asking the question of: "How many clusters you are interested to discover ?"
So the answer is specific to the problem domain.

One way is to try different K. At some point, we'll see increasing K doesn't help much to improve the overall quality of clustering. Then that is the right value of K.

Notice that the overall quality of cluster is the average distance from each data point to its associated cluster.


Determine the initial K centroids
We need to pick K centroids to start the algorithm. So one way to pick them is to randomly pick K points from the whole data set.

However, picking a good set of centroids can reduce the number of subsequent iterations and by "good" I mean the K centroid should be as far apart to each other as possible, or even better the initial K centroid is close to the final K centroid. As you can see, choosing the random K points is reasonable but non-optimum.

Another approach is to take a small random sample set from the input data set and do a hierarchical clustering within this smaller set (note that hierarchical clustering is not-scaling to large data set).

We can also partition the space into overlapping region using canopy cluster technique (describe below) and pick the center of each canopy as the initial centroid.

Iteration
Each iteration is implemented as a Map/Reduce job. First of all, we need a control program on the client side to initialize the centroid positions, kickoff the iteration of Map/Reduce jobs and determine whether the iteration should end ...

kmeans(data) {
initial_centroids = pick(k, data)
upload(data)
writeToS3(initial_centroids)
old_centroids = initial_centroids
while (true){
map_reduce()
new_centroids = readFromS3()
if change(new_centroids, old_centroids) < delta {
break
} else {
old_centroids = new_centroids
}
}
result = readFromS3()
return result
}


Within each iteration, most of the processing will be done in the Map task, which determine the membership for each point, as well as compute a partial sum of each member points of each cluster.

The reducer did the easy job by aggregating all partial sums and compute the update centroid position, and then out them into a shared store (S3 in this case) that can be picked up by the Map/Reduce job of next round.



Complexity Analysis
Most of the work is done by the Mapper and the workload is pretty balanced. So the time complexity will be O(k*n/p) where k is number of clusters, n is number of data points and p is number of machines. Note that the factor of k comes in at the closest_centroid() function above when comparing each data point with each intermediate centroid as follows ...
closest_centroid(point, listOfCentroids) {
bestCentroid = listOfCentroids[0]
minDistance = INFINITY
for each centroid in listOfCentroids {
distance = dist(point, centroid)
if distance < minDistance {
minDistance = distance
bestCentroid = centroid
}
}
return bestCentroid
}

If we partition the space into proximity regions, we only need to compare each point with centroid within the same proximity region and treat other centroids infinite distance. In other words, we don't have to compare each point with all k centroids.

Canopy clustering provide such a partitioning mechanism.


Canopy Clustering
To define the proximity region (canopy), we can draw a circle (or hypersphere) centered at a data point. Points outside this sphere is considered to be too far.

However, if we apply this definition to every point, then we will have as many proximity region as the number of points, which ends up doesn't save much processing. We also observed that points are very close by each other can stay in the same region without each point creating their own. Therefore, we can draw a smaller circle within the big circle (with the same center) such that data points within the small circle is not allowed to form its own proximity region.


Notice that each proximity region can overlap with each other and the degree of overlapping will be affected by the choice of T1. Also the choice of T2 affects how many canopies will be formed. Picking the right number of T1 and T2 is domain-specific, and also depends on the number of clusters and the space volume. If there is a small number of clusters within a big space, then a bigger T1 should be chosen.

To create the canopies (and mark the data points with the canopies), we will do the following steps ...
1) Create the canopy centers, with one scan
  • Keep a list of canopies, initially an empty list
  • Scan each data point, if it is within T2 distance of existing canopies, discard it. Otherwise, add this point into the list of canopies

2) Assign data points to the canopies, with another scan
  • Start with a list of canopies from last step
  • Scan each data point, if it is within T1 of the canopyA, add A as the assigned canopy to the data point. Notice that the data point can be assigned to multiple canopies
  • When done, each data point will look like

Notice that now the input data points has been added with an extra attribute that contains the assigned canopies. When compare the point with the intermediate centroids, we only need to compare centroids within the same canopy. Here is the modified version of the algorithm ...

closest_centroid(point, listOfCentroids) {
bestCentroid = listOfCentroids[0]
minDistance = INFINITY
for each cent in listOfCentroids {
if (not point.myCanopy.intersects(cent.myCanopy)) {
continue
}
distance = dist(point, centroid)
if distance < minDistance {
minDistance = distance
bestCentroid = centroid
}
}
return bestCentroid
}

by Ricky Ho (noreply@blogger.com) at April 22, 2011 11:04 PM

April 19, 2011

Volker Mische

FOSSGIS, GeoCouch and MapQuery

Two weeks ago I had the chance to give a talk about GeoCouch and MapQuery at the FOSSGIS 2011. Most of the people who read this Blog are probably aware of GeoCouch, but not so much of MapQuery. For me these two projects are tightly connected and therefore deserve a quick introduction/update.

GeoCouch

GeoCouch, a spatial index for CouchDB gains, more and more attention. One of the reason is that the installation recently got way easier for developers as well as for normal users. You now can install GeoCouch as an extension right next to your already existing CouchDB instance. You may also download a binary of Couchbase-Server, which already includes GeoCouch. And finally there's the brand new Iris Couch hosting as well (previously known as the CouchOne hosting). So getting started with GeoCouch is easier than ever before.

Some people might have wondered about the state/future of GeoCouch, especially after the merger of CouchOne with Membase to Couchbase. I will keep on developing GeoCouch at Couchbase and it is (as it always was) fully open source licensed under the Apache 2.0 License.

The new home for the latest source is the Couchbase Github repository.

OpenStreetMap

The FOSSGIS was also about OpenStreetMap. The idea to put OpenStreetMap data into GeoCouch is very sensible, but wasn't really done (AFAIK) in a big fashion. Luckily Jochen Topf from Geofabrik told me about his Projekt Osmium, which makes it possible to process OSM data with JavaScript. There is already a script to output a Shapefile, so it should be really easy to output GeoJSON, which could be consumed by GeoCouch. So if you (who are currently reading this) have some spare time, please give it a go :)

MapQuery

MapQuery is a web mapping framework that builds on OpenLayers and jQuery. The goal is a framework that is just as easy to use as jQuery combined with the power of OpenLayers. It's meant for people that just want to get started with web mapping, but also for those who have already knowledge about OpenLayers and want to have easy integration into their jQuery application.

I was able to show a quick demo of the MapQuery API at the FOSSGIS. I won't publish it here, as things are about to move fast. After over one year of discussions about MapQuery and only little code contributions, it seems that we are finally getting somewhere. That feels so good :)

The wonderful EduGIS is build on an early version of MapQuery (source code), but will be merged with the most recent version of my fork.

Other big news is that the WhereGroup hired Christian Wygoda, who is a committer of the MapQuery project. This also means that Mapbender 3 will use MapQuery.

And finally I've also met a developer of a another company that was building a big mapping application based on OpenLayers and jQuery. I don't want disclose it here, as the code isn't open source yet, but the developer told me that it should be easily possible. I will keep in touch with them and hope they will contribute their code to MapQuery.

To get to a conclusion about MapQuery. If you want to stay in touch with the project, please subscribe to the official mailing list, this is where things are happening (there's also the little attended IRC channel #mapquery on freenode). If you want to be a user of MapQuery, you should be patient and wait a bit. If you plan to contribute, you can start now. The currently biggest item is moving the EduGIS MapQuery code base over to the MapQuery version of my fork. The "documententation" are the demos.

FOSSGIS

As people started to asked about the slides from my presentaion at FOSSGIS, here they are.

FOSSGIS was a really awesome event, where I met a lot of new people, but also a lot of friends I haven't seen in a while. I'm really looking forward to next year's conference, but also hope that I might see many of the people at this year's FOSS4g in Denver.

by Volker Mische at April 19, 2011 12:23 PM

April 04, 2011

Till Klampäckel

Trying out BigCouch with Chef-Solo and Vagrant

So the other day, I wanted to quickly check something in BigCouch and thanks to Vagrant, chef(-solo) and a couple cookbooks — courtesy of Cloudant — this was exceptionally easy.

As a matter of fact, I had BigCouch running and setup within literally minutes.

Here's how.

Requirements

You'll need git, Ruby, gems and Vagrant (along with Virtualbox) installed. If you need help with those items, I suggest you check out my previous blog post called Getting the most out of Chef with Scalarium and vagrant.

For operating system to use, I suggest you get a Ubuntu 10.04 box (aka Lucid).

Vagrant (along with Ruby and Virtualbox) is a one time setup which you can use and abuse for all kinds of things, so don't worry about the extra steps.

Setup

Clone the cookbooks in $HOME:

$ git clone http://github.com/cloudant/cloudant_cookbooks

Create a vagrant environement:

$ mkdir ~/bigcouch-test
$ cd ~/bigcouch-test
$ vagrant init

Setup ~/bigcouch-test/Vagrantfile:

Vagrant::Config.run do |config|
  config.vm.box = "base"
  config.vm.box_url = "http://files.vagrantup.com/lucid32.box"

  # Forward a port from the guest to the host, which allows for outside
  # computers to access the VM, whereas host only networking does not.
  # config.vm.forward_port "http", 80, 8080

  config.vm.provisioner = :chef_solo
  config.chef.cookbooks_path = "~/cloudant_cookbooks"
  config.chef.add_recipe "bigcouch::default"
end

Start the vm:

$ vagrant up

Use BigCouch

$ vagrant ssh
$ sudo /etc/init.d/bigcouch start
$ ps aux|grep [b]igcouch

Done. (You should see processes located in /opt/bigcouch.)

Fin

That's all — for an added bonus you could open BigCouch's ports on the VM use it from your host system because otherwise this is all a matter of localhost. See config.vm.forward_port in your Vagrantfile.

by Till Klampaeckel (till@php.net) at April 04, 2011 03:56 PM

March 21, 2011

Damien Katz

Couchbase SF Training Was Awesome

I had a blast teaching the first Couchbase CouchDB Training with training pro Alan McKean last week. 2 intensive days of hands on teaching and talking about Apache CouchDB to enthusiastic and excited people. It was actually a learning experience for me too, there's a lot in CouchDB I haven't had a chance to use yet :)

It's not too late to sign up for the remaining 3 cities on the Couchbase Training World Tour: Austin, London and Berlin.

by Damien Katz at March 21, 2011 10:49 PM

March 20, 2011

Upstream

Testing the AnyTime date picker with Cucumber/Selenium

On cobot we are making extensive use of the AnyTime date picker. When it came to integration testing with Cucumber/Capybara, until recently I got away with using the Rack::Test driver and doing

And I fill in "2010-01-01 12:45:00" for "Date"

Yesterday I finally needed to enter a date in a Scenarion that was using Javascript (with the Selenium driver. With JavaScript enabled the above step doesn’t work anymore because when Selenium tries to type into the text field Anytime pops up and resets the text field’s contents. After some fiddling around I decided that I would make selenium click the actual buttons on the date picker. Not only did this seem to be the least hackish way, it would also mean I would be testing the real thing, i.e. clicking on buttons like a user would. This becomes especially important once you start with internationalization and different time formats (e.g. 24h vs. 12h system), where you want to make sure AnyTime generates the proper date string.

Long story short, it took me almost a day to figure out how to do this. I played around with a lot of variations, but in the end this is what you have to do:

When /I select the time "([^"]+)" from "([^"]+)"/ do |time_string, label|
  time = Time.parse(time_string)
  And %Q{I fill in "" for "#{label}"}
  find_field(label).click

  click_on_selectors ".AnyTime-btn:visible:contains(#{time.year})",
    ".AnyTime-mon#{time.month}-btn:visible",
    ".AnyTime-dom-btn:contains(#{time.day}):visible:first",
    ".AnyTime-hr#{time.hour}-btn:visible",
    ".AnyTime-x-btn:visible"
end

def click_on_selectors(*selectors)

  def recurse(*selectors)
    if selectors.any?
      wait_for_css_selector_fn(selectors.first,
        "$('#{escape_javascript selectors.first}').click(); #{recurse(*selectors[1..-1])}")
    else
      'window.__capybara_wait = false;'
    end
  end

  page.evaluate_script "window.__capybara_wait = true"
  page.evaluate_script recurse(*selectors)
  wait_until 10 do
    page.evaluate_script "!window.__capybara_wait"
  end
end


include ActionView::Helpers::JavaScriptHelper
def wait_for_css_selector_fn(selector, after)
  <<-JS
    (function() {
      var time = new Date().getTime();
      var runDelayed = function() {
        if(!$('#{escape_javascript selector}').length) {
          if(time < new Date().getTime() - 5000) {
            throw('waited too long for #{escape_javascript selector}')
          } else {
            window.setTimeout(runDelayed, 100);
          }
        } else {
          #{after};
        };
      }
      window.setTimeout(runDelayed, 100);
    })();
  JS
end

In you Cucumber feature you can then call:

When I fill in the time "2010-01-01 12:00" for "Date"

What the above code essentially does is open the date picker and click on all the required buttons. Just as important (and that was the tricky part) it asynchronously waits for the necessary events (open/close date picker, date shows up in text field). Enjoy.

by Alexander Lang at March 20, 2011 07:53 PM

Ricky Ho

Compare Machine Learning models with ROC Curve

ROC Curve is a common method to compare performance between different models. It can also be used to pick trade-off decisions between "false positives" and "false negatives". ROC curve is defined as a plot of "false positive rate" against "false negative rate". However, I don't find the ROC concept is intuitive and has been struggled for a while to grasp the concept.

Here is my attempt to explain ROC curve from a different angle. We use a binary classification example to illustrate the idea. (ie: predicting whether a patient has cancer or not)

First of all, all predictive model is not 100% correct. The desirable state is that a person who actually has cancer got a positive test result, and a person who actually has no cancer got a negative test result. Since the test is imperfect, it is possible that a person who actually has cancer was tested negative (ie: Fail to detect) or a person who actually has no cancer was tested positive (ie: False alarm).


In reality, there is always a tradeoff between the false negative rate and the false positive rate. People can tune the decision threshold to adjust them (e.g. In "random forest", we can set the threshold of predicting positive when more than 30% decision trees predicting positive). Usually, the threshold is set based on the consequence or cost of mis-classification. (e.g. in this example, fail to detect has a much higher cost than a false alarm)


This can also be used to compare model performance. A good model is one that has both low false positive rate and low false negative rate, which is indicated in the size of the gray area below (the smaller the better).

"Random guess" is the worst prediction model and is used as a baseline for comparison. The decision threshold of a random guess is a number between 0 to 1 in order to determine between positive and negative prediction.


ROC Curve is basically what I have described above with one transformation, which is transforming the y-axis from "fail to detect" to 1 - "fail to detect", which now become "success to detect". Honestly I don't understand why this representation is better though.

Now, the ROC curve will look as follows ...

by Ricky Ho (noreply@blogger.com) at March 20, 2011 03:39 AM

March 18, 2011

Ricky Ho

Predictive Analytics Conference 2011

I attended the San Francisco Predictive Analytic conference this week and got a chance to chat with some best data mining practitioners of the country. Here summarizes my key takeaways.

How is the division of labor between human and machine?

Another way to ask this question is how “machine learning” and “domain expertise” work together and complement each other, since each has different strength and weakness.


Machine learning is very good at processing large amount of data in an unbiased way while human is unable to process the same data volume and the judgment is usually biased. However, machine cannot look beyond the data being given. For example, if the prediction power is low, machine learning methods cannot distinguish whether it is because the data is not clean, or the wrong model is being chosen, or because some important input feature is not captured. Domain expertise must be brought in to figure out the problem.

So the consensus is data mining / machine learning is simply a toolbox that can be used to augment human’s domain expertise, but can never replace it. For example, the domain expert can throw in a large number of input features to the machine learning model, which can determine a subset that are most influential. But if the domain expert doesn’t recognize an important input feature (and not capturing it), there is no way the machine learning model can figure out what is missing, not even recognizing that something is missing.


On the other hand, human is also very good in visualizing data patterns. “Data visualization” technique can be a powerful means to get a good sense and quickly identify the area where drilldown analysis should be conducted. Of course, visualization is limited to low dimension data as human cannot comprehend more than a handful of dimensions. Human is also easily biased so they may find patterns where are actually coincidence. By having human and machine working together, they complement each other very well.

What are some of the key design decisions in data mining?
  1. Balance between false +ve and false –ve based on cost / consequence of making a wrong decision.
  2. We don’t have to use a method from beginning to end. We can use different methods at different stage of the analysis. For example, in a multi-class (A, B, C) problem, we can use decision tree to distinguish A from notA (ie: B, C) and then use support vector machine to separate B and C. As another example, we can use decision tree to determine the best input attributes to be used by the neural network.

What is the most powerful / most commonly used supervised machine learning modeling technique?


The general answer is that each modeling technique has its strength and weakness and none of them wins in all situations. So understand their corresponding strength and weakness is important to pick the right one.

Generalized Linear Regression
Linear and Logistic regression are based on fitting a linear plane into a set of data points such that the root mean square of error (distance between predicted output and actual output) is minimized. It is by far the most commonly used technique, one for numeric output and the other for categorical output. They have a long history in statistics. It is supported in pretty much all commercial and open source data mining tools.

Linear and Logistic regression model requires certain amount of data preparation such as missing data handling. It also assuming that the output (or logit output) is a linear combination of input features, error is expected to be normally distribution. However, real-life scenarios are not always linear. To deal with non-linearity, input terms will be mixed (usually by cross-multiplication) in different ways to generate additional input terms called “interactions”. This process is like trial and error and can generate huge number of combination. Nevertheless, they do a reasonably good job in a wide spectrum of business problems and are well-understood by statisticians and data miners. And they are commonly used as a baseline comparison with other models.

Neural Network
Neural Network is based on multiple layer of perceptrons (each is like a logistic regression with binary input and output). There is typically a hidden layer (so the number of layers is 3) with N perceptrons (where N is trial and error). Because of the extra layer and the logit() function in the neural network, it can handle non-linearity very well. If it has good predictor in its input data, Neural network can achieve very high performance in prediction.

Similar to linear regression, Neural network requires careful data preparation to remove noisy data as well as redundant input attributes (those that are highly correlated). Neural network also take much longer time to train as compared to other methods. Also the model that Neural network has learned is not explainable or make good sense out of it.

Support Vector Machine
Support Vector Machine is a binary classifier (input feature is numeric). It is based on finding a linear plane that can separate the binary output class such that the margin is maximized. The optimal solution is expressed in terms of the dot product of vectors. If the points are not linearly separable, we can use a function to transform the points to a higher dimension space such that it is linearly separable. The Math shows that the dot product (after transforming to a hi-dim space) can be generalized into a Kernel function (Radial basis function being the most common one). Although the underlying math is not easy for everyone to understand, SVM has demonstrated outstanding performance in a wide spectrum of problems and recently become one of the most effective methods.

Despite of its powerful capability, SVM is not broadly implemented in commercial products as there are some patent issue as AT&T holds the patent of SVM. On the other hand, the non-linear kernel function (such as the most common Radial Basis function) is difficult to implement in parallel programming model such as Map/Reduce. SVM is undergoing active research and a derivative Support Vector Regression can be used to predict numeric output.


Tree Ensembles

This is combining “ensemble methods” with “decision tree”.

Decision tree is the first generation machine learning algorithm based on a greedy approach. For a classification problem, decision tree try to split a branch where the combined “purity” (either by the Gini index or Entropy) after split is maximized. For a regression problem, decision tree try to split where the combined “between-class-variance” divided by “within-class-variance” can be maximized. This is equivalent to maximizing the F-value after split. The splitting continues until reaching the terminating condition such as there are too few member remains in the branch, or the gain of further split is insignificant.

Decision tree are very good at dealing with missing value (simply not using that value in learning and go own both path in scoring). Using a decision tree to capture the decision model is also very comprehensible and explainable. However, decision tree is relatively sensitive to noise and can easily overfit the data. Although the learning mechanism is easy to understand, Decision tree doesn’t perform very well in general and is rarely used in real system. However, when decision trees are used together with Ensemble methods, it becomes extraordinary powerful as all its weakness now disappears.


The idea of ensemble is simple. Instead of learning one model, we learning multiple models and combine the estimation of each individual learner (e.g. we let them vote on categorical output and compute the average for numeric output).


There are two main models for creating different learners. One is called “bagging”, which is basically drawing samples (with replacement) from the training set and then have the same Tree algorithm to learn on different sample data set. Another model is called “boosting”, which has a sequence of iterations where samples are drawn from the training set based on the probability distribution where the wrongly predicted items in last round will have a higher chance to be selected. In other words, the algorithm places more attention to learn from wrongly-classified examples.


It turns out Ensemble tree is the most popular method at this moment as it achieve very good prediction across the board, easy to understand and can be implemented in Map/reduce. Google recently published a good paper on their PLANET project which implements ensemble tree on map/reduce.

by Ricky Ho (noreply@blogger.com) at March 18, 2011 06:18 AM

March 07, 2011

Damien Katz

Is Node.js an application server?

An insightful comment on Reddit about Node.js:

And the idea that you fully understand your own code is a bit suspect, too. My code's all nice and fast until somebody passes me in a POST request with a million keys, or decides to upload a 10GB file where I was expecting a 5KB file and I run a hash algorithm over it, or I accidentally use way more memory than I expected and push the system into swap, or any number of other things like that. My life would be a lot easier if my code never did anything I didn't expect.

I keep wondering where Node fits in a production environment and who writes the code that powers it.

by Damien Katz at March 07, 2011 03:00 AM

March 04, 2011

Mark Headd

Building an Open311 Application with Node.js and CouchDB

Lots of work is being done to finalize the next version of the Open311 API spec (officially referred to as GeoReport V2).

Almost a year ago I launched TweetMy311 - a service that lets people report non-emergency service requests using a smart phone and Twitter. Since then, a lot has changed - not only with the Open311 specification but with the tools available to build powerful Twitter-based applications.
Node.js
In the last several months, I’ve spent a lot of time learning about and working with Node.js. Some of the things I did in the initial version of TweetMy311 (written in PHP) are so much easier to do in Node.js that I’ve decided to completely rewrite the application to use Node. In addition, since I initially launched TweetMy311 CouchDB (the NoSQL database on which the app is built) has also seen a lot of enhancements.

I’ve expecting the overhaul I’m currently working on to make the application code a lot more efficient and easy to understand. Once this overhaul is complete, I intend to release a big chunk of it as open source software, so that anyone that wants to build a powerful Node.js/CouchDB-based civic app can do so.

It’s also exciting to see new cities get on board the Open311 bandwagon. The City of Boston is now supporting Open311 and has started to issue API keys to developers.

As part of my work to overhaul TweetMy311, I’ve developed a neat little Node.js library for interacting with the Open311 API. Since I just started to work with the Boston implementation, I thought it would be helpful to others interested in doing so to walk through a quick example.

If you want to run this example for yourself, you’ll need to have Node.js installed, specifically the latest version - v0.4.2. If you have the Node Package Manager installed, you can simply do:

npm install open311

Once you’ve done this, you should be able to run the following script:

Which will output:

This is just a quick example of how to make the most basic of API calls with the Node.js Open311 module. You can use this module to build fully feature Open311 applications.

I’ll be doing some more blogging in the weeks ahead as the rewrite of TweetMy311 continues, and work on this phase of the GeoReport V2 spec is concluded.

Stay tuned!

a

by Mark Headd at March 04, 2011 02:54 AM

March 03, 2011

Damien Katz

So You Wanna Learn About CouchDB?

CouchDB World Tour Coming! Along with Alan McKean, I and other Couchbase staff will be doing 5 training sessions in 5 different cities starting in March. I'll be teaching the San Francisco one :) We've developed some incredible material that I'm really excited to present. So go ahead and sign up, and bring a friend!

Sign up now

Claire was so inspired she wrote a song about it:

http://vimeo.com/20499717

by Damien Katz at March 03, 2011 01:43 AM

February 15, 2011

Till Klampäckel

node.js & socket.io fun

I recently had the extreme pleasure to use node.js and socket.io on a project. Here are some insights.

Objective

So the objective of the project was to read data from the _changes feed of our CouchDB cluster (hosted by Cloudant) and publish the data to a widget which we can use to display a constant stream of "what are people doing right now".

The core of the problem we faced was not just taking this stream of data and feeding it on to a page, but since we'll deploy this widget to our homepage we needed to make sure that no matter how many clients see it, the impact on the database cluster is minimal; for example, it would be a single client (or down the road up to three for failover) who actually read data from the cluster.

After shopping around for a technology to use, it became obvious that we needed some sort of abstraction because of how the different technologies (e.g. comet, websockets, ajax longpolling, ...) are implemented in different browsers. We decided to build this project on top of socket.io — pretty much for the same reasons most people go to jQuery, prototype or dojo these days.

... more after the jump.

by Till Klampaeckel (till@php.net) at February 15, 2011 10:47 PM

Socket.io & nodejs: at a medium pace

In my last blog entry, I shared some nodejs-code to read CouchDB's _changes feed and publish the data to a website. In order to update the page in a continous fashion, I used socket.io which provides a nifty abstraction across server- to client-side transports — for example, websockets and ajax longpoll.

Full-throttle

When we tested the code for a few days over the weekend, the largest issue we ran into was that the stream moved too fast. In fact it moved so fast, we couldn't read anything and were at risk of getting a seizure when we watched the page for too long.

Certainly awesome from one point of view — people are using the website — but it also led to the next objective: I had to find a way to throttle broadcasting to the client. Here's how!

... more after the jump.

by Till Klampaeckel (till@php.net) at February 15, 2011 10:44 PM

February 09, 2011

Damien Katz

CouchOne + Membase = Couchbase

I've got some news I'm extremely excited to finally announce: a merger between CouchOne and Membase!

A little background, I met James Phillips, the co-founder of Membase, for the first time in December. I'd heard a little about Membase up to that point, but I was most impressed with some of their high profile users. For example, Membase is a key part of Zynga, where giving millions of users a fast, low latency experience is critical.

Membase has been targeting large scale mission critical apps, being able to scale out quickly and support millions of users, and getting impressive traction. They'd been going after a very specific pain point, a completely different part of the market than what we were targeting. They've focused on performance and scalability and exploiting all the power and memory available on modern servers. Simple, Fast, Elastic.

At CouchOne we've been focusing on very different problems: mobile, sync and offline use cases. We make it easy to build applications that travel with you, allowing you access to your important data no matter the network conditions. Slow and unreliable connectivity means many businesses can't rely on the cloud for mission critical apps, all their data is gone when their network is down. But with Couch powered apps on your phone, tablet, putting data directly on the machines at the edge of the network, you have your apps and data with you at all times and safely backed up to the cloud.

Couchbase!

What James had is the vision to see the great fit between the two companies. While independently we were both doing very well, we both have a lot of growing to do yet. And amazingly, the direction Membase needed to grow, we were already doing very well. And in the direction we needed to grow, Membase was already doing very well. Not only were the part of the stack we were focusing different and complementary, but the way we built out our teams was different and complementary. I'm not sure we could have planned it any better, and we didn't plan it at all!

And so I'm thrilled to announce Couchbase, a merging of both our companies and our technology!

Technologically, we'll be joining the products together to create a high volume, low latency, elastic clustered Couchbase server system. A Couch that's Simple, Fast, Elastic with all the reliability and power of CouchDB. We'll also continue to support the Membase API, for both backwards compatibility and it's performance advantages over HTTP. We will be the only solution out there that can scale to Zynga sized workloads and down to phones and tablets and everything in between, supporting millions of users and keeping everything in sync.

For existing CouchDB users, we will fully support CouchDB's HTTP API with all its associated benefits: seamless integration with other HTTP based infrastructure, a universally supported, human-readable protocol and direct-browser access just to name a few.

Together as Couchbase, we'll have the fastest, most scalable (both scale up and scale down) NoSQL solution. We will become the standard storage for mobile devices, and the standard server technology for syncing them all together. Our unified solution will dramatically simplify your technology stack and maintenance for building fast responsive apps that scale to millions of users, and also scaling down to phones so people can work and play even when not connected to the network.

My role at Couchbase will be CTO, overseeing the technical direction of the company. Dustin Sallings will be the chief architect. Bob Weiderhold will be CEO and co-founder James Phillips will continue to be product-oriented maniac :) CouchOne co-founders Chris Anderson and Jan Lehnardt will take roles to lead our mobile efforts and to work with our developers and community.

What's in it for you?

It's all upside! In the short-term we'll be able to provide a much better developer and support experience for both for CouchOne and Membase technologies, and move the development speed ahead much faster. The long term benefits are that CouchDB users will acquire the high performance, high scale easy-fast-elastic capabilities of Membase, while Membase users will acquire CouchDB's indexing features (map/reduce views, lucene, R-Tree GeoCouch), replication, reliability, and an easy path to mobile.

This is hot stuff! 2011 is the year of Couchbase!

by Damien Katz at February 09, 2011 06:09 PM

January 26, 2011

CouchOne

Putting Apps on the Web

Chris Anderson here.

The Web has been synonymous with HTML for too long. But its principles go deeper than arguments over W3C standards or backwards compatible CSS hacks. At the core of the web is the simple concept of linked data. I will argue that the Web, properly understood as linked data, the application that use it, and users that interface with it, can be much richer than just the presentation through HTML, and that extending the reach of the web beyond the confines of the browser is crucial to its long term success.

The surge of smartphone ‘apps’ has pundits and experts alike forecasting the demise of the web. This is the wrong way to think about it. Instead of concentrating on how apps are killing the web, let’s think about how apps can embrace the principles of the Web. Apps could potentially become vibrant participants of the Web, while still retaining their slick user interfaces and proprietary business models.

The web gave us write once run anywhere, a computing holy grail. But apps are platform-specific by design. Putting data on the web (giving it URLs) makes it “linked data”. The huge advantage to such an approach is interoperability at the data layer. Recently, #newtwitter showed us how you can build your web app on top of the same API as your native clients, building almost no new HTML on the server. This wave is only just beginning.

The Web Is Dead?

This recent summer, 2010, Wired Magazine pronounced the web dead.

Cover: WIRED, The Web is Dead

Here is the tagline that ran along with the article:

Two decades after its birth, the World Wide Web is in decline, as simpler, sleeker services — think apps — are less about the searching and more about the getting. Chris Anderson explains how this new paradigm reflects the inevitable course of capitalism. And Michael Wolff explains why the new breed of media titan is forsaking the Web for more promising (and profitable) pastures.

Of course we’ll take Wired Magazine’s pronouncements in the reverse oracular spirit they’ve earned for a previous obituary, 1997’s “Apple is Dead”:

Pray

That year turned out to be a turning point for Apple. Steve came back and everything. By my lights, Wired’s “The Web Is Dead” cover is as ringing an endorsement the Web could hope for.

Apps are threatening the web, with their quick path to revenue (even if they rarely become big hits.) The threat exists because the insides of apps often don’t interact with the web via hyperlinks, or if they do, it’s in an ad-hoc and limited way. This isn’t a bug, it’s just the way apps are — slickness matters in a big way, so developers optimize for user experience, not ‘webiness’.

Is this a turning point for the web, where it will continue as the dominant platform, or will it be replaced by apps? I think the reality is that the Web will win by remaining the backbone, the conduit on which the data is shared and exchanged. Even for the slickest native apps. But before I explain my position, let’s look at the strengths of the Web, and how different the app world is.

Long Live the Web

It is clear that in the possibility space of coding, one goal stood on the horizon for a long time: write once, run anywhere. It’s not always been clear whether it’s attainable, or just a mirage. It’s more of a social question (one of conventions) than a technical one.

Eventually, 21 years ago, the web emerged as the first successful consumer application platform to be independently implemented multiple times, across a wide range of hardware and software environments. This aspect is often overshadowed by the more fundamental changes to our social fabric as the Internet makes us all more connected.

Write once run anywhere has been regarded as a holy-grail since even before Sun’s foray into cross platform application widgets (Java) was unable to take hold as a standard. Similarly Adobe Flash and Microsoft Silverlight have attacked the HTML juggernaut. But in the fullness of time, those efforts look futile when held against the 3D accelerated, mobile, HTML5 web. How does the web do that?

The web is able to subsume the host operating system, as well as invaders (Java and Flash) to become the dominant user interface metaphor of our time, by adhering to a principled stance about openness, one that Tim Berners-Lee can describe better than I can.

The key to Berners-Lee’s argument is that the web enables applications to be brought online, and without any formal coordination, begin to interoperate with other applications on the web, because the constraints of hyperlinking and HTML are defined narrowly enough to allow for consensus. Rather than have a million features, it is better to have one killer feature. For the web, that feature is linking.

Godzilla the App Store

The web was comfortable in its dominance, maybe even starting to stagnate as a platform, when its first new competitor came along — the app store. Such a different way of thinking about software! With the web, anyone can put a new site up, whenever they want. In the app store, plan to wait a week or two before you can show your work to anyone.

But the payoff — literally! With the web your best bet was to put up some ads and hope for massive traffic, or else burden your users with a paywall and suffer the reduced engagement and lack of in-links. In the App Store you can make money from day one, even without massive traffic. It’s no suprise the app ecosystem is growing fast.

The new app sensation, Instagram, is quickly subsuming Facebook’s most powerful feature, photo sharing. And they don’t even have a way to browse your own photos via the web. Don’t get me wrong, I think Instagram is doing everything right, it’s just not very webby — because that’s the way apps are.

What matters in an app first and foremost is that people use it, and encourage their friends to use it. Instagram obviously put a great effort into their onboarding, because setting up my account was seamless and quick. This is development effort well spent, not thinking about linkability. Which is sad, because the message of the Web is so powerful: Build your app into the Web, and it can be linked to and extended beyond the confines of any one system.

Of course, real apps do understand this, and Instagram does share photos by a single-photo url, distributed via other social systems. It pays to be on the web, but Instagram at least has found that the user experience of the Web was no longer the place to invest, when growing the user base.

As a pragmattic matter, it makes sense to focus on slick interfaces and viral adoption, over an interest in broad web-style interoperability. But I think there’s another way.

The Web is more than HTML

Those of us who were aware of the web in the early 90’s knew the web had won when URLs joined the popular parlance. Once everything has a link, people can start to put the web together — by discussing and linking to pages, we build the meaning among the links. The act of linking is more important than the details of HTML. Of course, getting HTML right gives a baseline of interoperability.

Think about #newtwitter again - an (almost) all Ajax application that takes advantage of the same APIs Twitter provides for desktop and mobile clients. The link between the native and the desktop applications is made up of innumerable URLs shared via Web protocols. In Twitter’s case, the URLs are mostly hosted at sites like twitter.com and t.co. Twitter is centralized by design - we’ve seen this mostly-Ajax pattern applied to several prominent websites.

The major criticism that still applies to these Ajax heavy apps, is that because they depend on remote resources, they are inherently slow and unreliable, compared to native apps that don’t require remote servers.

Giving reliable low-latency URLs to native and local web applications

There is a way to have the best of both worlds, by moving the URLs to localhost and asynchronously sharing changes with best-effort as the network connection allows.

My team at CouchOne is putting apps on the web, by building web addressable data into the core of our App Store compatible platform. Our mission is to ensure you have the data you need, no matter the network conditions. We’re happy that developers can make money in the app store, but we know that users will be happier when their data isn’t locked into silos. Everyone should know by now that snappiness is the most important feature any user experience can have.

We’re confident that the benefits you accrue by moving your web server from a high-latency remote server, to the mobile device, will mean a real competitive advantage for early adopters, and eventully a wholesale movement of the web towards client-based eventually consistent applications.

Putting Apps on The Web

The web has traditionally been known as the HTML platform as deployed mostly in browsers. It really has become the #1 way code and functionality gets in front of users. The web won because of its simplicity. But its simplicity is holding it back in the app world.

By giving URLs to data you can have the best of both apps and the web. because the data is shared between the two interfaces so intimately, and presumably with a simple REST interface to support the web client, we can properly say these linked-data apps are on the web. The key to why #newtwitter feels more advanced than other sites with heavy APIs, is that it supports the web client from the public API. This proves that the app API is really Web stuff—we’ll see a lot more of this in the future.

For instance you might have a PhoneGap style cross-platform HTML5 mobile application that suddenly becomes popular among the business crowd, making a native Blackberry UI into a worthwhile investment. The local linked-data layer CouchOne provides a way for distinct implmentations to communicate with a common data substate. With real-time replication, the Blackberry users can collaborate with the web users, on the same app at the same time. The app is on the web, even if it is slick and low latency.

The web won because it allowed linking to anyone, so adding new pages to the web doesn’t require coordination beforhand. In the future, the web will allow sharing of data with the people you choose, using something much like the replication built into Apache CouchDB. Our aim isn’t merely to add to the web developer’s toolkit, it’s to fundamentally rebalance the web in favor of the edge.

As a developer, you’ll be able to take the existing data stored by an existing application, and write you’re own interface for it. Your favorite social network doesn’t have a native UI for your phone’s operating system, but they do offer a Couch feed? Just build your own UI. You’ll still be able to take advantage of real time interaction with users running against the same data on different platforms.

We strive to offer a relaxing new way to share state changes across computing devices, the web, and mobile phones.If you’re interested in combining the web and app worlds, check out Dale’s CouchDB on Android tutorial (coming soon) and the CouchApp community wiki.

January 26, 2011 03:00 PM

January 21, 2011

Mark Headd

Experiments in Open Data: Baltimore Edition

A lot of my open gov energy of late has been focused on replicating a technique pioneered by Max Ogden (creator of PDXAPI) to convert geographic information in shapefile format into an easy to use format for developers.

Specifically, Max has pioneered a technique for converting shapefiles into documents in an instance of GeoCouch (the geographic -enabled version of CouchDB).

I was thrilled recently to come across some data for the City of Baltimore and since I know there are some open government developments in the works there, I decided to put together a quick screencast showing how open data - when provided in an easily used format - can form the basis for some pretty useful civic applications.

The screencast below walks through a quick demonstration of an application I wrote in PHP to run on the Tropo platform - it currently supports SMS, IM and Twitter use.

Just send an address in the City of Baltimore to one of the following user accounts along with a hashtag for the type of location you are looking for:

  • SMS: (410) 205-4503
  • Jabber / Gtalk: bmorelocal@tropo.im
  • Twitter: @baltimoreAPI

This demo application interacts with a GeoCouch instance I have running in Amazon EC2 - you can take a look at the data I populated it with by going to baltapi.com and accessing the standard CouchDB user interface. I haven’t really locked this instance down all that tight, but there really isn’t anything in it that I can’t replace.


Locate places in Baltimore via SMS

Besides, one of the nice things about this technique is how easy it is to convert data from shapefile format and populate a GeoCouch instance. Hopefully others with GIS datasets will look at this approach as a viable one for providing data to developers. (If anyone has some shapefiles for the City of Baltimore and you want to share them, let me know and I’ll load them into baltapi.com.

There are a number of people in Baltimore pushing for an open data program from their city government, and I have heard that there are some really cool things in the pipeline. I can’t wait to see how things develop there, and I want to do anything I can to help.

Hopefully, this simple demo will be useful in illustrating both the ease with which data can be shared with developers and the potential benefit that applications built on top of open data can hold for municipalities.

UPDATE (4/18/2011): I’ve actually replicated all of the Baltimore data from the EC2 instance discussed in this blog post to the new Iris Couch instance. Iris Couch is by far the easiest way to get started using CouchDB, and Couch’s replication feature makes it easy to move data into an Iris Couch instance.

a

by Mark Headd at January 21, 2011 09:46 PM

January 19, 2011

CouchOne

Hosting Bulletin

Today, one of our CouchDB hosting customers suffered a security issue. During the build up for a product launch, the customer’s developer posted their open source software to the public web. Unfortunately, the source code contained the database administrator credentials. This was an unfortunate error — in this industry the mantra is to hustle and to deliver at all costs. But then a simple oversight anyone could make leads to calamity — we know the feeling and sympathize.

In any event, despite our service’s “beta” status, we are working with our customer to restore their data and reinstate their service. Fortunately, the restoration is greatly simplified by CouchDB’s features.

We value all of our users’ security and as such we strongly suggest that password information never be revealed in publicly accessible source code to avoid this kind of situation in the future. Considering today’s distributed source code systems, the best policy is never to commit authentication credentials at all. We’d like to stress again, that the security incident is not caused by any vulnerability in Apache CouchDB or our hosting infrastructure.

If you have any questions about this or our hosting service in general, do not hesitate to get in touch: hosting@couchone.com

— Jason, VP of Hosting

January 19, 2011 10:52 PM

January 14, 2011

Upstream

Euruko 2011 call for papers

Its very exciting to organize a conference, especially if it is the well known EuRuKo. The preparations for this year EuRuKo in Berlin are in full effect. We just published the call for papers and I hope we get a good deal of interesting talk proposals from you until the deadline at the 22th February. So stretch you fingers and send in a proposal

…[if] you have researched something about Ruby, developed a gem, found a unique usage for Ruby or you had a life changing experience with Ruby.

For details see the official post. We will choose wisely from the submissions so that you’ll never fell the urge to leave the track on this single track conference :) .

by Thilo Utke at January 14, 2011 03:30 PM

January 12, 2011

John Wood

CouchDB Plugins for Scout

Back in December I whipped up a series of CouchDB plugins for the Scout monitoring service. The plugins allow you to track all sorts of metrics for CouchDB, including (but not limited to):

  • Mean reads / second
  • Mean writes / second
  • Mean requests / second for DELETE, GET, HEAD, POST, and PUT requests
  • Mean view requests / second
  • Mean bulk HTTP requests / second
  • Counts for various HTTP response codes

In addition, there is a plugin for individual CouchDB databases and individual couchdb-lucene indexes. The database plugin will report:

  • Database size
  • Number of documents
  • Number of deleted documents
  • Number of update operations

The couchdb-lucene plugin will report:

  • Size of the index
  • Number of documents indexed
  • Number of deleted documents

The kind folks over at Scout have just released two new, official plugins based on the ones I created. The CouchDB Overall plugin combines some of the more important CouchDB metrics into a single plugin, and the CouchDB Database plugin reports the same set of the stats as the database plugin listed above.

The original plugins can be found at https://github.com/signal/scout-plugins/tree/master/couchdb. More information can be found here. I hope you find them useful.

Thanks to Doug Barth for some help on the plugins, and Derek over at Scout for putting together the official plugins.

by John Wood at January 12, 2011 08:09 PM

January 11, 2011

Henri Bergius

December 22, 2010

Upstream

JavasScript under scrutiny at Plat_forms 2011

… we consider their participation a glimpse of the future …

With these words our team (Alex, Frank and Me) got accepted with JavaScript into the web development platform comparison “Plat_forms 2011“. Big words :)

Plat_forms is a contest in which teams of three programmers compete to implement the same requirements for a web-based system within two days, using different technology platforms. It will be held in Nürnberg from 18th to 19th January.

The purpose of the Plat_forms is to provide new insights into the real pros, cons and emergent properties of each platform by analyzing various aspects like usability, structure, performance, scalability etc.

Sadly we were the only applicant for Javascript, so our results for JavaScript platform will be treated noncompetitively in the evaluation. The other participating platforms are our beloved Ruby, PHP, Perl and good old Java.

Although we use Ruby as our main language we are getting more and more comfortable with JavaScript. Just recently JavaScript got a huge boost with faster VMs, server side execution, powerful libraries and more possibilities on the client side commonly regarded as HTML5. So our usage of JavaScript increased over time and now ranges from rich client interfaces over special server side tasks to small JavaScript only apps.

With our participation with JavaScript at Plat_forms 2011 we want to push our skills and boundaries further. We are excited and anticipate the insights that our participation with JavaScript will reveal.

by Thilo Utke at December 22, 2010 04:33 PM