Reductio

Crossfilter has played a pretty big part in my work for the last couple of years. For those who don't know, Crossfilter is a Javascript library that "supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records" in the web browser. It is designed for use in visualizations and applications that allow interaction with and filtering of multiple dimensions of a single data set.

I've found Crossfilter to be a very powerful tool for building applications that support data interactions that feel 'real-time'. The idea is that <100ms interactions are fast enough to allow us to identify patterns and correlations as the visualization changes in exact coordination with our filtering. I use Crossfilter in several places, but the biggest project is easily Palladio, which is a testbed platform for research activities and a powerful example of the possibilities of browser-based data exploration tools.

Faceted browsing in Palladio - driven by Crossfilter and Reductio

Faceted browsing in Palladio - driven by Crossfilter and Reductio

Crossfilter consistently trips up people trying to use it for the first time because, I believe, its programming model for aggregations is inconsistent with the model we usually expect. It's for this reason that I started Reductio, a library that helps build up aggregation functions that work correctly and efficiently with Crossfilter's model.

I'm not going to get into all of the details here, but defining aggregations in Crossfilter requires defining functions that incrementally update the aggregation based on the addition or removal of records. This is at odds with the standard way of computing aggregations by building them up from scratch that we see in SQL or more standard map-reduce models.*

// A function that can be passed to Array.reduce to sum the array
function(a, b) {
  return a + b;
}
// The equivalent operation in Crossfilter requires 3 functions
function reduceAdd(p, v) {
  return p + v.total;
}

function reduceRemove(p, v) {
  return p - v.total;
}

function reduceInitial() {
  return 0;
}

It's these aggregation functions that consistently trip people up. The summation case is simple enough because summation is an operation that is computationally equivalent given the running total and the value to be added or removed. But what about operations that are not reversible? Take the computation of a maximum. When adding a new record to a maximum aggregation, we just need to check if the value of the record is larger than the current largest value we've seen (the current maximum).

// A function that can be passed to Array.reduce to return the max
function (a, b) {
  return Math.max(a, b);
}

But if we remove a record with a value equal to the current maximum, we need to know the next smallest value. And if we remove the record with that value we need to know the next smallest, and so on. We have to keep track of all values seen in order to avoid needing to rebuild the entire aggregation from scratch when a record is removed. And yes, this is significantly faster than rebuilding from scratch!

// The equivalent operation in Crossfilter (optimized, with
// error guards). Adapted from Reductio
var bisect = crossfilter.bisect.by(function(d) { return d; }).left;

function reduceAdd(p, v) {
    i = bisect(p.values, v.number, 0, p.values.length);
    p.values.splice(i, 0, v.number);
    p.max = p.values[p.values.length - 1];
    return p;
}

function reduceRemove(p, v) {
    i = bisect(p.values, v.number, 0, p.values.length);
    p.values.splice(i, 1);
    
    // Check for undefined.
    if(p.values.length === 0) {
        p.max = undefined;
        return p;
    }
 
    p.max = p.values[p.values.length - 1];
    return p;
}

function reduceInitial() {
  return { values: [], max: 0 };
}

It's a lot of work to come up with these aggregation functions, make them efficient, and test them thoroughly, and Crossfilter beginners (myself included) consistently struggle with this. Believe it or not, things get even nastier when dealing with thing like pivot- or exception-aggregations, when building aggregations where a record counts in more than one aggregation group (like moving averages, or tags), or when trying to compose multiple aggregations (commonly required for visualizations as simple as a scatter-plot). Reductio supports all of these aggregations and a few more, and I hope it will allow a quicker, smoother start when using Crossfilter.


* This difference in model is to allow very fast interactions with data and aggregations. If you aren't building an application that requires real-time interactions and are thinking about using Crossfilter simply to do one-time calculations or aggregations on your data, think again about using Crossfilter as it is probably not the right tool for the job.

Elastic lists using Protovis

I've been seeing more and more list-based visualizations used for data selection showing up in BI software. These types of selection interfaces are especially prominent in Qlikview and SAP BusinessObjects Explorer (which you can try on theweb).

Ever since seeing Moritz Stefaner's implementation of Elastic Lists, I've been a bit dissatisfied with the implementations in enterprise BI tools, including the ones listed above. "Elastic" lists leverage the list format to visualize characteristics of the data by tying the size of the bar representing a column value in the selection list to a metadata metric - in this case the probability that a given column value will occur in a dataset.

In order to help myself understand the strengths and weaknesses of this type of visualization more thoroughly, I started to experiment with list-based visualizations in Protovis (a Javascript-based visualization library using SVG for rendering). Eventually, I added in elasticity and gave the list selection the power to drive a second visualization. It uses the cars dataset and visualization from the Protovis examples to demonstrate driving a second visualization with the list selection. (Note: the coordinates on the second visualization are reversed for reasons that I haven't looked into at the moment.)

That experiment is now working well enough that I thought I'd publish it so that others can comment, use the code (but really, it's a bit of a mess, so be wary), and experiment with the concept. If you want to add some capability, go right ahead and fork the project on Github.

For my part, I will likely do a more thorough analysis of list-based visualization in BI tools eventually, but for now I think I can safely say that anywhere a list appears, there is little excuse for lack of "elasticity" in the visualization.

Note: This visualization will only work in browsers that support the SVG standard. It does not work in IE6, 7, or 8. Pretty much any other browser (Firefox, Chrome, Safari, etc.) should work fine.

You can view a static image of the visualization below.