Crossfilter has played a pretty big part in my work for the last couple of years. For those who don't know, Crossfilter is a Javascript library that "supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records" in the web browser. It is designed for use in visualizations and applications that allow interaction with and filtering of multiple dimensions of a single data set.

I've found Crossfilter to be a very powerful tool for building applications that support data interactions that feel 'real-time'. The idea is that <100ms interactions are fast enough to allow us to identify patterns and correlations as the visualization changes in exact coordination with our filtering. I use Crossfilter in several places, but the biggest project is easily Palladio, which is a testbed platform for research activities and a powerful example of the possibilities of browser-based data exploration tools.

Faceted browsing in Palladio - driven by Crossfilter and Reductio

Faceted browsing in Palladio - driven by Crossfilter and Reductio

Crossfilter consistently trips up people trying to use it for the first time because, I believe, its programming model for aggregations is inconsistent with the model we usually expect. It's for this reason that I started Reductio, a library that helps build up aggregation functions that work correctly and efficiently with Crossfilter's model.

I'm not going to get into all of the details here, but defining aggregations in Crossfilter requires defining functions that incrementally update the aggregation based on the addition or removal of records. This is at odds with the standard way of computing aggregations by building them up from scratch that we see in SQL or more standard map-reduce models.*

// A function that can be passed to Array.reduce to sum the array
function(a, b) {
  return a + b;
// The equivalent operation in Crossfilter requires 3 functions
function reduceAdd(p, v) {
  return p +;

function reduceRemove(p, v) {
  return p -;

function reduceInitial() {
  return 0;

It's these aggregation functions that consistently trip people up. The summation case is simple enough because summation is an operation that is computationally equivalent given the running total and the value to be added or removed. But what about operations that are not reversible? Take the computation of a maximum. When adding a new record to a maximum aggregation, we just need to check if the value of the record is larger than the current largest value we've seen (the current maximum).

// A function that can be passed to Array.reduce to return the max
function (a, b) {
  return Math.max(a, b);

But if we remove a record with a value equal to the current maximum, we need to know the next smallest value. And if we remove the record with that value we need to know the next smallest, and so on. We have to keep track of all values seen in order to avoid needing to rebuild the entire aggregation from scratch when a record is removed. And yes, this is significantly faster than rebuilding from scratch!

// The equivalent operation in Crossfilter (optimized, with
// error guards). Adapted from Reductio
var bisect = { return d; }).left;

function reduceAdd(p, v) {
    i = bisect(p.values, v.number, 0, p.values.length);
    p.values.splice(i, 0, v.number);
    p.max = p.values[p.values.length - 1];
    return p;

function reduceRemove(p, v) {
    i = bisect(p.values, v.number, 0, p.values.length);
    p.values.splice(i, 1);
    // Check for undefined.
    if(p.values.length === 0) {
        p.max = undefined;
        return p;
    p.max = p.values[p.values.length - 1];
    return p;

function reduceInitial() {
  return { values: [], max: 0 };

It's a lot of work to come up with these aggregation functions, make them efficient, and test them thoroughly, and Crossfilter beginners (myself included) consistently struggle with this. Believe it or not, things get even nastier when dealing with thing like pivot- or exception-aggregations, when building aggregations where a record counts in more than one aggregation group (like moving averages, or tags), or when trying to compose multiple aggregations (commonly required for visualizations as simple as a scatter-plot). Reductio supports all of these aggregations and a few more, and I hope it will allow a quicker, smoother start when using Crossfilter.

* This difference in model is to allow very fast interactions with data and aggregations. If you aren't building an application that requires real-time interactions and are thinking about using Crossfilter simply to do one-time calculations or aggregations on your data, think again about using Crossfilter as it is probably not the right tool for the job.

Skating to where the puck is

Being an SAP Mentor is an interesting situation. The relationship with SAP usually feels amazingly collaborative, but occasionally uncomfortably adversarial. ThIs is probably a healthy place for a relationship with a massive organization to be. Every organization includes many great people and interactions with those people are always delightful. Every organization is also, as an organization, predisposed to get the answers to certain questions consistently right and the answers to other sorts of questions consistently wrong.

One such question that SAP has consistently gotten wrong in the past is how to push technology adoption among a massive install base. Historically, it would appear that SAP has tended to try to monetize upgrades to underlying technology in order to recoup development costs or profit from its technology offerings. Netweaver, BW, Visual Composer, add-ons like SSO and Gateway, Personas, Business Client, and HANA* are all technologies in which SAP has either underinvested due to lack of direct revenue potential (BW especially) or has tried to monetize. The result is that widely deployed "free" technology from SAP often stagnates while advanced technology with an additional cost often sees limited adoption in SAP's install base.

This outcome is terrible for SAP's business. It makes it very difficult for SAP to keep its application offerings competitive in the marketplace. The reason is that basic application functionality to be competitive is often dependent on improvements in underlying technology. But if technology is either widely deployed and under-featured or decently featured but not widely deployed, then applications need to use only the under-featured technology stack in order to have a large enough potential install base to justify development costs. In other words, SAP often forces itself to build applications on sub-par technology.

We see this dynamic constantly with SAP. One example is UI technology. Clearly a long-standing weakness of SAP's applications has been user-interface and user-experience. The SAP GUI is outdated and SAP's attempts to improve on it like the Business Client have, in my opinion, been torpedoed by SAP's monetization reflex. They haven't initially been better enough to see widespread adoption under a monetization scheme, and with lack of adoption, investment has faded. A similar dynamic played out with SAP's support for Flash and Silverlight as UI technologies for years after it had become clear they were destined for the trash-heap of web-UI delivery technologies.

And yet, over the last 4 years, SAP has been able to overcome this tendency in one area that may be incredibly important to its long-term business prospects around the newly announced S/4HANA. In the case of a key succession of technologies, SAP has been able to do something different, with impressive results. Those technologies: Gateway, SAP UI5, and Fiori**.

Initially, all seemed to be going well. SAP developed Gateway in part due to prodding from influencers like SAP Mentors around the need for a modern, RESTful API layer in SAP applications to allow the development of custom user interfaces and add-ons in a sustainable manner. DJ Adams and others showed the way with projects like the Alternative Dispatch Layer (in 2009), to make development of these APIs easier. Uwe Fetzer taught the ABAP stack to speak JSON. And suddenly one could fairly easily create modern-looking APIs on SAP systems. SAP folded a lot of those learnings into Gateway, which was a great way to push the tools for building modern APIs into SAP's applications and out to the install-base.

Well, it would have been, but SAP made its usual mistake: it attempted to monetize Gateway.

The result of the monetization attempt could be predicted well enough. It would make roll-out of Gateway to the SAP install base slow, at best. Fiori would be delayed because it would need to build out its own API layer rather than relying on Gateway technology. Applications like Simple Finance or S/4HANA that are dependent on Fiori would subsequently be delayed, if they were created at all. Perhaps Fiori's roll-out is delayed by a year, Simple Finance by 2 years, and S/4HANA isn't announced until 2018.

But S/4HANA was announced in January 2015. So what went differently this time?

Fortunately, back in 2011, shortly after the initial Gateway monetization strategy was announced, SAP Mentors like Graham Robinson and other community members stepped in to explain why this was a mistake and push back against the strategy, both publicly and privately. While clearly not the only reason for SAP's change of heart, this feedback from the community was powerful, and ultimately SAP revised Gateway licensing in 2012 so that it made sense for SAP, its partners, and customers to build new applications using Gateway.

This revision set us on the path to relatively quick uptake of SAP UI5 (a modern browser-based front-end framework which leverages the Gateway APIs) and later Fiori and the Fiori apps (most of which are based on UI5 and Gateway APIs). With Fiori, SAP again thought short-term and attempted to monetize Fiori, only reversing course and including Fiori with existing Business Suite licenses after similar pressure from another group of SAP Mentors who had experienced customers' frustration with SAP's stagnant user-experience. These Mentors, community members, and analysts were able to communicate the role Fiori needed to play in guarding against SAP customers migrating to cloud vendors offering a more compelling user-experience story.

Fiori, meanwhile, is a hit and serves as the basis for Simple Finance and now S/4HANA - SAP's recently announced blockbuster refresh of the entire Business Suite offering, which SAP clearly hopes will drive its business for the next 10 years. Would that be happening now if SAP had remained on its standard course of attempting to monetize Gateway? I don't think so. The interaction with SAP on these topics often left some feathers a bit ruffled, but it sure must be nice for those like Graham and DJ to see some of the fruits of those discussions in major products like S/4HANA.

*A note on HANA: I think that HANA may be the exception in this story. Unlikely most of SAP's other technology offerings, HANA is good enough to be competitive on its own, and not simply as a platform for SAP's applications. The results are not yet in on the HANA monetization strategy, but things are looking OK, if not great. Of course, Graham has something to say about that, and what he says is always worth reading.

** Fiori is actually a design language focused on user-experience and a line of mini-applications implementing the Fiori design language to improve SAP Business Suite user-experience for specific business processes. However in the context of S/4HANA we can think of Fiori as a prerequisite and enabling component, like a technology prerequisite.

Store everything: the lie behind data lakes

I hope the data lake idea has passed the peak of its current, unfortunately large, round of hype. Gartner came down on the concept pretty hard last year, only 4 years after the term was coined by Pentaho's James Dixon. More recently, Michael Stonebraker illustrated many of the same concepts from the perspective of a data management professional (note the excellent comment by @datachick). The frustration of conscientious data professionals with this concept is palpable.

the initial worry - wasted effort

The idea of a data lake is that we establish a single huge repository to store all data in the enterprise in its original format. The idea is that the processes of data transformation, cleansing, and normalization result in loss of information that may be useful at a later date. To avoid this loss of information, we should store the original data rather than transforming it before storing it.

This is a relatively noble goal, but it overlooks the fact that, done correctly, these processes also introduce clarity and reduce false signals that are latent in the raw data. If every data analysis project needs to recreate the transformation and cleansing process before doing their analysis, this will result in duplicated work and errors. Long live the datawarehouse.

The deeper problem - adding information

In my reading, this sums up the usual justification for data lakes as well as the source of frustration usually expressed by data professionals. But I think there is another issue with data lakes that is often overlooked: that the data lake approach implies that transformation is an information-negative operation. That transforming necessarily discards data, and therefore information, from the original data set. It is a response to a common frustration with datawarehousing - that the data in the warehouse doesn't quite answer the question we are trying to ask and if only we could get at the source data we could reconstruct the dataset so as to be able to answer the problem. Sometimes true.

Usually, however, there are real issues with this approach. The transformation from raw data to datawarehouse (or other intermediate representation)  may remove information, but it adds other significant information to the source data. Specifically, it adds or acts on information about what data can go together and how it can be combined, how the data collection method may have influenced the data, sometimes weights data according to external information about its quality, might add interpolated data, etc. Usually this additional information is not stored along with the source data set. It's rare that it's stored in an accessible way along with the resulting data in a datawarehouse. I've seen almost no mention of maintaining this type of information in the context of a data lake. It is simply lost, or at best difficult to access.

Existential questions - If a tree falls...

Which brings me to the big lie around data lakes: storing everything. If a measurement is never taken, is it data? How many measurements are not taken for each measurement that is? Can a data lake store these un-measured measures?

The issue in data analysis, I find, is not so much that data to answer our question was collected but discarded. Rather, the problem is that the data was never collected, the measurements never taken, or taken improperly for use in addressing the question now at hand. Like the traveller in Frost's poem, we have limited resources. Collecting one measure is a road that precludes collecting other measures due to these limits. Storing everything, or even most things, is not possible.

Two roads diverged in a wood, and Iā€”
I took the one less traveled by,
And that has made all the difference.
— Robert Frost, "The Road not Taken"

Later, we often wonder if taking the other road would have allowed us to answer our current questions, but we can't go back and make different decisions about measurement. Data lakes don't change that. Maybe data lake vendors should switch to selling time machines.


I've been working on this for a while, so it's probably worth posting something about it.

Palladio is a platform for data visualization and exploration designed for use by humanities researchers. We're in early beta at the moment and will be doing a series of releases throughout 2014. You can read about it and try it out here:

I think the website does a pretty good job of explaining the capabilities of the platform, so I'll leave that for the moment. I encourage you to go check it out before reading on, because it will be worth understanding how the platform works to give some context to the discussion below.

The map view in the Palladio interface

So, why am I super-excited about this? Mostly because of the great team, which has the vision, technical skills, theoretical and domain knowledge, and information design chops to pull off this type of project. I consider myself lucky to be able to work with this group.

It's also great to be working on a project like this for a field that is simultaneously very strong on information theory and a bit underserved in terms of some types of tools. This is in stark contrast to my usual enterprise data management and visualization work where the theory tends to be weak but a plethora of tools exist.

In addition to trying to build a tool that incorporates important and underserved aspects of humanistic inquiry, I am excited to work with a team that buys into introducing state-of-the-art concepts around data exploration tools in general. Many of the concepts we are working to implement in Palladio are directly applicable to the types of data exploration problems we find in the enterprise and are concepts rarely expressed in existing tools. Palladio is a great example (one of many great examples) of how the process of humanistic inquiry can motivate the development of methods that are both technically and conceptually applicable in wildly different disciplines.


The thing that initially most impresses people about Palladio is the way that filtering and movement are integral to the visualization. Specifically, the visualizations update and move in real-time as you filter. This is not a new concept, but I don't think I've ever seen it fully implemented in a general-purpose tool. Getting the level of movement right is a design challenge that the team is tackling as a work in progress, but in my opinion this characteristic of real-time updates and movement is a key feature for a data exploration tool, and few if any tools implement it.

I'll try not to get too squishy here, but this behavior of the tool allows a person to interact with the data in a very direct way, giving a feel for the data that would not otherwise exist. When you can see the results of your interactions with the data in real time, it is a lot easier to conceptually link step-changes and interesting events with the actions that caused them. For example, dragging a filter along the timeline component allows you to play back history at your own speed, speeding up or slowing down as suits you. My theory-foo is weak, but when you see it, I think you'll understand. Try it out with the sample data.


Techy alert: Palladio is a purely client-side, browser-based application. The only involvement of a server is to deliver the HTML, Javascript, and CSS files that comprise Palladio. We arrived at this design through a few iterations, but the motivation was that we wanted to be cross-platform, super-easy to install, and still support pretty big data sets and fluid interactions. 10 years ago, this would have been nearly impossible, but today we have web browsers that, amazingly, are capable of supporting these types of applications. Yes, browsers can now support dynamically filtering, aggregating, and displaying 10s and 100s of thousands of rows of data and displaying hundreds of data points simultaneously in SVG; thousands of data points if you use canvas instead.

The time for client-side data visualization in the browser has come and we are taking advantage of that in a big way. A great strength of browser-rendered visualizations is that they allow true interaction with the visualization. Just using SVG or Canvas as a nicer replacement for static images is fine, but it isn't fully exploiting the medium. Add to this that the type of interactivity we are providing with Palladio is technically impossible in a client-server setup. Even if the server responds to queries instantaneously, the round-trip time the client-server communication introduces means that interactions won't be as closely linked as they are in Palladio, severely degrading the quality of the interactive experience.

Admittedly, we have work to do on performance and our cross-browser support could be better. Additionally, the problem of data that simply doesn't fit in the browser's memory remains unaddressed, though we have some ideas for mitigating the problem. But I think this is an application design approach that could be exploited for the vast majority of data sets out there, either because the data itself is relatively small, or through judicial use of pre-aggregation to avoid performance and size issues.


Lastly, user experience and information design have been integral components of this project from the start. The design has been overhauled several times along the way, and I wouldn't be at all surprised if it happened again. To be clear, I'm a complete design newb, but we have a great designer working on the team. One thing that has become clear to me through this process is that designing a general purpose interactive visualization tool is hard. There are more corner-cases than I previously imagined possible, but we are trying to get the big pieces right and I think we're on the road to success.

Obviously the organizational dynamics on a small team like ours are very different than those in a big development organization, but it seems like information design on most of the enterprise data exploration tools from larger vendors either started out problematic and stayed that way, or started out pretty well and started slipping as the tool took off. I'm not sure if there is an answer to this, but it's clear that when building a tool in this space, having at least one information designer with a strong voice on the team is indispensable.

Let me sum up

So, that's the Palladio project, as well as a few takeaways that I feel can be applied back to pretty much any data exploration project. In closing, I'll just mention that none of this would be possible without some great open source projects that we use and, most importantly, without the great team working on this as well as the feedback and patience of our dedicated beta participants. The central Javascript libraries we used to pull this off are d3.js, Crossfilter, and Angular.js. The members of the core team are Nicole Coleman (our fearless leader), Giorgio Caviglia (design and visualization), Mark Braude (documentation, testing, working with our beta users, and project management), and myself doing technical implementation work. Dan Edelstein is the principal investigator.

It's been a great ride so far and we've got some exciting things planned for the rest of the year. This is definitely a work in progress, and feedback is very welcome. Follow the Humanities + Design lab on Twitter for updates.