Toward an analysis of datawarehouse and business intelligence challenges - part 2

(This post is a bit of a blast from the past. It was original published in 2010 and was lost during a migration in 2011.)

This is the second half of an analysis, and with my first post on the topic constitutes my first swipe and listing the current fundamental challenges of the datawarehousing and business intelligence fields. The list is in no particular order and will surely change in the future. It is conceived as the beginning of a framework from which to evaluate new or maturing technologies and architectures from the perspective of applicability to the field.

Aggregating silo-ed data sources

Silos silos silos. Anyone trying to do data analysis has run into this problem: the data exists, but we can't get at it. The technical aspects of this challenge are many (bandwidth, interfaces, and ETL), but it's worth noting that they are usually dwarfed by the cultural and organizational obstacles (default against sharing, departmental rivalries), many of which are in place for good reason (security and permissions concerns, privacy laws).

Representing data in a meaningful way

Historically this feels like one of the least-addressed challenges, but we are finally seeing some serious attention paid to this problem. Challenges in representation of data range from visualization (and the related topic of responsible visualization - as visualization is too often untruthful), to analytical views and tools, through search and guided data exploration.

As we stand, the data in datawarehouses and business intelligence datamarts is too often opaque and misunderstood by most users. Even the most impressive and advanced visualizations and analysis tools (Gapminder, BusinessObjects Explorer, and Qlikview, for example) are still highly guided constructs that are often only applicable to predetermined datasets. We have come a long way (finally) over the last decade, but we have a long way yet to go.

Representing reporting structures

Reporting structures are now fairly well understood, but representing them efficiently in our datawarehouses or BI tools remains a challenge. Some examples of such structures: reporting hierarchies, time-dependency, calculated measures, and derived or dependent characteristics. Challenges revolve around rollup and calculation performance, reorganization due to reporting structure changes, and accessibility to potential users.

Performance

Traditionally this is the "big one" and it is still very much an unsolved problem. Bound by the CAP tradeoff, we are more or less forced to give up either consistency, availability, or partition-tolerance in order to improve performance under constant resources. Two approaches prevail: architectures that give up one or more of the three in exchange for performance, and architectures that attempt to better optimize for the problem-space in order to improve performance while maintaining all three CAP axes. Both are perfectly legitimate approaches, but it will be important to recognize which architectural approach is being pursued in any given product or technology. As a wise person once said, "there is no such thing as a free lunch".

Further complicating matters, there are multiple performance aspects of datawarehouse and business intelligence applications, and we need to be clear which ones we attempt to optimize for. These aspects include query performance (keeping in mind the random access vs. batch/bulk/scan access difference), data loading (ETL) and reorganization, and (in some systems) writeback or changing of data.

Security

Security models pose more of a management problem than a technical problem for datawarehouse and BI applications. Nonetheless, I think they're worth mentioning as a core challenge to keep in mind, just in case someone comes up with a way to make reasoning about security in analytical-processing-oriented datasets less painful.

Data loading

Last but certainly not least, data loading is a perennial headache in datawarehouse and BI systems. The three basic types of data loading (batch, real-time/streaming, and write-back/input) all to some extent conflict with each other. Add to that the complexity of managing a profusion of delta mechanisms (many of which exist for good reason, others of which exist because of careless design) and different interface formats and we've got ourselves a real party. Standardization of interfaces and design practices are the key touchstones of conquering this challenge, but as with many of these challenges, this is more of a human problem than a technical problem.

Conclusion - technical vs. design challenges

If we take one thing away from this enumeration of the challenges of the datawarehouse and business intelligence spaces, I hope it is the fact that most of these challenges are more human in nature than they are technical. They tend to derive from the difficulty in making tradeoff decisions, standardizing interfaces and architectures, identifying and focusing on the problem space, and understanding how people may actually use these systems to greatest effect. Because of this, these challenges are often at least as susceptible to design solutions as they are to pure technical solutions. There is a tendency in the industry to focus on technical answers to these challenges over design answers, perhaps because technical solutions are often more impressive and in some sense physical. I think that's unfortunate.