Toward an analysis of datawarehouse and business intelligence challenges - part 1

July 23, 2012 by Ethan Jewett

(This post is a bit of a blast from the past. It was original published in 2010 and was lost during a migration in 2011.)

Since I work in and around datawarehousing and business intelligence, I've developed notes and thoughts over the years on the key challenges in these areas. New technologies and architectural approaches are drastically changing the landscape of the field and can help to address some of these challenges, but enterprise software vendors and customers are often not aware of new approaches or their applicability to the classic problems of the field, which continue to persist.

I'm starting to compile this list publicly, in what I hope will be a more-or-less living document, because I will start using it to evaluate the applicability of newly maturing technologies (in-memory, non-relational or NO-SQL databases, etc.) and architectures (map-reduce, streaming datawarehousing, etc.) to these old problems. This list is a survey, not an in-depth analysis of the problems. I may provide more in-depth analyses if it seems relevant, but I will more likely look for and point to references where they are available.

This is about half of my initial list and is in no particular order. I'll post the second half of the list shortly.

Data volume

This is a classic datawarehousing problem, often addressed through data modeling, aggregation, or compression. Even though it is one of the oldest problems of the field, it is by no means solved, or even optimally addressed. Enterprises continue to struggle with the cost and technical feasibility of scaling up their datawarehouses, often due to limitations of the underlying database technology.

Data quality

We may seem to be able to procure and store all of the data necessary, but that is no guarantee that the data is correct. This challenge has more to do with data being wrong than with data being misunderstood or semantically misaligned, though this is a related problem. Data quality issues can arise for many reasons including incorrect data at the point of entry, incomplete data, duplicate data, or data that becomes incorrect because of an invalid transformation.

Data consistency

Even when data is correctly stored in a datawarehouse, it may become temporarily inconsistent under certain operations. For example, when deleting or loading data, there may be a period of time when queries can access part of the data being loaded or deleted, but not all of it. This can be thought of as an inconsistent state, and while most datawarehousing tools ensure consistency in some manner, this is an area that may sometimes be traded for better handling of another challenging area. The classic tradeoff is between consistency, loading performance, and query performance.

Semantic integration

An oft-overlooked, but extremely important concept, semantic integration challenges comes in two flavors: Homonyms, meaning data that has the same name but different meanings (your "Revenue" may not be the same as my "Revenue"). Synonyms, meaning data that has the same meaning but different names.

Historical data

Dealing with historical data is a challenge that could be subsumed under other challenges. Usually the problems here are mostly issues of handling volume, master data management (changing dimensions), and semantic integration. However, historical data brings some unique angles to these challenges, including possible relaxation of requirements around performance and detail, as well as new legal, audit, and governance requirements around retaining or discarding specific data-sets.

Unstructured data

Datawarehouses have always focused on structured data, primarily because of a complete lack of tools for handling unstructured data rather than because of a philosophical view that unstructured data does not belong in a datawarehouse. This is not to say that the philosophical view does not exist, but rather that the philosophical view derives from an inability to execute rather than any underlying principle, and so should be ignored in light of new tools.

Unstructured data brings with it design constraints and requirements that do not normally appear in datawarehousing discussions. These include a lack of design-time information about dimensionality, the existence of non-numeric "key figures" (text- or image-based data, for example), document-oriented data, and the need for full-text search. Additionally, the challenge of derived dimensions and measures is strongly related to unstructured data, as these are key tools for allowing us to derive structured reporting and analysis from unstructured data-sets.

Why in-memory doesn't matter (and why it does)

July 22, 2012 by Ethan Jewett

(This post is a bit of a blast from the past. It was original published in 2009 and was lost during a migration in 2011. The software landscape it references is obviously dated and many links are likely broken, but the analysis is still relevant.)

Well, that title was going to be perfect flame-bait, but then I went all moderate and decided to write a blog that actually matters. So here's the low-down:

There's a lot of talk lately about in-memory and how it's the awesome. This is especially true in the SAP-o-sphere, primarily due to SAP's marketing might getting thrown behind Business Warehouse Accelerator (BWA) and the in-memory analytics baked into Business ByDesign.

I'm here today to throw some cold water on that pronouncement. Yes, in-memory is a great idea in a lot of situations, but it has its downsides, and it won't address a lot of the issues that people are saying it addresses. In the SAP space, I blame some of the marketing around BWA. In the rest of the internet, I'm not sure if this is even an issue.

Since I've actually done a fair amount of thinking about these issues (and as a result I troll people on Twitter about it), I thought maybe it'd be helpful if I wrote it down.

So let's get down to brass tacks:

How in-memory helps

In short: it speeds everything up.

How much? Well, let's do the math: Your high-end server hard drive has a seek time of around 2 ms. That's 2*10^-3 seconds (thanks Google). Yes, I'm ignoring rotational latency to keep it simple.

Meanwhile, fast RAM has a latency measured in nanoseconds. Let's say 10ns to keep it simple. That's 10^-8 seconds.

So, if I remember my arithmetic (and I don't), RAM is about 2*10^5, or 200,000 times faster than hard disk access.

Keep in mind that RAM is actually faster because the CPU-memory interface usually supports faster transfer rates than the disk-CPU interface. But then, hard disks are actually faster because there are ways to drastically improve overall access performance and transfer rates (RAID, iSCSI? - not really my area). Point is, RAM helps your data access go a lot faster.

But ... er ... wait a second (or several thousand)

So here I am thinking, "Well, we're all fine and dandy then. I just put my job in RAM and it goes somewhere between 100,000 and 1,000,000 times as fast. Awesome!".

But then I remember that RAM isn't a viable backing store for some applications, like ERPs (no matter what Hasso Plattner seems to be saying) or any other application where you can't lose data, period. Yes it can act as a cache, but your writes (at least) are going to have to be transactional and will be constrained by the speed of your actual backing store, which will probably be a database on disk.

And then I see actual benchmarks meant to reflect the real world like this. For those who won't click the link, the numbers are a bit hard to read, but I'm seeing RAM doing about 10,000 database operations in the amount of time it takes a hard disk store to do about 100. That's only a 100x speedup.

Ok, now I'm back down to earth and I'm thinking, "I just put my job in RAM and I'll get maybe a 50-100x speedup but at the cost of significant volatility". (I'm also thinking that SAP's claimed performance improvements of 10x - 100x sound just about like what we'd expect.)

This is still really really good. It makes some things possible that were not possible before and it makes some things easy that used to be hard.

And finally, why in-memory doesn't matter

But really, what is the proportion of well-optimized workloads in the world? How often are people going to use in-memory as an excuse to be lazy about solving the actual underlying problems? In my experience, a lot. Already we are hearing things along the lines of, "The massive BW query on a DSO is slow? Throw the DSO into the BWA index." [Editor's note: A DSO is essentially a flat table. Also, the current version of BWA doesn't support direct indexing of DSOs, but it probably will soon, along with directly indexing ERP tables.]

Now's the part where we who know what we're doing tear these people to shreds and tell them to implement a real Information Lifecycle Management system and build a Inmon-approved data warehouse using their BW system (BW makes it relatively easy). Then that complex query on a flat table that used to take two days of runtime will run in 30 seconds.

Well, that would be one approach, but frankly most people and companies don't have the time or the organizational maturity in their IT function to pull this off. And in this world, where people have neither the time nor the business process for this sort of thing, then it starts to make sense to spend money on it, and something like BWA is a great thing in this context.

But it's not great because it's in-memory. It's great because it takes your data - that data you haven't had the time to properly build into a datawarehouse with a layered and scalable architecture, highly optimized ROLAP stores, painstakingly configured caching, and carefully crafted delta processes - and it compresses it, partitions it, and denormalizes it (where appropriate). Then, as the icing on the cake, it caches the heck out of it in memory.

Let's be clear: BW already has in-memory capabilities. Livecache is used with APO, and the OLAP cache resides in memory. The reason BWA matters is not that it is in-memory. It matters because it does the hard work for you behind the scenes, and partially because of this it is able to use architectural paradigms like column-based stores, compression, and partitioning that deliver performance improvements for certain types of queries regardless of the backing store.

In-memory is great, and fast, and should be used. But in most ways that are really important, it doesn't matter all that much.'

How Google+ spams people using the Gmail widget

April 25, 2012 by Ethan Jewett

So, there's this new-ish widget in Gmail that shows up on the right side gutter and gives information about the person you are receiving email from. It's actually pretty useful. In addition to basic contact information, it shows thumbnails of recent pictures in emails from that person, and gives chat options.

Here, I have sent an email to my real Gmail account from a throwaway account called ethanjewett@gmail.com

Recently, the option to add a person to your Google+ circles has been added to this widget in Gmail. Circles have also been integrated into Gmail Contacts, so a reasonable person might well think that this widget is a way to add people to contact groups.

Clicking on this butten pops up a list of your circles, allowing you to choose how you want to categorize this person.

At this point, my "friend" has been added to my "Friends" circle.

What I might not expect at this point is that Google has sent my new "Friend" a very excited email falsely stating that I am inviting him to join Google+

What I might not expect at this point is that Google has sent my new "Friend" a very excited email falsely stating that I am inviting him to join Google+

I'm a little concerned by this for the following reasons:

Did I invite this person to join Google+? No, I added him to a circle in Gmail.
Am I recommending Google+ for this person? No. In fact, I only found out about this because I added people to my circles in this manner that were confused by the invitation. One of these people wrote back to me asking what it was. I would never have invited some of these people to Google+ because I know they have no interest in it.
Did I want this person to know that I had added them to a circle/group in my Gmail contacts? Not necessarily. I really believe that this function is easy to confuse with contact groups, and organizing contacts is a very private activity that I do not want my contacts to know about. What if this is someone I want to avoid, someone I don't want to think about me. What if I added this person to a circle called "Stalkers"?
Google (and everyone else for that matter) should never ever ever email my contacts without my express consent.

I trust Google with my contact information and my email. This type of behavior makes quite clear that Google is not worthy of that trust. Am I going to do anything about it? Not at the moment, but it is one more straw added to the camel's back.

SAP BI OnDemand and Hana

November 21, 2011 by Ethan Jewett

It's been some time now since the press releases and SAP TechEd Bangalore keynote proclaiming that SAP's BI OnDemand product now runs on HANA as its underlying database. The press releases have gone out. The product is here. The BI OnDemand website has been updated with a shiny new "Powered by SAP Hana" logo.

There is only one problem. It seems that the BI OnDemand that most people can see is not actually powered by Hana.

I discovered this for myself when discussing the topic with Courtney Bjorlin, who was working on an article about the announcement. SAP confirms in the article that only the "Advanced Edition" of BI OnDemand is available on the HANA database. At SAP's TechEd in Madrid, I was able to ask around on the show floor and hallways and find out more about the situation.

How do I get BI OnDemand running on HANA?

You have to buy the "Advanced Edition" of BI OnDemand. This involves a sales process and is a hosted version of the BI OnDemand platform. It seems that it's not exactly SaaS or "OnDemand", but more on that below.

The fact that the logo at https://bi.ondemand.com says "Powered by SAP Hana" is apparently an inaccuracy. Hopefully that will be corrected soon.

What are these different "editions" of BI OnDemand?

There are three "editions" of BI OnDemand: Personal, Essential, and Advanced. Based on my discussions, it seems that Personal and Essential editions are SaaS applications hosted by SAP, while the Advanced edition is hosted by partners. All editions seem to include the same web interface as seen on bi.ondemand.com, but the Essential version includes customization and branding options as well as more storage. The Advanced edition features even more storage and customization options, plus access to a hosted version of the BusinessObjects Data Services, which can be used to manage contents of DataSets. This integration of Data Services can allow for incremental updates to DataSets, which is a key feature and is not possible in the Personal or Essential editions.

As far as I can tell, none of this is documented anywhere on SAP's standard sites. My thanks to Richard Hirsch for finding this presentation outlining some of these points (see page 17): link to PDF (link has been removed because SAP has let the domain lapse - original URL was http://sap-partnersummit2011.com/doc/post_event/FKOM2011_BA&T%20track_Day2/BAtrack_2_BA-Solutions&Innovation.pdf).

So if I have the Advanced edition, I'm now on Hana?

No, not quite.

First of all, based on discussions at TechEd Madrid, it seems that only new customers can currently get onto the Hana-based BI OnDemand platform. Apparently there are contingencies for existing customers to migrate eventually, but right now it is only for new customers.

Further complicating the issue, it seems that not all hosting partners for the Advanced edition provide HANA as the underlying platform. I was told by SAP employees on the show floor that only one partner is currently providing BI OnDemand on HANA, and that partner is only in North America. Other partners are providing the BI OnDemand on the older Microsoft SQL Server-based platform. I have yet to confirm this; it is only based on the one source, so take it with a grain of salt. But there is clearly confusion around the availability of BI OnDemand using HANA, even if you are purchasing the Advanced Edition.

If capabilities provided only by HANA are required for your implementation, be sure you are actually getting HANA when you buy the BI OnDemand Advanced Edition.

Is it Hana or HANA?

I have no idea. I did learn at TechEd that HANA (or Hana) is not an acronym, so I'm leaning towards Hana, but old habits die hard.

Ok, enough with the Q&A. What does this mean?

In my view, this means that SAP still has a lot of work to do getting its message across clearly. It is not particularly bad or good that HANA is not available for the Personal or Essential editions of BI OnDemand. These editions are limited to data set sizes that are simply too small for HANA to make much of a difference.

The greater concern here is one of communication. For any company, it is extremely important to say what you mean and mean what you say. It would have been much better if SAP had been clear about the roll-out of HANA for BI OnDemand from the beginning. As it stands now, many people will try out the Personal edition and think that they are using "Hana", but they're not.

Looking to the wider view, I worry about what this partial roll-out means for SAP's BI cloud play. The BI SaaS market is still very immature and SAP has the opportunity to play a leading role in this emerging market. However the BI OnDemand product doesn't seem to have received the sort of development attention required for this role, and the deployment options seem to be severely lacking.

Companies and departments looking to buy powerful SaaS BI capabilities are not interested in figuring out what database the product is using and the impact this has on their reporting needs. SaaS should work as defined in SLAs, and it should keep getting faster and better in a way that is non-disruptive.

After talking with some of the BI OnDemand development team in February, I know that they have a good understanding of the BI SaaS space and have some great ideas for the BI OnDemand platform. I'd love to see SAP deliver on its potential in this area and I think they have the people and the vision to do so, but we haven't seen it in the product yet.

Hopefully SAP can get both the BI OnDemand message and the platform straightened out quickly. The BI SaaS market it still extremely young and SAP could be leading the way.

Disclosure: SAP provided my travel and badge for the TechEd + Sapphire 2011 conference in Madrid.