I worked for an advertising network. Our business was to serve ads. Our clients, advertisers, wanted to know which ads and campaigns got queries, clicks and conversions, and how much they paid.
We started with web logs which record the details of the queries, clicks, and conversions.
We digested these logs into DataCube objects, which are in-memory OLAP cubes. An OLAP cube contains a specified set of hierarchical dimensions and a specified set of cumulative measures (in our case, impressions, clicks, conversions, revenue, etc.).
Each OLAP cube we generated contained a set of dimensions and the rolled-up metrics for those dimensions. Each of these cubes was mapped to an autogenerated relational table with fields corresponding to those dimensions and metrics.
For each row of data in the cube, we'd check the contents of the relational table to see if there was an existing row for that date and the specified values for those dimensions. If there was, we'd update the totals for the measures. If there wasn't, we'd insert a new row.
The UI presented the contents of the relational tables to the user.
This approach was reasonable but had issues:
The update/insert operations were expensive, even when optimized.
There was a perpetual battle between dimensional richness and cube load times. We always had to digest away some information.
Business stakeholders began to ask for more dimensions for use in their analytics, which increased the cube load times further.
As traffic volume increased, we began to require more than 60 minutes to process an hour's worth of data. That's obviously unsustainable, and led to decreases in dimensional richness and consequent stakeholder disappointment.
We could have stored raw log data in relational tables and implemented horizontal storage scaling via Greenplum or a similar solution, but they're all very expensive.
Eventually, we ran out of clever ways to improve the code. We needed a new approach.