By Marco Fedele
using spark I’ve developed a dashboard for a client with an original database of 14 tables with > 26M of total rows. The data was about phisical users, parcels, products and web users.
You can think it like an e-commerce company with also phisical stores.
The requirements were:
- made the dashboard multi access
- perform SQL like queries on a fixed subset of fields
- queries should return a web page with ~20 fixed tables of statistical analysis
- not all legal queries can be done on the original db structure, for example we didn’t have user’s gender, but we can extrapolate it from other data
- metrics should be count differently based on their kind, ex: gender should be counted by phisical users, month of invoice should be counted by number of invoices per year, site access should only account web users, …
The solution I’ve come up with was merge, deduplicate, pivot and reaggregate all data in a table were only relevant info are stored, with a reduction of number of rows more than 4 times from the original database.
To calculate the metrics, the only approach I’ve come up with, is getting the rdd from the DF and then use the aggregate function to analyse (and count) manually every metric from every single row.
The performance are really poor, like more than a minute to get the metrics from the reduced dataset using an 8 core CPU. With one user.
I’m not asking for code, but for documentation, examples, knowledge. I think that I’m missing something and I don’t know what is it. I’ve seen a lot of dashboards performing queries on bigger datasets almost instantly and I don’t get how they do this (apart from the computational power).
Source: Stack Overflow