I'm new to systems development and was wondering if someone more experienced than I could help me figure out some issues about database, web services, and overall architecture.
I have a web scraper that's supposed to run daily. It will collect, scrub, and aggregate data on local businesses from multiple publicly available government data. This data goes to a Postgres DB.
The user will then have an admin dashboard where they can see some metrics and trends. What I don't know is if this dashboard should query the DB every time the user loads the dashboard.
I imagine this is not the wisest approach since it would overload and slow down the DB with multiple JOIN, SUM, COUNT etc. I believe it would be best to compile these metrics overnight and store it somewhere? Or hourly?
I was doing some research and came across these "Analytical Databases". Is that what I should use? This similar question seems to have solved the puzzle, especially @samxli's comment on the accepted answer.
I could really use some direction-pointing here. How is analytics commonly handled in production? Thank you so much in advance! :thumbs-up:
Solution details:
- NodeJS web scraper with CAPTCHA bypassing collects public data daily
- Data from multiple sources is scubbed, aggregated and saved to a Postgres DB
- Data contains public information about local businesses - see below
- A dashboard shows historical data (time series), metrics, and trends
Sample record:
{
trade_name: "ACME Inc.",
legal_name: "Watchmen Hero Services Incorporated"
active: true,
foundation_date: "2018-11-23",
sector: "services",
main_activity: { id: 12318, name: "Law enforcement" },
secondary_activities: [],
address: {} // standard address object
location: { lat: -23.2319, long: 42.1212 },
...
}
Sample metrics:
- Total number of active and inactive companies over time per sector and activity
- Estimated tax revenue over time per district and activity
- Top N most common activities per city district