0

I'm new to systems development and was wondering if someone more experienced than I could help me figure out some issues about database, web services, and overall architecture.

I have a web scraper that's supposed to run daily. It will collect, scrub, and aggregate data on local businesses from multiple publicly available government data. This data goes to a Postgres DB.

The user will then have an admin dashboard where they can see some metrics and trends. What I don't know is if this dashboard should query the DB every time the user loads the dashboard.

I imagine this is not the wisest approach since it would overload and slow down the DB with multiple JOIN, SUM, COUNT etc. I believe it would be best to compile these metrics overnight and store it somewhere? Or hourly?

I was doing some research and came across these "Analytical Databases". Is that what I should use? This similar question seems to have solved the puzzle, especially @samxli's comment on the accepted answer.

I could really use some direction-pointing here. How is analytics commonly handled in production? Thank you so much in advance! :thumbs-up:

Solution details:

  • NodeJS web scraper with CAPTCHA bypassing collects public data daily
  • Data from multiple sources is scubbed, aggregated and saved to a Postgres DB
  • Data contains public information about local businesses - see below
  • A dashboard shows historical data (time series), metrics, and trends

Sample record:

{
  trade_name: "ACME Inc.",
  legal_name: "Watchmen Hero Services Incorporated"
  active: true,
  foundation_date: "2018-11-23",
  sector: "services",
  main_activity: { id: 12318, name: "Law enforcement" },
  secondary_activities: [],
  address: {} // standard address object
  location: { lat: -23.2319, long: 42.1212 },
  ...
}

Sample metrics:

  • Total number of active and inactive companies over time per sector and activity
  • Estimated tax revenue over time per district and activity
  • Top N most common activities per city district
Thiago Villa
  • 85
  • 1
  • 12

1 Answers1

0

I can see a few options. I agree with you in that, at scale, you want to separate reading and writing so that analytics doesn't impact your system performance.

You might want to look into replication - https://www.brianstorti.com/replication/. You can read from a "read replica" and get a near-realtime view of the data, but without a massive disruptive impact on write performance.

Alternatively, if you want to do some more work and get something that can work well at scale, dig deeper into your findings on analytical databases (OLAP) and look at building out a Star schema (https://en.wikipedia.org/wiki/Star_schema). You can put an ETL (Extract, Transform, Load) process in place to pull data from your transactional database into your analytics database in a format that can be much easier to aggregate and work with. I've worked on something similar with hundreds of data sources synced in 30-minute batches into a data warehouse. This might be overkill if you only have a single data source though.

Lastly, instead of Postgress, if you're primarily dealing with time series data and metrics, also consider the Elastic Stack (https://hackernoon.com/elastic-stack-a-brief-introduction-794bc7ff7d4f).

If you don't go the Elastic route, also consider some BI (business intelligence) tools like PowerBI to build your dashboards, rather than reinventing the wheel.

Riaan Nel
  • 2,425
  • 11
  • 18