Real Estate News Data Pipeline & API Dashboard

Turning scattered publisher feeds and fragile scraping workflows into a controlled data pipeline with source monitoring, scheduler controls, authenticated API access, and production deployment support.

13 publisher sourcesNormalized schemaAPI key accessScheduler controlsDockerized deployment

Client

Private real estate data client

Status

Private client build

A news data platform built for repeatable operations

The product collects news from multiple UAE, GCC, and regional publishers whose sites use different CMS platforms, markup structures, APIs, and access constraints. Instead of treating each scraper as a one-off script, the system was built as a repeatable ingestion pipeline.

The platform also exposes a read-only public API protected by API keys, with OpenAPI documentation, quota enforcement, request logging, and health endpoints. This makes the collected data usable by other products without giving direct database access.

Product context

The work was not just about scraping pages. The system needed to collect articles from different publisher platforms, normalize inconsistent data, avoid duplicate records, track each run, support operator controls, and expose clean API access for downstream consumers.

Challenge

The challenge

Each source exposes content differently. Some use WordPress REST APIs, some expose server-rendered JSON state, some rely on schema.org metadata, and others require source-specific DOM parsing. Several sources also need proxy handling because direct requests fail behind access controls.

What we built

We built the platform as an operator-managed data product: source-specific extraction, shared normalization, MongoDB persistence, run auditing, scheduler controls, API access, and deployment documentation working as one system.

Multi-source ingestion pipeline

Source-specific scrapers collect listings and article bodies from 13 publishers, including WordPress REST, Inertia payloads, Next.js state, Reuters Fusion CMS data, schema.org metadata, and server-rendered HTML.

Shared normalization layer

All sources are shaped into a consistent article model with headline, summary, URL, canonical URL, publish dates, authors, categories, tags, images, body text, read time, and source metadata.

Authenticated API surface

The backend exposes article, source, health, OpenAPI, and Swagger endpoints protected through API keys, daily limits, usage logs, analytics, and revocation support.

Operator dashboard

Operators can monitor corpus size, source health, scrape runs, scheduler configuration, source detail views, runtime settings, and usage estimates without touching server code.

Result

The result

The result is a maintainable ingestion and delivery system that can collect content from multiple publishers, store it in a normalized database, monitor ongoing source health, expose data through a secure API, and run in production through Docker.

The build turns a brittle manual scraping workflow into an operator-managed platform with visibility, retry behavior, runtime configuration, API access, and deployment documentation.

publisher/source integrations handled through source-specific logic

API keys

authenticated read access with quota and usage tracking

Scheduler

manual and scheduled scrape controls for operators

Docker

deployment support for single-service and two-service setups

Client feedback

“The system gave our operators a clearer way to monitor sources, recover from source issues, and use the collected data through a controlled API instead of relying on fragile scripts.”

Name withheld

Operations Lead, Private Data Platform

Execution logic

Why this mattered

The page stays outcome-led, but the proof is in the product decisions underneath: what we protected, what we simplified, and what became easier for the client to operate.