Back to work

Multi-source data automation

Real Estate News Data Pipeline & API Dashboard

Turning scattered publisher feeds and fragile scraping workflows into a controlled data pipeline with source monitoring, scheduler controls, authenticated API access, and production deployment support.

13 publisher sourcesNormalized schemaAPI key accessScheduler controlsDockerized deployment

Client

Internal product build

Status

Internal product build

Category

Data Automation & API Platform

Timeline

2026

Overview

A news data platform built for repeatable operations

The product collects news from multiple UAE, GCC, and regional publishers whose sites use different CMS platforms, markup structures, APIs, and access constraints. Instead of treating each scraper as a one-off script, the system was built as a repeatable ingestion pipeline.

The platform also exposes a read-only public API protected by API keys, with OpenAPI documentation, quota enforcement, request logging, and health endpoints. This makes the collected data usable by other products without giving direct database access.

Product context

The work was not just about scraping pages. The system needed to collect articles from different publisher platforms, normalize inconsistent data, avoid duplicate records, track each run, support operator controls, and expose clean API access for downstream consumers.

Challenge

The challenge

Each source exposes content differently. Some use WordPress REST APIs, some expose server-rendered JSON state, some rely on schema.org metadata, and others require source-specific DOM parsing. Several sources also need proxy handling because direct requests fail behind access controls.

What we built

What we built

We built the platform as an operator-managed data product: source-specific extraction, shared normalization, MongoDB persistence, run auditing, scheduler controls, API access, and deployment documentation working as one system.

01

Multi-source ingestion pipeline

Source-specific scrapers collect listings and article bodies from 13 publishers, including WordPress REST, Inertia payloads, Next.js state, Reuters Fusion CMS data, schema.org metadata, and server-rendered HTML.

02

Shared normalization layer

All sources are shaped into a consistent article model with headline, summary, URL, canonical URL, publish dates, authors, categories, tags, images, body text, read time, and source metadata.

03

Authenticated API surface

The backend exposes article, source, health, OpenAPI, and Swagger endpoints protected through API keys, daily limits, usage logs, analytics, and revocation support.

04

Operator dashboard

Operators can monitor corpus size, source health, scrape runs, scheduler configuration, source detail views, runtime settings, and usage estimates without touching server code.

Result

The result

The result is a maintainable ingestion and delivery system that can collect content from multiple publishers, store it in a normalized database, monitor ongoing source health, expose data through a secure API, and run in production through Docker.

The build turns a brittle manual scraping workflow into an operator-managed platform with visibility, retry behavior, runtime configuration, API access, and deployment documentation.

13

publisher/source integrations handled through source-specific logic

API keys

authenticated read access with quota and usage tracking

Scheduler

manual and scheduled scrape controls for operators

Docker

deployment support for single-service and two-service setups

Client feedback

The delivery was professional from start to finish. Requirements were understood quickly, milestones were met, and the final system gave us a more dependable way to manage the workflow.

Operator feedback

Private data workflow user

Execution logic

Why this mattered

The page stays outcome-led, but the proof is in the product decisions underneath: what we protected, what we simplified, and what became easier for the client to operate.

Scraping became operational

The work moved from one-off scripts into monitored source runs, source health, scheduler controls, and audit records.

Data became usable downstream

Normalized article records and API key access make the data usable by other products without exposing the database directly.

Source issues became visible

Operators can see failures, retry behavior, source health, recent runs, and runtime settings before problems become silent data gaps.

Start with context

Have a product, workflow, or system that needs a stronger next step?

Bring the rough context, product blocker, or delivery goal. We will help shape the practical next step before the work gets heavier.

A useful product conversation starts with the real context.

You do not need a perfect brief. A current product situation, blocker, target outcome, or rough workflow is enough to begin.

What to share

Current product stage, what is stuck, timeline, and what a successful next step should look like.