// Methodology

Data Collection Methodology

This page documents the technical approach behind our expired domain research database, including data sources, collection frequency, validation processes, and metric definitions.

1. Domain Discovery Pipeline

Our primary discovery mechanism monitors domain registration events across all major registrars via the ICANN Centralized Zone Data Service (CZDS). We process approximately 300,000 zone file changes daily across gTLD and ccTLD zones. Expired domains are identified when they enter the Redemption Grace Period (RGP) or are flagged as Pending Delete.

CZDS Zone Files → Parser → RGP/PendingDelete Filter → Metric Enrichment → Archive DB

2. Metric Sources

Each domain record is enriched with metrics from multiple providers to ensure data integrity and cross-validation:

  • Domain Authority (DA) — Proprietary score (0–100) based on link profile strength, calculated using a machine learning model trained on SERP correlation data.
  • Domain Rating (DR) — Independent authority score measuring the strength of a domain's backlink profile relative to all indexed domains.
  • Spam Score — Probability score (0–100) indicating the likelihood that a domain has been used for spam, based on 27 spam flag signals.
  • Backlink Count — Total number of external links pointing to the domain from unique pages, excluding internal links and nofollow attributes.
  • Referring Domains — Count of unique root domains with at least one dofollow link to the target domain.
  • Domain Age — Years since initial registration, sourced from WHOIS historical records.

3. Historical Trend Data

We maintain annual snapshots of DA, DR, and backlink counts for each domain, starting from 2021 or the domain's first appearance in our system. This enables researchers to track authority decay curves — the rate at which a domain loses authority after content removal or expiration.

4. PBN Footprint Detection

Our PBN detection algorithm analyzes clusters of domains for shared characteristics that indicate coordinated private blog network operation:

  • Shared hosting IP address ranges (Class C blocks)
  • Identical or near-identical WHOIS registrant patterns
  • Reciprocal linking patterns within the cluster
  • Temporal correlation in registration/expiration dates
  • Content similarity fingerprints via SimHash
  • Shared Google Analytics/AdSense identifiers

5. Update Cadence

Zone files are processed daily. Metric enrichment runs weekly on Sundays. Historical snapshots are captured on the first of each month. PBN cluster analysis runs bi-weekly. The full pipeline is orchestrated via a distributed task queue on dedicated infrastructure.

6. Data Validation

Automated validation catches anomalies before records enter the production database:

  • DA/DR values outside 0–100 range are rejected
  • Backlink counts differing >200% from prior snapshot are flagged for manual review
  • Duplicate domain entries are deduplicated by canonical form
  • WHOIS age calculations are cross-referenced against archive.org first-seen dates

7. Limitations

Our data has known limitations. Metric scores from third-party providers may lag actual authority changes by 2–4 weeks. Backlink counts reflect only indexed links and may undercount total link volume. PBN detection produces false positives at approximately a 3% rate based on manual audits. Historical data prior to 2021 is sparse for domains that were not in our initial seed list.