kallakata@home:~$

Reality Scraper


Multi-Site Real Estate Scraper

Scrapes apartment listings for sale from multiple Czech real estate sites and saves them to a single CSV file with automatic duplicate detection.

Supported Sites

  • Realingo.cz - Browser automation (Playwright)
  • Sreality.cz - REST API (fastest!)
  • Bezrealitky.cz - Browser automation (Playwright)

All sites can be enabled/disabled individually in config.yaml.

Features

  • Multi-site scraping - Scrape from multiple sites in one run
  • Duplicate detection - Automatically detect cross-site duplicates with confidence scoring
  • Single CSV output - All results in one file with source column
  • Incremental updates - Only new listings are added on subsequent runs
  • TUI interface - Beautiful terminal UI with real-time stats
  • Filtering - Location, disposition, price range
  • Analysis tools - Export unique properties, generate reports

Setup

# Install dependencies
pip install -e .

# Or with dev tools (testing)
pip install -e ".[dev]"

# Install Playwright browsers
python -m playwright install chromium

Quick Start

# Run with default config (CLI mode)
python scraper.py

# Run with TUI (recommended!)
python scraper.py --tui

# Use custom config
python scraper.py my_config.yaml --tui

Configuration

Edit config.yaml to customize search criteria and enabled sites:

# Enable/disable sites
sites:
  realingo:
    enabled: true
  sreality:
    enabled: true
  bezrealitky:
    enabled: true

# Search criteria (applies to all sites)
location: "Praha"              # City or district
location_radius: "2"           # Search radius in km
disposition:                   # Apartment layout
  - "3+kk"
  - "3+1"
price_min: null                # Min price in CZK
price_max: 11000000            # Max price in CZK

# Duplicate detection
duplicate_detection:
  enabled: true
  min_confidence: 0.60         # 0.60-1.00 (higher = stricter)
  area_tolerance_m2: 5         # ±5m² for matching

# Output
output_file: "listings.csv"
headless: true                 # Set false to see browser

Location examples: Praha, Brno, Praha 7, Nusle, Holešovice

Disposition options: 1+kk, 1+1, 2+kk, 2+1, 3+kk, 3+1, 4+kk, 4+1, 5+kk, 5+1

Output

CSV file (listings.csv) with columns:

  • id - Unique listing ID
  • source - Site source (realingo, sreality, bezrealitky)
  • url - Full listing URL
  • price - Price
  • currency - CZK or EUR
  • location - Address/location
  • disposition - Apartment layout (3+kk, etc.)
  • area_m2 - Size in m²
  • scraped_at - When scraped
  • duplicate_group_id - Hash for duplicate detection
  • duplicate_confidence - Confidence score (0.0-1.0)
  • first_seen_as_duplicate - Timestamp when first marked as duplicate

TUI Features

Run with --tui flag for an interactive terminal interface:

  • 📊 Real-time stats - Total, unique, duplicates, new listings
  • 🎨 Color highlighting - Each duplicate group gets a color
  • 🔍 Filtering - Show all, unique only, or duplicates only
  • ⌨️ Keyboard shortcuts:
    • S - Start scraping
    • X - Stop scraping
    • D - Toggle duplicate highlighting
    • U - Show unique only
    • G - Show duplicates only
    • A - Show all
    • R - Refresh table
    • O / Enter - Open selected URL in browser
    • Q - Quit

Analysis Tools

Analyze Duplicates

# Show statistics and top duplicate groups
python scripts/analyze_duplicates.py listings.csv --report

# Export unique properties only (one per duplicate group)
python scripts/analyze_duplicates.py listings.csv --unique-only unique.csv

# Export duplicates only
python scripts/analyze_duplicates.py listings.csv --duplicates-only dups.csv

# Show top 20 duplicate groups
python scripts/analyze_duplicates.py listings.csv --report --top 20

Testing

# Run unit tests
pytest

# Run with verbose output
pytest -v

# Test specific module
pytest tests/test_deduplication.py

Notes

  • The site shows ~40 listings per map viewport
  • Run periodically to capture new listings over time
  • Respects rate limits with configurable delay between requests