README

The Indie Cini

A data-driven planner for Chicago's independent cinemas.

The Indie Cini consolidates the latest showtimes and critic data for Chicago's major indie venues, forming a unified, interactive calendar and review dashboard. Under the hood, it runs a modular ETL-style pipeline of scrapers, storage backends, and a tailored React UI.

Key points

  • • Unifies showtimes from the Gene Siskel Film Center, The Music Box Theatre, and FACETS
  • • Enriches screening data with aggregated critic reviews from Metacritic
  • • Exposes an interactive calendar + “ReviewSpew” dashboard

Scroll down for the full technical write-up.

The Indie Cini calendar and review dashboard

Full README

Rendered from readme_public.md

🎬 The Indie Cini

A data-driven planner for Chicago's independent cinemas.

Chicago is home to fantastic independent cinemas and many eager moviegoers. But these cinemas and moviegoers are often separated by a digital gulf, one largely unnoticed despite its expanse.

  • Fandango and other major showtime aggregators tend not to list screenings from these venues.
  • Critical reviews and aggregated scores are strewn across innumerable websites, making movie discovery needlessly difficult.

To address these gaps, The Indie Cini consolidates showtimes and film metadata from Chicago’s leading independent cinemas and enriches them with aggregated critic reviews. It currently covers the programming of the Gene Siskel Film Center, the Music Box Theatre, and FACETS.

Built on an end-to-end pipeline of scrapers, data modeling, and a tailored UI, The Indie Cini delivers an interactive calendar, a critic-score dashboard, flexible filters, and a responsive layout, all designed to simplify the discovery of Chicago's indie film culture.

🧭 Overview

The Indie Cini automates the collection, normalization, and transformation of cinema slate data from Chicago’s top independent venues, alongside film review data from a major aggregator. The result is a unified and consistently updated view of the city’s indie screening landscape.

Core outcomes include:

  • Unified dataset of all screening events across Chicago's major indie venues

  • Aggregated critic reviews powering a sortable, detail-rich dashboard

  • Interactive calendar interface, with filters that refine both the calendar and the review dashboard

  • Analytics-ready data foundation for analyzing score dynamics and theatrical longevity, with initial support for capturing user interaction events.

🏗️ Architecture

High-level architecture of The Indie Cini data pipeline

High-level architecture of The Indie Cini data pipeline (click to enlarge).

The system follows a modular ETL-style pipeline, built for stability and transparency:

  1. Venue Scraping Layer (Python + Selenium + BeautifulSoup)

    • Separate scrapers for Siskel, Music Box, and FACETS, with seasonal expansion support for CIFF
    • Custom RotatingDriver manages rate limits, restarts, and JS-heavy pages
    • ScrapeResult bundles parsed DataFrames, source HTML, and run metadata
  2. Review Scraping Layer (Python + Selenium + BeautifulSoup)

    • Metacritic pages identified via automated search with empirically tuned relevance ranking
    • Dynamically loaded content retrieved through simulated user interaction (scrolling, “Load more," cookie dismissal)
    • Master files (mc_<data_type>_accum) reduce redundant scraping
    • (Planned): Letterboxd score integration
  3. Artifact Observability Layer (Python + pandas)

    • Timestamped inventory reports record the latest file, artifact date, and freshness status for each expected source/type combination
    • Artifact quality reports produce summary metrics and entity-level findings for duplicate film identifiers, missing values, invalid runtimes or review scores, and unmatched Metacritic searches
    • Reports are generated before database loading, creating a checkpoint between scrape artifact generation and downstream transformation
  4. Ingestion Layer (Python + pandas + SQLAlchemy):

    • Standardized loading routines clean column names and enforce dtypes
    • Data is routed to either a local MySQL dev database or a production PostgreSQL database on Render
  5. Storage Layer (PostgreSQL on Render, AWS S3)

    • Unified StorageBackend abstraction for consistent local/S3 file I/O
    • pathing.py defines centralized naming conventions and directory structure
    • ScrapeResult objects serialize directly into versioned S3 archives
    <!-- - Postgres tables (`showtimes`, `show_info`, `reviews`) feed unified analytical views (`v_screenings_enriched`, `metascore_report`) -->
    • Raw scrape artifacts persist in S3/local storage, while PostgreSQL houses the application's raw and transformed tables
  6. Transformation Layer (dbt)

    • dbt transforms raw PostgreSQL tables into layered models spanning staging, intermediate, and mart tiers:
      • Staging: Standardizes venue slate and review data
      • Intermediate: Constructs stable join keys, deduplicates Metacritic records, and aggregates film reviews by publication and film
      • Marts: Two feature-specific marts power the frontend: mart_show_calendar and mart_review_dashboard
    • dbt tests enforce core data integrity assumptions, including null, uniqueness, and composite-key checks.
  7. Frontend (Next.js + React Big Calendar)

    • Interactive calendar with daily/weekly/agenda views
    • Event blocks sized by runtime and color-coded by venue
    • Tooltips preview event details and link directly to booking pages
    • Critic-score dashboard with sorting, hovercards, and click-to-highlight behavior
    • Filter controls for venue, release type, and runtime
    • Fully responsive layout, plus a desktop resizable split panel between calendar and dashboard
  8. Usage Analytics (Next.js API routes + PostgreSQL)

    • Lightweight frontend instrumentation captures user interaction events, including page views and outbound clicks
    • Events are sent via a server-side API route and persisted to an append-only analytics_events table in PostgreSQL
    • A dbt staging model (stg_analytics_events) standardizes this data for potential downstream analysis
    • This layer establishes a foundation for future behavioral insights, without being a core feature of the current application

🧩 Key Features

CategoryFeature
Data IntegrationUnified cinema slate and review datasets spanning Chicago's leading independent cinemas
Transformation & ModelingLayered dbt architecture (staging → intermediate → marts) with surrogate key construction, deduplication pipelines, and feature-specific data marts
AutomationAutomated daily web scraping and ingestion via a custom rotating Selenium WebDriver, orchestrated via Render cron jobs
Data Quality & ObservabilityArtifact inventory reports track scrape freshness, while quality reports run structural checks and record entity-level findings; ingestion checks, structured logging, and dbt tests further enforce consistency
Storage ArchitectureCentralized storage abstractions standardize file and object I/O, with PostgreSQL serving as the canonical warehouse for ingested and modeled data
FrontendInteractive screening calendar and review dashboard with dynamic filtering, tooltips, and responsive layout
User AnalyticsLightweight tracking of user interaction events (page views, outbound clicks) via a custom API route, with events stored in PostgreSQL and standardized in dbt for potential downstream analysis

🗂️ Data Sources

SourcePurpose
Gene Siskel Film CenterPrimary showtime data and film metadata
Music Box TheatrePrimary showtime data and film metadata
FACETSPrimary showtime data and film metadata
Chicago International Film Festival (CIFF)Seasonal expansion of the above sources
MetacriticAggregated critic reviews and associated film metadata
Letterboxd (in development)Audience score enrichment

🧮 Technologies Used

LayerStack
Backend & IngestionPython, Selenium, BeautifulSoup, pandas, SQLAlchemy
StoragePostgreSQL (Render), AWS S3
Transformationdbt
FrontendNext.js / React, React Big Calendar, TailwindCSS
Deployment & OrchestrationRender (Web Service + Cron jobs)
Logging & ObservabilityCustom structured logger; artifact inventory and quality reports
Version ControlGit / GitHub (private repository)

🧰 Development Notes

  • Environment: Linux-based development with ChromeDriver for Selenium
  • Storage Structure: data/ and test/ directories mirror production vs. sandbox runs
  • Output Format: CSV and PKL scrape outputs archived under /data/pkl/{venue or review source}/{category}/
  • Observability Outputs: Timestamped inventory and quality reports are archived under /data/pkl/observability/
  • Logging: Timestamped, structured logs (e.g., facets_scrape | +0.01s | https://facets.org/...)

📸 Showcase

Live Demo

Visual Highlights

  • 📅 Main Calendar: Weekly and agenda views with color-coded venues
  • 🔍 Filters Panel: Filter by release type, runtime, and venue
  • 🗞️ ReviewSpew Dashboard: Sortable critic-review table with hovercards
  • 🎯 Click-to-Highlight: Selecting a film or director isolates its screenings in the calendar
  • 🖥️ Resizable Split-Panel Layout: Drag to adjust the space between the ReviewSpew dashboard and the calendar.
  • 📱 Responsive Mobile Layout
  • 🧩 Architecture Diagram: ETL pipeline overview → scener.onrender.com/pipeline

🚀 Future Work

  • Integrate Letterboxd and Rotten Tomatoes audience scores
  • Add a scrollable film-poster ribbon beneath the ReviewSpew dashboard for quick visual browsing.
  • Implement predictive models for audience–critic divergence and for the duration of a film's theatrical run.
  • Expand to additional Chicago-area venues: Logan Square, Davis, ... chain cinemas (possibly).

Frontend-specific

  • Add filters for review scores, specified date ranges, and polarizing genres.
  • Surface subtle features (like click-to-highlight) through small UX cues.

👤 Author

Max Ruther
M.S. Computer Science (Data Science concentration) — DePaul University
Solo developer and data engineer behind The Indie Cini.
💌 Portfolio

This repository remains private to protect proprietary logic and sensitive API configurations.
Public assets, screenshots, and diagrams are available on the case-study page.