README — The Indie Cini

🎬 The Indie Cini

A data-driven planner for Chicago's independent cinemas.

Chicago is home to fantastic independent cinemas and many eager moviegoers. But these cinemas and moviegoers are often separated by a digital gulf, one largely unnoticed despite its expanse.

Fandango and other major showtime aggregators tend not to list screenings from these venues.
Critical reviews and aggregated scores are strewn across innumerable websites, making movie discovery needlessly difficult.

To address these gaps, The Indie Cini consolidates showtimes and film metadata from Chicago’s leading independent cinemas and enriches them with aggregated critic reviews. It currently covers the programming of the Gene Siskel Film Center, the Music Box Theatre, and FACETS.

Built on an end-to-end pipeline of scrapers, data modeling, and a tailored UI, The Indie Cini delivers an interactive calendar, a critic-score dashboard, flexible filters, and a responsive layout, all designed to simplify the discovery of Chicago's indie film culture.

🧭 Overview

The Indie Cini automates the collection, normalization, and transformation of cinema slate data from Chicago’s top independent venues, alongside film review data from a major aggregator. The result is a unified and consistently updated view of the city’s indie screening landscape.

Core outcomes include:

Unified dataset of all screening events across Chicago's major indie venues
Aggregated critic reviews powering a sortable, detail-rich dashboard
Interactive calendar interface, with filters that refine both the calendar and the review dashboard
Analytics-ready data foundation for analyzing score dynamics and theatrical longevity, with initial support for capturing user interaction events.

🏗️ Architecture

High-level architecture of The Indie Cini data pipeline (click to enlarge).

The system follows a modular ETL-style pipeline, built for stability and transparency:

Venue Scraping Layer (Python + Selenium + BeautifulSoup)
- Separate scrapers for Siskel, Music Box, and FACETS, with seasonal expansion support for CIFF
- Custom RotatingDriver manages rate limits, restarts, and JS-heavy pages
- ScrapeResult bundles parsed DataFrames, source HTML, and run metadata
Review Scraping Layer (Python + Selenium + BeautifulSoup)
- Metacritic pages identified via automated search with empirically tuned relevance ranking
- Dynamically loaded content retrieved through simulated user interaction (scrolling, “Load more," cookie dismissal)
- Master files (mc_<data_type>_accum) reduce redundant scraping
- (Planned): Letterboxd score integration
Artifact Observability Layer (Python + pandas)
- Timestamped inventory reports record the latest file, artifact date, and freshness status for each expected source/type combination
- Artifact quality reports produce summary metrics and entity-level findings for duplicate film identifiers, missing values, invalid runtimes or review scores, and unmatched Metacritic searches
- Reports are generated before database loading, creating a checkpoint between scrape artifact generation and downstream transformation
Ingestion Layer (Python + pandas + SQLAlchemy):
- Standardized loading routines clean column names and enforce dtypes
- Data is routed to either a local MySQL dev database or a production PostgreSQL database on Render
Storage Layer (PostgreSQL on Render, AWS S3)
- Unified StorageBackend abstraction for consistent local/S3 file I/O
- pathing.py defines centralized naming conventions and directory structure
- ScrapeResult objects serialize directly into versioned S3 archives

- Raw scrape artifacts persist in S3/local storage, while PostgreSQL houses the application's raw and transformed tables
Transformation Layer (dbt)
- dbt transforms raw PostgreSQL tables into layered models spanning staging, intermediate, and mart tiers:
  - Staging: Standardizes venue slate and review data
  - Intermediate: Constructs stable join keys, deduplicates Metacritic records, and aggregates film reviews by publication and film
  - Marts: Two feature-specific marts power the frontend: mart_show_calendar and mart_review_dashboard
- dbt tests enforce core data integrity assumptions, including null, uniqueness, and composite-key checks.
Frontend (Next.js + React Big Calendar)
- Interactive calendar with daily/weekly/agenda views
- Event blocks sized by runtime and color-coded by venue
- Tooltips preview event details and link directly to booking pages
- Critic-score dashboard with sorting, hovercards, and click-to-highlight behavior
- Filter controls for venue, release type, and runtime
- Fully responsive layout, plus a desktop resizable split panel between calendar and dashboard
Usage Analytics (Next.js API routes + PostgreSQL)
- Lightweight frontend instrumentation captures user interaction events, including page views and outbound clicks
- Events are sent via a server-side API route and persisted to an append-only analytics_events table in PostgreSQL
- A dbt staging model (stg_analytics_events) standardizes this data for potential downstream analysis
- This layer establishes a foundation for future behavioral insights, without being a core feature of the current application

🧩 Key Features

Category	Feature
Data Integration	Unified cinema slate and review datasets spanning Chicago's leading independent cinemas
Transformation & Modeling	Layered dbt architecture (staging → intermediate → marts) with surrogate key construction, deduplication pipelines, and feature-specific data marts
Automation	Automated daily web scraping and ingestion via a custom rotating Selenium WebDriver, orchestrated via Render cron jobs
Data Quality & Observability	Artifact inventory reports track scrape freshness, while quality reports run structural checks and record entity-level findings; ingestion checks, structured logging, and dbt tests further enforce consistency
Storage Architecture	Centralized storage abstractions standardize file and object I/O, with PostgreSQL serving as the canonical warehouse for ingested and modeled data
Frontend	Interactive screening calendar and review dashboard with dynamic filtering, tooltips, and responsive layout
User Analytics	Lightweight tracking of user interaction events (page views, outbound clicks) via a custom API route, with events stored in PostgreSQL and standardized in dbt for potential downstream analysis

🗂️ Data Sources

Source	Purpose
Gene Siskel Film Center	Primary showtime data and film metadata
Music Box Theatre	Primary showtime data and film metadata
FACETS	Primary showtime data and film metadata
Chicago International Film Festival (CIFF)	Seasonal expansion of the above sources
Metacritic	Aggregated critic reviews and associated film metadata
*Letterboxd (in development)*	Audience score enrichment

🧮 Technologies Used

Layer	Stack
Backend & Ingestion	Python, Selenium, BeautifulSoup, pandas, SQLAlchemy
Storage	PostgreSQL (Render), AWS S3
Transformation	dbt
Frontend	Next.js / React, React Big Calendar, TailwindCSS
Deployment & Orchestration	Render (Web Service + Cron jobs)
Logging & Observability	Custom structured logger; artifact inventory and quality reports
Version Control	Git / GitHub (private repository)

🧰 Development Notes

Environment: Linux-based development with ChromeDriver for Selenium
Storage Structure: data/ and test/ directories mirror production vs. sandbox runs
Output Format: CSV and PKL scrape outputs archived under /data/pkl/{venue or review source}/{category}/
Observability Outputs: Timestamped inventory and quality reports are archived under /data/pkl/observability/
Logging: Timestamped, structured logs (e.g., facets_scrape | +0.01s | https://facets.org/...)

📸 Showcase

Live Demo

Shortlink: bit.ly/indiecini
Direct: scener.onrender.com

Visual Highlights

📅 Main Calendar: Weekly and agenda views with color-coded venues
🔍 Filters Panel: Filter by release type, runtime, and venue
🗞️ ReviewSpew Dashboard: Sortable critic-review table with hovercards
🎯 Click-to-Highlight: Selecting a film or director isolates its screenings in the calendar
🖥️ Resizable Split-Panel Layout: Drag to adjust the space between the ReviewSpew dashboard and the calendar.
📱 Responsive Mobile Layout
🧩 Architecture Diagram: ETL pipeline overview → scener.onrender.com/pipeline

🚀 Future Work

Integrate Letterboxd and Rotten Tomatoes audience scores
Add a scrollable film-poster ribbon beneath the ReviewSpew dashboard for quick visual browsing.
Implement predictive models for audience–critic divergence and for the duration of a film's theatrical run.
Expand to additional Chicago-area venues: Logan Square, Davis, ... chain cinemas (possibly).

Frontend-specific

Add filters for review scores, specified date ranges, and polarizing genres.
Surface subtle features (like click-to-highlight) through small UX cues.

👤 Author

Max Ruther
M.S. Computer Science (Data Science concentration) — DePaul University
Solo developer and data engineer behind The Indie Cini.
💌 Portfolio

This repository remains private to protect proprietary logic and sensitive API configurations.
Public assets, screenshots, and diagrams are available on the case-study page.

The Indie Cini

Key points

Full README