ResearchMatch

Finding the right professor to work with shouldn't feel like doom-scrolling through half-broken faculty pages and Google Scholar profiles.

ResearchMatch flips that script with an automated pipeline that scrapes, enriches & ranks research profiles, then serves lightning-fast matches through an elegant web UI.

Below is a behind-the-scenes look at the architecture you can run in a single Python script today—and the cloud-native design.

Why We Built It

Lower the barrier to undergraduate and graduate research.

Benchmark LLMs against traditional NLP baselines on a real task.

Exercise full-stack muscle—from scrapers to observability dashboards to attempt to solve its engineering at scale.

Current MVP Architecture

Traditional Scraping:

Crawl College-of-Computing (We did for Georgia tech, but it should work for other universities too) faculty pages → names, bios, personal sites, Scholar & ORCID links → BeautifulSoup, DuckDuckGo API

Four LLM Scrapers:

DeepSeek, GPT-4o, Mistral & Llama ingest each professor’s first Scholar page and produces a list of research topics

Matching Engine:

Keyword, TF-IDF & Word2Vec rank professors against the student’s query (secondary sort by citations/h-index)scikit-learn, gensim

Streamlit Dashboards:

Precision, recall, F1, BLEU, ROUGE plus live latency distributions.

Streamlit, Plotly— Redis Cache. Last 1000 queries cached as {query_hash → response_blob} to short-circuit recomputation.

Take-away: in <300 LOC we support five scraping methods, three matchers, evaluation against Perplexity SONAR “ground truth,” and a real-time metrics board—all backed by a single JSON file.

Performance Insights

LLM vs. Scraping: vanilla keyword scraping and Llama nudged out GPT-class models on F1 (≈ 0.27 vs 0.24) because Scholar bios contain well-tokenised keywords.

Caching matters: exact-keyword queries were already µs-level, but TF-IDF & Word2Vec dropped from ~5 ms to sub-ms when the Redis layer hit.

Latency dashboard: box-plots + rolling 90 % CIs gave us instant feedback when an LLM endpoint slowed down or cache-hit ratios slipped. (See plots below.)

The Road to Production-Scale

1 · Ingestion

Cloud-native upgrade: AWS Step Functions coordinate K8s scraper jobs on Fargate, publishing results to Kafka

Why it scales: serverless, auto-scaling workers with built-in retry semantics

2 · Enrichment

Cloud-native upgrade: Flink / Kinesis stream processor feeds stateless LLM micro-services that add embeddings & topic labels

Why it scales: parallel shard-out, lower per-message cost, easy horizontal expansion

3 · Storage

Cloud-native upgrade: Iceberg + S3 data lake, Aurora Postgres for metadata, Pinecone vector search

Why it scales: clean separation of hot vs. cold data, infinite object-store capacity, and managed vector indexing

4 · Core APIs

Cloud-native upgrade: Matching Engine (TF-IDF + ANN) and Personalisation service exposed via gRPC / GraphQL

Why it scales: low-latency, strongly-typed contracts, efficient binary transport

5 · Edge & UI

Cloud-native upgrade: CloudFront with global Redis edge cache, JWT/OIDC auth, Next.js SPA & Flutter mobile apps

Why it scales: < 100 ms P99 reads worldwide, authenticated and cached at the edge

Observability

Cloud-native upgrade: Prometheus metrics, OpenTelemetry traces, CloudWatch logs

Why it scales: uniform telemetry → safe, zero-downtime rollouts and rapid incident triage

Caching in the Future Design

Edge cache (CloudFront + Global Redis) for fully rendered search pages.

Micro-cache inside the Matching Engine for 1 s hot-query bursts.

Vector-search cache in Pinecone’s pod memory—evicts on LRU by embedding norm.

LLM response cache (S3 + Athena) keyed by [model, prompt_hash] to amortise API spend.

What’s Next

Incremental scraping using change-feeds rather than full re-crawls.

Hybrid ranker that blends BM25, ANN & re-ranking LLM in a single gRPC hop.

Personalised digests—a Notification Hub surfaces new papers the moment they land.

ResearchMatch started as a 500 profiles JSON prototype; the envisioned pipeline can ingest every faculty profile on Earth and serve sub-second, personalised matches.

We built this as a team at Georgia Tech for CS6675: Advanced Internet Systems and Applications

- Shrey Gupta

Read on Substack