Discovery and citations in Chive

How Chive extracts citations, enriches metadata from external sources, and surfaces related work through the knowledge graph.

March 11, 2026

This is the fifth and final post in a series about Chive, a decentralized eprint service on AT Protocol. The first post covers the architecture, the second covers the knowledge graph, the third covers collections, and the fourth covers open review.

This post describes Chive v0.3.0. Details may change as the project develops. You can follow the project on Bluesky at .

The previous posts focus on features you interact with directly: submitting papers, organizing collections, writing reviews. This one covers what happens behind the scenes: how Chive extracts citations from your papers, enriches metadata from external sources, finds related work, and surfaces papers you might care about.

Citation extraction

When you upload a PDF, Chive extracts the references using GROBID, an open-source machine learning library for parsing scientific documents. GROBID identifies the bibliographic entries in your paper and produces structured citation data: titles, authors, venues, years, DOIs. Chive calls GROBID's /api/processReferences endpoint with DOI consolidation enabled, so GROBID tries to resolve each reference to a DOI during extraction.

Chive then tries to match each extracted citation against its index. The matching has two strategies. An exact DOI match produces a confidence score of 1.0. If there’s no DOI, Chive normalizes the title (lowercased, punctuation removed, whitespace collapsed) and compares it against indexed titles for a confidence score of 0.8. When both the citing and cited paper are on Chive, the citation becomes an edge in the knowledge graph, stored as a CITES relationship in Neo4j.

These citation edges combine with the rich text references that authors embed in their abstracts and the entity links that reviewers create to build a multi-layered citation network across the platform: formal citations come from reference lists; semantic connections come from inline references; and community-contributed links come from reviews and annotations.

For non-PDF formats (LaTeX, DOCX, HTML, Markdown, Jupyter, ODT, RTF, EPUB, and plain text), Chive uses format-specific text extraction. The extractor locates the reference section by searching for common headings (“References,” “Bibliography,” “Works Cited,” and variants), extracts individual citation strings, and sends them to GROBID’s /api/processCitation endpoint in batches. The structured output then goes through the same matching pipeline. For LaTeX, Chive can also parse BibTeX entries directly.

The GROBID client uses a circuit breaker (five failures to open, thirty-second half-open timeout, two retry attempts with exponential backoff) so a GROBID outage doesn’t block the rest of the indexing pipeline.

You can also add citations manually. If the automated extraction misses a reference, or if you want to cite a paper that doesn’t appear in your reference list, you can add it yourself from the eprint page. The UI distinguishes user-provided citations from auto-extracted ones with source badges (User, GROBID, Semantic Scholar, Crossref), so readers can see where each citation came from. User-provided citations can include a contextual note explaining the relationship, displayed in a collapsible section below the citation entry.

Enrichment from external sources

Beyond citations, Chive enriches eprint metadata from three external services.

Semantic Scholar

Semantic Scholar provides paper metadata (title, DOI, year, venue, fields of study), citation and reference counts with influential citation flags, and related paper recommendations via SPECTER2 embeddings. SPECTER2 is a neural model trained to produce embeddings that capture semantic similarity between scientific papers. Two papers with similar SPECTER2 embeddings address similar research questions, even if they share no citations. Semantic Scholar also provides TLDR summaries and external identifiers (arXiv, PubMed, DBLP).

OpenAlex

OpenAlex provides hierarchical topic classification (domain, field, subfield), concepts with Wikidata identifiers, works counts, cited-by counts, and institutional affiliations with ROR codes. The concept hierarchy is useful for connecting papers across subfield boundaries. Because OpenAlex concepts carry Wikidata QIDs, Chive can map them into its own knowledge graph, so external classifications feed into the same structure that powers browse mode and faceted search. When a paper isn't in the OpenAlex database, Chive can use OpenAlex's text classification API to classify the paper’s title and abstract into topics and concepts.

Crossref

Crossref provides DOI resolution and journal metadata. It's the primary fallback when GROBID extracts a reference with a DOI but limited metadata: Chive resolves the DOI against Crossref to fill in the title, publication date, and container title. Crossref enrichment is conditional: Chive only queries it when a DOI exists and the record has missing fields.

Enrichment runs asynchronously when papers are indexed. If any source is unavailable, extraction continues with the others. The results are stored in Chive’s database and used to improve search, recommendations, and the discovery features described below. On each eprint page, an enrichment panel shows the external identifiers (Semantic Scholar ID, OpenAlex ID), citation metrics (citation count, references count, influential citations), OpenAlex topics, and Semantic Scholar concepts, when available.

Paper relatedness

Chive uses seven signal types to determine how papers relate to each other.

1.
Direct citation. The simplest signal: paper A cites paper B, or paper B is cited by paper A. These relationships come from the citation extraction pipeline.
2.
Co-citation. Two papers that are frequently cited together by the same third paper are likely related. Co-citation is a standard bibliometric measure introduced by Henry Small in 1973. Chive tracks co-citation counts as a signal of thematic relatedness.
3.
Bibliographic coupling. The inverse of co-citation: two papers that cite many of the same references are likely addressing similar questions. The signal includes a count of shared references, so stronger coupling (more shared references) produces a stronger relatedness score.
4.
Semantic similarity. SPECTER2 embeddings from Semantic Scholar capture meaning-level similarity. Two papers can be semantically similar without sharing any citations, which makes this signal particularly useful for connecting work across subfield boundaries or between fields that use different citation conventions. When SPECTER2 data isn’t available, Chive falls back to Elasticsearch's more-like-this query with a discount factor.
5.
Concept and topic overlap. OpenAlex classifies papers into a hierarchical taxonomy of domains, fields, and subfields, each linked to Wikidata. Chive maps these concepts to its own knowledge graph, so papers that share concepts at different levels of the hierarchy are related to different degrees. Sharing a subfield is a stronger signal than sharing a domain.
6.
Author network. Papers by the same author or by frequent collaborators are surfaced as potentially relevant. The relatedness score scales with the degree of author overlap.
7.
Collaborative filtering. Based on user interaction patterns: if researchers who bookmarked paper A also tend to bookmark paper B, the two papers may be related in ways that the other signals don't capture.

Each signal produces a score. Users can configure which signals to include and how to weight them (on a 0-100 scale per signal) from their discovery settings. The default weights favor author network and semantic similarity, with co-citation, concept overlap, and collaborative filtering as supporting signals. Users can also set a minimum score threshold and a maximum result count (up to 50).

User-curated relationships

Alongside automated signals, researchers can manually assert relationships between papers. You link two papers and select a relationship type: extends, replicates, contradicts, reviews, is-supplement-to, or a general related label. Each link can include a description explaining the connection.

These curated relationships are stored as AT Protocol records in your PDS and indexed alongside the automated signals. They’re useful for connections that algorithms miss: a paper that contradicts a widely-cited result, or a methods paper that serves as the practical implementation of a theoretical framework. Because they're user-created ATProto records, they carry your identity and can be referenced by others.

Citation network visualization

For eprints with nodes in the Chive knowledge (of which there aren't any at the moment), that eprint's page has an interactive citation graph built with React Flow that visualizes the paper’s position in the citation network. The graph places the paper at the center with citing papers on one side and references on the other. Influential citations are highlighted with distinct styling. Clicking a node navigates to that paper.

The visualization can be collapsed to a summary showing citation counts (cited-by, references, influential citations) or expanded to the full interactive view, and you can configure the default display mode (hidden, preview, or expanded) in your discovery settings. A minimap in the corner helps with navigation when the graph is large.

Following fields

You can follow knowledge graph fields to tailor what Chive surfaces for you. Your profile already declares your research fields, and those are used to show new papers in your areas on the eprints page. But you can also follow fields outside your own research using Chive's discovery settings, for areas you want to track without claiming them as your own.

The trending page has two tabs: papers trending in your declared research fields, and papers trending in the fields you follow. A computational linguist who wants to keep up with formal semantics or psycholinguistics can follow those fields and see trending work from each separately. There's also a configuration option to merge followed fields into your research field tab if you prefer a single stream.

The discovery settings are stored as part of your profile and include per-signal toggles (enable or disable semantic similarity, co-citation, topic overlap, and so on) as well as per-signal weights. You can tune how aggressively Chive recommends papers and how much diversity you want relative to your own declared fields (low, medium, or high).

Chive tracks views, unique views, and downloads for each eprint across three time windows: 24 hours, 7 days, and 30 days. Views are counted with Redis counters. Unique views are estimated using Redis HyperLogLog structures, which provide cardinality estimates with a standard error of about 0.81% without storing individual visitor identifiers. Time-windowed counts use sorted sets keyed by timestamp, so expired entries drop off naturally.

The trending score for a given time window is the view count within that window. Papers are ranked by how many views they received in the last 24 hours, 7 days, or 30 days.

There’s also a velocity indicator that measures acceleration: how the recent rate of attention compares to a longer-term baseline. For the 24-hour window, the baseline is the 7-day average. For the 7-day window, the baseline is the 30-day average. A positive velocity means a paper is gaining attention faster than its baseline; negative means it’s decelerating. This is useful for spotting papers that are picking up momentum in a field, separate from papers that are popular in absolute terms. A paper with moderate total views but high velocity is breaking through; a paper with high total views but negative velocity has peaked.

Trending can be filtered by field, so you can see what’s gaining traction specifically in, say, formal semantics or computational linguistics.

Your paper’s reach

Beyond platform-wide trending, you can see analytics for your own papers. Each eprint page shows its all-time view count, download count, and endorsement count. The time-windowed breakdowns (24 hours, 7 days, 30 days) are used on the trending page and in the recommendation pipeline but aren’t yet surfaced on individual eprint pages. Your author profile aggregates metrics across all your claimed papers.

How this connects to the rest

Citations, enrichment data, rich text references, and entity links all write to the same knowledge graph that undergirds collections, browse mode, faceted search, and trending. A citation extracted from a newly indexed paper creates the same kind of graph edge as a concept mapped from OpenAlex or an entity link a reviewer adds by hand. The graph doesn't distinguish between them.

This means the collection feed system from the third post picks up discovery events automatically. If someone uploads a paper that cites a paper you’re tracking in a collection, or if a new paper gets classified under a field your collection watches, the activity stream reflects it.

But as always, Chive never modifies your records. The citation graph, enrichment metadata, recommendation scores, and view counts are all derived indexes, rebuildable from the firehose, external APIs, and ongoing traffic.

In this series: What Chive is · The knowledge graph · Collections · Reviews, annotations, and endorsements · Discovery and citations

Technical deep dives: XRPC adapter · Lexicon namespace · Rich text · Firehose · Storage · Knowledge graph schema · Review system · Citations · Discovery · Plugins · Auth · Observability

chive.pub · github.com/chive-pub/chive · docs.chive.pub

Reviews, annotations, and endorsements in Chive

science

discovery

publishing

atscience-2026