Implementing h-index in SQL

24. April 2022  ‐ 2 min read

The h-index is an easy-to-calculate metric that measures the impact of the publications of an author. The easiest definition is that it is the largest number h such that the author has h published papers with at least h citations. It’s mostly used in the scientific community, but it can also be useful in other contexts. Recently I was asked to implement a h-index calculation functionality in a project of mine (hi @niko, if you’re reading this), and since I keep the relevant data in an SQLite database, I thought that it’d be convenient to use SQL to calculate the h-index.

A useful restatement of the definition of the h-index is $$h =\max\lbrace i \in \mathbb{N}: f(i) \ge i \rbrace$$ where $f(i)$ is the number of citations of the paper with index $i$, which allowed me to come up with the following solution:

SELECT MAX(Ranking)
FROM (SELECT Author, Citations, ROW_NUMBER() OVER (PARTITION BY Author ORDER BY Citations DESC) AS Ranking
    FROM CitationData)
WHERE Ranking <= Score
GROUP BY Author
ORDER BY MAX(Ranking) DESC

A few things are worth pointing out:

  • You should use ROW_NUMBER() and not RANK(), because with RANK() having multiple items with the same number of citations would lead to a wrong result.
  • This is SQLite-compatible, but I’m quite certain that it should be trivial to use in another SQL dialect.
  • Probably this isn’t the most efficient solution, but at least it’s one that is easy to understand.

I hope you found this useful. Feel free to send me feedback via mail.