Implementing h-index in SQL
24. April 2022 • 2 min readThe h-index is an easy-to-calculate metric that measures the impact of the publications of an author. The easiest definition is that it is the largest number h
such that the author has h
published papers with at least h
citations. It’s mostly used in the scientific community, but it can also be useful in other contexts. Recently I was asked to implement a h-index calculation functionality in a project of mine, and since I keep the relevant data in an SQLite database, I thought that it’d be convenient to use SQL to calculate the h-index.
A useful restatement of the definition of the h-index is $$h =\max\lbrace i \in \mathbb{N}: f(i) \ge i \rbrace$$ where $f(i)$ is the number of citations of the paper with index $i$, which allowed me to come up with the following solution:
SELECT MAX(Ranking)
FROM (SELECT Author, Citations, ROW_NUMBER() OVER (PARTITION BY Author ORDER BY Citations DESC) AS Ranking
FROM CitationData)
WHERE Ranking <= Score
GROUP BY Author
ORDER BY MAX(Ranking) DESC
A few things are worth pointing out:
- You should use
ROW_NUMBER()
and notRANK()
, because withRANK()
having multiple items with the same number of citations would lead to a wrong result. - This is SQLite-compatible, but I’m quite certain that it should be trivial to use in another SQL dialect.
- Probably this isn’t the most efficient solution, but at least it’s one that is easy to understand.
I hope you found this useful. Feel free to send me feedback via mail.