Back
Close

What is IDF and how is it calculated?

andrew-lucker
5,109 views

Inverse Document Frequency

IDF is one of the most basic terms of modern search engine relevance calculation. It is used to determine how rare a term is and how relevant it is to the original query. For example take the query "the Golden State Warriors". This query is difficult because there is no invidual word that identifies our intention to search for a basketball team. Instead we need to look at groups of words and weigh how relevant each set is to the overall query. This is the basics of flat search query relevance and it all starts with IDF.

Before we can calculate IDF we need to associate each document or query with a set of features. For this tutorial we will use only n-grams. An n-gram is one or more words. We can use python's string methods to quickly extract features from a document or query.

def extract_features( document ):
terms = tuple(document.lower().split())
features = set()
for i in range(len(terms)):
for n in range(1,4):
if i+n <= len(terms):
features.add(terms[i:i+n])
return features
print(extract_features('The Golden State Warriors'))
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Next we need to calculate Document Frequency, then invert it. The formula for IDF starts with the total number of documents in our database: N. Then we divide this by the number of documents containing our term: tD. This will never result in a number less than 1, because 1 indicates that the term is present in all documents, there is no document frequency more common than that limit. Next we normally take the logarithm of this whole term, because we may be indexing billions of documents, and the IDF can get pretty unwieldy unless we refer to it in terms of order of magnitude.

Here we can calculate the IDF for all of our features in a small database of documents.

def extract_features( document ):
terms = tuple(document.lower().split())
features = set()
for i in range(len(terms)):
for n in range(1,4):
if i+n <= len(terms):
features.add(terms[i:i+n])
return features
documents = [
"This article is about the Golden State Warriors",
"This article is about the Golden Arches",
"This article is about state machines",
"This article is about viking warriors"]
def calculate_idf( documents ):
N = len(documents)
from collections import Counter
tD = Counter()
for d in documents:
features = extract_features(d)
for f in features:
tD[" ".join(f)] += 1
IDF = []
import math
for (term,term_frequency) in tD.items():
term_IDF = math.log(float(N) / term_frequency)
IDF.append(( term_IDF, term ))
IDF.sort(reverse=True)
return IDF
for (IDF, term) in calculate_idf(documents):
print(IDF, term)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

As you can see in the output, rare terms are assigned higher IDF and thus can be weighted higher in relevancy calculation.

Create your playground on Tech.io
This playground was created on Tech.io, our hands-on, knowledge-sharing platform for developers.
Go to tech.io
codingame x discord
Join the CodinGame community on Discord to chat about puzzle contributions, challenges, streams, blog articles - all that good stuff!
JOIN US ON DISCORD
Online Participants