Spark NLP Document Similarity Ranker as-retriever for RAG tasks

Stefano Lori
4 min readMar 18, 2024

Breaking news! Spark NLP (https://sparknlp.org/) gets enhanced with a new DocumentSimilarityRanker as-retriever interface for your RAG (Retrieval-Augmented Generation) applications at scale! :D

The Spark NLP project, the great popular open-source library for natural language processing, has recently been upgraded to provide a way to use its DocumentSimilarityRanker annotator in RAG tasks. This interface provides a powerful tool for creating vector databases and applying approximate similarity search on a very large scale.

What is DocumentSimilarityRanker as-retriever interface?

DocumentSimilarityRanker as-retriever interface is a new capability of the annotator that allows users to create a vector database of a large collection of text documents and then execute searches to retrieve documents that are similar to a given query.
This is done by first training a DocumentSimilarityRanker model (Spark ML LSH model) on a corpus of text documents. The model then learns to represent each document as a vector in a high-dimensional space using sentence embeddings. Once the model is trained, it can be used to retrieve documents that are similar to a given query by finding the documents that have the most similar vectors to the query vector.

...
query = "Fifth document, Florence in Italy, is among the most beautiful cities in Europe."
...
document_similarity_ranker = DocumentSimilarityRankerApproach() \
.setInputCols("sentence_embeddings") \
.setOutputCol("doc_similarity_rankings") \
.setSimilarityMethod("brp") \
.setNumberOfNeighbours(3) \
.setBucketLength(2.0) \
.setNumHashTables(3) \
.setVisibleDistances(True) \
.setIdentityRanking(True) \
.asRetriever(query) # <--- HERE IT IS!
...

Why is DocumentSimilarityRanker as-retriever important?

DocumentSimilarityRanker asRetriever(query) is an important capability for several reasons. First, it can be used to identify relevant documents from a very large corpus of text data quickly and efficiently. This is because the model can use approximate similarity search techniques to find documents that are similar to the query, even if the query is not a perfect match for any of the documents in the corpus. Second, the model can be used to rank retrieved documents based on their similarity to the query. This allows users to focus on the most relevant documents first.
Third, in the LLM era, having the capability of searching massive document repositories and ranking document similarity at scale is a great advantage for more accurate answers using RAG techniques.

Credit by Anyscale blog (https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1)

How to use DocumentSimilarityRanker as-retriever ?

Let’s have an hands-on!

After installing Spark with for instance conda,

$ java -version
# should be Java 8 (Oracle or OpenJDK)
$ conda create -n sparknlp python=3.8 -y
$ conda activate sparknlp
$ pip install spark-nlp==5.2.x pyspark==3.3.x

we need to install Spark NLP (https://sparknlp.org/docs/en/install).
We can launch the session enriched with the Spark NLP dependencies as follow.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("Spark NLP")\
.master("local[*]")\
.config("spark.driver.memory","16G")\
.config("spark.driver.maxResultSize", "0") \
.config("spark.kryoserializer.buffer.max", "2000M")\
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.x") \
.getOrCreate()

Then, we need to define a set of documents (corpus), for instance:

df = spark.createDataFrame([
["First document, this is my first sentence. This is my second sentence."],
["Second document, this is my second sentence. This is my second sentence."],
["Third document, climate change is arguably one of the most pressing problems of our time."],
["Fourth document, climate change is definitely one of the most pressing problems of our time."],
["Fifth document, Florence in Italy, is among the most beautiful cities in Europe."],
["Sixth document, Florence in Italy, is a very beautiful city in Europe like Lyon in France."],
["Seventh document, the French Riviera is the Mediterranean coastline of the southeast corner of France."],
["Eighth document, the warmest place in France is the French Riviera coast in Southern France."]
]).toDF("text")

Then we can define the target query as the 5th document as follow:

query = "Fifth document, Florence in Italy, is among the most beautiful cities in Europe."

document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

sentence_embeddings = RoBertaSentenceEmbeddings.pretrained() \
.setInputCols(["document"]) \
.setOutputCol("sentence_embeddings")

document_similarity_ranker = DocumentSimilarityRankerApproach() \
.setInputCols("sentence_embeddings") \
.setOutputCol("doc_similarity_rankings") \
.setSimilarityMethod("brp") \
.setNumberOfNeighbours(3) \
.setBucketLength(2.0) \
.setNumHashTables(3) \
.setVisibleDistances(True) \
.setIdentityRanking(True) \
.asRetriever(query)

document_similarity_ranker_finisher = DocumentSimilarityRankerFinisher() \
.setInputCols("doc_similarity_rankings") \
.setOutputCols("finished_doc_similarity_rankings_id", "finished_doc_similarity_rankings_neighbors")

pipeline = Pipeline(stages=[
document_assembler,
sentence_embeddings,
document_similarity_ranker,
document_similarity_ranker_finisher
])

model = pipeline.fit(df)

(
model
.transform(df)
.select("text", "nearest_neighbor_id", "nearest_neighbor_distance")
.orderBy('nearest_neighbor_distance', ascending=True)
.show(10, False)
)

After running the code, we can observe and rank the results!

sent_roberta_base download started this may take some time.
Approximate size to download 284.8 MB
[OK!]
[Stage 71:========================> (7 + 9) / 16]
+--------------------------------------------------------------------------------------------+-------------------+-------------------------+
|text |nearest_neighbor_id|nearest_neighbor_distance|
+--------------------------------------------------------------------------------------------+-------------------+-------------------------+
|Fifth document, Florence in Italy, is among the most beautiful cities in Europe. |-1320876223 |0.0 |
|Sixth document, Florence in Italy, is a very beautiful city in Europe like Lyon in France. |1293373212 |0.17848861258809434 |
|Eighth document, the warmest place in France is the French Riviera coast in Southern France.|-1719102856 |0.2761524746260818 |
+--------------------------------------------------------------------------------------------+-------------------+-------------------------+

The most similar documents among the corpus are provided with their ranking distance from the query ! Internally the annotator has used:

To note that as we used the identity flag to True

...
.setIdentityRanking(True)
...

we can observe that the first result is the target query at distance 0.0 .

Conclusions

The new DocumentSimilarityRanker as-retriever interface is a powerful addition to the Spark NLP project. It allows users to create LSH vector databases of text documents and then use them to retrieve documents that are similar to a given query. This is an important tool for several RAG based tasks, including information retrieval, question answering, and summarization. Enjoy the power of Spark NLP! :D

--

--

Stefano Lori

Lead Big Data and AI, Senior Data Scientist in Fintech, ESG and Spark NLP contributor.