Ranking document similarity at scale with Spark NLP

9 min readJul 2, 2023

Combining the power of Spark NLP sentence embeddings and LSH approximate nearest neighbors search pipelines to catch contextual and semantic meanings in large textual document datasets

Context

Are you tired of manually sifting through mountains of text documents to find relevant similarity? Say goodbye to the tedious process! Spark NLP, the cutting-edge natural language processing library, has just added a new useful annotator: the Document Similarity Ranker.

As the name suggests this annotator can effortlessly compare and rank the similarity between different documents. Whether you’re dealing with articles, reports, customer reviews, or any other textual data, this powerful tool will save you countless hours and boost your productivity.

Core architecture

At its core, the Document Similarity Ranker leverages the power of various components to provide its impressive functionality.

Let’s break it down:

Firstly, the Document Similarity Ranker relies on a sentence embeddings pipeline. This pipeline processes individual sentences and transforms them into fixed-dimensional vectors called embeddings. These embeddings capture the contextual and semantic information of each sentence, enabling a more nuanced understanding of the text.
Spark NLP implements it in many ways, for instance RoBertaSentenceEmbeddings or E5Embeddings .
Once the sentence embeddings are generated, the Document Similarity Ranker connects them to a Spark ML Locality Sensitive Hashing (LSH) pipeline. Locality-Sensitive Hashing is a technique used for efficient nearest neighbor search in high-dimensional spaces. It maps similar vectors to the same or nearby hash buckets, allowing for fast retrieval of similar items.

In summary, the Document Similarity Ranker seamlessly integrates the strength of Spark NLP’s sentence embeddings pipelines with the power of the Spark built-in Approximate Nearest Neighbors (ANN) search engine.
Awesome! Coding time!

Hands-on using the Document Similarity Ranker !

I’ll code in Python, but remember you can use the great Scala APIs too! :)

Spark NLP library provides many sentence level embeddings models such as Bert, RoBerta or XlmRoBerta (multilingual), which leverages pretrained transformer models.
Please find more info here: https://sparknlp.org/docs/en/transformers .

There exists many well-written Spark NLP articles to quick start with sentence embeddings pipelines, for instance you can find one here:
https://www.johnsnowlabs.com/understanding-the-power-of-transformers-a-guide-to-sentence-embeddings-in-spark-nlp/ .

Install Spark NLP (≥ 5.x)

Please refer to https://sparknlp.org/docs/en/install to install the library.

Code time!

Load the necessary classes

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline

2. Initialise the PySpark session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Spark-NLP-doc-sim-ranker")\
    .master("local[*]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.x.x")\
    .getOrCreate()

3. Let’s create a test dataset with quite evident similarity in pair, i.e. first and second, third and fouth, fifth ad sixth, seventh and eigth sentences.

data = spark.createDataFrame(
        [
            ["First document, this is my first sentence. This is my second sentence."],
            ["Second document, this is my second sentence. This is my second sentence."],
            ["Third document, climate change is arguably one of the most pressing problems of our time."],
            ["Fourth document, climate change is definitely one of the most pressing problems of our time."],
            ["Fifth document, Florence in Italy, is among the most beautiful cities in Europe."],
            ["Sixth document, Florence in Italy, is a very beautiful city in Europe like Lyon in France."],
            ["Seventh document, the French Riviera is the Mediterranean coastline of the southeast corner of France."],
            ["Eighth document, the warmest place in France is the French Riviera coast in Southern France."]
        ]
    ).toDF("text")

4. Let’s build a RoBerta Sentence Embeddings pipeline, i.e. https://sparknlp.org/api/com/johnsnowlabs/nlp/embeddings/RoBertaSentenceEmbeddings.html

document_assembler = DocumentAssembler() \
            .setInputCol("text") \
            .setOutputCol("document")

sentence_embeddings = RoBertaSentenceEmbeddings.pretrained() \
            .setInputCols(["document"]) \
            .setOutputCol("sentence_embeddings")

pipeline = Pipeline(stages=[
            document_assembler,
            sentence_embeddings
        ])

roberta_pipeline = pipeline.fit(data).transform(data)

roberta_pipeline.show()

This will download the default pretrained model sent_roberta_base: https://sparknlp.org/2021/09/01/sent_roberta_base_en.html .

When executing in the preferred notebook of your choise you’ll see the following

[ / ]sent_roberta_base download started this may take some time.
Approximate size to download 284.8 MB
[ — ]Download done! Loading the resource.

and if you make a show of the processed pipeline model.

+--------------------+--------------------+--------------------+
|                text|            document| sentence_embeddings|
+--------------------+--------------------+--------------------+
|First document, t...|[{document, 0, 69...|[{sentence_embedd...|
|Second document, ...|[{document, 0, 71...|[{sentence_embedd...|
|Third document, c...|[{document, 0, 88...|[{sentence_embedd...|
|Fourth document, ...|[{document, 0, 91...|[{sentence_embedd...|
|Fifth document, F...|[{document, 0, 79...|[{sentence_embedd...|
|Sixth document, F...|[{document, 0, 89...|[{sentence_embedd...|
|Seventh document,...|[{document, 0, 10...|[{sentence_embedd...|
|Eighth document, ...|[{document, 0, 91...|[{sentence_embedd...|
+--------------------+--------------------+--------------------+

We can now zoom into the sentence embeddings schema

root
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)

and struct result (truncated…), showing at last the embeddings array containing the final representations of our text dataset.

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|sentence_embeddings                                                                                                                                                                                                                                                                                                           
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|[{sentence_embeddings, 0, 69, First document, this is my first sentence. This is my second sentence., {sentence -> 0, token -> First document, this is my first sentence. This is my second sentence., pieceId -> -1, isWordStart -> true}, [-0.0016509573, -0.20525712, -0.21965097, -0.059896577, 0.1287623, 0.19543253, 0.2
|[{sentence_embeddings, 0, 71, Second document, this is my second sentence. This is my second sentence., {sentence -> 0, token -> Second document, this is my second sentence. This is my second sentence., pieceId -> -1, isWordStart -> true}, [-7.9203E-4, -0.19994189, -0.21818015, -0.068899736, 0.12664562, 0.1954791, 0.
|[{sentence_embeddings, 0, 88, Third document, climate change is arguably one of the most pressing problems of our time., {sentence -> 0, token -> Third document, climate change is arguably one of the most pressing problems of our time., pieceId -> -1, isWordStart -> true}, [-0.008408386, -0.20978682, -0.21336395, -0.
|[{sentence_embeddings, 0, 91, Fourth document, climate change is definitely one of the most pressing problems of our time., {sentence -> 0, token -> Fourth document, climate change is definitely one of the most pressing problems of our time., pieceId -> -1, isWordStart -> true}, [-0.003835852, -0.20862408, -0.2177316
|[{sentence_embeddings, 0, 79, Fifth document, Florence in Italy, is among the most beautiful cities in Europe., {sentence -> 0, token -> Fifth document, Florence in Italy, is among the most beautiful cities in Europe., pieceId -> -1, isWordStart -> true}, [-0.017528567, -0.20887347, -0.21519172, -0.03888638, 0.133574
|[{sentence_embeddings, 0, 89, Sixth document, Florence in Italy, is a very beautiful city in Europe like Lyon in France., {sentence -> 0, token -> Sixth document, Florence in Italy, is a very beautiful city in Europe like Lyon in France., pieceId -> -1, isWordStart -> true}, [-0.023446266, -0.20165044, -0.21878108, -
|[{sentence_embeddings, 0, 101, Seventh document, the French Riviera is the Mediterranean coastline of the southeast corner of France., {sentence -> 0, token -> Seventh document, the French Riviera is the Mediterranean coastline of the southeast corner of France., pieceId -> -1, isWordStart -> true}, [-0.008019538, -0
|[{sentence_embeddings, 0, 91, Eighth document, the warmest place in France is the French Riviera coast in Southern France., {sentence -> 0, token -> Eighth document, the warmest place in France is the French Riviera coast in Southern France., pieceId -> -1, isWordStart -> true}, [-0.016810948, -0.2047661, -0.2230267,
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ...

5. Now that we have produced the transformers representation, let’s add the Document Similarity Ranker to create a LSH based ANN search engine.

To do so, let’s rewrite the pipeline adding the necessary steps:

from sparknlp.annotator.similarity.document_similarity_ranker import *

document_assembler = DocumentAssembler() \
            .setInputCol("text") \
            .setOutputCol("document")

sentence_embeddings = RoBertaSentenceEmbeddings.pretrained() \
            .setInputCols(["document"]) \
            .setOutputCol("sentence_embeddings")

# Let's set the DocumentSimilarityRankerApproach instance
# necessary to create the LSH model engine
document_similarity_ranker = DocumentSimilarityRankerApproach() \
            .setInputCols("sentence_embeddings") \
            .setOutputCol("doc_similarity_rankings") \
            .setSimilarityMethod("brp") \
            .setNumberOfNeighbours(1) \
            .setBucketLength(2.0) \
            .setNumHashTables(3) \
            .setVisibleDistances(True) \
            .setIdentityRanking(False)

# Let's extract the information we need in the columns of interest
document_similarity_ranker_finisher = DocumentSimilarityRankerFinisher() \
        .setInputCols("doc_similarity_rankings") \
        .setOutputCols(
            "finished_doc_similarity_rankings_id",
            "finished_doc_similarity_rankings_neighbors") \
        .setExtractNearestNeighbor(True)

pipeline = Pipeline(stages=[
            document_assembler,
            sentence_embeddings,
            document_similarity_ranker,
            document_similarity_ranker_finisher
        ])

docSimRankerPipeline = pipeline.fit(data).transform(data)
(
    docSimRankerPipeline
        .select(
               "finished_doc_similarity_rankings_id",
               "finished_doc_similarity_rankings_neighbors"
        ).show(10, False)
)

The run of the pipeline will compute the RoBerta Sentence Embeddings and will propagate the embeddings to the LSH Approximate Nearest Neighbors search engine (ANN) that will compute for each domcument a result such as the following:

+-----------------------------------+------------------------------------------+
|finished_doc_similarity_rankings_id|finished_doc_similarity_rankings_neighbors|
+-----------------------------------+------------------------------------------+
|1510101612                         |[(1634839239,0.12448559273510636)]        |
|1634839239                         |[(1510101612,0.12448559273510636)]        |
|-612640902                         |[(1274183715,0.12201215887654807)]        |
|1274183715                         |[(-612640902,0.12201215887654807)]        |
|-1320876223                        |[(1293373212,0.17848861258809434)]        |
|1293373212                         |[(-1320876223,0.17848861258809434)]       |
|-1548374770                        |[(-1719102856,0.2329717161223739)]        |
|-1719102856                        |[(-1548374770,0.2329717161223739)]        |
+-----------------------------------+------------------------------------------+

Alternatively we can use the brand new E5Embeddings

from sparknlp.annotator.similarity.document_similarity_ranker import *

document_assembler = DocumentAssembler() \
            .setInputCol("text") \
            .setOutputCol("document")

sentence_embeddings = E5Embeddings.pretrained() \
            .setInputCols(["document"]) \
            .setOutputCol("sentence_embeddings")

document_similarity_ranker = DocumentSimilarityRankerApproach() \
            .setInputCols("sentence_embeddings") \
            .setOutputCol("doc_similarity_rankings") \
            .setSimilarityMethod("brp") \
            .setNumberOfNeighbours(1) \
            .setBucketLength(2.0) \
            .setNumHashTables(3) \
            .setVisibleDistances(True) \
            .setIdentityRanking(False)

document_similarity_ranker_finisher = DocumentSimilarityRankerFinisher() \
        .setInputCols("doc_similarity_rankings") \
        .setOutputCols(
            "finished_doc_similarity_rankings_id",
            "finished_doc_similarity_rankings_neighbors") \
        .setExtractNearestNeighbor(True)

pipeline = Pipeline(stages=[
            document_assembler,
            sentence_embeddings,
            document_similarity_ranker,
            document_similarity_ranker_finisher
        ])

docSimRankerPipeline = pipeline.fit(data).transform(data)

(
    docSimRankerPipeline
        .select(
               "finished_doc_similarity_rankings_id",
               "finished_doc_similarity_rankings_neighbors"
        ).show(10, False)
)

The results are consistent with our dataset construction, where the document pairs 1–2, 3–4, 5–6 and 7–8 are retrieved as best neighbors.

Now, let’s inspect the input parameters we set and the results we obtained.

Input setters for parameters

setInputCols(“sentence_embeddings”) : this setter will address input column
setOutputCol(“doc_similarity_rankings”) : this setter will address ouput column
setSimilarityMethod(“brp”) : this setter will select the LSH method (lsh|mh) used to apply approximate nearest neigbours search
setNumberOfNeighbours(10) : this setter will address the desired number of similar documents for a given document in the set
setBucketLength(2.0) : LSH parameter used to control the average size of hash buckets and improve recall
setNumHashTables(3) : LSH parameter used to control number of hash tables used in LSH OR-amplification and improve recall
setVisibleDistances(True) : this setter will make distances visible in the result, useful for debugging level information
setIdentityRanking(False) : this setter will make identity distance (0.0) visible, useful for debugging level information

Output parameters and results

During pipeline execution the DocumentSimilarityRankerApproach will train a DocumentSimilarityRankerModel and the resulting model will compute 2 columns:

finished_doc_similarity_rankings_id: this column represents the ID of the given document
finished_doc_similarity_rankings_neighbors: this column represents the result of the ANN model, providing the ranked list of similar documents to the corresponding ID, ordered by LSH distance

A simple similarity self-test

Are the document reprensentation and similarity search effective?

A simple similarity test can be done by setting the identity ranking to True, observing that each document will appaear at distance 0.0 froim itself.

+-----------------------------------+------------------------------------------+
|finished_doc_similarity_rankings_id|finished_doc_similarity_rankings_neighbors|
+-----------------------------------+------------------------------------------+
|1510101612                         |[(1510101612,0.0)]                        |
|1634839239                         |[(1634839239,0.0)]                        |
|-612640902                         |[(-612640902,0.0)]                        |
|1274183715                         |[(1274183715,0.0)]                        |
|-1320876223                        |[(-1320876223,0.0)]                       |
|1293373212                         |[(1293373212,0.0)]                        |
|-1548374770                        |[(-1548374770,0.0)]                       |
|-1719102856                        |[(-1719102856,0.0)]                       |
+-----------------------------------+------------------------------------------+

Ranking nearest neighbors of a specific document

In the previous pipeline we can just set tha visible distance parameter to False

...
document_similarity_ranker = DocumentSimilarityRankerApproach() \
            .setInputCols("sentence_embeddings") \
            .setOutputCol("doc_similarity_rankings") \
            .setSimilarityMethod("brp") \
            .setNumberOfNeighbours(3) \
            .setBucketLength(2.0) \
            .setNumHashTables(3) \
            .setVisibleDistances(False) \      <-- HERE
            .setIdentityRanking(False)
...

obtaining a string column containing the ranked IDs:

+-----------------------------------+------------------------------------------+
|finished_doc_similarity_rankings_id|finished_doc_similarity_rankings_neighbors|
+-----------------------------------+------------------------------------------+
|1510101612                         |[1634839239,1274183715,-612640902]        |
|1634839239                         |[1510101612,1274183715,-612640902]        |
|-612640902                         |[1274183715,-1719102856,-1548374770]      |
|1274183715                         |[-612640902,-1719102856,-1320876223]      |
|-1320876223                        |[1293373212,-1719102856,1274183715]       |
|1293373212                         |[-1320876223,-1719102856,-1548374770]     |
|-1548374770                        |[-1719102856,1274183715,-612640902]       |
|-1719102856                        |[-1548374770,-1320876223,1274183715]      |
+-----------------------------------+------------------------------------------+

The step to parse and explode the column is straight forward:

from pyspark.sql.functions import *

# Parse the string column into an array of integers and explode with filter on id
target_id = 1510101612

(
    res
    .withColumn("finished_doc_similarity_rankings_neighbors", regexp_replace("finished_doc_similarity_rankings_neighbors", "[\\[\\]]", ""))
    .withColumn("col_array", split(col("finished_doc_similarity_rankings_neighbors"), ",").cast("array<int>"))
    .select("finished_doc_similarity_rankings_id", explode("col_array").alias("exploded_value"))
    .filter(col("finished_doc_similarity_rankings_id") == target_id)
    .show()
)

finally obtaining the N nearest neighbours documents of target_id.

+-----------------------------------+--------------+
|finished_doc_similarity_rankings_id|exploded_value|
+-----------------------------------+--------------+
|                         1510101612|    1634839239|
|                         1510101612|    1274183715|
|                         1510101612|    -612640902|
+-----------------------------------+--------------+

Conclusion

In conclusion, the document similarity ranker showcases the exceptional usability of the library. Spark NLP’s pipelines offer a seamless and intuitive way to orchestrate the entire document similarity analysis process.
With the ability to incorporate custom components and fine-tune parameters, these pipelines empower users to tailor the document similarity analysis to specific requirements and domains. This emphasis on usability makes Spark NLP’s pipelines a valuable tool for both experienced data scientists and NLP practitioners, fostering efficiency and productivity in large scale NLP projects.

Ranking document similarity at scale with Spark NLP

Written by Stefano Lori