Cleaning and extracting text from HTML/XML documents by using Spark NLP

Published in

spark-nlp

5 min readJan 13, 2021

Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java, and Scala programming languages. The library obtained today the best performing academic peer-reviewed results for two years in a row with an important growing community (2.5M Downloads and 9x growth in 2020).

Some more impressive numbers from the latest 2.7.x release:

more accurate, faster, and support up to 375 languages.
support to state-of-the-art Seq2Seq and Text2Text transformers. This includes new annotators for Google T5 (Text-To-Text Transfer Transformer) and MarianMNT for Neural Machine Translation — with over 646 new pre-trained models and pipelines.
720+ new pretrained models and pipelines while extending our support of multi-lingual models to 192+ languages such as Chinese, Japanese, Korean, Arabic, Persian, Urdu, and Hebrew.

Today I’m going to talk about a new annotator that was added in the latest release: the DocumentNormalizer.

Why do we need a document normalizer?

Spark NLP community expressed the need for an annotator capable of directly processing input HTML/XML documents to clean or extract specific contents.

Imagine you aggregate a collection of raw HTML documents you just collected from a given data source with your preferred crawler library and you want to remove all the tags to focus on the tag contents.

Please don’t call the ghostbusters, just use the brand new Spark NLP DocumentNormalizer annotator! :D

But wait, what is an annotator? o.O
Let’s see the definition to have an idea.

In Spark NLP, all Annotators are either Estimators or Transformers as we see in Spark ML. An Estimator in Spark ML is an algorithm that can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator that trains on a DataFrame and produces a model. A Transformer is an algorithm that can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer that transforms a DataFrame with features into a DataFrame with predictions.

Let’s load some data to a text column in your input Spark SQL DataFrame:

path = "html-docs"

data = spark.sparkContext.wholeTextFiles(path)
df = data.toDF(schema=["filename", "text"]).select("text")

df.show()...+--------------------+
|                text|
+--------------------+
|<div class='w3-co...|
|<span style="font...|
|<!DOCTYPE html> <...|
+--------------------+

Once your input DataFrame is loaded you can define your next pipeline stages:

documentAssembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

inpuColName = "document"
outputColName = "normalizedDocument"

action = "clean"
cleanUpPatterns = ["<[^>]*>"]
replacement = " "
removalPolicy = "pretty_all"
encoding = "UTF-8"

documentNormalizer = DocumentNormalizer() \
    .setInputCols(inpuColName) \
    .setOutputCol(outputColName) \
    .setAction(action) \
    .setPatterns(cleanUpPatterns) \
    .setReplacement(replacement) \
    .setPolicy(removalPolicy) \
    .setLowercase(True) \
    .setEncoding(encoding)

Let’s make a pass on the different parameters we are going to set in our example.

inpuColName: input column name string which targets a column of type Array(AnnotatorType.DOCUMENT).
outputColName: output column name string which targets a column of type AnnotatorType.DOCUMENT.
action: action string to perform applying regex patterns, i.e. (clean | extract). Default is “clean”.
cleanUpPatterns: normalization regex patterns that match will be removed from the document.
Default is “<[^>]*>” (e.g., it removes all HTML tags).
replacement: replacement string to apply when regexes match.
Default is “ “.
removalPolicy: parameter to remove patterns from text with a given policy. Valid policy values are: “all”, “pretty_all”, “first”, “pretty_first”.
Defaults are “pretty_all”.
lowercase boolean: whether to convert strings to lowercase.
Default is False.
encoding: file encoding to apply on normalized documents. Supported encodings are: UTF_8, UTF_16, US_ASCII, ISO-8859–1, UTF-16BE, UTF-16LE.
Default is “UTF-8”.

Thanks to the new DocumentNormalizer annotator, we can now apply the regular expressions we have chosen in order to normalize the document and prepare it for the following stages.

In this example, our goal is therefore to extract the text that is contained within the HTML tags.

Having the DataFrame input column “text” containing the document,

let’s build and execute a simple pipeline with the following code:

...
documentAssembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

cleanUpPatterns = ["<[^>]*>"]

documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setPatterns(cleanUpPatterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(True)docPatternRemoverPipeline = \
  Pipeline() \
    .setStages([
        documentAssembler,
        documentNormalizer
    ])

ds = docPatternRemoverPipeline.fit(df).transform(df)

ds.select("normalizedDocument").show(1, False)

to visualize the annotator action.

As we can see in the result below, the HTML tags content has been extracted, lowercased and formatted in the output column “normalizedDocument” and we can now use it as input of the next stages in a Spark NLP pipeline.

To provide a more complex example, the DocumentNormalizer annotator can be used as a text preparation step before the SentenceDetector followed by the Tokenizer as show in the following pipeline:

documentAssembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

cleanUpPatterns = ["<[^>]*>"]

documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setPatterns(cleanUpPatterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(True)

sentenceDetector = SentenceDetector() \
      .setInputCols(["normalizedDocument"]) \
      .setOutputCol("sentence")

regexTokenizer = Tokenizer() \
      .setInputCols(["sentence"]) \
      .setOutputCol("token") \
      .fit(df)

docPatternRemoverPipeline = \
  Pipeline() \
    .setStages([
        documentAssembler,
        documentNormalizer,
        sentenceDetector,
        regexTokenizer])

ds = docPatternRemoverPipeline.fit(df).transform(df)

ds.select("normalizedDocument").show(10)...

This pipeline is processing your HTML documents, applying the document normalization following your parameter settings, and chaining the cleaning action with sentence detector and regex tokenizer to provide in output clean tokens to further process.

Other interesting use cases in which the DocumentNormalizer can be useful:

extract content from a specif HTML tag, e.g. <p>

action = "clean"
tag = "p"
patterns = ["<"+tag+"(.+?)>(.+?)<\\/"+tag+">"]

obfuscate PII such as emails in HTML content

action = "clean"
tag = "p"
patterns = ["([^.@\\s]+)(\\.[^.@\\s]+)*@([^.@\\s]+\\.)+([^.@\\s]+)"]
replacement = "***OBFUSCATED PII***"

extract XML name tag contents

action = "extract"
tag = "name"
patterns = [tag]

Hope this article was useful! Have fun with the Spark NLP brand new release!

Further reference:

Intro and further article series https://towardsdatascience.com/introduction-to-spark-nlp-foundations-and-basic-components-part-i-c83b7629ed59
Spark NLP Web site: https://nlp.johnsnowlabs.com/
Notebook: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/document-normalizer/document_normalizer_notebook.ipynb

Cleaning and extracting text from HTML/XML documents by using Spark NLP

Written by Stefano Lori