Large Scale Information Extraction Using Apache Tika on Spark

Garoux LLC
2 min readAug 22, 2022

Natural language processing (NLP) models are built on text, but documents are stored as PDFs, Word docs, and more. In order to analyze documents using NLP we need to extract the text. Fortunately the Apache Tika project makes this easy, with built in methods to extract text and metadata from over a thousand file types. This is especially useful when analyzing large collections of documents which are in a variety of formats. Apache Spark is the standard library for large scale data analysis, so it pairs well with Tika for analyzing massive collections of documents. In this article we’ll walk through an example using Tika with Spark on AWS Elastic Map Reduce (EMR). All the code is available on Github.

First clone the example repository:

Then set up an S3 bucket in AWS with 3 subfolders named input, output, and resources. In the input bucket, upload a few files in a variety of formats from which you’d like to extract text. Also make sure you have the AWS CLI configured locally with S3 and AWS EMR permissions enabled. Then run the following command from the root directory of the git repository.

This command uploads all the JARs for the example to your S3 bucket in the resources folder, then creates a Spark job which will extract the text from everything in your input folder and store results in the output folder. The output format stores a JSON object on each line with the following format:

The code running in the Spark job is actually really simple.

The one tricky part involves the JARs for Tika. You’ll notice that I haven’t included Tika as a dependency through SBT, instead opting to provide the libraries directly as JARs. This is because Tika uses a version of Apache commons-compress which conflicts with the version included in Spark. The solution is to build the Tika JARs from scratch with commons-compress as a shaded dependency. The JARs included in the example code were built from Tika version 1.26 with shaded dependencies to avoid this issue.

And that’s it! Hopefully this helps you get started analyzing your documents at scale using Tika, Spark, and NLP.

--

--

Garoux LLC

Garoux LLC provides consulting in data science, machine learning, and AI. Check out https://garoux.com to learn more.