Lucidworks Pig Functions for Solr

Features

Use Pig scripts to parse content before indexing to Solr.
Fully SolrCloud compliant.
Supports Kerberos for authentication.

Supported versions are:

Solr 5.x
Pig versions 0.12, 0.13, 0.14, and 0.15
Hadoop 2.x

Build the Functions Jar

Configure solr-hadoop-common submodule.

Run the following commands: git submodule init to initialize your local configuration file. git submodule update to fetch all the data from that project and check out the appropriate commit listed in the superproject.

The build uses Gradle. To build the .jar files, run this command:

./gradlew clean shadowJar --info

This will make a .jar file:

solr-pig-functions/build/libs/solr-pig-functions-{version}.jar

The .jar is required to use Pig functions to index content to Solr.

Use the Functions

Available Functions

The Pig functions included in the solr-pig-functions-.jar are three UserDefined Functions (UDF) and one Store function. These functions are:

com/lucidworks/hadoop/pig/SolrStoreFunc.class
com/lucidworks/hadoop/pig/EpochToCalendar.class
com/lucidworks/hadoop/pig/Extract.class
com/lucidworks/hadoop/pig/Histogram.class

Using the Functions

There are two approaches to using functions in Pig: to REGISTER them in the script, or to load them with your Pig command line request.

If using REGISTER, the Pig function jars must be put in HDFS in order to be used by your Pig script. It can be located anywhere in HDFS; you can either supply the path in your script or use a variable and define the variable with -p property definition.

The example below uses the second approach, loading the .jars with -Dpig.additional.jars system property when launching the script. With this approach, the .jars can be located anywhere on the machine where the script will be run.

When using the Pig functions, you can pass parameters for your script on the command line. The properties you will need to pass are the location of Solr and the collection to use; these are shown in detail in the example below.

Indexing to a Kerberos-Secured Solr Cluster

When a Solr cluster is secured with Kerberos for internode communication, Pig scripts must include the full path to a JAAS file that includes the service principal and the path to a keytab file that will be used to index the output of the script to Solr.

Two parameters provide the information the script needs to access the JAAS file:

lww.jaas.file

The path to the JAAS file that includes a section for the service principal who will write to the Solr indexes. For example, to use this property in a Pig script:

set lww.jaas.file '/opt/${namePackage}/conf/login.conf';

The JAAS configuration file must be copied to the same path on every node where a Node Manager is running (i.e., every node where map/reduce tasks are executed).

lww.jaas.appname

The name of the section in the JAAS file that includes the correct service principal and keytab path. For example, to use this property in a Pig script:

set lww.jaas.appname 'Client';

Here is a sample section of a JAAS file:

Client { --(1)
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  keyTab="/data/solr-indexer.keytab" --(2)
  storeKey=true
  useTicketCache=true
  debug=true
  principal="[email protected]"; --(3)
};

The name of this section of the JAAS file. This name will be used with the lww.jaas.appname parameter.
The location of the keytab file.
The service principal name. This should be a different principal than the one used for Solr, but must have access to both Solr and Pig.

Sample CSV Script

The following Pig script will take a simple CSV file and index it to Solr.

set solr.zkhost '$zkHost';
set solr.collection '$collection'; -- (1)

A = load '$csv' using PigStorage(',') as (id_s:chararray,city_s:chararray,country_s:chararray,code_s:chararray,code2_s:chararray,latitude_s:chararray,longitude_s:chararray,flag_s:chararray); -- (2)
--dump A;
B = FOREACH A GENERATE $0 as id, 'city_s', $1, 'country_s', $2, 'code_s', $3, 'code2_s', $4, 'latitude_s', $5, 'longitude_s', $6, 'flag_s', $7; -- (3)

ok = store B into 'SOLR' using com.lucidworks.hadoop.pig.SolrStoreFunc(); -- (4)

This relatively simple script is doing several things that help to understand how the Solr Pig functions work.

This and the line above define parameters that are needed by SolrStoreFunc to know where Solr is. SolrStoreFunc needs the properties solr.zkhost and solr.collection, and these lines are mapping the zkhost and collection parameters we will pass when invoking Pig to the required properties.
Load the CSV file, the path and name we will pass with the csv parameter. We also define the field names for each column in CSV file, and their types.
For each item in the CSV file, generate a document id from the first field ($0) and then define each field name and value in name, value pairs.
Load the documents into Solr, using the SolrStoreFunc. While we don’t need to define the location of Solr here, the function will use the zkhost and collection properties that we will pass when we invoke our Pig script.

Warning

When using SolrStoreFunc, the document ID must be the first field.

When we want to run this script, we invoke Pig and define several parameters we have referenced in the script with the -p option, such as in this command:

./bin/pig -Dpig.additional.jars=/path/to/solr-pig-functions.jar -p csv=/path/to/my/csv/airports.dat -p zkHost=zknode1:2181,zknode2:2181,zknode3:2181/solr -p collection=myCollection ~/myScripts/index-csv.pig

The parameters to pass are:

csv: The path and name of the CSV file we want to process.
zkhost: The ZooKeeper connection string for a SolrCloud cluster, in the form of zkhost1:port,zkhost2:port,zkhost3:port/chroot. In the script, we mapped this to the solr.zkhost property, which is required by the SolrStoreFunc to know where to send the output documents.
collection: The Solr collection to index into. In the script, we mapped this to the solr.collection property, which is required by the SolrStoreFunc to know the Solr collection the documents should be indexed to.

Tip

The zkhost parameter above is only used if you are indexing to a SolrCloud cluster, which uses ZooKeeper to route indexing and query requests.

If, however, you are not using SolrCloud, you can use the solrUrl parameter, which takes the location of a standalone Solr instance, in the form of http://host:port/solr.

In the script, you would change the line that maps solr.zkhost to the zkhost property to map solr.server.url to the solrUrl property. For example:

set solr.server.url '$solrUrl';

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
gradle/wrapper		gradle/wrapper
solr-hadoop-common @ d2756e0		solr-hadoop-common @ d2756e0
solr-pig-core		solr-pig-core
solr-pig-functions		solr-pig-functions
.gitmodules		.gitmodules
LICENSE		LICENSE
README.adoc		README.adoc
build-pig-solr.sh		build-pig-solr.sh
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lucidworks Pig Functions for Solr

Features

Build the Functions Jar

Use the Functions

Available Functions

Using the Functions

Indexing to a Kerberos-Secured Solr Cluster

Sample CSV Script

About

Releases

Packages

Languages

License

OpenPOWER-BigData/HDP2.5-pig-solr

Folders and files

Latest commit

History

Repository files navigation

Lucidworks Pig Functions for Solr

Features

Build the Functions Jar

Use the Functions

Available Functions

Using the Functions

Indexing to a Kerberos-Secured Solr Cluster

Sample CSV Script

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages