Lazy's Technological Blog: 2015

Thursday, October 8, 2015

Building a Spark runnable application package using SBT and IBM BigInsighst 4

As a continuation to my older blog post about building a standalone job for spark and running it (see link below)
Running spark application in a standalone mode

I have decided to create another blog post about how would you build and run your spark application on a commercial hadoop distribution which is yarn enabled, as most of us will probably not configure hadoop from scratch and will use some kind of a commercial distribution . I have used IBM BigInsights 4.0 quick start edition (now called IBM IOP for hadoop) for this purpose.

The contents of this post will describe :

Step 1: install sbt on the target machine (ubuntu linux)
Step 2: code the simple program (yarn-client compatible)
Step 3: copy the file into the SBT enabled system
Step 4: create and edit the simpleCluster.sbt
Step 5: create the mkDirStructure.sh to automate the directory creation
Step 6: run the mkDirStructure.sh
Step 7: package the spark application
Step 8: create the input on BigInsights system
Step 9: move the jar to the BigInsights driver machine
Step 10: run the spark application on the BigInsights machine

You can download the full document from here:
Building a Spark runnable application package using SBT and IBM BigInsighst 4.pdf

Thursday, September 17, 2015

Building a spark runnable standalone application package using sbt and ubuntu linux

As everyone in the world of computing knows, Apache spark is one of the most interesting and talked about projects in today's open source community.although Apache spark is so much talked about, it is still a long way for being a "user friendly" application, and specifically one of the areas which it is a bit lacking on is the "build" area.

When building your own standalone java application you would typically use something like Apache ant or use built in tools within your IDE in order to generate the required jar file or any other construct which is required. both of these subjects are covered in detail in lots of books and documents.
Apache spark is typically used with a tool called "sbt" , which is a shortname for "simple build tool".

This build tool is mainly used within the Scala ecosystem, and on some cases can become quite "not simple". more about the tool here : http://www.scala-sbt.org/

When I started working with Apache Spark I had some major issues with "sbt" and it's integration with Spark, so in order to help others not to go through the same problems I had, I have decided to post this "hands-on" tutorial .

This tutorial covers the following steps :

Step 1: install SBT on the target machine
Step 2: code the simple program
Step 3: copy the file into the SBT enabled system
Step 4: create the input text file at /home/spark/input.txt
Step 5: create and edit the simple.sbt file
Step 6: create the mkDirStructure.sh to automate the directory creation
Step 7: run the mkDirStructure.sh
Step 8: package the spark application
Step 9: run the spark application

Download the full tutorial here

Sunday, August 9, 2015

Connecting IBM QRadar SIEM with a Java client for event collection

What is IBM Qradar ?

IBM® Security QRadar® SIEM consolidates log source event data from thousands of devices endpoints and applications distributed throughout a network. It performs immediate normalization and correlation activities on raw data to distinguish real threats from false positives.

What is Syslog ?

In computing, syslog is a widely used standard for message logging. It permits separation of the software that generates messages, the system that stores them, and the software that reports and analyzes them.

Computer system designers may use syslog for system management and security auditing as well as general informational, analysis, and debugging messages. A wide variety of devices, such as printers and routers, and message receivers across many platforms use the syslog standard. This permits the consolidation of logging data from different types of systems in a central repository. Implementations of syslog exist for many operating systems.

Each message is labeled with a facility code, and assigned a severity label. The facility code indicates the software type of the application that generated the message.

The destination of messages may be directed to various destinations, tuned by facility and severity, including console), files, remote syslog servers, or relays.

Most implementations provide a command line utility, often called logger, as well as a link library, to send messages to the log.

How can we submit events to QRadar?

The simplest way to send events to QRadar is using LEEF format to Syslog .

LEEF format formula:

LEEF Event format example:

More about LEEF format here:

LEEF Format in IBM QRadar

Definition of a "log source" in IBM QRadar

Through the Admin tab , go to "log sources" and define a new Log Source as pictured by the instructions below:

Sending events to QRadar using apache log4j

public class QRadarSink {

private Logger logger = Logger.getLogger(QRadarSink.class.getName());

private static final String LOGGER_NAME = "SYSLOG";

private static final String PATTERN_PREFIX = "%d{MMM dd HH:mm:ss} ";

private static final String PATTERN_POSTFIX = " %m%n";

public QRadarSink(String destinationHost, String port)

throws UnknownHostException {

String hostname = InetAddress.getLocalHost().getHostName();

PatternLayout layout = new PatternLayout(PATTERN_PREFIX + hostname + PATTERN_POSTFIX);

SyslogAppender syslog = new SyslogAppender();

syslog.setName(LOGGER_NAME);

syslog.setSyslogHost(destinationHost + ":" + port);

syslog.setFacilityPrinting(false);

syslog.setHeader(false);

syslog.setLayout(layout);

syslog.activateOptions();

logger.addAppender(syslog);

}

public void pushEventToQradar(String message) {

logger.error(message);

}

public static void main(String[] args) throws Exception {

QRadarSink sink = new QRadarSink("9.148.5.113","514");

}

QRadar Log Activity

Monday, July 27, 2015

Securing hadoop environments with MIT Kerberos OpenLDAP and IBM BigInsights 3.0.0.2

Big Data environments are characterized by a multiplicity of technologies, distributed data repositories, and parallel computation systems with different deployment models. With all that complexity,organizations want to maintain data privacy, to ensure that the data will not be exposed to unauthorized parties.
Organizations also need to provide a unified security mechanism that allows Single SignOn, ensuring that any service connected to the data cluster goes through the authentication process to be
permitted to access the data.
Like other distributed systems, Big Data clusters share the same security weaknesses. Distributed systems are demanding to ensure that parties are who they claim to be, to verify
client applications before they join the cluster and access the data that resides on federated systems.
This article describes the series of steps required to set up an IBM Big Data environment using Kerberos for host validation and authentication of client applications. The environment settings were based on the requirements of an IBM customer, as described in the next section of this article.

System Requirements:

Following are the list of the system requirements for this tutorial:

The system must manage a large number of documents and the metadata for those documents. The documents are classified into a variety of different topics and categories.
The system should handle many different document types (such as html, PDF, spreadsheets etc.) that are originated by many different systems.
The system should provide a federated search that considers the documents as well as the relevant topics that are associated with them.
The document categories are mapped to different authorization groups. Users belonging to those groups will have access to the corresponding documents.
The metadata is added to throughout the document’s life cycle.

The Proof Of Concept (PoC) documented in this article demonstrates the ability to apply a single sign-on mechanism in a subset of the proposed environment while using a Kerberos ticket to authenticate hosts, users and add-on services to the BigInsights Hadoop cluster

Contents of this document:

Background

Topology solution and hosts

Installation prerequisites:

Setting up users and groups in open ldap:

Step 1: Setting up the Linux machines:

1. Host Name setup :

Host name requirements:

Host resolution:

Passwordless ssh for root user

3. Install ldap client (on each Linux node)

4. Install DB2 prerequisites (on each Linux node)

5. Install Kerberos V5 client libraries on each of the Linux machines (4 total)

6. Install various prerequisites

7. Disable IPV6 on all nodes

8. Disable firewall

9. Disable Selinux

10. Create disks for data store

11. Configure Sudo permissions for admin user:

12. Configure limits.conf on each BI node:

13. Configure /etc/ssh/sshd_config on each BI node

14. Configure pam_ ldap module

15. Configure SSHD at /etc/pam.d/sshd

16. Configure System auth at /etc/pam.d/system-auth

17. Configure ladp configuiration at /etc/openldap/ldap.conf

18. Configure name service daemon at /etc/nslcd.conf

19. Configure name service switch at /etc/nsswitch.conf

20. Configure pam_ldap.conf at /etc/pam_ldap.conf

21. Copy certs from openLDAP server to all of the BigInsights nodes

22. Start local name service daemon (nslcd)

Step 2: Setting up IBM JDK and JCE:

Download and Install IBM JDK and JCE on Linux servers:

Step 3: Open LDAP time synchronization

Step 4: Configuring Kerberos client on all BigInsights nodes

1. /etc/krb5.conf on each of your Linux machines (4 total)

2. Add Kerberos service definitions to each /etc/services (all Linux machines)

Step 5: Creating and deploying host keytabs

1. Create the host keytabs

2. Configure sssd (security deamon) file on each node

3. Caching enablement

4. Deploy initialize and test the host keytabs

Step 6: Create the service Keytabs:

Step 7: Initialize the service keytabs

Step 8: Create the cluster hosts file for the BigInsights installer

Step 9: Run BigInsights installer prechecker

Step 10: BigInsights installation

Prefix 1: Complete users LDIF file

Prefix 2: Complete groups LDIF file

Prefix 3: Complete hosts LDIF file

Download the complete tutorial

Following article complements this article , it explains how to set up kerberos on microsoft active directory:

Configuring kerberos for hadoop with Active Directory

IBM Kerberos Automation Toolkit for hadoop

An automation toolkit is available for download to ease up setting up this environment, The latest version of the automation toolkit can be downloaded from this location :

IBM Kerberos Automation Toolkit for hadoop

This article was made possible because of the mutual work of Me and Roman Zeltser .