Thursday, October 8, 2015

Building a Spark runnable application package using SBT and IBM BigInsighst 4



As a continuation to my older blog post about building a standalone job for spark and running it (see link below)
Running spark application in a standalone mode

I have decided to create another blog post about how would you build and run your spark application on a commercial hadoop distribution which is yarn enabled, as most of us will probably not configure hadoop from scratch and will use some kind of a commercial distribution .  I have used IBM BigInsights 4.0 quick start edition (now called IBM IOP for hadoop) for this purpose.

The contents of this post will describe :

Step 1: install sbt on the target machine (ubuntu linux)
Step 2: code the simple program (yarn-client compatible)
Step 3: copy the file into the SBT enabled system
Step 4: create and edit the simpleCluster.sbt
Step 5: create the mkDirStructure.sh to automate the directory creation
Step 6: run the mkDirStructure.sh
Step 7: package the spark application
Step 8: create the input on BigInsights system
Step 9: move the jar to the BigInsights driver machine
Step 10: run the spark application on the BigInsights machine

You can download the full document from here:
Building a Spark runnable application package using SBT and IBM BigInsighst 4.pdf

Thursday, September 17, 2015

Building a spark runnable standalone application package using sbt and ubuntu linux


As everyone in the world of computing knows, Apache spark is one of the most interesting and talked about projects in today's open source community.although Apache spark is so much talked about, it is still a long way for being a "user friendly" application, and specifically one of the areas which it is a bit lacking on is the "build" area.

When building your own standalone java application you would typically use something like Apache ant or use built in tools within your IDE in order to generate the required jar file or any other construct which is required. both of these subjects are covered in detail in lots of books and documents.
Apache spark is typically used with a tool called "sbt" , which is a shortname for "simple build tool".

This build tool is mainly used within the Scala ecosystem, and on some cases can become quite "not simple".  more about the tool here : http://www.scala-sbt.org/

When I started working with Apache Spark I had some major issues with "sbt" and it's integration with Spark, so in order to help others not to go through the same problems I had, I have decided to post this "hands-on" tutorial .


This tutorial covers the following steps :
  • Step 1: install SBT on the target machine
  • Step 2: code the simple program
  • Step 3: copy the file into the SBT enabled system 
  • Step 4: create the input text file at /home/spark/input.txt
  • Step 5: create and edit the simple.sbt file 
  • Step 6: create the mkDirStructure.sh to automate the directory creation
  • Step 7: run the mkDirStructure.sh
  • Step 8: package the spark application
  • Step 9: run the spark application



Sunday, August 9, 2015

Connecting IBM QRadar SIEM with a Java client for event collection

What is IBM Qradar ?

IBM® Security QRadar® SIEM consolidates log source event data from thousands of devices endpoints and applications distributed throughout a network. It performs immediate normalization and correlation activities on raw data to distinguish real threats from false positives.

What is Syslog ?

In computing, syslog is a widely used standard for message logging. It permits separation of the software that generates messages, the system that stores them, and the software that reports and analyzes them.

Computer system designers may use syslog for system management and security auditing as well as general informational, analysis, and debugging messages. A wide variety of devices, such as printers and routers, and message receivers across many platforms use the syslog standard. This permits the consolidation of logging data from different types of systems in a central repository. Implementations of syslog exist for many operating systems.



Each message is labeled with a facility code, and assigned a severity label. The facility code indicates the software type of the application that generated the message.


The destination of messages may be directed to various destinations, tuned by facility and severity, including console), files, remote syslog servers, or relays.
Most implementations provide a command line utility, often called logger, as well as a link library, to send messages to the log.

How can we submit events to QRadar?


The simplest way to send events to QRadar is using LEEF format to Syslog .



LEEF format formula:



LEEF:Version|Vendor|Product|Version|EventID|Key1=Value1<tab>Key2=Value2<tab>Key3=Value3<tab>...<tab>KeyN=ValueN

LEEF Event format example:

Jan 18 11:07:53 192.168.1.1 LEEF:1.0|QRadar|QRM|1.0|NEW_PORT_DISCOVERD|src=172.5.6.67 dst=172.50.123.1 sev=5 cat=anomaly msg=this is a message

More about LEEF format here:
LEEF Format in IBM QRadar

Definition of a "log source" in IBM QRadar

Through the Admin tab , go to "log sources" and define a new Log Source as pictured by the instructions below:























Sending events to QRadar using apache log4j 



public class QRadarSink {



private Logger logger = Logger.getLogger(QRadarSink.class.getName());

       private static final String LOGGER_NAME = "SYSLOG";

       private static final String PATTERN_PREFIX = "%d{MMM dd HH:mm:ss}  ";
       private static final String PATTERN_POSTFIX = "      %m%n";

       public QRadarSink(String destinationHost, String port)
                    throws UnknownHostException {

             String hostname = InetAddress.getLocalHost().getHostName();
PatternLayout layout = new PatternLayout(PATTERN_PREFIX + hostname + PATTERN_POSTFIX);
             SyslogAppender syslog = new SyslogAppender();
             syslog.setName(LOGGER_NAME);
             syslog.setSyslogHost(destinationHost + ":" + port);
             syslog.setFacilityPrinting(false);
             syslog.setHeader(false);
             syslog.setLayout(layout);
             syslog.activateOptions();
             logger.addAppender(syslog);
       }

       public void pushEventToQradar(String message) {
             logger.error(message);
       }
      
       public static void main(String[] args) throws Exception {
             QRadarSink sink = new QRadarSink("9.148.5.113","514");     
              sink.pushEventToQradar("LEEF:1.0|InfoSphereStreams|PedictiveBlacklisting|1.0|NEW_EVENT_DISCOVERD|src=206.64.49.42       dst=172.50.123.1    devTime=Jul 20 2015 14:05:20     devTimeFormat=MMM dd yyyy HH:mm:ss proto=4      sev=9         filterMatched=nonMatched");
       }
}


QRadar Log Activity





Monday, July 27, 2015

Securing hadoop environments with MIT Kerberos OpenLDAP and IBM BigInsights 3.0.0.2

Big Data environments are characterized by a multiplicity of technologies, distributed data repositories, and parallel computation systems with different deployment models. With all that complexity,organizations want to maintain data privacy, to ensure that the data will not be exposed to unauthorized parties. 
Organizations also need to provide a unified security mechanism that allows Single SignOn, ensuring that any service connected to the data cluster goes through the authentication process to be
permitted to access the data. 
Like other distributed systems, Big Data clusters share the same security weaknesses. Distributed systems are demanding to ensure that parties are who they claim to be, to verify
client applications before they join the cluster and access the data that resides on federated systems.
This article describes the series of steps required to set up an IBM Big Data environment using Kerberos for host validation and authentication of client applications. The environment settings were based on the requirements of an IBM customer, as described in the next section of this article.

System Requirements:

Following are the list of the system requirements for this tutorial:

  • The system must manage a large number of documents and the metadata for those documents. The documents are classified into a variety of different topics and categories.
  • The system should handle many different document types (such as html, PDF, spreadsheets etc.) that are originated by many different systems.
  • The system should provide a federated search that considers the documents as well as the relevant topics that are associated with them.
  • The document categories are mapped to different authorization groups. Users belonging to those groups will have access to the corresponding documents.
  • The metadata is added to throughout the document’s life cycle.

The Proof Of Concept (PoC) documented in this article demonstrates the ability to apply a single sign-on mechanism in a subset of the proposed environment while using a Kerberos ticket to authenticate hosts, users and add-on services to the BigInsights Hadoop cluster



Contents of this document:

Background 
Topology solution and hosts 
Installation prerequisites: 
Setting up users and groups in open ldap: 
Step 1: Setting up the Linux machines:
1. Host Name setup : 
Host name requirements: 
Host resolution: 
Passwordless ssh for root user
3. Install ldap client (on each Linux node) 
4. Install DB2 prerequisites (on each Linux node) 
5. Install Kerberos V5 client libraries on each of the Linux machines (4 total) 
6. Install various prerequisites
7. Disable IPV6 on all nodes 
8. Disable firewall 
9. Disable Selinux 
10. Create disks for data store 
11. Configure Sudo permissions for admin user: 
12. Configure limits.conf on each BI node: 
13. Configure /etc/ssh/sshd_config on each BI node 
14. Configure pam_ ldap module 
15. Configure SSHD at /etc/pam.d/sshd 
16. Configure System auth at /etc/pam.d/system-auth 
17. Configure ladp configuiration at /etc/openldap/ldap.conf 
18. Configure name service daemon at /etc/nslcd.conf 
19. Configure name service switch at /etc/nsswitch.conf 
20. Configure pam_ldap.conf at /etc/pam_ldap.conf 
21. Copy certs from openLDAP server to all of the BigInsights nodes 
22. Start local name service daemon (nslcd)
Step 2: Setting up IBM JDK and JCE: 
Download and Install IBM JDK and JCE on Linux servers: 
Step 3: Open LDAP time synchronization 
Step 4: Configuring Kerberos client on all BigInsights nodes 
1. /etc/krb5.conf on each of your Linux machines (4 total) 
2. Add Kerberos service definitions to each /etc/services (all Linux machines) 
Step 5: Creating and deploying host keytabs 
1. Create the host keytabs 
2. Configure sssd (security deamon) file on each node 
3. Caching enablement 
4. Deploy initialize and test the host keytabs 
Step 6: Create the service Keytabs: 
Step 7: Initialize the service keytabs 
Step 8: Create the cluster hosts file for the BigInsights installer 
Step 9: Run BigInsights installer prechecker 
Step 10: BigInsights installation 
Prefix 1: Complete users LDIF file
Prefix 2: Complete groups LDIF file 
Prefix 3: Complete hosts LDIF file 


Following article complements this article , it explains how to set up kerberos on microsoft active directory: 

IBM Kerberos Automation Toolkit for hadoop
An automation toolkit is available for download to ease up setting up this environment, The latest version of the automation toolkit can be downloaded from this location : 


This article was made possible because of the mutual work of Me and Roman Zeltser .