Its all about Data

Sunday, December 15, 2013

Map/Reduce Programming Paradigm Example

MapReduce is the heart of Hadoop. It is the programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. The MapReduce concept is fairly simple to understand for those who are familiar with clustered scale-out data processing solutions.

The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

Lets start with Word Count Example :
Assume we have 3 web documents, and each documents contains web text. In that a real application won’t be quite so simple, as it’s likely to contain millions or even billions of rows, and they might not be neatly formatted rows at all; in fact, no matter how big or small the amount of data we need to analyze, the key principles which is covering here remains the same.

Doc 1 : Sachin is playing 200th Test.
Doc 2 : Dhoni is batting with Sachin in his 50th Test.Test is going good.
Doc 3 : Sachin and Dhoni playing Test Cricket Match in Mumbai.

In map phase the sentence would be split as words and form the initial key value pair as
Doc 1 :
<Sachin,1>
<is,1>
<playing,1>
<200th,1>
<Test,1>

Doc 2 :
<Dhoni,1>
<is,2>
<batting,1>
<with,1>
<Sachin,1>
<in,1>
<his,1>
<50th,1>
<Test,2>
<going,1>
<good,1>

Doc 3:
<Sachin,1>
<and,1>
<Dhoni,1>
<playing,1>
<Test,1>
<Cricket,1>
<Match,1>
<in,1>
<Mumbai,1>

In the reduce phase the keys are grouped together and the values for similar keys are added. So here there are only one pair of similar keys would be added so the out put key value pairs would be
<Sachin,2>
<is,3>
<playing,2>
<200th,1>
<Test,4>
<Dhoni,2>
<batting,1>
<with,1>
<in,2>
<his,1>
<50th,1>
<going,1>
<good,1>
<Cricket,1>
<Match,1>
<Mumbai,1>

This would give the number of occurrence of each word in the input. Thus reduce forms an aggregation phase for key

The point to be noted here is that first the mapper class executes completely on the entire data set splitting the words and forming the initial key value pairs. Only after this entire process is completed the reducer starts.

This Map/Reduce paradigm divide into 3 parts :
Part 1: Mapper Class
Part 2 : Reducer Class
Part 3 : Driver Class which will trigger Map/Reduce job.

Part 1 : Code of Mapper Class :
The functionality of the mapper method is as follows
Step 1 : Create a IntWritable variable ‘one’ with value as 1
Step 2 : Convert the input line in Text type to a String
Step 3 : Use a tokenizer to split the line into words
Step 4 : Iterate through each word and a form key value pairs as
a) Assign each work from the tokenizer(of String type) to a Text ‘word’
b) Form key value pairs for each word as <word,one> and push it to the output collector

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
/*Hadoop supported data types. The data types provided here are Hadoop specific data types designed for operational efficiency suited for massive parallel and lightning fast read write operations*/

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

/*Map method that performs the tokenizer job and framing the initial key value pairs
Mapper<LongWritable, Text, Text, IntWritable> , it refers to the data type of input and output key value pairs specific to the mapper or rateher the map method, ie Mapper<Input Key Type, Input Value Type, Output Key Type, Output Value Type>. */

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
//taking one line at a time and tokenizing the same
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);

//iterating through all the words available in that line and forming the key value pair
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
//sending to output collector which inturn passes the same to reducer
output.collect(word, one);
}
}
}

Part 2 : Code of Reducer Class :
The functionality of the reduce method is as follows
Step a) Initaize a variable ‘sum’ as 0
Step b) Iterate through all the values with respect to a key and sum up all of them
Step c) Push to the output collector the Key and the obtained sum as value

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
{
//reduce method accepts the Key Value pairs from mappers, do the aggregation based on keys and produce the final out put
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
int sum = 0;
/*iterates through all the values available with a key and add them together and give the
final result as the key and sum of its values*/
while (values.hasNext())
{
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

Part 3 : Code of Driver Class which will trigger Map/Reduce job.

The functionality of the Driver method is as follows

Step a) Creating a JobConf object

Step b) Setting configuration object with key and value class for Map/Reduce.

Step c) Provide the mapper and reducer class names

Step d) Provide the I/P and O/P HDFS directory path for Map/Reduce.

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.util.*;

public class WordCountDriver extends Configured implements Tool{

public int run(String[] args) throws Exception

{

//creating a JobConf object and assigning a job name for identification purposes

JobConf conf = new JobConf(getConf(), WordCountDriver.class);

conf.setJobName("WordCount");

//Setting configuration object with the Data Type of output Key and Value

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

//Providing the mapper and reducer class names

conf.setMapperClass(WordCountMapper.class);

conf.setReducerClass(WordCountReducer.class);

//the hdfs input and output directory to be fetched from the command line

FileInputFormat.addInputPath(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

return 0;

}

public static void main(String[] args) throws Exception

{

int res = ToolRunner.run(new Configuration(), new WordCountDriver(),args);

System.exit(res);

}

Friday, November 8, 2013

Steps to run Presto on linux m/c

Facebook announced on 6th November 2013 that it is committing its Presto low-latency, SQL-compliant query system for Hadoop to open source. Facebook has one of the largest data warehouses in the world, now surpassing 300 petabytes. That sheer scale has forced Facebook to invent its own tools for working with variably-structured data at high-scale.

Steps to run Presto on linux m/c :

Step 1 : First install Hadoop and Hive. Installation of hadoop can be follow from the blog "Setup Hadoop". This move brings yet another fast query option to Hadoop.
Setup Hive as :
Step a : Download the Release of Hive from : http://mirror.reverse.net/pub/apache/hive/ and extract Hive Release
Step b : export the path of hive folder as :
export HIVE_HOME=/home/hduser/hadoop/hive-0.11
Step c : Add $HIVE_HOME/bin to your PATH:
export PATH=$HIVE_HOME/bin:$PATH
Step d : Create directory on hdfs for hive as :
bin/hadoop fs -mkdir /home/hduser/hadoop-workDir/warehouse
At my m/c HDFS place at "/home/hduser/hadoop-workDir/". You can use according to HDFS place at your Hadoop Installation.
Step e : To use the hive command line interface (cli) from the shell:
$HIVE_HOME/bin/hive
Step f : Simple example of create table on hive as :
hive> CREATE TABLE hiveExample (eId INT, eName STRING);
Step g : hive> SHOW TABLES;

Step 2 : Download and Deployment Presto as per instruction shown at :
http://prestodb.io/docs/current/installation/deployment.html
Start the presto server as :
bin/launchuer run

Step 3 : Download presto-cli-0.52-executable.jar(download from http://prestodb.io/docs/current/installation/cli.html), rename it to presto, then run it:
./presto --server localhost:8080 --catalog hive --schema default

Step 4 : From another prompt start Hive metastore as :
Whatver port specify in Step 2 for "etc/catalog/hive.properties". Same port specify in below metastore run.
bin/hive --service metastore -p 9083

Step 5 : At presto command prompt shown in step 3, run the command "show tables", it will list the table(hiveExample) we had created in hive.
All the operation of hive we can do on presto, it is very fast compare to directly use hive.

Tuesday, October 29, 2013

Steps to run K-Means Clustering on Apache Mahout

For k-means example on Mahout, here i am downloading reuters dataset from : http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz

I am assuming you have configured apache mahout and hadoop. Otherwise first setup hadoop and mahout( hadoop setup and apache mahout setup explained in previous blogs).

Steps to run K-Means algorithm on apache mahout using this reuters dataset are :

Step 1 : untar this file as : tar -xvzf reuters21578.tar.gz

Step 2: Run Lucene indexing on this dataset as :
mahout org.apache.lucene.benchmark.utils.ExtractReuters reuters reuters-out

Step 3 : Copy this "reuters-out" lucene index file folder from local file system to hadoop file system as :
hadoop dfs -put reuters-out /home/hduser/hadoop-workDir/reuters-out

Step 4 : Mahout understand the sequence files/directory. So, convert this "reuters-out" file to sequence directory as :
mahout seqdirectory -i /home/hduser/hadoop-workDir/reuters-out -o /home/hduser/hadoop-workDir/reuters-out-seqdir -c UTF-8 -chunk 5

-c : The name of the character encoding of the input files. Default to UTF-8
-chunk : The chunkSize in MegaBytes. Defaults to 64

Step 5 : Convert these files of sequence directory to vectors as :

mahout seq2sparse -i /home/hduser/hadoop-workDir/reuters-out-seqdir -o /home/hduser/hadoop-workDir/reuters-out-seqdir-sparse-kmeans --maxDFPercent 85 --namedVector

--maxDFPercent : The max percentage of docs for the DF. Can be used to remove really high frequency terms. Expressed as an integer between 0 and 100. Default is 99. Use for remove junk words/ stop words repeats so much time in doc.

--namedVector : (Optional) Whether output vectors should be NamedVectors. If set true else false. It helps to get the vector Name(means file name) in cluster o/p.

Step 6: Run k-means clustering on these vectors as :

mahout kmeans -i /home/hduser/hadoop-workDir/reuters-out-seqdir-sparse-kmeans/tfidf-vectors -c /home/hduser/hadoop-workDir/reuters-kmeans-clusters -o /home/hduser/hadoop-workDir/reuters-kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering

-dm : It tell which distance measure algorithm want to use. "org.apache.mahout.common.distance.CosineDistanceMeasure" is just one type of distance measure. In mahout so many implementation present, we can use anyone according to our requirement. Some are :

EuclideanDistanceMeasure,MahalanobisDistanceMeasure,ManhattanDistanceMeasure etc.

-x : Maximum number of iteration.

-n : It is value of k for k-means. Here in example 20 mentioned. Go through this : http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set or http://datachurn.blogspot.in/2013/05/clustering-tips.html

Step 7 : Dump clustered points to local machine for analyzing, documents belong to which clusters as :

mahout seqdumper -i /home/hduser/hadoop-workDir/reuters-kmeans/clusteredPoints/part-m-00000 -o resultFile.txt

Step 8 : Get clean result of this dump file from java code as :

import java.io.BufferedReader;

import java.io.BufferedWriter;

import java.io.DataInputStream;

import java.io.FileInputStream;

import java.io.FileWriter;

import java.io.InputStreamReader;

import java.util.LinkedHashMap;

import java.util.Map;

public class mahoutClusterOutput {

String inputFile = "/home/hduser/hadoop/mahout-0.8/reuters-kmeans-sgm/resultFile.txt";

String outputFile = "/home/tarun/resultFileCluster.txt";

LinkedHashMap<String,String> mpIdData = new LinkedHashMap();

public void fileRead()

{

try

{

FileInputStream fstream = new FileInputStream(inputFile);

DataInputStream in = new DataInputStream(fstream);

BufferedReader br = new BufferedReader(new InputStreamReader(in));

String strLine="";

int lineNumber=0;

while ((strLine = br.readLine()) != null)

{

lineNumber++;

if(lineNumber < 3 || strLine.length() < 20)

{

continue;

}

String[] tempStr = strLine.split(":");

if(mpIdData.containsKey(tempStr[1]))

{

String val = mpIdData.get(tempStr[1]) + "," + tempStr[4].split("=")[0].replace("/", "");

mpIdData.put(tempStr[1], val);

}

else

{

String val =tempStr[4].split("=")[0].replace("/", "");

mpIdData.put(tempStr[1], val);

}

in.close();

FileWriter f1 = new FileWriter(outputFile,true);

BufferedWriter out = new BufferedWriter(f1);

for(Map.Entry<String, String> entry : mpIdData.entrySet())

{

out.write("Cluster Id : " + entry.getKey());

out.write("-------> Documents : " + entry.getValue());

out.write("\n");

}

out.close();

}

catch(Exception e)

{

System.out.println("Exception message : " + e.getMessage());

}

public static void main(String[] args)

{

mahoutClusterOutput ob = new mahoutClusterOutput();

ob.fileRead();

}

Wednesday, October 9, 2013

Ensemble Learning

Ensemble learning is to build a prediction model by combining the strengths of a collection of simpler base models

Bagging, Boosting and Random Forest fall in category of Ensemble Learning.

Ensemble learning can be broken down into two tasks:

1) Developing a population of base learners from the training data.

2) Then combining them to form the composite predictor.

Boosting and Regularization Paths :

Boosting technology that goes a step further; it builds an ensemble model by conducting a regularized and supervised search in a high-dimensional space of weak learners.

Before going forward lets start with some terminologies :

Least Squares :
least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in the results of every single equation.

Regularized versions :

Regularization in the fields of machine learning, refers to a process of introducing additional information in order to to prevent overfitting.

Ridge regression :

In this adds a constraint that β² the L²-norm of the parameter vector, is not greater than a given value. Equivalently, it may solve an unconstrained minimization of the least-squares penalty with α β² added, where α is a constant.

Lasso method :

An alternative regularized version of least squares is Lasso (least absolute shrinkage and selection operator), which uses the constraint that β¹ , the L¹-norm of the parameter vector, is no greater than a given value. (As above, this is equivalent to an unconstrained minimization of the least-squares penalty with added.)

One of the prime differences between Lasso and ridge regression is that in ridge regression, as the penalty is increased, all parameters are reduced while still remaining non-zero, while in Lasso, increasing the penalty will cause more and more of the parameters to be driven to zero.

Best of Sparsity Principle :

Boosting’s forward stagewise strategy with shrinkage approximately minimizes the same loss function with a Lasso-style L1 penalty.

However, the sometimes superior performance of boosting over procedures such as the support vector machine may be largely due to the implicit use of the L1 versus L2 penalty. The shrinkage resulting from the L1 penalty is better suited to sparse situations, where there are few basis functions with nonzero coefficients.

Bayesian sense the best predictor is ridge regression . That is, we should use an L2 rather than an L1 penalty when fitting the coefficients.

On the other hand, if there are only a small number (e.g., 1000) coefficients that are nonzero, the lasso (L1 penalty) will work better. We think of this as a sparse scenario, while the first case (Gaussian coefficients) is dense.

In other words, use of the L1 penalty follows what we call the “bet on sparsity” principle for high-dimensional problems:

Use a procedure that does well in sparse problems, since no procedure does well in dense

problems.

Monday, July 22, 2013

Regression Model Analysis

Regression model analysis is done by RMSE(Root Mean Square Error) value.
An RMSE of zero, meaning that the estimator $\hat{\theta}$ predicts observations of the parameter $\theta$ with perfect accuracy, is the ideal, but is practically never possible.
The regression line predicts the average y value associated with a given x value. Note that is also necessary to get a measure of the spread of the y values around that average. To do this, we use the root-mean-square error (r.m.s. error).
To construct the r.m.s. error, you first need to determine the residuals. Residuals are the difference between the actual values and the predicted values. It is denoted by

is the observed value for the i_th observation and

is the predicted value.

They can be positive or negative as the predicted value under or over estimates the actual value. Squaring the residuals, averaging the squares, and taking the square root gives us the r.m.s error. You then use the r.m.s. error as a measure of the spread of the y values about the predicted y value.

Tuesday, June 11, 2013

Data Preparation and Predictive Modeling Mistakes in M/C Learning and How to Avoid These Mistakes?

1) Mistake (Including ID Fields as Predictors) :
Because most IDs look like continuous integers (and older IDs are typically smaller), it is possible that they may make their way into the model as a predictive variables. Be sure to exclude them as early on in the process as possible to avoid any confusion while building the model.

2) Mistake(Using Anachronistic Variables) :
Make sure that no predictor variables contain information about the outcome. Because models are built using historical data, it is possible that some of the variables you have accessible when building your model were not available at the time the model is built to reflect.

3) Mistake(Selecting the wrong Y-variable) :
When building your dataset for a logistic regression model, we’ll want to select the response with the smaller number of data points as your y-variable. A great example of this from the higher ed world would come from building a retention model. In most cases, we’ll actually want to model attrition, identifying those employee who are likely to leave (hopefully the smaller group) rather than those who are likely to stay.

4) Mistake(Allowing Duplicate Records) :
Don’t include duplicates in a model file. Including just two records per person gives that person twice as much predictive power. To make sure that each person’s influence counts equally, only one record per person or action being modeled should be included.

5) Mistake(Modeling on Too Small of a Population) :
Double-check your population size. A good goal to shoot for in a modeling dataset is at least 1,000 records spanning three years. Including at least three years helps to account for any year-to-year fluctuations in our dataset. The larger our population size is, the most robust our model will be.

6) Mistake(Judging the quality of a model using one measure) :
It’s wrong to capture the quality of a model with a single Measure. We should use Cross Validation etc methods.

7) Mistake(Not Accounting for Outliers and/or Missing Values) :
Be sure to account for any outliers and/or missing values. Large rifts in individual variables can add up when we are combining those variables to build a predictive model. Checking the minimum and maximum values for each variable can be a quick way to spot any records that are out of the usual realm.

8) Mistake(Fail to consider enough variables) :
When deciding which variables to audition for a model, we want to include anything we have on-hand that we think could possibly be predictive. Thats wrong way. We should separate out the extra variables is something that we modeling program will do, so don’t be afraid to throw the kitchen sink at it for your first pass.

Monday, June 10, 2013

Underfitting/Overfitting Problem in M/C learning

Underfitting : If our algorithm works badly with points in our data set, then the algorithm underfitting the data set. It can be check easily throug the cost function measures. Cost function in linear regression is half the mean squared error ex. if mean squared error is c the cost fucntion is 0.5C². If in an experiment cost ends up high even after many iterations, then chances are we have an underfitting problem. We can say that learning algorithm is not good for the problem. Underfitting is also known as high bias( strong bias towards its hypothesis). In an another words we can say that hypothesis space the learning algorithm explores is too small to properly represent the data.

How to avoid underfitting :
More data will not generally help. It will, in fact, likely increase the training error. Therefore we should increase more features. Because that expands the hypothesis space. This includes making new features from existing features. Same way more parameteres may also expand the hypothesis space.

Overfitting : If our algorithm works well with points in our data set, but not on new points, then the algorithm overfitting the data set. Overfitting check easily through by spliting the data set so that 90% of data in our training set and 10% in a cross-validation set. Train on the training set, then measure the cost on the cross-validation set. If the cross-validation cost is much higher than the training cost, then chances are we have an overfitting problem. In another words we can say that hypothesis space is too large, and perhaps some features are faking the learning algorithm out.

How to avoid overfitting :
To avoid overfitting add the regularization if there are many features. Regularization forces the magnitudes of the parameters to be smaller(shrinking the hypothesis space). For this add a new term to the cost function

which penalizes the magnitudes of the parameters like as