Breaking

Saturday 12 December 2015

POC #: Generate Analytics from a Product based Company Web Log.

POC #: Generate Analytics from a Product based Company Web Log.

The POC is based on Analysing a Product based Company Web Log.

Industry: E-Commerce

1.          Load weblog data into HDFS using HDFS client
2.         Develop Pig program to load log and perform analytics on  IP Category-1 Category-2 page, status_code
2.1.       Count of page views by individual user ie [IP, count(*)]
2.2.      Top / Bottom 2: catagery-1/ catagery-2 / page /users (Exclude status code other than  200)
Top 2 and bottom 2 records
·         Category, total_number_views
·         page, total_number_views
·         IP, total_number_of_views
2.3. Total page views / Category wise pageviews / Unique pageviews
·         page,total_number_of_views
·         category, total_views
·         page, total_number_of_unique_views
2.4. Count of status code = 200 / 404 / 400 / 500
·         status_code, count
3. Load results into tables in MySql Database using Sqoop.

CODING DETAILS

Pig Script

weblog.pig

/* Develop Pig program to extract data for the following KPIs */

FILE1 = LOAD '/home/bhavesh/weblog.txt' using PigStorage(',') as (IP:chararray,Category1:chararray,Category2:chararray,page:int,statuscode:int);
GRPD = group FILE1 by IP;  
IPCOUNT = foreach GRPD generate group,COUNT(FILE1.IP);

/* Count of page views by individual user ie IP, count(*) */
STORE IPCOUNT into '/home/bhavesh/Count_by_Pageview_by_IndividualUser/' using PigStorage(',');

/*(Exclude status code other than 200) */

FILE_STATUS_CODE_200 = FILTER FILE1 by statuscode == 200;

/*
Top 2 and bottom 2 records
Category, total_number_views
page, total_number_views
IP, total_number_of_views
*/

GRPD_CATEGORY1 = group FILE_STATUS_CODE_200 by Category1;
CNT_FOR_CATEGORY1 = FOREACH GRPD_CATEGORY1 generate group,COUNT(FILE_STATUS_CODE_200.Category1) as COUNTING;

SORTED_CATEGORY1_DEC = ORDER CNT_FOR_CATEGORY1 by COUNTING DESC;
TOP_2_CATEGORY1 = limit SORTED_CATEGORY1_DEC 2;
STORE TOP_2_CATEGORY1 into '/home/bhavesh/Top2_By_Category1/' using PigStorage(',');

SORTED_CATEGORY1_ASC = ORDER CNT_FOR_CATEGORY1 by COUNTING ASC;
BOTTOM_2_CATEGORY1 = limit SORTED_CATEGORY1_ASC 2;
STORE BOTTOM_2_CATEGORY1 into '/home/bhavesh/Bottom2_By_Category1/' using PigStorage(',');

GRPD_CATEGORY2 = group FILE_STATUS_CODE_200 by Category2;
CNT_FOR_CATEGORY2 = FOREACH GRPD_CATEGORY2 generate group,COUNT(FILE_STATUS_CODE_200.Category2) as COUNTING;

SORTED_CATEGORY2_DEC = ORDER CNT_FOR_CATEGORY2 by COUNTING DESC;
TOP_2_CATEGORY2 = limit SORTED_CATEGORY2_DEC 2;
STORE TOP_2_CATEGORY2 into '/home/bhavesh/Top2_By_Category2/' using PigStorage(',');

SORTED_CATEGORY2_ASC = ORDER CNT_FOR_CATEGORY2 by COUNTING ASC;
BOTTOM_2_CATEGORY2 = limit SORTED_CATEGORY2_ASC 2;
STORE BOTTOM_2_CATEGORY2 into '/home/bhavesh/Bottom2_By_Category2/' using PigStorage(',');

GRPD_PAGES = group FILE_STATUS_CODE_200 by page;
CNT_FOR_PAGE = FOREACH GRPD_PAGES generate group,COUNT(FILE_STATUS_CODE_200.page) as COUNTING;

SORTED_PAGE_DEC = ORDER CNT_FOR_PAGE by COUNTING DESC;
TOP_2_PAGE = limit SORTED_PAGE_DEC 2;
STORE TOP_2_PAGE into '/home/bhavesh/Top2_By_PAGE/' using PigStorage(',');

SORTED_PAGE_ASC = ORDER CNT_FOR_PAGE by COUNTING ASC;
BOTTOM_2_PAGE = limit SORTED_PAGE_ASC 2;
STORE BOTTOM_2_PAGE into '/home/bhavesh/Bottom2_By_PAGE/' using PigStorage(',');

GRPD_IP = group FILE_STATUS_CODE_200 by IP;
CNT_FOR_IP = FOREACH GRPD_IP generate group,COUNT(FILE_STATUS_CODE_200.IP) as COUNTING;

SORTED_IP_DEC = ORDER CNT_FOR_IP by COUNTING DESC;
TOP_2_IP = limit SORTED_IP_DEC 2;
STORE TOP_2_IP into '/home/bhavesh/Top2_By_IP/' using PigStorage(',');

SORTED_IP_ASC = ORDER CNT_FOR_IP by COUNTING ASC;
BOTTOM_2_IP = limit SORTED_IP_ASC 2;
STORE BOTTOM_2_IP into '/home/bhavesh/Bottom2_By_IP/' using PigStorage(',');

/* Total page views / Category wise pageviews / Unique pageviews
page,total_number_of_views
category, total_views
page, total_number_of_unique_views */

GRPD_TOTALPAGES = group FILE1 by page;
CNT_FOR_TOTALPAGE = FOREACH GRPD_TOTALPAGES generate group,COUNT(FILE1.page) as COUNTING;
SORTED_PAGES_DEC = ORDER CNT_FOR_TOTALPAGE by COUNTING DESC;
STORE SORTED_PAGES_DEC into '/home/bhavesh/Total_PAGES_Count/' using PigStorage(',');

GRPD_TOTALCATEGORY1 = group FILE1 by Category1;
CNT_FOR_TOTALCATEGORY1 = FOREACH GRPD_TOTALCATEGORY1 generate group,COUNT(FILE1.page) as COUNTING;
SORTED_TOTALCATEGORY1_DEC = ORDER CNT_FOR_TOTALCATEGORY1 by COUNTING DESC;
STORE SORTED_TOTALCATEGORY1_DEC into '/home/bhavesh/Total_Category1_Count/' using PigStorage(',');

GRPD_TOTALCATEGORY2 = group FILE1 by Category2;
CNT_FOR_TOTALCATEGORY2 = FOREACH GRPD_TOTALCATEGORY2 generate group,COUNT(FILE1.page) as COUNTING;
SORTED_TOTALCATEGORY2_DEC = ORDER CNT_FOR_TOTALCATEGORY2 by COUNTING DESC;
STORE SORTED_TOTALCATEGORY2_DEC into '/home/bhavesh/Total_Category2_Count/' using PigStorage(',');

GRPD_TOTALPAGES_UNIQUEVIEW = group FILE1 by page;
CNT_FOR_TOTALPAGE_UNIQUEVIEW = FOREACH GRPD_TOTALPAGES_UNIQUEVIEW {
                                                                internet_protocol = FILE1.IP;
                                                                unique_internet_protocol = DISTINCT internet_protocol;
                                                                GENERATE group, COUNT(unique_internet_protocol);
                                                                };
STORE CNT_FOR_TOTALPAGE_UNIQUEVIEW into '/home/bhavesh/Page_Total_Number_Of_Unique_Views/' using PigStorage(',');

/*
Count of status code = 200 / 404 / 400 / 500
status_code, count
*/

GRPD = group FILE1 by statuscode;  
STATUS_CODE_COUNT = foreach GRPD generate group,COUNT(FILE1.statuscode);

STORE STATUS_CODE_COUNT into '/home/bhavesh/Status_Code_Count/' using PigStorage(',');

Shell Script

#####################################################################
#############################  COMPLETE SCRIPT  #######################
### HEADER - PROGRAM NAME - <weblog.sh>                                                                                      ###
### AUTHOR - BHAVESH BHADRICHA                                                                                                        ###
### DATE  - 11/DEC/2015                                                                                                                                          ###
### VERSION - 1.0                                                                                                                                                         ###
### DESCRIPTION - Data: It comprises of the information gathered from websites                 ###
### which contains IP, Two Categories of Product, Pages and Status Code                                      ###
###                                                                                                                                                                                           ###
### Problem Statement: Analyse the data in Hadoop Eco-system to:                                                ###
### 1.Load data into HDFS using HDFS client                                                                                                ###
###                                                                                                                                                                                           ###
### 2. Develop PIG program to parse WEB logs and meaning full result from it                         ###
###    INUPT file Format                                                                                                                                              ###
###    IP,Category-1,Category-2,page status_code                                                                                         ###
###                                                                                                                                                                                           ###
### PIG program to extract data for the following                                                                                           ###
###                                                                                                                                                                                            ###
### 3. Count of page views by individual user                                                                                                  ###
###    IP, count(*)                                                                                                                                                              ###
###                                                                                                                                                                                            ###
### 4. Top / Bottom 5: catagery-1/ catagery-2 / page /users                                                                   ###
###    (Exclude status code other than 200)                                                                                                       ###
###                                                                                                                                                                                            ###
###    Top 5 and bottom 5 records                                                                                                                             ###
###    Category, total_number_views                                                                                                                     ###
###    page, total_number_views                                                                                                                             ###
###    IP, total_number_of_views                                                                                                                            ###
###                                                                                                                                                                                            ###
### 5. Total page views / Category wise pageviews / Unique pageviews                                            ###
###                                                                                                                                                                                            ###
###     page,total_number_of_views                                                                                                                      ###
###                                               category, total_views                                                                                               ###
###                                               page, total_number_of_unique_views                                                        ###
###                                                                                                                                                                                           ###
### 6. Count of status code = 200 / 404 / 400 / 500                                                                                    ###
###    status_code, count                                                                                                                                               ###
###                                                                                                                                                                                            ###
### 7. Load results into tables in MySql Database using Sqoop.?                                                          ###
#####################################################################
#####################################################################
#####################################################################
###DEFINING THE LOCAL VARIABLES###
######################################################################
DATE=$(date +"%Y%m%d_%H%M%S")
LOGFILE="/home/bhavesh/POC/WEBLOG_POC/LOG/"$DATE".log"
#####################################################################
###  Load data into HDFS using HDFS client  ####################################
#####################################################################

hadoop fs -put weblog.txt /home/bhavesh/weblog.txt

########################## PIG Processing ###############################
#### PIG, which splits the data into two parts: Category data and Ratings data ###
#####################################################################

echo "Pig Script starts here"

echo "PIG Script,Weblog Processing" >> $LOGFILE

hadoop fs -rmr /home/bhavesh/Total_Category1_Count
hadoop fs -rmr /home/bhavesh/Bottom2_By_PAGE
hadoop fs -rmr /home/bhavesh/Top2_By_Category1
hadoop fs -rmr /home/bhavesh/Top2_By_IP
hadoop fs -rmr /home/bhavesh/Total_PAGES_Count
hadoop fs -rmr /home/bhavesh/Bottom2_By_IP
hadoop fs -rmr /home/bhavesh/Bottom2_By_Category2
hadoop fs -rmr /home/bhavesh/Status_Code_Count
hadoop fs -rmr /home/bhavesh/Count_by_Pageview_by_IndividualUser
hadoop fs -rmr /home/bhavesh/Page_Total_Number_Of_Unique_Views
hadoop fs -rmr /home/bhavesh/Total_Category2_Count
hadoop fs -rmr /home/bhavesh/Bottom2_By_Category1
hadoop fs -rmr /home/bhavesh/Top2_By_Category2
hadoop fs -rmr /home/bhavesh/Top2_By_PAGE

pig /home/bhavesh/POC/WEBLOG_POC/weblog.pig

if [ $? -eq 0 ]; then
    echo "Succesfully finished PIG  Processing " >> $LOGFILE
else
    echo "PIG Processing Failed Please check the Log " >> $LOGFILE
fi


############################ HIVE Processing ###############################
###### HIVE will load the Category data and Rating Data into Hive Tables  ##########
########################################################################

echo "HIVE Script starts here"

echo "HIVE LOAD data into Table " >> $LOGFILE

hive -e 'drop table if exists TotalCategory1Count';
hive -e 'drop table if exists Bottom2ByPAGE';
hive -e 'drop table if exists Top2ByCategory1';
hive -e 'drop table if exists Top2ByIP';
hive -e 'drop table if exists TotalPAGESCount';
hive -e 'drop table if exists Bottom2ByIP';
hive -e 'drop table if exists Bottom2ByCategory2';
hive -e 'drop table if exists StatusCodeCount';
hive -e 'drop table if exists CountbyPageviewbyIndividualUser';
hive -e 'drop table if exists PageTotalNumberOfUniqueViews';
hive -e 'drop table if exists TotalCategory2Count';
hive -e 'drop table if exists Bottom2ByCategory1';
hive -e 'drop table if exists Top2ByCategory2';
hive -e 'drop table if exists Top2ByPAGE';

hive -e "create external table TotalCategory1Count
(Category1 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/TotalCategory1Count'";

hive -e "create external table Bottom2ByPAGE
(Pages int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Bottom2ByPAGE'";

hive -e "create external table Top2ByCategory1
(Category1 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Top2ByCategory1'";


hive -e "create external table Top2ByIP
(IP string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Top2ByIP'";

hive -e "create external table TotalPAGESCount
(PAGES int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/TotalPAGESCount'";

hive -e "create external table Bottom2ByIP
(IP string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Bottom2ByIP'";

hive -e "create external table Bottom2ByCategory2
(Category2 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Bottom2ByCategory2'";

hive -e "create external table StatusCodeCount
(StatusCode int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/StatusCodeCount'";

hive -e "create external table CountbyPageviewbyIndividualUser
(IP string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/CountbyPageviewbyIndividualUser'";

hive -e "create external table PageTotalNumberOfUniqueViews
(page int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/PageTotalNumberOfUniqueViews'";

hive -e "create external table TotalCategory2Count
(Category2 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/TotalCategory2Count'";

hive -e "create external table Bottom2ByCategory1
(Category1 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Bottom2ByCategory1'";

hive -e "create external table Top2ByCategory2
(Category2 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Top2ByCategory2'";

hive -e "create external table Top2ByPAGE
(page int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Top2ByPAGE'";

hive -e "load data inpath '/home/bhavesh/Total_Category1_Count/part-r-00000' overwrite into table TotalCategory1Count";
hive -e "load data inpath '/home/bhavesh/Bottom2_By_PAGE/part-r-00000' overwrite into table Bottom2ByPAGE";
hive -e "load data inpath '/home/bhavesh/Top2_By_Category1/part-r-00000' overwrite into table Top2ByCategory1";
hive -e "load data inpath '/home/bhavesh/Top2_By_IP/part-r-00000' overwrite into table Top2ByIP";
hive -e "load data inpath '/home/bhavesh/Total_PAGES_Count/part-r-00000' overwrite into table TotalPAGESCount";
hive -e "load data inpath '/home/bhavesh/Bottom2_By_IP/part-r-00000' overwrite into table Bottom2ByIP";
hive -e "load data inpath '/home/bhavesh/Bottom2_By_Category2/part-r-00000' overwrite into table Bottom2ByCategory2";
hive -e "load data inpath '/home/bhavesh/Status_Code_Count/part-r-00000' overwrite into table StatusCodeCount";
hive -e "load data inpath '/home/bhavesh/Count_by_Pageview_by_IndividualUser/part-r-00000' overwrite into table CountbyPageviewbyIndividualUser";
hive -e "load data inpath '/home/bhavesh/Page_Total_Number_Of_Unique_Views/part-r-00000' overwrite into table PageTotalNumberOfUniqueViews";
hive -e "load data inpath '/home/bhavesh/Total_Category2_Count/part-r-00000' overwrite into table TotalCategory2Count";
hive -e "load data inpath '/home/bhavesh/Bottom2_By_Category1/part-r-00000' overwrite into table Bottom2ByCategory1";
hive -e "load data inpath '/home/bhavesh/Top2_By_Category2/part-r-00000' overwrite into table Top2ByCategory2";
hive -e "load data inpath '/home/bhavesh/Top2_By_PAGE/part-r-00000' overwrite into table Top2ByPAGE";


############################ SQOOP Processing ##############################
###### Pushing the HIVE Tale data into RDBMS Tables via SQOOP #######################
#########################################################################

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table TotalCategory1Count --export-dir /home/bhavesh/hive/TotalCategory1Count/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Bottom2ByPAGE --export-dir /home/bhavesh/hive/Bottom2ByPAGE/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Top2ByCategory1 --export-dir /home/bhavesh/hive/Top2ByCategory1/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Top2ByIP --export-dir /home/bhavesh/hive/Top2ByIP/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table TotalPAGESCount --export-dir /home/bhavesh/hive/TotalPAGESCount/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Bottom2ByIP --export-dir /home/bhavesh/hive/Bottom2ByIP/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Bottom2ByCategory2 --export-dir /home/bhavesh/hive/Bottom2ByCategory2/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table StatusCodeCount --export-dir /home/bhavesh/hive/StatusCodeCount/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table CountbyPageviewbyIndividualUser --export-dir /home/bhavesh/hive/CountbyPageviewbyIndividualUser/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table PageTotalNumberOfUniqueViews --export-dir /home/bhavesh/hive/PageTotalNumberOfUniqueViews/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table TotalCategory2Count --export-dir /home/bhavesh/hive/TotalCategory2Count/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Bottom2ByCategory1 --export-dir /home/bhavesh/hive/Bottom2ByCategory1/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Top2ByCategory2 --export-dir /home/bhavesh/hive/Top2ByCategory2/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Top2ByPAGE --export-dir /home/bhavesh/hive/Top2ByPAGE/part-r-00000 --input-fields-terminated-by ',';

Input Data


Script Execution





Hive Output







MySQL Output









6 comments:

  1. hi,
    can you please tell me how you made the input data from a web server log. in hive or pig.

    ReplyDelete
  2. Hi Bhavesh,

    Nice Explanations with codes, please share input files, it wud be great!!!!

    ReplyDelete
  3. Hi
    Can you please share input file.

    ReplyDelete
  4. Hi ,
    nice explanation can you please mail me input file
    harshitkacker@gmail.com

    ReplyDelete
  5. Hi i want to try could you please refer me the input data set

    ReplyDelete