POC #: Generate Analytics from a Product
based Company Web Log.
The POC is based
on Analysing a Product based Company Web
Log.
Industry: E-Commerce
1. Load weblog data into HDFS using HDFS
client
2. Develop
Pig program to load log and perform analytics on IP Category-1 Category-2 page, status_code
2.1.
Count of page views by individual
user ie [IP, count(*)]
2.2.
Top / Bottom 2: catagery-1/
catagery-2 / page /users (Exclude status code other than 200)
Top 2 and bottom 2 records
·
Category,
total_number_views
·
page,
total_number_views
·
IP,
total_number_of_views
2.3.
Total page views / Category wise pageviews / Unique pageviews
·
page,total_number_of_views
·
category,
total_views
·
page,
total_number_of_unique_views
2.4.
Count of status code = 200 / 404 / 400 / 500
·
status_code,
count
3. Load results
into tables in MySql Database using Sqoop.
CODING DETAILS
Pig Script
weblog.pig
/* Develop Pig program to extract data
for the following KPIs */
FILE1 = LOAD
'/home/bhavesh/weblog.txt' using PigStorage(',') as (IP:chararray,Category1:chararray,Category2:chararray,page:int,statuscode:int);
GRPD = group FILE1 by IP;
IPCOUNT = foreach GRPD generate
group,COUNT(FILE1.IP);
/* Count of page views by individual
user ie IP, count(*) */
STORE IPCOUNT into '/home/bhavesh/Count_by_Pageview_by_IndividualUser/'
using PigStorage(',');
/*(Exclude status code other than 200)
*/
FILE_STATUS_CODE_200 = FILTER FILE1 by
statuscode == 200;
/*
Top 2 and bottom 2 records
Category, total_number_views
page, total_number_views
IP, total_number_of_views
*/
GRPD_CATEGORY1 = group
FILE_STATUS_CODE_200 by Category1;
CNT_FOR_CATEGORY1 = FOREACH
GRPD_CATEGORY1 generate group,COUNT(FILE_STATUS_CODE_200.Category1) as
COUNTING;
SORTED_CATEGORY1_DEC = ORDER
CNT_FOR_CATEGORY1 by COUNTING DESC;
TOP_2_CATEGORY1 = limit
SORTED_CATEGORY1_DEC 2;
STORE TOP_2_CATEGORY1 into
'/home/bhavesh/Top2_By_Category1/' using PigStorage(',');
SORTED_CATEGORY1_ASC = ORDER
CNT_FOR_CATEGORY1 by COUNTING ASC;
BOTTOM_2_CATEGORY1 = limit
SORTED_CATEGORY1_ASC 2;
STORE BOTTOM_2_CATEGORY1 into
'/home/bhavesh/Bottom2_By_Category1/' using PigStorage(',');
GRPD_CATEGORY2 = group
FILE_STATUS_CODE_200 by Category2;
CNT_FOR_CATEGORY2 = FOREACH
GRPD_CATEGORY2 generate group,COUNT(FILE_STATUS_CODE_200.Category2) as COUNTING;
SORTED_CATEGORY2_DEC = ORDER
CNT_FOR_CATEGORY2 by COUNTING DESC;
TOP_2_CATEGORY2 = limit
SORTED_CATEGORY2_DEC 2;
STORE TOP_2_CATEGORY2 into
'/home/bhavesh/Top2_By_Category2/' using PigStorage(',');
SORTED_CATEGORY2_ASC = ORDER
CNT_FOR_CATEGORY2 by COUNTING ASC;
BOTTOM_2_CATEGORY2 = limit
SORTED_CATEGORY2_ASC 2;
STORE BOTTOM_2_CATEGORY2 into
'/home/bhavesh/Bottom2_By_Category2/' using PigStorage(',');
GRPD_PAGES = group
FILE_STATUS_CODE_200 by page;
CNT_FOR_PAGE = FOREACH GRPD_PAGES
generate group,COUNT(FILE_STATUS_CODE_200.page) as COUNTING;
SORTED_PAGE_DEC = ORDER CNT_FOR_PAGE
by COUNTING DESC;
TOP_2_PAGE = limit SORTED_PAGE_DEC 2;
STORE TOP_2_PAGE into
'/home/bhavesh/Top2_By_PAGE/' using PigStorage(',');
SORTED_PAGE_ASC = ORDER CNT_FOR_PAGE
by COUNTING ASC;
BOTTOM_2_PAGE = limit SORTED_PAGE_ASC
2;
STORE BOTTOM_2_PAGE into
'/home/bhavesh/Bottom2_By_PAGE/' using PigStorage(',');
GRPD_IP = group FILE_STATUS_CODE_200
by IP;
CNT_FOR_IP = FOREACH GRPD_IP generate
group,COUNT(FILE_STATUS_CODE_200.IP) as COUNTING;
SORTED_IP_DEC = ORDER CNT_FOR_IP by
COUNTING DESC;
TOP_2_IP = limit SORTED_IP_DEC 2;
STORE TOP_2_IP into
'/home/bhavesh/Top2_By_IP/' using PigStorage(',');
SORTED_IP_ASC = ORDER CNT_FOR_IP by
COUNTING ASC;
BOTTOM_2_IP = limit SORTED_IP_ASC 2;
STORE BOTTOM_2_IP into
'/home/bhavesh/Bottom2_By_IP/' using PigStorage(',');
/* Total page views / Category wise
pageviews / Unique pageviews
page,total_number_of_views
category, total_views
page, total_number_of_unique_views */
GRPD_TOTALPAGES = group FILE1 by page;
CNT_FOR_TOTALPAGE = FOREACH
GRPD_TOTALPAGES generate group,COUNT(FILE1.page) as COUNTING;
SORTED_PAGES_DEC = ORDER
CNT_FOR_TOTALPAGE by COUNTING DESC;
STORE SORTED_PAGES_DEC into
'/home/bhavesh/Total_PAGES_Count/' using PigStorage(',');
GRPD_TOTALCATEGORY1 = group FILE1 by
Category1;
CNT_FOR_TOTALCATEGORY1 = FOREACH
GRPD_TOTALCATEGORY1 generate group,COUNT(FILE1.page) as COUNTING;
SORTED_TOTALCATEGORY1_DEC = ORDER
CNT_FOR_TOTALCATEGORY1 by COUNTING DESC;
STORE SORTED_TOTALCATEGORY1_DEC into
'/home/bhavesh/Total_Category1_Count/' using PigStorage(',');
GRPD_TOTALCATEGORY2 = group FILE1 by
Category2;
CNT_FOR_TOTALCATEGORY2 = FOREACH
GRPD_TOTALCATEGORY2 generate group,COUNT(FILE1.page) as COUNTING;
SORTED_TOTALCATEGORY2_DEC = ORDER CNT_FOR_TOTALCATEGORY2
by COUNTING DESC;
STORE SORTED_TOTALCATEGORY2_DEC into
'/home/bhavesh/Total_Category2_Count/' using PigStorage(',');
GRPD_TOTALPAGES_UNIQUEVIEW = group
FILE1 by page;
CNT_FOR_TOTALPAGE_UNIQUEVIEW = FOREACH
GRPD_TOTALPAGES_UNIQUEVIEW {
internet_protocol
= FILE1.IP;
unique_internet_protocol
= DISTINCT internet_protocol;
GENERATE
group, COUNT(unique_internet_protocol);
};
STORE CNT_FOR_TOTALPAGE_UNIQUEVIEW
into '/home/bhavesh/Page_Total_Number_Of_Unique_Views/' using PigStorage(',');
/*
Count of status code = 200 / 404 / 400
/ 500
status_code, count
*/
GRPD = group FILE1 by statuscode;
STATUS_CODE_COUNT = foreach GRPD
generate group,COUNT(FILE1.statuscode);
STORE STATUS_CODE_COUNT into
'/home/bhavesh/Status_Code_Count/' using PigStorage(',');
Shell Script
#####################################################################
############################# COMPLETE SCRIPT
#######################
### HEADER - PROGRAM NAME - <weblog.sh> ###
### HEADER - PROGRAM NAME - <weblog.sh> ###
### AUTHOR - BHAVESH BHADRICHA ###
### DATE - 11/DEC/2015 ###
### VERSION - 1.0 ###
### DESCRIPTION - Data: It comprises
of the information gathered from websites ###
### which contains IP, Two Categories
of Product, Pages and Status Code ###
### ###
### Problem Statement: Analyse the
data in Hadoop Eco-system to: ###
### 1.Load data into HDFS using HDFS
client ###
### ###
### 2. Develop PIG program to parse
WEB logs and meaning full result from it ###
###
INUPT file Format ###
###
IP,Category-1,Category-2,page status_code ###
### ###
### PIG program to extract data for
the following ###
### ###
### 3. Count of page views by
individual user ###
###
IP, count(*) ###
### ###
### 4. Top / Bottom 5: catagery-1/
catagery-2 / page /users ###
###
(Exclude status code other than 200) ###
### ###
###
Top 5 and bottom 5 records ###
###
Category, total_number_views ###
###
page, total_number_views ###
###
IP, total_number_of_views ###
### ###
### 5. Total page views / Category
wise pageviews / Unique pageviews ###
### ###
###
page,total_number_of_views ###
### category,
total_views ###
### page,
total_number_of_unique_views ###
### ###
### 6. Count of status code = 200 /
404 / 400 / 500 ###
###
status_code, count ###
### ###
### 7. Load results into tables in
MySql Database using Sqoop.? ###
#####################################################################
#####################################################################
#####################################################################
#####################################################################
###DEFINING THE LOCAL VARIABLES###
######################################################################
DATE=$(date +"%Y%m%d_%H%M%S")
LOGFILE="/home/bhavesh/POC/WEBLOG_POC/LOG/"$DATE".log"
#####################################################################
###
Load data into HDFS using HDFS client
####################################
#####################################################################
hadoop fs -put weblog.txt
/home/bhavesh/weblog.txt
########################## PIG Processing
###############################
#### PIG, which splits the data into two
parts: Category data and Ratings data ###
#####################################################################
echo "Pig Script starts here"
echo "PIG Script,Weblog Processing"
>> $LOGFILE
hadoop fs -rmr
/home/bhavesh/Total_Category1_Count
hadoop fs -rmr /home/bhavesh/Bottom2_By_PAGE
hadoop fs -rmr /home/bhavesh/Top2_By_Category1
hadoop fs -rmr /home/bhavesh/Top2_By_IP
hadoop fs -rmr /home/bhavesh/Total_PAGES_Count
hadoop fs -rmr /home/bhavesh/Bottom2_By_IP
hadoop fs -rmr
/home/bhavesh/Bottom2_By_Category2
hadoop fs -rmr /home/bhavesh/Status_Code_Count
hadoop fs -rmr
/home/bhavesh/Count_by_Pageview_by_IndividualUser
hadoop fs -rmr
/home/bhavesh/Page_Total_Number_Of_Unique_Views
hadoop fs -rmr
/home/bhavesh/Total_Category2_Count
hadoop fs -rmr
/home/bhavesh/Bottom2_By_Category1
hadoop fs -rmr /home/bhavesh/Top2_By_Category2
hadoop fs -rmr /home/bhavesh/Top2_By_PAGE
pig /home/bhavesh/POC/WEBLOG_POC/weblog.pig
if [ $? -eq 0 ]; then
echo
"Succesfully finished PIG
Processing " >> $LOGFILE
else
echo
"PIG Processing Failed Please check the Log " >> $LOGFILE
fi
############################ HIVE Processing
###############################
###### HIVE will load the Category data and
Rating Data into Hive Tables ##########
########################################################################
echo "HIVE Script starts here"
echo "HIVE LOAD data into Table "
>> $LOGFILE
hive -e 'drop table if exists
TotalCategory1Count';
hive -e 'drop table if exists Bottom2ByPAGE';
hive -e 'drop table if exists
Top2ByCategory1';
hive -e 'drop table if exists Top2ByIP';
hive -e 'drop table if exists
TotalPAGESCount';
hive -e 'drop table if exists Bottom2ByIP';
hive -e 'drop table if exists
Bottom2ByCategory2';
hive -e 'drop table if exists
StatusCodeCount';
hive -e 'drop table if exists
CountbyPageviewbyIndividualUser';
hive -e 'drop table if exists
PageTotalNumberOfUniqueViews';
hive -e 'drop table if exists
TotalCategory2Count';
hive -e 'drop table if exists
Bottom2ByCategory1';
hive -e 'drop table if exists
Top2ByCategory2';
hive -e 'drop table if exists Top2ByPAGE';
hive -e "create external table
TotalCategory1Count
(Category1 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location
'/home/bhavesh/hive/TotalCategory1Count'";
hive -e "create external table
Bottom2ByPAGE
(Pages int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location
'/home/bhavesh/hive/Bottom2ByPAGE'";
hive -e "create external table
Top2ByCategory1
(Category1 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location
'/home/bhavesh/hive/Top2ByCategory1'";
hive -e "create external table Top2ByIP
(IP string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location
'/home/bhavesh/hive/Top2ByIP'";
hive -e "create external table
TotalPAGESCount
(PAGES int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location
'/home/bhavesh/hive/TotalPAGESCount'";
hive -e "create external table
Bottom2ByIP
(IP string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location
'/home/bhavesh/hive/Bottom2ByIP'";
hive -e "create external table
Bottom2ByCategory2
(Category2 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location
'/home/bhavesh/hive/Bottom2ByCategory2'";
hive -e "create external table
StatusCodeCount
(StatusCode int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location
'/home/bhavesh/hive/StatusCodeCount'";
hive -e "create external table
CountbyPageviewbyIndividualUser
(IP string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location
'/home/bhavesh/hive/CountbyPageviewbyIndividualUser'";
hive -e "create external table
PageTotalNumberOfUniqueViews
(page int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location
'/home/bhavesh/hive/PageTotalNumberOfUniqueViews'";
hive -e "create external table
TotalCategory2Count
(Category2 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/TotalCategory2Count'";
hive -e "create external table
Bottom2ByCategory1
(Category1 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location
'/home/bhavesh/hive/Bottom2ByCategory1'";
hive -e "create external table
Top2ByCategory2
(Category2 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location
'/home/bhavesh/hive/Top2ByCategory2'";
hive -e "create external table Top2ByPAGE
(page int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location
'/home/bhavesh/hive/Top2ByPAGE'";
hive -e "load data inpath
'/home/bhavesh/Total_Category1_Count/part-r-00000' overwrite into table
TotalCategory1Count";
hive -e "load data inpath
'/home/bhavesh/Bottom2_By_PAGE/part-r-00000' overwrite into table
Bottom2ByPAGE";
hive -e "load data inpath
'/home/bhavesh/Top2_By_Category1/part-r-00000' overwrite into table
Top2ByCategory1";
hive -e "load data inpath
'/home/bhavesh/Top2_By_IP/part-r-00000' overwrite into table Top2ByIP";
hive -e "load data inpath
'/home/bhavesh/Total_PAGES_Count/part-r-00000' overwrite into table
TotalPAGESCount";
hive -e "load data inpath
'/home/bhavesh/Bottom2_By_IP/part-r-00000' overwrite into table
Bottom2ByIP";
hive -e "load data inpath
'/home/bhavesh/Bottom2_By_Category2/part-r-00000' overwrite into table
Bottom2ByCategory2";
hive -e "load data inpath
'/home/bhavesh/Status_Code_Count/part-r-00000' overwrite into table
StatusCodeCount";
hive -e "load data inpath
'/home/bhavesh/Count_by_Pageview_by_IndividualUser/part-r-00000' overwrite into
table CountbyPageviewbyIndividualUser";
hive -e "load data inpath
'/home/bhavesh/Page_Total_Number_Of_Unique_Views/part-r-00000' overwrite into
table PageTotalNumberOfUniqueViews";
hive -e "load data inpath
'/home/bhavesh/Total_Category2_Count/part-r-00000' overwrite into table
TotalCategory2Count";
hive -e "load data inpath
'/home/bhavesh/Bottom2_By_Category1/part-r-00000' overwrite into table
Bottom2ByCategory1";
hive -e "load data inpath
'/home/bhavesh/Top2_By_Category2/part-r-00000' overwrite into table
Top2ByCategory2";
hive -e "load data inpath
'/home/bhavesh/Top2_By_PAGE/part-r-00000' overwrite into table
Top2ByPAGE";
############################ SQOOP Processing
##############################
###### Pushing the HIVE Tale data into RDBMS
Tables via SQOOP #######################
#########################################################################
sqoop export --connect
jdbc:mysql://localhost/mysql --username root --password root --table
TotalCategory1Count --export-dir /home/bhavesh/hive/TotalCategory1Count/part-r-00000
--input-fields-terminated-by ',';
sqoop export --connect
jdbc:mysql://localhost/mysql --username root --password root --table
Bottom2ByPAGE --export-dir /home/bhavesh/hive/Bottom2ByPAGE/part-r-00000
--input-fields-terminated-by ',';
sqoop export --connect
jdbc:mysql://localhost/mysql --username root --password root --table
Top2ByCategory1 --export-dir /home/bhavesh/hive/Top2ByCategory1/part-r-00000
--input-fields-terminated-by ',';
sqoop export --connect jdbc:mysql://localhost/mysql
--username root --password root --table Top2ByIP --export-dir
/home/bhavesh/hive/Top2ByIP/part-r-00000 --input-fields-terminated-by ',';
sqoop export --connect
jdbc:mysql://localhost/mysql --username root --password root --table
TotalPAGESCount --export-dir /home/bhavesh/hive/TotalPAGESCount/part-r-00000
--input-fields-terminated-by ',';
sqoop export --connect
jdbc:mysql://localhost/mysql --username root --password root --table
Bottom2ByIP --export-dir /home/bhavesh/hive/Bottom2ByIP/part-r-00000 --input-fields-terminated-by
',';
sqoop export --connect
jdbc:mysql://localhost/mysql --username root --password root --table
Bottom2ByCategory2 --export-dir
/home/bhavesh/hive/Bottom2ByCategory2/part-r-00000 --input-fields-terminated-by
',';
sqoop export --connect
jdbc:mysql://localhost/mysql --username root --password root --table
StatusCodeCount --export-dir /home/bhavesh/hive/StatusCodeCount/part-r-00000
--input-fields-terminated-by ',';
sqoop export --connect
jdbc:mysql://localhost/mysql --username root --password root --table
CountbyPageviewbyIndividualUser --export-dir
/home/bhavesh/hive/CountbyPageviewbyIndividualUser/part-r-00000
--input-fields-terminated-by ',';
sqoop export --connect
jdbc:mysql://localhost/mysql --username root --password root --table
PageTotalNumberOfUniqueViews --export-dir
/home/bhavesh/hive/PageTotalNumberOfUniqueViews/part-r-00000
--input-fields-terminated-by ',';
sqoop export --connect
jdbc:mysql://localhost/mysql --username root --password root --table
TotalCategory2Count --export-dir
/home/bhavesh/hive/TotalCategory2Count/part-r-00000
--input-fields-terminated-by ',';
sqoop export --connect
jdbc:mysql://localhost/mysql --username root --password root --table
Bottom2ByCategory1 --export-dir
/home/bhavesh/hive/Bottom2ByCategory1/part-r-00000 --input-fields-terminated-by
',';
sqoop export --connect jdbc:mysql://localhost/mysql
--username root --password root --table Top2ByCategory2 --export-dir
/home/bhavesh/hive/Top2ByCategory2/part-r-00000 --input-fields-terminated-by
',';
sqoop export --connect
jdbc:mysql://localhost/mysql --username root --password root --table Top2ByPAGE
--export-dir /home/bhavesh/hive/Top2ByPAGE/part-r-00000
--input-fields-terminated-by ',';
hi,
ReplyDeletecan you please tell me how you made the input data from a web server log. in hive or pig.
Hi Bhavesh,
ReplyDeleteNice Explanations with codes, please share input files, it wud be great!!!!
Can u please share input files.
ReplyDeleteEmail: prashant2892@hotmail.com
Thank You
Hi
ReplyDeleteCan you please share input file.
Hi ,
ReplyDeletenice explanation can you please mail me input file
harshitkacker@gmail.com
Hi i want to try could you please refer me the input data set
ReplyDelete