Log Parsing en Hadoop, Hive y Python

Uno de los análisis primario realizado en los registros de acceso a la Web es un análisis de cohorte, donde se necesita tirar de acceso de los usuarios la fecha y hora junto con otras dimensiones, como usuario, IP, datos geográficos, etc Aquí voy a estar utilizando Hadoop / colmena / Python para tirar fecha, los datos de ip de acceso a entrar en Hadoop y ejecutar algunas consultas. El ejemplo ilustra el uso de Hadoop (versión 0.20.1) streaming, SERDE, de la Hive (versión 0.40) asignador de plug-in del cliente (get_access_log_ip).

Los siguientes pasos cargar varios miles de filas en una tabla de destino (dw_log_ip_test - Registro de datos de acceso de almacén) "access_log_2010_01_25" a continuación, extraer la fecha de un formato como DD / MON / YYYY HH: MM: SS -800 a 'DD / MON / YYYY' a lo largo de con la dirección IP remota a través de un script en Python de streaming.

Paso 1: En primer lugar crear una tabla para acceder al registro (access_log_2010_01_25) y luego cargar los datos en él.



Log Parsing through Hadoop, Hive & Python

One of the primary analysis done on web access logs is some cohort analysis where one need to pull user access date time and along with other dimensions like user, ip, geo data, etc. Here I will be using Hadoop/ Hive/ Python to pull date, ip data from access log into Hadoop and run some queries. The example illustrates using Hadoop (version 0.20.1) streaming, SERDE, Hive’s (version 0.40) plugin customer mapper (get_access_log_ip).

The steps below load few thousand rows into a target table (dw_log_ip_test – data warehouse access log) “access_log_2010_01_25” then extract date from format like DD/Mon/YYYY:HH:MM:SS -800 to ‘DD/Mon/YYYY’ along with remote ip address through a Python streaming script.

Step 1: First create a table to access log (access_log_2010_01_25) and then load data into it.

hive> 
CREATE TABLE access_log_2010_01_25 (
  request_date STRING,
  remote_ip STRING,
  method STRING,
  request STRING,
  protocol STRING,
  user STRING,
  status STRING,
  size STRING,
  time STRING,
  remote_host STRING,
  ts STRING,
  perf STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES  (
"input.regex" = "\\[([^]]+)\\] ([^ ]*) \"([^ ]*) ([^ ]*) ([^ \"]*)\" user=([^ ]*) status=([^ ]*) size=([^ ]*)  time=([^ ]*) host=([^ ]*) timestamp=([^ ]*) perf=([^ ]*)",
"output.format.string" = "%1$s %2$s \"%3$s %4$s %5$s\" user=%6$s status=%7$s size=%8$s  time=%9$s  host=%10$s timestamp=%11$s  perf=%12$s"
)
STORED AS TEXTFILE;


hive> LOAD DATA LOCAL INPATH '/mnt/web_ser101/weblog_server101_20100125_1'   
    >   OVERWRITE INTO TABLE access_log_2010_01_25;
#- After load the data in one of the record would look like:
#- 25/Jan/2010:13:14:05 -0800      123.123.123.123   GET     /xmls/public/thumbnail.xml   HTTP/1.1        -       302     250     0  abcd.com   1264454045    - 
 
 
Paso 2: Crear una tabla de prueba de destino
 
hive>  CREATE  TABLE  dw_log_ip_test (dt string, remote_ip string);
 
Paso 3: En un editor de su elección, crear un sencillo script en Python (get_access_log_ip.py) que se "cita" cadena de la "fecha / hora" de cuerdas y "remote_ip" dirección de la siguiente manera.
 
#!/usr/bin/python
import sys
for line in sys.stdin.readlines():
   line = line.strip()
   fields = line.split('\t')
   dt = fields[0].split(':')[0]     #-- Get date 25/Jan/2010
   ip = fields[1]                   #-- Get remote IP
   print dt,"\t",ip


Step 4: Now extract data to dw_log_ip table and load only some limited data (10 seconds data)


hive> FROM access_log_2010_01_25 L > INSERT OVERWRITE TABLE dw_log_ip MAP L.request_date, L.remote_ip > USING '/home/hadoop/hive/etl/scripts/get_access_log_ip.py' AS dt, remote_ip > WHERE L.request_date > '25/Jan/2010:13:11:40' > and L.request_date < '25/Jan/2010:13:11:50'; # Hive outputs some information like: Total MapReduce jobs = 2 Launching Job 1 out of 2 Number of reduce tasks is set to 0 since there's no reduce operator Ended Job = job_201001251623_0094, Tracking URL = http://hadoop_main:50030/jobdetails.jsp?jobid=job_201001251623_0094 Kill Command = /usr/local/hadoop/bin/hadoop job -Dmapred.job.tracker=hdfs://hadoop_main:9001 -kill job_201001251623_0094 2010-02-03 18:42:40,793 Stage-1 map = 0%, reduce = 0% 2010-02-03 18:42:53,874 Stage-1 map = 50%, reduce = 0% 2010-02-03 18:43:05,981 Stage-1 map = 100%, reduce = 0% 2010-02-03 18:43:09,012 Stage-1 map = 100%, reduce = 100% Ended Job = job_201001251623_0094 Ended Job = -416940662, job is filtered out (removed at runtime). Launching Job 2 out of 2 … … Loading data to table dw_log_ip_test 11110 Rows loaded to dw_log_ip_test OK #-- Now check the results... hive > select dt, remote_ip from dw_log_ip; hive > select dt, remote_ip, count(1) > from dw_log_ip > group by dt, remote_ip;
   

0 pensamientos:

Post a Comment

feedback!