Big Data Processing with Apache Spark by Manuel Ignacio Franco Galeano

Big Data Processing with Apache Spark by Manuel Ignacio Franco Galeano

Author:Manuel Ignacio Franco Galeano
Language: eng
Format: epub
Published: 2018-10-30T00:00:00+00:00


Write a command to start the TCP server in port 12345:python3 log_socket_producer.py --port 12345

Open another two windows and start another two instances of this server in ports 9876 and 8765.

Creating a TCP Spark Stream Consumer

Create a Python file named log_socket_producer.py and import the necessary packages and global variables:from pyspark import SparkContext

from pyspark.streaming import StreamingContext

import argparse

import re

import os

Write a function that applies a regular expression to every incoming message and returns a dictionary:def parse_log_entry(msg):

"""

Parse a log entry from the format

$ip_addr - [$time_local] "$request" $status $bytes_sent $http_user_agent"

to a dictionary

"""

data = {}

# Regular expression that parses a log entry

search_term = '(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+\-\s+\[(.*)]\s+'\

'"(\/[/.a-zA-Z0-9-]+)"\s+(\d{3})\s+(\d+)\s+"(.*)"'

values = re.findall(search_term, msg)

if values:

val = values[0]

data['ip'] = val[0]

data['date'] = val[1]

data['path'] = val[2]

data['status'] = val[3]

data['bytes_sent'] = val[4]

data['agent'] = val[5]

return data



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.