I Heart Logs: Event Data, Stream Processing, and Data Integration by Jay Kreps
Author:Jay Kreps [Kreps, Jay]
Language: eng
Format: epub, azw3, pdf
Tags: Computers, Databases, Data Processing, General
ISBN: 9781491909331
Google: N9iYBAAAQBAJ
Publisher: O'Reilly Media, Inc.
Published: 2014-09-23T21:01:19+00:00
So far, I have only described what amounts to a fancy method of copying data from place to place. However, schlepping bytes between storage systems is not the end of the story. It turns out that “log” is another word for “stream” and logs are at the heart of stream processing.
But wait, what exactly is stream processing?
If you are a fan of database literature or semi-successful data infrastructure products of the late 1990s and early 2000s, you likely associate stream processing with efforts to build a SQL engine or “boxes-and-arrows” interface for event-driven processing.
If you follow the explosion of open source data systems, you likely associate stream processing with some of the systems in this space, for example, Storm, Akka, S4, and Samza. Most people see these as a kind of asynchronous message processing system that is not that different from a cluster-aware RPC layer (and in fact some things in this space are exactly that). I have heard stream processing described as a model where you process all your data immediately and then throw it away.
Both these views are a little limited. Stream processing has nothing to do with SQL. Nor is it limited to real-time processing. There is no inherent reason you can’t process the stream of data from yesterday or a month ago using a variety of different languages to express the computation. Nor must you (or should you) throw away the original data that was captured.
I see stream processing as something much broader: infrastructure for continuous data processing. I think the computational model can be as general as MapReduce or other distributed processing frameworks, but with the ability to produce low-latency results.
The real driver for the processing model is the method of data collection. Data collected in batch is naturally processed in batch. When data is collected continuously, it is naturally processed continuously.
The United States census provides a good example of batch data collection. The census periodically kicks off and does a brute force discovery and enumeration of US citizens by having people walk from door to door. This made a lot of sense in 1790 when the census was first begun (see Figure 3-1). Data collection at the time was inherently batch oriented, as it involved riding around on horseback and writing down records on paper, then transporting this batch of records to a central location where humans added up all the counts. These days, when you describe the census process, one immediately wonders why we don’t keep a journal of births and deaths and produce population counts either continuously or with whatever granularity is needed.
Download
I Heart Logs: Event Data, Stream Processing, and Data Integration by Jay Kreps.azw3
I Heart Logs: Event Data, Stream Processing, and Data Integration by Jay Kreps.pdf
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8303)
Azure Data and AI Architect Handbook by Olivier Mertens & Breght Van Baelen(6756)
Building Statistical Models in Python by Huy Hoang Nguyen & Paul N Adams & Stuart J Miller(6732)
Serverless Machine Learning with Amazon Redshift ML by Debu Panda & Phil Bates & Bhanu Pittampally & Sumeet Joshi(6616)
Data Wrangling on AWS by Navnit Shukla | Sankar M | Sam Palani(6402)
Driving Data Quality with Data Contracts by Andrew Jones(6343)
Machine Learning Model Serving Patterns and Best Practices by Md Johirul Islam(6108)
Learning SQL by Alan Beaulieu(5998)
Weapons of Math Destruction by Cathy O'Neil(5784)
Big Data Analysis with Python by Ivan Marin(5372)
Data Engineering with dbt by Roberto Zagni(4372)
Solidity Programming Essentials by Ritesh Modi(4021)
Time Series Analysis with Python Cookbook by Tarek A. Atwan(3882)
Pandas Cookbook by Theodore Petrou(3587)
Blockchain Basics by Daniel Drescher(3298)
Hands-On Machine Learning for Algorithmic Trading by Stefan Jansen(2909)
Feature Store for Machine Learning by Jayanth Kumar M J(2816)
Learn T-SQL Querying by Pam Lahoud & Pedro Lopes(2799)
Mastering Python for Finance by Unknown(2745)
