What Is Data Science? by Loukides Mike
Author:Loukides, Mike [Loukides, Mike]
Language: eng
Format: epub
Publisher: O'Reilly Media
Published: 2011-04-09T16:00:00+00:00
Storing data is only part of building a data platform, though. Data is only useful if you can do something with it, and enormous datasets present computational problems. Google popularized the MapReduce approach, which is basically a divide-and-conquer strategy for distributing an extremely large problem across an extremely large computing cluster. In the “map” stage, a programming task is divided into a number of identical subtasks, which are then distributed across many processors; the intermediate results are then combined by a single reduce task. In hindsight, MapReduce seems like an obvious solution to Google’s biggest problem, creating large searches. It’s easy to distribute a search across thousands of processors, and then combine the results into a single set of answers. What’s less obvious is that MapReduce has proven to be widely applicable to many large data problems, ranging from search to machine learning.
The most popular open source implementation of MapReduce is the Hadoop project . Yahoo’s claim that they had built the world’s largest production Hadoop application , with 10,000 cores running Linux, brought it onto center stage. Many of the key Hadoop developers have found a home at Cloudera , which provides commercial support. Amazon’s Elastic MapReduce makes it much easier to put Hadoop to work without investing in racks of Linux machines, by providing preconfigured Hadoop images for its EC2 clusters. You can allocate and de-allocate processors as needed, paying only for the time you use them.
Hadoop goes far beyond a simple MapReduce implementation (of which there are several); it’s the key component of a data platform. It incorporates HDFS , a distributed filesystem designed for the performance and reliability requirements of huge datasets; the HBase database; Hive , which lets developers explore Hadoop datasets using SQL-like queries; a high-level dataflow language called Pig ; and other components. If anything can be called a one-stop information platform, Hadoop is it.
Hadoop has been instrumental in enabling “agile” data analysis. In software development, “agile practices” are associated with faster product cycles, closer interaction between developers and consumers, and testing. Traditional data analysis has been hampered by extremely long turn-around times. If you start a calculation, it might not finish for hours, or even days. But Hadoop (and particularly Elastic MapReduce) make it easy to build clusters that can perform computations on long datasets quickly. Faster computations make it easier to test different assumptions, different datasets, and different algorithms. It’s easer to consult with clients to figure out whether you’re asking the right questions, and it’s possible to pursue intriguing possibilities that you’d otherwise have to drop for lack of time.
Hadoop is essentially a batch system, but Hadoop Online Prototype (HOP) is an experimental project that enables stream processing. Hadoop processes data as it arrives, and delivers intermediate results in (near) real-time. Near real-time data analysis enables features like trending topics on sites like Twitter . These features only require soft real-time; reports on trending topics don’t require millisecond accuracy. As with the number of followers on Twitter, a “trending topics” report only needs to be current to within five minutes -- or even an hour.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Access | Data Mining |
Data Modeling & Design | Data Processing |
Data Warehousing | MySQL |
Oracle | Other Databases |
Relational Databases | SQL |
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8299)
Azure Data and AI Architect Handbook by Olivier Mertens & Breght Van Baelen(6736)
Building Statistical Models in Python by Huy Hoang Nguyen & Paul N Adams & Stuart J Miller(6713)
Serverless Machine Learning with Amazon Redshift ML by Debu Panda & Phil Bates & Bhanu Pittampally & Sumeet Joshi(6588)
Data Wrangling on AWS by Navnit Shukla | Sankar M | Sam Palani(6371)
Driving Data Quality with Data Contracts by Andrew Jones(6321)
Machine Learning Model Serving Patterns and Best Practices by Md Johirul Islam(6086)
Learning SQL by Alan Beaulieu(5994)
Weapons of Math Destruction by Cathy O'Neil(5779)
Big Data Analysis with Python by Ivan Marin(5362)
Data Engineering with dbt by Roberto Zagni(4359)
Solidity Programming Essentials by Ritesh Modi(4008)
Time Series Analysis with Python Cookbook by Tarek A. Atwan(3866)
Pandas Cookbook by Theodore Petrou(3577)
Blockchain Basics by Daniel Drescher(3294)
Hands-On Machine Learning for Algorithmic Trading by Stefan Jansen(2905)
Feature Store for Machine Learning by Jayanth Kumar M J(2814)
Learn T-SQL Querying by Pam Lahoud & Pedro Lopes(2796)
Mastering Python for Finance by Unknown(2744)
