Home > Computers & Technology > Databases & Big Data > Data Modeling & Design

What Is Data Science? by Loukides Mike

Author:Loukides, Mike [Loukides, Mike] , Date: August 6, 2019 ,Views: 115

What Is Data Science? by Loukides Mike

Author:Loukides, Mike [Loukides, Mike]
Language: eng
Format: epub
Publisher: O'Reilly Media
Published: 2011-04-09T16:00:00+00:00

Storing data is only part of building a data platform, though. Data is only useful if you can do something with it, and enormous datasets present computational problems. Google popularized the MapReduce approach, which is basically a divide-and-conquer strategy for distributing an extremely large problem across an extremely large computing cluster. In the “map” stage, a programming task is divided into a number of identical subtasks, which are then distributed across many processors; the intermediate results are then combined by a single reduce task. In hindsight, MapReduce seems like an obvious solution to Google’s biggest problem, creating large searches. It’s easy to distribute a search across thousands of processors, and then combine the results into a single set of answers. What’s less obvious is that MapReduce has proven to be widely applicable to many large data problems, ranging from search to machine learning.

The most popular open source implementation of MapReduce is the Hadoop project . Yahoo’s claim that they had built the world’s largest production Hadoop application , with 10,000 cores running Linux, brought it onto center stage. Many of the key Hadoop developers have found a home at Cloudera , which provides commercial support. Amazon’s Elastic MapReduce makes it much easier to put Hadoop to work without investing in racks of Linux machines, by providing preconfigured Hadoop images for its EC2 clusters. You can allocate and de-allocate processors as needed, paying only for the time you use them.

Hadoop goes far beyond a simple MapReduce implementation (of which there are several); it’s the key component of a data platform. It incorporates HDFS , a distributed filesystem designed for the performance and reliability requirements of huge datasets; the HBase database; Hive , which lets developers explore Hadoop datasets using SQL-like queries; a high-level dataflow language called Pig ; and other components. If anything can be called a one-stop information platform, Hadoop is it.

Hadoop has been instrumental in enabling “agile” data analysis. In software development, “agile practices” are associated with faster product cycles, closer interaction between developers and consumers, and testing. Traditional data analysis has been hampered by extremely long turn-around times. If you start a calculation, it might not finish for hours, or even days. But Hadoop (and particularly Elastic MapReduce) make it easy to build clusters that can perform computations on long datasets quickly. Faster computations make it easier to test different assumptions, different datasets, and different algorithms. It’s easer to consult with clients to figure out whether you’re asking the right questions, and it’s possible to pursue intriguing possibilities that you’d otherwise have to drop for lack of time.

Hadoop is essentially a batch system, but Hadoop Online Prototype (HOP) is an experimental project that enables stream processing. Hadoop processes data as it arrives, and delivers intermediate results in (near) real-time. Near real-time data analysis enables features like trending topics on sites like Twitter . These features only require soft real-time; reports on trending topics don’t require millisecond accuracy. As with the number of followers on Twitter, a “trending topics” report only needs to be current to within five minutes -- or even an hour.

Download

What Is Data Science? by Loukides Mike.epub

Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.

Categories

Access	Data Mining
Data Modeling & Design	Data Processing
Data Warehousing	MySQL
Oracle	Other Databases
Relational Databases	SQL