HBase: The Definitive Guide by Lars George
Author:Lars George [Lars George]
Language: eng
Format: epub, pdf
Tags: COMPUTERS / Data Modeling & Design
ISBN: 9781449396138
Publisher: O'Reilly Media
Published: 2011-08-29T16:00:00+00:00
Fixed column mapping
The row key must be the first field and cannot be placed anywhere else. This can be overcome, though, with a subsequent FOREACH...GENERATE statement, reordering the relation layout.
Check with the Pig project site to see if these features have since been added.
Cascading
Cascading is an alternative API to MapReduce. Under the covers, it uses MapReduce during execution, but during development, users don’t have to think in MapReduce to create solutions for execution on Hadoop.
The model used is similar to a real-world pipe assembly, where data sources are taps, and outputs are sinks. These are piped together to form the processing flow, where data passes through the pipe and is transformed in the process. Pipes can be connected to larger pipe assemblies to form more complex processing pipelines from existing pipes.
Data then streams through the pipeline and can be split, merged, grouped, or joined. The data is represented as tuples, forming a tuple stream through the assembly. This very visually oriented model makes building MapReduce jobs more like construction work, while abstracting the complexity of the actual work involved.
Cascading (as of version 1.0.1) has support for reading and writing data to and from an HBase cluster. Detailed information and access to the source code can be found on the Cascading Modules page (http://www.cascading.org/modules.html).
Example 6-2 shows how to sink data into an HBase cluster. See the GitHub repository, linked from the modules page, for more up-to-date API information.
Example 6-2. Using Cascading to insert data into HBase
// read data from the default filesystem // emits two fields: "offset" and "line" Tap source = new Hfs(new TextLine(), inputFileLhs); // store data in an HBase cluster, accepts fields "num", "lower", and "upper" // will automatically scope incoming fields to their proper familyname, // "left" or "right" Fields keyFields = new Fields("num"); String[] familyNames = {"left", "right"}; Fields[] valueFields = new Fields[] {new Fields("lower"), new Fields("upper") }; Tap hBaseTap = new HBaseTap("multitable", new HBaseScheme(keyFields, familyNames, valueFields), SinkMode.REPLACE); // a simple pipe assembly to parse the input into fields // a real app would likely chain multiple Pipes together for more complex // processing Pipe parsePipe = new Each("insert", new Fields("line"), new RegexSplitter(new Fields("num", "lower", "upper"), " ")); // "plan" a cluster executable Flow // this connects the source Tap and hBaseTap (the sink Tap) to the parsePipe Flow parseFlow = new FlowConnector(properties).connect(source, hBaseTap, parsePipe); // start the flow, and block until complete parseFlow.complete(); // open an iterator on the HBase table we stuffed data into TupleEntryIterator iterator = parseFlow.openSink(); while(iterator.hasNext()) { // print out each tuple from HBase System.out.println( "iterator.next() = " + iterator.next() ); } iterator.close();
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8301)
Azure Data and AI Architect Handbook by Olivier Mertens & Breght Van Baelen(6746)
Building Statistical Models in Python by Huy Hoang Nguyen & Paul N Adams & Stuart J Miller(6723)
Serverless Machine Learning with Amazon Redshift ML by Debu Panda & Phil Bates & Bhanu Pittampally & Sumeet Joshi(6602)
Data Wrangling on AWS by Navnit Shukla | Sankar M | Sam Palani(6383)
Driving Data Quality with Data Contracts by Andrew Jones(6333)
Machine Learning Model Serving Patterns and Best Practices by Md Johirul Islam(6096)
Learning SQL by Alan Beaulieu(5995)
Weapons of Math Destruction by Cathy O'Neil(5779)
Big Data Analysis with Python by Ivan Marin(5367)
Data Engineering with dbt by Roberto Zagni(4365)
Solidity Programming Essentials by Ritesh Modi(4012)
Time Series Analysis with Python Cookbook by Tarek A. Atwan(3873)
Pandas Cookbook by Theodore Petrou(3582)
Blockchain Basics by Daniel Drescher(3294)
Hands-On Machine Learning for Algorithmic Trading by Stefan Jansen(2905)
Feature Store for Machine Learning by Jayanth Kumar M J(2815)
Learn T-SQL Querying by Pam Lahoud & Pedro Lopes(2796)
Mastering Python for Finance by Unknown(2744)
