SQL on Big Data by Sumit Pal
Author:Sumit Pal
Language: eng
Format: epub, pdf
Publisher: Apress, Berkeley, CA
Queries can be submitted to Impala through either Impala Shell or JDBC/ODBC drivers. Once a query is submitted, the query planner process turns the query request into a collection of plan fragments, and then the coordinator initiates execution on a remote impalad. Intermediate results are streamed between impalad’s before the query results are streamed back to client.
The coordinator orchestrates interactions between impalad across all the data nodes and also aggregates results produced by each data node. impalad also exposes a remote procedure call (RPC) interface that other impalad can use to connect to exchange data. Also, this interface allows the coordinator to assign work for each impalad.
Following query submission by the client, the usual steps of query validation, syntax and semantic analysis, are done before the query is optimized by the query engine. Every query is first validated syntactically and semantically, to make sure that there are no errors in the user’s query, both from a syntax perspective as well as from a semantic perspective, ensuring that the query makes sense. The metadata for the query exists in the Hive metastore. After this step, query planning occurs, whereby Impala tries to figure out the best way to solve the query to get the results. The EXPLAIN query dumps the output of what is going on within Impala to figure out the way to solve the problem. The EXPLAIN query provides an outline of the steps that impalad will perform and the relevant details on how the workload will be distributed among the nodes.
Optimization involves generating the best physical plan for the query, in terms of cost of execution as well as code generation of the query for faster execution on the hardware. The optimized query is submitted to the coordinator process, which orchestrates the query execution. When a query executes, the coordinator orchestrates interactions between Impala nodes, and once the result is available from each impalad process, it aggregates the results. This coordinator process is part of the impalad process and resides in all the nodes. Any node can act as the query coordinator. It is the coordinator that assigns work units to the impalad processes.
Once the work is assigned to each impalad process by the coordinator, the impalad process works with the storage engine to implement the query operators, for example, constant folding, predicate pushdown, etc., so as to extract only the relevant portions of the data that are really needed to satisfy the query.
The execution engine (executor process) in the impalad daemon executes the optimized query by reading from the data source at high speeds. It leverages all the disks and their controllers to read at an optimized speed and executes query fragments that have been optimized by the code optimizer, which includes Impala LLVM and the code-generation process. Impala executor serves hundreds of plan fragments at any given time.
Impala does nothing special for failover. Because HDFS provides failover using replication, if the impalad daemon is installed on the replicated nodes, the Impala process will seamlessly start using the impalad on the replicated nodes.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8296)
Azure Data and AI Architect Handbook by Olivier Mertens & Breght Van Baelen(6717)
Building Statistical Models in Python by Huy Hoang Nguyen & Paul N Adams & Stuart J Miller(6695)
Serverless Machine Learning with Amazon Redshift ML by Debu Panda & Phil Bates & Bhanu Pittampally & Sumeet Joshi(6566)
Data Wrangling on AWS by Navnit Shukla | Sankar M | Sam Palani(6350)
Driving Data Quality with Data Contracts by Andrew Jones(6300)
Machine Learning Model Serving Patterns and Best Practices by Md Johirul Islam(6069)
Learning SQL by Alan Beaulieu(5994)
Weapons of Math Destruction by Cathy O'Neil(5778)
Big Data Analysis with Python by Ivan Marin(5353)
Data Engineering with dbt by Roberto Zagni(4350)
Solidity Programming Essentials by Ritesh Modi(3998)
Time Series Analysis with Python Cookbook by Tarek A. Atwan(3855)
Pandas Cookbook by Theodore Petrou(3567)
Blockchain Basics by Daniel Drescher(3292)
Hands-On Machine Learning for Algorithmic Trading by Stefan Jansen(2905)
Feature Store for Machine Learning by Jayanth Kumar M J(2812)
Learn T-SQL Querying by Pam Lahoud & Pedro Lopes(2794)
Mastering Python for Finance by Unknown(2744)
