PYTHON CRASH COURSE: A COMPLETE BEGINNER’S GUIDE TO LEARN PYTHON AND CODING QUICKLY by ERIC LUTZ & MARK MATTHES

PYTHON CRASH COURSE: A COMPLETE BEGINNER’S GUIDE TO LEARN PYTHON AND CODING QUICKLY by ERIC LUTZ & MARK MATTHES

Author:ERIC LUTZ & MARK MATTHES [LUTZ, ERIC]
Language: eng
Format: azw3
Publisher: CODING AND PROGRAMMING ACADEMY
Published: 2020-08-03T16:00:00+00:00


Chapter 7-

Data Science Tips and Tricks

One of the major strengths of Data Scientists is a strong background in Math and Statistics. Mathematics helps them create complex analytics. Besides this, they also use mathematics to create Machine Learning models and Artificial Intelligence. Similar to software engineering, Data Scientists must interact with the business side.

This involves mastering the domain so that they can draw insights. Data Scientists need to analyze data to help a business, and this calls for some business acumen. Lastly, the results need to be assigned to the business in a way that anyone can understand.

This calls for the ability to verbally and visually communicate advanced results and observations in a manner that a business can understand as well as work on it.

Therefore, it is important for any wannabe Data Scientists to have knowledge about Data Mining.

Data Mining describes the process where raw data is structured in such a way where one can recognize patterns in the data via mathematical and computational algorithms.

Below are five mining techniques that every data scientist should know:

MapReduce

The modern Data Mining applications need to manage vast amounts of data rapidly. To deal with these applications, one must use a new software stack. Since programming systems can retrieve parallelism from a computing cluster, a software stack has a new file system called a distributed file system.

The system has a larger unit than the disk blocks found in the normal operating system. A distributed file system replicates data to enforce security against media failures.

In addition to such file systems, a higher-level programming system has also been created. This is referred to as MapReduce. It is a form of computing which has been implemented in different systems such as Hadoop and Google’s implementation.

You can adopt a MapReduce implementation to control large-scale computations such that it can deal with hardware faults. You only need to write three functions. That is Map and Reduce, and then you can allow the system to control parallel execution and task collaboration.

Distance Measures

The major problem with Data Mining is reviewing data for similar items. An example can be searching for a collection of web pages and discovering duplicate pages. Some of these pages could be plagiarism or pages that have almost identical content but different in content. Other examples can include customers who buy similar products or discover images with similar characteristics.

Distance measure basically refers to a technique that handles this problem. It searches for the nearest neighbors in a higher dimensional space. For every application, it is important to define the meaning of similarity. The most popular definition is the Jaccard Similarity. It refers to the ratio between intersection sets and union. It is the best similarity to reveal textual similarity found in documents and certain behaviors of customers.

For example, when looking for identical documents, there are different instances for this particular example. There might be very many small pieces of one document appearing out of order, more documents for comparisons, and documents that are so large to fit in the main memory.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.