Machine Learning in a Nutshell (Executive Leadership Series) by Lifehacker Books

Author:Lifehacker Books , Date: April 2, 2018 ,Views: 146

Machine Learning in a Nutshell (Executive Leadership Series) by Lifehacker Books

Author:Lifehacker Books
Language: eng
Format: azw3
Publisher: UNKNOWN
Published: 2017-07-31T07:00:00+00:00

Jocelyn decided to begin her data exploration work by focusing on the target feature. The structure of the data available from the Galaxy Zoo project is shown below in the table. The category of each galaxy is voted on by multiple Galaxy Zoo participants, and the data includes the fraction of these votes for each of the categories.

The raw data did not contain a single column that could be used as a target feature, so Jocelyn had to design one from the data sources that were present. She generated two possible target features from the data provided. In both cases, the target feature level was set to the galaxy category that received the majority of the votes. In the first target feature, just three levels were used:

elliptical (P EL majority), spiral (P CW, P ACW, or P EDGE majority), and other (P MG or P DK majority). The second target feature allowed three levels for spiral galaxies: spiral cw (P CW majority), spiral acw (P ACW majority), and spiral edge (P EDGE majority).

The main observation that Jocelyn made from these is that galaxies in the dataset were not evenly distributed across the different morphology types. Instead, the elliptical level was much more heavily represented than the others in both cases. Using the 3-level target feature as her initial focus, Jocelyn began to look at the different descriptive features in the data downloaded from the SDSS repository that might be useful in building a model to predict galaxy morphology.

The SDSS download that Jocelyn had access to was a big dataset—over 600,000 rows. Although modern predictive analytics and machine learning tools can handle data of this size, a large dataset can be cumbersome when performing data exploration operations—calculating summary statistics, generating visualizations, and performing correlation tests can just take too long.

For this reason, Jocelyn extracted a small sample of 10,000 rows from the full dataset for exploratory analysis using stratified sampling.

Given that (1) the SDSS data that Jocelyn downloaded was already in a single table; (2) the data was already at the right prediction subject level (one row per galaxy); and (3) many of the columns in this dataset would most likely be used directly as features in the ABT that she was building, Jocelyn decided to produce a data quality report on this dataset. The following table shows an extract from this data quality report.

Download

Machine Learning in a Nutshell (Executive Leadership Series) by Lifehacker Books.azw3

Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.

Categories

other	Arts & Photography
Biographies & Memoirs	Business & Money
Calendars	Christian Books & Bibles
Comics & Graphic Novels	Computers & Technology
Cookbooks, Food & Wine	Crafts, Hobbies & Home
Education & Teaching	Engineering & Transportation
Health, Fitness & Dieting	Humor & Entertainment
Law	Lesbian, Gay, Bisexual & Transgender Books
Literature & Fiction	Medical Books
Mystery, Thriller & Suspense	Parenting & Relationships
Politics & Social Sciences	Reference
Religion & Spirituality	Romance
Science & Math	Science Fiction & Fantasy
Self-Help	Sports & Outdoors
Teen & Young Adult	Test Preparation
Travel	Children's Books
History