Cleaning Data for Effective Data Science: Doing the other...

Cleaning Data for Effective Data Science: Doing the other 80% of the work with Python, R, and command-line tools

David Mertz
Bu kitabı ne kadar beğendiniz?
İndirilen dosyanın kalitesi nedir?
Kalitesini değerlendirmek için kitabı indirin
İndirilen dosyaların kalitesi nedir?

A comprehensive guide for data scientists to master effective data cleaning tools and techniques

Key Features
  • Master data cleaning techniques in a language-agnostic manner
  • Learn
    from intriguing hands-on examples from numerous domains, such as
    biology, weather data, demographics, physics, time series, and image
    processing
  • Work with detailed, commented, well-tested code samples in Python and R
Book Description

It
is something of a truism in data science, data analysis, or machine
learning that most of the effort needed to achieve your actual purpose
lies in cleaning your data. Written in David's signature friendly and
humorous style, this book discusses in detail the essential steps
performed in every production data science or data analysis pipeline and
prepares you for data visualization and modeling results.

The
book dives into the practical application of tools and techniques needed
for data ingestion, anomaly detection, value imputation, and feature
engineering. It also offers long-form exercises at the end of each
chapter to practice the skills acquired.

You will begin by
looking at data ingestion of data formats such as JSON, CSV, SQL
RDBMSes, HDF5, NoSQL databases, files in image formats, and binary
serialized data structures. Further, the book provides numerous example
data sets and data files, which are available for download and
independent exploration.

Moving on from formats, you will impute
missing values, detect unreliable data and statistical anomalies, and
generate synthetic features that are necessary for successful data
analysis and visualization goals.

By the end of this book, you
will have acquired a firm understanding of the data cleaning process
necessary to perform real-world data science and machine learning tasks.

What you will learn
  • How to think carefully about your data and ask the right questions
  • Identify problem data pertaining to individual data points
  • Detect problem data in the systematic “shape” of the data
  • Remediate data integrity and hygiene problems
  • Prepare data for analytic and machine learning tasks
  • Impute values into missing or unreliable data
  • Generate synthetic features that are more amenable to data science, data analysis, or visualization goals.
Who this book is for

This
book is designed to benefit software developers, data scientists,
aspiring data scientists, and students who are interested in data
analysis or scientific computing.

Basic familiarity with
statistics, general concepts in machine learning, knowledge of a
programming language (Python or R), and some exposure to data science
are helpful. A glossary, references, and friendly asides should help
bring all readers up to speed.

The text will also be helpful to
intermediate and advanced data scientists who want to improve their
rigor in data hygiene and wish for a refresher on data preparation
issues.

Table of Contents
  1. Data Ingestion – Tabular Formats
  2. Data Ingestion - Hierarchical Formats
  3. Data Ingestion - Repurposing Data Sources
  4. The Vicissitudes of Error - Anomaly Detection
  5. The Vicissitudes of Error - Data Quality
  6. Rectification and Creation - Value Imputation
  7. Rectification and Creation - Feature Engineering
  8. Ancillary Matters - Closure/Glossary

Yıl:
2021
Baskı:
1
Yayımcı:
Packt Publishing
Dil:
english
Sayfalar:
498
ISBN 10:
1801071292
ISBN 13:
9781801071291
Dosya:
LiT , 4.59 MB
IPFS:
CID , CID Blake2b
english, 2021
Bu kitap, telif hakkı sahibinin şikayeti nedeniyle indirilememektedir

Beware of he who would deny you access to information, for in his heart he dreams himself your master

Pravin Lal

Anahtar ifadeler