《Bad Data Handbook》

《Bad Data Handbook》 《Bad Data Handbook》

  • 书名:《Bad Data Handbook》:Cleaning Up The Data So You Can Get Back To Work
  • 分类:计算机
  • 作者:Q. Ethan McCallum
  • 出版社:O'Reilly Media
  • 出版年:2012-11-21
  • 售价:USD 39.99
  • 装订:Paperback
  • 页码:264

《Bad Data Handbook》 内容介绍:

What is bad data? Some people consider it a technical phenomenon, like missing values or malformed records, but bad data includes a lot more. In this handbook, data expert Q. Ethan McCallum has gathered 19 colleagues from every corner of the data arena to reveal how they’ve recovered from nasty data problems. From cranky storage to poor representation to misguided policy, there are many paths to bad data. Bottom line? Bad data is data that gets in the way. This book explains effective ways to get around it. Among the many topics covered, you’ll discover how to: Test drive your data to see if it’s ready for analysis Work spreadsheet data into a usable form Handle encoding problems that lurk in text data Develop a successful web-scraping effort Use NLP tools to reveal the real sentiment of online reviews Address cloud computing issues that can impact your analysis effort Avoid policies that create data analysis roadblocks Take a systematic approach to data quality analysis

作者Q. Ethan McCallum介绍:

Q Ethan McCallum is a consultant, writer, and technology enthusiast, though perhaps not in that order. His work has appeared online on The O’Reilly Network and Java.net, and also in print publications such as C/C++ Users Journal, Doctor Dobb’s Journal, and Linux Magazine. In his professional roles, he helps companies to make smart decisions about data and technology.

《Bad Data Handbook》 目录大纲:

Chapter 1 Setting the Pace: What Is Bad Data?
Chapter 2 Is It Just Me, or Does This Data Smell Funny?
Understand the Data Structure
Field Validation
Value Validation
Physical Interpretation of Simple Statistics
Visualization
Keyword PPC Example
Search Referral Example
Recommendation Analysis
Time Series Data
Conclusion
Chapter 3 Data Intended for Human Consumption, Not Machine Consumption
The Data
The Problem: Data Formatted for Human Consumption
The Solution: Writing Code
Postscript
Other Formats
Summary
Chapter 4 Bad Data Lurking in Plain Text
Which Plain Text Encoding?
Guessing Text Encoding
Normalizing Text
Problem: Application-Specific Characters Leaking into Plain Text
Text Processing with Python
Exercises
Chapter 5 (Re)Organizing the Web’s Data
Can You Get That?
General Workflow Example
The Real Difficulties
The Dark Side
Conclusion
Chapter 6 Detecting Liars and the Confused in Contradictory Online Reviews
Weotta
Getting Reviews
Sentiment Classification
Polarized Language
Corpus Creation
Training a Classifier
Validating the Classifier
Designing with Data
Lessons Learned
Summary
Resources
Chapter 7 Will the Bad Data Please Stand Up?
Example 1: Defect Reduction in Manufacturing
Example 2: Who’s Calling?
Example 3: When “Typical” Does Not Mean “Average”
Lessons Learned
Will This Be on the Test?
Chapter 8 Blood, Sweat, and Urine
A Very Nerdy Body Swap Comedy
How Chemists Make Up Numbers
All Your Database Are Belong to Us
Check, Please
Live Fast, Die Young, and Leave a Good-Looking Corpse Code Repository
Rehab for Chemists (and Other Spreadsheet Abusers)
tl;dr
Chapter 9 When Data and Reality Don’t Match
Whose Ticker Is It Anyway?
Splits, Dividends, and Rescaling
Bad Reality
Conclusion
Chapter 10 Subtle Sources of Bias and Error
Imputation Bias: General Issues
Reporting Errors: General Issues
Other Sources of Bias
Conclusions
References
Chapter 11 Don’t Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad?
But First, Let’s Reflect on Graduate School …
Moving On to the Professional World
Moving into Government Work
Government Data Is Very Real
Service Call Data as an Applied Example
Moving Forward
Lessons Learned and Looking Ahead
Chapter 12 When Databases Attack: A Guide for When to Stick to Files
History
Consider Files as Your Datastore
File Concepts
A Web Framework Backed by Files
Reflections
Chapter 13 Crouching Table, Hidden Network
A Relational Cost Allocations Model
The Delicate Sound of a Combinatorial Explosion…
The Hidden Network Emerges
Storing the Graph
Navigating the Graph with Gremlin
Finding Value in Network Properties
Think in Terms of Multiple Data Models and Use the Right Tool for the Job
Acknowledgments
Chapter 14 Myths of Cloud Computing
Introduction to the Cloud
What Is “The Cloud”?
The Cloud and Big Data
Introducing Fred
At First Everything Is Great
They Put 100% of Their Infrastructure in the Cloud
As Things Grow, They Scale Easily at First
Then Things Start Having Trouble
They Need to Improve Performance
Higher IO Becomes Critical
A Major Regional Outage Causes Massive Downtime
Higher IO Comes with a Cost
Data Sizes Increase
Geo Redundancy Becomes a Priority
Horizontal Scale Isn’t as Easy as They Hoped
Costs Increase Dramatically
Fred’s Follies
Myth 1: Cloud Is a Great Solution for All Infrastructure Components
Myth 2: Cloud Will Save Us Money
Myth 3: Cloud IO Performance Can Be Improved to Acceptable Levels Through Software RAID
Myth 4: Cloud Computing Makes Horizontal Scaling Easy
Conclusion and Recommendations
Chapter 15 The Dark Side of Data Science
Avoid These Pitfalls
Know Nothing About Thy Data
Thou Shalt Provide Your Data Scientists with a Single Tool for All Tasks
Thou Shalt Analyze for Analysis’ Sake Only
Thou Shalt Compartmentalize Learnings
Thou Shalt Expect Omnipotence from Data Scientists
Final Thoughts
Chapter 16 How to Feed and Care for Your Machine-Learning Experts
Define the Problem
Fake It Before You Make It
Create a Training Set
Pick the Features
Encode the Data
Split Into Training, Test, and Solution Sets
Describe the Problem
Respond to Questions
Integrate the Solutions
Conclusion
Chapter 17 Data Traceability
Why?
Personal Experience
Immutability: Borrowing an Idea from Functional Programming
An Example
Conclusion
Chapter 18 Social Media: Erasable Ink?
Social Media: Whose Data Is This Anyway?
Control
Commercial Resyndication
Expectations Around Communication and Expression
Technical Implications of New End User Expectations
What Does the Industry Do?
What Should End Users Do?
How Do We Work Together?
Chapter 19 Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough
Framework Introduction: The Four Cs of Data Quality Analysis
Complete
Coherent
Correct
aCcountable
Conclusion


微信扫一扫关注公众号

0 个评论

要评论图书请先登录注册

你也许想看:

Creating a Data-Driven Organization

《Creating a Data-Driven Organization》

Carl Anderson.O'Reilly Media.2015-8-15

“”

Linux Kernel Networking

《Linux Kernel Networking》

Rami Rosen.Apress.2013-12-22

“”

深入浅出Hibernate

《深入浅出Hibernate》

夏昕,曹晓钢,唐勇 编.电子工业出版社.2005-6

“本书由互联网上影响广泛的开放文档OpenDoc系列自由文献首份文档“Hibernate开发指南”发展而来。在编写过程中,...”

Introducing Go

《Introducing Go》

Caleb Doxsey.O'Reilly Media.2016-2-1

“”

Black Hat Python

《Black Hat Python》

Justin Seitz.No Starch Press.2014-12-14

“Python is the high-level language of choice for hackers and ...”

Learning Scala

《Learning Scala》

Jason Swartz.O'Reilly Media.2014-12-28

“”

Flume:构建高可用、可扩展的海量日志采集系统

《Flume:构建高可用、可扩展的海量日志采集系统》

【美】Hari Shreedharan(哈里•史瑞德哈伦).电子工业出版社.2015-8-1

“《Flume:构建高可用、可扩展的海量日志采集系统》从Flume 的基本概念和设计原理开始讲解,分别介绍了不同种类的组件...”

编写高质量代码

《编写高质量代码》

成林.机械工业出版社.2012-11

“《编写高质量代码:改善JavaScript程序的188个建议》是Web前端工程师进阶修炼的必读之作,将为你通往“Java...”

区块链

《区块链》

长铗,韩锋.中信出版社.2016-7

“《区块链:从数字货币到信用社会》从历史与背景、发展现状、基础原理与技术、应用生态、存在的问题与挑战等方面论述了区块链是怎...”

深入网站开发与运维

《深入网站开发与运维》

[美] Matthew Sacks.人民邮电出版社.2014-4-15

“在开发和运维方面,如今的大型网站承受着巨大的压力。随着敏捷方法的实施,问题变得更加严重。管理网站、部署应用、维护运作等任...”

涂抹Oracle

《涂抹Oracle》

李丙洋.中国水利水电出版社.2010-1

“本书作为一本创作之初就定位于技术应用的实践参考书,虽然前前后后串联了Oracle数据库中十余个常用特性或工具,但在章节的...”

OCaml from the Very Beginning

《OCaml from the Very Beginning》

John Whitington.Coherent Press.2013-6-7

“”

分布式服务框架:原理与实践

《分布式服务框架:原理与实践》

李林锋.电子工业出版社.2016-1-15

“《分布式服务框架:原理与实践》作者具有丰富的分布式服务框架、平台中间件的架构设计和实践经验,主导设计的华为分布式服务框架...”

Unity Shader入门精要

《Unity Shader入门精要》

冯乐乐.人民邮电出版社.2016-5-1

“本书不仅要教会读者如何使用Unity Shader,更重要的是要帮助读者学习Unity中的一些渲染机制以及如何使用Uni...”

Apache ZooKeeper Essentials

《Apache ZooKeeper Essentials》

Saurav Haloi.Packt Publishing - ebooks Account.2015-1-28

“”

SQL Server 2012实施与管理实战指南

《SQL Server 2012实施与管理实战指南》

俞榕刚,徐海蔚.电子工业出版社.2013-3

“《SQL Server 2012实施与管理实战指南》主要面向对Microsoft SQL Server有一定基础的数据库...”

持续交付

《持续交付》

Jez Humble,David Farley.人民邮电出版社.2011-10

“Jez Humble编著的《持续交付(发布可靠软件的系统方法)》讲述如何实现更快、更可靠、低成本的自动化软件交付,描述了...”

电脑报2012合订本

《电脑报2012合订本》

.重庆出版社.2013-1

“电脑报(2012合订本 套装全三册),ISBN:9787229059903,作者:电脑报杂志社 编”

Nmap 6

《Nmap 6》

Calderon Pale Paulino.Packt Publishing.2012-10-14

“Nmap is a well known security tool used by penetration teste...”

软件集成策略——如何有效率地提升质量

《软件集成策略——如何有效率地提升质量》

董越.电子工业出版社.2013-7

“要想把软件卖出去,要想让程序发挥价值,需要把研发出来的各个模块、各个功能“捏”在一起,并且达到一定的质量标准。因此,集成...”