Fantastic Free Data Science Books for Aspiring Data Scientists

Data science is an ever-growing field with a staggering number of data scientist wannabes looking for FREE Data Science Books.

So if you're looking for Data Science Ebooks to download for FREE, we thought we'd give you a hand and we have curated a list of some of our favourite FREE books on Data Science.

Whether you're looking for books about stats, machine learning, data visualisation, ethics or more, we've got you covered.

More...

In this post we bring you all the FREE general Data Science books that we've found (so far), categorised by sub-topic so you can find what you're looking for easily.

We'll be adding more FREE Ebooks for Data Scientists regularly, so bookmark the page!

To get the books, click on the image and you'll be taken to the page where you can download or read the book.

Is there a book that you'd like to recommend for this list?

Are any of the links out of date?

Send me a message (link at the bottom) and I'll jump right on it.

Bookmark this page, enjoy and don't forget to share!

The authors and/or publishers of these books have been kind enough to allow people to read these Machine Learning Books for FREE.

And they're all great books we'd recommend anyway, so just in case you want to get yourself a hard copy, we're including links so you can grab a printed (and more up-to-date) copy of the book if it's available.

Disclosure: The FREE ebooks were free to download at the time of posting but other links in this post may contain affiliate links. As Amazon Associates we may earn from qualifying purchases.

You can find further details in our TCs

FREE General Data Science Books

going pro in data science

Jerry Overton

Digging for answers to your pressing business questions probably won’t resemble those tidy case studies that lead you step-by-step from data collection to cool insights. Data science is not so clear-cut in the real world. Instead of high-quality data with the right velocity, variety, and volume, many data scientists have to work with missing or sketchy information extracted from people in the organization.

In this O'Reilly report, Jerry Overton - Distinguished Engineer at global IT leader CSC - introduces practices for making good decisions in a messy and complicated world. What he simply calls "data science that works" is a trial-and-error process of creating and testing hypotheses, gathering evidence, and drawing conclusions. These skills are far more useful for practicing data scientists than, say, mastering the details of a machine-learning algorithm.

Adapted and expanded from a series of articles Overton published on O’Reilly Radar and on the CSC Blog, each chapter is ideal for current and aspiring data scientists who want to go pro, as well as IT execs and managers looking to hire in this field.

Introduction To Data Science

Rafael A. Irizarry

The demand for skilled data science practitioners in industry, academia, and government is rapidly growing.

This book introduces concepts and skills that can help you tackle real-world data analysis challenges. It covers concepts from probability, statistical inference, linear regression and machine learning. It also helps you develop skills such as R programming, data wrangling with dplyr, data visualization with ggplot2, algorithm building with caret, file organization with UNIX/Linux shell, version control with Git and GitHub, and reproducible document preparation with knitr and R markdown.

The book is divided into six parts: R, Data Visualization, Data Wrangling, Probability, Inference and Regression with R, Machine Learning, and Productivity Tools. Each part has several chapters meant to be presented as one lecture. The book includes dozens of exercises distributed across most chapters.

an introduction to data science

Jeffrey Stanton

In this Introduction to Data Science eBook, a series of data problems of increasing complexity is used to illustrate the skills and capabilities needed by data scientists.

The open source data analysis program known as "R" and its graphical user interface companion "R-Studio" are used to work with real data examples to illustrate both the challenges of data science and some of the techniques used to address those challenges.

To the greatest extent possible, real datasets reflecting important contemporary issues are used as the basis of the discussions.

GET A HARD COPY OF THE BOOK HERE

executive data science

Brian Caffo, Roger D. Peng and Jeffrey T. Leek

In this concise book you will learn what you need to know to begin assembling and leading a data science enterprise, even if you have never worked in data science before.

You’ll get a crash course in data science so that you’ll be conversant in the field and understand your role as a leader. You’ll also learn how to recruit, assemble, evaluate, and develop a team with complementary skill sets and roles.

You’ll learn the structure of the data science pipeline, the goals of each stage, and how to keep your team on target throughout.

Finally, you’ll learn some down-to-earth practical skills that will help you overcome the common challenges that frequently derail data science projects.

The data science Handbook

Carl Shan, Henry Wang, William Chen and Max Song

The Data Science Handbook is a compilation of in-depth interviews with 25 remarkable data scientists, where they share their insights, stories, and advice.

These 25 data scientists hail from a wide selection of backgrounds, disciplines, and industries.

Some of them, like DJ Patil and Hilary Mason, were part of the trailblazing wave of data scientists who catapulted the field into national attention.

Others are at the start of their careers, such as Clare Corthell, who made her own path to data science by creating the Open Source Data Science Masters, a self-guided curriculum built on freely available internet resources.

GET A HARD COPY OF THE BOOK HERE

data-intensive text processing with MapReduce

Jimmy Lin and Chris Dyer

MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance.

This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains.

GRAB A HARD COPY OF THE BOOK HERE

Conversations on Data Science

Roger D. Peng and Hilary Parker

What is life like as a data scientist? In academia? In industry? What is the role of automation in data science? What is evidence-based data analysis? And what is RCatLadies?

These questions, and more, are discussed in Conversations on Data Science by Roger Peng and Hilary Parker, co-hosts of the popular Not So Standard Deviations podcast. This book collects many of their discussions from the podcast and distills them into a readable format. The conversational style of Roger and Hilary gives readers a behind-the-scenes look into how data science is done in real life. As new episodes of the podcast are created, this book is updated to add new topics of discussion. Readers of the book are entitled to free updates in the future as the book evolves.

Social Media Mining: An Introduction

Reza Zafarani, Mohammad ali Abbasi and Huan Liu

Social Media Mining integrates social media, social network analysis, and data mining to provide a convenient and coherent platform for students, practitioners, researchers, and project managers to understand the basics and potentials of social media mining.

It introduces the unique problems arising from social media data and presents fundamental concepts, emerging issues, and effective algorithms for network analysis and data mining.

Suitable for use in advanced undergraduate and beginning graduate courses as well as professional short courses, the text contains exercises of different degrees of difficulty that improve understanding and help apply concepts, principles, and methods in various scenarios of social media mining.

GRAB A HARD COOPY OF THE BOOK HERE

how to be a modern scientist

Jeff Leek

The face of academia is changing. It is no longer sufficient to just publish or perish. We are now in an era where Twitter, Github, Figshare, and Alt Metrics are regular parts of the scientific workflow. Here I give high level advice about which tools to use, how to use them, and what to look out for. This book is appropriate for scientists at all levels who want to stay on top of the current technological developments affecting modern scientific careers.

The book is probably most suited to graduate students and postdocs in the sciences, but may be of interest to others who want to adapt their scientific process to use modern tools.

Extracting Data From NoSQL Databases

Petter Näsholm

NoSQL databases differ from RDBMSs mainly in that they use non-relational data models, lack explicit schemas and scale horizontally. Some of these features cause problems for applications like Spotfire when extracting and importing data.

This thesis investigates how these problems can be solved, thus enabling support for NoSQL databases in Spotfire. The approach and conclusions are valid for any application that interacts with databases in a similar way as Spotfire.

Hadoop Illuminated

Mark Kerzner and Sujee Maniyam

'Hadoop illuminated' is the open source book about Apache Hadoop™. It aims to make Hadoop knowledge accessible to a wider audience, not just to the highly technical.

The book is a 'living book' - they will keep updating it to cover the fast evolving Hadoop eco system.

Numsense!

Annalyn Ng and Kenneth Soo

This book was available for FREE with Kindle Unlimited when we last checked, but you have to check Amazon in your country. Try Kindle Unlimited for FREE.

Numsense! is lovingly put together by two data science enthusiasts, Annalyn Ng (University of Cambridge) and Kenneth Soo (Stanford University), who wrote:

"We noticed that while data science is increasingly used to improve workplace decisions, many people know little about the field. Hence, we wrote these tutorials so that everyone and anyone can learn - be it an aspiring student or enterprising business professional".

Each tutorial covers the important functions and assumptions of a data science technique, without any maths or jargon. The book also illustrates these techniques with real-world data and examples.

process improvement using data

Kevin Dunn

This book is a guide on how to improve processes using the large quantities of data that are routinely collected from process systems. It is in a state of a semi-permanent draft so you might want to bookmark the page!

The book covers visualisation first, in Chapter 1, since most data analysis studies start by plotting the data. This is an extremely brief introduction to this topic, only illustrating the most basic plots required for this book.

This is followed by Chapter 2 on univariate data analysis, which is a comprehensive treatment of univariate techniques to quantify variability and then to compare variability. We look at various univariate distributions and consider tests of significance from a confidence-interval viewpoint. This is arguably a more useful and intuitive way, instead of using hypothesis tests.

Chapter 3 is on monitoring charts to track variability, and Chapter 4 introduces the area of multivariate data. The first natural application is least squares modelling, where we learn how variation in one variable is related to another variable. This chapter briefly covers multiple linear regression and outliers.

Chapter 5 covers designed experiments, where we intentionally introduce variation into our system to learn more about it. We learn how to use the models from the experiments to optimize our process.

The final chapter, Chapter 6, is on latent variable modelling where we learn how to deal with multiple variables and extract information from them. This section is divided in several chapters (PCA, PLS, and applications).

understanding the chief data officer

Julia Steele

o manage today's flood of available data, a number of high-profile corporations have adopted a new position in addition to existing CTOs and CIOs: the Chief Data Officer, or CDO. In this report, Julie Steele of Silicon Valley Data Science provides a clear, concise look at how CDOs view their nascent role in high-profile organizations such as Wells Fargo, Samsung, the Republican National Committee, Allstate, and the Federal Reserve Board.

Although there are as many CDO implementations as there are organizations that employ them, some distinct patterns have emerged. This report presents a picture of the current landscape, as well as some guidelines and best practices for those considering adding a CDO role to their own company.

data science at the command line

Jeroen Janssens

This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.

To get you started - whether you’re on Windows, macOS, or Linux - author Jeroen Janssens has developed a Docker image packed with over 80 command-line tools.

Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.

GET A HARD COPY OF THE BOOK HERE

data and social good

Mike Barlow

Data may indeed be the "new oil" - a seemingly inexhaustible source of fuel for spectacular economic growth - but it's also a valuable resource for humanitarian groups looking to improve and protect the lives of less fortunate people. In this O'Reilly report, you'll learn how statisticians and data scientists are volunteering their time to help a variety of nonprofit organizations around the world.

Mike Barlow cites several examples of how data and the work of data scientists have made a measurable impact on organizations such as DataKind, a group that connects socially minded data scientists with organizations working to address critical humanitarian issues. There's certainly no lack of demand for data science services among nonprofits today, because these organizations, too, realise the potential of data for changing people's fortunes.

Free Data Visualisation Books

d3 tips and tricks

Malcolm MacLean

D3 Tips and Tricks is a book written to help those who may be unfamiliar with JavaScript or web page creation get started turning information into visualization.

Data is the new medium of choice for telling a story or presenting compelling information on the Internet and d3.js is an extraordinary framework for presentation of data on a web page.

This book is not written for experts. It's put together as a guide to get you started if you're unsure what d3.js can do. It reads more like a story as it leads the reader through the basics of line graphs and on to discover animation, tooltips, tables, interfacing with databases via PHP, sankey diagrams, force diagrams, maps and more...

Data Visualization: A Practical Introduction

Kieran Healy

You should look at your data.

Graphs and charts let you explore and learn about the structure of the information you collect. Good data visualizations also make it easier to communicate your ideas and findings to other people. Beyond that, producing effective plots from your own data is the best way to develop a good eye for reading and understanding graphs - good and bad - made by others, whether presented in research articles, business slide decks, public policy advocacy, or media reports.

This book teaches you how to do it.

GET A HARD COPY OF THE BOOK HERE

If you're looking for more FREE Data Science Books we also have the following posts.