A Moment’s Digression


I had a data nerd moment while reading a novel the other day. I got in an argument with the book. But I think the book started it. It’s a frivolous discussion, probably, but sometimes those are the most fun.

What happened? Late in the book, the ghosts invite Patri to their New Year’s eve party. She has to think about it.

Invitations to a magic party with ghosts were obviously going to be very rare. There might be another chance, but for Patri that was beside the point. She was wondering how many such invitations there could be in eternity. That was a different question. Repetition in eternity was not a matter of probabilities, no matter how large the numbers. In eternity, as distinct from “in life” or “outside life,” this party was an absolutely unique occasion.

No, no, NO! I wanted to shout at the book (I was in a restaurant, so I didn’t). In all eternity, this invitation is bound to happen again.

Big data. infinite time — rare events do happen.

You can read the rest of the post here. Enjoy.

Women in Data Science – Positive Imagery for the Win

600 387630642

Unfortunately, I didn’t look at the announcement page for tomorrow’s SF Data Science Meetup until today, so this post is probably too late in terms of getting people to attend. But I absolutely love the image that they are using to publicize the talk.

A person’s mind generalizes to the images and associations they see most often, and computer science and technology are unfortunately quite skewed when it comes to gender. The articles on the hostile and unwelcome environment that women programmers face today — in 2014! — are numerous and depressing.

Fortunately, this seems to be less of an issue for women in data science. I know quite a few great women data scientists, and as far as I know, none of us have had any “bro” type issues with clients or colleagues. On the other hand, the two times that I taught EMC’s Data Science and Big Data Analytics course, the classes were entirely male. Respectful to me — but all male.

Perhaps this just means that women are more likely to choose academic programs rather than professional certification courses, or perhaps the makeup of the EMC course has become more balanced. I hope so. Keeping the field gender-balanced (and diverse in all other ways, as well) is the best way to maintain a positive environment for everyone who chooses this profession.

So it’s good to see a conscious effort to associate data science with female practitioners. Not because men shouldn’t be data scientists too, but rather to eradicate the idea that women shouldn’t be. In fact, that’s one of the reasons we chose a woman for the cover of our book. Thanks to the organizers of the SF Data Science Meetup for doing this.

Follow me via RSS!

I went back to using RSS to follow blogs and other websites recently; I don’t know why I ever stopped. My email doesn’t get clogged by notifications anymore, and I don’t lose blog updates in the ever-flowing stream of Twitter or Facebook or the WordPress reader. I can follow any blog on any platform as long as they have an RSS feed, and I don’t need to have accounts on every possible platform, either, just Feedly (and not even that, if I didn’t want to sync between devices).

It also occurred to me that RSS is really the only reliable medium for following an irregular blog like this one. Since I don’t blog on a regular schedule (or all that often), my posts tend to get lost in the WordPress reader, as do tweets and facebook/google+ updates.

So I’ve added a “Follow me on Feedly” button to the side of my blog; if you use another RSS reader, like Bloglovin or NetNewsWire, there is a generic RSS widget, as well. Even if you follow me on WordPress, or follow Win-Vector on Twitter, please do consider also following me (and other bloggers you love) via RSS, so you will be sure to never miss my blog updates. I promise, they will not all be about the book.


Popularity and Social Networks: Life is still like high school


I remember setting up the Multo blog a few years ago: my first blog explicitly meant for public consumption. On the “Follow” widget — the button that allows readers to follow a blog via email notifications — there is an option to show the count of the blog’s followers.

My first reaction: why would I want to do that?

It’s an insecurity reflex, of course, one left over from high school. I was never one of the popular or cool kids, though I was lucky enough not to be one of the pariahs, either. Like most of us, I flitted on the edges of the cool circle — the very outer edges, in my case — once in a while being noticed, mostly not. As my life, so will be my blog, my mind said. Why would I want to advertise my obscurity to the world?

Read more of this post

New Data Visualization post up on Win Vector Blog


I have a new post up on the Win-Vector blog, exploring a couple of data visualization tasks, and touching on the difference between graphing for data exploration and graphing for the communication of results.

Visualization is a useful tool for data exploration and statistical analysis, and it’s an important method for communicating your discoveries to others. While those two uses of visualization are related, they aren’t identical.

One of the reasons that I like ggplot so much is that it excels at layering together multiple views and summaries of data in ways that improve both data exploration and communication. Of course, getting at the right graph can be a bit of work, and often I will stop when I get to a visualization that tells me what I need to know — even if no one can read that graph but me. In this post I’ll look at a couple of ggplot graphs that take the extra step: communicating effectively to others.

The post concerns itself mostly with the ggplot code to generate the graphs, but there is a bigger-picture point, too. Data visualization is a bit like the drafts of a piece of writing: early graphs are rough, sometimes ugly, and highly detailed. By the time you get to the point of presenting the results — of articulating the “story” that is found in the data — you might want to use graphs that abstract away some of that detail, so that your viewers more clearly see the point you are trying to make, or the key insight that you are trying to convey.

You can read the post here.