The Sphering Transform for Detecting Distribution Drift
I have a new blog post up on the Win Vector blog: Detecting Data Differences Using the Sphering Transform. Read more
I have a new blog post up on the Win Vector blog: Detecting Data Differences Using the Sphering Transform. Read more
Just created a handy little Rosetta stone of common data operations in both pandas
and polars
. Maybe you’ll find it useful, too. Read more
This article is a shortened version of a post from the Wallaroo Blog, originally written by Julio Barros and me. I’m posting the non-Wallaroo section of that article here, with permission, because I think it’s a useful reference for A/B testing—one that I refer to myself. Hopefully, others find it helpful as well. Read more
I want to check what code blocks look like natively in this blog theme. Read more
I’ve been inspired to start using ninazumel.com for some microblogging, and not just about data science. In fact, probably mostly not about data science. Why not? I have the site, after all. Read more
I have a new article up on the Win-Vector blog, about the ξ (‘XI’) correlation coeffient that was recently introduced by Professor Sourav Chatterjee in his paper, “A New Coefficient of Correlation”. Read more
John Mount and I will be giving a talk for the online University of San Francisco Seminar Series in Data Science: Read more
When the world feels like it’s falling apart around you, it feels good to solve little problems that are completely under your control. And that’s what I’ve been doing this past week. This was originally posted at Multo. Read more
Back in the good old days, ninazumel.com was a static site that I maintained myself, in pure HTML. But that (to me) was so much of a hassle that I never did even the little bit of site maintenance that the website required. So I moved it to wordpress.com. Read more
We recently did a couple of talks about our vtreat
data treatment package: one for the Python version, and one for the R version. If you are fitting machine learning models on messy real-world data, then you might find vtreat
useful. Do check out one of the introductory talks below. Read more
We just got the authors' copies of Practical Data Science with R, 2nd Edtion. Hurray!! Read more
I've put a new release of the WVPlots
package up on CRAN. This release adds consistent palette and/or other color controls to most of the functions in the package. Read more
John has just put up an article on the Win-Vector blog, highlighting some of our popular series of articles, as well as our more popular posts. If you like the articles that I point to on this blog, check out some of the other posts written by John, too. Read more
I have a new article up on Win-Vector, discussing differential privacy and the new recent results on applying differential privacy to enable reuse of holdout data in machine learning. Read more
We’re in the middle of marketing efforts here at Win-Vector, and I’ve just spent a few hours going through the Win-Vector blog so I could update our Popular Articles page (I have to do that for Multo someday, too). Read more
I have a new article up on the Win-Vector Blog, on checking your input variables for signal: Read more
I’ve just put up the next installment of the new “Working with Sessionized Data” series on Win-Vector. Read more
We’ve been wanting to get more into training over at Win-Vector, but I don’t want to completely give up client work, because clients and their problems are often the inspiration for cool solutions – and good blog articles. Working on the video course for the last couple of months has given me some good ideas, too. Read more
I put a new post up on Win-Vector a couple of days ago called "The Geometry of Classifiers", a follow-up post to a recent paper by Fernandez-Delgado, et al. that investigates several classifiers against a body of data sets, mostly from the UCI Machine Learning Repository. Our article follows up the study with seven additional additional classifier implementations from scikit-learn
and an interactive Shiny app to explore the results. Read more
I ran across this essay recently on the role of design standards for scientific data visualization. The author, Jure Triglav, draws his inspiration from the creation and continued use of the NYCTA Graphics Standards, which were instituted in the late 1960s to unify the signage for the New York City subway system. Read more
I had a data nerd moment while reading a novel the other day. I got in an argument with the book. But I think the book started it. It's a frivolous discussion, probably, but sometimes those are the most fun. Read more
I have a new article up on the Win-Vector blog: Bandit Formulations for A/B Tests: Some Intuition. The article discusses the bandit problem formulation as an alternative to significance-based formulations for A/B tests. Read more
There's been a buzz the last few days about the 715 new planets that NASA has verified, using data from the Kepler Space Telescope. This discovery doubles the number of known planets, and turned up four new planets that could possibly support life. Read more
I remember setting up the Multo blog a few years ago: my first blog explicitly meant for public consumption. On the "Follow" widget -- the button that allows readers to follow a blog via email notifications -- there is an option to show the count of the blog's followers. Read more
It's been a while since I've posted here, but I have good news: the last appendix has gone to the editors. The book is now content complete. What a relief! We are hoping to release the book late in the first quarter of next year. In the meantime, you can still get early drafts of our chapters through Manning’s Early Access program, if you haven’t yet. The link is here. Read more
Christian Goldbach, Prussian mathematician. Probably most famous for the Goldbach conjecture, one of the oldest unsolved problems in mathematics:
Every even integer greater than 2 can be expressed as the sum of two primes.Read more
"No insults, please!" said Pugg. "For I am not your usual uncouth pirate, but refined and with a Ph.D. and therefore extremely high-strung."
…until the development of computers the possibility of dealing successfully with the complex itself was never really envisaged. Perhaps the most successful substitute for such a possibility, as well as the nearest approach to it, came in mathematics. … To find the simple in the complex, the finite in the infinite -- that is not a bad description of the aim and essence of mathematics.</p>Read more
We are sending substantive drafts of the first four chapters of our data science book out for review. Manning, our publisher, hopes to launch the book in their Early Access Program (MEAP) by early May. Crossing our fingers! Read more
As I've posted previously, we are writing a data science book. The preview of the first chapter of our book should come out in about a month or so. We are almost finished with the revisions to the first four chapters, and we've started refining the outline of the next three. Exciting! It happens that I've been rereading mathematician Gian-Carlo Rota's collection of essays, Indiscrete Thoughts, and I've found a few passages that really speak to me, now that I'm in book-writing mode. Enjoy. Read more
So there's this article that's been making the rounds called "The 10 Least Stressful Jobs of 2013"; perhaps you've read it. I don't normally bother with articles like that, but it came to my attention because some of my old graduate-school friends (who are professors) threw a mini-rant on social media over the fact that University Professor is the Number One least stressful job of the year, according to the article. And just now, I tripped over a blog post where a librarian takes umbrage over the fact that they also on the list. Read more
One of my favorite cheesy movies is a gem from 1984 called The Adventures of Buckaroo Banzai Across the 8th Dimension. For those who haven't seen it, Buckaroo Banzai is a brilliant young neurosurgeon and particle physicist who spends his days conducting cutting-edge research. At night, he and his research colleagues -- all engineers and scientists and doctors -- rock New Jersey as a band called the Hong Kong Cavaliers. In between the brilliant science and the rock-star night life, the Cavaliers find time to save the world from an alien invasion led by none other than John Lithgow. Read more
I’m happy to announce that John Mount and I have just signed a contract with Manning Publications to write a book on Data Science. We have both talked about doing this for quite a while, and we are excited that we finally have the opportunity. Read more
I came across an interesting article in The Atlantic a little while back that discussed the connection between writing and thinking. New Dorp, a Staten Island high school in a poor and working-class neighborhood, was able to improve student performance when they realized that their students couldn’t write. These underperforming students often could read and could do math. The majority of them were well-behaved, and seemed to want to learn. Yet they couldn't pass standard proficiency tests, and couldn't graduate. All because they couldn't form complex sentences. Read more
When people ask me what it means to be a data scientist, I used to answer, "it means you don't have to hold my hand." By which I meant that as a data scientist (a consulting data scientist), I can handle the data collection, the data cleaning and wrangling, the analysis, and the final presentation of results (both technical and for the business audience) with a minimal amount of assistance from my clients or their people. Not no assistance, of course, but little enough that I'm not interfering too much with their day-to-day job. Read more
I came across a post from Emily Willingham the other day: "Is a PhD required for Good Science Writing?". As a science writer with a science PhD, her answer is: is it not required, and it can often be an impediment. I saw a similar sentiment echoed once by Lee Gutkind, the founder and editor of the journal Creative Nonfiction. I don't remember exactly what he wrote, but it was something to the effect that scientists are exactly the wrong people to produce literary, accessible writing about matters scientific. Read more