Over the summer I worked with the Chicago Department of Public Health on reducing lead poisoning in Chicago children.  I collaborated with a team of four data scientists, as part of the Eric & Wendy Schmidt Data Science for Social Good Summer Fellowship, to build a predictive model that helped CDPH improve their lead exposure prevention efforts.  Here's the presentation I gave at the end of the summer to the Chicago civic data science community, summarizing our results.
This summer I'll be at the University of Chicago, as a fellow for the Eric & Wendy Schmidt Data Science for Social Good
 summer fellowship.  
DSSG brings in leaders in data analytics from government and industry, and mentors teams of graduate students as we work through real-life challenges with local and national data.  
This is its second year running, and already it's received a fair amount of positive press.

Here's an example of D3's excellent mapping functionality.  It's a map of average sunlight exposure across the United States. 

 It has simple interactivity - you can zoom and pan across the map, and clicking on any bubble will pull up a chart of hourly sunlight averages for that measurement station.  

Nothing fancy, but good practice for implementing maps in D3.


We just wrapped up our final submissions for the Large-Scale Hierarchical Text Classification challenge on Kaggle.  This was definitely the biggest dataset I've worked with - millions of documents, and hundreds of thousands of predictors and response classes.  

Especially when dealing with limited computing power, datasets of this scale really drive home the advantage of using broad, even sloppy tools to reduce the problem space at hand.  


Our most recent Kaggle challenge was the "March Machine Learning Mania" competition.  The name is taken from "March Madness", the nickname for the NCAA basketball tournament period that takes place every year during this month.  

Conway's Game of Life is a classic problem space in computational science.  The game setup is a square matrix, comprised of squares (cells) that are either colored (alive) or white (dead).  Based on four simple rules that roughly mimic population dynamics, boards evolve with new combinations of live and dead cells.  

The "game" can't be won or lost, but rather it serves as an interesting problem space to model how simple starting conditions and parameters can yield wildly divergent and complex permutations.

D3 is a popular tool for creating interactive data visualizations on the web.  It's a library of functions for the JavaScript language, available freely online 
(and with a very large code base of examples and implementations).  
I'm taking a course this semester to learn how to make data visualizations using D3.  
As a very basic example, here's an adjustable bar chart I created that represents state tax rates in America.  (The data is accurate as of December 2013.)

I joined a new data science group here at Harvard, headed by Luke Bornn in the Statistics department.  The group's sole focus is to compete in data challenges on Kaggle.com - something I've heard a lot about but had never checked out, until now.  

Kaggle is a platform which allows people and companies to post challenging problems related to data analysis, which anyone who has a Kaggle account can then attempt to solve.  The datasets are often large and unwieldy, and represent a diverse array of issues that show up in data science - classification, prediction, wrangling, and all that good stuff.  Sometimes the prize for winning is cash, sometimes it's a job, sometimes it's just for "swag", as they call it.

I'm excited as it'll be a chance to try out my data science skills against real competitors.  It'll be good to work in a team, too - a lot of my colleagues are highly capable CS and stats folks, and I'm sure I'm going to learn a lot from them.  More to follow.

It's impossible to overestimate the importance of good data management.  

Data management, or data wrangling, consists of scraping, parsing, and cleaning datasets, either from somewhere on the internet or from accumulated data of some sort or another. It is the least glamorous aspect of data science.  It doesn't make fancy predictions or generate cool interactive visuals.  But you can't get started without a dataset, and in cases where you don't have a nice, neat, shiny Excel spreadsheet to start with, you're at least going to need to get your hands dirty with a little data wrangling.  

I collaborated with three other data scientists on an effort to predict sentiment and popularity rankings in New York Times articles.  
(The video and website I created myself.)