Data Science Done Right - Literate Programming for Reproducible Research

Thomas Cooper Public Seen by 302

Do you take your raw data and clean it up in a spreadsheet? If so, do you note down the steps you take to go from the raw data to the clean?

When you perform the analysis that forms the basis of your ground breaking, paradigm shifting paper, will you record every button you pressed or the settings you used to analyse your data in SPSS?

Taking these kinds of notes can be a burden, but it is vital if your research is to be reproducible by others. Producing reproducible research helps you future self when it comes to writing up your analysis and for other researchers looking to verify your results.

Using a programming language such as Python or R to write the scripts that clean your data and perform the analysis means that the steps you took are clearly laid out, compared to Excel or SPSS where these steps can often be hard to discern. Using these languages also means that your reasoning, notes and data analysis can live in one single document and this is where Literate Programming comes in.

Literate programming is an extension of an idea developed by Donald Knuth[1], whereby an explanation in plain English is accompanied by executable code in one document. The document is then compiled (or weaved in the terminology of Knuth) and the code is executed at the same time. This technique has been adopted by data scientists worldwide using packages like Jupyter2 and by statisticians in research via Sweave[3] and Knitr[4].

Workshop:

This workshop will introduce the tools and techniques needed to ensure your research is open and reproducible.

As a motivating example we will use geotagged Twitter Data, from Newcastle’s Urban Observatory. We will perform a thematic analysis of the tweets and link these with different geographic measures of socioeconomic status such as average house price, deprivation levels and other openly available datasets.

We will show the process of data cleaning and analysis using Python, producing a final piece of reproducible research in a single document.

We hope researchers will find these tools useful in their ongoing work and that the use of these tools will facilitate more efficient sharing of research between DEN members.

References
[1] Knuth, Donald E. "Literate programming." CSLI Lecture Notes, Stanford, CA: Center for the Study of Language and Information (CSLI), 1992 1 (1992).
[2] Jupyter Project, http://jupyter.org/
[3] Friedrich Leisch. Sweave: Dynamic generation of statistical reports using literate data analysis. In Wolfgang Härdle and Bernd Rönz, editors, Compstat 2002 - Proceedings in Computational Statistics, pages 575-580. Physica Verlag, Heidelberg, 2002. ISBN 3-7908-1517-9, https://www.statistik.lmu.de/~leisch/Sweave/
[4] Knitr, http://yihui.name/knitr/

Helen Rice Sat 16 Apr 2016 11:04AM

I would attend this workshop if it runs.

Jonny Law Wed 20 Apr 2016 5:06PM

I'm a co-author of this workshop proposal, just thought I'd introduce myself. I work alongside Tom, in the Cloud Computing for Big Data CDT. Hopefully we can get more people on board who want to produce quantitative social media research to go alongside existing qualitative research.

Data Science Done Right - Literate Programming for Reproducible Research

Helen Rice · Sat 16 Apr 2016 11:04AM

Jonny Law · Wed 20 Apr 2016 5:06PM

Helen Rice Sat 16 Apr 2016 11:04AM

Jonny Law Wed 20 Apr 2016 5:06PM