Publications by David Smith

dv01 uses R bring greater transparency to the consumer lending market

26.04.2017

The founder of the NYC-based startup dv01 watched the 2008 financial crisis and was inspired to bring greater transparency to institutional investors in the consumer lending market. Despite being an open-source shop, they switched their data services to Microsoft SQL Server to provide better performance (reducing latency for queries from tens of...

1402 sym 2 img

Where Europe lives, in 14 lines of R Code

27.04.2017

Via Max Galka, always a great source of interesting data visualizations, we have this lovely visualization of population density in Europe in 2011, created by Henrik Lindberg: Impressively, the chart was created with just 14 lines of R code: (To recreate it yourself, download the GEOSTAT-grid-POP-1K-2011-V2-0-1.zip file from eurostat, and move ...

1127 sym 2 img

Make pleasingly parallel R code with rxExecBy

28.04.2017

Some things are easy to convert from a long-running sequential process to a system where each part runs at the same time, thus reducing the required time overall. We often call these “embarrassingly parallel” problems, but given how easy it is to reduce the time it takes to execute them by converting them into a parallel process, “pleasingl...

2505 sym 2 img

Using Microsoft R with Alteryx

01.05.2017

Alteryx Designer, the self-service analytics workflow tool, recently added integration with Microsoft R. This allows you to train models provided by Microsoft R, and create predictions from them, without needing to write R code — you simply drag-and-drop to create a workflow. In a recent post at the Microsoft R blog, Bharath Sankaranarayan wal...

1184 sym 2 img

The Datasaurus Dozen

02.05.2017

There's a reason why data scientists spend so much time exploring data using graphics. Relying only on data summaries like means, variances, and correlations can be dangerous, because wildly different data sets can give similar results. This is a principle that has been demonstrated in statistics classes for decades with Anscombe's Quartet: fou...

2032 sym 4 img

Technical Foundations of Informatics: A modern introduction to R

03.05.2017

Informatics (or Information Science) is the practice of creating, storing, finding, manipulating and sharing information. These are all tasks that the R language was designed for, and so Technical Foundations of Informatics, the online course guide for the University of Washington course of the same name, also provides an excellent resource for...

3038 sym 2 img

Real-time scoring with Microsoft R Server 9.1

04.05.2017

Once you've built a predictive model, in many cases the next step is to operationalize the model: that is, generate predictions from the pre-trained model in real time. In this scenario, latency becomes the critical metric: new data typically become available a single row at a time, and it's important to respond with that single prediction (or s...

2433 sym 2 img

Predicting Hospital Length of Stay using SQL Server R Services

09.05.2017

Last week, my Microsoft colleagues Bharath Sankaranarayan and Carl Saroufim presented a live webinar showing how you can predict a patient's length of stay at a hospital using SQL Server R Services. The recorded webinar is available for on-demand viewing now. (Registration is required to view.) The webinar is based on the Machine Learning Solut...

3143 sym 2 img

Stack Overflow Trends

10.05.2017

Developer Q&A site Stack Overflow recently introduced Stack Overflow Trends, a useful tool for tracking the growth and decline in the rate of questions asked on various topics (by their Stack Overflow tag). For example, you can see that activity around both R and Python has been increasing over the past 8 years: As you'd expect from a general pu...

1247 sym 4 img

Analyzing data on CRAN packages

11.05.2017

There's a handy new function in R 3.4.0 for anyone interested in data about CRAN packages. It's not documented, but it's pretty simple: tools::CRAN_package_db() returns a data frame with one row for every package on CRAN and 65 columns of data on those packages, as shown below. > names(tools::CRAN_package_db()) [1] "Package" "Ver...

1958 sym R (1852 sym/2 pcs) 4 img