Publications by Guest Blogger

Data Science Accelerator for Credit Risk Prediction

12.07.2017

by Fang Zhou, Data Scientist; Graham Williams, Director of Data Science, all at Microsoft Credit Risk Scoring is a classic but increasingly important operation in banking as banks are becoming far more risk careful when lending for mortgages, credit card payments or other commercial purposes, in an industry known for fierce competition and the gl...

4310 sym 6 img

Tutorial: Deep Learning with R on Azure with Keras and CNTK

09.08.2017

by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft) Microsoft's Cognitive Toolkit (better known as CNTK) is a commercial-grade and open-source framework for deep learning tasks. At present CNTK does not have a native R interface but can be accessed through Keras, a high-level API which wraps various ...

4453 sym R (1254 sym/3 pcs) 2 img

Calculating a fuzzy kmeans membership matrix with R and Rcpp

24.08.2017

by Błażej Moska, computer science student and data science intern  Suppose that we have performed clustering K-means clustering in R and are satisfied with our results, but later we realize that it would also be useful to have a membership matrix. Of course it would be easier to repeat clustering using one of the fuzzy kmeans functions availab...

2863 sym R (3167 sym/6 pcs)

Estimating mean variance and mean absolute bias of a regression tree by bootstrapping using foreach and rpart packages

26.10.2017

by Błażej Moska, computer science student and data science intern  One of the most important thing in predictive modelling is how our algorithm will cope with various datasets, both training and testing (previously unseen). This is strictly connected with the concept of bias-variance tradeoff. Roughly speaking, variance of an estimator descr...

3711 sym

Role Playing with Probabilities: The Importance of Distributions

02.11.2017

by Jocelyn Barker, Data Scientist at Microsoft I have a confession to make. I am not just a statistics nerd; I am also a role-playing games geek. I have been playing Dungeons and Dragons (DnD) and its variants since high school. While playing with my friends the other day it occurred to me, DnD may have some lessons to share in my job as a data s...

12563 sym 16 img 2 tbl

Recap: EARL Boston 2017

09.11.2017

By Emmanuel Awa, Francesca Lazzeri and Jaya Mathew, data scientists at Microsoft A few of us got to attend EARL conference in Boston last week which brought together a group of talented users of R from academia and industry. The conference highlighted various Enterprise Applications of R. Despite being a small conference, the quality of the talks...

2444 sym

Scale up your parallel R workloads with containers and doAzureParallel

21.11.2017

by JS Tan (Program Manager, Microsoft) The R language is by and far the most popular statistical language, and has seen massive adoption in both academia and industry. In our new data-centric economy, the models and algorithms that data scientists build in R are not just being used for research and experimentation. They are now also being deploye...

2969 sym 2 img

How to make Python easier for the R user: revoscalepy

28.11.2017

by Siddarth Ramesh, Data Scientist, Microsoft I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organization...

8032 sym Python (1464 sym/5 pcs)

An introduction to seplyr

14.12.2017

by John Mount, Win-Vector LLC seplyr is an R package that supplies improved standard evaluation interfaces for many common data wrangling tasks. The core of seplyr is a re-skinning of dplyr's functionality to seplyr conventions (similar to how stringr re-skins the implementing package stringi). Standard Evaluation and Non-Standard Evaluation ...

6386 sym R (3467 sym/9 pcs) 6 img

DataExplorer: Fast Data Exploration With Minimum Code

08.02.2018

by Boxuan Cui, Data Scientist at Smarter Travel Once upon a time, there was a joke: In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data. — Big Data Borat (@BigDataBorat) February 27, 2013 According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyabl...

3995 sym R (1599 sym/13 pcs) 16 img