Publications by Guest Blogger
Data Science Accelerator for Credit Risk Prediction
by Fang Zhou, Data Scientist; Graham Williams, Director of Data Science, all at Microsoft Credit Risk Scoring is a classic but increasingly important operation in banking as banks are becoming far more risk careful when lending for mortgages, credit card payments or other commercial purposes, in an industry known for fierce competition and the gl...
4310 sym 6 img
Tutorial: Deep Learning with R on Azure with Keras and CNTK
by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft) Microsoft's Cognitive Toolkit (better known as CNTK) is a commercial-grade and open-source framework for deep learning tasks. At present CNTK does not have a native R interface but can be accessed through Keras, a high-level API which wraps various ...
4453 sym R (1254 sym/3 pcs) 2 img
Calculating a fuzzy kmeans membership matrix with R and Rcpp
by Błażej Moska, computer science student and data science intern Suppose that we have performed clustering K-means clustering in R and are satisfied with our results, but later we realize that it would also be useful to have a membership matrix. Of course it would be easier to repeat clustering using one of the fuzzy kmeans functions availab...
2863 sym R (3167 sym/6 pcs)
Estimating mean variance and mean absolute bias of a regression tree by bootstrapping using foreach and rpart packages
by Błażej Moska, computer science student and data science intern One of the most important thing in predictive modelling is how our algorithm will cope with various datasets, both training and testing (previously unseen). This is strictly connected with the concept of bias-variance tradeoff. Roughly speaking, variance of an estimator descr...
3711 sym
Role Playing with Probabilities: The Importance of Distributions
by Jocelyn Barker, Data Scientist at Microsoft I have a confession to make. I am not just a statistics nerd; I am also a role-playing games geek. I have been playing Dungeons and Dragons (DnD) and its variants since high school. While playing with my friends the other day it occurred to me, DnD may have some lessons to share in my job as a data s...
12563 sym 16 img 2 tbl
Recap: EARL Boston 2017
By Emmanuel Awa, Francesca Lazzeri and Jaya Mathew, data scientists at Microsoft A few of us got to attend EARL conference in Boston last week which brought together a group of talented users of R from academia and industry. The conference highlighted various Enterprise Applications of R. Despite being a small conference, the quality of the talks...
2444 sym
Scale up your parallel R workloads with containers and doAzureParallel
by JS Tan (Program Manager, Microsoft) The R language is by and far the most popular statistical language, and has seen massive adoption in both academia and industry. In our new data-centric economy, the models and algorithms that data scientists build in R are not just being used for research and experimentation. They are now also being deploye...
2969 sym 2 img
How to make Python easier for the R user: revoscalepy
by Siddarth Ramesh, Data Scientist, Microsoft I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organization...
8032 sym Python (1464 sym/5 pcs)
An introduction to seplyr
by John Mount, Win-Vector LLC seplyr is an R package that supplies improved standard evaluation interfaces for many common data wrangling tasks. The core of seplyr is a re-skinning of dplyr's functionality to seplyr conventions (similar to how stringr re-skins the implementing package stringi). Standard Evaluation and Non-Standard Evaluation ...
6386 sym R (3467 sym/9 pcs) 6 img
DataExplorer: Fast Data Exploration With Minimum Code
by Boxuan Cui, Data Scientist at Smarter Travel Once upon a time, there was a joke: In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data. — Big Data Borat (@BigDataBorat) February 27, 2013 According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyabl...
3995 sym R (1599 sym/13 pcs) 16 img