Publications by David Smith
In case you missed it: March 2017 roundup
In case you missed them, here are some articles from March of particular interest to R users. A tutorial and comparison of the SparkR, sparklyr, rsparkling, and RevoScaleR packages for using R with Spark. An analysis of Scrabble games between AI players. The doAzureParallel package, a backend to “foreach” for parallel computations on Azure...
2557 sym
Prepare real-world data for analysis with the vtreat package
As anyone who's tried to analyze real-world data knows, there are any number of problems that may be lurking in the data that can prevent you from being able to fit a useful predictive model: Categorical variables can include infrequently-used levels, which will cause problems if sampling leaves them unrepresented in the training set. Numerical ...
2386 sym 2 img
Data Amp: a major on-line Microsoft event, April 19
This coming Wednesday, April 19 at 8AM Pacific Time (click for your local time), Microsoft will be hosting a major on-line event of interest to anyone working with big data, analytics, and artificial intelligence: Microsoft Data Amp. During Data Amp, Executive Vice President Scott Guthrie and Corporate Vice President Joseph Sirosh will share how ...
1409 sym 2 img
Free AI Workshop, May 9 in Seattle
There will be free AI workshop in Seattle on May 9, presented by members of the Microsoft Data Science team. The AI Immersion Workshop includes five specializations to choose from (in parallel tracks), all focused on an aspect of developing and deploying intelligent applications: Applied Machine Learning for Developers, featuring Microsoft R Se...
1447 sym
Warren Buffett Shareholder Letters: Sentiment Analysis in R
Warren Buffett — known as the “Oracle of Omaha” — is one of the most successful investors of all time. Wherever the winds of the market may blow, he always seems to find a way to deliver impressive returns for his investors and his company, Berkshire Hathaway. Every year he authors his famous “shareholder letter” with his musing abo...
1902 sym 2 img
Microsoft R Server 9.1 now available
During today's Data Amp online event, Joseph Sirosh announced the new Microsoft R Server 9.1, which is available for customers now. In addition the updated Microsoft R Client, which has the same capabilities for local use, is available free for everyone on both Windows and — new to this update — Linux. This release adds many new capabilit...
3923 sym
SQL Server 2017 to add Python support
One of the major announcements from yesterday's Data Amp event was that SQL Server 2017 will add Python as a supported language. Just as with the continued R support, SQL Server 2017 will allow you to process data in the database using any Python function or package without needing to export the data from the database, and use SQL Server itself a...
1774 sym 2 img
Reproducible Data Science with R
Yesterday, I had the honour of presenting at The Data Science Conference in Chicago. My topic was Reproducible Data Science with R, and while the specific practices in the talk are aimed at R users, my intent was to make a general argument for doing data science within a reproducible workflow. Whatever your tools, a reproducible process: Saves...
1209 sym
R 3.4.0 now available
R 3.4.0, the latest release of the R programming language (codename: “You Stupid Darkness”), is now available. This is the annual major update to the R language engine, and provides improved performance for R programs. The source code was released by the R Core Team on Friday and binaries for Windows, Mac and Linux are available for downloa...
3266 sym
Using checkpoint with knitr and RStudio
The knitr package by Yihui Xie is a wonderful tool for reproducible data science. I especially like using it with R Markdown documents, where with some simple markup in an easy-to-read document I can easily combine R code and narrative text to generate an attractive document with words, tables and pictures in HTML, PDF or Word format. Say, somet...
2849 sym R (179 sym/1 pcs) 2 img