Publications by David Smith
Data Analysis for Life Sciences
Rafael Irizarry from the Harvard T.H. Chan School of Public Health has presented a number of courses on R and Biostatistics on EdX, and he recently also provided an index of all of the course modules as YouTube videos with supplemental materials. The EdX courses are linked below, which you can take for free, or simply follow the series of YouTube...
1484 sym 2 img
IEEE Spectrum 2017 Top Programming Languages
IEEE Spectrum has published its fourth annual ranking of of top programming languages, and the R language is again featured in the Top 10. This year R ranks at #6, down a spot from its 2016 ranking (and with an IEEE score — derived from search, social media, and job listing trends — tied with the #5 place-getter, C#). Python has taken the #1...
1541 sym 2 img
Analyzing Github pull requests with Neural Embeddings, in R
At the useR!2017 conference earlier this month, my colleague Ali Zaidi gave a presentation on using Neural Embeddings to analyze GitHub pull request comments (processed using the tidy text framework). The data analysis was done using R and distributed on Spark, and the resulting neural network trained using the Microsoft Cognitive Toolkit. You ...
812 sym
SQL Server 2017 release candidate now available
SQL Server 2017, the next major release of the SQL Server database, has been available as a community preview for around 8 months, but now the first full-featured release candidate is available for public preview. For those looking to do data science with data in SQL Server, there are a number of new features compared to SQL Server 2017 which mig...
1565 sym
Introducing Joyplots
This is a joyplot: a series of histograms, density plots or time series for a number of data segments, all aligned to the same horizontal scale and presented with a slight overlap. Peak time for sports and leisure #dataviz. About time for a joyplot; might do a write-up on them. #rstats code at https://t.co/Q2AgW068Wa pic.twitter.com/SVT6pkB2hB �...
3147 sym
The R6 Class System
R is an object-oriented language with several object-orientation systems. There's the original (and still widely-used) S3 class system based on the “class” attribute. There's the somewhat stricter, signature-based S4 class system. There are reference classes (also called R5), which provide R objects with multiple references without duplicati...
2443 sym R (71 sym/2 pcs)
Learn parallel programming in R with these exercises for "foreach"
The foreach package provides a simple looping construct for R: the foreach function, which you may be familiar with from other languages like Javascript or C#. It's basically a function-based version of a "for" loop. But what makes foreach useful isn't iteration: it's the way it makes it easy to run those iterations in parallel, and save time on ...
1577 sym
How to use H2O with R on HDInsight
H2O.ai is an open-source AI platform that provides a number of machine-learning algorithms that run on the Spark distributed computing framework. Azure HDInsight is Microsoft's fully-managed Apache Hadoop platform in the cloud, which makes it easy to spin up and manage Azure clusters of any size. It's also easy to to run H2O on HDInsight: H2O AI...
1744 sym
A modern database interface for R
At the useR! conference last month, Jim Hester gave a talk about two packages that provide a modern database interface for R. Those packages are the odbc package (developed by Jim and other members of the RStudio team), and the DBI package (developed by Kirill Müller with support from the R Consortium). To communicate with databases, a common ...
4276 sym R (143 sym/3 pcs)
Applications in energy, retail and shipping
The Solutions section of the Cortana Intelligence Gallery provides more than two dozen working examples of applying machine learning, data science and artificial intelligence to real-world problems. Each solution provides sample data, scripts for model training and evaluation, and reporting of predictions. You can deploy a complete stack in Azur...
3156 sym 2 img