Publications by hrbrmstr
R⁶ — Disproving Approval
I couldn’t let this stand unchallenged: The new Rasmussen Poll, one of the most accurate in the 2016 Election, just out with a Trump 50% Approval Rating.That's higher than O's #'s!— Donald J. Trump (@realDonaldTrump) June 18, 2017 Ramussen makes their Presidential polling data available for both ? & O. Why not compare their ratings from day ...
849 sym R (1462 sym/1 pcs) 2 img
Ten-HUT! The Apache Drill R interface package — sergeant — is now on CRAN
I’m extremely pleased to announce that the sergeant package is now on CRAN or will be hitting your local CRAN mirror soon. sergeant provides JDBC, DBI and dplyr/dbplyr interfaces to Apache Drill. I’ve also wrapped a few goodies into the dplyr custom functions that work with Drill and if you have Drill UDFs that don’t work “out of the box�...
1356 sym
R⁶ — General (Attys) Distributions
Matt @stiles is a spiffy data journalist at the @latimes and he posted an interesting chart on U.S. Attorneys General longevity (given that the current US AG is on thin ice): Only Watergate and the Civil War have prompted shorter tenures as AG (if Sessions were to leave now). A daily viz: https://t.co/aJ4KDsC5kC pic.twitter.com/ZoiEV3MhGp— Matt...
1562 sym R (1732 sym/1 pcs) 4 img
Reading PCAP Files with Apache Drill and the sergeant R Package
It’s no secret that I’m a fan of Apache Drill. One big strength of the platform is that it normalizes the access to diverse data sources down to ANSI SQL calls, which means that I can pull data from parquet, Hie, HBase, Kudu, CSV, JSON, MongoDB and MariaDB with the same SQL syntax. This also means that I get access to all those platforms in R...
4033 sym R (3226 sym/1 pcs)
Analyzing “Wait-Delay” Settings in Common Crawl robots.txt Data with R
One of my tweets that referenced an excellent post about the ethics of web scraping garnered some interest: Apologies for a Medium link but if you do ANY web scraping, you need to read this #rstats // Ethics in Web Scraping https://t.co/y5YxvzB8Fd— boB Rudis (@hrbrmstr) July 26, 2017 If you load that up that tweet and follow the thread, you’l...
5056 sym R (1169 sym/3 pcs) 2 img
R⁶ — Reticulating Parquet Files
The reticulate package provides a very clean & concise interface bridge between R and Python which makes it handy to work with modules that have yet to be ported to R (going native is always better when you can do it). This post shows how to use reticulate to create parquet files directly from R using reticulate as a bridge to the pyarrow module,...
2182 sym R (753 sym/3 pcs)
R⁶ — Exploring macOS Applications with codesign, Gatekeeper & R
(General reminder abt “R⁶” posts in that they are heavy on code-examples, minimal on expository. I try to design them with 2-3 “nuggets” embedded for those who take the time to walk through the code examples on their systems. I’ll always provide further expository if requested in a comment, so don’t hesitate to ask if something is c...
2402 sym R (4255 sym/1 pcs)
Caching httr Requests? This means WAR[C]!
I’ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to work with WARC files it’s a tad fragile and improving it would mean reinv...
4394 sym R (5032 sym/6 pcs)
Reticulating Readability
I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non-“main text content” from an HTML document and it usually does a good job but there are some pages that it fails miserably on since it�...
7232 sym R (481 sym/3 pcs)
Unbottling “.msg” Files in R
There was a discussion on Twitter about the need to read in “.msg” files using R. The “MSG” file format is one of the many binary abominations created by Microsoft to lock folks and users into their platform and tools. Thankfully, they (eventually) provided documentation for the MSG file format which helped me throw together a small R pac...
2103 sym R (3880 sym/1 pcs) 4 img