Publications by John Mount
I think Pandas may have “lost the plot.”
I’ve thought of Pandas as in-memory column oriented data structure with reasonable performance. If I need high performance or scale, I can move to a database.Now I kind of wonder what Pandas is, or what it wants to be.The version 1.3.0 package seems to be marking natural ways to work with a data frame as “low performance” and issuing warnin...
1644 sym Python (1338 sym/10 pcs) 2 img
Using the data algebra for Statistics and Data Science
I have a new intermediate introduction on the data algebra up here: Using the data algebra for Statistics and Data Science.The data algebra is a tool for data processing in Python which is implemented on top of any of Pandas, Google BigQuery, PostgreSQL, MySQL, Spark, and SQLite. It allows you to develop data processing pipelines incrementally an...
623 sym
Data Algebra 0.9.0 Release
I am pleased to announce the 0.9.0 release of the data algebra.The data algebra is realization of the Codd relational algebra for data in written in terms of Python method chaining. It allows the concise clear specification of useful data transforms. Some examples can be found here. Benefits include being able to specify a single data transformat...
1624 sym
An appreciation of Cover’s universal portfolio in Python
I have a new theoretical finance note up: an appreciation of Cover’s universal portfolio in Python.Related To leave a comment for the author, please follow the link and comment on their blog: python – Win Vector LLC . Want to share your content on python-bloggers? click here....
279 sym
An Effective Personal Jupyter Data Science Workflow
I would like to share what I have found to be a very effective personal Jupyter workflow for data science development.Jupyter (nee IPython) workbooks are JSON documents that allow a data scientist to mix: code, markdown, results, images, and graphs. They are a great contribution to scientific reproducibility, as they can contain a number of steps...
6821 sym 4 img
Y-Aware PCA
We have had some trouble with some articles being damaged or hard to access in the Win Vector blog. I (John Mount) do want to apologize for that. In particular the graphs are missing for Dr. Nina Zumel’s wonderful y-aware Pricipal Components regression series. The complete R .md and .Rmd files that generated the articles are easy to get to, and...
2364 sym
Separating Code from Presentation in Jupyter Notebooks
One of the great conveniences of performing a data science style analysis using Jupyter is that Jupyter notebooks are literate containers that combine code, text, results, and graphs. This is also one of the pain points in working with Jupyter notebooks with partners or with source control. That is: Jupyter notebooks are JSON (which rapidly becom...
3921 sym R (249 sym/4 pcs) 2 img
Survive R
New PDF slides version (presented at the Bay Area R Users Meetup October 13, 2009). We at Win-Vector LLC appear to like R a bit more than some of our, perhaps wiser, colleagues ( see: Choose your weapon: Matlab, R or something else? and R and data ). While we do like R (see: Exciting Technique #1: The “R” language ) we also understand the ne...
4527 sym
R examine objects tutorial
This article is quick concrete example of how to use the techniques from Survive R to lower the steepness of The R Project for Statistical Computing‘s learning curve (so an apology to all readers who are not interested in R). What follows is for people who already use R and want to achieve more control of the software. I am a fan of the R. T...
7447 sym R (1917 sym/11 pcs) 4 img
CRU graph yet again (with R)
IowaHawk has a excellent article attempting to reproduce the infamous CRU climate graph using OpenOffice: Fables of the Reconstruction. We thought we would show how to produced similarly bad results using R. If the re-constructed technique is close to what was originally done then so many bad moves were taken that you can’t learn much of any...
4631 sym R (3279 sym/5 pcs) 8 img