Publications by John Mount

Better SQL Generation via the data_algebra

18.12.2019

In our recent note What is new for rquery December 2019 we mentioned an ugly processing pipeline that translates into SQL of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the data_algebra.dplyr translates the query to SQL as:SELECT 5.0 AS `x`, `sum23` FROM (SELECT `col1`, `col2...

1850 sym

A Richer Category for Data Wrangling

22.12.2019

I’ve been writing a lot about a category theory interpretations of data-processing pipelines and some of the improvements we feel it is driving in both the data_algebra and in rquery/rqdatatable.I think I’ve found an even better category theory re-formulation of the package, which I will describe here.In the earlier formalism our data transfo...

5800 sym Python (2110 sym/34 pcs) 9 tbl

data_algebra 0.7.0 What is New

07.06.2021

I’ve been tinkering a lot recently with the data_algebra, and just released version 0.7.0 to PyPi. In this note I’ll touch on what the data algebra is, what the new features are, and my plans going forward. The data algebraThe data algebra is a modern realization of elements of Codd’s 1969 relational model for data wrangling (see also Co...

5622 sym Python (2219 sym/15 pcs) 4 tbl

Using WITH For Neater SQL

21.06.2021

I’d like to work an example of using SQL WITH Common Table Expressions to produce more legible SQL.A major complaint with SQL is that it composes statements by right-ward nesting.That is: a sequence of operations A -> B -> C is represented as SELECT C FROM SELECT B FROM SELECT A. However, the SQL 99 standard introduced the WITH statement and...

1912 sym Python (1903 sym/10 pcs) 2 tbl

I think Pandas may have “lost the plot.”

04.08.2021

I’ve thought of Pandas as in-memory column oriented data structure with reasonable performance. If I need high performance or scale, I can move to a database.Now I kind of wonder what Pandas is, or what it wants to be.The version 1.3.0 package seems to be marking natural ways to work with a data frame as “low performance” and issuing warnin...

1644 sym Python (1338 sym/10 pcs) 2 img

Using the data algebra for Statistics and Data Science

28.09.2021

I have a new intermediate introduction on the data algebra up here: Using the data algebra for Statistics and Data Science.The data algebra is a tool for data processing in Python which is implemented on top of any of Pandas, Google BigQuery, PostgreSQL, MySQL, Spark, and SQLite. It allows you to develop data processing pipelines incrementally an...

623 sym

Data Algebra 0.9.0 Release

09.10.2021

I am pleased to announce the 0.9.0 release of the data algebra.The data algebra is realization of the Codd relational algebra for data in written in terms of Python method chaining. It allows the concise clear specification of useful data transforms. Some examples can be found here. Benefits include being able to specify a single data transformat...

1624 sym

An appreciation of Cover’s universal portfolio in Python

23.02.2022

I have a new theoretical finance note up: an appreciation of Cover’s universal portfolio in Python.Related To leave a comment for the author, please follow the link and comment on their blog: python – Win Vector LLC . Want to share your content on python-bloggers? click here....

279 sym

An Effective Personal Jupyter Data Science Workflow

20.08.2022

I would like to share what I have found to be a very effective personal Jupyter workflow for data science development.Jupyter (nee IPython) workbooks are JSON documents that allow a data scientist to mix: code, markdown, results, images, and graphs. They are a great contribution to scientific reproducibility, as they can contain a number of steps...

6821 sym 4 img

Y-Aware PCA

08.09.2022

We have had some trouble with some articles being damaged or hard to access in the Win Vector blog. I (John Mount) do want to apologize for that. In particular the graphs are missing for Dr. Nina Zumel’s wonderful y-aware Pricipal Components regression series. The complete R .md and .Rmd files that generated the articles are easy to get to, and...

2364 sym