Publications by George Pipis

Hack: How to Convert all Character Variables to Factors

06.10.2020

Let’s say that we want to convert all Character Variables to Factors and we are dealing with a large data frame of many columns which means that is not practical to convert them one by one. Thus, our approach is to detect the “char” variables and to convert them to “Factors”. Let’s provide a toy example: df<-data.frame(Gender = c("F"...

896 sym R (288 sym/2 pcs) 4 img

How to get Data from Different Sources in R

06.10.2020

The data that we want to get could be in different places and in different formats. We will provide some examples of how you can get data from different sources. Get Data from SQL It is very common for the data to be stored in an SQL database. We have provided an extensive example of how you can connect R with SQL. Get csv/text Data from HTTP(s) ...

1823 sym R (1234 sym/5 pcs) 2 img

ANOVA vs Multiple Comparisons

15.10.2020

When we run an ANOVA, we analyze the differences among group means in a sample. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. ANOVA Null and Alternatve Hypothesis The null hypothesis in ANOVA is that there is no difference between m...

2937 sym R (2303 sym/9 pcs) 2 img

Hack: How to Install and Load Packages Dynamically

16.10.2020

When we share an R script file with someone else, we assumed that they have already installed the required R packages. However, this is not always the case and for that reason, I strongly suggest adding this piece of code to every shared R script which requires a package. Let’s assume that your code requires the following three packages: “rea...

886 sym R (181 sym/1 pcs)

Hack: The “count(case when … else … end)” in dplyr

18.10.2020

When I run quires in SQL (or even HiveQL, Spark SQL and so on), it is quite common to use the syntax of count(case when.. else ... end). Today, I will provide you an example of how you run this type of commands in dplyr. Let’s start: library(sqldf) library(dplyr) df<-data.frame(id = 1:10, gender = c("m","m","m","f","f","f","m","...

829 sym R (893 sym/5 pcs) 2 img

Hack: The ‘[‘ in R lists

18.10.2020

Assume that you have a list and you want to get the n-th element of each component or generally to subset the list. You can use the command sapply(list, "[", c(1,2,3,..)) Let’s see this in practice. mylist<-list(id<-1:10, gender<-c("m","m","m","f","f","f","m","f","f","f"), amt<-c(5,20,30,10,20,50,5,20,10,30) ...

688 sym R (394 sym/4 pcs)

Tidyverse Tips

01.11.2020

I have found the following commands quite useful during the EDA part of any Data Science project. We will work with the tidyverse package where we will actually need the dplyr and the ggplot2 only and with the irisdataset. select_if | rename_if The select_if function belongs to dply and is very useful where we want to choose some columns based on...

3102 sym R (739 sym/8 pcs) 16 img

Excess Deaths during the 1st Wave of Covid-19

02.11.2020

Abstract Our goal is to provide some summary statistics of deaths across countries during the 1st Wave of Covid-19 and to compare these numbers with the corresponding ones of the previous years. This analysis is not scientific and we cannot drive any conclusion about the impact of Covid-19 since we need to take into consideration many other param...

4683 sym R (1563 sym/5 pcs) 34 img

Undersampling by Groups in R

06.11.2020

When we are dealing with unbalanced classes in Machine Learning projects there are many approaches that you can follow. Just to main some of them: Undersampling: We try to reduce the observations from the majority class so that the final dataset to be balancedOversampling: We try to generate more observations from the minority class usually by re...

2548 sym R (2672 sym/10 pcs)

Skewness and Kurtosis in Statistics

09.11.2020

Most commonly a distribution is described by its mean and variance which are the first and second moments respectively. Another less common measures are the skewness (third moment) and the kurtosis (fourth moment). Today, we will try to give a brief explanation of these measures and we will show how we can calculate them in R. Skewness The skewne...

5350 sym R (1187 sym/10 pcs) 10 img