Skip to content

Data Handling

DataFrames tutorials:

Useful packages:

  • DataSkimmer.jl produces a summary of tabular data (e.g. DataFrames), including histogram.
  • Strapping.jl converts between structs and tables.
  • SplitApplyCombine.jl contains data manipulation routines, such as splitdims (converting between vectors of vectors and matrices etc.), group, innerjoin. Similar to what DataFrames offers, but for additional data types.
  • InvertedIndices.jl for selecting when conditions are not true.
  • TableTransforms.jl provides data transformations (scaling, quantiles, selecting rows, etc). Idea is to construct a pipeline that can be applied to any data source that is Tables.jl compatible.

DataFrames

Column names can be string or symbol. Access works with either. But in grouping expressions, one needs to pick one or the other.

Chaining transformations

  • only have a single combine with multiple transformations as in

combine(df, :a => sum, :b => mean)

  • even with @chain from Chain.jl multiple combine in a row do not work. The result of each combine is fed into the next step. Which makes sense.

Converting to multi-dimensional array:

Deleting columns:

  • Using Not from InvertedIndices: select!(df, Not(:x1));

Renaming columns:

  • rename!(df, :old => :new)

Vector valued outputs of a transformation:

  • Example: compute grouped quantiles
  • combine(gdf, [:y, :wt] => Ref ∘ ((y, wt) -> quantile(y, FrequencyWeights(wt), [0.1, 0.7])))
  • The composition of Ref with the actual transformation prevents broadcasting the results (which would produce one row for each quantile)

Grouping

The following gives all the rows of the original DataFrame for which the "keys" have the selected values:

df = DataFrame(pk1=rand(1:10, 100), pk2=rand('a':'z', 100), value=rand(100)); 
gdf = groupby(df, [:pk1, :pk2]);
dfSub = gdf[(1, 'a')]

Multiple combine operations in one go:

weighted_sum(x,m) = sum(x .* m);
combine(gdf2, [[:y,:mass], [:z,:mass]] .=> weighted_sum .=> [:a,:b])

Categorical Data

Assigning levels:

xV = [1,2,3,1,2,3];
# Sorting ensures that categories are sorted correctly.
valV = sort(unique(xV));
xV = categorical(xV);
# Note that recode takes Pairs. These are constructed with `map` and then splatted.
xV = recode(xV, map(j -> j => "x$j", valV)...);

levelcode returns the underlying Int value for each entry.

Missing Values

Missings.jl has convenience functions for dealing with missing values.

Also useful, but more general is Skipper.jl. Example:

sa = skip(x -> isnan(x) || isinf(x), data);
dataMean = mean(sa); # Ignores skipped value
sa .* 2; # Ignores skipped values
sa[2]; # Uses original indices, if not skipped.
complement(sa) .= mean(sa); # Sets skipped value

Statistical Data Files

ReadStatTables.jl seems most up to date for reading STATA files.