Data Handling¶
DataFrames tutorials:
Useful packages:
- DataSkimmer.jl produces a summary of tabular data (e.g.
DataFrames
), including histogram. Strapping.jl
converts betweenstruct
s and tables.SplitApplyCombine.jl
contains data manipulation routines, such assplitdims
(converting between vectors of vectors and matrices etc.),group
,innerjoin
. Similar to whatDataFrames
offers, but for additional data types.InvertedIndices.jl
for selecting when conditions are not true.- TableTransforms.jl provides data transformations (scaling, quantiles, selecting rows, etc). Idea is to construct a pipeline that can be applied to any data source that is
Tables.jl
compatible.
DataFrames¶
Column names can be string
or symbol
. Access works with either. But in grouping expressions, one needs to pick one or the other.
Examining DataFrames
Chaining transformations
- only have a single combine with multiple transformations as in
combine(df, :a => sum, :b => mean)
- even with
@chain
fromChain.jl
multiplecombine
in a row do not work. The result of eachcombine
is fed into the next step. Which makes sense.
Converting to multi-dimensional array:
Deleting columns:
- Using
Not
fromInvertedIndices
:select!(df, Not(:x1));
Renaming columns:
rename!(df, :old => :new)
Vector valued outputs of a transformation:
- Example: compute grouped quantiles
combine(gdf, [:y, :wt] => Ref ∘ ((y, wt) -> quantile(y, FrequencyWeights(wt), [0.1, 0.7])))
- The composition of
Ref
with the actual transformation prevents broadcasting the results (which would produce one row for each quantile)
Grouping¶
The following gives all the rows of the original DataFrame
for which the "keys" have the selected values:
df = DataFrame(pk1=rand(1:10, 100), pk2=rand('a':'z', 100), value=rand(100));
gdf = groupby(df, [:pk1, :pk2]);
dfSub = gdf[(1, 'a')]
Multiple combine
operations in one go:
weighted_sum(x,m) = sum(x .* m);
combine(gdf2, [[:y,:mass], [:z,:mass]] .=> weighted_sum .=> [:a,:b])
Precomputing temporary variables in groups:
- Examples: compute the weighted mean of several variables. Each requires the total mass of each group, which should be precomputed.
- There doesn't seem to be a good way of doing this.
@aside
fromChain.jl
is made for that purpose. But it's not clear how to access the column of the current GroupedDataFrame.sum(_.mass)
works for aDataFrame
but not for aGroupedDataFrame
Manually iterating over each GroupedDataFrame:
gdf = groupby(df, :x);
v = Vector{Any}();
for subdf in gdf
push!(v, f(subdf));
end
This works fine. We end up with a Vector
of whatever f
returns (NamedTuple
s would be logical). That Vector
can now be assembled into a result DataFrame.
The benefit: f
can perform arbitrary calculations. No need to deal with macro issues, such as escaping variables.
The drawback (according to the docs) is performance. But I have not tried it.
Categorical Data¶
Assigning levels:
xV = [1,2,3,1,2,3];
# Sorting ensures that categories are sorted correctly.
valV = sort(unique(xV));
xV = categorical(xV);
# Note that recode takes Pairs. These are constructed with `map` and then splatted.
xV = recode(xV, map(j -> j => "x$j", valV)...);
levelcode
returns the underlying Int value for each entry.
Missing Values¶
Missings.jl
has convenience functions for dealing with missing values.
Also useful, but more general is Skipper.jl. Example:
sa = skip(x -> isnan(x) || isinf(x), data);
dataMean = mean(sa); # Ignores skipped value
sa .* 2; # Ignores skipped values
sa[2]; # Uses original indices, if not skipped.
complement(sa) .= mean(sa); # Sets skipped value
Statistical Data Files¶
ReadStatTables.jl seems most up to date for reading STATA files.