- DataSkimmer.jl produces a summary of tabular data (e.g.
DataFrames), including histogram.
structs and tables.
SplitApplyCombine.jlcontains data manipulation routines, such as
splitdims(converting between vectors of vectors and matrices etc.),
innerjoin. Similar to what
DataFramesoffers, but for additional data types.
InvertedIndices.jlfor selecting when conditions are not true.
- TableTransforms.jl provides data transformations (scaling, quantiles, selecting rows, etc). Idea is to construct a pipeline that can be applied to any data source that is
Column names can be
symbol. Access works with either. But in grouping expressions, one needs to pick one or the other.
- only have a single combine with multiple transformations as in
combine(df, :a => sum, :b => mean)
- even with
combinein a row do not work. The result of each
combineis fed into the next step. Which makes sense.
Converting to multi-dimensional array:
rename!(df, :old => :new)
Vector valued outputs of a transformation:
- Example: compute grouped quantiles
combine(gdf, [:y, :wt] => Ref ∘ ((y, wt) -> quantile(y, FrequencyWeights(wt), [0.1, 0.7])))
- The composition of
Refwith the actual transformation prevents broadcasting the results (which would produce one row for each quantile)
The following gives all the rows of the original
DataFrame for which the "keys" have the selected values:
df = DataFrame(pk1=rand(1:10, 100), pk2=rand('a':'z', 100), value=rand(100)); gdf = groupby(df, [:pk1, :pk2]); dfSub = gdf[(1, 'a')]
combine operations in one go:
weighted_sum(x,m) = sum(x .* m); combine(gdf2, [[:y,:mass], [:z,:mass]] .=> weighted_sum .=> [:a,:b])
xV = [1,2,3,1,2,3]; # Sorting ensures that categories are sorted correctly. valV = sort(unique(xV)); xV = categorical(xV); # Note that recode takes Pairs. These are constructed with `map` and then splatted. xV = recode(xV, map(j -> j => "x$j", valV)...);
levelcode returns the underlying Int value for each entry.
Missings.jl has convenience functions for dealing with missing values.
Also useful, but more general is Skipper.jl. Example:
sa = skip(x -> isnan(x) || isinf(x), data); dataMean = mean(sa); # Ignores skipped value sa .* 2; # Ignores skipped values sa; # Uses original indices, if not skipped. complement(sa) .= mean(sa); # Sets skipped value
Statistical Data Files¶
ReadStatTables.jl seems most up to date for reading STATA files.