# Data Handling¶

DataFrames tutorials:

Useful packages:

- DataSkimmer.jl produces a summary of tabular data (e.g.
`DataFrames`

), including histogram. `Strapping.jl`

converts between`struct`

s and tables.`SplitApplyCombine.jl`

contains data manipulation routines, such as`splitdims`

(converting between vectors of vectors and matrices etc.),`group`

,`innerjoin`

. Similar to what`DataFrames`

offers, but for additional data types.`InvertedIndices.jl`

for selecting when conditions are not true.- TableTransforms.jl provides data transformations (scaling, quantiles, selecting rows, etc). Idea is to construct a pipeline that can be applied to any data source that is
`Tables.jl`

compatible.

## DataFrames¶

Column names can be `string`

or `symbol`

. Access works with either. But in grouping expressions, one needs to pick one or the other.

Chaining transformations

- only have a single combine with multiple transformations as in

`combine(df, :a => sum, :b => mean)`

- even with
`@chain`

from`Chain.jl`

multiple`combine`

in a row do not work. The result of each`combine`

is fed into the next step. Which makes sense.

Converting to multi-dimensional array:

Deleting columns:

- Using
`Not`

from`InvertedIndices`

:`select!(df, Not(:x1));`

Renaming columns:

`rename!(df, :old => :new)`

Vector valued outputs of a transformation:

- Example: compute grouped quantiles
`combine(gdf, [:y, :wt] => Ref ∘ ((y, wt) -> quantile(y, FrequencyWeights(wt), [0.1, 0.7])))`

- The composition of
`Ref`

with the actual transformation prevents broadcasting the results (which would produce one row for each quantile)

### Grouping¶

The following gives all the rows of the original `DataFrame`

for which the "keys" have the selected values:

```
df = DataFrame(pk1=rand(1:10, 100), pk2=rand('a':'z', 100), value=rand(100));
gdf = groupby(df, [:pk1, :pk2]);
dfSub = gdf[(1, 'a')]
```

Multiple `combine`

operations in one go:

```
weighted_sum(x,m) = sum(x .* m);
combine(gdf2, [[:y,:mass], [:z,:mass]] .=> weighted_sum .=> [:a,:b])
```

## Categorical Data¶

Assigning levels:

```
xV = [1,2,3,1,2,3];
# Sorting ensures that categories are sorted correctly.
valV = sort(unique(xV));
xV = categorical(xV);
# Note that recode takes Pairs. These are constructed with `map` and then splatted.
xV = recode(xV, map(j -> j => "x$j", valV)...);
```

`levelcode`

returns the underlying Int value for each entry.

## Missing Values¶

`Missings.jl`

has convenience functions for dealing with missing values.

Also useful, but more general is Skipper.jl. Example:

```
sa = skip(x -> isnan(x) || isinf(x), data);
dataMean = mean(sa); # Ignores skipped value
sa .* 2; # Ignores skipped values
sa[2]; # Uses original indices, if not skipped.
complement(sa) .= mean(sa); # Sets skipped value
```

## Statistical Data Files¶

ReadStatTables.jl seems most up to date for reading STATA files.