Quantcast
Channel: G-Forge
Viewing all articles
Browse latest Browse all 75

The torch-dataframe – subsetting and sampling

$
0
0
Subsetting and batching is like dealing cards - should be random unless you are doing a trick. The image is cc from Steven Depolo.
Subsetting and batching is like dealing cards – should be random unless you are doing a trick. The image is cc from Steven Depolo.

In my previous two posts I covered the most basic data manipulation that you may need. In this post I’ll try to give a quick introduction to some of the sampling methods that we can use in our machine learning projects.

All posts in the torch-dataframe series

  1. Intro to the torch-dataframe
  2. Modifications
  3. Subsetting
  4. The mnist example
  5. Multilabel classification

First we start with loading the mtcars dataset same way as we have previously:

require 'Dataframe'
mtcars_df = Dataframe("mtcars.csv"):
  rename_column("", "rownames"):
  drop(Df_Array("cyl", "disp", "drat", "vs", "carb"))

Splitting the data

A common strategy is to split our dataset into three sets

  • train: the examples that we will train our model on
  • validate: the examples that we will use to check how our model is doing
  • test: the examples that we “lock away into a vault” and only look at once we have decided and trained our final model.

The split proportions can differ depending on application and dataset. The default proportions used in the torch-dataframe package are 70%, 20% and 10%. The split is achieved via the create_subsets function:

mtcars_df:create_subsets(Df_Dict{train=5, validate=3, test=2})

If you provide custom split proportions and numbers don’t sum to 1 and the function automatically normalizes the values and prints a warning Warning: You have provided a total ~= 1 (10).

Now you have three subsets in your data that you can access via the get_subset method or just via ["/subset_name"]:

th> mtcars_df["/train"]:size()
16
                                                                      [0.0001s] 

Note: The current implementation (v. 1.5) of torch-dataframe is a shallow wrapper around the parent data only containing a list of elements. If you print a subset you will see the indexes from the original dataset that are included within this particular dataset:

th> mtcars_df["/test"]

+---------+
| indexes |
+---------+
|      13 |
|       3 |
|      20 |
|      19 |
|      18 |
|      27 |
|      30 |
+---------+

                                                                      [0.0009s] 

Samplers

Many machine learning procedures follow the same steps:

  1. split your data
  2. sample from the training subset a random batch
  3. perform a calculation on that batch and update your parameters accordingly
  4. restart from 2.

The torch-dataframe tries therefore to make this entire process as painless as possible. We have also extended torch-dataset’s excellent samplers that allow you to sample using the following approaches:

  • linear: Does a linear walk through the data. Note that the subsetting already does a random permutation to your data unless you only have one subset.
  • ordered: Sorts the indexes and then does a linear walk through the data.
  • permutation: Reorganizes the order (permutes) and then walks through the rows. After each epoch you must reset and this resetting creates a new permutation.
  • uniform: Samples uniformly from the data. This means that within one epoch the same example may occur several times and some won’t appear at all.
  • label permutation: Permutes the data according to labels. This means that we can make sure that the training appears evenly distributed between labels.
  • label uniform: Samples uniformly but according to labels.
  • label distribution: Samples according to specific distributions.

You choose your samplers either during the create_subsets call using the sampler argument or you can set them later for each subset using the set_sampler function. Here is an example where you also must set the label:

th> mtcars_df["/train"]:set_labels("gear"):set_sampler("label-permutation")
th> mtcars_df["/train"]:get_batch(4)

+-------------------------------------------------------------+
| rownames      |  mpg |  hp |   wt | qsec | am        | gear |
+-------------------------------------------------------------+
| Merc 240D     | 24.4 |  62 | 3.19 |   20 | Automatic |    4 |
| Merc 280C     | 17.8 | 123 | 3.44 | 18.9 | Automatic |    4 |
| Merc 450SLC   | 15.2 | 180 | 3.78 |   18 | Automatic |    3 |
| Maserati Bora |   15 | 335 | 3.57 | 14.6 | Manual    |    5 |
+-------------------------------------------------------------+

false   
                                                                      [0.0063s] 

Using the sampler is done by calling the get_batch. The second argument returned from get_batch is whether the reset_sampler should be invoked. Note this is only required for some of the samplers, most will always return false.

Batch to tensor

One of the core functions is the ability to export data into tensors that can be used for deep learning. This is done via the to_tensor function that converts the numerical columns into a tensor of self:size() x #self:get_numerical_colnames() size. As we frequently has some input data that we want to map onto a set of labels/targets the Batchframe subclass has an extension to the to_tensor function. There are several options where the most common is probably to load the data from an external file and matching it with one or more columns within the dataframe. Below is an example of how it looks when both the data and the labels reside in the dataframe:

th> mtcars_df:as_categorical("am"):head(2)

+-----------------------------------------------------------+
| rownames      | mpg |  hp |    wt |  qsec |     am | gear |
+-----------------------------------------------------------+
| Mazda RX4     |  21 | 110 |  2.62 | 16.46 | Manual |    4 |
| Mazda RX4 Wag |  21 | 110 | 2.875 | 17.02 | Manual |    4 |
+-----------------------------------------------------------+

                                                                      [0.0026s] 
th> mtcars_df["/train"]:
 get_batch(3):
 to_tensor{data_columns = Df_Array("mpg", "hp"), 
           label_columns = Df_Array("am", "gear")}
  24.4000   62.0000
  16.4000  180.0000
  19.7000  175.0000
[torch.DoubleTensor of size 3x2]

 1  4
 1  3
 2  5
[torch.DoubleTensor of size 3x2]

{
  1 : "mpg"
  2 : "hp"
}
                                                                      [0.0032s] 

You can substitute the data_columns with load_data_fn or the label_columns with load_label_fn. Each function receives a single row in the format of a plain table where any information can be used for generating a tensor, e.g. filename of an image. As loading files is time-consuming I often like to do this in parallel.

A convenient way is to set the batch_args arguments when creating the subsets where you can specify the data/label retrieving strategies:

data:create_subsets{
  data_retriever = function(row) load_img(row.filename) end},
  label_retriever = Df_Array("image_class"")
}

Summary

In this post we’ve reviewed some of the core functions for setting up the dataframe for machine learning applications such as data-splitting, subsetting and converting the data into torch-friendly tensors.

Flattr this!


Viewing all articles
Browse latest Browse all 75

Trending Articles