Subsetting and batching is like dealing cards - should be random unless you are doing a trick. The image is cc from Steven Depolo. — Subsetting and batching is like dealing cards – should be random unless you are doing a trick. The image is cc from Steven Depolo.

In my previous two posts I covered the most basic data manipulation that you may need. In this post I’ll try to give a quick introduction to some of the sampling methods that we can use in our machine learning projects.

All posts in the torch-dataframe series

First we start with loading the mtcars dataset same way as we have previously:

require 'Dataframe'
mtcars_df = Dataframe("mtcars.csv"):
  rename_column("", "rownames"):
  drop(Df_Array("cyl", "disp", "drat", "vs", "carb"))

Splitting the data

A common strategy is to split our dataset into three sets

train: the examples that we will train our model on
validate: the examples that we will use to check how our model is doing
test: the examples that we “lock away into a vault” and only look at once we have decided and trained our final model.

The split proportions can differ depending on application and dataset. The default proportions used in the torch-dataframe package are 70%, 20% and 10%. The split is achieved via the create_subsets function:

mtcars_df:create_subsets(Df_Dict{train=5, validate=3, test=2})

If you provide custom split proportions and numbers don’t sum to 1 and the function automatically normalizes the values and prints a warning Warning: You have provided a total ~= 1 (10).

Now you have three subsets in your data that you can access via the get_subset method or just via ["/subset_name"]:

th> mtcars_df["/train"]:size()
16
                                                                      [0.0001s]

Note: The current implementation (v. 1.5) of torch-dataframe is a shallow wrapper around the parent data only containing a list of elements. If you print a subset you will see the indexes from the original dataset that are included within this particular dataset:

th> mtcars_df["/test"]

+---------+
| indexes |
+---------+
|      13 |
|       3 |
|      20 |
|      19 |
|      18 |
|      27 |
|      30 |
+---------+

                                                                      [0.0009s]

Samplers

Many machine learning procedures follow the same steps:

split your data
sample from the training subset a random batch
perform a calculation on that batch and update your parameters accordingly
restart from 2.

The torch-dataframe tries therefore to make this entire process as painless as possible. We have also extended torch-dataset’s excellent samplers that allow you to sample using the following approaches:

linear: Does a linear walk through the data. Note that the subsetting already does a random permutation to your data unless you only have one subset.
ordered: Sorts the indexes and then does a linear walk through the data.
permutation: Reorganizes the order (permutes) and then walks through the rows. After each epoch you must reset and this resetting creates a new permutation.
uniform: Samples uniformly from the data. This means that within one epoch the same example may occur several times and some won’t appear at all.
label permutation: Permutes the data according to labels. This means that we can make sure that the training appears evenly distributed between labels.
label uniform: Samples uniformly but according to labels.
label distribution: Samples according to specific distributions.

You choose your samplers either during the create_subsets call using the sampler argument or you can set them later for each subset using the set_sampler function. Here is an example where you also must set the label:

th> mtcars_df["/train"]:set_labels("gear"):set_sampler("label-permutation")
th> mtcars_df["/train"]:get_batch(4)

+-------------------------------------------------------------+
| rownames      |  mpg |  hp |   wt | qsec | am        | gear |
+-------------------------------------------------------------+
| Merc 240D     | 24.4 |  62 | 3.19 |   20 | Automatic |    4 |
| Merc 280C     | 17.8 | 123 | 3.44 | 18.9 | Automatic |    4 |
| Merc 450SLC   | 15.2 | 180 | 3.78 |   18 | Automatic |    3 |
| Maserati Bora |   15 | 335 | 3.57 | 14.6 | Manual    |    5 |
+-------------------------------------------------------------+

false   
                                                                      [0.0063s]

Using the sampler is done by calling the get_batch. The second argument returned from get_batch is whether the reset_sampler should be invoked. Note this is only required for some of the samplers, most will always return false.

Batch to tensor

One of the core functions is the ability to export data into tensors that can be used for deep learning. This is done via the to_tensor function that converts the numerical columns into a tensor of self:size() x #self:get_numerical_colnames() size. As we frequently has some input data that we want to map onto a set of labels/targets the Batchframe subclass has an extension to the to_tensor function. There are several options where the most common is probably to load the data from an external file and matching it with one or more columns within the dataframe. Below is an example of how it looks when both the data and the labels reside in the dataframe:

th> mtcars_df:as_categorical("am"):head(2)

+-----------------------------------------------------------+
| rownames      | mpg |  hp |    wt |  qsec |     am | gear |
+-----------------------------------------------------------+
| Mazda RX4     |  21 | 110 |  2.62 | 16.46 | Manual |    4 |
| Mazda RX4 Wag |  21 | 110 | 2.875 | 17.02 | Manual |    4 |
+-----------------------------------------------------------+

                                                                      [0.0026s] 
th> mtcars_df["/train"]:
 get_batch(3):
 to_tensor{data_columns = Df_Array("mpg", "hp"), 
           label_columns = Df_Array("am", "gear")}
  24.4000   62.0000
  16.4000  180.0000
  19.7000  175.0000
[torch.DoubleTensor of size 3x2]

 1  4
 1  3
 2  5
[torch.DoubleTensor of size 3x2]

{
  1 : "mpg"
  2 : "hp"
}
                                                                      [0.0032s]

You can substitute the data_columns with load_data_fn or the label_columns with load_label_fn. Each function receives a single row in the format of a plain table where any information can be used for generating a tensor, e.g. filename of an image. As loading files is time-consuming I often like to do this in parallel.

A convenient way is to set the batch_args arguments when creating the subsets where you can specify the data/label retrieving strategies:

data:create_subsets{
  data_retriever = function(row) load_img(row.filename) end},
  label_retriever = Df_Array("image_class"")
}

Summary

In this post we’ve reviewed some of the core functions for setting up the dataframe for machine learning applications such as data-splitting, subsetting and converting the data into torch-friendly tensors.

The torch-dataframe – subsetting and sampling

All posts in the torch-dataframe series

Splitting the data

Samplers

Batch to tensor

Summary

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112