DropoutWrapper: It seems that three keep probabilities don't make critical effect.
Batch data selection: This is critical. For some training data, the NN with continuous selection of batch data can't learn, while the NN with random selection of batch data can.
Batch size: NNs with batch size 100 is much faster in learning than with batch size 1, even if the learning time is measured in the number of sequences. The critical problem of small batch size is that the loss is not stable. It fluctuates, and the size of fluctuations is large.
Optimizer and Initializer: Use proper initializers and parameters for optimizers. They are very sensitive.
RMSPropOptimizer: Use tf.random_normal_initializer(mean=0.0, stddev=0.01) for the fully connected layer after the cell. The default initializer seems not work well. Using decay = 0.9 instead of 1.0, makes the learning about 10 times faster. learningRate = 1e-4 RMSPropOptimizer is more sensitive than GradientDescentOptimizer.
GradientDescentOptimizer: Learning rate is decayed with decay: nEpochs = 10000 learningRate = 1.0 lrDecay = 10 ** (-3.75 / nEpochs) # so that lrDecay ** (nEpochs - nEpochs_lr) ~ 0.001 nEpochs_lr = nEpochs // 5 lrDecay = config.lrDecay ** max(i + 1 - config.nEpochs_lr, 0.0)
m.assign_lr(sess, config.learningRate * lrDecay) at every epoch
Queue: queue size (capacity) and number of threads don't have much effect. Just 1 or 2 would be ok.
Training time: Using cpu only is just twice slower than using Nvidia Geforce 1080 Ti for training (batchSz=100, seqSz=300, timestepSz=30). The usage of gpu is about 60 % during training. Very strange problem.
|