ImageNetの訓練時の前処理についてまとめた

Posted On 2020-05-24

15.6k{icon} {views}

ImageNetで訓練済みのモデルを使う際の前処理は出てきても、モデルを訓練するときの前処理、特に訓練画像のアスペクト比がまばらなケースでどう前処理するのかがほとんど出てきませんでした。Alex Net, ResNet, Inceptionの論文や公式実装に使われている前処理をまとめてみました。

はじめに

ImageNetの訓練画像には、解像度やアスペクト比が異なるものが多数存在します。一方で画像分類のCNNでは入力サイズを固定値（224や299など）にすることが一般的です。入力画像のアスペクト比を訓練時とテスト時でどうケアしているのか、論文やコードから見ていきます。

AlexNetの場合

CNNによる画像分類に大きなブレイクスルーをもたらしたAlexNetでは、前処理について非常に明瞭にかかれています。論文からです(p.2)。

ImageNet consists of variable-resolution images, while our system requires a constant input dimensionality. Therefore, we down-sampled the images to a fixed resolution of 256 × 256. Given a rectangular image, we first rescaled the image such that the shorter side was of length 256, and then cropped out the central 256×256 patch from the resulting image. We did not pre-process the images in any other way, except for subtracting the mean activity over the training set from each pixel. So we trained our network on the (centered) raw RGB values of the pixels.

DeepLによる翻訳がこちら。

ImageNet は可変解像度の画像で構成されているが，我々のシステムは一定の入力次元数を必要とする．そこで，画像を 256×256 の固定解像度にダウンサンプリングした．矩形の画像を与えられた場合，まず，画像の短辺の長さが256になるように画像を再スケーリングし，得られた画像から中央の256×256のパッチを切り取りました．他の方法で画像の前処理は行いませんでしたが、各ピクセルから学習セットの平均活動量を差し引くこと以外は行いませんでした。そこで、我々はピクセルの（中央の）生のRGB値に基づいてネットワークを訓練しました。

画像の短辺の長さが256になるように画像を再スケーリングし，得られた画像から中央の256×256のパッチを切り取るというのがポイントのようですね。

ただしAlexNetではネットワークの入力解像度が224×224です。これについても言及があります(p.5)。

The first form of data augmentation consists of generating image translations and horizontal reflections. We do this by extracting random 224 × 224 patches (and their horizontal reflections) from the 256×256 images and training our network on these extracted patches4 . This increases the size of our training set by a factor of 2048, though the resulting training examples are, of course, highly interdependent. Without this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks. At test time, the network makes a prediction by extracting five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal reflections (hence ten patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches.

データ増強の第一の形態は、画像の翻訳と水平反射を生成することである。256×256の画像からランダムに224×224のパッチ（および水平反射）を抽出し、これらの抽出されたパッチを用いてネットワークを学習します4 。これにより、学習セットのサイズは2048倍になりますが、結果として得られる学習例はもちろん相互依存性が高いものとなります。この方式を採用しないと、ネットワークのオーバーフィットが大きくなり、より小さなネットワークを使用せざるを得なくなります。テスト時には、ネットワークは5つの224×224パッチ（4つのコーナーパッチとセンターパッチ）とそれらの水平反射（つまり全部で10パッチ）を抽出して予測を行い、その10パッチについてネットワークのソフトマックス層が行った予測を平均化します。

注釈として「This is the reason why the input images in Figure 2 are 224 × 224 × 3-dimensional.」とも書かれています。

この意味は、まず256×256でクロップした画像を、1ピクセルずつずらしながら（例：(0, 0, 224, 224), (0, 1, 224, 225), ……, (31, 31, 255, 255)）、32×32個=1024個のパッチを列挙します。それらの1024個のパッチについて、Horizontal flipをすれば2048個のサンプルができるよねという意味です。

ただ、これは現代的に考えれば全列挙する必要はなく、ただ単に256×256でCenter Cropした画像に、Random CropとHorizontal flipを入れればいいだけなので、全列挙にこだわる必要は全くありません。これは訓練データの話です。

一方でテスト時は(p.5)に言及があり、

At test time, the network makes a prediction by extracting five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal reflections (hence ten patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches.

テスト時には、ネットワークは5つの224×224パッチ（4つのコーナーパッチとセンターパッチ）とそれらの水平反射（つまり全部で10パッチ）を抽出して予測を行い、その10パッチについてネットワークのソフトマックス層が行った予測を平均化します。

とあるので、256×256の画像をTen Crop（4辺＋中央のクロップ×左右反転したものをTen Cropということがあります）すればいいだけです。

まとめると、Alex Netの場合の前処理は、

訓練時：テスト時共通：短辺が256になるようにリサイズ、256×256になるように画像中央をクロップ（Center Crop）
訓練時：
1. 出力が224×224になるようにRandom Crop
2. Random Horizontal Flip
テスト時：
1. 出力が224×224になるように四辺＋中央の5箇所をクロップ、全パターンについて左右反転し10パッチ作る（Ten Crop）

ということになります。

ResNetの場合

ResNetの場合は基本はAlex Netの前処理と同じですが、少しアレンジが入っています。論文より、訓練時(p.4)

The image is resized with its shorter side randomly sampled in [256, 480] for scale augmentation [40]. A 224×224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted [21].

画像は，スケール増大のために [256, 480] で短辺をランダムにサンプリングしてリサイズされます [40]．224×224のクロップは，画像またはその水平方向のフリップからランダムにサンプリングされ，ピクセルあたりの平均値が減算されます[21]．21]の標準的なカラー増強が使用されます。

色系のData Augmentationは好みなので無視しますが、Alex Netの場合は短辺が256固定でしたが、これが変動値になっていますね。あとはリサイズが2回だったのが1回になり、ただRandom Cropすればいいよってなってますね。ResNetも入力サイズは224×224です。テストの場合

In testing, for comparison studies we adopt the standard 10-crop testing [21]. For best results, we adopt the fullyconvolutional form as in [40, 12], and average the scores at multiple scales (images are resized such that the shorter side is in {224, 256, 384, 480, 640}).

テストでは，比較研究のために標準的な 10-croop テスト [21] を採用した．最良の結果を得るために，[40, 12]のような完全畳み込み形式を採用し，複数のスケールでのスコアを平均化します（画像は，短い方が{224, 256, 384, 480, 640}になるようにリサイズされています）．

テストの場合はAlex Netと同じTen Cropですが、よりよい結果のためにはTTA（Testing Time Augmentation）をしています（Ten CropもTTAですが）。このTTAは短辺のサイズが「224, 256, 384, 480, 640」と変化させて、224×224のTen Cropを取るということだと思います。ネットワークの入力解像度を変えてもおそらく意味がないので。

まとめると、ResNetの論文における前処理は、

訓練時:　短辺を256～480のサイズにリサイズ→Random Horizontal Flip→Random Crop（FlipとCropを入れ替えても差し障りないです）
テスト時: Alex Netと同じ短辺を256にリサイズ→Ten Crop。短辺の大きさを「224, 256, 384, 480, 640」と変えてTTAするとより良くなる。

となります。

ResNetのTensorFlowの公式実装の場合

ImageNetをResNetで訓練する場合の、TensorFlowの公式実装がこちらになります。訓練済みモデルを使う際の前処理の話はいっぱい出てきても、訓練時の前処理の話がなかなか出てこなかったので探すのに苦労しました。

https://github.com/tensorflow/models/blob/09d3c74a31d7e0c1742ae65025c249609b3c9d81/official/r1/resnet/imagenet_preprocessing.py#L228

def preprocess_image(image_buffer, bbox, output_height, output_width,
                     num_channels, is_training=False):
  """Preprocesses the given image.
  Preprocessing includes decoding, cropping, and resizing for both training
  and eval images. Training preprocessing, however, introduces some random
  distortion of the image to improve accuracy.
  Args:
    image_buffer: scalar string Tensor representing the raw JPEG image buffer.
    bbox: 3-D float Tensor of bounding boxes arranged [1, num_boxes, coords]
      where each coordinate is [0, 1) and the coordinates are arranged as
      [ymin, xmin, ymax, xmax].
    output_height: The height of the image after preprocessing.
    output_width: The width of the image after preprocessing.
    num_channels: Integer depth of the image buffer for decoding.
    is_training: `True` if we're preprocessing the image for training and
      `False` otherwise.
  Returns:
    A preprocessed image.
  """
  if is_training:
    # For training, we want to randomize some of the distortions.
    image = _decode_crop_and_flip(image_buffer, bbox, num_channels)
    image = _resize_image(image, output_height, output_width)
  else:
    # For validation, we want to decode, resize, then just crop the middle.
    image = tf.image.decode_jpeg(image_buffer, channels=num_channels)
    image = _aspect_preserving_resize(image, _RESIZE_MIN)
    image = _central_crop(image, output_height, output_width)

  image.set_shape([output_height, output_width, num_channels])

  return _mean_image_subtraction(image, _CHANNEL_MEANS, num_channels)

「分類モデルなのになぜBounding Boxがあるんだ」みたいな疑問はおいておくとして、訓練時はRandom Crop＋Random Flip→リサイズ、テスト時はアスペクト比を維持しながら（短辺基準で）リサイズして、Center Cropするという形式になっています。

Center CropはTen Cropのケースの1つなので、この関数はTTAをしていないということになります。TenCropでさえ推論コストが10倍になるので、訓練中のValidationのようにすぐ終わらせたい場合は、このようにCenter Cropで代用するのも良いと思います。

Inception の場合

Inceptionは少し変わった前処理をします。論文より(p.8)

Our image sampling methods have changed substantially over the months leading to the competition, and already converged models were trained on with other options, sometimes in conjunction with changed hyperparameters, like dropout and learning rate, so it is hard to give a definitive guidance to the most effective single way to train these networks. To complicate matters further, some of the models were mainly trained on smaller relative crops, others on larger ones, inspired by [8]. Still, one prescription that was verified to work very well after the competition includes sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area and whose aspect ratio is chosen randomly between 3/4 and 4/3. Also, we found that the photometric distortions by Andrew Howard [8] were useful to combat overfitting to some extent. In addition, we started to use random interpolation methods (bilinear, area, nearest neighbor and cubic, with equal probability) for resizing relatively late and in conjunction with other hyperparameter changes, so we could not tell definitely whether the final results were affected positively by their use

我々の画像サンプリング方法は、コンテストに至るまでの数ヶ月間に大幅に変更されており、すでに収束したモデルは、他のオプションで訓練され、時にはドロップアウトや学習率のようなハイパーパラメータを変更していたため、これらのネットワークを訓練するための最も効果的な単一の方法について明確な指針を与えることは困難です。問題をさらに複雑にするために、モデルのいくつかは主に小さい相対的な作物で訓練され、他のモデルは[8]に触発されて大きい作物で訓練されました。それでも、コンテスト後に非常にうまく機能することが確認された1つの処方箋は、サイズが画像領域の8%から100%の間で均等に分布し、アスペクト比が3/4から4/3の間でランダムに選択された画像の様々なサイズのパッチをサンプリングすることを含みます。また、アンドリュー・ハワード[8]によるフォトメトリックディストーションがオーバーフィット対策にある程度有効であることがわかった。さらに、我々は比較的遅くに、他のハイパーパラメータの変更と組み合わせて、ランダムな補間手法（バイリニア、面積、最近傍、三乗、等確率）をリサイズに使用し始めたので、最終的な結果がそれらの使用によってプラスの影響を受けたかどうかを明確に知ることはできませんでした。

いろいろ試行錯誤した結果が見て取れますが、Random Cropにてこを入れているのが特徴的です。Inceptionの入力解像度はこの論文では224です（ただいろんなバージョンがあって後に解像度が変わっていることがあります）。

Inceptionの訓練時の前処理は次のような流れです。

クロップする面積を8％～100％の一様乱数で選ぶ
アスペクト比を3/4～4/3で同様に乱数で選ぶ
それをもとにCropする領域の縦横の長さを決める（はみ出してしまう場合はもう一度選び直す）
はみ出さない場合は、そのサイズでRandom Crop
Cropしたものを入力解像度にリサイズ

Horizontal flipは特に言及されていませんが、好みで入れてもいいと思います。少し複雑なアルゴリズムですが、入力解像度に依存せずに適用できるのが面白そうですね。

テスト時の前処理は、ResNetの方法がベースとなっていますが、もう少し積極的なTTAをしています。(p.8-9)

During testing, we adopted a more aggressive cropping approach than that of Krizhevsky et al. [9]. Specifically, we resize the image to 4 scales where the shorter dimension (height or width) is 256, 288, 320 and 352 respectively, take the left, center and right square of these resized images (in the case of portrait images, we take the top, center and bottom squares). For each square, we then take the 4 corners and the center 224×224 crop as well as the square resized to 224×224, and their mirrored versions. This results in 4×3×6×2 = 144 crops per image. A similar approach was used by Andrew Howard [8] in the previous year’s entry, which we empirically verified to perform slightly worse than the proposed scheme. We note that such aggressive cropping may not be necessary in real applications, as the benefit of more crops becomes marginal after a reasonable number of crops are present (as we will show later on).

テスト中、Krizhevskyら[9]のものよりも積極的なクロッピングアプローチを採用しました。具体的には、短い方の寸法（高さまたは幅）がそれぞれ256、288、320、352の4つのスケールに画像をリサイズし、これらのリサイズ画像の左、中央、右の正方形を取ります（ポートレート画像の場合は上、中央、下の正方形を取ります）。それぞれの正方形について、224×224にリサイズされた正方形と同様に、四隅と中央の224×224のクロップ、およびそれらのミラーリングされたバージョンを取ります。この結果、1枚の画像あたり4×3×6×2＝144個のクロップが得られます。同様のアプローチは、前年のエントリーでAndrew Howard [8]によって使用されており、我々は経験的に提案されたスキームよりもわずかにパフォーマンスが悪いことを検証しました。妥当な数の作物が存在すると、より多くの作物の利益は限界になるので、このような積極的な作付けは実際のアプリケーションでは必要ないかもしれないことに注意してください（後に示すように）。

「4×3×6×2=144個のクロップを取るTTA」というのがすごいですが、こういうことです。

短辺の長さが「256、288、320、352」になるように画像をリサイズする。長方形が4つできる。
1でできた各長方形について、短辺の大きさの正方形で、左、中央、右の正方形を取る。例えば、短辺が縦でS、長辺が横でLなら、xy座標で(0, 0), ((L-S)/2, 0), (L-S, 0)を起点に辺の長さがSの正方形を切り取る。正方形が3つできる。
2でできた各正方形について、まず1つは全体をそのまま224×224にリサイズする。あとは224×224のサイズで、四辺＋中央をクロップする（Ten CropのFlip前と一緒）。これで224×224の正方形が6つできる。
3の全てのパターンについて、Horizontal flipするバージョン、しないバージョンを入れる。2パターン。

したがって、「4×3×6×2=144個のクロップ」ができるというわけです。ちょっと複雑ですね。

まとめると、

訓練時：クロップする領域の大きさ、アスペクト比をランダムに変えながら行うRandom Crop→リサイズ
テスト時：「4×3×6×2=144個」のパッチでTTA

なかなか一言で言い表しづらい前処理なので、詳細は前の方を見てください。

どれを使うか

正直好みでいいと思います。これ以外の前処理でも全然ありだと思います。迷った場合ですが、あくまで自分の直感だとTensorFlow公式のResNetの実装のように、

訓練時：ResNetの前処理
テスト時：ResNetの前処理でTen CropせずにただCenter Crop（精度出したければTen Cropやもっと複雑なTTAをしても良い）

が分かりやすく、実装しやすいのではないかなと思います。Inceptionの前処理でもいいですが、特にこだわりなければResNetの前処理を使っておけば問題なさそうな雰囲気はあります。~~ResNetは知らないのがもぐりレベルで有名なので、何か言われても「ResNetの論文がこう言ってるので」と堂々とできますし~~

Shikoan's ML Blogの中の人が運営しているサークル「じゅ～しぃ～すくりぷと」の本のご案内

技術書コーナー

北海道の駅巡りコーナー

Tags:前処理, 画像処理, 論文読み