Byte列を通じてNumPy配列からTensorFlowのテンソルへ変換する

Posted On 2021-01-23

6.2k{icon} {views}

Byte列を通じてNumPy配列からTensorFlowのテンソルへ変換する方法です。TFRecordで記録したデータを扱う際に役に立つと思われるやり方です。

NumPy配列をByte列に変換する

こんなことやらなくても、TF2のデフォルトとなったEagerテンソルだと特に問題ないのです。しかし、TF1系列のときに作られたと思われる機能で、Eagerテンソルになっていない（グラフモードのテンソル）場合であると便利です。例えば、TFRecordで記録したものをtf.data.Datasetで回すケースです。これ以外はほとんど使わないと思います。

ニッチすぎるかもしれません。NumPy配列を最も容量的に無駄のないやり方で、ファイルに書き出せるとしたら、Byte列に変換するという方法です。.tobytes()という関数で一発です。

import numpy as np

def tobytes():
    x = np.arange(8, dtype=np.int32)
    x_bytes = x.tobytes()
    print(x)
    # [0 1 2 3 4 5 6 7]
    print(x_bytes)
    # b'\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00\x07\x00\x00\x00'

if __name__ == "__main__":
    tobytes()

NumPy配列の型を変えるとByte列の長さが変わります。データ型によって、配列の要素にどのぐらいのメモリを割り当てるかが変わるので当たり前ですね。

def tobytes():
    x = np.arange(8, dtype=np.uint8)
    x_bytes = x.tobytes()
    print(x)
    print(x_bytes)
    # [0 1 2 3 4 5 6 7]
    # b'\x00\x01\x02\x03\x04\x05\x06\x07'

Byte列に変換するとshapeは消える

Byte列に変換すると元の配列のshapeの情報は消えます。以下のコードはNumPy配列としては異なりますが、Byte列に変換すると全て同じ値を示します。shapeの情報が落ちているからです。

def tobytes():
    x = np.arange(8, dtype=np.int32).reshape(4, 2).tobytes()
    print(x)
    y = np.arange(8, dtype=np.int32).tobytes()
    print(y)
    z = np.arange(8, dtype=np.int32).reshape(2, 2, 2).tobytes()
    print(z)
# b'\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00\x07\x00\x00\x00'
# b'\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00\x07\x00\x00\x00'
# b'\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00\x07\x00\x00\x00'

それっぽい痕跡はありますが、元のデータの型の情報は落ちていると捉えるべきです。データセットのスキーマーを記録しておくとあとで役立つでしょう。

Byte列からNumPy配列を復元するとはnp.frombuffer

Byte列からNumPy配列を復元するときは、np.frombufferという関数を使います。.tobytes()がシリアライズにあたるなら、np.frombufferがデシリアライズですね。

def frombytes():
    x_bytes = np.arange(8, dtype=np.int32).tobytes()
    x_recon = np.frombuffer(x_bytes, dtype=np.int32)
    print(x_recon)
    # [0 1 2 3 4 5 6 7]

shapeの調整はfrombufferでNumPy配列に戻したあとでやります。

def frombytes():
    x_bytes = np.arange(8, dtype=np.int32).reshape(4, 2).tobytes()
    x_recon = np.frombuffer(x_bytes, dtype=np.int32).reshape(4, 2)
    print(x_recon)

データ型をミスマッチがあると

つまり、元のshapeとデータ型の情報を持っておく必要があります。もし変換の前後でデータ型がミスマッチするような意地悪なケースを想定すると、配列の要素数や値が変わってしまいます。

def frombytes():
    x_bytes = np.arange(8, dtype=np.int32).tobytes()
    x_recon = np.frombuffer(x_bytes, dtype=np.uint8)
    print(x_recon)
    # [0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0]

def frombytes():
    x_bytes = np.arange(8, dtype=np.float32).tobytes()
    x_recon = np.frombuffer(x_bytes, dtype=np.int32)
    print(x_recon)
    # [         0 1065353216 1073741824 1077936128 1082130432 1084227584
    # 1086324736 1088421888]

Byte列の変換前後で、intとfloatを混同すると、とんでもない値になってしまうのでここは注意が必要です。

TensorFlowでByte列をテンソルに変換する

ここからが本題ですが、NumPyでtobytes()に変換したあと、NumPy配列ではなくTensorFlowのテンソルに変換したいことがあります。TFRecordのことを考えたときです。tf.io.decode_rawという関数でできます。

def bytes_to_tf():
    x_bytes = np.arange(8, dtype=np.int32).tobytes()
    x_tf = tf.io.decode_raw(x_bytes, tf.int32)
    print(x_tf)
    # tf.Tensor([0 1 2 3 4 5 6 7], shape=(8,), dtype=int32)

「エンディアンとか大丈夫なのかな？」と思ったのですが、そもそもNumPy配列とTensorFlowのテンソルが相当互換性があるので、こういうことやっても大丈夫なようです。一応主要な型で試してみます。

def bytes_to_tf():
    x_bytes = np.arange(8, dtype=np.uint8).tobytes()
    x_tf = tf.io.decode_raw(x_bytes, tf.uint8)
    print(x_tf)
    # tf.Tensor([0 1 2 3 4 5 6 7], shape=(8,), dtype=uint8)

def bytes_to_tf():
    x_bytes = (np.arange(8, dtype=np.float32)-3).tobytes()
    x_tf = tf.io.decode_raw(x_bytes, tf.float32)
    print(x_tf)
    # tf.Tensor([-3. -2. -1.  0.  1.  2.  3.  4.], shape=(8,), dtype=float32)

def bytes_to_tf():
    x = np.zeros(8, dtype=np.complex64)
    x.real = np.arange(8)
    x.imag = -np.arange(8)
    x_bytes = x.tobytes()
    x_tf = tf.io.decode_raw(x_bytes, tf.complex64)
    print(x_tf)
    # tf.Tensor([0.+0.j 1.-1.j 2.-2.j 3.-3.j 4.-4.j 5.-5.j 6.-6.j 7.-7.j], shape=(8,), dtype=complex64)

特に問題なさそうです

画像をByte列してTensorFlowのテンソルにする

より実践的に、NumPy配列の画像をByte列を経由してTensorFlowのテンソルにする例を見てみます。MNISTでやってみましょう。

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

def mnist():
    (img, _), (_, _) = tf.keras.datasets.mnist.load_data()
    img_bytes = img.tobytes()
    img_tf = tf.reshape(tf.io.decode_raw(img_bytes, tf.uint8), (60000, 28, 28))
    for i in range(9):
        ax = plt.subplot(3, 3, i + 1)
        ax.imshow(img_tf[i])
    plt.show()

うまくいきました。TFRecordではそのままByte列で記録すればよさそうですね。

Shikoan's ML Blogの中の人が運営しているサークル「じゅ～しぃ～すくりぷと」の本のご案内