【Python/Tensorflow】SequenceExampleを使用したSparseTensorのシリアライズ方法

1 この記事の対象読者
2 SequenceExampleを用いた記述方法
3 【補足】FeatureListsについて
4 【補足】parse_single_sequence_exampleについて
5 まとめ

この記事の対象読者

先日の記事の補足になります。

この記事ではSparseTesnorをTFRecord形式に変換し、読み出す方法についてまとめました。その中で、Tensorflowの公式ドキュメントにて、「SparseFeature」よりも「SequenceExample」と「VarLenFeature」を用いた方が良いという記述があるという旨に触れました。「後者のほうがシンプルに書けるらしい」という言及だけで終わっていたので、そこについて後日調査しました。

という訳で、調査した内容についてまとめようと思います。結論としては、「SparseFeatureの方がシンプルに記述できるんじゃないか？」というところに現状は落ち着いています。

とはいえ、人によって感じ方は様々だと思いますので、これについてグダグダ言ってても仕方ないですね。では早速本題に入っていきましょう！

SequenceExampleを用いた記述方法

今回は試したコードを最初に見せようと思います。今までの記事で扱っていない要素が多数登場しますが、それは補足という形で後ろでまとめます。

サンプルコードの動作環境は以下です。

OS：Windows10
Pythonのバージョン：3.7.9
tensorflowのバージョン：2.1.0

#【TFReocrd作成側】
import numpy as np
import tensorflow as tf
import scipy

def int64_feature(l):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=l))

def float_feature(l):
    return tf.train.Feature(float_list=tf.train.FloatList(value=l))

# メイン
sample_mat = scipy.sparse.random(12, 12, density=0.1)

# 読み出すことを考えて、[行、列]の形式で保存
indices_features = []
for row, col in zip(sample_mat.row, sample_mat.col):
    indices_features.append(int64_feature([row, col]))

values_feature = float_feature(sample_mat.data)
shape_feature = int64_feature([12, 12])

feature_lists = tf.train.FeatureLists(feature_list={
    "indices": tf.train.FeatureList(feature=indices_features),
    "values":  tf.train.FeatureList(feature=[values_feature]),
    "shape":   tf.train.FeatureList(feature=[shape_feature])
})
seq_example = tf.train.SequenceExample(feature_lists=feature_lists)

savepath = "sparse_seq_example.tfrecords"
# TFRecord形式で保存
with tf.io.TFRecordWriter(savepath) as writer:
    writer.write(seq_example.SerializeToString())

print(seq_example)
"""
feature_lists {
  feature_list {
    key: "indices"
    value {
      feature {
        int64_list {
          value: 11
          value: 5
        }
      }
      - 長いので中略 -
      feature {
        int64_list {
          value: 9
          value: 11
        }
      }
    }
  }
  feature_list {
    key: "shape"
    value {
      feature {
        int64_list {
          value: 12
          value: 12
        }
      }
    }
  }
  feature_list {
    key: "values"
    value {
      feature {
        float_list {
          value: 0.4992492198944092
          value: 0.3927936255931854
          value: 0.23264065384864807
          value: 0.26840609312057495
          value: 0.7273346781730652
          value: 0.10040651261806488
          value: 0.8834704160690308
          value: 0.31499430537223816
          value: 0.25683218240737915
          value: 0.6979917287826538
          value: 0.7752628922462463
          value: 0.9083815217018127
          value: 0.7380165457725525
        }
      }
    }
  }
}
"""

# 【TFRecord読み出し側】
import numpy as np
import tensorflow as tf
import scipy

filepath = "test_sparse_seq_ex.tfrecords"

# shapeも含めて読み出せる点はこちらの方がSparseFeatureよりも有利か


def parse_seq_example(example):
    sequence_features = {
        "indices": tf.io.VarLenFeature(dtype=tf.int64),
        "values":  tf.io.VarLenFeature(dtype=tf.float32),
        "shape":   tf.io.VarLenFeature(dtype=tf.int64)
    }

    _, train_data = tf.io.parse_single_sequence_example(
        example,
        sequence_features=sequence_features
    )

    indices = tf.sparse.to_dense(train_data["indices"])
    values =  tf.reshape(tf.sparse.to_dense(train_data["values"]), shape=(-1,))
    shape =   tf.reshape(tf.sparse.to_dense(train_data["shape"]), shape=(-1,))

    train_X = tf.sparse.SparseTensor(indices, values, shape)

    return indices, values, shape


dataset = tf.data.TFRecordDataset([filepath]).map(parse_seq_example)

for data in dataset:
    print(data)

"""
【出力結果】
SparseTensor(
indices=tf.Tensor(
[[11  5]
 [ 9  2]
 [ 2 10]
 [ 9  6]
 [ 1  6]
 [11  1]
 [ 9 10]
 [10  4]
 [ 8 10]
 [ 3  0]
 [ 0  3]
 [ 5 11]
 [ 4  5]
 [ 9 11]], shape=(14, 2), dtype=int64),
values=tf.Tensor(
[0.49924922 0.39279363 0.23264065 0.2684061  0.7273347  0.10040651
 0.8834704  0.7617732  0.3149943  0.25683218 0.6979917  0.7752629
 0.9083815  0.73801655], shape=(14,), dtype=float32),
dense_shape=tf.Tensor([12 12], shape=(2,), dtype=int64))
"""

TFRecord作成側と、読み出し側を両方載せました。SequenceExampleでシリアライズする際の注意点は、「tf.train.FeatureLists」を使用するところです。

一方、読み出し側については、「tf.io.parse_single_sequence_example」を使用する点が大きく異なります。

詳細については、記事の【補足】を読んでいただくか、stack_overflowの回答を参考にすることで理解できるのではないかと思います。

What are the advantages of using tf.train.SequenceExample over tf.train.Example for variable length features?

Recently I read this guide on undocumented featuers in TensorFlow, as I needed to pass variable length sequences as input. However, I found the protocol for tf....

以下、各種API等については理解している前提とします。上記のサンプルコードのように、SparseTensorをTFReocrd形式に変換し、SparseTensorとして復元するためには以下の工程が必要となります。

デシリアライズ後SparseTensorとして読み出すことを前提に、「indices」「values」「shape」のFeatureListを作成する
デシリアライズした後、indices,values,shapeは「tf.sparse.to_dense」で密Tensorに戻し、SparseTensorの引数に渡す

この工程に関してですが、「SparseFeature」を使うよりも処理が複雑になっている気がします。

SparseTensorはCOO形式とは異なる形式で引数を渡して生成することになるほか、デシリアライズの際にもインデックス情報がSparseTensorで渡される関係上、parserの内部で密Tensorへの変換が絡んだりするため、無駄に行数を使うことになります。

これならSparseFeatureを使ってSparseTensorを直接復元した方がコード量も減るので見通しが良い気がします。些末なことかもしれませんが。

【補足】FeatureListsについて

tf.train.FeatureListsを記述する際の構造は以下のようになります。

tf.train.Featureのリストを引数にとるオブジェクト ⇒ tf.train.FeatureList
tf.train.FeatureListをvalue、適当な文字列をkeyとするディクショナリを引数にとるオブジェクト ⇒ tf.train.FeatureLists

FeatureListに関しては上記のように、Featureのリストを複数持つProtocol Bufferの単位と見ればOKです。

FeatureListsと関連して、tf.train.SequenceExampleについても触れます。これは少し特殊で、”context”という引数と”feature_list”という引数を取ります。前者は固定長のFeature、すなわち「tf.train.Features」に対応し、後者は可変長のFeature、すなわち「tf.train.FeatureLists」に対応します。

contextもfeature_listも使う例としては、画像入力に対する説明文を出力するDNNを学習する場合などがあるようです。

【補足】parse_single_sequence_exampleについて

この関数は、contextに対応するデシリアライズ結果とfeature_listsに対応するデシリアライズ結果を出力するほか、引数として”context_feature”と”sequence_feature”の二つを取る点が「tf.io.parse_single_example」と異なります。

context_featureに関しては、通常のExampleと同様、TFRecord作成時にcontextへ指定したExampleのfeatureに紐づいている辞書のキーと、tf.io.FixedLenFeatureを適切な型で指定すればOKです。

一方、sequence_featureに関しては、SequenceExampleのfeature_listsとして渡した、「キー名とFeatureListがセットになった辞書」のキーと「tf.io.VarLenFeature」を指定する必要があります。「tf.io.VarLenFeature」は、デシリアライズ結果をSparseTensorとして返すので、「tf.io.parse_single_sequence_example」の出力を扱うときには注意が必要です。