【Python基礎】Pandas：行・列のデータをランダムで取得（df.sample()）

2023年1月10日2023年1月15日

URLをコピーしました！

Pandas

前回、繰り返し関数whileの使い方を勉強しました。

ランダムにデータを1行取得

まずは１番の基本。

ランダムにデータを1行取得する方法です。

ランダムにデータを1行取得するには、「データフレーム名.sample()」を用います。

selected_row = df.sample()

print(selected_row)

実行結果
   0   1   2
3  4  16  64

ランダムにデータを1列取得

ランダムにデータを1取得するには「axis=1」のオプションを追加します。

つまり「データフレーム名.sample(axis=1)」とします。

selected_column = df.sample(axis=1)

print(selected_column)

実行結果
     2
0    1
1    8
2   27
3   64
4  125

重複無しで複数のデータをランダムに取得

ランダムに取得したデータが重複しないようにデータを取得するには「n=取得するデータ数」のオプションを追加します。

selected_row3 = df.sample(n=3)

print(selected_row3)

実行結果
   0   1    2
0  1   1    1
1  2   4    8
4  5  25  125

ちなみにこの場合、データが重複しないためにはnはデータフレーム内の全データ数よりも少ない数である必要があります。

もしデータフレーム内の全データ数よりも多い数を指定した場合、エラーとなります。

selected_row10= df.sample(n=10)

print(selected_row10)

実行結果
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [21], in <cell line: 1>()
----> 1 selected_row10= df.sample(n=10)
      3 print(selected_row10)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/generic.py:5446, in NDFrame.sample(self, n, frac, replace, weights, random_state, axis, ignore_index)
   5443 if weights is not None:
   5444     weights = sample.preprocess_weights(self, weights, axis)
-> 5446 sampled_indices = sample.sample(obj_len, size, replace, weights, rs)
   5447 result = self.take(sampled_indices, axis=axis)
   5449 if ignore_index:

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/sample.py:150, in sample(obj_len, size, replace, weights, random_state)
    147     else:
    148         raise ValueError("Invalid weights: weights sum to zero")
--> 150 return random_state.choice(obj_len, size=size, replace=replace, p=weights).astype(
    151     np.intp, copy=False
    152 )

File mtrand.pyx:965, in numpy.random.mtrand.RandomState.choice()

ValueError: Cannot take a larger sample than population when 'replace=False'

重複ありで複数のデータをランダムに取得

重複ありで複数のデータをランダムに取得するには「n=取得するデータ数」と併せて「replace=True」のオプションを追加します。

selected_row3_replace = df.sample(n=3, replace=True)

print(selected_row3_replace)

実行結果
   0   1    2
0  1   1    1
0  1   1    1
4  5  25  125

「replace=True」のオプションがある場合、取得するデータ数はデータフレーム内の全データ数を超えても問題ありません。

selected_row10_replace = df.sample(n=10, replace=True)

print(selected_row10_replace)

実行結果
   0   1    2
1  2   4    8
1  2   4    8
1  2   4    8
4  5  25  125
0  1   1    1
1  2   4    8
2  3   9   27
0  1   1    1
2  3   9   27
0  1   1    1

こういったランダムにデータを取得する系の関数はデータを取得し、データベースを作成する場合ではなく、そのデータベースを使って何かをする際に結構役立ちます。

次回はmatplotlibとPILを使って、アニメーションのグラフを作成してみましょう。