GOOGLE ADS

Sonntag, 24. April 2022

Pandas DataFrame-Gruppe nach aufeinanderfolgenden gleichen Werten in mehreren Spalten

Ich muss aufeinanderfolgende Zeilen mit denselben Werten für eine Liste von Spalten neu gruppieren. Dank dessen habe ich herausgefunden, wie man es für eine Spalte macht, aber ich kann es nicht für mehr als eine funktionieren lassen.

Meine Frage ist ziemlich nah an dieser, aber ich kann es auch nicht so machen, wie ich es möchte.

Hier ist ein funktionierendes Snippet, in dem die Spalten user, group, value1und value2identisch sein müssen, um die Zeilen neu zu gruppieren:

#! /bin/python3
import pandas as pd
data = [{"user":"paul","group":"accounting","value1":"foo","value2":3,"value3":"random123"},{"user":"paul","group":"accounting","value1":"foo","value2":3,"value3":"random456"},{"user":"paul","group":"accounting","value1":"foo","value2":3,"value3":"random789"},{"user":"paul","group":"accounting","value1":"foo","value2":5,"value3":"random789"},{"user":"paul","group":"accounting","value1":"foo","value2":5,"value3":"random789"},{"user":"paul","group":"accounting","value1":"foo","value2":5,"value3":"random158"},{"user":"jack","group":"administration","value1":"foo","value2":5,"value3":"random487"},{"user":"jack","group":"administration","value1":"foo","value2":5,"value3":"random435"},{"user":"jack","group":"administration","value1":"bar","value2":3,"value3":"random483"},{"user":"jack","group":"administration","value1":"foo","value2":3,"value3":"random431"},{"user":"jack","group":"administration","value1":"foo","value2":3,"value3":"random478"},{"user":"paul","group":"accounting","value1":"foo","value2":5,"value3":"random759"},{"user":"jack","group":"administration","value1":"bar","value2":3,"value3":"random431"},{"user":"jack","group":"administration","value1":"foo","value2":3,"value3":"random478"}]
df = pd.DataFrame(data)
print(df)
print("----")
grouped = df.groupby(((df['value2'].shift()!= df['value2'])).cumsum())
for k, v in grouped:
print(f'[group {k}]')
print(v)

Es gibt dies aus:

[group 1]
user group value1 value2 value3
0 paul accounting foo 3 random123
1 paul accounting foo 3 random456
2 paul accounting foo 3 random789
[group 2]
user group value1 value2 value3
3 paul accounting foo 5 random789
4 paul accounting foo 5 random789
5 paul accounting foo 5 random158
6 jack administration foo 5 random487
7 jack administration foo 5 random435
[group 3]
user group value1 value2 value3
8 jack administration bar 3 random483
9 jack administration foo 3 random431
10 jack administration foo 3 random478
[group 4]
user group value1 value2 value3
11 paul accounting foo 5 random759
[group 5]
user group value1 value2 value3
12 jack administration bar 3 random431
13 jack administration foo 3 random478

Aber das brauche ich:

[group 1]
user group value1 value2 value3
0 paul accounting foo 3 random123
1 paul accounting foo 3 random456
2 paul accounting foo 3 random789
[group 2]
user group value1 value2 value3
3 paul accounting foo 5 random789
4 paul accounting foo 5 random789
5 paul accounting foo 5 random158
[group 3]
user group value1 value2 value3
6 jack administration foo 5 random487
7 jack administration foo 5 random435
[group 4]
user group value1 value2 value3
8 jack administration bar 3 random483
[group 5]
user group value1 value2 value3
9 jack administration foo 3 random431
10 jack administration foo 3 random478
[group 6]
user group value1 value2 value3
11 paul accounting foo 5 random759
[group 7]
user group value1 value2 value3
12 jack administration bar 3 random431
[group 8]
user group value1 value2 value3
13 jack administration foo 3 random478

Ich habe mehrere Spalten in der Gruppe ausprobiert, aber ohne Erfolg:

grouped = df.groupby(((df[['user', 'value2']].shift()!= df[['user', 'value2']])).cumsum())
#returns
ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional


Lösung des Problems

Erstellen Sie aufeinanderfolgende Gruppen, indem Sie Spalten aus der Liste mit vergleichen DataFrame.anyund dann die kumulative Summe hinzufügen:

cols = ['user','group','value1','value2']
grouped = df.groupby(((df[cols].shift()!= df[cols]).any(axis=1)).cumsum())
for k, v in grouped:
print(f'[group {k}]')
print(v)

[group 1]
user group value1 value2 value3
0 paul accounting foo 3 random123
1 paul accounting foo 3 random456
2 paul accounting foo 3 random789
[group 2]
user group value1 value2 value3
3 paul accounting foo 5 random789
4 paul accounting foo 5 random789
5 paul accounting foo 5 random158
[group 3]
user group value1 value2 value3
6 jack administration foo 5 random487
7 jack administration foo 5 random435
[group 4]
user group value1 value2 value3
8 jack administration bar 3 random483
[group 5]
user group value1 value2 value3
9 jack administration foo 3 random431
10 jack administration foo 3 random478
[group 6]
user group value1 value2 value3
11 paul accounting foo 5 random759
[group 7]
user group value1 value2 value3
12 jack administration bar 3 random431
[group 8]
user group value1 value2 value3
13 jack administration foo 3 random478

Keine Kommentare:

Kommentar veröffentlichen

Warum werden SCHED_FIFO-Threads derselben physischen CPU zugewiesen, obwohl CPUs im Leerlauf verfügbar sind?

Lösung des Problems Wenn ich das richtig verstehe, versuchen Sie, SCHED_FIFO mit aktiviertem Hyperthreading ("HT") zu verwenden, ...