我需要对dataframe增加一列month,当某行数据的开始日期和结束日期包含某些月份的第一天,那就把这些月份添加到month列中,我只能用最原始的for循环来处理,我的数据大概有300行,如果用向量化处理的话,我该如何优化?
import pandas as pd
import numpy as np
data = pd.DataFrame(
{
"d_month": ['202109', '202109', '202109', '202106', '202106', '202106', '202105', '202105', '202105', '202104',
'202104', '202104', '202103', '202103', '202103', '202102', '202102', '202102', '202101', '202101',
'202101'],
"ts_code": ['A202109', 'B202109', 'C202109', 'A202106', 'B202106', 'C202106', 'A202105', 'B202105', 'C202105',
'A202104', 'B202104', 'C202104', 'A202103', 'B202103', 'C202103', 'A202102', 'B202102', 'C202102',
'A202101', 'B202101', 'C202101'],
"first_date": ['20210118', '20210118', '20210118', '20201019', '20201019', '20201019', '20210322', '20210322',
'20210322', '20210222', '20210222', '20210222', '20200720', '20200720', '20200720', '20201221',
'20201221', '20201221', '20201123', '20201123', '20201123'],
"end_date": ['20210917', '20210917', '20210917', '20210618', '20210618', '20210618', '20210521', '20210521',
'20210521', '20210416', '20210416', '20210416', '20210319', '20210319', '20210319', '20210219',
'20210219', '20210219', '20210115', '20210115', '20210115'],
})
data = data.sort_values(by=['d_month', 'ts_code'], ascending=[True, True]).reset_index(drop=True)
result = pd.DataFrame()
s = data['d_month'].sort_values(ascending=True).drop_duplicates()
for i in s.values:
d1 = str(i) + '01'
v1 = data[(data.first_date <= d1) & (data.end_date >= d1)].reset_index(drop=True)
v1['month'] = i
result = pd.concat([result, v1])
result = result.sort_values(by=['month', 'd_month', 'ts_code'], ascending=[True, True, True]).reset_index(drop=True)
result = result[['month', 'd_month', 'first_date', 'end_date']]
print('原始数据:')
print(data.head(10))
print('希望得到的数据:')
print(result.head(10))
输出:
原始数据:
d_month ts_code first_date end_date
0 202101 A202101 20201123 20210115
1 202101 B202101 20201123 20210115
2 202101 C202101 20201123 20210115
3 202102 A202102 20201221 20210219
4 202102 B202102 20201221 20210219
5 202102 C202102 20201221 20210219
6 202103 A202103 20200720 20210319
7 202103 B202103 20200720 20210319
8 202103 C202103 20200720 20210319
9 202104 A202104 20210222 20210416
希望得到的数据:
month d_month first_date end_date
0 202101 202101 20201123 20210115
1 202101 202101 20201123 20210115
2 202101 202101 20201123 20210115
3 202101 202102 20201221 20210219
4 202101 202102 20201221 20210219
5 202101 202102 20201221 20210219
6 202101 202103 20200720 20210319
7 202101 202103 20200720 20210319
8 202101 202103 20200720 20210319
9 202101 202106 20201019 20210618
Process finished with exit code 0