qq_38411575 2021-03-31 14:40 采纳率: 77.8%
浏览 67
已结题

python for 循环向量化优化

我需要对dataframe增加一列month,当某行数据的开始日期和结束日期包含某些月份的第一天,那就把这些月份添加到month列中,我只能用最原始的for循环来处理,我的数据大概有300行,如果用向量化处理的话,我该如何优化?

import pandas as pd
import numpy as np

data = pd.DataFrame(
    {
        "d_month": ['202109', '202109', '202109', '202106', '202106', '202106', '202105', '202105', '202105', '202104',
                    '202104', '202104', '202103', '202103', '202103', '202102', '202102', '202102', '202101', '202101',
                    '202101'],
        "ts_code": ['A202109', 'B202109', 'C202109', 'A202106', 'B202106', 'C202106', 'A202105', 'B202105', 'C202105',
                    'A202104', 'B202104', 'C202104', 'A202103', 'B202103', 'C202103', 'A202102', 'B202102', 'C202102',
                    'A202101', 'B202101', 'C202101'],
        "first_date": ['20210118', '20210118', '20210118', '20201019', '20201019', '20201019', '20210322', '20210322',
                       '20210322', '20210222', '20210222', '20210222', '20200720', '20200720', '20200720', '20201221',
                       '20201221', '20201221', '20201123', '20201123', '20201123'],
        "end_date": ['20210917', '20210917', '20210917', '20210618', '20210618', '20210618', '20210521', '20210521',
                     '20210521', '20210416', '20210416', '20210416', '20210319', '20210319', '20210319', '20210219',
                     '20210219', '20210219', '20210115', '20210115', '20210115'],

    })
data = data.sort_values(by=['d_month', 'ts_code'], ascending=[True, True]).reset_index(drop=True)
result = pd.DataFrame()
s = data['d_month'].sort_values(ascending=True).drop_duplicates()
for i in s.values:
    d1 = str(i) + '01'
    v1 = data[(data.first_date <= d1) & (data.end_date >= d1)].reset_index(drop=True)
    v1['month'] = i
    result = pd.concat([result, v1])


result = result.sort_values(by=['month', 'd_month', 'ts_code'], ascending=[True, True, True]).reset_index(drop=True)
result = result[['month', 'd_month', 'first_date', 'end_date']]
print('原始数据:')
print(data.head(10))
print('希望得到的数据:')
print(result.head(10))

输出:

原始数据:
  d_month  ts_code first_date  end_date
0  202101  A202101   20201123  20210115
1  202101  B202101   20201123  20210115
2  202101  C202101   20201123  20210115
3  202102  A202102   20201221  20210219
4  202102  B202102   20201221  20210219
5  202102  C202102   20201221  20210219
6  202103  A202103   20200720  20210319
7  202103  B202103   20200720  20210319
8  202103  C202103   20200720  20210319
9  202104  A202104   20210222  20210416
希望得到的数据:
    month d_month first_date  end_date
0  202101  202101   20201123  20210115
1  202101  202101   20201123  20210115
2  202101  202101   20201123  20210115
3  202101  202102   20201221  20210219
4  202101  202102   20201221  20210219
5  202101  202102   20201221  20210219
6  202101  202103   20200720  20210319
7  202101  202103   20200720  20210319
8  202101  202103   20200720  20210319
9  202101  202106   20201019  20210618

Process finished with exit code 0
  • 写回答

1条回答 默认 最新

  • qq_38411575 2021-03-31 15:45
    关注

    大牛的回答:

    Idea is get all unique months and pass to merge for cross join with helper column a for all combinations and then filter in boolean indexing, last sorting and change order of columns if necessary:

    df = data.assign(a=1)
    df1 = df[['a','d_month']].drop_duplicates().rename(columns={'d_month':'month'})
    df = df.merge(df1, on='a')
    df = df[(df.start_date <= df['month']) & (df.end_date >= df['month'])].drop('a', axis=1)
    
    df = df.sort_values(by=['month', 'd_month', 'code'], ignore_index=True)
    df = df[df.columns[-1:].tolist() + df.columns[:-1].tolist()]
    
    print (df.head(10))
        month d_month     code start_date  end_date
    0  202103  202103  A202103   20200720  20210319
    1  202103  202103  B202103   20200720  20210319
    2  202103  202103  C202103   20200720  20210319
    3  202103  202104  A202104   20210222  20210416
    4  202103  202104  B202104   20210222  20210416
    5  202103  202104  C202104   20210222  20210416
    6  202103  202106  A202106   20201019  20210618
    7  202103  202106  B202106   20201019  20210618
    8  202103  202106  C202106   20201019  20210618
    9  202103  202109  A202109   20210118  20210917
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

问题事件

  • 系统已结题 8月27日
  • 已采纳回答 8月19日

悬赏问题

  • ¥15 全部备份安卓app数据包括密码,可以复制到另一手机上运行
  • ¥15 Python3.5 相关代码写作
  • ¥20 测距传感器数据手册i2c
  • ¥15 RPA正常跑,cmd输入cookies跑不出来
  • ¥15 求帮我调试一下freefem代码
  • ¥15 matlab代码解决,怎么运行
  • ¥15 R语言Rstudio突然无法启动
  • ¥15 关于#matlab#的问题:提取2个图像的变量作为另外一个图像像元的移动量,计算新的位置创建新的图像并提取第二个图像的变量到新的图像
  • ¥15 改算法,照着压缩包里边,参考其他代码封装的格式 写到main函数里
  • ¥15 用windows做服务的同志有吗