
我在使用LSTM模型预测南极海冰的SIC时,我首先将每年每月的SIC呈现在折线图上,但在有些年份出现了异常点,不知道是什么问题

其次我又将预测的后六年的SIC与实际的SIC共同呈现在第二张图上。但在2023年出现了明显的异常。正常来说SIC都是在0到1直接,第一张图的数据出现了明显的异常。我多次询问导师,得到的结果是数据没问题。那会不会是我的代码有什么问题,请求指导


下午好🌅🌅🌅
本答案参考通义千问
你提到的问题非常常见,尤其是在处理气候或环境数据(如南极海冰浓度SIC)时。以下是对你问题的详细分析与解决方案建议:
sigmoid而非tanh等。import pandas as pd
import numpy as np
# 假设你的数据存储在一个DataFrame中,列名为 'SIC'
df = pd.read_csv('sic_data.csv')
# 检查异常值
Q1 = df['SIC'].quantile(0.25)
Q3 = df['SIC'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['SIC'] < (Q1 - 1.5 * IQR)) | (df['SIC'] > (Q3 + 1.5 * IQR))]
print("Outliers:", outliers)
MinMaxScaler 或 StandardScaler 对数据进行归一化。from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(df[['SIC']])
train_size = int(len(scaled_data) * 0.8)
train, test = scaled_data[:train_size], scaled_data[train_size:]
def create_dataset(data, time_steps=1):
X, y = [], []
for i in range(len(data)-time_steps-1):
X.append(data[i:(i+time_steps), 0])
y.append(data[i + time_steps, 0])
return np.array(X), np.array(y)
time_steps = 12 # 假设用过去12个月预测下一个月
X_train, y_train = create_dataset(train, time_steps)
X_test, y_test = create_dataset(test, time_steps)
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1], 1)))
model.add(Dropout(0.2))
model.add(LSTM(units=50))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mean_squared_error')
from keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='loss', patience=5)
model.fit(X_train, y_train, epochs=50, batch_size=32, callbacks=[early_stop], validation_split=0.1)
scaler.inverse_transform() 得到原始值。predicted_sic = model.predict(X_test)
predicted_sic = scaler.inverse_transform(predicted_sic.reshape(-1, 1))
test_real = scaler.inverse_transform(y_test.reshape(-1, 1))
import matplotlib.pyplot as plt
plt.figure(figsize=(12,6))
plt.plot(test_real, label='Actual SIC')
plt.plot(predicted_sic, label='Predicted SIC')
plt.legend()
plt.title("SIC Prediction vs Actual")
plt.show()
| 问题 | 解决方案 |
|------|----------|
| 异常值未处理 | 使用箱线图或Z-score检测并处理异常值 |
| 数据未归一化 | 使用 MinMaxScaler 对数据进行标准化 |
| 时间序列划分不科学 | 按时间顺序划分训练集与测试集 |
| LSTM结构不合理 | 使用合理层数与Dropout控制过拟合 |
| 预测结果异常点 | 反归一化后可视化对比,排查是否为数据分布问题 |
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt
# 加载数据
df = pd.read_csv('sic_data.csv')
# 检查异常值
Q1 = df['SIC'].quantile(0.25)
Q3 = df['SIC'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['SIC'] < (Q1 - 1.5 * IQR)) | (df['SIC'] > (Q3 + 1.5 * IQR))]
print("Outliers:", outliers)
# 标准化
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(df[['SIC']])
# 划分训练集和测试集
train_size = int(len(scaled_data) * 0.8)
train, test = scaled_data[:train_size], scaled_data[train_size:]
# 构建滑动窗口
def create_dataset(data, time_steps=1):
X, y = [], []
for i in range(len(data)-time_steps-1):
X.append(data[i:(i+time_steps), 0])
y.append(data[i + time_steps, 0])
return np.array(X), np.array(y)
time_steps = 12
X_train, y_train = create_dataset(train, time_steps)
X_test, y_test = create_dataset(test, time_steps)
# 构建模型
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1], 1)))
model.add(Dropout(0.2))
model.add(LSTM(units=50))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mean_squared_error')
# 训练模型
early_stop = EarlyStopping(monitor='loss', patience=5)
model.fit(X_train, y_train, epochs=50, batch_size=32, callbacks=[early_stop], validation_split=0.1)
# 预测
predicted_sic = model.predict(X_test)
predicted_sic = scaler.inverse_transform(predicted_sic.reshape(-1, 1))
test_real = scaler.inverse_transform(y_test.reshape(-1, 1))
# 可视化
plt.figure(figsize=(12,6))
plt.plot(test_real, label='Actual SIC')
plt.plot(predicted_sic, label='Predicted SIC')
plt.legend()
plt.title("SIC Prediction vs Actual")
plt.show()
如有更多数据或代码细节,欢迎提供,我可以进一步帮助你调试。