2301_76620094 2024-04-15 14:46 采纳率: 0%
浏览 25
已结题

Spark加载本地pipelinemodel报错

我在spark项目中需要加载本地pipelinemodel(之前训练好,已导出到本地目录),但是每当我想加载本地模型时就会报错。具体语句为_val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")_

以下是我的具体代码(已注释掉加载本地模型的语句,可正常运行):

package org.apache.spark.examples.ml

// $example on$
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
// $example off$
import org.apache.spark.sql.SparkSession

object PipelineExample {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .appName("PipelineExample")
      .master("local[*]") // 使用本地模式,*表示使用所有可用的CPU核心
      .getOrCreate()

    // $example on$
    // Prepare training documents from a list of (id, text, label) tuples.
    val training = spark.createDataFrame(Seq(
      (0L, "a b c d e spark", 1.0),
      (1L, "b d", 0.0),
      (2L, "spark f g h", 1.0),
      (3L, "hadoop mapreduce", 0.0)
    )).toDF("id", "text", "label")

    // Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
    val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
    val hashingTF = new HashingTF()
      .setNumFeatures(1000)
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("features")
    val lr = new LogisticRegression()
      .setMaxIter(10)
      .setRegParam(0.001)
    val pipeline = new Pipeline()
      .setStages(Array(tokenizer, hashingTF, lr))

    // Fit the pipeline to training documents.
    val model = pipeline.fit(training)

    // Now we can optionally save the fitted pipeline to disk
    model.write.overwrite().save("/tmp/spark-logistic-regression-model")

    // We can also save this unfit pipeline to disk
    pipeline.write.overwrite().save("/tmp/unfit-lr-model")



    // Prepare test documents, which are unlabeled (id, text) tuples.
    val test = spark.createDataFrame(Seq(
      (4L, "spark i j k"),
      (5L, "l m n"),
      (6L, "spark hadoop spark"),
      (7L, "apache hadoop")
    )).toDF("id", "text")

    // Make predictions on test documents.
    model.transform(test)
      .select("id", "text", "probability", "prediction")
      .collect()
      .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
        println(s"($id, $text) --> prob=$prob, prediction=$prediction")
      }
    // $example off$
    // And load it back in during production
   // val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
    spark.stop()
  }
}

代码参考GitHub上spark的pipeline示例代码示例代码

具体错误(部分)信息:

Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat(Ljava/lang/String;)Lorg/apache/hadoop/io/nativeio/NativeIO$POSIX$Stat;
    at org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat(Native Method)
    at org.apache.hadoop.io.nativeio.NativeIO$POSIX.getStat(NativeIO.java:608)
    at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfoByNativeIO(RawLocalFileSystem.java:934)
    at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:848)
    at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:816)
    at org.apache.hadoop.fs.LocatedFileStatus.<init>(LocatedFileStatus.java:52)
    at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:2199)
    at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:2179)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:244)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332)

项目pom依赖:

<dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <dependency>
            <groupId>com.baomidou</groupId>
            <artifactId>mybatis-plus-boot-starter</artifactId>
            <version>3.5.3.1</version>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>druid</artifactId>
            <version>1.1.16</version>
        </dependency>

        <dependency>
            <groupId>com.mysql</groupId>
            <artifactId>mysql-connector-j</artifactId>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>

        <!--spark-->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>3.2.0</version>
        </dependency>

        <!--spark_sql-->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>3.2.0</version>
        </dependency>

        <!--scala-->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.12.13</version>
        </dependency>
        <!--spark解析xml依赖包 https://mvnrepository.com/artifact/com.databricks/spark-xml -->
        <dependency>
            <groupId>com.databricks</groupId>
            <artifactId>spark-xml_2.12</artifactId>
            <version>0.12.0</version>
        </dependency>
        <!--spark-mlib-->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.12</artifactId>
            <version>3.2.0</version>
            <scope>compile</scope>
        </dependency>
        <!-- janio-->
        <dependency>
            <groupId>org.codehaus.janino</groupId>
            <artifactId>commons-compiler</artifactId>
            <version>3.0.16</version>
        </dependency>
        <dependency>
            <groupId>org.codehaus.janino</groupId>
            <artifactId>janino</artifactId>
            <version>3.0.16</version>
        </dependency>
    </dependencies>
    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-dependencies</artifactId>
                <version>${spring-boot.version}</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>

        </dependencies>
    </dependencyManagement>


请问我该怎么解决问题,达到加载本地模型的目的?

  • 写回答

17条回答 默认 最新

  • 专家-赤兔[在线] 优质创作者: 编程框架技术领域 2024-04-15 14:46
    关注

    引自免费微信小程序:皆我百晓生

    针对您在Spark项目中尝试加载本地保存的PipelineModel时遇到的UnsatisfiedLinkError异常,此问题通常与Hadoop Native库未正确配置或缺失有关。为了成功加载模型并解决当前问题,您可以按照以下步骤操作:

    1. 确认Hadoop Native Libraries: 确保您的项目环境中已经包含了Hadoop Native Libraries。这些库对于处理本地文件系统(如/tmp路径)的操作是必需的。如果您使用的是Docker容器、云环境或没有单独安装Hadoop的本地开发环境,可能需要手动添加这些库。

      您可以通过以下方式之一获取和配置Hadoop Native Libraries:

      • 使用Hadoop二进制发行版: 下载与您使用的Spark版本兼容的Hadoop二进制发行版(例如,如果您使用的是预编译的Spark 3.2.0,那么选择与之兼容的Hadoop版本)。将其中的lib/native目录添加到系统的JAVA_LIBRARY_PATH环境变量中。

      • 使用maven-shade-plugin: 在您的Maven pom.xml文件中添加maven-shade-plugin插件,并配置其将Hadoop Native Libraries打包到最终的JAR中。这样可以确保在运行时自动加载所需的库。添加以下配置到pom.xml<build>标签内:

        <build>
          <plugins>
            <plugin>
              <groupId>org.apache.maven.plugins</groupId>
              <artifactId>maven-shade-plugin</artifactId>
              <version>3.2.4</version>
              <executions>
                <execution>
                  <phase>package</phase>
                  <goals>
                    <goal>shade</goal>
                  </goals>
                  <configuration>
                    <filters>
                      <filter>
                        <artifact>*:*</artifact>
                        <excludes>
                          <exclude>META-INF/*.SF</exclude>
                          <exclude>META-INF/*.DSA</exclude>
                          <exclude>META-INF/*.RSA</exclude>
                        </excludes>
                      </filter>
                    </filters>
                    <transformers>
                      <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                      <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                        <mainClass>org.apache.spark.examples.ml.PipelineExample</mainClass>
                      </transformer>
                    </transformers>
                    <artifactSet>
                      <includes>
                        <include>org.apache.hadoop:hadoop-client</include>
                      </includes>
                    </artifactSet>
                    <relocations>
                      <relocation>
                        <pattern>com.google.common</pattern>
                        <shadedPattern>shaded.com.google.common</shadedPattern>
                      </relocation>
                    </relocations>
                  </configuration>
                </execution>
              </executions>
            </plugin>
          </plugins>
        </build>
        

        这段配置会将Hadoop相关的依赖(包括Native库)打包到最终的JAR文件中,同时解决潜在的类冲突问题。

    2. 设置环境变量: 如果您选择手动添加Hadoop Native Libraries到系统路径,确保在运行Spark应用前设置以下环境变量:

      export JAVA_LIBRARY_PATH=/path/to/hadoop-native-libraries/lib/native:$JAVA_LIBRARY_PATH
      

      /path/to/hadoop-native-libraries替换为实际的Hadoop Native Libraries所在目录。

    3. 重新构建并运行项目: 保存对pom.xml的修改(如果进行了上述步骤1中的第二种方法),然后重新构建您的项目。之后运行修改后的Spark应用,看看是否解决了UnsatisfiedLinkError问题。

    通过以上步骤,应该能解决您在加载本地PipelineModel时遇到的UnsatisfiedLinkError异常。确保正确配置Hadoop Native Libraries,并根据实际情况选择合适的方法将其包含在您的项目环境中。如果您仍遇到问题,可能需要进一步检查环境配置或提供更详细的错误日志以进行排查。

    评论 编辑记录

报告相同问题?

问题事件

  • 已结题 (查看结题原因) 4月17日
  • 创建了问题 4月15日