使用Go SDK的管道在Cloud Dataflow中不是完全并行的

I have Apache Beam code implementation on Go SDK as described below. The pipeline has 3 steps. One is textio.Read, other one is CountLines and the last step is ProcessLines. ProcessLines step takes around 10 seconds time.

After I implemented the below code, I found that during pipeline execution, each step waits for completion of execution of the previous step and starts working in parallel after previous step is completed. Each step itself is running in parallel but only one step is running at a time within the pipeline, where the pipeline is not fully parallel right now.

This question is follow-up of the below question. Although the answer partially solves the problem, the pipeline currently does not run in fully parallel. Parallelism Problem on Cloud Dataflow using Go SDK

package main

import (
    "context"
    "flag"
    "time"

    "github.com/apache/beam/sdks/go/pkg/beam"
    "github.com/apache/beam/sdks/go/pkg/beam/io/textio"
    "github.com/apache/beam/sdks/go/pkg/beam/log"
    "github.com/apache/beam/sdks/go/pkg/beam/x/beamx"
)

// metrics to be monitored
var (
    input         = flag.String("input", "", "Input file (required).")
    numberOfLines = beam.NewCounter("extract", "numberOfLines")
    lineLen       = beam.NewDistribution("extract", "lineLenDistro")
)

func AddRandomKey(s beam.Scope, col beam.PCollection) beam.PCollection {
    return beam.ParDo(s, addRandomKeyFn, col)
}

func addRandomKeyFn(elm beam.T) (int, beam.T) {
    return rand.Int(), elm
}

func countLines(ctx context.Context, _ int, lines func(*string) bool, emit func(string)) {
    var line string
    for lines(&line) {
        lineLen.Update(ctx, int64(len(line)))
        numberOfLines.Inc(ctx, 1)
        emit(line)
    }
}
func processLines(ctx context.Context, _ int, lines func(*string) bool) {
    var line string
    for lines(&line) {
        time.Sleep(10 * time.Second)
        numberOfLinesProcess.Inc(ctx, 1)
    }
}

func CountLines(s beam.Scope, lines beam.PCollection) beam.PCollection {
    s = s.Scope("Count Lines")
    keyed := AddRandomKey(s, lines)
    grouped := beam.GroupByKey(s, keyed)

    return beam.ParDo(s, countLines, grouped)
}

func ProcessLines(s beam.Scope, lines beam.PCollection) {
    s = s.Scope("Process Lines")
    keyed := AddRandomKey(s, lines)
    grouped := beam.GroupByKey(s, keyed)

    beam.ParDo0(s, processLines, grouped)
}

func main() {
    // If beamx or Go flags are used, flags must be parsed first.
    flag.Parse()
    // beam.Init() is an initialization hook that must be called on startup. On
    // distributed runners, it is used to intercept control.
    beam.Init()

    // Input validation is done as usual. Note that it must be after Init().
    if *input == "" {
        log.Fatal(context.Background(), "No input file provided")
    }

    p := beam.NewPipeline()
    s := p.Root()

    l := textio.Read(s, *input)
    lines := CountLines(s, l)
    ProcessLines(s, lines)

    // Concept #1: The beamx.Run convenience wrapper allows a number of
    // pre-defined runners to be used via the --runner flag.
    if err := beamx.Run(context.Background(), p); err != nil {
        log.Fatalf(context.Background(), "Failed to execute job: %v", err.Error())
    }
}

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

报告相同问题？

关注问题

如何使用Go SDK for Google Cloud Platform获取项目元数据？
2018-05-11 11:42

回答 1 已采纳 With Projects.Get you create a pointer to a ProjectsGetCall. The call has not been made yet. You h
如何使用Go SDK for Google Cloud Platform获取项目ID？
2018-05-14 08:19

回答 2 已采纳 Figured out the correct API. package main import ( "fmt" "golang.org/x/net/context"
仅在Golang中使用环境变量初始化Firebase Admin SDK
2019-06-14 22:42

回答 1 已采纳 try: export JSON_CREDS="{\"type\":\"service_account\",\"project_id\":\"project-id\",\"private_key
DataflowSDK-examples:Google Cloud Dataflow提供了一个简单而强大的模型，用于构建批处理和流并行数据处理管道。该存储库提供了一些示例管道，以帮助您开始使用Dataflow
2021-04-29 05:00

Google Cloud Dataflow示例Google Cloud Dataflow是一项用于在Google Cloud Platform上执行管道的服务。入门在Google Cloud Dataflow上在Google Cloud Dataflow上我们搬到了Apache Beam！ Apache Beam Python SDK和...
在AWS-SDK-GO中使用正则表达式过滤AWS资源
2019-01-03 17:18

回答 1 已采纳 There is nothing in the API documentation to suggest that DescribeSubnets accepts a regular expres
使用Go SDK检查AWS Data Pipeline的状态
2017-09-12 19:32

回答 1 已采纳 FYI in case anyone else comes across this, this is how I resolved this: Golang AWS API call to de
尝试使用Golang在CloudWatch上放置PutLogEvents时获取SerializationException
2019-03-16 12:49

回答 1 已采纳 Reason for SerializationException was:logevents := make([]*cloudwatchlogs.InputLogEvent, 1) follow
dataflow_Apache Beam，Google Cloud Dataflow和使用Python创建自定义模板
2020-08-16 07:15

weixin_26752759的博客 Apache Beam，Google Cloud Dataflow和使用Python创建自定义模板 (Apache Beam, Google Cloud Dataflow and Creating Custom Templates Using Python) 阿帕奇光束 (Apache Beam) Apache Beam(Batch + Stream) is a ...
如何在aws-sdk-go Dynamodb QueryInput中使用“ BETWEEN”？
2016-11-02 09:19

回答 1 已采纳 The time value should be given as mentioned below:- "time": { ComparisonOperator: aws
怎么样使用淘宝SDK API在自己写得网站上调用呢？ django json 运维
2023-01-10 20:53

回答 4 已采纳要在自己的网站上使用淘宝 SDK API，首先需要在淘宝开放平台申请应用并获取App Key和App Secret。然后根据淘宝 API文档进行开发，在网站后端使用这两个凭据对 API 进行签名认证，
在AWS Go-SDK中从VPC中的Lambda访问s3
2018-04-13 05:02

回答 2 已采纳 To access S3 within a VPC without an internet gateway you need to use a S3 Endpoint
转谷歌开源Cloud Dataflow Java SDK
2014-12-26 10:23

weixin_34329187的博客现在，他们开源了Dataflow Java SDK，使开发人员可以看到它的实现方式，并合理使用该SDK开发运行在本地或其它云上的服务。 Dataflow是一项云服务，使用了由FlumeJava和MillWheel演变而来的技...
如何获取cloudformation模板中的参数以使用Amazon Go SDK启动？
2019-03-29 19:16

回答 1 已采纳 You're trying to provide a []Parameter containing a single Parameter object with duplicate fields
通过Time、Window与Trigger比较Google Cloud DataFlow与Apache Flink的区别
2017-01-10 18:16

lmalds李麦迪的博客 Flink DataFlow
探索Google Cloud Dataflow Java SDK：流处理与批处理的新维度
2024-04-11 09:42

gitblog_00001的博客探索Google Cloud Dataflow Java SDK：流处理与批处理的新维度项目地址:https://gitcode.com/GoogleCloudPlatform/DataflowJavaSDK 在大数据时代，有效地管理和处理数据至关重要。Google Cloud Dataflow Java SDK ...
没有解决我的问题, 去提问

悬赏问题

¥500 火焰左右视图、视差（基于双目相机）
¥100 set_link_state
¥15 虚幻5 UE美术毛发渲染
¥15 CVRP 图论物流运输优化
¥15 Tableau online 嵌入ppt失败
¥100 支付宝网页转账系统不识别账号
¥15 基于单片机的靶位控制系统
¥15 真我手机蓝牙传输进度消息被关闭了，怎么打开？(关键词-消息通知)
¥15 装 pytorch 的时候出了好多问题，遇到这种情况怎么处理？
¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本

码龄粉丝数原力等级 --

使用Go SDK的管道在Cloud Dataflow中不是完全并行的

0条回答默认最新

悬赏问题

使用Go SDK的管道在Cloud Dataflow中不是完全并行的

0条回答 默认 最新

悬赏问题

0条回答默认最新