选择列表中值的第一个外观（DISTINCT / GROUP BY）

I have a query, which using two JOINs, returns me a list in this format:

unique_id | non_unique_id | timestamp

The full list is big (thousands of rows), the result of the query is only a few dozens of rows, as the query has WHERE timestamp >= 'some timestamp in the past'

So now, I have the list like this:

89 | 286 | 1406219705
87 | 286 | 1406219518
79 | 922 | 1406216949
78 | 228 | 1406216871
77 | 126 | 1406216748
76 | 939 | 1406216722
74 | 126 | 1406216352
64 | 939 | 1406212540
63 | 126 | 1406212522
49 | 228 | 1406205715
48 | 228 | 1406204851
37 | 228 | 1406196435
32 | 228 | 1406190209
23 | 126 | 1406182577  <- 'limiting timestamp'
18 | 871 | 1406181991
10 | 922 | 1406178816
 9 | 764 | 1406178778
 7 | 609 | 1406178699
 5 | 126 | 1406177398
 4 | 871 | 1406177379  <- 'some timestamp in the past'

So now, I only need to select rows between the 'limiting timestamp' and the end of the list ('some timestamp in the past'). I could have specified the 'limiting timestamp' in the WHERE condition for the original query, but the problem is: I need the resulting set to have no records with non_unique_id, that have already appeared in the list above the 'limiting timestamp'. This is how the result of the query should look like:

                       <- 'limiting timestamp'
18 | 871 | 1406181991

 9 | 764 | 1406178778
 7 | 609 | 1406178699

                       <- 'some timestamp in the past'

So the result will return 3 rows, which all have non_unique_id that did not appear in the results above. But if the 'non_unique_id' had already appeared in the list between 'limiting timestamp' and the 'some timestamp in the past', then only the first occurrence should be kept. Note: the last part condition is optional, as it will be pretty easy to extract the duplicate from the final list.

So far I was only able to come up with the solution of doing a JOIN between the list >= 'some timestamp in the past' and > 'limiting timestamp'. This way I'll see if there are any occurrences of the top list in the bottom list. However, it can be assumed that the query is complex and time needed to produce it's results shouldn't be doubled by running it again, but with a slightly different condition.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
doumu2172 2014-07-24 18:31
关注
You can try this if SQL Server 2008+ (sqlfiddle: http://sqlfiddle.com/#!3/0bc33/3):

WITH cteOrdered AS ( SELECT ROW_NUMBER() OVER (PARTITION BY t1.Non_Unique_ID ORDER BY t1.Timestamp) AS RID, t1.* FROM Table1 t1 LEFT JOIN (SELECT Non_Unique_ID FROM Table1 WHERE Timestamp < 1406177379 OR Timestamp > 1406182577) t2 ON t1.Non_Unique_ID = t2.Non_Unique_ID WHERE t2.Non_Unique_ID IS NULL AND t1.Timestamp > 1406177379 AND t1.Timestamp < 1406182577 ) SELECT Unique_ID, Non_Unique_ID, Timestamp FROM cteOrdered WHERE RID = 1;

I've added another row into the data

(18, 871, 1406181990),

to see if the query was producing you want. You said if there are duplicate non_unique_id's within the search range only the "first" occurrence should be kept. I take this is the one with the EARLIEST timestamp? If opposite, you can change this line

SELECT ROW_NUMBER() OVER (PARTITION BY t1.Non_Unique_ID ORDER BY t1.Timestamp) AS RID,

to

SELECT ROW_NUMBER() OVER (PARTITION BY t1.Non_Unique_ID ORDER BY t1.Timestamp DESC) AS RID,

and that will flip the order to retain the LATEST timestamp for the duplicates.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

选择列表中值的第一个外观（DISTINCT / GROUP BY） mysql php sql
2014-07-24 17:08

回答 2 已采纳 You can try this if SQL Server 2008+ (sqlfiddle: http://sqlfiddle.com/#!3/0bc33/3): WITH cteOrder
有两个列表，列表中值的数量不确定 python
2022-05-22 20:53

回答 2 已采纳 res = sum(i*j for i,j in zip(a,b))
请教一个关于groupby函数的问题 python
2022-05-04 16:37

回答 1 已采纳 df.groupby(['班级','性别']).median()['身高']
一个小兔子的大数据见解1
2019-02-20 14:00

会武术的科学家的博客自定义UDAF，group_concat_distinct()函数，聚合出来一个city_names字段，area、product_id、city_names、click_count 5、join商品明细表，hive（product_id、product_name、extend_info），extend_info是json...
请问python中如何删除二维列表中值全部为指定数的一列？ python
2019-10-22 22:49

回答 2 已采纳建议：变量名尽量不要和关键字或内部函数相同，你用list当变量名很容易出问题 import numpy as py arr=np.array([[0,0,0,0,0,0,1,0] ,[1,1,0
如何求数组中值是第几大 c语言
2021-12-13 16:45

回答 1 已采纳外循环判断coun++已解决
本题要求实现一个函数，删除单链表的第i个结点。 c语言链表
2022-04-24 20:42

回答 2 已采纳供参考： Node *deletelink(Node *head, int i) { int j=0; if (i < 1){ printf("error\n")
大数据常见问题：数据倾斜的原理及处理方案
2022-02-14 16:18

徐凤年不是真无敌的博客理想状态下，一个任务是由集群下所有机器共同承担执行任务，每个节点承担的任务应该相近，但实际上在并行处理过程中，分配到每台节点的数据量并不是均匀的，当大量的数据分配到某一个节点时（假设10个节点，5亿数据...
怎么看数组中值为0的个数有2个呢 c语言
2021-12-19 16:07

回答 1 已采纳 1 02 0可不是2个吗
Golang中值类型/指针类型的差异
2018-07-14 01:57

回答 1 已采纳 when I usevar tf intf = &t1, It's correct But when I usevar tf intf = &t1, it's not Since in
一个类里面__init__方法和其他方法中值的使用 python qt
2022-05-11 11:54

回答 2 已采纳函数loe()里的self.init()少传了一个参数root
大数据组件之Hive（Hive学习一篇就够了）
2022-03-05 13:41

绝域时空的博客文章目录一、Hive安装1、解压环境2、环境变量配置3、配置文件信息1.打开编辑文件2.输入以下内容4、拷贝mysql驱动5、更新guava包和hadoop一致6、mysql授权7、初始化8、hive启动模式9、Hadoop的core-site.xml配置二、...
关于python的加权中值滤波的实现 python
2022-07-22 00:37

回答 2 已采纳 import numpy as np import cv2 as cv def median_filter(input_image, kernel, stride=1, padding=False
03_大数据技术之SparkSql（2.0）
2020-11-07 15:56

okbin1991的博客第1章Spark SQL概述 1.1什么是Spark SQL Spark SQL是Spark用于结构化数据(structured data)处理的Spark模块。与基本的Spark RDD API不同，Spark SQL的抽象数据类型为Spark提供了关于数据结构和正在执行的计算的更多...
如何站在大数据的角度看100000个故事
2018-06-15 16:57

腾讯云开发者的博客欢迎大家前往腾讯云+社区，获取更多腾讯海量技术实践干货哦~ 本文来自云+社区翻译社，作者HesionBlack 最近我从马克·里德尔那拿到了很棒的...在这篇文章中，我将会进行一个简单的分析来检验在故事中的特定...
大数据数据倾斜问题及策略
2021-11-19 11:37

smallumbrella的博客数据倾斜是大数据开发中经常会遇到的问题，而且基本是面试中的必问考点，在面试中以及实际开发中，几乎天天面临的都是这个问题。本文是小鹏人工爬虫来的，希望能帮自己和我的读者们理解它！正文 Hadoop中的数据...
SparkCore-常用转换算子总结
2021-11-27 13:24

AiryView的博客通过对这个RDD的所有元素应用一个函数，返回一个新的RDD。说人话就是：将处理的数据逐条进行映射转换，可以是类型的转换，也可以是值的转换。值的转换，即里面每个数据*2 val mapRDD: RDD[Int] = rdd.map( ...
小白的大数据入门路——Hive学习笔记
2020-08-04 11:04

5akura的博客文章目录一、Hive基本概念1.1、什么是Hive1.2、Hive的优缺点1.3、Hive架构原理1.4、Hive对比数据库二、Hive安装2.1、安装包准备2.2、相关配置和启动2.3、本地文件导入Hive2.3.1、Linux本地文件导入2.3.2、HDFS文件...
大数据--mysql3--sql必知必会（第五版）之mysql的检索过滤数据
2021-01-21 14:21

斑马！的博客说明：本文是对《SQL必知必会》第五版学习的总结，并且参考文章《mysql基础》。对于文章的出现的表名称，列名请参考书籍《SQL必知必会》一：检索数据 1.1检索某个列 select prod_name from Products; 使用...
没有解决我的问题, 去提问

悬赏问题

¥15 merge函数占用内存过大
¥15 Revit2020下载问题
¥15 使用EMD去噪处理RML2016数据集时候的原理
¥15 神经网络预测均方误差很小但是图像上看着差别太大
¥15 Oracle中如何从clob类型截取特定字符串后面的字符
¥15 想通过pywinauto自动电机应用程序按钮，但是找不到应用程序按钮信息
¥15 如何在炒股软件中，爬到我想看的日k线
¥15 seatunnel 怎么配置Elasticsearch
¥15 PSCAD安装问题 ERROR: Visual Studio 2013, 2015, 2017 or 2019 is not found in the system.
¥15 (标签-MATLAB|关键词-多址)

选择列表中值的第一个外观（DISTINCT / GROUP BY）

2条回答 默认 最新

悬赏问题

2条回答默认最新