V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
Distand
V2EX  ›  程序员

sql 求助,果然只会 mysql 的写 hive 要命

  •  
  •   Distand · 2022-05-16 20:36:33 +08:00 · 2228 次点击
    这是一个创建于 906 天前的主题,其中的信息可能已经有所发展或是发生改变。

    数据开发跑路了,只能自己去 hive 导数据,需求是:

    1. xfrom='sugg'的全要

    2. 其它 xfrom 的按 sub_region 分组根据 pubtime 一个月前取 xx 条,一个月内取 xx 条,都要按 click 倒序排

    3. 汇总到一起

    渣渣业务开发只能写出这种 sql 了,跑了下执行一次要 49 块钱,求个优化🙏

    WITH total AS (
    	SELECT * FROM xx.xx WHERE log_date='20220515'
    ),
    other AS (
    	SELECT * FROM total WHERE xfrom != 'sugg'
    ),
    final AS (
      SELECT * FROM total WHERE xfrom='sugg' 
      UNION ALL
      SELECT * FROM (SELECT * FROM other WHERE sub_region='a' AND pubtime < '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 1000) a1
      UNION ALL 
      SELECT * FROM (SELECT * FROM other WHERE sub_region='a' AND pubtime >= '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 1000) a2
      UNION ALL 
      SELECT * FROM (SELECT * FROM other WHERE sub_region='b' AND pubtime < '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 1000) b1
      UNION ALL 
      SELECT * FROM (SELECT * FROM other WHERE sub_region='b' AND pubtime >= '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 1000) b2
      UNION ALL 
      SELECT * FROM (SELECT * FROM other WHERE sub_region='c' AND pubtime < '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 250) c1
      UNION ALL 
      SELECT * FROM (SELECT * FROM other WHERE sub_region='c' AND pubtime >= '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 250) c2
      UNION ALL 
      SELECT * FROM (SELECT * FROM other WHERE sub_region='d' AND pubtime < '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 500) d1
      UNION ALL 
      SELECT * FROM (SELECT * FROM other WHERE sub_region='d' AND pubtime >= '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 500) d2
      UNION ALL 
      SELECT * FROM (SELECT * FROM other WHERE sub_region='e' AND pubtime < '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 500) e1
      UNION ALL 
      SELECT * FROM (SELECT * FROM other WHERE sub_region='e' AND pubtime >= '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 500) e2
      UNION ALL 
      SELECT * FROM (SELECT * FROM other WHERE sub_region='f' AND pubtime < '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 50) f1
      UNION ALL 
      SELECT * FROM (SELECT * FROM other WHERE sub_region='f' AND pubtime >= '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 50) f2
      UNION ALL 
      SELECT * FROM (SELECT * FROM other WHERE sub_region='g' AND pubtime < '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 50) g1
      UNION ALL 
      SELECT * FROM (SELECT * FROM other WHERE sub_region='g' AND pubtime >= '${yyyyMMdd,-1m}' ORDER BY click DESC LIMIT 50) g2
      UNION ALL 
      -- 这个 sub_region 有特殊要求,其它都是一样的条件,只是 limit 数量不同
      SELECT * FROM (SELECT * FROM other WHERE sub_region='h' AND pubtime < '${yyyyMMdd,-1m}' AND (tag LIKE '%th%' OR tag LIKE '%ph%') ORDER BY click DESC LIMIT 150) h1
      UNION ALL 
      SELECT * FROM (SELECT * FROM other WHERE sub_region='h' AND pubtime >= '${yyyyMMdd,-1m}' AND (tag LIKE '%th%' OR tag LIKE '%ph%') ORDER BY click DESC LIMIT 150) h2
    )
    INSERT OVERWRITE TABLE yy.yy SELECT * FROM final;
    
    9 条回复    2022-05-20 14:51:56 +08:00
    efaun
        1
    efaun  
       2022-05-16 21:07:32 +08:00
    看到第二条说什么分组什么取多少条, 你看看分区函数适不适合你
    https://www.cnblogs.com/linJie1930906722/p/6036053.html
    chineselittleboy
        2
    chineselittleboy  
       2022-05-16 21:55:15 +08:00 via Android
    最里面的 select * 改成想要的一些字段能便宜点
    zhusimaji
        3
    zhusimaji  
       2022-05-16 23:25:58 +08:00
    关于取数分组的语句可以参考 select * from (select XXX, row_number() over (partition by sub_region order by click desc ) as rank from XXX)a where rank <1000
    zhusimaji
        4
    zhusimaji  
       2022-05-16 23:26:46 +08:00
    所以后续一个月内和一个月外只要 union 一下就可以了,应该就可以输出结果了
    FYFX
        5
    FYFX  
       2022-05-16 23:30:44 +08:00
    你下面那一堆 sql 可以根据 pubtime 分成两部分,然后用窗口函数 partition by sub_region order by click 算一下就行了吧,而且你这只取一天数据要 49 块钱是不是有点离谱
    lyang
        6
    lyang  
       2022-05-17 08:36:42 +08:00
    49 块钱是什么意思
    Distand
        7
    Distand  
    OP
       2022-05-17 09:48:25 +08:00
    感谢各位老哥,先 case when pubtime < '${yyyyMMdd,-1m}' then 0 else 1 end as pt ,再 row_number() over(partition by sub_region,pt order by click desc) as rank 后只需要 2 块钱了
    512357301
        8
    512357301  
       2022-05-17 13:15:21 +08:00 via Android
    这是一条 SQL 就搞定的吧(通过辅助列判断需要哪些数据,需要用到开窗函数,然后 where 就行了,怎么你这搞了那么多的 union all ,hive 执行 union all 效率很低的
    shuianfendi6
        9
    shuianfendi6  
       2022-05-20 14:51:56 +08:00
    用 partition 和 row_number 吧

    spark-sql 执行效率会好不少
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   2884 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 30ms · UTC 11:10 · PVG 19:10 · LAX 03:10 · JFK 06:10
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.