묻고 답해요
141만명의 커뮤니티!! 함께 토론해봐요.
인프런 TOP Writers
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
윈도우 함수 연습 문제
윈도우함수(탐색함수) 연습문제연습문제1-- 문제 1) user들의 다음 접속 월과 다다음 접속 월을 구하는 쿼리를 작성해주세요 select user_id, visit_month, lead(visit_month) over (partition by user_id order by visit_month asc) as after_visit_month, lead(visit_month, 2) over (partition by user_id order by visit_month asc) as after_visit_month from `advanced.analytics_function_01` order by user_id;연습문제2-- 문제2) user들의 다음 접속 월과 다다음 접속 월, 이전 접속 월을 구하는 쿼리를 작성해주세요. select *, lead(visit_month) over (partition by user_id order by visit_month asc) as after_visit_month, lead(visit_month, 2) over (partition by user_id order by visit_month asc) as after_two_visit_month, lag(visit_month) over(partition by user_id order by visit_month asc) as before_visit_month from `advanced.analytics_function_01` order by user_id, visit_month;연습문제3 -- 3번 유저가 접속했을 때 다음 접속까지의 간격을 구하시오 select *, after_visit_month - visit_month as diff from( select *, lead(visit_month, 1) over (partition by user_id order by visit_month) as after_visit_month from `advanced.analytics_function_01` ) order by user_id, visit_month;윈도우 함수 frame 연습문제SELECT * , SUM(amount) OVER(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS amount_total , SUM(amount) OVER(ORDER BY order_id) AS cumulative_sum , SUM(amount) OVER(PARTITION BY user_id ORDER BY order_id) AS cumulative_sum_by_user , AVG(amount) OVER(ORDER BY order_id ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS last_5_orders_avg_amount FROM `advanced.orders` ORDER BY order_id 윈도우 함수 연습문제연습문제1-- 1. 사용자별 쿼리를 실행한 총 횟수를 구하는 쿼리를 작성해주세요 단, group by를 사용해서 집계하는것이 아닌 quary_log의 데이터의 우측에 새로운 컬럼을 작성해주세요 select *, count(query_date) over (partition by user) as total_query_cnt from `advanced.query_logs` order by user,query_date;연습문제2-- 2. 주차별로 팀 내에서 쿼리를 많이 실행한수를 구한후 실행 수를 활요해 랭킹을 구해주세요. 단 랭킹이 1등인 사람만 보여주세요. with query_cnt_by_team AS ( select extract(week FROM query_date) as week_number, team, user, count(user) as query_cnt from `advanced.query_logs` group by all ) select *, rank() over(partition by week_number, team order by query_cnt desc) AS rk from query_cnt_by_team qualify rk = 1 order by week_number, team, query_cnt desc; 연습문제3-- 3. (2번문제에서 사용한 주차별 쿼리 사용) 쿼리를 실행한 시점 기준 1주 전에 쿼리 실행 수를 별도의 컬럼으로 확인할수 있는 쿼리를 작성해주세요. with query_cnt_by_team AS ( select extract(week FROM query_date) as week_number, team, user, count(user) as query_cnt from `advanced.query_logs` group by all ) select *, lag(query_cnt, 1) over(partition by user order by week_number) as prev_week_query_cnt from query_cnt_by_team ; 연습문제4-- 4. 시간에 흐름에 따라 일자별로 유저가 실행한 누적 쿼리 수를 작성해주세요(query_date) select *, sum(query_cnt) over(partition by user order by query_date) as cumulative_sum, sum(query_cnt) over(partition by user order by query_date rows between unbounded preceding and current row) as cumulative_sum2 from( select query_date, team, user, count(user) as query_cnt from `advanced.query_logs` group by all ) order by user, query_date;연습문제5-- 5. 다음 데이터는 주문 횟수를 나타낸 데이터입니다. 만약 주문 횟수가 없으면 null로 기록됩니다.이런 데이터에서 null값이라고 되어 있는 부분은 바로 이전 날짜의 값으로 채워주는 쿼리를 작성해주세요 WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ) select *, last_value(number_of_orders ignore nulls) over(order by date) as last_value_orders from raw_data;연습문제6-- 6. 5번 문제에서 null을 채운후 2일전 ~ 현재의 데이터의 평균을 구하는 쿼리를 작성해주세요(이동평균) WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ) , filled_data as( select * except(number_of_orders), last_value(number_of_orders ignore nulls) over(order by date) as number_of_orders from raw_data ) select *, avg(number_of_orders) over (order by date rows between 2 preceding and current row) as moving_avg from filled_data연습문제7-- 7) app_logs 테이블에서 custom session을 만들어 주세요 이전 이벤트 로그와 20초가 지나면 새로운 session을 만들어 주세요 session 숫자로 (1,2,3 ...)표시해도 됩니다. -- 2022-08-18일의 user_pseudo_id(1997494153.8491999091)은 session_id가 4까지 나옵니다. with base as( select event_date, datetime(timestamp_micros(event_timestamp), 'Asia/Seoul') as event_datetime, event_name, user_id, user_pseudo_id from advanced.app_logs where event_date = "2022-08-18" and user_pseudo_id = "1997494153.8491999091" ), diff_Data as( select *, from( select *, lag(event_datetime, 1) over(partition by user_pseudo_id order by event_datetime) as prev_event_datetime from base ) ) select *, sum(session_start) over (partition by user_pseudo_id order by event_datetime) as session_num from( select *, case when prev_event_datetime is null then 1 when second_diff >= 20 then 1 else 0 end as session_start from diff_data ) order by event_datetime
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝스터디 2주차 과제] 윈도우 함수 연습문제
윈도우 함수의 탐색 함수 : LEAD, LAG, FIRST_VALUE, LAST_VALUE## 문제1) user들의 다음 접속 월과 다다음 접속 월을 구하는 쿼리를 작성해주세요. -- SELECT -- user_id -- , visit_month -- , LEAD(visit_month, 1) OVER (PARTITION BY user_id ORDER BY visit_month) AS after_month -- , LEAD(visit_month, 2) OVER (PARTITION BY user_id ORDER BY visit_month) AS after_next_month -- FROM `advanced.analytics_function_01` -- ORDER BY user_id ## 문제2) user들의 다음 접속 월과 다다음 접속 월,이전 접속 월을 구하는 쿼리를 작성해주세요. -- LAG 함수를 사용할 때 NULL이 나온다 => 그 값은 처음이다! -- LEAD 함수를 사용할 때 NULL이 나온다 => 그 값은 마지막이다! -- SELECT -- user_id -- , visit_month -- , LEAD(visit_month, 1) OVER (PARTITION BY user_id ORDER BY visit_month) AS after_month -- , LEAD(visit_month, 2) OVER (PARTITION BY user_id ORDER BY visit_month) AS after_next_month -- , LAG(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS before_month -- FROM `advanced.analytics_function_01` -- ORDER BY user_id ## 3번 : 유저가 접속했을 때, 다음 접속까지의 간격을 구하시오 -- user_id | visit_month | after_visit_month | diff_month -- SELECT -- user_id -- , visit_month -- , LEAD(visit_month, 1) OVER (PARTITION BY user_id ORDER BY visit_month) AS after_month -- , LEAD(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) - visit_month AS diff_month -- FROM `advanced.analytics_function_01` -- ORDER BY user_id ## 윈도우 함수를 이렇게 쓰는게 좋을까? => 중복된 쿼리는 줄이는 것이 좋을 수 있음 -- 쿼리를 수정할 상황이 생김 => 2번 수정 => 굉장히 많아지면 복잡해지고, 실수하기 좋음 -- 쿼리가 길어지는 것을 무서워하지 말고, 쿼리를 덜 수정할 수 있는 구조를 만들면 좋겠다! -- 윈도우 함수 쓰다보면 쿼리 줄이 길어짐. 감안하고 사용하면 좋겠다 -- -- 그래서 서브쿼리로 만들어보면,, -- SELECT -- * -- , (after_month - visit_month) AS diff_month -- FROM ( -- SELECT -- * -- , LEAD(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS after_month -- FROM `advanced.analytics_function_01` -- ) -- ORDER BY user_id ## 추가 문제 : 이 데이터셋을 기준으로 user_id의 첫번째 방문 월, 마지막 방문 월을 구하는 쿼리를 작성해주세요 SELECT user_id , visit_month , FIRST_VALUE(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS first_visit , LAST_VALUE(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS last_visit FROM `advanced.analytics_function_01` ORDER BY user_id, visit_month윈도우 함수 Frame 연습문제-- amount_total : 전체 SUM -- cumulative_sum : row 시점에 누적 SUM -- cumulative_sum_by_user : row 시점에 유저별 누적 SUM -- last_5_orders_avg_amount : order_id 기준으로 정렬하고, 직전 5개의 주문의 평균 amount -- 집계분석함수() OVER(PARTITION BY ~~~ ORDER BY ~~~ ROWS BETWEEN A and B) SELECT * , SUM(amount) OVER() AS amount_total ## OVER 안에 아무것도 들어가지 않는 경우도 있구나! , SUM(amount) OVER(ORDER BY order_id) AS cumulative_sum , SUM(amount) OVER(PARTITION BY user_id ORDER BY order_id) AS cumulative_sum_by_user , AVG(amount) OVER(ORDER BY order_id ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS last_5_orders_avg_amount -- , AVG(amount) OVER(ORDER BY order_id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) -- , AVG(amount) OVER(ORDER BY order_id ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) FROM advanced.orders ORDER BY order_id 윈도우 함수의 연습문제## 윈도우 함수 연습문제 ## 1) 사용자별 쿼리를 실행한 총 횟수를 구하는 쿼리를 작성해주세요. 단, GROUP BY를 사용해서 집계하는 것이 아닌 query_logs의 데이터의 우측에 새로운 컬럼을 만들어주세요. -- 사용자별 쿼리를 실행한 총 횟수 : COUNT() 전체 실행. -- OVER(PARTITION BY user) -- SELECT -- * -- , COUNT(query_date) OVER(PARTITION BY user) AS total_query_cnt -- FROM `advanced.query_logs` -- ORDER BY query_date ## 2) 주차별로 팀 내에서 쿼리를 많이 실행한 수를 구한 후, 실행한 수를 활용해 랭킹을 구해주세요. 단, 랭킹이 1등인 사람만 결과가 보이도록 해주세요. -- 주차별로 개인당 실행한 쿼리 횟수 -- 위 쿼리 횟수를 기반으로 랭킹 -- 랭킹을 기반으로 필터링(랭킹=1) -- 문제의 의도 : 원본 데이터 => 1 row마다 데이터가 있고, 그걸 집계해서 사용. GROUP BY => 윈도우 함수 -- SELECT -- * -- , RANK() OVER(PARTITION BY week_number, team ORDER BY query_cnt DESC) AS team_rank -- FROM ( -- SELECT -- EXTRACT(week from query_date) AS week_number, -- team, -- user, -- COUNT(query_date) AS query_cnt -- FROM `advanced.query_logs` -- GROUP BY ALL -- ) -- QUALIFY team_rank = 1 -- ORDER BY week_number, team -- ## ## WITH 문 풀이 -- WITH query_cnt_by_team AS ( -- SELECT -- EXTRACT(week from query_date) AS week_number, -- team, -- user, -- COUNT(query_date) AS query_cnt -- FROM `advanced.query_logs` -- GROUP BY ALL -- ) -- SELECT -- *, -- RANK() OVER(PARTITION BY week_number, team ORDER BY query_cnt DESC) AS rk -- FROM query_cnt_by_team -- -- QUALIFY : 윈도우 함수의 조건을 설정할 때 사용 -- QUALIFY rk = 1 -- ORDER BY week_number, team, query_cnt DESC -- -- 16주차 - AI팀의 케이피, 16주차 - 코칭팀의 카일, 16주차 - 데이터 사이언스팀의 샘 ## 3)(2번 문제에서 사용한 주차별 쿼리 사용) 쿼리를 실행한 시점 기준 1주 전에 쿼리 실행 수를 별도의 컬럼으로 확인할 수 있는 쿼리를 작성해주세요 -- 1주 전의 쿼리 실행 수 => LAG -- WITH query_cnt_by_team AS ( -- SELECT -- EXTRACT(week from query_date) AS week_number, -- team, -- user, -- COUNT(query_date) AS query_cnt -- FROM `advanced.query_logs` -- GROUP BY ALL -- ) -- SELECT -- user, -- team, -- week_number, -- query_cnt, -- LAG(query_cnt) OVER(PARTITION BY user ORDER BY week_number) AS prev_week_query_cnt -- FROM query_cnt_by_team -- ORDER BY user -- -- ans T) -- WITH query_cnt_by_team AS ( -- SELECT -- EXTRACT(week from query_date) AS week_number, -- team, -- user, -- COUNT(query_date) AS query_cnt -- FROM `advanced.query_logs` -- GROUP BY ALL -- ) -- SELECT -- *, -- LAG(query_cnt, 1) OVER(PARTITION BY user ORDER BY week_number) AS prev_week_query_cnt -- FROM query_cnt_by_team ## 4) 시간의 흐름에 따라, 일자별로 유저가 실행한 누적 쿼리 수를 작성해주세요. -- 누적 쿼리 : 과거의 시간(UNBOUNDED PRECEDING)부터 current row까지 -- 출제 의도 : Default Frame에 대해 알려드리고 싶었음. -- For aggregate analytic functions, if the ORDER BY clause is present but the window frame clause is not, the following window frame clause is used by default: -- SELECT -- *, -- SUM(query_cnt) OVER(PARTITION BY user ORDER BY query_date) AS cumulative_sum -- FROM ( -- SELECT -- user, -- team, -- query_date, -- COUNT(query_date) AS query_cnt -- FROM `advanced.query_logs` -- GROUP BY ALL -- ) -- -- ans T) -- SELECT -- *, -- SUM(query_cnt) OVER(PARTITION BY user ORDER BY query_date) AS cumulative_sum, -- SUM(query_cnt) OVER(PARTITION BY user ORDER BY query_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum2 -- -- Frame의 Default 값 : UNBOUNDED PRECEDING ~ CURRENDT ROW -- FROM ( -- SELECT -- query_date, -- team, -- user, -- COUNT(user) AS query_cnt -- FROM `advanced.query_logs` -- GROUP BY ALL -- ) -- -- QUALIFY cumulative_sum != cumulative_sum2 -- -- -- WHERE, QUALIFY 조건 설정해서 2가지 값이 모두 같은지 비교 => 모두 같으면 != 연산 결과에 반환하는 값이 없을 -- ORDER BY user, query_date ## 5) 다음 데이터는 주문횟수를 나타낸 데이터입니다. 만약 주문 횟수가 없으면 NULL로 기록됩니다. 이런 데이터에서 NULL값이라고 되어있는 부분을 바로 이전 날짜의 값으로 채워주는 쿼리를 작성해주세요. -- LAG로 직전 값 가져오면 되지 않을까? -- number_of_orders가 null 이면, before_number_of_orders를 가져와라! -- 아래 쿼리는 어려운 방법 -- 그 다음 방법 : LAST VALUE를 쓰자! => 값이 없으면 NULL이 뜬다 ! -- FIRST_VALUE, LAST_VALUE => NULL 을 포함해서 연산 -- 출제 의도 : NULL 을 제외해서 연산하고 싶으면 IGNORE NULLS을 쓰면 된다 ! -- WITH raw_data AS ( -- SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL -- SELECT DATE '2024-05-02', 13 UNION ALL -- SELECT DATE '2024-05-03', NULL UNION ALL -- SELECT DATE '2024-05-04', 16 UNION ALL -- SELECT DATE '2024-05-05', NULL UNION ALL -- SELECT DATE '2024-05-06', 18 UNION ALL -- SELECT DATE '2024-05-07', 20 UNION ALL -- SELECT DATE '2024-05-08', NULL UNION ALL -- SELECT DATE '2024-05-09', 13 UNION ALL -- SELECT DATE '2024-05-10', 14 UNION ALL -- SELECT DATE '2024-05-11', NULL UNION ALL -- SELECT DATE '2024-05-12', NULL -- ), -- -- SELECT -- -- date, -- -- IFNULL(number_of_orders, last_value_orders) AS numbers_of_orders -- -- FROM ( -- -- SELECT -- -- *, -- -- -- LAG(number_of_orders) OVER(ORDER BY date) AS prev_orders, ## 마지막 값 NULL !! -- -- LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date) AS last_value_orders -- -- FROM raw_data -- -- ) -- -- -- ans T) -- filled_data AS( -- SELECT -- * EXCEPT(number_of_orders), -- LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date) AS number_of_orders -- FROM raw_data -- ) -- ## 6) 5번 문제에서 NULL을 채운 후, 2일 전 ~ 현재 데이터의 평균을 구하는 쿼리를 작성해주세요 (이동평균) -- -- Frame : 2일 전 => BTWEEN 2 PRECEDING AND CURRENT ROW -- -- 출제 의도 : Frame을 지정할 수 있는가? -- SELECT -- *, -- AVG(number_of_orders) OVER(ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg -- FROM filled_data ## 7) app_logs 테이블에서 CustomSession을 만들어주세요. 이전 이벤트 로그와 20초가 지나면 새로운 Session을 만들어주세요. Session은 숫자로(1,2,3…) 표시해도 됩니다. ## 2022-08-18일의 user_pseudo_id(1997494153.8491999091)은 session_id가 4까지 나옵니다 WITH base as ( SELECT event_date, event_timestamp, DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Asia/Seoul') as event_datetime, event_name, user_id, user_pseudo_id FROM advanced.app_logs WHERE event_date = '2022-08-18' and user_pseudo_id = '1997494153.8491999091' ), base2 as ( SELECT *, LAG(event_datetime, 1) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) as before_event_datetime FROM base ) SELECT *, SUM(session_start) OVER(PARTITION BY user_pseudo_id ORDER BY event_timestamp) as session_id FROM ( SELECT *, IF(second_diff is NULL or second_diff > 20, 1, NULL) as session_start FROM ( SELECT *, DATETIME_DIFF(event_datetime, before_event_datetime, second) as second_diff FROM base2 ) ) ORDER BY event_timestamp
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝스터디 2주차 과제] 윈도우 함수 연습 문제
1. 탐색함수 연습문제 문제 1. user들의 다음 접속 월과 다다음 접속 월을 구하는 쿼리SELECT user_id, visit_month, lead(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS lead_visit_month, lead(visit_month,2) OVER (PARTITION BY user_id ORDER BY visit_month) AS lead_visit_month2 FROM `advanced.analytics_function_01` ORDER BY user_id문제2. user들의 다음 접속 월과 다다음 접속 월, 이전 접속 월을 구하는 쿼리 SELECT user_id, visit_month, lead(visit_month,1) OVER (PARTITION BY user_id ORDER BY visit_month) AS lead_visit_month, lead(visit_month,2) OVER (PARTITION BY user_id ORDER BY visit_month) AS lead_visit_month2, lag(visit_month,1) OVER (PARTITION BY user_id ORDER BY visit_month) AS lag_visit_month FROM `advanced.analytics_function_01` ORDER BY user_id, visit_month # LAG 함수를 사용할 때 NULL 이면 그 값은 처음, # LEAD 함수를 사용할 때 NULL 이면 그 값은 마지막 문제3. 유저가 접속했을 때, 다음 접속까지의 간격을 구하기SELECT user_id, visit_month, lead(visit_month,1) OVER (PARTITION BY user_id ORDER BY visit_month) AS lead_visit_month, ((lead(visit_month,1) OVER (PARTITION BY user_id ORDER BY visit_month) ) - visit_month) AS diff_month # 별칭쓴거는 select 안에서 못함 FROM `advanced.analytics_function_01` ORDER BY user_id, visit_month # 이 경우에는 쿼리 많은경우에 수정할 경우 헷갈리기 시작한다. 따라서 서브쿼리로 묶어서 하면 더 편하고 실수 적어짐 # 쿼리 길어진다고 해도 무서워하지 말고 쿼리 덜 수정할 수 있는 구조를 만들자. # 윈도우 함수 쓰면 줄이 쿼리 길어짐. 감안하고 쓰자.문제 4. 이 데이터셋을 기준으로 user_id의 첫번째 접근 월을 구하는 쿼리를 작성하기# 마지막 추가 문제 SELECT *, FIRST_VALUE(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month) AS First_Value, LAST_VALUE(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month) AS Last_Value FROM ( SELECT *, lead(visit_month,1) OVER (PARTITION BY user_id ORDER BY visit_month) AS lead_visit_month FROM `advanced.analytics_function_01` ORDER BY user_id, visit_month ) ORDER BY user_id, visit_month 2. Frame 연습문제 문제 . 회사의 모든 주문량, 누적 주문량, 최근 직전5개 평균 주문량 구하기 SELECT * , SUM(amount) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS amount_total, # OVER 안에 아무것도 없으면 전체 출력이다! SUM(amount) OVER (ORDER BY order_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum, SUM(amount) OVER (PARTITION BY user_id ORDER BY order_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum_by_user, AVG(amount) OVER (ORDER BY order_id ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS last_5_avg order by order_id 3. 연습문제문제 1. 사용자별 쿼리를 실행한 총 횟수를 구하는 쿼리를 작성하기. 단, GROUP BY를 사용해서 집계하는 것이 아닌 query_logs의 데이터의 우측에 새로운 컬럼을 만들기SELECT *, COUNT(user) OVER (PARTITION BY user ) AS cnt FROM advanced.query_logs ORDER BY query_date, user문제 2. 주차별로 팀 내에서 쿼리를 많이 실행한 수를 구한 후, 실행한 수를 활용해 랭킹을 구하기. 단, 랭킹이 1등인 사람만 결과가 보이도록 하기 # query_date를 바탕으로 주차별로 구분하여 WITH함수로 묶기 WITH week_number_table AS ( SELECT *, EXTRACT(WEEK FROM query_date) AS week_number, # 하루에 유저별 쿼리 이용수 추출 COUNT(*) AS query_cnt, FROM advanced.query_logs GROUP BY ALL ) # 팀 내에서 유저별로 랭킹 구하기 SELECT week_number, team, query_date, query_cnt, RANK() OVER (PARTITION BY week_number,team ORDER BY query_cnt DESC) AS team_rank FROM week_number_table # 1등인 유저만 출력 QUALIFY team_rank = 1 ORDER BY week_number, team 문제 3. (2번 문제에서 사용한 주차별 쿼리 사용) 쿼리를 실행한 시점 기준 1주 전에 쿼리 실행 수를 별도의 컬럼으로 확인할 수 있는 쿼리를 작성하기# 2번쿼리 WITH week_number_table AS ( SELECT *, EXTRACT(WEEK FROM query_date) AS week_number, COUNT(*) AS query_cnt, FROM advanced.query_logs GROUP BY ALL ) SELECT *, # 1주전의 실행수 이므로 LAG를 이용해서 전 week_number의 실행수 구하기 LAG(query_cnt, 1) OVER(PARTITION BY user ORDER BY week_number) AS prev_week_query_count FROM week_number_table 문제 4. 시간의 흐름에 따라 일자별로 유저가 실행한 누적 쿼리 수를 작성하기# 시간의 흐름에 따라 일자별로 쿼리수 묶기 WITH user_query AS ( SELECT *, COUNT(*) AS query_count FROM advanced.query_logs GROUP BY ALL ) # 유저별로 누적 쿼리수 구하기 SELECT *, SUM(query_count) OVER (PARTITION BY user ORDER BY query_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_query_count FROM user_query ORDER BY user # Frame 의 Default 값은 UNBOUNDED PRECEDING ~ CURRENT ROW 이다 # 데이터 정합성 확인 할때 2가지 값이 모두 같은지 비교하면 편리하다 ( WHERE, QUALIFY절에 활용 ) 문제 5. 만약 주문 횟수가 없으면 NULL로 기록됨. 이런 데이터에서 NULL값이라고 되어 있는 부분을 바로 이전 날짜의 값으로 채워주는 쿼리를 작성하기WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ) # CASE문을 이용해서 주문량 NULL 값이면 전 날짜의 주문량으로 대치 SELECT *, (CASE WHEN number_of_orders IS NULL THEN LAG(number_of_orders,1) OVER (ORDER BY date) ELSE number_of_orders END ) AS number_of_orders2 FROM raw_data # BUT 맨 마지막값 NULL이 연속2번이라 NULL이 나옴 # LAST_VALUE의 IGNORE NULLS 쓰기 SELECT *, LAST_VALUE(number_of_orders IGNORE NULLS) OVER (ORDER BY date) AS number_of_orders2 FROM raw_data 문제 6. 5번 문제에서 NULL을 채운 후, 2일전 ~ 현재 데이터의 평균을 구하는 쿼리를 작성하기(이동평균)SELECT date, number_of_orders2, AVG(number_of_orders2) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_agv # NULL값 처리한 테이블 서브쿼리로 묶은 다음에 윈도우 함수써서 2틀전~현재 평균 출력하기 FROM ( SELECT *, LAST_VALUE(number_of_orders IGNORE NULLS) OVER (ORDER BY date) AS number_of_orders2, FROM raw_data ) 문제 7. app_logs 테이블에서 custom session을 만들기. 이전 이벤트 로그와 20초가 지나면 새로운 session을 만들기. session은 숫자로 (1, 2, 3 …) 표시해도 됨.WITH START AS ( SELECT event_date, # timestamp 를 서울 표준 시간으로 바꾸기 DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Asia/Seoul') AS event_datetime, event_name, user_id, user_pseudo_id FROM advanced.app_logs WHERE event_date = '2022-08-16' ), # 기존 시간의 차이 구하기 DIFF_DATE AS ( SELECT *, DATETIME_DIFF(event_datetime, before_event_datetime, SECOND) AS second_diff FROM ( SELECT # 전(LAG) 시간을 불러오기 위해 서브쿼리로 묶음 *, LAG(event_datetime, 1 ) OVER ( PARTITION BY user_pseudo_id ORDER BY event_datetime) AS before_event_datetime FROM START) ORDER BY event_datetime) # session_start를 누적합 이용하여 session_number 구하 SELECT *, SUM(session_start) OVER (PARTITION BY user_pseudo_id ORDER BY event_datetime) AS session_number FROM ( SELECT *, # CASE문을 써서 처음 시작부분 (NULL) 1로 바꾸기 (CASE WHEN second_diff IS NULL THEN 1 # second_diff 가 20초 이상이면 1 아니면 0 WHEN second_diff >= 20 THEN 1 ELSE 0 END) AS session_start FROM DIFF_DATE)
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[인프런 빅쿼리 빠짝스터디 2주차] 윈도우 함수, FRAME설정, QUALITY
윈도우 탐색 함수 연습문제(1) 연습문제 1-- 문제 1) USER의 다음 접속월, 다다음 접속 월 SELECT user_id, visit_month, LEAD(visit_month,1) OVER(PARTITION BY user_id ORDER BY visit_month) AS next_month, LEAD(visit_month,2) OVER(PARTITION BY user_id ORDER BY visit_month) AS the_month_after_next FROM `avdanced.analytics_function_01` (2) 연습문제 2-- 문제 2) USER의 다음 접속월, 다다음 접속 월, 이전 접속 월 SELECT user_id, visit_month, LEAD(visit_month,1) OVER(PARTITION BY user_id ORDER BY visit_month) AS next_month, LEAD(visit_month,2) OVER(PARTITION BY user_id ORDER BY visit_month) AS the_month_after_next, LAG(visit_month,1) OVER(PARTITION BY user_id ORDER BY visit_month) AS last_month FROM `avdanced.analytics_function_01` 윈도우 함수 FRAME 연습문제연습문제 (1~4)SELECT -- 1)모든 주문량 SUM(amount) OVER() AS amount_total, -- 2)특정주문시점에서 누적주문량 #SUM(amount) OVER(partition by order_date) AS cumulative_sum, SUM(amount) OVER (ORDER BY order_date) AS cumulative_sum, -- 3)고객별 주문 시점에서 누적 주문량 #SUM(amount) OVER(partition by user_id) AS cumulative_sum_by_user, SUM(amount) OVER(partition by user_id ORDER BY order_id) AS cumulative_sum_by_user, -- 4) 최근 직전 5개 평균 주문량 AVG(amount) OVER(ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS last_5_orders_avg_amount, AVG(amount) OVER(ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS last_5_unbounded_orders_avg_amount, AVG(amount) OVER(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS all_orders_avg_amount FROM `avdanced.orders` 윈도우 함수(1) 연습문제 1-- 연습문제1) 사용자별 쿼리 실행 횟수 WITH base AS( SELECT user, team, query_date, COUNT(*) OVER(PARTITION BY user) AS total_query_cnt, FROM `avdanced.query_logs` ) SELECT * FROM base(2) 연습문제 2-- 연습문제2) 주차별 팀내 쿼리 실행한 수 (RANK 1만 보이도록) WITH base2 AS( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(*) OVER(PARTITION BY EXTRACT(WEEK FROM query_date) ,user ORDER BY EXTRACT(WEEK FROM query_date) ) AS query_cnt, FROM `avdanced.query_logs` ORDER BY EXTRACT(WEEK FROM query_date) ) SELECT DISTINCT *, RANK() OVER(PARTITION BY team,week_number ORDER BY total_query_cnt DESC) AS team_rank FROM base2 QUALIFY team_rank = 1 ORDER BY week_number, team(3) 연습문제 3WITH base2 AS( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(*) OVER(PARTITION BY EXTRACT(WEEK FROM query_date) ,user ORDER BY EXTRACT(WEEK FROM query_date) ) AS query_cnt, FROM `avdanced.query_logs` #QUALIFY team_rank = 1 ORDER BY EXTRACT(WEEK FROM query_date) ), base3 AS( SELECT DISTINCT *, RANK() OVER(PARTITION BY team,week_number ORDER BY query_cnt DESC) AS team_rank FROM base2 QUALIFY team_rank = 1 ORDER BY week_number, team ) -- 연습문제3) 쿼리 실행 시점 1주전 쿼리 실행 SELECT DISTINCT *, LAG(query_cnt,1) OVER(PARTITION BY user ORDER BY week_number) AS prev_week_query_count FROM base2 GROUP BY ALL ORDER BY user, week_number(4) 연습문제 4--연습문제4) SELECT *, SUM(query_count) OVER(PARTITION BY user ORDER BY query_date) AS culmulative_query_count, SUM(query_cnt) OVER(PARTITION BY user ORDER BY query_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum2 FROM( SELECT DISTINCT *, COUNT(user) OVER(PARTITION BY query_date, user) AS query_count, FROM `avdanced.query_logs` ) ORDER BY user,query_date (5) 연습문제 5WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ) --연습문제 5) null에 이전 값 삽입 SELECT raw_data.date, IF(raw_data.number_of_orders IS NULL, LAG(raw_data.number_of_orders,1) OVER(ORDER BY date), raw_data.number_of_orders) FROM raw_data(6) 연습문제 6WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ), null_is_lag AS( --연습문제 5) null에 이전 값 삽입 SELECT raw_data.date, IF(raw_data.number_of_orders IS NULL, LAG(raw_data.number_of_orders,1) OVER(ORDER BY date), raw_data.number_of_orders) AS number_of_orders FROM raw_data ) -- 연습문제 6) 이동평균 SELECT *, AVG(nl.number_of_orders) OVER(ORDER BY nl.date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg FROM null_is_lag AS nl(7) 연습문제 7-- 1. TIMESTAMP → DATETIME -- 2. SECOND_DIFF 생성 : uSER로 묶어서 - -- 3. SESSION_START생성 : USER로 묶어서 LAG(DATA,1)이 NULL이면 1, SECOND_DIFF가 20이상이면 +1 -- 4. SESSION_ID생성: SESSION_START가 1일 경우 SESSION_ID +1, NULL일 경우 LAG(DATA,1) WITH add_date AS ( -- 1. TIMESTAMP → DATETIME SELECT event_date, event_timestamp, DATETIME(TIMESTAMP_MICROS(event_timestamp)) AS event_datetime, event_name, user_id, user_pseudo_id, LAG(DATETIME(TIMESTAMP_MICROS(event_timestamp))) OVER(PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS before_event_datetime FROM `avdanced.app_logs_temp` --,UNNEST(event_params) AS param -- FROM 절 안에서 UNNEST를 사용 WHERE event_date ="2022-08-18" AND user_pseudo_id = "1997494153.8491999091" ), add_diff AS ( -- 2. SECOND_DIFF 생성 : uSER로 묶어서 - SELECT *, DATE_DIFF(event_datetime, before_event_datetime,SECOND) AS second_diff, FROM add_date ), add_session AS( -- 3. SESSION_START생성 : USER로 묶어서 LAG(DATA,1)이 NULL이면 1, SECOND_DIFF가 20이상이면 +1 SELECT *, IF(second_diff IS NULL OR second_diff >=20, 1, NULL) AS session_start FROM add_diff ) -- 4. SESSION_ID생성 *, SUM(session_start) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS session_num FROM add_session ORDER BY event_datetime
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝스터디 2주차 과제] 윈도우 함수 연습문제
1.--1) 사용자별 쿼리를 실행한 총 횟수를 구하는 쿼리를 작성해주세요. 단 Group By를 사용해서 집계하는 것이 아닌 query_logs의 데이터의 우측에 새로운 컬럼을 만들어주세요. select * , count(query_date) over(partition by user) as total_query_cnt from advanced.query_logs order by user, query_date2.--2) 주차별로 팀 내에서 쿼리를 많이 실행한 수를 구한 후, 실행한 수를 활용해 랭킹을 구해주세요. 단 랭킹이 1등인 사람만 결과가 보이도록 해주세요. select * , rank() over(partition by week_number, team order by query_cnt desc) as team_rank from ( select EXTRACT(WEEK FROM query_date) as week_number , team , user , count(user) as query_cnt from advanced.query_logs group by all ) as base qualify team_rank = 1 order by 1,2,3 3.--3) (2번 문제에서 사용한 주차별 쿼리 사용) 쿼리를 실행한 시점 기준 1주 전에 쿼리 실행 수를 별도의 컬럼으로 확인할 수 있는 쿼리를 작성해주세요. select * , lag(query_cnt) over(partition by user order by week_number) as prev_week_query_cnt from ( select EXTRACT(WEEK FROM query_date) as week_number , team , user , count(user) as query_cnt from advanced.query_logs group by all ) as base order by user4.--4) 시간의 흐름에 따라, 일자별로 유저가 실행한 누적 쿼리 수를 작성해주세요. select * , sum(query_cnt) over(partition by user order by query_date) as cumulative_query_cnt from ( select user , team , query_date , count(user) as query_cnt from advanced.query_logs group by all ) as base order by user5.--5) 다음 데이터는 주문 횟수를 나타낸 데이터입니다. 만약 주문 횟수가 없으면 NULL로 기록됩니다. 이런 데이터에서 NULL 값이라고 되어있는 부분을 바로 이전 날짜의 값으로 채워주는 쿼리를 작성해주세요. WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ) select * , ifnull(number_of_orders,last_value(number_of_orders ignore nulls) over(order by date)) as last_value_orders from raw_data6.--6) 5번 문제에서 NULL을 채운 후, 2일 전 ~ 현재 데이터의 평균을 구하는 쿼리를 작성해주세요(이동 평균) WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ) select date , number_of_orders , avg(number_of_orders) over(order by date rows between 2 preceding and current row) as moving_avg from (select date , ifnull(number_of_orders,last_value(number_of_orders ignore nulls) over(order by date)) as number_of_orders from raw_data) as base7.--7) app_logs 테이블에서 Custom Session을 만들어 주세요. 이전 이벤트 로그와 20초가 지나면 새로운 Session을 만들어 주세요. Session은 숫자로 (1,2,3...) 표시해도 됩니다 --2022-08-18일의 user_pseudo_id(1997494153.8491999091)은 session_id가 4까지 나옵니다. with base as ( select event_date , event_timestamp , datetime(timestamp_micros(event_timestamp), 'Asia/Seoul') as event_datetime , event_name , user_id , user_pseudo_id , lag(datetime(timestamp_micros(event_timestamp), 'Asia/Seoul')) over(partition by user_pseudo_id order by event_timestamp) as before_event_datetime from advanced.app_logs ) , diff_data as ( select * , if(second_diff > 20 or second_diff is null, 1, null) as session_start from ( select * , DATETIME_DIFF(event_datetime, before_event_datetime, second) as second_diff from base ) ) select * , sum(session_start) over(order by event_timestamp) as session_num from diff_data -- where event_date = '2022-08-18' -- and user_pseudo_id = '1997494153.8491999091' order by event_timestamp
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝스터디 2주차 과제] 윈도우 함수 연습문제
이번주는 평일 주말에 일이 많아서 따라가기 힘들었네요 ㅎㅎ그치만 마무리해서 올립니다!! #유저들의 다음 접속 월, 다다음 접속 월SELECT user_id,visit_month,LEAD(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month),LEAD(visit_month, 2) OVER(PARTITION BY user_id ORDER BY visit_month)FROM advanced.analytics_function_01#유저들의 다음 접속월, 다다음 접속 월, 이전 접속 월SELECT user_id,visit_month,LEAD(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month),LEAD(visit_month, 2) OVER(PARTITION BY user_id ORDER BY visit_month),LAG(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month)FROM advanced.analytics_function_01ORDER BY user_id#diff 구하기SELECT user_id,visit_month,LEAD(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month),LEAD(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month) - visit_monthFROM advanced.analytics_function_01ORDER BY user_idSELECT after_visit_month - visit_monthFROM(SELECT user_id,visit_month,LEAD(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month) AS after_visit_monthFROM advanced.analytics_function_01ORDER BY user_id)#윈도우함수SELECT *,SUM(amount) OVER () AS amount_total,SUM(amount) OVER (ORDER BY order_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum,SUM(amount) OVER (PARTITION BY user_id ORDER BY order_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum_user,AVG(amount) OVER (ORDER BY order_id ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS last_5_avgFROM advanced.ordersORDER BY order_id#윈도우함수 연습문제 1SELECT *, COUNT(user) OVER(PARTITION BY user)FROM advanced.query_logs#윈도우함수 연습문제 2WITH nums AS (SELECT EXTRACT(week FROM query_date) AS week_number,team,userFROM advanced.query_logs), numss AS (SELECT week_number, team, user, COUNT(user) AS query_cntFROM numsGROUP BY week_number, team, user), rnks AS (SELECT *, RANK() OVER (PARTITION BY week_number, team ORDER BY query_cnt DESC) AS rkFROM numssORDER BY week_number, team, user)SELECT *FROM rnksWHERE rk = 1#윈도우함수 연습문제 3WITH nums AS (SELECT EXTRACT(week FROM query_date) AS week_number,team,userFROM advanced.query_logs), numss AS (SELECT week_number, team, user, COUNT(user) AS query_cntFROM numsGROUP BY week_number, team, user)SELECT user, team, week_number, query_cnt, LAG(query_cnt) OVER (PARTITION BY user ORDER BY week_number)FROM numssORDER BY user, team#윈도우함수 연습문제 4WITH cntcnt AS (SELECT user, team, query_date, COUNT(user) AS query_countFROM advanced.query_logsGROUP BY user, team, query_date)SELECT user, team, query_date, query_count, SUM(query_count) OVER (PARTITION BY user ORDER BY query_date) AS cntFROM cntcnt#윈도우함수 연습문제 5WITH raw_data AS(SELECT DATE '2024-05-01' AS date,15 AS number_of_orders UNION ALLSELECT DATE '2024-05-02',13 UNION ALLSELECT DATE '2024-05-03',NULL UNION ALLSELECT DATE '2024-05-04',16 UNION ALLSELECT DATE '2024-05-05',NULL UNION ALLSELECT DATE '2024-05-06',18 UNION ALLSELECT DATE '2024-05-07',20 UNION ALLSELECT DATE '2024-05-08',NULL UNION ALLSELECT DATE '2024-05-09',13 UNION ALLSELECT DATE '2024-05-10',14 UNION ALLSELECT DATE '2024-05-11',NULL UNION ALLSELECT DATE '2024-05-12',NULL), raws AS (SELECT date, number_of_orders, LAG(number_of_orders) OVER(ORDER BY date) AS beforesFROM raw_data)SELECT date, COALESCE(number_of_orders, befores)FROM raws#윈도우함수 연습문제 6WITH raw_data AS(SELECT DATE '2024-05-01' AS date,15 AS number_of_orders UNION ALLSELECT DATE '2024-05-02',13 UNION ALLSELECT DATE '2024-05-03',NULL UNION ALLSELECT DATE '2024-05-04',16 UNION ALLSELECT DATE '2024-05-05',NULL UNION ALLSELECT DATE '2024-05-06',18 UNION ALLSELECT DATE '2024-05-07',20 UNION ALLSELECT DATE '2024-05-08',NULL UNION ALLSELECT DATE '2024-05-09',13 UNION ALLSELECT DATE '2024-05-10',14 UNION ALLSELECT DATE '2024-05-11',NULL UNION ALLSELECT DATE '2024-05-12',NULL), raws AS (SELECT date, number_of_orders, LAG(number_of_orders) OVER(ORDER BY date) AS beforesFROM raw_data), dates AS (SELECT date, COALESCE(number_of_orders, befores) AS ordersFROM raws)SELECT date, orders, AVG(orders) OVER(ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)FROM dates#윈도우함수 연습문제7마지막 문제는 스스로 풀진 못했고 강의의 힘을 빌려 마무리했습니다! ㅎㅎ
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝 스터디 2주차 과제] 윈도우 함수 연습 문제
1. 윈도우 함수 탐색 함수 연습 문제1) 유저의 다음 접속 월과 다다음 접속 월 구하기SELECT user_id , visit_month , LEAD(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS lead_visit_month , LEAD(visit_month, 2) OVER (PARTITION BY user_id ORDER BY visit_month) AS lead2_visit_month FROM advanced.analytics_function_01 2) 이전 접속월 구하기SELECT user_id , visit_month , LAG(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS before_visit_month FROM advanced.analytics_function_01 3) 다음 접속 월까지의 간격 구하기SELECT * , after_visit_month - visit_month AS diff_month FROM ( SELECT user_id , visit_month , LEAD(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS after_visit_month FROM advanced.analytics_function_01 ) 4) 첫번째와 마지막 방문 월 구하기SELECT user_id , visit_month , FIRST_VALUE(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS first_visit_month , LAST_VALUE(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS last_visit_month FROM advanced.analytics_function_01 2. 윈도우 함수 Frame 연습 문제SELECT * , SUM(amount) OVER () AS amount_total , SUM(amount) OVER (ORDER BY order_id) AS cumulative_sum , SUM(amount) OVER (PARTITION BY user_id ORDER BY order_id) AS user_cumulative_sum , AVG(amount) OVER (ORDER BY order_id ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS last_5_avg FROM advanced.orders 3. 윈도우 함수 연습문제1) 사용자 별 쿼리를 실행한 총 횟수 구하는 쿼리를 작성해주세요.SELECT *, COUNT(query_date) OVER(PARTITION BY user ORDER BY query_date) AS total_query_cnt # 38 row FROM `advanced.query_logs` 2) 주차별로 팀내에서 쿼리를 많이 실행한 수를 구한후, 실행한수를 활용해 랭킹을 구해주세요. WITH base AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL) SELECT *, RANK() OVER(PARTITION BY team ORDER BY query_cnt) AS rk FROM base QUALIFY rk = 1 ORDER BY week_number, team, query_cnt DESC 3) 쿼리를 실행한 시점 기준1주전에 쿼리 실행 수를 별도의 컬럼으로 확인할 수 있는 쿼리를 작성해주세요WITH base AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL) SELECT *, LAG(query_cnt, 1) OVER(PARTITION BY user ORDER BY week_number) AS prev_week_cnt FROM base4) 시간의 흐름에 따라,일자별로 유저가 실행한 누적 쿼리수를 작성해주세요SELECT *, SUM(query_cnt) OVER(PARTITION BY user ORDER BY query_date) AS cumulative_sum, SUM(query_cnt) OVER(PARTITION BY user ORDER BY query_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum2 FROM( SELECT query_date, team, user, COUNT(user) AS query_cnt FROM advanced.q5) NULL값이라고 되어있는 부분을 바로 이전 날짜의 값으로 채워주는 쿼리를 작성해주세요WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ) SELECT *, LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date) AS last_value_orders FROM raw_data6) 5번문제에서 NULL을 채운 후, 2일전~현재데이터의 평균을 구하는 쿼리를 작성해주세요(이동평균)WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ) , filled_date AS( SELECT * EXCEPT(number_of_orders), LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date) AS number_of_orders FROM raw_data) SELECT *, AVG(number_of_orders) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg FROM filled_date7) app_logs테이블에서 CustomSession을 만들어주세요. 이전 이벤트 로그와 20초가 지나면 새로운 Session을 만들어주세요.WITH base AS( SELECT EVENT_DATE, DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Asia/Seoul') AS event_datetime, event_name, user_id, user_pseudo_id FROM advanced.app_logs WHERE event_date = '2022-08-18' AND user_pseudo_id = "1997494153.8491999091" ) , diff_date AS(SELECT *, DATETIME_DIFF(event_datetime, prev_event_datetime, SECOND) AS second_diff FROM ( SELECT *, LAG(EVENT_DATETIME, 1) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS prev_event_datetime FROM base ORDER BY base.event_datetime)) SELECT *, SUM(session_start) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS session_num FROM ( SELECT *, CASE WHEN prev_event_datetime IS NULL THEN 1 WHEN second_diff >= 20 THEN 1 ELSE 0 END AS session_start FROM diff_date)
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝스터디 2주차 과제] 윈도우 함수 연습 문제
#1SELECT *, LEAD(visit_month, 1) OVER(PARTITION BY user_id ORDER BY visit_month) AS lead_visit_month1, LEAD(visit_month, 2) OVER(PARTITION BY user_id ORDER BY visit_month) AS lead_visit_month2 FROM advanced.analytics_function_01 ORDER BY user_id;#2SELECT *, LEAD(visit_month, 1) OVER(PARTITION BY user_id ORDER BY visit_month) AS after_visit_month1, LEAD(visit_month, 2) OVER(PARTITION BY user_id ORDER BY visit_month) AS after_visit_month2, LAG(visit_month, 1) OVER(PARTITION BY user_id ORDER BY visit_month) AS before_visit_month1 FROM advanced.analytics_function_01 ORDER BY user_id, visit_month;#3SELECT *, after_visit_month - visit_month AS diff FROM ( SELECT *, LEAD(visit_month, 1) OVER (PARTITION BY user_id ORDER BY visit_month) AS after_visit_month, FROM advanced.analytics_function_01 );#4SELECT *, FIRST_VALUE(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first, LAST_VALUE(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last FROM advanced.analytics_function_01;#5SELECT *, COUNT(user) OVER(PARTITION BY user) AS total_query_cnt FROM advanced.query_logs ORDER BY user, query_date;#6WITH query_cnt_cnt AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(*) AS query_cnt FROM advanced.query_logs GROUP BY week_number,team,user ) SELECT *, RANK() OVER(PARTITION BY week_number, team ORDER BY query_cnt DESC) AS rk FROM query_cnt_cnt QUALIFY rk = 1 ORDER BY week_number, team, query_cnt DESC;#7WITH query_cnt_cnt AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(*) AS query_cnt FROM advanced.query_logs GROUP BY week_number,team,user ) SELECT *, LAG(query_cnt, 1) OVER (PARTITION BY user ORDER BY week_number) AS prev_week_query_cnt FROM query_cnt_cnt;#8WITH query_cnt AS ( SELECT *, COUNT(*) AS query_cnt_cnt FROM advanced.query_logs GROUP BY user, team, query_date ) SELECT *, SUM(query_cnt_cnt) OVER (PARTITION BY user ORDER BY query_date ) AS query_cnt_cnt_cnt FROM query_cnt;#9WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ) SELECT *, LAST_VALUE(number_of_orders IGNORE NULLS) OVER (ORDER BY date) AS last_value_orders FROM raw_data;#10WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ), filled_data AS ( SELECT * EXCEPT(number_of_orders), LAST_VALUE(number_of_orders IGNORE NULLS) OVER (ORDER BY date) AS number_of_orders FROM raw_data ) SELECT *, AVG(number_of_orders) OVER(ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg FROM filled_data#11WITH base AS ( SELECT event_date, DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Asia/Seoul') AS event_datetime, event_name, user_id, user_pseudo_id FROM advanced.app_logs WHERE 1=1 AND event_date = "2022-08-18" AND user_pseudo_id = "1997494153.8491999091" ) , diff_date AS ( SELECT *, DATETIME_DIFF(event_datetime, prev_event_datetime, SECOND) AS second_diff # second_diff 기반으로 새로운 세션의 시작일지, 아닐지를 판단할 수 있음 FROM( SELECT *, LAG(event_datetime, 1) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS prev_event_datetime -- DATETIME_DIFF() => 차이를 구할 수 있음 FROM base ) ) SELECT *, SUM(session_start) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS session_num FROM ( SELECT *, CASE WHEN prev_event_datetime IS NULL THEN 1 WHEN second_diff >= 20 THEN 1 ELSE 0 END AS session_start FROM diff_data ORDER BY event_datetime )
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝스터디 2주차 과제] 윈도우 함수 연습 문제
1번-- 1) 사용자별 쿼리를 실행한 총 횟수를 구하는 쿼리를 작성해주세요. -- 단, GROUP BY를 사용해서 집계하는 것이 아닌 query_logs의 데이터의 우측에 새로운 컬럼을 만들어주세요. SELECT *, COUNT(query_date) over(PARTITION BY user) AS total_query_cnt FROM advanced.query_logs ORDER BY user, query_date 2번-- 2) 주차별로 팀 내에서 쿼리를 많이 실행한 수를 구한 후, 실행한 수를 활용해 랭킹을 구해주세요. -- 단, 랭킹이 1등인 사람만 결과가 보이도록 해주세요 WITH query_cnt_team AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) SELECT *, RANK() OVER(PARTITION BY week_number, team ORDER BY query_cnt desc) AS rk FROM query_cnt_team QUALIFY rk = 1 ORDER BY week_number, query_cnt desc; 3번-- 3) (2번 문제에서 사용한 주차별 쿼리 사용) 쿼리를 실행한 시점 기준 1주 전에 쿼리 실행 수를 별도의 컬럼으로 확인할 수 있는 쿼리를 작성헤주세요. WITH query_cnt_team AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) SELECT *, LAG(query_cnt) OVER(PARTITION BY team, user ORDER BY week_number) AS prev_week_query_cnt FROM query_cnt_team 4번-- 4) 시간의 흐름(query_date)에 따라, 일자별로 유저가 실행한 누적 쿼리 수를 작성해주세요. WITH query_cnt_team AS ( SELECT query_date, team, user, COUNT(user) as query_cnt FROM advanced.query_logs GROUP BY ALL ORDER BY user, query_date ) SELECT user, team, query_date, query_cnt, SUM(query_cnt) OVER(PARTITION BY team, user ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_query_count FROM query_cnt_team ORDER BY user, query_date; 5번-- 5) 주문 횟수를 나타낸 데이터에서 NULL 값이라고 되어있는 부분을 바로 이전 날짜의 값으로 채워주는 쿼리를 작성해주세요. WITH raw_data AS( SELECT DATE'2024-05-01'AS date,15 AS number_of_orders UNION ALL SELECT DATE'2024-05-02',13 UNION ALL SELECT DATE'2024-05-03',NULL UNION ALL SELECT DATE'2024-05-04',16 UNION ALL SELECT DATE'2024-05-05',NULL UNION ALL SELECT DATE'2024-05-06',18 UNION ALL SELECT DATE'2024-05-07',20 UNION ALL SELECT DATE'2024-05-08',NULL UNION ALL SELECT DATE'2024-05-09',13 UNION ALL SELECT DATE'2024-05-10',14 UNION ALL SELECT DATE'2024-05-11',NULL UNION ALL SELECT DATE'2024-05-12',NULL ) SELECT *, LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date) AS last_value_orders FROM raw_data; 6번-- 6) 5번 문제에서 NULL을 채운 후, 2일 전 ~ 현재 데이터의 평균을 구하는 쿼리를 작성해주세요.(이동 평균) WITH raw_data AS( SELECT DATE'2024-05-01'AS date,15 AS number_of_orders UNION ALL SELECT DATE'2024-05-02',13 UNION ALL SELECT DATE'2024-05-03',NULL UNION ALL SELECT DATE'2024-05-04',16 UNION ALL SELECT DATE'2024-05-05',NULL UNION ALL SELECT DATE'2024-05-06',18 UNION ALL SELECT DATE'2024-05-07',20 UNION ALL SELECT DATE'2024-05-08',NULL UNION ALL SELECT DATE'2024-05-09',13 UNION ALL SELECT DATE'2024-05-10',14 UNION ALL SELECT DATE'2024-05-11',NULL UNION ALL SELECT DATE'2024-05-12',NULL ), filled_data AS ( SELECT * EXCEPT(number_of_orders), LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date) AS last_value_orders FROM raw_data ) SELECT * FROM filled_data; 7번-- 7) app_logs 테이블에서 Custom Session을 만들어 주세요. 이전 이벤트 로그와 20초가 지나면 새로운 Session을 만들어 주세요. -- Session은 숫자로 (1, 2, 3 ...) 표시해도 됩니다. WITH base AS ( SELECT event_date, DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Asia/Seoul') AS event_datetime, event_name, user_id, user_pseudo_id FROM advanced.app_logs WHERE event_date = '2022-08-18' ), diff_date AS ( SELECT *, DATETIME_DIFF(event_datetime, pre_event_time, SECOND) AS second_diff FROM ( SELECT *, LAG(event_datetime) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime ASC) AS pre_event_time FROM base ) ), session_start AS ( SELECT *, CASE WHEN pre_event_time IS NULL THEN 1 WHEN second_diff >= 20 THEN 1 ELSE 0 END AS start_session FROM diff_date ) SELECT *, SUM(start_session) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS session_id FROM session_start ORDER BY event_datetime;
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝스터디 2주차 과제] 윈도우 함수 연습 문제
# 윈도우 함수(탐색 함수) 연습 문제 1 SELECT *, LEAD(visit_month, 1) OVER(PARTITION BY user_id ORDER BY visit_month ASC) AS lead_1, LEAD(visit_month, 2) OVER(PARTITION BY user_id ORDER BY visit_month ASC) AS lead_2 FROM advanced.analytics_function_01 ORDER BY user_id; # 윈도우 함수(탐색 함수) 연습 문제 2 SELECT *, LEAD(visit_month, 1) OVER(PARTITION BY user_id ORDER BY visit_month ASC) AS lead_1, LEAD(visit_month, 2) OVER(PARTITION BY user_id ORDER BY visit_month ASC) AS lead_2, LAG(visit_month, 1) OVER(PARTITION BY user_id ORDER BY visit_month ASC) AS lag_1 FROM advanced.analytics_function_01 ORDER BY user_id;# 1) 사용자별 쿼리를 실행한 총 횟수를 구하는 쿼리를 작성해주세요. 단, GROUP BY를 사용해서 집계하는 것이 아닌 query_logs의 데이터의 우측에 새로운 컬럼을 만들어주세요. SELECT *, COUNT(user) OVER(PARTITION BY user) AS total_query_cnt FROM advanced.query_logs ORDER BY query_date# 2) 주차별로 팀 내에서 쿼리를 많이 실행한 수를 구한 후, 실행한 수를 활용해 랭킹을 구해주세요. 단, 랭킹이 1등인 사람만 결과가 보이도록 해주세요 WITH base AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) SELECT *, RANK() OVER(PARTITION BY week_number, team ORDER BY query_cnt DESC) AS team_rank FROM base QUALIFY team_rank = 1 ORDER BY week_number, team # 3) (2번 문제에서 사용한 주차별 쿼리 사용) 쿼리를 실행한 시점 기준 1주 전에 쿼리 실행 수를 별도의 컬럼으로 확인할 수 있는 쿼리를 작성해주세요 WITH base AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) SELECT user, team, week_number, query_cnt, LAG(query_cnt, 1) OVER(PARTITION BY user ORDER BY week_number) AS prev_week_query_cnt FROM base# 4) 시간의 흐름에 따라, 일자별로 유저가 실행한 누적 쿼리 수를 작성해주세요 WITH base AS ( SELECT user, team, query_date, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) SELECT *, SUM(query_cnt) OVER(PARTITION BY user ORDER BY query_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_query_count FROM base# 5) 다음 데이터는 주문 횟수를 나타낸 데이터입니다. 만약 주문 횟수가 없으면 NULL로 기록됩니다. 이런 데이터에서 NULL값이라고 되어있는 부분을 바로 이전 날짜의 값으로 채워주는 쿼리를 작성해주세요 WITH raw_data AS ( SELECT DATE'2024-05-01'AS date, 15 AS number_of_orders UNION ALL SELECT DATE'2024-05-02', 13 UNION ALL SELECT DATE'2024-05-03', NULL UNION ALL SELECT DATE'2024-05-04', 16 UNION ALL SELECT DATE'2024-05-05', NULL UNION ALL SELECT DATE'2024-05-06', 18 UNION ALL SELECT DATE'2024-05-07', 20 UNION ALL SELECT DATE'2024-05-08', NULL UNION ALL SELECT DATE'2024-05-09', 13 UNION ALL SELECT DATE'2024-05-10', 14 UNION ALL SELECT DATE'2024-05-11', NULL UNION ALL SELECT DATE'2024-05-12', NULL ) SELECT *, IFNULL(number_of_orders, LAG(number_of_orders,1) OVER(ORDER BY date ASC)) AS non_null FROM raw_data; SELECT *, LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date ASC) AS non_null FROM raw_data# 6) 5번 문제에서 NULL을 채운 후, 2일 전 ~ 현재 데이터의 평균을 구하는 쿼리를 작성해주세요(이동 평균) WITH raw_data AS ( SELECT DATE'2024-05-01'AS date, 15 AS number_of_orders UNION ALL SELECT DATE'2024-05-02', 13 UNION ALL SELECT DATE'2024-05-03', NULL UNION ALL SELECT DATE'2024-05-04', 16 UNION ALL SELECT DATE'2024-05-05', NULL UNION ALL SELECT DATE'2024-05-06', 18 UNION ALL SELECT DATE'2024-05-07', 20 UNION ALL SELECT DATE'2024-05-08', NULL UNION ALL SELECT DATE'2024-05-09', 13 UNION ALL SELECT DATE'2024-05-10', 14 UNION ALL SELECT DATE'2024-05-11', NULL UNION ALL SELECT DATE'2024-05-12', NULL ) SELECT *, AVG(number_of_orders) OVER(ORDER BY date ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg FROM ( SELECT date, LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date) AS number_of_orders FROM raw_data )# 7) app_logs 테이블에서 Custom Session을 만들어 주세요. 이전 이벤트 로그와 20초가 지나면 새로운 Session을 만들어 주세요. Session은 숫자로(1,2,3...) 표시해도 됩니다 # 2022-08-18일의 user_pseudo_id(1997494153.8491999091)은 session_id가 4까지 나옵니다 WITH base AS ( SELECT event_date, DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Asia/Seoul') AS event_datetime, event_name, user_id, user_pseudo_id FROM advanced.app_logs WHERE event_date = '2022-08-18' AND user_pseudo_id = '1997494153.8491999091' ), diff_data AS ( SELECT *, DATETIME_DIFF(event_datetime, before_1_event_datetime, SECOND) AS second_diff FROM ( SELECT *, LAG(event_datetime, 1) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime ASC) AS before_1_event_datetime FROM base ) ) SELECT *, SUM(session_start) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS session_id FROM ( SELECT *, CASE WHEN second_diff IS NULL THEN 1 WHEN second_diff >= 20 THEN 1 ELSE NULL END AS session_start FROM diff_data )
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝스터디 2주차 과제] 윈도우 함수 연습문제
탐색 함수 연습문제 11. user 들의 다음 접속 월과 다다음 접속 월을 구하는 쿼리를 작성해주세요SELECT user_id, visit_month, LEAD (visit_month, 1) OVER (PARTITION BY user_id ORDER BY visit_month) AS after_visit_month, LEAD (visit_month, 2) OVER (PARTITION BY user_id ORDER BY visit_month) AS after_two_visit_month FROM advanced.analytics_function_01 user 들의 다음 접속 월과 다다음 접속 월, 이전 접속 월을 구하는 쿼리를 작성해주세요-- 다음접속월 LEAD -- 다다음접속월 LEAD -- 이전접속월 LAG SELECT *, LEAD(visit_month, 1) OVER(PARTITION BY user_id ORDER BY visit_month ASC) AS after_visit_month, LEAD(visit_month, 2) OVER(PARTITION BY user_id ORDER BY visit_month ASC) AS after_two_visit_month, LAG(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month ASC) AS before_visit_month FROM advanced.analytics_function_01유저가 접속했을 때, 다음 접속까지의 간격을 구하시오-- user_id | visit_month | after_visit_month | diff_month -- 일단 LEAD, LAG 구하고 after visit month 에서 visit month 뺴기 SELECT *, (after_visit_month - visit_month) AS diff_month FROM ( SELECT *, LEAD(visit_month, 1) OVER(PARTITION BY user_id ORDER BY visit_month) AS after_visit_month, FROM advanced.analytics_function_01 ) ORDER BY user_id, visit_month ASC별칭을 정의하고 실행 한 뒤 다음 쿼리에서 그걸 이용해야 하기 때문에 서브쿼리를 사용해야함길이가 짧은 쿼리 SELECT *, LEAD(visit_month, 1) OVER (PARTITION BY user_id ORDER BY visit_month) AS after_visit_month, LEAD(visit_month, 1) OVER (PARTITION BY user_id ORDER BY visit_month) - visit_month AS diff_month FROM advanced.analytics_function_01길이는 더 짧지만 같은 코드가 중복되기 때문에 코드를 수정해야 할 경우 두 번 수정해야함.user_id 의 첫번째 방문 월, 마지막 방문 월을 구하는 쿼리를 작성해주세요-- 첫번째방문 : FIRST_VALUE -- 마지막방문 : LAST_VALUE SELECT *, FIRST_VALUE(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_visit_month, LAST_VALUE(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_visit_month FROM advanced.analytics_function_01 ORDER BY user_id, visit_monthFRAME 연습문제 2SUM : amount_total 구하기누적합 : cumulative_sum 구하기유저별 누적합 : cumulative_sum(user)직전 5개 평균값 : last_5_avg ** 직전이므로 current row 포함되면 안 됨SELECT *, -- amount 전체 합 SUM(amount) OVER() AS amount_total, -- 누적 합 cumulative_sum SUM(amount) OVER(ORDER BY order_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_sum, -- 유저별 누적합 cumulative_sum(user) SUM(amount) OVER( PARTITION BY user_id ORDER BY order_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) as cumulative_sum_user, -- 직전 5개 주문의 평균값 last_5_avg AVG(amount) OVER(ORDER BY order_id ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) as last_5_avg FROM advanced.orders ORDER BY order_id 윈도우 함수 연습문제 3사용자별 쿼리를 실행한 총 횟수를 구하는 쿼리를 작성해주세요 단 GROUP BY를 사용해서 집계하는 것이 아닌 query_logs의 데이터의 우측에 새로운 컬럼을 만들어주세요SELECT *, COUNT(query_date) OVER( PARTITION BY user ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) as total_query_cnt FROM advanced.query_logs ORDER BY query_date주차별로 팀 내에서 쿼리를 많이 실행한 수를 구한 후, 실행한 수를 활용해 랭킹을 구해주세요. 단, 랭킹이 1등인 사람만 결과가 보이도록 해주세요WITH weekly AS ( -- 날짜를 주로 변환 SELECT *, EXTRACT(WEEK FROM query_date) AS week_number FROM advanced.query_logs ), querycnt AS ( -- 쿼리 실행 횟수 구하기 SELECT week_number, team, user, COUNT(query_date) OVER (PARTITION BY week_number, user) AS query_cnt FROM weekly ), team_ranks AS ( -- 팀 내에서 쿼리 많이 작성한 사람 랭크 매겨서, 주차별로 파티션 나누고, 랭크 1만 보이게 SELECT week_number, team, user, query_cnt, ROW_NUMBER() OVER(PARTITION BY week_number, team ORDER BY query_cnt DESC) AS team_rank -- 랭킹 행이 여러개 출력되어서 ROW_NUMBER 로 하나만 나오게 구함 FROM querycnt ) SELECT week_number, team, user, query_cnt, team_rank FROM team_ranks WHERE team_rank = 1 ORDER BY week_number ASC, team ASC;-- GROUP BY 이용 WITH query_cnt_by_team AS( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) SELECT *, RANK() OVER(PARTITION BY week_number, team ORDER BY query_cnt DESC) AS team_rank FROM query_cnt_by_team QUALIFY team_rank =1 ORDER BY week_number, team, query_cnt DESC2번 문제에서 사용한 주차별 쿼리 사용해 쿼리를 실행한 시점 기준 1주전 에 쿼리 실행 수를 별도의 컬럼으로 확인할 수 있는 쿼리를 작성해주세요WITH query_cnt_by_team AS( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) SELECT user, team, week_number, query_cnt, LAG(query_cnt) OVER (PARTITION BY team, user ORDER BY week_number, query_cnt) AS prev_week_query_cnt FROM query_cnt_by_team ORDER BY user, team시간의 흐름에 따라, 일자별로 유저가 실행한 누적 쿼리 수를 작성해주세요-- 시간의 흐름에 따라 => query_date 기준 -- 유저별, 일자별 쿼리수 합계 => SUM() OVER(PARTITION BY user ~ -- 누적 쿼리수 계산 새 컬럼 cumulative_query_cnt => SUM() OVER(PARTITION BY ORDER BY ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_query_cnt WITH querycnt AS( SELECT user, team, query_date, COUNT(user) AS query_count FROM advanced.query_logs GROUP BY ALL ORDER BY user ASC, query_count DESC ) SELECT user, team, query_date, query_count, SUM(query_count) OVER(PARTITION BY team, user ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_query_cnt FROM querycnt ORDER BY user, query_date ASC, query_count DESC다음 데이터는 주문 횟수를 나타낸 데이터입니다. 만약 주문 횟수가 없으면 NULL로 기록됩니다. 이런 데이터에서 NULL 값이라고 되어있는 부분을 바로 이전 날짜의 값으로 채워주는 쿼리를 작성해주세요SELECT *, LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS last_orders FROM raw_data5번 문제에서 NULL을 채운 후 2일전 ~ 현재의 데이터의 평균을 구하는 쿼리를 작성해주세요SELECT date, last_orders, AVG(last_orders) OVER(ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as moving_avg FROM( SELECT *, LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS last_orders FROM raw_data )app_logs 테이블에서 Custom Session을 만들어주세요. 이전 이벤트 로그와 20초가 지나면 새로운 Session을 만들어주세요 Session은 숫자로 (1,2,3...) 표시해도 됩니다. 2022-08-18일의 user_pseudo_id(1997494153.8491999091)은 session_id가 4까지 나옵니다.WITH time2 AS ( SELECT event_date, event_timestamp, TIMESTAMP_MICROS(event_timestamp) + INTERVAL 9 HOUR AS event_datetime, -- 구글링 검색하다가 한국 시간 기준으로 변환하는 방법찾아서 적용해봄 event_name, user_id, user_pseudo_id FROM advanced.app_logs ), lagged_data AS ( SELECT *, LAG(event_datetime) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS before_event_datetime FROM time2 -- 결과 짧게 보려고 날짜 필터링 WHERE event_date = '2022-08-18' ), session_data AS ( SELECT *, TIMESTAMP_DIFF(event_datetime, before_event_datetime, SECOND) AS second_diff FROM lagged_data ) SELECT *, -- session_start: 첫 이벤트에만 1, 나머지는 NULL IF(before_event_datetime IS NULL, 1, NULL) AS session_start, -- session_id: 첫 이벤트에서 1로 시작하고, 이후 20초 이상 차이가 생길 때마다 1씩 증가 SUM(IF(before_event_datetime IS NULL OR second_diff >= 20, 1, 0)) OVER ( PARTITION BY user_pseudo_id ORDER BY event_datetime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS session_id FROM session_data ORDER BY user_pseudo_id, event_datetime
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝스터디 2주차 과제] 윈도우 함수 연습문제
1. 탐색 함수 연습 문제-- 1. user들의 다음 접속 월과 다다음 접속 월을 구하는 쿼리를 작성해주세요. -- partition 유저, 접속 월 정렬, LEAD(1), LEAD(2) SELECT user_id , visit_month , LEAD(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month) AS after_month , LEAD(visit_month, 2) OVER(PARTITION BY user_id ORDER BY visit_month) AS after2_month FROM `advanced.analytics_function_01` ORDER BY user_id, visit_month -- 2. user들의 다음 접속 월과 다다음 접속 월,이전 접속 월을 구하는 쿼리를 작성해주세요. -- partition: 유저, order: 접속 월, LEAD(1), LEAD(2), LAG(1) SELECT user_id , visit_month , LEAD(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month) AS after_month , LEAD(visit_month, 2) OVER(PARTITION BY user_id ORDER BY visit_month) AS after2_month , LAG(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month) AS before_month FROM `advanced.analytics_function_01` ORDER BY user_id, visit_month -- 3. 유저가 접속했을 때, 다음 접속까지의 간격을 구하시오. -- partition: 유저, order: 접속 월, 간격: LEAD(1) - 접속 월 SELECT after_visit_month - visit_month AS diff_month FROM ( user_id , visit_month , LEAD(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month) AS after_visit_month FROM `advanced.analytics_function_01` ) ORDER BY user_id, visit_month -- 4. user_id의 첫번째 방문 월, 마지막 방문월을 구하는 쿼리를 작성해주세요. -- partition: 유저, order: 접속 월, FIRST(전체 범위), LAST(전체 범위) SELECT DISTINCT user_id, first_visit_month, last_visit_month FROM ( SELECT user_id -- , visit_month , FIRST_VALUE(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_visit_month , LAST_VALUE(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_visit_month FROM `advanced.analytics_function_01` ) ORDER BY user_id 2. Frame 연습 문제-- 1. 우리 회사의 모든 주문량은?(amount_total) : SUM(amount, 전체 범위) -- 2. 특정 주문 시점에서 누적 주문량은?(cumulative_sum) : SUM(order by 주문id) -- 3. 고객별 주문 시점에서 누적 주문량은?(cumnulative_sum_by_user) : SUM(partition by 유저,order by 주문id) -- 4. 최근 직전 5개의 평균 주문량은?(last_5_orders_avg_amount) : AVG(order by 주문id, 5전 ~ 1전) SELECT * # amount_total : SUM(amount) OVER()와 동일한 결과 출력됨 , SUM(amount) OVER(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS amount_total , SUM(amount) OVER(ORDER BY order_id) AS cumulative_sum , SUM(amount) OVER(PARTITION BY user_id ORDER BY order_id) AS cumulative_sum_by_user , AVG(amount) OVER(ORDER BY order_id ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS last_5_orders_avg_amount FROM `advanced.orders` ORDER BY order_id 3. 윈도우 함수 총정리 연습 문제-- 1. 사용자별 쿼리를 실행한 총 횟수를 구하는 쿼리를 작성해주세요. 단, GROUP BY를 사용해서 집계하는 것이 아닌, query_logs 데이터의 우측에 새로운 컬럼을 만들어주세요. -- 동명이인 없음. COUNT(query_date) OVER(partition by 유저) SELECT * , COUNT(query_date) OVER(PARTITION BY user) AS total_query_cnt FROM `advanced.query_logs` ORDER BY user, query_date -- 2. 주차별로 팀 내에서 쿼리를 많이 실행한 수를 구한 후, 실행한 수를 활용해 랭킹을 구해주세요. 단, 랭킹이 1등인 사람만 결과가 보이도록 해주세요. -- 1) week, 주차별 실행 수 (중복제거 포함) WITH query_log_week_cnt AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number , team , user , COUNT(query_date) AS query_cnt FROM `advanced.query_logs` GROUP BY ALL ) -- 2) RANK() OVER(PARTITION BY week, team ORDER BY 주차별 실행 수 DESC) / QUALIFY rank = 1 SELECT week_number , team , user , query_cnt , RANK() OVER(PARTITION BY week_number, team ORDER BY query_cnt DESC) AS team_rank FROM query_log_week_cnt WHERE 1=1 QUALIFY team_rank = 1 ORDER BY week_number, team, user -- 3. 쿼리를 실행한 시점 기준 1주 전에 쿼리 실행수를 별도의 컬럼으로 확인할 수 있는 쿼리를 작성해주세요. -- 1) week, 주차별 실행 수 (중복제거 포함) WITH query_log_week_cnt AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number , team , user , COUNT(query_date) AS query_cnt FROM `advanced.query_logs` GROUP BY ALL ) -- 2) LAG(쿼리cnt) OVER(partition by 유저 order by 주차) SELECT user , team , week_number , LAG(query_cnt) OVER(PARTITION BY user ORDER BY week_number) AS prev_week_query_cnt FROM query_log_week_cnt ORDER BY user, team, week_number -- 4. 시간의 흐름에 따라, 일자별로 유저가 실행한 누적 쿼리 수를 작성해주세요. -- 1) 일자별 쿼리 수 (집계) WITH query_log_daily_cnt AS ( SELECT user , team , query_date , COUNT(query_date) AS query_cnt FROM `advanced.query_logs` GROUP BY ALL ) -- 2) 일자별 누적 쿼리 수 : SUM(쿼리수) OVER(파티션: 유저, 정렬: 일자) SELECT * , SUM(query_cnt) OVER(PARTITION BY user ORDER BY query_date) AS cumulative_query_cnt FROM query_log_daily_cnt ORDER BY user, query_date-- 5. NULL 값 부분을 바로 이전 날짜의 값으로 채워주는 쿼리를 작성해주세요. WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ) -- if null, lag, date 정렬 = 마지막 연속 null case 처리 불가.. -- : last value, ignore nulls 활용 SELECT * , LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date) AS fill_order_num FROM raw_data ORDER BY date-- 6. NULL을 채운 후, 2일 전 ~ 현재 데이터의 평균을 구하는 쿼리를 작성해주세요. (이동평균) WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ) -- last value, ignore nulls 활용 , fill_data AS ( SELECT * , LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date) AS fill_order_num FROM raw_data ) -- AVG(주문 수) OVER(date 정렬 rows between 2일전 ~ 현재) SELECT date , fill_order_num , AVG(fill_order_num) OVER(ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg FROM fill_data ORDER BY date-- 7. custom session 생성 (이전 이벤트 로그와 20초가 지나면 새로운 session) -- 컬럼 생성: event_datetime, 이전 event_datetime, 기준 - 이전 event_datetime, >= 20초 : session_id 1 WITH app_log_datetime AS ( SELECT event_date , event_timestamp , DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Asia/Seoul') AS event_datetime , event_name , user_id , user_pseudo_id FROM `advanced.app_logs` WHERE event_date = '2022-08-18' AND user_pseudo_id = '1997494153.8491999091' ) -- 이전 event_datetime , app_log_datetime_before AS ( SELECT * , LAG(event_datetime) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS before_event_datetime FROM app_log_datetime ) -- 기준 - 이전 event_datetime , app_log_diff AS ( SELECT * , DATETIME_DIFF(event_datetime, before_event_datetime, SECOND) AS second_diff FROM app_log_datetime_before ) -- session >= 20, 첫번째 값 null 처리 , app_log_session_start AS ( SELECT * , CASE WHEN before_event_datetime IS NULL THEN 1 WHEN second_diff >= 20 THEN 1 -- session을 나누는 기준(서비스 특성 고려하여 선정 필요, 보통 앱로그에서는 30초 60초) ELSE 0 END AS session_start FROM app_log_diff ) -- 누적 합을 통해 session_num 생성 SELECT * , SUM(session_start) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS session_num FROM app_log_session_start
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝스터디 2주차 과제] 윈도우 함수 연습 문제
사용자별 쿼리를 실행한 총 횟수SELECT *, COUNT(user) OVER(PARTITION BY user) AS total_query_cnt FROM advanced.query_logs ORDER BY user,query_date주차별 팀 내 쿼리 실행횟수 1등SELECT *, RANK() OVER(PARTITION BY week_number,team ORDER BY query_cnt DESC) AS team_rank FROM ( SELECT EXTRACT(week FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) AS query_logs_byweek QUALIFY team_rank = 1쿼리 실행시점 기준 1주전 쿼리 실행 횟수SELECT *, LAG(query_cnt,1) OVER(ORDER BY week_number) AS prev_week_query_cnt FROM ( SELECT EXTRACT(week FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) AS query_logs_byweek 시간의 흐름에 따라, 일자별 유저의 누적 쿼리 실행 수SELECT *, SUM(query_count) OVER(PARTITION BY user ORDER BY query_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_query_count FROM ( SELECT *, COUNT(user) AS query_count FROM advanced.query_logs GROUP BY ALL ) AS query_logs_byday 주문 횟수가 없으면 NULL로 기록된 데이터. NULL값을 이전 날짜의 값으로 채워주기 WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ) SELECT date, IF(number_of_orders IS NULL, LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date), number_of_orders) AS number_of_orders FROM raw_data ORDER BY date5번 문제에서 NULL을 채운 후, 2일전~현재데이터의 평균을 구하는 쿼리(이동 평균)SELECT *, AVG(number_of_orders) OVER(ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg FROM ( SELECT date, IF(number_of_orders IS NULL, LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date), number_of_orders) AS number_of_orders FROM raw_data ) AS dataapp_logs테이블에서 custom session 만들기. 이전 이벤트 로그와 20초가 지나면 새로운 session을 표시(숫자로 1,2,3,..)WITH base AS ( SELECT event_date, DATETIME(TIMESTAMP_MICROS(event_timestamp),'Asia/Seoul') AS event_datetime, event_name, user_pseudo_id FROM advanced.app_logs WHERE user_pseudo_id = '1997494153.8491999091' AND event_date = '2022-08-18' ) SELECT *, SUM(session_start) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS session_num FROM ( SELECT *, CASE WHEN second_diff IS NULL THEN 1 WHEN second_diff >= 20 THEN 1 END AS session_start FROM ( SELECT *, Lag(event_datetime,1) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS before_event_datetime, DATETIME_DIFF(event_datetime, Lag(event_datetime,1) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime),SECOND) AS second_diff FROM base ) AS A ) AS B
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝스터디 2주차 과제] 윈도우 함수 연습문제
문제 1.SELECT *, COUNT(query_date) OVER (PARTITION BY user ORDER BY user) AS total_query_cnt FROM advanced.query_logs 문제2.WITH table AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) SELECT *, RANK() OVER(PARTITION BY week_number ORDER BY query_cnt) AS team_rank FROM table QUALIFY team_rank = 1 문제3.WITH table AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) SELECT *, LAG(query_cnt) OVER (PARTITION BY user ORDER BY week_number) AS prev_week_query_cnt FROM table문제4.WITH query_count_table AS ( SELECT *, COUNT(*) AS query_count FROM advanced.query_logs GROUP BY ALL ) SELECT *, SUM(query_count) OVER (PARTITION BY user ORDER BY query_date) AS cululative_query_count FROM query_count_table 문제5.WITH raw_data AS( SELECT DATE'2024-05-01'AS date,15 AS number_of_orders UNION ALL SELECT DATE'2024-05-02',13 UNION ALL SELECT DATE'2024-05-03',NULL UNION ALL SELECT DATE'2024-05-04',16 UNION ALL SELECT DATE'2024-05-05',NULL UNION ALL SELECT DATE'2024-05-06',18 UNION ALL SELECT DATE'2024-05-07',20 UNION ALL SELECT DATE'2024-05-08',NULL UNION ALL SELECT DATE'2024-05-09',13 UNION ALL SELECT DATE'2024-05-10',14 UNION ALL SELECT DATE'2024-05-11',NULL UNION ALL SELECT DATE'2024-05-12',NULL ) SELECT date, IF(number_of_orders is null , last_value(number_of_orders IGNORE NULLS) OVER(ORDER BY date asc), number_of_orders) AS number_of_orders_not_null FROM raw_data;문제6.WITH raw_data AS( SELECT DATE'2024-05-01'AS date,15 AS number_of_orders UNION ALL SELECT DATE'2024-05-02', 13 UNION ALL SELECT DATE'2024-05-03', NULL UNION ALL SELECT DATE'2024-05-04', 16 UNION ALL SELECT DATE'2024-05-05', NULL UNION ALL SELECT DATE'2024-05-06', 18 UNION ALL SELECT DATE'2024-05-07', 20 UNION ALL SELECT DATE'2024-05-08', NULL UNION ALL SELECT DATE'2024-05-09', 13 UNION ALL SELECT DATE'2024-05-10', 14 UNION ALL SELECT DATE'2024-05-11', NULL UNION ALL SELECT DATE'2024-05-12', NULL ), filled_data AS ( SELECT * EXCEPT(number_of_orders), LAST_VALUE(number_of_orders IGNORE NULLS) OVER (ORDER BY date) AS number_of_orders FROM raw_data ) SELECT *, AVG(number_of_orders) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg FROM filled_data문제 7.WITH base AS( SELECT event_date, DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Asia/Seoul') AS event_datetime, event_name, user_id, user_pseudo_id FROM advanced.app_logs WHERE event_date = '2022-08-18' AND user_pseudo_id = "1997494153.8491999091" ), diff_data AS( SELECT *, DATETIME_DIFF(event_datetime, prev_event_datetime, SECOND) AS second_diff FROM ( SELECT *, # 직전 이벤트 시간을 prev_event_datetime으로 만들기 LAG(event_datetime, 1) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS prev_event_datetime FROM base ) ) SELECT *, SUM(session_start) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS session_num FROM( SELECT *, CASE WHEN prev_event_datetime IS NULL THEN 1 WHEN second_diff >= 20 THEN 1 ELSE NULL END AS session_start FROM diff_data ) ORDER BY event_datetime
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝스터디 2주차 과제] 윈도우 함수 연습문제
윈도우 함수(탐색 함수) 연습 문제윈도우 함수 연습 문제 1문제1) user들의 다음 접속 월과 다다음 접속 월을 구하는 쿼리를 작성해주세요.SELECT user_id, visit_month, LEAD(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month) AS after_visit_month, LEAD(visit_month, 2) OVER(PARTITION BY user_id ORDER BY visit_month) AS after_two_visit_month FROM advanced.analytics_function_01 윈도우 함수 연습 문제 2문제2) user들의 다음 접속 월과 다다음 접속 월, 이전 접속 월을 구하는 쿼리를 작성해주세요.SELECT user_id, visit_month, LEAD(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month) AS after_visit_month, LEAD(visit_month, 2) OVER(PARTITION BY user_id ORDER BY visit_month) AS after_two_visit_month, LAG(visit_month) OVER(PARTITION BY user_id ORDER BY visit_month) AS before_visit_month FROM advanced.analytics_function_01 윈도우 함수 FRAME 연습 문제-- amount_total : 전체 sum-- cumulative_sum : row 시점에 누적 sum-- cumulative_sum_by_user : 유저별 row 시점에 누적 sum-- last_5_orders_avg_amount : order_id 기준으로 정렬하고, 직전 5개의 주문의 평균 amountSELECT -- SUM(amount), SUM(amount) OVER() AS amount_total, # OVER 안에 아무것도 들어가지 않는 경우도 있구나! SUM(amount) OVER (ORDER BY order_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum, -- SUM(amount) OVER (PARTITION BY user_id ORDER BY order_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum_by_user, SUM(amount) OVER (PARTITION BY user_id ORDER BY order_id) AS cumulative_sum_by_user, AVG(amount) OVER (ORDER BY order_id ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS last_5_order_avg_amount FROM advanced.orders ORDER BY order_id 윈도우 함수 연습문제 (1번)1번사용자별 쿼리를 실행한 총 횟수를 구하는 쿼리를 작성해주세요. 단,GROUP BY를 사용해서 집계하는 것이 아닌 query_logs의 데이터의 우측에 새로운 컬럼을 만들어주세요.-- 1) 사용자별 쿼리를 실행한 총 횟수를 구하는 쿼리를 작성해주세요. 단,GROUP BY를 사용해서 집계하는 것이 아닌 query_logs의 데이터의 우측에 새로운 컬럼을 만들어주세요. -- SELECT * -- FROM advanced.query_logs SELECT user, team, query_date, COUNT(user) OVER(PARTITION BY user) AS total_query_cnt FROM advanced.query_logs ORDER BY user, query_date 윈도우 함수 연습문제 (2번~6번)주차별로 팀 내에서 쿼리를 많이 실행한 수를 구한 후, 실행한 수를 활용해 랭킹을 구해주세요. 단, 랭킹이 1등인 사람만 결과가 보이도록 해주세요.문제의 의도 : 원본 데이터 ⇒ 1 ROW 마다 데이터가 있고, 그걸 집계해서 사용. GROUP BY ⇒ 윈도우 함수WITH query_ct_by_team AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) SELECT *, RANK() OVER (PARTITION BY week_number, team ORDER BY query_cnt DESC ) AS rk FROM query_ct_by_team WHERE 1=1 -- QUALIFY : 윈도우 함수의 조건을 설정할 때 사용 QUALIFY rk = 1 -- and week_number = 16 ORDER BY week_number, team, query_cnt DESC (2번 문제에서 사용한 주차별 쿼리 사용) 쿼리를 실행한 시점 기준 1주 전에 쿼리 실행 수를 별도의 컬럼으로 확인할 수 있는 쿼리를 작성해주세요.WITH query_ct_by_team AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) SELECT *, LAG(query_cnt) OVER(PARTITION BY user ORDER BY week_number) AS prev_week_query_cnt FROM query_ct_by_team ORDER BY user, week_number 시간의 흐름에 따라, 일자별로 유저가 실행한 누적 쿼리 수를 작성해주세요.WITHWITH qcnt AS( SELECT *, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) -- 누적 쿼리 : Frame. 과거의 시간(UNBOUNDED PRECEDING)부터 current, low까지 합쳐라 SELECT *, SUM(query_cnt) OVER(PARTITION BY user ORDER BY query_date) AS cumulative_query_cc, SUM(query_cnt) OVER(PARTITION BY user ORDER BY query_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS cumulative_query_cc2 -- Frame의 Default 값 : UNBOUNDED PRECEDING ~ CURRENT ROW FROM qcnt ORDER BY user, query_date SUBQUERYSELECT *, SUM(query_cnt) OVER(PARTITION BY user ORDER BY query_date) AS cumulative_query_cc, SUM(query_cnt) OVER(PARTITION BY user ORDER BY query_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS cumulative_query_cc2 -- Frame의 Default 값 : UNBOUNDED PRECEDING ~ CURRENT ROW FROM (SELECT *, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL) ORDER BY user, query_date cumulative_sum 비교SELECT *, SUM(query_cnt) OVER(PARTITION BY user ORDER BY query_date) AS cumulative_query_cc, SUM(query_cnt) OVER(PARTITION BY user ORDER BY query_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS cumulative_query_cc2 -- Frame의 Default 값 : UNBOUNDED PRECEDING ~ CURRENT ROW FROM (SELECT *, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL) -- WHERE, QUALIFY 조건 설정해서 2가지 값이 모두 같은지 비교 -- => 모두 같으면 != 연산 결과에 반환하는 값이 없을 것 QUALIFY cumulative_query_cc != cumulative_query_cc2 ORDER BY user, query_date 다음 데이터는 주문 횟수를 나타낸 데이터입니다. 만약 주문 횟수가 없으면 NULL로 기록됩니다. 이런 데이터에서 NULL값이라고 되어 있는 부분을 바로 이전 날짜의 값으로 채워주는 쿼리를 작성해주세요.LAGSELECT date, IF(number_of_orders IS NULL, LAG(number_of_orders) OVER(ORDER BY date), number_of_orders) AS number_of_orders, FROM raw_data order by date LAST_VALUE SELECT *, LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date) AS last_value_orders FROM raw_data -- FIRST_VALUE, LAST_VALUE => NULL을 포함해서 연산 -- NULL을 제외해서 연산하고 싶으면 IGNORE NULLS를 쓰면 된다! 5번 문제에서 NULL을 채운 후, 2일 전 ~ 현재 데이터의 평균을 구하는 쿼리를 작성해주세요 (이동평균).WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ), filled_data AS ( SELECT *, LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date) AS last_value_orders FROM raw_data ) SELECT * EXCEPT(number_of_orders), ROUND(AVG(last_value_orders) OVER(ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),2) AS number_of_orders FROM filled_data -- SELECT * EXCEPT(number_of_orders), -- ROUND(AVG(last_value_orders) OVER(ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),2) AS number_of_orders SELECT *, ROUND(AVG(last_value_orders) OVER(ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),2) AS moving_avg FROM filled_data app_logs 테이블에서 Custom Session을 만들어주세요. 이전 이벤트 로그와 20초가 지나면 새로운 Session을 만들어주세요. Session은 숫자로 (1, 2, 3...) 표시해도 됩니다. 2022-08-18일의 user_pseudo_id(1997494153.8491999091)은 session_id가 4까지 나옵니다.SELECT * -- event_params가 필요하면 UNNEST => PIVOT해서 사용하면 됨 FROM advanced.app_logs WHERE event_date = "2022-08-18" AND user_pseudo_id = "1997494153.8491999091" ORDER BY event_timestamp WITH base AS ( SELECT --* event_date, DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Asia/Seoul') AS event_datetime, event_name, user_id, user_pseudo_id -- event_params가 필요하면 UNNEST => PIVOT해서 사용하면 됨 FROM advanced.app_logs WHERE event_date = "2022-08-18" AND user_pseudo_id = "1997494153.8491999091" -- WHERE 조건에 이벤트를 필터링해서 계산해도 괜찮음 ), diff_data AS ( SELECT *, DATETIME_DIFF(event_datetime, prev_event_datetime, SECOND) AS second_diff # secend_diff 기반으로 새로운 세션의 시작일지, 아닐지를 판단할 수 있음 FROM ( SELECT *, LAG(event_datetime, 1) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS prev_event_datetime # event_datetime이랑 prev_event_datetime을 빼서 20초가 넘으면 새로운 세션으로 정의, # 20초가 넘지 않으면 기존 세션 # DATETIME_DIFF() => 차이를 구할 수 있음 FROM base ) ) SELECT * , # 누적합을 사용해서 session_number을 만들었다! SUM(session_start) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS session_num # session을 구할 때 쿼리가 길어질 수 있음. 하루에 접속을 여러번 하는 서비스 => session 기반이 좋을 수도 있고, 아니라고 하면 일자별 유저 집계가 나을 수도 있다. FROM ( SELECT * , CASE WHEN prev_event_datetime IS NULL THEN 1 WHEN second_diff >= 20 THEN 1 # session을 나누는 기준 초, 데이터를 탐색하면서 결정. 보통 앱 로그에서는 30초, 60초 ELSE NULL -- ELSE 0 사용 가능 END AS session_start # session이 시작됨을 알리는 session_start FROM diff_data ) ORDER BY event_datetime -- 세션 정리 -- 이전 이벤트 로그와 현재 이벤트 로그의 diff => 초나 분을 구한다 -- 그 기준을 가지고 기준보다 높으면 새로운 세션이라고 생각한다 -- 첫번째 값엔 null이 있을 수 있어서, 이 부분도 챙겨야 한다 -- 새로운 세션, session_start 값을 기반으로 누적합 => session_num이 된다!
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝스터디 2주차 과제] 윈도우 함수 연습 문제
윈도우 함수 - 탐색함수 연습 문제1번-- 1번. user들의 다음 접속 월과 다다음 접속 월을 구해라. SELECT user_id, visit_month, LEAD(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS visit_after_one_month, LEAD(visit_month, 2) OVER (PARTITION BY user_id ORDER BY visit_month) AS visit_after_two_month FROM advanced.analytics_function_01 ORDER BY user_id; -- LEAD 함수는 항상 ORDER BY가 필요함!=> LEAD는 순서를 알아야 동작하는 함수이기 때문에~!!2번-- 2번. user들의 다음 접속 월과 다다음 접속 월, 이전 접속 월을 구해라. SELECT user_id, visit_month, LEAD(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS visit_after_one_month, LEAD(visit_month, 2) OVER (PARTITION BY user_id ORDER BY visit_month) AS visit_after_two_month, LAG(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS visit_before_one_month FROM advanced.analytics_function_01 ORDER BY user_id;3번-- 3번. user가 접속했을 때, 다음 접속까지의 간격을 구해라. -- LAG으로 이전 월을 구한 뒤, 현재 행과 뺄셈 진행하기.. (방문 텀을 구하는 것이므로.. 현재 달에 이전 월을 빼는게 더 맞는 방향이라 생각함....!) WITH base AS( SELECT *, LAG(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS before_visit_month FROM advanced.analytics_function_01 ) SELECT *, (visit_month - before_visit_month) AS visit_interval FROM base; -- 피연산자에 null값이 포함되어 있으면 해당 연산의 결과는 NULL이 된다. -- 3번을 서브쿼리 없이 짠다면.. -- SELECT -- *, -- LAG(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS before_visit_month, -- visit_month - LAG(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS gap -- FROM advanced.analytics_function_01 -- 하지만!! 여기서 만약 before_visit_month에 대한 내용을 수정한다고하면.. before_visit_month와 gap 두 부분을 수정해야한다... => 과연 괜찮은 쿼리일까? -- 쿼리가 길어지는 것을 무서워하지 말고, 쿼리를 덜 수정할 수 있는 구조를 만드는게 더 좋음!!!추가문제-- 추가문제 -- 이 데이터셋을 기준으로 user_id의 첫 번째 방문 월, 마지막 방문 월을 구해라. SELECT *, FIRST_VALUE(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS first_visit_month, LAST_VALUE(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS last_visit_month FROM advanced.analytics_function_01윈도우 함수 - FRAME 연습 문제1번-- 1)우리회사의모든주문량은? -- amount_total SELECT *, SUM(amount) OVER (ORDER BY order_id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS amount_total FROM advanced.orders;2번-- 2)특정주문시점에서누적주문량은? -- cumulative_sum SELECT *, SUM(AMOUNT) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum FROM advanced.orders;3번-- 3)고객별주문시점에서누적주문량은? -- cumulative_sum_by_uesr SELECT *, SUM(AMOUNT) OVER (PARTITION BY user_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum_by_uesr FROM advanced.orders;4번-- 4)최근직전5개의평균주문량은? -- last_5_orders_avg_amount SELECT *, SUM(AMOUNT) OVER (ORDER BY order_date ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS last_5_orders_avg_amount FROM advanced.orders; -- 정석 답 SELECT *, SUM(amount) OVER() AS amount_total, SUM(amount) OVER(ORDER BY order_id) AS cumulatvie_sum, SUM(amount) OVER(PARTITION BY user_id ORDER BY order_id) AS cumulatvie_sum_by_user, AVG(amount) OVER(ORDER BY order_id ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS last_5_orders_avg_amount FROM advanced.orders ORDER BY user_id, order_date; 윈도우 함수 연습 문제1번# 1. 사용자별 쿼리를 실행한 총 횟수를 구해라. SELECT *, COUNT(*) OVER (PARTITION BY user) AS total_query_cnt FROM advanced.query_logs;2번# 2. 주차별로 팀 내에게 쿼리를 가장 많이 사용한 사람들 보여주세요. -- week_number | team | user | query_cnt | team_rank -- week_number는 format_date("%U", query_date) 이용 -- 팀 내에서 쿼리를 가장 많이 사용한 사람.. -- 한 사람은 하나의 팀에만 속해있음 -- => Rank구할 때 PARTITION으로 팀으로 나누고 ORDER BY 에 query_cnt한 다음에.. 근데 이때 주차로도 나눠져야함.... PARTITION에 두 가지가 가능할까? => 애초에 주차별로 나눠서 count했어야 함..! -- 혼자 풀어본 쿼리 WITH base AS ( SELECT *, FORMAT_DATE("%U", query_date) AS week_number, COUNT(*) OVER (PARTITION BY user, FORMAT_DATE("%U", query_date)) AS query_cnt FROM advanced.query_logs ) SELECT DISTINCT week_number, team, user, query_cnt, RANK() OVER (PARTITION BY team ORDER BY query_cnt DESC) AS team_rank FROM base QUALIFY team_rank = 1 ORDER BY week_number, team; -- 문제 의도.. 에 철저하게 당했(?)다 -- window연습문제니까 window만 생각하게 됨...! -- 2번 정석 정답 -- 팀, 주차별 실행 횟수 count하기 -- 위의 결과를 바탕으로 랭킹 구한 뒤, QUALFY로 1등만 뽑기 -- WITH count_of_week AS ( -- SELECT -- EXTRACT(WEEK FROM query_date) AS week_number, -- team, -- user, -- COUNT(query_date) AS query_cnt -- FROM advanced.query_logs -- GROUP BY ALL -- ) -- SELECT -- *, -- RANK() OVER (PARTITION BY week_number, team ORDER BY query_cnt DESC) AS team_rank -- FROM count_of_week -- QUALIFY team_rank = 1 -- ORDER BY week_number, team3번# 3번. 2번 문제에서 사용한 주차별 쿼리를 사용하여 쿼리 실행시점 기준 1주 전 쿼리 실행 수를 별도의 컬럼을 확인할 수 있는 쿼리 작성하기 -- user | team | week_number | query_count | prev_week_query_count -- from count_of_week -- LAG(query_count) OVER (PARTITION BY user ORDER BY week_number) SELECT user, team, week_number, query_cnt, LAG(query_cnt) OVER (PARTITION BY user ORDER BY week_number) AS prev_week_qeury_cnt FROM count_of_week ORDER BY user, week_number;4번# 4번. 시간의 흐름에 따라 일자별로 유저가 실행한 누적 쿼리수를 작성해라. -- count -- frame설정 - ROWS UNBOUNDED PRECEDING AND CURRENT ROW -- 혼자 풀어본 리 SELECT *, COUNT(user) OVER (PARTITION BY user ORDER BY query_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_query_cnt FROM advanced.query_logs ORDER BY user, query_date; -- 일자별 흐름. 일자별 누적합.. 먼저 "일자별"로 합을 구해뒀어야했음. -- 정석 쿼리 -- SELECT -- *, -- SUM(query_count) OVER (PARTITION BY user ORDER BY query_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_query_cnt -- FROM ( -- SELECT -- *, -- COUNT(*) AS query_count -- FROM `advanced.query_logs` -- GROUP BY ALL) -- ORDER BY user, query_date 5번# 5번. 다음 데이터는 주문 횟수를 나타낸 데이터임. 만약 주문 횟수가 없으면 NULL이 기록된다. NULL로 되어있는 부분을 바로 이전 날짜의 값으로 채워주는 쿼리를 작성해라. -- WITH raw_data AS ( -- SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL -- SELECT DATE '2024-05-02', 13 UNION ALL -- SELECT DATE '2024-05-03', NULL UNION ALL -- SELECT DATE '2024-05-04', 16 UNION ALL -- SELECT DATE '2024-05-05', NULL UNION ALL -- SELECT DATE '2024-05-06', 18 UNION ALL -- SELECT DATE '2024-05-07', 20 UNION ALL -- SELECT DATE '2024-05-08', NULL UNION ALL -- SELECT DATE '2024-05-09', 13 UNION ALL -- SELECT DATE '2024-05-10', 14 UNION ALL -- SELECT DATE '2024-05-11', NULL UNION ALL -- SELECT DATE '2024-05-12', NULL -- ), filled_null AS (SELECT -- date, -- IF(number_of_orders IS NULL, LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), number_of_orders) AS number_of_orders -- FROM raw_data) -- frame을 위와 이 설정해주지 않아도 됐음!! -- 어짜피 null값을 제외한 마지막 값을 리턴하니까..!!!!!6번# 6번. 5번 문제에서 null을 채운 후, 2일 전~ 현재 데이터의 평균을 구하는 쿼리를 작성해라. SELECT *, AVG(number_of_orders) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg FROM filled_null7번# 7번. app_logs 테이블에서 custom session을 만들어라.이전 이벤트 로그와 20초가 지나면 새로운 session을 만들 것. -- event_date | event_timestamp | event_datetime | event_name | user_id | user_pesudo_id | before_event_datetime | second_diff | session_start | session_id -- event_datetime, before_event_datetime, second_diff, session_start, session_id 컬럼을 생성해야한다. -- timestamp_micros()이용하여 timestamp만들고 datetime으로 'Asia/Seoul' -- LAG로 before_datetime -- datetime_diff 함수로 secound_diff -- session_start가 1 아니면 null -- WITH base AS ( -- SELECT -- event_date, -- event_timestamp, -- DATETIME(timestamp_micros(event_timestamp), 'Asia/Seoul') as event_datetime, -- event_name, -- user_id, -- user_pseudo_id, -- DATETIME(timestamp_micros(LAG(event_timestamp) OVER (ORDER BY event_timestamp)), 'Asia/Seoul') AS before_event_datetime, -- from advanced.app_logs -- where event_date = '2022-08-18' -- AND user_pseudo_id = -- '1997494153.849199901' -- ) -- SELECT -- *, -- datetime_diff(event_datetime, before_event_datetime, second) AS second_diff, -- IF(datetime_diff(event_datetime, before_event_datetime, second) >= 20, 1, NULL) AS session_start -- from base -- ORDER BY event_timestamp -- 첫 번째 값의 NULL값 처리 못함 -- session_id 설정 못함.. -- 정석 쿼리 WITH base AS ( SELECT event_date, DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Asia/Seoul') AS event_datetime, event_name, user_id, user_pseudo_id FROM advanced.app_logs WHERE event_date = '2022-08-18' AND user_pseudo_id = '1997494153.8491999091' ), diff_data AS( SELECT *, DATETIME_DIFF(event_datetime, prev_event_datetime, SECOND) AS second_diff FROM ( SELECT *, LAG(event_datetime, 1) OVER (PARTITION BY user_pseudo_id ORDER BY event_datetime) AS prev_event_datetime, FROM base ) ) SELECT *, SUM(session_start) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS session_id FROM ( SELECT *, CASE WHEN prev_event_datetime IS NULL THEN 1 WHEN second_diff >= 20 THEN 1 ELSE NULL END AS session_start FROM diff_data ) ORDER BY event_datetime 기억할 것윈도우 함수에 ORDER BY를 설정한다고 쿼리 결과값이 정렬되는 것은 아니다.(단지 함수를 위한 정렬일 뿐!) FROM 다음에 ORDER BY를 써줘야 쿼리 결과가 정렬되어 나타난다.frame 설정 없이도 기본값은 UNBOUNDED PRECEDING AND CURRENT ROW와 값이 같다. 하지만, 명시적 표현으로 해주는 것이 더 좋다LAG의 옵션으로 IGNORE NULLS는 없다. NULL을 구분하여 처리하고싶다면 FIRST_VALUE, LAST_VALUE 등의 함수를 이용해야 한다.세션을 구하려면 쿼리가 많아지는건 어쩔 수 없는 일! (쿼리가 많아진다고 겁 먹지 말고 일단 단계별로 진행해보자)하루 방문 횟수가 많다 하면 세션 기반이 좋을 수 있고 아니라면 그냥 일자별 유저 집계가 나을 수도 있다. (session 범위 기준은 데이터를 탐색하며 결정)time_diff 함수 사용할 때, 입력되는 데이터 타입에 맞춰 함수 이름을 써줘야 한다. (입력되는 데이터 타입이 timestamp라면 timestamp_diff 함수를 사용해야 하고, datetime이라면 datetime_diff 함수를 사용해야 한다.)'어떤 타이밍에 서브쿼리를 이용하는 것이 좋은 타이밍일까...' 를 고민했는데, 그냥 써야할 때 쓰면 되겠다고 생각했다. (코드가 길어지는 것을 두려워하지 말자~!)
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝 스터디 2주차 과제] 윈도우 함수 연습 문제
#1SELECT user_id, visit_month, LEAD(visit_month,1) OVER(PARTITION BY user_id ORDER BY visit_month) AS next_month, LEAD(visit_month,2) OVER(PARTITION BY user_id ORDER BY visit_month) AS the_month_after_next FROM `avdanced.analytics_function_01`#2SELECT user_id, visit_month, LEAD(visit_month,1) OVER(PARTITION BY user_id ORDER BY visit_month) AS next_month, LEAD(visit_month,2) OVER(PARTITION BY user_id ORDER BY visit_month) AS the_month_after_next, LAG(visit_month,1) OVER(PARTITION BY user_id ORDER BY visit_month) AS last_month FROM `avdanced.analytics_function_01` FrameSELECT SUM(amount) OVER() AS amount_total, #SUM(amount) OVER(partition by order_date) AS cumulative_sum, SUM(amount) OVER (ORDER BY order_date) AS cumulative_sum, #SUM(amount) OVER(partition by user_id) AS cumulative_sum_by_user, SUM(amount) OVER(partition by user_id ORDER BY order_id) AS cumulative_sum_by_user, AVG(amount) OVER(ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS last_5_orders_avg_amount, AVG(amount) OVER(ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS last_5_unbounded_orders_avg_amount, AVG(amount) OVER(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS all_orders_avg_amount FROM `avdanced.orders` 2)#1SELECT user, team, query_date, count(user) over(PARTITION BY user) AS total_query_cnt FROM advanced.query_logs ORDER BY query_date, user#2WITH query_cnt_by_team AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) SELECT *, RANK() OVER (PARTITION BY week_number, team ORDER BY query_cnt DESC) AS rk FROM query_cnt_by_team QUALIFY rk = 1 ORDER BY week_number, team, query_cnt DESC#3WITH query_cnt_by_team AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) SELECT *, LAG(query_cnt, 1) OVER (PARTITION BY user ORDER BY week_number) AS prev_week_query_cnt FROM query_cnt_by_team ORDER BY user, week_number#4SELECT query_date, team, user, query_cnt, SUM(query_cnt) OVER (PARTITION BY user ORDER BY query_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum FROM ( SELECT query_date, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) ORDER BY user, query_date#5WITH raw_data AS( SELECT DATE'2024-05-01'AS date,15 AS number_of_orders UNION ALL SELECT DATE'2024-05-02', 13 UNION ALL SELECT DATE'2024-05-03', NULL UNION ALL SELECT DATE'2024-05-04', 16 UNION ALL SELECT DATE'2024-05-05', NULL UNION ALL SELECT DATE'2024-05-06', 18 UNION ALL SELECT DATE'2024-05-07', 20 UNION ALL SELECT DATE'2024-05-08', NULL UNION ALL SELECT DATE'2024-05-09', 13 UNION ALL SELECT DATE'2024-05-10', 14 UNION ALL SELECT DATE'2024-05-11', NULL UNION ALL SELECT DATE'2024-05-12', NULL ) SELECT date, IF(number_of_orders is null , last_value(number_of_orders IGNORE NULLS) OVER(ORDER BY date asc), number_of_orders) AS number_of_orders_not_null FROM raw_data;#6WITH raw_data AS( SELECT DATE'2024-05-01'AS date,15 AS number_of_orders UNION ALL SELECT DATE'2024-05-02', 13 UNION ALL SELECT DATE'2024-05-03', NULL UNION ALL SELECT DATE'2024-05-04', 16 UNION ALL SELECT DATE'2024-05-05', NULL UNION ALL SELECT DATE'2024-05-06', 18 UNION ALL SELECT DATE'2024-05-07', 20 UNION ALL SELECT DATE'2024-05-08', NULL UNION ALL SELECT DATE'2024-05-09', 13 UNION ALL SELECT DATE'2024-05-10', 14 UNION ALL SELECT DATE'2024-05-11', NULL UNION ALL SELECT DATE'2024-05-12', NULL ), filled_data AS ( SELECT * EXCEPT(number_of_orders), LAST_VALUE(number_of_orders IGNORE NULLS) OVER (ORDER BY date) AS number_of_orders FROM raw_data ) SELECT *, AVG(number_of_orders) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg FROM filled_data#7WITH base AS( SELECT event_date, event_timestamp, DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Asia/Seoul') AS event_datetime, event_name, user_id, user_pseudo_id, DATETIME(TIMESTAMP_MICROS(LAG(event_timestamp) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp)), 'Asia/Seoul') AS before_event_datetime FROM advanced.app_logs WHERE event_date = '2022-08-18' AND user_pseudo_id = '1997494153.8491999091' ), session_info AS( SELECT *, TIMESTAMP_DIFF(event_datetime, before_event_datetime, SECOND) AS second_diff, CASE WHEN TIMESTAMP_DIFF(event_datetime, before_event_datetime, SECOND) >= 20 OR TIMESTAMP_DIFF(event_datetime, before_event_datetime, SECOND) IS NULL THEN 1 ELSE NULL END AS session_start FROM base ) SELECT *, SUM(session_start) OVER (PARTITION BY user_pseudo_id ORDER BY event_datetime) AS session_id FROM session_info ORDER BY event_date, event_timestamp;
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝스터디 2주차 과제] 윈도우 함수 연습문제
-- 함수(컬럼) OVER (PARTITION BY 파티션 컬럼 ORDER BY 정렬할 컬럼) AS -- LEAD, LAG -> 함수(컬럼,순서). 순서를 명시하지 않으면 디폴트 1 -- 1) select *, LEAD(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month ASC) AS lead_visit_month, LEAD(visit_month,2) OVER (PARTITION BY user_id ORDER BY visit_month ASC) AS lead2_visit_month FROM advanced.analytics_function_01 ORDER BY user_id ; -- 여기서 *를 쓰고 진행하는 이유가 있나? -- 2) select *, LEAD(visit_month,1) OVER (PARTITION BY user_id ORDER BY visit_month ASC) AS after_visit_month, LEAD(visit_month,2) OVER (PARTITION BY user_id ORDER BY visit_month ASC) AS after_visit_month, LAG(visit_month,1) OVER (PARTITION BY user_id ORDER BY visit_month ASC) AS before_visit_month FROM advanced.analytics_function_01 ORDER BY user_id, visit_month ; -- 3) select *, LEAD(visit_month,1) OVER (PARTITION BY user_id ORDER BY visit_month ASC) AS after_visit_month, LEAD(visit_month,1) OVER (PARTITION BY user_id ORDER BY visit_month ASC) - visit_month FROM advanced.analytics_function_01 ORDER BY user_id, visit_month ; -- 서브쿼리 SELECT *, after_visit_month-visit_month AS diff FROM ( select *, LEAD(visit_month,1) OVER (PARTITION BY user_id ORDER BY visit_month ASC) AS after_visit_month FROM advanced.analytics_function_01 ) ORDER BY user_id, visit_month -- 쿼리가 길어지는 것을 무서워하지말고 쿼리를 덜 수정하는 방향으로 하자 -- # 추가문제 : 이 데이터셋 기준으로 user_id의 첫번째 방문월, 마지막방문월 구하는 쿼리 작성하기 first_value, last_value SELECT *, FIRST_VALUE(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month ASC) AS first_visit_month, LAST_VALUE(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month ASC) AS last_visit_month FROM advanced.analytics_function_01 ORDER BY user_id, visit_month ; -- 4) 시간의 흐름에 따라, 일자별로 유저가 실행한 누적 쿼리 수를 작성해주세요 SELECT *, SUM(query_cnt) OVER(PARTITION BY user ORDER BY query_date) AS cumulative_sum, SUM(query_cnt) OVER(PARTITION BY user ORDER BY query_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum2 FROM ( SELECT query_date, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) ORDER BY user, query_date -- 5) WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07' , 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL UNION ALL ) SELECT *, LAST_VALUE(number_of_order IGNORE NULLS) OVER (ORDER BY date) AS last_value_orders FROM raw_data -- 6) WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07' , 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ), filled_data AS ( SELECT * EXCEPT(number_of_orders), **LAST_VALUE**(number_of_orders **IGNORE NULLS**) OVER (ORDER BY date) AS number_of_orders FROM raw_data ) SELECT *, AVG(number_of_orders) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg FROM filled_data -- 7) app_logs 테이블에서 Custom Session을 만들어 주세요. 이전 이벤트 로그와 20초가 지나면 새로운 Session을 만들어 주세요. Session은 숫자로 (1, 2, 3 …) 표시해도 됩니다 -- 2022-08-18일의 user_pseudo_id(1997494153.8491999091)은 session_id가 4까지 나옵니다 WITH base AS ( SELECT event_date, DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Asia/Seoul') AS event_datetime, event_name, user_id, user_pseudo_id -- event_params가 필요하면 UNNEST => PIVOT해서 사용하면 됨 FROM advanced.app_logs WHERE event_date = "2022-08-18" AND user_pseudo_id = "1997494153.8491999091" -- WHERE 조건에 이벤트를 필터링해서 계산해도 괜찮음. 여기서는 필터링을 하지 않고, 진행 ), diff_data AS ( SELECT *, DATETIME_DIFF(event_datetime, prev_event_datetime, SECOND) AS second_diff # second_diff 기반으로 새로운 세션의 시작일지, 아닐지를 판단할 수 있음 FROM ( SELECT *, LAG(event_datetime, 1) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS prev_event_datetime # event_datetime이랑 prev_event_datetime을 빼서 20초가 넘으면 새로운 세션으로 정의. # 20초가 넘지 않으면 기존 세션 -- DATETIME_DIFF() => 차이를 구할 수 있음 FROM base ) ) SELECT *, # 누적합을 사용해서 session_number를 만들었다! SUM(session_start) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS session_num # session을 구할 때 쿼리가 길어질 수 있음. 하루에 접속을 여러번 하는 서비스 => session 기반이 좋을 수 있고, 아니라고 하면 일자별 유저 집계가 나을 수 있다 FROM ( SELECT *, CASE WHEN prev_event_datetime IS NULL THEN 1 WHEN second_diff >= 30 THEN 1 # session을 나누는 기준 초. 데이터를 탐색하면서 결정. 보통 앱 로그에서는 30초, 60초 ELSE 0 END AS session_start # session이 시작됨을 알리는 session_start FROM diff_data ) ORDER BY event_datetime
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝스터디 2주차 과제] 윈도우 함수 연습 문제
1. 윈도우 함수 탐색 함수 연습 문제1-1) 유저의 다음 접속 월과 다다음 접속 월 구하기SELECT user_id , visit_month , LEAD(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS lead_visit_month , LEAD(visit_month, 2) OVER (PARTITION BY user_id ORDER BY visit_month) AS lead2_visit_month FROM advanced.analytics_function_01 1-2) 이전 접속월 구하기SELECT user_id , visit_month , LAG(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS before_visit_month FROM advanced.analytics_function_01 1-3) 다음 접속 월까지의 간격 구하기SELECT * , after_visit_month - visit_month AS diff_month FROM ( SELECT user_id , visit_month , LEAD(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS after_visit_month FROM advanced.analytics_function_01 ) 1-4) 첫번째와 마지막 방문 월 구하기SELECT user_id , visit_month , FIRST_VALUE(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS first_visit_month , LAST_VALUE(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month) AS last_visit_month FROM advanced.analytics_function_01 2. 윈도우 함수 Frame 연습 문제SELECT * , SUM(amount) OVER () AS amount_total , SUM(amount) OVER (ORDER BY order_id) AS cumulative_sum , SUM(amount) OVER (PARTITION BY user_id ORDER BY order_id) AS user_cumulative_sum , AVG(amount) OVER (ORDER BY order_id ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING) AS last_5_avg FROM advanced.orders 3. 윈도우 함수 연습 문제1-1) 사용자 별 쿼리를 실행한 총 횟수 구하기SELECT user , team , query_date , COUNT(*) OVER(PARTITION BY user) AS total_query_cnt FROM advanced.query_logs ORDER BY user, query_date. -- 검증 1-2) 주차 별로 팀 내에서 쿼리를 많이 실행한 수, 단 랭킹이 1등인 경우만 구하기SELECT week_number , team , user , query_cnt , RANK() OVER(PARTITION BY week_number, team ORDER BY query_cnt DESC) AS team_rank FROM ( SELECT EXTRACT(WEEK FROM query_date) AS week_number , team , user , COUNT(query_date) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) QUALIFY team_rank=1 ORDER BY week_number 1-3) 쿼리 실행 시점 기준 1주 전 쿼리 실행 수 구하기WITH query_cnt_by_team AS ( SELECT EXTRACT(WEEK FROM query_date) AS week_number , team , user , COUNT(query_date) AS query_cnt FROM advanced.query_logs GROUP BY EXTRACT(WEEK FROM query_date) , team , user ) SELECT * , LAG(query_cnt) OVER(PARTITION BY user ORDER BY week_number) AS prev_week_query_cnt FROM query_cnt_by_team ORDER BY user 1-4) 시간의 흐름에 따라 일자 별 유저가 실행한 누적 쿼리 수 구하기WITH query_cnt_by_user AS ( SELECT user , team , query_date , COUNT(query_date) AS query_count FROM advanced.query_logs GROUP BY ALL ) SELECT user , team , query_date , query_count , SUM(query_count) OVER(PARTITION BY user ORDER BY query_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_query_cnt -- , SUM(query_count) OVER(PARTITION BY user ORDER BY query_date) AS cumulative_query_cnt2 FROM query_cnt_by_user -- 검증 -- QUALIFY cumulative_query_cnt != cumulative_query_cnt2 ORDER BY user, query_date*집계 분석 함수를 사용하고 ORDER BY가 있는 경우 Frame의 디폴트 값은 UNBOUNDED PRECEDING ~ CURRENT ROW 이다. 1-5) 값이 null인 경우 직전 값으로 채우기null 값이 2번 연속 있는 경우가 있어 COALESCE를 사용했습니다.WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ) SELECT date -- , number_of_orders , COALESCE( number_of_orders , LAG(number_of_orders, 1) OVER(ORDER BY date) , LAG(number_of_orders, 2) OVER(ORDER BY date) ) AS number_of_orders FROM raw_data 성윤님의 풀이를 보고 LAST_VALUE + IGNORE NULLS를 사용하면 연속되는 null의 수에 상관없이 직전 값을 채울 수 있어 깔끔하다는 생각이 들었습니다!SELECT * , LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date) AS last_value_orders FROM raw_data 1-6) 같은 데이터 셋에서 1일 전 부터 현재까지의 평균 구하기SELECT * , AVG(last_value_orders) OVER(ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg FROM ( SELECT date , LAST_VALUE(number_of_orders IGNORE NULLS) OVER(ORDER BY date) AS last_value_orders FROM raw_data ) 1-7) 20초를 기준으로 세션 아이디 구하기WITH base AS ( SELECT event_date , user_pseudo_id , event_name , DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Asia/Seoul') AS event_datetime FROM advanced.app_logs WHERE user_pseudo_id='1997494153.8491999091' AND event_date="2022-08-18" ORDER BY event_timestamp ) , base_second_dff AS ( SELECT * , DATETIME_DIFF(event_datetime, before_event_datetime, SECOND) AS second_diff FROM ( SELECT * , LAG(event_datetime) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) before_event_datetime FROM base ) ) , result AS( SELECT * -- CASE 문 사용 가능 , IF(second_diff - before_second_diff > 20 OR second_diff IS NULL, 1, null) AS session_start FROM ( SELECT * , LAG(second_diff) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS before_second_diff FROM base_second_dff ) ) SELECT * , SUM(session_start) OVER(PARTITION BY user_pseudo_id ORDER BY event_datetime) AS session_id FROM result ✔ 느낀 점프레임은 처음 보는 개념이라 초반에는 어색했지만 문제를 풀면서 익숙해질 수 있었다. 누적합을 구할 때 등 유용하게 사용할 것 같다."쿼리가 길어지는 것을 무서워하지말고 쿼리를 잘 수정할 수 있는 구조를 만들자" 일할 때에도 동료들이 수정하고 테스트하기 쉬운 구조인가?를 생각해보자.CTE의 이름을 짓는건 언제나 고민된다. 서브쿼리를 적절하게 사용하자! ✔ 새롭게 알게된 점ROW_NUMBER를 사용할 때 중복값이 있으면 순서가 보장되지 않을 수 있다. 이때는 OVER 안에서 id를 사용하자.FIRST_VALUE는 MIN과 같이 동작하지만, LAST_VALUE는 MAX가 아니다.전체 집계 값을 구하고 싶을 때에는 OVER 안에 아무것도 안 쓰면 된다.윈도우 함수에 조건을 걸고 싶을 땐 QUALIFY를 사용하자. 연속된 NULL값을 채우는 경우 LAST_VALUE와 IGNORE NULLS조합을 사용하자.
-
미해결BigQuery(SQL) 활용편(퍼널 분석, 리텐션 분석)
[빠짝스터디 2주차 과제] 윈도우 함수 연습 문제
1. 윈도우 함수 연습문제 풀이문제 1)사용자별 쿼리를 실행한 총 횟수를 구하는 쿼리를 작성. 단, GROUP BY를 사용해서 집계하는 것이 아닌 query_logs의 데이터 우측에 새로운 컬럼을 만들어주세요SELECT user, team, query_date, COUNT(user) OVER (PARTITION BY user) AS total_query_cnt FROM advanced.query_logs ORDER BY query_date, team, user ;문제 2)주차별로 팀 내에서 많이 실행한 수를 구한 후, 실행한 수를 활용해 랭킹을 구해주세요. 단, 랭킹이 1등인 사람만 결과가 보이도록 해주세요.WITH weekly_team_query_cnt AS ( SELECT EXTRACT(week FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) SELECT week_number, team, user, query_cnt, ROW_NUMBER() OVER (PARTITION BY week_number, team ORDER BY query_cnt DESC) AS team_rank FROM weekly_team_query_cnt WHERE 1=1 QUALIFY team_rank = 1 ORDER BY week_number, team ;문제 3)(2번 문제에서 사용한 주차별 쿼리 사용) 쿼리를 실행한 시점 기준 1주 전에 쿼리 실행 수를 별도의 컬럼으로 확인할 수 있는 쿼리를 작성해주세요WITH weekly_team_query_cnt AS ( SELECT EXTRACT(week FROM query_date) AS week_number, team, user, COUNT(user) AS query_cnt FROM advanced.query_logs GROUP BY ALL ) SELECT user, team, week_number, query_cnt, LAG(query_cnt, 1) OVER (PARTITION BY user ORDER BY week_number) AS prev_week_query_cnt FROM weekly_team_query_cnt ORDER BY user, team, week_number ;문제 4)시간의 흐름에 따라, 일자별로 유저가 실행한 누적 쿼리 수를 작성WITH timeseries_user_query_count AS ( SELECT user, team, query_date, COUNT(user) AS query_count FROM advanced.query_logs GROUP BY ALL ) SELECT user, team, query_date, query_count, SUM(query_count) OVER(PARTITION BY user ORDER BY query_date) AS cumulative_query_count FROM timeseries_user_query_count문제 5)다음 데이터는 주문 횟수를 나타낸 데이터입니다. 만약 주문 횟수가 없으면 NULL로 기록됩니다. 이런 데이터에서 NULL 값이라고 되어있는 부분을 바로 이전 날짜의 값으로 채워주는 쿼리를 작성해주세요.WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ) SELECT date, LAST_VALUE(number_of_orders IGNORE NULLS) OVER (ORDER BY date) AS number_of_orders FROM raw_data ;문제 6)5번 문제에서 NULL을 채운 후, 2일 전 ~ 현재 데이터의 평균을 구하는 쿼리를 작성해주세요(이동 평균)WITH raw_data AS ( SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL SELECT DATE '2024-05-02', 13 UNION ALL SELECT DATE '2024-05-03', NULL UNION ALL SELECT DATE '2024-05-04', 16 UNION ALL SELECT DATE '2024-05-05', NULL UNION ALL SELECT DATE '2024-05-06', 18 UNION ALL SELECT DATE '2024-05-07', 20 UNION ALL SELECT DATE '2024-05-08', NULL UNION ALL SELECT DATE '2024-05-09', 13 UNION ALL SELECT DATE '2024-05-10', 14 UNION ALL SELECT DATE '2024-05-11', NULL UNION ALL SELECT DATE '2024-05-12', NULL ) SELECT date, number_of_orders, AVG(number_of_orders) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg FROM ( SELECT * EXCEPT(number_of_orders), LAST_VALUE(number_of_orders IGNORE NULLS) OVER (ORDER BY date) AS number_of_orders FROM raw_data ) ;문제 7)** 새로운 Session 부여하기본인 해결 쿼리) WITH create_new_session_app_logs_raw_data AS ( SELECT *, LAG(event_datetime) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS before_event_datetime FROM ( SELECT event_date, event_timestamp, DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Asia/Seoul') AS event_datetime, event_name, user_id, user_pseudo_id, FROM advanced.app_logs WHERE event_date = '2022-08-18' AND user_pseudo_id = '1997494153.8491999091' ) ), create_new_session_app_logs AS ( SELECT *, IF(second_diff is null OR second_diff > 20, 1, null) AS session_start FROM ( SELECT *, EXTRACT(second FROM event_datetime - before_event_datetime) AS second_diff FROM create_new_session_app_logs_raw_data ) ) SELECT *, SUM(session_start) OVER (ORDER BY event_timestamp) AS session_id FROM create_new_session_app_logs ;카일스쿨님 해결 쿼리)본인 해결 쿼리와의 차이점 second_diff 새로운 컬럼 생성 시 DATETIME_DIFF(col1, col2, 시간단위)라는 함수 적용CASE WHEN문을 적용하여 직관적이고 수정하기 용이한 형태의 쿼리 사용 WITH base AS ( SELECT event_date, DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Asia/Seoul') AS event_datetime, event_name, user_id, user_pseudo_id, FROM advanced.app_logs WHERE event_date = "2022-08-18" AND user_pseudo_id = "1997494153.8491999091" ORDER BY event_timestamp ), diff_data AS ( SELECT *, DATETIME_DIFF(event_datetime, prev_event_datetime, SECOND) AS second_diff # second_diff 기반으로 새로운 세션의 시작일지, 아닐지를 판단할 수 있음 FROM ( SELECT *, LAG(event_datetime, 1) OVER (PARTITION BY user_pseudo_id ORDER BY event_datetime) AS prev_event_datetime, # event_datetime이랑 prev_event_datetime을 뺴서 20초가 넘으면 새로운 세션으로 정의 # 20초가 넘지 않으면 기존 세션 FROM base ) ) SELECT *, # 누적합을 사용해서 session_number를 만들었다! SUM(session_start) OVER (PARTITION BY user_pseudo_id ORDER BY event_datetime) AS session_num # session을 구할 때 쿼리가 길어질 수 있음. 하루에 접속을 여러번 하는 서비스 => session기반이 좋을 수 있고, 아니라고 한다면 일자별 집계가 나을 수 있음 FROM ( SELECT *, CASE WHEN prev_event_datetime IS NULL THEN 1 WHEN second_diff >= 20 THEN 1 # session을 나누는 기준 초. 데이터를 탐색하면서 결정 ELSE NULL END AS session_start # session이 시작됨을 알리는 session_start FROM diff_data ) ORDER BY event_datetime ;