#1 - Kaggle Data Survey

์ด๋ฒˆ ๊ธ€์—์„œ๋Š” ๊ฐ„๋‹จํ•œ BI(Business Intelligence) ๋ถ„์„์„ ์ง„ํ–‰ํ•˜๋ฉฐ ๋น„์ฆˆ๋‹ˆ์Šค ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต์„ ์ฐพ์•„๊ฐ€๋Š” ๊ณผ์ •์„ ๋‹ค๋ค„๋ณด๊ณ ์ž ํ•œ๋‹ค.

BigQuery ์–ธ์–ด์— ์ต์ˆ™ํ•˜์ง€ ์•Š๋”๋ผ๋„ ์šฐ์„ ์€ ๋ถ„์„ํ๋ฆ„์— ์ต์ˆ™ํ•ด์ง€๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ์ œ์‹œ๋œ ์ฟผ๋ฆฌ๋“ค์„ ํ•˜๋‚˜์”ฉ ์‹คํ–‰ํ•˜๋ฉฐ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•ด ๋ณด๋„๋ก ํ•˜์ž.

โ–  ์‹ค์Šต ์ฃผ์ œ (Hands-On Topic)

Kaggle์—์„œ๋Š” ํ•ด๋งˆ๋‹ค ๊ฒฝ์ง„๋Œ€ํšŒ์— ์ฐธ์—ฌํ•˜๋Š” ๋ฐ์ดํ„ฐ ๊ณผํ•™์ž๋“ค์„ ๋Œ€์ƒ์œผ๋กœ ์„ค๋ฌธ์„ ์‹ค์‹œํ•˜๋ฉฐ ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๊ณต์œ ํ•œ๋‹ค. ์„ค๋ฌธ์—๋Š” ๋ฐ์ดํ„ฐ ๊ณผํ•™์ž๋“ค์˜ ๊ฐ„๋‹จํ•œ ํ”„๋กœํŒŒ์ผ ํ†ต๊ณ„์™€ ๊ฒฝ๋ ฅ ๊ธฐ๊ฐ„, ๊ธ‰์—ฌ, ๊ต์œก ์ˆ˜์ค€ ๋ฐ ์„ ํ˜ธ ๋„๊ตฌ ๋“ฑ ๋‹ค์–‘ํ•œ ํ•ญ๋ชฉ์ด ํฌํ•จ๋˜์–ด ์žˆ๋‹ค.

kaggle_survey

์ž์„ธํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜ ๋งํฌ๋ฅผ ์ฐธ์กฐํ•˜์ž.

์—ฌ๋Ÿฌ ์„ค๋ฌธ ํ•ญ๋ชฉ ์ค‘ ์ด๋ฒˆ ํ•ธ์ฆˆ์˜จ์—์„œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต์„ ์ฐพ๊ณ ์ž ํ•œ๋‹ค.

โ–ท ํ•ต์‹ฌ ๋น„์ฆˆ๋‹ˆ์Šค ์งˆ๋ฌธ๋“ค(Key Business Questions)

  1. 2021๋…„๋„์— ๋ถ„์„๊ฐ€๋“ค๋กœ๋ถ€ํ„ฐ ๊ฐ€์žฅ ๋งŽ์ด ์„ ํƒ๋ฐ›์€ ์–ธ์–ด๋Š”?

  2. Kaggle ๋ถ„์„๊ฐ€๋“ค์€ ๋ช‡ ๊ฐ€์ง€ ์–ธ์–ด๋กœ ๋ถ„์„์„ ์ง„ํ–‰ํ• ๊นŒ?

  3. ์–ด๋–ค ์–ธ์–ด์กฐํ•ฉ์ด ๊ฐ€์žฅ ๋งŽ์ด ์„ ํ˜ธ๋˜๊ณ  ์žˆ์„๊นŒ?

โ–  Hands-On

์•ž์„œ์˜ ๋น„์ฆˆ๋‹ˆ์Šค ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต์„ ์ฐพ๋Š” ๊ณผ์ •์—๋Š” BigQuery์— ๋ฐ์ดํ„ฐ๋ฅผ ์ ์žฌํ•˜๊ณ  ๋ถ„์„์ด ์šฉ์ดํ•œ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ELT(Extract-Load-Transform) ๊ณผ์ •์ด ์ˆ˜๋ฐ˜๋œ๋‹ค.

โ–ท ๋ฐ์ดํ„ฐ ์ค€๋น„ (Data Preparation)

์šฐ์„  Kaggle ์‚ฌ์ดํŠธ์—์„œ ์„ค๋ฌธ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋กœ๋“œ ๋ฐ›๊ณ  ๊ฐœ๋žต์ ์œผ๋กœ ๋ฐ์ดํ„ฐ์˜ ๋‚ด์šฉ๊ณผ ํ˜•์‹์„ ํŒŒ์•…ํ•˜๋„๋ก ํ•œ๋‹ค.

  1. Data Source - Kaggle Survey 2021 ์—์„œ ์›๋ณธ ํŒŒ์ผ์„ ๋‹ค์šด๋กœ๋“œํ•˜๊ณ  ์••์ถ•์„ ํ•ด์ œํ•œ๋‹ค.

  2. EDA(๋ฐ์ดํ„ฐํƒ์ƒ‰)

    • kaggle_survey_2021_responses.csv ์—์„œ ๋Œ€๋žต์ ์ธ ๋ฐ์ดํ„ฐ ํ˜„ํ™ฉ๊ณผ ๋น„์ฆˆ๋‹ˆ์Šค ์งˆ๋ฌธ๊ณผ ๊ด€๋ จ๋œ ์ปฌ๋Ÿผ์˜ ๋‚ด์šฉ์„ ์‚ดํ•€๋‹ค.

    • CSV์˜ ๊ฒฝ์šฐ ๋ถ„๋ฆฌ์ž(delimeter)์™€ ํ—ค๋” ๋ฐ ์ปฌ๋Ÿผ ์†์„ฑ ๋“ฑ์„ ํŒŒ์•…ํ•ด ๋‘”๋‹ค.

  3. ๋ฐ์ดํ„ฐ ์—…๋กœ๋“œ

    • ๋กœ์ปฌPC์˜ ํŒŒ์ผ์„ ๋ฐ”๋กœ BigQuery์— ์—…๋กœ๋“œ๋„ ๊ฐ€๋Šฅํ•˜๋‚˜ ํŒŒ์ผ ํฌ๊ธฐ์˜ ์ œํ•œ(10MB)์ด ์žˆ๋‹ค.

    • ์ด๋กœ ์ธํ•ด ํฐ ํŒŒ์ผ์€ Clodud Storage์— ์—…๋กœ๋“œ ํ›„ BigQuery๋กœ ์ ์žฌํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค.

    • ์ด๋ฒˆ ํ•ธ์ฆˆ์˜จ์—์„œ Cloud Storage ๋Œ€์‹  Google Drive์— ํŒŒ์ผ์„ ์˜ฌ๋ ค ์‹ค์Šตํ† ๋ก ํ•˜๊ฒ ๋‹ค. Google Drive

  4. Drive URI ํ™•์ธ - BigQuery์— ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ ์†Œ์Šค์˜ ์œ„์น˜๋ฅผ ์•Œ๋ ค์ค„ ๋•Œ ํ•„์š”ํ•œ ์ •๋ณด๋กœ ์‚ฌ์ „์— ํŒŒ์•…ํ•ด ๋‘”๋‹ค.

    • ์—…๋กœ๋“œ๋œ ํŒŒ์ผ์—์„œ ๋งˆ์šฐ์Šค ์˜ค๋ฅธ์ชฝ ํด๋ฆญ ํ›„ Get link ์„ ํƒํ•œ๋‹ค. Google Drive Get Link

    • Copy link ๋‚ด์˜ ์•”ํ˜ธ๋œ ์ˆซ์ž๊ฐ€ ํŒŒ์ผ์˜ id --> BigQuery ์ ์žฌ์‹œ ํ•„์š”ํ•œ ์ •๋ณด Google Drive Share

์ตœ์ข…์ ์œผ๋กœ ์„ค๋ฌธ ๋ฐ์ดํ„ฐ๋Š” ์•„๋ž˜์˜ Google Drive URI์— ์œ„์น˜ํ•˜๊ฒŒ ๋œ๋‹ค.

  • https://drive.google.com/open?id=1xGKxgLmk07GP7wfP9OjROqcw7zP-QkxU

โ–ท ๋ฐ์ดํ„ฐ ์ ์žฌ (Load)

๋ถ„์„ํ•  ๋ฐ์ดํ„ฐ๊ฐ€ ์ค€๋น„๋˜์—ˆ์œผ๋ฏ€๋กœ BigQuery์— ์ด๋ฅผ ์ ์žฌํ•˜๊ธฐ ์œ„ํ•œ Dataset์„ ์ค€๋น„ํ•˜์ž. BigQuery Dataset์€ ํ…Œ์ด๋ธ” ๋˜๋Š” ๋ทฐ(View)๋“ค์„ ๋‹ด๊ณ  ์žˆ๋Š” ์ปจํ…Œ์ด๋„ˆ์˜ ์—ญํ• ์„ ํ•œ๋‹ค.

  1. BigQuery Console ์˜คํ”ˆ

    • https://console.cloud.google.com/bigquery?project=************

    • project์—๋Š” ๋ณธ์ธ์ด ์ƒ์„ฑํ•œ GCP ํ”„๋กœ์ ํŠธ์˜ ์ด๋ฆ„ ์‚ฌ์šฉํ•  ๊ฒƒ - ์ƒ๋‹จ์— ๋ณธ์ธ ํ”„๋กœ์ ํŠธ(ex. learning-club-2020)๊ฐ€ ์„ ํƒ๋˜์–ด ์žˆ๋Š”์ง€ ๋จผ์ € ํ™•์ธ BigQuery Console

  2. Dataset ์ƒ์„ฑ

    • ํ…Œ์ด๋ธ”์„ ์ €์žฅํ•  Dataset ์ƒ์„ฑ - ์™ผ์ชฝ ํ”„๋กœ์ ํŠธ ๋ฆฌ์ŠคํŠธ ๋ชฉ๋ก์—์„œ ๋‚ด ํ”„๋กœ์ ํŠธ๋ฅผ ์„ ํƒํ•˜๋ฉด ์˜ค๋ฅธ์ชฝ [+] ๋ฐ์ดํ„ฐ์„ธํŠธ ๋งŒ๋“ค๊ธฐ ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ ํŒ์—…์ด ๋œธ. BigQuery Dataset1 BigQuery Dataset2

๋ฐ์ดํ„ฐ์…‹์ด ์ค€๋น„๋˜์—ˆ์œผ๋‹ˆ CSV๋กœ ๋˜์–ด ์žˆ๋Š” ์„ค๋ฌธ๋ฐ์ดํ„ฐ๋ฅผ BigQuery ํ…Œ์ด๋ธ”๋กœ ์˜ฎ๊ธธ ์ฐจ๋ก€์ด๋‹ค.

์ด ๋•Œ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ฐ€๋Šฅํ•œ ๋ฐฉ๋ฒ• ๋ช‡ ๊ฐ€์ง€๊ฐ€ ์žˆ๋Š”๋ฐ ์ด๋ฒˆ ์‹ค์Šต์—์„œ๋Š” ์ด ์ค‘์—์„œ ๋‘ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋„๋ก ํ•œ๋‹ค.

  1. BigQuery Console UI ์ƒ์—์„œ๋‚˜ Command Line ์—์„œ CSV ๋ฐ์ดํ„ฐ๋ฅผ BigQuery ์Šคํ† ๋ฆฌ์ง€๋กœ ์ ์žฌํ•˜๋Š” ๋ฐฉ๋ฒ•

  2. ์™ธ๋ถ€ํ…Œ์ด๋ธ”(External Table)๋กœ ์ƒ์„ฑํ•˜์—ฌ ๋ฌผ๋ฆฌ์ ์ธ ๋ฐ์ดํ„ฐ์˜ ์ด๋™์—†์ด BigQuery๊ฐ€ ์™ธ๋ถ€๋ฐ์ดํ„ฐ(CSVํŒŒ์ผ)๋ฅผ ํ…Œ์ด๋ธ”์ฒ˜๋Ÿผ ์ฝ๋„๋ก ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•

BigQuery ์™ธ๋ถ€ ํ…Œ์ด๋ธ” (External Table) ์ƒ์„ฑ

BigQuery๋Š” DDL(Data Definition Language) ๋ฌธ์œผ๋กœ ์™ธ๋ถ€ ์ €์žฅ์†Œ์— ์œ„์น˜ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฐธ์กฐํ•˜๋Š” External Table ์ƒ์„ฑ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

์•ž์„œ ํ™•์ธํ•œ Google Drive URI๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•˜๋Š” BigQuery ๊ตฌ๋ฌธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

CREATE OR REPLACE EXTERNAL TABLE learning_club.kaggle_survey_2021 
OPTIONS (
  format = 'CSV',
  skip_leading_rows = 2,
  field_delimiter = ',',
  uris = ['https://drive.google.com/open?id=1xGKxgLmk07GP7wfP9OjROqcw7zP-QkxU']
);

SELECT * FROM learning_club.kaggle_survey_2021;

uris๋ฅผ ํ†ตํ•ด CSVํŒŒ์ผ์ด ์œ„์น˜ํ•œ URI๋ฅผ BigQuery์— ์•Œ๋ ค์ฃผ๊ณ  ์žˆ๋‹ค. CSVํŒŒ์ผ์˜ ํ—ค๋”๊ฐ€ 2์ค„์— ๊ฑธ์ณ์žˆ์œผ๋ฏ€๋กœ skip_leading_rows ๊ฐ’์„ 2๋กœ ์„ค์ •ํ•˜์˜€๋‹ค.

โ–ท ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜ (Transform)

์„ค๋ฌธ ํŒŒ์ผ์—๋Š” ๋‹ค์–‘ํ•œ ์„ค๋ฌธ ํ•ญ๋ชฉ๊ณผ ๊ด€๋ จ๋œ ๋งŽ์€ ์ปฌ๋Ÿผ๋“ค์„ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค.

์ด ์ค‘ ํ•ธ์ฆˆ์˜จ์˜ ๋น„์ฆˆ๋‹ˆ์Šค ์งˆ๋ฌธ๊ณผ ๊ด€๋ จ์ด ์žˆ๋Š” ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ์ปฌ๋Ÿผ๋“ค๋งŒ ์ถ”์ถœํ•˜์ž. ๊ทธ๋ฆฌ๊ณ  External Table ์„ Internal Table๋กœ ์˜ฎ๊ฒจ์„œ ๋น ๋ฅด๊ฒŒ ์ฟผ๋ฆฌ๊ฐ€ ์ˆ˜ํ–‰๋  ์ˆ˜ ์žˆ๋„๋ก ํ•ด๋ณด์ž.

CREATE OR REPLACE TABLE learning_club.survey_raw AS
SELECT What_programming_languages_do_you_use_on_a_regular_basis___Select_all_that_apply____Selected_Choice___Python AS Python,
       What_programming_languages_do_you_use_on_a_regular_basis___Select_all_that_apply____Selected_Choice___R AS R,
       What_programming_languages_do_you_use_on_a_regular_basis___Select_all_that_apply____Selected_Choice___SQL AS SQL,
       What_programming_languages_do_you_use_on_a_regular_basis___Select_all_that_apply____Selected_Choice___C AS C,
       What_programming_languages_do_you_use_on_a_regular_basis___Select_all_that_apply____Selected_Choice___Java AS Java,
       What_programming_languages_do_you_use_on_a_regular_basis___Select_all_that_apply____Selected_Choice___Javascript AS Javascript,
       What_programming_languages_do_you_use_on_a_regular_basis___Select_all_that_apply____Selected_Choice___Julia AS Julia,
       What_programming_languages_do_you_use_on_a_regular_basis___Select_all_that_apply____Selected_Choice___Swift AS Swift,
       What_programming_languages_do_you_use_on_a_regular_basis___Select_all_that_apply____Selected_Choice___Bash AS Bash,
       What_programming_languages_do_you_use_on_a_regular_basis___Select_all_that_apply____Selected_Choice___MATLAB AS MATLAB,
       What_programming_languages_do_you_use_on_a_regular_basis___Select_all_that_apply____Selected_Choice___Other AS Other,
       ROW_NUMBER() OVER() AS id,  -- ์‚ฌ์šฉ์ž ID ์ƒ์„ฑ
  FROM learning_club.kaggle_survey_2021;

Kaggle ์„ค๋ฌธ ์ž๋ฃŒ๋Š” ๋ถ„์„ํ•˜๊ธฐ ์šฉ์ดํ•œ ๊ตฌ์กฐ๊ฐ€ ์•„๋‹ˆ์–ด์„œ Tidy Data๊ฐ€ ๋˜๋„๋ก ๋ณ€ํ™˜ ์ž‘์—…์ด ํ•„์š”ํ•œ๋ฐ ์•„๋ž˜์˜ ๋ช‡๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

๋ณ€ํ™˜๋œ ๊ตฌ์กฐ๋Š” ์•„๋ž˜์˜ ๊ทธ๋ฆผ๊ณผ ์œ ์‚ฌํ•˜๋‹ˆ ์ฐธ๊ณ ํ† ๋ก ํ•œ๋‹ค.

[์ฐธ๊ณ ] Tidy Data

  • https://vita.had.co.nz/papers/tidy-data.pdf

  • https://www.openscapes.org/blog/2020/10/12/tidy-data/

๋ฐฉ๋ฒ• 1 - UNION ALL ์‚ฌ์šฉ

ํ‘œ์ค€ SQL ๋ฌธ๋ฒ•์˜ UNION ALL์„ ์‚ฌ์šฉํ•ด์„œ ์•„๋ž˜์™€ ๊ฐ™์ด Stacked Data๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

WITH tidy AS ( -- WITH ๊ตฌ๋ฌธ์€ SQL ๋ฌธ๋ฒ• ์†Œ๊ฐœ๋•Œ ๋‹ค๋ฃฐ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.
  SELECT id, Python AS lang FROM learning_club.survey_raw
   UNION ALL
  SELECT id, R AS lang FROM learning_club.survey_raw
   UNION ALL
  SELECT id, SQL AS lang FROM learning_club.survey_raw
--   UNION ALL
--  SELECT ....
)
SELECT * FROM tidy WHERE lang IS NOT NULL;

๋ฐฉ๋ฒ• 2 - ARRAY ์‚ฌ์šฉ

BigQuery ์Šค๋Ÿฝ๊ฒŒ ARRAY๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋ฉด ์ข€ ๋” ๊ฐ„๊ฒฐํ•˜๊ฒŒ ์ฟผ๋ฆฌ ๊ธฐ์ˆ ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

WITH tidy AS (
  SELECT id, lang FROM (
    SELECT id, [Python, R, SQL, C, Java, Javascript, Julia, Swift, Bash, MATLAB, Other] AS languages 
      FROM learning_club.survey_raw
  ), UNNEST(languages) AS lang
)
SELECT * FROM tidy WHERE lang IS NOT NULL;

๋ฐฉ๋ฒ• 3 - UNPIVOT ์‚ฌ์šฉ

๋งˆ์ง€๋ง‰์œผ๋กœ BigQuery UNPIVOT ์—ฐ์‚ฐ์ž๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด์ž.

WITH tidy AS (
  SELECT * EXCEPT(dummy)
    FROM learning_club.survey_raw
 UNPIVOT (dummy FOR lang IN (Python, R, SQL, C, Java, Javascript, Julia, Swift, Bash, MATLAB, Other))
)
SELECT * FROM tidy WHERE lang IS NOT NULL;  
Row
id
lang

1

16

Python

2

16

C

3

16

MATLAB

4

136

SQL

5

139

Python

6

139

SQL

...

...

...

์ฃผ์–ด์ง„ ์›์‹œ๋ฐ์ดํ„ฐ(raw data)๊ฐ€ ๋‚ด๊ฐ€ ์›ํ•˜๋Š” ํ˜•ํƒœ๋กœ ์ฃผ์–ด์ง€๋Š” ๊ฒฝ์šฐ๋Š” ๊ฑฐ์˜ ์—†์œผ๋ฏ€๋กœ ์ฟผ๋ฆฌ๋ฅผ ํ†ตํ•ด์„œ ํ…Œ์ด๋ธ”์˜ ํ˜•ํƒœ๋ฅผ ์›ํ•˜๋Š” ๋ชจ์Šต์œผ๋กœ ๋ณ€ํ™˜์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์—ฐ์Šต์ด ํ•„์š”ํ•˜๋‹ค.

ARRAY๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ๋‹ค์Œ ๋ถ„์„์„ ์ด์–ด๊ฐ€๋„๋ก ํ•˜์ž.

โ–ท ๋ฐ์ดํ„ฐ ๋ถ„์„ (Analysis)

๋ถ„์„์ด ์šฉ์ดํ•œ ๊ตฌ์กฐ๋กœ ๋ณ€ํ™˜์ด ๋˜์—ˆ์œผ๋ฏ€๋กœ GROUP BY์™€ ์ง‘๊ณ„ ํ•จ์ˆ˜(aggregation function)์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ„๋‹จํ•œ ๋ถ„์„์„ ์ง„ํ–‰ํ•œ๋‹ค.

  1. ์ฒซ๋ฒˆ์งธ ์งˆ๋ฌธ์˜ ๋‹ต์„ ์ฐพ๊ธฐ ์œ„ํ•œ ์ฟผ๋ฆฌ - ๊ณผํ•™์ž๋“ค์ด ๊ฐ€์žฅ ์„ ํ˜ธํ•˜๋Š” ๋„๊ตฌ๋Š”?

WITH tidy AS (
  SELECT id, lang FROM (
    SELECT id, [Python, R, SQL, C, Java, Javascript, Julia, Swift, Bash, MATLAB, Other] AS languages 
      FROM learning_club.survey_raw
  ), UNNEST(languages) AS lang
)
SELECT lang, COUNT(1) AS cnt 
  FROM tidy
 WHERE lang IS NOT NULL 
 GROUP BY 1 
 ORDER BY 2 DESC;
Row
lang
cnt

1

Python

21860

2

SQL

10756

3

R

5334

4

Java

4769

5

C

4709

...

...

...

์ฟผ๋ฆฌ ์ˆ˜ํ–‰์˜ ๊ฒฐ๊ณผ์…‹์œผ๋กœ ๋ถ€ํ„ฐ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต์„ ํ™•์ธ ํ• ์ˆ˜ ์žˆ๊ณ  ๊ฒฐ๊ณผ์ฐฝ์˜ Explore Data (๋ฐ์ดํ„ฐ ํƒ์ƒ‰) ๋ฒ„ํŠผ์„ ํด๋ฆญํ•˜๋ฉด Data Studio์˜ ์‹œ๊ฐ์  ๋ถ„์„ ํ™˜๊ฒฝ์œผ๋กœ Seamless ํ•˜๊ฒŒ ๋„˜์–ด๊ฐˆ ์ˆ˜ ์žˆ๋‹ค.

bigquery_explore_data

์–ธ์–ด๋ณ„ ์„ ํ˜ธ๋„ ์‹œ๊ฐํ™” datastudio_kaggle_survey_2021

  1. ๋‘๋ฒˆ์งธ ์งˆ๋ฌธ - Kaggle ๋ถ„์„๊ฐ€๋“ค์€ ๋ช‡ ๊ฐ€์ง€ ์–ธ์–ด๋กœ ๋ถ„์„์„ ์ง„ํ–‰ํ• ๊นŒ?

์ด ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต์„ ์œ„ํ•ด์„œ๋Š” ์‚ฌ์šฉ์ž๋ณ„๋กœ ์‚ฌ์šฉํ•˜๋Š” ์–ธ์–ด์˜ ๊ฐฏ์ˆ˜๋ฅผ ๋จผ์ € ์ง‘๊ณ„ํ•œ ํ›„, ์–ธ์–ด์˜ ๊ฐฏ์ˆ˜๋ฅผ ์ฐจ์›(dimension)์œผ๋กœ ํ•˜๋Š” ์ง‘๊ณ„๋ฅผ ํ•œ ๋ฒˆ ๋” ์ˆ˜ํ–‰ํ•ด์•ผ ํ•œ๋‹ค.

WITH tidy AS (
  SELECT id, lang FROM (
    SELECT id, [Python, R, SQL, C, Java, Javascript, Julia, Swift, Bash, MATLAB, Other] AS languages 
      FROM learning_club.survey_raw
  ), UNNEST(languages) AS lang
)
SELECT languages, COUNT(1) AS cnt 
  FROM (
  -- ์‚ฌ์šฉ์ž๋ณ„๋กœ ์‚ฌ์šฉํ•˜๋Š” ์–ธ์–ด๋“ค์„ ๋ฐฐ์—ด๋กœ ๋ฌถ์Œ.
    SELECT id, COUNT(1) AS languages 
      FROM tidy WHERE lang IS NOT NULL
     GROUP BY 1
  ) 
 GROUP BY 1 
 ORDER BY 2 DESC;
Row
languages
cnt

1

2

7589

2

1

6798

3

3

5493

4

4

2716

5

5

1207

  1. ์„ธ๋ฒˆ์งธ ์งˆ๋ฌธ - ์–ด๋–ค ์–ธ์–ด์กฐํ•ฉ์ด ๊ฐ€์žฅ ๋งŽ์ด ์„ ํ˜ธ๋˜๊ณ  ์žˆ์„๊นŒ?

์ด ์งˆ๋ฌธ์— ๋Œ€ํ•œ ์ฟผ๋ฆฌ๋ฅผ ์ž‘์„ฑํ•˜๊ธฐ ์œ„ํ•ด์„œ ์–ธ์–ด์กฐํ•ฉ์„ ํ•˜๋‚˜์˜ ๋ฌธ์ž์—ด๋กœ ํ‘œํ˜„ํ•˜๊ณ  ์ด๋ฅผ GROUP BYํ•˜์—ฌ ์ง‘๊ณ„ํ•˜์˜€๋‹ค.

WITH tidy AS (
  SELECT id, lang FROM (
    SELECT id, [Python, R, SQL, C, Java, Javascript, Julia, Swift, Bash, MATLAB, Other] AS languages 
      FROM learning_club.survey_raw
  ), UNNEST(languages) AS lang
)
SELECT ARRAY_TO_STRING(languages, ', ') langs, COUNT(1) AS cnt 
  FROM (
  -- ์‚ฌ์šฉ์ž๋ณ„๋กœ ์‚ฌ์šฉํ•˜๋Š” ์–ธ์–ด๋“ค์„ ๋ฐฐ์—ด๋กœ ๋ฌถ์Œ.
    SELECT id, ARRAY_AGG(lang IGNORE NULLS ORDER BY lang) AS languages 
      FROM tidy GROUP BY 1
  )
 WHERE ARRAY_LENGTH(languages) = 2
 GROUP BY 1 ORDER BY 2 DESC;
Row
langs
cnt

1

Python, SQL

2636

2

Python, R

1018

3

C, Python

790

4

MATLAB, Python

652

5

Java, Python

575

๊ฒฐ๊ณผ๋กœ๋ถ€ํ„ฐ Python๊ณผ SQL ์กฐํ•ฉ์„ ๋ฐ์ดํ„ฐ ๊ณผํ•™์ž์™€ ๋ถ„์„๊ฐ€๋“ค์ด ๊ฐ€์žฅ ์„ ํ˜ธํ•˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

์ด์™€ ๊ฐ™์ด ๋ถ„์„์„ ์œ„ํ•œ ํ•ต์‹ฌ ์งˆ๋ฌธ(Key Business Question, KBQ)์„ ์ƒ๊ฐํ•˜๊ณ  ์ฟผ๋ฆฌ๋ฅผ ํ†ตํ•ด ๋‹ต์„ ์ฐพ๋Š” ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•˜๋ฉฐ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ดํ•ด์™€ ํ†ต์ฐฐ์„ ํ‚ค์›Œ๊ฐ€๋Š” ๊ณผ์ •์ด BI ๋ถ„์„์ด๋ผ ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ๋น„์ฆˆ๋‹ˆ์Šค ์งˆ๋ฌธ์„ ์• ๋“œํ˜น ์ฟผ๋ฆฌ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋Šฅ๋ ฅ์ด ์š”๊ตฌ๋˜๋ฉฐ ์ฟผ๋ฆฌ ์ž‘์„ฑ ๋Šฅ๋ ฅ์ด ๋ถ„์„์ƒ์‚ฐ๋ ฅ์— ์ฐจ์ด๋ฅผ ๋งŒ๋“ ๋‹ค.

Last updated