EDA YouTube channels in Vietnam

Table of content

Dataset from top 200,000 channels in Vietnam - Nov 2019 - ver.1

Note: This purpose of analysis for study research. Focus on dataset Vietnam only. Sorry, if this analysis have some error and mistake in research. Also data is vast, some data might miss in this analysis

Reference: https://www.omnicoreagency.com/youtube-statistics/ for some facts.

The number of people watch YouTube keep increasing every year. Include childrens, teenagers, adults, olders. But what are they watching about? Because YouTube channels in Vietnam are so many, and variety. It’s hard for us to see the overall picture of YouTube Vietnam.

I. Goals of the analysis

Most of user often see ‘subscribers number’, and think that channels have most of great videos. But if we look other factors, we might find something interest.

Here is assumptions, and I would like to verify those assumptions:

  1. Is there any great channels that we didn’t know they exists.
    • Based on other factor beside ‘subscribers’ (like views, or view count/joined date, …)?
    • Based on categories instead of general factor?
    • Based on keywords?
  2. Which channel category (e.g. Entertainment, Gaming, Comedy, etc.) has the largest number views, subscribes, view count/joined date…?
    • The relative between categories?
  3. Why that channels have large view count?
    • Something cause view_count on kid channels always larger than other channels. Children have most free time, might be use their childhood to watch YouTube? or something else?

    • Teenager have less time than the children to watch, but still have time to watch YouTube so much?
    • Adults have less time to watch, because they have to work?
    • Older often watch TV channel?

P.S:

If your YouTube homepage is bore and nothing fun to watch, you can go to analysis below and find the interesting channels and subscribe. And your YouTube will have all the best video to watch. :)

#invite people for the Kaggle party
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
from collections import Counter
import wordcloud
from underthesea import word_tokenize

warnings.filterwarnings('ignore')
%matplotlib inline
# Load dataset, which already cleansing data.
df_train = pd.read_excel('youtube_channel.xlsx')
df_train.columns
df_train.head()
Index(['_id', 'url', 'name', 'subscribers_text', 'video_count', 'view_count',
       'category', 'keywords', 'parent', 'subscribers_infer', 'r_view_p_sub',
       'r_view_p_joined', 'r_view_p_vid', 'r_joined_p_vid', 'joined_date',
       'joined_date_seconds', 'name_joined', 'name_vid', 'name_sub_vid',
       'name_sub', 'name_url'],
      dtype='object')
_id url name subscribers_text video_count view_count category keywords parent subscribers_infer ... r_view_p_joined r_view_p_vid r_joined_p_vid joined_date joined_date_seconds name_joined name_vid name_sub_vid name_sub name_url
0 5dd01bfac94e8ed89c37ea91 https://www.youtube.com/channel/UC5ezaYrzZpyIt... POPS Kids 9.71M 7085.0 9411018577 Film & Animation "phim hoat hinh" "ca nhac thieu nhi" "kenh thi... https://www.youtube.com/channel/UC5ezaYrzZpyIt... 9710000 ... 51018 1328301 26 2014-01-14 2014-01-14 POPS Kids (joined 2014-01-14) POPS Kids (7,085 videos) POPS Kids (9.71M - 7,085 videos) POPS Kids (9.71M) POPS Kids (https://www.youtube.com/channel/UC5...
1 5dd01bfac94e8ed89c37ea9b https://www.youtube.com/channel/UCUgXK2UjZ8G_E... POPS MUSIC 7.29M 25462.0 6610506521 Music "pops music" "pops vietnam" "nhac viet 2019" "... https://www.youtube.com/channel/UCUgXK2UjZ8G_E... 7290000 ... 31485 259622 8 2013-03-25 2013-03-25 POPS MUSIC (joined 2013-03-25) POPS MUSIC (25,462 videos) POPS MUSIC (7.29M - 25,462 videos) POPS MUSIC (7.29M) POPS MUSIC (https://www.youtube.com/channel/UC...
2 5dd01bfac94e8ed89c37eaa4 https://www.youtube.com/channel/UC6eq3sR4CtbvG... Kênh Thiếu Nhi - BHMEDIA 6.06M 6078.0 4307433103 Music "kenh thieu nhi bhmedia" "nhac thieu nhi" "thi... https://www.youtube.com/channel/UC6eq3sR4CtbvG... 6060000 ... 24631 708692 28 2014-05-05 2014-05-05 Kênh Thiếu Nhi - BHMEDIA (joined 2014-05-05) Kênh Thiếu Nhi - BHMEDIA (6,078 videos) Kênh Thiếu Nhi - BHMEDIA (6.06M - 6,078 videos) Kênh Thiếu Nhi - BHMEDIA (6.06M) Kênh Thiếu Nhi - BHMEDIA (https://www.youtube....
3 5dd01bfac94e8ed89c37ea9e https://www.youtube.com/channel/UCSJsjCiTl2lou... Thơ Nguyễn 6.91M 895.0 4227680361 People & Blogs "Thơ Nguyễn" https://www.youtube.com/channel/UCSJsjCiTl2lou... 6910000 ... 36138 4723665 130 2016-03-05 2016-03-05 Thơ Nguyễn (joined 2016-03-05) Thơ Nguyễn (895 videos) Thơ Nguyễn (6.91M - 895 videos) Thơ Nguyễn (6.91M) Thơ Nguyễn (https://www.youtube.com/channel/UC...
4 5dd01bfac94e8ed89c37ea8b https://www.youtube.com/channel/UC0jDoh3tVXCaq... FAP TV 10.3M 447.0 4034410388 Film & Animation "FAP TV" "Com nguoi" "phim hai" "fap tivi" faptv https://www.youtube.com/channel/UC0jDoh3tVXCaq... 10300000 ... 22320 9025526 404 2014-02-26 2014-02-26 FAP TV (joined 2014-02-26) FAP TV (447 videos) FAP TV (10.3M - 447 videos) FAP TV (10.3M) FAP TV (https://www.youtube.com/channel/UC0jDo...

5 rows × 21 columns

df_train.describe()
view_count subscribers_infer r_view_p_sub r_view_p_joined r_view_p_vid r_joined_p_vid
count 2.160700e+04 2.160700e+04 21607.000000 21607.000000 2.160700e+04 21607.000000
mean 2.668484e+07 8.124845e+04 407.559124 249.443236 1.667015e+05 2839.976674
std 1.557322e+08 3.213656e+05 1115.933585 1209.028898 7.942098e+05 9203.371213
min 2.896000e+03 1.200000e+01 0.000000 1.000000 0.000000e+00 0.000000
25% 7.875605e+05 2.740000e+03 178.000000 8.000000 7.585500e+03 282.000000
50% 2.274931e+06 8.771000e+03 345.000000 24.000000 2.470900e+04 759.000000
75% 9.681355e+06 4.005500e+04 393.000000 98.000000 8.770500e+04 2110.000000
max 9.411019e+09 1.030000e+07 81890.000000 51018.000000 4.304108e+07 274752.000000

General analysis

Channels with most subscribers.

Nothing to comment. Just a fact.

Note: Other channels with hidden subscriber stats. Note symbol ~ is (~ guess number).

cdf = df_train.sort_values("subscribers_infer", ascending=False).head(30)

fig, ax = plt.subplots(figsize=(10,10))
_ = sns.barplot(x="subscribers_infer", y="name_sub_vid", data=cdf,
                palette=sns.cubehelix_palette(n_colors=30, reverse=True), ax=ax)
_ = ax.set(xlabel="No. of subscribers", ylabel="Channel (Subscribers)")

png

Number of subscribers vs. views

Sort DESC by subscribers, compare with view_count. And all top channels belongs to Entertainment category

All the kid channels quite have large no. of view than other channels. Might be, they have most time to watch.

  • POPS Kids (large no. of videos)
  • Bé Bún - Bé Bắp
  • Thơ Nguyễn
  • Kênh Thiếu Nhi - BHMEDIA (large no. of videos)
  • Kid Studio (large no. of videos)

Channels have a few videos, but have large no. of views. (Most video attract a lot of views.)

  • Vanh Leg
  • Sơn Tùng M-TP Oficial
  • Hau Hoang
  • K-ICM Offical
  • iGAMING TV
cdf = df_train.sort_values("subscribers_infer", ascending=False).head(30)
fig, ax = plt.subplots(figsize=(12,10))
_ = sns.barplot(x="view_count", y="name_sub_vid", data=cdf,
                palette=sns.cubehelix_palette(n_colors=30, reverse=True), ax=ax)
_ = ax.set(xlabel="No. of views", ylabel="Channel (Subscribers)")

png

Number of videos vs. views

Again, POPS MUSIC have most views.

Other channel comes from traditional TV. They start using to YouTube for broadcast because of increasing users on YouTube. Catch the trends before TV is out of date.

cdf = df_train.sort_values("video_count", ascending=False).head(40).sort_values("view_count", ascending=False)

fig, ax = plt.subplots(figsize=(8,8))
_ = sns.barplot(x="view_count", y="name_vid", data=cdf,
                palette=sns.cubehelix_palette(n_colors=40, reverse=True), ax=ax)
_ = ax.set(xlabel="No. of views", ylabel="Channel (No. of Videos)")

png

Category per channels

Note: Category ‘People & Blogs’ quite mix. Some of channel in this category might belongs to Education, but most of them still entertain

Number of channel on entertaiment:

  • People & Blogs
  • Entertaiment
  • Music
  • Gaming
  • Film & Animation

Number of channel on education:

  • Education
  • Howto & style
  • Science & technology

The ration for education (7%), entertaiment (93%)

  • Most of channel on entertainment easy to attract user for monetize than education.
  • Also on kid channels.
cdf = df_train.groupby("category").size().reset_index(name="channels").sort_values("channels", ascending=False)
list(cdf['category'])

fig, ax = plt.subplots(figsize=(12,8))
_ = sns.barplot(x="channels", y="category", data=cdf,
                palette=sns.cubehelix_palette(n_colors=20, reverse=True), ax=ax)
_ = ax.set(xlabel="No. of channels", ylabel="Category")
['People & Blogs',
 'Entertainment',
 'Music',
 'Gaming',
 'Film & Animation',
 'Education',
 'Howto & Style',
 'News & Politics',
 'Autos & Vehicles',
 'Science & Technology',
 'Travel & Events',
 'Sports',
 'Comedy',
 'Pets & Animals',
 'Nonprofits & Activism']

png

Channel which still keep up or thrive fast in the short time.

Note: Dataset at Nov 2019

r_view_p_joined = view_count / joined_date

  • if channel discontinue, view_count / increase time of joined_date => decrease r_view_p_joined
  • if channel thrive fast, large view_count / small time of joined_date => high r_view_p_joined
  • if channel still keep up, increase view_count/ increase time of joined_date => r_view_p_joined keep up
cdf = df_train.sort_values("r_view_p_joined", ascending=False).head(30).sort_values("joined_date_seconds", ascending=False)
fig, ax = plt.subplots(figsize=(10,8))
_ = sns.barplot(x="r_view_p_joined", y="name_joined", data=cdf,
                palette=sns.cubehelix_palette(n_colors=30, reverse=True), ax=ax)
_ = ax.set(xlabel="Views / Joined Date", ylabel="Channel (No. of Videos)")

png

Channels with less video but great views

r_view_p_vid = views / videos

Channels with a lot of videos often have large no. of views. But with this factor, those channels care about the quality per videos rather than use a lot of videos.

This factors check great contents.

cdf = df_train[df_train['video_count'] > 3].sort_values("r_view_p_vid", ascending=False).head(30)
fig, ax = plt.subplots(figsize=(10,8))
_ = sns.barplot(x="r_view_p_vid", y="name_vid", data=cdf,
                palette=sns.cubehelix_palette(n_colors=30, reverse=True), ax=ax)
_ = ax.set(xlabel="Views / Videos", ylabel="Channel")

png

More specific on Entertainment

View count per category

  • Entertainment
  • Music
  • People & Blog
x_metric = 'view_count'

def draw_per_category(category):
    cdf = df_train[df_train['category'] == category].sort_values(x_metric, ascending=False).head(30)
    fig, ax = plt.subplots(figsize=(10,8))
    _ = sns.barplot(x=x_metric, y="name_sub_vid", data=cdf,
                palette=sns.cubehelix_palette(n_colors=30, reverse=True), ax=ax)
    _ = ax.set(xlabel=x_metric, ylabel="Channel - " + category)


draw_per_category('Entertainment')
draw_per_category('Music')
draw_per_category('People & Blogs')
draw_per_category('Gaming')
draw_per_category('Film & Animation')

png

png

png

png

png

More specific on Eduction

View count per category

  • Education
  • Howto & Style
  • Science & Technology

More channels Education is for kids? Quite strange.

draw_per_category('Education')
draw_per_category('Howto & Style')
draw_per_category('Science & Technology')

png

png

png

Keywords

Keywords often frequently appear in channel title

title_words = list(df_train["name"].apply(lambda x: word_tokenize(str(x).lower()) ))
title_words = [x for y in title_words for x in y]
Counter(title_words).most_common(25)
[('tv', 1566),
 ('-', 1358),
 ('việt', 812),
 ('official', 778),
 ('channel', 669),
 ('nam', 636),
 ('nhạc', 488),
 ('anh', 403),
 ('phim', 399),
 ('vlogs', 388),
 ('vlog', 385),
 ('hay', 326),
 ('music', 303),
 ('vietnam', 295),
 ('kênh', 286),
 ('nguyễn', 284),
 ('.', 266),
 ('minh', 248),
 ('cuộc sống', 242),
 ('giải trí', 224),
 ('miền', 223),
 ('hoàng', 209),
 ('tin', 200),
 ('&', 198),
 ('nguyen', 198)]

Wordcloud

Note: việt, vietnam, channel, kênh, official are just postfix of channel name.

  1. TV is keyword most appear for channel name
  2. nhạc, music
  3. vlogs, vlog
  4. phim
  5. other… (cuộc sống, giải trí, tin tức, …)

wc = wordcloud.WordCloud(width=1200, height=500, 
                         collocations=False, background_color="white", 
                         colormap="tab20b").generate(" ".join(title_words))
_ = plt.figure(figsize=(15,10))
_ = plt.imshow(wc, interpolation='bilinear')
_ = plt.axis("off")

png

Conclusion

Benefits.

  • Sometime, as YouTuber should know trending on YouTube to adapt and change follow the demand of audience.
  • Knowing content of your competitors produce, and might be they know audience’s demand, might help your channel survive
  • Or if you’re just normal user who want to find interest channel to learn or entertain, this analysis is for you. (All the best channel included)

About the content, just my opinion.

Reference: Veritasium Channel

“My Video Went Viral. Here’s Why” (https://www.youtube.com/watch?v=fHsa9DqmId8)

  • “Audience change their taste, algorithm reflect their demand, youtuber chase the algorithm to survive”. The circle repeat, and repeat.
  • The quality content depends on demand of audience. What content are audience want to watch? Great or poor content, depend on them.
  • Hope YouTuber keep up produce great content.

I provided tool below which query channel in sub-category. Hope you find something interest. And will continue analysis on 500,000 videos of top channels in Vietnam. To be continue….


II. Tool for query top channels

Select category or value, and click search