Analyze Facebook Ads Report with Pandas
TL;DR
Analyzing performances of Facebook ad campaigns. Giving recommendations on how to get the best performance out of the campaigns. The targeted feature is the conversion
of the listed ads. A similar data set can be found here: Kaggle source. This analysis also exists as a runnable jupyter notebook HERE. Also, check out the original Jupyter Notebook.
Intro
This article is part of the core script of a home assignment from a Dutch company. A data set of 1,143 observations was provided to interpret, analyze and convert into action points. The assignment states to make sure the insights I uncover are accompanied by action points and provided by reasoning why?
The Data
The given data set is an ad performance report. I assume it was created by the Facebook Ads Manager. Accordingly, I will focus on marketing metrics like conversion and ad costs.
Goals
The company is looking for opportunities to exponentially increase their growth rate in markets outside of the Netherlands. This suggests to find the top converting ads and implement its learnings onto other, low performing ads given a measurable low cost input.
- Identify high converting ads
- Find most efficient ads
- Conclude must-have plan for future ads
Data 💾
Data Extraction
# load data frame and parse date columns
date_cols = ['reporting_start', 'reporting_end']
df = pd.read_csv('raw_data/data_for_data_analytics.csv',
parse_dates=date_cols,
)
# glimpse on the df
df.head()
Some manual data handling is required, due to missing values in the lower ~380 rows.
# cells are shifted left at index 761, leaving wrong values in the remaining columns
# detect na rows and shift them 2 cells to the right
df.loc[df.total_conversion.isna(), 'campaign_id':'approved_conversion'] \
= df.loc[df.total_conversion.isna(), 'campaign_id':'approved_conversion'].shift(2,
axis=1)# fillna
df[['campaign_id', 'fb_campaign_id']] = df[['campaign_id', 'fb_campaign_id']].fillna(-1)
# fixing the dtypes
df = df.astype(dtype={'campaign_id':int,
'fb_campaign_id':int,
'interest1':int,
'interest2':int,
'interest3':int
})# show fixed area (to check)
df.loc[758:763]
Prelim analysis
print(f'There are {df.ad_id.nunique()} unique values in "ad_id" (primary key).')
print(f'The overall reporting time span is {(df.reporting_end.max() - df.reporting_start.min()).days} days.')
print(f'In total {round(df.spent.sum(),2)} EUR were spent on all campaigns in the recored time.')There are 1143 unique values in "ad_id" (primary key).
The overall reporting time span is 13 days.
In total 58705.23 EUR were spent on all campaigns in the recored time.df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1143 entries, 0 to 1142
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ad_id 1143 non-null int64
1 reporting_start 1143 non-null datetime64[ns]
2 reporting_end 1143 non-null datetime64[ns]
3 campaign_id 1143 non-null int64
4 fb_campaign_id 1143 non-null int64
5 age 1143 non-null object
6 gender 1143 non-null object
7 interest1 1143 non-null int64
8 interest2 1143 non-null int64
9 interest3 1143 non-null int64
10 impressions 1143 non-null float64
11 clicks 1143 non-null int64
12 spent 1143 non-null float64
13 total_conversion 1143 non-null float64
14 approved_conversion 1143 non-null float64
dtypes: datetime64[ns](2), float64(4), int64(7), object(2)
memory usage: 134.1+ KB
Feature Engineering 👷
Feature categories
I assume that the report shows two important parts in its features — preset_features
and performance_features
.
They will be assigned accordingly:
preset_features
: ['ad_id', 'reporting_end', 'age', 'gender', 'interest1', 'interest2', 'interest3']performance_features
: ['impressions', 'clicks', 'spent', 'approved_conversion']
That means that for each scheduled ad there has been a designated audience targeting and Facebook assigned reported performance metrics to the advertisement.
Data cleaning
Some minor data cleaning will be done due to incomplete date and redundant features.
- I will solely focus on analyzing the individual ads, given the fact that there ~30% of the campaign_id data missing. The data set stays at its current granularity, however, the columns
campaign_id
andfb_campaign_id
can be removed. Once the missing data is acquired it can still be added to the set with no loss. - The columns
reporting_start
andreporting_end
are identical. One will be removed. - The very limited context of the data set doesn’t allow me to define the difference between the columns
total_conversion
andapproved_conversion
. Therefore, I will only useapproved_conversion
as a valid measure for success (e.g. 'sign up for bank account'). - Some rows show conversions without a click. This could derive from overlapping recording time windows. Since their origin is not fully clear I will remove those rows (71 rows).
df_ = df.copy().drop(columns=['campaign_id',
'fb_campaign_id',
'reporting_start',
'total_conversion'])
# conversions without clicks are excluded
mask = (df_.clicks == 0) & (df_.approved_conversion > 0)
df_ = df_.loc[~mask]
Feature calculations
Some additional calculations will be necessary for further analysis:
CPC
(Cost per Click)CPCon
(Cost per approved_conversion)CPM
(Cost per 1,000 impressions)CTR
(Click through rate)conversion_rate
(approved_conversion
/impressions
)weekd
(day of the week, starting 0 = Monday)
df_['CPC'] = df_.spent / df_.clicks
df_['CPCon'] = df_.apply(lambda x: (x.spent / x.approved_conversion) if x.approved_conversion != 0 else np.nan,
axis=1) # avoid inf values with 0 as conversion
df_['CPM'] = round((df_.spent / df_.impressions) * 1000, 2)
df_['CTR'] = df_.clicks / df_.impressions
df_['conversion_rate'] = df_.approved_conversion / df_.impressions
df_['weekd'] = df_.reporting_end.apply(lambda x: x.dayofweek)
Remove feature bias / create interest-sets
Columns interest1-interest3
hold information about the targeted interest groups. The IDs are defined by Facebook. I wasn't able to decode them, hence, they will stay encoded. However, these columns show duplicated entries. Some rows have a similar trailing value from interest2
to interest3
. I will remove these duplicates.
Additionally, I assume that interests are not prioritized (from 1-3), but rather given as an equal keyword pool. Therefore, they should be concatenated as unique sets in order to see which interest combinations perform best.
# create interest_set with unique interests
def interest_set(x):
"""
Returns concatenated interest1-interest3 to have a unique interest set.
Set is sorted to avoid permutation duplicates.
"""
s = sorted(set([x.interest1, x.interest2, x.interest3]))
string = '_'.join([str(x) for x in s])
return string
df_['interest_set'] = df_.apply(interest_set, axis=1)
# keep single interests but remove trailing duplicates
df_clean = df_.copy()
df_clean.loc[df_clean.interest2 == df_clean.interest3, 'interest3'] = -1df_clean.head()
Data Analysis 📈
Socio analysis 🌈
plt.figure(figsize=(15,8))
# display CTR and conversion rate in violinplots
for i, metric in enumerate(['CTR', 'conversion_rate']):
plt.subplot(1,2,i+1)
sns.violinplot(data=df_clean[df_clean[metric] != 0],
x='age',
y=metric,
hue='gender',
orient='vertical',
split=True)
# give hints on significance / outlier behavior / balance
sns.stripplot(x='age',
y=metric,
data=df_clean[df_clean[metric] != 0],
hue='gender',
jitter=True,
alpha=.3
)
plt.title(f'{metric.capitalize()} Analysis by Socio Factors',
size=15)
plt.grid(axis='y')
plt.show()
Disclaimer
- Data with 0 as target value have been removed (no-performer)
Note
- Female
CTR
(mode, mean and median) is constantly higher throughout all age types. - Older generations are more click affine.
- Younger generations actually come to closure.
- High variances are visible (let’s have a look at the outliers)
Day-by-Day performance 🗓
# display weekday performance
plt.figure(figsize=(10,7))
sns.barplot(data=df_clean,
y='conversion_rate',
x='weekd',
)
plt.title('Conversion per weekday',
size=15)
plt.grid(axis='y')
plt.show()
Note
- Tuesdays and Fridays (1 and 4) are performing best in average.
- Wednesdays and Saturdays (high variance) show the least user engagement.
Interest performance 🤹
# group df by interest_sets
df_interests = df_clean.groupby('interest_set').mean().sort_values(by='conversion_rate',
ascending=False)
# display top20 interest sets
plt.figure(figsize=(15,6))
sns.barplot(x=df_interests.head(20).index,
y=df_interests.head(20).conversion_rate,
)
plt.xticks(rotation=90)
plt.grid(axis='y')
plt.title(f'Top 20 Interest Sets by Conversion Rate', size=15)
plt.show()
Note
- Interest sets
30-33-34
and21-24-26
show the best conversion performances.
# create df with most expensive ads without conversion
interest_flops = df_interests[(df_interests.spent > 0) \
& (df_interests.approved_conversion == 0.0)].sort_values(by='spent',
ascending=False)
# plot top30
plt.figure(figsize=(15,6))
sns.barplot(data=interest_flops.head(30),
x=interest_flops.head(30).index,
y='spent',
)
plt.xticks(rotation=90)
plt.ylabel('Avg. Ad Cost (EUR)')
plt.title('Top30 Most Expensive Interest Sets with no Conversion', size=15)
plt.show()
Note
- The set
27-31-33
was the most expensive and thus least efficient interest set in the data set.
# check top100 conversionists and their interests
df_conversion = df_clean.sort_values(by='conversion_rate', ascending=False).head(100)
df_conversion_low = df_clean.sort_values(by='conversion_rate', ascending=True).head(100)
# concatenate all available values
plt.figure(figsize=(15,10))
for i, (k, df) in enumerate({'High performing': df_conversion,
'Low performing': df_conversion_low}.items()):
new_lst = list(df.interest1.values) + list(df.interest2.values) + list(df.interest3.values)
count = dict(Counter(new_lst))
# create df
interest_df = pd.DataFrame(list(count.items()))
interest_df.columns =["interest", "count"]
interest_df = interest_df.sort_values(by='count', ascending=False).reset_index(drop=True)
interest_df = interest_df[interest_df['interest'] != -1]
# plot
plt.subplot(2,1,i+1)
sns.barplot(data=interest_df,
x='interest',
y='count',
order=interest_df['interest'],
)
plt.title(f'{k} Interest Groups by Conversion', size=15)
plt.grid(axis='y')
plt.tight_layout()
plt.show()
Note
- Interest group 16 appears 20 times in the top 100 converting ads
- Interest group 32 appears the most in the low performing ads
Impressions performance 👀
Note
- Higher bids lead to higher impressions, thus, higher (paid) clicks. Correlation is there.
- There are some outliers to be inspected, like ad 1122265.
Set Cost goals 🎯
Let’s have a look at some excerpts from the data set to set a goal for future ads and approve to the above findings.
- The cheapest 👼 impressions which still led to conversions:
# filter by Cost per Million without the non conversion / click data
df_clean[df_clean.CPM != 0].sort_values(by='CPM').head()
Note
- All of the top 5 cheapest cost per impressions are targeted to a male audience with an over representation of interest groups
16
and17
.
- The worst (most expensive) 👺 cost per conversion ads.
df_clean[df_clean.CPCon != 0].sort_values(by='CPCon', ascending=False).head(10)
Note
To avoid highly clicked and costly ads with no conversion:
- Avoid weekends.
- Avoid interest group
32
overall. - Avoid age group
45-49
.
- The CPCon range
# show cheapest and most expensive conversions
df_clean.CPCon.describe()count 513.000000
mean 46.165392
std 55.627454
min 0.180000
25% 7.540000
50% 29.910000
75% 60.730000
max 352.449999
Name: CPCon, dtype: float64
Findings and Recommendations 💎
- The Cost per Conversion ranges between 0.18–352.45 EUR. Over 75% of them lie under 60 EUR. Highly costly outliers show that some campaigns were lacking maintenance.
- Weekdays play a significant role for ad performance.
- Some targeted interest groups lead to high costs (clicks) but low conversion.
- There is a significant difference between age and gender groups.
Best practices for well converting ads:
- Use Tuesdays and Fridays as preferred scheduling days. Details
- Interest sets
30-33-34
and21-24-26
show the best conversion performances. Details - Interest group
16
seems like a good general catch. Details - The female audience could be targeted stronger due to better conversions, provided that company fairness policy allows that. Details
To avoid highly clicked and costly ads with no conversion Details:
- Avoid weekends.
- Avoid interest group
32
overall. - Avoid age group
45-49
(window shoppers). - Create an alert for cost exploding ads or set budget limits.